Machine Learning Neeru

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 170

UNIT I: Introduction to Machine Learning

1.1 Understanding Machine Learning

Machine learning involves the programming of computers to optimize performance by utilizing


example data or prior experiences. This process revolves around defining a model with parameters, and
learning is the execution of a computer program that fine-tunes these parameters using training data or
historical experiences. The model may possess predictive capabilities for making future predictions, be
descriptive for extracting knowledge from data, or exhibit a combination of both functionalities.

The term "Machine Learning" was introduced by Arthur Samuel in 1959, a pioneering figure in early
American computer gaming and artificial intelligence, during his time at IBM. Samuel characterized
machine learning as "the field of study that gives computers the ability to learn without being explicitly
programmed." However, consensus on a universal definition for machine learning remains elusive,
with different authors providing varying interpretations.

Definition of Learning

Learning in the context of machine learning is defined as follows: A computer program is considered to
learn from experience E concerning a class of tasks T and a performance measure Pif its performance
in tasks T , as measured by P, improves with experience E .

Examples

i) Handwriting Recognition Learning Problem

Task T : Recognizing and classifying handwritten words within images.

Performance P: Percentage of words correctly classified.

Training experience E : A dataset of handwritten words with given classifications.

ii) Robot Driving Learning Problem

Task T : Driving on highways using vision sensors.

Performance measure P: Average distance traveled before an error.

Training experience E : A sequence of images and steering commands recorded while observing a
human driver.

iii) Chess Learning Problem

Task T :Playing chess.

Performance measure P: Percentage of games won against opponents.

Training experience E : Playing practice games against itself.

Definition
A computer program that learns from experience is termed a machine learning program or simply a
learning program. It is also occasionally referred to as a learner.

1.2 Basic Components of the Machine Learning Process

The machine learning process comprises four fundamental components, namely data storage,
abstraction, generalization, and evaluation. Each plays a crucial role in the learning journey,
contributing to the understanding and utilization of information. The following outlines these key
components and their respective functions:

1. Data Storage:

Role: Data storage facilities are pivotal in the learning process, serving as repositories for extensive
data. Both humans and machines rely on efficient data storage as a foundational element for advanced
reasoning.

Human Analog: In humans, data is stored in the brain, and retrieval involves electrochemical signals.

Machine Implementation: Computers utilize various storage devices like hard disk drives, flash memory,
and random access memory. Retrieval is facilitated through technologies such as cables.

2. Abstraction:

Role: Abstraction is the second component, involving the extraction of knowledge from stored data.
This process includes forming general concepts that encapsulate the essence of the data. Knowledge
creation encompasses applying existing models and developing new ones, with training being the
process of fitting a model to a dataset.

Transformation: Once trained, the model transforms the data into an abstract representation that
summarizes the original information.

3. Generalization:

Role: Generalization, the third component, is the process of translating knowledge about stored data
into a form applicable for future actions. These actions are designed for tasks that share similarities
with, but are not identical to, those encountered before.

Objective: The goal in generalization is to uncover the properties of the data that are most relevant to
future tasks.
4. Evaluation:

Role: Evaluation is the final step in the learning process, assessing the effectiveness and efficiency of
the acquired knowledge and models.

Feedback Loop: The evaluation results guide adjustments to the learning process, ensuring continuous
refinement and improvement.

In summary, the machine learning process involves storing and retrieving data, abstracting knowledge
from the data through model training, generalizing this knowledge for future tasks, and finally,
evaluating and refining the acquired knowledge for ongoing improvement. This iterative cycle forms
the foundation of the machine learning journey.

1.3 Applications of Machine Learning

Machine learning has found diverse applications across various industries, revolutionizing how tasks
are performed, decisions are made, and systems operate. Here are some notable applications of
machine learning:

1. Healthcare:
Disease Diagnosis: Machine learning models analyze medical data to assist in early diagnosis of
diseases.

Drug Discovery: ML aids in identifying potential drug candidates and predicting their effectiveness.

2. Finance:
Fraud Detection: ML algorithms analyze transaction data to detect and prevent fraudulent activities.

Credit Scoring: Models predict creditworthiness based on customer data.

3. Retail:
Recommendation Systems: ML powers personalized product recommendations for users.

Demand Forecasting: Predictive models help optimize inventory management and supply chains.

4. Marketing:
Customer Segmentation: ML categorizes customers based on behavior for targeted marketing.

Ad Targeting: Algorithms optimize ad placements by analyzing user preferences.

5. Manufacturing:
Predictive Maintenance: ML predicts equipment failures, optimizing maintenance schedules.

Quality Control: Image recognition and analysis enhance product quality inspection.

6. Autonomous Vehicles:
Object Detection: ML enables vehicles to recognize and respond to objects and obstacles.

Path Planning: Algorithms optimize routes and navigate complex environments.


7. Natural Language Processing (NLP):
Chatbots: ML-driven chatbots provide automated customer support.

Language Translation: NLP models facilitate accurate language translation.

8. Image and Video Analysis:


Facial Recognition: ML identifies faces for security and authentication.

Object Detection: Algorithms analyze images and videos for object identification.

9. Cybersecurity:
Anomaly Detection: ML identifies unusual patterns to detect cybersecurity threats.

Malware Detection: Algorithms analyze code for signs of malicious activity.

10. Education:
Personalized Learning: ML tailors educational content based on individual student progress.

Predictive Analytics: Models identify students at risk of falling behind.

11. Energy Management:


Smart Grids: ML optimizes energy distribution and consumption in smart grids.

Predictive Maintenance: Algorithms forecast equipment failures in the energy sector.

12. Human Resources:


Recruitment: ML assists in screening resumes and identifying suitable candidates.

Employee Retention: Predictive models analyze factors influencing employee retention.

These applications showcase the versatility of machine learning, demonstrating its impact on
optimizing processes, enhancing decision-making, and enabling innovations across various domains.

1.4 Perspectives in Machine Learning:

1. Evolutionary Advancements:

 Perspective: Machine learning is viewed as an evolving field with continuous


advancements, incorporating new algorithms, techniques, and paradigms to address
complex challenges.

2. Ubiquitous Integration:

 Perspective: Machine learning is becoming integral to various industries, including


healthcare, finance, and manufacturing, transforming how tasks are performed and
decisions are made.

3. Human-Augmented AI:
 Perspective: There is a shift towards human-machine collaboration, where machine
learning augments human capabilities rather than replacing them. This perspective
emphasizes the symbiotic relationship between humans and AI.

4. Exponential Data Growth:

 Perspective: The explosion of data availability is seen as a driving force for machine
learning, offering opportunities for more accurate models and insights across diverse
domains.

5. Cross-Disciplinary Collaboration:

 Perspective: Collaboration between machine learning experts, domain specialists,


ethicists, and policymakers is essential for addressing complex challenges and ensuring
responsible AI development.

6. Explainable AI (XAI):

 Perspective: There is a growing need for machine learning models to be explainable and
interpretable, especially in critical applications such as healthcare and finance, to build
trust and transparency.

1.5 Issues and Challenges in Machine Learning:

1. Ethical Considerations:

 Issue: The ethical implications of machine learning, including bias, fairness, and the
societal impact of AI applications, raise complex challenges that require careful
consideration.

2. Bias and Fairness:

 Issue: Machine learning models can perpetuate and amplify biases present in training
data, leading to unfair and discriminatory outcomes.

3. Data Privacy:

 Issue: Concerns about the privacy of personal data used in machine learning models
highlight the need for robust data protection measures and compliance with privacy
regulations.

4. Security Vulnerabilities:

 Issue: Machine learning models are susceptible to adversarial attacks, highlighting the
need for enhanced security measures to protect against manipulation and exploitation.

5. Interpretability:
 Issue: The lack of interpretability in complex models, such as deep neural networks,
poses challenges in understanding model decisions, particularly in sensitive
applications.

6. Regulatory Frameworks:

 Issue: The absence of comprehensive and standardized regulatory frameworks for AI


and machine learning can lead to uncertainties in legal and ethical considerations.

7. Resource Intensiveness:

 Issue: Training and deploying resource-intensive machine learning models, especially


deep learning models, pose challenges in terms of computational resources and energy
consumption.

8. Excessive Reliance on Big Data:

 Issue: The assumption that more data always leads to better models may overlook the
quality of data, and reliance on massive datasets may not be feasible or ethical in certain
situations.

9. Generalization Challenges:

 Issue: Ensuring that machine learning models generalize well to unseen data and diverse
scenarios remains a significant challenge, especially in dynamic environments.

10. Accountability and Transparency:

 Issue: Establishing accountability and ensuring transparency in machine learning


processes, including model development and decision-making, is critical for responsible
AI deployment.

Navigating these perspectives and addressing the associated issues requires ongoing interdisciplinary
collaboration, ethical considerations, and a commitment to developing machine learning solutions that
benefit society while minimizing potential risks and biases.

1.6 Types of Learning


In the realm of machine learning, algorithms are broadly categorized as:

1. Supervised Learning:

Supervised learning involves providing a machine learning algorithm with a training set of examples,
each paired with correct responses or targets. The algorithm generalizes from this training set to
respond correctly to new, unseen inputs.

Characteristics:

 Training set includes input-output pairs.


 Algorithm learns a function mapping inputs to outputs.

 Used for both classification and regression problems.

 A teacher supervises the learning process, correcting the algorithm iteratively.

Example: Consider patient data with gender, age, and health status labels. The algorithm learns to
predict health status based on gender and age.

2. Unsupervised Learning:

Unsupervised learning involves algorithms without labeled responses. The algorithm aims to identify
similarities between inputs, grouping together those with common characteristics. A common approach
is density estimation.

Characteristics:

 No labeled responses in the training data.

 Focus on finding hidden patterns or grouping in data.

 Cluster analysis is a common unsupervised learning method.

 Evaluation is challenging as there are no output values for comparison.

Example: Using patient data with gender and age but without health status labels, the algorithm seeks
to uncover patterns or groupings.

3. Semi-Supervised Learning:

Semi-supervised learning is a type of machine learning where the algorithm is trained on a dataset that
contains both labeled and unlabeled data. While a subset of the training data has explicit labels, a larger
portion remains unlabeled. The algorithm leverages both labeled and unlabeled examples to improve
its learning and generalization capabilities.

Characteristics:

1. Combination of Labeled and Unlabeled Data:

 Semi-supervised learning utilizes a mix of data with known labels and data without
labels in the training set.

2. Cost-Efficient Labeling:

 Suited for scenarios where obtaining labeled data is expensive or time-consuming.


Semi-supervised learning allows for more extensive use of unlabeled data, reducing the
need for extensive labeling efforts.

3. Improved Generalization:
 By learning from both labeled and unlabeled instances, the algorithm aims to generalize
better to new, unseen data.

4. Flexible Learning Paradigm:

 Adaptable to different levels of labeling availability, making it suitable for practical


applications where obtaining fully labeled datasets may be challenging.

5. Application in Real-World Scenarios:

 Commonly applied when labeled data is scarce, as is often the case in various domains,
such as medical imaging, speech recognition, and natural language processing.

Example:

Scenario: Fraud Detection in Financial Transactions

In a dataset of financial transactions, only a small subset is labeled as fraudulent. Collecting labeled
instances of fraud is a costly and time-consuming process. Semi-supervised learning can be employed
by using the limited labeled instances of fraud along with a larger pool of unlabeled transactions.

 Labeled Data:

 A small number of transactions labeled as either "fraudulent" or "non-fraudulent."

 Unlabeled Data:

 A larger set of transactions without explicit labels.

The semi-supervised learning algorithm learns patterns from the labeled instances of fraud and non-
fraud, and it generalizes this knowledge to make predictions on the unlabeled transactions. This
approach allows for more efficient and cost-effective fraud detection compared to a fully supervised
model, as it benefits from a more extensive dataset while minimizing the need for exhaustive labeling.

4. Reinforcement Learning:

Definition: Reinforcement learning lies between supervised and unsupervised learning. The algorithm
receives feedback when its responses are incorrect but is not explicitly told how to correct them. It
explores different possibilities until it learns how to achieve the correct answer.

Characteristics:

 Learns through trial and error.

 The agent receives rewards or penalties for actions.

 Actions may impact future rewards and situations.

 Described as learning with a critic, as it discovers optimal actions through exploration.

Example: Teaching a dog a new trick involves rewarding or punishing based on its actions. Similarly,
reinforcement learning trains computers for tasks where explicit instructions are not provided.
These three types of learning represent fundamental approaches in machine learning, each addressing
distinct challenges and applications. Supervised learning focuses on labeled data, unsupervised
learning explores data patterns without labels, and reinforcement learning involves learning optimal
actions through exploration and feedback.

1.7 Review of probability

In random experiments, we are interested in the numerical outcomes i.e., numbers associated with the
outcomes of the experiment. For example, when 50 coins are tossed, we ask for the number of heads.
Whenever we associate a real number with each outcome of trial, we are dealing with a function whose
range is the set of real numbers we ask for such a function is called a random variable (r. v.) chance
variable, stochastic variable or simply a variable.
Definition: Quantities which vary with some probabilities are called random variables.

Definition: By a random variable we mean a real number associated with the outcomes of a random
experiment.

Example 1.1: Suppose two coins are tossed simultaneously then the sample space is: S= {HH, HT, TH,
TT}. Let X denote the number of heads, then if X = 0 then the outcome is {TT} and P(X = 0) = ¼. If X
takes the value 1, then outcome is {HT, TH} and P(X = 1) = 2/4. Next if X takes the value 2 then the
outcome is {HH} and P(X = 2) = 1/4.The probability distribution of this random variable X is given by
the following table:

X =x 0 1 2 Total
P( X=x) 1 2 1 1
4 4 4

Example 1.2: out of 24 mangoes 6 are rotten, 2 mangoes are drawn. Obtain the probability distribution
of the number of rotten mangoes that can be drawn:
Let X denote the number of rotten mangoes drawn then X can take values 0, 1, 2.
18 6
C X C
18 X 17 51 P( X=1)= ❑ 241 ❑ 1 = 18 X 6 = 9
18
❑C2
P( X=0)= = = ; 24 X 23 23
24
C 2 24 X 23 92 ❑C 2

1X2
and
6
❑C 2 6X5 5
P( X=2)= 24 = =
❑C2
24 X 23 92

X=x 0 1 2 Total
P(X = x ) 51 9 5 1
92 23 92

Types of Random Variables:

There are two types of random variables:


(i) Discrete random variables (ii) Continuous random variables
Distribution function: Let X be a one-dimensional random variable. The function F defined for all x,
by the equation F (x)=P (X x) is called the cumulative distribution function of X .

Note: 1. We write c. d. f. For cumulative distribution function. Only d. f is written instead of c. d. f.


Note 2: suffix X in F is used to emphasize the fact that the distribution function is associated with the
value at X. when the underlying variate is clear from the context, we shall simply write F (x) instead of
F (x).

Note 3: tail events let ‘x’ be any real number then the events ¿ X < x ∨¿ and ¿ X > x ∨. ¿ X x∨¿ are called
tail events. For distinction, we may label them open, closed, upper and lower tails. Often, simple r.v.’s
are expanded as linear combination of tail events.

Some Properties of a c. d. f.:

1) P {a< X b }=F (b)– F (a) Interval property

2) 0 F (x) 1 , x R Boundedness property

3) F is non-decreasing i.e., if x  y, then F(x)  F(y) Monotone increasing property

4) F ¿
lim F ( xn)=0 as n−; lim F ( xn)=0 as n Limits property

5) F is continuous from the right each point Right continuous property


i.e., F ¿
6) F ¿ Jump discontinuity

Conditions (3), (4) and (5) are necessary as well as sufficient for F to be c.d.f. on R.

Discrete Random variables:

Quantities which can take only integer values are called discrete random variables.

Examples:

The number of children in a family of a colony.


The number of rooms in the houses of a township.

Probability mass function: Probability distribution

Definition: Let X be a discrete random variable taking value x, x = 0, 1, 2, 3, .... then P(X = x) is called
the probability mass function of X and it satisfies the following

( i ) P(X = x)  0

( ii ) ∑ P (X =x)=1
x=0

Discrete distribution function:

A r. v. X is said to be discrete, if there exist a countable number of points x1, x2, x3, . . . and, p(xi)  0
such that


F (x)= ∑ p(x i )
∑ p (x i)=1 x i≤ x
x=1

Finite equiprobable space (Uniform space)

A finite equiprobable space is finite probability distribution where each sample point x 1 , x 2 , x 3 ,. . . xn
has the same probability for all i i.e.,

P( X=x i)=p i=¿ a constant for all i and

∑ p i=1.
xi ≤ x

Example 1.3: A random variable X has the following probability distribution:


X 0 1 2 3 4 5 6 7 8
P(x ) k 3k 5k 7k 9k 11k 13 k 15 k 17 k
(a)
(b) Determine the value of k
(c) find P( X< 4) , P(X 5), P(0< x< 4)
(d) find the c.d.f.
(e) find the smallest value of x for which P( X x )< 0.5
1 16 15 15
[Ans. k= , , , , andF (x)=0.5 , F (5)=0.44 , F (6)=0.61¿
81 81 81 81

Expectation: The behaviour of r.v. either discrete or continuous is completely characterized by the
distribution function F(x) or density f(x)[ P(xi) in discrete case . Instead of a function, a more compact
description can be made by a single number such as mean (expectation), median and mode known as
measures of central tendency of the r.v. X.

Expectation or median or expected value of a r.v. X denoted by E[X or , is defined as

{ }

∑ x P( X=x ), ifX is discrete


E [ X ] ≡{ −∞
¿

¿ ∫ xf (x)dx ,ifX is continuous


−∞
Definition: Variance: variance characterizes the variablility in the distributions, since two
distributions with same mean can still have different dispersion of data about their means,
Variance of r.v. X is σ 2 ≡ E ¿ for X discrete
2
σ ≡ E ¿ for X is continuous.

Standard Deviation: standard deviation denoted by  (S.D.) is the positive square root of variance.

σ =E ¿ 
2 ∑ (x 2−2 μx + μ2 )f ( x )
x

¿ ∑ x f ( x )−2 μ ∑ xf ( x ) + μ ∑ f (x )
2 2

x x x

¿ E ( X 2) −2 .❑2 .1 E ( X 2 )−❑2
since μ=∑ xf ( x ) , ∑ f (x)=1 .
x x

Continuous Random variables:

Let X be a continuous random variable taking value x, a  x  b then f(x) = P(X = x) is defined as the
p. d. f. of X and satisfies the following
b

(i) f (x)≥ 0 ( ii) ∫ f (x )dx =1.


a

Note: 1. For a continuous variate, point probabilities are zero.


2. Area under the probability curve y=f (x ) is unity; the fact f (x)≥ 0 implies the
graph f (x) is above x –axis.
3. Area under the probability curve y = f(x) bounded by x=a , x=b is simply
P(a  x  b ).
4. Relation between p. d. f. and c. d. f.: The density f and c. d. f. F are always
x
d
connected by (a) F (x)= ∫ f ( x) dx ∀ x ∈ R (b) [F (x)]=f ( x ) ∀ x ∈ R .
−∞
dx

Moments: If the range of the probability density function is from -  to , the rth moment about origin

is defined as μr =∫ x f (x)dx .
' r

−∞

The r th moment about any arbitrary origin ‘a’ is μr =∫ ¿ ¿


'

−∞

The mean is given by (taking moment about x = 0) μ1= ∫ xf (x ) dx


'

−∞

[∫ ]
∞ ∞ 2

The variance  is given by σ =μ −μ = ∫ x f (x)dx−


2 2 '
2
'2
1
2
xf (x)dx
−∞ −∞
Jointly Distributed Random Variables:

When the outcome of a random experiment can be characterized in more than one way, the probability
density is a function of more than one variate.

Example 1.4: When a card is drawn from an ordinary deck, it may be characterized according to its
suit in some order viz., say clubs, diamonds, hearts and spades and Y be a variate that assumes the
values 1 , 2, 3 , . .. , 13 which correspond to the denominations: Ace , 2 ,3 , . .. , 10 , J , Q , K . Then (X , Y ) is
a 2 – dimensional variate. The probability of drawing a particular card will be denoted by f (x , y ) and
if each card is equi-probable of being drawn, the density of (X , Y ) is

1
f ( x , y )=
∀ 1 ≤ x ≤ 4 ∀ 1≤ y ≤13
52
Trails whose outcomes can be characterized by two (three) variates give rise to bivariate (tri-variate)
distributions etc. Extensions to n-variate distributions are fairly straight forward.

Joint discrete Distribution Function:

The joint c. d. f. of X and Y is said to be discrete if there exists a non-negative function P such that P
vanishes everywhere except a finite or countably infinite number of points in the plane and at such
points (x , y )so that P(x , y)=P (X=x , Y = y ), for all x , y R .

Let X and Y have a joint discrete distribution. A function P which does not vanish on the set {(x i, yi)
such that i, j = 1, 2, 3, . . .} and satisfies the following properties:

∞ ∞
(i) P(x i , y j)≥ 0 for all i, j = 1, 2, 3, . . . . . . and (ii) ∑ ∑ P (x i , y j )=1 is called joint probability
i=1 j=1

(mass) function of X and Y or simply the joint probability function.

Individual and Marginal Probability Functions:

Let X and Y be two jointly distributed variables with joint discrete density P(x , y), the individual
variates X and Y themselves are random variables.

The individual distributions of X and Y are called marginal distributions of X and Y


(i) The Marginal probability function for X is denoted by P X (x) or P(x ) and is given by
P(x )=P(X =x)=¿ ∑ P (X=x , Y = y )= ∑ P (x , y )
y y

(ii) The marginal probability function for Y is denoted by PY ( y ) and is given by


P( y )=P(Y = y )=∑ P( X=x ,Y = y )= ∑ P (x , y ).
x x

Note: It is convenient to display the probability function of a bivariate distribution in a rectangular


array, in which the row totals and column totals provide the marginal probability functions of X and Y
respectively.
y y1 y2 y3 ... yj ... ym ... P(X=
x xi )

x1 P11 P12 P13 ... P1j ... P1m ... P(x1)

x2 P21 P22 P23 ... P21 ... P2m ... P(x2)

x3 ... ... ... ... ... ... ... ... ...


. ... ... ... ... ... ... ... ... ...
xi Pi1 Pi2 Pi3 ... Pij ... Pim ... P(xi)

. . . . . . . . . .

xn Pn1 Pn2 Pn3 ... Pnj ... Pnm ... P(xn)

. ... ... ... ... ... ... ... ... ...

P(Y = yi ) P(y1) P(y2) P(y3) ... P(yj) ... P(ym) ... 1

Table 2.1: Conditional Probability Table

We have here, Pij =P( X=x i ,Y = y j ); P( x j)=∑ Pij ∧P( y j )=∑ Pij .
j i

Conditional Probability functions (cond. P. f.):

Let X and Y have joint discrete distribution with associated probability function P. Let the possible
values of X be {x1, x2, x3, . . .,xi, . . .} and those of Y be { y1, y2, y3, . . .,yj, . . .} respectively.

The conditional probability function of X, given Y = y j denotd by P X


yj
( ) xi
yj
is defined by

( )
xi P ( xi , y j )
PX = for i, j=1, 2 , 3 ,…
yj
yj PY ( y j )
¿ 0 if PY ( y j)=0

yj
The conditional probability function of Y, given X = xi denoted by PY / xi x i is defined by ( )
PY
xi
( )
y j P ( xi , y j )
xi
=
P X ( xi )
for i , j=1 ,2 , 3 , …

¿ 0 if P X (x i)=0
Therefore P ( x i , y j) =P ( X =xi , Y = y j ) ; P ( Y = y j ) =PY ( y j )∧P ( X =x i )=P X (x i)

Joint continuous distribution function:


A 2-dimensional random vector (X , Y ) is called a continuous random vector if there exists a function

[∫ ]
x y

f(x, y) ≥ 0 such that for −∞ < x , y< ∞, the c. d. f. of (X, Y) given by F (x , y )=∫ f (u , v)dv du is
−∞ −∞

continuous. The function f (x , y ) is called the joint p. d. f of (X , Y ).

Some properties of joint density: Let f (x , y )≥ 0 be the joint p. d. f of continuous random vector (X,
Y) and F(x, y) be the c. d. f. of (X, Y) then it holds the following properties:

[ ]
∞ ∞ b d

(i) ∫ ∫ f (x , y )dx dy=1 (ii) P {a< X ≤ b , c<Y ≤ d } = ∫ ∫ f ( x , y)dydx


−∞ −∞ a c
2
∂ F (x , y)
(iii) f (x , y )=
∂x∂ y

Individual or Marginal Distributions: Let (X, Y) be a continuous random vector with joint c. d. f.
x y

“F” and joint p. d. f. “f”. Then F (x , y )=P( X x ,Y y )= ∫ ∫ f (u , v)dv du.


−∞ −∞

Definition: Let (X, Y) be a 2-dimesnional continuous random vector with joint p. d. f. f(x, y). Then
the individual or marginal distribution of X and Y are defined by the p. d. f.’ s
∞ ∞
f X ( x )=∫ f (x , y)dy and f Y ( y)= ∫ f ( x , y ) dx .
−∞ −∞

[∫ ]
b b ∞

On observation, we have P(a≤ X ≤ b)=∫ f X ( x )dx=∫ f ( x , y )dy dx .


a a −∞

Conditional Distribution Function: The conditional c. d. f. of a variate X, given Y = y, written


F X/Y (x/y ) = lim +¿
∈→ 0 P { X ≤x / y−∈≤ Y ≤ y+∈}¿

The conditional p. d. f. of X given Y = y, written f X /Y (x / y )∀ x ∈ R is a non-negative function


satisfying F X/Y (x/y ) = ∫ f X/Y (t / y )dt ∀ x ∈ R .


−∞

Note: The conditional p. d. f. f(x/y) is given by f X (x / y )=f (x , y )/f Y ( y )where f Y ( y) is the marginal p.
Y

d. f. of Y, f Y ( y)>0 , and is continuous.


Chapter 6 Linear Algebra
6.1 Linear Equations

Elementary algebra, using the rules of completion and balancing developed by al-Khwarizmi, allows
us to determine the value of an unknown variable x that satisfies an equation like the one below:

10 x−5=1 5+5 x

An equation like this that only involves an unknown (like x) and not its higher powers (x2 , x3 ),
along with additions (or subtractions) of the unknown multiplied by numbers (like 10x and 5x) is
called a linear equation. We now know, of course, that the equation above can be converted to a
special form (“number multiplied by unknown equals number”, or ax = b, where a and b are
numbers):

5 x=20

Once in this form, it becomes easy to see that x = b/a = 4. Linear algebra is, in essence, concerned
with the solution of several linear equations in several unknowns.

6.2 W h a t is a system of linear equations?

A system of m linear equations in n unknown variables x 1 , x 2 , x 3 , … ., x n is a collection of m


equations of the form

a11 x 1 +a12 x2 + a13 x 3 +…+a 1 n x n=b 1


a21 x 1 +a 22 x 2+ a23 x 3 +…+ a2 n x n=b2
a31 x 1 +a32 x2 + a33 x 3 +…+ a3 n x n=b3
: : : : :
a m 1 x 1+ am 2 x 2+ am 3 x 3 +…+ amn x n=bm

The numbers a ij are called the coefficients of the linear system; because there are m
equations and n unknown variables there are therefore m ×n coefficients. The main
problem with a linear system is of course to solve it:

Problem 6.1: Find a list of n numbers ( s1 , s 2 , … , sn ) that satisfy the above system of
linear equations

In other words, if we substitute the list of numbers ( s1 , s 2 , … , s n) for the unknown


variables ( x 1 , x 2 , x 3 , … ., x n) in the above equation then the left-hand side of the ith
equation will equal bi. We call such a list ( s1 , s 2 , … , s n) a solution to the system of
equations. Notice that we say “a solution” because there may be more than one. The
set of all solutions to a linear system is called its solution set. As an example of a
linear system, below is a linear system consisting of m = 2 equations and n = 3
unknowns:
x 1−5 x 2−7 x 3=0
5 x 2+11 x 3=1

Here is a linear system consisting of m = 3 equations and n = 2 unknowns:


−5 x 1+ x 2=−1
π x1− √ 5 x 2=0
63 x 1−2 x2=−7

And finally, below is a linear system consisting of m = 4 equations and n = 6


unknowns:

−5 x 1+ x 3−44 x 4 −55 x 6 =−1


π x1−5 x 2−x 3 +4 x 4−5 x 5+ √ 5 x 6=0
1 1
63 x 1−√ 2 x 2− x 3 +ln ( 3 ) x 4 + 4 x 5− x 6=0
5 33
1 1
63 x 1−√ 2 x 2− x 3− x 4−5 x 6=5
5 8

Problem 6.2. Verify that (1,2,−4) is a solution to the system of equations

2 x1 +2 x 2+ x3 =2
x 1+ 3 x 2−x 3=11

Is (1,−1,2) a solution to the system?

Solution. The number of equations is m = 2 and the number of unknowns is n = 3.


There are m × n = 6 coefficients: a 11=2 , a12=2 , a13=1 , a21=1 , a22=3 , a 23=−1. And b 1=0
and b 2=11. The list of numbers (1,2,−4) is a solution because

2(1)+2(2)+(−4) = 2

(1)+3(2)− (−4) = 11

On the other hand, for (1,−1,2) we have that

2(1)+2(−1)+(2) = 2

but
1 +3(−1)− 2 = −4 = 11.

Thus, (1,−1,2) is not a solution to the system.

A linear system may not have a solution at all. If this is the case, we say that the
linear system is inconsistent:
INCONSISTENT ⇔ NO SOLUTION

A linear system is called consistent if it has at least one solution:

CONSISTENT ⇔ AT LEAST ONE SOLUTION

We will see shortly that a consistent linear system will have either just one solution
or infinitely many solutions. For example, a linear system cannot have just 4 or 5
solutions. If it has multiple solutions, then it will have infinitely many solutions.

Problem 6.3. Show that the linear system does not have a solution.

−x 1+ x2 =3
x 1−x 2=1.

Solution. If we add the two equations we get

0=4

which is a contradiction. Therefore, there does not exist a list ( s1 , s 2) that satisfies the
system because this would lead to the contradiction 0 = 4.

Problem 6.4. Let t be an arbitrary real number and let


−3
s1= −2 t
2
3
s2= +t
2
s3=t

Show that for any choice of the parameter t, the list ( s1 , s 2 , s 3) is a solution to the linear system

x 1+ x2 + x 3=0
x 1+ 3 x 2−x 3=3 .

Solution. Substitute the list ( s1 , s 2 , s 3) into the left-hand-side of the first equation

( −32 −2 t )+( 32 +t )+( t )=0


And

( −32 −2 t )+3( 32 +t )− ( t )= 62 =3
Both equations are satisfied for any value of t. Because we can vary t arbitrarily, we get an
infinite number of solutions parameterized by t. For example, compute the list ( s1 , s 2 , s 3) for t
= 3 and confirm that the resulting list is a solution to the linear system.

We will use matrices to develop systematic methods to solve linear systems and to study the
properties of the solution set of a linear system. Informally speaking, a matrix is an array or
table consisting of rows and columns.
For example

[ ]
1 −2 1 0
A= 0 2 −8 8
−4 7 11 −5
is a matrix having m = 3 rows and n = 4 columns. In general, a matrix with m rows and n
columns is a m × n matrix and the set of all such matrices will be denoted by M m ×n . Hence, A
above is a 3 × 4 matrix. The entry of A in the ith row and jth column will be denoted by a ij. A
matrix containing only one column is called a column vector and a
matrix containing only one row is called a row vector. For example, here is a row vector

u=[ 1−3 4 ]
and here is a column vector
v=
[ ]
−3
1
We can associate to a linear system three matrices: (1) the coe fficient matrix, (2) the output
column vector, and (3) the augmented matrix. For example, for the linear system
5 x 1−3 x2 +8 x 3=−1

x 1+ 4 x 2−6 x 3=0

2 x 2+ 4 x 3=3

the coefficient matrix A, the output vector b, and the augmented matrix [A b] are:

[ ] [] [ ]
5 −3 8 −1 5 −3 8 −1
A= 1 4 −6 , b= 0 ,[ A b]= 1 4 −6 0
0 2 4 3 0 2 4 3

If a linear system has m equations and n unknowns then the coe fficient matrix A must be a m×n
matrix, that is, A has m rows and n columns. Using our previously defined notation, we can
write this as A ∈ M m × n.
If we are given an augmented matrix, we can write down the associated linear system in an obvious
way. For example, the linear system associated to the augmented matrix is

[ ]
1 4 −2 8 12
0 1 −7 2 −4
0 0 5 −1 7

x 1+ 4 x 2−2 x 3+ 8 x 4=12
x 2−7 x 3+ 2 x 4 =−4
5 x 3−x 4=7 .
We can study matrices without interpreting them as coe fficient matrices or augmented ma-trices
associated to a linear system. Matrix algebra is a fascinating subject with numerous applications
in every branch of engineering, medicine, statistics, mathematics, finance, biol-ogy, chemistry,
etc.

6.3 Solving linear systems


In algebra, you learned to solve equations by first “simplifying” them using operations that do
not alter the solution set. For example, to solve 2 x=8−2 x we can add to both sides 2 x and
obtain4 x=8 and then multiply both sides by 1 yielding x=2. We4
can do similar operations on a
linear system. There are three basic operations, called elementary operations, that can be
performed:

1. Interchange two equations.

2. Multiply an equation by a nonzero constant.

3. Add a multiple of one equation to another.

These operations do not alter the solution set. The idea is to apply these operations iteratively to
simplify the linear system to a point where one can easily write down the solution set. It is
convenient to apply elementary operations on the augmented matrix [A b] representing the
linear system. In this case, we call the operations elementary row operations, and the process of
simplifying the linear system using these operations is called row reduction. The goal with row
reducing is to transform the original linear system into one having a triangular structure and
then perform back substitution to solve the system. This is best explained via an example.

Problem 6.5. Use back substitution on the augmented matrix

[ ]
1 0 −2 −4
0 1 −1 0
0 0 1 1

to solve the associated linear system.


Solution. Notice that the augmented matrix has a triangular structure. The third row corresponds
to the equation x 3=1. The second row corresponds to the equation x 2−x 3=0 and therefore
x 2=x 3=1. The first row corresponds to the equation x 1−2 x3 =−4
and therefore
x 1=−4+2 x 3=−4 +2=−2.
Therefore, the solution is (−2,1,1).
Problem 6.6. Solve the linear system using elementary row operations.

−3 x 1+ 2 x 2 +4 x 3=12

x 1−2 x3 =−4

2 x1 −3 x 2 + 4 x 3=−3

Solution. Our goal is to perform elementary row operations to obtain a triangular structure
and then use back substitution to solve. The augmented matrix is

[ ]
−3 2 4 12
1 0 −2 −4
2 −3 4 −3

Interchange Row 1 ( R1) and Row 2 ( R2):

[ ] [ ]
−3 2 4 12 → 1 0 −2 −4
1 0 −2 −4 −−R1 ↔ R2 −3 2 4 12
2 −3 4 −3 2 −3 4 −3

As you will see, this first operation will simplify the next step. Add 3 R1 ¿ R 2:

[ ] [ ]
1 0 −2 −4 → 1 0 −2 −4
−3 2 4 12 −−3 R 1+ R 2 0 2 −2 0
2 −3 4 −3 2 −3 4 −3

Add −2 R1 to R3 :

[ ] [ ]
1 0 −2 −4 → 1 0 −2 −4
0 2 −2 0 −−−2 R 1+ R 3 0 2 −2 0
2 −3 4 −3 0 −3 8 5

1
Multiply R2 by :
2

[ ] [ ]

1 0 −2 −4 1 0 −2 −4
1
0 2 −2 0 −− R 2 0 1 −1 0
2
0 −3 8 5 0 −3 8 5
Add 3 R2 to R3:

[ ] [ ]
1 0 −2 −4 → 1 0 −2 −4
0 1 −1 0 −−3 R 2+ R 3 0 1 −1 0
0 −3 8 5 0 0 5 5

1
Multiply R3 by :
5

[ ] [ ]

1 0 −2 −4 1 0 −2 −4
1
0 1 −1 0 −− R 0 1 −1 0
5 3
0 0 5 5 0 0 1 1

We can continue row reducing but the row reduced augmented matrix is in triangular form. So
now use back substitution to solve. The linear system associated to the row reduced augmented
matrix is

x 1−2 x3 =−4

x 2−x 3=0

x 3=1

The last equation gives that x 3=1. From the second equation, we obtain that x 2−x 3=0, and thus
x 2=1. The first equation then gives that x 1=−4+2(1)=−2. Thus, the solution to the original
system is (−2, 1, 1). You should verify that (−2, 1, 1) is a solution to the original system

The original augmented matrix of the previous example is:

[ ]
−3 2 4 12 −3 x 1 +2 x 2 + 4 x 3=12
M= 1 0 −2 −4 → x 1−2 x 3=−4
2 −3 4 −3 2 x1 −3 x 2 + 4 x3 =−3

After row reducing, we obtained the row reduced matrix

[ ]
1 0 −2 −4 x1 −2 x 3=−4
N= 0 1 −1 0 → x 2−x 3=0
0 0 1 1 x3 =1

Although the two augmented matrices M and N are clearly distinct, it is a fact that they have the
same solution set.

Problem 6.7. Using elementary row operations, show that the linear system is inconsistent.
x 1+ 2 x 3=1

x 2+ x3 =0

2 x1 + 4 x 3=1

Solution. The augmented matrix is

[ ]
1 0 21
0 1 10
2 0 41

Perform the operation −2 R1 + R3 :

[ ] [ ]
1 0 21 → 1 0 2 1
0 1 1 0 −−−2 R1 + R3 0 1 1 0
2 0 41 0 0 0 −1

The last row of the simplified augmented matrix

[ ]
1 0 2 1
0 1 1 0
0 0 0 −1

corresponds to the equation

0 x 1+ 0 x2 +0 x 3=−1

Obviously, there are no numbers x 1 , x 2 , x 3 that satisfy this equation, and therefore, the linear
system is inconsistent, i.e., it has no solution. In general, if we obtain a row in an augmented
matrix of the form

[0 0 0 … 0c]
where c is a nonzero number, then the linear system is inconsistent. We will call this type of row
an inconsistent row. However, a row of the form

[ 0 1 0 0 0]
corresponds to the equation x 2=0 which is perfectly valid.

6.4 Geometric interpretation of the solution set

The set of points ( x 1 , x 2) that satisfy the linear system

x 1−2 x2 =−1

−x 1+ 3 x 2=3

is the intersection of the two lines determined by the equations of the system. The solution for
this system is (3,2). The two lines intersect at the point ( x 1 , x 2) = (3,2), see Figure 6.1.

Figure 6.1: The intersection point of the two lines is the solution of the linear system

Similarly, the solution of the linear system

x 1−2 x2 + x 3=0

2 x 2−8 x3 =8

−4 x 1+5 x 2+ 9 x3 =−9
is the intersection of the three planes determined by the equations of the system. In this case,
there is only one solution: (29, 16, 3). In the case of a consistent system of two equations, the
solution set is the line of intersection of the two planes determined by the equations of the
system, see Figure 6.2.

Figure 6.2: The intersection of the two planes is the solution set of the linear system

6.5 Row Reduction and Echelon Forms

Consider the linear system

x 1+ 5 x 2−2 x 4 −x5 +7 x 6=−4

2 x 2−2 x 3 +3 x 6=0

−9 x 4 −x5 + x 6=−1

5 x 5+ x 6 =5

0=0

having augmented matrix

[ ]
1 5 0 −2 −1 7 −4
0 2 −2 0 0 3 0
0 0 0 −9 −1 1 −1
0 0 0 0 5 1 5
0 0 0 0 0 0 0

The above augmented matrix has the following properties:


P1. All nonzero rows are above any rows of all zeros.

P2. The leftmost nonzero entry of a row is to the right of the leftmost nonzero entry of the row
above it.

Any matrix satisfying properties P1 and P2 is said to be in row echelon form (REF). In REF, the
leftmost nonzero entry in a row is called a leading entry:

[ ]
1 5 0 −2 −1 7 −4
0 2 −2 0 0 3 0
0 0 0 −9 −1 1 −1
0 0 0 0 5 1 5
0 0 0 0 0 0 0

A consequence of property P2 is that every entry below a leading entry is zero

[ ]
1 5 0 −2 −1 7 −4
0 2 −2 0 0 3 0
0 0 0 −9 −1 1 −1
0 0 0 0 5 1 5
0 0 0 0 0 0 0

We can perform elementary row operations, or row reduction, to transform a matrix into REF.

Problem 6.8. Explain why the following matrices are not in REF. Use elementary row
operations to put them in REF.

[ ] [ ]
3 −1 0 3 7 5 0 −3
M= 0 0 00 N= 0 3 −1 1
0 1 30 0 6 −5 2

Solution. Matrix M fails property P1. To put M in REF we interchange R2 with R3:

[ ] [ ]
3 −1 0 3 → 3 −1 0 3
0 0 00 −−R 2 ↔ R 3 0 1 00
0 1 30 0 0 00
The matrix N fails property P2. To put N in REF we perform the operation − 2 R2 + R3 → R3:

[ ] [ ]
7 5 0 −3 → 7 5 0 −3
0 3 −1 1 −−−2 R 2 + R 3 → R 3 0 3 −1 1
0 6 −5 2 0 0 −3 0

Why is REF useful? Certain properties of a matrix can be easily deduced if it is in REF. For
now, REF is useful to us for solving a linear system of equations. If an augmented matrix is in
REF, we can use back substitution to solve the system, just as we did in the previous problems.

For example, consider the system

8 x 1−2 x2 + x 3=4

3 x 2−x 3=7

2 x3 =4

whose augmented matrix is already in REF

[ ]
8 −1 1 4
0 3 −1 7
0 0 2 4

From the last equation we obtain that 2 x3 =4 , and thus x 3=2. Substituting x 3=2 into the second
equation we obtain that x 2=3. Substituting x 3=2 and x 2=3 into the first equation we obtain that
x 1=1.

6.6 Reduced row echelon form (RREF)

Although REF simplifies the problem of solving a linear system, later in the course we will need
to completely row reduce matrices into what is called reduced row echelon form (RREF). A
matrix is in RREF if it is in REF (so it satisfies properties P1 and P2) and in addition satisfies
the following properties:

P3. The leading entry in each nonzero row is a 1.


P4. All the entries above (and below) a leading 1 are all zero.

A leading 1 in the RREF of a matrix is called a pivot. For example, the following matrix in
RREF:

[ ]
1 6 0 3 00
0 0 1 −4 0 5
0 0 0 0 17

has three pivots:

[ ]
1 6 0 3 00
0 0 1 −4 0 5
0 0 0 0 17

Problem 6.9. Use row reduction to transform the matrix into RREF.

[ ]
0 3 −6 6 4 −5
3 −7 8 −5 8 9
3 −9 12 −9 6 15

Solution. The first step is to make the top leftmost entry nonzero:

[ ] [ ]
0 3 −6 6 4 −5 → 3 −9 12 −9 6 15
3 −7 8 −5 8 9 −−R3 ↔ R 1 3 −7 8 −5 8 9
3 −9 12 −9 6 15 0 3 −6 6 4 −5

Now create a leading 1 in the first row:

[ ] [ ]

3 −9 12 −9 6 15 1 −3 4 −3 2 5
1
3 −7 8 −5 8 9 −−R1 → R 1 3 −7 8 −5 8 9
3
0 3 −6 6 4 −5 0 3 −6 6 4 −5

Create zeros under the newly created leading 1:


[ ] [ ]
1 −3 4 −3 2 5 → 1 −3 4 −3 2 5
3 −7 8 −5 8 9 −−−3 R1 + R2 0 2 −4 4 2 −6
0 3 −6 6 4 −5 0 3 −6 6 4 −5

Create a leading 1 in the second row:

[ ] [ ]

1 −3 4 −3 2 5 1 −3 4 −3 2 5
1
0 2 −4 4 2 −6 −− R 2 0 1 −2 2 1 −3
2
0 3 −6 6 4 −5 0 3 −6 6 4 −5

Create zeros under the newly created leading 1:

[ ] [ ]
1 −3 4 −3 2 5 → 1 −3 4 −3 2 5
0 1 −2 2 1 −3 −−−3 R2 + R3 0 1 −2 2 1 −3
0 3 −6 6 4 −5 0 0 0 0 1 4

We have now completed the top-to-bottom phase of the row reduction algorithm. In the next
phase, we work bottom-to-top and create zeros above the leading 1’s. Create zeros above the
leading 1 in the third row:

[ ] [ ]
1 −3 4 −3 2 5 → 1 −3 4 −3 2 5
0 1 −2 2 1 −3 −−−R 3 + R 2 0 1 −2 2 0 −7
0 0 0 0 1 4 0 0 0 0 1 4

[ ] [ ]
1 −3 4 −3 2 5 → 1 −3 4 −3 0 −3
0 1 −2 2 0 −7 −−−2 R3 + R1 0 1 −2 2 0 −7
0 0 0 0 1 4 0 0 0 0 1 4

Create zeros above the leading 1 in the second row:

[ ] [ ]
1 −3 4 −3 0 −3 → 1 0 −2 3 0 24
0 1 −2 2 0 −7 −−3 R2 + R1 0 1 −2 2 0 −7
0 0 0 0 1 4 0 0 0 01 4

This completes the row reduction algorithm, and the matrix is in RREF.

Problem 6.10. Use row reduction to solve the linear system

2 x1 + 4 x 2+ 6 x3 =8

x 1+ 2 x 2 +4 x 3 =8
3 x 1+ 6 x 2 +9 x 3=12

Solution. The augmented matrix is

[ ]
2 4 68
1 2 48
3 6 9 12

Create a leading 1 in the first row:

[ ] [ ]

2 4 68 1 2 34
1
1 2 4 8 −− R1 1 2 4 8
2
3 6 9 12 3 6 9 12

Create zeros under the first leading 1:

[ ] [ ]
1 2 34 → 1 2 34
1 2 48 −−−R 1 + R 2 0 0 14
3 6 9 12 3 6 9 12

[ ] [ ]
1 2 34 → 1 2 34
0 0 1 4 −−−3 R1 + R3 0 0 1 4
3 6 9 12 0 0 00

The system is consistent, however, there are only 2 nonzero rows but 3 unknown variables. This
means that the solution set will contain 3 − 2 = 1 free parameter. The second row in the
augmented matrix is equivalent to the equation: x 3=4.

The first row is equivalent to the equation:

x 1+ 2 x 2 +3 x3 =4

and after substituting x 3=4 we obtain

x 1+ 2 x 2=−8.

We now must choose one of the variables x 1or x 2 to be a parameter, say t, and solve for the
remaining variable. If we set x 2=t then from x 1+ 2 x 2=−8 we obtain that

x 1=−8−2 t .

We can therefore write the solution set for the linear system as
x 1=−8−2 t

x 2=t

x 3=4

where t can be any real number. If we had chosen x 1 to be the parameter, say x 1=t , then the
solution set can be written as

x 1=t

1
x 2=−4− t
2

x 3=4

Both are two different parameterizations, they both give the same solution set.

In general, if a linear system has n unknown variables and the row reduced augmented matrix
has r leading entries, then the number of free parameters d in the solution set is d=n−r . Thus,
when performing back substitution, we will have to set d of the unknown variables to arbitrary
parameters. In the previous example, there are n=3 unknown variables and the row reduced
augmented matrix contained r =2 leading entries. The number of free parameters was therefore
d=n−r =3−2=1. Because the number of leading entries r in the row reduced coefficient
matrix determine the number of free parameters, we will refer to r as the rank of the coefficient
matrix: r =rank ( A).

Problem 6.11. Solve the linear system represented by the augmented matrix

[ ]
1 −7 2 −5 8 10
0 1 −3 3 1 −5
0 0 0 1 −1 4

Solution. The number of unknowns is n=5 and the augmented matrix has rank r =3 (leading
entries). Thus, the solution set is parameterized by d=5−3=2 free variables, call them t and s.
The last equation of the augmented matrix is x 4 −x5 =4 . We choose x 5 to be the first parameter
so we set x 5=t . Therefore, x 4 =4+ t . The second equation of the augmented matrix is

x 2−3 x 3+ 3 x 4 + x 5 =−5
and the unassigned variables are x 2 and x 3. We choose x 3 to be the second parameter, say
x 3=s. Then

x 2=−5+3 x 3−3 x 4−x 5

¿−5+3 s−3 (4+ t)−t

¿−17−4 t +3 s .

We now use the first equation of the augmented matrix to write x 1 in terms of the other
variables:

x 1=10+7 x 2−2 x 3+5 x 4 −8 x 5

¿ 10+7 (−17−4 t+3 s ) −2 s+5 ( 4+t )−8 t

¿−89−31 t+ 19 s

Thus, the solution set is

x 1=−89−31t +19 s

x 2=−17−4 t+3 s

x 3=s

x 4 =4+ t

x 5=t

where t and s are arbitrary real numbers. Choose arbitrary numbers for t and s and substitute the
corresponding list (x 1 , x 2 ,... , x5 ) into the system of equations to verify that it is a solution.

6.7 Existence and uniqueness of solutions


The REF or RREF of an augmented matrix leads to three distinct possibilities for the
solution set of a linear system.
Let [A b] be the augmented matrix of a linear system. One of the following distinct
possibilities will occur:

1. The augmented matrix will contain an inconsistent row.

2. All the rows of the augmented matrix are consistent and there are no free parameters.

3. All the rows of the augmented matrix are consistent and there are d ≥ 1 variables
that must be set to arbitrary parameters

In Case 1., the linear system is inconsistent and thus has no solution. In Case 2., the linear
system is consistent and has only one (and thus unique) solution. This case occurs when r
= rank(A) = n since then the number of free parameters is d = n−r = 0. In Case 3., the
linear system is consistent and has infinitely many solutions. This case occurs when r < n
and thus d = n −r > 0 is the number of free parameters.

13.1 Vector: A vector V can be considered as an ordered list of numbers. In other words, a
vector is a mathematical entity characterized by both magnitude and direction, often represented
as an ordered set of values in a multi-dimensional space.

[]
v1
V = v2

vn

Above vector is an example of a n-dimensional vector, or n-dimensional column vector. The set
of all n-tuples of real number is denoted as Rn . As an equation it is written as,

{[ ]| }
v1
R ≔ v2 v2, … v n∈ R
n


vn

[]
v1
v
Thus, a particular n-tuple in Rn , say V = 2 denote a point in n-space. The number v i is called

vn
as the coordinates, components, entries, or elements of V . Another vector U =( u1 ,u 2 , … um ) is
an example of an m-dimensional row vector, R :={( u 1 , u2 , … ,u m )∨u1 , u2 , … , um ∈ R }
m
13.2 Vector addition and scaler multiplication: Consider two vectors U and V in Rn ,

[] []
u1 v1
U = u2 and V = v 2 , then the sum, written as U +V is the vector in Rn , obtained by adding the
⋮ ⋮
un vn

[ ]
u1 + v 1
u +v
corresponding elements from U and V . That is, U +V = 2 2 .

un + v n

The scaler product of the vector V by a real number k is obtained by multiplying each element of
V

[][ ]
v1 k v1
v k v2
By k . That is, kV =k 2 = .
⋮ ⋮
vn k vn

Example 1:

[] [] [ ][] [ ][ ]
1 4 1+4 5 3(1)−4 −1
3(2)−3
(a) Let A=
2
and B=
3
, then A+ B= 2+3 = 5 and 3 A−B= = 3
3 2 3+ 2 5 3(3)−2 7
4 0 4 +0 4 3(4)−0 12

(b) Let A=(2 , 4 ,−5) and B=(1 ,−6 , 9), then

A+ B= ( 2+ ( 1 ) , 4+ (−6 ) ,−5+ ( 9 ) )=(3 ,−2 , 4) ,

5 A= ( 5 (2 ) , 5 ( 4 ) ,5 (−5 ) ) =(10 , 20 ,−25) ,

−B=(−1 )( 1 ,−6 ,9 )=(−1, 6 ,−9), and

3 A−5 B=3 ( 2 , 4 ,−5 ) + (−5 )( 1 ,−6 , 9 )=(1 , 42,−60).


[]
0
0
(c) A special vector is the zero vector. All its elements are zero, as : O= . The zero vector

0
labels the origin. In this sense, the zero vector is the only one with zero magnitude, only the only

[] [ ][]
a1 a 1+ 0 a1
a n a + 0 = a2 =A
one which points in no direction. For any vector, A= 2 ∈ R , A+O= 2 .
⋮ ⋮ ⋮
an a n+ 0 an

13.3 Hyperplanes and lines in Rn

A hyperplane in Rn is defined as the solution set of a linear equation of the form

a 1 x 1+ a2 x2 +… an x n=b

where, A=( a1 , a2 , ⋯ , an ) is a non-zero constant vector in Rn , x 1 , x 2 ,… x n ∈ Rm are variables


representing the coordinates in Rm, and b in Rmis a constant.

Thus, a hyperplane in R2 is a line and a hyperplane in R3 is a plane. Given two non-zero


vectors, U and V , they will usually determine a plane, unless both vectors are in the same line.
The line L in Rn along the direction of a non-zero vector U =(u ¿ ¿ 1 , u2 , … …u n)¿ and passing
through a point P(b 1 , b 2 , … , bn ) can be written as:

L= { P+tU |t ∈ R }

The plane determined by two vectors U and V in Rn can be written as:

{ P+tU +sV |t , s ∈ R }

Example 2:

{[ ] [ ]| }
1 1
2 +t 0 t ∈ R
(a) In a two-dimensional scenario, describes a line in R4 parallel to the x-
5 0
4 0
axis.
{[ ] [ ] [ ]| }
2 1 0
1 0 1
4 +s 0 +t 0 s,t ∈R
(b) In a three-dimensional scenario, describes a plane in R6
1 0 0
5 0 0
8 0 0

parallel to x-y plane.

(c) (Specifying a plane with one linear algebraic equation): The solution set to
x 1+ x2 + x 3 + x 4 + x 5=1 can be represented equivalently as:

[ ][ ]
x1 1−x 2−x 3−x 4 −x5
x2 x2
x3 = x3 , which further can be written as hyperplane,
x4 x4
x5 x5

{[ ] [ ] [ ] [ ] [ ]| }
1 −1 −1 −1 −1
0 1 0 0 0
0 + s2 0 + s3 1 +s 4 0 + 0 s 2 , s3 , s 4 , s 5 ∈ R
0 0 0 1 0
0 0 0 0 1

Thus, the solution set to x 1+ x2 + x 3 + x 4 + x 5=1 is a 4-dimensional hyperplane in R5.

Example 3:

(a) Let H be the plane in R3 corresponding to the linear equation 2 x1 −5 x 2 +7 x 3=4


Notice that the points P=(1 ,1 , 1) and Q=(5 , 4 ,2) are the solutions of the equation. Thus, P
and Q and the directed line segment V =⃗ PQ=Q−P=(4 ,3 ,1) lie on the plane H. The vector
U =(2 ,−5 ,7)is normal to H, and, as expected,

U ⋅V = (2 ,−5 ,7 ) ⋅ ( 4 , 3 , 1 )=8−15+7=0 , Therefore, U is orthogonal to V .

(b) Find an equation of the hyperplane H in R4 that passes through the point P=(1 ,3 ,−4 ,2)
and is normal to the vector U =(4 ,−2, 5 , 6). The coefficients of the unknowns of an
equation of H are the components of the normal vector U ; hence, the equation of H must
be of the form 4 x1 −2 x 2 +5 x3 +6 x 4 =k . Substituting P into this equation, we obtain
4 ( 1 )−2 ( 3 ) +5 (−4 ) +6 ( 2 )=k or k =−10.
Thus, 4 x1 −2 x 2 +5 x3 +6 x 4 =−10. is the equation of hyperplane H .

13.4 Directions and Magnitudes

Definition: The dot product, also know as inner product or scaler product of vectors in Rn ,

[] []
u1 v1
U = u2 and V = v 2 is defined as U ⋅V =u 1 v 1+u 2 v2 + …u n v n
⋮ ⋮
un vn

That is the dot product of vectors U and V is obtained by multiplying corresponding elements
and adding the resulting products.

The vectors U and V are said to be perpendicular (or orthogonal) if their dot product is zero,
that is U ⋅V =0. The inner product is also denoted as ⟨ U ,V ⟩.

Example 4:

[] [] []
1 2 2
(a) Let U = 2 , V = −3 , and W = 3 , then
−2 5 4
U ⋅V =1 ( 2 ) +2 (−3 ) + (−2 ) ( 5 )=2−6−10=−14
U ⋅W =1 ( 2 ) +2 ( 3 ) + (−2 )( 4 )=2+ 6−8 =0

V ⋅W =2 ( 2 ) + (−3 )( 3 )+5 ( 4 ) =4−9+20=15

Thus, V and W are orthogonal.

(b) Let U =(1 , 2 ,3 , 5) and V =(2 , 3 , k ,5). Then find k so that U and V are perpendicular.
We have, U . V =1 ( 2 ) +2 (3 )+ 3 k +5 ( 5 )=2+6 +3 k +25=33+3 k
U and V are perpendicular then U . V =0.
Therefore, 33+3 k =0, or k =−11.

Some Properties of Dot Product: Consider the non-zero vectors U , V ,and W in Rn . Then, some
important properties of dot products are mentioned as below:

(i) Symmetry: U ⋅V =V ⋅U
(ii) Distributive: U ⋅ ( V +W )=U ⋅V +U ⋅W
(iii) Bilinear (linear in both U and V ): U ⋅ ( cV + dW )=cU ⋅V +dU ⋅W and
( cU + dW ) ⋅V =cU ⋅V +dW ⋅V , where c , d are scaler in R .
(iv) Positive Definite: U ⋅U >0.

Euclidean length (Norm or Magnitude) of a vector:

The Euclidean norm of a vector V ∈ Rn , is denoted by ‖V ‖ , and defined to be the non-negative

[]
v1
v
square root of ⟨ U ,V ⟩. Consider a vector V = 2 in Rn , then

vn

√∑
n
‖V ‖=√ ⟨ U ,V ⟩=√ ( v 1) 2+ ( v 2 )2 +… ( v n )2= ( vi )
2

i=1

[]
0
0
Thus, ‖V ‖≥ 0 , and ‖V ‖=0 if and only if V = . A vector is called unit vector (or unit

0
norm vector) if ‖V ‖=1, or equivalently if ⟨ U ,V ⟩=1.
^ = V is the unit norm vector in the
For any non-zero vector V ∈ Rn , the vector V
‖V ‖
^
direction of V . The process of obtaining V from V is called the normalization of vector V
.

Distance, Angles, and Projections: Consider two non-zero vectors U and V in Rn ,

[] []
u1 v1
U = u2 and V = v 2 , then the distance between U and V is denoted and defined as:
⋮ ⋮
un vn

√ 2 2
d ( U ,V )=‖U −V ‖= ( u1−v 1 ) + ( u2−v 2 ) + … ( un−v n )
2

The angle between U and V is determined by the formula,

U ⋅V
U ⋅V =‖U‖‖V ‖Cosθ or cosθ= .
‖U ‖‖V ‖
The projection of a vector U onto a non-zero vector V is denoted and determined by,

U ⋅V U ⋅V
proj ( U .V )= V= V
‖V ‖
2
V ⋅V

Theorem 1 (Cauchy-Schwarz Inequality): For any non-zero vectors U and V in Rn ,

U ⋅V ≤‖U‖‖V ‖

Proof: Let α ∈ Rbe any real number and consider the following positive quadratic polynomial in
α,

0 ≤ ( U +αV ) ⋅ (U + αV )=U ⋅U +2 αU ⋅V + α 2 V ⋅V
2 2
¿‖U ‖ +2 ( U ⋅ V ) α +‖V ‖ α 2

Let, a=‖V ‖2, b=2 ( U ⋅V ), and c=‖U‖2. Then for every value of α ,
2
a α +bα + c ≥ 0.

This means that the determinant D=b2−4 ac ≤0 ,

Equivalently, b 2 ≤ 4 ac . Thus, 4 ( U ⋅V )2 ≤ 4 ‖U‖ ‖V ‖ or U ⋅V ≤‖U‖‖V ‖.


2 2

Theorem 2 (Triangle Inequality): For any vectors U and V in Rn ,

‖U + V ‖≤‖U‖+‖V ‖.
2
Proof: ‖U + V ‖ =( U + V ) ⋅ ( U + V )

¿ U ⋅U +2 U ⋅V +V ⋅V
2 2
¿‖U ‖ +‖V ‖ + 2‖U‖‖V ‖Cosθ
2
¿ (‖U‖+‖V ‖) +¿ 2‖U‖‖V ‖( Cosθ−1 )

≤ (‖U‖+‖V ‖) 2

Thus, ‖U + V ‖≤‖U‖+‖V ‖.
The triangle inequality is also self-evident when examining a sketch of u , v and u+ v as below:

Figure. 13.1

Example 5:

[] []
1 2
Consider two vectors in R3, U = −2 and V = 4 . Then obtain the distance and angle between
3 6
U and V .

Distance between U and V is obtained as

√ 2 2 2
d ( U ,V )= ( u1−v 1 ) + ( u2−v 2 ) + … ( un−v n )

¿ √ ( 1−2 ) + (−2−4 ) + (3−6 )


2 2 2

¿ √ 1+36+9

¿ √ 46

To find the angle between U and V , we first obtain

U . V =( 1 ) ( 2 ) + (−2 ) ( 4 )+ ( 3 ) ( 6 )=12,
2
‖U ‖ =( 1 )2 + (−2 )2 + ( 3 )2=14 ,
2
‖V ‖ =( 2 )2 + ( 4 )2 + ( 6 )2=56 ,
U .V 12 3 −1 3
Then, cosθ= = = , or θ=cos ( ).
‖U ‖‖V ‖ √ 14 √ 56 7 7
Also, proj ( U , V )=
U⋅V
‖V ‖
2
V=
[] []
12 2 3 1
56
4= 2.
6
7
3

13.5 Field: In the context of vector spaces, a field F is a mathematical structure that provides the
scalars used for scalar multiplication in the vector space. A field is a set equipped with two
operations, addition and multiplication, such that the set satisfies certain algebraic properties. the
following properties hold for all elements a , b , c in the field:

(i). Closure under addition and multiplication: a+ b and a . b are in field F .

(ii). Associativity of addition and multiplication: a+ ( b+c )= ( a+b )+ c

and a . ( b . c ) + ( a . b ) . c

(iii). Commutativity of addition: a+ b=b+a

(iv). Existence of additive identity: There exists an element 0 in F such that,

a+ 0=a for all a in F .

(v). Existence of additive inverses: For every a in F, there exists an element −a in F


such that a+ (−a ) =0.

(vi). Nonzero element property: a . b=0 implies that either a=0 or b=0.

(vii). Existence of multiplicative identity: There exists an element 1 in F such that a .1=a for all
a in F .

13.6 Vector Space: Formally, a vector space over a field (usually the real numbers or complex
numbers) is a set V equipped with two operations: vector addition and scalar multiplication, such
that for any vectors u , v ∈V and any scalars c , d from the field F , the following properties hold:
(i). Closure under addition: u+ v ∈ V

(ii). Associativity of addition: u+ ( v +w )=( u+ v )+ w

(iii). Commutativity of addition: u+ v=v+ u

(iv). Existence of zero vector: There exists a vector 0 ∈V such that u+0=u for all u ∈V

(v). Existence of additive inverses: For every u ∈V , there exists a vector −u ∈V such that
u+ (−u )=0

(vi). Closure under scalar multiplication: c .u ∈ V

(vii). Compatibility of scalar multiplication with field multiplication: c ( d .u )=(cd . u)

(viii). Identity element for scalar multiplication: 1. u=u for all u ∈V

(ix). Distributivity of scalar multiplication with respect to vector addition:

c . (u+ v )=c . u+c . v

(x). Distributivity of scalar multiplication with respect to field addition:

( c +d ) . u=c . u+d .u

13.7 Spanning Sets: Let V be a vector space over F . Then vectors u1 ,u 2 , … . un are said to span
VV or form a spanning set of V if every v ∈ V , a linear combination of the vectors u1 ,u 2 , … . un .
That means if there exist scaler a , a 2 , … . a ∈ F , then, v=a1 v 1 +a 2 v 2+ …+a n v n belongs to V as
well. The following remarks can be concluded from the above definition.

Remark 1: Suppose u1 ,u 2 , … . un span V . Then, for any vector w , the set w ,u1 , u2 , …. un also
spans V .
Remark 2: Suppose u1 ,u 2 , … . un span V and suppose uk is a linear combination of some of the
other u ' s . Then the u ' s without uk also span V .
Remark 3: Suppose u1 ,u 2 , … . un span V and suppose one of the u ' s is the zero vector. Then
the u ' s without the zero vector also span V .
Example 6: Let V =R 2 (the vector space of all 2-dimensional real vectors) and consider the set

S=
{[ ] [ ]}
1 0
,
0 1
. The question is whether S spans V . Any vector in R2 can be written as a linear

combination of the vectors in S. For example:

The vector [ 23]


v 1= can be expressed as v 1=2 [ 10]+3 [01 ].
Since we can represent any vector in R2 using linear combinations of the vectors in S, we can
say that S spans R2 .

{[ ] [ ] [ ]}
1 1 1
Example 7: Now, let's consider V =R 3 and the set S= 1 , 1 , 0 .
1 0 0

Again, we want to know if S spans V . For a vector in R3, say (a ,b ,c ) , we need to check if there
exist scalars x , y and z such that:

[] [][][]
1 1 1 a
x 1 +y 1 + 0 = b
1 0 0 c

[]
a
In this case, we can see that the vector b can be written as
c

[] [] [] []
a 1 1 1
b =c 1 +(b−c) 1 +(a−b) 0
c 1 0 0

So, it is concluded that S spans R3. These examples illustrate the concept of spanning sets,
where a set of vectors spans a vector space if every vector in that space can be expressed as a
linear combination of the vectors in the set.

13.8 Subspaces: In the realm of subspace geometry, envision the conventional three-dimensional
space, R3, and select any plane passing through the origin. This chosen plane constitutes a
distinct vector space. When we scale a vector within this plane by a factor like 2, -3, or any
scalar, the result remains within the same plane. Similarly, the sum of two vectors within the

[]
0
plane preserves the plane. This specific plane, passing through the origin 0 ,exemplifies a
0
foundational concept in linear algebra-it serves as a subspace within the original space R3 .

Definition. Subspace S of a vector space V is a nonempty subset that satisfies the requirements
for a vector space that Linear combinations stay in the subspace.
(i) The sum of any two vectors, x and y , also resides in the subspace: ( x + y ¿∈ S .
(ii)The product of any vector x in the subspace by any scalar c remains within the subspace:
cx ∈ S .

Example 8. Consider all vectors in R2. whose components are positive or zero. This
subset is the first quadrant of the x-y plane; the coordinates satisfy x ≥ 0 and
y ≥ 0. It is not a subspace, even though it contains zero and addition does leave us within
the subset. Rule (ii) is violated, since if the scalar c=−1 and the vector
x=
[12] , [ ]
−2
then the multiple cx = −1 is in the third quadrant instead of the first.
If we include the third quadrant along with the first, scalar multiplication is all right.
Every multiple cx will stay in this subset. However, rule (i) is now violated, since

[ 12]+[−2
−1 ] [ 1 ]
=
−1
, which is not in either quadrant. The smallest subspace
containing the first quadrant is the whole space R2 .

Example 9. Start from the vector space of 3 by 3 matrices say R3 x3. One possible
subspace is the set of lower triangular matrices. Another is the set of symmetric matrices.
Therefore, A=B and cA are lower triangular if A and B are lower triangular, and they
are symmetric if A and B are symmetric. Of course, the zero matrix is in both subspaces.
Let's see two fundamental examples: the column space and the nullspace of a matrix A .

Column space: The column space encompasses every possible linear combination that can be
formed using the columns of matrix A .

Explanation: The column space of a matrix refers to the subspace formed by all the possible
linear combinations of its columns. Consider a matrix A with columns [v 1 , v 2 , v 3 ]. The column
space would include every vector that can be expressed as a linear combination of these columns,
such as c 1 v 1 +c 2 v 2+ c 3 v 3 where, c 1 , c2 ,∧c3 are scalars. In essence, it represents the span of the
column vectors of the matrix.
[ ]
1 2
Example 10: If A= 3 4 , then the column space would include all vectors of the form
5 6

[] []
1 2
c 1 3 +c 2 4 where c 1 and c 2 can be any scalar. The column space is the subspace spanned by the
5 6

[] []
1 2
columns 3 and 4 in this example.
5 6

Note: The system of linear equation, say Ax=b have a solution if and only if the vector b can be
represented as a linear combination of the columns of matrix A . In other words, b lies within the column
space of A .

Nullspace: The null space of a matrix, denoted as N ( A), comprises all vectors x that satisfy the
equation Ax=0. Therefore, the null space of a matrix A represents the set of all solutions
(vectors x ) that, when multiplied by A , result in the zero vector ¿).

In other words, it consists of vectors that get "mapped" to the zero vector under the linear
transformation represented by the matrix A .

This null space is a subspace of the vector space Rn , analogous to how the column space is a
subspace of Rm.

Example 11: Let's consider a specific example of a matrix A and find its null space in R3 :

[ ]
1 2 −1
A= 0 1 1
2 0 3

[]
x1
Now, we are looking for vectors X = x 2 such that Ax=b.
x3
[ ][ ] [ ]
1 2 −1 x 1 0
Setting up the system of equations: 0 1 1 x 2 = 0
2 0 3 x3 0

Solving this system, you can find the solutions for x 1 , x 2 , x 3,. The null space ( N ( A)) consists of

[]
x1
all vectors of the form x 2 that satisfy these equations. The set of solutions forms the basis for
x3
the null space of matrix A in R3 .

13.9 Linear Dependence and Independence:

Linear dependence and independence are important concepts that describe the relationships
between vectors.

1.Linearly Dependent Vectors:

A set of vectors is said to be linearly dependent if there exists a nontrivial linear combination of
these vectors that equals the zero vector. In simpler terms, vectors v 1 , v 2 , .. . v n are linearly
dependent if there exist scalars c 1 , c2 , . .. c n not all zero, such that c 1 v 1 ,c 2 v 2 , .. . c n v n = 0

Example: 12 Consider two vectors in ℝ2 : v 1=


[ 12] and v 2= [ 42].
These vectors are linearly dependent because 2 v 1=v 2 .

You can verify that, 2


[ 12] −[ 24]=[00 ].
2. Linearly Independent Vectors

A set of vectors is said to be linearly independent if the only linear combination that equals the
zero vector is the trivial one, where all the coefficients are zero.

In other words, vectors v 1 , v 2 , .. . v n are linearly independent if c 1 v 1 ,c 2 v 2 , .. . c n v n = 0 implies that


c 1=c 2=c n=0
Example: 13 Consider two vectors in ℝ2 : v 1=
[]
1
0 []
and v 2=
0
1
.

These vectors are linearly independent because the only way to get the zero vector as a linear combination

is by setting c 1=c 2=0, since c 1 [ 10]+¿ c [ 01]= [ 00].


2

In general, if you have a set of vectors v 1 , v 2 , .. . v n , and you can express one of them as a linear
combination of the others, the set is linearly dependent. If no vector in the set can be written as a
linear combination of the others, the set is linearly independent. Linear independence is a
desirable property because it means that none of the vectors in the set is redundant; each
contributes something unique to the span of the set.

13.10 Basis and Dimensions:

A basis is a fundamental concept in the context of vector spaces. A basis is a set of vectors that
spans the entire vector space and is linearly independent. The idea is that any vector in the vector
space can be uniquely expressed as a linear combination of the vectors in the basis. The number
of vectors in the basis is known as the dimension of the vector space. Some key properties of a
basis is mentioned below:

1. Spanning Property: A basis must span the entire vector space, meaning that any vector in the
space can be represented as a linear combination of the basis vectors.

2. Linear Independence: The vectors in a basis must be linearly independent, ensuring that no
vector in the basis can be written as a combination of the others.

Example 14: Consider the vector space ℝ3 , and let's define a basis for this space. A common
basis for ℝ3 is the standard basis, which consists of three vectors:

[] [] []
1 0 0
e 1= 0 , e 2= 1 ,∧e 3= 0
0 0 1

These vectors form a basis for ℝ3 because:


[]
a
Spanning Property: Any vector V = b ∈¿ ℝ3 can be expressed as a linear combination of the
c
standard basis vectors: V =a . e1 +b . e 2+ c . e 3

Linear Independence: The coefficients in the linear combination above are unique. If
a . e 1+ b . e2 +c . e3 =0 , then a=b=c=0 . This ensures linear independence.

[]
2
Now, let's take an example vector W = 3 . Then we can express W as a linear combination of the
1

[] [] []
1 0 0
standard basis vectors, as: W =2. 0 +3. 1 +1. 0
0 0 1

This demonstrates the spanning property of the basis, showing that any vector in ℝ3 can be
represented using the basis vectors.

Thus, we can say that a basis is a set of vectors that not only spans the vector space but also
ensures that the vectors are linearly independent, providing a unique way to represent any vector
in the space. The standard basis for ℝn is a common example that satisfies these properties.

Problem Set

[] [ ] []
1 4 1
U = U = U =
P.1 Let 1 2 , 2 −1 , and 3 3 . Find
4 3 5
(a) 3 U 1−2 U 2

(b) 2 U 1 +U 2−7 U 3

[] [] [ ]
1 1 1
P.2 Write the vector V = 2 as a linear combination of the vectors U 1= 0 , U 2= −1 , and
3 0 0

[]
1
U 3= 0 .
1

P.3 Find U ⋅V where:

(a) U =(2 , 3 , 4 5) and V =(3 ,−2, 5 , 4)

(b) U =(2 ,−3 , 4−5 ,1) and V =(3 ,−1 ,−5 , 4 , 2)

P.4 Consider U =(3 , 2 ,−2 , 1) and V =(3 , k , 5 , 4) . The find the value of k such that U

And V are orthogonal.

P.5 Find the unit vector in the direction of vector V where:

(a) V =(3 ,−2, 2 , 4)

(b) V =(2 ,−3 , 4 , 5 ,1)

P.6 Consider U =(1 ,−2 , 4 ) and V =(−3 , 2 ,5). Then find:

(a) Cosθ where θ is the angle between U and V .

(b) proj(U ,V ), the projection of U onto V

(c) d ( U ,V ) , the distance between U and V .


P.7 Find an equation of the hyperplane H that passes through points P=(3 ,−4 , 2) and is normal
to vector V =(−3 , 2 ,5).

[] [ ] [ ]
1 4 3
U = U = U =
P.8 Check whether the vectors: 1 2 , 2 −1 , and 3 −3 are linearly independent.
4 3 −1

[] [ ] []
1 4 3
U = U = U =
P.9 Determine whether the vectors: 1 0 , 2 −1 , and 3 1 spans R3.
4 3 0
Chapter 14: Eigen Values and Eigen Vectors

14.1 Introduction

Eigenvalues and eigenvectors provide a way to understand the inherent properties of linear
transformations represented by matrices. In linear algebra statistical modelling, eigenvalues and
eigenvectors are utilized to uncover the principal components of a dataset, enabling
dimensionality reduction, and capturing the most significant variability in the data.

Consider an example of multiplying the nonzero vectors, [ 25] and [ 34 ] by a given square matrix
[ 64 37], such as
Case 1: [ 64 37] [ 25]=[ 2743]
Case 2: [ 64 37] [ 34 ]=[ 3040]
We aim to examine the impact of multiplying the given matrix on vectors. In case 1, it is
observed that the result is a completely different vector with altered direction and length, which
is typically unremarkable for our current discussion. However, in the case 2, something
noteworthy unfolds. The multiplication yields a vector, signifying that the new vector maintains
the same direction as the original one. The scaling factor denoted as λ is 10. This chapter will
delve into the systematic exploration of such scaling factors λ and non-zero vectors X for a
given square matrix A by considering the equation of the form: AX=λX .

14.2 Cayley-Hamilton theorem, Characteristic Polynomial:

The Cayley-Hamilton theorem is a fundamental result in linear algebra that establishes a


relationship between a square matrix and its characteristic polynomial. The theorem is named
after the mathematicians Arthur Cayley and William Rowan Hamilton.
Consider a square matrix A of order n × n . Then the characteristic polynomial of A is denoted
by p ( λ ) =det ⁡( A−λI ), where I is the identity matrix and λ is a scaler. And the characteristic
polynomial is a polynomial in λ whose roots are the eigenvalues of the matrix A .

A matrix polynomial is an expression of the form P ( λ ) =c 0 I +c 1 λ+c 2 λ2 +…+c k λ k , where


c 0 , c 1 , … , c k are constants. The degree of the matrix polynomial is the highest power of λ that
appears in the expression.

Cayley-Hamilton theorem: The Cayley-Hamilton theorem states that every square matrix
satisfies its own characteristic equation. In other words, if p ( λ ) =det ⁡( A−λI ), is the
characteristic polynomial of the matrix A , then substituting A for λ in the polynomial yields the
zero matrix. That is, p( A)=det (A− AI )=0.

For a matrix polynomial P ( λ ) =c 0 I +c 1 λ+c 2 λ2 +…+c k λ k, the Cayley-Hamilton theorem implies:


2 k
P ( A )=c 0 I + c 1 A+ c2 A +…+ c k A =0.

Example 1:

Consider a2 ×2matrix A given by: A= [ 13 24 ]


Then the characteristic polynomial p(λ) is found by evaluating det ( A−λ I ):

p ( λ ) =det
[( 1−3 λ 2
4−λ ]) 2
=( 1−λ )( 4−λ )−2⋅3=λ −5 λ+4

Now, according to the Cayley-Hamilton theorem for matrix polynomials:

2
P( A)= A −5 A +4 I =0

Substitute the matrix A into the polynomial:

[ ] [ ] [ ]
2
1 2 1 2 1 0
P ( A )= −5⋅ +4 ⋅ =0
3 4 3 4 0 1

As expected, P( A) is the zero matrix, confirming the Cayley-Hamilton theorem for this
example.
Example 2:

(a) Let us the find the characteristic polynomial of a 2 ×2matrix, A= [ a 11 a12


a 21 a22]’

The characteristic polynomial of A is given as: p ( λ ) =det ⁡( A−λI )

p ( λ ) =det
([ a 11−λ
a21
a12
a22−λ ])
¿ ( a 11−λ ) ( a22− λ )−a12 ⋅ a21
2
¿ λ −( a11 + a22 ) λ+(a ¿ ¿ 11⋅ a22−a12 ⋅ a21) ¿
2
p ( λ ) =λ −tr ( A ) λ+det ⁡( A)
where, tr ( A ) denotes the trace of A ,that is the sum of the diagonal elements of A .

[ ]
a 11 a12 a 13
(b) Now, consider a 3 ×3 matrix, A= a 21 a22 a 23 ,
a 31 a32 a 33
With similar approach (as in part (a)), The characteristic polynomial of A is given as:
3 2
p ( λ ) =λ −tr ( A ) λ + ( A11 + A 22+ A 33 ) λ−det ⁡(A ),
Where, A11 , A22 , A 33 denote the cofactors of a 11 , a22 ,and a 33 respectively.

14.3 Eigenvalues and Eigenvectors:

Definition: Let A be any square matrix. Then a scalar λ is called an eigenvalue of A if there
exists a nonzero (column) vector X such that AX=λX holds. Any vector X satisfying this
relation is called an eigenvector of A associated with the eigenvalue λ .
Note that each scalar multiple of an eigenvector X associated with the eigenvalue λ is also such
an eigenvector, as: A(kX )=k ( AX )=k (λX )=λ(kX ).

Determining Eigenvalues and Eigenvectors:


Example 3: Consider a 2 ×2 matrix A= [ ]
3 1
2 2
, then determine the Eigen values and the eigen
vectors of matrix A .
(a) To find the Eigen values:
1. Solve the characteristic equation, det ( A−λ I )=0.
det
([ 3−λ2 1
2− λ])
=( 3−λ )( 2− λ )−2 ×1=0

2. Solve for λ , which gives the eigenvalues.

(3−λ)(2− λ)−2=0
2
λ −5 λ+ 4=0
(λ−1)(λ−4)=0

So, the eigenvalues are λ 1=1∧λ 2=4.

(b) To find the eigenvectors:


For each eigenvalue λ , solve the system ( A−λ I ) X=0.

1. For λ=1 :
( A−I ) X = 3−1
2 [
1 x1
2−1 x 2
= ][ ] [ ][ ] [ ]
2 1 x
2 1 x2
=
0
0

One solution is X 1 =[ ]
−1
2

2. For λ=4 :

( A−4 I ) X= [ 3−4
2
1
2−4 x 2
=][ ] [
x 1 −1 1 x 1
2 −2 x 2
=
0
0 ][ ] [ ]
One solution is X 2 =[ 11]
Therefore, the eigenvalues are λ 1=1, and λ 2=4 , and the corresponding eigenvectors are
X1=
[ ]
−1
2 []
1
and X 2 = .
1

Theorem 1: Eigenvalues
The eigenvalues of a square matrix A a re the roots of the characteristic equation of A . Then, an
n × n matrix has at least one eigenvalue and at most n numerically different eigenvalues.

Theorem 2: Eigenvectors, Eigenspace


If W and X are eigenvectors of a matrix A corresponding to the same eigenvalue λ so are W + X
(where, X ≠−W ) and kX for any k ≠ 0. Then the eigenvectors corresponding to one and the same
eigenvalue of A , together with 0 , form a vector space, called the eigenspace of A corresponding
to that eigenvalue λ .

Example 4 (Multiple Eigenvalues):

[ ]
−2 2 −3
Let us find the eigenvalues and eigenvectors of a matrix, A= 2 1 −6 .
−1 −2 0

The characteristic equation is given as: det ( A−λI ) =0

([ ])
−2−λ 2 −3
Or, det 2 1− λ −6 =0
−1 −2 − λ

3 2
Or, λ + λ −21 λ−45=0.

The roots of above equation gives the eigenvalues of A ,

Thus, we get the eigenvalues: λ 1=5 , λ2=λ 3=−3.

The eigenvector X associated with eigenvalue λ is given by ( A−λI ) X =0,

[ ][ ] [ ]
−2−5 2 −3 x 1 0
When λ=5 , then 2 1−5 −6 x 2 = 0
−1 −2 −5 x 3 0

[ ][ ] [ ]
−7 2 −3
x1 0
−24 −48
which is row-reduces to, 0 x2 = 0
7 7
x3 0
0 0 0

−24 48
thus, −7 x 1+ 2 x 2−3 x 3=0 and x2− x 3=0 ,
7 7

let us choose x 3=−1 , then we get x 2=2 and x 1=1 .

[ ][ ]
x1 1
A
Thus, the eigenvector of corresponding to λ=5 is X = x 2
= 2 .
x 3 −1
[ ][ ] [ ]
−2+3 2 −3 x1 0
Similarly, when λ=−3 , then 2 1+3 −6 x2 = 0 ,
−1 −2 3 x3 0

[ ][ ] [ ]
1 2 −3 x1 0
which row-reduce to 0 0 0 x2 = 0 ,
0 0 0 x3 0

thus, x 1+ 2 x 2−3 x 3=0 or x 1=−2 x 2+3 x 3

[][ ]
x 1 −2
Let us choose x 2=1 and x 3=0 then we get x 1=−2, that is x 2 = 1 .
x3 0

[ ][]
x1 3
And choosing x 2=0 and x 3=1, we get x 1=3, that is x 2 = 0 .
x3 1

Thus, we get two linearly independent eigenvectors corresponding to λ=−3.

Note that, the order of an eigenvalue λ as a root of the characteristic polynomial is called the
algebraic multiplicity of λ , and denoted as M λ. And the number of linearly independent
eigenvectors corresponding to is called as the geometric multiplicity of λ , denoted as m λ. Thus
m λis the dimension of the eigenspace corresponding to this λ

Definitions: (a) A real square matrix A is called symmetric if AT = A .


(b) A real square matrix A is called skew-symmetric if AT =− A .
(c) A real square matrix A is called orthogonal if AT = A−1.

Theorem 4: (Eigenvalues of the Transpose)


The transpose AT of a square matrix A has the same eigenvalues as A .

Theorem 5: (Eigenvalues of Symmetric and Skew-Symmetric Matrices)


(a) The eigenvalues of a symmetric matrix are real.
(b) The eigenvalues of a skew-symmetric matrix are pure imaginary or zero.

Theorem 6: (Invariance of Inner Product)


[]
y1
y2 ,
Consider two vectors X , Y ∈ Rn with inner product, ⟨ X ,Y ⟩= X Y =[ x 1 , x 2 ,.. . x n ]
T


yn
and a n × northogonal matrix A .Then the orthogonal transformation U =AX and V = AY
preserves the value of the inner product of vectors X , Y . That is ⟨ U ,V ⟩= ⟨ X , Y ⟩ .
Moreover, the transformation also preserves the length or norm of any vector, that is
‖U ‖=‖AX‖=‖ X‖.

Example 5: (Real Matrices with Complex Eigenvalues and Eigenvectors)


We know that a real polynomial may have complex roots (which then occur in conjugate pairs),
therefore a real matrix may have complex eigenvalues and eigenvectors. For example,
Consider a matrix A= [ ]
0 1 ,
−1 0
then characteristic equation of this skew-symmetric matrix

Is det ( A−λI ) =|−λ 1


−1 −λ | 2
=λ +1=0, Thus, the eigenvalues are λ 1=−i and λ 2=i.

Then the eigenvector []


x1
x2 [
i 1 x1
corresponding to λ 1 is −1 i
x2][ ] [ ]
= 0 , after solving we get,
0

i x 1 + x 2=0, let x 1=1, then x 2=i. Therefore, the eigenvector corresponding to λ 1is
[ ][]
x1
x2
=1 .
i

Similarly, the eigenvector corresponding to λ 2is [][ ]


x1
= 1
x2 −i

Theorem 7: (Any real square matrix A may be written as the sum of a symmetric matrix E and a
1 T 1 T
skew-symmetric matrix O , where E and O are given as: E¿ ( A+ A ) and O= ( A− A ).
2 2

[ ] [ ]
1 2 3 1 4 7
T
Example 6: Consider a real square matrix A= 4 5 6 , then A = 2 5 8 .
7 8 9 3 6 9

[ ] [ ]
1 3 5 0 −1 −2
1 1
Thus, E¿ ( A+ A )= 3 5 7 , and O¿ ( A−A ) = 1 0 −1 ,
T T
2 2
5 7 9 2 1 0
We observe that, A=E+O , where E is a symmetric matrix and O is a skew-symmetric matrix.
Theorem 8: (Eigenvalues of Symmetric and Skew-Symmetric Matrices)
(a) The eigenvalues of a symmetric matrix are real.
(b) The eigenvalues of a skew-symmetric matrix are pure imaginary or zero.

Theorem 9: (Orthonormality of Column and Row Vectors)


A real square matrix A ∈ R n× n is orthogonal if and only if its column vectors A1 , A 2 , ⋯ A n ∈ Rn
(and its row vectors) form an orthonormal system, that is,

⟨ A j , Ak ⟩= A Tj A k = {10ifif j=k
j≠k

Theorem 10: (Determinant of an Orthogonal Matrix)


The determinant of an orthogonal matrix has the value −1 or +1.

Theorem 11: (Eigenvalues of an Orthogonal Matrix)


The eigenvalues of an orthogonal matrix A are either real or complex conjugates in pairs,
and have absolute value 1.

Theorem 12: (Basis of Eigenvectors)

If an n × n matrix A possesses n unique eigenvalues, it implies that A has a set of eigenvectors,


X 1 , X 2 , ⋯ X n that forms a basis for the vector space Rn .

Theorem 13: (Symmetric matrices)

An n × n symmetric matrix A possesses an orthonormal set of eigenvectors, X 1 , X 2 , ⋯ X n that


forms a basis for the vector space Rn .

Definition: (Similar Matrices. Similarity Transformation)

Two square matrices A and B, are said to be similar if there exists an invertible matrix P such
that: B=P−1 AP. Here, P is a non-singular (invertible) matrix that facilitates the
transformation. This transformation is called similarity transformation.

Theorem 14: (Eigenvalues and Eigenvectors of Similar Matrices)

If a matrix B is similar to A , then Bhas the same eigenvalues as A .


Furthermore, if X is an eigenvector of A , then Y =P−1 X is an eigenvector of B corresponding to
the same eigenvalue.
14.4 Some other Properties of Eigen values

1. Sum of Eigenvalues (Trace);

The trace of a matrix A is equal to the sum of its eigenvalues.

trace ( A ) =λ1 + λ 2+ ⋯ λn

2. Product of Eigenvalues (Determinant):

The determinant of a matrix A is equal to the product of its eigenvalues.

det ( A )=λ 1 × λ2 × ⋯ λn

3. Eigenvalues of Inverse:

If the eigenvalues of an invertible matrix A are λ 1 , λ2 , ⋯ λ n .Then the eigenvalues of matrix A−1
1 1 1
are , , ⋯ .
λ1 λ2 λ n

For Example: The eigenvalues of A= [ 21 31] are λ =2 λ =3.


1 2

[ 21 31] are 12 and 13 .


−1
−1
However, the eigenvalues of A =

4. Eigenvalues of Product:

The eigenvalues of the product AB are the same as the eigenvalues of BA , where A and B are
square matrices.

5. Eigenvalues of Power:

If the eigenvalues of A are λ 1 , λ2 , ⋯ λ n .Then the eigenvalues of matrix A k (where k is a


positive integer) are λ k1 , λ k2 , ⋯ λkn .
14.5 Diagonalization of a Matrix

Theorem 15: (Diagonalization of a Matrix)

If an n × nmatrix A possesses a set of eigenvectors forming a basis of Rn , then the matrix X with
these eigenvectors as its columns diagonalizes A .
−1
D= X AX

Here, the diagonal matrix D has the eigenvalues of A along its main diagonal. And X
is the matrix with eigenvectors of A as column vectors. Also,
m −1 m
D =X A X for m=2 , 3 , ⋯

Example 7: (Diagonalization of Matrix)

Consider a matrix A= [ 42 31] for which we will find the eigenvalues, eigenvectors, and then
demonstrate its diagonalization.

1. Eigenvalues:

The eigenvalues are obtained by solving the characteristic equation det ( A−λ I )=0.

det
[( 4−λ2 1
3− λ]) 2
=( 4−λ ) ( 3− λ )−( 2 ⋅1 )= λ −7 λ +10=0

This gives us two eigenvalues: λ 1=2∧λ 2=5 .

2. Eigenvectors:

Now, for each eigenvalue, we find the corresponding eigenvector V by solving ( A−λ I )V =0 .

For λ=2 :

[
( A−2 I ) V = 4−2
2
1 x
3−2 y
=
0
0 ][ ] [ ]
Solving this system of equations, we find an eigenvector V 1= [ ]
1
−2

For λ=5 :

[
( A−2 I ) V = 4−5
2
1 x
3−5 y
=
][ ] [ ]
0
0

Solving this system of equations, we find an eigenvector V 2= [ 11]


3. Diagonalization:

To diagonalize A we form the matrix P with the eigenvectors as columns and D as the diagonal
matrix of eigenvalues. Therefore,

P=
[−21 11] and D= [ 20 05]
Then, the diagonalization is given by A=PD P−1 .

[ ][ ][ ] =[ 42 13]=A
−1
−1 1 1 2 0 1 1
PD P =
−2 1 0 5 −2 1

Moreover, P D P =[ ] [ ] [ ] [ 14 11]
2 −1
21 1 2 0 1 1
−1 18 7 2
= =A
−2 1 0 5 −2 1

This demonstrates the diagonalization of the matrix A .

Example 8: (Inverse of a matrix using Kayley-Hamilton theorem)

[ ]
2 −1 1
Consider a matrix A= −1 2 −1 .
1 −1 2
The characteristic equation is given by det ( A−λI ) =0
| |
2−λ −1 1
−1 2−λ −1 =0
1 −1 2− λ

| ||
( 2− λ ) 2−λ −1 + −1 −1 + −1 2− λ =0
−1 2−λ 1 2−λ 1 −1 || |
( 2− λ ) ( λ2−4 λ +3 ) + ( λ−1 ) +( λ−1)
3 2
λ −6 λ + 9 λ−4=0

Correspondingly, using Kayley-Hamilton theorem,


3 2
A −6 A +9 A−4 I =0

Multiplying by A−1 in above equation we get,


−1 1 2
A −6 A +9 I −4 A =0 A = ( A −6 A +9 I )
2 −1
4

([ ] [ ] [ ])
2
2 −1 1 2 −1 1 1 0 0
−1 1
A = −1 2 −1 −6 −1 2 −1 + 9 0 1 0
4
1 −1 2 1 −1 2 0 0 1

([ ][ ] [ ])
6 −5 −5 −12 6 −6 9 0 0
−1 1
A = −5 −6 −5 + 6 −12 6 +0 9 0
4
5 −5 6 −6 6 −12 0 0 9

[ ]
3 1 −1
−1 1
A = 1 3 1
4
−1 1 3
Problem Set

P1. Determine A2 , A 3 ,∧ A−1, where A=


1 2
0 3
. [ ]

[ ]
4 6 0
P2. Determine a real matrix A=B , where B= 3
0 4 6.
0 0 4

P3. Find a polynomial having the following matrix as a root:

[ ]
1 3 4
(a) A= [ ]
3 1
2 2
, (b) A= 1 4
0 1
6.
2

P4. For each of the following matrices, find the eigenvalues and corresponding eigenvectors:

[ ]
2 0 1
(a) A= [ 6 −1
2 3 ]
, (b) A= 0 3 0 .
1 0 2

P5. Consider a matrix A= [ 42 31], then

(a) Find a non-singular matrix P such that D=P−1 AP is diagonal.


(b) Find a matrix B such that B2= A .
(c) Find A 4 and f ( A) , where f ( u )=2 u2−3 u+5.
P6. Find a real 2 ×2 symmetric matrix A with eigenvalues:

2
(a) λ 1=1 , λ 2=4 and eigenvector corresponding to λ 1 is given as X = .
2 []
1
(b) λ 1=3 , λ 2=2 and eigenvector corresponding to λ 2 is given as X = .
2 []
P7. Let A be a square matrix with eigenvalues λ 1 , λ2 , … , λn. Prove that the sum of the
eigenvalues (trace) is equal to the sum of the elements on the main diagonal, and the product of
the eigenvalues (determinant) is equal to the determinant of matrix A .

P8. If A is a square matrix with eigenvalues λ 1 , λ2 , … , λnshow that the transpose of A has the
same eigenvalues.

P9. If A is invertible with eigenvalues λ 1 , λ2 , … , λn . Then find the eigenvalues of the inverse
of A .

P10. If A is a square matrix with eigenvalues λ 1 , λ2 , … , λn .Then find the eigenvalues of matrix
cA , where c is a scalar.

P11. Let A and B be matrices with eigenvalues λ 1 , λ2 , … , λn and μ1 , μ 2 , … , μn, respectively.


Then find the eigenvalues of A+ B.

P12. If A and B be matrices with eigenvalues λ 1 , λ2 , … , λn and μ1 , μ 2 , … , μn, respectively. Then


find the eigenvalues of AB.

[ ]
1 2 −2
P13. Determine the Inverse of a matrix A= 2 1 2 using Kayley-Hamilton theorem.
−2 2 1
Dataset in Machine Learning

The Oxford Dictionary defines a dataset as “a collection of data that is treated as a single unit by
a computer.” This implies that a dataset consists of numerous individual pieces of data but is
utilized to train an algorithm with the objective of identifying patterns within the entire dataset.

Data is a fundamental component of any AI model and has played a pivotal role in the surge of
popularity in machine learning. With the availability of data, scalable ML algorithms have
transitioned from theoretical concepts to practical products capable of delivering value to
businesses.

A dataset in machine learning is a structured collection of data used to train, test, and validate
machine learning models. It serves as the foundation for building and evaluating algorithms by
providing examples that the model can learn from. Datasets typically consist of input features
and corresponding output labels or target values. The quality and characteristics of the dataset
significantly influence the performance and generalization ability of the machine learning model.

Key Components of a Dataset:

1. Features:

 Input variables or attributes that represent the characteristics or properties of the


data.

2. Labels or Targets:

 Output variables that indicate the desired prediction or classification.

3. Instances or Samples:

 Individual data points in the dataset, each comprising a set of features and
corresponding labels.

4. Training Set:

 Subset of the dataset used to train the machine learning model.

5. Testing Set:

 Subset of the dataset used to evaluate the model's performance on unseen data.

Types of Datasets in Machine Learning:

1. Training Dataset:

 Definition: The portion of the dataset used to train the machine learning model.
 Characteristics: It consists of labeled examples used to adjust the model's
parameters during training.

2. Testing Dataset:

 Definition: The portion of the dataset reserved for evaluating the model's
performance.

 Characteristics: It contains examples not seen by the model during training,


allowing for an assessment of generalization.

3. Validation Dataset:

 Definition: An optional subset used to fine-tune model parameters and avoid


overfitting.

 Characteristics: It provides an independent evaluation before deploying the


model on real-world data.

4. Unlabeled Dataset:

 Definition: A dataset where instances lack corresponding output labels or targets.

 Characteristics: Used in unsupervised learning to discover patterns,


relationships, or groupings in the data.

5. Labeled Dataset:

 Definition: A dataset with instances paired with corresponding output labels or


targets.

 Characteristics: Used in supervised learning for training models to make


predictions or classifications.

6. Time Series Dataset:

 Definition: Data organized in chronological order, often used for predicting


future values based on historical patterns.

 Characteristics: Contains temporal dependencies, suitable for time series


forecasting.

7. Image Dataset:

 Definition: A dataset comprising images as instances, used in computer vision


tasks.
 Characteristics: Each instance is represented by pixel values, and labels may
indicate object categories or annotations.

8. Text Dataset:

 Definition: A dataset consisting of text data, used in natural language processing


(NLP) tasks.

 Characteristics: Instances may include sentences, documents, or paragraphs, and


labels can represent categories or sentiment.

9. Multimodal Dataset:

 Definition: A dataset containing diverse data types, such as a combination of text,


images, and numerical features.

 Characteristics: Used in applications that require the analysis of multiple data


modalities simultaneously.

10. Imbalanced Dataset:

 Definition: A dataset where the distribution of classes is uneven, with some


classes having significantly fewer instances than others.

 Characteristics: May pose challenges for machine learning models in correctly


predicting minority classes.

11. Synthetic Dataset:

 Definition: A dataset generated artificially to simulate specific scenarios or


conditions.

 Characteristics: Used for testing model robustness, handling edge cases, or


augmenting real-world datasets.

12. Publicly Available Dataset:

 Definition: A dataset made accessible to the public for research, experimentation,


and benchmarking purposes.

 Characteristics: Often used by researchers, practitioners, and educators to


develop and test machine learning models.

While raw data serves as a starting point, it cannot be directly input into a machine learning
algorithm. Several steps must be taken before the dataset becomes usable.

Three Steps of Data Processing in Machine Learning:


1. Collect:

 Decide on the sources for collecting data, choosing from open-source datasets, the
Internet, or generators of artificial data.

 Types of sources: freely available open-source datasets, the Internet, generators of


artificial data.

 The article discusses this step in detail in the following section.

2. Preprocess:

 Determine if the dataset has been used before; assume it is flawed if not.

 Adapt the dataset to fit specific project goals.

3. Annotate:

 Ensure the data is understandable for computer processing.

 Consider outsourcing annotation tasks to trained professionals.

Quest for a Dataset in Machine Learning: Where to Find It and What Sources Fit Your
Case Best?

The sources for collecting a dataset vary based on your project, budget, and business size.
Collecting data directly correlated with business goals is ideal but can be resource-intensive. Free
datasets for machine learning offer a cost-effective option, although adjustments may be needed
to align them with specific project requirements.

Features of a Proper, High-Quality Dataset in Machine Learning:

Quality of a Dataset: Relevance and Coverage

 Ensure data pieces are relevant to the project goals.

 Verify that data is of sufficient quality and corresponds to required features.

 Address blind spots and biases to avoid imbalances.

Tip: Use live data whenever possible to avoid issues with predictability.

Sufficient Quantity of a Dataset in Machine Learning

 Have enough data to train the algorithm effectively.


 Be cautious of overtraining (overfitting) and aim for a balance.

Note: Consult with a data scientist for advice on the volume of data needed for a specific AI
project.

Before Deploying, Analyze Your Dataset

Analyzing the dataset before deploying it is a crucial step. Real-life cases emphasize the
dependency of an ML algorithm on the comprehensive analysis of its dataset. Blind spots, biases,
and unexpected consequences may arise if the dataset is not thoroughly examined.

An anecdote about a hospital using an ML algorithm to reduce treatment costs for pneumonia
patients illustrates the importance of dataset analysis. The algorithm, trained on clinic data,
inaccurately classified asthma as a non-aggravating condition due to the absence of recorded
deaths for asthmatics with pneumonia in the historic dataset.

This case underscores the need for human supervision and control over machine learning
algorithms. Machines cannot independently perform the analytic work of humans, and dataset
analysis is essential before using the data for training.

In Summary: What You Need to Know About Datasets in Machine Learning

Collecting a dataset for an AI project may seem straightforward, but it can be a time-consuming
task that requires careful consideration. Understanding what a dataset in machine learning is,
how to collect and preprocess the data, and the features of a proper dataset is crucial.

A dataset is a collection of data pieces treated as a single unit for analytic and prediction
purposes by a computer. Preprocessing involves cleaning and annotating the data, making it
understandable for machine processing. Key features of a good dataset include quality,
relevance, coverage, and sufficient quantity.

Collecting data from sources directly related to business goals is ideal, but free datasets for
machine learning offer a cost-effective option. However, adjustments may be necessary to align
them with specific project requirements. Analyzing the dataset before deployment is crucial to
identify blind spots, biases, and potential issues, ensuring a more accurate and reliable machine
learning model.

Data Preprocessing in Machine Learning


Data preprocessing is a critical step in the machine learning pipeline, serving as the foundation
for creating accurate and efficient models. It involves cleaning and transforming raw data into a
format suitable for analysis and model training. This process is indispensable because real-world
datasets are often noisy, contain missing values, and may be presented in formats unsuitable for
machine learning algorithms. In this discussion, we will delve into the significance of data
preprocessing, the key steps involved, and the tools commonly used in the process.
Importance of Data Preprocessing:

1. Noise Reduction: Real-world data is susceptible to noise, which refers to irrelevant or random
variations in the data. Noise can arise from various sources, including errors in data collection
instruments or inconsistencies in data entry. By identifying and eliminating noise, data
preprocessing ensures that the model does not learn from irrelevant patterns, contributing to the
model's robustness and accuracy.

2. Handling Missing Values: Datasets often contain missing values, which can adversely impact
model training. Data preprocessing involves strategies for handling missing data, such as
imputation (estimating missing values based on existing data) or removal of instances with
missing values. Proper handling of missing data ensures that the model is trained on complete
and reliable information.

3. Format Standardization: Datasets may come in different formats, including CSV, Excel,
HTML, or others. Data preprocessing involves standardizing the format to ensure consistency
and compatibility with machine learning algorithms. This step facilitates seamless data
manipulation and analysis.

4. Encoding Categorical Data: Machine learning algorithms typically require numerical input,
but real-world datasets often include categorical variables. Data preprocessing includes encoding
categorical variables into a numerical format. This can be achieved through techniques like label
encoding, assigning numerical labels to categories, or one-hot encoding, creating binary columns
for each category.

5. Splitting into Training and Test Sets: To evaluate the performance of a machine learning
model, the dataset is split into training and test sets. The training set is used to train the model,
while the test set assesses its performance on unseen data. Data preprocessing ensures a
representative split, helping the model generalize well to new, unseen instances.

6. Feature Scaling: Machine learning algorithms are sensitive to the scale of input features. Data
preprocessing includes feature scaling techniques to bring features to a similar scale. Common
methods include Min-Max scaling, which scales features to a specified range, and Z-score
normalization, which standardizes features by subtracting the mean and dividing by the standard
deviation.

Steps in Data Preprocessing:

1. Getting the Dataset:

The first step is obtaining the dataset relevant to the machine learning problem. Datasets can vary
in format and content based on the nature of the problem, and they are typically stored in files
like CSV, Excel, or databases.
2. Importing Libraries:

Python libraries such as Numpy, Matplotlib, and Pandas are commonly used for data
preprocessing. Numpy facilitates mathematical operations, Matplotlib supports data
visualization, and Pandas is instrumental in importing, cleaning, and managing datasets.

3. Importing Datasets:

After setting the working directory, the dataset is imported using Pandas. Pandas provides
powerful data structures like DataFrames, making it easy to manipulate and analyze data.

4. Finding Missing Data:

Data preprocessing involves identifying and handling missing values. Techniques include
removing instances with missing values or imputing missing values based on statistical methods.

5. Encoding Categorical Data:

Categorical variables are transformed into numerical format using encoding techniques. Label
encoding assigns numerical labels to categories, while one-hot encoding creates binary columns
for each category.

6. Splitting Dataset into Training and Test Set:

The dataset is divided into training and test sets to train and evaluate the machine learning
model, ensuring a balance between the two sets.

7. Feature Scaling:

Feature scaling is applied to standardize the scale of input features, preventing any feature from
dominating the learning process.

Tools for Data Preprocessing:

1. Numpy:

Numpy is a fundamental library for scientific computing in Python, providing support for large,
multidimensional arrays and matrices. It is essential for performing mathematical operations
during data preprocessing.

import numpy as np

2. Matplotlib:

Matplotlib, along with its sub-library pyplot, is widely used for creating visualizations and plots
in Python.
import matplotlib.pyplot as plt

3. Pandas:

Pandas is a powerful library for data manipulation and analysis. It provides data structures like
DataFrames, simplifying the handling of datasets.

import pandas as pd

In conclusion, data preprocessing is a vital step in machine learning, contributing to the success
of models by ensuring that they are trained on clean, relevant, and properly formatted data.
Through steps like noise reduction, handling missing values, and encoding categorical data, data
preprocessing sets the stage for robust and accurate machine learning models. Utilizing tools like
Numpy, Matplotlib, and Pandas streamlines the preprocessing workflow, making it accessible
and efficient for data scientists and machine learning practitioners.

Understanding Bias and Variance in Machine Learning

In machine learning, achieving optimal model performance requires a delicate balance between
bias and variance. Bias and variance are two sources of error that influence a model's ability to
generalize to unseen data. Understanding these concepts is crucial for fine-tuning models and
making informed decisions during the model development process.

Bias:

Bias refers to the error introduced by approximating a real-world problem, which may be
highly complex, by a simplified model. A high bias model oversimplifies the underlying
patterns in the data and tends to make strong assumptions about the relationship between features
and the target variable. This can lead to systematic errors, causing the model to consistently
underperform.

“Bias is the difference between the average prediction of our model and the correct
value which we are trying to predict."

1. Average Prediction of Our Model:

 This refers to the predictions made by the machine learning model across the
entire dataset.

2. Correct Value We Are Trying to Predict:

 This is the true or actual value of the target variable in the dataset. In supervised
learning, the model is trained to predict this target variable.
3. Difference:

 The disparity or gap between the average prediction of the model and the actual
value is what we define as bias.

In mathematical terms, bias (B) can be expressed as:


N
1
B= ∑ (^y ¿ ¿ i− y i )¿
N i=1

where:

 ^y i is the predicted value by the model for the i-th instance.

 y i is the true value for the i th instance.

 N is the total number of instances in the dataset.

A high bias implies that the model is making simplistic assumptions about the underlying
patterns in the data and may not be capturing its complexity. This can result in underfitting,
where the model fails to perform well on both the training and unseen data

Characteristics of High Bias Models:

 Overly Simple: High bias models may be too simplistic and unable to capture the
complexity of the underlying data patterns.

 Underfitting: Such models may struggle to fit the training data and perform poorly on
both the training and test sets.

Addressing Bias:

 Model Complexity: Increasing the complexity of the model by adding more features or
using more sophisticated algorithms can help reduce bias.

 Feature Engineering: Carefully selecting and engineering features can provide the
model with more information to learn from.

Variance:

Variance, on the other hand, measures the model's sensitivity to fluctuations in the training
data. A high variance model is overly complex and tends to fit the training data too closely,
capturing noise and outliers. While these models may perform well on the training set, they often
fail to generalize to new data, leading to poor performance on the test set.

Characteristics of High Variance Models:


 Overfitting: High variance models may memorize the training data instead of learning
general patterns, leading to poor generalization.

 Sensitive to Noise: These models can be overly sensitive to variations in the training
data, capturing noise rather than the underlying trends.

Addressing Variance:

 Regularization: Techniques like L1 and L2 regularization can penalize overly complex


models, discouraging them from fitting the training data too closely.

 Data Augmentation: Increasing the size of the training dataset or applying data
augmentation techniques can help the model generalize better.

 Feature Selection: Removing irrelevant or redundant features can reduce the complexity
of the model and mitigate overfitting.

Bulls-eye diagram
Let's visualize the four different cases representing combinations of both high and low bias
and variance using a bulls-eye diagram:
1. Low Bias, Low Variance:

In this case, the model consistently predicts values very close to the correct ones. The hits on
the target are tightly clustered around the bulls-eye, indicating low variability. This scenario
represents a well-fitted model that captures the underlying patterns in the data accurately.

2. Low Bias, High Variance:

The model still predicts values close to the correct ones on average, but the predictions vary
more. The hits on the target are more scattered, indicating higher variability. This scenario
suggests that the model is sensitive to fluctuations in the training data, potentially overfitting
noise.

3. High Bias, Low Variance:


Here, the model consistently predicts values far from the correct ones. The hits on the target
are tightly clustered, but they are far from the bulls-eye. This scenario indicates that the
model is too simplistic and fails to capture the complexity of the underlying patterns in the
data.

4. High Bias, High Variance:

In this case, the model consistently predicts values far from the correct ones, and the
predictions vary widely. The hits on the target are scattered, indicating both a systematic error
(high bias) and sensitivity to fluctuations in the training data (high variance). This scenario
represents a poorly fitted model that fails to capture the underlying patterns and is overly
sensitive to noise.

Mathematical Definition
If we label the variable we aim to predict as y and our covariates as X , it is reasonable to
assume a relationship between them, expressed as y=f (x )+ ϵ . Here, ϵ represents the error
term, assumed to follow a normal distribution with a mean of zero (ϵ ∼ N (0 , σ ϵ )).

We can derive an estimate of the model ^y = f^ (x ) for f (x) using techniques like linear
regression. In this context, the anticipated squared prediction error at a given point x is
defined as:
2
Err ( x )=E[ ( ^y − y ) ]

This error can be broken down into bias and variance components:

MSE ¿

[
¿ E ( f^ ( x )−f ( x ) ) + ( ϵ ) −2 ϵ ( ^f ( x ) −f ( x ) )
2 2
]
[ ]
¿ E ( f^ ( x )−f ( x ) ) + E [ ( ϵ ) ]−2 E [ ϵ ( f^ ( x )−f ( x )) ]
2 2

[ ]
¿ E ( f^ ( x )−f ( x ) ) + σ ϵ −2 E [ ϵ ( ^f ( x ) −f ( x ) ) ]
2 2

As error ϵ has zero mean i.e. E [ ϵ ] =0, hence the above equation reduces to

MSE ¿

Adding and subtracting E [ ^f ( x ) ] in square of the above equation, we have:

MSE ¿

[ ] [
2 2
] [ 2
]
¿ E ( f^ ( x ) + E [ ^f ( x ) ]−E [ f^ ( x ) ]−f ( x ) ) =E ( E [ f^ ( x ) ]−f ( x ) ) + E ( f^ ( x )−E [ f^ ( x ) ]) +2 E ( E [ f^ ( x ) ]−f ( x ) )( f^ ( x )−E [ ^f
As 2 E ( E [ ^f ( x ) ] −f ( x ) )( f^ ( x )−E [ f^ ( x ) ] )=0

Hence, the above equation reduces to;

MSE ¿

MSE ¿

[
Where Bias=( E [ f^ ( x ) ]−f ( x ) ) and Variance=E ( f^ ( x )−E [ ^f ( x ) ] )
2
]
Here, the third term, irreducible error (σ 2ϵ ), represents the noise inherent in the true
relationship, which cannot be fundamentally diminished by any model. In an ideal scenario
with the true model and an infinite dataset for calibration, both bias and variance terms
should be reducible to zero. However, operating in a world with imperfect models and finite
data introduces a trade-off between minimizing bias and minimizing variance.

Bias-Variance Tradeoff:

The relationship between bias and variance is often depicted as a trade-off. Increasing the
complexity of a model typically reduces bias but increases variance, and vice versa. Striking
the right balance is crucial for creating models that generalize well to new, unseen data. This
balance is often referred to as the bias-variance trade-off.
Optimal Model Complexity:

 Underfitting: Models that are too simple and underfit the data.

 Optimal: Models with balanced bias and variance, achieving good generalization.

 Overfitting: Models that are too complex and overfit the training data.

Model Evaluation and Selection:

To evaluate the bias and variance of a model, various techniques can be employed, such as cross-
validation and learning curves. Learning curves provide insights into how a model's performance
changes with the size of the training dataset, offering valuable information about bias and
variance.

In conclusion, navigating the bias-variance trade-off is a key challenge in machine learning.


Striking the right balance is essential for developing models that generalize well to new data. By
understanding the characteristics of high bias and high variance models and employing
appropriate strategies, machine learning practitioners can build models that perform well on a
variety of datasets, ensuring robust and reliable predictions.
Function approximation
Function approximation in machine learning refers to the process of finding an approximation or
model that can represent an underlying function within a given dataset. The goal is to learn a
mapping from input data to output values, capturing the underlying patterns or relationships in
the data. This process is fundamental to various machine learning tasks, including regression and
classification.

Here are some key concepts related to function approximation in machine learning:

1. Supervised Learning:

 Function approximation is often achieved through supervised learning, where the


algorithm is trained on a labeled dataset. The algorithm learns to map input
features to corresponding output labels.

2. Regression:

 In regression tasks, the goal is to predict a continuous output variable. The


algorithm approximates the underlying function that relates input features to the
continuous target variable.

3. Classification:

 In classification tasks, the goal is to assign input data to one of several predefined
classes or categories. The algorithm learns a decision boundary that separates
different classes.

4. Model Selection:

 Different machine learning models can be used for function approximation,


including linear regression, decision trees, support vector machines, neural
networks, and more. The choice of model depends on the nature of the data and
the complexity of the underlying function.

5. Overfitting and Underfitting:

 Overfitting occurs when a model learns the training data too well, capturing noise
and outliers that do not generalize well to new data. Underfitting occurs when a
model is too simple to capture the underlying patterns in the data. Balancing
between overfitting and underfitting is crucial for effective function
approximation.

6. Hyperparameter Tuning:
 Hyperparameters are configuration settings for a machine learning algorithm.
Tuning these hyperparameters is essential to optimize the model's performance.
Techniques such as cross-validation can be employed to find the best
hyperparameter values.

7. Loss Functions:

 During training, the model's performance is evaluated using a loss function, which
measures the difference between the predicted outputs and the actual labels. The
objective is to minimize this loss to improve the model's accuracy.

8. Neural Networks:

 Deep learning, particularly neural networks, has gained popularity for complex
function approximation tasks. Deep neural networks can automatically learn
hierarchical representations of data, allowing them to capture intricate patterns.

9. Ensemble Methods:

 Ensemble methods, such as random forests or gradient boosting, combine


multiple weak models to create a stronger overall model. These methods can
improve the accuracy and robustness of function approximation.

In summary, function approximation in machine learning involves training models to capture the
relationships between input data and output labels. The choice of model, handling
overfitting/underfitting, tuning hyperparameters, and using appropriate evaluation metrics are
critical aspects of successful function approximation.

Overfitting
Overfitting is a common challenge in machine learning, occurring when a model learns the
training data too well to the point that it captures noise and idiosyncrasies rather than the
underlying patterns of the data. It is a crucial concept to understand because overfit models
may perform exceptionally well on the training data but generalize poorly to new, unseen
data. This phenomenon can lead to suboptimal model performance and limits the model's
ability to make accurate predictions in real-world scenarios.

One of the primary causes of overfitting is excessive model complexity. A model with too
many parameters or that is too flexible can fit the training data precisely, even capturing
random fluctuations. As a result, the model becomes overly tailored to the specificities of
the training set, losing its ability to generalize to new, unseen data. Overfitting is often
visualized in a graph where the model's performance on the training data continues to
improve, but its performance on a separate validation or test set starts to decline.

Several factors contribute to overfitting, and understanding them is essential for effective
model development:

1. Model Complexity:

 Models with a large number of parameters, such as high-degree polynomial


regression or deep neural networks, are more prone to overfitting. The
complexity of the model should be chosen carefully based on the complexity
of the underlying data-generating process.

2. Insufficient Data:

 Overfitting is more likely to occur when the available dataset is small. In such
cases, the model may find spurious correlations in the limited data, mistaking
them for genuine patterns.

3. Noise in the Data:

 If the training data contains noise or outliers, an overfit model may capture
these anomalies as if they were meaningful patterns. This results in a lack of
generalizability to new, clean data.

4. Feature Engineering:

 The inclusion of irrelevant or redundant features in the model can contribute


to overfitting. Feature selection and engineering are crucial steps to ensure
that the model focuses on the most informative aspects of the data.

Several techniques can be employed to mitigate overfitting:

1. Regularization:

 Regularization methods introduce a penalty term for large coefficients in the


model, discouraging overly complex models. L1 and L2 regularization are
common techniques used in linear models and neural networks.

2. Cross-Validation:

 Cross-validation involves splitting the dataset into multiple subsets for


training and testing the model. This helps assess the model's performance on
different subsets of the data and provides a more reliable estimate of its
generalization ability.
3. Data Augmentation:

 Augmenting the training dataset by creating variations of existing data (e.g.,


through rotations, translations, or other transformations) can help expose
the model to a more diverse range of examples, reducing the risk of
overfitting.

4. Early Stopping:

 Monitoring the model's performance on a validation set during training and


stopping the training process when the performance begins to degrade can
prevent the model from becoming too specialized on the training data.

In conclusion, overfitting is a significant challenge in machine learning that arises from


overly complex models and insufficient data. It is crucial for practitioners to be aware of the
signs of overfitting and to employ appropriate techniques, such as regularization and cross-
validation, to develop models that generalize well to new, unseen data. Striking the right
balance between model complexity and generalization is essential for creating robust and
reliable machine learning models.
Chapter 2: Regression Analysis in Machine Learning

Introduction

Regression analysis is a powerful statistical method used in machine learning to model the
relationship between a dependent variable (target) and one or more independent variables
(predictors). The primary goal of regression is to understand how changes in the
independent variables correspond to changes in the dependent variable while holding
other variables constant. This method is particularly useful for predicting continuous or
real-valued outcomes, such as temperature, age, salary, and prices.

Example of Regression Analysis: Consider a marketing company (Company A) that


conducts various advertising campaigns each year, resulting in sales. If we have data on the
amount spent on advertising and the corresponding sales for the past five years, we can use
regression analysis to model the relationship between these variables. The company may
want to predict sales for the upcoming year based on a planned advertising budget, and
regression analysis facilitates this prediction.

Regression as a Supervised Learning Technique: Regression is categorized as a


supervised learning technique because it learns from labeled data, where the outcomes
(dependent variable) are known. In the example above, sales would be the dependent
variable, and the advertising expenditure would be the independent variable. Regression is
widely used for prediction, forecasting, time series modeling, and understanding causal-
effect relationships between variables.

In the context of regression, a graph is plotted between the variables, aiming to find the
best-fitting line or curve that passes through the given data points. This line or curve
minimizes the vertical distance between the data points and the regression line, indicating
the strength of the relationship. The machine learning model can then make predictions
based on this fitted line or curve.

Terminologies Related to Regression Analysis:

 Dependent Variable: The main factor we aim to predict or understand is the


dependent variable (also called the target variable).

 Independent Variable: The factors affecting the dependent variable, used for
prediction, are independent variables (also called predictors).

 Outliers: Observations with very low or high values compared to others are outliers
and can impact results.
 Multicollinearity: Highly correlated independent variables can lead to
multicollinearity, which may affect the ranking of important variables.

 Underfitting and Overfitting: Overfitting occurs when a model fits the training
data too closely but generalizes poorly to new data, while underfitting happens
when the model is too simple and performs poorly on both training and test data.

Why Use Regression Analysis?

 Prediction of Continuous Variables: Regression is specifically designed for


predicting continuous or real-valued outcomes.

 Trend Identification: It helps identify trends in data, making it valuable for


understanding patterns and making informed decisions.

 Real-World Applications: Regression is applied in various scenarios such as


weather prediction, sales forecasting, and market trend analysis.

 Factor Importance: Regression enables the identification of the most and least
important factors affecting the dependent variable.

Regression analysis is a foundational technique in machine learning, providing a systematic


way to model and understand relationships between variables. It is widely used for
prediction and decision-making in diverse fields due to its ability to handle continuous
outcomes and unveil valuable insights from data.

Types of Regression

There are various types of regressions which are used in data science and machine
learning. Each variant holds significance in diverse scenarios. Despite their differences, all
regression methods fundamentally assess how independent variables impact dependent
variables. Here we are discussing some important types of regression which are given
below:

 Linear Regression
 Logistic Regression
 Polynomial Regression
 Support Vector Regression
 Decision Tree Regression
 Random Forest Regression
 Ridge Regression
 Lasso Regression
Linear Regression
Linear regression is a fundamental statistical method used in machine learning and data
analysis to model the relationship between a dependent variable and one or more
independent variables. The primary goal of linear regression is to find the best-fitting
linear relationship that describes how changes in the independent variables are
associated with changes in the dependent variable.

Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:
 Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called
Simple Linear Regression.

 Multiple Linear regression:


If more than one independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is
called Multiple Linear Regression.

Simple Linear Regression Model:

Regression analysis utilizes the statistical relation between two or more variables so that
one variable can be predicted from the other variable(s).

In simple linear regression, we attempt to model the relationship between two variables, as
seen in the following examples.

 income and number of years of education,


 height and weight of people,
 length and width of envelopes,
 temperature and output of an industrial process,
 altitude and boiling point of water, or
 dose of a drug and response.
In this section, we shall consider a regression model where there is only one independent
variable and the regression function is linear in the regression parameters. The model can
be stated as,
y=β 0 + β 1 x + ε

where

i) β 0∧β 1are regression parameters. β 0 is the intercept and β 1is the slope
parameter.
ii) x is the independent (or predictor) variable. We assume that it can be measured
without error (i.e. x is not a random variable in the regression model).
iii) ε is the error term and E( ε )=0 , Var (ε )=σ 2. The error term is not observable. In
this context, error does not mean a mistake but is a statistical term representing
random fluctuations, measurement errors, or the effect of factors outside our
control.
iv) y is the dependent (or response) variable.

The designation simple indicates that there is only one x to predict the response y , and
linear means that the model is linear in β 0 and β 1

The simple linear regression model for n observations can be written as:

y i=β 0 + β 1 x i +ε i for all i=1 , 2 ,3 , … , n

In this case, y i and ε i are random variables and x i are known constants.
Therefore, these assumptions hold for the above model:

a) E [ε ¿¿ i]=0 ¿ for i=1,2,3…, n, or equivalently E [ y ¿¿ i]=β 0 + β 1 x i ¿


b) Var [ε ¿¿ i]=σ ¿ 2
for i=1,2,3…, n, or equivalently Var [ y ¿¿ i]=σ 2 ¿
c) Cov [ε ¿ ¿ i, , ε j ]=0 ¿ fori≠ j or equivalently Cov [ y ¿ ¿i , y j ]=0 ¿.

Assumption (a) imply that y i depends only on x iand that all other variation in y i is random.
Assumption (b) is also known as the assumption of homoscedasticity, homogeneous
variance, or constant variance.
Assumption (c) implies the ε ivariables are uncorrelated.
7.3 Estimation of the regression parameters ( β 0 , β1 )

In practical scenarios, the regression parameters are typically not known beforehand.
However, by utilizing a random sample of n observations represented as:
( x 1 , y 1 ) , ( x 1 , y 1 ) , … , ( xn , y n ) we have the means to estimate the parameters β 0 , β 1, and σ 2.
These estimations, denoted as ^β and ^β , are obtained through the method of least squares.
0 1
Importantly, this method does not necessitate any assumptions regarding the distribution
of the data.
In the method of least squares the point estimates of β₀ and β₁ are the values that
minimize the sum of squared error (SSE) between observed values y and predicted values
ŷ:

SSE=∑ ( y i−^y i )
2

¿ ∑ ( y i− ^β 0− ^β 1 x i )
2

Σ represents the sum over all data points (i=1 to n ).

yᵢ is the observed value for the i th data point.

ŷᵢ is the predicted value for the i th data point.

To find the minimum of SSE, we take partial derivatives with respect to β₀ and β₁ and set
them equal to zero:

∂ SSE
=−2 ∑ ( y i−^y i )=0
∂ β0

∂ SSE
=−2 ∑ x i ( y i−^y i )=0
∂ β1

Solving for ^β 0and ^β 1

Solving the above equations gives us the least squares estimates for β₀ and β₁:

For ^β 0:

^β = y − ^β x
0 1
where:
ȳ is the mean of the observed values ( y ).
x̄ is the mean of the predictor variable (x ).

For ^β 1:

n
^β 1=∑ (x ¿¿ i−x)( y¿ ¿i− y)
n
¿¿
i=1
n ∑ x i y i−n x y
∑ (x ¿¿ i−x ) = i =1n
2
¿
∑ x 2i −n x 2
i=1

i=1

These equations provide the estimated coefficients β 0 and ^β 1 that define the best-fitting
linear regression line, minimizing the sum of squared differences between the observed
and predicted values. These estimates are used to make predictions and describe the linear
relationship between the predictor variable (x ) and the response variable (y).

7.4 Meaning of Regression Parameters


(i) In linear regression, ^β 1: represents the slope or coefficient associated with the
predictor variable x . This coefficient tells us how much the expected value
(mean) of the dependent variable y is expected to change for a one-unit increase
in x . In other words, it quantifies the impact of x on the predicted values of y . If
^β is positive, it indicates that an increase in x is associated with an increase in
1
the expected value of y , and if β₁ is negative, it indicates the opposite.
(ii) The intercept ^β 0represents the predicted value of the dependent variable y when
the predictor variable x is equal to zero. In this context, it provides the starting
point or baseline value of y when all other predictors are held constant. If x is a
meaningful predictor variable that can take on a value of zero in the context of
the problem, then the intercept β₀ has a specific interpretation as: ^β 0=E ( y) at
x=0
(iii) In cases where the predictor variable x does not have a meaningful
interpretation at x=0 , the intercept ^β 0remains relevant as the baseline value of y
when all other predictors are held constant. However, the coefficient ^β 1for x still
retains its meaning as the change in the expected value of y associated with a
one-unit change in x , even though x=0 might not be a meaningful point within
the context of the problem.

Example 7.1: Students in a statistics class (taught by one of the authors) claimed that doing
the homework had not helped prepare them for the midterm exam. The exam score y and
homework score x (averaged up to the time of the midterm) for the 11 students in the class
were as follows:
x 5 6 7 8 9 10 11 12 13 14 15
y 1 4 3 8 7 7 13 10 16 17 13

xᵢ yᵢ 2
xi xᵢ yᵢ
5 1 25 5
6 4 36 24
7 3 49 21
8 8 64 64
9 7 81 63
10 7 100 70
11 13 121 143
12 10 144 120
13 16 169 208
14 17 196 238
15 13 225 195

∑ xᵢ=110
∑ yᵢ=99
∑ x 2i =1210

∑ x i yᵢ=1151
x̄=
∑ xᵢ = 110 =10
n 11
ȳ=
∑ yᵢ = 99 =9
n 11
n x̄ ȳ =11∗10∗9=990
n

∑ xi y i−n x y
^β 1= i=1 =(1151−990)/(1210−11∗10∗10)=1.4636
n

∑ x −n x2
i
2

i=1

^β = y − ^β x=9−1.4636∗10=−5.6364
0 1

The prediction equation is thus given by:


^y = β^ 0 + β^ 1 x =−5.6364+1.4636 x

14
12 12
12 Simple Linear Regression
10 9

8
y

0
0.5 1 1.5 2 2.5 3 3.5
x

Figure 7.1 Regression of y on x.

7.5 Properties of the Least Squares Estimators


However, if the three assumptions hold, then.
a. The least-squares estimators ^β 0∧¿ ^β 1 are unbiased i.e.

E [ ^β 1 ]=β 1 and

E [ ^β 0 ]= β0

b. and have minimum variance among all linear unbiased estimators (Gauss-Markov
theorem)

Proof a
n
( x¿¿ i−x )( y ¿¿ i− y )
^β = ∑ n
¿¿
1 i=1
∑ (x¿ ¿i− x) ¿ 2

i=1

Where y i=β 0 + β 1 x i +ε i and using the assumption (a) E [ y ¿¿ i]=β 0 + β 1 x i ¿

We have

E [ y ]=β 0+ β1 x
E [ y i− y ]=β 1 (x i−x)

E [ ^β 1 ]=E ¿
n
¿ ∑ (x¿ ¿i−x )E ¿ ¿ ¿ ¿
i=1
n

∑ (x ¿¿ i−x )β 1 (x i−x)
i=1
n
¿
∑ (x¿ ¿i−x) =β 1 ¿ 2

i=1

Therefore ^β 1is an unbiased estimator of β 1.

E [ ^β 0 ]=E [ y− ^β 1 x ] =β 0+ β 1 x−x E [ β^ 1 ]= β0

The estimator of ^β 0 is also unbiased.

(b) Variance of Regression Parameter

2
σ
Var [ ^β 1 ]= n
(i)
∑ (x ¿¿ i−x )2 ¿
i=1

[ ]
(ii) Var ^β 0 =σ ¿
2

(i) Proof:

Var [ y ¿¿ i]=Var [ β 0 + β 1 x i+ ε i ] =Var [ ε i ] =σ ¿


2
(because β 0 + β 1 x i is fixed.)

Var [ ^β 1 ]=Var ¿

Using the property of variance

Var [ ax ] =a Var [ x]
2

Hence, we can write eq () as:


1
Var [ ^β 1 ]=
¿¿¿

Apply Linearity of Expectation:

1
Var [ ^β 1 ]=
¿¿¿

1
¿
¿¿¿
1
Var [ ^β 1 ]=
¿¿¿

n
Note that the variance of ^β 1 is minimum when∑ (x ¿¿ i−x) ¿is as large as possible and
2

i=1
because of this reason, values of X should be chosen so as to cover the entire range of its
values.

(ii) Proof

Var [ ^β 0 ]=Var [ y − ^β1 x ]=Var [ y ] + x 2 Var [ ^β1 ] −2Cov [ y , β^ 1 ]= β0

Cov [ y , ^β 1 ]=0 as y and ^β 1 are uncorrelated.

2
σ
Var [ y ]=
n

2 2
σ σ
Var [ ^β 0 ]=
2
+x n
n
∑ (x ¿¿ i−x )2 ¿
i=1

Var [ ^β 0 ]=σ 2 ¿

The variance of ^β 0 is minimum when x=0 .

7.6 Coefficient of Determination


The coefficient of determination r 2 is defined as:
n

SSR
∑ ( ^y ¿¿ i− y)2
r 2= = n
i=1
¿
SST
∑ ( y ¿¿ i− y) =1− SSE
2
SST
¿
i=1

n n
where SSR=∑ (^y ¿¿ i− y ) ¿ is the regression sum of squares and SST =∑ ( y ¿¿ i− y) ¿is
2 2

i=1 i=1
the total sum of squares. The total sum of squares can be partitioned into SST =SSR+ SSE ,
that is,
n n n

∑ ( y ¿¿ i− y ) =∑ (^y ¿¿ i− y) +∑ ( y ¿ ¿ i−^y i )2 ¿ ¿ ¿
2 2

i=1 i=1 i=1

Thus r 2 gives the proportion of variation in y that is explained by the model or,
equivalently, accounted for by regression on x.

Example 7.2: For the grades data of Example 7.1, we have:

xᵢ yᵢ xi
2
xᵢ yᵢ 2
( ^y ¿¿ i− y ) ¿
2
( y ¿¿ i− y ) ¿
5 1 25 5 53.559 64
6 4 36 24 34.2787 25
7 3 49 21 19.2826 36
8 8 64 64 8.57084 1
9 7 81 63 2.1433 4
10 7 100 70 1.6E-07 4
11 13 121 143 2.14095 16
12 10 144 120 8.56616 1
13 16 169 208 19.2756 49
14 17 196 238 34.2693 64
15 13 225 195 53.5473 16

∑ (^y ¿¿ i− y )2=¿ ¿235.64


i=1

∑ ( y ¿¿ i− y )2=280 ¿
i=1
2 SSR 235.64
r= = =0.84159
SST 280

The correlation between homework score and exam score is r =√ 0.8419=0.91736 .

Interval estimation in Simple Linear Regression

In simple linear regression, interval estimation involves estimating a range within which a
population parameter, such as the slope or intercept of the regression line, is likely to fall.
This process is essential for understanding the precision and reliability of the regression
model's parameters. Two common types of interval estimation in simple linear regression
are confidence intervals for the regression coefficients (slope and intercept).

1. Confidence Interval for the Slope (β ₁):

The slope of the regression line in simple linear regression represents the rate of change in
the dependent variable for a one-unit change in the independent variable. A confidence
interval for the slope provides a range of values within which the true population slope is
likely to lie. The width of these confidence intervals is a measure of the overall quality of the
regression line. If the errors are normally and independently distributed, then the sampling
distribution of both β 0, β ₁ is t with n−2 degrees of freedom

The formula for the confidence interval for the slope (β ₁) is typically given as:

^β −t SE ( β^ 1 ) ≤ β 1 ≤ β^ 1 +t α SE( β^ 1 )
1 α
, n−2 , n−2
2 2

where:

 ^β is the estimated slope from the regression analysis,


1

 t is the critical value from the t -distribution based on the desired confidence
level with n−2 degree of freedom,

 SE ( ^β 1 ) is the standard error of the slope.

This interval provides a range of values for the population slope with a specified level of
confidence.

2. Confidence Interval for the Intercept (β₀):


The intercept of the regression line is the predicted value of the dependent variable when
the independent variable is zero. Similar to the slope, a confidence interval for the intercept
provides a range of values within which the true population intercept is likely to fall.

The formula for the confidence interval for the intercept (β₀) is typically given as:

^β −t SE ( β^ 0 ) ≤ β 0 ≤ ^β 0 +t α SE( β^ 0 )
0 α
,n−2 , n−2
2 2

where:

 ^β is the estimated intercept from the regression analysis,


0

 t is the critical value from the t-distribution based on the desired confidence
level,

 SE ( ^β 0 ) is the standard error of the intercept.

This interval provides a range of values for the population intercept with a specified
level of confidence.

In both cases, the critical value t is determined based on the desired confidence level
and degrees of freedom associated with the regression analysis. Common confidence
levels are 95% or 99%, and the degrees of freedom depend on the sample size and the
number of parameters estimated.

Interval estimation in simple linear regression is crucial for understanding the


uncertainty associated with the estimated coefficients and provides insights into the
range of values that the true population parameters are likely to encompass.

Residual
The residual is defined as the disparity between the observed value and the fitted value of
the study variable. For the i th observation, the residual is computed as follows:

e i= y i− ^yi where i=1 , 2 ,… , n

Here, y i represents an observed data point, and ^y i is the corresponding predicted value
obtained from the regression model.

Residuals can be seen as a measure of the deviation between the actual data and the
model's predictions. They represent the amount of variability in the response variable that
cannot be accounted for by the regression model.
Residuals can also be interpreted as the observed values of the model's errors. Therefore,
any departure from the assumptions regarding random errors should be reflected in the
residuals. Analysing the residuals aids in identifying inadequacies in the model and
assessing whether the model's assumptions hold.

Understanding and analyzing residuals is crucial in assessing the performance of the


regression model. Here are some key points related to residuals in simple linear regression:

 Residual Analysis: Examining the pattern of residuals can provide insights into the
adequacy of the model. Ideally, residuals should be randomly distributed around
zero, indicating that the model is capturing the underlying relationship.

 Residual Plots: Creating residual plots, such as scatterplots of residuals against


predicted values or independent variables, can help identify patterns or trends that
the model might have missed.

 Homoscedasticity: Residuals should exhibit homoscedasticity, meaning they should


have constant variance across all levels of the independent variable.
Heteroscedasticity (varying variance) might indicate a violation of assumptions.

 Normality of Residuals: While normality is an assumption of linear regression, it is


more critical for hypothesis testing and confidence interval estimation. If the sample
size is large, the Central Limit Theorem often mitigates concerns about normality.

 Outliers and Influential Points: Residual analysis can help identify outliers or
influential points that have a substantial impact on the regression model.

Residuals in simple linear regression quantify the unexplained variability in the


relationship between the dependent and independent variables. Analyzing residuals is an
integral part of assessing the model's performance and ensuring that the underlying
assumptions of linear regression are met.

The general linear model, often referred to as the multiple linear regression model, is used
for analysing data involving multiple independent variables. In this context, we consider a
linear multiple regression model with k independent variables. Multiple linear regression
extends the principles of simple linear regression to accommodate situations where there
are more than one independent variable influencing the dependent variable.
8.2 Multiple Linear Regression Model
The representation of the regression model is as follows:

y=β 0 + β 1 x 1 + β 2 x 2 +…+ β k x k + ε

Where
(i) The regression parameters, denoted as β 0, β 1, ..., β k , represent key coefficients within the
model.
(ii) The independent variables, namely x ₁ , x ₂ , ... , x ₖ, are the predictor variables. These
variables are considered fixed and can be measured without any error or uncertainty.
(iii) The variable y, which serves as the response variable, is observable and subject to
random error u.

This regression model comes into play when the response variable relies on multiple
independent variables.
The regression function, also known as the response function, associated with the above
model can be expressed as:

E [ y ]=β 0+ β1 x 1+ β 2 x2 +…+ β k x k

The outcome y is frequently impacted by multiple predictor variables. For instance, the
crop yield could be influenced by the quantities of nitrogen, potash, and phosphate
fertilizers applied. While these factors can be manipulated by the experimenter, the yield
might also be affected by uncontrollable variables, such as those related to weather
conditions.

Let's consider that we have a random sample consisting of n observations of Y , collected at


specific values of the independent variables.

y i=β 0 + β 1 x 1 i + β 2 x 2 i+ …+ β k x ki + ε i

The assumptions for ε i or y i are the same as those for simple linear regression:

a) E [ε ¿¿ i]=0 ¿ for i=1 ,2 , 3 … , n , or equivalently E [ y ¿¿ i]=β 0 + β 1 x 1 i β2 x 2i +…+ β k x ki ¿


2
b) Var [ε ¿¿ i]=σ ¿ for i=1 ,2 , 3 … , n , or equivalently Var [ y ¿¿ i]=σ 2 ¿
c) Cov [ε ¿ ¿ i, , ε j ]=0 ¿ f ∨i≠ jor equivalently Cov [ y ¿ ¿i , y j ]=0 ¿

Assumption 1 states that the model is correct, in other words that all relevant x ’ s are
included, and the model is indeed linear. Assumption 2 asserts that the variance of y is
constant and therefore does not depend on the x ’ s . Assumption 3 states that the y ’ s are
uncorrelated with each other, which usually holds in a random sample (the observations
would typically be correlated in a time series or when repeated measurements are made on
a single plant or animal). Later we will add a normality assumption, under which the y
variable will be independent as well as uncorrelated.
When all three assumptions hold, the least-squares estimators of the β ’s have some good
properties. If one or more assumptions do not hold, the estimators may be poor. Under the
normality assumption, the maximum likelihood estimators have excellent properties.
Any of the three assumptions may fail to hold with real data. Several procedures have been
devised for checking the assumptions. These diagnostic techniques are discussed in
Chapter 10.

We get the following set of n equations;

y 1=β 0 + β 1 x 11 + β 2 x 21 +…+ β k x k 1+ ε 1

y 2=β 0 + β 1 x 12 + β 2 x22 +…+ β k x k 2+ ε 2

y n=β 0 + β 1 x 1 n + β 2 x 2 n +…+ β k x kn +ε n

These n equations can be written in matrix form as:

Y = Xβ+ ε (8.4)

[ ][ ][ ] [ ]
y1 1 x 11 x 21 ⋯ x k 1 β 0 ε 1
y 2 = 1 x 12 x22 … x k 2 β1 + ε 2
(8.5)
⋮ ⋮ ⋮ ⋮⋮ ⋮ ⋮
yn 1 x 1 n x2 n … x kn βk εn

where,

(i) Y represents a column vector consisting of n observed values.


(ii) X is an (nxp) matrix comprising known values, where p=(k +1).
(iii) β is a column vector representing p unknown parameters.
(iv) ε is a column vector that contains unobservable errors associated with n observations, and
it's important to note that p is less than n.

We assume that the Rank ( X )= p. In such a scenario, the matrix X is considered to have full
rank.

The preceding three assumptions on ε i or y i can be expressed in terms of the model

a) E [ε ]=0 or E [Y ]=Xβ .
b) Cov (ε )=σ 2 I ∨Cov ( y )=σ 2 I

8.3 Meaning of Regression Parameter


(i) The parameter βᵢ signifies the alteration in E( y) for each unit increase in xi while
keeping all other variables constant.
(ii) If the model encompasses x ₁=0 , x ₂=0 ,... , xₖ=0 , the intercept β 0 represents the
expected value E( y) when all independent variables are set to zero.

When the influence of x i on E( y) remains consistent regardless of the values of the other
variables for all i, it can be said that the variables exhibit an additive effect or do not
interact with each other.

8.4 Least-Squares Estimator for β


The least squares method is employed to calculate estimates for the p unknown regression
parameters. It's important to note that no distributional assumptions regarding y are
necessary to obtain these estimators.
For the parameters β ₀ , β ₁ , ... , βₖ , our objective is to find estimators, denoted as
^β , β^ , … , β^ , that minimize the sum of squared deviations between the n observed values of
0 1 k
y and their corresponding predicted values ŷ. These estimations are the values that
minimize the sum of squared errors (SSE) between observed values y and predicted values
ŷ:

SSE=∑ ( y i−^y i )
2

SSE=∑ ( y i− β^ 0− β^ 1 x1 i−…− ^β k x ki )
2
(8.6)

Note that the predicted value ^y i= β^ 0 + ^β 1 x 1 i +…+ β^ k x ki estimates E( y i), not ^y i. A better
notation would be E [^y i ] but ^y i is commonly used.

To find the values of ^β 0 , β^ 1 , … , β^ k , k that minimize SSE, we could differentiate SSE with
respect to each βᵢ and set the results equal to zero to yield p=k +1 equations that can be
solved simultaneously for the ^β j ’s.

∂ SSE
=0
∂ βj

∂ SSE
=−2 ∑ ( y i−^y i )=0 (1)
∂ β0
∂ SSE
=−2 ∑ x 1 i ( y i−^y i )=0 (2)
∂ β1

∂ SSE
=−2 ∑ x ki ( y i− ^y i) =0 (p)
∂ βk

By solving above equations, we get following p set of equations:

∑ ^yi =¿ n ^β 0+ ^β1 ∑ x 1i +¿ ^β 2 ∑ x 2 i +…+ β^ k ∑ x ki ¿ ¿ (1)

∑ x 1 i ^yi =¿ ^β 0 ∑ x 1 i+ ¿ ^β1 ∑ x 21i +¿ β^ 2 ∑ x1 i x2 i +…+ ^β k ∑ x 1 i x ki ¿ ¿ ¿


(2)

∑ x ki ^y i=¿ β^ 0 ∑ x ki +¿ ^β 1 ∑ x ki x 1 i+¿ β^ 2 ∑ x ki x 2 i +…+ β^ k ∑ x 2ki ¿ ¿ ¿ (p)

The equations can be written in a matrix form as,

X X β^ =X y
T T

−1
Pre-multiply both sides by, ( X T X ) to get,

−1 −1
^ ( XT X ) XT y
( X T X ) X T X β=

The least squares estimate of the parameters is given by,

^β=( X T X )−1 X T y

Because we assumed that X is of full rank, the matrix ( X T X ) has an inverse, and the
solution of ^β is unique.

Example 8.1:

Consider a data set,

area room age price


(x 1) s (x 2) (x 3) ( y)
23 3 8 6562
15 2 7 4569
24 4 9 6897
29 5 4 7562
31 7 6 8234
25 3 10 7485

Let us consider,

Here area, rooms, age are features / independent variables and price is the target /
dependent variable.

X = [[ 1, 23, 3, 8],
[ 1, 15, 2, 7],
[ 1, 24, 4, 9],
[ 1, 29, 5, 4],
[ 1, 31, 7, 6],
[ 1, 25, 3, 10]]

X = [[ 1, 1, 1, 1, 1, 1],
T

[23, 15, 24, 29, 31, 25],


[ 3, 2, 4, 5, 7, 3],
[ 8, 7, 9, 4, 6, 10]]

y = [6562, 4569, 6897, 7562, 8234, 7485]

[ ]
305.33
^β=( X X ) X y= 236.86
T −1 T

−4.76
102.90

Hence
^β =305.33
0
^ =236.86
β 1
^β =−4.76
2
^β =102.90
3
The model can be interpreted as:

^y i= β^ 0 + ^β 1 x 1 i + ^β 2 x 2 i+ β^ 3 x 3i

^y i=305.33+ 236.86 x1 i +−4.76 x 2 i +102.90 x3 i


Actual value of Test Data = [8234, 7485]

Predicted value of Test Data = [8232. 7241.52380952]

Regression of y on x1 ignoring x2,x3


9000
8000
7000
6000
5000
Price

4000
3000
2000
1000
0
14 16 18 20 22 24 26 28 30 32
Area x1

Figure 8.1 Regression of y on x1 ignoring x2, X3.

Regression of y on x2 ignoring x1,x3


9000
8000
7000
6000
5000
price

4000
3000
2000
1000
0
1 2 3 4 5 6 7 8
rooms

Figure 8.2 Regression of y on x 1 ignoring x 2, x 3 .


Regression of y on x3 ignoring x1, x2
9000
8000
7000
6000
5000
price

4000
3000
2000
1000
0
3 4 5 6 7 8 9 10 11
age

Figure 8.3 Regression of y on x 3 ignoring x 1, x 2.

Multiple Linear Regression


9000
8000
7000
6000
Predicted price

5000
4000
3000
2000
1000
0
4000 4500 5000 5500 6000 6500 7000 7500 8000 8500
actual price (y)

Figure 8.4 Multiple Linear Regression.

8.5 Properties of the estimated parameters


a. ^β is unbiased.
−1 2
b. Covar ( ^β )=( X X ) σ
T

c. ^β is a normal random vector.


d. The least squares estimators have the minimum variance (are the most
efficient) among the class of all the unbiased linear unbiased estimators of ^β .

8.5.1 Unbiases of ^β

E [ Y ] =E [ Xβ+ ε ] = Xβ
E [ ^β ]=E [ ( X X ) X y ]=( X X ) X Xβ=β
T −1 T −1 T
T

Therefore, ^β is unbiased. For each j, ^β j is an unbiased estimator of β j .

−1
8.5.2 Covar ( ^β )=( X T X ) σ 2

We use the property of Covariance of a random vector.

LetV be a random vector. For a given matrix M of constant elements,


−1
cov ( MV )=M cov ( V ) M Let M =( X T X ) X T and V =Y
T

−1 −1
cov ( ^β )= ( X X ) X σ I X ( X X )
T T 2 T

−1 2
cov ( ^β )= ( X X ) σ
T

The i th diagonal element of the matrix cov ( β^ )= ( X T X ) σ 2gives the variance of ^β i and the
−1

( i , j )th element gives the covariance of ^β iand ^β j for i≠ j.


The variance matrix cov ( ^β ) is a ( pxp) symmetric matrix. Because it gives the variance as
well as the covariance, it is also referred to as the Variance-Covariance matrix of ^β . The
matrix is a positive definite matrix if and only if all the eigenvalues are greater than zero

^β=( X T X )−1 X T y is a linear function of normal vector y . Therefore ^β is also a normal vector.

Example 8.2

(a) Use the matrix method to estimate the parameters of the simple linear regression
model.
(b) Obtain the variance-covariance matrix of the estimators.
Solution

(a)

We have

y i=β 0 + β 1 x i +ε i, i=1 , 2 ,.. , n

Using Matrix notation, the set of n equations can be written as:

[ ] [ ][ ] [ ]
y1 1 x1 ε1
y2 = 1 x2 β0 + ε2
⋮ ⋮ ⋮ β1 ⋮
yn 1 xn εn

y= Xβ+ ε

[ ]
1 x1
T
X X=
[ 1 1
x1 x2 ]
… 1 1 x2
… x1 ⋮ ⋮
1 xn

X X=
T

[∑ n
xi
∑ xi
∑ x 2i ]

[]
y1
T
X Y=
[ 1 1 …1
x1 x2 … x1 ⋮ ] y2

yn

XT Y =
[ ]
∑ yi
∑ xi yi

[ 1 ∑
][ −∑ x i
]
2
(XT X) =
−1 xi
n ∑ x2i −( ∑ x i ) −∑ x i
2
n

[]
^β= β 0 =( X T X )−1 X T y
β1
¿
[ 2
1 ∑ x 2i
n ∑ x i −( ∑ x i ) −∑ x i
2
][ ][ ] −∑ x i
n
∑ yi
∑ xi yi

[∑ [) ] ∑ ∑∑∑ ∑ ∑∑ ]
2
1 xi y i− x i xi y i
=
x i −( ∑ x i
2
n
2
− xi y i +n xi yi

We obtain,

^β = y ∑ xi −¿ x ∑ x i y i ¿
2

0
∑ x 2i −n x 2

^β = ∑
xi y i−n y x
or
1
∑ x 2i −n x 2
(b) Var ( ^β )=Var (( X T X ) X T y )=( X T X ) σ 2
−1 −1

[ 1 ∑
][ −∑ x i
]
2
(XT X) =
−1 xi
n ∑ x2i −( ∑ x i ) −∑ x i
2
n

[ 2
1
2

][
x 2i
n ∑ x i −( ∑ x i) −∑ x i
−∑ x i
n ] [
σ=
2 Var ( ^β 0)
Cov( β^ 0 , ^β 1) Var ( β^ 1 ) ]
Cov ( β^ 0 , β^ 1 )

var ( ^β 0 )=σ
2 ∑ x 2i =
σ
2
+ x
2 σ
2

n ∑ x i −( ∑ xi )
2 2
n ∑ (x ¿¿ i−x )2 ¿
2
σ
Var ( ^β 1 )=
∑ (x¿ ¿i−x)2 ¿

cov ( β^ 0 , ^β1 ) =
−σ ∑ xi = −x σ 2
2

n ∑ xi −( ∑ x i ) ∑ (x ¿¿ i−x ) ¿
2 2 2
Statistical measures to assess the goodness of fit and the significance
of the model
When interpreting the output of a multiple linear regression model, several key statistical
measures are essential to assess the goodness of fit and the significance of the model. Here
are some common elements found in the output and their interpretations:

R-squared (R ²) :

R-squared represents the proportion of the variance in the dependent variable that is
explained by the independent variables in the model. It ranges from 0 to 1, where a higher
R-squared value indicates a better fit.
n

SSR
∑ ( ^y ¿¿ i− y)2
r 2= = n
i=1
¿
SST
∑ ( y ¿¿ i− y) =1− SSE
2
SST
¿
i=1

n n
where SSR=∑ (^y ¿¿ i− y ) ¿ is the regression sum of squares and SST =∑ ( y ¿¿ i− y) ¿is
2 2

i=1 i=1
the total sum of squares. The total sum of squares can be partitioned into SST =SSR+ SSE ,
that is,
n n n

∑ ( y ¿¿ i− y )2=∑ (^y ¿¿ i− y)2 +∑ ( y ¿ ¿ i−^y i )2 ¿ ¿ ¿


i=1 i=1 i=1

Thus r 2 gives the proportion of variation in y that is explained by the model or, equivalently,
accounted for by regression on x.

 Example: An R-squared of 0.75 means that 75% of the variability in the dependent
variable is explained by the independent variables in the model.

Adjusted R-squared (R2 ):

The Adjusted R-squared is a modification of the R-squared statistic in multiple linear


regression that adjusts for the number of predictors in the model. While R-squared
measures the proportion of variance in the dependent variable explained by the
independent variables, the Adjusted R-squared provides a more nuanced evaluation by
penalizing the inclusion of irrelevant predictors.

SSE /(n−k )
R2=1−
SST /(n−1)
where
 n is the number of observations,

 k is the number of predictors. nterpretation of Adjusted R-squared:

 Increased Predictors:

 As more predictors are added to a model, R-squared tends to increase even if the
additional predictors do not contribute significantly to explaining the variance.
Adjusted R-squared adjusts for this by penalizing the increase in R-squared due to
adding less relevant predictors.

 Comparison between Models:

 Adjusted R-squared allows for a fair comparison of models with different numbers
of predictors. A higher Adjusted R-squared suggests a better balance between model
complexity and explanatory power.

 Model Parsimony:

 Adjusted R-squared encourages model parsimony, favoring models that achieve a


high degree of explanatory power with fewer predictors. This is important in
avoiding overfitting.

 Key Points:

 Range: Like R-squared, Adjusted R-squared ranges from 0 to 1. A value closer to 1


indicates a better fit, accounting for both explanatory power and model simplicity.

 Interpretation: An increase in Adjusted R-squared indicates that the additional


predictors are contributing meaningfully to the model's explanatory power.
Conversely, a decrease may suggest that the new predictors are not providing
substantial benefits.

 Caveats: While Adjusted R-squared is a valuable metric, it has limitations. It


assumes that the underlying assumptions of multiple linear regression are met and
may not adequately penalize certain types of overfitting.

Adjusted R-squared is a valuable metric for assessing the goodness of fit in multiple
linear regression models, providing a balanced measure that considers both
explanatory power and model complexity. It aids in selecting models that strike an
optimal balance between capturing variance in the data and avoiding unnecessary
complexity.

F -statistic

The F-statistic in multiple linear regression is a statistical measure that assesses the
overall significance of the regression model. It tests the null hypothesis that all the
coefficients of the independent variables in the model are equal to zero, suggesting that
the model has no explanatory power. In contrast, the alternative hypothesis is that at
least one of the coefficients is not equal to zero, indicating that the model is statistically
significant.

The formula for the F -statistic in multiple linear regression is as follows:

SSR
k
F=
SSE
n−k −1

where:

 SSR is the sum of squared residuals (explained sum of squares),

 SSEis the sum of squared errors (residual sum of squares),

 k is the number of predictors (independent variables),

 n is the number of observations.

The F -statistic follows an F -distribution with degrees of freedom k for the numerator
and n−k −1for the denominator.

Interpretation of the F-statistic:

1. Null Hypothesis: The null hypothesis for the F-test is that all coefficients of the
independent variables are equal to zero, implying that the model has no explanatory
power.

2. Alternative Hypothesis: The alternative hypothesis is that at least one coefficient is


not equal to zero, indicating that the model is statistically significant and has
explanatory power.
3. Significance Level: The F -statistic is compared to a critical value from the F -
distribution based on a chosen significance level (e.g., 0.05). If the calculated F-
statistic is greater than the critical value, the null hypothesis is rejected.

4. Model Significance: If the F -statistic is statistically significant, it suggests that the


overall model is meaningful, and at least one independent variable contributes
significantly to explaining the variance in the dependent variable.

5. Caution: A significant F -statistic does not provide information about which specific
predictors are significant; for that, individual t-tests for each coefficient are typically
conducted.

the F -statistic in multiple linear regression is a crucial tool for determining whether the
overall model is statistically significant. It helps assess whether the inclusion of
independent variables in the model contributes meaningfully to explaining the variation
in the dependent variable.

Significance F ( p-value):

The significance F , often referred to as the p-value associated with the F -statistic, indicates
the probability of observing an F -statistic as extreme as the one calculated if the null
hypothesis were true.

If the significance F is less than a chosen significance level (e.g., 0.05), you reject the null
hypothesis, suggesting that the overall model is statistically significant.

A significant F -statistic implies that at least one independent variable in the model
contributes significantly to explaining the variance in the dependent variable.

The degrees of freedom for the F -statistic are associated with the number of predictors ( k )
and the number of observations (n ) in the model.

Example: Suppose you obtain an F -statistic of 15.5 with 3 and 96 degrees of freedom
(associated with k and n ). The Significance F ( p-value) is calculated to be 0.0002 (less than
0.05). In this case:

Conclusion: Since the p-value is less than the chosen significance level (e.g., 0.05), you
would reject the null hypothesis.

Interpretation: This suggests that at least one of the predictors in the model is significant,
and the overall model is statistically significant.

The F -statistic and Significance F are crucial for determining whether the regression
model as a whole is statistically significant. A significant F-statistic implies that the model
has explanatory power, and at least one independent variable contributes significantly to
the prediction of the dependent variable.

Standard Error
In multiple linear regression, the standard error (often referred to as the standard error of
the residuals or residual standard error) is a measure of the dispersion of the observed
values around the predicted values. It provides an estimate of how much the actual values
deviate from the predicted values and is a crucial metric for assessing the accuracy and
precision of the regression model.

The standard error in multiple linear regression is mathematically calculated as the square
root of the mean squared residual (MSR), which is the average of the squared differences
between the observed and predicted values. The formula for the standard error is as
follows:

Standard Error=
√ ∑ ( y i− ^y i )2 where i=1 , 2 ,… , n
n−k −1

where:

 y i is the observed value for the i-th observation,

 ^y iis the predicted value for the i-th observation,

 n is the number of observations,

 k is the number of predictors (independent variables) in the model.

Key points regarding the standard error in multiple linear regression:

1. Accuracy of Predictions: A lower standard error indicates that the model's


predictions are, on average, closer to the actual values. It serves as a measure of the
accuracy of the model.

2. Comparison of Models: When comparing different regression models, the model


with a lower standard error is generally preferred, as it suggests better predictive
performance.
3. Degrees of Freedom: The denominator in the formula accounts for the degrees of
freedom, specifically n−k −1, where n is the number of observations and p is the
number of predictors. This adjustment is important for unbiased estimation.

4. Assumption of Homoscedasticity: The standard error is related to the assumption


of homoscedasticity, which means that the variance of the residuals is constant
across all levels of the independent variables.

5. Model Evaluation: In conjunction with other metrics such as R-squared and the F-
statistic, the standard error is useful for evaluating the overall performance of the
multiple linear regression model.

In summary, the standard error in multiple linear regression is a fundamental metric that
quantifies the precision of the model's predictions. A lower standard error indicates a
better fit, but its interpretation should be considered alongside other model evaluation
metrics and assumptions.

Coefficient p-values
In multiple linear regression, the p-values associated with the coefficients of the
independent variables provide information about the statistical significance of each
predictor in the model. Each coefficient p-value is associated with a null hypothesis that the
corresponding coefficient is equal to zero, indicating no effect of that particular predictor
on the dependent variable.

Here's how to interpret the p-values for the coefficients in multiple linear regression:

1. Null Hypothesis: The null hypothesis (H 0 ) for each coefficient is that the
corresponding predictor has no effect on the dependent variable.

2. Alternative Hypothesis: The alternative hypothesis (H a ) is that the coefficient is


not equal to zero, suggesting that the predictor has a statistically significant effect on
the dependent variable.

3. Significance Level (α ): The significance level, often denoted as α , is the chosen


threshold for determining statistical significance. Common values are 0.05 or 0.01.

4. Interpretation:

 p-value < α : If the p-value for a coefficient is less than the chosen
significance level (e.g., 0.05), you reject the null hypothesis. This suggests that
the corresponding predictor is statistically significant in predicting the
dependent variable.
 p-value ≥ α : If the p-value is greater than or equal to the significance level,
you fail to reject the null hypothesis. This implies that there is not enough
evidence to conclude that the predictor has a significant effect.

5. Decision Rule: A common decision rule is to reject the null hypothesis when the p-
value is less than α and, conversely, fail to reject the null hypothesis when the p-
value is greater than or equal to α .

6. Importance of Individual Significance: While the F -statistic tests the overall


significance of the model, examining individual coefficient p-values is essential for
identifying which specific predictors contribute significantly to the model.

Example: Suppose you have a multiple linear regression model with predictors
X 1 ​, X 2 ​,∧X 3, and corresponding coefficients β 1 ​, β 2 ​,∧β 3. The p-values associated with
these coefficients are p1 ​, p 2 ​,∧ p3 . If p1 <0.05, you might conclude that X 1 is a significant
predictor, while p2 and p3above 0.05 suggest that X 2 and X 3 are not statistically significant
predictors.

Coefficient p-values in multiple linear regression help assess the individual contributions of
each predictor to the model. A low p-value indicates that the corresponding predictor is
likely to be a significant factor in predicting the dependent variable.

Feature Selection and Dimensionality Reduction

Introduction

In machine learning and data analysis, handling datasets with numerous features or
variables is a common challenge. High-dimensional data can lead to increased
computational complexity, overfitting, and difficulties in visualizing and interpreting
results. Feature selection and dimensionality reduction techniques are employed to address
these issues. This discussion focuses on three widely used methods: Principal Component
Analysis (PCA), Linear Discriminant Analysis (LDA), and Independent Component Analysis
(ICA).

Feature Selection vs. Dimensionality Reduction

Feature Selection: Feature selection involves choosing a subset of relevant features from
the original set. The goal is to retain the most informative variables while discarding
irrelevant or redundant ones. This process simplifies models, improves interpretability, and
reduces the risk of overfitting. Common techniques include filter methods (e.g., correlation
analysis), wrapper methods (e.g., recursive feature elimination), and embedded methods
(e.g., regularization).
Dimensionality Reduction: Dimensionality reduction aims to transform the data into a
lower-dimensional space while preserving its essential characteristics. This is achieved by
creating new variables, known as components or factors, that capture most of the original
data's variance. Dimensionality reduction techniques are particularly useful when the
dataset has a large number of features. PCA, LDA, and ICA are three powerful methods for
dimensionality reduction.

Principal Component Analysis (PCA)

PCA is an unsupervised linear transformation technique that identifies the directions


(principal components) along which the data varies the most. The first principal
component captures the maximum variance, the second component captures the second
most, and so on. By projecting the data onto a lower-dimensional subspace defined by these
principal components, PCA reduces dimensionality.

Procedure:

1. Standardization: Standardize the data to ensure that all variables are on the same
scale.

2. Covariance Matrix: Compute the covariance matrix of the standardized data.

3. Eigenvalue Decomposition: Find the eigenvectors and eigenvalues of the


covariance matrix.

4. Select Principal Components: Sort the eigenvalues in descending order and choose
the top k eigenvectors to form the matrix W .

5. Project Data: Multiply the original data by W to obtain the lower-dimensional


representation.

Applications:

 Image Compression: Reduce the dimensionality of image data while retaining key
features.

 Noise Reduction: Identify and remove noise or irrelevant information.

Example

Suppose we have a dataset with three data points in two dimensions:

[ ]
1 2
X= 3 4
5 6
Step 1: Standardize the Data

We start by standardizing the data (subtracting the mean and dividing by the standard
deviation for each feature). In this small example, we won't need to do that since the
data is already small and simple.

Step 2: Calculate the Covariance Matrix

The covariance matrix is calculated using the formula:

1
Covariance Matrix (∑)= ( X −X )T ⋅ ( X−X )
n−1

where X is the mean vector.

∑=
[ 44]
1 4
2 4

Step 3: Calculate Eigenvectors and Eigenvalues

Next, we calculate the eigenvectors and eigenvalues of the covariance matrix. The
eigenvectors represent the directions of maximum variance, and the eigenvalues
represent the magnitude of the variance in those directions.

Eigenvalues (λ):{8 , 0 }

Eigenvectors (v): [ 11]


Step 4: Sort Eigenvectors by Eigenvalues

Sort the eigenvectors in descending order based on their corresponding eigenvalues. In


this case, we only have one non-zero eigenvalue, so the sorting is straightforward.

Step 5: Choose Principal Components

Choose the top k eigenvectors to form the matrix W . Since we have only one non-zero
eigenvalue, we choose the corresponding eigenvector as our principal component.

[]
(W ): 1
1

Step 6: Project Data onto Principal Components

Project the original data onto the subspace defined by the principal components.
[]
3
Projected Data (Z): Z=X ⋅W = 7
11

The resulting matrix Z represents the dataset in the reduced space defined by the
principal components.

This is a simplified example, and in real-world scenarios, PCA is applied to datasets with
many more dimensions to capture the most significant sources of variance in the data.
The eigenvalues and eigenvectors are crucial in determining the principal components
and their importance in representing the data.

Linear Discriminant Analysis (LDA)

LDA is a supervised technique that aims to find the linear combinations of features that
best separate different classes in a classification problem. Unlike PCA, LDA considers class
labels, making it particularly useful for feature extraction in the context of classification
tasks.

Linear Discriminant Analysis (LDA) is widely regarded as the preferred method for
effectively distinguishing between two or more classes characterized by multiple features
in classification problems. For instance, in scenarios where there are two classes with
multiple features and a need for efficient separation, LDA emerges as a commonly
employed technique. Attempting classification based on a single feature may result in
overlapping and insufficient discrimination between the classes.

To overcome the overlapping issue in the classification process, we must increase the
number of features regularly.

Let's assume we have to classify two different classes having two sets of data points in a 2-
dimensional plane as shown below figure:
Nevertheless, drawing an efficient straight line in a 2-dimensional plane to separate these
data points proves to be impossible. However, by employing Linear Discriminant Analysis,
we can achieve dimensional reduction from a 2-dimensional plane to a 1-dimensional
plane. This technique allows us to enhance the separability between multiple classes.

Working of LDA

Linear Discriminant Analysis serves as a dimensionality reduction technique in machine


learning, allowing the transformation of a 2-dimensional or 3-dimensional graph into a 1-
dimensional plane.

For instance, let's consider a scenario with two classes represented on a 2-dimensional
plane with an X −Y axis, and the objective is to achieve efficient classification. As
demonstrated in the example, LDA facilitates the creation of a straight line that distinctly
separates the two classes of data points. In this context, LDA utilizes the X −Y axis to
establish a new axis by employing a straight line for segregation and subsequently
projecting the data onto this new axis.

Hence, we can maximize the separation between these classes and reduce the 2-D plane
into 1-D.
To create a new axis, Linear Discriminant Analysis uses the following criteria:

o It maximizes the distance between means of two classes.

o It minimizes the variance within the individual class.

Using the above two conditions, LDA generates a new axis in such a way that it can
maximize the distance between the means of the two classes and minimizes the variation
within each class.

In other words, we can say that the new axis will increase the separation between the data
points of the two classes and plot them onto the new axis.

Procedure:

1. Compute Class Means: Calculate the mean vectors for each class.

2. Compute Scatter Matrices: Compute the within-class scatter matrix ( SW ) and the
between-class scatter matrix ( S B).

3. Eigenvalue Decomposition: Find the eigenvectors and eigenvalues of S W −1 SB ​.

4. Select Discriminant Vectors: Choose the top k eigenvectors as discriminant


vectors.
5. Project Data: Project the original data onto the subspace defined by the
discriminant vectors.

Example

Suppose we have the following dataset:

Class 1:

 Data points X 1 =(x 1 , x 2)={(4 ,2),(2 , 4),(2 , 3),(3 , 6),( 4 , 4)}

Class 2:

 Data points X 2 =(x 1 , x 2 )={(9 , 10) ,(6 ,8),(9 , 5),(8 , 7), (10 ,8) }

Step 1: Compute Class Means

Calculate the mean vector for each class:

Mean for Class 1: x 1= ( 4+2+2+3+


5
4 2+ 4+3+ 6+4
,
5 )=(3 , 3.8)
Mean for Class 2: x 2= ( 9+ 6+9+5 8+10 , 10+8+5+7+
5
8
)=(8.4 ,7.6)
Step 2: Compute Scatter Matrices

I. Calculate the within-class scatter matrix (SW =S 1 +S 2)

Where S1 is class 1 covariance matrix and S2 is class 2 covariance matrix

1 T
S1 =
n1−1
( X 1−x 1 ) ⋅ ( X 1−x 1 )

1 T
S2 = ( X −x ) ⋅( X 2− x2 )
n2−1 2 2

[ ][ ]
4−3 2−3.8 1 −1.8
2−3 4−3.8 −1 0.2
( X 1−x 1) = 2−3 3−3.8 = −1 −0.8
3−3 6−3.8 0 2.2
4−3 4−3.8 1 0.2
[ ][ ]
9−8.4 10−7.6 0.6 2.4
6−8.4 8−7.6 −2.4 0.4
( X 2−x 2) = 9−8.4 5−7.6 = 0.6 −2.6
8−8.4 7−7.6 −0.4 −0.6
10−8.4 8−7.6 1.6 0.4

[ ][ ]
1 −1.8
−1 0.2 4 −1
S1 =
1
[ 1 −1 −1 0 1
5−1 −1.8 0.2 −0.8 2.2 0.2
−1
0
] 1
−0.8 = −1 44 =
2.2
4
5
[1
−0.25
−0.25
2.2 ]
1 0.2

Similarly,

S2 =
[−0.05
2.3 −0.05
3.3 ]
Therefore

[−0.25
SW =
1 −0.25
2.2
+
][
2.3
−0.05
−0.05
3.3
=
3.3 −0.3
−0.3 5.5 ][ ]
II. Calculate the between-class scatter matrix (S B) :

T
S B=( x 1−x 2 ) ⋅ ( x 1−x 2 )

T
[ ][ ][ ]
3 8.4 −5.4
( x 1−x 2 ) = 3.8 − 7.6 = −3.8

S =
B [−5.4
−3.8 ]
[ −5.4 −3.8 ] =[ 29.16 20.52 ]
20.52 14.44

Step 3: Solve for Eigenvalues and Eigenvectors

Solve the generalized eigenvalue problem S−1


W S B to find the eigenvalues and corresponding

eigenvectors.

−1
SW S B =
[ 9.2213
4.2339
6.489
2.9794 ]
Step 4: Sort Eigenvalues and Choose Top Eigenvectors
Sort the eigenvalues in descending order and choose the top k (in this case, k =1)
eigenvector(s).
|S−1 |
W S B−λI |=
9.2213−λ
4.2339
6.489
2.9794−λ |
=0

2
λ −12.2007 λ=0

λ=0 , 12.2007

Eigenvector for λ=0 is

v 1=
[−0.5755
0.8178 ]

Eigenvector for λ=12.2007 is

[ 0.9088
v 2=
0.4173 ]

The eigen vector corresponding to the largest eigen value is:

[ 0.9088
v 2=
0.4173 ]

Step 5: Project Data onto New Subspace

Project the original data onto the selected eigenvector(s) to obtain the transformed 1-
dimensional data. This leads to good separability between the two classes

The above steps illustrate how LDA can be applied to this simple dataset, resulting in a 1-
dimensional subspace that maximizes the separation between the two classes. In a real-
world scenario with higher-dimensional data, the same principles apply, and LDA aims to
find the linear combinations of features that best discriminate between different classes.

Applications:

 Face Recognition: Reduce face image data to improve classification accuracy.

 Biomedical Studies: Identify relevant features for classifying different medical


conditions.

Independent Component Analysis (ICA)

Overview: ICA is an unsupervised method that separates a multivariate signal into


additive, independent components. It assumes that the observed data is a linear
combination of independent sources and aims to recover these sources. ICA is particularly
valuable when the sources are non-Gaussian and statistically independent.

The typical scenario used to illustrate Independent Component Analysis (ICA) is known as
the "Cocktail Party Problem." In its simplest form, consider two individuals engaged in a
conversation at a cocktail party, analogous to the red and blue speakers in the scenario
described. In this setup, two microphones are strategically positioned near the partygoers,
resembling the placement of the purple and pink microphones. Both microphones capture
the voices of both individuals at varying volumes determined by their proximity to the
microphones. Essentially, two audio files are recorded, each containing a mix of the two
speakers.

Figure The simplest version of the “Cocktail Party Problem”.

The challenge presented is how to effectively separate the voices within each file, aiming to
acquire distinct recordings for each speaker. Independent Component Analysis (ICA) proves
to be a straightforward solution to this problem. ICA transforms a set of vectors into a
maximally independent set. In the context of the "Cocktail Party Problem," ICA works to
convert the two mixed audio recordings (depicted by the purple and pink waveforms) into
two separate, unmixed recordings, each representing the individual speaker (illustrated by
the blue and red waveforms). It's noteworthy that the number of inputs and outputs
remains the same in this process. Additionally, since the outputs are mutually independent,
there isn't an apparent method for discarding components, as seen in Principal Component
Analysis (PCA).

Fig
ure: Converting mixed signal into independent components using ICA

Procedure:

1. Whitening: Decorrelate and standardize the data.

2. Maximize Independence: Find a transformation matrix that maximizes the


independence of the components.

3. Orthogonality: Ensure that the components are orthogonal.

4. Projection: Project the original data onto the subspace defined by the independent
components.

Applications:
 Speech Signal Separation: Separate different speakers in a mixed audio signal.

 Neuroimaging: Extract independent sources from EEG or fMRI data.

Comparison and Selection Criteria

PCA vs. LDA

 Supervision: PCA is unsupervised, while LDA is supervised and considers class


labels.

 Objective: PCA focuses on maximizing variance, while LDA aims to maximize the
separation between classes.

 Use Case: PCA is suitable for tasks like noise reduction and visualization, while LDA
is effective in classification problems.

PCA vs. ICA

 Independence: PCA captures orthogonal components, while ICA seeks statistically


independent components.

 Supervision: PCA is unsupervised, while ICA is unsupervised.

 Assumption: PCA assumes Gaussian distributions, whereas ICA does not assume a
specific distribution.

LDA vs. ICA

 Objective: LDA focuses on maximizing class separability for classification tasks,


while ICA is an unsupervised approach designed for blind source separation by
extracting statistically independent components from mixed signals.
Chapter 3 Introduction to Classification
and Classification Algorithms
3.1 Introduction

Classification, a fundamental concept in machine learning, plays a pivotal role in solving a


wide range of problems by categorizing input data into distinct classes or labels. This
powerful technique is employed in various domains, from image recognition and natural
language processing to medical diagnosis and fraud detection. In this exploration, we delve
into the essence of classification, examining its definition, significance, and the general
approach employed to tackle classification problems.

At its core, classification is a supervised machine learning task where the primary objective
is to assign predefined categories or labels to input data based on its features. The process
involves training a model on a labeled dataset, where instances are associated with known
class labels. The goal is to develop a predictive model that can generalize well to unseen
instances, thereby accurately assigning appropriate labels. The significance of classification
lies in its ability to automate decision-making processes, enabling systems to categorize
and interpret data in a manner analogous to human reasoning.

Significance of Classification: The significance of classification is pervasive across diverse


domains. In healthcare, classification models contribute to disease diagnosis based on
patient symptoms and medical test results. In finance, these models aid in fraud detection
by distinguishing between legitimate and suspicious transactions. Image classification is a
cornerstone of computer vision applications, enabling machines to recognize and
categorize objects in images. The versatility of classification underscores its applicability in
addressing complex real-world challenges, making it a cornerstone of modern machine
learning.

3.2 General Approach to Classification: Successfully tackling a classification problem


involves a systematic and well-defined approach. The following steps outline a general
framework for classification:

1. Problem Definition: Clearly articulate the classification problem at hand. Define the
classes or categories that the algorithm should predict, setting the foundation for
subsequent steps.
2. Data Collection and Preparation: Gather a labeled dataset where each instance is
paired with a corresponding class label. This dataset is then divided into two subsets:
training data, used to train the model, and testing data, reserved for evaluating the model's
performance. Data preprocessing techniques are applied to handle missing values,
normalize features, and encode categorical variables.

3. Feature Selection and Extraction: Identify relevant features that contribute to the
prediction task. Optionally, perform feature extraction or dimensionality reduction
techniques to enhance model efficiency and interpretability.

4. Choose a Classification Algorithm: Select an appropriate classification algorithm based


on the characteristics of the data, the problem's nature, and the size of the dataset. Popular
algorithms include Decision Trees, Support Vector Machines, Naive Bayes, and Neural
Networks.

5. Data Splitting: Divide the dataset into training and testing sets, ensuring that the model
is evaluated on unseen data. This step guards against overfitting and provides a realistic
assessment of the model's generalization capabilities.

6. Model Training: Feed the training data into the chosen classification algorithm, allowing
the model to learn patterns and relationships between features and class labels.

7. Model Evaluation: Assess the model's performance using the testing dataset. Common
evaluation metrics such as accuracy, precision, recall, and the confusion matrix offer
insights into the model's effectiveness. Fine-tune hyperparameters if necessary to optimize
performance.

8. Prediction and Deployment: Once satisfied with the model's performance, deploy it to
make predictions on new, unseen data. Continuous monitoring and potential updates are
essential to ensure the model remains relevant as new data becomes available.

Classification stands as a foundational pillar in machine learning, empowering systems to


make informed decisions based on learned patterns. Its broad applicability and
effectiveness make it an invaluable tool in diverse fields. By following a systematic
approach to classification, practitioners can navigate the intricacies of problem-solving,
unleashing the full potential of this powerful machine learning technique. As technology
advances and datasets grow in complexity, the role of classification will continue to evolve,
contributing to the development of intelligent systems capable of addressing increasingly
sophisticated challenges.

3.3 Nearest Neighbour Methods: Learning from Proximity


In the bustling atmosphere of a nightclub, where the rhythm of the music invites you to
dance, the scenario unfolds an analogy that resonates with one of the fundamental concepts
in machine learning—nearest neighbor methods. Imagine you're on the dance floor,
unfamiliar with the dance moves for the current song. In such a situation, observing and
imitating those around you becomes a natural strategy.

The Dance Floor Analogy:

Consider the dance floor as your data space, where each dancer represents a data point.
When you decide to dance, you might instinctively choose the person closest to you and
mimic their moves. This initial approach reflects the essence of the nearest neighbour
method: learning from the proximity of similar instances. However, acknowledging that not
everyone may be an expert dancer, you expand your observation to a few more people,
trying to discern a consensus. This act of observing multiple nearby dancers and adapting
your moves based on what most of them are doing mirrors the core principle of nearest
neighbour methods in machine learning.

Nearest Neighbour Methods in Machine Learning:

In the realm of machine learning, nearest neighbour methods are a family of algorithms
that operate based on the idea of proximity. When faced with a task where a model is not
readily available, and the underlying structure of the data is unknown, these methods
leverage the similarity of data points to make predictions or classifications.

FIGURE 5: The nearest neighbors decision boundary with left: one neighbor and right:
two neighbors

Key Characteristics of Nearest Neighbour Methods:

1. Data-Driven Decision Making:


 Nearest neighbour methods make decisions based on the similarity of
instances in the feature space. The assumption is that similar instances share
similar characteristics or behaviours.

2. Localized Decision Boundaries:

 Rather than defining global decision boundaries, nearest neighbour methods


create localized decision boundaries around individual data points. This
allows for flexibility in capturing complex and non-linear patterns within the
data.

3. K-Nearest Neighbours (KNN):

 A specific implementation of nearest neighbour methods is the K-Nearest


Neighbours algorithm. In KNN, a data point is classified based on the majority
class of its K nearest neighbours in the feature space.

4. Distance Metrics:

 The concept of proximity is quantified using distance metrics, such as


Euclidean distance or Manhattan distance. These metrics measure the
dissimilarity between data points.

Application Scenarios:

1. Classification:

 Nearest neighbour methods are commonly used for classification tasks. A


new data point is assigned the class label of the majority of its nearest
neighbours.

2. Regression:

 In regression problems, nearest neighbour methods predict the target value


for a data point based on the average or weighted average of its nearest
neighbours.

3. Anomaly Detection:

 Nearest neighbour methods can be employed for anomaly detection by


identifying data points that deviate significantly from their neighbours.

Strengths and Considerations:

 Strengths:
 Nearest neighbour methods are simple, intuitive, and effective for high-
dimensional data or situations where the underlying structure is complex.

 Considerations:

 They may be computationally expensive, especially with large datasets, as


they require pairwise comparisons between data points.

k-Nearest Neighbors (KNN)

The k-Nearest Neighbors (KNN) algorithm is a simple yet powerful supervised machine
learning algorithm used for both classification and regression tasks. In KNN, the prediction
for a new data point is based on the majority class (for classification) or the average (for
regression) of its k nearest neighbors in the feature space. The "nearest" is determined by a
distance metric, commonly Euclidean distance.

In the realm of machine learning, k-Nearest Neighbors (KNN) is a straightforward and


intuitive algorithm used for classification and regression tasks. Let's delve into the wine
example to comprehend how KNN operates in the context of chemical components rutin
and myricetin.

Dataset: Imagine we have data points representing red and white wines based on the levels
of rutin and myricetin. Each wine sample is a data point in a two-dimensional space.

Wine Type Rutin Level Myricetin Level

Red 3 4

White 2 3
In the above figure, the x-axis represents rutin levels and the y-axis represents myricetin
levels, the data points for red and white wines can be plotted. This graph visually
represents the trends in data for the two types of wines.

The "K" in KNN refers to the number of nearest neighbors considered in the decision-
making process. In simpler terms, when determining the class of a new data point, KNN
looks at its closest neighbors to make a prediction. The choice of " k " is a crucial parameter
that influences the algorithm's performance.

Classifying a New Wine: Suppose we introduce a new glass of wine into the dataset, and
we want to determine whether it is red or white. Let's k =5 for this scenario.
To classify the new wine, we identify its five nearest neighbors on the chart. The
classification is then determined by the majority of votes from these neighbors. If four out
of the five nearest neighbors are red wines, the new point would be classified as a red wine.

Applications of K-Nearest Neighbors (KNN) in Machine Learning:


K-Nearest Neighbors (KNN) is a versatile algorithm widely employed in various machine
learning applications due to its simplicity and effectiveness. Here are some notable use
cases where KNN finds application:

1. Recommendation Engine:

 Recommendation engine suggests products or services to users based on their


preferences or historical interactions.
 In recommendation systems, KNN is utilized to identify items or products that
are similar to those previously liked or interacted with by a user. It is
particularly effective for collaborative filtering.

2. Concept Search:

 Concept search involves finding semantically similar documents and


classifying documents containing similar topics.

 With the exponential growth of data generating tons of documents, KNN is


employed to extract key concepts from sets of documents. It helps in
identifying and grouping documents with similar topics or concepts.

3. Missing Data Imputation:

 Datasets often contain missing values, which can be problematic for machine
learning models or analysis.

 KNN is an effective algorithm for imputing missing values through a process


called "nearest neighbor imputation." It predicts missing values based on the
values of their nearest neighbors in the feature space.

4. Pattern Recognition:

 Pattern recognition involves identifying patterns in data, whether in text or


images.

 KNN is applied in pattern recognition tasks such as handwritten digit


recognition, credit card usage pattern detection, and image recognition. It
helps classify or categorize data based on the patterns identified in the
training set.

5. Banking:

 In the banking and financial sector, predictive modeling is crucial for


decision-making.
 KNN is widely used to predict the risk associated with giving a loan to a
customer. It helps financial institutions assess the creditworthiness of
customers by considering the similarity of their financial attributes to those
of known cases.

Benefits and Considerations:

 Benefits:

 Simple and intuitive algorithm.

 Effective in capturing non-linear relationships in data.

 Versatile in handling both classification and regression tasks.

 Considerations:

 Computationally intensive for large datasets.

 Sensitive to irrelevant or redundant features.

 The choice of the "K" parameter influences the algorithm's performance.

Computing K-Nearest Neighbor Distance Metrics:

K-Nearest Neighbors (KNN) relies on distance metrics to determine the proximity of data
points. Different distance metrics are suitable for various types of data and scenarios. Let's
explore some common distance metrics used in KNN:

1. Hamming Distance:

 Hamming distance is utilized in text data, particularly for calculating the


distance between two binary vectors (binary strings). It measures the
number of positions at which corresponding bits differ.

Hamming distance

Total number of bits


d ( x , y )= ​
Number of differing bits

2. Euclidean Distance:

 Euclidean distance is the most popular distance measure used for finding the
distance between two real-valued vectors (e.g., integers or floats). It
measures the straight-line distance between two points in a
multidimensional space.
Euclidean Distance

(∑ ( )
n 1
2 2
d ( x , y )= x i− yi )
i=1

3. Manhattan Distance:

Manhattan distance, also known as "Taxicab" or "City Block" distance,


calculates the distance between two real-valued vectors by summing the
absolute differences along each dimension.

Manhattan Distance
n
d ( x , y )=∑ |x i− y i|
i=1

4. Minkowski Distance:

 Minkowski distance is a generalized form that includes Euclidean distance


and Manhattan distance as special cases. It introduces a parameter " p" that
can be adjusted to calculate different distance measures.

Minkowski Distance

(∑ | )
n 1
p p
d ( x , y )= x i− y i|
i=1

For Euclidean distance, p=2, and for Manhattan distance, p=1

Choosing the appropriate distance metric is crucial in the KNN algorithm, as it influences
the model's performance and ability to capture relationships within the data. The selection
depends on the nature of the features and the characteristics of the problem at hand.

Steps in KNN Algorithm:

1. Step 1 - Choose the Value of k :

 Decide the number of neighbors ( k ) to consider when making predictions. A


common practice is to experiment with different values of k to find the one
that works best for the given problem.

2. Step 2 - Calculate Distance:

 For each data point in the dataset, calculate its distance to the new data point
using a distance metric (e.g., Euclidean distance). The distance can be
calculated in a multi-dimensional k feature space.
3. Step 3 - Identify Neighbors:

 Identify the k data points with the shortest distances to the new data point.
These k data points are considered the "nearest neighbors."

4. Step 4 - Make Prediction:

 For classification tasks, assign the class label that is most common among the
k nearest neighbors to the new data point.

 For regression tasks, predict the average or weighted average of the target
values of the k nearest neighbors.

Example:

Let's consider a simple example for a binary classification problem with two features (X1
and X2) and two classes (Class 0 and Class 1).

Data Point X1 X2 Class

A 1 1 0

B 2 0 0

C 2 1 1

D 3 3 1

E 4 2 0

Now, suppose we have a new data point with features X1=3 and X2=2, and we want to
classify it using KNN with k=3.

1. Calculate Distances:

 Calculate the Euclidean distance between the new data point and each
existing data point in the dataset.

Distance to A=√ (3−1 )2 + ( 2−1 )2= √ 5

Distance to B=√ ( 3−2 )2+ ( 2−0 )2= √ 5

Distance to C=√ ( 3−2 )2 + ( 2−1 )2=√ 2

Distance to D= √ ( 3−3 )2 + ( 2−3 )2=1


Distance to E=√ (3−4 )2 + ( 2−2 )2 =1

2. Identify Neighbors:

 Select the three data points with the shortest distances: D , E , and C .

3. Make Prediction (Classification):

 Since two out of the three nearest neighbors belong to Class 1, we predict
that the new data point belongs to Class 1.

In this example, the KNN algorithm predicts that the new data point with features X1=3 and
X2=2 belongs to Class 1 based on the majority class among its three nearest neighbors (
D , E , and C ).

K-Nearest Neighbor Pros

1. It’s simple to implement.

2. It’s flexible to different feature/distance choices.

3. It naturally handles multi-class cases.

4. It can do well in practice with enough representative data.

K-Nearest Neighbor Cons

1. We need to determine the value of parameter “K” (number of nearest neighbors).

2. Computation cost is quite high because we need to compute the distance of each
query instance to all training samples.

3. It requires a large storage of data.

4. We must have a meaningful distance function.

The k-Nearest Neighbors (KNN) algorithm is a simple yet powerful supervised machine
learning algorithm used for both classification and regression tasks. In KNN, the prediction
for a new data point is based on the majority class (for classification) or the average (for
regression) of its k nearest neighbors in the feature space. The "nearest" is determined by a
distance metric, commonly Euclidean distance.

Support Vector Machine


Support Vector Machine (SVM) is a supervised machine learning algorithm used for
classification and regression tasks. SVM aims to find the optimal hyperplane in a feature
space that separates different classes while maximizing the margin between them. Support
Vector Machine is a system for efficiently training linear learning machines in kernel-
induced feature spaces, while respecting the insights of generalisation theory and
exploiting optimisation theory.

In mathematical terms, a hyperplane is defined as a subspace with a dimension one less


than that of the ambient space. This implies that in a one-dimensional ambient space, the
hyperplane reduces to a point. Likewise, in a two-dimensional ambient space, the
hyperplane manifests as a line, and so forth. When a hyperplane is employed to separate
two classes, the data points associated with Class A are situated on one side of the
hyperplane, while those affiliated with Class B occupy the opposite side. Consequently, in a
one-dimensional space, the hyperplane that serves as a separator takes the form of a point.

Figure: Separating Hyperplane in 1D (A Point)

In two dimensions, the hyperplane that separates class A and class B is a line:
Figure: Separating Hyperplane in 2D (A Line)

And in three dimensions, the separating hyperplane is a plane:


Separating Hyperplane in 3D (A Plane)

Similarly in N dimensions the separating hyperplane will be an (N-1)-dimensional


subspace.

If you take a closer look, for the two dimensional space example, each of the following is a
valid hyperplane that separates the classes A and B:
Figure: Separating Hyperplanes

So how do we decide which hyperplane is the most optimal? Enter maximum margin
classifier.

Maximum Margin Classifier

The optimal hyperplane is the one that separates the two classes while maximizing the
margin between them. And a classifier that functions thus is called a maximum margin
classifier.
Figure: Maximum Margin Classifier

Hard and Soft Margins

We considered a super simplified example where the classes were perfectly


separable, and the maximum margin classifier was a good choice.

But what if your data points were distributed like this? The classes are still perfectly
separable by a hyperplane, and the hyperplane that maximizes the margin will look like
this:
Figure: Is the Maximum Margin Classifier Optimal?

But do you see the problem with this approach? Well, it still achieves class separation.
However, this is a high variance model that is, perhaps, trying to fit the class A points too
well.

Notice, however, that the margin does not have any misclassified data point. Such a
classifier is called a hard margin classifier.

Take a look at this classifier instead. Won't such a classifier perform better? This is a
substantially lower variance model that would do reasonably well on classifying both
points from class A and class B.
Figure: Linear Support Vector Classifier

Notice that we have a misclassified data point inside the margin. Such a classifier that
allows minimal misclassifications is a soft margin classifier.

Support Vector Classifier

The soft margin classifier we have is a linear support vector classifier. The points are
separable by a line (or a linear equation). If you’ve been following along so far, it should be
clear what support vectors are and why they are called so.

Each data point is a vector in the feature space. The data points that are closest to the
separating hyperplane are called support vectors because they support or aid the
classification.

It's also interesting to note that if you remove a single data point or a subset of data points
that are not support vectors, the separating hyperplane does not change. But, if you remove

In the examples so far, the data points were linearly separable. So we could fit a soft margin
classifier with the least possible error. But what if the data points were distributed like this?
one or more support vectors, the hyperplane changes.
Figure: Non-linearly Separable Data

Support Vector Machines and the Kernel Trick

Here’s a summary of what we’d do:

 Problem: The data points are not linearly separable in the original feature space.

 Solution: Project the points onto a higher dimensional space where they are linearly separable.

But projecting the points onto a higher dimensional features space requires us to map the data
points from the original feature space to the higher dimensional space.

This recomputation comes with non-negligible overhead, especially when the space that we want to
project onto is of much higher dimensions than the original feature space. Here's where the kernel trick
comes into play.

Mathematically, the support vector classifier you can be represented by the following equation [1]:
Ensemble Learning

Ensemble learning usually produces more accurate solutions than a single model would.
Ensemble Learning is a technique that creates multiple models and then combines them to
produce improved results. Ensemble learning usually produces more accurate solutions
than a single model would.

 Ensemble learning methods is applied to regression as well as classification.

o Ensemble learning for regression creates multiple repressors i.e. multiple regression
models such as linear, polynomial, etc.

o Ensemble learning for classification creates multiple classifiers i.e. multiple classification
models such as logistic, decision tress, KNN, SVM, etc.

Figure 1: Ensemble learning view

Which components to combine?

• different learning algorithms

• same learning algorithm trained in different ways

• same learning algorithm trained the same way

There are two steps in ensemble learning:

Multiples machine learning models were generated using same or different machine
learning algorithm. These are called “base models”. The prediction perform on the basis of
base models.

Techniques/Methods in ensemble learning


Voting, Error-Correcting Output Codes, Bagging: Random Forest Trees, Boosting: Adaboost,
Stacking.

3.1 Model Combination Schemes - Combining Multiple Learners

We discussed many different learning algorithms in the previous chapters. Though these
are generally successful, no one single algorithm is always the most accurate. Now, we are
going to discuss models composed of multiple learners that complement each other so that
by combining them, we attain higher accuracy.

There are also different ways the multiple base-learners are combined to generate the final

output:

Figure2: General Idea - Combining Multiple Learners


Multiexpert combination

Multiexpert combination methods have base-learners that work in parallel. These methods
can in turn be divided into two:

 In the global approach, also called learner fusion, given an input, all base-learners generate
an output and all these outputs are used.

Examples are voting and stacking.

 In the local approach, or learner selection, for example, in mixture of experts, there is a
gating model, which looks at the input and chooses one (or very few) of the learners as
responsible for generating the output.

Multistage combination

Multistage combination methods use a serial approach where the next base-learner is
trained with or tested on only the instances where the previous base-learners are not
accurate enough. The idea is that the base-learners (or the different representations they
use) are sorted in increasing complexity so that a complex base-learner is not used (or its
complex representation is not extracted) unless the preceding simpler base-learners are
not confident.

An example is cascading.
Let us say that we have L base-learners. We denote by d j (x) the prediction of base-learner
M j given the arbitrary dimensional input x. In the case of multiple representations, each M j
uses a different input representation x j . The final prediction is calculated from the
predictions of

the base-learners:

y=f (d1 , d 2 ,. . . , d L ∨Φ)

where f (·) is the combining function with Φ denoting its parameters.


Figure 1: Base-learners are d j and their outputs are combined using f (·). This is for a single
output; in the case of classification, each base-learner has K outputs that are separately
used to calculate y i, and then we choose the maximum. Note that here all learners observe
the same input; it may be the case that different learners observe different representations
of the same input object or event.

When there are K outputs, for each learner there are d ji (x ), i=1 , .. . , K , j=1 , .. . , L, and,
combining them, we also generate K values, y i ,i=1, . . . , K and then for example in
classification, we choose the class with the maximum y i value:
K
Choose C i if y i =max y k .
k−1

3.2 Voting

The simplest way to combine multiple classifiers is by voting, which corresponds to taking a
linear combination of the learners, Refer above figure 1.

y i=∑ w j d ij where w j ≥ 0 , ∑ w j=1


j j

This is also known as ensembles and linear opinion pools. In the simplest case, all learners
are given equal weight and we have simple voting that corresponds to taking an average.
Still, taking a (weighted) sum is only one of the possibilities and there are also other
combination rules, as shown in table 1. If the outputs are not posterior probabilities, these
rules require that outputs be normalized to the same scale

Table 1 - Classifier combination rules


An example of the use of these rules is shown in table 2, which demonstrates the effects of
different rules. Sum rule is the most intuitive and is the most widely used in practice.
Median rule is more robust to outliers; minimum and maximum rules are pessimistic and
optimistic, respectively. With the product rule, each learner has veto power; regardless of
the other ones, if one learner has an output of 0, the overall output goes to 0. Note that after
the combination rules, yi do not necessarily sum up to 1.

Table 2: Example of combination rules on three learners and three classes

In weighted sum, d ji is the vote of learner j for class C i and w j is the weight of its vote.
Simple voting is a special case where all voters have equal weight, namely, w j=1/L . In
classification, this is called plurality voting where the class having the maximum number of
votes is the winner.

When there are two classes, this is majority voting where the winning class gets more than
half of the votes. If the voters can also supply the additional information of how much they
vote for each class (e.g., by the posterior probability), then after normalization, these can be
used as weights in a weighted voting scheme. Equivalently, if d ji are the class posterior
probabilities, P(C i∨x , M j),then we can just sum them up ( w j=1/L ) and choose the class
with maximum y i .

In the case of regression, simple or weighted averaging or median can be used to fuse the
outputs of base-regressors. Median is more robust to noise than the average.
Another possible way to find w j is to assess the accuracies of the learners (regressor or
classifier) on a separate validation set and use that information to compute the weights, so
that we give more weights to more accurate learners.

Voting schemes can be seen as approximations under a Bayesian framework with weights
approximating prior model probabilities, and model decisions approximating model-
conditional likelihoods.

P ( Ci | x )= ∑ P ( C i| x , M j ) P(M j )
all models M j

Simple voting corresponds to a uniform prior. If we have a prior distribution preferring


simpler models, this will give larger weights to them. We cannot integrate over all models;
we only choose a subset for which we believe P(M j) is high, or we can have another
Bayesian step and calculate P ¿), the probability of a model given the sample, and sample
high probable models from this density.

Let us assume that d j are iid with expected value E [d j ] and variance Var (d j) , then when we
take a simple
average with w j=1/L , the expected value and variance of the output are:

1 1
E [ y ] =E ⌊ ∑ d j ⌋ = ≤[ d j ]=E [d j ]
j L L

Var [ y ] =Var
(∑ 1L d )= L1 Var (∑ d )= L1 LVar (d )= 1L Var (d )
j
j 2
j
j 2 j j

We see that the expected value does not change, so the bias does not change. But variance,
and therefore mean square error, decreases as the number of independent voters, L,
increases. In the general case,

1 1
Var [ y ] =
L
2
Var
(∑ d j )= L 2 ¿
j

which implies that if learners are positively correlated, variance (and error) increase. We
can thus view using different algorithms and input features as efforts to decrease, if not
eliminate, the positive correlation.

3.3 Bagging

Bootstrap aggregating, often abbreviated as bagging, involves having each model in the
ensemble vote with equal weight. In order to promote model variance, bagging trains each
model in the ensemble using a randomly drawn subset of the training set. As an example,
the random forest algorithm combines random decision trees with bagging to achieve very
high classification accuracy.

The simplest method of combining classifiers is known as bagging, which stands for
bootstrap aggregating, the statistical description of the method. This is fine if you know
what a bootstrap is, but fairly useless if you don’t. A bootstrap sample is a sample taken
from the original dataset with replacement, so that we may get some data several times and
others not at all. The bootstrap sample is the same size as the original, and lots and lots of
these samples are taken: B of them, where B is at least 50, and could even be in the
thousands. The name bootstrap is more popular in computer science than anywhere else,
since there is also a bootstrap loader, which is the first program to run when a computer is
turned on. It comes from the nonsensical idea of ‘picking yourself up by your bootstraps,’
which means lifting yourself up by your shoelaces, and is meant to imply starting from
nothing.

Bootstrap sampling seems like a very strange thing to do. We’ve taken a perfectly good
dataset, mucked it up by sampling from it, which might be good if we had made a smaller
dataset (since it would be faster), but we still ended up with a dataset the same size. Worse,
we’ve done it lots of times. Surely this is just a way to burn up computer time without
gaining anything. The benefit of it is that we will get lots of learners that perform slightly
differently, which is exactly what we want for an ensemble method. Another benefit is that
estimates of the accuracy of the classification function can be made without complicated
analytic work, by throwing computer resources at the problem (technically, bagging is a
variance reducing algorithm; the meaning of this will become clearer when we talk about
bias and variance). Having taken a set of bootstrap samples, the bagging method simply
requires that we fit a model to each dataset, and then combine them by taking the output to
be the majority vote of all the classifiers. A NumPy implementation is shown next, and then
we will look at a simple example.

# Compute bootstrap samples

samplePoints = np.random.randint(0,nPoints,(nPoints,nSamples)) classifiers = []

for i in range(nSamples):

sample = []

sampleTarget = [] for j in range(nPoints):

sample.append(data[samplePoints[j,i]])
sampleTarget.append(targets[samplePoints[j,i]])

# Train classifiers

classifiers.append(self.tree.make_tree(sample,sampleTarget,features))

The example consists of taking the party data that was used to demonstrate the decision
tree, and restricting the trees to stumps, so that they can make a classification based on just
one variable

When we want to construct the decision tree to decide what to do in the evening, we start

by listing everything that we’ve done for the past few days to get a suitable dataset (here,
the last ten days):

The output of a decision tree that uses the whole dataset for this is not surprising: it takes
the two largest classes, and separates them. However, using just stumps of trees and 20
samples, bagging can separate the data perfectly, as this output shows:
3.3.1 RANDOM FORESTS

A random forest is an ensemble learning method where multiple decision trees are
constructed and then they are merged to get a more accurate prediction.

If there is one method in machine learning that has grown in popularity over the last few
years, then it is the idea of random forests. The concept has been around for longer than
that, with several different people inventing variations, but the name that is most strongly
attached to it is that of Breiman, who also described the CART algorithm in unit 2.

Figure 3: Example of random forest with majority voting


The idea is largely that if one tree is good, then many trees (a forest) should be better,
provided that there is enough variety between them. The most interesting thing about a
random forest is the ways that it creates randomness from a standard dataset. The first of
the methods that it uses is the one that we have just seen: bagging. If we wish to create a
forest then we can make the trees different by training them on slightly different data, so
we take bootstrap samples from the dataset for each tree. However, this isn’t enough
randomness yet. The other obvious place where it is possible to add randomness is to limit
the choices that the decision tree can make. At each node, a random subset of the features is
given to the tree, and it can only pick from that subset rather than from the whole set.

As well as increasing the randomness in the training of each tree, it also speeds up the
training, since there are fewer features to search over at each stage. Of course, it does
introduce a new parameter (how many features to consider), but the random forest does
not seem to be very sensitive to this parameter; in practice, a subset size that is the square
root of the number of features seems to be common. The effect of these two forms of
randomness is to reduce the variance without effecting the bias. Another benefit of this is
that there is no need to prune the trees. There is another parameter that we don’t know
how to choose yet, which is the number of trees to put into the forest. However, this is fairly
easy to pick if we want optimal results: we can keep on building trees until the error stops
decreasing.

Once the set of trees are trained, the output of the forest is the majority vote for
classification, as with the other committee methods that we have seen, or the mean
response for regression. And those are pretty much the main features needed for creating a
random forest. The algorithm is given next before we see some results of using the random
forest.
Algorithm

Here is an outline of the random forest algorithm.

1. The random forests algorithm generates many classification trees. Each tree is generated as
follows:

a) If the number of examples in the training set is N, take a sample of N examples at random -
but with replacement, from the original data. This sample will be the training set for
generating the tree.

b) If there are M input variables, a number m is specified such that at each node, m variables
are selected at random out of the M and the best split on these m is used to
split the node. The value of m is held constant during the generation of the various trees in
the forest.

c) Each tree is grown to the largest extent possible.

2. To classify a new object from an input vector, put the input vector down each of the trees in
the forest. Each tree gives a classification, and we say the tree “votes” for that class. The
forest chooses the classification

The implementation of this is very easy: we modify the decision to take an extra parameter,
which is m, the number of features that should be used in the selection set at each stage. We
will look at an example of using it shortly as a comparison to boosting.

Looking at the algorithm you might be able to see that it is a very unusual machine learning
method because it is embarrassingly parallel: since the trees do not depend upon each
other, you can both create and get decisions from different trees on different individual
processors if you have them. This means that the random forest can run on as many
processors as you have available with nearly linear speedup.

There is one more nice thing to mention about random forests, which is that with a little bit
of programming effort they come with built-in test data: the bootstrap sample will miss out
about 35% of the data on average, the so-called out-of-bootstrap examples. If we keep track
of these datapoints then they can be used as novel samples for that particular tree, giving
an estimated test error that we get without having to use any extra datapoints.

This avoids the need for cross-validation.

As a brief example of using the random forest, we start by demonstrating that the random
forest gets the correct results on the Party example that has been used in both this and the
previous chapters, based on 10 trees, each trained on 7 samples, and with just two levels
allowed in each tree:
As a rather more involved example, the car evaluation dataset in the UCI Repository
contains 1,728 examples aiming to classify whether or not a car is a good purchase based
on six attributes. The following results compare a single decision tree, bagging, and a
random forest with 50 trees, each based on 100 samples, and with a maximum depth of five
for each tree. It can be seen that the random forest is the most accurate of the three
methods.
Strengths and weaknesses

Strengths

The following are some of the important strengths of random forests.

• It runs efficiently on large data bases.

• It can handle thousands of input variables without variable deletion.

• It gives estimates of what variables are important in the classification.

• It has an effective method for estimating missing data and maintains accuracy when a large
proportion of the data are missing.

• Generated forests can be saved for future use on other data.

• Prototypes are computed that give information about the relation between the variables
and the classification.

• The capabilities of the above can be extended to unlabeled data, leading to unsupervised
clustering, data views and outlier detection.

• It offers an experimental method for detecting variable interactions.


• Random forest run times are quite fast, and they are able to deal with unbalanced and
missing data.

• They can handle binary features, categorical features, numerical features without any need
for scaling.

Weaknesses

• A weakness of random forest algorithms is that when used for regression they cannot
predict beyond the range in the training data, and that they may over-fit data sets that are
particularly noisy.
• The sizes of the models created by random forests may be very large. It may take hundreds
of megabytes of memory and may be slow to evaluate.

• Random forest models are black boxes that are very hard to interpret.

3.4 Boosting

 Boosting: train next learner on mistakes made by previous learner(s)

In bagging, generating complementary base-learners is left to chance and to the unstability


of the learning method. In boosting, we actively try to generate complementary base-
learners by training the next learner on the mistakes of the previous learners. The original
boosting algorithm combines three weak learners to generate a strong learner. A weak
learner has error probability less than 1/2, which makes it better than random guessing on
a two-class problem, and a strong learner has arbitrarily small error probability.

Original Boosting Concept

Given a large training set, we randomly divide it into three. We use X1 and train d1. We then
take X2 and feed it to d1. We take all instances misclassified by d1 and also as many
instances on which d1 is correct from X2, and these together form the training set of d2. We
then take X3 and feed it to d1 and d2. The instances on which d1 and d2 disagree form the
training set of d3. During testing, given an instance, we give it to d1 and d2; if they agree,
that is the response, otherwise the response of d3 is taken as the output.

1. Split data X into {X1, X2, X3}

2. Train d1 on X1

 Test d1 on X2

3. Train d2 on d1’s mistakes on X2 (plus some right)

 Test d1 and d2 on X3

4. Train d3 on disagreements between d1 and d2

 Testing: apply d1 and d2; if disagree, use d3

 Drawback: need large X

overall system has reduced error rate, and the error rate can arbitrarily be reduced by using
such systems recursively, that is, a boosting system of three models used as dj in a higher
system.
Though it is quite successful, the disadvantage of the original boosting method is that it
requires a very large training sample. The sample should be divided into three and
furthermore, the second and third classifiers are only trained on a subset on which the
previous ones err. So unless one has a quite large training set, d2 and d3 will not have
training

sets of reasonable size.

3.5.1 AdaBoost

Freund and Schapire (1996) proposed a variant, named AdaBoost, short for adaptive
boosting, that uses the same training set over and over and thus need not be large, but the
classifiers should be simple so that they do not overfit. AdaBoost can also combine an
arbitrary number of baselearners, not three.

AdaBoost algorithm
The idea is to modify the probabilities of drawing the instances as a function of the error.
Let us say pt j

denotes the probability that the instance pair (xt, rt) is drawn to train the jth base-learner.

Initially, all
1 p = 1/N. Then we add new base-learners as follows, starting from j = 1: Є
t

denotes the error rate of dj .

AdaBoost requires that learners are weak, that is, Є j < 1/2,∀ j; if not, we stop adding new
base- learners. Note that this error rate is not on the original problem but on the dataset
used at step j. We j j +1 j
.

define βj = Є j /(1 −Є j) < 1, andj+= βj pt if dj correctly classifies xt ; = pt


1
we set pt otherwise, pt

Becaus j+should be probabilities, there is a normalization where we by ∑ 𝑡 pt , so that


1
e pt divide pt
j+ j+
they sum up to 1. This has the effect that the probability of a correctly1classified1 instance is
decreased, and the probability of a misclassified instance increases. Then a new sample of
the same size is drawn from the original sample accordingj+1 to these modified probabilities,
pt with replacement, and is used to train dj+1.
This has the effect that dj+1 focuses more on instances misclassified by dj ; that is why the
base-learners are chosen to be simple and not accurate, since otherwise the next training
sample would contain only a few outlier and noisy instances repeated many times over. For
example, with decision trees, decision stumps, which are trees grown only one or two levels,
are used. So it is clear that these would have bias but the decrease in variance is larger and
the overall error decreases. An algorithm like the linear discriminant has low variance, and
we cannot gain by AdaBoosting linear discriminants.

3.5.2 Stacking - Stacked Generalization

Stacked generalization is a technique proposed by Wolpert (1992) that extends voting in


that the way the output of the base-learners is combined need not be linear but is learned
through a combiner system, f (·|Φ), which is another learner, whose parameters Φ are also
trained. (see the below given figure)
Figure: In stacked generalization, the combiner is another learner and is not restricted
to being a linear combination as in voting.

y = f (d1, d2, . . . , dL |Φ)

The combiner learns what the correct output is when the base-learners give a certain
output combination. We cannot train the combiner function on the training data because
the base-learners may be memorizing the training set; the combiner system should
actually learn how the baselearners make errors. Stacking is a means of estimating and
correcting for the biases of the base-learners. Therefore, the combiner should be trained
on data unused in training the base-learners.

If f (·|w1, . . . , wL) is a linear model with constraints, wi ≥ 0, ∑ 𝑗Wj = 1, the optimal


weights can be found by constrained regression, but of course we do not need to enforce
this; in stacking, there is no restriction on the combiner function and unlike voting, f (·)
can be nonlinear. For example, it may be implemented as a multilayer perceptron with Φ
its connection weights.

The outputs of the base-learners dj define a new L-dimensional space in which the
output discriminant/regression function is learned by the combiner function.

In stacked generalization, we would like the base-learners to be as different as possible


so that they will complement each other, and, for this, it is best if they are based on
different learning algorithms. If we are combining classifiers that can generate
continuous outputs, for example, posterior probabilities, it is better that they be the
combined rather than hard decisions.

When we compare a trained combiner as we have in stacking, with a fixed rule such as
in voting, we see that both have their advantages: A trained rule is more flexible and
may have less bias, but adds extra parameters, risks introducing variance, and needs
extra time and data for training. Note also that there is no need to normalize classifier
outputs before stacking.

Summary:

 Hamming distance is suitable for binary data or categorical features.

 Euclidean distance is widely used for continuous numerical features.

 Manhattan distance is preferred for cases where features are in an integer space.

 Minkowski distance allows flexibility by adjusting the parameter "p" to suit specific
requirements.

You might also like