MA2401 Lecture Notes
MA2401 Lecture Notes
Department of Mathematics
NATIONAL UNIVERSITY OF SINGAPORE
Contents 4
2 Linear Algebra 12
2.1 Matrices and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Introduction to Matrices . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Defining Matrices and Vectors in R . . . . . . . . . . . . . . . . . 13
2.1.3 Special types of Matrices . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Matrix and Vector Operations . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Definitions and Properties . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.3 Matrix Operations in R . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.4 Relational Operators in R . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Introduction to Linear Systems . . . . . . . . . . . . . . . . . . . 28
2.3.2 Solutions to a Linear System . . . . . . . . . . . . . . . . . . . . . 31
2.3.3 Gaussian and Gauss-Jordan Elimination . . . . . . . . . . . . . . 32
2.3.4 Solving a Linear System in R . . . . . . . . . . . . . . . . . . . . . 37
2.4 Submatrices and Block Multiplication . . . . . . . . . . . . . . . . . . . . 42
2.4.1 Block Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 42
1
2.4.2 Entries and Submatrices of Vectors and Matrices . . . . . . . . . 45
2.5 Inverse of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.5.1 Definition and Properties . . . . . . . . . . . . . . . . . . . . . . . 46
2.5.2 Algorithm to Finding the Inverse . . . . . . . . . . . . . . . . . . 49
2.5.3 Finding Inverse in R . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.5.4 Inverse and Linear System . . . . . . . . . . . . . . . . . . . . . . 52
2.6 Least Square Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.7 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.7.1 Definition: Cofactor Expansion . . . . . . . . . . . . . . . . . . . 56
2.7.2 Computing Determinant in R . . . . . . . . . . . . . . . . . . . . 59
2.7.3 Properties of Determinant . . . . . . . . . . . . . . . . . . . . . . 59
2.7.4 Determinants of Partitioned Matrices . . . . . . . . . . . . . . . . 60
2.8 Eigenanalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.8.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . 61
2.8.2 Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.9 Appendix for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.9.1 Matrix operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.9.2 Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.9.3 Solving Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . 79
2.9.4 Elementary Row Operations . . . . . . . . . . . . . . . . . . . . . 81
2.9.5 Gaussian Elimination and Gauss-Jordan Elimination . . . . . . . 84
2.9.6 Orthogonal and Orthonormal . . . . . . . . . . . . . . . . . . . . 86
2.9.7 Gram-Schmidt Process . . . . . . . . . . . . . . . . . . . . . . . . 88
2.9.8 Least square solution . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.9.9 Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.9.10 Orthogonal Diagonalization . . . . . . . . . . . . . . . . . . . . . 92
2.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
References 269
Chapter 1
1.1 Introduction
RStudio is an integrated development environment (IDE) for R that provides an alter-
native interface to R. RStudio runs on Mac, PC, and Linus machines and provides a
simplified interface that looks and feels identical on all of them.
The core of R is an interpreted computer language which allows branching and looping
as well as modular programming using functions. R allows integration with the proce-
dures written in the C, C++, .Net, Python or FORTRAN languages for efficiency.
R is freely available under the GNU General Public License, and pre-compiled binary
versions are provided for various operating systems like Linux, Windows and Mac.
R is free software distributed under a GNU-style copy left, and an official part of the
GNU project called GNU S.
Evolution of R
R was initially written by Ross Ihaka and Robert Gentleman at the Department of
Statistics of the University of Auckland in Auckland, New Zealand. R made its first
appearance in 1993.
• A large group of individuals has contributed to R by sending code and bug reports.
• Since mid-1997 there has been a core group (the “R Core Team”) who can modify
the R source code archive.
Features of R
As stated earlier, R is a programming language and software environment for statistical
analysis, graphics representation and reporting. The following are the important features
of R
5
• R is a well-developed, simple and effective programming language which includes
conditionals, loops, user defined recursive functions and input and output facilities.
• R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
• R provides a large, coherent and integrated collection of tools for data analysis.
• R provides graphical facilities for data analysis and display either directly at the
computer or printing at the papers.
1. Go to https://rstudio.cloud/plans/free.
1.3 R Script
You may type a series of command in a R Script and execute them at one time. R Script
is a normal text file; you may type your script in any text editor, such as Notepad. You
may also prepare your script in a word processor, like Microsoft Word, TextEdit, or
WordPad, provided you can save the scipt in plain text (ASCII) format. Save the text file
with .R at the end, for example, test.R.
To open a new R Script in RStudio, go to File, New File, and click R Script, or
by pressing Ctrl+Shift+N (for the desktop RStudio) or Ctrl+Alt+Shift+N (for RStudio
Cloud). You may now begin to type your commands in the R Script. Pressing Ctrl+Enter
will run the command in the console. Selecting mutiple line and pressing Ctrl+Enter
will run all the selected commands in the console.
You may save your script file by pressing Ctrl+s or clicking the blue file button.
1.4 Operations
1.4.1 Basic Operations
R can perform simple calculations. Commands entered in the Console tab are immedi-
ately executed by R by pressing Enter. Here are some basic operators:
• Addition:
> 5+3
[1] 8
• Subtraction:
> 8-5
[1] 3
• Product:
> 123 * 321
[1] 39483
• Division:
> 123/3
[1] 41
• Powers:
> 9^3
[1] 729
• Roots:
> 9^(1/3)
[1] 2.080084
• Log:
In R, log is the natural log, that is, with base e = exp(1). In order to compute the
log with another base a, use log(x,base=a).
> log(exp(2)) #log is the natural log, and exp(x) is e^x, where e is the
natural number
[1] 2
> log10(2) #base 10 log of 2, in general, logx(y) is base x log of y
[1] 0.30103
> log(100, base=2) #alternative way of base 2 log of 100
[1] 6.643856
• To display a number in fractions, we need the library MASS. Then use the function
fractions
> library(MASS)
> fractions(0.3333)
[1] 1/3
> fractions(0.333)
[1] 333/1000
The >= function checks if the first number is greater than or equals to the second
number.
> 1>=1
[1] TRUE
> 3>=1
[1] TRUE
1.5 Guide to R
1.5.1 Comments and Storing Variables
The symbol > is the command line prompt symbol; typing a command or instruction will
cause it to be executed or performed immediately. If you press Enter before completing
the command (e.g., if the command is very long), the prompt changes to + indicating that
more is needed. Sometimes you do not know what else is expected and the + prompt
keeps coming. Pressing Esc will kill the command and return to the > prompt. If you
want to issue several commands on one line, they should be separated by a semicolon (;)
and they will be executed sequentially.
Comments are prefaced with the # character. You can save values to named variables
for later reuse.
> a=9*2 #store the answer as variable a
> a #to display the stored value
[1] 18
> a<-9*2 #may use <- instead of =
> a
[1] 18
Once the variables are defined, they can be used in other operations and functions.
> a<-2; b<-pi; #the semi-colon can be used to place multiple commands on one
line.
> a*b
[1] 6.283185
The broom icon can be used to clear all the commands in the console. Alternatively,
pressing Ctrl+L will clear all the commands in the console. Note that this will not delete
the commands from the Environment and History tab.
If you don’t know the exact name of a function, you can use the function apropos()
and give part of the name and R will find all functions that match. Quotation marks are
mandatory here. For example
> apropos("matrix")
You can do a broader search using ?? or help.search(), which will find matches not
only in the names of functions and data sets, but also in the documentation for them.
> ??histogram
or
> help.search("histogram")
1.6.4 Plots Tab
Plots created in the console are displayed in the Plots tab. For example
> plot(1,1) #this will display a point at (1,1) on the xy graph
> x = -9:9 #x takes integer values from -9 to 9
> plot(x,x^2) #plot x against x^2, for x from -9 to 9
Chapter 2
Linear Algebra
The size of a matrix is given by m × n, where m is the number of rows and n is the
number of columns. The (i, j)-entry of the matrix is the number aij in the i-th row and
j-th column, for i = 1, ..., m, j = 1, ..., n. A matrix can also be denoted as
The last example shows that all real numbers can be thought of as 1 × 1 matrices.
Remark. 1. To be precise, the above examples are called real-valued matrices, or ma-
trices with real number entries. Later we will be introduced to complex-valued and
even matrices with function entries.
12
2. The choice of using round or square brackets is a matter of taste.
Example. 1. A = (aij )2×3 , aij = i + j.
2 3 4
A=
3 4 5
It is a n × 1 matrix.
It is a 1 × n matrix.
The numeric function creates a vector with all its elements being 0.
> zeros3<-numeric(3)
> zeros3
[1] 0 0 0
v1<-c(1,2,3,numeric(2))
> v1
[1] 1 2 3 0 0
The function rbind concatenates vector as rows of a matrix, and the function cbind
concatenates vectors as columns of a matrix.
> v1<-c(1,3,3,1); v2<-c(2,2,5,3); v3<-c(1,1,5,-1); mr<-rbind(v1,v2,v3); mr
[,1] [,2] [,3] [,4]
v1 1 3 3 1
v2 2 2 5 3
v3 1 1 5 -1
> mc<-cbind(v1,v2,v3); mc
v1 v2 v3
[1,] 1 2 1
[2,] 3 2 1
[3,] 3 5 5
[4,] 1 3 -1
> m1<-rbind(mr,c(1:4)); m1
[,1] [,2] [,3] [,4]
v1 1 3 3 1
v2 2 2 5 3
v3 1 1 5 -1
1 2 3 4
> m2<-cbind(numeric(4),mc); m2
v1 v2 v3
[1,] 0 1 2 1
[2,] 0 3 2 1
[3,] 0 3 5 5
[4,] 0 1 3 -1
Diagonal matrices. A square matrix with all the non diagonal entries equal 0 is
called a diagonal matrix, D = (dij )n with dij = 0 for all i ̸= j. It is usually denoted by
D = diag{d1 , d2 , ..., dn }.
Example.
1 0 0 0
0 0 0
1 0 0 2 0 0
diag{1, 1} = diag{0, 0, 0} = 0 0 0 diag{1, 2, 3, 4} =
0 1 0 0 3 0
0 0 0
0 0 0 4
Scalar matrices. A diagonal matrix A = diag{a1 , a2 , ..., an } such that all the diag-
onal entries are equal a1 = a2 = ... = an is called a scalar matrix.
3 0 0
2 0 0 3 0.
Example.
0 2
0 0 3
Identity matrices. A scalar matrix with all diagonal entries equal 1 is called an
identity matrix. An identity matrix of order n is denoted as In . If there is no confusion
with the order of the matrix, we will write I instead. So a scalar matrix can be written
as cI.
Zero matrices. A matrix (of any size) with all entries equal 0 is called a zero matrix.
Usually denoted as 0m×n for the size m×n zero matrix, and 0n for the zero square matrix
of order n. If it is clear in the context, we will just denote it as 0.
Triangular matrices. A square matrix A = (aij )n with all entries below (above) the
diagonal equal 0, that is, aij = 0 for all i > j (i < j), is called an upper (lower) triangular matrix.
It is a strictly upper or lower matrix if the diagonals are equal to zero too, that is, aij = 0
for all i ≥ j (i ≤ j).
Example. 1.
2 1 a b c
̸ =
3 2 d e f
for any choice of a, b, c, d, e, f .
2.
1 1 a b
=
3 2 c d
if and only if a = 1, b = 1, c = 3, d = 2.
2. −A = (−1)A.
(i) (Commutative) A + B = B + A,
(ii) (Associative) A + (B + C) = (A + B) + C,
Proof. To show equality, we have to show that the matrices on the left and right of the
equality have the same size, and that the corresponding entries are equal. It is clear that
the matrices on both sides has the same size, so we will only check that the entries agree.
(i) aij + bij = bij + aij follows directly from commutativity of addition of real numbers.
(ii) aij + (bij + cij ) = (aij + bij ) + cij follows directly from associativity of addition of
real numbers.
(iii) 0 + aij = aij follows directly from the additive identity property of real numbers.
(iv) aij + (−aij ) = 0 follows directly from additive inverse property of real numbers.
(v) a(aij + bij ) = aaij + abij follows directly from distributive property of addition of
real numbers.
(vi) (a + b)aij = aaij + baij follows directly from distributive property of addition of real
numbers.
(vii) (ab)aij = a(baij ) follows directly from associativity of multiplication of real numbers.
(viii) If aaij = 0, then a = 0 or aij = 0. Suppose a ̸= 0, then aij = 0 for all i, j. So
A = 0.
Remark. 1. Since addition is associative, we will not write the parentheses when adding
multiple matrices.
Matrix Multiplication
Let A = (aij )m×p and B = (bij )p×n . The product AB is defined to be a m × n matrix
whose (i, j)-entry is
p
X
aik bkj = ai1 bij + ai2 b2j + · · · + aip bpj .
k=1
**(2,3) entry of the matrix
Example. product is (matrix 1)Row (matrix 1)Row (matrix 1)Row
2*(matrix 2)Column 3 1*(matrix 1*(matrix
2)Column 1 2)Column 2
1 1
1 2 3 2 1+4−3=2 1+6−6=1
3 =
4 5 6 4 + 10 − 6 = 8 4 + 15 − 12 = 7
−1 −2
(matrix 1)Row (matrix 1)Row
(2 × 3) (3 × 2) 2*(matrix
(2 × 2) 1*(matrix
2)Column 1 2)Column 2
Remark. 1. For AB to be defined, the number of columns of A must agree with the
number of rows of B. The resultant matrix has the same number of rows as A, and
the same number of columns as B.
(m × p)(p × n) = (m × n).
2.
Matrixmultiplication
is not commutative,
that is AB ̸= BA in general. For example,
0 1 1 2 1 2 0 1
̸=
1 0 3 4 3 4 1 0
3. If we are multiplying A to the left of B, we are pre-multiplying A to B, AB. If we
multiply A to the right of B, we are post-multiplying A to B, BA. Pre-multiplying
A to B is the same as post-multiplying B to A.
(ii) (Left distributive law) For matrices A = (aij )m×p , B = (bij )p×n , and C = (cij )p×n ,
A(B + C) = AB + AC.
(iii) (Right distributive law) For matrices A = (aij )m×p , B = (bij )m×p , and C = (cij )p×n ,
(A + B)C = AC + BC.
(iv) (Commute with scalar multiplication) For any real number c ∈ R, and matrices
A = (aij )m×p , B = (bij )p×n , c(AB) = (cA)B = A(cB).
(v) (Multiplicative identity) For any m × n matrix A, Im A = A = AIn .
(vi) (Zero divisor) There exists A ̸= 0m×p and B ̸= 0p×n such that AB = 0m×n .
The product of two non-zero matrices can be a zero matrix
(vii) (Zero matrix) For any m × n matrix A, A0n×p = 0m×p and 0p×m A = 0p×n .
The proof is beyond the scope of this course. Interested readers may refer to the
appendix.
**For diagonal matrix A and diagonal matrix B, AB = BA
Remark. 1. For square matrices, we define A2 = AA, and define inductively, An =
AAn−1 , for n ≥ 2. It follows that An Am = An+m .
Tranpose
For a m × n matrix A, the transpose of A, written as AT , is a n × m matrix whose
(i, j)-entry is the (j, i)-entry of A, that is, if AT = (bij )n×m , then
bij = aji
for all i = 1, ..., n, j = 1, ..., m. Equivalently, the rows of A are the columns of AT and
vice versa.
T
T 1 4 1 T
1 2 3 1 2 1 2
Example. 1. = 2 5 2. 1 = 1 1 0 3. =
4 5 5 2 0 2 0
3 6 0
This gives us an alternative way to define symmetric matrices. A square matrix A is
symmetric if and only if AT = A.
(b) If A and B are symmetric matrices (with the appropriate sizes), then so is AB.
Trace
Let A = (aij )n×n be a square matrix of order n. Then the trace of A, denoted by tr(A)
is the sum of the diagonal entries of A,
n
X
tr(A) = aii = a11 + a22 + · · · + ann .
i=1
The first multiplication is known as outer product, denoted as u ⊗ v, and the second is
known as inner product, or dot product, denoted as u · v. In this course, we will only be
discussing inner product.
1 2
Example. 1. 2 · 2 = 2 + 4 − 2 = 4.
−1 2
1 1
2. 0 · 1 = 1 + 0 − 1 = 0.
−1 1
2 1
3. · = 2 − 6 = −4.
3 −2
The norm of a vector u ∈ Rn is defined to be the square root of the inner product of
u with itself, and is denoted as ∥u∥,
√
∥u∥ = u · u.
x
Geometric meaning of norm. The distance between the point and the origin in
y
R2 is given by
p
2 2 x
distance = x + y = .
y
That is, in R2 , the norm of a vector can be interpreted as its distance from the origin.
u
y
x
3
Similarly, in R , the distance of a vector y to the origin is
z
p x
2 2
distance = x + y + z = 2 y .
z
We may thus generalize and define the distance between a vector v and the origin in
n
R is its norm, ∥v∥.
Observe that the distance between two vector v = (vi ) and u = (ui ) is
p
d(u, v) = (u1 − v1 )2 + (u2 − v2 )2 + · · · + (un − vn )2 = ∥u − v∥.
1 p √
Example. 1. 2 = 12 + 22 + (−1)2 = 6.
−1
√
1 0 1−0 p
2. d , = = 12 + (−2)2 = 5.
3 5 3−5
The angle between two nonzero vectors, u, v ̸= 0 is the number θ with 0 ≤ θ ≤ π
such that
u·v
cos(θ) = .
∥u∥∥v∥
This is a natural definition because once again, in R2 , this is indeed the definition of the
trigonometric function cosine.
v
θ
u
(i) (Symmetric) u · v = v · u.
Pn Pn Pn Pn
(iii) i=1 ui (avi + bwi ) = i=1 (aui vi + bui wi ) = a i=1 ui vi + b i=1 ui wi .
(iv) u · u = ni=1 u2i ≥ 0 since ui ∈ R are real numbers. It is clear that a sum of square
P
of real numbers is equal to 0 if and only if all the numbers are 0.
pPn √ pPn
2 2
(v) ∥cu∥ = i=1 (cui ) = c2 i=1 ui = |c|∥u∥.
Then
u u u·u
· = = 1.
∥u∥ ∥u∥ ∥u∥2
1 √
Example. Let u = 2 . We have computed that ∥u∥ = 6 and so
−1
1
1
√ 2
6 −1
Exercise:
2. For the vectors that are not unit, normalize them, if possible.
0 1 2 1
1 1
(i) 1
(ii) 2 −2 (iii) 2 0 (iv) cos(π/2) 1
0 1 −2 1
• Substraction A − B, A-B
• Transpose AT , t(A)
The ^ function raises the entries in the first vector or matrix to the exponent of the
second vector or matrix.
> v1^v2
[1] 4 125 1296
> A1^A2
[,1] [,2]
[1,] 1 4
[2,] 64 4
[3,] 25 -1
The >= function checks if each entry in the first vector or matrix is greater than or
equal to the corresponding entry in the second vector or matrix.
> v1>=v2
[1] TRUE TRUE FALSE FALSE
> A1>=A2
[,1] [,2]
[1,] TRUE TRUE
[2,] FALSE FALSE
The == function checks if each entry in the first vector or matrix is equal to the cor-
responding entry in the second vector or matrix.
> v1==v2
[1] TRUE FALSE FALSE FALSE
> A1==A2
[,1] [,2]
[1,] TRUE FALSE
[2,] FALSE FALSE
The != function checks if each entry in the first vector or matrix is different from the
corresponding entry in the second vector or matrix.
> v1!=v2
[1] FALSE TRUE TRUE TRUE
> A1!=A2
[,1] [,2]
[1,] FALSE TRUE
[2,] TRUE TRUE
1. 2x + y = 3
2. x1 − x2 + 3x3 − 5x4 = 2
The following are examples of linear systems that are not in standard form
1. y = x sin( π6 )
2. x = 2y
1. xy = 3
2. x2 + y 2 = 1
3. cos(x) + 4 sin(y) = 2
we say that
x1 = c1 , x2 = c2 , ..., xn = cn
is a solution to the linear system if the equations are simultaneously satisfied after making
the substitution, that is,
Example. x = 1 = y is a solution to
3x − 2y = 1
x + y = 2
Example.
x + 2y = 5
,
2x + 4y = 10
solutions: x = 1, y = 2, or x = 3, y = 1, etc.
2.
x − y + 3z = 1 ,
General solution: x = 1 + s − 3t, y = s, z = t, s, t ∈ R.
x + y = 2
x − y = 0
2x + y = 1
1. If there are any rows that consist entirely of zeros, then they are grouped together at
the bottom of the matrix. A row consisting entirely of zeros is called a zero row.
2. For any nonzero row, the first nonzero number of a row from the left is called a
leading entry. In any two successive nonzero rows, the leading entry in the lower row
occurs farther to the right than the leading entry in the higher row.
3. If a row does not consist entirely of zeros, then the first nonzero number in the row is
a 1.
4. Each column that contains a leading entry is called a pivot column. In each pivot
column, all entries except the leading entry is zero.
A matrix that has the first three properties is said to be in row-echelon form.
∗ ∗
0 ··· 0 ∗ ∗
0 ··· 0 0 ··· 0 ∗ ∗
.
0 0 0
.. .. ..
. . .
0 ··· ··· 0 0
A matrix in RREF has the form
1 ∗ 0 ∗ 0 ∗
0 ··· 0 1 ∗ 0 ∗
0 ··· 0 0 ··· 0 1 ∗
.
0 0 0 0 0
.. .. .. .. ..
. . . . .
0 ··· 0 ··· 0 0 0
Example. The following are examples of augmented matrices in row-echelon form but
not in reduced row-echelon form.
−1 2 3 4
1. 0 1 1 2
0 0 2 3
1 −1 1
2.
0 0 1
1 −1 0
3.
0 0 3
The following are examples of augmented matrices in reduced row-echelon form.
0 1 2 0 1
0 0 0 1 3
1.
0 0
0 0 0
0 0 0 0 0
2. 1 2 0 −5
The following is an augmented matrix that is not in row-echelon form
1 2 0 −5
1.
1 0 1 3
Example. From the REF or RREF, we are able to read off a general solution.
1.
1 1 1 1
0 1 1 0
This is in REF. We let the third variable be the parameter s, then we get y = −s from
the second row, and x = 1 − s − (−s) = 1 from the first row. So a general solution is
x = 1, y = −s, z = s.
2.
1 0 0 1
0 1 −1 0
This is in RREF. General solution: x = 1, y = s, z = s.
To solve a linear system, we can perform algebraic operations on the system (equiv-
alently, the augmented matrix) that do not alter the solution set until it is in REF or
RREF. This is achieved using the following 3 types of elementary row operations.
1. Exchanging 2 rows, Ri ↔ Rj ,
Remark. 1. Note that we cannot multiply a row by 0, as it may change the linear
system. For example, consider
x + y = 2
x − y = 0
instead of
1 0 0 2R2 +R1 1 0 0
−−−−→
0 1 0 1 2 0
In fact, the 2R2 + R1 is not an elementary row operation, but a combination of 2
operations, 2R2 then R2 + R1 . Here’s another example,
1 0 0 R1 +R2 1 1 0
−−−−→
0 1 0 1 0 0
and
1 0 0 R2 +R1 1 0 0
−−−−→
0 1 0 1 1 0
The algorithm to reduce a (an augmented) matrix to be in REF or RREF is called
the Gauss-Jordan elimination.
Step 1: Locate the leftmost column that does not consist entirely of zeros.
Step 2: Interchange the top row with another row, if necessary, to bring a nonzero entry to
the top of the column found in Step 1.
Step 3: For each row below the top row, add a suitable multiple of the top row to it so that
the entry below the leading entry of the top row becomes zero.
Step 4: Now cover the top row in the augmented matrix and begin again with Step 1
applied to the submatrix that remains. Continue this way until the entire matrix
is in row-echelon form.
Once the above process is completed, we will end up with a REF. The following steps
continue the process to reduce it to its RREF.
Step 5: Multiply a suitable constant to each row so that all the leading entries become 1.
Step 6: Beginning with the last nonzero row and working upward, add suitable multiples
of each row to the rows above to introduce zeros above the leading entries.
Algorithm. 1. Express the given linear system as an augmented matrix. Make sure
that the liner system is in standard form.
4. Use back substitution (if the augmented matrix is in REF) to obtain a general
solution, or read off the general solution (if the augmented matrix is in RREF).
Example. 1.
1 1 2 4 1 1 2 4 2
R3 + R2
1 1 2 4
−1 2 −1 1 −R−2−
+R1
−→ 0 3 1 5 −−−−3−→ 0 3 1 5
R3 −2R1
2 0 3 −2 0 −2 −1 −10 0 0 −1/3 −20/3
Indeed, the system is consistent, with unique solution x = −31, y = −5, z = 20.
The augmented matrix is now in REF. We may use back substitution to obtain the
solution, or continue to reduce to its RREF.
1 3 −2 0 2 0 0 1 3 −2 0 2 0 0
−R2 (1/6)R 3
0 0 1 2 0 3 1 R2 −3R3
0 0 1 2 0 0 0
−−→ −−−−→ 0 0 0 0 0 1 1/3 −−−−→ 0
0 0 0 0 1 1/3
0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 3 0 4 2 0 0
R1 +2R2 0 0 1 2 0 0
0
−− −−→
0 0 0 0 0 1 1/3
0 0 0 0 0 0 0
we will let A = (aij ) be the coefficient matrix, and b = (bi ) be the constant vector. Let’s
input this into R.
> A <- matrix(c(a11 , a12 , ..., a1n , a21 , a22 , ...a2n , ..., am1 , am2 , ..., amn ),m,n,T)
> b=matrix(c(b1 , b2 , ..., bm )).
Observe from the plot that the intersection is a line. Indeed, as we will see later
that there will be a parameter in the solution, and thus the solution is a line.
> plotEqn(A,b)
From the plot we can see that all 3 lines intersect at a point, and hence, the system
has a unique solution. Indeed,
> Solve(A, b, fractions = TRUE)
x1 = 3/2
x2 = 1/2
0 = 0
tells us that x1 = 3/2, x2 = 1/2 is the unique solution.
1 1 2
3. Let A = 1 −1 and b = 0,
2 1 1
> A <- matrix(c(1,1,1,-1,2,1),3,2,T); b=matrix(c(2,0,1))
> plotEqn(A,b)
x[1] + x[2] = 2
x[1] - 1*x[2] = 0
2*x[1] + x[2] = 1
From the plot we can see that the 3 lines do not intersect at any common point,
hence, the system has no solutions. Indeed,
> Solve(A, b, fractions = TRUE)
x1 = 1/3
x2 = 1/3
0 = 4/3
1 3 −2 0 2 0 0
2 6 −5 −2 4 −3 −1
4. Let A = 0 0 5 10 0 15 and b = 5 ,
2 6 0 8 4 18 6
> A <- matrix(c(1,3,-2,0,2,0,2,6,-5,-2,4,-3,0,0,5,10,0,15,2,6,0,8,4,18),4,6,T);
b <- matrix(c(0,-1,5,6))
> showEqn(A,b)
1*x1 + 3*x2 - 2*x3 + 0*x4 + 2*x5 + 0*x6 = 0
2*x1 + 6*x2 - 5*x3 - 2*x4 + 4*x5 - 3*x6 = -1
0*x1 + 0*x2 + 5*x3 + 10*x4 + 0*x5 + 15*x6 = 5
2*x1 + 6*x2 + 0*x3 + 8*x4 + 4*x5 + 18*x6 = 6
In this case since there are more than 3 variables in the system, we cannot plot the
equations. We will proceed to find a general solution.
> Solve(A, b, fractions = TRUE)
x1 + 3*x2 + 4*x4 + 2*x5 = 0
x3 + 2*x4 = 0
x6 = 1/3
0 = 0
A general solution is
x1 = −3r − 4s − 2t, x2 = r, x3 = −2s, x4 = s, x5 = t, x6 = 1/3, r, s, t ∈ R.
Suppose now we given A = (aij )m×n and B = (bij )m×p , and we want to find an
X = (xij )n×p such that AX = B,
a11 a12 · · · a1n x11 x12 · · · x1p b11 b12 · · · b1p
a21 a22 · · · a2n x21 x22 · · · x2p b21 b22 · · · b2p
.. = .. .. .
.. .. . . .. .. .. .. .. ..
. . . . . . . . . . . .
am1 am2 · · · amn xn1 xn2 · · · xnp bm1 bm2 · · · bmp
Example. 1. Solve
1 2 −3 x1 x2 1 1
2 6 −11 y1 y2 = 3 2 ,
1 −2 7 z1 z2 −1 1
This is equivalent to solving the 2 linear systems
> Solve(A,b2)
x1 + 2*x3 = 1
x2 - 2.5*x3 = 0
0 = 0
A general solution for X is
−2s 1 − 2t
X = 0.5 + 2.5s 2.5t , s, t ∈ R.
s t
Hence, in order to define, if possible, the inverse of a matrix, we first need to identity
the matrix that serves the same role as 1 in real numbers. We indeed have such an object.
Recall that the identity matrix has the multiplicative identity property, for any m × n
matrix A,
Im A = A = AIn .
A square matrix A of order n is invertible if there exists a square matrix B of order
n such that
AB = In = BA.
In this case, we say that B is an inverse of A. A square matrix is singular if it is not
invertible.
Remark. 1. Only square matrices can be invertible. All non-square matrices are not
invertible by default.
2. For a square matrix A to be invertible, we need to check that there exists a B that is
simultaneously a left and right inverse of A, that is, we need to show both BA = In
and AB = In . However, it turns out that BA = In if and only if AB = Iu for a square
matrix B. That is, as long as there is a square matrix B of the same order that either
pre or post multiplied to A to give the identity matrix, then A is invertible.
3. Moreover, the matrix B is unique, that is, if there is a C such that either AC = In or
CA = In , then necessarily B = C.
Since inverse is unique, we can denote the inverse of an invertible matrix A by A−1 .
That is, if A is invertible, there exists a unique matrix A−1 such that
AA−1 = In = A−1 A.
Remark. 1. By induction, one can prove that the product of invertible matrices is in-
vertible, and (A1 A2 · · · Ak )−1 = A−1 −1 −1
k · · · A 2 A1 if Ai is an invertible matrix for
i = 1, ..., k.
2. We define the negative power of an invertible matrix to be
A−n = (A−1 )n
for any positive integer n.
Exercise: Show that AB is invertible if and only if both A and B are invertible.
Hint: If AB is invertible, let C be the inverse. Pre and post multiply AB with C.
Theorem. An order 2 square matrix
a b
A=
c d
is invertible if and only if ad − bc ̸= 0, in which case the inverse is given by the formula
−1 1 d −b
A = .
ad − bc −c a
The formula is obtained by using the adjoint of A, which is beyond the scope of this
module. However, readers may verify that indeed we have AA−1 = I2 = A−1 A.
AX = I,
So, −1
1 0 1 0.5 0.5 −0.5
1 1 0 = −0.5 0.5 0.5 .
0 1 1 0.5 −0.5 0.5
The inverse of a invertible matrix A can be obtained in R using the function solve.
such that the coefficient matrix A is invertible. Then by pre-multiplying the inverse of
A to both sides of the corresponding matrix equation
Ax = b,
we have
x = A−1 b.
That is, A−1 b will be the unique solution to the system.
Example. Consider that linear system
x1 + x3 = 2
x1 + x2 = 4
x2 + x3 = 6
Using R, we have
> A <- matrix(c(1,0,1,1,1,0,0,1,1),3,3,T); b <- matrix(c(2,4,6))
> solve(A)%*%b
[,1]
[1,] 0
[2,] 4
[3,] 2
Indeed,
> Solve(A,b)
x1 = 0
x2 = 4
x3 = 2
x1 = c1 , x2 = c2 , ..., xn = cn
x1 + x2 + x4 = 1
x 2 + x3 = 1
x1 + 2x2 + x3 + x4 = 1
It is inconsistent,
> A <- matrix(c(1,1,0,1,0,1,1,0,1,2,1,1),3,4,T); b <- matrix(c(1,1,1))
> Solve(A,b)
x1 - 1*x3 + x4 = 0
x2 + x3 = 1
0 = -1
0
2/3
A least square solution to the system is u =
0 . One can check that given any
0
a
b
v= c , the distance between
d
0
1 1 0 1 1
2/3 2
Au = 0 1 1 0 =
1
0 3
1 2 1 1 2
0
1
and b = 1, which is
1
√
p 3
((1 − 2/3)2 + (1 − 2/3)2 + (1 − 4/3)2 = ,
3
is shorter than the distance between Av and b. For example,
1
0 1
• if v = , then the distance between Av = 0 and b is
0
1
0
p
(1 − 1)2 + (0 − 1)2 + (1 − 1)2 = 1.
0
1 1
• if v = , then the distance between Av = 1 and b is
0
2
0
p
(1 − 1)2 + (1 − 1)2 + (2 − 1)2 = 1.
0
0 0
• if v = , then the distance between Av = 1 and b is
1
1
0
p
(0 − 1)2 + (1 − 1)2 + (1 − 1)2 = 1.
1
1 3
• if v =
1
, then the distance between Av = 2 and b is
5
1
p √
(3 − 1)2 + (2 − 1)2 + (5 − 1)2 = 21.
Theorem. Let A be a m × n matrix and b ∈ Rm . A vector u ∈ Rn is a least square
solution to Ax = b if and only if u is a solution to AT Ax = AT b.
That is, to find a least square solution, we solve for the following matrix equation
AT Ax = AT b.
x1 = s − t, x2 = 2/3 − s, x3 = s, x4 = t, s, t ∈ R
x1 + x3 = 2
x1 + x2 = 4
x2 + x3 = 6
2.7 Determinant
2.7.1 Definition: Cofactor Expansion
We will define the determinant of A of order n by induction.
• Define Mij , called the (i, j) matrix minor of A, to be the matrix obtained from A
be deleting the i-th row and j-th column.
1 2 −1
Example. Let A = −1 1 3 . Then
3 2 1
1 2
(i) M23 = ,
3 2
−1 3
(ii) M12 = , and
3 1
2 −1
(iii) M31 = .
1 3
• For a square matrix of order n, the (i, j)-cofactor of A, denoted as Aij , is the (real)
number given by
Aij = (−1)i+j det(Mij ).
This definition is well-defined since Mij is a square matrix of order n − 1, and by
induction hypothesis, the determinant is well defined. Take note of the sign of the
(i, j)-entry, (−1)i+j . Here’s a visualization of the sign of the entries of the matrix
+ − + ···
− + − · · ·
+ − + · · · .
.. .. . .
. . .
1 2 −1
Example. Let A = −1 1 3 . Then
3 2 1
row i (2.1)
This is called the cofactor expansion along .
column j (2.2)
The determinant of A is also denoted as det(A) = |A|.
Remark. The equality between cofactor expansion along any row or any column is a
theorem. The proof requires knowledge of the symmetric groups, which is beyond the
scope of this module.
a b c
e f d f d e
d e f =a −b +c = aei − af h − bdi + bf g + cdh − ceg.
h i g i g h
g h i
a b c a b
det(A) = d e f d e
g h i g h
1 5 1 2
0 2 6 3
Example. Compute the determinant of
0
.
0 1 2
0 0 1 1
Cofactor expansion along the first column.
1 5 1 2
2 6 3 5 1 2 5 1 2 5 1 2
0 2 6 3
= 1 0 1 2 − 0 0 1 2 + 0 2 6 3 − 0 2 6 3 = 2(1 − 2) = −2.
0 0 1 2
0 1 1 0 1 1 0 1 1 0 1 2
0 0 1 1
det(A) = det(AT ).
This statement is proved by induction on the size n of the matrix A, and using fact
that cofactor expansion along first row of A is equal to cofactor expansion along the first
column of AT .
Sketch of proof:
Upper triangular matrix , cofactor expand along first column,
a11 a12 · · · a1n
0 a22 · · · a2n
.. .
.. .. . .
. . . .
0 0 · · · ann
0 0 1 1
> A <- matrix(c(1,5,1,2,0,2,6,3,0,0,1,2,0,0,1,1),4,4,T); det(A)
[1] -2
By induction, we get
det(A−1 ) = det(A)−1 .
Proof. Since the identity matrix I is a triangular matrix, det(I) = 1. Then
So det(A)−1 = det(A−1 ).
Corollary (Equivalence of invertibility and determinant). A square matrix A is invertible
if and only if det(A) ̸= 0.
Corollary (Determinant of scalar multiplication). For any square matrix A of order n
and scalar c ∈ R,
det(cA) = cn det(A).
1 5 1 2 1 0 1 1
and B = 0 1 1 −1. Given that det(A) = −2
0 2 6 3
Example. Let A = 0 0 1 2 0 0 1 2
0 0 1 1 0 0 0 3
1. det(3A) = 34 (−2) = −162.
0 B
= (−1)mn B C .
C 0
0 B B 0 0 Im
This follows from = .
C 0 0 C In 0
It can be shown that if B and C are not square matrices, then the matrix must be
singular and so has zero determinant.
Im B
3. = 1 since it is an upper triangular matrix.
0 In
4. If A and B are square matrices, then
A B
= A D .
0 D
2.8 Eigenanalysis
2.8.1 Eigenvalues and Eigenvectors
Let A be a square matrix of order n. Then notice that for any vector u ∈ Rn , Au is
also a vector in Rn . So we may think of A as a map Rn → Rn , taking a vector and
transforming it to another vector in the same Euclidean space.
Example. 1.
0 1
A=
1 0
Geometrically the matrix A reflects a vector along the line x = y.
−2 1 −1 1 1 1
A = , A = , A = .
1 −2 1 −1 1 1
v v
v = Av
Av
Av
Observe that any vector on the line x = y gets transform back to itself, and any
vector along x = −y line get transformed to the negative of itself.
2.
1 1
A=
1 1
The matrix A takes a vector and maps it to a vector along the line x = y such that
both coordinates in Av are the sum of the coordinates in v.
1 1 −1 0 1 2
A = , A = , A = .
0 1 1 0 1 2
Av
v Av
v v
Av = 0
Observe that any vector v along the line x = y is mapped to twice itself, Av = 2v,
and it take any vector v along the line x = −y to the origin, Av = 0.
Let A be a square matrix of order n. A real number λ ∈ R is an eigenvalue of A if
there is a nonzero vector v ∈ Rn , v ̸= 0, such that Av = λv. In this case, the nonzero
vector v is called an eigenvector associated to λ. In other words, A transforms its eigen-
vectors by scaling it by a factor of the associated eigenvalue.
Eigenvalue
Eigenvector
1
λ=1 vλ =
1
−1
λ = −1 vλ =
1
1 1 1 2 1 1 0 1
2. For A = ,A = =2 ,A = =0 .
1 1 1 2 1 −1 0 −1
Eigenvalue Eigenvector
1
λ=2 vλ =
1
1
λ=0 vλ =
−1
The eigenvalues of a square matrix can be obtained through its characteristic poly-
nomial.
Lemma. Let A be a square matrix of order n. Then det(xI − A) is a polynomial of
degree n.
a b
Example. Let A = be an order 2 square matrix. Then the characteristic
c d
polynomial of A is
x − a −b
= (x − a)(x − d) − bc = x2 − (a + d)x + ad − bc.
−c x − d
It is a degree 2 polynomial.
This means that λ is an eigenvalue of A if and only if it is a root of the polynomial
det(xI − A). This motivates the following definition.
x−1 0 0
det(xI − A) = 0 x −2 = (x − 1)[x(x − 1) − 6)] = (x − 1)(x + 2)(x − 3).
0 −3 x − 1
a b c
Example. Let A = 0 d e . Then
0 0 f
x − a −b −c
x − d −e
det(xI − A) = 0 x − d −e = (x − a) = (x − a)(x − d)(x − f ).
0 x−f
0 0 x−f
So, the roots of the characteristic polynomial, and hence the eigenvalues are λ = a, d, f ,
which are the diagonal entries of A.
Once we have found the eigenvalues of a matrix, the eigenvectors are obtained by
solving the homogeneous system (λI − A)x = 0.
For λ = 0:
> Solve(0*diag(3)-A)
x1 + x2 = 0
x3 = 0
0 = 0
x1 −s
So the nonzero vectors of the form x2 =
s , for any s ∈ R, is an eigenvector.
x3 0
For λ = 2:
> Solve(2*diag(3)-A)
x1 - 1*x2 = 0
0 = 0
0 = 0
x1 t
So the nonzero vectors of the form x2 = t , for any s, t ∈ R, is an eigenvector.
x3 s
2.
2 3 −4 6 −3 −2 −6
13 −8 6 6 −9 −12 −10
3 −8 6 −4 1 −2 0
−5 0 −1 −5 5
A= 6 4
−7 7 −8 0 2 8 4
2 −7 7 −4 2 −3 0
−3 3 −4 0 3 4 −1
> A <- matrix(c(2,3,-4,6,-3,-2,-6,13,-8,6,6,-9,-12,-10,3,-8,6,-4,1,-2,0,
-5,0,-1,-5,5,6,4,-7,7,-8,0,2,8,4,2,-7,7,-4,2,-3,0,-3,3,-4,0,3,4,-1),7,7,T)
> charpoly(A)
[1] 1 7 -8 -158 -419 -425 -150 0
> p <- polynomial(rev(charpoly(A)))
> p
-150*x - 425*x^2 - 419*x^3 - 158*x^4 - 8*x^5 + 7*x^6 + x^7
> solve(p)
[1] -5.0000000 -3.0000000 -2.0000000 -1.0000001 -0.9999999 0.0000000
5.0000000
That is, the characteristic polynomial of A is
For λ = 0:
> Solve(0*diag(7)-A)
x1 = 0
x2 = 0
x3 - 1*x6 = 0
x4 - 1*x6 = 0
x5 = 0
x7 = 0
0 = 0
0
0
1
1 to be the eigenvector associated to 0.
We will just pick a particular solution
0
1
0
For λ = 5:
> Solve(5*diag(5)-A)
x1 + x6 = 0
x2 + x6 = 0
x3 - 1*x6 = 0
x4 - 1*x6 = 0
x5 = 0
x7 = 0
0 = 0
−1
−1
1
1 .
So, an associated eigenvector is
0
1
0
For λ = −1:
> Solve(-1*diag(7)-A)
x1 - 1*x6 - 1*x7 = 0
x2 - 1*x6 = 0
x3 - 1*x6 = 0
x4 - 1*x7 = 0
x5 - 1*x7 = 0
0 = 0
0 = 0
Observe that here we need 2 parameters in the general solution. Letting x6 = s,
1 1
1 0
1 0
x7 = t, we can get 2 (independent) solutions 0 and
1.
0 1
1 0
0 1
For λ = −2:
> Solve(-2*diag(7)-A)
x1 - 1*x7 = 0
x2 - 1*x7 = 0
x3 - 1*x7 = 0
x4 - 1*x7 = 0
x5 - 1*x7 = 0
x6 = 0
0 = 0
1
1
1
1.
So, an associated eigenvector is
1
0
1
For λ = −3:
> Solve(-3*diag(7)-A)
x1 - 1*x7 = 0
x2 = 0
x3 = 0
x4 = 0
x5 + x7 = 0
x6 - 1*x7 = 0
0 = 0
1
0
0
0 .
So, an associated eigenvector is
−1
1
1
For λ = −5:
> Solve(-5*diag(7)-A)
x1 = 0
x2 - 1*x6 = 0
x3 - 1*x6 = 0
x4 = 0
x5 + x6 = 0
x7 = 0
0 = 0
0
1
1
0 .
So, an associated eigenvector is
−1
1
0
We may use the function eigen in R to obtain the eigenvalues and eigenvectors of a
matrix A.
1 1 0
Example. 1. Let A = 1 1 0.
0 0 2
> A <- matrix(c(1,1,0,1,1,0,0,0,2),3,3,T)
> eigen(A)
eigen() decomposition
$values
[1] 2.000000e+00 2.000000e+00 1.110223e-15
$vectors
[,1] [,2] [,3]
[1,] 0 0.7071068 0.7071068
[2,] 0 0.7071068 -0.7071068
[3,] 1 0.0000000 0.0000000
The entries of $values are the eigevalues, and the i-th column of the matrix $vector
is an eigenvector associated to i-th entry of $values. To verify this, let lambda be
the vector containing the eigenvalues
> lambda <- eigen(A)$values
and P to be the matrix eigen(A)$vectors
> P <- eigen(A)$vectors
Remark. (i) Observe that due to rounding error, the last 2 difference is not 0.
However, the difference is very small (in the order of 10−16 ). Note also that the
third eigenvalue should be 0. One must exercise digression when interpreting
the data.
(ii) Observe that R will always choose the eigenvector v (written as column vector)
such that vT v = 1 (such a vector is called a unit vector). For example,
referring to the previous example
√ where we found the eigenvector by solving
(λI − A)x = 0, R chose s = 1/ 2 for the eigevector associated to eigenvalue
0. Let us verify this
> crossprod(P[,1],P[,1])
[,1]
[1,] 1
> crossprod(P[,2],P[,2])
[,1]
[1,] 1
> crossprod(P[,3],P[,3])
[,1]
[1,] 1
(iii) Finally observe that since there are 2 parameters in the general solution to
find the eigenvectors for eigenvalue
√ 2, R will “separate” them, that is, it will
choose and s = 0, t = 1/ 2 for one of the eigenvector, and s = 1, t = 0 for
the other eigenvector associated to eigenvalue 2.
2. Let
2 3 −4 6 −3 −2 −6
13 −8 6 6 −9 −12 −10
3 −8 6 −4 1 −2 0
A = −5 0 −1 −5 5
6 4
−7 7 −8 0 2 8 4
2 −7 7 −4 2 −3 0
−3 3 −4 0 3 4 −1
> A <- matrix(c(2,3,-4,6,-3,-2,-6,13,-8,6,6,-9,-12,-10,3,-8,6,-4,1,-2,0,
-5,0,-1,-5,5,6,4,-7,7,-8,0,2,8,4,2,-7,7,-4,2,-3,0,-3,3,-4,0,3,4,-1),7,7,T)
> eigen(A)
eigen() decomposition
$values
[1] 5.000000e+00 -5.000000e+00 -3.000000e+00 -2.000000e+00 -1.000000e+00
[6] -1.000000e+00 1.493133e-15
$vectors
[,1] [,2] [,3] [,4] [,5]
[1,] 4.472136e-01 8.927420e-16 5.000000e-01 4.082483e-01 5.000000e-01
[2,] 4.472136e-01 -5.000000e-01 -2.187319e-15 4.082483e-01 1.271365e-15
[3,] -4.472136e-01 -5.000000e-01 -1.885620e-15 4.082483e-01 1.252999e-15
[4,] -4.472136e-01 3.246335e-16 5.421158e-16 4.082483e-01 5.000000e-01
[5,] -8.448406e-16 5.000000e-01 -5.000000e-01 4.082483e-01 5.000000e-01
[6,] -4.472136e-01 -5.000000e-01 5.000000e-01 2.051160e-15 1.767114e-16
[7,] -5.011742e-16 1.062829e-15 5.000000e-01 4.082483e-01 5.000000e-01
[,6] [,7]
[1,] -0.5718368 3.711441e-16
[2,] -0.1115115 -3.976544e-17
[3,] -0.1115115 -5.773503e-01
[4,] -0.4603253 -5.773503e-01
[5,] -0.4603253 6.362470e-16
[6,] -0.1115115 -5.773503e-01
[7,] -0.4603253 3.582023e-16
We will just check that the 5-th and 6-th columns of P are the eigevectors associated
to eigenvalue −1. The rest of the verification of the eigenvalue eigenvector pairs are
left to the reader. Also, it is a good exercise to interpret the data and decide which
of the entries are the result of rounding error, and what the correct values should
be.
> lambda <- eigen(A)$values; P <- eigen(A)$vectors
> A%*%P[,5]-lambda[5]*P[,5]
[,1]
[1,] 2.220446e-16
[2,] 3.047722e-15
[3,] -9.878031e-16
[4,] -1.998401e-15
[5,] -1.776357e-15
[6,] -1.313782e-16
[7,] -1.221245e-15
> A%*%P[,6]-lambda[6]*P[,6]
[,1]
[1,] -5.551115e-16
[2,] -1.706968e-15
[3,] 1.595946e-15
[4,] 2.109424e-15
[5,] 6.106227e-16
[6,] 1.026956e-15
[7,] 9.992007e-16
2.8.2 Diagonalization
A square matrix A is said to be diagonalizable if there exists an invertible matrix P such
that P−1 AP = D is a diagonal matrix.
Remark. The statement above is equivalent to being able to express A as A = PDP−1
for some invertible P and diagonal matrix D.
Example. 1. Any square zero matrix is diagonalizable, 0 = I0I−1 .
2. Any diagonal matrix D is diagonalizable, D = IDI−1 .
−1
3 1 −1 1 0 1 2 0 0 1 0 1
3. A = 1 3 −1 is diagonalizable, with A = 0 1 1 0 2 0 0 1 1 .
0 0 2 1 1 0 0 0 4 1 1 0
The invertible matrix P that diagonalizes A have the form P = u1 u2 · · · un ,
where ui are eigenvectors of A, and the diagonal matrix is D = diag(λ1 , λ2 , ..., λn ), where
λi is the eigenvalue associated to eigenvector ui . In other words, the i-th column of the
invertible matrix P is an eigenvector of A with eigenvalue the i-th diagonal entry of D.
We have already seen this in a previous example.
1 1 0
Example. Let A = 1 1 0.
0 0 2
> A <- matrix(c(1,1,0,1,1,0,0,0,2),3,3,T)
> eigen(A)
eigen() decomposition
$values
[1] 2.000000e+00 2.000000e+00 1.110223e-15
$vectors
[,1] [,2] [,3]
[1,] 0 0.7071068 0.7071068
[2,] 0 0.7071068 -0.7071068
[3,] 1 0.0000000 0.0000000
> P <- eigen(A)$vectors
> D <- diag(eigen(A)$values)
> D
[,1] [,2] [,3]
[1,] 2 0 0.000000e+00
[2,] 0 2 0.000000e+00
[3,] 0 0 1.110223e-15
Let us fix the last eigenvalue to its correct value 0,
> D[3,3] <- 0
> D
[,1] [,2] [,3]
[1,] 2 0 0
[2,] 0 2 0
[3,] 0 0 0
Now let us verify that A = PDP−1
> P%*%D%*%solve(P)
[,1] [,2] [,3]
[1,] 1 1 0
[2,] 1 1 0
[3,] 0 0 2
$vectors
[,1] [,2] [,3]
[1,] 0.7071068 -7.071068e-01 -0.7071068
[2,] 0.7071068 -7.071068e-01 0.7071068
[3,] 0.0000000 6.280370e-16 0.0000000
Observe that in this case, if we account for rounding error and identify the (3, 2)-entry
of the $vectors as 0, the second column of $vectors is equal to the negative of the first.
So, the matrix $vectors is not invertible,
> P <- eigen(A)$vectors
> solve(P)
Error in solve.default(P) :
system is computationally singular: reciprocal condition number
= 2.22045e-16
$vectors
[,1] [,2] [,3]
[1,] 0.7071068+0.0000000i 0.7071068+0.0000000i 0.7071068+0i
[2,] 0.0000000-0.7071068i 0.0000000+0.7071068i 0.0000000+0i
[3,] 0.0000000+0.0000000i 0.0000000+0.0000000i 0.7071068+0i
1 1 0
2. Consider the example given above, A = 1 1 1. We have computed that the
0 0 2
eigenvalues are λ = 0 and 2. Now find the eigenvectors.
For λ = 0,
> Solve(-A)
x1 + x2 = 0
x3 = 0
0 = 0
For λ = 2,
> Solve(2*diag(3)-A)
x1 - 1*x2 = 0
x3 = 0
0 = 0
1
This shows that we only have 2 parameter family of eigenvectors, s −1 and
0
1
t 1, s, t ∈ R\{0} (associated to 0 and 2, respectively). Since A has order 3,
0
it needs 3 parameter family of eigenvectors to be diagonalizable. Hence, A is not
diagonalizable, as shown above.
1 1
3. Let A = .
0 1
> A <- matrix(c(1,1,0,1),2,2,T);
> solve(polynomial(rev(charpoly(A))))
[1] 1 1
A has only one eigenvalue, λ = 1. Now compute the eigenvectors.
> Solve(diag(2)-A)
x2 = 0
0 = 0
1
We obtain a parameter family of eivenvectors s , s ∈ R\{0}. Since A has order
0
2, it does not have enough parameter family of eigenvectors to be diagonalizable.
Indeed,
> eigen(A)
eigen() decomposition
$values
[1] 1 1
$vectors
[,1] [,2]
[1,] 1 -1.000000e+00
[2,] 0 2.220446e-16
Taking into account that the (2, 2)-entry of $vectors should be 0, the second is
equal to the negative of the first. Hence, the matrix $vectors is not invertible.
> P <- eigen(A)$vectors
> solve(P)
Error in solve.default(P) :
system is computationally singular: reciprocal condition number
= 1.11022e-16
A = PDPT
$vectors
[,1] [,2] [,3]
[1,] -0.5773503 -0.3160858 0.7528323
[2,] -0.5773503 -0.4939290 -0.6501544
[3,] -0.5773503 0.8100148 -0.1026779
> P <- eigen(A)$vectors
> crossprod(P,P)
[,1] [,2] [,3]
[1,] 1.000000e+00 -1.110223e-16 -2.567391e-16
[2,] -1.110223e-16 1.000000e+00 -4.163336e-17
[3,] -2.567391e-16 -4.163336e-17 1.000000e+00
Observe that the diagonal entries of PT P (=crossprod(P,P)) are 1, and the entries
off-diagonal are suppose to be 0. Hence, PT P = I is the identity matrix, verifying
that P is indeed orthogonal. Let us check that A = PDPT . Recall that the
transpose of P is R is t(P).
> D <- diag(eigen(A)$values)
> P%*%D%*%t(P)
[,1] [,2] [,3]
[1,] 5 1 1
[2,] 1 5 1
[3,] 1 1 5
1 −1 3
2. Consider A = −1 1 5.
3 5 1
> A <- matrix(c(1,-1,3,-1,1,5,3,5,1),3,3,T); eigen(A,symmetric = TRUE)
eigen() decomposition
$values
[1] 6.429008 1.876374 -5.305382
$vectors
[,1] [,2] [,3]
[1,] 0.2893960 -0.86088999 -0.4184715
[2,] 0.6190719 0.50177055 -0.6041327
[3,] 0.7300685 -0.08423029 0.6781632
(ii) (Left distributive law) For matrices A = (aij )m×p , B = (bij )p×n , and C = (cij )p×n ,
A(B + C) = AB + AC.
(iii) (Right distributive law) For matrices A = (aij )m×p , B = (bij )m×p , and C = (cij )p×n ,
(A + B)C = AC + BC.
(iv) (Commute with scalar multiplication) For any real number c ∈ R, and matrices
A = (aij )m×p , B = (bij )p×n , c(AB) = (cA)B = A(cB).
(vi) (Zero divisor) There exists A ̸= 0m×p and B ̸= 0p×n such that AB = 0m×n .
(vii) (Zero matrix) For any m × n matrix A, A0n×p = 0m×p and 0p×m A = 0p×n .
Proof. We will check that the corresponding entries on each side agrees. The check for
the size of matrices agree is trivial and is left to the reader.
(i) The (i, j)-entry of (AB)C is
q p q p
X X X X
( aik bkl )clj = aik bkl clj .
l=1 k=1 l=1 k=1
Since both sums has finitely many terms, the sums commute and thus the (i, j)-entry
of (AB)C is equal to the (i, j)-entry of A(BC).
Pp Pp
(ii) P
The (i, j)-entry of A(B + C) is k=1 aik (bkj + ckj ) = k=1 (aik bkj + aik ckj ) =
p Pp
k=1 aik bkj + k=1 aik ckj , which is the (i, j)-entry of AB + AC.
δi1 a1j + · · · + δii aij + · · · + δim amj = 0a1j + · · · + 1aij + · · · + 0amj = aij .
p p
X X
bki ajk = ajk bki ,
k=1 k=1
T
which is exactly the (i, j)-entry of (AB) .
x1 = 1 − 2t, x2 = 1 + 3t, x3 = t, t ∈ R.
The next 2 theorems provides us with an algorithm to find solutions of a linear system.
Theorem. Two linear system has the same solution set if their augmented matrix has
the same RREF.
This means that by reading off the solutions from the RREF of the augmented matrix,
we are able to obtain the solutions for the linear system.
x + 2y − z = 1
x + 5y + z = 3
x + y + z = 2
Readers can check that indeed x = 9/8, y = 1/4, z = 5/8 is the unique solution to
the linear system.
3x + 6y = −3
3x + 5y + z = 2
x + 2y = −1
The RREF of the augmented matrix is
1 0 2 9
0 1 −1 −5
0 0 0 0
Readers can check that
x = 9 − 2t, y = t − 5, z = t, t ∈ R
is a general solution to the linear system.
3. Consider the following linear system
x + y = 2
x − y = 0
2x + y = 1
The RREF of the augmented matrix is
1 0 0
0 1 0
0 0 1
that is, the system is inconsistent.
3. For the second type of elementary row operation, the row we put first is the row we
are performing the operation upon,
1 0 0 R1 +2R2 1 2 0
−−−−→
0 1 0 0 1 0
instead of
1 0 0 2R2 +R1 1 0 0
−−−−→
0 1 0 1 2 0
In fact, the 2R2 + R1 is not an elementary row operation, but a combination of 2
operations, 2R2 then R2 + R1 . Here’s another example,
1 0 0 R1 +R2 1 1 0
−−−−→
0 1 0 1 0 0
and
1 0 0 R2 +R1 1 0 0
−−−−→
0 1 0 1 1 0
Two augmented matrices are row equivalent if one can be obtained from the other by
elementary row operations.
Theorem. Two augmented matrices are row equivalent if and only have they have the
same RREF.
Observe that from the RREF we are able to uniquely obtain the solution set, and from
the solution set, if we know the number of equations the linear system has, we are able
to reconstruct the RREF uniquely. Hence, the previous theorem gives us the following
statement.
Theorem. Two linear systems have the same solution set if their augmented matrices
are row equivalent.
Note that every augmented matrix is row equivalent to a unique RREF, but it can
be row equivalent to different REFs.
Example. From the REF or RREF, we are able to read off a general solution.
1.
1 1 1 1
0 1 1 0
This is in REF. We let the third variable be the parameter s, then we get y = −s from
the second row, and x = 1 − s − (−s) = 1 from the first row. So a general solution is
x = 1, y = −s, z = s.
2.
1 0 0 1
0 1 −1 0
This is in RREF. General solution: x = 1, y = s, z = s.
Example. We will now reconstruct the RREF of the augmented matrix of a linear system
given a general solution.
3. x = 3, y = 2, z = 1, 3 equations. RREF:
1 0 0 3
0 1 0 2 .
0 0 1 1
2.9.5 Gaussian Elimination and Gauss-Jordan Elimination
Step 1: Locate the leftmost column that does not consist entirely of zeros.
Step 2: Interchange the top row with another row, if necessay, to bring a nonzero entry to
the top of the column found in Step 1.
Step 3: For each row below the top row, add a suitable multiple of the top row to it so that
the entry below the leading entry of the top row becomes zero.
Step 4: Now cover the top row in the augmented matrix and begin again with Step 1
applied to the submatrix that remains. Continue this way until the entire matrix
is in row-echelon form.
Once the above process is completed, we will end up with a REF. The following steps
continue the process to reduce it to its RREF.
Step 5: Multiply a suitable constant to each row so that all the leading entries become 1.
Step 6: Beginning with the last nonzero row and working upward, add suitable multiples
of each row to the rows above to introduce zeros above the leading entries.
Remark. The Gaussian elimination and Gauss-Jordan elimination may not be the fastest
way to obtain the RREF of an augmented matrix. We do not have to follow the algorithm
strictly when reducing the augmented matrix.
Algorithm
1. Express the given linear system as an augmented matrix. Make sure that the liner
system is in standard form.
4. Use back substitution (if the augmented matrix is in REF) to obtain a general
solution, or read off the general solution (if the augmented matrix is in RREF).
Example.
1 1 2 4 1 1 2 4 R + 2
R
1 1 2 4
−1 2 −1 1 −R−2−
+R1
−→ 0 3 1
3 2
5 −−−−3−→ 0 3 1 5
R3 −2R1
2 0 3 −2 0 −2 −1 −10 0 0 −1/3 −20/3
Indeed, the system is consistent, with unique solution x = −31, y = −5, z = 20.
Remark. We may include verbose = TRUE as an argument of the function Solve to
show the steps of the Gaussian elimination algorithm. For example, solve
x1 + x2 + 2x3 = 4
−x1 + 2x2 − x3 = 1
2x1 + 3x2 = 2
Initial matrix:
[,1] [,2] [,3] [,4]
[1,] 1 1 2 4
[2,] -1 2 -1 1
[3,] 2 0 3 2
row: 1
row: 2
row: 3
multiply row 3 by 4
[,1] [,2] [,3] [,4]
[1,] 1 0 3/2 1
[2,] 0 1 1/4 1
[3,] 0 0 1 8
u · v = 0.
• Case 1: Either u = 0 or v = 0.
• Case 2: Otherwise,
u·v
cos(θ) = =0
∥u∥∥v∥
tells us that θ = π2 , that is, u and v are perpendicular.
That is, u, v are orthogonal if and only if either one of them is the zero vector or they
are perpendicular to each other.
Example.
1 1 1 0 1 0
1 · 0 = 0, 0 · 1 = 0, 1 · 0 = 0.
1 −1 0 0 0 1
Exercise: Suppose u, v are orthogonal. Show that for any s, t ∈ R scalars, su, tv
are also orthogonal.
√ √
1/√2 −1/√ 2 0
4. S = 1/ 2 , 1/ 2 , 0 is an orthonormal set.
0 0 1
1 1 0
5. S = 1 , −1 , 0 is an orthogonal set but it cannot be normalized to
0 0 0
an orthonormal set since it contains the zero vector.
2.9.7 Gram-Schmidt Process
Theorem (Gram-Schmidt Process). Let S = {u1 , u2 , ..., uk } be a linearly independent
set. Let
v1 = u1
v1 · u2
v2 = u2 − v1
∥v1 ∥2
v1 · u3 v2 · u3
v3 = u3 − v1 − v2
∥v1 ∥2 ∥v2 ∥2
..
.
v1 · ui v2 · ui vi−1 · ui
vi = ui − v1 − v2 − · · · − vi−1
∥v1 ∥2 ∥v2 ∥2 ∥vi−1 ∥2
..
.
v1 · uk v2 · uk vk−1 · uk
vk = uk − v1 − v2 − · · · − vk−1 .
∥v1 ∥2 ∥v2 ∥2 ∥vk−1 ∥2
1
v1 = 2
1
1 1 1 1
1+2+1 2 = 1 −1 , let v2 = −1 instead
v2 = 1 −
12 + 22 + 12 3
1 1 1 1
1 1 1 −1
1+2+2 1−1+2 −1 = 1 0
v3 = 1 − 2 −
12 + 22 + 12 12 + (−1)2 + 12 2
2 1 1 1
−1
let v3 = 0 instead.
1
Why are we allowed to take v2 and v3 to be a multiple of the original vector found?
1 1 −1
So √16 2 , √13 −1 , √12 0 is an orthonormal set.
1 1 1
2.9.8 Least square solution
Let A be a m × n matrix and b ∈ Rm . A vector u ∈ Rn is a least square solution to
Ax = b if for every vector v ∈ Rn ,
It is the generalization of the distance of a vector from the origin, using the Pythagoras
theorem.
2.9.9 Diagonalization
A square matrix A is said to be diagonalizable if there exists an invertible matrix P such
that P−1 AP = D is a diagonal matrix.
Suppose A = PDP−1 , for some matrix D and invertible matrix P. Then the charac-
teristic polynomial of A is
Theorem. Suppose A is a square matrix such that its characteristic polynomial can be
written as a product of linear factors,
where rλi is the algebraic multiplicity of λi , for i = 1, ..., k, and the eigenvalues are
distinct, λi ̸= λj for all i ̸= j. Then A is diagonalizable if and only if for each eigenvalue
of A, its geometric multiplicity is equal to its algebraic multiplicity,
dim(Eλi ) = rλi
(i) (Not enough eigenvalues) The characteristic polynomial of A do not split into (real)
linear factors.
(ii) (Not enough eigenvectors) There is an eigenvalue of A where the geometric multi-
plicity is strictly less than the algebraic multiplicity, dim(Eλi ) < rλi .
For in either cases, there will not be enough linearly independent eigenvectors to form a
basis for Rn .
Proof. If A has n distinct eigenvalues, then the algebraic multiplicity of each eigenvalue
must be 1. Thus
1 ≤ dim(Eλ ) ≤ rλ = 1 ⇒ dim(Eλ ) = 1 = rλ
for every eigenvalue λ of A. Therefore A is diagonalizable.
Algorithm to diagonalization
(i) Compute the characteristic polynomial of A
det(xI − A).
where rλi is the algebraic multiplicity of λi , for i = 1, ..., k, and the eigenvalues are
distinct, λi ̸= λj for all i ̸= j. For each eigenvalue λi of A, i = 1, ..., k, find a basis
for the eigenspace, that is, find the solution space of the following linear system,
(λi I − A)x = 0.
If there is a i such that dim(Eλi ) < rλi , that is, if the number of parameters in the
solution space of the above linear system is not equal to the algebraic multiplicity,
then A is not diagonalizable.
(iii) Otherwise, find a basis Sλi of the eigenspace Eλi for S each eigenvalue λi , i = 1, ..., k.
Necessarily |Sλi | = rλi for all i = 1, ..., k. Let S = ki=1 Sλi . Then
k
X k
X
|S| = |Sλi | = rλi = n,
i=1 i=1
(iv) Let
µ1 0 · · · 0
0 µ2 · · · 0
P = u1 u2 · · · un , and D = diag(µ1 , µ2 , ..., µn ) = .. .. ,
. .
. . .
0 0 · · · µn
A = PDP−1 .
1 1 0
Example. 1. A = 1 1 0. It has eigenvalues 0 and 2 with multiplicity r0 = 1
0 0 2
−1 1 0
and r2 = 2, respectively. Also, 1 is a basis for E0 and 1 , 0
0 0 1
is a basis for E2 . Then dim(E0 ) = 1 = r0 and dim(E2 ) = 2 = r2 . Hence, A is
diagonalizable, with
−1
−1 1 0 0 0 0 −1 1 0
A = PDP−1 = 1 1 0 0 2 0 1 1 0 .
0 0 1 0 0 2 0 0 1
1 1 1
2. A = 0 2 2. A is a triangular matrix, hence the diagonal entries, 1, 2, 3 are
0 0 3
the eigenvalues, each with algebraic multiplicity 1. Therefore A is diagonalizable.
We will need to find a basis for each of the eigenspace.
1 − 1 −1 −1 0 −1 −1 0 1 0 1
λ = 1: 0 1 − 2 −2 = 0 −1 −2 −→ 0 0 1 . So 0 is a
0 0 1−3 0 0 −2 0 0 0 0
basis for E1 .
2 − 1 −1 −1 1 −1 −1 1 −1 0 1
λ = 2: 0 2 − 2 −2 = 0 0 −2 −→ 0 0 1 . So 1 is
0 0 2−3 0 0 −1 0 0 0 0
a basis for E2 .
3 − 1 −1 −1 2 −1 −1 2 0 −3 3
λ = 3: 0 3 − 2 −2 = 0 1 −2 −→ 0 1 −2 . So 4 is
0 0 3−3 0 0 0 0 0 0 2
a basis for E3 .
−1
1 1 3 1 0 0 1 1 3
⇒ A = 0 1 4 0 2 0 0 1 4 .
0 0 2 0 0 3 0 0 2
1 2
3. A = . λ = 1 is the only eigenvalue with algebraic multiplicity r1 = 2.
0 1
0 −2
λ = 1: . There is only one non-pivot columns, hence dim(E1 ) = 1 < 2 = r1 .
0 0
This shows that A is not diagonalizable.
Theorem. Let A be a square matrix of order n. The following statements are equivalent.
(i) A is orthogonal.
Proof. Write
r1
r2
A = c1 c2 · · · cn = .. ,
.
rn
where for i = 1, ..., n, ci and ri are the columns of and rows of A, respectively. Then
cT1 cT1 c1 cT1 c2 · · · cT1 cn
c T T T T
c2 c1 c2 c2 · · · c2 cn
2
AT A = .. c1 c2 · · · cn = .. ... ..
. . .
T T T T
cn cn c1 cn c2 · · · cn cn
c1 · c1 c1 · c2 · · · c1 · cn
c2 · c1 c2 · c2 · · · c2 · cn
= .. .. ,
. ...
.
cn · c1 cn · c2 · · · cn · cn
and
r1 r1 rT1 r1 rT2 · · · r1 rTn
T
r2
r2 r1
r2 rT2 · · · r2 rTn
AAT = .. rT1 rT2 T
· · · rn = ..
.. ..
. . . .
rn rn rT1 rn rT2 · · · rn rTn
r1 · r1 r1 · r2 · · · r1 · rn
r2 · r1 r2 · r2 · · · r2 · rn
= .. .. .
. ...
.
rn · r1 rn · r2 · · · rn · rn
1 if i = j
T
So A A = I if and only if ci · cj = , and AAT = I if and only if
0 if i ̸
= j
1 if i = j
ri · rj =
0 if i ̸= j
√ √ √ √
1/√2 −1/√ 2 1/ √2 1/√2 1 0
Example. 1. = .
1/ 2 1/ 2 −1/ 2 1/ 2 0 1
√ √ √ √ √ √
1/√3 1/ √2 1/√6 1/√3 1/ √3 1/ 3 1 0 0
2. 1/√3 −1/ 2 1/ √6
1/√2 −1/√ 2 0√ = 0 1 0.
1/ 3 0 −2/ 6 1/ 6 1/ 6 −2/ 6 0 0 1
√ √ √ √
1/ 2 0 1/ 2 1/ 2 0 −1/ 2 1 0 0
3. 0√ 1 0√ 0√ 1 0√ = 0 1 0.
−1/ 2 0 1/ 2 1/ 2 0 1/ 2 0 0 1
√ √ √ √
−1/ 2 0 1/ 2 −1/ 2 0 1/ 2 1 0 0
4. 0√ 1 0√ 0√ 1 0√ = 0 1 0
1/ 2 0 1/ 2 1/ 2 0 1/ 2 0 0 1
2.
√ √ √ √
3 0 −1 1/ 2 0 1/ 2 4 0 0 1/ 2 0 −1/ 2
0 2 0 = 0
√ 1 0√ 0 2 0 0√ 1 0√
−1 0 3 −1/ 2 0 1/ 2 0 0 2 1/ 2 0 1/ 2
Supppose A = PDPT for some orthogonal matrix P and diagonal matrix D. Then
AT = (PDPT )T = (PT )T DT PT = PDPT = A,
since D is diagonal, and hence symmetric. This shows that if A is orthogonally diago-
nalizable, it is symmetric. The converse is also true, but the proof is beyond the scope
of this course.
Theorem. An order n square matrix is orthogonally diagonalizable if and only if it is
symmetric.
The algorithm to orthogonally diagonalize a matrix is the same as the usual diago-
nalization, except until the last step, instead of using a basis of eigenvectors to form the
matrix P, we have to turn it into an orthonormal basis of eigenvectors, that is, to use the
Gram-Schmidt process. However, we do not need to use the Gram-Schmidt process for
the whole basis, but only among those eigenvectors that belong to the same eigenspace.
This follows from the fact that the eigenspaces are already orthogonal to each other.
Theorem. If A is orthogonally diagonalizable, then the eigenspaces are orthogonal to
each other. That is, suppose λ1 and λ2 are distinct eigenvalues of a symeetric matrix A,
λ1 ̸= λ2 . Let Eλi denote the eigenspace associated to eigenvalue λi , for i = 1, 2. Then for
any v1 ∈ Eλ1 and v2 ∈ Eλ2 , v1 · v2 = 0.
5 −1 −1
Example. A = −1 5 −1.
−1 −1 5
x−5 1 1
det(xI − A) = 1 x−5 1 = (x − 3)(x − 6)2 .
1 1 x−5
A has eigenvalues λ = 3, 6 with algebraic multiplicity r3 = 1, r6 = 2. Let us now compute
the eigenspaces.
−2 1 1 1 0 −1 1
λ = 3: 1 −2 1 −→ 0 1 −1 ⇒ v1 = 1 is a basis for E3
1 1 −2 0 0 0 1
1 1 1 1 1 1 −1 −1
λ = 6: 1 1 1 −→ 0 0 0 ⇒ v2 =
1 , v3 =
0 is a basis for E6 .
1 1 1 0 0 0 0 1
Observe that v1 is orthogonal to v2 and v3 , but v2 and v3 are not orthogonal to each
other. So we need only to perform Gram-Schmidt process on {v2 , v3 }.
Algorithm to orthogonal diagonalization
Follow step (i) to (iii) in algorithm to diagonalization.
(v) Let
µ1 0 · · · 0
0 µ2 · · · 0
P = u1 u2 · · · un , and D = diag(µ1 , µ2 , ..., µn ) = .. .. ,
. .
. . .
0 0 · · · µn
So −1
√1 −1
T
√1 √ √1 √ √1
3 0 0
√13 2 6
−2 3 2 6
−2
A= 3 0 √
6
0 6 0 √13 0 √
6
.
√1 √1 √1 0 0 6 √1 √1 √1
3 2 6 3 2 6
3 0 −1
2. Let A = 0 2 0 . A is symmetric, thus orthogonally diagonalizable.
−1 0 3
x−3 0 1
det(xI − A) = 0 x−2 0 = (x − 4)(x − 2)2 .
1 0 x−3
1 0 1 1 0 1 −1
λ = 4: 4I − A = 0 2 0 −→ 0
1 0 ⇒ 0 is a basis for E4
1 0 1 0 0 0 1
−1 0 1 1 0 −1 0 1
λ = 2: 2I − A = 0 0 0 −→ 0 0 0 ⇒ 1 , 0 is a basis for
1 0 −1 0 0 0 0 1
E2 .
Observe that in this case, the basis for E2 is also orthogonal. Hence, there is no
need to performance Gram-Schmidt process. Thus, we have
√ √ √ √
−1/ 2 0 1/ 2 4 0 0 −1/ 2 0 1/ 2
A = 0√ 1 0√ 0 2 0 0√ 1 0√ .
1/ 2 0 1/ 2 0 0 2 1/ 2 0 1/ 2
2.10 Exercises
1 2 1
1 0 1
1. Let A = and B = 2 1 0 .
1 2 3
1 1 −1
2. Let
1 1 1 1 0 1
A= , B= , C= .
1 1 −1 −1 0 0
3. (a) Use the seq and rep function to define the following vectors.
(i) v1 = (2, 3, 4).
(ii) v2 = (1, 3, 5).
(iii) v3 = (1, 1, 1).
(b) Convert the vectors v1 , v2 , v3 defined in (a) to column vectors, then
(i) define A = v1 v2 v3 ,
(ii) compute v1T v2 and v2 v3T .
(c) Normalize v3 defined in (a) to a unit vector.
(a)
3x1 + 2x2 − 4x3 = 3
2x1 + 3x2 + 3x3 = 15
5x1 − 3x2 + x3 = 14
(b)
2x2 + x3 + 2x4 − x5 = 4
x2 + x4 − x5 = 3
4x1 + 6x2 + x3 + 4x4 − 3x5 = 8
2x1 + 2x2 + x4 − x5 = 2
(c)
x − 4y + 2z = −2
x + 2y − 2z = −3
x − y = 4
2 −1 a b
5. Let A = . Find all the matrices B = such that AB = BA.
2 1 c d
2 1 1 2 3 4 1
6. (a) Solve the matrix equation 0 1 2 X = 1
0 3 7.
1 3 2 2 1 1 2
2 1 1 1
(b) Using the answer in (a), without the aid of R, solve 0 1 2 x = −1.
1 3 2 −1
(Hint: look at the columns of the matrix on the right in (a).)
7. Let
1 0 2 3 3 2
A= , B= , C= ,
−1 1 −1 −2 −1 2
Compute A−1 , B−1 , C−1 . Using your answers for A−1 , B−1 , C−1 , find (ABC)−1 .
8. Let
1 0 2 3 3 2
A= , B= , C= ,
−1 1 −1 −2 −1 2
(a) Compute A−1 , B−1 , C−1 . Using your answers for A−1 , B−1 , C−1 , find (ABC)−1 .
(b) Find det(A), det(B), and det(C). Use your answers for det(A), det(B), and
det(C) to find det(ABC).
(c) Use your answers in (b) to find det((ABC)−1 ).
0 1 1 0 6
1 −1 1 −1 3
9. Let A = 1
and b = .
−1
0 1 0
1 1 1 1 1
(a) Is the linear system Ax = b is consistent.
(b) Find a least squares solution to the system. Is the solution unique? Why?
10. A line
p(x) = a1 x + a0
is said to be the least squares approximating line for a given a set of data points
(x1 , y1 ), (x2 , y2 ), ..., (xm , ym ) if the sum
is minimized. Writing
x1 y1 p(x1 ) a1 x 1 + a0
x2 y2 p(x2 ) a1 x2 + a0
x = .. , y = .. , and p(x) = .. =
..
. . . .
xm ym p(xm ) a1 x m + a0
S = ||y − p(x)||2
is minimized. Observe that if we let
1 x1
1 x2
a0
N = .. .. and a = ,
. . a1
1 xm
then Na = p(x). And so our aim is to find a that minimizes ||y − Na||2 .
L
R=ρ ,
A
where R is the resistance measured in Ohms Ω, L is the length of the material in
meters m, A is the cross-sectional area of the material in meter squared m2 , and ρ
is the resistivity of the material in Ohm meters Ωm. A student wants to measure
the resistivity of a certain material. Keeping the cross-sectional area constant at
0.002m2 , he connected the power sources along the material at varies length and
measured the resistance and obtained the following data.
It is known that the Ohm meter might not be calibrated. Taking that into account,
ρ
the student wants to find a linear graph R = 0.002 L + R0 from the data obtained to
compute the resistivity of the material.
ρ
(a) Relabeling, we let R = y, 0.002 = a1 and R0 = a0 . Is it possible to find a graph
y = a1 x + a0 satisfying the points?
(b) Find the least square approximating line for the data points and hence find the
resistivity of the material. Would this material make a good wire?
11. Suppose the equation governing the relation between data pairs is not known. We
may want to then find a polynomial
p(x) = a0 + a1 x + a2 x2 + · · · + an xn
of degree n, n ≤ m − 1, that best approximates the data pairs (x1 , y1 ), (x2 , y2 ), ...,
(xm , ym ). A least square approximating polynomial of degree n is such that
||y − p(x)||2
is minimized. If we write
x1 y1 1 x1 x21 · · · xn1 a0
x2 y2 1 x2 x2 n
· · · x2 a1
2
x = .. , y = .. , N = .. .. and a = . ,
.. .. .
. .. ..
. . . . .
xm ym 1 xm x2m · · · xnm an
then p(x) = Na, and the task is to find a such that ||y − Na||2 is minimized.
p(x) = a0 + a1 x + a2 x2 + a3 x3 + a4 x4
that is a least square approximating polynomial for the following data points
(a) Find det(A), det(B), and det(C). Use your answers for det(A), det(B), and
det(C) to find det(ABC).
(b) Use your answers in (b) to find det((ABC)−1 ).
Determine if A is invertible.
1 −3 3
(a) A = 3
−5 3.
6 −6 4
9 8 6 3
0 −1 3 −4
(b) A =
0
0 3 0
0 0 0 2
16. Let
9 −1 −6
A = 0 8 0 .
0 0 −3
17. Diagonalize
1 −3 0 3
3 7 0 −3
A=
−18 −23 9 13 .
3 3 0 1
Hence, find a matrix B such that B2 = A.
A = PDPT .
where for all i = 1, ..., m, fi is a real-valued function and is called the i-th component function
103
x1
x2
of F. If x = .. , we may expression the component functions as fi (x1 , x2 , ..., xn ), for
.
xn
all i = 1, ..., m.
Example. Let T : R2 → R3 , T(x, y) = (2x − 3y, x, 5y). Then for any (x1 , y1 ), (x2 , y2 )R2
and α, β ∈ R,
(i) T(0) = 0,
Example. 1. T : R4 → R3 ,
x1
x2 2x 1 − 3x 2 + x 3 − 5x 4
T 4x1 + x2 − 2x3 + x4
x3 =
5x1 − x2 + 4x3
x4
2 −3 1 −5
= x1 4 + x2 1 + x3 −2 + x4 1
5 −1 4 0
x
2 −3 1 −5 1
x2
= 4 1 −2 1 x3 .
5 −1 4 0
x4
Conversely, the linear mapping defined by the zero matrix 0(m,n) is the zero mapping,
T0 (u) = 0u = 0 ∈ Rm ,
for all u ∈ Rn .
Conversely, the linear mapping defined by the identity matrix In is the identity
mapping,
TI (u) = Iu = u,
for all u ∈ Rn .
A linear functional is a linear mapping L : Rn → R, that is, the codomain is the real
numbers (m = 1 in the definition for linear mappings).
Theorem. Every linear functional L : Rn → R can be written as
n
X
T
L(x) = a x = ai x i
i=1
a1
a2
for a (column) vector a = .. ∈ Rn .
.
an
0 0 0 −1 z
Theorem (Orthogonally Diagonalize a Quadratic Form). Every quadratic form Q :
Rn → R can be expressed as
λ1 0 ··· 0 y1
0
λ2 0 0 y2
Q(x1 , x2 , ..., xn ) = λ1 y12 + λ2 y22 + · · · λn yn2 = y1 y2 · · · yn ..
.. . . .. ..
. . . . .
0 0 · · · λn yn
y1
y2
for some linear mapping y = .. : Rn → Rn . The linear mapping if given by y = PT x
.
yn
for some orthogonal matrix P.
Proof. Since A is symmetric, it is orthogonally diagonalizable.
Hence, wecan find
λ1 0 ··· 0
0 λ2 · · · 0
λ1 , λ2 ..., λn and an orthogonal matrix P such that PT AP = .. .. . Hence,
.. . .
. . . .
0 0 · · · λn
T
if we let y = P x, then x = Py, and
Q(x) = Q(Py) = (Py)T A(Py)
= yT PT APy
λ1 0 · · · 0
0 λ2 · · · 0
T
= y .. .. y
.. . .
. . . .
0 0 · · · λn
= λ1 y1 + λ2 y2 + · · · + λn yn .
2 2 1 −1/2
Example. 1. Let Q(x, y) = x − xy + y . The symmetric matrix can
−1/2 1
be orthogonally diagonalized as such
√ √ T √ √
1/√2 −1/√ 2 1 −1/2 1/√2 −1/√ 2 1/2 0
= .
1/ 2 1/ 2 −1/2 1 1/ 2 1/ 2 0 3/2
′ √ √ T √
x 1/√2 −1/√ 2 x (x + y)/ √2
Hence, if we let = = , then
y′ 1/ 2 1/ 2 y (−x + y)/ 2
1 2 3 2 1 3
Q(x, y) = x′ + y ′ = (x + y)2 + (y − x)2 .
2 2 4 4
2. Let Q(x, y, z) = 2xy + 2xz + 2yz, then
0 1 1 x
Q(x, y, z) = x y z 1 0 1 y .
1 1 0 z
0 1 1
1 0 1 can be orthogonally diagonalized as such
1 1 0
√ √ √ T √ √ √
1/ √2 1/√6 1/√3 0 1 1 1/ √2 1/√6 1/√3
−1/ 2 1/ 6 1/ 3 1 0 1 −1/ 2 1/ 6 1/ 3
√ √ √ √
0 −2 6 1/ 3 1 1 0 0 −2 6 1/ 3
−1 0 0
= 0 −1 0 .
0 0 2
′ √ √ √ T (x−y)
x 1/ √2 1/√6 1/√3 x √
2
So let y ′ = −1/ 2 1/ √6 1/√3 y = x+y−2z √ , we have
6
z′ 0 −2 6 1/ 3 z x+y+z
√
3
2 2 2 1 1 2
Q(x, y, z) = −x′ − y ′ + 2z ′ = − (x − y)2 − (x + y − 2z)2 + (x + y + z)2 .
2 6 3
Remark. A quadratic form is an example of a bilinear form,
X
B(x, y) = xT Ay = xi aij yj .
i,j
3.2 Functions in R
3.2.1 Creating Functions in R
In R everything we do involves a function, explicitly or implicitly. They are the funda-
mental building block of R. We can either use a primitive function, or create our own
functions. For the purpose of this course, a R function has two parts, the body and the
formals (or arguments). The body contains the code inside the function, and the formals
contains the list of all the arguments that control how you call the function. The primi-
tive functions call C code directly with .Primitive(), it contains no R code in its body.
Type names(methods:::.BasicFunsList) to get a list of all primitive function in R.
In the next example, we want find the total number of seconds after some certain
number of hours, minutes, and seconds have past.
no of sec <- function(h,m,s){
#h is the number of hours
#m is the number of minutes
#s is the number of seconds
s + 60*(m+60*h)
}
We used comments in this case to remind us of what the arguments represent. So, for
example, 2 hours 30 minutes and 15 seconds is 9015 seconds,
> no of sec(2,30,15)
[1] 9015
In R functions, any argument can be given a default value. The function will take this
value for that argument if no input is given. For example,
no of sec <- function(s,m,h,d=0){
#d is the number of days, default is 0
#h is the number of hours
#m is the number of minutes
#s is the number of seconds
s + 60*(m+60*(h+24*d))
}
> no of sec(15,30,2)
[1] 9015
> no of sec(0,0,0,1)
[1] 86400
Note that it is a good habit to put arguments with default values at the back since the
position of the argument matters. For example,
test <- function(x=10,y){
x*y
}
> test(2)
Error in test(2) : argument "y" is missing, with no default
> test(,2)
[1] 20
vis-á-vis
test <- function(x,y=10){
x*y
}
> test(2)
[1] 20
Exercise. Using the function no of sec defined above, what happens if one of the ar-
gument is a vector? For example
> no of sec(c(1,2,3),0,0)
What happens if two or more of the arguments are vectors of different length? For
example
> no of sec(c(1,2,3),c(0,1),0)
and
> no of sec(seq(4),c(0,1),0)
To evaluate the expression for a specific number, use the eval function.
> x <- 3
> eval(f)
[1] 9
The function findZeros takes a function as its argument and returns zeros of the
function (that is, the points where the function is 0).
2. If the function has many zeros, findZeros will display some of the zeros.
> findZeros(sin(pi*x)∼x)
x
1 -4
2 -3
3 -2
4 -1
5 0
6 1
7 2
8 3
9 4
To find the zeros of a function within a confined domain, say within the interval
(a, b), we include the argument xlim = range(a,b).
> findZeros(sin(pi*x)∼x,xlim = range(-2,2))
x
1 -1
2 0
3 1
We may use findZeros to find the intersection of two functions f (x) and g(x), since
3. Solve for x2 = −1
> findZeros(x^2+1∼x)
numeric(0)
Warning message:
In findZeros(x^2 + 1 ∼ x) :
No zeros found. You might try modifying your search window or increasing
npts.
is singular.
3.3 Graphs
3.3.1 Plots in R
Plot
The function plot in R requires the (preinstalled) system package graphics. Here is the
description of the some of the arguments for the function plot.
plot(x, y, type = "p", main, xlim, ylim, xlab, ylab)
• type is the type of plot desired. "p" for points, "l" for lines, "b"
for both points and lines, "c" for empty points joined by lines, "o" for
overplotted. points and lines, "s" and "S" for stair steps and "h" for
histogram-like vertical lines. Finally, "n" does not produce any points
or lines.
Example. 1. We use the data set mtcars available in the R environment to create a
basic scatterplot. Let’s use the columns wt and mpg in mtcars.
> input <- mtcars[,c(‘wt’,‘mpg’)]
> input
wt mpg
Mazda RX4 2.620 21.0
Mazda RX4 Wag 2.875 21.0
Datsun 710 2.320 22.8
Hornet 4 Drive 3.215 21.4
Hornet Sportabout 3.440 18.7
.
. .
. .
.
. . .
.
. .
. .
.
. . .
> plot(x = input$wt,y = input$mpg, xlab = "Weight", ylab = "Milage", xlim
= c(1.5,5), ylim = c(10,34), main = "Weight vs Milage")
However, most of the time we are plotting the graph to observe the trends; it is not
necessary to include the labels for the axes and the title. Moreover, in this case,
input is already a matrix with 2 columns. Finally, if we do not specify the limits of
the plot, R will automatically choose a big enough interval to contain all the data.
> plot(input,type=‘p’)
Finally, it is clear from this example that we do not want to use line plot.
plot(input,type=‘b’)
We may try to fix the line plot by rearranging the wt in ascending order.
plot(input[order(input$wt),],type = "b")
However, it is still not very meaningful to use line plot in this case.
3. In some cases, the function defined may have more than 1 input, but we want to
observe the trend for the variation of one of the argument, while keeping the rest
constant. Let use our function halflife defined in the previous section. Suppose
we fix the amount and type of substance, we want to observe how the function
varies with t.
> t <- seq(0,100,4)
> plot(t,halflife(1,1/5,t),type = "l")
Or suppose we want to know how the function will vary with respect to substance
with different half-lives, after a fixed time.
> lambda <- seq(from = 0.1, to = 1, length.out = 20)
> plot(lambda,halflife(1,lambda,1),type = "l")
Sapply
For most of the examples above we are able to evaluate the functions for vectors too, since
the operations involved in defining the function accept vectors as arguments. However,
consider the following function.
0 if x ≤ −1,
x + 1 if −1 < x ≤ 0,
f (x) =
−x + 1 if 0 < x ≤ 1,
0 if x > 1.
If we try to plot it with the method defined above, we obtain the following.
> plot(x=seq(-2,2,length=1000),f(x),type="l")
Error in xy.coords(x, y, xlabel, ylabel, log) :
‘x’ and ‘y’ lengths differ
In addition: Warning message:
In if (or(x <= -1, x > 1)) return(0) :
the condition has length > 1 and only the first element will be used
Hence, we want instead the function f to act on each component of x. To this end,
we use the function sapply (simplified list apply function).
> x <- seq(-2,2,length=1000)
> y <- sapply(x,f)
> plot(x,y,type = "l")
Curve
The function curve in R can be used to plot the graph of a symbolic function. Let us
take a look at the description of some of the arguments that we need.
curve(expression, from, to, type, n)
(c) n=10000
In the curve function, we may include the argument add = TRUE to add another
graph onto the current one.
> curve(x^2, from = -5, to = 5)
> curve(x^4, add = TRUE)
Histogram
A histogram represents the frequencies of values of a variables belonging to different
ranges. Each bar in a histogram represents the number of times the values within a given
range appears. Here are the description of some of the arguments in the function hist
in R.
hist(x, breaks, col, border)
For now we don’t have to concern ourselves with the function norm.
> hist(x,breaks = c(10,30,40,45,50,55,60,65,70,75,80,90),col = "BLUE", border
= "RED")
Contour
The R function contour creates a contour plot, or add contour lines to an existing plot.
Here are description of some of the arguments we will be using.
contour(x, y, z, nlevels, levels, col, lwd, lty, labcex, label, drawlabels,
add)
• x, y are the locations of grid lines at which the values in z are measured.
These must be in ascending order. By default, equally spaced values from
0 to 1 are used
• lty is the type of lines drawn, 1 is straight line, 2 and above are broken
lines.
• label is a vector giving the labels for the contour lines. If NULL then
the levels are used as labels.
2. > contour(z,nlevels=20,lty=2)
Linear Regression
The simplest process of curve fitting is to use a line, or a linear equation. The function
lm is used to fit linear models. A line y = mx + c is uniquely determined by its gradient
the y-intercept, m and c, respectively. This is what lm returns. Let us take a look at
some of the arguments of lm.
lm(formula, data, na.action)
• na.action is a function which indicates what should happen when the data
contain NAs. The default is set by the na.action setting of options,
and is na.fail if that is unset. The ‘factory-fresh’ default is na.omit.
Another possible value is NULL, no action. Value na.exclude can be useful.
To plot the linear curve over the scatter plot of the data, we use the function abline.
abline(a , b , and other graphical.parameters)
• a is the y-intercept.
• b is the gradient.
To read the value predicted by the curve, we can use the function predict. Here are
some of the arguments of the function.
predict(object,newdata)
• object is the formula which is already created using the lm() function.
• newdata is the vector containing the new value for predictor variable.
Example. 1. We will use the weights and miles per gallon data from the data mtcars.
> Weight <- mtcars$wt #1000 lbs
> Miles per gallon <- mtcars$mpg
> lm(Miles per gallon∼Weight, data=mtcars)
Call:
lm(formula = Miles per gallon ∼ Weight, data = mtcars)
Coefficients:
(Intercept) Weight
37.285 -5.344
2. We will be using the data beavers in the library datasets. It records the body
temperature of 2 beavers at 10 minutes interval. Let us plot the graph of the first
beaver.
> data(beavers)
> Minutes <- seq(10, nrow(beaver1) * 10, 10) #starting at 10 minutes to
114*10 minutes, interval of 10 minutes
> Temperature <- beaver1$temp #the recorded body temperatures
of the first beaver
> plot(Minutes, Temperature, pch = 19, frame = FALSE)
> lm(Temperature ∼ Minutes, data = beaver1)
Call:
lm(formula = Temperature ∼ Minutes, data = beaver1)
Coefficients:
(Intercept) Minutes
3.672e+01 2.456e-04
• x, y are vectors giving the coordinates of the points in the scatter plot.
• f is the amount of smooth, the larger the value, the smooth the curve.
It must be a positive number.
Example. 1. We will plot both the linear model and the lowess smoother model. >
Weight <- mtcars$wt
> Miles per gallon <- mtcars$mpg
> plot(Weight, Miles per gallon,pch = 19, frame = FALSE)
> abline(lm(Miles per gallon ∼ Weight, data = mtcars), col = "blue")
> lines(lowess(Weight,Miles per gallon), col = "red")
Now to read the value predicted by the lowess curve,
> predict(loess(Miles per gallon ∼ Weight,mtcars),data.frame(Weight=3))
1
20.48786
Note that here the object is loess instead of lowess.
2. > data(beavers)
> Minutes <- seq(10, nrow(beaver1) * 10, 10)
> Temperature <- beaver1$temp
> plot(Minutes, Temperature, frame = FALSE, type = "l")
> abline(lm(Temperature ∼ Minutes, data = beaver1), col = "blue")
> lines(lowess(Minutes,Temperature), col ="red")
Using the lowess model, it is no longer an observed trend that the body temperature
of the beaver will continue to raise.
3. We will now observe the graphs of the function lowess for different smoother val-
ues.
> plot(Minutes, Temperature, frame = FALSE, type = "l")
> lines(lowess(Minutes,Temperature), col ="blue")
> lines(lowess(Minutes,Temperature,f=0.1), col ="green")
> lines(lowess(Minutes,Temperature,f=10), col ="red")
> legend("topleft",col = c("blue", "green", "red"),lwd = 2,c("f = default",
"f = 0.1", "f = 5")) #add legend
We may alternatively use the scatter.smooth function in R to plot the graph with a
lowess curve.
3.4 Derivatives
3.4.1 Definitions and Properties
Single variable real-valued function
Recall that a real function of a single value is differentiable at a if there is a real number
m such that the limit
f (a + h) − f (a) − mh
lim
h→0 h
goes to 0. Readers may refer to the appendix for the formal definition of limits. In this
case, we can write
f (a + h) − f (a)
lim = m,
h→0 h
and m is called the derivative of f at a. The value m is commonly denoted as f ′ (a),
d df
| f , or dx
dx x=a
(a). Then the linear approximation of f at a is
Remark. 1. Recall
p from section 2.2.2 that the norm of a vector h = (hi ) ∈ Rn , ∥h∥
is given by h21 + h22 + · · · + h2n .
Observe that a real function of a single value is a special case of this, when a = a and
L = f ′ (a).
∂r ∂ ∂ ∂
f= ··· f,
∂xi1 ∂xi2 · · · ∂xir ∂xi1 ∂xi2 ∂xir
where xi1 , xi2 , ..., xir ∈ {x1 , x2 , ..., xn }.
Then
∂2 ∂
f (x, y) = (2x3 y + 3xy 2 ) = 6x2 y + 3y 2 ,
∂x∂y ∂x
and
∂2 ∂
f (x, y) = (3x2 y 2 + y 3 ) = 6x2 y + 3y 2 .
∂y∂x ∂y
Observe that in the example above the order of the mixed partial derivatives does
not matter. This is not true in general. However, it will be true for all the cases that we
come across in this course.
df (a) = La = ∇f (a)T .
∂x1 Pni=1 yi xi
P
y1
∂x n
2 i=1 i i y2
y x
∇f (x) = .. = .. = y.
Pn . .
∂xn i=1 yi xi yn
To perform differentiation, we use the function Deriv. The arguments that we need
are
Deriv(function,name,nderiv)
• function is user defined function.
• name is the variable we want to differentiate with respect to, placed
inside a parenthesis " ".
• nderiv is the number of derivatives order to calculate. The default number
is 1.
2
Example. 1. Find the derivative of ex .
> f <- function(x) {exp(x^2)}
> df <- Deriv(f,"x")
> df
function (x)
2 * (x * exp(x^2))
2. Find the derivative of cos(sin2 (2x − 1)).
> f <- function(x){cos(sin(2*x-1)^2)}
> df <- Deriv(f,"x")
> df
function (x)
{
.e2 <- 2 * x - 1
.e3 <- sin(.e2)
-(4 * (cos(.e2) * .e3 * sin(.e3^2)))
}
Remark. Recall in section 1.4.1 that the primitive function log is the natural
logarithm, it is already in base e = exp(1). However, if we try to differentiate
log(x,base=a), it will evaluate log(a).
> f <- function(x){log(x,2)} > df <- Deriv(f,"x")
> df
function (x)
1/(0.693147180559945 * x)
So here, we might want to differentiate symbolically using D. However, we must use
the logarithm change of base formula.
> f <- expression(log(x)/log(2))
> df <- D(f,"x")
> df
1/x/log(2)
1
Indeed, the answer is xln(2) , which agrees with the one given by Deriv above, except
it does not evaluate ln(2).
Remark. One have to be careful that the domain of f and the domain of its partial
derivatives may not be the same. For example, (x, y, z) = (1, −1, 0) is in the domain
of f , but it is not in the domain of ∂f
∂z
.
One advantage of using this package is the ability to differentiate some functions that
the preinstalled function D cannot.
> f <- expression(abs(x))
> D(f,"x")
Error in D(f, "x") : Function ‘abs’ is not in the derivatives table
> f <- function(x,y){abs(x*y)}
> dfx <- Deriv(f,"x")
> dfx
function (x, y)
y * sign(x * y)
> dfy <- Deriv(f,"y")
> dfy
function (x, y)
x * sign(x * y)
Vector-valued functions
The function Deriv can be used to differentiate vector-valued functions.
Example. 1. Consider the multivariable vector-valued function
xyz
F(x, y, z) = x + y + z .
z2
Remark. Note that R returns each dFx, dFy, dFz as a 3 × 1 matrix, that is, it is
a column vector.
Then f is differentiable at every point a ̸= 0, with derivative 0 for a < 0 and derivative 1
for a > 0. Let us try to use the function Deriv to find the derivative of f at some points
a ̸= 1.
f <- function(x){
if (x>=0) return(x)
else return(0)
}
> Deriv(f,"x")
Error in Deriv (st[[3]], x, env, use.D, dsym, scache, drule. = drule.) :
Could not retrieve body of ’return()’
We will now compute the derivatives of the function f defined above at a few points.
> fderiv(f,1)
[1] 1
> fderiv(f,-1)
[1] 0
> fderiv(f,10)
[1] 1
We may also define a function that return the derivative of f at different points.
> df <- function(x) fderiv(f,x)
Even though we are not able to obtain a symbolic function in R for such functions,
we may still plot the graph for visualization.
> x <- seq(-1,1,length=100)
> y <- sapply(x,f)
> plot(x,y,type="l")
However, the package pracma has the function grad that computes the gradient of a
function numerically. Here some of the arguments of the function grad.
grad(f,x0)
f (x, y) = xy
One may similarly compute the numerical matrix derivative of a multivariable vector-
valued function, by applying the numerical partial derivatives or the gradient on each
component of the function.
3.5 Integration
3.5.1 Definitions and Properties
Single variable integration
Recall that the (definite) integration of a single variable real-valued function f (x) from
limits a to b, Z b
f (x)dx,
a
is the area bounded by the function, the x-axis, and the limits a and b.
By the fundamental theorem of calculus, one way to find the integral is to use the anti-
d
derivative of the function f , that is, a function F such that dx F (x) = f (x). If F is an
anti-derivative of f , then Z b
f (x)dx = F (b) − F (a).
a
x2 d x2 2x
Example. Let f (x) = x. An anti-derivative of f (x) is 2
, indeed, dx 2
= 2
= x. Hence,
1
12 0
Z
1
xdx = − = ,
0 2 2 2
which is indeed the area of the triangle,
Multivariable Integration
Let us start with a function f from R2 to R. Recall that the graph of f is the set of
all points (x, y, f (x, y)) in R3 . Then the integral
R of f over the region R ⊆ R2 , where R
is a subset of the domain of f , written as R f dA, is the volume enclosed by R and the
function.
3. Let D be the region bounded by the circle of radius r in the xy-plane. Then D
is both x- and y-simple.
p We will just present
p the case for D being x-simple. The
functions are a(y) = − r − y , b(y) = r − y 2 , −r ≤ y ≤ r.
2 2 2
If the region D is x-simple, can then we perform the integration with respect to x
first, then with respect to y. This is because the integral
Z b(y)
f (x, y)dx
a(y)
and hence Z Z d
f dA = (Fy (b(y)) − Fy (a(y)))dy.
D c
The algorithm for D being y-simple is analogous, interchanging x and y in the algo-
rithm above.
1
√
y 1 x=√y
yx2
Z Z Z Z
f dA = xydxdy = dy
D 0 y2 0 2 x=y 2
6 1
1 1 2 1 y3 y
Z
5
= y − y dy = −
2 0 2 3 6 0
1 1 1 1
= − = .
2 3 6 12
2. Let DR be the region bounded by the curves y = sin(x) and y = 0, for 0 ≤ x ≤ 2π.
Find D f dA, where f is the constant function 1, f (x, y) = 1. The region D is
the union of two y-simple sets. For 0 ≤ x ≤ π, we have 0 ≤ y ≤ sin(x), and for
π ≤ x ≤ 2π. we have sin(x) ≤ x ≤ 0. So, the integral is splitted into two interated
integration.
Z Z π Z sin(x) Z 2π Z 0
f dA = 1dydx + 1dydx
D 0 0 π sin(x)
Z π Z 2π
= sin(x)dx + − sin(x)dx
0 π
= [− cos(x)]π0 + [cos(x)]2π
π = −(−1 − 1) + (1 + 1) = 4.
R
3. Evaluate D
ydA, where D is the half disk where x2 + y 2 ≤ 1 and y ≥ 0.
1
√ √1−x2
1−x2 1
y2
Z Z Z Z
ydA = ydydx = dx
D −1 0 −1 2 0
1
1 1 x3
Z
2 1
= 1 − x dx = x−
2 −1 2 3 −1
1 1 1 2
= 1− +1− = .
2 3 3 3
R
4. Let R be the rectangle [1, 3] × [2, 3]. Find R
xydA.
3 3 3 ! 3 !
x2 y2
Z Z Z
xydA = xdx ydy =
R 1 2 2 1 2 2
9 1 9 4
= − − = 10
2 2 2 2
We can apply the idea of iterative integration to more general sets. Let D(y) denote
the set of all points in D whose second coordinate is y. Suppose for each value of y, the set
D(y) consists of a finite number of intervals, whose end points are piecewise continuous
functions of y. Then we can integrate f (x, y) with respect to x over D(y), and obtain a
piecewise continuous function with respect to y. The argument can be same similarly for
D such that for each x, D(x) consist of finite number of intervals, whose end points are
continuous functions of x.
Example. An annulus is the region bounded two circles.
Suppose the two circlesphave radii p
r1 and r2 , with p
r1 < r2 . Then
p for each y, the set D(y)
consist of intervals [− r2 − y , − r1 − y ] and [ r1 − y , r22 − y 2 ].
2 2 2 2 2 2
R
Exercise. Let R be the annulus where the radii of the circles are 1 and 4. Find R ydA.
This is also known as integration by substitution. This idea can be generalized to multi-
variable functions.
Example. Find Z 1 √
x2 1 − x2 dx.
0
d
Let x = sin(t) and x(S) = (0, 1), then dt
x(t) = cos(t), and S = (0, π/2). So,
Z 1
2
√ Z π/2
2
q Z π/2
x 1− x2 dx = 2
sin (t) 1 − sin (t) cos(t)dt = sin2 (t) cos2 (t)dt
0 0 0
Z π/2 Z π/2
1 1
= (2 sin(t) cos(t))2 dt = sin2 (2t)dt
0 4 4 0
1 π/2
Z
π 1 π/2
= 1 − cos(4t)dt = + [sin(4t)]0
8 0 16 32
π
=
16
Multivariable functions
A continuously differentiable multivariable vector-valued function F is called a smooth
change of variables over an open set U ⊆ Rn if F is one-to-one and its derivative matrix
dF is invertible at each point of U . Recall that the derivative matrix is invertible at a
point a if and only if the Jacobian JF is nonzero at a. Suppose further that F maps a
smoothly bounded set C onto a smoothly bounded set D, so that the boundary of C is
mapped to the boundary of D. If f is a continuous function whose domain contains D,
then Z Z
f dV = (f ◦ F)|JF|dV.
D C
Let us explicitly spell out the details for a function with two variables. Write the
components of F as F(u, v) = (x(u, v), y(u, v)). Then
Z Z
∂x ∂y ∂x ∂x
f (x, y)dxdy = f (x(u, v), y(u, v)) − dudv.
D C ∂u ∂v ∂v ∂u
Example. 1. Let
C = { (u, v) u2 + v 2 ≤ r2 } and D = { (x, y) x 2 y 2
a
+ b
≤ r2 }.
C is a boundary of a circle, and D is an ellipse. The mapping F(u, v) = (au, bv),
x = au, y = bv, a > 0, b > 0 is one-to-one on C, with image F(C) = D. The
Jacobian is
JF = (ab − 0) = ab,
which is nonzero since a, b > 0. So, F is a smooth change of variable on C. We will
use F to find the area of the ellipse D.
Z Z Z
Area of D = 1dxdy = 1|ab|dudv = ab 1dudv
D C C
and since the area of the circle is πr2 (derived next), we have
Area of D = abπr2 .
1 1
JF(u, v) = = 3,
−1 2
= 3e eu du ev dv = 3e(e − 1)2 .
0 0
Readers may refer to the appendix for the rigorous definition of integration over un-
bounded set. The techniques however are shown in the examples below.
Example. 1. Show that
1 −(x2 +y2 )
p(x, y) =
e
π
is a probability density function, and find the probability that (x, y) is in the region
D = { (x, y) x, y ≥ 0 }, that is, in the first quadrant.
2 +y 2 )
e−(x
R
We will first show that R2
dxdy = π. First, define
Then
Z Z
1 −(x2 +y2 ) 1 2 2
e dxdy = lim e−(x +y ) dxdy
D π π R→∞ DR+
Z π Z R
1 2 2
= dθ lim er rdr
π 0 R→∞ 0
1 2 1
= lim (1 − eR ) = .
4 R→∞ 4
2. Let
2x+c−y
4
if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 2,
p(x, y) =
0 otherwise.
Find c so that p is a probability density function.
Z 2Z 1
2x + c − y 2x + c − y
Z Z
p(x, y)dxdy = dxdy = dxdy
R2 [0,1]×[0,2] 4 0 0 4
2
1 2 2 1 2 y2
Z Z
1 1
= x + xc − xy 0 dy = 1 + c − ydy = y(1 + c) −
4 0 4 0 4 2 0
1 c
= (2(1 + c) − 2) = .
4 2
So p is a probability density function if and only if 2c = 1, or c = 2. We need to
check that p is nonnegative. Since x ≥ 0 and y ≤ 2 (or 2 − y ≥ 0),
2x + 2 − y 0
p(x, y) = ≥ = 0.
4 4
Exercise. Let a, b > 0. Show that
Z Z
−(ax2 +by 2 ) 2 +v 2 ) 1
e dxdy = e−(u √ dudv,
x2 +y 2 ≤R2 u2
a
2
+ vb ≤Rn ab
and evaluate Z
2 +by 2 )
e−(ax dA.
R2
2 +by 2 )
Suggest how can we modify e−(ax , if necessary, such that it is a probability density
function.
3.5.4 Integration in R
Single integral
To perform indefinite integral in R, we need the package package mosaicCalc.
> install.packages("mosaicCalc")
> library(mosaicCalc)
The function antiD can be used to find anti-derivative of a function, that is, to perform
indefinite integral.
However, antiD may not be able to return a symbolic function all the time, even for
2
those whose symbolic anti-derivative exists. For example, the anti-derivative of 2xex is
2
ex , however, the antiD will perform numerical (definite) integration.
> F <- antiD(2*x*exp(x^2)∼x)
> F
function (x, C = 0)
{
numerical integration(.newf, .wrt, as.list(match.call())[-1],
formals(), from, ciName = intC, .tol)
}
<environment: 0×00000219d8bf6050>
However, we may still use it to perform definite integrals.
> F(1)-F(0)
[1] 1.718282
> exp(1)-1
[1] 1.718282
Iterative integral
We will use the function integrate to perform iterated integral over x-simple or y-simple
domains.
Here is the explanation for the code. First we color code the various parts of the code
that corresponds to the double integral.
Z 1 Z √y
xydxdy.
0 y2
> integrate(function(y) {
sapply(y, function(y) {
integrate(function(x) f(x,y), y^2, sqrt(y))$value
})
},0, 1)
When we run the code, the function integrate will partition [0, 1], P = {a = y0 < y1 <
· · · < yn = b}, and let y be the partition points, y = (y0 , y1 , ..., yn ). Then for each value yi ,
i = 0, ..., 1, it will evaluate integrate(function(x) f(x,yi ), yi ^2, sqrt(yi ))$value,
viewing the function f (x, y) as only a function of x, with the y component fixed = yi .
Then the Riemann (or Darboux) sum is taken over this partition P . Readers may refer
to appendix 3.7.5 for the details.
R
In general, if we want to compute the integral D f dA, where D is x-simple with
smooth functions a(y), b(y) defining the end points of x, and c, d defining the end points
of y, we type
> integrate(function(y) {
sapply(y, function(y) {
integrate(function(x) f(x,y), a(y), b(y))$value
})
},c, d)
is a probability density function. Let ρ = 0.2. Find the probability that (x, y) is in the
region 0 ≤ x ≤ 1, 1 ≤ y ≤ 2.
f <- function(x,y,rho){
(1/(2*pi*sqrt(1-rho^2)))*exp(-(x^2-2*rho*x*y+y^2)/(2*(1-rho^2)))
}
frho <- function(rho) {
integrate(function(y) {
sapply(y, function(y) {
integrate(function(x) f(x,y,rho), -10, 10)$value
})
}, -10, 10)$value
}
(the code was written in Rscript).
The probability is
> integrate(function(y) {
sapply(y, function(y) {
integrate(function(x) f(x,y,rho=0.2), 0, 1)$value
})
}, 1, 2)$value
[1] 0.05170924
s′ (t) = −βs(t)i(t),
i′ (t) = βs(t)i(t) − γi(t),
r′ (t) = γi(t),
where β and γ are some fixed constants, is a first order nonlinear system of ordinary
differential equations.
Readers may visit https://www.maa.org/press/periodicals/loci/joma/the-sir-model-for-spre
for details.
In this course we will only focus our attention on first order system of ordinary dif-
ferential equations (first order SDE in short) that can be written as
y1 = s, , y2 = i, y3 = r,
F1 (y1 , y2 , y3 ) = −βy1 y2 , F2 (y1 , y2 , y3 ) = βy1 y2 − γy2 , F3 (y1 , y2 , y3 ) = γy3 .
Note that the functions are assumed to be functions of yi , it does not (explicitly) depend
on t (even though the yi is a function of t). For example, we will not be dealing with the
cases where F1 (y1 , y2 , y3 , t) = sin(t)y1 + y2 y3 , since it involves t.
The initial conditions of a first order SDE are the values that the functions yi take at
a given point,
the aim is to find functions y1 , y2 , ..., yn that satisfy the equations above; the set {y1 , y2 , ..., yn }
is called a solution of the system. If the SDE has an initial condition, then we further in-
sist that the functions yi satisfies the initial conditions. For example, and initial condition
for the SIR model will be
Here we are assuming that it is the beginning of the outbreak, at t = 0, and a small
proportion of the population has already infected, but no one has yet recovered from the
disease.
To find the solution of a SDE with initial conditions in R, we need the package deSolve.
> install.packages("deSolve")
> library(deSolve)
out <- ode(y = state, times = times, func = SDE, parms = parameters)
Example. Let’s solve the SIR model with β = 0.7 and γ = 0.1, with initial conditions
out <- ode(y = state, times = times, func = sirmodel, parms = parameters)
Here we want to plot all three functions s, i, r in a single graph, so we instead use
plot(out[,"time"],out[,"s"],type="l",col="green",xlab="Days", ylab="")
lines(out[,"time"],out[,"i"],col="red")
lines(out[,"time"],out[,"r"],col="blue")
3.7 Appendix for Chapter 3
3.7.1 Sequences and Limits
A sequence is list of objects in which repetitions are allowed and order matters. The
objects are usually indexed by the natural numbers. Hence, a sequence can be viewed
as a function whose domain is either the set of non-negative integers {0, 1, 2, ...} or the
positive integers {1, 2, 3, ...}. It is customary to denote the elements of a sequence by a
letter and the index in the subscript, for example, (a1 , a2 , a3 , ...).
1. Let an = n1 for n ≥ 1, that is, the sequence is 1, 12 , 31 , ... .
Example.
2. Define an recursively, an+1 = an + an−1 , a0 = 1, a1 = 1. This is the Fibonacci
n n
sequence, (0, 1, 1, 2, 3, 5, 8, ...). The formula for the n-th term is an = φ √−ψ
5
, where
√ √
1+ 5 1− 5
φ= 2
, ψ= 2
.
3. Let an = (−1)n , for n ≥ 1. The sequence is (−1, 1, −1, 1, ..)
n 2 3 4
4. Let an = 1 + n1 . Then the sequence is 2, 32 , 43 , 54 , ... .
The limit of a sequence is a real number that the values an are close to for large values
of n. That is, a is the limit of an if as n becomes bigger, the difference |a − an | becomes
smaller. We shall denote this as
lim an = a.
n→∞
Here is the precise definition. A sequence (an ) is said to converge to a real number a if
for every ε > 0, there is a N > 0 such that for all n > N , |a − an | < ε. The number a
is called the limit of the sequence. If the sequence do not converge to some real number,
then it is said to diverge.
1. The sequence 1, 21 , 31 , ... converges to 0, limn→∞ n1 = 0.
Example.
2. The Fibonacci sequence diverge.
3. The sequence is (−1, 1, −1, 1, ..(−1)n , ...) diverge.
2 3 4 n
4. The sequence 2, 23 , 43 , 54 , ..., 1 + n1 , ... converges to e, the Euler num-
n
ber, limn→∞ 1 + n1 = e.
Remark. We may allow the limit a to be ±∞. In this case, we must modify the definition
as such. We say that an converges to ∞ if for every (large) number M > 0, there is a
N such that for all n > N , an > M . Similarly, we say that an converges to −∞
if for every (large negative) number M < 0, there is a N such that for all n > N ,
an < M . For example, the Fibonacci sequence converges to ∞. However, the sequence
(−1, 1, −1, 1, ..(−1)n , ...) is still divergent.
Theorem (Uniqueness of limit). If limn→∞ an = a and limn→∞ an = b, then a = b.
Proof. We will show that for every ε > 0, |a−b| < ε. Given any ε > 0, since limn→∞ an =
a (limn→∞ an = b, respectively), we can find N1 (N2 , respectively) such that for all n > N1
(n > N2 , respectively),
ε ε
|an − a| < |an − b| < , respectively .
2 2
Let N = max{N1 , N2 } + 1. Then by triangle inequality,
ε ε
|a − b| = |(a − aN ) − (b − aN )| ≤ |a − aN | + |b − aN | < + = ε.
2 2
Example. 1. Define the function f on R to be f (x) = n for all x such that n ≤ x <
n + 1. Then f is right continuous.
2. If we instead define f to be f (x) = n for all x such that n − 1 < x ≤ n. Then f is
left continuous.
An open-interval is the set { x ∈ R a < x < b }. Here we allow a = −∞ and b = ∞.
It is usually denoted as (a, b). A closed-interval is the set { x ∈ R a ≤ x ≤ b }, that is,
it includes the points a and b. It is usually denoted as [a, b].
Here the supremum and infimum is taken over all possible partitions of [a, b].
A function f is said to be Darboux integrable over the closed interval [a, b] if L(f ) =
U (f ). In this case, we denote the integral as
Z b
f (x)dx = L(f ) = U (f ).
a
Let [a, b] be a closed interval. Divide the interval into n subintervals, each of length
δn x = b−1
n
. Let xi = a + iδn x, that is, x0 = a, x1 = a + h, ..., xn = a + nδn x = b. Then
the left Riemann sum is
Xn
f (xi−1 )δn x,
i=1
Hence, we will just say that a function is integrable over [a, b] if it is either Darboux
or Riemann integrable over [a, b].
Theorem (Fundamental theorem of calculus II). Let f be an integrable function on [a, b].
For any x in [a, b], let Z x
F (x) = f (x)dx.
a
Then F is continuous on [a, b]. If f is continuous at x0 in (a, b), then F is differentiable
at x0 and
d
F (x0 ) = f (x0 ).
dx
xa+1 f ′ (x)
Z Z Z
a 1
(ii) x dx = + C for a ̸= −1, dx = ln |x| + C, dx = ln |f (x)| +
a+1 x f (x)
C.
ax
Z Z Z
x x ′ f (x) f (x)
(iii) e dx = e + C, f (x)e dx = e + C, ax dx = + C.
ln a
Integration formula for trigonometric functions
Z Z Z
(i) sin x dx = − cos x + C, cos x dx = sin x + C, sec2 x dx = tan x + C.
Z Z
2
csc x dx = − cot x + C, sec x tan x dx = sec x + C,
Z
csc x cot x dx = − csc x + C.
Z Z
tan x dx = ln | sec x| + C, cot x dx = ln | sin x| + C,
Z Z
sec x dx = ln | sec x + tan x| + C, csc x dx = ln | csc x − cot x| + C.
Z Z
1 1
(ii) √ dx = sin−1 x + C, dx = tan−1 x + C,
Z 1−x 2 1 + x2
1
√ dx = sec−1 x + C.
2
x x −1
converge and are equal. The limit of these sequences are called the integral of f over D,
and denote it with the usual notation
Z Z Z
f dV = lim f dV = lim f dV.
D k→∞ Dk k→∞ D(k)
Theorem. Let D be an unbounded set, and f a continuous function on D whose absolute
value is integrable over D, Z
|f |dV exists.
D
Then for any increasing sequence of smoothly bounded sets Dk whose union is D, the
limit of integrals Z
lim f dV exists,
k→∞ Dk
(a) T1 : R2 → R2 ,
x x+y x
T1 = for all ∈ R2 .
y x−y y
(b) T2 : R2 → R2 , x
x 2 x
T2 = for all ∈ R2 .
y 0 y
(c) T3 : R2 → R3 ,
x+y
x x
T3 = x for all ∈ R2 .
y y
y
(d) T4 : R3 → R3 ,
x 1 x
T4 y = y − x for all y ∈ R3 .
z y−z z
within the range −10 ≤ x ≤ 10. Use a line for the plot. Is the function continuous?
5. The dataset airquality records daily air quality measurements in New York, May
1 to September 30, in the year 1973. It has with 153 observations on 6 variables.
• Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt
Island, units (ppb).
• Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms
from 0800 to 1200 hours at Central Park, units (lang).
• Wind: Average wind speed in miles per hour at 0700 and 1000 hours at La-
Guardia Airport, units (mph).
• Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Air-
port, units (degrees F).
• Month: the month (May to September).
• Day: numeric Day of month (1 to 31).
We will now plot a scatterplot with temperature in the x-axis and wind speed in
the y-axis.
> Temp <- airquality$Temp
> Wind <- airquality$Wind
(a) Plot a scatterplot of the graph of the wind speed against the temperature. Title
the graph as ”Wind Speed vs Temperature”, label the x-axis with ”Temperature
(degree F)” and the y-axis with ”Wind Speed (mph)”.
(b) Add a linear fit line (in blue) using the function lm to fit the graph in (a).
(c) Add the lowess fit curve (in red) with f value = 1.
7. Let
2
f (x) = xex .
d2 d
Find dx2
f (x), and dx
f (0).
8. Let
2
f (x, y) = ex/y−y .
∂ 2 ∂2
Find the gradient, ∇f (x, y), ∂x∂y f (x, y), and ∂y∂x
f (x, y). Also, find the linear
approximation of f at (−1, −1).
∂ ∂ ∂
y · ∇f (x, y, z) = r f (x, y, z) + s f (x, y, z) + t f (x, y, z) = 0
∂x ∂y ∂z
Find ∇Q(1, 0, 1), that is, find the gradient of Q(x1 , x2 , x3 ) at the point (1, 0, 1).
13. Let
F(x, y, z) = (x + y + z, xy + yz + zx).
Find the matrix derivative, DF of F.
14. Let
ex cos(y)
F(x, y) = .
ex sin(y)
Find the Jacobian JF of F.
d √ 1 √ 1
sin−1 ( 2/2) = = 2= q √ .
dx cos(π/4)
1 − ( 2/2)2
with initial conditions y1 (0) = y2 (0) = 0. Plot the graph of y1 and y2 for 0 ≤ t ≤ 4π.
25. Consider the following 4th order differential equation
y (4) (t) + y(t) = 0,
with initial conditions y(0) = 1, y ′ (0) = 1, y (2) (0) = 0, y (3) (0) = 1. Here, y (k) (t) is
the k-th derivative of y(t). We can convert it to a first order system of differential
equations as such. Let
y1 (t) = y(t), y2 (t) = y ′ (t), y3 (t) = y (2) (t), y4 (t) = y (3) (t).
Then the above 4th order differential equation becomes the following first order
system of differential equations
y1′ (t) = y2 (t)
y2′ (t) = y3 (t)
y3′ (t) = y4 (t)
y4′ (t) = −y1 (t)
with initial conditions
with initial conditions y1 (0) = 0, y2 (0) = 1. One can check that y1 (t) = sin(t), and
y2 (t) = y1′ (t) = cos(t) is the solution. To solve it in R, we need to first create the
interpolating function, using approxfun
times <- seq(0,2*pi,length=1000) #0<=t<=2*pi
t <- data.frame(times=times,t=times)
t <- approxfun(t,rule=2)
Now use the interpolation function in the ODE function
ODE <- function(t,state,parameters){
y1 <- state[1]
y2 <- state[2]
dy1 <- y2
dy2 <- -sin(t)
Sol <- c(dy1,dy2)
list(Sol)
}
Plot the graph of the solution of the following 4th order ODE
for 0 ≤ t ≤ 10, with initial conditions y(0) = y ′ (0) = y ′′ (0) = y (3) (0) = 0.
Chapter 4
Total letters = 10
No. of O = 2
No. of K = 2
No. of E = 3
179
2. Multinomial expansion
X n
n
(x1 + x2 + · · · + xk ) = xn1 1 xn2 2 · · · xnk k .
n1 +n2 +···+nk =n
n1 , n2 , ..., nk
Example. 1. How many ways to have n binary digits such that m of them are 0 and
no two 0 are consecutive?
To solve this, image we first lay out the n-m 1’s. Then we slot the 0’s into the
n − m + 1 possible slots (including the left and right end). Since there can be no
consecutive 0’s, each slot can at most contain one 0. Hence,
we are choosing m of
n−m+1
the n − m + 1 slots to put the 0’s. The solution is m
.
2. Binomial expansion
n
n
X n
(x + y) = xk y n−k .
k=1
k
(i) 0 ≤ P (E) ≤ 1
(ii) P (S) = 1
(iii) For any sequence of events E1 , E2 ,..., such that Ei ∩ Ej = ∅ for all i ̸= j,
∞
[ ∞
X
P( Ei ) = P (Ei ).
i=1 i=1
Axiomatic approach: Since there are six sides to each die, the total number of outcome
for two dice is 36. There are 6 possible outcomes that sums to 7, (1,6), (2,5), (3,4), (4,3),
(5,2), (6,1). By assumption that the dice are fair, each of these outcomes are equally
likely. So, the probability is
6 1
= .
36 6
Relative frequency approach, by simulation: The function sample in R generates a
sample of the specified size from the elements from a vector x. Here are some of the
arguments.
sample(x,size,replace,prob)
• x is a vector containing the list of all the objects from which to choose
from. If it is a positive integers, then the list is from 1 to that integer.
• prob is the probability that each of the object is being chosen, it should
be a vector of the same length as x.
We will now simulate rolling two fair dice and finding the sum.
sum2dice <- function(nreps){
count <- 0
for (i in 1:nreps) {
sum <- sum(sample(1:6,2,replace=TRUE))
if (sum==7) count <- count + 1
}
return(count/nreps)
}
Let us explain the code. The argument of the defined function sum2dice is nreps,
which is the total number of times the experiment is to be conducted. count is the
number of times the sum is 7. We start the code with count = 0. The line for (i in
1:nreps) runs the experiments nreps times. For each experiment, we take two numbers
from 1 to 6 with replacement, then find the sum. The line if (sum==7) checks if for this
particular experiment, and perform count <- count +1 if the sum is 7. That is, it will
add 1 to the total number of times the sum is 7 if the outcome is 7 in the experiment.
Once nreps number of experiments have been conducted, the code will move to the next
command, that is to return the fraction count/nreps.
> sum2dice(10000)
[1] 0.1684
> sum2dice(1000000)
[1] 0.166963
So, observe that as we perform more experiments, the number approaches 1/6, which
was the answer derived theoretically.
By using the function replicate, we could shorten the code. Let us do it more
generally. The code to find the relative frequency of the sum of rolling d dice is k is as
such.
rollddice <- function(d) {sample(1:6,d,replace=TRUE)}
sumddice <- function(d,k,nreps){
sum <- replicate(nreps,sum(rollddice(d)))
return(mean(sum==7))
}
Remark. Though the second code is shorter, it takes more memory space, and thus is
subjected to higher inaccuracy.
Personal probability
Some times people use probability to measure their degree of confidence that an even
would occur. This is known as personal or subject view of probability. For example, one
might say that he or she might be moving house next year with a probability of 0.9.
This tells us that the persons is quite certain that he or she will move. It is impossible
to perform this experiment repeatedly under the same conditions and take the relative
frequency in this situation. However, the personal probabilities must still be subjected
to the axioms of probability. For example, when ask the probability that he or she will
not move, the person must say that it is 0.1.
In this course, we will not deal with or use probability in the sense of personal prob-
ability.
Exercise. Suppose a weather forecast predicts that it might rain the next day with
probability 90%. Can this be interpreted as relative frequency approach to probability?
Bear in mind that it is impossible to “conduct an experiment under the same conditions”
to check how many of the outcome of the experiments results in being raining the next
day.
Since the coin is assumed to be fair, there are four equaly likely outcomes, S =
{(h, h), (h, t), (t, h), (t, t)}. So, the probability that the event (h, h) occurs in 1/4. How-
ever, we are told that the first coin was obsevered to land on heads. This makes the
events (t, h) and (t, t) impossible for this experiment. Hence, S = {(h, h), (h, t)}, and
therefore, the probability for (h, h) is 1/2. This is called condition probability.
If A and B are two events in a sample space S and P (A) ̸= 0, the conditional probability
of B given A is defined as
P (A ∩ B)
P (B|A) = .
P (A)
(Note that P (A ∩ B) is the probability that both events A and B occurs.) From the
relative frequency explanation, we are finding the frequency of event B occuring over all
the times that event A happens.
Example. 1. Find the probability that the sum of the outcomes of tossing three dice
is at least 12, given that the outcome of the first die is 5.
sum12 <- function(nreps){
count12 <- 0
count5 <- 0
for (i in 1:nreps) {
toss <- sample(1:6,3,replace=TRUE)
if (toss[1]==5) {
count5 <- count5 + 1
if (sum(toss)>=12) count12 <- count12 + 1
}
}
return(count12/count5)
}
> sum12(10000)
[1] 0.5828505
> sum7(10000)
[1] 0.5839
So, the probability that the sum of three tosses is at least 12 given the first is 5 is
the same as the probability that the sum of the other 2 dice toss is at least 7.
2. Suppose a box contains 50 defective light bulbs, 100 partially defective light bulbs
that will not last more than 3 hours, and 250 good light bulbs. If a bulb is taken
from the box and it lights up when used, what is the probability that the light bulb
is actually a good light bulb?
Here we are to find the probability that the light bulb is good, given the condition
that it is not defective.
P (Good) 250 5
P (Good | Not defective) = = = .
P (Not defective) 100 + 250 7
Remark. Note that the good light bulbs are a subset of the not defective light
bulbs, hence, the intersection of the good light bulbs and the not defective ones are
just the good ones.
3. Consider a game where each player take turns to toss a die, the score they obtain
for that round will be the outcome of the toss added to their previous tosses. If the
outcome of the toss is 3, a player gets a bonus toss (only once, even if it lands on
3 the second time, it is the end of the player’s turn). What is the probability that
the player’s score is less than 8 after the first turn?
At first look, this does not look like a conditional probability problem. However, to
compute the probability analytically (vis-a-vis via simulation), we need to break-
down the event into smaller events; whether the player’s first toss lands on 3. Let
T be the outcome of the player’s first toss, and B the outcome of player’s bonus
toss with B = 0 if the player did not get one.
Now suppose we know that the player’s score is 4. Let’s find the probability that
it was obtained with the help of a bonus toss.
Here we want to compute the probability that the player did have a bonus toss
B > 0, given that his score was 4, T + B = 4.
P (A ∩ B) = P (A)P (B).
P (A ∩ B) P (A)P (B)
P (A|B) = = = P (A).
P (B) P (B)
Similarly, P (B|A) = P (B). This says that the condition that B happens will not affect
the probability of A happening and vice versa. For example, the probability that the
second toss of a coin will land on heads is independent on the condition that the first
landed on heads. Another example is the condition that the outcome of the third toss of
a die is 3 will not affect the probability that the sum of the first two tosses is 8.
More generally, n events, E1 , E2 , ..., En , are independent, if for any subset Ej1 , Ej2 , ..., Ejm
of them, !
\m Ym
P Eji = P (Eji ).
i=1 i=1
P (A ∩ B) P (A|B)P (B)
P (B|A) = = .
P (A) P (A)
A = (A ∩ B) ∪ (A ∩ B c ),
P (A) = P (A ∩ B) + P (A ∩ B c )
= P (A|B)P (B) + P (A|B c )P (B c )
= P (A|B)P (B) + P (A|B c )[1 − P (B)].
More generally, law of total probability states that if A1 , A2 , ..., An are mutually
Sn ex-
clusive events such that the union of these events is the entire sample space, i=1 Ai = S
then for any event E,
n
X n
X
P (E) = P (E ∩ Ai ) = P (E|Ai )P (Ai ).
i=1 i=1
Suppose now E has occurred and we are interested in determining the probability
that one of the Ak also occurrs. Then by the law of total probability and Bayes’ rule, we
arrive at
P (E|Ak )P (Ak ) P (E|Ak )P (Ak )
P (Ak |E) = = Pn .
P (E) i=1 P (E|Ai )P (Ai )
P (E ∩ A1 )
P (A1 |E) =
P (E)
P (A1 )
=
P (A1 ) + P (E ∩ A2 )
p
=
p + (1 − p) k1
kp
= .
1 + p(k − 1)
2. (Monty Hall problem) In a television show, the host, Monty Hall, gave contestants
the chance to choose to open one of three doors. Behind a door is a luxurious
car, while a goat is behind two others. When a contestant picks a door, the host,
knowing which which door hides the car, will reveal a goat behind one of the two
doors that the contestant did not pick. The contestant is then given a choice to
either remain with his original decision, or switch to the other unopened door. Is
it to the contestant’s advantage to switch?
> Montyhall(100000)
P(Win with Switch) P(Win Without Switching)
0.6675855 0.3341477
The simulation tells us that the contestant have about 2/3 chance of winning if he
chooses to switch, while only about 1/3 chance of winning if he does not switch.
We will now derive this analytically. Let us call the door that the contestant opens
door 1. Let Di , i = 1, 2, 3 be the door that conceals the car. Let Mi , i = 1, 2, 3,
be the door that Monty opens. Without lost of generality, let’s name the door that
Monty open door 3. Then before Monty opens any door, the probability of the car
being behind each door is P (Di ) = 1/3. After Monty has opened a door, since he
won’t open the door that hides the sheep, the probability that the sheep is behind
door 3 given that Monty has opened door 3 is P (D3 |M3 ) = 0. Our aim is to find
the probabilities P (Di |M3 ), for i = 1, 2. The probability P (D1 |M3 ) is the chance
that contestant wins when he did not switch doors, while P (D2 |M3 ) is the chance
that he wins by switching doors. To compute P (Di |M3 ), we use Bayes’ rule, that
is, we first compute P (M3 |Di ) for i = 1, 2, 3.
• P (M3 |D1 ) = 1/2, since Monty can choose to open either door 2 or 3.
• P (M3 |D2 ) = 1 , since the participant has chose door 1, and door 2 contains
the car, so Monty can only open door 3.
• P (M3 |D3 ) = 0, Monty knows which door conceals the car and will not open
it.
Hence,
P (M3 |D1 )P (D1 )
P (D1 |M3 ) =
P (M3 |D1 )P (D1 ) + P (M3 |D2 )P (D2 ) + P (M3 |D3 )P (D3 )
1 1 1
·
2 3 6 1
= 1 1 1 1 = 1 = ,
· +1· 3 +0· 3
2 3 2
3
which agrees with the simulation. Therefore, it is always to the contestant’s advan-
tage to switch.
3. X3 = the duration a person has to wait in a queue for the famous bubble tea.
4. X4 = the distance a car can travel on a full tank.
Food for thought: Is the score of a randomly chosen student for a particular test a
discrete or continuous random variable?
where D is the domain of the probability density function p. The support of a probability
density function p(x), is the subset of the domain such that p(x) > 0 is positive,
1. 0 ≤ F (x) ≤ 1,
2. (non-decreasing function) for any real numbers a and b with a < b, F (a) ≤ F (b).
3. limx→∞ F (x) = 1,
4. limx→−∞ F (x) = 0,
Example. 1. Four balls are to be randomly selected, without replacement from a bag
containing twenty balls numbered 1 to 20. Let X denote the largest number among
the four selected balls, then X takes on one of the values 4 to 20. There are 20 4
equally possible choice, and for X = i, it must be the case the one of the four choices
the other three choices are chosen from numbers 1 to i − 1. Hence, there
is i, while
i−1
are 3 equally possible choices. Therefore, the probability density function is
i−1
3
p(i) = 20
.
4
To compute the cumulative distribution function F (k) = P (X ≤ k), observe that
this happens when the four balls are chosen between 1 to k. Therefore
k
4
F (k) = 20 .
4
> plot(x,F(x),type=’h’,lwd=8,col="blue")
2. Two fair dice are being toss. Let X denote the sum of the outcomes of the dice.
The probability density function is
1
m · 36 2 ≤ k ≤ 12,
p(k) =
0 otherwise,
In words, the expected value of X is the weighted average of the possible values that X
can take on, each value being weighted by the probability that X assumes it. However,
the term “expected value” is a misnomer, it is not something we “expect” to happen.
For example, let X be the outcome of the roll of a fair die. Then the expected value of
X is
6
X k
E[X] = = 3.5.
k=1
6
But it is impossible to expect that the outcome of a die toss is 3.5.
Consider another example, where a fair coin is toss 1000 times. Let tail be 0 and head
be 1. Then the expected value is 500. But P (X = 500) is 0.025, which is rare and not
something we “expect” would occur.
The intuition behind expected value, or mean, is the average value of X when we
conduct large numbers of experiments. Let xn be the outcome of the n-th experiment.
Then,
x1 + x2 + · · · + xn
E[X] = lim .
n→∞ n
Example. 1. I is an indicator variable for an event A if
1 if A occurs,
I=
0 if Ac occurs.
Then p(1) = P (A) and p(0) = P (Ac ), and thus
E[I] = P (A).
2. Find the expected value of the sum of the outcomes of two dice.
1 2 3 2 1
+3·
E[X] = 2 · +4· + · · · + 11 + 12 = 7.
36 36 36 36 36
> x <- c(2,3,4,5,6,7,8,9,10,11,12)
> px <- (1/36)*c(1,2,3,4,5,6,5,4,3,2,1)
> mean <- sum(x*px)
> mean
[1] 7
Simulation
sum2dice <- function(nreps){
sum <- 0
for (i in 1:nreps){
sum <- sum + sum(sample(1:6,2,replace=TRUE))
}
return(sum/nreps)
}
> sum2dice(100000)
[1] 6.9966
Alternatively,
sum2dice <- function(nreps){
outcome <- replicate(nreps,sum(sample(1:6,2,replace=TRUE)))
return(mean(outcome))
}
> sum2dice(100000)
[1] 7.00168
Theorem (Properties of expected values). Let X and Y be a discrete random variables,
and a, b ∈ R be real numbers. Then
(i) (Joint linearity) E[aX + bY ] = aE[X] + bE[Y ],
(ii) (Linearity) E[aX + b] = aE[X] + b,
(iii) (Independence) if X and Y are independent, then
E[X · Y ] = E[X] · E[Y ].
Example. 1. Let X denote a random variable that takes on all of the values −1, 0,
and 1, with respective probabilities
p(−1) = 0.2, p(0) = 0.5, p(1) = 0.3,
and p(x) = 0, otherwise. Then E[X 2 ] is
E[X 2 ] = (−1)2 · 0.2 + 02 · 0.5 + 12 · 0.3 = 0.5
Note that
E[X]2 = [−1 · 0.2 + 0 · 0.5 + 1 · 0.3 = 0.1]2 = 0.01 ̸= 0.5 = E[X 2 ].
V ar[aX + b] = a2 V ar[X].
Proof. (i) By the lemma above and the properties of expected value,
2. Let X denote the sum of the outcomes of tossing two fair dice. The variance of X
is
1 2 3 2 1 35
V ar[X] = E[X 2 ]−E[X]2 = (22 · +32 · +42 · +· · ·+112 +122 )−72 = .
36 36 36 36 36 6
> x <- c(2,3,4,5,6,7,8,9,10,11,12)
> px <- (1/36)*c(1,2,3,4,5,6,5,4,3,2,1)
> var <- sum(x^2*px)-(sum(x*px))^2
> var
[1] 5.833333
Theorem (Chebychev’s inequality). For a random variable X with mean µX and vari-
2
ance σX , and any real number α ∈ R, we have the follow (equivalent) inequalities.
1
(i) P (|X − µX | ≥ ασ) ≤ α2
.
2
σX
(ii) P (|X − µX | ≥ α) ≤ α2
.
1
(iii) P (|X − µX | < ασ) ≥ 1 − α2
.
2
σX
(iv) P (|X − µX | < α) ≥ 1 − α2
.
Take inequality (i) for example. It says that the probability that a random variable
takes values, say 4 standard deviations away from the mean, is less than 1/16 = 0.0625.
Rephrasing it, using (iii), we are sure with probability 1 − 0.0625 = 0.9375 that the X
will be within 4 standard deviations of the mean.
E[X] = p,
V ar[X] = p(1 − p).
E[X] = np,
V ar[X] = np(1 − p).
Example. Consider the following gambling game. A player bets on one of the numbers
1 through 6. Three dice are then rolled. Let i = 0, 1, 2, 3 be the number of times the
number the player bets on appears as the outcome of the dice roll. If i = 1, 2, 3, the
player wins i units. If i = 0, the player loses 1 unit. Is the game fair to the player?
This is equivalent to finding the expected value of the gambling game. If we assume
that the dice are fair and act independently, this is a binomial random variable with
parameter (3, 61 ). Hence, by letting X denote the player’s winnings in the game,
E[X] = −P (i = 0) + P (i = 1) + 2P (i = 2) + 3P (i = 3)
3 2 2 3
3 5 3 1 5 3 1 5 3 1
= − + +2 +
0 6 1 6 6 2 6 6 3 6
17
= −
216
> i <- c(0,1,2,3)
> x <- c(-1,1,2,3)
> px <- choose(3,i)*(1/6)^i*(5/6)^(3-i)
> fractions(sum(x*px))
[1] -17/216
In other words, in the long run, the player will lose 17 units per every 216 games he
plays. Hence, the game is not fair to the player (not surprising).
The functions in R relevant to a binomial distribution with parameters (n, p) are
• rbinom(m,n,p), to generate m independent experiments
• dbinom(k,n,p), the probability density function p(k) = P (X = k)
• pbinom(k,n,p), the cumulative distribution function F (k) = P (X ≤ k)
• qbinom(q,n,p), to find a k such that F (k) = P (X ≤ k) = q
4.4.4 Geometric Distributions
Suppose in an experiment, independent identical Bernoulli trials with parameter p are
performed until the a success appears. Let X be the random variable denoting the number
of trials needed for the first success, then the probability density function is
p(k) = P (X = k) = (1 − p)k−1 p, k ≥ 1.
Remark. R defines geometric distribution as the number of failures before the first suc-
cess, instead of the number of trials to achieve the first success. This explains why, for
example, dgeom(k,p) compute p(k + 1) instead of p(k).
Example. Suppose there is a 10% chance that a particular pokémon card vending ma-
chine dispense a rare pokémon card. Let X denote the random variable representing the
number of cards dispensed to obtain the first rare pokémon card. Then the probability
that the fourth card is a rare pokémon card is
> dgeom(3,0.1)
[1] 0.0729
X = G1 + G2 + · · · + Gm .
Note also that G1 , G2 , ..., Gm are independent identical trials. Therefore, the expectation
and variance are
m
E[X] = ,
p
m(1 − p)
V ar[X] = .
p2
Note also that the derivations above tell us that if we have m independent iden-
tical geometric distributions G1 , G2 , ..., Gm with parameter p, then the sum X = G1 +
G2 + · · · + Gm is a negative binomial distribution with parameters (m, p). By induction, if
X1 , ..., Xr are negative binomial distributions with parameters (m1 , p), (m2 , p), ..., (mr , p),
Pr sum X = X1 +X2 +· · ·+Xr is also a negative binomial distribution
respectively, then the
with parameters ( i=1 mi , p). On the other hand, a negative binomial distribution with
parameters (1, p) is a geometric distribution with parameter p.
This is a negative binomial distribution with parameters (4, 0.7), and we are finding
the probability
8−1
p(8) = 0.74 0.34 = 0.06806835.
4−1
> dnbinom(4,4,0.7)
[1] 0.06806835
λk
p(k) = P (X = k) = e−λ , k ≥ 0.
k!
The Poisson distribution is very popular for modeling the number of times particular
event occur in given times or on defined spaces. For example, the number of customers
that walk into a given shop between 9a.m. to 10a.m., or the number of people queuing
at a particular ATM. The expectation and variance are
E[X] = λ,
V ar[X] = λ.
Example. Suppose a piece of glass rod of unit length drops and breaks into pieces. Let’s
assume that the number of broken pieces is a Poisson distribution of parameter λ, and
the break points are uniformly distributed. Let X be the length of the shortest broken
piece. Let us simulate this. Note that the support of a Poisson distribution starts from
0, however, we cannot have 0 (broken) pieces. Hence, let the number of break points be
a Poisson distribution.
Let us explain the code. The value no breakpts is the number of break points. Then
runif will randomly and uniformly generate no breakpts numbers between 0 to 1. The
function sort will order the numbers from the smallest to largest. The function diff
then compute the differences between adjacent numbers in c(0,breakpts,1). Note that
we must include the extreme points 0 and 1 to compute the lengths from the start to the
first break point and the last break point to the end of the rod.
EX <- mean(replicate(100000,minpiece(4)))
> EX
[1] 0.08190401
> Var=mean(replicate(100000,minpiece(4))^2)-EX^2
> Var
[1] 0.02231847
Remark. See section 4.6.1 for the function runif(n,min,max). In summary, it simulates
n uniform random variable between the interval [min, max], where the default for min
and max are 0 and 1, respectively.
The Poisson distribution was introduced by Siméon Denis Poisson in a book he wrote
regarding the application of probability theory to lawsuits, criminal trials, and the like.
This book, published in 1837, was entitled Recherches sur la probabilitè des jugements en
matière criminelle et en matière civile (Investigations into the Probability of Verdicts in
Criminal and Civil Matters).
λk
p(k) = P (X = k) ≃ e−λ .
k!
In other words, if n independent trials, each of which results in a success with probability
p, are performed, then when n is large and p is small enough to make np moderate, the
number of successes occurring is approximately a Poisson random variable with param-
eter λ = np. This value λ (which is the expected number of successes) will usually be
determined empirically. (This discussion is from 7.)
3.22 −3.2
P (X ≤ 2) = e−3.2 + 3.2e−3.2 + e ≃ 0.3799.
2
Remark. If instead of the average λ number of events, we are given the average rate r of
an event occuring, then letting t be a time frame or interval length (noting that the units
must agree with that of r), the probability density function of a Poisson distribution is
(rt)k
e−rt .
k!
2. Be careful that the term continuous random variable is used to distinguish it against
a discrete random variable. It is not necessary that the probability density function
of a continuous random variable X is continuous.
Example. The amount of time in hours that a computer functions before breaking down
is a continuous random variable with probability density function given by
−x/100
λe x ≥ 0,
p(x) =
0 x < 0.
1. Find the probability that it will function between 50 and 150 hours before breaking
down.
2. Find the probability that it will last more than 100 hours.
R
Since R p(x)dx = 1, we have
Z ∞ Z N
1 = p(x)dx = λ lim λe−x/100 dx
−∞ N →∞ 0
N
= λ(−100) lim e−x/100 0
N →∞
= −100λ( lim eN − 1) = 100λ,
N →∞
1. The probability that a computer will function between 50 and 150 hours before
breaking down is
Z 150
1 −x/100 150
dx = − e−x/100 50 = e−1/2 − e−3/2 ≃ 0.383.
P (50 ≤ X ≤ 150) = e
50 100
2. The probability that a computer will last longer than 100 hours is
Z ∞
1 −x/100 ∞
dx = − e−x/100 100 = e−1 ≃ 0.368.
P (X > 100) = e
100 100
Note that this is also 1−F (100), where F (t) is the cumulative distribution function.
Here are some properties of the probability and cumulative distribution function. Let
X be a continuous random variable with probability density function p(x).
(i) The probability of a point is 0, P (X = a) = 0 for any real number a. This follows
from
Z a+ε
P (X = a) = lim P (a − ε ≤ X ≤ a + ε) = lim p(x)dx ≃ lim εp(a) = 0.
ε→0 ε→0 a−ε ε→0
Rb R∞
(ii) P (X < b) = P (X ≤ b) = −∞
p(x)dx and P (X > a) = P (X ≥ a) = a
p(x)dx
(see section 3.7.7). Hence,
Z b
P (a < X < b) = P (a < X ≤ b) = P (a ≤ X < b) = P (a ≤ X ≤ b) = p(x)dx.
a
Rt
(iii) Since F (t) = −∞
p(x)dx, by the fundamental theorem of calculus,
d
F (a) = p(a), for all a ∈ R.
dx
That is, the density is the derivative of the cumulative distribution function.
E[aX + b] = aE[X] + b.
Proof. Let p(x) be the probability density function of X and g(X) = aX + b in the
theorem above, then
Z ∞ Z ∞ Z ∞
E[aX + b] = (ax + b)p(x)dx = a xp(x)dx + b p(x)dx = aE[X] + b,
−∞ −∞ −∞
R
where in the last equality we use the fact that R
p(x)dx = 1.
Remark. 1. As in the case for discrete random variable, we need to define joint
probability density function to be able to prove the results above.
2. By induction on property (i), we have for random variables X1 , X2 , ..., Xk , and real
numbers a1 , a2 , ..., ak ∈ R,
Lemma.
V ar[X] = E[X 2 ] − E[X]2 .
V ar[aX + b] = a2 V ar[X].
Corollary (Properties of Variance). From the theorem, we get the following properties.
2. If X1 , X2 , ..., Xn are independent random variables, then for any real numbers a1 , a2 , ..., an ∈
R,
Remark. Note that in this course we have chosen the definition of a uniform distribution
to exclude the end points a and b.
The expected value and variance of a uniform distribution on the interval (a, b) are
b+a
E[X] = ,
2
(b − a)2
V ar[X] = .
12
Proof.
b 2 b
b 2 − a2
Z
x 1 x b+a
E[X] = dx = = = ,
a b−a b − a 2 a 2(b − a) 2
and 3 b
b
x2 b 3 − a3 b2 + ab + a2
Z
2 1 x
E[X ] = dx = = = ,
a b−a b − a 3 a 3(b − a) 3
and thus
2
b2 + ab + a2 4b2 + 4ab + 4a2 − 3b2 − 6ab − 3a2 (b − a)2
b+a
V ar[X] = − = = .
3 2 12 12
Let X denote the amount of minutes pass 7a.m that the man arrives at the station.
Then X is a uniform distribution on (0, 15).
For the man to wait less than 2 minutes, he must arrive between 7:05 and 7:07, and
between 7:12 and 7:14. Hence, the probability is
7 − 5 14 − 12 4
P (5 < X < 7) + P (12 < X < 14) = + = .
15 15 15
For the man to wait more than 5 minutes, he must arrive between 7 and 7:02, between
7:07 and 7:09, and between 7:14 and 7:15. Hence, the probability is
2 − 0 9 − 7 15 − 14 1
P (0 < X < 2) + P (7 < X < 9) + P (14 < X < 15) = + + = .
15 15 15 3
Hence, he is more likely to wait more than 5 minutes for the train than to wait less
than 2 minutes for the train.
The expected value and variance of the standard normal distribution are
E[Z] = 0,
V ar[Z] = 1.
Hence, the expected value and variance of a normal distribution with parameters
(µ, σ) are
that is, the parameters of a normal distribution are the mean and the standard deviation
(which explains the use of the notations µ and σ).
It is usually assume that the marks distribution of a class follows a normal distribu-
tion, provided the class is large enough. This will be justified in section 4.7.6. So, if
X is the marks obtained by a randomly chosen student, X is a normal distribution
with parameters (45, 12).
2. An expert witness in a paternity suit testifies that the length (in days) of hu-
man gestation is approximately normally distributed with parameters µ = 270 and
σ 2 = 100. The defendant in the suit is able to prove that he was out of the country
during a period that began 290 days before the birth of the child and ended 240
days before the birth. If the defendant was, in fact, the father of the child, what
is the probability that the mother could have had the very long or short gestation
indicated by the testimony?
Let X denote the length of the gestation. and assume that the defendant is the
father. Then the probability that the birth could occur within the indicated period
is
> 1-pnorm(290,270,10)+pnorm(240,270,10)
[1] 0.02410003
Exercise. Is this the probability that the defendant is the father of the child?
This is called a exponential distribution with parameter λ. This follows from the following
reasoning. Let X denote the waiting time between a pair of success event. Since waiting
time is nonnegative, P (X < x) = 0 for x < 0. For x ≥ 0,
F (x) = P (X ≤ x) = 1 − P (X > x)
= 1 − P (no events in [0, x])
(λx)0
= 1 − e−λx
0!
−λx
= 1−e .
Hence, noting that the derivative of a cumulative distribution function is the probability
d
density function, we have dx F (x) = p(x) = λeλx , as desired (see the remarks at the end of
section 4.4.6). The derivations above also show that the cumulative distribution function
is given by
F (c) = 1 − e−λc .
The exponential distributions are memoryless, that is, the probability the time interval
between a successive event is independent of how long you have already waited. Precisely,
for any positive real numbers t1 , t2 ∈ R,
P (X > t1 + t2 |X > t1 ) = P (X > t2 ).
This follows from
P ((X > t1 + t2 ) ∩ (X > t1 )) P (X > t1 + t2 )
P (X > t1 + t2 |X > t1 ) = =
P (X > t1 ) P (X > t1 )
R∞ −λx
−λx ∞
λe dx −e e−λ(t1 +t2 )
= Rt1 +t
∞
2
= t1 +t2
∞ =
t1
λe−λx dx [−e−λx ]t1 e−λ(t1 )
= e−λt2 = P (X > t2 ).
In fact, the exponential distributions are the only distributions that possesses this prop-
erty. Readers may refer to the appendix for details.
This says that for example, if the waiting time for finding a parking space in a busy
mall follows an exponential distribution, then the probability that a patron has to wait,
say a further fifteen minutes after searching for awhile, is the same as the probability
that he has to wait for fifteen minutes when he first arrived at the carpark.
(a) What is the probability that a ride cost more than $5.50?
First, since the mean is λ1 = E[X] = 10, λ = 0.1. Next, since 7.50 > 2.50,
the ride must be more than 1km. Let S be the random variable denoting the
distance. So,
5.5 − 2.5
P (X > 5.50) = P S − 1 > = P (S > 7) = e−7(0.1) ≃ 0.497.
0.5
> exp(-7*0.1)
[1] 0.4965853
> 1-pexp(7,0.1)
[1] 0.4965853
(b) What is the expected value E[X]?
Then
Z ∞
E[X] = 2.5 + (2 + 0.5s)(0.1)e−0.1s ds
1
Z ∞ Z ∞
−0.1s
= 2.5 + 0.2 e ds + 0.05 se−0.1s ds
1 1
≃ 9.29
2. Suppose the time a patient has to wait in minutes in a clinic for his turn is expo-
nentially distributed with mean 15 minutes. What is the probability that he has to
wait for more than 10 minutes given that the previous patient has already entered
the consultation room for 3 minutes?
Since the mean is 15 mintues, λ = 1/15. By the memoryless property of exponential
distribution,
> 1-pexp(7,1/15)
[1] 0.6270891
The R function for the gamma function is gamma(x), for any real number x.
A random variable is said to have a gamma distribution with parameters (α, λ), for
some positive λ, α > 0, if its probability density function is given by
( α−1
λe−λx (λx)
Γ(α)
for x ≥ 0,
p(x) =
0 for x < 0.
When α = n, the gamma distribution with parameters (n, λ) is the distribution for
the amount of time one has to wait for a total of n independent Poisson distribution with
parameter λ. This distribution is called a Erlang distribution. The probability density
function is ( n−1
λe−λx (λx)
(n−1)!
for x ≥ 0,
p(x) =
0 for x < 0.
Readers may refer to the appendix for details.
(λx)1−1
p(x) = λe−λx = λe−λx .
Γ(1)
2. When α = n/2 and λ = 1/2, this gamma distribution with parameters (n/2, 1/2)
is a chi-square distribution with n degress of freedom (unfortunately, we will not
be discussing about chi-square distributions in this course).
The R functions relevant to a gamma distribution with parameters (α, λ) are
• rgamma(n,alpha,lambda), to generate n independent experiments
• dgamma(x,alpha,lambda), the probability density function p(x)
• pgamma(k,alpha,lambda), the cumulative distribution function F (k) = P (X ≤ k)
• qgamma(q,alpha,lambda), to find a c such that F (c) = P (X ≤ c) = q
The default for lambda is 1.
Example. 1. Suppose in a network context, a node does not transmit until it has ac-
cumulated five messages in its buffer. Suppose the times between message arrivals
are independent and Poisson distributed with mean 100 milliseconds. What is the
probability that more than 552 milliseconds will pass before a transmission is made,
starting with an empty buffer.
Since the mean of the Poisson distribution is 100 milliseconds, λ = 1/100 = 0.01.
Hence, the time until the accumulation of five messages is a gamma distribution
with parameters (α = 5, λ = 0.01). So,
P (X > 552) ≃ 0.35.
> 1-pgamma(552,5,0.01)
[1] 0.3544101
2. Suppose that the average arrival rate at a local fast food drive-through window is
three cars per minutes (λ = 3). If one car has already gone through the drive-
through, what is the average waiting time before the third car arrives?
The problem is asking for the mean of a gamma distribution with parameters (α =
2, λ = 3), which is
2
E[X] = .
3
The reasoning follows as such. Since on average there are 3 cars per minutes, it
means that on average, the time between the arrival of each car is 1/3 minutes.
Since 1 car has already gone through, the average waiting time before the third car
arrives is the average waiting time for 2 more cars to arrive, which is then 2 times
of 1/3 minutes.
Γ(α)Γ(β)
B(α, β) = .
Γ(α + β)
for all 0 < x < 1. This is the reason this distribution is called a beta distribution, and
clearly R 1 α−1
x (1 − x)β−1 dx
Z
B(α, β)
p(x)dx = 0 = = 1.
R B(α, β) B(α, β)
The expected value and variance of a standard beta distribution with parameters
(α, β) are
α
E[X] =
α+β
αβ
V ar[X] =
(α + β)2 (α + β + 1)
More generally, the probability density function of the beta distribution with param-
eters (α, β) over the interval (a, b) is defined to be
α−1 β−1
1 1 x−a b−x
p(x) = ,
b − a B(α, β) b−a b−a
Let X represent the proportion of the rice sold before 3p.m. Then the expected
value is
α 3
E[X] = = .
α+β 5
The probability that at least 85% of the rice in the bucket is sold before 3p.m is
P (X ≥ 0.85) = 1 − P (X < 0.85) ≃ 0.11
> 1-pbeta(0.85,3,2)
[1] 0.1095188
The graph of the probability density function of the above beta distribution is as
follows.
> curve(dbeta(x,3,2))
2. Project managers often use a Program Evaluation and Review Technique (PERT)
to manage large scale projects. PERT was actually developed by the consulting
firm of Booz, Allen, & Hamilton in conjunction with the United States Navy as
a tool for coordinating the activities of several thousands of contractors working
on the Polaris missile project. A standard assumption in PERT analysis is that
the time to complete any given activity follows a general beta distribution, where
a is the optimistic time to complete an activity and b is the pessimistic time to
complete the activity. Suppose the time X (in hours) it takes a three man crew to
re-roof a single-family house has a beta distribution with a = 8, b = 16, α = 2, and
β = 3. The crew will complete the reroofing in a single day provided the total time
to complete the job is no more than 10 hours. If this crew is contracted to re-roof a
single-family house, what is the chance that they will finish the job in the same day?
In particular, choose values in the interval 0 < α, β ≤ 1 and/or 1 < α, β. What can
you observe about the shape of the graphs for the different values of α and β, and explain
it.
for each pair of (x, y) in the domain of X and Y is called the joint probability density function
of X and Y . It must fulfill the following properties.
(i) (Nonnegative) pX,Y (x, y) ≥ 0 for all (x, y).
P P
(ii) (Sum to one) x y PX,Y (x, y) = 1.
P
(iii) (Probability of an event) P ((X, Y ) ∈ A) = (x,y)∈A pX,Y (x, y).
Let pX,Y be the joint probability density function for discrete random variables X
and Y . We are able to obtain the probability density functions of X and Y , called the
marginal probability density functions by
X
pX (x) = pX,Y (x, y),
y
X
pY (y) = pX,Y (x, y),
x
respectively.
respectively.
Example. 1. Suppose that 3 balls are randomly selected from a bag containing 3 red,
4 white, and 5 blue balls. If we let X and Y denote, respectively, the number of
red and white balls chosen, then the joint probability mass function of X and Y , is
given by
3 4 5
x y 3−x−y
pX,Y (x, y) = 12
, 0 ≤ x + y, ≤ 3.
3
> x <- y <- seq(from=0,to=3)
> p <- outer(x,y,function(x,y){
choose(3,x)*choose(4,y)*choose(5,3-x-y)/choose(12,3)})
> rownames(p) <- c("X=0","X=1","X=2","X=3")
> colnames(p) <- c("Y=0","Y=1","Y=2","Y=3")
> fractions(p)
Y=0 Y=1 Y=2 Y=3
X=0 1/22 2/11 3/22 1/55
X=1 3/22 3/11 9/110 0
X=2 3/44 3/55 0 0
X=3 1/220 0 0 0
Indeed,
> fractions(choose(3,x)*choose(9,3-x)/choose(12,3))
[1] 21/55 27/55 27/220 1/220
> fractions(choose(4,y)*choose(8,3-y)/choose(12,3))
[1] 14/55 28/55 12/55 1/55
Exercise. Find the marginal probability density functions and the joint cumulative
distribution function.
Exercise. In both the examples above, verify that the joint probability density functions
satisfy the sum to one property.
Continuous random variables
A probability density function p : Rn → R is a joint probability density function n ≥ 2.
We will only be discussing for the case where n = 2, that is, X and Y are continuous
random variables, and pX,Y : R2 → R is the joint probability density function. Recall
that pX,Y must satisfy the following properties.
Let pX,Y be the joint probability density function for continuous random variables X
and Y . The marginal probability density functions for X and Y are
Z ∞
pX (x) = pX,Y (x, y)dy,
Z−∞
∞
pY (y) = pX,Y (x, y)dx,
−∞
respectively. That is, the probability for X ∈ A and Y ∈ B for subsets A, B ⊆ R are
Z Z Z ∞
P (X ∈ A) = pX (x)dx = pX,Y (x, y)dydx,
A A −∞
Z Z Z ∞
P (Y ∈ B) = pY (y)dy = pX,Y (x, y)dxdy,
B B −∞
respectively.
Compute
(a) P ((X > 1) ∩ (Y < 1)),
Z 1 Z ∞
P ((X > 1) ∩ (Y < 1)) = 2e−x e−2y dxdy
0
Z 11 Z ∞
−2y
= 2 e dy e−x dx
0 1
−2 −1
= (1 − e )(e )
(Observe that the random variables X and Y are independent, see section
4.7.2.)
(b) P (X < Y ),
Z
P (X < Y ) = pX,Y dA
Z0<x<y<∞
∞Z y
= 2e−x e−2y dxdy
Z0 ∞ 0
Z a Z ∞
P (X < a) = 2e−x e−2y dydx
Z0 a 0
−x
= e dx
0
= 1 − e−a .
Exercise. Refer to section 3.5.4. Compute the probabilities above using the R
function integrate.
2. Consider a circle of radius R, and suppose that a point within the circle is randomly
chosen in such a manner that all regions within the circle of equal area are equally
likely to contain the point. (In other words, the point is uniformly distributed
within the circle.) If we let the center of the circle denote the origin and define X
and Y to be the coordinates of the point chosen (see picture below),
then, since (X, Y ) is equally likely to be near each point in the circle, it follows that
the joint density function of X and Y is given by
c if x2 + y 2 ≤ R2 ,
pX,Y (x, y) =
0 if x2 + y 2 > R2 ,
(a) Determine c.
Hence,
1
c= .
πR2
(b) Find the probability marginal density functions of X and Y .
√
P (D ≤ a) = P ( X 2 + Y 2 ≤ a)
= P (X 2 + Y 2 ≤ a2 )
Z
1
= 2
dA
x2 +y 2 ≤a2 ≤R2 πR
1
= · (Area of circle of radius a)
πR2
1 2
a 2
= (πa ) = .
πR2 R
P (A ∩ B) = P (A) · P (B).
Translating this to discrete random variables, two discrete random variables X and
Y are independent if for any a and b in the supports of X and Y respectively,
In other words, the joint probability density function is the product of the marginal
probability density functions
This extends verbatim to continuous random variables. We say that X and Y are
dependent otherwise.
For suppose X and Y are independent random variables with joint probability density
function pX,Y (x, y) = pX (x)pY (y), then for any subsets A and B in the support of X and
Y,
XX
P ((X ∈ A) ∩ (Y ∈ B)) = pX,Y (x, y)
x∈A y∈B
XX
= pX (x)pY (y)
x∈A y∈B
! !
X X
= pX (x) pY (y)
x∈A y∈B
= P (X ∈ A)P (Y ∈ B)
A necessary and sufficient condition for the random variables X and Y to be inde-
pendent is for their joint probability density function pX,Y (x, y) to factor into two terms,
one depending only on x and the other depending only on y. Readers may refer to the
appendix for the proof and further details.
Theorem. The continuous (discrete) random variables X and Y are independent if and
only if their joint probability density function can be expressed as
Let M and F denote the number of males and females, respectively, that enter the
post office in a given day. Our task is then to show that pM,F = pM pF . Condition
on M + F , which is a Poisson random variable, we have
pM,F (i, j) = P ((M = i) ∩ (F = j))
= P ((M = i) ∩ (F = j) | M + F = i + j)P (M + F = i + j)
+P ((M = i) ∩ (F = j) | M + F ̸= i + j)P (M + F ̸= i + j)
= P ((M = i) ∩ (F = j) | M + F = i + j)P (M + F = i + j)
since P ((M = i) ∩ (F = j) | M + F ̸= i + j) = 0 (it is not possible that M = i and
F = j, but the total M + F = i + j). Since M + F is a Poisson random variable
with parameters λ,
λi+j
P (M + F = i + j) = e−λ .
(i + j)!
Next, given that there are i + j number of people entering the post office and the
number of males entering is a binomial random distribution with probability of
success p, we have
i+j i
P ((M = i) ∩ (F = j)|M + F = i + j) = p (1 − p)j .
i
Hence,
pM,F (i, j) = P ((M = i) ∩ (F = j) | M + F = i + j)P (M + F = i + j)
λi+j
i+j i
= p (1 − p)j e−λ
i (i + j)!
i+j
(i + j)! λ
= e−λ
i!j! (i + j)!
i j
λ λ
= e−λ
i! j!
i j
−λp λ −λ(1−p) λ
= e e .
i! j!
Since the joint probability density function splits into functions with respect to
M = i and F = j, we can conclude that the random variables X and Y are
independent with probability density functions
λi
P (M = i) = e−λp ,
i!
λj
P (F = j) = e−λ(1−p) ,
j!
which are Poisson distributions with parameters λp and λ(1 − p), respectively.
4. A man and a woman decide to meet at a certain location. If each of them inde-
pendently arrives at a time uniformly distributed between 12 noon and 1 p.m., find
the probability that the first to arrive has to wait longer than 10 minutes.
Let M and W denote, respectively, the time in minutes past 12 that the man and
woman arrive. Then M and W are uniform distributions over the interval (0, 60).
The desired probability is P ((M + 10 < W ) ∪ (W + 10 < M ))= P (M + 10 <
W ) + P (W + 10 < M )
Z Z
= pM,W (m, w)dA + pM,W (m, w)dA
m+10<w,0<m<60 w+10<m,0<w<60
Z 60 Z w−10 Z 60 Z m−10
1 1
= 2
dmdw + 2 dwdm
60 10 0 60 10 0
Z 60
2
= (w − 10)dw
602 10
60
2 w2
= − 10w
602 2 10
25
=
36
P (A ∩ B)
P (A|B) = .
P (B)
Now suppose X and Y are two discrete random variables with joint probability den-
sity function pX,Y , and a and b are in the support of X and Y respectively, define the
conditional probability density function of X given that Y = b to be
pX,Y (x, b)
pX|Y (x|b) = P (X = x|Y = b) = .
pY (b)
Remark. Note that since b is in the support of Y , pY (b) > 0, hence, the function above
is well-defined.
where the third inequality follows from the fact that X and Y are independent, and the
fourth equality follows from the fact that sum of independent Poisson distributions is a
Poisson distribution (see section 4.7.2).
Observe that this is a binomial distribution with parameters (n, λ1 /(λ1 + λ2 )).
The definition for conditional probability density function for continuous random
variables are analogous. Suppose X and Y are two continuous random variables with
join probability density function pX,Y . The conditional probability density function of X
given Y = y is
pX,Y (x, y)
pX|Y (x|y) = ,
pY (y)
provided pY (y) > 0. That is, for any set of real number A,
Z
P (X ∈ A | Y = y) = pX|Y (x|y)dx.
A
Compute the conditional probability density of X given that Y = y, for 0 < y < 1.
For 0 < x < 1,
pX,Y (x, y)
pX|Y (x|y) = R ∞
p (x, y)dx
−∞ X,Y
x(2 − x − y)
= R1
x(2 − x − y)dx
0
x(2 − x − y)
=
[x2 − x3 /3 − yx2 /2]10
x(2 − x − y)
=
2/3 − y/2
6x(2 − x − y)
= .
4 − 3y
pX,Y (x, y)
pX|Y (x|y) = R ∞
p (x, y)dx
−∞ X,Y
e−x/y e−y /y
= R ∞ −x/y −y
0
e e /ydx
e−x/y
= ∞
[−ye−x/y ]0
e−x/y
= ,
y
for 0 < x, y < ∞, and 0 otherwise. Hence,
Z ∞
P (X > 1 | Y = y) = pX|Y (x, y)dx
1
∞
e−x/y
Z
= dx
1 y
∞
= −e−x/y 1
= e−1/y .
Reader may refer to the appendix for the derivations. This shows that
Conditional expectation
Let pX|Y (x|y) be the conditional probability density function of X, given that Y = y.
Define the conditional expectation of X given that Y = y as
X
E[X|Y = y] = xpX|Y (x|y) (discrete),
Zx ∞
E[X|Y = y] = xpX|Y (x|y)dx (continuous).
−∞
∞ ∞
e−x/y e−y
Z Z
E[XY ] = xy dxdy
y
Z0 ∞ 0
Z ∞
= e−y xe −x/y
dx dy
Z0 ∞ 0
Z ∞
−y
x=∞
−xye−x/y x=0 + −x/y
= e ye dx dy
0 0
Z ∞ x=∞
−y 2 e−y e−x/y x=0 dy
=
Z0 ∞
= y 2 e−y dy
0
y=∞
Z ∞
= −y 2 e−y y=0 + 2ye−y dy
0
Z ∞
−y y=∞
2e−y dy
= −2ye y=0 +
0
−y y=∞
= − −2e y=0 = 2.
integrate(function(x){
sapply(x,function(x){
integrate(function(y)x*exp(-x/y-y),0,Inf)$value
})
},0,Inf)$value
[1] 1.999994
Recall from the previous section that the conditional probability density function of
X given Y = y is
e−x/y
pX|Y (x|y) =
y
for 0 < x, y < ∞, and 0 otherwise. Hence, the conditional expected value of X given
Y = y is
Z ∞ Z ∞ −x/y
e
E[X|Y = y] = xpX|Y (x|y)dx = x dx
−∞ 0 y
Z ∞
−x/y ∞ −x/y −x/y ∞
= −xe 0
− −e d = −ye 0
0
= y,
for 0 < y < ∞, where in the third equality, we use integration by parts, with u = x
−x/y
and v = −e−x/y ( dx
dv
= e y ).
(ii) For a real matrix square matrix A of order n, and a n-random vector X = (X1 , X2 , ..., Xn )T ,
E[Ax] = AE[X].
Covariance
Theorem (Expected value of product of functions on independent random variables). If
X and Y are independent random variables, then for any functions f and g,
E[f (X)g(Y )] = E[f (X)]E[g(Y )].
The covariance between X and Y , denoted by Cov[X, Y ], is defined by
Cov[X, Y ] = E [(X − E[X]) (Y − E[Y ])] .
Equivalently,
Cov[X, Y ] = E [(X − E[X]) (Y − E[Y ])]
= E[XY − XE[Y ] − Y E[Y ] + E[X]E[Y ]]
= E[XY ] − 2E[X]E[Y ] + E[X]E[Y ]
= E[XY ] − E[X]E[Y ].
Suppose that typically when X is larger than its mean, Y is also larger than its mean,
and vice versa for below-mean values. Then (X − E[X])(Y − E[Y ]) will usually be posi-
tive, and hence their covariance is positive. Similarly, if X is often smaller than its mean
whenever Y is larger than its mean, the covariance between them will be negative. All
of this is roughly speaking, of course, since it depends on how much and how often X is
larger or smaller than its mean, etc.
Observe that
Cov[X, X] = V ar[X].
By the theorem above, it is clear that if X and Y are independent, then Cov[X, Y ] = 0.
But the converse is not true, that is, it is possible for Cov[X, Y ] = 0, but X and Y are
dependent. For example, suppose X is a random variable such that
1
P (X = 0) = P (X = 1) = P (X = −1) = ,
3
and
̸ 0,
0 if X =
Y =
1 if X = 0.
Then XY = 0, and so E[XY ] = 0. Also, E[X] = 0. Thus
Cov[X, Y ] = E[XY ] − E[X]E[Y ] = 0.
However, it is clear from construction that X and Y are not independent.
Theorem (Properties of covariance).
(i) (Symmetry) Cov[X, Y ] = Cov[Y, X]
(ii) (Additive constant) Cov[X + a, Y ] = Cov[X, Y ].
hP i P P
n Pm n m
(iii) (Linearity) Cov a X
i=1 i i , j=1 i i =
b Y i=1 j=1 ai bj Cov[Xi , Yj ].
is called the sample variance. We will compute (a) V ar[X] and (b) E[S 2 ].
(a) Since X1 , ..., Xn are independent, they are pairwise independent too. Hence,
" n # n
X Xi 1 X
V ar[X] = V ar = 2 V ar[Xi ]
i=1
n n i=1
1 2 σ2
= (nσ ) = .
n2 n
This shows that if the sample size is large enough, the sample variance is small, and
hence, the sample mean is a good estimate of the actual mean.
(b)
" n
#
1 X
E[S 2 ] = E (Xi − µ + µ − X)2
n − 1 i=1
" n n n
#
1 X 1 X 1 X
= E (Xi − µ)2 + (X − µ)2 − 2(X − µ) (Xi − µ)
n − 1 i=1 n − 1 i=1 n − 1 i=1
" n n
#
1 X n 1 X
= E (Xi − µ)2 + (X − µ)2 − 2(X − µ) (Xi − µ)
n − 1 i=1 n−1 n − 1 i=1
" n
#
1 X n n
= E (Xi − µ)2 + (X − µ)2 − 2(X − µ) (X − µ)
n − 1 i=1 n−1 n−1
" n
#
1 X n
= E (Xi − µ)2 − (X − µ)2
n − 1 i=1 n−1
n
1 X n
E (Xi − µ)2 − E (X − µ)2
=
n − 1 i=1 n−1
n
1 X n
= V ar[Xi ] − V ar[X]
n − 1 i=1 n−1
n n σ2
= σ2 −
n−1 n−1 n
2
= σ ,
where in the third equality we used the definition of the sample mean X = ni=1 Xi /n,
P
and in the seventh equality, we use the fact that the means of Xi for all i = 1, ..., n,
and X are µ, and the variance of Xi , i = 1, ..., n, is σ 2 .
This shows that for large enough sample, the average of the sample variance is the
distribution variance, that is, the sample variance can be used to estimate the distri-
bution variance.
Covariance matrices
Let X be a random vector. The covariance matrix of X = (X1 , X2 , ..., Xn )T is
Correlation
Covariance does measure how much or little X and Y vary together, but it is hard to
decide whether a given value of covariance is “large” or not. For example, if we change
the units from meters to centimeters, then by linearity property of covariance, Cov[X, Y ]
increase by 1002 . Thus it makes sense to scale covariance according to the variables’
standard deviations. Accordingly, the correlation between two random variables X and
Y is defined by
Cov[X, Y ]
ρ(X, Y ) = p p ,
V ar[X] V ar[Y ]
provided V ar[X]V ar[Y ] > 0 is positive. So, the correlation is unitless, that is, it does not
depends on which units we are using for our variables. Moreover, it is bounded between
−1 and 1 (correlation is kind of the normalization of covariance).
Lemma. For random variables X and Y ,
−1 ≤ ρ(X, Y ) ≤ 1,
where we use the fact that covariance is symmetric, Cov[X, Y ] = Cov[Y, X]. Hence,
s 2
Cov[X, Y ]
p q p
2 2 2
det(Σ) = σX σY − Cov[X, Y ] = σX σY 1 − = σX σY 1 − ρ2 .
σX σY
Next,
2
−1
−1 σX Cov[X, Y ] 1 σY2 −Cov[X, Y ]
Σ = = 2 2 ,
Cov[Y, X] σY2 σX σY (1 − ρ2 ) −Cov[Y, X] 2
σX
and hence
(X − µ)T Σ−1 (X − µ)
1 σY2 −Cov[X, Y ] X − µX
= 2 2 X − µX Y − µY 2
σX σY (1 − ρ2 ) −Cov[Y, X] σX Y − µY
1 2 2 2 2
= 2 2 (x − µ X ) σY + (y − µ Y ) σX − 2Cov[X, Y ](x − µ X )(y − µ Y )
σX σY (1 − ρ2 )
" 2 2 #
1 x − µX y − µY 2ρ(x − µX )(y − µY )
= + − .
(1 − ρ2 ) σX σY σX σY
1 1 T −1
p(x) = p e− 2 (X−µ) Σ (X−µ) ,
(2π)n/2 det(Σ)
The R function for multivariate normal distribution can be found in the package
mvtnorm. The multivariate normal joint probability density function in R is mvrnorm.
Here are some of the arguments we will need,
dmvnorm(x, mean, sigma)
Theorem (Univariate central limit theorem). Let X1 , X2 , ..., Xn , ... be a sequence of inde-
pendent and identically distributed random variables with common mean µ and variance
σ 2 . Then for large n, the new random variable X1 + X2 + · · · + Xn is approximately
normal with mean nµ and variance nσ 2 . In other words, the distribution of
X1 + X2 + · · · + Xn − nµ
√
nσ
Example. 1. Binomially distributed random variables, though discrete, also are ap-
proximately normally distributed. For example, let’s find the approximate proba-
bility of getting more than 60 heads in 100 tosses of a coin.
> 1-pbinom(60,100,1/2)
[1] 0.0176001
Recall that the mean of a binomial distribution with parameters (n, p) is p, and
the variance is p(1 − p). By the central limit theorem, the new random variable
X1 + X2 + · · · + X100 , where Xi is the outcome of the i-th toss, is approximately
normal with mean 100(0.5) = 50 and variance 100(0.5)2 = (5)2 .
> 1-pnorm(60,50,5)
[1] 0.02275013
That doesn’t seem very accurate. The problem is, do we treat the problem as
P (X > 60) or P (X ≥ 61)? Let’s try P (X ≥ 61).
> 1-pnorm(61,50,5)
[1] 0.01390345
So, P (X > 60) is too big and P (X ≥ 61) is too small, which tells us that the
answer is somewhere in between.
> 1-pnorm(60.5,50,5)
[1] 0.01786442
Now this probability is close to the actual one. This is known as correction for continuity.
Suppose X1 , X2 , ..., Xn are n measurements, then, from the central limit theorem,
it follows that Pn
Xi − nd
Zn = i=1 √
2 n
hasPapproximately a standnard normal distribution. The average of the n readings
is ni=1 Xi /n. The task is then to find the probability that the difference between
the average and d is accurate to within ±0.5 light-year.
Pn √ √
i=1 Xi n n
P −0.5 ≤ − d ≤ 0.5 = P (−0.5 ≤ Zn ≤ )
n 2 2
√ √
n n
≃ Φ( ) − Φ(−0.5 ).
4 4
√ √ √
By symmetry, −Φ(− 4n ) = 1 − Φ( 4n ). Hence, the probability is 2Φ( 4n ) − 1.
Suppose the astronomer wants to be 95% certain that his estimate value is accurate
to within ±0.5 light-year. Then we need
√
n
2Φ( ) − 1 ≥ 0.95
4
or √
n
Φ( ) ≥ 0.975
4
> qnorm(0.975)
[1] 1.959964
Hence, √
n
≥ 1.96 ⇒ n ≥ 61.5,
4
that is, he needs to take 62 readings.
Central limit theorems also exist when the Xi are independent, but not necessarily
identically distributed random variables. One version, by no means the most general, is
as follows.
Theorem (Central limit theorem for independent random variables). Let X1 , X2 , ..., Xn , ...
be a sequence of independent random variables having respective means µi = E[Xi ] and
variance σi2 = V ar[Xi ]. If
(i) The Xi are uniformly bounded. that is there is M such that P (|Xi | < M ) = 1 for
all i, and
P∞ 2
(ii) i=1 σn = ∞,
then "P #
n
i=1 (Xi − µi )
P p Pn 2
≤ a → Φ(a), as n → ∞.
i=1 σi
Finally, we have the central limit theorem for multivariate independent identically
distributed random vectors.
Theorem (Multivariate central limit theorem). Suppose X1 , X2 , ..., Xn , ... are indepen-
dent random vectors, all having the same distribution which has mean vector µ and
covariance matrix Σ. Then for large n, the new random variable X1 + X2 + · · · + Xn is
approximately multivariate normal with mean nµ and covariance matrix nΣ. That is,
Proof. In order to compute E[g(X)], we need to know the probability density function
of Y = g(X), that is, need to find a function h(y), for y ∈ {y1 = g(x1 ), y2 = g(x2 ), ...}
such that h(yi ) = P (Y = g(xi )).
P
To this end, we group all the terms in i g(xi )p(xi ) having the same g(xi ), that is,
we let yj be a fixed number, and let {xj1 , xj2 , ..., } be all the xi such that g(xji ) = yj (so
we are not using the same notation as above.) Then
X XX
g(xi )p(xi ) = g(xji )p(xji )
i j i
XX
= yj p(xji )
j i
X X
= yj p(xji )
j i
X
= yj P (g(X) = yj )
j
= E[g(X)],
E[aX + b] = aE[X] + b.
= aE[X] + b,
P
where in the last equality, we use the fact that p(x)>0 p(x) = 1.
By letting a = 0, and b = a in the theorem, we arrive at the corollary.
Corollary. For a constant a,
E[a] = a.
Lemma. Let X be a discrete random variable. Then the variance is given by
V ar[X] = E[X 2 ] − E[X]2 .
From the theorem on expected value of a function of a random variable, let g(X) =
(X − µ)2
X
V ar[X] = E[(X − µ)2 ] = (xi − µ)2 p(xi )
i
X X X
= x2i p(xi ) − 2µ xi p(xi ) + µ2 p(xi )
i i i
2 2 2
= E[X ] − 2E[X] + E[X]
= E[X 2 ] − E[X]2 ,
where we use g(X) = X 2 for the first term, and i p(xi ) = 1 for the last term in the
P
second last equality.
Lemma (Markov inequality). If X is a random variable and g(x) ≥ 0 is a nonnegative
function, then for any positive d > 0,
E[g(X)]
P (g(X) ≥ d) ≤ .
d
Proof. Let I be the indicator random variable for the event {g(X) ≥ d},
1 if g(X) ≥ d,
I(g(X)) =
0 otherwise.
Then since g(X) ≥ 0 and I(g(X)) ≤ 1,
g(X) ≥ dI.
Hence,
E[g(X)] ≥ E[dI] = dE[I] = dP (g(X) ≥ d),
which is what we want.
We will just prove (i) of the Chebychev’s inequality. The others can be derived from
(i).
Theorem (Chebychev’s inequality). For a random variable X with mean µX and vari-
2
ance σX ,
1
P (|X − µX | ≥ ασ) ≤ 2
α
for any real number α ∈ R.
Proof. Let g(X) = (X − µX )2 and d = α2 σX
2
. By the Markov inequality,
E[(X − µX )2 ]
P ((X − µX )2 ≥ α2 σX
2
)≤ 2
.
α2 σX
By taking square root on the left-hand side, it is equal to P (|X − µX | ≥ ασX ). Observe
2
that the numerator on the right-hand side is just the variance, σX . Hence, the right-hand
2
side reduces to 1/α , which is what we want.
4.8.2 Expected Value and Variance of Geometric Distributions
Lemma (Geometric series). For any x ̸= 1 and nonnegative integers m < n,
n
X 1 − xn−m+1
xk = xm .
k=m
1−x
where in the third equality, we remove the first term from S(n) and remove the last term
from xS(n). Hence,
1 − xn+1
S(n) = .
1−x
So,
n n n−m
X
k m
X
k−m m
X 1 − xn−m+1
x =x x =x xk = xm .
k=m k=m k=0
1 − x
Letting m = 0, differentiate both sides of the equality in the lemma with respect to
x, then let n → ∞, for |x| < 1,
∞
X 1
kxk−1 = .
k=1
(1 − x)2
Letting m = 0, taking the second derivative on both sides of the equality in the lemma
with respect to x, then let n → ∞, for |x| < 1,
∞ ∞
2 X
k−2
X
= k(k − 1)x = k(k + 1)xk−1
(1 − x)3 k=2 k=1
∞
X ∞
X
2 k−1
= k x + kxk−1
k=1 k=1
∞
X 1
= k 2 xk−1 + ,
k=1
(1 − x)2
and hence,
∞
X 2 1 x+1
k 2 xk−1 = 3
− 2
= .
k=1
(1 − x) (1 − x) (1 − x)3
Theorem (Expectation and variance of geometric distributions). Let X be a geometric
distribution with parameter p. Then the expectation and variance of X are
1
E[X] = ,
p
1−p
V ar[X] = .
p2
Proof. By the derivations above,
∞ ∞
X
k−1
X 1 1
E[X] = k(1 − p) p=p k(1 − p)k−1 = p 2
= ,
k=1 k=1
(1 − (1 − p)) p
and
∞ ∞
X X (1 − p) + 1 2−p
E[X 2 ] = k 2 (1 − p)k−1 p = k 2 (1 − p)k−1 = p = .
k=1 k=1
(1 − (1 − p))3 p2
Hence, 2
2 2−p 2 1 1−p
V ar[X] = E[X ] − E[X] = − = .
p2 p p2
Taking the second derivative with respect to x on both sides of the Macluarin series
expansion of ex , we have
∞ ∞ ∞
x
X xk−2 X 2 xk−2 X xk−2
e = k(k − 1) = k − k
k=2
k! k=2
k! k=2
k!
∞ ∞
!
1 X 2 xk−1 X xk−1
= k −1− k +1
x k=1 k! k=1
k!
∞
!
1 X 2 xk−1 x
= k −e ,
x k=2 k!
and so ∞
X xk−1
k2 = (x + 1)ex .
k=1
k!
Therefore, the variance of a Poisson distribution with parameter λ is
∞
2 2
X e−λ λk
V ar[X] = E[X ] − E[X] = k2 − λ2
k=1
k!
∞ k−1
X λ
= λe−λ k2 − λ2
k=1
k!
−λ
= λe (λ + 1)eλ − λ2
= λ.
Proof. Let pY (x) be the probability density function of Y . Then using the same deriva-
tions as in the lemma, we have
Z ∞ Z ∞ Z ∞ Z 0
P (Y < −y)dy = P (−Y > y)dy = tpY (−t)dt = − xpY (x)dx,
0 0 0 −∞
where in the last equality we did a change in variable t = −x. Similarly,
Z ∞ Z ∞
P (Y > y)dy = xpY (x)dx.
0 0
Hence,
Z ∞ Z ∞ Z ∞ Z 0
P (Y > y)dy − P (Y < −y)dy = xpY (x)dx + xpY (x)dx
0 0 0 −∞
Z ∞
= xpY (x)dx
−∞
= E[Y ].
where we used the smooth change of variables F(r, θ) = (r cos(θ), r sin(θ)) (see section
3.5.2 and 3.5.3). Thus, taking the square root on both sides, we
Z ∞
t2 √
e− 2 dt = 2π
−∞
The expected value and variance of the standard normal distribution are
E[Z] = 0,
V ar[Z] = 1.
Proof.
Z
1 z2
E[Z] = √ ze− 2 dz
2π R
1 h 2 iN
− z2
= √ lim −e
2π N →∞ −N
1 (−N )
− 2
2
− N2
2
= √ lim e −e
2π N →∞
= 0.
Then
Z
1 z2
V ar[Z] = E[Z ] = √ 2
z 2 e− 2 dz,
2π R
z2
and by integration by parts, using u = z and v = −e− 2 (see section 3.7.6), we have
2 iN
Z
1 h
− z2
2
− z2
V ar[Z] = √ lim −ze + e dz
2π Z N →∞ −N R
1 z2
= √ e− 2 dz
2π R
= 1.
4.8.6 Memoryless
A nonnegative random variable X is memoryless if for any positive real numbers t1 , t2 > 0,
P (X > t1 + t2 |X > t2 ) = P (X > t1 ).
Recall that P (A|B) = P (A ∩ B)/P (B). Hence, the equation above is equivalent to
P ((X > t1 + t2 ) ∩ (X > t2 ))
P (X > t1 ) =
P (X > t2 )
P (X > t1 + t2 )
= ,
P (X > t2 )
Therefore, a nonnegative distribution is memoryless if for all positive real numbers t1 , t2 >
0,
P (X > t1 + t2 ) = P (X > t1 )P (X > t2 ).
Suppose now X is a memoryless distribution. Let F̂ (x) = P (X > x), then we have
F̂ (t1 + t2 ) = F̂ (t1 )F̂ (t2 ).
Lemma. Suppose g(x) is a continuous function satisfying
g(s + t) = g(s)g(t),
then
g(x) = e−λx
for some real number λ.
Proof. (i) Note that since
2
2 1 1 1
g =g + =g ,
n n n n
by induction, we have m
m 1
g =g .
n n
(ii) Also, n
1 1 1 1
g(1) = g + + ··· + =g ,
n n n n
which is equivalent to
1
g = g(1)1/n .
n
(iii) Hence, m m
g = g(1) n
n
for all nonnegative integers m, n. By continuity, this means that g(x) = g(1)x for
all nonnegativereal number x.
2
Since g(1) = g 12 ≥ 0, letting λ = −log(g(1)), we have
g(x) = e−λx .
Thus, by the lemma,
P (X > x) = F̂ (x) = e−λx
for some λ. Hence,
F (x) = 1 − P (X < x) = 1 − e−λx ,
and by taking derivative with respect to x, the probability density function is
d
p(x) = F (x) = λe−λx , for x ≥ 0,
dx
which is the probability density function of a exponential distribution.
Let Tn be the random variable denoting the amount of time one has to wait before a
total of n independent Poisson distributed events with parameter λ has occur, that is,
Tn = X1 + X2 + · · · + Xn ,
where Xi is a exponential random variable with parameter λ. Note that Tn is less than
or equal to t if and only if the number of events that occurred by time t is at least n. Let
N (t) denote the number of events in the time interval [0, t]. Then
(λt)α−1
p(t) = λe−λt , for t > 0,
Γ(α)
which is the probability density function for a Gamma distribution with parameters (α, λ).
dt
Let x = λt, then = λ1 , and we have
dx
Z ∞
1 Γ(α + 1) αΓ(α) α
E[X] = e−x xα dx = = = .
λΓ(α) 0 λΓ(α) λΓ(α) λ
Z ∞ Z ∞
α−1
2 2 −λt (λt)
1
E[X ] = t λe dt = 2 λeλt (λt)α+1 dt
0 Γ(α) λ Γ(α) 0
Z ∞
1 x α+1 Γ(α + 2) (α + 1)(α)Γ(α)
= e x dx = =
λ2 Γ(α) 0 λ2 Γ(α) λ2 Γ(α)
α(α + 1)
= .
λ2
Hence,
α(α + 1) α 2 α
V ar[X] = E[X 2 ] − E[X]2 = 2
− = 2.
λ λ λ
4.8.8 Beta Distributions
Define the beta function to be
Z 1
B(α, β) = xα−1 (1 − x)β−1 dx,
0
Γ(α)Γ(β)
β(α, β) = .
Γ(α + β)
With a smooth change of variable F(z, t) = (zt, z(1−t)) (then JF = t(−z)−z(1−t) = z),
the equation becomes
Z ∞Z 1
Γ(α)Γ(β) = e−z (zt)α−1 (z(1 − t))β−1 zdtdz
Z0 ∞ 0 Z 1
−z α−β−1
= e z dz tα−1 (1 − t)β−1 dt
0 0
= Γ(α + β)B(α, β),
as desired.
Therefore, a beta distribution with parameters (α, β) can be written as
1
p(x) = xα−1 (1 − x)β−1 ,
B(α, β)
for all 0 < x < 1. This is the reason this distribution is called a beta distribution, and
clearly R 1 α−1
x (1 − x)β−1 dx
Z
B(α, β)
p(x)dx = 0 = = 1.
R B(α, β) B(α, β)
The expected value and variance of a beta distribution with parameters (α, β) are
α
E[X] =
α+β
αβ
V ar[X] =
(α + β)2 (α + β + 1)
Proof. Recall that Γ(α + 1) = αΓ(α). Hence,
Z 1 Z 1
1 α−1 β−1 1
xα (1 − x)β−1 dx
E[X] = x x (1 − x) dx =
B(α, β) 0 B(α, β) 0
B(α + 1, β) Γ(α + 1)Γ(β) Γ(α + β)
= =
B(α, β) Γ(α + β + 1) Γ(α)Γ(β)
αΓ(α)Γ(β) Γ(α + β)
=
(α + β)Γ(α + β) Γ(α)Γ(β)
α
= .
α+β
Similarly,
Z 1
2 1
x2 xα−1 (1 − x)β−1 dx
E[X ] =
B(α, β) 0
B(α + 2, β) (α + 1)αΓ(α)Γ(β) Γ(α + β)
= =
B(α, β) (α + β + 1)(α + β)Γ(α + β) Γ(α)Γ(β)
α(α + 1)
= .
(α + β)(α + β + 1)
Therefore,
2
2 2 α(α + 1) α
V ar[X] = E[X ] − E[X] = −
(α + β)(α + β + 1) α+β
α (α + 1)(α + β) − α(α + β + 1)
=
α+β (α + β)(α + β + 1)
α β
=
α + β (α + β)(α + β + 1)
αβ
= 2
.
(α + β) (α + β + 1)
On the other hand, suppose pX,Y (x, y) = f (x)g(y) for some functions f and g, de-
pending only on x and y, respectively. Then
Z ∞Z ∞ Z ∞Z ∞
1 = pX,Y (x, y)dxdy = f (x)g(y)dxdy
−∞ −∞ −∞ −∞
Z ∞ Z ∞
= f (x)dx g(y)dy = c1 c2 ,
−∞ −∞
R∞ R∞
where c1 = −∞
f (x)dx and c2 = −∞
g(y)dy. Also,
Z ∞ Z ∞
pX (x) = pX,Y (x, y)dy = f (x) g(y)dy = c2 f (x),
Z−∞
∞ Z −∞
∞
pY (y) = pX,Y (x, y)dx = g(y) f (x)dx = c1 g(y).
−∞ −∞
Hence, since c1 c2 = 1,
Remark. Observe from the proof of the theorem that even though the joint probabil-
ity density function can be written as the product of two functions depending on each
variables, it is not necessary true that pX (x) = f (x) and pY (y) = g(y). For example, let
1 1
4
xy 0 < x < 1, 0 < y < 1 4
x 0<x<1 y 0<y<1
pX,Y (x, y) = , f (x) = , g(y) = .
0 otherwise 0 otherwise 0 otherwise
Proof. • Suppose X and Y are discrete. Then following P P the same idea as in the
discrete univariate case, we will group all the terms in x y g(x, y)pX,Y (x, y) that
have the same g(xi , yj ), that is, let ht be a fixed number, and let
{(xt1 , yt(1,1) ), (xt1 , yt(1,2) ), ...(xt1 , yt(1,j) ), ..., (xti , yt(i,j) ), ...}
be the set of all pairs (x, y) such that g(xti , yt(i,j) ) = ht for all i, j ≥ 1. Then
XX XXX
g(x, y)pX,Y (x, y) = g(xti , yt(i,j) )pX,Y (xti , yt(i,j) )
x y t i j
X XX
= ht pX,Y (xti , yt(i,j) )
t i j
X
= ht P (g(x, y) = ht )
t
= E[g(X, Y )].
we have
Z ∞ Z
E[g(x, y)] = pX,Y (x, y)dAdt.
0 g(x,y)>t
Proof. (i) Cov[X, Y ] = E[XY ] − E[X]E[Y ] = E[Y X] − E[Y ]E[X] = Cpv[Y, X].
Proof. Let µ be the expected value of X. Suppose X is discrete. Then since (xi −
µ)2 p(xi ) ≥ 0 and p(xi ) > 0 for all xi in the support of X,
X
0 = V ar[X] = E[(X − µ)2 ] = (xi − µ)2 p(xi )
i
−1 ≤ ρ(X, Y ) ≤ 1,
and thus,
−1 ≤ ρ(X, Y )
Similarly,
X Y
0 ≤ V ar −
σX σY
V ar[X] V ar[Y ] 2Cov[X, Y ]
= 2
+ −
σX σY2 σX σY
= 2(1 − ρ(X, Y )),
Now suppose Y = a + bX for some real number a, b, with b ̸= 0. Then σY2 = V ar[Y ] =
b2 V ar[X] = b2 σX
2
, and since σX , σY ≥ 0 we have σY = sgn(b)bσX , where
1 if b > 0,
sgn(b) = 0 if b = 0,
−1 if b < 0.
2. Consider a game where each player take turns to toss a die, the score they obtain
for that round will be the outcome of the toss added to their previous tosses. If the
outcome of the toss is 3, a player gets a bonus toss (only once, even if it lands on
3 the second time, it is the end of the player’s turn).
(a) Find the probability, by simulation and analytically, that a player score after
the second round is 8. (Hint: for the analytical computation, consider cases
where player gets no bonus, exactly one bonus, and 2 bonus tosses; that is,
(b) Suppose the player’s score is 8 after the second round. Find the probability
by simulation and analytically that it was obtained with the help of exactly 1
bonus toss.
3. Four buses carrying 148 students from the same school arrive at a football stadium.
The buses carry, respectively, 40, 33, 25, and 50 students. One of the students
is randomly selected. Let X denote the number of students who were on the bus
carrying the randomly selected student. One of the 4 bus drivers is also randomly
selected. Let Y denote the number of students on her bus.
5. Suppose a university has 10 blocks of hostels and each block has 10 bathrooms.
During a pandemic, a university periodically collects wastewater sample from the
bathrooms of the hostels to test for the presence of the virus. However, rather than
testing wastewater sample from each bathroom separately, it has been decided to
test the wastewater sample collected from bathrooms of an entire block. If the test
is negative, one test will suffice for the whole block, whereas if the test is positive,
wastewater sample will be collected from each bathroom and tested individually,
and in all, 11 tests will be made for that block. Assume that the probability that
the wastewater sample of a bathroom is tested positive is 0.1 for all bathrooms,
independently from one another. Compute the expected number of tests necessary
for all the blocks. What is the variance?
6. There are 100 marbles, of which 40 of them are blue and 60 of them are red. The
marbles are distributed randomly into 10 bags with 10 marbles each. Let p(k) be
the probability density function for the probability that a randomly chosen bag has
exactly k blue marbles.
7. There are 95 marbles, of which 40 of them are blue and 55 of them are red. The
marbles are distributed randomly into 10 bags, with 9 of them having 10 marbles
and 1 bag containing 5 marbles. Let p(k) be the probability density function for
the probability that a randomly chosen bag has exactly k blue marbles.
9. Consider a game of marbles played between 2 players. For each round, each players
get to picks either 1 or 2 marbles, and guess the number of marbles their opponent
will pick. If only one player guesses correctly, he is considered the winner for the
round. If either both players guesses correctly, or neither guesses correctly, then
no one wins the round. Suppose a game consist of 10 rounds. Consider a specified
player. Find a number k such that the probability that he wins at most k rounds
in a game is 0.98.
10. A person tosses a fair coin until a tail appears for the first time. If the tail appears
on the nth flip, the person wins 2n dollars. Let X denote the player’s winnings.
Show that E[X] = +∞. This problem is known as the St. Petersburg paradox.
(a) Would you be willing to pay $1 million to play this game once?
(b) Would you be willing to pay $1 million for each game if you could play for as
long as you liked and only had to settle up when you stopped playing?
11. A satellite system consists of 12 components and functions on any given day if at
least 9 of the 12 components function on that day. On a rainy day, each of the
components independently functions with probability 0.9, whereas on a dry day,
each independently functions with probability 0.7. If the probability of rain for the
next 5 days is 0.3 independently,
(a) what is the probability that the satellite system will function for at least 3 of
the 5 days?
(b) what is the probability that the satellite system first malfunction happens on
the 5th day?
12. It is estimated that a certain lecturer makes on average 2 typographical errors every
10000 words. Suppose a page of the lecture notes consist of 700 words. What is
the probability that the page contains at least 1 typographical error?
13. Consider an online player versus player game where there are exactly 8 players (4v4)
in each match. Suppose approximately 10,000 matches were played last year.
(a) Estimate the probability that for at least 1 of these matches, at least 2 players
were born on 1st of January;
(b) Estimate the probability that for at least 2 of these matches, exactly 3 players
celebrated their birthday on the same day of the year.
(c) Repeat (b) under the assumption that there was at least a match with exactly
3 players celebrating their birthday on the same day of the year.
14. On average, a person contracts a cold 5 times in a given year. Suppose that a new
wonder drug (based on large quantities of vitamin C) has just been marketed that
reduces the number of times to 3 for 75 percent of the population. For the other
25 percent of the population, the drug has no appreciable effect on colds. If an
individual tries the drug for a year and has 2 colds in that time, how likely is it
that the drug is beneficial for him or her? (Hint: What is the discrete random
distribution that best model the number of times a person contracts a cold in a
given year.)
15. A bag contains 4 white and 4 black balls. We randomly choose 4 balls. If 2 of them
are white and 2 are black, we stop. If not, we replace the balls in the bag and again
randomly select 4 balls. This continues until exactly 2 of the 4 chosen are white.
16. Consider a game where 2 players will each pick at random from a stack of cards
labeled 1 to 6, without replacement. The player who picked the larger card with
the larger number wins. The person to win 4 games is declared the overall winner.
(a) What must the capacity of the tank be so that the probability of the supply
being exhausted in a given week is 0.01?
(b) What should the volume of the tank be so that the supply be so that the weekly
sales demand is always met?
(c) What is the expected weekly volume of sales?
(a) What is the probability that 3 double-decker buses arrive within 30 minutes?
What assumptions did you make?
(b) What is the mean waiting time for 5 double decker buses to arrive at the bus
stop?
30. (a) If X is a uniform distribution over (0, 1), find the density function of Y = eX .
(Hint: First find the cumulative density function of Y , that is, find P (Y ≤ y).)
(b) Suppose a random variable X has the following probability density function
1
x
1 ≤ x ≤ e1 ,
p(x) =
0 otherwise.
Suggest how we can generate values of the random variable X in R using runif.
(c) Suggest how we might generate in R random values of a random variable X
with probability density function
3 √
x 0 ≤ x ≤ 2,
p(x) =
0 otherwise.
31. (a) Find the parameters α and β of a standard beta distribution with expected
value E[X] = µ and variance V ar[X] = σ 2 . Hint: Observe that
αβ α α
= 1− .
(α + β)2 α+β α+β
(b) It is known that students takes between 2 to 8 hours to complete their MA2104
assignment, with mean 6 hours and variance 2 hours. Compute the probability
that a student will take more than 7 hours to finish the assignment.
32. Show that if the support of a random variable (either discrete or continuous) is
bounded, a < X < b, that is, p(x) > 0 only within the interval a < x < b, then
a < E[X] < b.
33. A bin of 5 transistors is known to contain 2 that are defective. The transistors are
to be tested, one at a time, until the defective ones are identified. Denote by N1 the
number of tests made until the first defective is identified and by N2 the number of
additional tests until the second defective is identified.
(a) Find c.
(b) Find the marginal densities of X and Y . (Hint: Integration by parts)
(c) Find E[X].
36. A man and a woman agree to meet at a certain location about 12:30 p.m. If the
man arrives at a time uniformly distributed between 12:15 and 12:45, and if the
woman independently arrives at a time uniformly distributed between 12:00 and 1
p.m., find the probability that the first to arrive waits no longer than 5 minutes.
39. Choose a number X at random from the set of numbers {1, 2, 3, 4, 5}. Now choose
a number at random from the subset no larger than X, that is, from {1, ..., X}.Call
this second number Y .
41. How many times would you expect to roll a fair die before all 6 sides appeared at
least once? What is the variance?
42. Let X and Y be discrete random variables and E[X|Y ] denote the function of the
random variable Y whose value at Y = y is the conditional expectation E[X|Y = y].
Note that E[X|Y ] is itself a random variable. Show that
43. Let X be the number of 1’s and Y be the number of 2’s that occur in n rolls of a
fair die. Compute Cov[X, Y ]. Verify your answer via simulation.
44. Suppose X1 , ..., Xn are independent and identically uniform distribution over the
interval (0, 1). Define R = max{X1 , ..., Xn }. Find the density of R. (Hint: First,
use the fact that R ≤ t if and only if all Xi ≤ t for all i = 1, ..., n to find the
cumulative distribution function of R.)
46. Given an order n nonzero square matrix A, we define a new matrix A = |a1lk | A,
where alk is the entry of A with the largest absolute value, |ak | ≥ |aij | for all
i, j = 1, ..., n. Then all the entries of A are in the interval [−1, 1]. Note that A is
invertible if and only if A is invertible.
Write a function whose formal (argument) is n, and output is an order n square
matrix whose entries are randomly generated values within the interval (−1, 1).
Generate a large number of such matrices and find the probability that they are
singular. This exercise shows that most matrices are invertible.
(a) Compute an approximation to the probability that the number 6 will appear
between 150 and 200 times inclusively.
(b) If the number 6 appears exactly 200 times, find the probability that the number
5 will appear less than 150 times.
49. A model for the movement of a stock supposes that if the present price of the
stock is s, then after one period, it will be either 1.012s with probability 0.52 or
0.99s with probability 0.48. Assuming that successive movements are independent,
approximate the probability that the stock’s price will be up at least 30 percent
after the next 1000 periods.
References
[1] https://www.tutorialspoint.com/r/r_overview.htm
[2] Pruim, Randall & Horton, Nicholas & Kaplan, Daniel. (2014). Start Teaching with
R and RStudio. 10.13140/2.1.4414.6567.
[3] Ma, Siu Lun & Ng Kah Loon & Victor Tan (2016). Linear algebra : concepts and
techniques on euclidean spaces. McGraw-Hill Education (Asia)
[4] Fieller, N. (2018). Basics of matrix algebra for statistics with R. Chapman and
Hall/CRC.
[5] Lax, P. D., & Terrell, M. S. (2018). Multivariable Calculus with Applications.
Springer.
[8] Matloff, N. (2019). Probability and statistics for data science: Math+ R+ data. CRC
Press.
[9] Ugarte, M. D., Militino, A. F., & Arnholt, A. T. (2008). Probability and Statistics
with R. CRC press.
269