0% found this document useful (0 votes)
2K views270 pages

MA2401 Lecture Notes

This document provides an introduction and overview to the MA2401 Introductory Mathematics with R course offered by the Department of Mathematics at the National University of Singapore. The document outlines the course contents which include introductions to R and RStudio, linear algebra, functions and calculus. Key topics covered are matrices and vectors, matrix operations, linear systems, derivatives, and using R for linear algebra, functions, graphs, and calculus. The document serves as a reference for the course materials and tutorials.

Uploaded by

Jojo Lomo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views270 pages

MA2401 Lecture Notes

This document provides an introduction and overview to the MA2401 Introductory Mathematics with R course offered by the Department of Mathematics at the National University of Singapore. The document outlines the course contents which include introductions to R and RStudio, linear algebra, functions and calculus. Key topics covered are matrices and vectors, matrix operations, linear systems, derivatives, and using R for linear algebra, functions, graphs, and calculus. The document serves as a reference for the course materials and tutorials.

Uploaded by

Jojo Lomo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 270

MA2401 Introductory Mathematics with R

Department of Mathematics
NATIONAL UNIVERSITY OF SINGAPORE

Jonathon Teo Yi Han


[email protected]
Contents

MA2401 Introductory Mathematics with R 1

Contents 4

1 Introduction to R and Rstudio 5


1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Installing R and RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 R Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.2 Relational Operators . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Guide to R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.1 Comments and Storing Variables . . . . . . . . . . . . . . . . . . 8
1.5.2 The Class of an Object . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.3 R is a Case-Sensitive Language . . . . . . . . . . . . . . . . . . . . 9
1.6 The Tabs in Rstudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6.1 The Environment Tab . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6.2 History Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6.3 Help Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6.4 Plots Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Linear Algebra 12
2.1 Matrices and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Introduction to Matrices . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Defining Matrices and Vectors in R . . . . . . . . . . . . . . . . . 13
2.1.3 Special types of Matrices . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Matrix and Vector Operations . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Definitions and Properties . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.3 Matrix Operations in R . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.4 Relational Operators in R . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Introduction to Linear Systems . . . . . . . . . . . . . . . . . . . 28
2.3.2 Solutions to a Linear System . . . . . . . . . . . . . . . . . . . . . 31
2.3.3 Gaussian and Gauss-Jordan Elimination . . . . . . . . . . . . . . 32
2.3.4 Solving a Linear System in R . . . . . . . . . . . . . . . . . . . . . 37
2.4 Submatrices and Block Multiplication . . . . . . . . . . . . . . . . . . . . 42
2.4.1 Block Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 42

1
2.4.2 Entries and Submatrices of Vectors and Matrices . . . . . . . . . 45
2.5 Inverse of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.5.1 Definition and Properties . . . . . . . . . . . . . . . . . . . . . . . 46
2.5.2 Algorithm to Finding the Inverse . . . . . . . . . . . . . . . . . . 49
2.5.3 Finding Inverse in R . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.5.4 Inverse and Linear System . . . . . . . . . . . . . . . . . . . . . . 52
2.6 Least Square Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.7 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.7.1 Definition: Cofactor Expansion . . . . . . . . . . . . . . . . . . . 56
2.7.2 Computing Determinant in R . . . . . . . . . . . . . . . . . . . . 59
2.7.3 Properties of Determinant . . . . . . . . . . . . . . . . . . . . . . 59
2.7.4 Determinants of Partitioned Matrices . . . . . . . . . . . . . . . . 60
2.8 Eigenanalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.8.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . 61
2.8.2 Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.9 Appendix for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.9.1 Matrix operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.9.2 Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.9.3 Solving Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . 79
2.9.4 Elementary Row Operations . . . . . . . . . . . . . . . . . . . . . 81
2.9.5 Gaussian Elimination and Gauss-Jordan Elimination . . . . . . . 84
2.9.6 Orthogonal and Orthonormal . . . . . . . . . . . . . . . . . . . . 86
2.9.7 Gram-Schmidt Process . . . . . . . . . . . . . . . . . . . . . . . . 88
2.9.8 Least square solution . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.9.9 Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.9.10 Orthogonal Diagonalization . . . . . . . . . . . . . . . . . . . . . 92
2.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

3 Functions and Calculus 103


3.1 Functions of several variables . . . . . . . . . . . . . . . . . . . . . . . . 103
3.1.1 Linear Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.1.2 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.2 Functions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.2.1 Creating Functions in R . . . . . . . . . . . . . . . . . . . . . . . 109
3.2.2 Symbolic Functions in R . . . . . . . . . . . . . . . . . . . . . . . 112
3.2.3 Solving equations in R . . . . . . . . . . . . . . . . . . . . . . . . 113
3.3 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.3.1 Plots in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.3.2 Curve Fitting in R . . . . . . . . . . . . . . . . . . . . . . . . . . 127
3.4 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.4.1 Definitions and Properties . . . . . . . . . . . . . . . . . . . . . . 133
3.4.2 Special Cases: Vector and Matrix Derivatives . . . . . . . . . . . 140
3.4.3 Symbolic Derivatives in R . . . . . . . . . . . . . . . . . . . . . . 141
3.4.4 Numerical Differentiation in R . . . . . . . . . . . . . . . . . . . . 145
3.5 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
3.5.1 Definitions and Properties . . . . . . . . . . . . . . . . . . . . . . 148
3.5.2 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 153
3.5.3 Probability Density Function . . . . . . . . . . . . . . . . . . . . 155
3.5.4 Integration in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
3.6 System of Differential Equations . . . . . . . . . . . . . . . . . . . . . . . 160
3.6.1 Introduction and Definitions . . . . . . . . . . . . . . . . . . . . . 160
3.6.2 Solving System of Ordinary Differential Equations in R . . . . . . 161
3.7 Appendix for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
3.7.1 Sequences and Limits . . . . . . . . . . . . . . . . . . . . . . . . . 164
3.7.2 Introduction to Functions . . . . . . . . . . . . . . . . . . . . . . 165
3.7.3 Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
3.7.4 Differentiation Rules . . . . . . . . . . . . . . . . . . . . . . . . . 166
3.7.5 Introduction to Integration . . . . . . . . . . . . . . . . . . . . . . 166
3.7.6 Integration Rules and Techniques . . . . . . . . . . . . . . . . . . 169
3.7.7 Integration over Unbounded Sets . . . . . . . . . . . . . . . . . . 170
3.8 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

4 Probability and Statistics 179


4.1 Fundamental of Probability . . . . . . . . . . . . . . . . . . . . . . . . . 179
4.1.1 Basic Combinatorics with R . . . . . . . . . . . . . . . . . . . . . 179
4.1.2 Introduction to Probability . . . . . . . . . . . . . . . . . . . . . 180
4.2 Conditional Probabilities, Independent Events, Bayes’ Rule . . . . . . . . 183
4.2.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . 183
4.2.2 Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . 185
4.2.3 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
4.3 Discrete Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . 188
4.3.1 Introduction to Random Variables . . . . . . . . . . . . . . . . . . 188
4.3.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . 189
4.3.3 Expected Value of Discrete Random Variables . . . . . . . . . . . 191
4.3.4 Variance of Discrete Random Variables . . . . . . . . . . . . . . . 195
4.4 Some families of Discrete Univariate Distributions . . . . . . . . . . . . . 196
4.4.1 Discrete Uniform Distributions . . . . . . . . . . . . . . . . . . . 196
4.4.2 Bernoulli Distributions . . . . . . . . . . . . . . . . . . . . . . . . 197
4.4.3 Binomial Distributions . . . . . . . . . . . . . . . . . . . . . . . . 197
4.4.4 Geometric Distributions . . . . . . . . . . . . . . . . . . . . . . . 199
4.4.5 Negative Binomial Distributions . . . . . . . . . . . . . . . . . . . 200
4.4.6 Poisson Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 201
4.5 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 203
4.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
4.5.2 Expected Value and Variance of Continuous Random Variables . . 205
4.6 Some Families of Continuous Univariate Distributions . . . . . . . . . . . 207
4.6.1 Uniform Distributions . . . . . . . . . . . . . . . . . . . . . . . . 207
4.6.2 Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 209
4.6.3 Exponential Distributions . . . . . . . . . . . . . . . . . . . . . . 212
4.6.4 Gamma Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 215
4.6.5 Beta Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 216
4.7 Multivariable Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 220
4.7.1 Joint Probability Density Functions . . . . . . . . . . . . . . . . . 220
4.7.2 Independent Random Variables . . . . . . . . . . . . . . . . . . . 226
4.7.3 Conditional Random Variables . . . . . . . . . . . . . . . . . . . . 230
4.7.4 Expected Values, Covariance, and Correlation . . . . . . . . . . . 232
4.7.5 Bivariate Normal Distributions . . . . . . . . . . . . . . . . . . . 238
4.7.6 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . 240
4.8 Appendix for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
4.8.1 Discrete Random Variable . . . . . . . . . . . . . . . . . . . . . . 243
4.8.2 Expected Value and Variance of Geometric Distributions . . . . . 245
4.8.3 Expected Value and Variance of a Poisson Distribution . . . . . . 246
4.8.4 Expected Value and Variance of Continuous Random Variables . . 247
4.8.5 Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 248
4.8.6 Memoryless . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
4.8.7 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 251
4.8.8 Beta Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 253
4.8.9 Independent Random Variables . . . . . . . . . . . . . . . . . . . 254
4.8.10 Expected Values, Covariance, and Correlation . . . . . . . . . . . 255
4.9 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

References 269
Chapter 1

Introduction to R and Rstudio

1.1 Introduction
RStudio is an integrated development environment (IDE) for R that provides an alter-
native interface to R. RStudio runs on Mac, PC, and Linus machines and provides a
simplified interface that looks and feels identical on all of them.

R is a programming language and software environment for statistical analysis, graph-


ics representation and reporting. R was created by Ross Ihaka and Robert Gentleman at
the University of Auckland, New Zealand, and is currently developed by the R Develop-
ment Core Team.

The core of R is an interpreted computer language which allows branching and looping
as well as modular programming using functions. R allows integration with the proce-
dures written in the C, C++, .Net, Python or FORTRAN languages for efficiency.

R is freely available under the GNU General Public License, and pre-compiled binary
versions are provided for various operating systems like Linux, Windows and Mac.

R is free software distributed under a GNU-style copy left, and an official part of the
GNU project called GNU S.

Evolution of R
R was initially written by Ross Ihaka and Robert Gentleman at the Department of
Statistics of the University of Auckland in Auckland, New Zealand. R made its first
appearance in 1993.

• A large group of individuals has contributed to R by sending code and bug reports.

• Since mid-1997 there has been a core group (the “R Core Team”) who can modify
the R source code archive.

Features of R
As stated earlier, R is a programming language and software environment for statistical
analysis, graphics representation and reporting. The following are the important features
of R

5
• R is a well-developed, simple and effective programming language which includes
conditionals, loops, user defined recursive functions and input and output facilities.

• R has an effective data handling and storage facility,

• R provides a suite of operators for calculations on arrays, lists, vectors and matrices.

• R provides a large, coherent and integrated collection of tools for data analysis.

• R provides graphical facilities for data analysis and display either directly at the
computer or printing at the papers.

As a conclusion, R is world’s most widely used statistics programming language. It’s


the # 1 choice of data scientists and supported by a vibrant and talented community
of contributors. R is taught in universities and deployed in mission critical business
applications.

1.2 Installing R and RStudio


1. Download R from http://cran.r-project.org/.

2. Download RStudio from http://www.rstudio.org/.

Alternatively, students may run RStudio in a web browser.

1. Go to https://rstudio.cloud/plans/free.

2. Choose “Cloud Free” and click Sign up.

3. Go ahead and sign up, and choose an account name.

1.3 R Script
You may type a series of command in a R Script and execute them at one time. R Script
is a normal text file; you may type your script in any text editor, such as Notepad. You
may also prepare your script in a word processor, like Microsoft Word, TextEdit, or
WordPad, provided you can save the scipt in plain text (ASCII) format. Save the text file
with .R at the end, for example, test.R.

To open a new R Script in RStudio, go to File, New File, and click R Script, or
by pressing Ctrl+Shift+N (for the desktop RStudio) or Ctrl+Alt+Shift+N (for RStudio
Cloud). You may now begin to type your commands in the R Script. Pressing Ctrl+Enter
will run the command in the console. Selecting mutiple line and pressing Ctrl+Enter
will run all the selected commands in the console.

You may save your script file by pressing Ctrl+s or clicking the blue file button.
1.4 Operations
1.4.1 Basic Operations
R can perform simple calculations. Commands entered in the Console tab are immedi-
ately executed by R by pressing Enter. Here are some basic operators:
• Addition:
> 5+3
[1] 8

• Subtraction:
> 8-5
[1] 3

• Product:
> 123 * 321
[1] 39483

• Division:
> 123/3
[1] 41

• Powers:
> 9^3
[1] 729

• Roots:
> 9^(1/3)
[1] 2.080084

• Log:
In R, log is the natural log, that is, with base e = exp(1). In order to compute the
log with another base a, use log(x,base=a).
> log(exp(2)) #log is the natural log, and exp(x) is e^x, where e is the
natural number
[1] 2
> log10(2) #base 10 log of 2, in general, logx(y) is base x log of y
[1] 0.30103
> log(100, base=2) #alternative way of base 2 log of 100
[1] 6.643856

• To perform an operation module n, use %% n.


> (8+3) %% 5 #the parenthesis is necessary
[1] 1
> (pi*3) %% 1
[1] 0.424778

• To display a number in fractions, we need the library MASS. Then use the function
fractions
> library(MASS)
> fractions(0.3333)
[1] 1/3
> fractions(0.333)
[1] 333/1000

1.4.2 Relational Operators


The > function checks if the first number is greater than the second number.
> 1>3
[1] FALSE
> pi>1
[1] TRUE

The >= function checks if the first number is greater than or equals to the second
number.
> 1>=1
[1] TRUE
> 3>=1
[1] TRUE

The < and <= functions work analogously.


> 1<3
[1] TRUE
> pi<=1
[1] FALSE

The == function checks if the numbers are equal.


> 1==0
[1] FALSE
> 1==0.99999999999999999
[1] TRUE

The != function checks if the two numbers are different.


> 1!=0
[1] True
> 1!=0.99999999999999999
[1] FALSE

1.5 Guide to R
1.5.1 Comments and Storing Variables
The symbol > is the command line prompt symbol; typing a command or instruction will
cause it to be executed or performed immediately. If you press Enter before completing
the command (e.g., if the command is very long), the prompt changes to + indicating that
more is needed. Sometimes you do not know what else is expected and the + prompt
keeps coming. Pressing Esc will kill the command and return to the > prompt. If you
want to issue several commands on one line, they should be separated by a semicolon (;)
and they will be executed sequentially.

Comments are prefaced with the # character. You can save values to named variables
for later reuse.
> a=9*2 #store the answer as variable a
> a #to display the stored value
[1] 18
> a<-9*2 #may use <- instead of =
> a
[1] 18
Once the variables are defined, they can be used in other operations and functions.
> a<-2; b<-pi; #the semi-colon can be used to place multiple commands on one
line.
> a*b
[1] 6.283185

Note that R is case sensitive. The variable a and A are different,


> a<-2; A<-10
> a
[1] 2
> A
[1] 10

R will override the previous variable,


> a<-2
> a<-3
> a
[1] 3

The broom icon can be used to clear all the commands in the console. Alternatively,
pressing Ctrl+L will clear all the commands in the console. Note that this will not delete
the commands from the Environment and History tab.

1.5.2 The Class of an Object


Every object in R has a class (or type of object) associated with it. Examples of classes
are “numeric”, “matrix” and “character”. The function t.test() produces objects of
class “htest”. The importance of class is that some functions require that their arguments
are of a particular class or even may operate differently depending upon which class their
argument happens to be. The class of an object can be changed by functions such as
as.matrix(.) which will convert a “numeric” class vector into a “matrix” class. This is
only critical in a few cases since R usually takes care of this matter internally.

1.5.3 R is a Case-Sensitive Language


Note that R treats lower case and upper case letters as different, for example, inverting a
matrix is performed using the function solve() but R does not recognize Solve(), nor
SOLVE(), nor .... The objects x and X are distinct (and easy to confuse). The function
matrix and the library Matrix are distinct.

1.6 The Tabs in Rstudio


1.6.1 The Environment Tab
The Environment tab shows the objects available to the console. These are subdivided
into data, values (non-data frame, non-function objects) and functions. The broom icon
can be used to remove all objects from the environment, and it is good to do this from
time to time, especially when running in RStudio server or if you choose to save the
environment when shutting down RStudio since in these cases objects can stay in the
environment essentially indefinitely.

1.6.2 History Tab


The commands that are entered will appear in the History tab. Commands in the history
tab can be transferred back to the console by simply double clicking them, or selecting
them and pressing Enter. Multiple lines can be selected by holding down Shift and
clicking the lines, or pressing the Up and Down keys. Any lines can be deleted from the
history by using the delete button (red circle with a white cross). The broom icon can
be used to clear all commands in the history. History files can be saved and share with
others so that they can rerun the code.

1.6.3 Help Tab


The Help tab is where RStudio displays R help files. These can be searched and navigated
in the Help tab. You can also open a help file using the ? operator in the console. For
example
> ?abs
or
> help(abs) #help is viewed as a function
will provide the help file for the absolute (modulus) function.

If you don’t know the exact name of a function, you can use the function apropos()
and give part of the name and R will find all functions that match. Quotation marks are
mandatory here. For example
> apropos("matrix")

You can do a broader search using ?? or help.search(), which will find matches not
only in the names of functions and data sets, but also in the documentation for them.
> ??histogram
or
> help.search("histogram")
1.6.4 Plots Tab
Plots created in the console are displayed in the Plots tab. For example
> plot(1,1) #this will display a point at (1,1) on the xy graph
> x = -9:9 #x takes integer values from -9 to 9
> plot(x,x^2) #plot x against x^2, for x from -9 to 9
Chapter 2

Linear Algebra

2.1 Matrices and Vectors


2.1.1 Introduction to Matrices
A matrix is a rectangular array of number
 
a11 a12 · · · a1n
 a21 a22 · · · a2n 
A =  .. ..  ,
 
.. ..
 . . . . 
am1 am2 · · · amn

were aij ∈ R are real numbers.

The size of a matrix is given by m × n, where m is the number of rows and n is the
number of columns. The (i, j)-entry of the matrix is the number aij in the i-th row and
j-th column, for i = 1, ..., m, j = 1, ..., n. A matrix can also be denoted as

A = (aij )m×n = (aij )m n


i=1 j=1 .

Matrices are usually denoted by upper case bolded letters.


 
1 2
Example. 1. 3 4  is a 3 × 2 matrix. The (2, 1)-entry is 3.
0 −1

2. 2 1 0 is a 1 × 3 matrix. The (1, 2)-entry is 2.
 
1
3. 2 is a 3 × 1 matrix. The (3, 1)-entry is 3.

3

4. 4 is a 1 × 1 matrix.

The last example shows that all real numbers can be thought of as 1 × 1 matrices.

Remark. 1. To be precise, the above examples are called real-valued matrices, or ma-
trices with real number entries. Later we will be introduced to complex-valued and
even matrices with function entries.

12
2. The choice of using round or square brackets is a matter of taste.
Example. 1. A = (aij )2×3 , aij = i + j.
 
2 3 4
A=
3 4 5

2. B = (bij )3×2 , bij = (−1)i+j .  


1 −1
B = −1 1 
1 −1

1 if i = j,
3. C = (cij )3×3 , cij = .
0 otherwise.
 
1 0 0
C = 0 1 0
0 0 1

A (real) column n-vector (or vector) is a collection of n ordered real numbers,


 
v1
 v2 
v =  ..  , where vi ∈ R for i = 1, ..., n.
 
.
vn

It is a n × 1 matrix.

A (real) row n-vector (or vector) is a collection of n ordered real numbers,



v = v1 v2 · · · vn , where vi ∈ R for i = 1, ..., n.

It is a 1 × n matrix.

The set of all n-vectors (column or row) is denoted as Rn .

2.1.2 Defining Matrices and Vectors in R


To create a vector, use the concatenation function c.
> v<-c(1,2,3,4,5) #create a vector v1 with entries (1,2,3,4,5)
> v
[1] 1 2 3 4 5
This vector is neither viewed as a column nor row vector, as using the dim function re-
turns NULL
> dim(v)
NULL

Use the matrix function to change it to a column vector.


> v<-matrix(v); v
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4

To change it to a row vector, type t(matrix(v)) or matrix(v,T)


> t(matrix(v))
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
> matrix(v,T)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4

Create a vector of strings or characters


> string<-c("weight","height","gender") #create a vector of strings/characters
> string
[1] "weight" "height" "gender"

Adding quotations cause R to recognize numbers as strings,


> stringnum<-c("2","3","4")
[1] "2" "3" "4"

Adding numbers to strings cause R to view the numbers as strings,


> stringwnum<-c("Hello",2,3,4)
[1] "Hello" "2" "3" "4"

We may concentenate vectors with vectors,


> v1<-c(1,2,3,4); v2<-c(2,4,6)
> v3<-c(v1,v2)
> v3
[1] 1 2 3 4 2 4 6
> string1<-c(v1,stringnum)
> string1
[1] "1" "2" "3" "4" "2" "3" "4"
> v5<-c(string,v2)
> v5
[1] "weight" "height" "gender" "2" "4" "6"

The numeric function creates a vector with all its elements being 0.
> zeros3<-numeric(3)
> zeros3
[1] 0 0 0
v1<-c(1,2,3,numeric(2))
> v1
[1] 1 2 3 0 0

The rep function replicates an item.


> a<-rep(2,5) #replicates the number 2 by 5 times
> a
[1] 2 2 2 2 2
> v1=c(1,2,3); v2<-rep(v1,3) #replicates vector v1 by 3 times
> v2
[1] 1 2 3 1 2 3 1 2 3
> v3<-rep(c(1,2,3),c(2,1,3)) #replicates 1 by 3 times, 2 by 1 time, and 3 by
3 times
> v3
[1] 1 1 2 3 3 3
> string<-c("Age","Gender","Weight"); string1<-rep(string,c(3,2,3)); string1
[1] "Age" "Age" "Age" "Gender" "Gender" "Weight" "Weight" "Weight"

The seq function creates a sequence of numbers.


> s1<-seq(1,10,by=2); s1 #sequence of numbers from 1 to 10, by intervals of
2
[1] 1 3 5 7 9
> s2<-seq(1,2,length=10); s2 #sequence of 10 equally spaced numbers from 1
to 2
[1] 1.000000 1.111111 1.222222 1.333333 1.444444 1.555556 1.666667 1.777778
1.888889
[10] 2.000000
> seq(10)
[1] 1 2 3 4 5 6 7 8 9 10

Creating a matrix using the dim function.


> v<-c(1:6); dim(v)=c(2,3) #uses entries from v to create a 2 by 3 matrix
> v
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> u=seq(10); dim(u)=c(5,2); u
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
By default, the entries of v are filled by the column. To convert the vector back to a
vector,
> dim(v)=NULL; v
[1] 1 2 3 4 5 6

Creating a matrix using the matrix function.


> m<-matrix(seq(1:10),2,5) #matrix(v,m,n) takes entries from v to create a
m by n matrix
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
> m1<-matrix(seq(10),2,5,T) #put T in the 4th argument to fill matrix by row
> m1
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10

The function rbind concatenates vector as rows of a matrix, and the function cbind
concatenates vectors as columns of a matrix.
> v1<-c(1,3,3,1); v2<-c(2,2,5,3); v3<-c(1,1,5,-1); mr<-rbind(v1,v2,v3); mr
[,1] [,2] [,3] [,4]
v1 1 3 3 1
v2 2 2 5 3
v3 1 1 5 -1
> mc<-cbind(v1,v2,v3); mc
v1 v2 v3
[1,] 1 2 1
[2,] 3 2 1
[3,] 3 5 5
[4,] 1 3 -1
> m1<-rbind(mr,c(1:4)); m1
[,1] [,2] [,3] [,4]
v1 1 3 3 1
v2 2 2 5 3
v3 1 1 5 -1
1 2 3 4
> m2<-cbind(numeric(4),mc); m2
v1 v2 v3
[1,] 0 1 2 1
[2,] 0 3 2 1
[3,] 0 3 5 5
[4,] 0 1 3 -1

We may create a matrix by defining its entries.

Example. 1. A = (aij )2×3 , aij = i + j, type the following the a script


A <- matrix(numeric(6),2,3)
for (i in 1:2) {
for (j in 1:3) {
A[i,j]=i+j
}
}
> A
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 3 4 5

2. B = (bij )3×2 , bij = (−1)i+j ,


B <- matrix(numeric(6),3,2)
for (i in 1:3) {
for (j in 1:2) {
B[i,j]=(-1)^(i+j)
}
}
> B
[,1] [,2]
[1,] 1 -1
[2,] -1 1
[3,] 1 -1

2.1.3 Special types of Matrices


Here are some special types of matrices.

Square matrices. A m × n matrix is a square matrix if the number of columns is


equal to the number of rows m = n. A square matrix of size n × n is called an order n
square matrix. It is usually denoted by A = (aij )n .
 
  1 2 3
2 3
Example. Order 2: Order 3: 3 2 4
1 2
5 6 6
The i-th diagonal entry of a square matrix is its (i, i)-entry. The diagonal entries of
a square matrix A = (aij )n of order n is the collection {a11 , a22 , ..., ann }.

Diagonal matrices. A square matrix with all the non diagonal entries equal 0 is
called a diagonal matrix, D = (dij )n with dij = 0 for all i ̸= j. It is usually denoted by
D = diag{d1 , d2 , ..., dn }.
Example.
 
  1 0 0 0
  0 0 0
1 0 0 2 0 0
diag{1, 1} = diag{0, 0, 0} = 0 0 0 diag{1, 2, 3, 4} =  
0 1 0 0 3 0
0 0 0
0 0 0 4

Scalar matrices. A diagonal matrix A = diag{a1 , a2 , ..., an } such that all the diag-
onal entries are equal a1 = a2 = ... = an is called a scalar matrix.
 
  3 0 0
2 0 0 3 0.
Example.
0 2
0 0 3
Identity matrices. A scalar matrix with all diagonal entries equal 1 is called an
identity matrix. An identity matrix of order n is denoted as In . If there is no confusion
with the order of the matrix, we will write I instead. So a scalar matrix can be written
as cI.

Zero matrices. A matrix (of any size) with all entries equal 0 is called a zero matrix.
Usually denoted as 0m×n for the size m×n zero matrix, and 0n for the zero square matrix
of order n. If it is clear in the context, we will just denote it as 0.

Triangular matrices. A square matrix A = (aij )n with all entries below (above) the
diagonal equal 0, that is, aij = 0 for all i > j (i < j), is called an upper (lower) triangular matrix.
It is a strictly upper or lower matrix if the diagonals are equal to zero too, that is, aij = 0
for all i ≥ j (i ≤ j).

All entries below All entries below and on


   
∗ ∗ ··· ∗ 0 ∗ ··· ∗
diagonal are 0 0 ∗ · · · ∗ diagonal are 0 0 0 · · · ∗
Upper triangular:  .. .. . . ..  Strictly upper triangular:  .. .. . . .. 
   
. . . . . . . .
0 0 ··· ∗ 0 0 ··· 0
All entries above   All entries above and  
diagonal are 0
∗ 0 · · · 0 on diagonal are 0 0 0 · · · 0
∗ ∗ · · · 0  ∗ 0 · · · 0
Lower triangular:  .. .. . . ..  Strictly lower triangular:  .. .. . . .. 
   
. . . . . . . .
∗ ∗ ··· ∗ ∗ ∗ ··· 0
Symmetrix matrices. A square matrix A = (aij )n such that aij = aji for all
i, j = 1, ...., n is called a symmetric matrix, that is, the entries are diagonally reflected
along the diagonal of A. Identify a symmetric matrix:
 1) Draw a diagonal reflecting line
2) Entries on the opposite side of the line

  1 2 1 2
1 2 3
2 1 1 1 must be equal (2,1) = (1,2); (3,2) = (2,3)
 
1 2  
Example.  2 0 4 
1 1 1 2. **A symmetrical, upper triangle matrix is

2 0
3 4 5
2 1 2 1 NOT always a zero matrix (the diagonal
entries may be non-zero numbers, so there
is a possibility that it is a diagonal matrix)
2.2 Matrix and Vector Operations
2.2.1 Definitions and Properties
Two matrices are equal if they are of the same size and all the entries are equal.

Example. 1.    
2 1 a b c
̸ =
3 2 d e f
for any choice of a, b, c, d, e, f .

2.    
1 1 a b
=
3 2 c d
if and only if a = 1, b = 1, c = 3, d = 2.

Matrix Addition and Scalar Multiplication


Let A = (aij )m×n and B = (bij )m×n be two matrices and c ∈ R a real number. We define
the following operations as such.

1. (Scalar multiplication) cA = (caij ).

2. (Matrix addition) A + B = (aij + bij )m×n


   
1 5 3 15
Example. 1. 3 3
 3 =
  9 9
−1 2 −3 6
     
1 0 2 6 2 0 7 2 2
2. + =
−1 2 5 1 4 0 0 6 5
Remark. 1. Matrix addition is only defined between matrices of the same size.

2. −A = (−1)A.

3. Matrix substraction is defined to be the addition of a negative multiple of another


matrix,
A − B = A + (−1)B.

Theorem (Properties of matrix addition and scalar multiplication). For matrices A =


(aij )m×n , B = (bij )m×n , C = (cij )m×n , and real numbers a, b ∈ R,

(i) (Commutative) A + B = B + A,

(ii) (Associative) A + (B + C) = (A + B) + C,

(iii) (Additive identitiy) 0m×n + A = A,

(iv) (Additive inverse) A + (−A) = 0m×n ,

(v) (Distributive law) a(A + B) = aA + aB,

(vi) (Scalar addition) (a + b)A = aA + bA,

(vii) (ab)A = a(bA),

(viii) if aA = 0m×n , then either a = 0 or A = 0.

Proof. To show equality, we have to show that the matrices on the left and right of the
equality have the same size, and that the corresponding entries are equal. It is clear that
the matrices on both sides has the same size, so we will only check that the entries agree.

(i) aij + bij = bij + aij follows directly from commutativity of addition of real numbers.

(ii) aij + (bij + cij ) = (aij + bij ) + cij follows directly from associativity of addition of
real numbers.

(iii) 0 + aij = aij follows directly from the additive identity property of real numbers.

(iv) aij + (−aij ) = 0 follows directly from additive inverse property of real numbers.

(v) a(aij + bij ) = aaij + abij follows directly from distributive property of addition of
real numbers.

(vi) (a + b)aij = aaij + baij follows directly from distributive property of addition of real
numbers.

(vii) (ab)aij = a(baij ) follows directly from associativity of multiplication of real numbers.
(viii) If aaij = 0, then a = 0 or aij = 0. Suppose a ̸= 0, then aij = 0 for all i, j. So
A = 0.

Remark. 1. Since addition is associative, we will not write the parentheses when adding
multiple matrices.

2. Property (iii) and (i) imply that A + 0 = A.

3. Property (iv) and (i) imply that −A + A = 0.

Matrix Multiplication
Let A = (aij )m×p and B = (bij )p×n . The product AB is defined to be a m × n matrix
whose (i, j)-entry is
p
X
aik bkj = ai1 bij + ai2 b2j + · · · + aip bpj .
k=1
**(2,3) entry of the matrix
Example. product is (matrix 1)Row (matrix 1)Row (matrix 1)Row
2*(matrix 2)Column 3 1*(matrix 1*(matrix
  2)Column 1 2)Column 2
  1 1  
1 2 3 2 1+4−3=2 1+6−6=1
3 =
4 5 6 4 + 10 − 6 = 8 4 + 15 − 12 = 7
−1 −2
(matrix 1)Row (matrix 1)Row
(2 × 3) (3 × 2) 2*(matrix
(2 × 2) 1*(matrix
2)Column 1 2)Column 2
Remark. 1. For AB to be defined, the number of columns of A must agree with the
number of rows of B. The resultant matrix has the same number of rows as A, and
the same number of columns as B.

(m × p)(p × n) = (m × n).

2. 
Matrixmultiplication
   is not  commutative,
 that is AB ̸= BA in general. For example,
0 1 1 2 1 2 0 1
̸=
1 0 3 4 3 4 1 0
3. If we are multiplying A to the left of B, we are pre-multiplying A to B, AB. If we
multiply A to the right of B, we are post-multiplying A to B, BA. Pre-multiplying
A to B is the same as post-multiplying B to A.

Theorem (Properties of matrix multiplication). (i) (Associative) For matrices A =


(aij )m×p , B = (bij )p×q , and C = (cij )q×n (AB)C = A(BC).

(ii) (Left distributive law) For matrices A = (aij )m×p , B = (bij )p×n , and C = (cij )p×n ,
A(B + C) = AB + AC.

(iii) (Right distributive law) For matrices A = (aij )m×p , B = (bij )m×p , and C = (cij )p×n ,
(A + B)C = AC + BC.

(iv) (Commute with scalar multiplication) For any real number c ∈ R, and matrices
A = (aij )m×p , B = (bij )p×n , c(AB) = (cA)B = A(cB).
(v) (Multiplicative identity) For any m × n matrix A, Im A = A = AIn .

(vi) (Zero divisor) There exists A ̸= 0m×p and B ̸= 0p×n such that AB = 0m×n .
The product of two non-zero matrices can be a zero matrix
(vii) (Zero matrix) For any m × n matrix A, A0n×p = 0m×p and 0p×m A = 0p×n .

The proof is beyond the scope of this course. Interested readers may refer to the
appendix.
**For diagonal matrix A and diagonal matrix B, AB = BA
Remark. 1. For square matrices, we define A2 = AA, and define inductively, An =
AAn−1 , for n ≥ 2. It follows that An Am = An+m .

2. In general (AB)n ̸= An Bn . (Why?)

3. Nilpotent matrices are examples of zero divisors. A square matrix A


 is said
 to
0 1
be nilpotent if there is a positive integer k ∈ Z such that Ak = 0. and
  0 0
5 −3 2
15 −9 6 are examples are nilpotent matrices.
10 −6 4

Tranpose
For a m × n matrix A, the transpose of A, written as AT , is a n × m matrix whose
(i, j)-entry is the (j, i)-entry of A, that is, if AT = (bij )n×m , then

bij = aji

for all i = 1, ..., n, j = 1, ..., m. Equivalently, the rows of A are the columns of AT and
vice versa.
   T
 T 1 4 1  T  
1 2 3  1 2 1 2
Example. 1. = 2 5 2. 1 = 1 1 0 3. =
4 5 5 2 0 2 0
3 6 0
This gives us an alternative way to define symmetric matrices. A square matrix A is
symmetric if and only if AT = A.

Theorem (Properties of transpose). (i) For any matrix A, (AT )T = A.

(ii) For any matrix A, and real number c ∈ R, (cA)T = cAT .

(iii) For matrices A and B of the same size, (A + B)T = AT + BT .

(iv) For matrices A = (aij )m×p and B = (bij )p×n , (AB)T = BT AT .

Refer to the appendix for the proof.


Example.
  T
  1 1  T  
 1 2 3 2 3   =
2 1
=
2 8
4 5 6 8 7 1 7
−1 −2
 
  1 4
1 2 −1 
= 2 5
1 3 −2
3 6
 T
1 1  T
1 2 3
= 2 3
4 5 6
−1 −2

Which of the following statements are true? Justify.

(a) If A and B are symmetric matrices of the same size, then so is A + B.

(b) If A and B are symmetric matrices (with the appropriate sizes), then so is AB.

Trace
Let A = (aij )n×n be a square matrix of order n. Then the trace of A, denoted by tr(A)
is the sum of the diagonal entries of A,
n
X
tr(A) = aii = a11 + a22 + · · · + ann .
i=1

2.2.2 Dot Product


How do we multiply vectors? Given two (column) vectors u, v ∈ Rn , we cannot multiply
them since their size don’t match. However, if we transpose one of the vectors, we can
multiply,
   
u1 u1 v1 u1 v2 · · · u1 vn
 u2 v1 u2 v2 · · · u2 vn 
 u2    
T
1. uv =  ..  v1 v2 · · · vn =  .. ..  = (ui vj )n , or
 
.. ...
.  . . . 
un un v1 un v2 · · · un vn
 
v1
 v2 
2. uT v = u1 u2 · · · un  ..  = u1 v1 + u2 v2 + · · · + un vn = ni=1 ui vi .
   P
.
vn

The first multiplication is known as outer product, denoted as u ⊗ v, and the second is
known as inner product, or dot product, denoted as u · v. In this course, we will only be
discussing inner product.
   
1 2
Example. 1.  2  · 2 = 2 + 4 − 2 = 4.
−1 2
   
1 1
2.  0 · 1 = 1 + 0 − 1 = 0.
 
−1 1
   
2 1
3. · = 2 − 6 = −4.
3 −2
The norm of a vector u ∈ Rn is defined to be the square root of the inner product of
u with itself, and is denoted as ∥u∥,

∥u∥ = u · u.
 
x
Geometric meaning of norm. The distance between the point and the origin in
y
R2 is given by  
p
2 2 x
distance = x + y = .
y
That is, in R2 , the norm of a vector can be interpreted as its distance from the origin.

u
y

 
x
3
Similarly, in R , the distance of a vector y  to the origin is

z
 
p x
2 2
distance = x + y + z = 2  y .
z
We may thus generalize and define the distance between a vector v and the origin in
n
R is its norm, ∥v∥.

Observe that the distance between two vector v = (vi ) and u = (ui ) is
p
d(u, v) = (u1 − v1 )2 + (u2 − v2 )2 + · · · + (un − vn )2 = ∥u − v∥.
 
1 p √
Example. 1.  2  = 12 + 22 + (−1)2 = 6.
−1

     
1 0 1−0 p
2. d , = = 12 + (−2)2 = 5.
3 5 3−5
The angle between two nonzero vectors, u, v ̸= 0 is the number θ with 0 ≤ θ ≤ π
such that
u·v
cos(θ) = .
∥u∥∥v∥
This is a natural definition because once again, in R2 , this is indeed the definition of the
trigonometric function cosine.

v
θ
u

Theorem (Properties of dot product and norm). Let u, v, w ∈ Rn be vectors and a, b, c ∈


R be scalars.

(i) (Symmetric) u · v = v · u.

(ii) (Scalar multiplication) cu · v = (cu) · v = u · (cv).

(iii) (Distribution) u · (av + bw) = au · v + bu · w.

(iv) (Positive definite) u · u ≥ 0 with equality if and only if u = 0.

(v) ∥cu∥ = |c|∥u∥.

Proof. Let u = (ui ), v = (vi ), and w = (wi ).

(i) u · v = ni=1 ui vi = ni=1 vi ui = v · u. Alternatively, since uT v is a 1 × 1 matrix, it


P P
is symmetric. So u · v = uT v = (uT v)T = vT u = v · u.

(ii) c ni=1 ui vi = ni=1 (cui )vi = ni=1 ui (cvi ).


P P P

Pn Pn Pn Pn
(iii) i=1 ui (avi + bwi ) = i=1 (aui vi + bui wi ) = a i=1 ui vi + b i=1 ui wi .

(iv) u · u = ni=1 u2i ≥ 0 since ui ∈ R are real numbers. It is clear that a sum of square
P
of real numbers is equal to 0 if and only if all the numbers are 0.
pPn √ pPn
2 2
(v) ∥cu∥ = i=1 (cui ) = c2 i=1 ui = |c|∥u∥.

A vector u is a unit vector if ∥u∥ = 1. We can normalize every nonzero vector by


multiplying it by the reciprocal of its norm,
u
u −→ , for u ̸= 0.
∥u∥

Then    
u u u·u
· = = 1.
∥u∥ ∥u∥ ∥u∥2

1 √
Example. Let u =  2 . We have computed that ∥u∥ = 6 and so
−1
 
1
1
√ 2
6 −1

is a unit vector. Indeed,


         
1 1 1 1
 √1  2  ·  √1  2  = √1 √1  2  ·  2  = 6 = 1.
6 −1 6 −1 6 6 −1 6
−1

Exercise:

1. Which of the following vectors are unit vectors?

2. For the vectors that are not unit, normalize them, if possible.
       
0 1 2 1
1 1
(i) 1
  (ii) 2 −2  (iii) 2 0  (iv) cos(π/2) 1

0 1 −2 1

2.2.3 Matrix Operations in R


Summary of notations
• Addition A + B, A+B

• Substraction A − B, A-B

• Multiplication AB, A%∗%B

• Hadamard multiplication, A ⊙ B, A∗B

• Kronecker multiplication A ⊗ B, A%x%B

• Transpose AT , t(A)

• The product AT A, crossprod(A,A)


• The product AAT , crossprod(t(A),t(A))

• Inversion A−1 , solve(A)

• Trace tr(A), sum(diag(A))

• Determinant det(A) or |A|, det(A)

• Eigenvalues and eigenvectors, eigen(A)

• To produce a vector with the diagonals of a matrix A, diag(A)

• Trace of a matrix A, sum(diag(A))

• Create a diagonal matrix with vector v, diag(v)

• To change a dataframe into a matrix, data.matrix(dataframe)

• To change some other object into a matrix, as.matrix(object)

• To find the length of a vector x, length(x)

• To find the dimensions of a matrix A, dim(A)

• Singular value decomposition of a matrix A, svd(A)

To define an order n identity matrix In , type diag(n)


> diag(3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1

To define a size m × n zero matrix, 0m×n , type matrix(numeric(m* n),m,n)


> matrix(0,3,2)
[,1] [,2]
[1,] 0 0
[2,] 0 0
[3,] 0 0

Matrix multiplication, A%*%B.


> A <- matrix(c(1,2,3,4,5,6),2,3,T); B <- matrix(c(1,1,2,3,-1,-2),3,2,T)
> A%*%B
[,1] [,2]
[1,] 2 1
[2,] 8 7

The Hadamard multiplication multiplies vectors or matrices entry-wise; it is defined


only for vectors or matrices with the same size. It can also be used for scalar multiplica-
tion.
> v1<-c(2,5,6); v2<-c(2,3,4); v1; v2
[1] 2 5 6
[1] 2 3 4
> v1∗v2
[1] 4 15 24
> A1<-matrix(c(1,2,4,2,5,-1),3,2,T); A2<-matrix(c(2,2,3,2,2,5),3,2,T); A1;
A2
[,1] [,2]
[1,] 1 2
[2,] 4 2
[3,] 5 -1
[,1] [,2]
[1,] 2 2
[2,] 3 2
[3,] 2 5
> A1∗A2
[,1] [,2]
[1,] 2 4
[2,] 12 4
[3,] 10 -5
> 3*A1
[,1] [,2]
[1,] 3 6
[2,] 12 6
[3,] 15 -3

The divide function / divides vectors vectors or matrices entry-wise.


> v1/v2
[1] 1.000000 1.666667 1.500000
> A1/A2
[,1] [,2]
[1,] 0.500000 1.0
[2,] 1.333333 1.0
[3,] 2.500000 -0.2

The ^ function raises the entries in the first vector or matrix to the exponent of the
second vector or matrix.
> v1^v2
[1] 4 125 1296
> A1^A2
[,1] [,2]
[1,] 1 4
[2,] 64 4
[3,] 25 -1

2.2.4 Relational Operators in R


The > function checks if each entry in the first vector or matrix is greater than the cor-
responding entry in the second vector or matrix.
> v1 <- c(2,5,0,-1); v2 <- c(2,3,3,5)
> v1>v2
[1] FALSE TRUE FALSE FALSE
> A1<-matrix(v1,2,2,T); A2<-matrix(v2,2,2,T)
> A1>A2
[,1] [,2]
[1,] FALSE TRUE
[2,] FALSE FALSE

The >= function checks if each entry in the first vector or matrix is greater than or
equal to the corresponding entry in the second vector or matrix.
> v1>=v2
[1] TRUE TRUE FALSE FALSE
> A1>=A2
[,1] [,2]
[1,] TRUE TRUE
[2,] FALSE FALSE

The < and <= functions work analogously.

The == function checks if each entry in the first vector or matrix is equal to the cor-
responding entry in the second vector or matrix.
> v1==v2
[1] TRUE FALSE FALSE FALSE
> A1==A2
[,1] [,2]
[1,] TRUE FALSE
[2,] FALSE FALSE

The != function checks if each entry in the first vector or matrix is different from the
corresponding entry in the second vector or matrix.
> v1!=v2
[1] FALSE TRUE TRUE TRUE
> A1!=A2
[,1] [,2]
[1,] FALSE TRUE
[2,] TRUE TRUE

2.3 Linear Systems


2.3.1 Introduction to Linear Systems
An equation is linear if the variables (unknowns) are only acted upon by multiplying by
constants and adding them up. A linear equation in n variables has the form
a1 x1 + a2 x2 + · · · + an xn = b.
Here a1 , a2 , ..., an are constants (fixed real numbers), called the coefficients, b is called
the constant, and x1 , x2 , ..., xn are the variables. The linear equation is said to be in
standard form if all the terms involving the variables are on one side of the equation, and
the constant is on the other side.

Example. The following are examples of linear systems in standard form

1. 2x + y = 3

2. x1 − x2 + 3x3 − 5x4 = 2

The following are examples of linear systems that are not in standard form

1. y = x sin( π6 )

2. x = 2y

The following are not linear systems

1. xy = 3

2. x2 + y 2 = 1

3. cos(x) + 4 sin(y) = 2

A system of linear equations, or a linear system consists of a finite number of linear


equations. In general, a linear system with n variables and m equations has the form

a11 x1 + a12 x2 + · · · + a1n xn = b1


a21 x1 + a22 x2 + · · · + a2n xn = b2
..
.
am1 x1 + am2 x2 + · · · + amn xn = bm

A linear system can be expressed uniquely as an augmented matrix


 
a11 a12 · · · a1n b1
 a21 a22 · · · a2n b2 
..  .
 
 .. .. .. ..
 . . . . . 
am1 am2 · · · amn bm

The linear system is homogeneous if there are no constant terms

a11 x1 + a12 x2 + · · · + a1n xn = 0


a21 x1 + a22 x2 + · · · + a2n xn = 0
..
.
am1 x1 + am2 x2 + · · · + amn xn = 0

Then the corresponding augemented matrix is


 
a11 a12 · · · a1n 0
 a21 a22 · · · a2n 0 
..  .
 
 .. .. .. ..
 . . . . . 
am1 am2 · · · amn 0
Given a linear system
a11 x1 + a12 x2 + · · · + a1n xn = b1
a21 x1 + a22 x2 + · · · + a2n xn = b2
..
.
am1 x1 + am2 x2 + · · · + amn xn = bm
the homogeneous system associated to it is
a11 x1 + a12 x2 + · · · + a1n xn = 0
a21 x1 + a22 x2 + · · · + a2n xn = 0
..
.
am1 x1 + am2 x2 + · · · + amn xn = 0
Example. 1. (Nonhomogeneous) Linear system:
3x + 2y − z = 1
5y + z = 3
x + z = 2

The corresponding augmented matrix is


 
3 2 −1 1
 0 5 2 3 
1 0 1 2

The associated homogeneous system is


3x + 2y − z = 0
5y + z = 0
x + z = 0
with augmented matrix  
3 2 −1 0
 0 5 2 0 
1 0 1 0
2. Linear system:
2x − 1 = 3y
3 − 9y = 6x
The augmented matrix is  
2 −3 1
.
6 9 3
The associated homogeneous system is
2x − 3y = 0
6x + 9y = 0

with augmented matrix  


2 −3 0
.
6 9 0
3. Sometimes the linear system may contain unknown coefficients or constants. We
may still proceed to find the augmented matrix.

x1 − 4x2 + ax3 − 6x4 = 2


3x1 + 2x2 = a
6x1 + 2x2 − x3 + (a − 1)x4 = −1

The augmented matrix is


 
1 −4 a −6 2
 3 2 0 0 a .
6 2 −1 (a − 1) −1

2.3.2 Solutions to a Linear System


Given a linear system

a11 x1 + a12 x2 + · · · + a1n xn = b1


a21 x1 + a22 x2 + · · · + a2n xn = b2
..
.
am1 x1 + am2 x2 + · · · + amn xn = bm

we say that
x1 = c1 , x2 = c2 , ..., xn = cn
is a solution to the linear system if the equations are simultaneously satisfied after making
the substitution, that is,

a11 c1 + a12 c2 + · · · + a1n cn = b1


a21 c1 + a22 c2 + · · · + a2n cn = b2
..
.
am1 c1 + am2 c2 + · · · + amn cn = bm

Example. x = 1 = y is a solution to

3x − 2y = 1
x + y = 2

Note that solutions may not be unique.

Example.
x + 2y = 5
,
2x + 4y = 10
solutions: x = 1, y = 2, or x = 3, y = 1, etc.

If the solution is not unique, we need to introduce parameters, usually denoted by


r, s, t, or s1 , s2 , ..., sk . This means that any choice of a real number for each of the
parameter is a solution to the linear system. A general solution to a linear system captures
all possible solutions to the linear system.
Example. 1.
x + 2y = 5
,
2x + 4y = 10
General solutions: x = 5 − 2s, y = s, s ∈ R, or x = t, y = 21 (5 − t), t ∈ R, etc.

2.
x − y + 3z = 1 ,
General solution: x = 1 + s − 3t, y = s, z = t, s, t ∈ R.

It is possible to for a linear system to have no solutions, for example,

x + y = 2
x − y = 0
2x + y = 1

In this case, we say that the system is inconsistent.

2.3.3 Gaussian and Gauss-Jordan Elimination


Consider a linear system with variables x, y, z with the following augmented matrix
 
1 0 0 1
 0 1 0 2 .
0 0 1 3

It is immediate that the unique solution is x = 1, y = 2, z = 3. This is an exam-


ple of a (an augmented) matrix that is in reduced row-echelon form. A matrix is in
reduced row-echelon form (RREF) if it has the following properties.

1. If there are any rows that consist entirely of zeros, then they are grouped together at
the bottom of the matrix. A row consisting entirely of zeros is called a zero row.

2. For any nonzero row, the first nonzero number of a row from the left is called a
leading entry. In any two successive nonzero rows, the leading entry in the lower row
occurs farther to the right than the leading entry in the higher row.

3. If a row does not consist entirely of zeros, then the first nonzero number in the row is
a 1.

4. Each column that contains a leading entry is called a pivot column. In each pivot
column, all entries except the leading entry is zero.

A matrix that has the first three properties is said to be in row-echelon form.
 
∗ ∗
 0 ··· 0 ∗ ∗ 
 
 0 ··· 0 0 ··· 0 ∗ ∗ 
.
 
 0 0 0 
 
 .. .. .. 
 . . . 
0 ··· ··· 0 0
A matrix in RREF has the form
 
1 ∗ 0 ∗ 0 ∗
 0 ··· 0 1 ∗ 0 ∗ 
 
 0 ··· 0 0 ··· 0 1 ∗ 
.
 
 0 0 0 0 0
 
 .. .. .. .. .. 
 . . . . . 
0 ··· 0 ··· 0 0 0

Example. The following are examples of augmented matrices in row-echelon form but
not in reduced row-echelon form.
 
−1 2 3 4
1.  0 1 1 2 
0 0 2 3
 
1 −1 1
2.
0 0 1
 
1 −1 0
3.
0 0 3
The following are examples of augmented matrices in reduced row-echelon form.
 
0 1 2 0 1
 0 0 0 1 3 
1. 
 0 0

0 0 0 
0 0 0 0 0

2. 1 2 0 −5
The following is an augmented matrix that is not in row-echelon form
 
1 2 0 −5
1.
1 0 1 3
Example. From the REF or RREF, we are able to read off a general solution.
1.  
1 1 1 1
0 1 1 0
This is in REF. We let the third variable be the parameter s, then we get y = −s from
the second row, and x = 1 − s − (−s) = 1 from the first row. So a general solution is
x = 1, y = −s, z = s.

2.  
1 0 0 1
0 1 −1 0
This is in RREF. General solution: x = 1, y = s, z = s.
To solve a linear system, we can perform algebraic operations on the system (equiv-
alently, the augmented matrix) that do not alter the solution set until it is in REF or
RREF. This is achieved using the following 3 types of elementary row operations.
1. Exchanging 2 rows, Ri ↔ Rj ,

2. Adding a multiple of a row to another, Ri + cRj , c ∈ R,

3. Multiplying a row by a nonzero constant, aRj , a ̸= 0.

Example. 1. Exchange row 2 and row 4


   
1 1 2 5 1 1 2 5
 0 0 2 4  R2 ↔R4  0 1 4 3 
 − −−−→  
 0 1 1 2   0 1 1 2 
0 1 4 3 0 0 2 4

2. Add 2 times row 3 to row 2


   
1 1 2 5 1 1 2 5
 0 0 2 4  R2 +2R3  0 2 4 8 
 − −−−→  
 0 1 1 2   0 1 1 2 
0 1 4 3 0 1 4 3

3. Multiply row 2 by 1/2


   
1 1 2 5 1 1 2 5
 0 1
0 2 4 
 −2−R2  0 0 1 2 

 0 → 
1 1 2   0 1 1 2 
0 1 4 3 0 1 4 3

Remark. 1. Note that we cannot multiply a row by 0, as it may change the linear
system. For example, consider

x + y = 2
x − y = 0

It has a unique solution x = 1, y = 1. Suppose in the augmented matrix we multiply


row 2 by 0,    
1 1 2 0R2 1 1 2
−−→
1 −1 0 0 0 0
then the system now has a general solution x = 2 − s, y = s.

2. Elementary row operations may not commute. For example,


   
1 0 0 2R1 R2 ↔R1 0 1 0
−−→−−−−→
0 1 0 2 0 0

is not the same as    


1 0 0 R2 ↔R1 2R1 0 2 0
−−−−→−−→
0 1 0 1 0 0
But if the elementary row operations do commute, we can stack them
   
1 0 0 2R2 2 0 0
−−→
0 1 0 2R1 0 2 0
3. For the second type of elementary row operation, the row we put first is the row we
are performing the operation upon,
   
1 0 0 R1 +2R2 1 2 0
−−−−→
0 1 0 0 1 0

instead of    
1 0 0 2R2 +R1 1 0 0
−−−−→
0 1 0 1 2 0
In fact, the 2R2 + R1 is not an elementary row operation, but a combination of 2
operations, 2R2 then R2 + R1 . Here’s another example,
   
1 0 0 R1 +R2 1 1 0
−−−−→
0 1 0 1 0 0

and    
1 0 0 R2 +R1 1 0 0
−−−−→
0 1 0 1 1 0
The algorithm to reduce a (an augmented) matrix to be in REF or RREF is called
the Gauss-Jordan elimination.
Step 1: Locate the leftmost column that does not consist entirely of zeros.

Step 2: Interchange the top row with another row, if necessary, to bring a nonzero entry to
the top of the column found in Step 1.

Step 3: For each row below the top row, add a suitable multiple of the top row to it so that
the entry below the leading entry of the top row becomes zero.

Step 4: Now cover the top row in the augmented matrix and begin again with Step 1
applied to the submatrix that remains. Continue this way until the entire matrix
is in row-echelon form.
Once the above process is completed, we will end up with a REF. The following steps
continue the process to reduce it to its RREF.
Step 5: Multiply a suitable constant to each row so that all the leading entries become 1.

Step 6: Beginning with the last nonzero row and working upward, add suitable multiples
of each row to the rows above to introduce zeros above the leading entries.
Algorithm. 1. Express the given linear system as an augmented matrix. Make sure
that the liner system is in standard form.

2. Use the Gaussian elimination to reduce the augmented matrix to a row-echelon


form, or use the Gauss-Jordan elimination to reduce the augmented matrix to its
reduced row-echelon form.

3. If the system is consistent, assign the variables corresponding to the nonpivot


columns as parameters.

4. Use back substitution (if the augmented matrix is in REF) to obtain a general
solution, or read off the general solution (if the augmented matrix is in RREF).
Example. 1.
     
1 1 2 4 1 1 2 4 2
R3 + R2
1 1 2 4
 −1 2 −1 1  −R−2−
+R1
−→  0 3 1 5  −−−−3−→  0 3 1 5 
R3 −2R1
2 0 3 −2 0 −2 −1 −10 0 0 −1/3 −20/3

The augmented matrix is now in REF. By back substitution, we have


1
z = 20, y = (5 − z) = −5, x = 4 − y − 2z = −31.
3
Alternatively, we can continue to reduce it to its RREF and read off the solution.
     
1 1 2 4 1 1 2 4 1 1 0 −36
−3R3 R2 −R3
 0 3 1 5  −−−→  0 3 1 5  −−−−→  0 3 0 −15 
R1 −2R3
0 0 −1/3 −20/3 0 0 1 20 0 0 1 20
   
1
R2
1 1 0 −36 1 0 0 −31
R −R2
−3−→  0 1 0 −5  −−1−−→  0 1 0 −5 
0 0 1 20 0 0 1 20

Indeed, the system is consistent, with unique solution x = −31, y = −5, z = 20.

2. Solve the following linear system

x1 + 3x2 − 2x3 + + 2x5 = 0


2x1 + 6x2 − 5x3 − 2x4 + 4x5 − 3x6 = −1
5x3 + 10x4 + 15x6 = 5
2x1 + 6x2 + 8x4 + 4x5 + 18x6 = 6

The augmented matrix is


 
1 3 −2 0 2 0 0
 2
 6 −5 −2 4 −3 −1 

 0 0 5 10 0 15 5 
2 6 0 8 4 18 6

We will begin the Gaussian elimination.


   
1 3 −2 0 2 0 0 1 3 −2 0 2 0 0
 2 6 −5 −2 4 −3 −1  R2 −2R1 R4 −2R1  0 0 −1 −2 0 −3 −1 
 0 0 5 10 0 15 5  −−−−→−−−−→  0 0
   
5 10 0 15 5 
2 6 0 8 4 18 6 0 0 4 8 0 18 6
   
1 3 −2 0 2 0 0 1 3 −2 0 2 0 0
R3 +5R2 R4 +4R2  0 0 −1 −2 0 −3 −1  R3 ↔R4  0
   0 −1 −2 0 −3 −1 
−− −−→−−−−→  −−−−→  
0 0 0 0 0 0 0  0 0 0 0 0 6 2 
0 0 0 0 0 6 2 0 0 0 0 0 0 0

The augmented matrix is now in REF. We may use back substitution to obtain the
solution, or continue to reduce to its RREF.
   
1 3 −2 0 2 0 0 1 3 −2 0 2 0 0
−R2 (1/6)R 3
 0 0 1 2 0 3 1  R2 −3R3 
  0 0 1 2 0 0 0 
−−→ −−−−→   0 0 0 0 0 1 1/3  −−−−→  0

0 0 0 0 1 1/3 
0 0 0 0 0 0 0 0 0 0 0 0 0 0
 
1 3 0 4 2 0 0
R1 +2R2  0 0 1 2 0 0
 0 
−− −−→  
0 0 0 0 0 1 1/3 
0 0 0 0 0 0 0

This corresponds to the following linear system

x1 + 3x2 + 4x4 + 2x5 = 0


x3 + 2x4 = 0
+ x6 = 1/3

Letting x2 = r, x4 = s, x5 = t, a general solution is

x1 = −3r − 4s − 2t, x2 = r, x3 = −2s, x4 = s, x5 = t, x6 = 1/3.

2.3.4 Solving a Linear System in R


Install the packages pracma and matlib.
> install.packages("pracma")
> library(pracma)
> install.packages("matlib")
> library(matlib)
> library(MASS)

Example. 1. Solve the linear system



 x + y + 2z = 4
−x + 2y − z = 1 .
2 + 3 = −2

> A <- matrix(c(1, 1, 2, 4,-1,2,-1,1,2,0,3,-2),3,4,T)


> view(A)
> A[2,] <- A[2,]+A[1,]
> A[3,] <- A[3,]-2*A[1,]
> A[3,] <- A[3,]+(2/3)*A[2,]
> fractions(A)
[,1] [,2] [,3] [,4]
[1,] 1 1 2 4
[2,] 0 3 1 5
[3,] 0 0 -1/3 -20/3
> A[3,] <- -3*A[3,]
> A[2,] <- A[2,]-A[3,];A[1,] <- A[1,]-2*A[3,]
> A[2,] <- A[2,]/3
> A[1,] <- A[1,]-A[2,]
> A
[,1] [,2] [,3] [,4]
[1,] 1 0 0 -31
[2,] 0 1 0 -5
[3,] 0 0 1 20 The unique solution is x = −31, y = −5, z = 20.

2. Solve the following linear system

x1 + 3x2 − 2x3 + + 2x5 = 0


2x1 + 6x2 − 5x3 − 2x4 + 4x5 − 3x6 = −1
5x3 + 10x4 + 15x6 = 5
2x1 + 6x2 + 8x4 + 4x5 + 18x6 = 6

> A <- matrix(c(1,3,-2,0,2,0,0,2,6,-5,-2,4,-3,-1,0,0,5,10,0,15,5,2,6,0,8,4,18,6),


> A <- fractions(A)
> View(A)
> A[2,] <- A[2,]-2*A[1,]
> A[4,] <- A[4,]-2*A[1,]
> A[3,] <- A[3,]+5*A[2,]
> A[4,] <- A[4,]+4*A[2,]
> A[3:4,] <- A[c(4,3),]
> A[2,] <- -A[2,]; A[3,] <- A[3,]/6
> A[2,] <- A[2,] -3*A[3,]
> A[1,] <- A[1,]+2*A[2,]
> A
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1 3 0 4 2 0 0
[2,] 0 0 1 2 0 0 0
[3,] 0 0 0 0 0 1 1/3
[4,] 0 0 0 0 0 0 0 Letting x2 = r, x4 = s, x5 = t, the
general solution is

x1 = −3r − 4s − 2t, x2 = r, x3 = −2s, x4 = s, x5 = t, x6 = 1/3.

Given a linear system,

a11 x1 + a12 x2 + · · · + a1n xn = b1


a21 x1 + a22 x2 + · · · + a2n xn = b2
..
.
am1 x1 + am2 x2 + · · · + amn xn = bm

we will let A = (aij ) be the coefficient matrix, and b = (bi ) be the constant vector. Let’s
input this into R.
> A <- matrix(c(a11 , a12 , ..., a1n , a21 , a22 , ...a2n , ..., am1 , am2 , ..., amn ),m,n,T)
> b=matrix(c(b1 , b2 , ..., bm )).

To display the linear system, type


> showEqn(A,b).
If the system has 2 variables, we may plot the equations using plotEqn(A,b). If the
system has 3 variables, we will use plotEqn3d(A,b) instead.

We will use the function Solve in R to solve a linear system,


> Solve(A,b,fractions = TRUE)

From the output, we will read off a general solution.


   
1 2 3 1
Example. 1. Let A = and b = .
4 5 6 1
> A <- matrix(seq(6),2,3,T); b <- matrix(c(1,1))
> showEqn(A,b)
1*x1 + 2*x2 + 3*x3 = 1
4*x1 + 5*x2 + 6*x3 = 1

To plot the equations, type


> plotEqn3d(A,b)

Observe from the plot that the intersection is a line. Indeed, as we will see later
that there will be a parameter in the solution, and thus the solution is a line.

Solving the linear system, we have


> Solve(A,b,fractions = TRUE)
x1 - 1*x3 = -1
x2 + 2*x3 = 1
by letting x3 to be a parameter t ∈ R, we obtain a general solution
x1 = t − 1, x2 = 1 − 2t, x3 = t, t ∈ R .
   
1 1 2
2. Let A = 2 4
  and b = 5.

1 −1 1
> A <- matrix(c(1,1,2,4,1,-1),3,2,T); b <- matrix(c(2,5,1))
> showEqn(A,b)
1*x1 + 1*x2 = 2
2*x1 + 4*x2 = 5
1*x1 - 1*x2 = 1

> plotEqn(A,b)

From the plot we can see that all 3 lines intersect at a point, and hence, the system
has a unique solution. Indeed,
> Solve(A, b, fractions = TRUE)
x1 = 3/2
x2 = 1/2
0 = 0
tells us that x1 = 3/2, x2 = 1/2 is the unique solution.
   
1 1 2
3. Let A = 1 −1 and b = 0,
  
2 1 1
> A <- matrix(c(1,1,1,-1,2,1),3,2,T); b=matrix(c(2,0,1))
> plotEqn(A,b)
x[1] + x[2] = 2
x[1] - 1*x[2] = 0
2*x[1] + x[2] = 1

From the plot we can see that the 3 lines do not intersect at any common point,
hence, the system has no solutions. Indeed,
> Solve(A, b, fractions = TRUE)
x1 = 1/3
x2 = 1/3
0 = 4/3
   
1 3 −2 0 2 0 0
2 6 −5 −2 4 −3 −1
4. Let A = 0 0 5 10 0 15  and b =  5 ,
  

2 6 0 8 4 18 6
> A <- matrix(c(1,3,-2,0,2,0,2,6,-5,-2,4,-3,0,0,5,10,0,15,2,6,0,8,4,18),4,6,T);
b <- matrix(c(0,-1,5,6))
> showEqn(A,b)
1*x1 + 3*x2 - 2*x3 + 0*x4 + 2*x5 + 0*x6 = 0
2*x1 + 6*x2 - 5*x3 - 2*x4 + 4*x5 - 3*x6 = -1
0*x1 + 0*x2 + 5*x3 + 10*x4 + 0*x5 + 15*x6 = 5
2*x1 + 6*x2 + 0*x3 + 8*x4 + 4*x5 + 18*x6 = 6
In this case since there are more than 3 variables in the system, we cannot plot the
equations. We will proceed to find a general solution.
> Solve(A, b, fractions = TRUE)
x1 + 3*x2 + 4*x4 + 2*x5 = 0
x3 + 2*x4 = 0
x6 = 1/3
0 = 0

A general solution is
x1 = −3r − 4s − 2t, x2 = r, x3 = −2s, x4 = s, x5 = t, x6 = 1/3, r, s, t ∈ R.

2.4 Submatrices and Block Multiplication


2.4.1 Block Multiplication
Let A be an m × n matrix,
 
a11 a12 · · · a1n
 a21 a22 · · · a2n 
A =  .. ..  .
 
.. ..
 . . . . 
am1 am2 · · · amn
The rows of A are the 1 × n submatrices of A,
  
r1 = a11 a12 · · · a1n , r2 = a21 a22 · · · a2n , ..., rm = am1 am2 · · · amn ,
and the columns of A are the m × 1 submatrices of A
     
a11 a12 a1n
 a21   a22   a2n 
c1 =  ..  , c2 =  ..  , ..., cn =  ..  .
     
 .   .   . 
am1 am2 amn
In general, a p × q submatrix of an m × n matrix A, p ≤ m, q ≤ n, is formed by
taking a p × q block of the entries of the matrix A.
 
  1 3 5 7 1
4 6 1
Example. is a 2 × 3 submatrix of A = 2 4
 6 1 2, taken from row 2
2 2 1
1 2 2 1 3
and 3, and columns 2 to 4 of A.
Observe that matrix multiplication respects submatrices, in the sense that if we take
k ×p submatrices of A and multiply to p×l submatrices of B, we obtain k ×l submatrices
of AB. We call this block multiplication.
 
1 0 1  
1 2 2 1 3 5 7 1
Example. Let A =  3 2 1 and B = 2 4 6 1 2 . Then multiplying the
  
1 2 2 1 3
 4 4 2
1 0 1
submatrix of A, consisting of the first 2 rows of A to the submatrix of B,
1 2 2
 
1 3  
2 4 consisting of the first 2 columns of B, we get 2 5
, which is a 2×2 submatrix
7 15
1 2
of AB consisting of the first 2 rows and first 2 columns of AB.
   
1 0 1   2 5 7 8 4
1 2 2 1 3 5 7 1
 2 4 6 1 2 =  7 15 21 11 11 .
 

3 2 1  8 19 29 24 10
1 2 2 1 3
4 4 2 14 32 48 34 18
In particular, we have the following cases.

1. If B = b1 b2 · · · bn , where bi is the i-th column of B, for i = 1, ..., n, then

AB = Ab1 Ab2 · · · Abn ,
that is, the columns of AB is just A pre-multiplying to the columns of B.
 
a1
 a2 
2. If A =  .. where aj is the j-th row of B, for j = 1, ..., m, then
 
 . 
am
 
a1 B
 a2 B 
AB =  ..  ,
 
 . 
am B
that is, the rows of AB is just B post-multiplying to the rows of A.
 
a1
 a2  
3. If A =  ..  and B = b1 b2 · · · bn , then
 
 . 
am
 
a1 b1 a1 b2 · · · a1 bn
 a2 b1 a2 b2 · · · a2 bn 
AB =  .. ..  .
 
.. ..
 . . . . 
am b1 am b2 · · · am bn

Indeed, the (i, j)-entry of AB is


 
bij
p
b2j  X
 
ai bj = ai1 ai2 · · · aip  ..  = aik bkj .
 .  k=1
bpj

Exercise: If Ai is a mi × p matrix, for i = 1, 2, and Bi is a p × ni matrix for i = 1, 2,


show that the following block multiplication holds,
   
A1  A1 B 1 A1 B 2
B1 B2 = ,
A2 A2 B 1 A2 B 2
 
A1
where is a (m1 + m2 ) × p matrix, where the first m1 rows are from A1 , and the
A2 
m1 + 1 to m1 + m2 rows are from A2 , and B1 B2 is a p × (n1 + n2 ) matrix, where the
first n1 columns are from B1 and the n1 + 1 to n1 + n2 columns are from B2 .

Suppose now we given A = (aij )m×n and B = (bij )m×p , and we want to find an
X = (xij )n×p such that AX = B,
    
a11 a12 · · · a1n x11 x12 · · · x1p b11 b12 · · · b1p
 a21 a22 · · · a2n   x21 x22 · · · x2p   b21 b22 · · · b2p 
..  =  .. ..  .
    
 .. .. . . ..   .. .. .. .. ..
 . . . .  . . . .   . . . . 
am1 am2 · · · amn xn1 xn2 · · · xnp bm1 bm2 · · · bmp

By the block multilplication, for k = 1, ..., p,


    
a11 a12 · · · a1n x1k b1k
 a21 a22 · · · a2n   x2k   b2k 
..   ..  =  ..  .
    
 .. .. ..
 . . . .  .   . 
am1 am2 · · · amn xnk bmk

which can be viewed this as p linear systems,

a11 x1k + a12 x2k + · · · + a1n xnk = b1k


a21 x1k + a22 x2k + · · · + a2n xnk = b2k
.. .
.
am1 x1k + am2 x2k + · · · + amn xnk = bmk

Then we may proceed to solve each of the p linear systems to obtain X.

Example. 1. Solve     
1 2 −3 x1 x2 1 1
2 6 −11  y1 y2  =  3 2 ,
1 −2 7 z1 z2 −1 1
This is equivalent to solving the 2 linear systems

x1 + 2y1 − 3z1 = 1 x2 + 2y2 − 3z2 = 1


2x1 + 6y1 − 11z1 = 3 and 2x2 + 6y2 − 11z2 = 2
x1 − 2y1 + 7z1 = −1 x2 − 2y2 + 7z2 = 1

> A <- matrix(c(1,2,-3,2,6,-11,1,-2,7),3,3,T); b1 <- matrix(c(1,3,-1));


b2 <- matrix(c(1,2,1))
> Solve(A,b1)
x1 + 2*x3 = 0
x2 - 2.5*x3 = 0.5
0 = 0

> Solve(A,b2)
x1 + 2*x3 = 1
x2 - 2.5*x3 = 0
0 = 0
A general solution for X is
 
−2s 1 − 2t
X = 0.5 + 2.5s 2.5t  , s, t ∈ R.
s t

2. Solve the matrix equation


    
3 2 −1 x1 x2 x3 1 2 1
5 −1 3   y1 y2 y3  = 2 1 1
2 1 −1 z1 z2 z3 3 1 0

This is equivalent to solving the following 3 linear systems


       
3x + 2y − z = a a 1 2 1
5x − y + 3z = b , for  b  = 2 , 1 , 1 .
2x + y − z = c c 3 1 0

> A <- matrix(c(3,2,-1,5,-1,3,2,1,-1),3,3,T);


B <- matrix(c(1,2,1,2,1,1,3,1,0),3,3,T)
> Solve(A,B[,1],fractions = TRUE)
x1 = 5/3
x2 = -11/3
x3 = -10/3
> Solve(A,B[,2],fractions = TRUE)
x1 = 2/9
x2 = 7/9
x3 = 2/9
> Solve(A,B[,3],fractions = TRUE)
x1 = -1/9
x2 = 10/9
x3 = 8/9
(See the next section for the command in R to obtain the columns of B.)
So the solution is
   
x 1 x 2 x3 5/3 2/9 −1/9
 y1 y2 y3  = −11/3 7/9 10/9  .
z1 z2 z3 −10/3 2/9 8/9

2.4.2 Entries and Submatrices of Vectors and Matrices


To access the i-th entry in a vector v, type v[i].
> v<-c(1,2,3,4,5); v[3]
[1] 3

To access the (i, j)-th entry of a matrix A, type A[i,j].


> A<-matrix(seq(10),2,5, T); A
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
> A[2,3]
[1] 8

To access the i-th row of a matrix A, type A[i,].


> A[2,]
[1] 6 7 8 9 10

To access the j-th column of a matrix A, type A[,j].


> A[,3]
[1] 3 8

To access a subset of rows of a matrix A, type A[i1 :i2 ,]


> A<-matrix(seq(9),3,3,T); A
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
> A[2:3,]
[,1] [,2] [,3]
[1,] 4 5 6
[2,] 7 8 9

To access a subset of columns of a matrix A, type A[,j1 :j2 ]


> A[,1:2]
[,1] [,2]
[1,] 1 2
[2,] 4 5
[3,] 7 8

More generally, > A<-matrix(seq(25),5,5,T); A


[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
[3,] 11 12 13 14 15
[4,] 16 17 18 19 20
[5,] 21 22 23 24 25
> A[c(1,3),c(2,4,5)] #Row 1 and 3, column 2, 4, and 5
[,1] [,2] [,3]
[1,] 2 4 5
[2,] 12 14 15

2.5 Inverse of a Matrix


2.5.1 Definition and Properties
Recall that for a nonzero real number c ∈ R, the multiplicative inverse (or just inverse)
of c, denoted as 1c , is defined to be the number such that when multiplied to c gives 1,
1
c
× c = 1 (this is why 0 has no inverse, for a × 0 = 0 for any real number a ∈ R, and so
there can be no a ∈ R such that a × 0 = 1).

Hence, in order to define, if possible, the inverse of a matrix, we first need to identity
the matrix that serves the same role as 1 in real numbers. We indeed have such an object.
Recall that the identity matrix has the multiplicative identity property, for any m × n
matrix A,
Im A = A = AIn .
A square matrix A of order n is invertible if there exists a square matrix B of order
n such that
AB = In = BA.
In this case, we say that B is an inverse of A. A square matrix is singular if it is not
invertible.

Remark. 1. Only square matrices can be invertible. All non-square matrices are not
invertible by default.

2. For a square matrix A to be invertible, we need to check that there exists a B that is
simultaneously a left and right inverse of A, that is, we need to show both BA = In
and AB = In . However, it turns out that BA = In if and only if AB = Iu for a square
matrix B. That is, as long as there is a square matrix B of the same order that either
pre or post multiplied to A to give the identity matrix, then A is invertible.

3. Moreover, the matrix B is unique, that is, if there is a C such that either AC = In or
CA = In , then necessarily B = C.

Since inverse is unique, we can denote the inverse of an invertible matrix A by A−1 .
That is, if A is invertible, there exists a unique matrix A−1 such that

AA−1 = In = A−1 A.

Example. 1. The identity matrix I is invertible with I−1 = I.


         −1  
0 1 −1 1 1 0 −1 1 0 1 0 1 −1 1
2. = = . So = .
1 1 1 0 0 1 1 0 1 1 1 1 1 0
         −1  
2 −5 3 5 1 0 3 5 2 −5 2 −5 3 5
3. = = . So = .
−1 3 1 2 0 1 1 2 −1 3 −1 3 1 2

Theorem (Cancellation law for matrices). Let A be a invertible matrix of order n.

(i) If B and C are n × m matrices with AB = AC, then B = C.

(ii) If B and C are m × n matrices with BA = CA, then B = C.

Proof. (i) Pre-multiply A−1 to both sides of AB = AC, we get B = IB = A−1 AB =


A−1 AC = IC = C.

(ii) Post-multiply A−1 to both sides of BA = CA, we get we get B = BI = BAA−1 =


CAA−1 = CI = C.
Theorem (Properties of inverses). Let A be an invertible matrix of order n.
(i) (A−1 )−1 = A.
(ii) For any nonzero real number a ∈ R, (aA) is invertible with inverse (aA)−1 = a1 A−1 .
(iii) AT is invertible with inverse (AT )−1 = (A−1 )T .
(iv) If B is an invertible matrix of order n, then (AB) is invertible with inverse (AB)−1 =
B−1 A−1
Proof. (i) Since AA−1 = In = A−1 A, then A is the inverse of A−1 .
(ii) We directly check that aA( a1 A−1 ) = aa In = In = ( a1 A−1 )(aA).
(iii) Since I is symmetric, I = IT = (AA−1 )T = (A−1 )T AT , and similarly, I = AT (A−1 )T .
Hence, (AT )−1 = (A−1 )T .
(iv) We directly check that (B−1 A−1 )(AB) = B−1 (A−1 A)B = B−1 B = In and (AB)(B−1 A−1 ) =
A(BB−1 )A−1 = AA−1 = In

Remark. 1. By induction, one can prove that the product of invertible matrices is in-
vertible, and (A1 A2 · · · Ak )−1 = A−1 −1 −1
k · · · A 2 A1 if Ai is an invertible matrix for
i = 1, ..., k.
2. We define the negative power of an invertible matrix to be
A−n = (A−1 )n
for any positive integer n.
Exercise: Show that AB is invertible if and only if both A and B are invertible.
Hint: If AB is invertible, let C be the inverse. Pre and post multiply AB with C.
Theorem. An order 2 square matrix
 
a b
A=
c d
is invertible if and only if ad − bc ̸= 0, in which case the inverse is given by the formula
 
−1 1 d −b
A = .
ad − bc −c a
The formula is obtained by using the adjoint of A, which is beyond the scope of this
module. However, readers may verify that indeed we have AA−1 = I2 = A−1 A.

Orthogonal matrices. A square matrix A is an orthogonal matrix if the transpose


of the matrix is its inverse, AT = A−1
AT A = I = AAT .
The crossprod function in R can be used to check if a matrix is orthogonal; A
is orthogonal if and only if crossprod(A,A) is the identity matrix. In the following
examples, it is a good exercise for the readers to check that crossprod(A,A) is the
identity matrix.
Example. 1. For any real number θ ∈ R,
 
cos(θ) − sin(θ)
A=
sin(θ) cos(θ)
is an orthogonal matrix, since
  
T cos(θ) sin(θ) cos(θ) − sin(θ)
A A =
− sin(θ) cos(θ) sin(θ) cos(θ)
2
 
cos2 (θ) + sin (θ) − cos(θ) sin(θ) + sin(θ) cos(θ)
=
− sin(θ) cos(θ) + cos(θ) sin(θ) sin2 (θ) + cos2 (θ)
 
1 0
=
0 1
 √ √ √  √ √ √   
1/√3 1/ √2 1/√6 1/√3 1/ √3 1/ 3 1 0 0
2. 1/√3 −1/ 2 1/ √6
   1/√2 −1/√ 2 0√  = 0 1 0.
1/ 3 0 −2/ 6 1/ 6 1/ 6 −2/ 6 0 0 1
 √ √  √ √   
1/ 2 0 1/ 2 1/ 2 0 −1/ 2 1 0 0
3.  0√ 1 0√   0√ 1 0√  = 0 1 0.
−1/ 2 0 1/ 2 1/ 2 0 1/ 2 0 0 1
 √ √  √ √   
−1/ 2 0 1/ 2 −1/ 2 0 1/ 2 1 0 0
4.  0√ 1 0√   0√ 1 0√  = 0 1 0.
1/ 2 0 1/ 2 1/ 2 0 1/ 2 0 0 1
Observe that in this case, the inverse and the transpose are all equal to A. Such a
matrix is called an involutory matrix.

2.5.2 Algorithm to Finding the Inverse


Let A be a square matrix of order n.

Step 1: Form a new n × 2n matrix A In .
 
Step 2: Reduce the matrix A I −→ R B to its reduced row-echelon form.
Step 3: If R ̸= I, then A is not invertible. If R = I, A is invertible with inverse A−1 = B.
 
2 7 1
Example. Find the inverse of 1 4 −1.
1 3 0
   
2 7 1 1 0 0 1 3 0 0 0 1
R1 ↔R3 R3 −2R1
 1 4 −1 0 1 0  − −−−→  1 4 −1 0 1 0  −− −−→
R2 −R1
1 3 0 0 0 1 2 7 1 1 0 0
   
1 3 0 0 0 1 1 1 3 0 0 0 1
R2 +R3 2 R2 R −R2
 0 1 −1 0 1 −1  − −−−→−−→  0 1 0 1/2 1/2 −3/2  −−3−−→
R1 −3R2
0 1 1 1 0 −2 0 1 1 1 0 −2
 
1 0 0 −3/2 −3/2 11/2
 0 1 0 1/2 1/2 −3/2  .
0 0 1 1/2 −1/2 −1/2
So  −1  
2 7 1 −3 −3 11
1 4 −1 = 1  1 1 −3 .
2
1 3 0 1 −1 −1
The algorithm works because finding the inverse of a square matrix A of order n is
equivalent to solving the following matrix equation

AX = I,

for some order n matrix A.


Example. Solve the following matrix equation
    
3 5 a b 1 0
= .
1 2 c d 0 1

Recall that this is solving two linear systems simultaneously


         
3 5 a 1 3 5 b 0
= , =
1 2 c 0 1 2 d 1
   
3 5 1 0 R1 −3R2 R2 +2R1 0 −1 1 −3
−−−−→−−−−→
1 2 0 1 1 0 2 −5
 
−R1 R1 ↔R2 1 0 2 −5
−−→−−−−→
0 1 −1 3
Hence,  
−1 2 −5
A = .
−1 3

2.5.3 Finding Inverse in R


We will perform the above algorithm to compute the inverse in R.
Example. 1. Compute the inverse of
 
2 −5
A=
−1 3

> A <- matrix(c(2,-5,-1,3),2,2,T)


> B <- cbind(A,diag(2))
> B <- fractions(B)
> View(B)
> B[2,] <- B[2,]+B[1,]/2
> B[1,] <- B[1,]/2
> B[2,] <- 2*B[2,]
> B[1,] <- B[1,]+2.5*B[2,]
> B[,c(3,4)]
[,1] [,2]
[1,] 3 5
[2,] 1 2
So,  −1  
2 −5 3 5
= .
−1 3 1 2

2. Compute the inverse of  


1 0 1
A = 1 1 0 .
0 1 1

> A <- matrix(c(1,0,1,1,1,0,0,1,1),3,3,T)


> B <- cbind(A,diag(3))
> B <- fractions(B)
> View(B)
> B[2,] <- B[2,]-B[1,]
> B[3,] <- B[3,]+B[2,]
> B[3,] <- B[3,]-B[2,]
> B[3,] <- B[3,]-B[2,]
> B[3,] <- B[3,]/2
> B[2,] <- B[2,]+B[3,]
> B[1,] <- B[1,]-B[3,]
> B[,4:6]
[,1] [,2] [,3]
[1,] 1/2 1/2 -1/2
[2,] -1/2 1/2 1/2
[3,] 1/2 -1/2 1/2

So,  −1  
1 0 1 0.5 0.5 −0.5
1 1 0 = −0.5 0.5 0.5  .
0 1 1 0.5 −0.5 0.5
The inverse of a invertible matrix A can be obtained in R using the function solve.

Example. 1. > A <- matrix(c(1,0,1,1,1,0,0,1,1),3,3,T)


> solve(A)
[,1] [,2] [,3]
[1,] 0.5 0.5 -0.5
[2,] -0.5 0.5 0.5
[3,] 0.5 -0.5 0.5
That is,  −1  
1 0 1 0.5 0.5 −0.5
1 1 0 = −0.5 0.5 0.5  .
0 1 1 0.5 −0.5 0.5

2. > A<- matrix(c(2,-5,-1,3),2,2,T)


> solve(A)
[,1] [,2]
[1,] 3 5
[2,] 1 2
That is,  −1  
2 −5 3 5
= .
−1 3 1 2

2.5.4 Inverse and Linear System


Consider now a linear system

a11 x1 + a12 x2 + · · · + a1n xn = b1


a21 x1 + a22 x2 + · · · + a2n xn = b2
..
.
an1 x1 + an2 x2 + · · · + ann xn = bn

such that the coefficient matrix A is invertible. Then by pre-multiplying the inverse of
A to both sides of the corresponding matrix equation

Ax = b,

we have
x = A−1 b.
That is, A−1 b will be the unique solution to the system.
Example. Consider that linear system

x1 + x3 = 2
x1 + x2 = 4
x2 + x3 = 6

The corresponding matrix equation is


    
1 0 1 x1 2
1 1 0 x2 = 4 .
 
0 1 1 x3 6

Pre-multiplying the inverse of A to both sides, we get


   −1       
x1 1 0 1 2 0.5 0.5 −0.5 2 0
x2  = 1 1 0 4 = −0.5 0.5 0.5   4 = 4 .
 
x3 0 1 1 6 0.5 −0.5 0.5 6 2

Using R, we have
> A <- matrix(c(1,0,1,1,1,0,0,1,1),3,3,T); b <- matrix(c(2,4,6))
> solve(A)%*%b
[,1]
[1,] 0
[2,] 4
[3,] 2

We could instead just type solve(A,b)


> solve(A,b)
[,1]
[1,] 0
[2,] 4
[3,] 2

Indeed,
> Solve(A,b)
x1 = 0
x2 = 4
x3 = 2

2.6 Least Square Solutions


More often, in real life application, the data collected or the mathematical model used
might not be very accurate, and thus there might not be a solution to the system

a11 x1 + a12 x2 + · · · + a1n xn = b1


a21 x1 + a22 x2 + · · · + a2n xn = b2
..
.
am1 x1 + am2 x2 + · · · + amn xn = bm

However, we might want to find an approximate solution

x1 = c1 , x2 = c2 , ..., xn = cn

(vis-à-vis dismissing the model or data), such that if we write

a11 c1 + a12 c2 + · · · + a1n cn = d1


a21 c1 + a22 c2 + · · · + a2n cn = d2
..
.
am1 c1 + am2 c2 + · · · + amn cn = dm
     
d1 b1 c1
 d2   b2   c2 
then d =  ..  is as close to b =  ..  as possible. A vector u =  ..  is called a
     
 .   .  .
dm bm cn
least square solution to Ax = b if Au is the closest to b. We will phrase this formally.

Let A be a m × n matrix and b ∈ Rm . A vector u ∈ Rn is a least square solution to


Ax = b if for every vector v ∈ Rn ,

∥Au − b∥ ≤ ∥Av − b∥.

Example. 1. Consider the following linear system

x1 + x2 + x4 = 1
x 2 + x3 = 1
x1 + 2x2 + x3 + x4 = 1
It is inconsistent,
> A <- matrix(c(1,1,0,1,0,1,1,0,1,2,1,1),3,4,T); b <- matrix(c(1,1,1))
> Solve(A,b)
x1 - 1*x3 + x4 = 0
x2 + x3 = 1
0 = -1
 
0
2/3
A least square solution to the system is u = 
 0 . One can check that given any

0
 
a
b
v=  c , the distance between

d
 
  0  
1 1 0 1   1
2/3 2  
Au = 0 1 1 0   =
   1
0 3
1 2 1 1 2
0
 
1
and b = 1, which is
1

p 3
((1 − 2/3)2 + (1 − 2/3)2 + (1 − 4/3)2 = ,
3
is shorter than the distance between Av and b. For example,
 
1  
0 1
• if v =  , then the distance between Av = 0 and b is
  
0
1
0
p
(1 − 1)2 + (0 − 1)2 + (1 − 1)2 = 1.
 
0  
1 1
• if v =  , then the distance between Av = 1 and b is
  
0
2
0
p
(1 − 1)2 + (1 − 1)2 + (2 − 1)2 = 1.
 
0  
0 0
• if v =  , then the distance between Av = 1 and b is
  
1
1
0
p
(0 − 1)2 + (1 − 1)2 + (1 − 1)2 = 1.
 
1  
1 3
• if v = 
1
, then the distance between Av = 2 and b is
5
1
p √
(3 − 1)2 + (2 − 1)2 + (5 − 1)2 = 21.
Theorem. Let A be a m × n matrix and b ∈ Rm . A vector u ∈ Rn is a least square
solution to Ax = b if and only if u is a solution to AT Ax = AT b.
That is, to find a least square solution, we solve for the following matrix equation

AT Ax = AT b.

We can do this in R by using Solve(crossprod(A,A),crossprod(A,b)). Recall that


for matrices A and B, crossprod(A,B) is the product AT B.
Example. 1. Find a least square solution to the following linear system
x1 + x2 + x4 = 1
x2 + x3 = 1
x1 + 2x2 + x3 + x4 = 1
> A <- matrix(c(1,1,0,1,0,1,1,0,1,2,1,1),3,4,T);
b <- matrix(c(1,1,1))
> Solve(crossprod(A,A),crossprod(A,b),fractions=TRUE)
x1 - 1*x3 + x4 = 0
x2 + x3 = 2/3
0 = 0
0 = 0

Observe that the solution is not unique, a general solution is

x1 = s − t, x2 = 2/3 − s, x3 = s, x4 = t, s, t ∈ R

So, any choice


 ofs and t will produce a least square solution. We may let s = 0 = t.
0
2/3
Then u =  0  is a least square solution.

Readers may check that for any s and t,


     
  0 1 −1  
1 1 0 1   1
2/3 −1  0  2  

Au = 0 1 1 0   + s   + t   =
      1 .
0 1 0 3
1 2 1 1 2
0 0 1

2. Find a least square solution to the following linear system


x + y = 2
x − y = 0
2x + y = 1
> A <- matrix(c(1,1,1,-1,2,1),3,2,T);
b <- matrix(c(2,0,1))
> Solve(crossprod(A,A),crossprod(A,b),fractions=TRUE)
x1 = 3/7
x2 = 5/7
 
3/7
So, u = is a least square solution.
5/7
Remark. Observe that here the least square solution is unique. In cases when the
least square solution is unique, we may alternatively use qr.solve,
> fractions(qr.solve(A,b))
[,1]
[1,] 3/7
[2,] 5/7
However, the topic on when the least square solution is unique is beyond the scope
of this course. Hence, in general, we will use
Solve(crossprod(A,A),crossprod(A,b),fractions=TRUE) instead.

3. Find a least square solution to the following linear system

x1 + x3 = 2
x1 + x2 = 4
x2 + x3 = 6

> A <- matrix(c(1,0,1,1,1,0,0,1,1),3,3,T); b <- matrix(c(2,4,6))


> Solve(crossprod(A,A),crossprod(A,b), fractions = TRUE)
x1 = 0
x2 = 4
x3 = 2
 
0
So, a least square solution is u = 4.

2
Remark. Observe that this is exactly the unique solution to the system. In general,
if a system is consistent, then the set of all least square solutions is exactly the set
of all solutions.

2.7 Determinant
2.7.1 Definition: Cofactor Expansion
We will define the determinant of A of order n by induction.

1. For n = 1, A = (a), det(A) = a.


 
a b
2. For n = 2, A = , det(A) = ad − bc.
c d
Suppose we have defined the determinant of all square matrices of order ≤ n − 1. Let A
be a square matrix of order n.

• Define Mij , called the (i, j) matrix minor of A, to be the matrix obtained from A
be deleting the i-th row and j-th column.
 
1 2 −1
Example. Let A = −1 1 3 . Then
3 2 1
 
1 2
(i) M23 = ,
3 2
 
−1 3
(ii) M12 = , and
3 1
 
2 −1
(iii) M31 = .
1 3

• For a square matrix of order n, the (i, j)-cofactor of A, denoted as Aij , is the (real)
number given by
Aij = (−1)i+j det(Mij ).
This definition is well-defined since Mij is a square matrix of order n − 1, and by
induction hypothesis, the determinant is well defined. Take note of the sign of the
(i, j)-entry, (−1)i+j . Here’s a visualization of the sign of the entries of the matrix
 
+ − + ···
− + − · · · 
 + − + · · · .
 
 
.. .. . .
. . .
 
1 2 −1
Example. Let A = −1 1 3 . Then
3 2 1

(i) A23 = (−1)5 (2 − 6) = 4,


(ii) A12 = (−1)3 (−1 − 9) = 10, and
(iii) A31 = (−1)4 (6 + 1) = 7.

• The determinant of A is defined to be


n
X
det(A) = ai1 Ai1 + ai2 Ai2 + · · · + ain Ain = aik Aik (2.1)
k=1
Xn
= a1j A1j + a2j A2j + · · · + anj Anj = akj Akj (2.2)
k=1


row i (2.1)
This is called the cofactor expansion along .
column j (2.2)
The determinant of A is also denoted as det(A) = |A|.

Remark. The equality between cofactor expansion along any row or any column is a
theorem. The proof requires knowledge of the symmetric groups, which is beyond the
scope of this module.

So for order 3 matrices, the determinant is defined to be

a b c
e f d f d e
d e f =a −b +c = aei − af h − bdi + bf g + cdh − ceg.
h i g i g h
g h i

There is an easy way to compute determinant for order 3 matrices.

a b c a b
det(A) = d e f d e
g h i g h
 
1 5 1 2
0 2 6 3
Example. Compute the determinant of 
0
.
0 1 2
0 0 1 1
Cofactor expansion along the first column.

1 5 1 2
2 6 3 5 1 2 5 1 2 5 1 2
0 2 6 3
= 1 0 1 2 − 0 0 1 2 + 0 2 6 3 − 0 2 6 3 = 2(1 − 2) = −2.
0 0 1 2
0 1 1 0 1 1 0 1 1 0 1 2
0 0 1 1

Corollary. The determinant of A and AT are equal,

det(A) = det(AT ).

This statement is proved by induction on the size n of the matrix A, and using fact
that cofactor expansion along first row of A is equal to cofactor expansion along the first
column of AT .

Corollary. If a square matrix A has a zero row or column, then det(A) = 0.

Proof. Cofactor expand along the zero row or column.

Corollary. The determinant of a triangular matrix is the multiplication of the diagonal


entries. If A = (aij )n is a triangular matrix, then

det(A) = a11 a22 = · · · ann = Πnk=1 aii .

Sketch of proof:
Upper triangular matrix , cofactor expand along first column,
 
a11 a12 · · · a1n
 0 a22 · · · a2n 
..  .
 
 .. .. . .
 . . . . 
0 0 · · · ann

Lower triangular matrix, cofactor expand along the first row,


 
a11 0 · · · 0
 a21 a22 · · · 0 
..  .
 
 .. .. . .
 . . . . 
an1 an2 · · · ann

2.7.2 Computing Determinant in R


The determinant of a square matrix A can be computed in R using the function det(A)
 
1 2 −1
Example. 1. A = −1 1 3 
3 2 1
> A <- matrix(c(1,2,-1,-1,1,3,3,2,1),3,3,T); det(A)
[1] 20
 
1 5 1 2
0 2 6 3
2. A = 0 0 1 2

0 0 1 1
> A <- matrix(c(1,5,1,2,0,2,6,3,0,0,1,2,0,0,1,1),4,4,T); det(A)
[1] -2

Exercise: Show that


x−1 0 0
0 x −2 = (x − 1)(x + 2)(x − 3).
0 −3 x − 1

2.7.3 Properties of Determinant


Theorem (Determinant of product of matrices). Let A and B be square matrices of the
same size. Then
det(AB) = det(A) det(B).

By induction, we get

det(A1 A2 · · · Ak ) = det(A1 ) det(A2 ) · · · det(Ak ).

Corollary (Determinant of inverse). If A is invertible, then

det(A−1 ) = det(A)−1 .
Proof. Since the identity matrix I is a triangular matrix, det(I) = 1. Then

1 = det(I) = det(AA−1 ) = det(A) det(A−1 ).

So det(A)−1 = det(A−1 ).
Corollary (Equivalence of invertibility and determinant). A square matrix A is invertible
if and only if det(A) ̸= 0.
Corollary (Determinant of scalar multiplication). For any square matrix A of order n
and scalar c ∈ R,
det(cA) = cn det(A).
   
1 5 1 2 1 0 1 1
 and B = 0 1 1 −1. Given that det(A) = −2
0 2 6 3  
Example. Let A = 0 0 1 2 0 0 1 2 
0 0 1 1 0 0 0 3
1. det(3A) = 34 (−2) = −162.

2. det(3AB−1 ) = 34 (−2)(−3)−1 = −18.

3. det((3B)−1 ) = (34 × 3)−1 = 3−5 .

2.7.4 Determinants of Partitioned Matrices


 
A B
Consider the partitioned matrix , where the size of the submatrices A, B, C, D
C D
match suitably.
0 Im
1. = (−1)mn .
In 0
2. If B and C are both square matrices of order n and m respectively, then

0 B
= (−1)mn B C .
C 0
    
0 B B 0 0 Im
This follows from = .
C 0 0 C In 0
It can be shown that if B and C are not square matrices, then the matrix must be
singular and so has zero determinant.
Im B
3. = 1 since it is an upper triangular matrix.
0 In
4. If A and B are square matrices, then

A B
= A D .
0 D

This follows from


     
Im 0 Im B A 0 A B
=
0 D 0 In 0 In 0 D
5. In general, if all the submatrices are square matrices, we have
= A D − CA−1 B

A B if A is invertible
= −1
C D = D A − BD C if D is invertible
For if A is invertible,
  
A B Im 0 A B A B
= =
C D −CA−1 In C D 0 D − CA−1 B
and if D is invertible,
Im −BD−1 A − BD−1 C 0
  
A B A B
= = .
C D 0 In C D C D

2.8 Eigenanalysis
2.8.1 Eigenvalues and Eigenvectors
Let A be a square matrix of order n. Then notice that for any vector u ∈ Rn , Au is
also a vector in Rn . So we may think of A as a map Rn → Rn , taking a vector and
transforming it to another vector in the same Euclidean space.
Example. 1.  
0 1
A=
1 0
Geometrically the matrix A reflects a vector along the line x = y.
           
−2 1 −1 1 1 1
A = , A = , A = .
1 −2 1 −1 1 1

v v

v = Av

Av
Av

Observe that any vector on the line x = y gets transform back to itself, and any
vector along x = −y line get transformed to the negative of itself.
2.  
1 1
A=
1 1
The matrix A takes a vector and maps it to a vector along the line x = y such that
both coordinates in Av are the sum of the coordinates in v.
           
1 1 −1 0 1 2
A = , A = , A = .
0 1 1 0 1 2
Av
v Av

v v

Av = 0

Observe that any vector v along the line x = y is mapped to twice itself, Av = 2v,
and it take any vector v along the line x = −y to the origin, Av = 0.
Let A be a square matrix of order n. A real number λ ∈ R is an eigenvalue of A if
there is a nonzero vector v ∈ Rn , v ̸= 0, such that Av = λv. In this case, the nonzero
vector v is called an eigenvector associated to λ. In other words, A transforms its eigen-
vectors by scaling it by a factor of the associated eigenvalue.

Remark. Note that for v to be an eigenvector associate to λ, necessarily v ̸= 0. Other-


wise, A0 = λ0 for any λ ∈ R and thus every number is an eigenvalue, which makes the
definition pointless.
           
0 1 1 1 −1 1 −1
Example. 1. For A = ,A = ,A = =− .
1 0 1 1 1 −1 1

Eigenvalue
Eigenvector
 
1
λ=1 vλ =
1
−1
λ = −1 vλ =
1
             
1 1 1 2 1 1 0 1
2. For A = ,A = =2 ,A = =0 .
1 1 1 2 1 −1 0 −1

Eigenvalue Eigenvector
 
1
λ=2 vλ =
1
1
λ=0 vλ =
−1
The eigenvalues of a square matrix can be obtained through its characteristic poly-
nomial.
Lemma. Let A be a square matrix of order n. Then det(xI − A) is a polynomial of
degree n.
 
a b
Example. Let A = be an order 2 square matrix. Then the characteristic
c d
polynomial of A is
x − a −b
= (x − a)(x − d) − bc = x2 − (a + d)x + ad − bc.
−c x − d
It is a degree 2 polynomial.
This means that λ is an eigenvalue of A if and only if it is a root of the polynomial
det(xI − A). This motivates the following definition.

Let A be a square matrix of order n, the characteristic polynomial of A, denoted as


char(A), is the degree n polynomial

char(A) = det(xI − A).

So, we have the following theorem.


Theorem. Let A be a square matrix of order n. λ ∈ R is an eigenvalue of A if and only
if λ is a root of the characteristic polynomial det(xI − A).
The characteristic polynomial of a matrix can be found in R using the function
charpoly. This function requires the package pracma. The roots of a polynomial can be
found using the function solve; we need to first define the polynomial using the function
polynomial. These functions require the package polynom.
> install.packages(‘polynom’)
> library(polynom)
> library(pracma)
> library(matlib)
 
0 1
Example. 1. A = .
1 0
> A <- matrix(c(0,1,1,0),2,2,T); charpoly(A)
[1] 1 0 -1
This means that the coefficient of x2 is 1, coefficient of x is 0, and the constant is
−1,
x −1
det(xI − A) = = x2 − 1.
−1 x
So the eigenvalues of A are λ = ±1. Indeed,
> p <- polynomial(c(-1,0,1))
> p
-1 + x^2
> solve(p)
[1] -1 1
Remark. The coefficients are displayed in descending powers of x in the function
charpoly, but when defining the characteristic polynomial, we key in the coeffi-
cients in ascending powers of x. We may use the rev function to reverse the order of
the vector, that is, define the polynomial as p <- polynomial(rev(charpoly(A))).
 
1 1
2. A = .
1 1
> A <- matrix(c(1,1,1,1),2,2,T); charpoly(A)
[1] 1 -2 0
So the characteristic polynomial is x2 − 2x.
x − 1 −1
det(xI − A) = = (x − 1)2 − 1 = x(x − 2).
−1 x − 1
So the eigenvalues of A are λ = 0 and λ = 2.
> p <- polynomial(c(0,-2,1))
> p
-2*x + x^2
> solve(p)
[1] 0 2
 
1 0 0
3. A = 0 0 2.
0 3 1
> A <- matrix(c(1,0,0,0,0,2,0,3,1),3,3,T); charpoly(A)
[1] 1 -2 -5 6
So the characteristic polynomial is x3 − 2x2 − 5x + 6.

x−1 0 0
det(xI − A) = 0 x −2 = (x − 1)[x(x − 1) − 6)] = (x − 1)(x + 2)(x − 3).
0 −3 x − 1

So the eigenvalues of A are λ = 1, λ = −2, and λ = 3.


> p <- polynomial(c(6,-5,-2,1))
> p
6 - 5*x - 2*x^2 + x^3
> solve(p)
[1] -2 1 3

Remark. In this course, we only deal with real roots (eigenvalues);


 we will ignore all
0 1
complex eivenvalues. For example the matrix A = has eigenvalues ±i,
−1 0
> A <- matrix(c(0,1,-1,0),2,2,T); charpoly(A)
[1] 1 0 1
> p <- polynomial(c(1,0,1))
> p
1 + x^2
> solve(p)
[1] 0-1i 0+1i
In this case, we will reject the complex roots, and thus say that the characteristic poly-
nomial of A has no (real) roots.

Theorem. A square matrix A is invertible if and only if 0 is not an eigenvalue of A.

Proof. 0 is an eigenvalue of A ⇔ 0 is a root of the polynomial det(xI − A) ⇔ 0 =


det(0I − A) = det(A) ⇔ A not invertible.
Recall that the determinant of a triangular matrix is the product of the diagonal
entries. Suppose A is a triangular matrix. Then xI − A is also a triangular matrix.
Hence, we have the following statement.

Lemma. The eigenvaules of a triangular matrix are the diagonal entries.


 
a11 a12 · · · a1n
 0 a22 · · · a2n 
Proof. A =  ..
 
 . . . . .. 
. 
0 0 · · · ann

x − a11 −a12 · · · −a1n


0 x − a22 · · · −a2n
⇒ .. .. .. = (x − a11 )(x − a22 ) · · · (x − ann ).
. . .
0 0 · · · x − ann

 
a b c
Example. Let A = 0 d e . Then
0 0 f

x − a −b −c
x − d −e
det(xI − A) = 0 x − d −e = (x − a) = (x − a)(x − d)(x − f ).
0 x−f
0 0 x−f

So, the roots of the characteristic polynomial, and hence the eigenvalues are λ = a, d, f ,
which are the diagonal entries of A.

Once we have found the eigenvalues of a matrix, the eigenvectors are obtained by
solving the homogeneous system (λI − A)x = 0.

Theorem. Let λ be an eigenvalue of A. The nonzero solutions to the homogeneous


system (λI − A)x = 0 are eigenvectors of A associated to λ.

Proof. Av = λv if and only if 0 = λv − Av = (λI − A)v. Recall that an eigenvector


must be a nonzero vector. This means that if λ is an eigenvalue of A, then the nontrivial
solutions of the homogeneous system (λI − A)x = 0 are the eigenvectors associated to
λ.
 
1 1 0
Example. 1. A = 1 1 0.
0 0 2
> A <- matrix(c(1,1,0,1,1,0,0,0,2),3,3,T)
> charpoly(A)
[1] 1 -4 4 0
> p <- polynomial(c(0,4,-4,1))
> p
4*x - 4*x^2 + x^3
> solve(p)
[1] 0 2 2
So, the characteristic polynomial of A is

det(xI − A) = x3 − 4x2 − 4x = x(x − 2)2 ,

and the eigenvalues are 0, 2. Since 0 is an eigenvalue of A, A is not invertible.


Now we will find the eigenvectors

For λ = 0:
> Solve(0*diag(3)-A)
x1 + x2 = 0
x3 = 0
0 = 0    
x1 −s
So the nonzero vectors of the form x2 =
   s , for any s ∈ R, is an eigenvector.
x3 0

For λ = 2:
> Solve(2*diag(3)-A)
x1 - 1*x2 = 0
0 = 0
0 = 0    
x1 t
So the nonzero vectors of the form x2  =  t , for any s, t ∈ R, is an eigenvector.
x3 s
2.  
2 3 −4 6 −3 −2 −6
 13 −8 6 6 −9 −12 −10
 
 3 −8 6 −4 1 −2 0 
 
−5 0 −1 −5 5
A= 6 4 

−7 7 −8 0 2 8 4 
 
 2 −7 7 −4 2 −3 0 
−3 3 −4 0 3 4 −1
> A <- matrix(c(2,3,-4,6,-3,-2,-6,13,-8,6,6,-9,-12,-10,3,-8,6,-4,1,-2,0,
-5,0,-1,-5,5,6,4,-7,7,-8,0,2,8,4,2,-7,7,-4,2,-3,0,-3,3,-4,0,3,4,-1),7,7,T)
> charpoly(A)
[1] 1 7 -8 -158 -419 -425 -150 0
> p <- polynomial(rev(charpoly(A)))
> p
-150*x - 425*x^2 - 419*x^3 - 158*x^4 - 8*x^5 + 7*x^6 + x^7
> solve(p)
[1] -5.0000000 -3.0000000 -2.0000000 -1.0000001 -0.9999999 0.0000000
5.0000000
That is, the characteristic polynomial of A is

det(xI − A) = x7 + 7x6 − 8x5 − 158x4 − 419x3 − 425x2 − 150x


= x(x − 5)(x + 1)2 (x + 2)(x + 3)(x + 5),

and the eigenvalues are λ = 0, 5, −1, −2, −3, −5.

Remark. Due to rounding error, -1.0000001 and -0.9999999 should be 1 instead.


We can infer this by observing that the coefficients of the characteristic polynomial
are all integers.
We will proceed to compute the eigenvectors.

For λ = 0:
> Solve(0*diag(7)-A)
x1 = 0
x2 = 0
x3 - 1*x6 = 0
x4 - 1*x6 = 0
x5 = 0
x7 = 0
0 = 0  
0
0
 
1
 
1 to be the eigenvector associated to 0.
We will just pick a particular solution  
0
 
1
0

For λ = 5:
> Solve(5*diag(5)-A)
x1 + x6 = 0
x2 + x6 = 0
x3 - 1*x6 = 0
x4 - 1*x6 = 0
x5 = 0
x7 = 0
0 = 0  
−1
−1
 
1
 
 1 .
So, an associated eigenvector is  
0
 
1
0

For λ = −1:
> Solve(-1*diag(7)-A)
x1 - 1*x6 - 1*x7 = 0
x2 - 1*x6 = 0
x3 - 1*x6 = 0
x4 - 1*x7 = 0
x5 - 1*x7 = 0
0 = 0
0 = 0
Observe that here we need 2 parameters in the general solution. Letting x6 = s,
   
1 1
1 0
   
1 0
   
x7 = t, we can get 2 (independent) solutions 0 and 
 
1.

0 1
   
1 0
0 1

For λ = −2:
> Solve(-2*diag(7)-A)
x1 - 1*x7 = 0
x2 - 1*x7 = 0
x3 - 1*x7 = 0
x4 - 1*x7 = 0
x5 - 1*x7 = 0
x6 = 0
0 = 0  
1
1
 
1
 
1.
So, an associated eigenvector is  
1
 
0
1

For λ = −3:
> Solve(-3*diag(7)-A)
x1 - 1*x7 = 0
x2 = 0
x3 = 0
x4 = 0
x5 + x7 = 0
x6 - 1*x7 = 0
0 = 0  
1
0
 
0
 
 0 .
So, an associated eigenvector is  
−1
 
1
1

For λ = −5:
> Solve(-5*diag(7)-A)
x1 = 0
x2 - 1*x6 = 0
x3 - 1*x6 = 0
x4 = 0
x5 + x6 = 0
x7 = 0
0 = 0  
0
1
 
1
 
 0 .
So, an associated eigenvector is  
−1
 
1
0
We may use the function eigen in R to obtain the eigenvalues and eigenvectors of a
matrix A.
 
1 1 0
Example. 1. Let A = 1 1 0.
0 0 2
> A <- matrix(c(1,1,0,1,1,0,0,0,2),3,3,T)
> eigen(A)
eigen() decomposition
$values
[1] 2.000000e+00 2.000000e+00 1.110223e-15

$vectors
[,1] [,2] [,3]
[1,] 0 0.7071068 0.7071068
[2,] 0 0.7071068 -0.7071068
[3,] 1 0.0000000 0.0000000

The entries of $values are the eigevalues, and the i-th column of the matrix $vector
is an eigenvector associated to i-th entry of $values. To verify this, let lambda be
the vector containing the eigenvalues
> lambda <- eigen(A)$values
and P to be the matrix eigen(A)$vectors
> P <- eigen(A)$vectors

Let us now show that AP[, i] = λ[i] ∗ P[, i],


> > A%*%P[,1]
[,1]
[1,] 0
[2,] 0
[3,] 2
> lambda[1]*P[,1]
[1] 0 0 2
We may also verify that AP[, i] − λ[i] ∗ P[, i] = 0,
> A%*%P[,2]-lambda[2]*P[,2]
[,1]
[1,] 2.220446e-16
[2,] -2.220446e-16
[3,] 0.000000e+00
> A%*%P[,3]-lambda[3]*P[,3]
[,1]
[1,] -5.630016e-16
[2,] 1.007091e-15
[3,] 0.000000e+00

Remark. (i) Observe that due to rounding error, the last 2 difference is not 0.
However, the difference is very small (in the order of 10−16 ). Note also that the
third eigenvalue should be 0. One must exercise digression when interpreting
the data.
(ii) Observe that R will always choose the eigenvector v (written as column vector)
such that vT v = 1 (such a vector is called a unit vector). For example,
referring to the previous example
√ where we found the eigenvector by solving
(λI − A)x = 0, R chose s = 1/ 2 for the eigevector associated to eigenvalue
0. Let us verify this
> crossprod(P[,1],P[,1])
[,1]
[1,] 1
> crossprod(P[,2],P[,2])
[,1]
[1,] 1
> crossprod(P[,3],P[,3])
[,1]
[1,] 1
(iii) Finally observe that since there are 2 parameters in the general solution to
find the eigenvectors for eigenvalue
√ 2, R will “separate” them, that is, it will
choose and s = 0, t = 1/ 2 for one of the eigenvector, and s = 1, t = 0 for
the other eigenvector associated to eigenvalue 2.

2. Let  
2 3 −4 6 −3 −2 −6
 13 −8 6 6 −9 −12 −10
 
 3 −8 6 −4 1 −2 0 
 
A = −5 0 −1 −5 5
 6 4 

−7 7 −8 0 2 8 4 
 
 2 −7 7 −4 2 −3 0 
−3 3 −4 0 3 4 −1
> A <- matrix(c(2,3,-4,6,-3,-2,-6,13,-8,6,6,-9,-12,-10,3,-8,6,-4,1,-2,0,
-5,0,-1,-5,5,6,4,-7,7,-8,0,2,8,4,2,-7,7,-4,2,-3,0,-3,3,-4,0,3,4,-1),7,7,T)
> eigen(A)
eigen() decomposition
$values
[1] 5.000000e+00 -5.000000e+00 -3.000000e+00 -2.000000e+00 -1.000000e+00
[6] -1.000000e+00 1.493133e-15

$vectors
[,1] [,2] [,3] [,4] [,5]
[1,] 4.472136e-01 8.927420e-16 5.000000e-01 4.082483e-01 5.000000e-01
[2,] 4.472136e-01 -5.000000e-01 -2.187319e-15 4.082483e-01 1.271365e-15
[3,] -4.472136e-01 -5.000000e-01 -1.885620e-15 4.082483e-01 1.252999e-15
[4,] -4.472136e-01 3.246335e-16 5.421158e-16 4.082483e-01 5.000000e-01
[5,] -8.448406e-16 5.000000e-01 -5.000000e-01 4.082483e-01 5.000000e-01
[6,] -4.472136e-01 -5.000000e-01 5.000000e-01 2.051160e-15 1.767114e-16
[7,] -5.011742e-16 1.062829e-15 5.000000e-01 4.082483e-01 5.000000e-01
[,6] [,7]
[1,] -0.5718368 3.711441e-16
[2,] -0.1115115 -3.976544e-17
[3,] -0.1115115 -5.773503e-01
[4,] -0.4603253 -5.773503e-01
[5,] -0.4603253 6.362470e-16
[6,] -0.1115115 -5.773503e-01
[7,] -0.4603253 3.582023e-16

We will just check that the 5-th and 6-th columns of P are the eigevectors associated
to eigenvalue −1. The rest of the verification of the eigenvalue eigenvector pairs are
left to the reader. Also, it is a good exercise to interpret the data and decide which
of the entries are the result of rounding error, and what the correct values should
be.
> lambda <- eigen(A)$values; P <- eigen(A)$vectors
> A%*%P[,5]-lambda[5]*P[,5]
[,1]
[1,] 2.220446e-16
[2,] 3.047722e-15
[3,] -9.878031e-16
[4,] -1.998401e-15
[5,] -1.776357e-15
[6,] -1.313782e-16
[7,] -1.221245e-15
> A%*%P[,6]-lambda[6]*P[,6]
[,1]
[1,] -5.551115e-16
[2,] -1.706968e-15
[3,] 1.595946e-15
[4,] 2.109424e-15
[5,] 6.106227e-16
[6,] 1.026956e-15
[7,] 9.992007e-16

2.8.2 Diagonalization
A square matrix A is said to be diagonalizable if there exists an invertible matrix P such
that P−1 AP = D is a diagonal matrix.
Remark. The statement above is equivalent to being able to express A as A = PDP−1
for some invertible P and diagonal matrix D.
Example. 1. Any square zero matrix is diagonalizable, 0 = I0I−1 .
2. Any diagonal matrix D is diagonalizable, D = IDI−1 .
     −1
3 1 −1 1 0 1 2 0 0 1 0 1
3. A = 1 3 −1 is diagonalizable, with A = 0 1 1 0 2 0 0 1 1 .
0 0 2 1 1 0 0 0 4 1 1 0

The invertible matrix P that diagonalizes A have the form P = u1 u2 · · · un ,
where ui are eigenvectors of A, and the diagonal matrix is D = diag(λ1 , λ2 , ..., λn ), where
λi is the eigenvalue associated to eigenvector ui . In other words, the i-th column of the
invertible matrix P is an eigenvector of A with eigenvalue the i-th diagonal entry of D.
We have already seen this in a previous example.
 
1 1 0
Example. Let A = 1 1 0.
0 0 2
> A <- matrix(c(1,1,0,1,1,0,0,0,2),3,3,T)
> eigen(A)
eigen() decomposition
$values
[1] 2.000000e+00 2.000000e+00 1.110223e-15

$vectors
[,1] [,2] [,3]
[1,] 0 0.7071068 0.7071068
[2,] 0 0.7071068 -0.7071068
[3,] 1 0.0000000 0.0000000
> P <- eigen(A)$vectors
> D <- diag(eigen(A)$values)
> D
[,1] [,2] [,3]
[1,] 2 0 0.000000e+00
[2,] 0 2 0.000000e+00
[3,] 0 0 1.110223e-15
Let us fix the last eigenvalue to its correct value 0,
> D[3,3] <- 0
> D
[,1] [,2] [,3]
[1,] 2 0 0
[2,] 0 2 0
[3,] 0 0 0
Now let us verify that A = PDP−1
> P%*%D%*%solve(P)
[,1] [,2] [,3]
[1,] 1 1 0
[2,] 1 1 0
[3,] 0 0 2

However, not every square matrix is diagonalizable.


 
1 1 0
Example. Let A = 1 1 1.
0 0 2
> A <- matrix(c(1,1,0,1,1,1,0,0,2),3,3,T)
> eigen(A)
eigen() decomposition
$values
[1] 2 2 0

$vectors
[,1] [,2] [,3]
[1,] 0.7071068 -7.071068e-01 -0.7071068
[2,] 0.7071068 -7.071068e-01 0.7071068
[3,] 0.0000000 6.280370e-16 0.0000000

Observe that in this case, if we account for rounding error and identify the (3, 2)-entry
of the $vectors as 0, the second column of $vectors is equal to the negative of the first.
So, the matrix $vectors is not invertible,
> P <- eigen(A)$vectors
> solve(P)
Error in solve.default(P) :
system is computationally singular: reciprocal condition number
= 2.22045e-16

A square matrix A is not diagonalizable if

(i) A does not have enough real eigenvalues, or

(ii) A does not have enough eigenvectors.


 
0 −1 1
Example. 1. A = 1 0 −1.
0 0 1
> A <- matrix(c(0,-1,1,1,0,-1,0,0,1),3,3,T);
> solve(polynomial(rev(charpoly(A))))
[1] 0-1i 0+1i 1+0i
In this case, A has 1 real root and 2 complex roots. Since in this course we are only
working with real-valued matrices, we do not have enough eigenvalues to form the
diagonal matrix D. Hence, A is not diagonalizable. Moreover, the matrix P will
also not be a real-valued matrix.

Solving the eigenvectors associated to 1,


> Solve(diag(3)-A)
x1 - 1*x3 = 0
x2 = 0
0 = 0  
1
we obtain only 1 parameter family of eigenvectors s 0, s ∈ R\{0}, instead of 3

1
(= order of A) parameters family of eigenvectors. The other 2 parameter family of
eigenvectors are complex eigenvectors,
> eigen(A)
eigen() decomposition
$values
[1] 0+1i 0-1i 1+0i

$vectors
[,1] [,2] [,3]
[1,] 0.7071068+0.0000000i 0.7071068+0.0000000i 0.7071068+0i
[2,] 0.0000000-0.7071068i 0.0000000+0.7071068i 0.0000000+0i
[3,] 0.0000000+0.0000000i 0.0000000+0.0000000i 0.7071068+0i
 
1 1 0
2. Consider the example given above, A = 1 1 1. We have computed that the
0 0 2
eigenvalues are λ = 0 and 2. Now find the eigenvectors.
For λ = 0,
> Solve(-A)
x1 + x2 = 0
x3 = 0
0 = 0
For λ = 2,
> Solve(2*diag(3)-A)
x1 - 1*x2 = 0
x3 = 0
0 = 0
 
1
This shows that we only have 2 parameter family of eigenvectors, s −1 and
  0
1
t 1, s, t ∈ R\{0} (associated to 0 and 2, respectively). Since A has order 3,
0
it needs 3 parameter family of eigenvectors to be diagonalizable. Hence, A is not
diagonalizable, as shown above.
 
1 1
3. Let A = .
0 1
> A <- matrix(c(1,1,0,1),2,2,T);
> solve(polynomial(rev(charpoly(A))))
[1] 1 1
A has only one eigenvalue, λ = 1. Now compute the eigenvectors.
> Solve(diag(2)-A)
x2 = 0
0 = 0  
1
We obtain a parameter family of eivenvectors s , s ∈ R\{0}. Since A has order
0
2, it does not have enough parameter family of eigenvectors to be diagonalizable.
Indeed,
> eigen(A)
eigen() decomposition
$values
[1] 1 1
$vectors
[,1] [,2]
[1,] 1 -1.000000e+00
[2,] 0 2.220446e-16
Taking into account that the (2, 2)-entry of $vectors should be 0, the second is
equal to the negative of the first. Hence, the matrix $vectors is not invertible.
> P <- eigen(A)$vectors
> solve(P)
Error in solve.default(P) :
system is computationally singular: reciprocal condition number
= 1.11022e-16

Hence, to check if a matrix is diagonalizable, we can check if solve(eigen(A)$vectors)


exists, and is a real-valued matrix. However, if A is a symmetric matrix, then it will be
diagonalizable. In fact, we can say more.

An order n square matrix A is orthogonally diagonalizable if

A = PDPT

for some orthogonal matrix P and diagonal matrix D.

Theorem. An order n square matrix is orthogonally diagonalizable if and only if it is


symmetric.

We orthogonally diagonalize a symmetric matrix in R using the same function eigen,


except in the second argument we add symmetric = TRUE.
 
5 1 1
Example. 1. Consider A = 1 5 1.
1 1 5
A <- matrix(c(5,1,1,1,5,1,1,1,5),3,3,T); eigen(A,symmetric = TRUE)
eigen() decomposition
$values
[1] 7 4 4

$vectors
[,1] [,2] [,3]
[1,] -0.5773503 -0.3160858 0.7528323
[2,] -0.5773503 -0.4939290 -0.6501544
[3,] -0.5773503 0.8100148 -0.1026779
> P <- eigen(A)$vectors
> crossprod(P,P)
[,1] [,2] [,3]
[1,] 1.000000e+00 -1.110223e-16 -2.567391e-16
[2,] -1.110223e-16 1.000000e+00 -4.163336e-17
[3,] -2.567391e-16 -4.163336e-17 1.000000e+00

Observe that the diagonal entries of PT P (=crossprod(P,P)) are 1, and the entries
off-diagonal are suppose to be 0. Hence, PT P = I is the identity matrix, verifying
that P is indeed orthogonal. Let us check that A = PDPT . Recall that the
transpose of P is R is t(P).
> D <- diag(eigen(A)$values)
> P%*%D%*%t(P)
[,1] [,2] [,3]
[1,] 5 1 1
[2,] 1 5 1
[3,] 1 1 5
 
1 −1 3
2. Consider A = −1 1 5.
3 5 1
> A <- matrix(c(1,-1,3,-1,1,5,3,5,1),3,3,T); eigen(A,symmetric = TRUE)
eigen() decomposition
$values
[1] 6.429008 1.876374 -5.305382

$vectors
[,1] [,2] [,3]
[1,] 0.2893960 -0.86088999 -0.4184715
[2,] 0.6190719 0.50177055 -0.6041327
[3,] 0.7300685 -0.08423029 0.6781632

> P <- eigen(A)$vectors


> crossprod(P,P)
[,1] [,2] [,3]
[1,] 1.000000e+00 4.857226e-17 0.000000e+00
[2,] 4.857226e-17 1.000000e+00 -6.245005e-17
[3,] 0.000000e+00 -6.245005e-17 1.000000e+00
> D <- diag(eigen(A)$values)
> P%*%D%*%t(P)
[,1] [,2] [,3]
[1,] 1 -1 3
[2,] -1 1 5
[3,] 3 5 1
2.9 Appendix for Chapter 2
2.9.1 Matrix operations
Theorem (Properties of matrix multiplication). (i) (Associative) For matrices A =
(aij )m×p , B = (bij )p×q , and C = (cij )q×n (AB)C = A(BC).

(ii) (Left distributive law) For matrices A = (aij )m×p , B = (bij )p×n , and C = (cij )p×n ,
A(B + C) = AB + AC.

(iii) (Right distributive law) For matrices A = (aij )m×p , B = (bij )m×p , and C = (cij )p×n ,
(A + B)C = AC + BC.

(iv) (Commute with scalar multiplication) For any real number c ∈ R, and matrices
A = (aij )m×p , B = (bij )p×n , c(AB) = (cA)B = A(cB).

(v) (Multiplicative identity) For any m × n matrix A, Im A = A = AIn .

(vi) (Zero divisor) There exists A ̸= 0m×p and B ̸= 0p×n such that AB = 0m×n .

(vii) (Zero matrix) For any m × n matrix A, A0n×p = 0m×p and 0p×m A = 0p×n .
Proof. We will check that the corresponding entries on each side agrees. The check for
the size of matrices agree is trivial and is left to the reader.
(i) The (i, j)-entry of (AB)C is
q p q p
X X X X
( aik bkl )clj = aik bkl clj .
l=1 k=1 l=1 k=1

The (i, j)-entry of A(BC) is


p q p q
X X X X
aik ( bkl clj ) = aik bkl clj .
k=1 l=1 k=1 l=1

Since both sums has finitely many terms, the sums commute and thus the (i, j)-entry
of (AB)C is equal to the (i, j)-entry of A(BC).
Pp Pp
(ii) P
The (i, j)-entry of A(B + C) is k=1 aik (bkj + ckj ) = k=1 (aik bkj + aik ckj ) =
p Pp
k=1 aik bkj + k=1 aik ckj , which is the (i, j)-entry of AB + AC.

(iii) The proof is analogous to left distributive law.

(iv) Left to reader.



1 if i = j
(v) Note that I = (δij ), where δij = . So the (i, j)-entry of Im A is
0 if i ̸= j

δi1 a1j + · · · + δii aij + · · · + δim amj = 0a1j + · · · + 1aij + · · · + 0amj = aij .

The proof for A = AIn is analogous.


    
1 0 0 0 0 0
(vi) Consider for example = .
1 0 1 1 0 0
(vii) Left to reader, if you have read till this far, surely this proof is trivial to you.

Theorem (Properties of transpose). (i) For any matrix A, (AT )T = A.


(ii) For any matrix A, and real number c ∈ R, (cA)T = cAT .
(iii) For matrices A and B of the same size, (A + B)T = AT + BT .
(iv) For matrices A = (aij )m×p and B = (bij )p×n , (AB)T = BT AT .
Proof. We will only proof (iv). The rest is left to the reader. The (j, i)-entry of AB is
p
X
ajk bki ,
k=1

which is the (i, j)-entry of (AB) . The (i, j)-entry of BT AT is


T

p p
X X
bki ajk = ajk bki ,
k=1 k=1
T
which is exactly the (i, j)-entry of (AB) .

2.9.2 Linear Systems


A system of linear equations can be expressed uniquely as an augmented matrix
 
a11 a12 · · · a1n b1
 a21 a22 · · · a2n b2 
..  .
 
 .. .. ... ..
 . . . . 
am1 am2 · · · amn bm
The linear system is homogeneous if there are no constant terms
a11 x1 + a12 x2 + · · · + a1n xn = 0
a21 x1 + a22 x2 + · · · + a2n xn = 0
..
.
am1 x1 + am2 x2 + · · · + amn xn = 0
Then the corresponding augemented matrix is
 
a11 a12 · · · a1n 0
 a21 a22 · · · a2n 0 
..  .
 
 .. .. ... ..
 . . . . 
am1 am2 · · · amn 0
Example. (Nonhomogeneous) Linear system:
3x + 2y − z = 1
5y + z = 3
x + z = 2
The associated homogeneous system is
3x + 2y − z = 0
5y + z = 0
x + z = 0
2.9.3 Solving Linear Systems
In an (augmented) matrix, a zero row is a row with all entries 0. The leading entry of a
nonzero row is the first nonzero entry of the row counting from the left.

An (augmented) matrix is in reduced row-echelon form (RREF) if


1. All zero rows are at the bottom of the matrix.
2. The leading entries are further to the right as we move down the rows.
3. The leading entries are 1.
4. In each column that contains a leading entry, all entries except the leading entry is 0.
In RREF, a pivot column is a column containing a leading entry. A column is called
nonpivot otherwise.

A matrix in RREF has the form


 
1 ∗ 0 ∗ 0 ∗
 0 ··· 0 1 ∗ 0 ∗ 
 
 0 ··· 0 0 ··· 0 1 ∗ 
.
 
 0 0 0 0 0
 
 .. .. .. .. .. 
 . . . . . 
0 ··· 0 ··· 0 0 0
If an augmented matrix is in RREF, we are able to read off the solutions, if it exists.
The corresponding linear system is inconsistent (has no solutions) if and only if the last
column (after the vertical line) is a pivot column.
 
∗ ··· ∗ 0
 .. . . .. .. 
 . . . . .
 
 0 ··· 0 1 
0 ··· 0 0
This means that the system has no solution. For if the leading entry is in the last column,
we are solving for
0x1 + 0x2 + · · · + 0x1 = 1,
which is impossible. Otherwise, the linear system is consistent, that is, there will be
solutions to the linear system. In this case,
# of parameters = # of nonpivot columns on the left of the vertical line
Example. 1. The system corresponding to
 
1 0 2 1
0 1 −3 1
is consistent since the last column is not a pivot column. The system has 1 param-
eter since the third column is also a nonpivot column. The corresponding linear
system is
x1 + 2x3 = 1
x2 − 3x3 = 1
By letting x3 to be the parameter t, we have a general solution

x1 = 1 − 2t, x2 = 1 + 3t, x3 = t, t ∈ R.

2. The system corresponding to


 
1 0 1 0
 0 1 2 0 
0 0 0 1

is inconsistent since the last column is a pivot column.

3. The system corresponding to


 
1 0 0 1
 0 1 0 −2 
0 0 1 9

has a unique solution x1 = 1, x2 = −2, x3 = 9.

The next 2 theorems provides us with an algorithm to find solutions of a linear system.

Theorem. Every (augmented) matrix corresponds to a unique (augmented) matrix in


reduced row-echelon form. This (augmented) matrix will be called its RREF.

Theorem. Two linear system has the same solution set if their augmented matrix has
the same RREF.

This means that by reading off the solutions from the RREF of the augmented matrix,
we are able to obtain the solutions for the linear system.

Example. 1. Consider the following linear system

x + 2y − z = 1
x + 5y + z = 3
x + y + z = 2

The RREF of the augmented matrix is


 
1 0 0 9/8
 0 1 0 1/4 
0 0 1 5/8

Readers can check that indeed x = 9/8, y = 1/4, z = 5/8 is the unique solution to
the linear system.

2. Consider the following linear system

3x + 6y = −3
3x + 5y + z = 2
x + 2y = −1
The RREF of the augmented matrix is
 
1 0 2 9
 0 1 −1 −5 
0 0 0 0
Readers can check that
x = 9 − 2t, y = t − 5, z = t, t ∈ R
is a general solution to the linear system.
3. Consider the following linear system
x + y = 2
x − y = 0
2x + y = 1
The RREF of the augmented matrix is
 
1 0 0
 0 1 0 
0 0 1
that is, the system is inconsistent.

2.9.4 Elementary Row Operations


There are 3 types of elementary row operations.
1. Exchanging 2 rows, Ri ↔ Rj ,
2. Adding a multiple of a row to another, Ri + cRj , c ∈ R,
3. Multiplying a row by a nonzero constant, aRj , a ̸= 0.
Example. 1. Exchange row 2 and row 4
   
1 1 2 5 1 1 2 5
 0 0 2 4  R2 ↔R4  0 1 4 3 
 − −−−→  
 0 1 1 2   0 1 1 2 
0 1 4 3 0 0 2 4

2. Add 2 times row 3 to row 2


   
1 1 2 5 1 1 2 5
 0 0 2 4  R2 +2R3  0 2 4 8 
 − −−−→  
 0 1 1 2   0 1 1 2 
0 1 4 3 0 1 4 3

3. Multiply row 2 by 1/2


   
1 1 2 5 1 1 2 5
 0 1
0 2 4 
 −2−R2  0 0 1 2 

 0 → 
1 1 2   0 1 1 2 
0 1 4 3 0 1 4 3
Remark. 1. Note that we cannot multiply a row by 0, as it may change the linear
system. For example, consider
x + y = 2
x − y = 0
It has a unique solution x = 1, y = 1. Suppose in the augmented matrix we multiply
row 2 by 0,    
1 1 2 0R2 1 1 2
−−→
1 −1 0 0 0 0
then the system now has a general solution x = 2 − s, y = s.
2. Elementary row operations may not commute. For example,
   
1 0 0 2R1 R2 ↔R1 0 1 0
−−→−−−−→
0 1 0 2 0 0
is not the same as    
1 0 0 R2 ↔R1 2R1 0 2 0
−−−−→−−→
0 1 0 1 0 0
But if the elementary row operations do commute, we can stack them
   
1 0 0 2R2 2 0 0
−−→
0 1 0 2R1 0 2 0

3. For the second type of elementary row operation, the row we put first is the row we
are performing the operation upon,
   
1 0 0 R1 +2R2 1 2 0
−−−−→
0 1 0 0 1 0
instead of    
1 0 0 2R2 +R1 1 0 0
−−−−→
0 1 0 1 2 0
In fact, the 2R2 + R1 is not an elementary row operation, but a combination of 2
operations, 2R2 then R2 + R1 . Here’s another example,
   
1 0 0 R1 +R2 1 1 0
−−−−→
0 1 0 1 0 0
and    
1 0 0 R2 +R1 1 0 0
−−−−→
0 1 0 1 1 0
Two augmented matrices are row equivalent if one can be obtained from the other by
elementary row operations.
Theorem. Two augmented matrices are row equivalent if and only have they have the
same RREF.
Observe that from the RREF we are able to uniquely obtain the solution set, and from
the solution set, if we know the number of equations the linear system has, we are able
to reconstruct the RREF uniquely. Hence, the previous theorem gives us the following
statement.
Theorem. Two linear systems have the same solution set if their augmented matrices
are row equivalent.

Note that every augmented matrix is row equivalent to a unique RREF, but it can
be row equivalent to different REFs.

Example. From the REF or RREF, we are able to read off a general solution.

1.  
1 1 1 1
0 1 1 0
This is in REF. We let the third variable be the parameter s, then we get y = −s from
the second row, and x = 1 − s − (−s) = 1 from the first row. So a general solution is
x = 1, y = −s, z = s.

2.  
1 0 0 1
0 1 −1 0
This is in RREF. General solution: x = 1, y = s, z = s.

Example. We will now reconstruct the RREF of the augmented matrix of a linear system
given a general solution.

1. x = 1 − 2s + t, y = s, z = t, the linear system has 3 equations. Then substituting


back y = s and z = t into x = 1 − 2s + t, we get x + 2y − z = 1. So the RREF of the
augmented matrix is  
1 2 −1 1
 0 0 0 0 .
0 0 0 0

2. x = 3 − 5s, y = 2 + 2s, z = s, the linear system has 3 equations. Substituting back,


we get,
x + 5z = 3
y − 2z = 2
and thus the RREF of the augmented matrix is
 
1 0 5 3
 0 1 −2 2  .
0 0 0 0

3. x = 3, y = 2, z = 1, 3 equations. RREF:
 
1 0 0 3
 0 1 0 2 .
0 0 1 1
2.9.5 Gaussian Elimination and Gauss-Jordan Elimination
Step 1: Locate the leftmost column that does not consist entirely of zeros.

Step 2: Interchange the top row with another row, if necessay, to bring a nonzero entry to
the top of the column found in Step 1.

Step 3: For each row below the top row, add a suitable multiple of the top row to it so that
the entry below the leading entry of the top row becomes zero.

Step 4: Now cover the top row in the augmented matrix and begin again with Step 1
applied to the submatrix that remains. Continue this way until the entire matrix
is in row-echelon form.

Once the above process is completed, we will end up with a REF. The following steps
continue the process to reduce it to its RREF.

Step 5: Multiply a suitable constant to each row so that all the leading entries become 1.

Step 6: Beginning with the last nonzero row and working upward, add suitable multiples
of each row to the rows above to introduce zeros above the leading entries.

Remark. The Gaussian elimination and Gauss-Jordan elimination may not be the fastest
way to obtain the RREF of an augmented matrix. We do not have to follow the algorithm
strictly when reducing the augmented matrix.

Algorithm

1. Express the given linear system as an augmented matrix. Make sure that the liner
system is in standard form.

2. Use the Gaussian elimination to reduce the augmented matrix to a row-echelon


form, or use the Gauss-Jordan elimination to reduce the augmented matrix to its
reduced row-echelon form.

3. If the system is consistent, assign the variables corresponding to the nonpivot


columns as parameters.

4. Use back substitution (if the augmented matrix is in REF) to obtain a general
solution, or read off the general solution (if the augmented matrix is in RREF).

Example.
     
1 1 2 4 1 1 2 4 R + 2
R
1 1 2 4
 −1 2 −1 1  −R−2−
+R1
−→  0 3 1
3 2
5  −−−−3−→  0 3 1 5 
R3 −2R1
2 0 3 −2 0 −2 −1 −10 0 0 −1/3 −20/3

The augmented matrix is now in REF. By back substitution, we have


1
z = 20, y = (5 − z) = −5, x = 4 − y − 2z = −31.
3
Alternatively, we can continue to reduce it to its RREF and read off the solution.
     
1 1 2 4 1 1 2 4 1 1 0 −36
−3R3 R −R3
 0 3 1 5 − −−→  0 3 1 5  −−2−−→  0 3 0 −15 
R1 −2R3
0 0 −1/3 −20/3 0 0 1 20 0 0 1 20
   
1
R2
1 1 0 −36 1 0 0 −31
R −R2
−3−→  0 1 0 −5  −−1−−→  0 1 0 −5 
0 0 1 20 0 0 1 20

Indeed, the system is consistent, with unique solution x = −31, y = −5, z = 20.
Remark. We may include verbose = TRUE as an argument of the function Solve to
show the steps of the Gaussian elimination algorithm. For example, solve
x1 + x2 + 2x3 = 4
−x1 + 2x2 − x3 = 1
2x1 + 3x2 = 2

> A <- matrix(c(1,1,2,-1,2,-1,2,0,3),3,3,T); b <- matrix(c(4,1,2))


> Solve(A,b,fractions = TRUE,verbose = TRUE)

Initial matrix:
[,1] [,2] [,3] [,4]
[1,] 1 1 2 4
[2,] -1 2 -1 1
[3,] 2 0 3 2

row: 1

exchange rows 1 and 3


[,1] [,2] [,3] [,4]
[1,] 2 0 3 2
[2,] -1 2 -1 1
[3,] 1 1 2 4

multiply row 1 by 1/2


[,1] [,2] [,3] [,4]
[1,] 1 0 3/2 1
[2,] -1 2 -1 1
[3,] 1 1 2 4

multiply row 1 by 1 and add to row 2


[,1] [,2] [,3] [,4]
[1,] 1 0 3/2 1
[2,] 0 2 1/2 2
[3,] 1 1 2 4

subtract row 1 from row 3


[,1] [,2] [,3] [,4]
[1,] 1 0 3/2 1
[2,] 0 2 1/2 2
[3,] 0 1 1/2 3

row: 2

multiply row 2 by 1/2


[,1] [,2] [,3] [,4]
[1,] 1 0 3/2 1
[2,] 0 1 1/4 1
[3,] 0 1 1/2 3

subtract row 2 from row 3


[,1] [,2] [,3] [,4]
[1,] 1 0 3/2 1
[2,] 0 1 1/4 1
[3,] 0 0 1/4 2

row: 3

multiply row 3 by 4
[,1] [,2] [,3] [,4]
[1,] 1 0 3/2 1
[2,] 0 1 1/4 1
[3,] 0 0 1 8

multiply row 3 by 3/2 and subtract from row 1


[,1] [,2] [,3] [,4]
[1,] 1 0 0 -11
[2,] 0 1 1/4 1
[3,] 0 0 1 8

multiply row 3 by 1/4 and subtract from row 2


[,1] [,2] [,3] [,4]
[1,] 1 0 0 -11
[2,] 0 1 0 -1
[3,] 0 0 1 8
x1 = -11
x2 = -1
x3 = 8

2.9.6 Orthogonal and Orthonormal


We say that vectors u, v ∈ Rn are orthogonal if

u · v = 0.

Suppose u, v ∈ Rn are orthogonal.

• Case 1: Either u = 0 or v = 0.
• Case 2: Otherwise,
u·v
cos(θ) = =0
∥u∥∥v∥
tells us that θ = π2 , that is, u and v are perpendicular.
That is, u, v are orthogonal if and only if either one of them is the zero vector or they
are perpendicular to each other.
Example.            
1 1 1 0 1 0
1 ·  0  = 0, 0 · 1 = 0, 1 · 0 = 0.
1 −1 0 0 0 1
Exercise: Suppose u, v are orthogonal. Show that for any s, t ∈ R scalars, su, tv
are also orthogonal.

A set S = {v1 , v2 , ..., vk } ⊆ Rn of vectors is orthogonal if vi · vj = 0 for every i ̸= j,


that is, vectors in S are pairwise orthogonal. A set S = {v1 , v2 , ..., vk } ⊆ Rn of vectors
is orthonormal if 
0 if i ̸= j,
vi · vj =
1 if i = j.
That is, S is orthogonal, and all the vectors are unit vectors.
Remark. Every orthogonal set of nonzero vectors can be normalized to an orthonormal
set.
      
 1 0 0 
Example. 1. S = 0 , 1 , 0 is an orthonormal set.
0 0 1
 
          
 1 0 −1  0 −1
2. S =  1 , 1 , 0
     is not an orthogonal set since  1  ·  0 =
1 −1 1 −1 1
 
−1 ̸= 0.
     
 1 1 1 
3. S = 1 ,  0  , −2 is an orthogonal but not orthonormal set. It can
1 −1 1
 
be normalized to an orthonormal set
      
 1 1 1 1 
1 1
√ 1 , √  0  , √ −2 .
 3 2 −1 6
1 1

 √   √   
 1/√2 −1/√ 2 0 
4. S = 1/ 2 ,  1/ 2  , 0 is an orthonormal set.
0 0 1
 
      
 1 1 0 
5. S =  1 , −1 , 0 is an orthogonal set but it cannot be normalized to
   
0 0 0
 
an orthonormal set since it contains the zero vector.
2.9.7 Gram-Schmidt Process
Theorem (Gram-Schmidt Process). Let S = {u1 , u2 , ..., uk } be a linearly independent
set. Let

v1 = u1
 
v1 · u2
v2 = u2 − v1
∥v1 ∥2
   
v1 · u3 v2 · u3
v3 = u3 − v1 − v2
∥v1 ∥2 ∥v2 ∥2
..
.      
v1 · ui v2 · ui vi−1 · ui
vi = ui − v1 − v2 − · · · − vi−1
∥v1 ∥2 ∥v2 ∥2 ∥vi−1 ∥2
..
.      
v1 · uk v2 · uk vk−1 · uk
vk = uk − v1 − v2 − · · · − vk−1 .
∥v1 ∥2 ∥v2 ∥2 ∥vk−1 ∥2

Then {v1 , v2 , ..., vk } is an orthogonal set of nonzero vectors. Hence,


 
v1 v2 vk
w1 = , w2 = , ..., wk =
∥v1 ∥ ∥v2 ∥ ∥vk ∥

is an orthonormal set such that span{w1 , ..., wk } = span{uk , ..., uk }.


      
 1 1 1 
Example. We will use the Gram-Schmidt process to convert S =  2 , 1 , 1
   
1 1 2
 
into an orthonormal set.

 
1
v1 =  2
1
       
1 1 1 1
1+2+1 2 = 1 −1 , let v2 = −1 instead
v2 =  1 −
12 + 22 + 12 3
1 1 1 1
       
1 1 1 −1
1+2+2 1−1+2 −1 = 1  0 
v3 = 1 − 2 −
12 + 22 + 12 12 + (−1)2 + 12 2
2 1 1 1
 
−1
let v3 =  0  instead.
1

Why are we allowed to take v2 and v3 to be a multiple of the original vector found?
      
 1 1 −1 
So √16 2 , √13 −1 , √12  0  is an orthonormal set.
1 1 1
 
2.9.8 Least square solution
Let A be a m × n matrix and b ∈ Rm . A vector u ∈ Rn is a least square solution to
Ax = b if for every vector v ∈ Rn ,

∥Au − b∥ ≤ ∥Av − b∥.


 
v1
 v2 
Here, for a vector v =  .. , the norm ∥v∥ is defined to be
 
.
vn
q
∥v∥ = (v12 + v22 + · · · + vn2 ).

It is the generalization of the distance of a vector from the origin, using the Pythagoras
theorem.

2.9.9 Diagonalization
A square matrix A is said to be diagonalizable if there exists an invertible matrix P such
that P−1 AP = D is a diagonal matrix.

Remark. The statement above is equivalent to being able to express A as A = PDP−1


for some invertible P and diagonal matrix D.

Example. 1. Any square zero matrix is diagonalizable, 0 = I0I−1 .

2. Any diagonal matrix D is diagonalizable, D = IDI−1 .


     −1
3 1 −1 1 0 1 2 0 0 1 0 1
3. A = 1 3 −1 is diagonalizable, with A = 0 1 1 0 2 0 0 1 1 .
0 0 2 1 1 0 0 0 4 1 1 0

Suppose A = PDP−1 , for some matrix D and invertible matrix P. Then the charac-
teristic polynomial of A is

det(xI − A) = det(xI − PDP−1 ) = det(PxP−1 − PDP−1 )


= det(P(xI − D)P−1 ) = det(P) det(xI − D) det(P−1 ) (2.3)
= det(P) det(P)−1 det(xI − D) = det(xI − D).

In other words, the characteristics polynomial of A is the same as D. Furthermore, if D


is a diagonal matrix, then the eigenvalues of A are exactly the diagonal entries of D, with
algebraic multiplicity equals to the number of diagonal entries taking the value. Hence, if
A is diagonalizable, A = PDP−1 , then the diagonal entries of the matrix D are exactly
the eigenvalues of A, appearing their algebraic multiplicities times. The invertible matrix
P is constructed from the eigenvectors of A.

Theorem. Let A be a square matrix of order n. A is diagonalizable if and only if there


exists a basis {u1 , u2 , ..., un } of Rn of eigenvectors of A.

The invertible matrix P that diagonalizes A have the form P = u1 u2 · · · un ,
where ui are eigenvectors of A, and the diagonal matrix is D = diag(λ1 , λ2 , ..., λn ), where
λi is the eigenvalue associated to eigenvector ui . In other words, the i-th column of the
invertible matrix P is an eigenvector of A with eigenvalue the i-th diagonal entry of D.

Theorem (Geometric Multiplicity is no greater than Algebraic multiplicity). The geo-


metric multiplicity of an eigenvalue λ of a square matrix A is no greater than the algebraic
multiplicity, that is,
1 ≤ dim(Eλ ) ≤ rλ .

Theorem. Suppose A is a square matrix such that its characteristic polynomial can be
written as a product of linear factors,

det(xI − A) = (x − λ1 )rλ1 (x − λ2 )rλ2 · · · (x − λk )rλk ,

where rλi is the algebraic multiplicity of λi , for i = 1, ..., k, and the eigenvalues are
distinct, λi ̸= λj for all i ̸= j. Then A is diagonalizable if and only if for each eigenvalue
of A, its geometric multiplicity is equal to its algebraic multiplicity,

dim(Eλi ) = rλi

for all eigenvalues λi of A.

In other words, there are two obstructions to a matrix A being diagonalizable.

(i) (Not enough eigenvalues) The characteristic polynomial of A do not split into (real)
linear factors.

(ii) (Not enough eigenvectors) There is an eigenvalue of A where the geometric multi-
plicity is strictly less than the algebraic multiplicity, dim(Eλi ) < rλi .

For in either cases, there will not be enough linearly independent eigenvectors to form a
basis for Rn .

Corollary. If A is a square matrix of order n with n distinct eigenvalues, then A is


diagonalizable.

Proof. If A has n distinct eigenvalues, then the algebraic multiplicity of each eigenvalue
must be 1. Thus
1 ≤ dim(Eλ ) ≤ rλ = 1 ⇒ dim(Eλ ) = 1 = rλ
for every eigenvalue λ of A. Therefore A is diagonalizable.

Algorithm to diagonalization
(i) Compute the characteristic polynomial of A

det(xI − A).

If it cannot be factorized into linear factors, then A is not diagonalizable.


(ii) Otherwise, write

det(xI − A) = (x − λ1 )rλ1 (x − λ2 )rλ2 · · · (x − λk )rλk ,

where rλi is the algebraic multiplicity of λi , for i = 1, ..., k, and the eigenvalues are
distinct, λi ̸= λj for all i ̸= j. For each eigenvalue λi of A, i = 1, ..., k, find a basis
for the eigenspace, that is, find the solution space of the following linear system,

(λi I − A)x = 0.

If there is a i such that dim(Eλi ) < rλi , that is, if the number of parameters in the
solution space of the above linear system is not equal to the algebraic multiplicity,
then A is not diagonalizable.

(iii) Otherwise, find a basis Sλi of the eigenspace Eλi for S each eigenvalue λi , i = 1, ..., k.
Necessarily |Sλi | = rλi for all i = 1, ..., k. Let S = ki=1 Sλi . Then
k
X k
X
|S| = |Sλi | = rλi = n,
i=1 i=1

and S = {u1 , u2 , ..., un } is a basis for Rn .

(iv) Let
 
µ1 0 · · · 0
  0 µ2 · · · 0 
P = u1 u2 · · · un , and D = diag(µ1 , µ2 , ..., µn ) =  .. ..  ,
 
. .
. . .
0 0 · · · µn

where µi is the eigenvalue associated to ui , i = 1, ..., n, Aui = µi ui . Then

A = PDP−1 .
 
1 1 0
Example. 1. A = 1 1 0. It has eigenvalues 0 and 2 with multiplicity r0 = 1
0 0 2        
 −1   1 0 
and r2 = 2, respectively. Also,  1  is a basis for E0 and  1 , 0
 
0 0 1
   
is a basis for E2 . Then dim(E0 ) = 1 = r0 and dim(E2 ) = 2 = r2 . Hence, A is
diagonalizable, with
   −1
−1 1 0 0 0 0 −1 1 0
A = PDP−1 =  1 1 0 0 2 0  1 1 0 .
0 0 1 0 0 2 0 0 1
 
1 1 1
2. A = 0 2 2. A is a triangular matrix, hence the diagonal entries, 1, 2, 3 are
0 0 3
the eigenvalues, each with algebraic multiplicity 1. Therefore A is diagonalizable.
We will need to find a basis for each of the eigenspace.
        
1 − 1 −1 −1 0 −1 −1 0 1 0  1 
λ = 1:  0 1 − 2 −2  = 0 −1 −2 −→ 0 0 1 . So 0 is a
   
0 0 1−3 0 0 −2 0 0 0 0
 
basis for E1 .
        
2 − 1 −1 −1 1 −1 −1 1 −1 0  1 
λ = 2:  0 2 − 2 −2  = 0 0 −2 −→ 0 0 1 . So 1 is
   
0 0 2−3 0 0 −1 0 0 0 0
 
a basis for E2 .
        
3 − 1 −1 −1 2 −1 −1 2 0 −3  3 
λ = 3:  0 3 − 2 −2  = 0 1 −2 −→ 0 1 −2 . So 4 is
   
0 0 3−3 0 0 0 0 0 0 2
 
a basis for E3 .
   −1
1 1 3 1 0 0 1 1 3
⇒ A = 0 1 4 0 2 0 0 1 4 .
0 0 2 0 0 3 0 0 2
 
1 2
3. A = . λ = 1 is the only eigenvalue with algebraic multiplicity r1 = 2.
0 1
 
0 −2
λ = 1: . There is only one non-pivot columns, hence dim(E1 ) = 1 < 2 = r1 .
0 0
This shows that A is not diagonalizable.

2.9.10 Orthogonal Diagonalization


Orthogonal Matrices. A square matrix A of order n is an orthogonal matrix if AT =
A−1 , equivalently, AT A = I = AAT .

Theorem. Let A be a square matrix of order n. The following statements are equivalent.

(i) A is orthogonal.

(ii) The columns of A forms an orthonormal basis for Rn .

(iii) The rows of A forms an orthonormal basis for Rn .

Proof. Write  
r1
  r2 

A = c1 c2 · · · cn =  ..  ,
.
rn
where for i = 1, ..., n, ci and ri are the columns of and rows of A, respectively. Then
   
cT1 cT1 c1 cT1 c2 · · · cT1 cn
c T  T T T
  c2 c1 c2 c2 · · · c2 cn 

 2
AT A =  ..  c1 c2 · · · cn =  .. ... .. 
.  . . 
T T T T
cn cn c1 cn c2 · · · cn cn
 
c1 · c1 c1 · c2 · · · c1 · cn
 c2 · c1 c2 · c2 · · · c2 · cn 
=  .. ..  ,
 
 . ...
. 
cn · c1 cn · c2 · · · cn · cn

and
   
r1 r1 rT1 r1 rT2 · · · r1 rTn
T
 r2 
 r2 r1
  r2 rT2 · · · r2 rTn 
AAT =  ..  rT1 rT2 T
· · · rn =  ..
  
.. .. 
.  . . . 
rn rn rT1 rn rT2 · · · rn rTn
 
r1 · r1 r1 · r2 · · · r1 · rn
 r2 · r1 r2 · r2 · · · r2 · rn 
=  .. ..  .
 
 . ...
. 
rn · r1 rn · r2 · · · rn · rn

1 if i = j
T
So A A = I if and only if ci · cj = , and AAT = I if and only if
 0 if i ̸
= j
1 if i = j
ri · rj =
0 if i ̸= j
 √ √  √ √   
1/√2 −1/√ 2 1/ √2 1/√2 1 0
Example. 1. = .
1/ 2 1/ 2 −1/ 2 1/ 2 0 1
 √ √ √  √ √ √   
1/√3 1/ √2 1/√6 1/√3 1/ √3 1/ 3 1 0 0
2. 1/√3 −1/ 2 1/ √6
   1/√2 −1/√ 2 0√  = 0 1 0.
1/ 3 0 −2/ 6 1/ 6 1/ 6 −2/ 6 0 0 1
 √ √  √ √   
1/ 2 0 1/ 2 1/ 2 0 −1/ 2 1 0 0
3.  0√ 1 0√   0√ 1 0√  = 0 1 0.
−1/ 2 0 1/ 2 1/ 2 0 1/ 2 0 0 1
 √ √  √ √   
−1/ 2 0 1/ 2 −1/ 2 0 1/ 2 1 0 0
4.  0√ 1 0√   0√ 1 0√  = 0 1 0
1/ 2 0 1/ 2 1/ 2 0 1/ 2 0 0 1

Orthogonally Diagonalizable. An order n square matrix A is orthogonally diagonalizable


if
A = PDPT
for some orthogonal matrix P and diagonal matrix D.
Example. 1.
   √ √ √   √ √ √ 
5 −1 −1 1/√3 1/ √2 1/√6 3 0 0 1/√3 1/ √3 1/ 3
−1 5 −1 = 1/ 3 −1/ 2 1/ 6  0 6 0 1/ 2 −1/ 2 0√ 
√ √ √ √
−1 −1 5 1/ 3 0 −2/ 6 0 0 6 1/ 6 1/ 6 −2/ 6

2.
   √ √   √ √ 
3 0 −1 1/ 2 0 1/ 2 4 0 0 1/ 2 0 −1/ 2
 0 2 0 = 0
√ 1 0√  0 2 0  0√ 1 0√ 
−1 0 3 −1/ 2 0 1/ 2 0 0 2 1/ 2 0 1/ 2
Supppose A = PDPT for some orthogonal matrix P and diagonal matrix D. Then
AT = (PDPT )T = (PT )T DT PT = PDPT = A,
since D is diagonal, and hence symmetric. This shows that if A is orthogonally diago-
nalizable, it is symmetric. The converse is also true, but the proof is beyond the scope
of this course.
Theorem. An order n square matrix is orthogonally diagonalizable if and only if it is
symmetric.
The algorithm to orthogonally diagonalize a matrix is the same as the usual diago-
nalization, except until the last step, instead of using a basis of eigenvectors to form the
matrix P, we have to turn it into an orthonormal basis of eigenvectors, that is, to use the
Gram-Schmidt process. However, we do not need to use the Gram-Schmidt process for
the whole basis, but only among those eigenvectors that belong to the same eigenspace.
This follows from the fact that the eigenspaces are already orthogonal to each other.
Theorem. If A is orthogonally diagonalizable, then the eigenspaces are orthogonal to
each other. That is, suppose λ1 and λ2 are distinct eigenvalues of a symeetric matrix A,
λ1 ̸= λ2 . Let Eλi denote the eigenspace associated to eigenvalue λi , for i = 1, 2. Then for
any v1 ∈ Eλ1 and v2 ∈ Eλ2 , v1 · v2 = 0.
 
5 −1 −1
Example. A = −1 5 −1.
−1 −1 5
x−5 1 1
det(xI − A) = 1 x−5 1 = (x − 3)(x − 6)2 .
1 1 x−5
A has eigenvalues λ = 3, 6 with algebraic multiplicity r3 = 1, r6 = 2. Let us now compute
the eigenspaces.
      
−2 1 1 1 0 −1  1 
λ = 3:  1 −2 1  −→ 0 1 −1 ⇒ v1 = 1 is a basis for E3
  
1 1 −2 0 0 0 1
 
        
1 1 1 1 1 1  −1 −1 
λ = 6: 1 1 1 −→ 0 0 0 ⇒ v2 =
     1 , v3 =
  0  is a basis for E6 .
1 1 1 0 0 0 0 1
 

Observe that v1 is orthogonal to v2 and v3 , but v2 and v3 are not orthogonal to each
other. So we need only to perform Gram-Schmidt process on {v2 , v3 }.
Algorithm to orthogonal diagonalization
Follow step (i) to (iii) in algorithm to diagonalization.

(iv) Apply Gram-Schmidt process to each


Sk basis Sλi of the eigenspace Eλi to obtain an
orthonormal basis Tλi . Let T = i=1 Tλi , it is an orthonormal set. Similarly, we
have |Tλi | = rλi , and so
k
X k
X
|T | = |Tλi | = rλi = n,
i=1 i=1

which shows that T = {u1 , u2 , ..., un } is an orthonormal basis for Rn .

(v) Let
 
µ1 0 · · · 0
  0 µ2 · · · 0 
P = u1 u2 · · · un , and D = diag(µ1 , µ2 , ..., µn ) =  .. ..  ,
 
. .
. . .
0 0 · · · µn

where µi is the eigenvalue associated to ui , i = 1, ..., n, Aui = µi ui . Then P is an


orthogonal matrix, and
A = PDPT .
 
5 −1 −1
Example. 1. Let A = −1 5 −1. A is symmetric, hence it is orthogonally
−1 −1 5
diagonalizable. We have found the eigenvalues and eigenvectors above. We will
now perform the Gram-Schmidt process to the basis for E6 .
 
−1
u1 =  0 ,
1
     
−1 −1 1
1  1 
u2 =  1 −
 0 =− −2 .
2 2
0 1 1
      
 1 −1 1 
Indeed,  1 , 0 , −2 is an orthogonal basis. Normalizing, we get an
   
1 1 1
 
orthonormal basis
      
 1 1 −1 1 
1 1
√ 1 , √  0  , √ −2 .
 3 2 6
1 1 1

So  −1
   √1 −1
T
√1 √ √1 √ √1

3 0 0
 √13 2 6
−2    3 2 6
−2 
A=  3 0 √
6
0 6 0  √13 0 √
6
.
√1 √1 √1 0 0 6 √1 √1 √1
3 2 6 3 2 6
 
3 0 −1
2. Let A =  0 2 0 . A is symmetric, thus orthogonally diagonalizable.
−1 0 3

x−3 0 1
det(xI − A) = 0 x−2 0 = (x − 4)(x − 2)2 .
1 0 x−3
      
1 0 1 1 0 1  −1 
λ = 4: 4I − A = 0 2 0 −→ 0
   1 0 ⇒  0  is a basis for E4

1 0 1 0 0 0 1
 
        
−1 0 1 1 0 −1  0 1 
λ = 2: 2I − A =  0 0 0  −→ 0 0 0  ⇒ 1 , 0 is a basis for
1 0 −1 0 0 0 0 1
 
E2 .

Observe that in this case, the basis for E2 is also orthogonal. Hence, there is no
need to performance Gram-Schmidt process. Thus, we have
 √ √   √ √ 
−1/ 2 0 1/ 2 4 0 0 −1/ 2 0 1/ 2
A =  0√ 1 0√  0 2 0  0√ 1 0√  .
1/ 2 0 1/ 2 0 0 2 1/ 2 0 1/ 2
2.10 Exercises
 
  1 2 1
1 0 1
1. Let A = and B = 2 1 0 .
1 2 3
1 1 −1

(a) Definite A in R, filling in the entries by its rows.


(b) Definite B in R, filling in the entries by its columns.
(c) Use R to determine the size of AB.
(d) Compute BA.
(e) Compute BAT .

2. Let      
1 1 1 1 0 1
A= , B= , C= .
1 1 −1 −1 0 0

(a) Compute AB. Is A or B a zero matrix?


(b) Compute C2 = CC.
(c) Is BC = CB?

3. (a) Use the seq and rep function to define the following vectors.
(i) v1 = (2, 3, 4).
(ii) v2 = (1, 3, 5).
(iii) v3 = (1, 1, 1).
(b) Convert the vectors v1 , v2 , v3 defined in (a) to column vectors, then

(i) define A = v1 v2 v3 ,
(ii) compute v1T v2 and v2 v3T .
(c) Normalize v3 defined in (a) to a unit vector.

4. Solve the following linear systems.

(a) 
 3x1 + 2x2 − 4x3 = 3
2x1 + 3x2 + 3x3 = 15
5x1 − 3x2 + x3 = 14

(b) 

 2x2 + x3 + 2x4 − x5 = 4
x2 + x4 − x5 = 3


 4x1 + 6x2 + x3 + 4x4 − 3x5 = 8
2x1 + 2x2 + x4 − x5 = 2

(c) 
 x − 4y + 2z = −2
x + 2y − 2z = −3
x − y = 4

   
2 −1 a b
5. Let A = . Find all the matrices B = such that AB = BA.
2 1 c d
   
2 1 1 2 3 4 1
6. (a) Solve the matrix equation 0 1 2 X = 1
   0 3 7.
1 3 2 2 1 1 2
   
2 1 1 1
(b) Using the answer in (a), without the aid of R, solve 0 1 2 x = −1.
  
1 3 2 −1
(Hint: look at the columns of the matrix on the right in (a).)

7. Let      
1 0 2 3 3 2
A= , B= , C= ,
−1 1 −1 −2 −1 2
Compute A−1 , B−1 , C−1 . Using your answers for A−1 , B−1 , C−1 , find (ABC)−1 .

8. Let      
1 0 2 3 3 2
A= , B= , C= ,
−1 1 −1 −2 −1 2

(a) Compute A−1 , B−1 , C−1 . Using your answers for A−1 , B−1 , C−1 , find (ABC)−1 .
(b) Find det(A), det(B), and det(C). Use your answers for det(A), det(B), and
det(C) to find det(ABC).
(c) Use your answers in (b) to find det((ABC)−1 ).
   
0 1 1 0 6
1 −1 1 −1 3
9. Let A = 1
 and b =  .
−1
0 1 0
1 1 1 1 1
(a) Is the linear system Ax = b is consistent.
(b) Find a least squares solution to the system. Is the solution unique? Why?

10. A line
p(x) = a1 x + a0
is said to be the least squares approximating line for a given a set of data points
(x1 , y1 ), (x2 , y2 ), ..., (xm , ym ) if the sum

S = [y1 − p(x1 )]2 + [y2 − p(x2 )]2 + · · · + [ym − p(xm )]2

is minimized. Writing
       
x1 y1 p(x1 ) a1 x 1 + a0
 x2   y2   p(x2 )   a1 x2 + a0 
x =  ..  , y =  ..  , and p(x) =  ..  = 
       
.. 
 .   .   .   . 
xm ym p(xm ) a1 x m + a0

the problem is now rephrased as finding a0 , a1 such that

S = ||y − p(x)||2
is minimized. Observe that if we let
 
1 x1
 1 x2   
a0
N =  .. ..  and a = ,
 
. .  a1
1 xm

then Na = p(x). And so our aim is to find a that minimizes ||y − Na||2 .

It is known the equation representing the dependency of the resistance of a cylin-


drically shaped conductor (a wire) at 20o C is given by

L
R=ρ ,
A
where R is the resistance measured in Ohms Ω, L is the length of the material in
meters m, A is the cross-sectional area of the material in meter squared m2 , and ρ
is the resistivity of the material in Ohm meters Ωm. A student wants to measure
the resistivity of a certain material. Keeping the cross-sectional area constant at
0.002m2 , he connected the power sources along the material at varies length and
measured the resistance and obtained the following data.

L 0.01 0.012 0.015 0.02


R 2.75 × 10−4 3.31 × 10−4 3.92 × 10−4 4.95 × 10−4

It is known that the Ohm meter might not be calibrated. Taking that into account,
ρ
the student wants to find a linear graph R = 0.002 L + R0 from the data obtained to
compute the resistivity of the material.
ρ
(a) Relabeling, we let R = y, 0.002 = a1 and R0 = a0 . Is it possible to find a graph
y = a1 x + a0 satisfying the points?
(b) Find the least square approximating line for the data points and hence find the
resistivity of the material. Would this material make a good wire?

11. Suppose the equation governing the relation between data pairs is not known. We
may want to then find a polynomial

p(x) = a0 + a1 x + a2 x2 + · · · + an xn

of degree n, n ≤ m − 1, that best approximates the data pairs (x1 , y1 ), (x2 , y2 ), ...,
(xm , ym ). A least square approximating polynomial of degree n is such that

||y − p(x)||2

is minimized. If we write
       
x1 y1 1 x1 x21 · · · xn1 a0
 x2   y2   1 x2 x2 n
· · · x2   a1 
 
2
x =  ..  , y =  ..  , N =  .. .. and a = . ,
    
.. .. .
. ..   .. 
 
 .   .  . . .
xm ym 1 xm x2m · · · xnm an
then p(x) = Na, and the task is to find a such that ||y − Na||2 is minimized.

We shall now find a quartic polynomial

p(x) = a0 + a1 x + a2 x2 + a3 x3 + a4 x4

that is a least square approximating polynomial for the following data points

x 4 4.5 5 5.5 6 6.5 7 8 8.5


y 0.8651 0.4828 2.590 -4.389 -7.858 3.103 7.456 0.0965 4.326

To define the matrix N, we need to install the package matrixcalc.


> install.packages("matrixcalc")
> library(matrixcalc)
The function is vandermonde.matrix(x, n+1), where

• x is the vector representing the x-values of the data point.


• n is the degree of the polynomial.

Enter the data points.


x <- c(4,4.5,5,5.5,6,6.5,7,8,8.5)
y <- matrix(c(0.8651,0.4828,2.590,-4.389,-7.858,3.103,7.456,0.0965,4.326))
Now define the marix N
> N <- vandermonde.matrix(x,5)

Find a least square solution of Na = y.

12. Let A, B, and C be the matrices defined in questions 7.

(a) Find det(A), det(B), and det(C). Use your answers for det(A), det(B), and
det(C) to find det(ABC).
(b) Use your answers in (b) to find det((ABC)−1 ).

13. (Cramer’s Rule)

(a) Compute the determinant of the following matrices.


 
1 5 3
(i) A = 0 2 −2
0 1 3
 
1 5 3
(ii) A1 = 2 2 −2
0 1 3
 
1 1 3
(iii) A2 = 0 2 −2
0 0 3
 
1 5 1
(iv) A3 = 0 2 2
0 1 0
 
1
(b) Solve the matrix equation Ax = b, where b = 2.

0
 
det(A1 )
1
(c) Compute det(A) det(A2 ). How is this related to the answer in (b)?

det(A3 )
Observe that the matrix Ak is obtained by replacing the k-th column of A by b.
Cramer’s rule state that if A is an invertible matrix of order n and Ak is the matrix
obtained from A by replacing the k-th column of A by b, then the matrix equation
Ax = b has a unique solution
 
det(A1 )
1  det(A2 ) 
x= .
 
..
det(A) 

. 
det(An )

14. A square matrix P = (pij ) of order n is a stochastic matrix, or a Markov matrix if


the sum of each column vector is equal to 1,

p1j + p2j + · · · + pnj = 1

for every j = 1, ..., n. Let  


0.2 0.8 0.4
P = 0.3 0.2 0.4 .
0.5 0 0.2
(a) Check that P is a stochastic matrix.
(b) Compute the determinant of I − P. Is it invertible?
(c) Solve the homogeneous system (I − P)x = 0.

15. For each of the matrices A,

(i) find its characteristic polynomial,


(ii) find all the eigenvalues,
(iii) for each of the eigenvalues, find an associated eigenvector.

Determine if A is invertible.
 
1 −3 3
(a) A = 3
 −5 3.
6 −6 4
 
9 8 6 3
0 −1 3 −4
(b) A = 
0

0 3 0
0 0 0 2
16. Let  
9 −1 −6
A = 0 8 0 .
0 0 −3

(a) To compute powers of a matrix in R, we need to package expm.


> install.packages("expm")
> library(expm)
Then the n-th power of A can be computed via A%^%n. Compute A3 using R.
(b) Express A as PDP−1 for some invertible matrix P and diagonal matrix D.
(c) Compute PD3 P−1 . Compare with the answer you obtained in (a). Explain
your observation.

17. Diagonalize  
1 −3 0 3
 3 7 0 −3
A=
−18 −23 9 13  .

3 3 0 1
Hence, find a matrix B such that B2 = A.

18. Orthogonally diagonalize the following matrices A; that is, express A as

A = PDPT .

Make sure to check that P is indeed orthogonal.


 
3 1
(a) A = .
1 3
 
2 2 −2
(b) A =  2 −1 4 .
−2 4 −1
 
1 −2 0 0
−2 1 0 0
(c) A =  .
0 0 1 −2
0 0 −2 1
19. Is  
1 0 1
A= 0 2 −1
−1 −1 3
orthogonally diagonalizable?
Chapter 3

Functions and Calculus

3.1 Functions of several variables


Informally, a function (or mapping) takes either numbers or vectors, and returns numbers
or vectors. We shall use the notation F : D ⊆ Rn → Rm to denote a function defined on
a domain D, returning vectors in Rm . That is, for every vector v ∈ D, F(v) is a vector
in Rm . D is called the domain of the function F. When the domain D of a function F is
not specified, we will assume that the domain is the largest set for which the definition
makes sense. We will say that F is a function from Rn to Rm . If n = 1, the function is
said to be of a single variable, and it is said to be multivariable if n > 1. It is called a
real-valued function if m = 1, and it is called vector-valued if m > 1. The range of F
is a subset R ⊆ Rm containing all vectors that can be mapped by F from some vectors
v ∈ D, that is, R = F(D) = { u ∈ Rm F(v) = u for some u ∈ D }. Finally, Rm , the
ambient space that contains the range is called the codomain of F.

Example. Let F be a function from R3 to R2 be defined by


 
x  sin(x)+cos(y) 
F   y   = √ z .
x+y+z
z
   
 x 
Then the domain of F is D = y  x + y + z ≥ 0, z ̸= 0 . Then range of F is
z
 
   
x
R= x ∈ R, y ≥ 0 .
y
We usually use a bolded upper case letters F to denote vector-valued functions, and
unbolded lower case letters f for real-valued functions. We may express a vector-valued
function F : D ⊆ Rn → Rm in terms of its components,
 
f1 (x)
 f2 (x) 
F(x) =  ..  ,
 
 . 
fm (x)

where for all i = 1, ..., m, fi is a real-valued function and is called the i-th component function

103
 
x1
 x2 
of F. If x =  .. , we may expression the component functions as fi (x1 , x2 , ..., xn ), for
 
.
xn
all i = 1, ..., m.

Example. 1. Using the previous example, f1 (x, y, z) = (sin(x)+cos(y))/z and f2 (x, y, z) =



x + y + z.

2. A F : D ⊆ Rn → Rm is a constant function if it returns the same vector v ∈ Rm


for every u ∈ D. That is, if we write v = (vi ), then fi (x1 , ..., xn ) = vi for all
i = 1, ..., m.

3.1.1 Linear Mappings


A mapping T : Rn → Rm is said to be linear if for every vectors u, v ∈ Rn and scalar
α, β ∈ R,
T(αu + βv) = αT(u) + βT(v).
In words, it means that the map of linear combinations is the linear combination of the
maps.

Example. Let T : R2 → R3 , T(x, y) = (2x − 3y, x, 5y). Then for any (x1 , y1 ), (x2 , y2 )R2
and α, β ∈ R,

T(α(x1 , y1 ) + β(x2 , y2 )) = T((αx1 + βx2 , αy1 + βy2 ))


= (2(αx1 + βx2 ) − 3(αy1 + βy2 ), αx1 + βx2 , 5(αy1 + βy2 ))
= (α(2x1 − 3y1 ) + β(2x1 − 3y1 ), αx1 + βx2 , α5y1 + β5y2 )
= α(2x1 − 3y1 , x1 , 5y1 ) + β(2x2 − 3y2 , x2 , 5y2 )
= αT(x1 , y1 ) + βT(x2 , y2 ).

Hence, T is a linear mapping.

The following theorem allows us to check if a mapping is linear.

Theorem (Properties of Linear Mapping). Let T : Rn → Rm be a linear mapping. Then

(i) T(0) = 0,

(ii) for any α ∈ R and u ∈ Rn , T(αu) = αT(u),

(iii) for any u, v ∈ Rn , T(u + v) = T(u) + T(v),

(iv) for any c1 , c2 , ..., ck ∈ R and u1 , u2 , ..., uk ∈ Rn , T(c1 u1 + c2 u2 + · · · + ck uk ) =


c1 T(u1 ) + c2 T(u2 ) + · · · + ck T(uk ).

Proof. Left as exercise.


This means that if a mapping F : D ⊆ Rn → Rm fails any of the properties above,
then it cannot be linear.
 
  x
2 3 x
Example. 1. T : R → R , T = y  is not a linear mapping since

y
1
 
  0
0
T = 0 ̸= 0.

0
1
 
x
2. T : R2 → R, T = xy is not a linear mapping since
y
 
1
T = 1,
1
but       
1 2 1
T 2 =T = 4 ̸= 2 = 2T .
1 2 1
  p 
2 2 x 3
(x3 + y 3 )
3. T : R → R , T = is not a linear mapping since
y 0
    p  √ 3

1 0 3
(13 + 13 ) 2
T + = =
0 1 0 0
    √ 3
 √3

1 1 13 13
̸= + = +
0 0 0 0
   
1 0
= T +T
0 1
Theorem (Standard Matrix of Linear Mapping). A mapping T : Rn → Rm is linear if
and only if it can be written as a matrix multiplication,
T(u) = Au
for all u ∈ Rn , for some m × n matrix A. The matrix A is called the standard matrix,
or matrix representation of T.
Proof. Given any x = (xi ) ∈ Rn , we can write it as x = x1 e1 + x2 e2 + · · · + xn en , where ei
is the vector that takes 0 in all its coordinates except the i-th coordinate, where it takes
1, for i = 1, ..., n. Then by linearity,
T(x) = T(x1 e1 + x2 e2 + · · · + xn en )
= x1 T(e1 ) + x2 T(e2 ) + · · · + x2 T(en )
 
x1
  x2 

= T(e ) T(e ) · · · T(e )

1 2 n  .. 
.
xn
= Ax
where the third equality follows from expressing a matrix equation as a vector
 equation,
and the last equation follows from letting A = T(e1 ) T(e2 ) · · · T(en ) , that is, the
i-th column of A is T(ei ), the map of ei under the mapping T.
The proof tells us how to construct the matrix representation A of T,

A = T(e1 ) T(e2 ) · · · T(en ) ,

Example. 1. T : R4 → R3 ,
 
x1  
x2  2x 1 − 3x 2 + x 3 − 5x 4
T  4x1 + x2 − 2x3 + x4 
x3  =
 
5x1 − x2 + 4x3
x4
       
2 −3 1 −5
= x1 4 + x2 1 + x3 −2 + x4 1 
      
5 −1 4 0
 
 x
2 −3 1 −5  1 

x2 
= 4 1 −2 1    x3  .
5 −1 4 0
x4

The standard matrix of T is


        
1 0 0 0  
 0 1 0 0 2 −3 1 −5
A= T 0 T 0 T 1 T 0 = 4 1 −2 1 .
         
5 −1 4 0
0 0 0 1

Indeed, T(u) = Au for all u ∈ R4 .

2. Zero mappings T : Rn → Rm , T(u) = 0 for all u ∈ Rn . The standard matrix is


 
AT = T(e1 ) T(e2 ) · · · T(en ) = 0 0 · · · 0 = 0(m,n) ,

the zero matrix.

Conversely, the linear mapping defined by the zero matrix 0(m,n) is the zero mapping,

T0 (u) = 0u = 0 ∈ Rm ,

for all u ∈ Rn .

3. Identity mapping: T : Rn → Rn , T(u) = u for all u ∈ Rn . The standard matrix is


 
A = T(e1 ) T(e2 ) · · · T(en ) = e1 e2 · · · en = In ,

the identity matrix.

Conversely, the linear mapping defined by the identity matrix In is the identity
mapping,
TI (u) = Iu = u,
for all u ∈ Rn .
A linear functional is a linear mapping L : Rn → R, that is, the codomain is the real
numbers (m = 1 in the definition for linear mappings).
Theorem. Every linear functional L : Rn → R can be written as
n
X
T
L(x) = a x = ai x i
i=1
 
a1
 a2 
for a (column) vector a =  ..  ∈ Rn .
 
.
an

3.1.2 Quadratic Forms


A function Q : Rn → R is called a (real) quadratic form if it has the following expression
X
Q(x1 , x2 , ..., xn ) = qij xi xj
i≤j

= q11 x21 + q12 x1 x2 + · · · + q1n x1 xn


+q22 x22 + q23 x2 x3 + · · · + q2n x2 xn
+ · · · · · · + qnn x2n ,

where qij ∈ R are real numbers.


Example. 1. Q(x, y) = x2 + 4xy − y 2
2. Q(x, y, z) = x2 + 2xy + 2xz + 2yz + z 2
3. Q(t, x, y, z) = t2 − x2 − y 2 − z 2
Theorem (Quadratic form and Symmetric Matrix). Any real quadratic form Q : Rn → R
can be expressed as
Q(x) = xT Ax
for every column vector x ∈ Rn , for some symmetric matrix A. Conversely, any real
symmetric matrix A defines a quadratic form given by the expression above.
P
Proof. Let Q(x1 , x2 , ..., xn ) = i≤j qij xi xj be a quadratic form. Define
 1
 2 qij if i < j
aij = qii if i = j
 1
q
2 ji
if j > i

Define A = (aij ). Then one can check that


 1
 
q11 q
2 12
··· q1n x1
1 1
 q12
 q22 ··· q   x 
2 2n   2 
Q(x1 , x2 , ..., xn ) = x1 x2 · · · xn  2 .. .. ...
T
..   ..  = x Ax.
 . . .  . 
1 1
q
2 1n
q
2 2n
··· qnn xn
It is clear that if A is a symmetric matrix, then Q(x1 , ..., xn ) = xT Ax is a quadratic
form.
Example. 1. Let Q(x, y) = x2 − xy + y 2 . Then
  
 1 −1/2 x
Q(x, y) = x y .
−1/2 1 y
2. More generally, for any real numbers a, b, c ∈ R,
  
2 2
 a b/2 x
Q(x, y) = ax + bxy + cy = x y .
b/2 c y

3. Let Q(x, y, z) = x2 + 2xy + 2xz + 2yz + z 2 . Then


  
 1 1 1 x
Q(x, y, z) = x y z  1 0 1  y .
1 1 1 z

4. Let Q(t, x, y, z) = t2 − x2 − y 2 − z 2 . Then


  
1 0 0 0 t
 0 −1 0 0  x
 
Q(t, x, y, z) = t x y z 0 0 −1 0  y  .

0 0 0 −1 z
Theorem (Orthogonally Diagonalize a Quadratic Form). Every quadratic form Q :
Rn → R can be expressed as
  
λ1 0 ··· 0 y1
 0
 λ2 0 0   y2 
 
Q(x1 , x2 , ..., xn ) = λ1 y12 + λ2 y22 + · · · λn yn2 = y1 y2 · · · yn  ..

.. . . ..   .. 
. . . .  . 
0 0 · · · λn yn
 
y1
 y2 
for some linear mapping y =  ..  : Rn → Rn . The linear mapping if given by y = PT x
 
.
yn
for some orthogonal matrix P.
Proof. Since A is symmetric, it is orthogonally diagonalizable.
 Hence, wecan find
λ1 0 ··· 0
0 λ2 · · · 0
λ1 , λ2 ..., λn and an orthogonal matrix P such that PT AP =  .. .. . Hence,
 
.. . .
. . . .
0 0 · · · λn
T
if we let y = P x, then x = Py, and
Q(x) = Q(Py) = (Py)T A(Py)
= yT PT APy
 
λ1 0 · · · 0
 0 λ2 · · · 0 
T 
= y  .. ..  y

.. . .
. . . .
0 0 · · · λn
= λ1 y1 + λ2 y2 + · · · + λn yn .
 
2 2 1 −1/2
Example. 1. Let Q(x, y) = x − xy + y . The symmetric matrix can
−1/2 1
be orthogonally diagonalized as such
 √ √ T   √ √   
1/√2 −1/√ 2 1 −1/2 1/√2 −1/√ 2 1/2 0
= .
1/ 2 1/ 2 −1/2 1 1/ 2 1/ 2 0 3/2
 ′  √ √ T    √ 
x 1/√2 −1/√ 2 x (x + y)/ √2
Hence, if we let = = , then
y′ 1/ 2 1/ 2 y (−x + y)/ 2
1 2 3 2 1 3
Q(x, y) = x′ + y ′ = (x + y)2 + (y − x)2 .
2 2 4 4
2. Let Q(x, y, z) = 2xy + 2xz + 2yz, then
  
 0 1 1 x
Q(x, y, z) = x y z  1 0 1  y .
1 1 0 z
 
0 1 1
1 0 1 can be orthogonally diagonalized as such
1 1 0
 √ √ √ T   √ √ √ 
1/ √2 1/√6 1/√3 0 1 1 1/ √2 1/√6 1/√3
−1/ 2 1/ 6 1/ 3 1 0 1 −1/ 2 1/ 6 1/ 3
√ √ √ √
0 −2 6 1/ 3 1 1 0 0 −2 6 1/ 3
 
−1 0 0
=  0 −1 0 .
0 0 2
 ′  √ √ √ T    (x−y) 
x 1/ √2 1/√6 1/√3 x √
2
So let y ′  = −1/ 2 1/ √6 1/√3 y  =  x+y−2z √ , we have
 
6 
z′ 0 −2 6 1/ 3 z x+y+z

3

2 2 2 1 1 2
Q(x, y, z) = −x′ − y ′ + 2z ′ = − (x − y)2 − (x + y − 2z)2 + (x + y + z)2 .
2 6 3
Remark. A quadratic form is an example of a bilinear form,
X
B(x, y) = xT Ay = xi aij yj .
i,j

The discussion on bilinear forms is beyond the scope of this course.

3.2 Functions in R
3.2.1 Creating Functions in R
In R everything we do involves a function, explicitly or implicitly. They are the funda-
mental building block of R. We can either use a primitive function, or create our own
functions. For the purpose of this course, a R function has two parts, the body and the
formals (or arguments). The body contains the code inside the function, and the formals
contains the list of all the arguments that control how you call the function. The primi-
tive functions call C code directly with .Primitive(), it contains no R code in its body.
Type names(methods:::.BasicFunsList) to get a list of all primitive function in R.

In mathematics, we normally use alphabets with superscripts or subscripts to denote


functions. However, it may get very confusing as more functions are created, and it
does not take into account units (dimensions). Hence, it is a good practice to name the
function by what it does, as well as the units if necessary. It is also a good practice to
define the functions in the Rscript and run it later. For example, we want a function
that converts inch to centimeter.
inch to cm <- function(x){
x*2.54
}
Here, the name of the function is inch to cm, the formal is just x, and the body is x*2.54
> formals(inch to cm)
$x

> body(inch to cm)


{
x * 2.54
}

In the next example, we want find the total number of seconds after some certain
number of hours, minutes, and seconds have past.
no of sec <- function(h,m,s){
#h is the number of hours
#m is the number of minutes
#s is the number of seconds
s + 60*(m+60*h)
}
We used comments in this case to remind us of what the arguments represent. So, for
example, 2 hours 30 minutes and 15 seconds is 9015 seconds,
> no of sec(2,30,15)
[1] 9015

In R functions, any argument can be given a default value. The function will take this
value for that argument if no input is given. For example,
no of sec <- function(s,m,h,d=0){
#d is the number of days, default is 0
#h is the number of hours
#m is the number of minutes
#s is the number of seconds
s + 60*(m+60*(h+24*d))
}
> no of sec(15,30,2)
[1] 9015
> no of sec(0,0,0,1)
[1] 86400
Note that it is a good habit to put arguments with default values at the back since the
position of the argument matters. For example,
test <- function(x=10,y){
x*y
}
> test(2)
Error in test(2) : argument "y" is missing, with no default
> test(,2)
[1] 20
vis-á-vis
test <- function(x,y=10){
x*y
}
> test(2)
[1] 20

To summarize, here’s a practical example; to compute the amount of radioactive sub-


stance left after some time.
halflife <- function(N in,lambda,t){
#N in is the initial amount of radioactive substance in mole
#lambda is the decay constant
#t is the amount of time that has past, in seconds
N in*exp(-lambda*t)
}

Exercise. Using the function no of sec defined above, what happens if one of the ar-
gument is a vector? For example
> no of sec(c(1,2,3),0,0)

What happens if two or more of the arguments are vectors of different length? For
example
> no of sec(c(1,2,3),c(0,1),0)
and
> no of sec(seq(4),c(0,1),0)

Explain your results.


An alternative to define a matrix by its entry is to use the function outer.
Example. 1. A = (aij )2×3 , aij = i + j
> i <- seq(2); j <- seq(3); a <- function(i,j){i+j}
> A <- outer(i,j,a)
> A
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 3 4 5
2. B = (bij )3×2 , bij = (−1)i+j ,
> i <- seq(3); j<- seq(2); b <- function(i,j){(-1)^{i+j}}
> B <- outer(i,j,b)
> B
[,1] [,2]
[1,] 1 -1
[2,] -1 1
[3,] 1 -1

3.2.2 Symbolic Functions in R


In the previous section we have been creating functions that takes produces a numeric
value whenever you input some numbers as arguments. However, we may want to define
a function symbolically, that is,. we may want to define f (x) = x2 without having to
specify a priori what x is.
> f <- x^2
Error: object ‘x’ not found

We can create symbolic functions in R using the expression function.


> f <- expression(x^2)
> f
expression(x^2)

To evaluate the expression for a specific number, use the eval function.
> x <- 3
> eval(f)
[1] 9

In this case, we may also evaluate a function on a vector


> x <- seq(10)
> eval(f)
[1] 1 4 9 16 25 36 49 64 81 100

The function in the expression can take any number of arguments


g <- expression(2*x^2*y)

In this case, we need to define x and y before we evaluate the function


> x <- seq(10); y <- rep(1,10)
> eval(g)
[1] 2 8 18 32 50 72 98 128 162 200

Exercise. Let f1 (x) = x2 and f2 (x) = 2x,


f1 <- expression(x^2); f2 <- expression(2*x)
How do we evaluate f1 (x) + f2 (x)?
> x <- 1
> eval(f1+f2)
Error in f1 + f2 : non-numeric argument to binary operator
3.2.3 Solving equations in R
We can solve equations in R using the function findZeros from the library mosaic. First
install the package.
> install.packages("mosaic")
> library(mosaic)

The function findZeros takes a function as its argument and returns zeros of the
function (that is, the points where the function is 0).

Example. 1. Solve 2x2 − 1 = 0.


> f <- function(x) 2*x^2-1
> findZeros(f(x)∼x)
x
1 -0.7071
2 0.7071

2. If the function has many zeros, findZeros will display some of the zeros.
> findZeros(sin(pi*x)∼x)
x
1 -4
2 -3
3 -2
4 -1
5 0
6 1
7 2
8 3
9 4
To find the zeros of a function within a confined domain, say within the interval
(a, b), we include the argument xlim = range(a,b).
> findZeros(sin(pi*x)∼x,xlim = range(-2,2))
x
1 -1
2 0
3 1

We may use findZeros to find the intersection of two functions f (x) and g(x), since

f (x) = g(x) ⇔ f (x) − g(x) = 0.

Example. 1. Find the intersection of x3 and 3x2 − 1.


> f <- function(x) 3*x^2-1
> g <- function(x) x^3
> findZeros(f(x)-g(x)∼x)
x
1 -1.8794
2 0.3473
3 1.5321
2. Solve for sin(x) = cos(x), for x ∈ (0, 2π).
> findZeros(sin(x)-cos(x)∼x,xlim = range(0,2*pi))
x
1 -2.3562
2 0.7854
3 3.9270
4 7.0686

3. Solve for x2 = −1
> findZeros(x^2+1∼x)
numeric(0)
Warning message:
In findZeros(x^2 + 1 ∼ x) :
No zeros found. You might try modifying your search window or increasing
npts.

Exercise. Find the values of a such that


 
2 2a + 1 3
A = a + 2 2a + 2 2a + 2
0 1 1

is singular.
3.3 Graphs
3.3.1 Plots in R
Plot
The function plot in R requires the (preinstalled) system package graphics. Here is the
description of the some of the arguments for the function plot.
plot(x, y, type = "p", main, xlim, ylim, xlab, ylab)

• x is the data set whose values are the horizontal coordinates.

• y is the data set whose values are the vertical coordinates.

• type is the type of plot desired. "p" for points, "l" for lines, "b"
for both points and lines, "c" for empty points joined by lines, "o" for
overplotted. points and lines, "s" and "S" for stair steps and "h" for
histogram-like vertical lines. Finally, "n" does not produce any points
or lines.

• xlim is the limits of the values of x used for plotting.

• ylim is the limits of the values of y used for plotting.

• main is the tile of the graph.

• xlab is the label in the horizontal axis.

• ylab is the label in the vertical axis.

• axes indicates whether both axes should be drawn on the plot.

Example. 1. We use the data set mtcars available in the R environment to create a
basic scatterplot. Let’s use the columns wt and mpg in mtcars.
> input <- mtcars[,c(‘wt’,‘mpg’)]
> input
wt mpg
Mazda RX4 2.620 21.0
Mazda RX4 Wag 2.875 21.0
Datsun 710 2.320 22.8
Hornet 4 Drive 3.215 21.4
Hornet Sportabout 3.440 18.7
.
. .
. .
.
. . .
.
. .
. .
.
. . .
> plot(x = input$wt,y = input$mpg, xlab = "Weight", ylab = "Milage", xlim
= c(1.5,5), ylim = c(10,34), main = "Weight vs Milage")
However, most of the time we are plotting the graph to observe the trends; it is not
necessary to include the labels for the axes and the title. Moreover, in this case,
input is already a matrix with 2 columns. Finally, if we do not specify the limits of
the plot, R will automatically choose a big enough interval to contain all the data.
> plot(input,type=‘p’)

Finally, it is clear from this example that we do not want to use line plot.
plot(input,type=‘b’)
We may try to fix the line plot by rearranging the wt in ascending order.
plot(input[order(input$wt),],type = "b")
However, it is still not very meaningful to use line plot in this case.

2. Consider now the following data.


> x <- seq(-5,5); y <- x^2; x; y
[1] -5 -4 -3 -2 -1 0 1 2 3 4 5
[1] 25 16 9 4 1 0 1 4 9 16 25
> plot(x,y,type="p")
In this example, there is clearly a trend in the data; in fact it is a graph of a
function. So, we may use line plot instead.
> plot(x,y,type="l")

3. In some cases, the function defined may have more than 1 input, but we want to
observe the trend for the variation of one of the argument, while keeping the rest
constant. Let use our function halflife defined in the previous section. Suppose
we fix the amount and type of substance, we want to observe how the function
varies with t.
> t <- seq(0,100,4)
> plot(t,halflife(1,1/5,t),type = "l")

Or suppose we want to know how the function will vary with respect to substance
with different half-lives, after a fixed time.
> lambda <- seq(from = 0.1, to = 1, length.out = 20)
> plot(lambda,halflife(1,lambda,1),type = "l")
Sapply
For most of the examples above we are able to evaluate the functions for vectors too, since
the operations involved in defining the function accept vectors as arguments. However,
consider the following function.


 0 if x ≤ −1,
x + 1 if −1 < x ≤ 0,

f (x) =

 −x + 1 if 0 < x ≤ 1,
0 if x > 1.

> f <- function(x){


if (or(x<=-1,x>1)) return(0)
if (and(x>-1,x<=0)) return(x+1)
if (and(x>0,x<=1)) return(-x+1)
}

If we try to plot it with the method defined above, we obtain the following.
> plot(x=seq(-2,2,length=1000),f(x),type="l")
Error in xy.coords(x, y, xlabel, ylabel, log) :
‘x’ and ‘y’ lengths differ
In addition: Warning message:
In if (or(x <= -1, x > 1)) return(0) :
the condition has length > 1 and only the first element will be used

Hence, we want instead the function f to act on each component of x. To this end,
we use the function sapply (simplified list apply function).
> x <- seq(-2,2,length=1000)
> y <- sapply(x,f)
> plot(x,y,type = "l")
Curve
The function curve in R can be used to plot the graph of a symbolic function. Let us
take a look at the description of some of the arguments that we need.
curve(expression, from, to, type, n)

• expression is the name of a function or an expression written as a function


of ‘x’.

• from is the lower boundary of the limit of x.

• to is the upper boundary of the limit of x.

• type is the type of plot, see description for plot.

• n is integer; the number of x values at which to evaluate, by default


it is set to 101.

Example. 1. curve(sin(x), from = -2*pi, to = 2*pi)

2. Sometimes we want to increase n to increase the resolution of the graph. Let us


plot the curve x sin( x12 ) for 0 < x ≤ 0.5 for different values of n.
> curve(x*sin(1/x^2), from = 0, to = 0.5, n =)
(a) n = 101 (default) (b) n = 1000

(c) n=10000

In the curve function, we may include the argument add = TRUE to add another
graph onto the current one.
> curve(x^2, from = -5, to = 5)
> curve(x^4, add = TRUE)
Histogram
A histogram represents the frequencies of values of a variables belonging to different
ranges. Each bar in a histogram represents the number of times the values within a given
range appears. Here are the description of some of the arguments in the function hist
in R.
hist(x, breaks, col, border)

• x is a vector, representing the variable.

• breaks is one of the following :

– a vector giving the breakpoints between histogram cells,


– a function to compute the vector of breakpoints,
– a single number giving the number of cells for the histogram,
– a function to compute the number of cells.

• col is the color to fill the bar, default is "lightgray".

• boarder is the color of the boarder of the bar, default is no color.

Example. > x <- rnorm(1000,50,10)


> hist(x,breaks = 10,col = "BLUE", border = "RED")

For now we don’t have to concern ourselves with the function norm.
> hist(x,breaks = c(10,30,40,45,50,55,60,65,70,75,80,90),col = "BLUE", border
= "RED")
Contour
The R function contour creates a contour plot, or add contour lines to an existing plot.
Here are description of some of the arguments we will be using.
contour(x, y, z, nlevels, levels, col, lwd, lty, labcex, label, drawlabels,
add)

• x, y are the locations of grid lines at which the values in z are measured.
These must be in ascending order. By default, equally spaced values from
0 to 1 are used

• z is a matrix containing the values to be plotted.

• nlevels is the number of contour levels desired .

• levels is the numeric vector of levels at which to draw contour lines.

• col is the color of the lines drawn, default is black.

• lwd is the width of the line drawn.

• lty is the type of lines drawn, 1 is straight line, 2 and above are broken
lines.

• labcex is the size of the contour label.

• label is a vector giving the labels for the contour lines. If NULL then
the levels are used as labels.

• drawlabels, if = TRUE, contours will be label. Default is TRUE.

• add, if value = TRUE, add to the previous plot.


Example. 1. > x <- y <- seq(-10,10); f <- function(x,y){x^2+y^2}
> z <- outer(x,y,f)
> contour(x,y,z)

2. > contour(z,nlevels=20,lty=2)

3. > x <- runif(1000); y <- runif(1000); z <- kde2d(x, y, n = 50)


> contour(z,nlevels=20, col = hcl.colors(10,"Temps"))
Remark. (a) See section 4.6.1 for the function runif.
(b) The function kde2d requires the (pre-installed) package MASS, it calculates the
kernel density estimate of the variables.
(c) See https://developer.r-project.org/Blog/public/2019/04/01/hcl-based-color-pa
for hcl.colors.

We may include the scatter plot of x and y to the contour plot.


> plot(x,y,pch=20); contour(z,nlevels=10,lwd=2, col = hcl.colors(10,"Temps"),
add=TRUE)
The function filled.contour is similar to contour, except it fills the areas between
the contour lines with colors. The arguments for filled.contour are exactly those of
contour. We may plot a contour plot on top of a filled contour plot using the function
plot.axes for better visualization.
filled.contour(z,col=hcl.colors(10,"Zissou1"),plot.axes = {contour(z, add =
TRUE, lwd = 2,col = hcl.colors(10,"Temps" ))})

3.3.2 Curve Fitting in R


Given a set of data points, we might want to find either a curve that exactly fit all the
data, or a curve that approximately fits the data. The curve that exactly fit all the points
is called interpolation, and a curve that approximately fits the data is called smoothing.
We might want to choose one over the other depending on the situation. However, in
most cases in reality, due to measurement errors or small-scale fluctuations, it would be
preferable to use smoothing.
There are two distinct ways that you can use the curve.
• We can read the value predicted by the curve for a specific explanatory variable.
This is useful when we want to make a prediction, or when we want to compare the
actual value of the data to what the curve says is a typical value. The difference
between the actual value versus the vale given by the curve is called the residual.

• Characterize the relationship between data. It is useful when we want to make


statements about the overall trend rather than individual cases.

Linear Regression
The simplest process of curve fitting is to use a line, or a linear equation. The function
lm is used to fit linear models. A line y = mx + c is uniquely determined by its gradient
the y-intercept, m and c, respectively. This is what lm returns. Let us take a look at
some of the arguments of lm.
lm(formula, data, na.action)

• formula is a symbol presenting the relation between x and y. The syntax


is y∼ x.

• an optional data frame, list or environment (or object coercible by


as.data.frame to a data frame) containing the variables in the model.

• na.action is a function which indicates what should happen when the data
contain NAs. The default is set by the na.action setting of options,
and is na.fail if that is unset. The ‘factory-fresh’ default is na.omit.
Another possible value is NULL, no action. Value na.exclude can be useful.

To plot the linear curve over the scatter plot of the data, we use the function abline.
abline(a , b , and other graphical.parameters)

• a is the y-intercept.

• b is the gradient.

To read the value predicted by the curve, we can use the function predict. Here are
some of the arguments of the function.
predict(object,newdata)

• object is the formula which is already created using the lm() function.

• newdata is the vector containing the new value for predictor variable.

Example. 1. We will use the weights and miles per gallon data from the data mtcars.
> Weight <- mtcars$wt #1000 lbs
> Miles per gallon <- mtcars$mpg
> lm(Miles per gallon∼Weight, data=mtcars)

Call:
lm(formula = Miles per gallon ∼ Weight, data = mtcars)

Coefficients:
(Intercept) Weight
37.285 -5.344

> plot(Weight, Miles per gallon,pch = 19, frame = FALSE)


> abline(lm(Miles per gallon ∼ Weight, data = mtcars), col = "blue")
> predict(lm(Miles per gallon∼Weight, data=mtcars), data.frame(Weight=4.5))
1
13.235
That is, the linear model predicts that at 4500 lbs, the car would be able to travel
13.235 miles per gallon.

2. We will be using the data beavers in the library datasets. It records the body
temperature of 2 beavers at 10 minutes interval. Let us plot the graph of the first
beaver.
> data(beavers)
> Minutes <- seq(10, nrow(beaver1) * 10, 10) #starting at 10 minutes to
114*10 minutes, interval of 10 minutes
> Temperature <- beaver1$temp #the recorded body temperatures
of the first beaver
> plot(Minutes, Temperature, pch = 19, frame = FALSE)
> lm(Temperature ∼ Minutes, data = beaver1)

Call:
lm(formula = Temperature ∼ Minutes, data = beaver1)

Coefficients:
(Intercept) Minutes
3.672e+01 2.456e-04

> abline(lm(Temperature ∼ Minutes, data = beaver1), col = "blue")


In this case, we would (falsely) assume that the body temperature of the beaver
will raise over time. This is a limitation of using a linear model.

Locally Weighted Scatterplot Smoothing


The function lowess in R can be used to fit a nonlinear curve for a data. It is the acronym
for locally weighted scatterplot smoothing. Given vectors x and y, it will return a new
set of vectors that produces a smooth nonlinear curve. Here are some of its arguments.
lowess(x, y , f = 2/3)

• x, y are vectors giving the coordinates of the points in the scatter plot.

• f is the amount of smooth, the larger the value, the smooth the curve.
It must be a positive number.

Example. 1. We will plot both the linear model and the lowess smoother model. >
Weight <- mtcars$wt
> Miles per gallon <- mtcars$mpg
> plot(Weight, Miles per gallon,pch = 19, frame = FALSE)
> abline(lm(Miles per gallon ∼ Weight, data = mtcars), col = "blue")
> lines(lowess(Weight,Miles per gallon), col = "red")
Now to read the value predicted by the lowess curve,
> predict(loess(Miles per gallon ∼ Weight,mtcars),data.frame(Weight=3))
1
20.48786
Note that here the object is loess instead of lowess.

2. > data(beavers)
> Minutes <- seq(10, nrow(beaver1) * 10, 10)
> Temperature <- beaver1$temp
> plot(Minutes, Temperature, frame = FALSE, type = "l")
> abline(lm(Temperature ∼ Minutes, data = beaver1), col = "blue")
> lines(lowess(Minutes,Temperature), col ="red")
Using the lowess model, it is no longer an observed trend that the body temperature
of the beaver will continue to raise.

3. We will now observe the graphs of the function lowess for different smoother val-
ues.
> plot(Minutes, Temperature, frame = FALSE, type = "l")
> lines(lowess(Minutes,Temperature), col ="blue")
> lines(lowess(Minutes,Temperature,f=0.1), col ="green")
> lines(lowess(Minutes,Temperature,f=10), col ="red")
> legend("topleft",col = c("blue", "green", "red"),lwd = 2,c("f = default",
"f = 0.1", "f = 5")) #add legend
We may alternatively use the scatter.smooth function in R to plot the graph with a
lowess curve.

3.4 Derivatives
3.4.1 Definitions and Properties
Single variable real-valued function
Recall that a real function of a single value is differentiable at a if there is a real number
m such that the limit
f (a + h) − f (a) − mh
lim
h→0 h
goes to 0. Readers may refer to the appendix for the formal definition of limits. In this
case, we can write
f (a + h) − f (a)
lim = m,
h→0 h
and m is called the derivative of f at a. The value m is commonly denoted as f ′ (a),
d df
| f , or dx
dx x=a
(a). Then the linear approximation of f at a is

f (a) + f ′ (a)(x − a).

Example. Let f (x) = x3 − 2x, then the derivative of f at a = −1 is f ′ (−1) = 1, and


the linear approximation is the line x + 2.
A function is said to be differentiable if it is differentiable at every point in its domain.
In this case, we will let f ′ (x) denote the function that sends every point a in the domain
of f to its derivative at a ,
f (a + h) − f (a)
f ′ : D → R, a 7→ lim = f ′ (a).
h→0 h
Remark. It can be a case that a function is differentiable at almost everywhere in the
domain, but still not be differentiable. For example, consider the piecewise function

2 for all x > 0
f (x) =
1 for all x ≤ 0

Then f is differentiable at every a ̸= 0, with derivative


(
′ limh→0 f (a+h)−f
h
(a)
= limh→0 2−2
h
=0 for all a > 0
f (a) = f (a+h)−f (a) 1−1
limh→0 h
= limh→0 h = 0 for all a < 0
But if a = 0, then

f (h) − f (0) 2−1 1


lim+ = lim+ = lim+ = +∞.
h→0 h h→0 h h→0 h

So f is not differentiable, since it fails to be differentiable at the point a = 0.

Multivariable real-valued function


Suppose now f is a multivariable function from Rn to R and f is defined in a small open
ball around a point a ∈ Rn . The function f is said to be differentiable at a if there exists
a linear function L : Rn → R such that the limit
(f (a + h) − f (a)) − L(h)
lim = 0.
∥h∥→0 ∥h∥

Remark. 1. Recall
p from section 2.2.2 that the norm of a vector h = (hi ) ∈ Rn , ∥h∥
is given by h21 + h22 + · · · + h2n .

2. Recall that a linear function L : Rn → R can be written as L(x) = mT x = i mi xi


P
for some column vector m ∈ Rn , where we expressed x as a column vector.
P
Suppose f is differentiable at the point a = (ai ), and L(h) = i mi hi for some
n
m = (mi ) ∈ R . Set hj = 0 for all j ̸= i. Then in the definition above, we have

(f (a1 , ..., ai + hi , ..., an ) − f (a1 , ..., ai , ..., an ) − L(0, ..., hi , ..., 0)


0 = lim
hi →0 hi
(f (a1 , ..., ai + hi , ..., an ) − f (a1 , ..., ai , ..., an ) − mi hi
= lim ,
hi →0 hi
and thus
f (a1 , ..., ai + hi , ..., an ) − f (a1 , ..., ai , ..., an )
lim = mi .
hi →0 hi
The number mi is the partial derivative of f with respect to xi at the point a, and is
denoted by
∂f
(a), fxi (a) or ∂xi f (a).
∂xi
To obtain the partial derivative ∂xi f , we are treating f as a function of only xi , while
taking the other xj to be fixed, xj = aj for all j ̸= i. Since this is true for all i = 1, ..., n,
it shows that if f is differentiable at a, then its partial derivative ∂xi f (a) exists for all
i = 1, ..., n, and the linear function L is given by
 
x1
 x2  X ∂
L(x) = ∂x∂ 1 f (a) ∂x∂ 2 f (a) · · · ∂x∂n f (a)  ..  =

f (a)xi ,
 
. i
∂xi
xn

that is, the matrix representation of L is


∂ ∂ ∂

L= ∂x1
f (a) ∂x2
f (a) ··· ∂xn
f (a) .
The gradient of f at a is the matrix representation of L at a written as a column
vector, that is, to take the transpose of L, and is denoted as ∇f (a):
 ∂ 
∂x1
f (a)
 ∂ f (a) 
 ∂x
∇f (a) =  2 . .

 .
. 

∂xn
f (a)

The linear approximation of f at a is


X ∂
f (a) + L(x − a) = f (a) + f (a)(xi − ai ).
i
∂xi

Observe that a real function of a single value is a special case of this, when a = a and
L = f ′ (a).

A function f from Rn to R is differentiable if it is differentiable at every point in its


domain. In this case, we let ∇f denote the vector-valued function that sends every point
a in the domain of f to its gradient at a,
 ∂ 
∂x1
f (a)
 ∂ f (a) 
n n  ∂x
∇f : R → R , a 7→  2 . .

 .
. 

∂xn
f (a)

Example. Let f : R2 → R be f (x, y) = x2 + y 2 . Then f is differentiable and the


derivative at a point (1, 1) is
∂ ∂
 
∂x
f (1, 1) ∂y f (1, 1) = 2 2 ,

and its linear approximation at (1, 1) is


 
2 2
 x−1
1 +1 + 2 2 = 2x + 2y − 2.
y−1
The gradient of f is  
2x
∇f (x, y) = .
2y
The usual rules for ordinary differentiation hold for partial derivatives
Theorem (Properties of Partial Derivatives). Suppose f and g are function from Rn to
R such that f and g are differentiable at a. Then
∂ ∂ ∂
(i) ∂xi
(f + g)(a) = ∂xi
f (a) + ∂xi
g(a).
∂ ∂
(ii) ∂xi
(f g)(a) = ∂xi
f (a)g(a) + f (a) ∂x∂ i g(a).
∂ 1 ∂xi f (a)
(iii) ( )(a)
∂xi f
=− f (a)
, if f (a) ̸= 0.
Remark. It can be the case that the partial derivatives of a function exist at a point,
but the function is not differentiable at that point. For example, consider the function
f (x, y) = |x + y| − |x − y|.
Then along the line y = 0, f as a function of x is
f (x, 0) = |x + 0| − |x − 0| = 0.
Similarly, along the line x = 0, f as a function of y is
f (0, y) = |0 + y| − |0 − y| = 0.
 
1
However, the derivative of f at (0, 0) is not well defined, since if we let h = h , then
1
   
∂ ∂
 h  h
L(h, h) = ∂x f (0, 0) ∂y f (0, 0) = 0 0 =0
h h
and so
(f (0 + h, 0 + h) − f (0, 0)) − L(h, h) |2h| − |0| − 0 2
lim √ = lim √ =√ ≠ 0.
(h,h)→(0,0) 2
h +h 2 (h,h)→(0,0) |h| 2 2
However, f is differentiable at a if all the partial derivatives of f are continuous near a.
Theorem. If the partial derivatives of f are continuous in an open neighborhood of a,
then f is differentiable at a.

Mixed partial derivatives



Suppose f is a differentiable multivariable function such that the partial derivatives ∂xi
f
are also differentiable, then we define the mixed partial derivatives of f as

∂r ∂ ∂ ∂
f= ··· f,
∂xi1 ∂xi2 · · · ∂xir ∂xi1 ∂xi2 ∂xir
where xi1 , xi2 , ..., xir ∈ {x1 , x2 , ..., xn }.

For example, suppose r = 2 and n = 2. That is, f (x, y) is a function from R2 to R,


then
∂2
 
∂ ∂
f (x, y) = f (x, y) .
∂x∂y ∂x ∂y
Example. Let f : R2 → R be defined by

f (x, y) = xy 2 (x2 + y).

Then
∂2 ∂
f (x, y) = (2x3 y + 3xy 2 ) = 6x2 y + 3y 2 ,
∂x∂y ∂x
and
∂2 ∂
f (x, y) = (3x2 y 2 + y 3 ) = 6x2 y + 3y 2 .
∂y∂x ∂y
Observe that in the example above the order of the mixed partial derivatives does
not matter. This is not true in general. However, it will be true for all the cases that we
come across in this course.

Multivariable vector-valued functions


We shall now extend the definitions of differentiability and partial derivatives to vector-
valued functions. Suppose F : Rn → Rm is a multivariable vector-valued function. Then
F is differentiable at a there exists a linear mapping La : Rn → Rm such that the limit
∥(F(a + h) − F(a)) − La (h)∥
lim = 0.
∥h∥→0 ∥h∥
By writing F in its components,
 
f1 (x1 , ..., xn )
 f2 (x1 , ..., xn ) 
F(x1 , ..., xn ) =   , fi : Rn → R,
 
..
 . 
fm (x1 , ..., xn )

following an analogous argument given above on each of the components fi , i = 1, ..., m,


we have the following result.
Theorem. Suppose a multivariable vector-valued function F is differentiable at a ∈ Rn .
Then all the partial derivatives
∂fi
, i = 1, ..., m, j = 1, ..., n,
∂xj
exists at the point a, and the linear mapping La has the matrix representation
 ∂f1 ∂f1 ∂f1 
∂x1
(a) ∂x2
(a) · · · ∂xn
(a)
 ∂f2 (a) ∂f2 (a) · · · ∂f2 (a) 
dF(a) =  ∂x1. ∂x2 ∂xn
.
 
. .
. . . .
.
 . . . . 
∂fm ∂fm ∂fm
∂x1
(a) ∂x2
(a) ··· ∂xn
(a)

The matrix representation dF(a) of La is called the matrix derivative of F at a.


Observe that if m = 1, then the matrix derivative of f is the matrix representation of
the linear functional L, that is, the transpose of the gradient of f ,

df (a) = La = ∇f (a)T .

A multivariable vector-valued function F is said to be differentiable if it is differen-


tiable at every point in its domain. In this case, we let dF denote the matrix-valued
function that sends every point a in the domain of F to its matrix derivative at a.

If n = m in the definition above, then the matrix derivative dF(a) of F at a is a


square matrix. The determinant is called the Jacobian of F at a, and is denoted as
∂f1 ∂f1 ∂f1
∂x1
(a) ∂x2
(a) ··· ∂xn
(a)
∂f2 ∂f2 ∂f2
∂x1
(a) ∂x2
(a) ··· ∂xn
(a)
JF(a) = det(dF(a)) = .. .. ... .. .
. . .
∂fn ∂fn ∂fn
∂x1
(a) ∂x2
(a) · · · ∂xn
(a)
 
2 2 x+y
Example. 1. Let F : R → R be defined by F(x, y) = . Then F is differen-
xy
tial on the whole R2 , and its matrix derivative is
 
1 1
dF = ,
y x

and the Jacobian is


JF = x − y.
 
2 2 r cos(θ)
2. (Polar coordinates) Let F : R → R be defined by F(r, θ) = . Then F
r sin(θ)
is differentiable on the whole R2 , and its matrix derivative is
∂ ∂
  
r cos(θ) r cos(θ) cos(θ) −r sin(θ)
dF = ∂r ∂
∂θ
∂ = ,
∂r
r sin(θ) ∂θ r sin(θ) sin(θ) r cos(θ)

and its Jacobian is


JF = r cos2 (θ) + r sin2 (θ) = r.
 
r cos(ϕ) sin(θ)
3. (Spherical coordinates) Let F : R3 → R3 be defined by F(r, θ, ϕ) =  r sin(ϕ) sin(θ) .
r cos(θ)
3
Then F is differentiable on the whole R , and its matrix derivative is
 
cos(ϕ) sin(θ) r cos(ϕ) cos(θ) −r sin(ϕ) sin(θ)
dF =  sin(ϕ) sin(θ) r sin(ϕ) cos(θ) r cos(ϕ) sin(θ)  ,
cos(θ) −r sin(θ) 0
and the Jacobian is
JF = r2 cos2 (ϕ) cos2 (θ) sin(θ) + r2 sin2 (ϕ) sin3 (θ)
+r2 sin2 (ϕ) cos2 (θ) sin(θ) + r2 cos2 (ϕ) sin3 (θ)
= r2 sin(θ).
 
ρ cos(θ)
4. (Cylindrical coordinates) Let F : R3 → R3 be defined by F(ρ, θ, z) =  ρ sin(θ) .
z
3
Then F is differentiable on the whole R , and its matrix derivative is
 
cos(θ) −ρ sin(θ) 0
dF =  sin(θ) ρ cos(θ) 0 ,
0 0 1
and its Jacobian is
JF = ρ cos2 (θ) + ρ sin2 (θ) = ρ.
Refer to https://youtu.be/wCZ1VEmVjVo for a thorough discussion of the Jacobian
a multivariable vector-valued function.

3.4.2 Special Cases: Vector and Matrix Derivatives


1. Let f : Rn → R be f (x) = yT x = ni=1 yi xi for some y ∈ Rn . Then
P

∂x1 Pni=1 yi xi
 P   
y1
 ∂x n
 2 i=1 i i   y2 
y x   
∇f (x) =  ..  =  ..  = y.

Pn .  .
∂xn i=1 yi xi yn

2. Let f : Rn → R be f (x) = xT x = ni=1 x2i . Then


P

∂x1 Pni=1 x2i


 P   
2x1
 ∂x n 2
 2 i=1 xi   2x2 
 
∇f (x) =  ..   ..  = 2x.
=
 .
Pn 2
  . 
∂xn i=1 xi 2xn

3. Let Q : Rn → R be a quadratic form, Q(x) = xT Ax for some symmetric matrix


A = (aij ). Then the i-th component of the gradient is
n X
X n n
X X
(∇Q(x))i = ∂xi xk akl xl = ail xl + xk aki .
k=1 l=1 l=1 k=1
Observe that since k and l are dummy index, we may take both of them to be, say
k. Since A is symmetric, aik = aki . Hence, the sum is equals to
n
X X n
X
aik xk + xk aik = 2aik xk ,
k=1 k=1 k=1

which is the i-th component of 2Ax. Hence,


∇Q(x) = 2Ax.

4. Let L : Rn → Rm be a linear mapping with matrix representation A = (aij )m×n ,


L(x) = Ax. Then the (i, j)-entry of the matrix derivative dL is
m
∂ X
aik xk = aij .
∂xj k=1
Hence,  
a11 a12 · · · a1n
 a21 a22 · · · a2n 
dL =  .. ..  = A.
 
.. ..
 . . . . 
am1 am2 · · · amn
This shows that if L is a linear mapping with matrix representation A, then
JL = det(A).

3.4.3 Symbolic Derivatives in R


Single variable real-valued functions
Although we may use the function D from the pre-installed packages in R to perform sym-
bolic differentiation, there are some limitations. Instead, we will be using the package
Deriv.
> install.packages("Deriv")
> library(Deriv)

To perform differentiation, we use the function Deriv. The arguments that we need
are
Deriv(function,name,nderiv)
• function is user defined function.
• name is the variable we want to differentiate with respect to, placed
inside a parenthesis " ".
• nderiv is the number of derivatives order to calculate. The default number
is 1.
2
Example. 1. Find the derivative of ex .
> f <- function(x) {exp(x^2)}
> df <- Deriv(f,"x")
> df
function (x)
2 * (x * exp(x^2))
2. Find the derivative of cos(sin2 (2x − 1)).
> f <- function(x){cos(sin(2*x-1)^2)}
> df <- Deriv(f,"x")
> df
function (x)
{
.e2 <- 2 * x - 1
.e3 <- sin(.e2)
-(4 * (cos(.e2) * .e3 * sin(.e3^2)))
}

3. Find the derivative of ln(tan(x2 )).


> f <- function(x){log(tan(x^2))}
> df <- Deriv(f,"x")
> df
function (x)
{
.e1 <- x^2
2 * (x/(cos(.e1)^2 * tan(.e1)))
}

Remark. Recall in section 1.4.1 that the primitive function log is the natural
logarithm, it is already in base e = exp(1). However, if we try to differentiate
log(x,base=a), it will evaluate log(a).
> f <- function(x){log(x,2)} > df <- Deriv(f,"x")
> df
function (x)
1/(0.693147180559945 * x)
So here, we might want to differentiate symbolically using D. However, we must use
the logarithm change of base formula.
> f <- expression(log(x)/log(2))
> df <- D(f,"x")
> df
1/x/log(2)
1
Indeed, the answer is xln(2) , which agrees with the one given by Deriv above, except
it does not evaluate ln(2).

To evaluate the derivative at a point, we just make the substitution.


d
Example. 1. Let f (x) = x2 , find dx
f (2).
> f <- function(x){x^2}
> df <- Deriv(f,"x")
> df(2)
[1] 4
2 d
2. Let f (x) = ex . Find dx f (1).
> f <- function(x){exp(x^2)}
> df <- Deriv(f,"x")
> df(1)
[1] 5.436564
> 2*exp(1)
[1] 5.436564
2
Exercise. Use R to obtain the third derivative to ex .

Multivariable real-valued functions


The function Deriv can also be used to perform partial differentiation.
Example. 1. Find the partial derivatives of f (x, y) = xy.
> f <- function(x,y){x*y}
> dfx <- Deriv(f,"x"); dfy <- Deriv(f,"y")
> dfx
function (x, y)
y
> dfy
function (x, y)
x
So, the gradient of f is  
y
∇f (x, y) = .
x

2. Find the partial derivatives of f (x, y, z) = xy z .


> f <- function(x,y,z){x*y^z}
> dfx <- Deriv(f,"x");dfy <- Deriv(f,"y");dfz <- Deriv(f,"z")
> dfx
function (x, y, z)
y^z
> dfy
function (x, y, z)
x * y^(z - 1) * z
> dfz
function (x, y, z)
x * y^z * log(y)
This tells us that the gradient of f is
 
yz
∇f (x, y, z) =  xzy z−1  .
xy z ln(y)

Remark. One have to be careful that the domain of f and the domain of its partial
derivatives may not be the same. For example, (x, y, z) = (1, −1, 0) is in the domain
of f , but it is not in the domain of ∂f
∂z
.
One advantage of using this package is the ability to differentiate some functions that
the preinstalled function D cannot.
> f <- expression(abs(x))
> D(f,"x")
Error in D(f, "x") : Function ‘abs’ is not in the derivatives table
> f <- function(x,y){abs(x*y)}
> dfx <- Deriv(f,"x")
> dfx
function (x, y)
y * sign(x * y)
> dfy <- Deriv(f,"y")
> dfy
function (x, y)
x * sign(x * y)

Vector-valued functions
The function Deriv can be used to differentiate vector-valued functions.
Example. 1. Consider the multivariable vector-valued function
 
xyz
F(x, y, z) = x + y + z  .
z2

Let us use R to compute its matrix derivative.


> F <- function(x,y,z){matrix(c(x*y*z,x+y+z,z^2))}
> dFx <- Deriv(F,"x")
> dFy <- Deriv(F,"y")
> dFz <- Deriv(F,"z")
> dFx;dFy;dFz
function (x, y, z)
matrix(c(y * z, 1, 0), nrow = 3, ncol = 1, byrow = , dimnames = )
function (x, y, z)
matrix(c(x * z, 1, 0), nrow = 3, ncol = 1, byrow = , dimnames = )
function (x, y, z)
matrix(c(x * y, 1, 2 * z), nrow = 3, ncol = 1, byrow = , dimnames = )
So reading off the 3 columns, we get
 
yz xz xy
dF =  1 1 1  .
0 0 2z

Remark. Note that R returns each dFx, dFy, dFz as a 3 × 1 matrix, that is, it is
a column vector.

2. Find the matrix derivative of the linear mapping


    
x 1 0 −1 x
L   y   = 2 1 5
   y .
z 0 3 −2 z

> L <- function(x,y,z){matrix(c(1,0,-1,2,1,5,0,3,-2),3,3,T)


%*%matrix(c(x,y,z),3,1)}
> DLx <- Deriv(L,"x")
> DLy <- Deriv(L,"y")
> DLz <- Deriv(L,"z")
> DLx;DLy;DLz
function (x, y, z)
matrix(c(1, 0, -1, 2, 1, 5, 0, 3, -2), 3, 3, T) %*% matrix(c(1,
0, 0), nrow = 3, ncol = 1, byrow = , dimnames = )
function (x, y, z)
matrix(c(1, 0, -1, 2, 1, 5, 0, 3, -2), 3, 3, T) %*% matrix(c(0,
1, 0), nrow = 3, ncol = 1, byrow = , dimnames = )
function (x, y, z)
matrix(c(1, 0, -1, 2, 1, 5, 0, 3, -2), 3, 3, T) %*% matrix(c(0,
0, 1), nrow = 3, ncol = 1, byrow = , dimnames = )
So,  
1 0 −1
DL(x, y, z) = 2 1 5  .
0 3 −2
Exercise. Use R to find the Jacobian of
 
r cos(θ)
1. (Polar coordinates) F(r, θ) =
r sin(θ)
 
r cos(ϕ) sin(θ)
2. (Spherical coordinates) F(r, θ, ϕ) =  r sin(ϕ) sin(θ) 
r cos(θ)
 
ρ cos(θ)
3. (Cylindrical coordinates) F(ρ, θ, z) =  ρ sin(θ) 
z

3.4.4 Numerical Differentiation in R


Consider the following piecewise function

0 if x < 0
f (x) =
x if x ≥ 0

Then f is differentiable at every point a ̸= 0, with derivative 0 for a < 0 and derivative 1
for a > 0. Let us try to use the function Deriv to find the derivative of f at some points
a ̸= 1.
f <- function(x){
if (x>=0) return(x)
else return(0)
}
> Deriv(f,"x")
Error in Deriv (st[[3]], x, env, use.D, dsym, scache, drule. = drule.) :
Could not retrieve body of ’return()’

So instead of differentiating symbolically, we might want to compute the numerical


differential of f at specific points. To that end, we use the function fderiv. It requires
the package pracma. We will assume that the package has already been installed (see
section 2.3.4), and just proceed to load it.
> library(pracma)
Let us take a look at some of the arguments of the function fderiv.
fderiv(f, x, n)

• f is the function to be differentiated.

• x is the point(s) the differentiation will take place. If it is a vector,


the function will be differentiated at those points.

• n is the order of derivative, default is 1.

We will now compute the derivatives of the function f defined above at a few points.
> fderiv(f,1)
[1] 1
> fderiv(f,-1)
[1] 0
> fderiv(f,10)
[1] 1
We may also define a function that return the derivative of f at different points.
> df <- function(x) fderiv(f,x)

Even though we are not able to obtain a symbolic function in R for such functions,
we may still plot the graph for visualization.
> x <- seq(-1,1,length=100)
> y <- sapply(x,f)
> plot(x,y,type="l")

> dy <- sapply(x,df)


> plot(x,dy,type="l")
We may use fderiv to perform partial derivatives numerically on mutlivariable func-
tions
> f <- function(x,y) x*y
The partial derivative of f with respect to x is
> dfx <- function(x,y)fderiv(function(x)f(x,y),x)
The partial derivative of f with respect to y is
> dfy <- function(x,y)fderiv(function(y)f(x,y),y)
In this case, we may obtain the gradient of f via
> gradf <- function(x,y)matrix(c(dfx(x,y),dfy(x,y)))

However, the package pracma has the function grad that computes the gradient of a
function numerically. Here some of the arguments of the function grad.
grad(f,x0)

• f is a function of several variables

• x is the point where the gradient is to build.

Example. 1. Compute the gradient of

f (x, y) = xy

at the points (1, 2) and (2, 1).


> f <- function(u){
x <- u[1];
y <- u[2];
return(x*y)
}
> grad(f,c(1,2))
[1] 2 1
> grad(f,c(2,1))
[1] 1 2
2. Compute the gradient of
 2 2)
e−(x +y if 0 ≤ x2 + y 2 ≤ 4
f (x, y) =
0 otherwise

at the points (1, 1) and (0, 0).


> f <- function(u){
x <- u[1];
y <- u[2];
if (and(x^2+y^2>=0,x^2+y^2<=4)) return(exp(-(x^2+y^2)))
else return(0)
}
> grad(f,c(1,1))
[1] -0.2706706 -0.2706706
> grad(f,c(0,0))
[1] 0 0

One may similarly compute the numerical matrix derivative of a multivariable vector-
valued function, by applying the numerical partial derivatives or the gradient on each
component of the function.

3.5 Integration
3.5.1 Definitions and Properties
Single variable integration
Recall that the (definite) integration of a single variable real-valued function f (x) from
limits a to b, Z b
f (x)dx,
a
is the area bounded by the function, the x-axis, and the limits a and b.
By the fundamental theorem of calculus, one way to find the integral is to use the anti-
d
derivative of the function f , that is, a function F such that dx F (x) = f (x). If F is an
anti-derivative of f , then Z b
f (x)dx = F (b) − F (a).
a

x2 d x2 2x
Example. Let f (x) = x. An anti-derivative of f (x) is 2
, indeed, dx 2
= 2
= x. Hence,
1
12 0
Z
1
xdx = − = ,
0 2 2 2
which is indeed the area of the triangle,

Multivariable Integration
Let us start with a function f from R2 to R. Recall that the graph of f is the set of
all points (x, y, f (x, y)) in R3 . Then the integral
R of f over the region R ⊆ R2 , where R
is a subset of the domain of f , written as R f dA, is the volume enclosed by R and the
function.

More generally, suppose f is a real-valued multivariable function from Rn to R, then the


graph of f is the set of all points (x1 , x2 , ...,
R xn , f (x1 , x2 , ..., xn )) in Rn+1 . The integration
of f over a volume R ⊆ Rn , written as R f dV , is the volume enclosed by R and the
function. Except for n > 2, we are unable to plot the graph nor visualize it.
Under certain conditions, we are able to solve multivariable integral analytically by
performing iterated single integrals (or iterative integration), that is, to perform a series
of single integrals with respect to the variables xi , i = 1, ..., n in some order. Let us again
begin with a function from R2 to R. A bounded set D is x-simple if there exists real
number b and c such that the domain D is bounded between the two (horizontal) lines
y = c and y = d, and there exists real-valued single variable (continuous) functions a and
b such that for each y in the interval [c, d], the set of x such that (x, y) is in D s bounded
by a(y) and b(y), a(y) ≤ x ≤ b(y). That is, D = { (x, y) c ≤ y ≤ d, a(y) ≤ x ≤ b(y) }.
The definition for a bounded set D to be y-simple is analogously defined, exchanging the
roles of x and y in the definition above.

(a) D is x-simple (b) D is y-simple

Example. 1. The simplest examples of x- and y-simple regions are rectangles, R =


{ (x, y) a ≤ x ≤ b, c ≤ y ≤ d }. The functions defining the intervals for each x or
y values are constant functions.

2. Let D be the region bounded by the curves y = x2 and y 2 = x. Then D is


both x- and y-simple. In the perspective of x-simple, the functions are a(y) = y 2 ,

b(y) = √ y; for 0 ≤ y ≤ 1. In the perspective of y-simple, the functions are
c(x) = x, d(x) = x2 , for 0 ≤ x ≤ 1.

3. Let D be the region bounded by the circle of radius r in the xy-plane. Then D
is both x- and y-simple.
p We will just present
p the case for D being x-simple. The
functions are a(y) = − r − y , b(y) = r − y 2 , −r ≤ y ≤ r.
2 2 2

If the region D is x-simple, can then we perform the integration with respect to x
first, then with respect to y. This is because the integral
Z b(y)
f (x, y)dx
a(y)

is a function with respect to y. So, if Fy (x) is an anti-derivative of f (x, y), thinking of it


as only a function of x and fixing y, then
Z b(y)
f (x, y)dx = Fy (b(y)) − Fy (a(y)),
a(y)

and hence Z Z d
f dA = (Fy (b(y)) − Fy (a(y)))dy.
D c
The algorithm for D being y-simple is analogous, interchanging x and y in the algo-
rithm above.

Example. 1. Let D be the region bounded


R by curves y = x2 and y 2 = x, and let
2
f ∈ R → R be f (x, y) = xy. Find D f dA. Here we treat D as being x-simple.

1

y 1 x=√y
yx2
Z Z Z Z 
f dA = xydxdy = dy
D 0 y2 0 2 x=y 2
6 1
1 1 2 1 y3 y
Z  
5
= y − y dy = −
2 0 2 3 6 0
 
1 1 1 1
= − = .
2 3 6 12

2. Let DR be the region bounded by the curves y = sin(x) and y = 0, for 0 ≤ x ≤ 2π.
Find D f dA, where f is the constant function 1, f (x, y) = 1. The region D is
the union of two y-simple sets. For 0 ≤ x ≤ π, we have 0 ≤ y ≤ sin(x), and for
π ≤ x ≤ 2π. we have sin(x) ≤ x ≤ 0. So, the integral is splitted into two interated
integration.
Z Z π Z sin(x) Z 2π Z 0
f dA = 1dydx + 1dydx
D 0 0 π sin(x)
Z π Z 2π
= sin(x)dx + − sin(x)dx
0 π
= [− cos(x)]π0 + [cos(x)]2π
π = −(−1 − 1) + (1 + 1) = 4.

R
3. Evaluate D
ydA, where D is the half disk where x2 + y 2 ≤ 1 and y ≥ 0.

1
√ √1−x2
1−x2 1
y2
Z Z Z Z 
ydA = ydydx = dx
D −1 0 −1 2 0
1
1 1 x3
Z 
2 1
= 1 − x dx = x−
2 −1 2 3 −1
 
1 1 1 2
= 1− +1− = .
2 3 3 3
R
4. Let R be the rectangle [1, 3] × [2, 3]. Find R
xydA.

3 3 3 !  3 !
x2 y2
Z Z Z 
xydA = xdx ydy =
R 1 2 2 1 2 2
  
9 1 9 4
= − − = 10
2 2 2 2

We can apply the idea of iterative integration to more general sets. Let D(y) denote
the set of all points in D whose second coordinate is y. Suppose for each value of y, the set
D(y) consists of a finite number of intervals, whose end points are piecewise continuous
functions of y. Then we can integrate f (x, y) with respect to x over D(y), and obtain a
piecewise continuous function with respect to y. The argument can be same similarly for
D such that for each x, D(x) consist of finite number of intervals, whose end points are
continuous functions of x.
Example. An annulus is the region bounded two circles.
Suppose the two circlesphave radii p
r1 and r2 , with p
r1 < r2 . Then
p for each y, the set D(y)
consist of intervals [− r2 − y , − r1 − y ] and [ r1 − y , r22 − y 2 ].
2 2 2 2 2 2

R
Exercise. Let R be the annulus where the radii of the circles are 1 and 4. Find R ydA.

3.5.2 Change of Variables


Single variable functions
Let f be a single variable real-valued function, and S be a subset of R contained in
the domain of f . Suppose y : S → R is a single variable real-valued function that is a
d
one-to-one continuously differentiable function on S with nonzero derivative, dx y(x) ̸= 0
for all x ∈ S. Then we have the following change of variable formula:
Z Z
dy
f (y(x)) (x)dx = f (y)dy.
S dx y(S)

This is also known as integration by substitution. This idea can be generalized to multi-
variable functions.
Example. Find Z 1 √
x2 1 − x2 dx.
0
d
Let x = sin(t) and x(S) = (0, 1), then dt
x(t) = cos(t), and S = (0, π/2). So,
Z 1
2
√ Z π/2
2
q Z π/2
x 1− x2 dx = 2
sin (t) 1 − sin (t) cos(t)dt = sin2 (t) cos2 (t)dt
0 0 0
Z π/2 Z π/2
1 1
= (2 sin(t) cos(t))2 dt = sin2 (2t)dt
0 4 4 0
1 π/2
Z
π 1 π/2
= 1 − cos(4t)dt = + [sin(4t)]0
8 0 16 32
π
=
16

Multivariable functions
A continuously differentiable multivariable vector-valued function F is called a smooth
change of variables over an open set U ⊆ Rn if F is one-to-one and its derivative matrix
dF is invertible at each point of U . Recall that the derivative matrix is invertible at a
point a if and only if the Jacobian JF is nonzero at a. Suppose further that F maps a
smoothly bounded set C onto a smoothly bounded set D, so that the boundary of C is
mapped to the boundary of D. If f is a continuous function whose domain contains D,
then Z Z
f dV = (f ◦ F)|JF|dV.
D C
Let us explicitly spell out the details for a function with two variables. Write the
components of F as F(u, v) = (x(u, v), y(u, v)). Then
Z Z
∂x ∂y ∂x ∂x
f (x, y)dxdy = f (x(u, v), y(u, v)) − dudv.
D C ∂u ∂v ∂v ∂u
Example. 1. Let
C = { (u, v) u2 + v 2 ≤ r2 } and D = { (x, y) x 2 y 2
 
a
+ b
≤ r2 }.
C is a boundary of a circle, and D is an ellipse. The mapping F(u, v) = (au, bv),
x = au, y = bv, a > 0, b > 0 is one-to-one on C, with image F(C) = D. The
Jacobian is
JF = (ab − 0) = ab,
which is nonzero since a, b > 0. So, F is a smooth change of variable on C. We will
use F to find the area of the ellipse D.
Z Z Z
Area of D = 1dxdy = 1|ab|dudv = ab 1dudv
D C C

and since the area of the circle is πr2 (derived next), we have
Area of D = abπr2 .

2. Let us find the area of the circle D = { (x, y) x2 + y 2 ≤ R2 }. Let F(r, θ) =


(r cos(θ), r sin(θ)), x = r cos(θ), y = r sin(θ), and C = { (r, θ) 0 < r ≤ R, 0 ≤ θ < 2π }.
F is one-to-one on C, with image F(C) = D. We have seen that the Jacobian is
JF = r. Hence,
Z 2π Z R Z 2π 2
R2
Z
R
Area of circle D = 1dA = rdrdθ = dθ = 2π = πR2 .
D o 0 0 2 2
Remark. Strictly speaking, we are not allowed to define C like this, since it will no
longer be a bounded set. There is a way to rigorously evaluate the integral above,
but we will not discuss the technicalities here since it will still give the same answer.
Readers may refer to the appendix for the discussion on integration on unbounded
sets.
R 2 2
3. Let D be the half annulus, D = { (x, y) r12 ≤ x2 + y 2 ≤ r22 , y ≥ 0 }. Find D ex +y dA.
Let F(r, θ) be as the previous example, and C = { (r, θ) r1 ≤ r ≤ r2 , 0 ≤ θ ≤ π }.
Then
Z π Z r2 Z π r 2 2
π(er2 − er1 )

1 r2 2
Z
x2 +y 2 r2
e dA = e rdrdθ = 1dθ e = ,
D 0 r1 0 2 r1 2
where in the first equally we used the identity r2 cos2 (θ) + r2 sin2 (θ) = r2 .
4. Let D be the parallelogram in R2 bounded by the lines
y = −x + 5, y = −x + 2, y = 2x − 1, y = 2x − 4.
Compute Z
ex dA.
D
First we make a change of variable such that the parallelogram transforms into a
rectangle. Define F(u, v) = (u + v + 1, −u + 2v + 1), x = u + v + 1, y = −u + 2v + 1,
and C = { (u, v) 0 ≤ u, v ≤ 1 }. Then F is one-to-one, F(C) = D, and

1 1
JF(u, v) = = 3,
−1 2

which shows that F is indeed a smooth change of variable. Hence,


Z Z 1Z 1 Z 1Z 1
x u+v+1
e dA = e 3dudv = 3e eu ev dudv
D 0
Z 10 Z 1 0 0

= 3e eu du ev dv = 3e(e − 1)2 .
0 0

Exercise. Find the volume of the 3-dimensional sphere of radius R,


Z
1dV.
x2 +y 2 +z 2 ≤R2

3.5.3 Probability Density Function


A real-valued function p : Rn → R is a probability density function if
(i) it is defined and nonnegative on the whole Rn , p(x) ≥ 0, and
R
(ii) Rn p(x)dV = 1.
If p is integrable on a set D ⊆ Rn , then the probability that x is in D is defined to be
Z
p(x)dV.
D

Readers may refer to the appendix for the rigorous definition of integration over un-
bounded set. The techniques however are shown in the examples below.
Example. 1. Show that
1 −(x2 +y2 )
p(x, y) =
e
π
is a probability density function, and find the probability that (x, y) is in the region
D = { (x, y) x, y ≥ 0 }, that is, in the first quadrant.

2 +y 2 )
e−(x
R
We will first show that R2
dxdy = π. First, define

DR = { (x, y) x2 + y 2 ≤ R2 }, CR = { (r, θ) 0 < r ≤ R, 0 ≤ θ < 2π ≤ R2 },

and F(r, θ) = (r cos(θ), r sin(θ)). Then F(CR ) = DR , and so


Z Z 2π Z R
−(x2 +y 2 ) 2 2
e dxdy = e−r rdrdθ = π(1 − e−R ).
DR 0 0
So,
Z Z
−(x2 +y 2 ) 2 +y 2 ) 2
e dxdy = lim e−(x dxdy = lim π(1 − e−R ) = π.
R2 R→∞ DR R→∞

Clearly p is defined and nonnegative on the whole R2 . Hence, p is a probability


density function. By observing that p is symmetric in all 4 quadrant, we can
conclude that the probability that (x, y) is in the first quadrant is 1/4. We will
show this analytically. Let

CR+ = { (r, θ) 0 < r ≤ R, 0 ≤ θ ≤ π


2
+
}, and DR = { (x, y) x2 + y 2 ≤ R2 , x, y ≥ 0 }.

Then
Z Z
1 −(x2 +y2 ) 1 2 2
e dxdy = lim e−(x +y ) dxdy
D π π R→∞ DR+
Z π Z R
1 2 2
= dθ lim er rdr
π 0 R→∞ 0
1 2 1
= lim (1 − eR ) = .
4 R→∞ 4

2. Let
2x+c−y

4
if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 2,
p(x, y) =
0 otherwise.
Find c so that p is a probability density function.
Z 2Z 1
2x + c − y 2x + c − y
Z Z
p(x, y)dxdy = dxdy = dxdy
R2 [0,1]×[0,2] 4 0 0 4
2
1 2 2 1 2 y2
Z Z 
1 1
= x + xc − xy 0 dy = 1 + c − ydy = y(1 + c) −
4 0 4 0 4 2 0
1 c
= (2(1 + c) − 2) = .
4 2
So p is a probability density function if and only if 2c = 1, or c = 2. We need to
check that p is nonnegative. Since x ≥ 0 and y ≤ 2 (or 2 − y ≥ 0),
2x + 2 − y 0
p(x, y) = ≥ = 0.
4 4
Exercise. Let a, b > 0. Show that
Z Z
−(ax2 +by 2 ) 2 +v 2 ) 1
e dxdy = e−(u √ dudv,
x2 +y 2 ≤R2 u2
a
2
+ vb ≤Rn ab

and evaluate Z
2 +by 2 )
e−(ax dA.
R2
2 +by 2 )
Suggest how can we modify e−(ax , if necessary, such that it is a probability density
function.
3.5.4 Integration in R
Single integral
To perform indefinite integral in R, we need the package package mosaicCalc.
> install.packages("mosaicCalc")
> library(mosaicCalc)

The function antiD can be used to find anti-derivative of a function, that is, to perform
indefinite integral.

Example. Find the indefinite integral of


Z
sin(x)dx

> F <- antiD(sin(x)∼x)


> F
function (x, C = 0)
-cos(x) + C

To find the definite integral Z π


sin(x)dx,
0
we substitution the limits into the function and take the difference.
> F(pi)-F(0)
[1] 2

However, antiD may not be able to return a symbolic function all the time, even for
2
those whose symbolic anti-derivative exists. For example, the anti-derivative of 2xex is
2
ex , however, the antiD will perform numerical (definite) integration.
> F <- antiD(2*x*exp(x^2)∼x)
> F
function (x, C = 0)
{
numerical integration(.newf, .wrt, as.list(match.call())[-1],
formals(), from, ciName = intC, .tol)
}
<environment: 0×00000219d8bf6050>
However, we may still use it to perform definite integrals.
> F(1)-F(0)
[1] 1.718282
> exp(1)-1
[1] 1.718282

We may also use the function integrate to perform definite integral.


> f <- function(x) 2*x*exp(x^2)
> integrate(f,0,1)
1.718282 with absolute error < 1.9e-14
The function integrate returns the numerical answer of the integral, as well as the error.
If we want the function to only return the numerical value, we type instead
> integrate(f,0,1)$value
[1] 1.718282

Iterative integral
We will use the function integrate to perform iterated integral over x-simple or y-simple
domains.

Example. Let D be the Rregion bounded by curves y = x2 and y 2 = x, and let f ∈ R2 →


R be f (x, y) = xy. Find D f dA.

By thinking of D as x-simple, we have the following equality,


Z Z 1 Z √y
f dA = xydxdy.
D 0 y2

Then to evaluate the integral in R, we simply type


> f <- function(x,y) x*y
> integrate(function(y) {
sapply(y, function(y) {
integrate(function(x) f(x,y), y^2, sqrt(y))$value
})
},0, 1)

Here is the explanation for the code. First we color code the various parts of the code
that corresponds to the double integral.
Z 1 Z √y
xydxdy.
0 y2

> integrate(function(y) {
sapply(y, function(y) {
integrate(function(x) f(x,y), y^2, sqrt(y))$value
})
},0, 1)

When we run the code, the function integrate will partition [0, 1], P = {a = y0 < y1 <
· · · < yn = b}, and let y be the partition points, y = (y0 , y1 , ..., yn ). Then for each value yi ,
i = 0, ..., 1, it will evaluate integrate(function(x) f(x,yi ), yi ^2, sqrt(yi ))$value,
viewing the function f (x, y) as only a function of x, with the y component fixed = yi .
Then the Riemann (or Darboux) sum is taken over this partition P . Readers may refer
to appendix 3.7.5 for the details.
R
In general, if we want to compute the integral D f dA, where D is x-simple with
smooth functions a(y), b(y) defining the end points of x, and c, d defining the end points
of y, we type
> integrate(function(y) {
sapply(y, function(y) {
integrate(function(x) f(x,y), a(y), b(y))$value
})
},c, d)

Exercise. Write a code to perform iterative integral over y-simple domain.

Example. Show that for −1 < ρ < 1,


1 1
e 2(1−ρ2 ) [
− x2 −2ρxy+y 2 ]
f (x, y) = p
2π 1 − ρ2

is a probability density function. Let ρ = 0.2. Find the probability that (x, y) is in the
region 0 ≤ x ≤ 1, 1 ≤ y ≤ 2.

f <- function(x,y,rho){
(1/(2*pi*sqrt(1-rho^2)))*exp(-(x^2-2*rho*x*y+y^2)/(2*(1-rho^2)))
}
frho <- function(rho) {
integrate(function(y) {
sapply(y, function(y) {
integrate(function(x) f(x,y,rho), -10, 10)$value
})
}, -10, 10)$value
}
(the code was written in Rscript).

> x <- seq(-0.99,0.99,length=100)


> y <- sapply(x,frho)
> plot(x,y,type = "l")
Observe that except that near the end points, the function frho returns 1 for all −1 <
ρ < 1. This shows that f (x, y) is a probability density function for all −1 < ρ < 1.

The probability is
> integrate(function(y) {
sapply(y, function(y) {
integrate(function(x) f(x,y,rho=0.2), 0, 1)$value
})
}, 1, 2)$value
[1] 0.05170924

3.6 System of Differential Equations


3.6.1 Introduction and Definitions
A differential equation is an equation involving differentiation. The order of the equation
is the highest derivative that appears in the equation.
(i) It is an ordinary differential equation if the functions involved are single variable
functions.
(ii) It is a partial differential equation if the equation involved multivariable functions
and their partial derivatives.
(iii) It is linear if the unknown functions and their derivatives are only acted upon by
multiplying by known functions (or constant) and additions.
(iv) It is nonlinear if it is not linear.
Example. 1. y ′ (t) = sin(y(t)) is a nonlinear first order ordinary differential equation.
2. y ′′ (t) +a(t)y(t) = b(t) for some fixed functions a(x) and b(x) is a linear second order
ordinary differential equation.
3. y ′ (t) + p(t)y(t) = q(t)y(t)n is a nonlinear first order ordinary differential equation.
∂2y ∂2y ∂2y
4. ∂s2
(s, t) + ∂t2
(s, t) + ∂s∂t
(s, t) = 0 is a second order linear partial differential equa-
tion.
A system of differential equations consists of a few differential equations. The order
of the system is the highest order of all the differential equations. The system is ordinary
if all the equations involves derivative of only one variable; otherwise, it is a partial
differential system. The system is linear if all the equations are; otherwise, it is nonlinear.
Example. The SIR model

s′ (t) = −βs(t)i(t),
i′ (t) = βs(t)i(t) − γi(t),
r′ (t) = γi(t),

where β and γ are some fixed constants, is a first order nonlinear system of ordinary
differential equations.
Readers may visit https://www.maa.org/press/periodicals/loci/joma/the-sir-model-for-spre
for details.
In this course we will only focus our attention on first order system of ordinary dif-
ferential equations (first order SDE in short) that can be written as

y1′ (t) = F1 (y1 , y2 , ..., yn )


y2′ (t) = F2 (y1 , y2 , ..., yn )
..
.

yn (t) = Fn (y1 , y2 , ..., yn )

for some (multivariable) functions F1 , F2 , ..., Fn .

For example, for the SIR model, we have

y1 = s, , y2 = i, y3 = r,
F1 (y1 , y2 , y3 ) = −βy1 y2 , F2 (y1 , y2 , y3 ) = βy1 y2 − γy2 , F3 (y1 , y2 , y3 ) = γy3 .

Note that the functions are assumed to be functions of yi , it does not (explicitly) depend
on t (even though the yi is a function of t). For example, we will not be dealing with the
cases where F1 (y1 , y2 , y3 , t) = sin(t)y1 + y2 y3 , since it involves t.

The initial conditions of a first order SDE are the values that the functions yi take at
a given point,

y1 (t0 ) = a1 , y2 (t0 ) = a2 , ..., yn (t0 ) = an for some real numbers a1 , a2 , ..., an ∈ R.

3.6.2 Solving System of Ordinary Differential Equations in R


Given a first order system of ordinary differential equations,

y1′ (t) = F1 (y1 , y2 , ..., yn )


y2′ (t) = F2 (y1 , y2 , ..., yn )
..
.

yn (t) = Fn (y1 , y2 , ..., yn )

the aim is to find functions y1 , y2 , ..., yn that satisfy the equations above; the set {y1 , y2 , ..., yn }
is called a solution of the system. If the SDE has an initial condition, then we further in-
sist that the functions yi satisfies the initial conditions. For example, and initial condition
for the SIR model will be

s(0) = 1 - percentage of population being infected


i(0) = the percentage of population being infected
r(0) = 0.

Here we are assuming that it is the beginning of the outbreak, at t = 0, and a small
proportion of the population has already infected, but no one has yet recovered from the
disease.
To find the solution of a SDE with initial conditions in R, we need the package deSolve.
> install.packages("deSolve")
> library(deSolve)

The code to (numerically) solve a first order SDE (in Rscript).


SDE <- function(t, state, parameters){
y1 <- state[1] #these are the functions to be solved
y2 <- state[2]
y3 <- state[3]
#y4...yn
p1 <- parameters["p1"] #these are the parameters F1,F2,...,Fn need
p2 <- parameters["p2"]
#p3...pk
dy1 <- F1(y1,y2,...,yn) #the SDE
dy2 <- F2(y1,y2,...,yn)
dy3 <- F3(y1,y2,...,yn)
#dy4...dyn
Sol <- c(dy1,dy2,dy3,...,dyn)
list(Sol)
}

parameters <- c(p1 = , p2 = ,p3 = ,..., pk = )


#put in numbers for the parameters
state <- c(y1 = a1, y2 = a2,..., yn = an)
#the initial condition yi(x0)=ai
times <- seq(start of x, end of x , length = )
#the start of x must x0

out <- ode(y = state, times = times, func = SDE, parms = parameters)

To visualize the result, we plot a graph.


> plot(out)

Example. Let’s solve the SIR model with β = 0.7 and γ = 0.1, with initial conditions

s(0) = 1 − 0.0001, i(0) = 0.0001, r(0) = 0.

sirmodel <- function(t, state, parameters){


s <- state[1]
i <- state[2]
r <- state[3]
beta <- parameters["beta"]
gamma <- parameters["gamma"]
ds <- -beta*s*i
di <- beta*s*i-gamma*i
dr <- gamma*i
Sol <- c(ds,di,dr)
list(Sol)
}
parameters <- c(beta = 0.7, gamma = 0.1)
state <- c(s = 1-0.0001, i = 0.0001, r=0)
times <- seq(0, 50, length=1000)

out <- ode(y = state, times = times, func = sirmodel, parms = parameters)

Here we want to plot all three functions s, i, r in a single graph, so we instead use
plot(out[,"time"],out[,"s"],type="l",col="green",xlab="Days", ylab="")
lines(out[,"time"],out[,"i"],col="red")
lines(out[,"time"],out[,"r"],col="blue")
3.7 Appendix for Chapter 3
3.7.1 Sequences and Limits
A sequence is list of objects in which repetitions are allowed and order matters. The
objects are usually indexed by the natural numbers. Hence, a sequence can be viewed
as a function whose domain is either the set of non-negative integers {0, 1, 2, ...} or the
positive integers {1, 2, 3, ...}. It is customary to denote the elements of a sequence by a
letter and the index in the subscript, for example, (a1 , a2 , a3 , ...).
1. Let an = n1 for n ≥ 1, that is, the sequence is 1, 12 , 31 , ... .

Example.
2. Define an recursively, an+1 = an + an−1 , a0 = 1, a1 = 1. This is the Fibonacci
n n
sequence, (0, 1, 1, 2, 3, 5, 8, ...). The formula for the n-th term is an = φ √−ψ
5
, where
√ √
1+ 5 1− 5
φ= 2
, ψ= 2
.
3. Let an = (−1)n , for n ≥ 1. The sequence is (−1, 1, −1, 1, ..)
n  2 3 4 
4. Let an = 1 + n1 . Then the sequence is 2, 32 , 43 , 54 , ... .

The limit of a sequence is a real number that the values an are close to for large values
of n. That is, a is the limit of an if as n becomes bigger, the difference |a − an | becomes
smaller. We shall denote this as
lim an = a.
n→∞

Here is the precise definition. A sequence (an ) is said to converge to a real number a if
for every ε > 0, there is a N > 0 such that for all n > N , |a − an | < ε. The number a
is called the limit of the sequence. If the sequence do not converge to some real number,
then it is said to diverge.
1. The sequence 1, 21 , 31 , ... converges to 0, limn→∞ n1 = 0.

Example.
2. The Fibonacci sequence diverge.
3. The sequence is (−1, 1, −1, 1, ..(−1)n , ...) diverge.
 2 3 4 n 
4. The sequence 2, 23 , 43 , 54 , ..., 1 + n1 , ... converges to e, the Euler num-
n
ber, limn→∞ 1 + n1 = e.
Remark. We may allow the limit a to be ±∞. In this case, we must modify the definition
as such. We say that an converges to ∞ if for every (large) number M > 0, there is a
N such that for all n > N , an > M . Similarly, we say that an converges to −∞
if for every (large negative) number M < 0, there is a N such that for all n > N ,
an < M . For example, the Fibonacci sequence converges to ∞. However, the sequence
(−1, 1, −1, 1, ..(−1)n , ...) is still divergent.
Theorem (Uniqueness of limit). If limn→∞ an = a and limn→∞ an = b, then a = b.
Proof. We will show that for every ε > 0, |a−b| < ε. Given any ε > 0, since limn→∞ an =
a (limn→∞ an = b, respectively), we can find N1 (N2 , respectively) such that for all n > N1
(n > N2 , respectively),
ε  ε 
|an − a| < |an − b| < , respectively .
2 2
Let N = max{N1 , N2 } + 1. Then by triangle inequality,
ε ε
|a − b| = |(a − aN ) − (b − aN )| ≤ |a − aN | + |b − aN | < + = ε.
2 2

3.7.2 Introduction to Functions


A single variable real-valued function f (x) is said to be continuous at the point a in its
domain if for every sequence an converging to a, we have
lim f (an ) = f (a).
n→∞
The function f is said to be continuous on a subset S of the domain of f if it is continuous
at every point in S. The function f is said to be continuous if it is continuous on its
domain.
Example. 1. f (x) = 2ex + 1 is continuous.
sin(x)
2. f (x) = x
is continuous on the whole R.
3. f (x) = sin x1 is continuous everywhere except 0, R\{0}.


A function f is said to be right continuous at a point a in its domain if for every


sequence an converging to a with an > a for every n > 0, we have
lim f (an ) = f (a).
n→∞
It is said to be left continuous at a point a in its domain if for every sequence an converging
to a with an < a for every n > 0, we have
lim f (an ) = f (a).
n→∞

Example. 1. Define the function f on R to be f (x) = n for all x such that n ≤ x <
n + 1. Then f is right continuous.
2. If we instead define f to be f (x) = n for all x such that n − 1 < x ≤ n. Then f is
left continuous.
An open-interval is the set { x ∈ R a < x < b }. Here we allow a = −∞ and b = ∞.
It is usually denoted as (a, b). A closed-interval is the set { x ∈ R a ≤ x ≤ b }, that is,
it includes the points a and b. It is usually denoted as [a, b].

S f is said to be piecewise continuous if there is a partition of its domain


A function
dom(f ) = ∞ n=1 In , such that f is continuous on each interval In excluding the endpoints.
Example. 
0 if x < 0
f (x) =
x if x ≥ 0
is a piecewise continuous function as it is continuous on (−∞, 0) and (0, ∞).
Theorem (ε − δ definition for continuity). A function f (x) is continuous at a if and
only if for every ε > 0, there is a δ > 0 such that for all x such that |x − a| < δ,
|f (x) − f (a)| < ε.
Using this theorem, we have can define continuity for multivariable vector-valued
functions. A function F : Rn → Rm is continuous at a ∈ Rn if for every ε > 0, there is a
δ > 0 such that for all x such that ∥x − a∥ < δ, ∥F(x) − F(a)∥ < ε.
3.7.3 Derivative
A single value real-valued function f is said to be continuously differentiable on an open
set U in its domain if its derivative f ′ is continuous on U . A multivariable vector-valued
function F is said to be continuously differentiable on an open set U in its domain if all
its partial derivative are continuous on U . A continuously differentable function is called
a C 1 function.

A function F = (fi ) : Rn → Rm is a C k function on an open set U if the r-th mixed


partial derivatives of its components
∂r
fi , i = 1, ..., n
∂xr11 ∂xr22 · · · ∂xrnn

are defined and continuous on U for all 0 ≤ r ≤ k. F is a smooth function, or a C ∞


function if it is a C k function for all positive integer k; that is, all the mixed partial
derivatives of its components are continuously differentiable.

3.7.4 Differentiation Rules


Properties of differentiation
(i) [cf (x)]′ = cf ′ (x), [f (x) ± g(x)]′ = f ′ (x) ± g ′ (x).
′
f ′ (x)g(x) − f (x)g ′ (x)

f (x)
(ii) [f (x)g(x)]′ = f ′ (x)g(x) + f (x)g ′ (x), = .
g(x) [g(x)]2
(iii) [f (g(x))]′ = f ′ (g(x))g ′ (x).

Some common differentiation formulas


d x d a
(i) (a ) = ax ln a (a > 0), (x ) = axa−1 .
dx dx
d x d 1
(ii) e = ex , ln |x| = .
dx dx x
d d d
(iii) sin x = cos x, cos x = − sin x, tan x = sec2 x,
dx dx dx
d d d
cot x = − csc2 x, sec x = sec x tan x, csc x = − csc x cot x.
dx dx dx
d 1 d 1 d 1
(iv) sin−1 x = √ , tan−1 x = , sec−1 x = √ .
dx 1 − x2 dx 1 + x2 dx x x2 − 1

3.7.5 Introduction to Integration


Suppose a, b are real numbers such that a < b. A partition of the closed interval [a, b] =
{ x ∈ R a ≤ x ≤ b } is a finite ordered subset P having the form

{a = x0 < x1 < · · · < xn = b}.


Let f be a single variable real-valued function, defined on a closed interval [a, b]. The
upper Darboux sum of f with respect to P is the sum
n
X
U (f, P ) = sup f (x)(xi − xi−1 ),
i=1 x∈[xi−1 ,xi ]

and the lower Darboux sum is


n
X
L(f, P ) = inf f (x)(xi − xi−1 ).
x∈[xi−1 ,xi ]
i=1

Upper Darboux sum of f (x) = ex , interval [−2, 2]

Lower Darboux sum of f (x) = ex , interval [−2, 2]

Reader may visit the website https://www.geogebra.org/m/qe6jMAK2 for the visu-


alization of the upper and lower Darboux sum of a function.
The upper Darboux integral of a function f over the interval [a, b] is

U (f ) = sup{ U (f, P ) P is a partition of [a, b] },

and the lower Darboux integral is

L(f ) = inf{ U (f, P ) P is a partition of [a, b] }.

Here the supremum and infimum is taken over all possible partitions of [a, b].

A function f is said to be Darboux integrable over the closed interval [a, b] if L(f ) =
U (f ). In this case, we denote the integral as
Z b
f (x)dx = L(f ) = U (f ).
a

Let [a, b] be a closed interval. Divide the interval into n subintervals, each of length
δn x = b−1
n
. Let xi = a + iδn x, that is, x0 = a, x1 = a + h, ..., xn = a + nδn x = b. Then
the left Riemann sum is
Xn
f (xi−1 )δn x,
i=1

the mid-point Riemann sum is


n
X
f ((xi−1 + xi )/2)δn x,
i=1

and the right Riemann sum is


n
X
f (xi )δn x.
i=1

A function f is said to be Riemann integrable if


n
X n
X n
X
lim f (xi−1 )δn x = lim f ((xi−1 + xi )/2)δn x. = lim f (xi )δn x.
n→∞ n→∞ n→∞
i=1 i=1 i=1

In this case, we will denote the equality as


Z b
R f (x)dx.
a

Theorem. A function f is Darboux integrable if and only if it is Riemann integrable.

Hence, we will just say that a function is integrable over [a, b] if it is either Darboux
or Riemann integrable over [a, b].

Theorem (Continuous functions are integrable). Every continuous function f defined


on [a, b] is integrable.

Theorem (Piecewise continuous functions are integrable). If f is a piecewise continuous


function on [a, b], then f is integrable on [a, b].
Theorem (Properties of integration). Let f and g be functions defined on [a, b].
(i) If f is integrable on [a, b] and α is a real number, then αf is integrable with
Rb Rb
a
αf (x)dx = α a f (x)dx.
Rb
(ii) If f and g are integrable on [a, b], then f + g is integrable with a (f + g)(x)dx =
Rb Rb
a
f (x)dx + a g(x)dx.
(iii) If c is a number in [a, b], a < c < b, and f is integrable on [a, c] and [c, b], then f
Rb Rc Rb
is integrable on [a, b] with a f (x)dx = a f (x)dx + c f (x)dx. The converse is true
too.
Theorem (Fundamental theorem of calculus I). If F is a continuous function on [a, b]
d
that is differentiable on (a, b), and if f = dx F is integrable on [a, b], then
Z b
f (x)dx = F (b) − F (a).
a

Theorem (Fundamental theorem of calculus II). Let f be an integrable function on [a, b].
For any x in [a, b], let Z x
F (x) = f (x)dx.
a
Then F is continuous on [a, b]. If f is continuous at x0 in (a, b), then F is differentiable
at x0 and
d
F (x0 ) = f (x0 ).
dx

3.7.6 Integration Rules and Techniques


1. (Integration by parts) if u and v are continous functions on [a, b] that are differen-
d
tiable on (a, b), and if dx u = u′ and dx
d
v = v ′ are integrable on [a, b], then
Z b Z b
′ b
u(x)v (x)dx = [u(x)v(x)]a − u′ (x)v(x)dx.
a a

2. (Change of variable) Let u be a continuously differentiable function on an open


interval J, and let I be an open interval such that u(J) ⊆ I. If f is continuous on
I, then f ◦ u is continuous on J and
Z b Z u(b)

f ◦ u(x)u (x)dx = f (u)du.
a u(a)

Basic integration formulas


Z Z Z
(i) (af (x) + bg(x)) dx = a f (x) dx + b g(x) dx.

xa+1 f ′ (x)
Z Z Z
a 1
(ii) x dx = + C for a ̸= −1, dx = ln |x| + C, dx = ln |f (x)| +
a+1 x f (x)
C.
ax
Z Z Z
x x ′ f (x) f (x)
(iii) e dx = e + C, f (x)e dx = e + C, ax dx = + C.
ln a
Integration formula for trigonometric functions
Z Z Z
(i) sin x dx = − cos x + C, cos x dx = sin x + C, sec2 x dx = tan x + C.
Z Z
2
csc x dx = − cot x + C, sec x tan x dx = sec x + C,
Z
csc x cot x dx = − csc x + C.
Z Z
tan x dx = ln | sec x| + C, cot x dx = ln | sin x| + C,
Z Z
sec x dx = ln | sec x + tan x| + C, csc x dx = ln | csc x − cot x| + C.
Z Z
1 1
(ii) √ dx = sin−1 x + C, dx = tan−1 x + C,
Z 1−x 2 1 + x2
1
√ dx = sec−1 x + C.
2
x x −1

3.7.7 Integration over Unbounded Sets


A set D ⊆ Rn is said to be smoothly bounded if it is a closed bounded set whose boundary
is the union of a finite number of graphs of continuously differential functions
xi = b(x1 , ..., xi−1 , xx+1 , ..., xn ).
To integrate a function f over an unbounded set D, we must assume that there exists
a increasing sequence of smoothly bounded sets
D1 ⊆ D2 ⊆ · · · ⊆ Dk ⊆ · · · , k≥1
such that D is the union, [
U= Uk
k≥1

Example. (a) The first quadrant D ⊆ R2 consist of all points D = { (x, y) x, y ≥ 0 }.


It is the union of the sequence Dk = { (x, y) 0 ≤ x, y ≤ k } of sets.
(b) The upper half space D = { (x1 , x2 , ..., xn ) x1 , x2 , ..., xn−1 ∈ R, xn ≥ 0 } ⊆ Rn is
the union of the rectangular prism Dk { (x1 , x2 , ..., xn ) −k ≤ x1 , x2 , ..., xn−1 ≤ k, 0 ≤ xn ≤ k }.
(c) The whole space Rn is the union of circles of radius k, Dk = { x ∈ Rn ∥x∥ ≤ k }.
Let D ⊆ Rn be an unbounded set. Define D(k) = D∩Sk , where Sk = { x ∥x∥ ≤ k }
is the ball of radius k. We say that f is integrable
S over D if for any increasing sequence
of smoothly bounded sets Dk such that D = k≥0 Uk , the sequences
Z Z
f dV and f dV
Dk D(k)

converge and are equal. The limit of these sequences are called the integral of f over D,
and denote it with the usual notation
Z Z Z
f dV = lim f dV = lim f dV.
D k→∞ Dk k→∞ D(k)
Theorem. Let D be an unbounded set, and f a continuous function on D whose absolute
value is integrable over D, Z
|f |dV exists.
D
Then for any increasing sequence of smoothly bounded sets Dk whose union is D, the
limit of integrals Z
lim f dV exists,
k→∞ Dk

and this limit is the same for all such sequences.


3.8 Exercise 3
1. Determine whether the following mappings are linear transformation. Write down
the standard matrix for each of the linear transformations.

(a) T1 : R2 → R2 ,
     
x x+y x
T1 = for all ∈ R2 .
y x−y y

(b) T2 : R2 → R2 ,    x   
x 2 x
T2 = for all ∈ R2 .
y 0 y
(c) T3 : R2 → R3 ,
 
  x+y  
x x
T3 =  x  for all ∈ R2 .
y y
y

(d) T4 : R3 → R3 ,
     
x 1 x
T4   y  = y − x for all y  ∈ R3 .
  
z y−z z

(e) T5 : Rn → R, T5 (x) = x · y for all x ∈ Rn , where y = (y1 , y2 , ..., yn )T is a fixed


vector in Rn .
(f) T6 : Rn → R, T6 (x) = x · x for all x ∈ Rn .

2. Express the following quadratic forms as

Q(x1 , x2 , ..., xn ) = a1 y12 + a2 y22 + · · · + an yn2

for some yi , i = 1, ..., n, depending on x1 , x2 , ..., xn .

(a) Q(x1 , x2 , x3 ) = 7x21 + 6x22 + 5x23 − 4x1 x2 + 4x2 x3 .


(b) Q(x1 , x2 , x3 ) = 2x21 + 2x22 + 2x23 − 2x1 x3 .

3. Use the outer function in R to define the following matrices.

(a) A = (ai,j )4×5 , where ai,j = |2j − 3i|.


(b) A = (ai,j )7×7 , where ai,j = (i − j)/(i + j).

4. Plot the graph of the following function



1 if x < 0,
f (x) =
ex if x ≥ 0,

within the range −10 ≤ x ≤ 10. Use a line for the plot. Is the function continuous?

5. The dataset airquality records daily air quality measurements in New York, May
1 to September 30, in the year 1973. It has with 153 observations on 6 variables.
• Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt
Island, units (ppb).
• Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms
from 0800 to 1200 hours at Central Park, units (lang).
• Wind: Average wind speed in miles per hour at 0700 and 1000 hours at La-
Guardia Airport, units (mph).
• Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Air-
port, units (degrees F).
• Month: the month (May to September).
• Day: numeric Day of month (1 to 31).

We will now plot a scatterplot with temperature in the x-axis and wind speed in
the y-axis.
> Temp <- airquality$Temp
> Wind <- airquality$Wind

(a) Plot a scatterplot of the graph of the wind speed against the temperature. Title
the graph as ”Wind Speed vs Temperature”, label the x-axis with ”Temperature
(degree F)” and the y-axis with ”Wind Speed (mph)”.
(b) Add a linear fit line (in blue) using the function lm to fit the graph in (a).
(c) Add the lowess fit curve (in red) with f value = 1.

6. Input the following data (refer to exercise 2 question 10).


> x <- data.frame("L" = c(0.01,0.012,0.015,0.02),
"R" = c(2.75e-4,3.31e-4,3.92e-4,4.95e-4))
Use the function lm in R to find the gradient and y-intercept of the line that best
bit the given data (compare the answers obtained with the answers in exercise 2
question 10).

7. Let
2
f (x) = xex .
d2 d
Find dx2
f (x), and dx
f (0).

8. Let
2
f (x, y) = ex/y−y .
∂ 2 ∂2
Find the gradient, ∇f (x, y), ∂x∂y f (x, y), and ∂y∂x
f (x, y). Also, find the linear
approximation of f at (−1, −1).

9. Consider the function


− 21
f (x) = f (x1 , x2 , ..., xn ) = − x21 + x22 + · · · + x2n = −∥x∥−1 .

Find the gradient of f (x), ∇f (x).


10. Let
f (x, y) = ex cos(y).
Show that
∂2 ∂2
f (x, y) + f (x, y) = 0 for all (x, y).
∂x2 ∂y 2
11. Let a, b, c be some fixed real numbers and define

f (x, y, z) = sin(ax + by + cz).


 
r
Let y = s be such that ar + bs + ct = 0. Show that

t

∂ ∂ ∂
y · ∇f (x, y, z) = r f (x, y, z) + s f (x, y, z) + t f (x, y, z) = 0
∂x ∂y ∂z

for all (x, y, z).

12. Let (refer to exercise 2 question 2b)

Q(x1 , x2 , x3 ) = 2x21 + 2x22 + 2x23 − 2x1 x3 .

Find ∇Q(1, 0, 1), that is, find the gradient of Q(x1 , x2 , x3 ) at the point (1, 0, 1).

13. Let
F(x, y, z) = (x + y + z, xy + yz + zx).
Find the matrix derivative, DF of F.

14. Let  
ex cos(y)
F(x, y) = .
ex sin(y)
Find the Jacobian JF of F.

15. Consider the function


e−x if x < 0,

f (x) =
ex if x ≥ 0.
df d2 f
Plot the (line) graph of the derivative, dx
of f , and its second derivative dx2
within
the range −2 ≤ x ≤ 2.

16. Inverse functions


Given a single variable real-valued function f , defined on an interval (a, b), a func-
tion g is said to be the inverse function of f in the interval (a, b) if g(f (x)) = x.
For example, the inverse of ex over the whole real line is ln(x), the inverse of sin(x)
over the interval (0, π/2) is arcsin(x) = sin−1 (x). Suppose f is defined on (a, b) and
g is the inverse function. Then
d 1
g(f (a)) = ′ ,
dx f (a)
where f ′ (a) = d
dx
f (a). For example,
d 1 d
ln(e1 ) = 1 = 1/ e1 ,
dx e dx
and (recall that d
dx
sin−1 (x) = √ 1
1−x2
),

d √ 1 √ 1
sin−1 ( 2/2) = = 2= q √ .
dx cos(π/4)
1 − ( 2/2)2

We have a similar theorem for multivariable vector-valued functions. Let F(x) be


defined on an open set containing a point a. Suppose in a small neighborhood of a,
F has an inverse G, that is, G(F(x)) = x in this neighborhood, then the derivative
matrix of G at F(a) is
DG(F(a)) = DF(a)−1 .

Let (see question 14)

F(x, y) = (ex cos(y), ex sin(y)), for all (x, y).

(a) Verify that  


1 v
G(u, v) = ln(u2 + v 2 ), tan−1
2 u
is an inverse of F in a small neighborhood of (0, π/4).
(b) Verify that
DG(F((0, π/4))) = DF((0, π/4))−1 .
17. Divergence
Let F(x) = (f1 (x), f2 (x), ..., fn (x)) be a multivariable function from Rn to Rn . The
divergence of F is defined to be
∂ ∂ ∂
divF = f1 (x) + f2 (x) + · · · + fn (x).
∂x1 ∂x2 ∂xn
Explicitly, for n = 3, then for F(x, y, z) = (f1 (x, y, z), f2 (x, y, z), f3 (x, y, z)),
∂ ∂ ∂
divF(x, y, z) = f1 (x, y, z) + f2 (x, y, z) + f3 (x, y, z).
∂x ∂y ∂z

Let F(x, y, z) = (xy 2 , yz 2 , zx2 ). Find divF(1, 2, 3).


18. Let D be the area bounded by the x-axis and the graph of y = 1 − x2 . Express D
as a y-simple set.
19. Find the area bounded by the circle centered at the origin with radius 2, and the
graph of y = x2 . Leave your answer to the 3 significant figures.
20. Compute Z
x2 y 2 (x + y)dA
D
2
where D is the set in the set in R bounded by the graphs of y = 1, x = y, and
x = 4 − y. You may leave your answer to the nearest 3 significant figures.
21. Let
D = {(x, y) | 1 ≤ x2 + y 2 ≤ 4}.
Use the change of variables (x, y) = (r cos(θ), r sin(θ)) to evaulate the integral
Z
(3 + x4 + 2x2 y 2 + y 4 )dA.
D

22. Let D be the parallelogram in R2 bounded by the lines


y = −x − 2, y = −x + 1, y = x + 2, y = x − 3.
Evaluate the integral Z
(x − y)2 dA.
D
Hint: Let the change of variable be
(x, y) = F(u, v) = (a1 u + b1 v + c1 , a2 u + b2 v + c2 ),
for some real numbers ai , bi , ci , i = 1, 2, and its domain to be C = {(u, v) | 0 ≤
u, v ≤ 1}.
23. Let πy
πx
  
c cos 2
cos 2
for 0 ≤ x, y ≤ 1,
p(x, y) =
0 otherwise.
Find c so that p is a probability density function. For this value of c, find the
probability that (x, y) is in the region
R = {(x, y) | 0 ≤ x, y ≤ 0.5}.

24. Solve the following first order ordinary differential equations


y1′ (t) = sin(y2 (t))
y2′ (t) = cos(y1 (t))

with initial conditions y1 (0) = y2 (0) = 0. Plot the graph of y1 and y2 for 0 ≤ t ≤ 4π.
25. Consider the following 4th order differential equation
y (4) (t) + y(t) = 0,
with initial conditions y(0) = 1, y ′ (0) = 1, y (2) (0) = 0, y (3) (0) = 1. Here, y (k) (t) is
the k-th derivative of y(t). We can convert it to a first order system of differential
equations as such. Let
y1 (t) = y(t), y2 (t) = y ′ (t), y3 (t) = y (2) (t), y4 (t) = y (3) (t).
Then the above 4th order differential equation becomes the following first order
system of differential equations
y1′ (t) = y2 (t)
y2′ (t) = y3 (t)
y3′ (t) = y4 (t)
y4′ (t) = −y1 (t)
with initial conditions

y1 (0) = 1, y2 (0) = 1, y3 (0) = 0, y4 (0) = 1.

Plot the solution of the 4th order ODE for 0 ≤ t ≤ 8.

26. So far we have only considered system of ordinary differential equations

y1′ (t) = F1 (y1 , y2 , ..., yn )


y2′ (t) = F2 (y1 , y2 , ..., yn )
..
.

yn (t) = Fn (y1 , y2 , ..., yn )

where the multivariable functions Fi , i = 1, ..., n does not depend on t. Consider


now the following system of ODE

y1′ (t) = y2 (t)


y2′ (t) = −sin(t)

with initial conditions y1 (0) = 0, y2 (0) = 1. One can check that y1 (t) = sin(t), and
y2 (t) = y1′ (t) = cos(t) is the solution. To solve it in R, we need to first create the
interpolating function, using approxfun
times <- seq(0,2*pi,length=1000) #0<=t<=2*pi
t <- data.frame(times=times,t=times)
t <- approxfun(t,rule=2)
Now use the interpolation function in the ODE function
ODE <- function(t,state,parameters){
y1 <- state[1]
y2 <- state[2]
dy1 <- y2
dy2 <- -sin(t)
Sol <- c(dy1,dy2)
list(Sol)
}

state <- c(y1=0,y2=1)


parameters <- 0

out <- ode(y=state,times=times,func=ODE,parms=parameters)


plot(out)
Indeed, from the graph, we can observe that y1 (t) = sin(t) and y2 (t) = y1′ (t) = cos(t)
is the solution.

Plot the graph of the solution of the following 4th order ODE

y (4) (t) + cos(t)y(t) = t

for 0 ≤ t ≤ 10, with initial conditions y(0) = y ′ (0) = y ′′ (0) = y (3) (0) = 0.
Chapter 4

Probability and Statistics

4.1 Fundamental of Probability


4.1.1 Basic Combinatorics with R
The total number of arrangements of n objects into m positions with replacement or
with repetitions, is
nm .
The total number of arrangements of n objects into m positions without replacement
or without repetitions, is
n!
n(n − 1) · · · (n − m + 1) = .
(n − m)!
When among the n objects, n1 of them are indistinguishable, n2 of them are indistin-
guishable, and so on, until nk of them that are indistinguishable, then the total number
of possible arrangements is
 
n n!
= .
n1 , n2 , ..., nk n1 !n2 ! · · · nk !
This also applies to the situation where we are distributing n objects into k groups of
respective sizes n1 , n2 ,..., nk .
Example. 1. How many possible ways to arrange the letter in the word BOOK-
KEEPER.

Total letters = 10
No. of O = 2
No. of K = 2
No. of E = 3

Hence, total number of ways is


10!
= 151200
2!2!3!
> factorial(10)/(factorial(2)*factorial(2)*factorial(3))
[1] 151200

179
2. Multinomial expansion
 
X n
n
(x1 + x2 + · · · + xk ) = xn1 1 xn2 2 · · · xnk k .
n1 +n2 +···+nk =n
n1 , n2 , ..., nk

An arrangement of r objects taken


 from n objects without regard to order is called a
n n
combination, denoted as Cr or r , and is given by
 
n n n!
Cr = = .
r r!(n − r)!

The function in R is given by choose(n,r).


> choose(5,3)
[1] 10
> choose(10,2)
[1] 45

Example. 1. How many ways to have n binary digits such that m of them are 0 and
no two 0 are consecutive?

To solve this, image we first lay out the n-m 1’s. Then we slot the 0’s into the
n − m + 1 possible slots (including the left and right end). Since there can be no
consecutive 0’s, each slot can at most contain one 0. Hence,
 we are choosing m of
n−m+1
the n − m + 1 slots to put the 0’s. The solution is m
.

2. Binomial expansion
n  
n
X n
(x + y) = xk y n−k .
k=1
k

The function combn in R allows us to generate all possible combinations of m objects


take from n objects. For example, we want choose three letters from A, B, C, D.
> x <- c("A","B","C","D")
> combn(x,3)
[,1] [,2] [,3] [,4]
[1,] "A" "A" "A" "B"
[2,] "B" "B" "C" "C"
[3,] "C" "D" "D" "D"
Each column gives us a possible outcome. Indeed, 43 = 4.


4.1.2 Introduction to Probability


An experiment is any action or process that generates observations. The sample space
of an experiment is the set of all possible outcomes of an experiment. An event E is
any subset of the sample space. If there is only one element in the event, we may abuse
notations and let the element denote that event. For an event E, let E c , called the
complement of E, denote the event that E did not occur. A sequence of events, E1 , E2 ,
..., En is said to be mutually exclusive if they are pairwise disjoint, Ei ∩ Ej = ∅ for all
i ̸= j.
Axioms of probability
The axiomatic (rigorous) approach to probability is as such. Let S denote the sample
space. For each event E, let P (E) denote the probability of the event. P (E) must satisfy
the following:

(i) 0 ≤ P (E) ≤ 1

(ii) P (S) = 1

(iii) For any sequence of events E1 , E2 ,..., such that Ei ∩ Ej = ∅ for all i ̸= j,

[ ∞
X
P( Ei ) = P (Ei ).
i=1 i=1

If events A and B are not mutually exclusive, then

P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

Relative frequency approach to probability


The intuitive approach to define probability of an event is the fraction of the number of
times it occurs as the experiment is performed many times. That is, let n denote the
total number of experiments conducted, then
no. of times E occurs
P (E) = lim .
n→∞ n
Example. If two fair dice are rolled, what is the probability that the sum of the result
of the dice is 7?

Axiomatic approach: Since there are six sides to each die, the total number of outcome
for two dice is 36. There are 6 possible outcomes that sums to 7, (1,6), (2,5), (3,4), (4,3),
(5,2), (6,1). By assumption that the dice are fair, each of these outcomes are equally
likely. So, the probability is
6 1
= .
36 6
Relative frequency approach, by simulation: The function sample in R generates a
sample of the specified size from the elements from a vector x. Here are some of the
arguments.
sample(x,size,replace,prob)

• x is a vector containing the list of all the objects from which to choose
from. If it is a positive integers, then the list is from 1 to that integer.

• size is a positive integer, the number of objects to choose.

• replace is a logical argument, if it is true, objects will be chosen with


replacement, and false otherwise. The default is false.

• prob is the probability that each of the object is being chosen, it should
be a vector of the same length as x.
We will now simulate rolling two fair dice and finding the sum.
sum2dice <- function(nreps){
count <- 0
for (i in 1:nreps) {
sum <- sum(sample(1:6,2,replace=TRUE))
if (sum==7) count <- count + 1
}
return(count/nreps)
}

Let us explain the code. The argument of the defined function sum2dice is nreps,
which is the total number of times the experiment is to be conducted. count is the
number of times the sum is 7. We start the code with count = 0. The line for (i in
1:nreps) runs the experiments nreps times. For each experiment, we take two numbers
from 1 to 6 with replacement, then find the sum. The line if (sum==7) checks if for this
particular experiment, and perform count <- count +1 if the sum is 7. That is, it will
add 1 to the total number of times the sum is 7 if the outcome is 7 in the experiment.
Once nreps number of experiments have been conducted, the code will move to the next
command, that is to return the fraction count/nreps.

> sum2dice(10000)
[1] 0.1684
> sum2dice(1000000)
[1] 0.166963

So, observe that as we perform more experiments, the number approaches 1/6, which
was the answer derived theoretically.

By using the function replicate, we could shorten the code. Let us do it more
generally. The code to find the relative frequency of the sum of rolling d dice is k is as
such.
rollddice <- function(d) {sample(1:6,d,replace=TRUE)}
sumddice <- function(d,k,nreps){
sum <- replicate(nreps,sum(rollddice(d)))
return(mean(sum==7))
}

Remark. Though the second code is shorter, it takes more memory space, and thus is
subjected to higher inaccuracy.

Personal probability
Some times people use probability to measure their degree of confidence that an even
would occur. This is known as personal or subject view of probability. For example, one
might say that he or she might be moving house next year with a probability of 0.9.
This tells us that the persons is quite certain that he or she will move. It is impossible
to perform this experiment repeatedly under the same conditions and take the relative
frequency in this situation. However, the personal probabilities must still be subjected
to the axioms of probability. For example, when ask the probability that he or she will
not move, the person must say that it is 0.1.

In this course, we will not deal with or use probability in the sense of personal prob-
ability.

Exercise. Suppose a weather forecast predicts that it might rain the next day with
probability 90%. Can this be interpreted as relative frequency approach to probability?
Bear in mind that it is impossible to “conduct an experiment under the same conditions”
to check how many of the outcome of the experiments results in being raining the next
day.

4.2 Conditional Probabilities, Independent Events,


Bayes’ Rule
4.2.1 Conditional Probability
Suppose two fair coins are being tossed. It is observed that one of the coins landed on
heads. What is the probability that the toss will end up with two heads?

Since the coin is assumed to be fair, there are four equaly likely outcomes, S =
{(h, h), (h, t), (t, h), (t, t)}. So, the probability that the event (h, h) occurs in 1/4. How-
ever, we are told that the first coin was obsevered to land on heads. This makes the
events (t, h) and (t, t) impossible for this experiment. Hence, S = {(h, h), (h, t)}, and
therefore, the probability for (h, h) is 1/2. This is called condition probability.

If A and B are two events in a sample space S and P (A) ̸= 0, the conditional probability
of B given A is defined as
P (A ∩ B)
P (B|A) = .
P (A)
(Note that P (A ∩ B) is the probability that both events A and B occurs.) From the
relative frequency explanation, we are finding the frequency of event B occuring over all
the times that event A happens.

No. of times B and A occur (No. of times B and A occur)/n P (A ∩ B)


lim = lim = .
n→∞ number of times A occurs n→∞ (number of times A occurs)/n P (A)

Example. 1. Find the probability that the sum of the outcomes of tossing three dice
is at least 12, given that the outcome of the first die is 5.
sum12 <- function(nreps){
count12 <- 0
count5 <- 0
for (i in 1:nreps) {
toss <- sample(1:6,3,replace=TRUE)
if (toss[1]==5) {
count5 <- count5 + 1
if (sum(toss)>=12) count12 <- count12 + 1
}
}
return(count12/count5)
}

> sum12(10000)
[1] 0.5828505

sum7 <- function(nreps){


count <- 0
for (i in 1:nreps) {
sum <- sum(sample(1:6,2,replace=TRUE))
if (sum>=7) count <- count + 1
}
return(count/nreps)
}

> sum7(10000)
[1] 0.5839

So, the probability that the sum of three tosses is at least 12 given the first is 5 is
the same as the probability that the sum of the other 2 dice toss is at least 7.

2. Suppose a box contains 50 defective light bulbs, 100 partially defective light bulbs
that will not last more than 3 hours, and 250 good light bulbs. If a bulb is taken
from the box and it lights up when used, what is the probability that the light bulb
is actually a good light bulb?

Here we are to find the probability that the light bulb is good, given the condition
that it is not defective.
P (Good) 250 5
P (Good | Not defective) = = = .
P (Not defective) 100 + 250 7

Remark. Note that the good light bulbs are a subset of the not defective light
bulbs, hence, the intersection of the good light bulbs and the not defective ones are
just the good ones.

3. Consider a game where each player take turns to toss a die, the score they obtain
for that round will be the outcome of the toss added to their previous tosses. If the
outcome of the toss is 3, a player gets a bonus toss (only once, even if it lands on
3 the second time, it is the end of the player’s turn). What is the probability that
the player’s score is less than 8 after the first turn?

At first look, this does not look like a conditional probability problem. However, to
compute the probability analytically (vis-a-vis via simulation), we need to break-
down the event into smaller events; whether the player’s first toss lands on 3. Let
T be the outcome of the player’s first toss, and B the outcome of player’s bonus
toss with B = 0 if the player did not get one.

P (score < 8) = P (T + B < 8)


= P (((T ̸= 3) ∩ (T < 8)) ∪ ((T = 3) ∩ (B ≤ 4)))
= P ((T ̸= 3) ∩ (T < 8)) + P ((T = 3) ∩ (B ≤ 4))
= P (T ̸= 3) + P (T = 3)P (B ≤ 4)
5 1 4 17
= + · =
6 6 6 18

Now suppose we know that the player’s score is 4. Let’s find the probability that
it was obtained with the help of a bonus toss.

Here we want to compute the probability that the player did have a bonus toss
B > 0, given that his score was 4, T + B = 4.

P ((B > 0) ∩ (T + B = 4))


P (B > 0|T + B = 4) =
P (T + B = 4)
P ((T = 3) ∩ (B = 1))
=
P ((T = 3) ∩ (B = 1)) + P (T = 4)
1 1
·
6 6
= 1 1 1
· +6
6 6
1
=
7
Exercise. 1. Show that
P (A|B) = P (B|A)
is false by giving a counter-example scenario.

2. Is P (A|B)P (B) = P (B|A)P (A) true?

4.2.2 Independent Events


Two events A and B are independent if

P (A ∩ B) = P (A)P (B).

Applying this to the conditional probability, if A and B are independent, then

P (A ∩ B) P (A)P (B)
P (A|B) = = = P (A).
P (B) P (B)

Similarly, P (B|A) = P (B). This says that the condition that B happens will not affect
the probability of A happening and vice versa. For example, the probability that the
second toss of a coin will land on heads is independent on the condition that the first
landed on heads. Another example is the condition that the outcome of the third toss of
a die is 3 will not affect the probability that the sum of the first two tosses is 8.
More generally, n events, E1 , E2 , ..., En , are independent, if for any subset Ej1 , Ej2 , ..., Ejm
of them, !
\m Ym
P Eji = P (Eji ).
i=1 i=1

Explicitly, for 3 events E1 , E2 , E3 to be independent, we must have

P (E1 ∩ E2 ) = P (E1 )P (E2 ),


P (E1 ∩ E3 ) = P (E1 )P (E3 ),
P (E2 ∩ E3 ) = P (E2 )P (E3 ),
P (E1 ∩ E2 ∩ E3 ) = P (E1 )P (E2 )P (E3 ).

4.2.3 Bayes’ Rule


From the definition of conditional probability, we have

P (A ∩ B) P (A|B)P (B)
P (B|A) = = .
P (A) P (A)

This is known as Bayes’ rule, or Bayes’ formula. It relates P (B|A) to P (A|B).

Consider events A and B. Then it is clear that

A = (A ∩ B) ∪ (A ∩ B c ),

that is, for an outcome to be in A, it must either be in A and B, or in A but not in B.


Note that (A ∩ B) and (A ∩ B c ) are mutually exclusive. Hence,

P (A) = P (A ∩ B) + P (A ∩ B c )
= P (A|B)P (B) + P (A|B c )P (B c )
= P (A|B)P (B) + P (A|B c )[1 − P (B)].

More generally, law of total probability states that if A1 , A2 , ..., An are mutually
Sn ex-
clusive events such that the union of these events is the entire sample space, i=1 Ai = S
then for any event E,
n
X n
X
P (E) = P (E ∩ Ai ) = P (E|Ai )P (Ai ).
i=1 i=1

Suppose now E has occurred and we are interested in determining the probability
that one of the Ak also occurrs. Then by the law of total probability and Bayes’ rule, we
arrive at
P (E|Ak )P (Ak ) P (E|Ak )P (Ak )
P (Ak |E) = = Pn .
P (E) i=1 P (E|Ai )P (Ai )

Example. 1. A student is attempting a multiple-choice question in a test, where there


are k options. Let p be the probability that the student actually knows the answer
to the question, and 1 − p be the probability that the student is guessing the an-
swer. Assume that all the options are equally likely to be chosen when the student
is guessing. What is the conditional probability that a student knew the answer
given that he answered correctly?

Let events E, A1 and A2 represents the events “answered correctly”,“knows the


answer”, and “guessed”, respectively. Then by Bayes’ rule and law of total proba-
bility,

P (E ∩ A1 )
P (A1 |E) =
P (E)
P (A1 )
=
P (A1 ) + P (E ∩ A2 )
p
=
p + (1 − p) k1
kp
= .
1 + p(k − 1)

In particular, if k = 4 and p = 1/2, then the probability is 4/5.

2. (Monty Hall problem) In a television show, the host, Monty Hall, gave contestants
the chance to choose to open one of three doors. Behind a door is a luxurious
car, while a goat is behind two others. When a contestant picks a door, the host,
knowing which which door hides the car, will reveal a goat behind one of the two
doors that the contestant did not pick. The contestant is then given a choice to
either remain with his original decision, or switch to the other unopened door. Is
it to the contestant’s advantage to switch?

Let us first simulate the problem.


#Monty hall problem
Montyhall <- function(nreps) {
switch win <- 0
dontswitch win <- 0
no switch <- 0 #count the number of times contestant switches
for (i in 1:nreps){
car <- sample(1:3,1) #the door which conceals the car
guess <- sample(1:3,1) #the door that contestant guess
switch <- sample(0:1,1)
if (switch==1) no switch <- no switch + 1
if (all(switch==1,guess!=car)) switch win <- switch win + 1
if (all(switch==0,guess==car)) dontswitch win <- dontswitch win + 1
}
Probs <- c(switch win/(no switch),dontswitch win/(nreps-no switch))
names(Probs) <- c(‘‘P(Win with Switch)",‘‘P(Win Without Switching)")
return(Probs)
}

> Montyhall(100000)
P(Win with Switch) P(Win Without Switching)
0.6675855 0.3341477
The simulation tells us that the contestant have about 2/3 chance of winning if he
chooses to switch, while only about 1/3 chance of winning if he does not switch.

We will now derive this analytically. Let us call the door that the contestant opens
door 1. Let Di , i = 1, 2, 3 be the door that conceals the car. Let Mi , i = 1, 2, 3,
be the door that Monty opens. Without lost of generality, let’s name the door that
Monty open door 3. Then before Monty opens any door, the probability of the car
being behind each door is P (Di ) = 1/3. After Monty has opened a door, since he
won’t open the door that hides the sheep, the probability that the sheep is behind
door 3 given that Monty has opened door 3 is P (D3 |M3 ) = 0. Our aim is to find
the probabilities P (Di |M3 ), for i = 1, 2. The probability P (D1 |M3 ) is the chance
that contestant wins when he did not switch doors, while P (D2 |M3 ) is the chance
that he wins by switching doors. To compute P (Di |M3 ), we use Bayes’ rule, that
is, we first compute P (M3 |Di ) for i = 1, 2, 3.

• P (M3 |D1 ) = 1/2, since Monty can choose to open either door 2 or 3.
• P (M3 |D2 ) = 1 , since the participant has chose door 1, and door 2 contains
the car, so Monty can only open door 3.
• P (M3 |D3 ) = 0, Monty knows which door conceals the car and will not open
it.

Hence,
P (M3 |D1 )P (D1 )
P (D1 |M3 ) =
P (M3 |D1 )P (D1 ) + P (M3 |D2 )P (D2 ) + P (M3 |D3 )P (D3 )
1 1 1
·
2 3 6 1
= 1 1 1 1 = 1 = ,
· +1· 3 +0· 3
2 3 2
3

P (M3 |D2 )P (D2 )


P (D2 |M3 ) =
P (M3 |D1 )P (D1 ) + P (M3 |D2 )P (D2 ) + P (M3 |D3 )P (D3 )
1 · 31 1
3 2
= 1 1 1 1 = 1 = ,
· +1· 3 +0· 3
2 3 2
3

which agrees with the simulation. Therefore, it is always to the contestant’s advan-
tage to switch.

4.3 Discrete Random Variable


4.3.1 Introduction to Random Variables
Often we are interested in some function of the outcome of a experiment, rather than the
actual outcome itself. This is provided the outcomes are real numbers. For example,
1. toss two dice. X1 = the sum of the outcomes of the dice,

2. X2 = the number of toss of a coin needed to have 3 consecutive heads,

3. X3 = the duration a person has to wait in a queue for the famous bubble tea.
4. X4 = the distance a car can travel on a full tank.

Formally, a random variable is a real-valued function whose domain is the sample


space. Random variables can either be discrete or continuous. A random variable is
discrete if the set of possible outcomes is countable (finite or countably infinite). In the
examples above, X1 and X2 are discrete random variables. We will see later that X3 and
X4 are continuous random variables.

Food for thought: Is the score of a randomly chosen student for a particular test a
discrete or continuous random variable?

4.3.2 Discrete Random Variables


The function that assigns probability to the values of the random variable is called the
probability density function (pdf), denoted as p(x) = P (X = x), and must satisfy the
follow two conditions:

1. p(x) ≥ 0 for all x ∈ D,


P
2. x∈D p(x) = 1,

where D is the domain of the probability density function p. The support of a probability
density function p(x), is the subset of the domain such that p(x) > 0 is positive,

S = {x ∈ D | p(x) > 0}.

The cumulative distribution function (cdf), is defined as


X
F (x) = P (X ≤ x) = p(k).
k≤x

Discrete cumulative distribution functions satisfies the following properties:

1. 0 ≤ F (x) ≤ 1,

2. (non-decreasing function) for any real numbers a and b with a < b, F (a) ≤ F (b).

3. limx→∞ F (x) = 1,

4. limx→−∞ F (x) = 0,

5. F (x) is a step function, and the height of the increment at x is p(x).

Example. 1. Four balls are to be randomly selected, without replacement from a bag
containing twenty balls numbered 1 to 20. Let X denote the largest number among
the four selected balls, then X takes on one of the values 4 to 20. There are 20 4
equally possible choice, and for X = i, it must be the case the one of the four choices
 the other three choices are chosen from numbers 1 to i − 1. Hence, there
is i, while
i−1
are 3 equally possible choices. Therefore, the probability density function is
i−1

3
p(i) = 20
 .
4
To compute the cumulative distribution function F (k) = P (X ≤ k), observe that
this happens when the four balls are chosen between 1 to k. Therefore
k

4
F (k) = 20 .

4

> p <- function(i)choose(i-3,3)/choose(20,4)


> F <- function(k)choose(k,4)/choose(20,4)
> x<- seq(4,20)
> plot(x,p(x),type=’h’,lwd=8,col="blue")

> plot(x,F(x),type=’h’,lwd=8,col="blue")
2. Two fair dice are being toss. Let X denote the sum of the outcomes of the dice.
The probability density function is
1

m · 36 2 ≤ k ≤ 12,
p(k) =
0 otherwise,

where m = min{k − 1, 14 − k}.

The cumulative distribution function is



 0 k < 2,
k(k−1)

2 ≤ k ≤ 7,

F (k) = 72
42−(14−k)(13−k)

 72
8 ≤ k ≤ 12,

1 k > 12.

4.3.3 Expected Value of Discrete Random Variables


Let X be a discrete random variable with a probability density function p(x). The
expected value (or mean) of X, denoted by E[X], is
X
E[X] = x · p(x).
x∈S

In words, the expected value of X is the weighted average of the possible values that X
can take on, each value being weighted by the probability that X assumes it. However,
the term “expected value” is a misnomer, it is not something we “expect” to happen.
For example, let X be the outcome of the roll of a fair die. Then the expected value of
X is
6
X k
E[X] = = 3.5.
k=1
6
But it is impossible to expect that the outcome of a die toss is 3.5.

Consider another example, where a fair coin is toss 1000 times. Let tail be 0 and head
be 1. Then the expected value is 500. But P (X = 500) is 0.025, which is rare and not
something we “expect” would occur.

The intuition behind expected value, or mean, is the average value of X when we
conduct large numbers of experiments. Let xn be the outcome of the n-th experiment.
Then,
x1 + x2 + · · · + xn
E[X] = lim .
n→∞ n
Example. 1. I is an indicator variable for an event A if

1 if A occurs,
I=
0 if Ac occurs.
Then p(1) = P (A) and p(0) = P (Ac ), and thus
E[I] = P (A).

2. Find the expected value of the sum of the outcomes of two dice.
1 2 3 2 1
+3·
E[X] = 2 · +4· + · · · + 11 + 12 = 7.
36 36 36 36 36
> x <- c(2,3,4,5,6,7,8,9,10,11,12)
> px <- (1/36)*c(1,2,3,4,5,6,5,4,3,2,1)
> mean <- sum(x*px)
> mean
[1] 7

Simulation
sum2dice <- function(nreps){
sum <- 0
for (i in 1:nreps){
sum <- sum + sum(sample(1:6,2,replace=TRUE))
}
return(sum/nreps)
}

> sum2dice(100000)
[1] 6.9966

Alternatively,
sum2dice <- function(nreps){
outcome <- replicate(nreps,sum(sample(1:6,2,replace=TRUE)))
return(mean(outcome))
}

> sum2dice(100000)
[1] 7.00168
Theorem (Properties of expected values). Let X and Y be a discrete random variables,
and a, b ∈ R be real numbers. Then
(i) (Joint linearity) E[aX + bY ] = aE[X] + bE[Y ],
(ii) (Linearity) E[aX + b] = aE[X] + b,
(iii) (Independence) if X and Y are independent, then
E[X · Y ] = E[X] · E[Y ].

To be rigorous, we need to define joint probability density function before we can


define E[aX + bY ] and E[(X · Y )] (see section 4.7.1). Interested readers may refer to the
appendix for the proof.
Remark. 1. The fact that E[aX + b] = aE[X] + b should be clear. Intuitively, if the
mean of X is E[X], then the mean of aX + b will be scaled by a, and shifted by b.
2. From property (ii) we have E[a] = a for any constant a.
3. By induction on property (i), we have for random variables X1 , X2 , ..., Xk , and real
numbers a1 , a2 , ..., ak ∈ R,
E[a1 X1 + a2 X2 + · · · + ak Xk ] = a1 E[X1 ] + a2 E[X2 ] + · · · + ak E[Xk ].

4. By induction on property (iii), if X1 , X2 , ..., Xk are independent events, then


E[X1 · X2 · · · Xk ] = E[X1 ] · E[X2 ] · · · E[Xk ].
More generally, the expectation of a function of a random variable, denoted as E[g(X)]
is given by X
E[g(X)] = g(x)p(x).
x∈S

Example. 1. Let X denote a random variable that takes on all of the values −1, 0,
and 1, with respective probabilities
p(−1) = 0.2, p(0) = 0.5, p(1) = 0.3,
and p(x) = 0, otherwise. Then E[X 2 ] is
E[X 2 ] = (−1)2 · 0.2 + 02 · 0.5 + 12 · 0.3 = 0.5
Note that
E[X]2 = [−1 · 0.2 + 0 · 0.5 + 1 · 0.3 = 0.1]2 = 0.01 ̸= 0.5 = E[X 2 ].

2. Let X be the outcome of a toss of a fair die. Find the expectation of 2X + 3.


E[2X + 3] = 2E[X] + 3 = 2(3.5) + 3 = 10.
> x <- c(1,2,3,4,5,6)
> px <- replicate(6,1/6)
> mean <- sum((2*x+3)*px)
> mean
[1] 10
3. Suppose five balls are drawn at random without replacement from a bag containing
six blue balls and eight red balls. Let X be the difference between the number of
red balls and balls drawn, X = R − B, where R and B denoted the number of red
and blue ball drawn, respectively. Let’s find the mean of X.

The values at which X takes nonzero probability are −5 = 0 − 5, −3 = 1 − 4,


−1 = 2 − 3, 1 = 3 − 2, 3 = 4 − 1, and 5 = 5 − 0.
E[X] = −5P (X = 0 − 5) − 3P (X = 1 − 4) − P (X = 2 − 3)
+P (X = 3 − 2) + 3P (X = 4 − 1) + 5P (5 − 0)
6 8 6 8 6 8 6 8 6 8
         
5 1 4 2 3 3 2 4 1 5
= −5 · 14
 −3·  −
14
 +
14
 +3·
14
 +5·
14 14

5 5 5 5 5 5
5
= .
7
> R <- c(0,1,2,3,4,5)
> B <- c(5,4,3,2,1,0)
> x <- R-B
> px <- choose(8,R)*choose(6,B)/choose(14,5)
> fractions(sum(x*px))
[1] 5/7

Alternatively, note that X = R − B = R − (5 − R) = 2R − 5. Hence,


E[X] = E[2R − 5] = 2E[R] − 5.
Thus, our task is to find the expected value of the number of red balls chosen. Now
for i = 1, ..., 5 let Ii be the indicator random variable that the i-th ball chosen is
red. Then
R = I1 + I2 + I3 + I4 + I5 ,
and so
E[R] = E[I1 + I2 + I3 + I4 + I5 ]
= E[I1 ] + E[I2 ] + E[I3 ] + E[I4 ] + E[I5 ]
= P (I1 = 1) + P (I2 = 1) + P (I3 = 1) + P (I4 = 1) + P (I5 = 1),
where P (Ii = 1) is the probability that the i-th ball chosen is red. Note that by
symmetry, P (Ii = 1) is the same for all i = 1, ..., 5. Without lost of generality,
let i = 1. Let us make some comments about finding P (I1 = 1). Firstly, we do
not care about what is the choice for the other four balls. We are only concern
on the probability that this position get filled up by a red ball. Secondly, this
probability is not affected by the number of red balls that might be chosen for the
other positions. In another words, all 8 red balls have equal probability of being
chosen for this position. Hence,
8 4
P (I1 = 1) = = .
14 7
4 20
Therefore, E[R] = 5 · 7
= 7
, and so
20 5
E[X] = 2 · −5= .
7 7
4.3.4 Variance of Discrete Random Variables
Let X be a discrete random variable with mean µ = E[X]. The variance of X, denoted
as V ar[X], is defined to be
V ar[X] = E[(X − µ)2 ].
Observe from the formula that the variance of a discrete random variable X is the
mean squared difference of the random variable and its mean. From the relative frequency
approach to probability, if xn denotes the outcome of the n-th experiment, then
(x1 − µ)2 + (x2 − µ)2 + · · · + (xn − µ)2
V ar[X] = lim .
x→∞ n
Lemma.
V ar[X] = E[X 2 ] − E[X]2 .
Example. Let I denote the indicator random variable for the event A. Let p = P (A).
Recall that P (I = 1) = P (A) = p, and the mean is E[I] = P (A) = p. Then the variance
of I is

V ar[I] = E[I 2 ] − E[I]2 = 02 (P (I = 0)) + 12 P (I = 1) − p2 = p − p2 = p(1 − p).

Theorem (Properties of Variance). (i) Let X be a discrete random variable, and a, b ∈


R be real numbers. Then

V ar[aX + b] = a2 V ar[X].

(ii) If X and Y are independent random variables, then

V ar[X + Y ] = V ar[X] + V ar[Y ].

Proof. (i) By the lemma above and the properties of expected value,

V ar[aX + b] = E[(aX + b)2 ] − E[(aX + b)]2


= E[a2 X 2 + 2abX + b2 ] − (aE[X] + b)2
= a2 E[X 2 ] + 2abE[X] + b2 − a2 E[X]2 − 2abE[X] − b2
a2 E[X 2 ] − E[X]2

=
= a2 V ar[X].

(ii) Recall that if X and Y are independent, then

E[X · Y ] = E[X] · E[Y ].

Hence, by the lemma and the linearity of expectation,

V ar[X + Y ] = E[(X + Y )2 ] − E[X + Y ]2


= E[X 2 + 2X · Y + Y 2 ] − E[X]2 − 2E[X] · E[Y ] − E[Y ]2
= E[X 2 ] + 2E[X] · E[Y ] + E[Y 2 ] − E[X]2 − 2E[X] · E[Y ] − E[Y ]2
= E[X 2 ] − E[X]2 + E[Y 2 ] − E[Y ]2
= V ar[X] + V ar[Y ].
Example. 1. Let X denote a random variable that takes on and of the values −1, 0,
and 1, with respective probabilities

p(−1) = 0.2, p(0) = 0.5, p(1) = 0.3,

and p(x) = 0, otherwise. The variance of X is

V ar[X] = E[X 2 ] − E[X]2 = 0.5 − 0.01 = 0.49.

2. Let X denote the sum of the outcomes of tossing two fair dice. The variance of X
is
1 2 3 2 1 35
V ar[X] = E[X 2 ]−E[X]2 = (22 · +32 · +42 · +· · ·+112 +122 )−72 = .
36 36 36 36 36 6
> x <- c(2,3,4,5,6,7,8,9,10,11,12)
> px <- (1/36)*c(1,2,3,4,5,6,5,4,3,2,1)
> var <- sum(x^2*px)-(sum(x*px))^2
> var
[1] 5.833333

The standard deviation of a discrete random variable X, denoted by σX , is the root


of the variance, p
σX = V ar[X].
2
We often denote the variance of a random variable X by σX = V ar[X].

Theorem (Chebychev’s inequality). For a random variable X with mean µX and vari-
2
ance σX , and any real number α ∈ R, we have the follow (equivalent) inequalities.
1
(i) P (|X − µX | ≥ ασ) ≤ α2
.
2
σX
(ii) P (|X − µX | ≥ α) ≤ α2
.
1
(iii) P (|X − µX | < ασ) ≥ 1 − α2
.
2
σX
(iv) P (|X − µX | < α) ≥ 1 − α2
.

Take inequality (i) for example. It says that the probability that a random variable
takes values, say 4 standard deviations away from the mean, is less than 1/16 = 0.0625.
Rephrasing it, using (iii), we are sure with probability 1 − 0.0625 = 0.9375 that the X
will be within 4 standard deviations of the mean.

4.4 Some families of Discrete Univariate Distribu-


tions
4.4.1 Discrete Uniform Distributions
A random variable X is said to follow a discrete uniform distribution with parameter n
if X only takes on n equal possibilities.
Here are some examples. A fair coin toss is a discrete uniform distribution with
parameter 2. The outcome of tossing a fair die is a discrete uniform distribution with
parameter 6. Selecting randomly a person from a group of n people is a discrete uniform
distribution with parameter n.

If the possible values of X are x1 , x2 , ..., xn , then


1
P (X = xi ) = , for i = 1, ..., n,
n
n
1X
E[X] = xi ,
n i=1
n
1X
V ar[X] = (xi − E[X])2 .
n i=1

The R function to simulate k discrete uniform distribution with parameter n is


sample(n,k)

More generally, if the possible values of X are {x1 , x2 , ..., xn },


sample(c(x1,x2,...,xn),k)
Example. > x <- c("Amanda", "Beatrice","Candice","Denis","Elisa")
> Females <- sample(x,5000000,replace=TRUE)
> mean(Females=="Amanda")
[1] 0.2000634

4.4.2 Bernoulli Distributions


A Bernoulli random variable X can only take only two possible outcomes, success (or
X = 1), and failure (or X = 0). The distribution is determined by p, for some 0 ≤ p ≤ 1,
where p is the probability of success. The probability density function of a Bernoulli
distribution is 
p if x = 1,
p(x) =
1 − p if x = 0.
The expectation and variance are (refer to the examples on expectation and variance of
indicator random variable)

E[X] = p,
V ar[X] = p(1 − p).

4.4.3 Binomial Distributions


If a series of n independent Bernoulli trials with parameter p are being performed in an
experiment, then the probability density function that k of the trials are successful is
 
n k
p(k) = p (1 − p)n−k , 0 ≤ k ≤ n.
k
The random variable X of such experiments is called a Binomial random variable. A
binomial distribution is determined by the parameters (n, p), n for number of Bernoulli
trials, and p for the probability of success for each Bernoulli trials. The expectation and
variance of binomial distribution is

E[X] = np,
V ar[X] = np(1 − p).

Proof. Let Bi denote theP


indicator random variable that the i-th trial is a success. Then
P (Bi = 1) = p and X = ni=1 Bi . Hence,
" n # n n n
X X X X
E[X] = E Bi = E[Bi ] = P (Bi = 1) = p = np,
i=1 i=1 i=1 i=1

and since B1 , B2 , ..., Bk are independent,


" n # n n
X X X
V ar[X] = V ar Bi = V ar[Bi ] = p(1 − p) = np(1 − p).
i=1 i=1 i=1

Example. Consider the following gambling game. A player bets on one of the numbers
1 through 6. Three dice are then rolled. Let i = 0, 1, 2, 3 be the number of times the
number the player bets on appears as the outcome of the dice roll. If i = 1, 2, 3, the
player wins i units. If i = 0, the player loses 1 unit. Is the game fair to the player?

This is equivalent to finding the expected value of the gambling game. If we assume
that the dice are fair and act independently, this is a binomial random variable with
parameter (3, 61 ). Hence, by letting X denote the player’s winnings in the game,

E[X] = −P (i = 0) + P (i = 1) + 2P (i = 2) + 3P (i = 3)
   3      2    2      3
3 5 3 1 5 3 1 5 3 1
= − + +2 +
0 6 1 6 6 2 6 6 3 6
17
= −
216
> i <- c(0,1,2,3)
> x <- c(-1,1,2,3)
> px <- choose(3,i)*(1/6)^i*(5/6)^(3-i)
> fractions(sum(x*px))
[1] -17/216

In other words, in the long run, the player will lose 17 units per every 216 games he
plays. Hence, the game is not fair to the player (not surprising).
The functions in R relevant to a binomial distribution with parameters (n, p) are
• rbinom(m,n,p), to generate m independent experiments
• dbinom(k,n,p), the probability density function p(k) = P (X = k)
• pbinom(k,n,p), the cumulative distribution function F (k) = P (X ≤ k)
• qbinom(q,n,p), to find a k such that F (k) = P (X ≤ k) = q
4.4.4 Geometric Distributions
Suppose in an experiment, independent identical Bernoulli trials with parameter p are
performed until the a success appears. Let X be the random variable denoting the number
of trials needed for the first success, then the probability density function is

p(k) = P (X = k) = (1 − p)k−1 p, k ≥ 1.

This is known as a geometric distribution with parameter p, where p is the probability


of success for each other the Bernoulli trials. The expectation and variance are
1
E[X] = ,
p
1−p
V ar[X] = .
p2
Readers may refer to the appendix for the proof. However, one should be able to explain
the expected value from intuition.

The functions in R relevant to a binomial distribution with parameter p are

• rgeom(m,p), to generate m independent experiments

• dgeom(k-1,p), the probability density function p(k) = P (X = k)

• pgeom(k-1,p), the cumulative distribution function F (k) = P (X ≤ k)

• qgeom(q,p), to find a k such that F (k + 1) = P (X ≤ k + 1) = q

Remark. R defines geometric distribution as the number of failures before the first suc-
cess, instead of the number of trials to achieve the first success. This explains why, for
example, dgeom(k,p) compute p(k + 1) instead of p(k).

Example. Suppose there is a 10% chance that a particular pokémon card vending ma-
chine dispense a rare pokémon card. Let X denote the random variable representing the
number of cards dispensed to obtain the first rare pokémon card. Then the probability
that the fourth card is a rare pokémon card is

P (X = 4) = 0.1(1 − 0.1)3 = 0.0729.

> dgeom(3,0.1)
[1] 0.0729

Suppose each card cost 2 dollars. The expected value is


1
E[2X] = 2E[X] = 2 · = 20 dollars,
0.1
and the variance is
1 − 0.1
V ar[2X] = 22 V ar[X] = 4 · = 180 dollars2
0.12
4.4.5 Negative Binomial Distributions
More generally, suppose in an experiment, independent identical Bernoulli trials with
parameter p are performed until m successes has been achieved. Let X be the random
variable representing the number of trials needed to achieve m successes. Then the
probability density function is
 
k−1 m
p(k) = P (X = k) = p (1 − p)k−m , k ≥ m.
m−1
This is because among the k trials, the last trial must be the m-th success, and the
rest of the m − 1 successes can be among any of the first k − 1 trials. This is call a
negative binomial distribution with parameters (m, p). For i = 1, ..., m, let Gi be the
number of trials between the i − 1 and i-th success, counting also the i-th success. Then
Gi is a geometric distribution for i = 1, ..., m. Hence, we have

X = G1 + G2 + · · · + Gm .

Note also that G1 , G2 , ..., Gm are independent identical trials. Therefore, the expectation
and variance are
m
E[X] = ,
p
m(1 − p)
V ar[X] = .
p2
Note also that the derivations above tell us that if we have m independent iden-
tical geometric distributions G1 , G2 , ..., Gm with parameter p, then the sum X = G1 +
G2 + · · · + Gm is a negative binomial distribution with parameters (m, p). By induction, if
X1 , ..., Xr are negative binomial distributions with parameters (m1 , p), (m2 , p), ..., (mr , p),
Pr sum X = X1 +X2 +· · ·+Xr is also a negative binomial distribution
respectively, then the
with parameters ( i=1 mi , p). On the other hand, a negative binomial distribution with
parameters (1, p) is a geometric distribution with parameter p.

The R functions relevant to a negative binomial distribution with parameters (m, p)


are
• rnbinom(n,m,p), to generate n independent experiments

• dnbinom(k-m,m,p), the probability density function p(k) = P (X = k)

• pnbinom(k-m,m,p), the cumulative distribution function F (k) = P (X ≤ k)

• qnbinom(q,m,p), to find a k such that F (k) = P (X ≤ k) = q


Remark. Just as in the case for geometric distribution, R defines X as the number of
failure, instead of total number of trials. So, say in rnbinom, we should add m to the
generated numbers (numbers of failed trials) to obtain the total trials. This explains why
when inputting the value for X in the various functions, we use m + k instead.
Example. Consider a dating agency, that claims that it has 70% chance that their clients
are being introduced to a partner that they will eventually end up in a serious relation-
ship. Suppose a man keeps returning to the agency until he found the right partner to
marry. Find the probability that he is married to his fourth successful serious relation-
ship, after being introduced to eight potential partners by the agency.

This is a negative binomial distribution with parameters (4, 0.7), and we are finding
the probability  
8−1
p(8) = 0.74 0.34 = 0.06806835.
4−1
> dnbinom(4,4,0.7)
[1] 0.06806835

4.4.6 Poisson Distributions


A random variable X that takes on any possible nonnegative integer is said to be a
Poisson random variable with parameter λ if its probability density function is given by

λk
p(k) = P (X = k) = e−λ , k ≥ 0.
k!
The Poisson distribution is very popular for modeling the number of times particular
event occur in given times or on defined spaces. For example, the number of customers
that walk into a given shop between 9a.m. to 10a.m., or the number of people queuing
at a particular ATM. The expectation and variance are

E[X] = λ,
V ar[X] = λ.

Readers may refer to the appendix for the derivations.

The R functions relevant to a Poisson distribution with parameter λ are

• rpois(n,lambda), to generate n independent experiments

• dpois(k,lambda), the probability density function p(k) = P (X = k)

• ppois(k,lambda), the cumulative distribution function F (k) = P (X ≤ k)

• qpois(q,lambda), to find a k such that F (k) = P (X ≤ k) = q

Example. Suppose a piece of glass rod of unit length drops and breaks into pieces. Let’s
assume that the number of broken pieces is a Poisson distribution of parameter λ, and
the break points are uniformly distributed. Let X be the length of the shortest broken
piece. Let us simulate this. Note that the support of a Poisson distribution starts from
0, however, we cannot have 0 (broken) pieces. Hence, let the number of break points be
a Poisson distribution.

minpiece <- function(lambda){


no breakpts <- rpois(1,lambda)
breakpts <- sort(runif(no breakpts,min=0,max=1))
lengths <- diff(c(0,breakpts,1))
return(min(lengths))
}

Let us explain the code. The value no breakpts is the number of break points. Then
runif will randomly and uniformly generate no breakpts numbers between 0 to 1. The
function sort will order the numbers from the smallest to largest. The function diff
then compute the differences between adjacent numbers in c(0,breakpts,1). Note that
we must include the extreme points 0 and 1 to compute the lengths from the start to the
first break point and the last break point to the end of the rod.

Let us now find the mean and variance for λ = 4.

EX <- mean(replicate(100000,minpiece(4)))
> EX
[1] 0.08190401
> Var=mean(replicate(100000,minpiece(4))^2)-EX^2
> Var
[1] 0.02231847
Remark. See section 4.6.1 for the function runif(n,min,max). In summary, it simulates
n uniform random variable between the interval [min, max], where the default for min
and max are 0 and 1, respectively.
The Poisson distribution was introduced by Siméon Denis Poisson in a book he wrote
regarding the application of probability theory to lawsuits, criminal trials, and the like.
This book, published in 1837, was entitled Recherches sur la probabilitè des jugements en
matière criminelle et en matière civile (Investigations into the Probability of Verdicts in
Criminal and Civil Matters).

The Poisson random distribution has a tremendous range of applications in diverse


areas because it may be used as an approximation for a binomial random variable with
parameters (n, p) when n is large and p is small enough so that np is of moderate size.
To see this, suppose that X is a binomial random variable with parameters (n, p), and
let λ = np. Then
n!
P (X = k) = pk (1 − p)n−k
(n − k)!k!
 k  n−k
n! λ λ
= 1−
(n − k)!k! n n
n  k
n(n − 1) · · · (n − k + 1) λk

λ λ
= 1− 1− .
nk k! n n
Now, for n large and λ moderate,
 n
λ
1− ≃ e−λ ,
n
n(n − 1) · · · (n − k + 1)
≃ 1,
nk
 k
λ
1− ≃ 1.
n
Hence, for n large and λ moderate,

λk
p(k) = P (X = k) ≃ e−λ .
k!
In other words, if n independent trials, each of which results in a success with probability
p, are performed, then when n is large and p is small enough to make np moderate, the
number of successes occurring is approximately a Poisson random variable with param-
eter λ = np. This value λ (which is the expected number of successes) will usually be
determined empirically. (This discussion is from 7.)

Example. Consider an experiment that consists of counting the number of α particles


given off in a 1-second interval by 1 gram of radioactive material. If we know from past
experience that on the average, 3.2 such α particles are given off, what is a good approx-
imation to the probability that no more than 2α particles will appear?

If we think of the gram of radioactive material as consisting of large number n of


atoms, each of which has probability 3.2/n of disintegrating and sending off an α particle
during the second considered, then we see that to a very close approximation, the number
of α particles given off will be a Poisson random variable with parameter λ = 3.2. Hence,
the desired probability is

3.22 −3.2
P (X ≤ 2) = e−3.2 + 3.2e−3.2 + e ≃ 0.3799.
2
Remark. If instead of the average λ number of events, we are given the average rate r of
an event occuring, then letting t be a time frame or interval length (noting that the units
must agree with that of r), the probability density function of a Poisson distribution is

(rt)k
e−rt .
k!

4.5 Continuous Random Variables


4.5.1 Introduction
Recall (section 3.5.3) that p(x) is a probability density function if

(a) it is defined and nonnegative on the whole R, p(x) ≥ 0, and


R
(b) R p(x)dx = 1.

If p is integrable on a set D ⊆ R, then the probability that x is in D is defined to be


Z
p(x)dx.
D

We say that X is a continuous random variable if there is a probability density func-


tion p such that Z
P (X ∈ D) = p(x)dx
D
for any set D.
Remark. 1. To be technical, we need the set D to be measurable. However, all sets
that will be introduced will be measurable. In fact, all sets in this course will be
Borel sets.

2. Be careful that the term continuous random variable is used to distinguish it against
a discrete random variable. It is not necessary that the probability density function
of a continuous random variable X is continuous.

The cumulative distribution function of a continuous random variable with probabil-


ity density function p(x) is
Z t
F (t) = P (X ≤ t) = p(x)dx.
−∞

Example. The amount of time in hours that a computer functions before breaking down
is a continuous random variable with probability density function given by
 −x/100
λe x ≥ 0,
p(x) =
0 x < 0.

1. Find the probability that it will function between 50 and 150 hours before breaking
down.

2. Find the probability that it will last more than 100 hours.
R
Since R p(x)dx = 1, we have
Z ∞ Z N
1 = p(x)dx = λ lim λe−x/100 dx
−∞ N →∞ 0
N
= λ(−100) lim e−x/100 0

N →∞
= −100λ( lim eN − 1) = 100λ,
N →∞

that is, λ = 1/100.

1. The probability that a computer will function between 50 and 150 hours before
breaking down is
Z 150
1 −x/100 150
dx = − e−x/100 50 = e−1/2 − e−3/2 ≃ 0.383.

P (50 ≤ X ≤ 150) = e
50 100

2. The probability that a computer will last longer than 100 hours is
Z ∞
1 −x/100 ∞
dx = − e−x/100 100 = e−1 ≃ 0.368.

P (X > 100) = e
100 100

Note that this is also 1−F (100), where F (t) is the cumulative distribution function.

Here are some properties of the probability and cumulative distribution function. Let
X be a continuous random variable with probability density function p(x).
(i) The probability of a point is 0, P (X = a) = 0 for any real number a. This follows
from
Z a+ε
P (X = a) = lim P (a − ε ≤ X ≤ a + ε) = lim p(x)dx ≃ lim εp(a) = 0.
ε→0 ε→0 a−ε ε→0

Rb R∞
(ii) P (X < b) = P (X ≤ b) = −∞
p(x)dx and P (X > a) = P (X ≥ a) = a
p(x)dx
(see section 3.7.7). Hence,
Z b
P (a < X < b) = P (a < X ≤ b) = P (a ≤ X < b) = P (a ≤ X ≤ b) = p(x)dx.
a

Rt
(iii) Since F (t) = −∞
p(x)dx, by the fundamental theorem of calculus,

d
F (a) = p(a), for all a ∈ R.
dx
That is, the density is the derivative of the cumulative distribution function.

4.5.2 Expected Value and Variance of Continuous Random Vari-


ables
The expected value of a continuous random distribution X with probability density func-
tion p(x) is defined as Z ∞
E[X] = xp(x)dx.
−∞

Theorem (Expected value of a function of a random variable). If X is a continuous


random variable with probability density function p(x), then, for any real-valued function
g, Z ∞
E[g(X)] = g(x)f (x)dx.
−∞

Readers may refer to the appendix for the proof.

Corollary. For real numbers a and b,

E[aX + b] = aE[X] + b.

Proof. Let p(x) be the probability density function of X and g(X) = aX + b in the
theorem above, then
Z ∞ Z ∞ Z ∞
E[aX + b] = (ax + b)p(x)dx = a xp(x)dx + b p(x)dx = aE[X] + b,
−∞ −∞ −∞
R
where in the last equality we use the fact that R
p(x)dx = 1.

Theorem (Properties of expected values). Let X and Y be continuous random variables,


and a, b ∈ R be real numbers. Then

(i) (Linearity) E[aX + bY ] = aE[X] + bE(Y )


(ii) (Independence) If X and Y are independent, then

E[X · Y ] = E[X] · E[Y ].

Remark. 1. As in the case for discrete random variable, we need to define joint
probability density function to be able to prove the results above.

2. By induction on property (i), we have for random variables X1 , X2 , ..., Xk , and real
numbers a1 , a2 , ..., ak ∈ R,

E[a1 X1 + a2 X2 + · · · + ak Xk ] = a1 E[X1 ] + a2 E[X2 ] + · · · + ak E[Xk ].

3. By induction on property (iii), if X1 , X2 , ..., Xk are independent events, then

E[X1 · X2 · · · Xk ] = E[X1 ] · E[X2 ] · · · E[Xk ].

The variance of a continuous random distribution X with probability density function


p(x) is defined as
V ar[X] = E[(X − µ)2 ]
where µ = E[X] is the expected value. Just as in the case for discrete random variable,
we have the following lemma. The proof is analogous to the discrete case, by replacing
the summations with integrations.

Lemma.
V ar[X] = E[X 2 ] − E[X]2 .

Similarly, we have the following properties.

Theorem (Properties of Variance). (i) Let X be a discrete random variable, and a, b ∈


R be real numbers. Then

V ar[aX + b] = a2 V ar[X].

(ii) If X and Y are independent random variables, then

V ar[X + Y ] = V ar[X] + V ar[Y ].

Corollary (Properties of Variance). From the theorem, we get the following properties.

1. V ar[a] = 0 for any real number a ∈ R.

2. If X1 , X2 , ..., Xn are independent random variables, then for any real numbers a1 , a2 , ..., an ∈
R,

V ar[a1 X1 + a2 X2 + · · · + an Xn ] = a21 V ar[X1 ] + a22 V ar[X2 ] + · · · + a2n V ar[Xn ].


4.6 Some Families of Continuous Univariate Distri-
butions
4.6.1 Uniform Distributions
A continuous random variable X is a uniform random variable on the interval (a, b) if
the probability density function of X is given by
 1
b−a
if a < x < b,
p(x) =
0 otherwise.

Remark. Note that in this course we have chosen the definition of a uniform distribution
to exclude the end points a and b.

For any c, d ∈ (a, b), c < d,


d
d−c
Z
1
P (c < X < d) = dx = .
c b−a b−a

Let F (x) be the cumulative distribution function of a uniform distribution on the


interval (a, b). For any c ∈ (a, b),
Z c
1 c−a
F (c) = dx = .
a b−a b−a
Hence, 
 0 if x ≤ a,
x−a
F (x) = b−a
if a < x < b,
1 if x ≥ b.

The expected value and variance of a uniform distribution on the interval (a, b) are

b+a
E[X] = ,
2
(b − a)2
V ar[X] = .
12
Proof.
b  2 b
b 2 − a2
Z
x 1 x b+a
E[X] = dx = = = ,
a b−a b − a 2 a 2(b − a) 2
and  3 b
b
x2 b 3 − a3 b2 + ab + a2
Z
2 1 x
E[X ] = dx = = = ,
a b−a b − a 3 a 3(b − a) 3
and thus
2
b2 + ab + a2 4b2 + 4ab + 4a2 − 3b2 − 6ab − 3a2 (b − a)2

b+a
V ar[X] = − = = .
3 2 12 12

The R functions relevant to a uniform distribution on (a, b) are


• runif(n,a,b), to generate n independent experiments

• dunif(x,a,b), the probability density function p(x)

• punif(k,a,b), the cumulative distribution function F (k) = P (X ≤ k)

• qunif(q,a,b), to find a c such that F (c) = P (X ≤ c) = q


Note that the default values for a and b are 0 and 1, respectively.
Example. During the peak hour, trains arrive at a particular stations at 7-minutes in-
terval starting at 7a.m. That is, they arrive at 7, 7:07, 7:14, etc. Suppose the time that
a man arrives at the station is a uniform distribution between 7a.m and 7:15a.m. Is he
more likely to wait less than 2 minutes for the train, or more than 5 minutes for the train?

Let X denote the amount of minutes pass 7a.m that the man arrives at the station.
Then X is a uniform distribution on (0, 15).

For the man to wait less than 2 minutes, he must arrive between 7:05 and 7:07, and
between 7:12 and 7:14. Hence, the probability is
7 − 5 14 − 12 4
P (5 < X < 7) + P (12 < X < 14) = + = .
15 15 15
For the man to wait more than 5 minutes, he must arrive between 7 and 7:02, between
7:07 and 7:09, and between 7:14 and 7:15. Hence, the probability is
2 − 0 9 − 7 15 − 14 1
P (0 < X < 2) + P (7 < X < 9) + P (14 < X < 15) = + + = .
15 15 15 3
Hence, he is more likely to wait more than 5 minutes for the train than to wait less
than 2 minutes for the train.

Let us simulate this.


waitingtime <- function(t){
diff <- c(7-t,14-t,21-t)
return(min(diff[diff>0]))
}

waittrain <- function(nreps){


count2 <- 0
count5 <- 0
for (i in 1:nreps){
arrive <- runif(1,0,15)
wait <- waitingtime(arrive)
if (wait<2) count2 <- count2 + 1
if (wait>5) count5 <- count5 + 1
}
Probs <- c(count2/nreps,count5/nreps)
names(Probs) <- c("P(wait less than 2mins)","P(wait more than 5mins)")
return(Probs)
}
> waittrain(1000000)
P(wait less than 2mins) P(wait more than 5mins)
0.266914 0.333202

4.6.2 Normal Distributions


A continuous random variable X is a normal distribution with parameters (µ, σ) if its
probability density function is of the form
1 (x−µ)2
p(x) = √ e− 2σ2 , for all x ∈ R.
2πσ 2
The graph of the probability density function of a normal distribution with parameters
(µ, σ) is a bell-shaped curve, symmetric about µ, and σ tells us how wide the base of the
bell is; the larger σ is, the wider the base of the bell.

The cumulative distribution function of a normal distribution with parameters (µ, σ)


is Z c
1 (x−µ)2
F (c) = P (X ≤ c) = √ e− 2σ2 dx.
−∞ 2πσ 2
The standard normal distribution is a normal distribution with parameters (0, 1), that
is, the probability density function is given by
1 z2
p(z) = √ e− 2 , for all z ∈ R.

The cumulative distribution function of the standard normal distribution is
Z c
1 x2
F (c) = P (X ≤ c) = √ e− 2 dx.
−∞ 2π
Given a normal distribution X with parameters (µ, σ), we can standardize it
X −µ
Z= .
σ
Then the cumulative distribution function of Z is
  Z c Z c−µ
c−µ 1 − z2 σ 1 z2
P Z≤ = P (X ≤ c) = √ e 2 dx = √ e− 2 dx.
σ −∞ 2π −∞ 2π
But the integral on the right is the cumulative distribution function of the standard nor-
mal distribution. This shows that if X is a normal distribution with parameters (µ, σ),
then after normalizing, Z is the standard normal distribution.

The expected value and variance of the standard normal distribution are

E[Z] = 0,
V ar[Z] = 1.

Readers may refer to the appendix for the derivations.

Hence, the expected value and variance of a normal distribution with parameters
(µ, σ) are

E[X] = E[σZ + µ] = σE[Z] + µ = µ,


V ar[X] = V ar[σZ + µ] = σ 2 V ar[Z] = σ 2 ,

that is, the parameters of a normal distribution are the mean and the standard deviation
(which explains the use of the notations µ and σ).

The R functions relevant to a normal distribution with parameters (µ, σ) are


• rnorm(n,mean,sd), to generate n independent experiments, where mean is the ex-
pected value µ, and sd is the standard deviation σ.

• dnorm(x,mean,sd), the probability density function p(x)

• pnorm(k,mean,sd), the cumulative distribution function F (k) = P (X ≤ k)

• qnorm(q,mean,sd), to find a c such that F (c) = P (X ≤ c) = q


Note that the default values for mean and sd are 0 and 1, respectively.

It is customary to denote the cumulative distribution function of the standard normal


distribution as Φ,
Φ(c) = P (X ≤ c).
In the past, without the use of computers, people will have to standardize a normal
distribution and find the probabilities using a table containing the different Φ values
(readers may search the internet for a standard normal distribution table).
Example. 1. Suppose in a particular test, with total marks 100, the mean and stan-
dard deviation of the scores of the cohort is 45 and 12, respectively.
(a) What is the probability that a randomly selected student in the cohort will get
between 40 to 65 marks?
(b) Suppose to score an A grade, a student must be in the top 10% of the cohort.
What marks must a student obtain to score an A grade?
(c) Find the constant c such that P (60 ≤ X ≤ c) = 0.1.

It is usually assume that the marks distribution of a class follows a normal distribu-
tion, provided the class is large enough. This will be justified in section 4.7.6. So, if
X is the marks obtained by a randomly chosen student, X is a normal distribution
with parameters (45, 12).

(a) Since P (40 ≤ X ≤ 65) = P (X ≤ 65) − P (X ≤ 40), the probability is approxi-


mately 0.61.
> pnorm(65,45,12)-pnorm(40,45,12)
[1] 0.6137485

As a practice, we will compute the probability by standardizing,


   
40 − 45 65 − 45 5 5
P (40 ≤ X ≤ 65) = P ≤Z≤ =P − ≤Z≤ .
12 12 12 3
Then the probability is
> pnorm(5/3)-pnorm(-5/12)
[1] 0.6137485
(b) We are looking for a c such that P (X > c) = 0.1. This is equivalent to finding
c such that P (X ≤ c) = 0.9.
> qnorm(0.9,45,12)
[1] 60.37862
That is, a student must obtain 61 marks or higher to get an A grade (it must
be a really difficult test!).

Alternatively, P (X ≤ c) = 0.9 ⇔ 0.9 = P (Z ≤ c−45


12
).
> qnorm(0.9)
[1] 1.281552
which means
c − 45
= 1.28 ⇒ c = 60.4.
12
> 12*qnorm(0.9)+45
[1] 60.37862
(c) We have 0.1 = P (60 ≤ X ≤ c) = P (X ≤ c) − P (X ≤ 60) ⇒

P (X ≤ c) = 0.1 + P (x ≤ 60) ≃ 0.994

> 0.1 + pnorm(60,45,12)


[1] 0.9943502
So, c is 75.4.
> qnorm(0.1 + pnorm(60,45,12),45,12)
[1] 75.39955
Exercise. Answer (c) using standardization.

2. An expert witness in a paternity suit testifies that the length (in days) of hu-
man gestation is approximately normally distributed with parameters µ = 270 and
σ 2 = 100. The defendant in the suit is able to prove that he was out of the country
during a period that began 290 days before the birth of the child and ended 240
days before the birth. If the defendant was, in fact, the father of the child, what
is the probability that the mother could have had the very long or short gestation
indicated by the testimony?

Let X denote the length of the gestation. and assume that the defendant is the
father. Then the probability that the birth could occur within the indicated period
is

P ((X > 290) ∪ (X < 240)) = P (X > 290) + P (X < 240)


= (1 − P (X ≤ 290)) + P (X < 240)
≃ 0.024

> 1-pnorm(290,270,10)+pnorm(240,270,10)
[1] 0.02410003

Exercise. Is this the probability that the defendant is the father of the child?

4.6.3 Exponential Distributions


Recall that in a Poisson distribution, the (discrete) random variable counts the number
of events that take place in a given fixed interval. However, we could also consider
counting the waiting time between each successive events. This is now a continuous
random distribution and the probability density function is given by
 −λx
λe if x ≥ 0,
p(x) =
0 if x < 0.

This is called a exponential distribution with parameter λ. This follows from the following
reasoning. Let X denote the waiting time between a pair of success event. Since waiting
time is nonnegative, P (X < x) = 0 for x < 0. For x ≥ 0,

F (x) = P (X ≤ x) = 1 − P (X > x)
= 1 − P (no events in [0, x])
(λx)0
= 1 − e−λx
0!
−λx
= 1−e .

Hence, noting that the derivative of a cumulative distribution function is the probability
d
density function, we have dx F (x) = p(x) = λeλx , as desired (see the remarks at the end of
section 4.4.6). The derivations above also show that the cumulative distribution function
is given by
F (c) = 1 − e−λc .
The exponential distributions are memoryless, that is, the probability the time interval
between a successive event is independent of how long you have already waited. Precisely,
for any positive real numbers t1 , t2 ∈ R,
P (X > t1 + t2 |X > t1 ) = P (X > t2 ).
This follows from
P ((X > t1 + t2 ) ∩ (X > t1 )) P (X > t1 + t2 )
P (X > t1 + t2 |X > t1 ) = =
P (X > t1 ) P (X > t1 )
R∞ −λx
 −λx ∞
λe dx −e e−λ(t1 +t2 )
= Rt1 +t

2
= t1 +t2
∞ =
t1
λe−λx dx [−e−λx ]t1 e−λ(t1 )
= e−λt2 = P (X > t2 ).
In fact, the exponential distributions are the only distributions that possesses this prop-
erty. Readers may refer to the appendix for details.

This says that for example, if the waiting time for finding a parking space in a busy
mall follows an exponential distribution, then the probability that a patron has to wait,
say a further fifteen minutes after searching for awhile, is the same as the probability
that he has to wait for fifteen minutes when he first arrived at the carpark.

The expected value and variance are


1
E[X] =
λ
1
V ar[X] =
λ2
Proof. By integrating by parts, with u = x and v = −e−λx ,
Z ∞ Z ∞
−λx −λx N
e−λx dx
 
E[X] = xλe dx = lim −xe 0
+
0 N →∞ 0
 −λx
N
e
= lim −
N →∞ λ 0
1
= .
λ
More generally, by integration by parts, with u = xn and v = −e−λx ,
Z ∞ Z ∞
−λx
 n −λx N
n
E[X ] = n
x λe dx = lim −x e 0
+ nxn−1 e−λx dx
0 N →∞ 0
n ∞ n−1 −λx
Z
= x λe dx
λ 0
n
= E[X n−1 ].
λ
2 1 2
So, let n = 2, we have E[X 2 ] = λ
· λ
= λ2
. Hence,
 2
2
2 2 1 1
V ar[X] = E[X ] − E[X] = 2 − = 2.
λ λ λ
The R functions relevant to a exponential distribution with parameter λ are
• rexp(n,lambda), to generate n independent experiments

• dexp(x,lambda), the probability density function p(x)

• pexp(k,lambda), the cumulative distribution function F (k) = P (X ≤ k)

• qexp(q,lambda), to find a c such that F (c) = P (X ≤ c) = q


Example. 1. A certain private hire car company charges $2.50 for the first kilometer,
and $0.50 per kilometer thereafter. Assume that the fare is prorated after the first
kilometer. Suppose the ride distances follows an exponential distribution with mean
10km. Let X be the total fees paid.

(a) What is the probability that a ride cost more than $5.50?

First, since the mean is λ1 = E[X] = 10, λ = 0.1. Next, since 7.50 > 2.50,
the ride must be more than 1km. Let S be the random variable denoting the
distance. So,
 
5.5 − 2.5
P (X > 5.50) = P S − 1 > = P (S > 7) = e−7(0.1) ≃ 0.497.
0.5
> exp(-7*0.1)
[1] 0.4965853
> 1-pexp(7,0.1)
[1] 0.4965853
(b) What is the expected value E[X]?

Next, note that X is a function of S,



2.5 if 0 < S ≤ 1,
X=
2.5 + 0.5(S − 1) if S > 1.

Then
Z ∞
E[X] = 2.5 + (2 + 0.5s)(0.1)e−0.1s ds
1
Z ∞ Z ∞
−0.1s
= 2.5 + 0.2 e ds + 0.05 se−0.1s ds
1 1
≃ 9.29

> 2.5+0.2*(integrate(function(s) exp(-0.1*s),1,Inf)$value)


+0.05*(integrate(function(s) s*exp(-0.1*s),1,Inf)$value)
[1] 9.286281

2. Suppose the time a patient has to wait in minutes in a clinic for his turn is expo-
nentially distributed with mean 15 minutes. What is the probability that he has to
wait for more than 10 minutes given that the previous patient has already entered
the consultation room for 3 minutes?
Since the mean is 15 mintues, λ = 1/15. By the memoryless property of exponential
distribution,

P (X > 10|X > 3) = P (X > 7) = e−7(1/15) ≃ 0.63

> 1-pexp(7,1/15)
[1] 0.6270891

4.6.4 Gamma Distributions


Define the gamma function, denoted as Γ(α), to be
Z ∞
Γ(α) = xα−1 e−x dx, for all α > 0.
0

Lemma (Some properties of the gamma function).

(i) For any positive real number α, Γ(α + 1) = αΓ(α).

(ii) For any positive integer n, Γ(n) = (n − 1)!.

The R function for the gamma function is gamma(x), for any real number x.

A random variable is said to have a gamma distribution with parameters (α, λ), for
some positive λ, α > 0, if its probability density function is given by
( α−1
λe−λx (λx)
Γ(α)
for x ≥ 0,
p(x) =
0 for x < 0.

When α = n, the gamma distribution with parameters (n, λ) is the distribution for
the amount of time one has to wait for a total of n independent Poisson distribution with
parameter λ. This distribution is called a Erlang distribution. The probability density
function is ( n−1
λe−λx (λx)
(n−1)!
for x ≥ 0,
p(x) =
0 for x < 0.
Readers may refer to the appendix for details.

The expected value and variance are


α
E[X] =
λ
α
V ar[X] =
λ2
Remark. 1. When α = 1, the gamma distribution with parameters (1, λ) is the ex-
ponential distribution with parameter λ. This follows from the fact (shown in the
appendix) that Γ(1) = 1, and for x > 0,

(λx)1−1
p(x) = λe−λx = λe−λx .
Γ(1)
2. When α = n/2 and λ = 1/2, this gamma distribution with parameters (n/2, 1/2)
is a chi-square distribution with n degress of freedom (unfortunately, we will not
be discussing about chi-square distributions in this course).
The R functions relevant to a gamma distribution with parameters (α, λ) are
• rgamma(n,alpha,lambda), to generate n independent experiments
• dgamma(x,alpha,lambda), the probability density function p(x)
• pgamma(k,alpha,lambda), the cumulative distribution function F (k) = P (X ≤ k)
• qgamma(q,alpha,lambda), to find a c such that F (c) = P (X ≤ c) = q
The default for lambda is 1.
Example. 1. Suppose in a network context, a node does not transmit until it has ac-
cumulated five messages in its buffer. Suppose the times between message arrivals
are independent and Poisson distributed with mean 100 milliseconds. What is the
probability that more than 552 milliseconds will pass before a transmission is made,
starting with an empty buffer.

Since the mean of the Poisson distribution is 100 milliseconds, λ = 1/100 = 0.01.
Hence, the time until the accumulation of five messages is a gamma distribution
with parameters (α = 5, λ = 0.01). So,
P (X > 552) ≃ 0.35.
> 1-pgamma(552,5,0.01)
[1] 0.3544101
2. Suppose that the average arrival rate at a local fast food drive-through window is
three cars per minutes (λ = 3). If one car has already gone through the drive-
through, what is the average waiting time before the third car arrives?

The problem is asking for the mean of a gamma distribution with parameters (α =
2, λ = 3), which is
2
E[X] = .
3
The reasoning follows as such. Since on average there are 3 cars per minutes, it
means that on average, the time between the arrival of each car is 1/3 minutes.
Since 1 car has already gone through, the average waiting time before the third car
arrives is the average waiting time for 2 more cars to arrive, which is then 2 times
of 1/3 minutes.

4.6.5 Beta Distributions


Besides the uniform, another distribution with positive probability density function in an
interval is the beta distribution. A distribution with probability density function given
by (
Γ(α+β) α−1
Γ(α)Γ(β)
x (1 − x)β−1 if 0 < x < 1,
p(x) =
0 otherwise,
is called a standard beta distribution with parameters (α, β).

Define the beta function to be


Z 1
B(α, β) = xα−1 (1 − x)β−1 dx,
0

for any nonnegative real numbers α, β ≥ 0.

Lemma. For any nonnegative numbers α, β ≥ 0,

Γ(α)Γ(β)
B(α, β) = .
Γ(α + β)

Therefore, a standard beta distribution with parameters (α, β) can be written as


1
p(x) = xα−1 (1 − x)β−1 ,
B(α, β)

for all 0 < x < 1. This is the reason this distribution is called a beta distribution, and
clearly R 1 α−1
x (1 − x)β−1 dx
Z
B(α, β)
p(x)dx = 0 = = 1.
R B(α, β) B(α, β)
The expected value and variance of a standard beta distribution with parameters
(α, β) are
α
E[X] =
α+β
αβ
V ar[X] =
(α + β)2 (α + β + 1)

The cumulative distribution function of a standard beta distribution with parameters


(α, β) is Z c
1
F (c) = (X ≤ c) = xα−1 (1 − x)β−1
0 B(α, β)
for 0 ≤ c ≤ 1.

More generally, the probability density function of the beta distribution with param-
eters (α, β) over the interval (a, b) is defined to be
 α−1  β−1
1 1 x−a b−x
p(x) = ,
b − a B(α, β) b−a b−a

for all a < x < b, and 0, everywhere else.

The cumulative distribution function of a beta distribution with parameters (α, β)


over the interval (a, b) is
c  α−1  β−1
x−a b−x
Z
1 1
F (c) = dx.
a b − a B(α, β) b−a b−a
Let X be a beta random variable with parameter (α, β) over the interval (a, b). Let
X −a
Y = .
b−a
Let FX and FY be the cumulative distribution function of the variable X and Y , respec-
tively. Then
Z c  α−1  β−1
c−a 1 1 x−a b−x
P (Y ≤ ) = P (X ≤ c) = dx
b−a a b − a B(α, β) b−a b−a
Z c−a
1 b−a 1 α−1
= y (1 − y)β−1 (b − a)dy
B(α, β) 0 b−a
That is, Y is the standard beta distribution with parameter (α, β). Hence, we can always
convert any beta distribution to a standard one. Also, from the discussions above, we
have that the expected value and variance of a beta distribution with parameters (α, β)
over the interval (a, b) are
α
E[X] = E[a + Y (b − a)] = a + (b − a)
α+β
(b − a)2 αβ
V ar[X] = V ar[a + Y (b − a)] =
(α + β)2 (α + β + 1)
The R functions relevant to a standard beta distribution with parameters (α, β) are
• rbeta(n,alpha,beta), to generate n independent experiments
• dbeta(x,alpha,beta), the probability density function p(x)
• pbeta(k,alpha,beta), the cumulative distribution function F (k) = P (X ≤ k)
• qbeta(q,alpha,beta), to find a c such that F (c) = P (X ≤ c) = q
Example. 1. A certain economic rice stall cooks a new batch of rice everyday at
3p.m. The cook would like to know how much rice to cook to fill the rice bucket.
The cashier claims that the proportion of the rice sold from its opening to 3 p.m
can be modeled with a standard beta distribution with parameters (α = 3, β = 2).
Compute the expected value for the proportion of the rice sold before 3p.m. How
likely is it that at least 85% of the rice in the bucket will be sold before 3p.m?

Let X represent the proportion of the rice sold before 3p.m. Then the expected
value is
α 3
E[X] = = .
α+β 5
The probability that at least 85% of the rice in the bucket is sold before 3p.m is
P (X ≥ 0.85) = 1 − P (X < 0.85) ≃ 0.11
> 1-pbeta(0.85,3,2)
[1] 0.1095188

The graph of the probability density function of the above beta distribution is as
follows.
> curve(dbeta(x,3,2))
2. Project managers often use a Program Evaluation and Review Technique (PERT)
to manage large scale projects. PERT was actually developed by the consulting
firm of Booz, Allen, & Hamilton in conjunction with the United States Navy as
a tool for coordinating the activities of several thousands of contractors working
on the Polaris missile project. A standard assumption in PERT analysis is that
the time to complete any given activity follows a general beta distribution, where
a is the optimistic time to complete an activity and b is the pessimistic time to
complete the activity. Suppose the time X (in hours) it takes a three man crew to
re-roof a single-family house has a beta distribution with a = 8, b = 16, α = 2, and
β = 3. The crew will complete the reroofing in a single day provided the total time
to complete the job is no more than 10 hours. If this crew is contracted to re-roof a
single-family house, what is the chance that they will finish the job in the same day?

The distribution is a beta distribution with parameters (α = 2, β = 3) over the


interval (8, 16). We are to find P (≤ 10).
Z 10  2−1  3−1
1 1 x−8 16 − x
P (X ≤ 10) = dx
8 16 − 8 B(2, 3) 16 − 8 16 − 8
Z 10
1 Γ(5)
= 4 (x − 8)(16 − x)2 dx
8 Γ(2)Γ(3) 8
1 4! 10 3
Z
= (x − 40x2 + 512x − 2048)dx
4096 2 8
 4 10
3 x 40x3 512x2
= − + − 2048x
1024 4 3 2 8
3 268
= ≃ 0.2617.
1024 3
Alternatively,
> a = 8; b = 16; alpha = 2; beta = 3
> pbeta((10-a)/(b-a),alpha,beta)
[1] 0.2617188
Exercise. Plot the graph of the probability density function for different parameters
(α, β),
> curve(dbeta(x,alpha,beta))

In particular, choose values in the interval 0 < α, β ≤ 1 and/or 1 < α, β. What can
you observe about the shape of the graphs for the different values of α and β, and explain
it.

4.7 Multivariable Distributions


4.7.1 Joint Probability Density Functions
Discrete random variables
Consider now an experiment involving two or more random variables. For example, the
sum of the outcome of two dice roll, or more relevant to daily life, an online shopping
app will suggest some products to you after you have search for some items as these are
items frequently bought together with the item you search. The app is relying on the
fact that sales of certain groups of items are correlated.

If X and Y are discrete random variables, the function given by

pX,Y (x, y) = P ((X = x) ∩ (Y = y))

for each pair of (x, y) in the domain of X and Y is called the joint probability density function
of X and Y . It must fulfill the following properties.
(i) (Nonnegative) pX,Y (x, y) ≥ 0 for all (x, y).
P P
(ii) (Sum to one) x y PX,Y (x, y) = 1.
P
(iii) (Probability of an event) P ((X, Y ) ∈ A) = (x,y)∈A pX,Y (x, y).
Let pX,Y be the joint probability density function for discrete random variables X
and Y . We are able to obtain the probability density functions of X and Y , called the
marginal probability density functions by
X
pX (x) = pX,Y (x, y),
y
X
pY (y) = pX,Y (x, y),
x

respectively.

The joint cumulative distribution function of X and Y is defined to be


XX
FX,Y (a, b) = P ((X ≤ a) ∩ (Y ≤ b)) = pX,Y (x, y).
x≤a y≤b
The marginal cumulative distribution of X and Y is defined to be
X XX
FX (a) = P ((X ≤ a) ∩ (Y < ∞)) = pX (x) = pX,Y (x, y),
x≤a x≤a y
X XX
FY (b) = P ((X < ∞) ∩ (Y ≤ b)) = pY (y) = pX,Y (x, y),
y≤b y≤b x

respectively.
Example. 1. Suppose that 3 balls are randomly selected from a bag containing 3 red,
4 white, and 5 blue balls. If we let X and Y denote, respectively, the number of
red and white balls chosen, then the joint probability mass function of X and Y , is
given by
3 4 5
  
x y 3−x−y
pX,Y (x, y) = 12
 , 0 ≤ x + y, ≤ 3.
3
> x <- y <- seq(from=0,to=3)
> p <- outer(x,y,function(x,y){
choose(3,x)*choose(4,y)*choose(5,3-x-y)/choose(12,3)})
> rownames(p) <- c("X=0","X=1","X=2","X=3")
> colnames(p) <- c("Y=0","Y=1","Y=2","Y=3")
> fractions(p)
Y=0 Y=1 Y=2 Y=3
X=0 1/22 2/11 3/22 1/55
X=1 3/22 3/11 9/110 0
X=2 3/44 3/55 0 0
X=3 1/220 0 0 0

The joint cumulative distribution function is


CDF <- function(x,y)sum(p[seq(1,x),seq(1,y)])
> F <- matrix(,4,4)
for(i in 1:4){
for(j in 1:4) F[i,j] <- CDF(i,j)
}
> rownames(F) <- c("X<=0","X<=1","X<=2","X<=3")
> colnames(F) <- c("Y<=0","Y<=1","Y<=2","Y<=3")
> fractions(F)
Y<=0 Y<=1 Y<=2 Y<=3
X<=0 1/22 5/22 4/11 21/55
X<=1 2/11 7/11 47/55 48/55
X<=2 1/4 167/220 43/44 219/220
X<=3 14/55 42/55 54/55 1

The marginal probability density functions are


> px <- rowSums(p)
> fractions(px)
X=0 X=1 X=2 X=3
21/55 27/55 27/220 1/220
> py <- colSums(p)
> fractions(py)
Y=0 Y=1 Y=2 Y=3
14/55 28/55 12/55 1/55

Indeed,
> fractions(choose(3,x)*choose(9,3-x)/choose(12,3))
[1] 21/55 27/55 27/220 1/220

> fractions(choose(4,y)*choose(8,3-y)/choose(12,3))
[1] 14/55 28/55 12/55 1/55

Exercise. Find P ((1 ≤ X ≤ 3) ∩ (2 ≤ Y ≤ 3)).

2. Suppose that 15 percent of the families in a certain community have no children,


20 percent have 1 child, 35 percent have 2 children, and 30 percent have 3. Suppose
further that in each family each child is equally likely (independently) to be a boy
or a girl. If a family is chosen at random from this community, then B, the number
of boys, and G, the number of girls, in this family will have the joint probability
mass function as follows.

> b <- g <- seq(from=0,to=3)


> f <- function(t){
if (t>3) return(0)
if (t==0) return(0.15)
if (t==1) return(0.2*0.5)
if (t==2) return(0.35*0.5^2)
if (t==3) return(0.3*0.5^3)
}
> p <- outer(b,g,function(b,g)b+g)
> p <- sapply(p,f)
> p <- matrix(p,4,4)
> rownames(p) <- c("B=0","B=1","B=2","B=3")
> colnames(p) <- c("Y=0","Y=1","Y=2","Y=3")
> p
Y=0 Y=1 Y=2 Y=3
B=0 0.1500 0.1000 0.0875 0.0375
B=1 0.1000 0.0875 0.0375 0.0000
B=2 0.0875 0.0375 0.0000 0.0000
B=3 0.0375 0.0000 0.0000 0.0000

Exercise. Find the marginal probability density functions and the joint cumulative
distribution function.

Exercise. In both the examples above, verify that the joint probability density functions
satisfy the sum to one property.
Continuous random variables
A probability density function p : Rn → R is a joint probability density function n ≥ 2.
We will only be discussing for the case where n = 2, that is, X and Y are continuous
random variables, and pX,Y : R2 → R is the joint probability density function. Recall
that pX,Y must satisfy the following properties.

(i) (Nonnegative) pX,Y (x, y) ≥ 0 for all (x, y) ∈ R2 .


R
(ii) (Integrate to one) R2 p(x)dV = 1.
R
(iii) (Probability of an event) P ((X, Y ) ∈ D) = D pX,Y dA.

Remark. From (iii), given subsets A, B ⊆ R, if we let D = { (x, y) x ∈ A, y ∈ B },


Z Z Z
P ((X ∈ A) ∩ (Y ∈ B)) = pX,Y dA = pX,Y (x, y)dxdy.
D B A

Let pX,Y be the joint probability density function for continuous random variables X
and Y . The marginal probability density functions for X and Y are
Z ∞
pX (x) = pX,Y (x, y)dy,
Z−∞

pY (y) = pX,Y (x, y)dx,
−∞

respectively. That is, the probability for X ∈ A and Y ∈ B for subsets A, B ⊆ R are
Z Z Z ∞
P (X ∈ A) = pX (x)dx = pX,Y (x, y)dydx,
A A −∞
Z Z Z ∞
P (Y ∈ B) = pY (y)dy = pX,Y (x, y)dxdy,
B B −∞

respectively.

The joint cumulative distribution function is


Z b Z a
FX,Y (a, b) = pX,Y (x, y)dxdy.
−∞ −∞

Exercise. Show that


∂2
pX,Y (x, y) = FX,Y (x, y).
∂x∂y
Example. 1. The joint probability density function of X and Y is given by
 −x −2y
2e e 0 < x, y < ∞,
pX,Y (x, y) =
0 otherwise.

Compute
(a) P ((X > 1) ∩ (Y < 1)),

Z 1 Z ∞
P ((X > 1) ∩ (Y < 1)) = 2e−x e−2y dxdy
0
Z 11 Z ∞
−2y
= 2 e dy e−x dx
0 1
−2 −1
= (1 − e )(e )

(Observe that the random variables X and Y are independent, see section
4.7.2.)
(b) P (X < Y ),

Z
P (X < Y ) = pX,Y dA
Z0<x<y<∞
∞Z y
= 2e−x e−2y dxdy
Z0 ∞ 0

= 2(1 − e−y )e−2y dy


Z0 ∞ Z ∞
−2y
= 2e dy − 2e−3y dy
0 0
2 1
= 1− = .
3 3

(c) P (X < a).

Z a Z ∞
P (X < a) = 2e−x e−2y dydx
Z0 a 0
−x
= e dx
0
= 1 − e−a .

Exercise. Refer to section 3.5.4. Compute the probabilities above using the R
function integrate.

2. Consider a circle of radius R, and suppose that a point within the circle is randomly
chosen in such a manner that all regions within the circle of equal area are equally
likely to contain the point. (In other words, the point is uniformly distributed
within the circle.) If we let the center of the circle denote the origin and define X
and Y to be the coordinates of the point chosen (see picture below),
then, since (X, Y ) is equally likely to be near each point in the circle, it follows that
the joint density function of X and Y is given by

c if x2 + y 2 ≤ R2 ,
pX,Y (x, y) =
0 if x2 + y 2 > R2 ,

for some value c.

(a) Determine c.

By the integrate to one property, we have


Z Z
1 = cdA = c dA = c · (Area of circle with radius R)
x2 +y 2 ≤R2 x2 +y 2 ≤R2
2
= c(πR ).

Hence,
1
c= .
πR2
(b) Find the probability marginal density functions of X and Y .

The marginal probability density function of X is


Z ∞
pX (x) = pX,Y (x, y)dy
−∞

Z R2 −x2
1
= √
dy
− R2 −x2 πR2

2 R 2 − x2
= , for x2 ≤ R2 ,
πR2
and pX (x) = 0 for all x2 > R2 . By symmetry, the marginal probability density
function of Y is ( √
2 R2 −y 2
pY (y) = πR2
for y 2 ≤ R2 ,
0 for y 2 > R2 .
(c) Compute the probability that D, the distance from the origin of the point
selected, is less than or equal to a.


P (D ≤ a) = P ( X 2 + Y 2 ≤ a)
= P (X 2 + Y 2 ≤ a2 )
Z
1
= 2
dA
x2 +y 2 ≤a2 ≤R2 πR
1
= · (Area of circle of radius a)
πR2
1 2
 a 2
= (πa ) = .
πR2 R

(d) Find E[D].

From (c), we obtained that the cumulative density function of D is


 a 2
FD (a) = , for 0 ≤ a ≤ R.
R
Hence, by taking the derivative with respect to a, we have
a
pD (a) = 2 , for 0 ≤ a ≤ R.
R2
Therefore,
R Z R  R
2 a3
Z  a  2 2 2R
E[D] = a 2 2 da = 2 a da = 2 = .
0 R R 0 R 3 0 3

4.7.2 Independent Random Variables


Recall that 2 events A and B are said to be independent if

P (A ∩ B) = P (A) · P (B).

Translating this to discrete random variables, two discrete random variables X and
Y are independent if for any a and b in the supports of X and Y respectively,

P ((X = a) ∩ (Y = b)) = P (X = a) · P (Y = b).

In other words, the joint probability density function is the product of the marginal
probability density functions

pX,Y (x, y) = pX (x)pY (y).

This extends verbatim to continuous random variables. We say that X and Y are
dependent otherwise.
For suppose X and Y are independent random variables with joint probability density
function pX,Y (x, y) = pX (x)pY (y), then for any subsets A and B in the support of X and
Y,
XX
P ((X ∈ A) ∩ (Y ∈ B)) = pX,Y (x, y)
x∈A y∈B
XX
= pX (x)pY (y)
x∈A y∈B
! !
X X
= pX (x) pY (y)
x∈A y∈B
= P (X ∈ A)P (Y ∈ B)

for X and Y discrete random variables, and


Z Z
P ((X ∈ A) ∩ (Y ∈ B)) = pX,Y (x, y)dydx
x∈A y∈B
Z Z
= pX (x)pY (y)dydx
x∈A y∈B
Z  Z 
= pX (x)dx pY (y)dy
x∈A y∈B
= P (X ∈ A)P (Y ∈ B).

In particular, the joint cumulative distribution function is

FX,Y (a, b) = P ((X ≤ a) ∩ (Y ≤ b)) = P (X ≤ a)P (Y ≤ b) = FX (a)FY (b)

the product of the marginal cumulative distribution functions.

A necessary and sufficient condition for the random variables X and Y to be inde-
pendent is for their joint probability density function pX,Y (x, y) to factor into two terms,
one depending only on x and the other depending only on y. Readers may refer to the
appendix for the proof and further details.

Theorem. The continuous (discrete) random variables X and Y are independent if and
only if their joint probability density function can be expressed as

pX,Y (x, y) = f (x)g(y)

for some functions f and g, depending only on x and y, respectively.

A sequence of random variables X1 , X2 , ... is said to be independent and identically


distributed if the random variables are independent and have the same probability density
function (or cumulative distribution function). We say that X1 , X2 , ..., Xn is a random
sample of size n from a population if the random variables are independent and identically
distributed and their common distribution is that of the population.

Example. 1. Suppose that n + m independent Bernoulli trials with probability of


success p are performed. If X is the number of successes in the first n trials, and Y
is the number of successes in the final m trials, then X and Y are independent, since
knowing the number of successes in the first n trials does not affect the distribution
of the number of successes in the final m trials (by the assumption of independent
trials). In fact, X and Y are independent binomial distributions with parameters
(n, p) and (m, p), respectively. Hence,
   
n x n−x m
pX,Y (x, y) = p (1 − p) py (1 − p)m−y = pX (x)pY (y).
x y
In contrast, if Z is the random variable denoting the total number of success in the
n + m trials, then X and Z are not independent.
Exercise. Find the joint probability density function pX,Z of X and Z, and prove
that X and Z are dependent.
2. Suppose that the number of people who enter a post office on a given day is a
Poisson random variable with parameter λ. Show that if each person who enters
the post office is a male with probability p and a female with probability 1 − p, then
the number of males and females entering the post office are independent Poisson
random variables with respective parameters λp and λ(1 − p).

Let M and F denote the number of males and females, respectively, that enter the
post office in a given day. Our task is then to show that pM,F = pM pF . Condition
on M + F , which is a Poisson random variable, we have
pM,F (i, j) = P ((M = i) ∩ (F = j))
= P ((M = i) ∩ (F = j) | M + F = i + j)P (M + F = i + j)
+P ((M = i) ∩ (F = j) | M + F ̸= i + j)P (M + F ̸= i + j)
= P ((M = i) ∩ (F = j) | M + F = i + j)P (M + F = i + j)
since P ((M = i) ∩ (F = j) | M + F ̸= i + j) = 0 (it is not possible that M = i and
F = j, but the total M + F = i + j). Since M + F is a Poisson random variable
with parameters λ,
λi+j
P (M + F = i + j) = e−λ .
(i + j)!
Next, given that there are i + j number of people entering the post office and the
number of males entering is a binomial random distribution with probability of
success p, we have
 
i+j i
P ((M = i) ∩ (F = j)|M + F = i + j) = p (1 − p)j .
i
Hence,
pM,F (i, j) = P ((M = i) ∩ (F = j) | M + F = i + j)P (M + F = i + j)
λi+j
 
i+j i
= p (1 − p)j e−λ
i (i + j)!
i+j
(i + j)! λ
= e−λ
i!j! (i + j)!
i j
λ λ
= e−λ
i! j!
i j
  
−λp λ −λ(1−p) λ
= e e .
i! j!
Since the joint probability density function splits into functions with respect to
M = i and F = j, we can conclude that the random variables X and Y are
independent with probability density functions

λi
P (M = i) = e−λp ,
i!
λj
P (F = j) = e−λ(1−p) ,
j!

which are Poisson distributions with parameters λp and λ(1 − p), respectively.

3. Show that the sum of independent Poisson distribution is a Poisson distribution.


That is, suppose X1 and X2 are independent Poisson distributions with parameters
λ1 and λ2 , respectively, then X1 + X2 is a Poisson distribution with parameter
λ1 + λ2 .
n
X
P (X1 + X2 = n) = P ((X1 = k) ∩ (X2 = n − k))
k=0
Xn
= P (X1 = k)P (X2 = n − k)(since X1 and X2 are independent)
k=0
n  k
 n−k 
−λ1 λ1 −λ1 λ1
X
= e e
k=0
k! (n − k)!
n
−(λ1 +λ2 )
X λk1 λ2n−k
= e
k=1
k!(n − k)!
n
−(λ1 +λ2 ) 1 X n!
= e λk λn−k
n! k=1 k!(n − k)! 1 2
(λ1 + λ2 )n
= e−(λ1 +λ2 ) ,
n!
where the last equality follows from the binomial expansion
n
n
X n!
(λ1 + λ2 ) = λk λn−k .
k=1
k!(n − k)! 1 2

4. A man and a woman decide to meet at a certain location. If each of them inde-
pendently arrives at a time uniformly distributed between 12 noon and 1 p.m., find
the probability that the first to arrive has to wait longer than 10 minutes.

Let M and W denote, respectively, the time in minutes past 12 that the man and
woman arrive. Then M and W are uniform distributions over the interval (0, 60).
The desired probability is P ((M + 10 < W ) ∪ (W + 10 < M ))= P (M + 10 <
W ) + P (W + 10 < M )
Z Z
= pM,W (m, w)dA + pM,W (m, w)dA
m+10<w,0<m<60 w+10<m,0<w<60
Z 60 Z w−10 Z 60 Z m−10
1 1
= 2
dmdw + 2 dwdm
60 10 0 60 10 0
Z 60
2
= (w − 10)dw
602 10
60
2 w2

= − 10w
602 2 10
25
=
36

4.7.3 Conditional Random Variables


Recall that the conditional probability of event B given A is defined to be

P (A ∩ B)
P (A|B) = .
P (B)

Now suppose X and Y are two discrete random variables with joint probability den-
sity function pX,Y , and a and b are in the support of X and Y respectively, define the
conditional probability density function of X given that Y = b to be

pX,Y (x, b)
pX|Y (x|b) = P (X = x|Y = b) = .
pY (b)

Remark. Note that since b is in the support of Y , pY (b) > 0, hence, the function above
is well-defined.

The conditional cumulative distribution function of X given that Y = b, for b in the


support of Y , is defined to be
X
FX|Y (a|b) = P (X ≤ a|Y = b) = pX|Y (x|b).
x≤a

If X and Y are independent, then

pX,Y (x, b) pX (x)pY (b)


pX|Y (x|b) = = = pX (x),
pY (b) pY (b)

that is the conditional probability of X given Y = b is just the probability of X.

Example. If X and Y are independent Poisson random variables with parameters λ1


and λ2 , respectively, calculate the conditional probability of X given that X + Y = n.
P ((X = k) ∩ (X + Y = n))
P (X = k | X + Y = n) =
P (X + Y = n)
P ((X = k) ∩ (Y = n − k))
=
P (X + Y = n)
P (X = k)P (Y = n − k)
=
P (X + Y = n)
 k n−k   −1
−λ1 λ1 −λ2 λ2 −(λ1 +λ2 ) (λ1 + λ2 )
= e e e
k! (n − k)! n!
k n−k
λ1 λ2 n!
= n
(λ1 + λ2 ) k!(n − k)!
  k  n−k
n λ1 λ2
= ,
k λ1 + λ2 λ1 + λ2

where the third inequality follows from the fact that X and Y are independent, and the
fourth equality follows from the fact that sum of independent Poisson distributions is a
Poisson distribution (see section 4.7.2).

Observe that this is a binomial distribution with parameters (n, λ1 /(λ1 + λ2 )).

The definition for conditional probability density function for continuous random
variables are analogous. Suppose X and Y are two continuous random variables with
join probability density function pX,Y . The conditional probability density function of X
given Y = y is
pX,Y (x, y)
pX|Y (x|y) = ,
pY (y)
provided pY (y) > 0. That is, for any set of real number A,
Z
P (X ∈ A | Y = y) = pX|Y (x|y)dx.
A

In particular, by letting A = (−∞, a), the conditional cumulative distribution function


of X given Y = y is Z a
FX|Y (a|y) = pX|Y (x|y)dx.
−∞

Example. 1. The joint probability density function of X and Y is given by


 12
5
x(2 − x − y) 0 < x, y < 1,
pX,Y (x, y) =
0 otherwise.

Compute the conditional probability density of X given that Y = y, for 0 < y < 1.
For 0 < x < 1,

pX,Y (x, y)
pX|Y (x|y) = R ∞
p (x, y)dx
−∞ X,Y
x(2 − x − y)
= R1
x(2 − x − y)dx
0
x(2 − x − y)
=
[x2 − x3 /3 − yx2 /2]10
x(2 − x − y)
=
2/3 − y/2
6x(2 − x − y)
= .
4 − 3y

2. Suppose the joint density of X and Y is given by


( −x/y −y
e e
y
0 < x, y < ∞,
pX,Y (x, y) =
0 otherwise.

Find P (X > 1 | Y = y).

The conditional probability density function of X given Y = y is

pX,Y (x, y)
pX|Y (x|y) = R ∞
p (x, y)dx
−∞ X,Y
e−x/y e−y /y
= R ∞ −x/y −y
0
e e /ydx
e−x/y
= ∞
[−ye−x/y ]0
e−x/y
= ,
y
for 0 < x, y < ∞, and 0 otherwise. Hence,
Z ∞
P (X > 1 | Y = y) = pX|Y (x, y)dx
1

e−x/y
Z
= dx
1 y
∞
= −e−x/y 1


= e−1/y .

4.7.4 Expected Values, Covariance, and Correlation


Expected value
Suppose X and Y are random variables with joint probability distribution function
pX,Y (x, y). Then for any function g(x, y) of two variables,
• if X and Y are discrete,
XX
E[g(X, Y )] = g(x, y)pX,Y (x, y),
x y

• if X and Y are continuous


Z ∞ Z ∞
E[g(X, Y )] = g(x, y)pX,Y (x, y)dxdy.
−∞ −∞

Reader may refer to the appendix for the derivations. This shows that

E[aX + bY ] = aE[X] + bE[Y ],

which proves the joint linearity property of expected values.

Conditional expectation
Let pX|Y (x|y) be the conditional probability density function of X, given that Y = y.
Define the conditional expectation of X given that Y = y as
X
E[X|Y = y] = xpX|Y (x|y) (discrete),
Zx ∞
E[X|Y = y] = xpX|Y (x|y)dx (continuous).
−∞

Example. Suppose the joint density of X and Y is given by


( −x/y −y
e e
y
0 < x, y < ∞,
pX,Y (x, y) =
0 otherwise.

(a) Compute E[XY ].

∞ ∞
e−x/y e−y
Z Z
E[XY ] = xy dxdy
y
Z0 ∞ 0
Z ∞ 
= e−y xe −x/y
dx dy
Z0 ∞  0
Z ∞ 
−y
x=∞
−xye−x/y x=0 + −x/y

= e ye dx dy
0 0
Z ∞ x=∞
−y 2 e−y e−x/y x=0 dy

=
Z0 ∞
= y 2 e−y dy
0
y=∞
Z ∞
= −y 2 e−y y=0 + 2ye−y dy

0
Z ∞
−y y=∞
2e−y dy
 
= −2ye y=0 +
0
−y y=∞
 
= − −2e y=0 = 2.
integrate(function(x){
sapply(x,function(x){
integrate(function(y)x*exp(-x/y-y),0,Inf)$value
})
},0,Inf)$value
[1] 1.999994

(b) Compute the conditional expected value of X given Y = y.

Recall from the previous section that the conditional probability density function of
X given Y = y is
e−x/y
pX|Y (x|y) =
y
for 0 < x, y < ∞, and 0 otherwise. Hence, the conditional expected value of X given
Y = y is
Z ∞ Z ∞ −x/y
e
E[X|Y = y] = xpX|Y (x|y)dx = x dx
−∞ 0 y
Z ∞
−x/y ∞ −x/y −x/y ∞
    
= −xe 0
− −e d = −ye 0
0
= y,

for 0 < y < ∞, where in the third equality, we use integration by parts, with u = x
−x/y
and v = −e−x/y ( dx
dv
= e y ).

Exgiveny <- function(y)integrate(function(x)x*exp(-x/y)/y,0,Inf)$value

Let X1 , ..., Xn be independent and identically distributed random variables having


distribution F and expected value µ. Such a sequence of random variables is said to
constitute a sample from the distribution F . The quantity
n
X Xi
X=
i=1
n

is called the sample mean. Let us now compute E[X].

By the joint linearity property of expected values,


" n # n
X Xi 1X
E[X] = E = E[Xi ]
i=1
n n i=1
1
= (nµ) = µ
n
That is, the expected value of the sample mean is the actual mean of the distribution.
So, suppose when the mean µ of a distribution is unknown, we use the sample mean to
estimate the actual mean. For example, suppose we want to know the average PSLE
score of all NUS students. We will estimate it by say, surveying 500 students. Here Xi is
the PSLE score for the i-th student in our sample. Then we will use X to estimate µ.
Expected value of random vectors
A random vector X = (X1 , X2 , ..., Xn )T is a vector whose coordinates are random vari-
ables. The mean vector of a random vector X is the vector
 
E[X1 ]
 E[X2 ] 
E[X] =  ..  .
 
 . 
E[Xn ]

Lemma (Properties of mean vector of random vectors).


(i) (Linearity) For any random vectors X and Y and real numbers a, b ∈ R,
E[aX + bY] = aE[X] + bE[Y].

(ii) For a real matrix square matrix A of order n, and a n-random vector X = (X1 , X2 , ..., Xn )T ,
E[Ax] = AE[X].

Covariance
Theorem (Expected value of product of functions on independent random variables). If
X and Y are independent random variables, then for any functions f and g,
E[f (X)g(Y )] = E[f (X)]E[g(Y )].
The covariance between X and Y , denoted by Cov[X, Y ], is defined by
Cov[X, Y ] = E [(X − E[X]) (Y − E[Y ])] .
Equivalently,
Cov[X, Y ] = E [(X − E[X]) (Y − E[Y ])]
= E[XY − XE[Y ] − Y E[Y ] + E[X]E[Y ]]
= E[XY ] − 2E[X]E[Y ] + E[X]E[Y ]
= E[XY ] − E[X]E[Y ].
Suppose that typically when X is larger than its mean, Y is also larger than its mean,
and vice versa for below-mean values. Then (X − E[X])(Y − E[Y ]) will usually be posi-
tive, and hence their covariance is positive. Similarly, if X is often smaller than its mean
whenever Y is larger than its mean, the covariance between them will be negative. All
of this is roughly speaking, of course, since it depends on how much and how often X is
larger or smaller than its mean, etc.

Observe that
Cov[X, X] = V ar[X].
By the theorem above, it is clear that if X and Y are independent, then Cov[X, Y ] = 0.
But the converse is not true, that is, it is possible for Cov[X, Y ] = 0, but X and Y are
dependent. For example, suppose X is a random variable such that
1
P (X = 0) = P (X = 1) = P (X = −1) = ,
3
and 
̸ 0,
0 if X =
Y =
1 if X = 0.
Then XY = 0, and so E[XY ] = 0. Also, E[X] = 0. Thus
Cov[X, Y ] = E[XY ] − E[X]E[Y ] = 0.
However, it is clear from construction that X and Y are not independent.
Theorem (Properties of covariance).
(i) (Symmetry) Cov[X, Y ] = Cov[Y, X]
(ii) (Additive constant) Cov[X + a, Y ] = Cov[X, Y ].
hP i P P
n Pm n m
(iii) (Linearity) Cov a X
i=1 i i , j=1 i i =
b Y i=1 j=1 ai bj Cov[Xi , Yj ].

Hence, it follows that


" n
# " n n
#
X X X
V ar Xi = Cov Xi , Xi
i=1 i=1 j=1
n X
X n
= Cov[Xi , Xj ]
i=1 j=1
Xn X
= V ar[Xi ] + Cov[Xi , Xj ],
i=1 i̸=j

and by the symmetry property of covariance, we have


" n # n
X X X
V ar Xi = V ar[Xi ] + 2 Cov[Xi , Xj ].
i=1 i=1 i<j

We will write this in full for n = 3.


V ar[X1 + X2 + X3 ] = V ar[X1 ] + V ar[X2 ] + V ar[X3 ]
+2Cov[X1 , X2 ] + 2Cov[X1 , X3 ] + 2Cov[X2 , X3 ].
Finally, if X1 , ..., Xn are pairwise independent, that is, Xi and Xj is independent for
all i ̸= j, then " n #
X Xn
V ar Xi = V ar[Xi ].
i=1 i=1
Let X1 , X2 , ..., Xn be independent and identically
Pn distributed random variables having
expected value µ and variance σ 2 . Let X = i=1 Xi /n be the sample mean. The
quantities
Xi − X, for i = 1, ..., n,
are called deviations, as they equal the differences between the individual data and the
sample mean. The random variable
n
2
X (Xi − X)2
S =
i=1
n−1

is called the sample variance. We will compute (a) V ar[X] and (b) E[S 2 ].
(a) Since X1 , ..., Xn are independent, they are pairwise independent too. Hence,
" n # n
X Xi 1 X
V ar[X] = V ar = 2 V ar[Xi ]
i=1
n n i=1
1 2 σ2
= (nσ ) = .
n2 n
This shows that if the sample size is large enough, the sample variance is small, and
hence, the sample mean is a good estimate of the actual mean.
(b)
" n
#
1 X
E[S 2 ] = E (Xi − µ + µ − X)2
n − 1 i=1
" n n n
#
1 X 1 X 1 X
= E (Xi − µ)2 + (X − µ)2 − 2(X − µ) (Xi − µ)
n − 1 i=1 n − 1 i=1 n − 1 i=1
" n n
#
1 X n 1 X
= E (Xi − µ)2 + (X − µ)2 − 2(X − µ) (Xi − µ)
n − 1 i=1 n−1 n − 1 i=1
" n
#
1 X n n
= E (Xi − µ)2 + (X − µ)2 − 2(X − µ) (X − µ)
n − 1 i=1 n−1 n−1
" n
#
1 X n
= E (Xi − µ)2 − (X − µ)2
n − 1 i=1 n−1
n
1 X  n
E (Xi − µ)2 − E (X − µ)2
  
=
n − 1 i=1 n−1
n
1 X n
= V ar[Xi ] − V ar[X]
n − 1 i=1 n−1
n n σ2
= σ2 −
n−1 n−1 n
2
= σ ,

where in the third equality we used the definition of the sample mean X = ni=1 Xi /n,
P
and in the seventh equality, we use the fact that the means of Xi for all i = 1, ..., n,
and X are µ, and the variance of Xi , i = 1, ..., n, is σ 2 .

This shows that for large enough sample, the average of the sample variance is the
distribution variance, that is, the sample variance can be used to estimate the distri-
bution variance.

Covariance matrices
Let X be a random vector. The covariance matrix of X = (X1 , X2 , ..., Xn )T is

Cov[X] = (Cov[Xi , Xj ]i,j ),

that is, it is an order n square matrix whose (i, j)-entry is Cov[Xi , Xj ].


Lemma. Let X = (X1 , X2 , ..., Xn )T be a random vector. The covariance matrix of X
can be written as
Cov[X] = E[(X − µ)(X − µ)T ],
where µ is the mean vector.

Correlation
Covariance does measure how much or little X and Y vary together, but it is hard to
decide whether a given value of covariance is “large” or not. For example, if we change
the units from meters to centimeters, then by linearity property of covariance, Cov[X, Y ]
increase by 1002 . Thus it makes sense to scale covariance according to the variables’
standard deviations. Accordingly, the correlation between two random variables X and
Y is defined by
Cov[X, Y ]
ρ(X, Y ) = p p ,
V ar[X] V ar[Y ]
provided V ar[X]V ar[Y ] > 0 is positive. So, the correlation is unitless, that is, it does not
depends on which units we are using for our variables. Moreover, it is bounded between
−1 and 1 (correlation is kind of the normalization of covariance).
Lemma. For random variables X and Y ,

−1 ≤ ρ(X, Y ) ≤ 1,

with ρ(X, Y ) = ±1 if and only if Y = a + bX for some real numbers a, b ∈ R.

4.7.5 Bivariate Normal Distributions


The joint distribution of random variables X and Y is said to be bivariate normal if their
joint probability density function is given by
 2  2 
x−µX y−µ 2ρ(x−µX )(y−µY )
1 − 1
2 σX
+ σ Y − σ σ
pX,Y (x, y) = p e 2(1−ρ ) Y X Y
,
2πσX σY 1 − ρ2

for x, y ∈ R, where ρ = ρ(X, Y ) is the correlation.


Lemma. The joint probability distribution function of a bivariate normal distribution
with random variable X = (X, Y )T is given by
1 1 T −1
p(x) = p e− 2 (X−µ) Σ (X−µ) ,
2π det(Σ)
where Σ = Cov[X] is the covariance matrix of X.
The determinant of Σ is
2
σX Cov[X, Y ] 2 2
det(Σ) = = σX σY − Cov[X, Y ]2 ,
Cov[Y, X] σY2

where we use the fact that covariance is symmetric, Cov[X, Y ] = Cov[Y, X]. Hence,
s  2
Cov[X, Y ]
p q p
2 2 2
det(Σ) = σX σY − Cov[X, Y ] = σX σY 1 − = σX σY 1 − ρ2 .
σX σY
Next,
 2
−1  
−1 σX Cov[X, Y ] 1 σY2 −Cov[X, Y ]
Σ = = 2 2 ,
Cov[Y, X] σY2 σX σY (1 − ρ2 ) −Cov[Y, X] 2
σX

and hence

(X − µ)T Σ−1 (X − µ)
  
1  σY2 −Cov[X, Y ] X − µX
= 2 2 X − µX Y − µY 2
σX σY (1 − ρ2 ) −Cov[Y, X] σX Y − µY
1  2 2 2 2

= 2 2 (x − µ X ) σY + (y − µ Y ) σX − 2Cov[X, Y ](x − µ X )(y − µ Y )
σX σY (1 − ρ2 )
" 2  2 #
1 x − µX y − µY 2ρ(x − µX )(y − µY )
= + − .
(1 − ρ2 ) σX σY σX σY

In general, the joint probability


  density function of a n-variate normal distribution
X1
 X2 
with random variable X =  ..  is defined analogously,
 
 . 
Xn

1 1 T −1
p(x) = p e− 2 (X−µ) Σ (X−µ) ,
(2π)n/2 det(Σ)

where Σ = Cov[X] is the covariance matrix of X.

Suppose X and Y have a bivariate normal distribution. Then

(i) The marginal distribution of X is a normal distribution with parameters (µX , σX ).

(ii) The marginal distribution of Y is a normal distribution with parameters (µY , σY ).

(iii) The conditional density of X given Y = y is a normal distribution with mean


µX|y = E[X|y] = µX + ρ σσXY (y − µY ) and variance σX|y
2 2
= σX (1 − ρ2 ).

The R function for multivariate normal distribution can be found in the package
mvtnorm. The multivariate normal joint probability density function in R is mvrnorm.
Here are some of the arguments we will need,
dmvnorm(x, mean, sigma)

• x is the vector or matrix of quantiles. If x is a matrix, each row is


taken to be a quantile.

• mean is the mean vector.

• sigma is the covariance matrix.

To generate n trials of a multivariate normal distribution, use the function rmvnorm,


rmvnorm(n, mean , sigma)
4.7.6 Central Limit Theorem
The central limit theorem is one of the most remarkable results in probability theory.
Loosely put, it states that the sum of a large number of independent random variables
has a distribution that is approximately normal. Hence, it not only provides a simple
method for computing approximate probabilities for sums of independent random vari-
ables, but also helps explain the remarkable fact that the empirical frequencies of so many
natural populations exhibit bell-shaped (that is, normal) curves.

The univariate central limit theorem is as follows.

Theorem (Univariate central limit theorem). Let X1 , X2 , ..., Xn , ... be a sequence of inde-
pendent and identically distributed random variables with common mean µ and variance
σ 2 . Then for large n, the new random variable X1 + X2 + · · · + Xn is approximately
normal with mean nµ and variance nσ 2 . In other words, the distribution of
X1 + X2 + · · · + Xn − nµ

tends to the standard normal distribution as n → ∞,


  Z a
X1 + X2 + · · · + Xn − nµ 1 2
P √ ≤a → e−x /2 dx.
nσ 2π −∞

Example. 1. Binomially distributed random variables, though discrete, also are ap-
proximately normally distributed. For example, let’s find the approximate proba-
bility of getting more than 60 heads in 100 tosses of a coin.
> 1-pbinom(60,100,1/2)
[1] 0.0176001

Recall that the mean of a binomial distribution with parameters (n, p) is p, and
the variance is p(1 − p). By the central limit theorem, the new random variable
X1 + X2 + · · · + X100 , where Xi is the outcome of the i-th toss, is approximately
normal with mean 100(0.5) = 50 and variance 100(0.5)2 = (5)2 .
> 1-pnorm(60,50,5)
[1] 0.02275013
That doesn’t seem very accurate. The problem is, do we treat the problem as
P (X > 60) or P (X ≥ 61)? Let’s try P (X ≥ 61).
> 1-pnorm(61,50,5)
[1] 0.01390345
So, P (X > 60) is too big and P (X ≥ 61) is too small, which tells us that the
answer is somewhere in between.
> 1-pnorm(60.5,50,5)
[1] 0.01786442
Now this probability is close to the actual one. This is known as correction for continuity.

2. An astronomer is interested in measuring the distance, in light-years, from his ob-


servatory to a distant star. Although the astronomer has a measuring technique,
he knows that because of changing atmospheric conditions and normal error, each
time a measurement is made, it will not yield the exact distance, but merely an
estimate. As a result, the astronomer plans to make a series of measurements and
then use the average value of these measurements as his estimated value of the
actual distance. If the astronomer believes that the values of the measurements are
independent and identically distributed random variables having a common mean d
(the actual distance) and a common variance of 4 (light-years), how many measure-
ments need he make to be reasonably sure that his estimated distance is accurate
to within ±0.5 light-year?

Suppose X1 , X2 , ..., Xn are n measurements, then, from the central limit theorem,
it follows that Pn
Xi − nd
Zn = i=1 √
2 n
hasPapproximately a standnard normal distribution. The average of the n readings
is ni=1 Xi /n. The task is then to find the probability that the difference between
the average and d is accurate to within ±0.5 light-year.
 Pn  √ √
i=1 Xi n n
P −0.5 ≤ − d ≤ 0.5 = P (−0.5 ≤ Zn ≤ )
n 2 2
√ √
n n
≃ Φ( ) − Φ(−0.5 ).
4 4
√ √ √
By symmetry, −Φ(− 4n ) = 1 − Φ( 4n ). Hence, the probability is 2Φ( 4n ) − 1.
Suppose the astronomer wants to be 95% certain that his estimate value is accurate
to within ±0.5 light-year. Then we need

n
2Φ( ) − 1 ≥ 0.95
4
or √
n
Φ( ) ≥ 0.975
4
> qnorm(0.975)
[1] 1.959964
Hence, √
n
≥ 1.96 ⇒ n ≥ 61.5,
4
that is, he needs to take 62 readings.

Central limit theorems also exist when the Xi are independent, but not necessarily
identically distributed random variables. One version, by no means the most general, is
as follows.

Theorem (Central limit theorem for independent random variables). Let X1 , X2 , ..., Xn , ...
be a sequence of independent random variables having respective means µi = E[Xi ] and
variance σi2 = V ar[Xi ]. If

(i) The Xi are uniformly bounded. that is there is M such that P (|Xi | < M ) = 1 for
all i, and
P∞ 2
(ii) i=1 σn = ∞,
then "P #
n
i=1 (Xi − µi )
P p Pn 2
≤ a → Φ(a), as n → ∞.
i=1 σi

Finally, we have the central limit theorem for multivariate independent identically
distributed random vectors.

Theorem (Multivariate central limit theorem). Suppose X1 , X2 , ..., Xn , ... are indepen-
dent random vectors, all having the same distribution which has mean vector µ and
covariance matrix Σ. Then for large n, the new random variable X1 + X2 + · · · + Xn is
approximately multivariate normal with mean nµ and covariance matrix nΣ. That is,

P (X1 + X2 + · · · + Xn ≤ a = (a1 , a2 , ..., an ))


Z an Z a1
Z a2
1 1 T −1
→ ··· p e− 2 (X−nµ) Σ (X−nµ) dx1 dx2 · · · dx1 ,
n/2
−∞ −∞ −∞ (2π) det(nΣ)
as n → ∞
4.8 Appendix for Chapter 4
4.8.1 Discrete Random Variable
Theorem (Expectation of a function of a random variable). If X is a discrete random
variable with support {x1 , x2 , ..., xi , ...} and probability density function p(x), then for any
real-valued function g, X
E[g(X)] = g(xi )p(xi ).
i

Proof. In order to compute E[g(X)], we need to know the probability density function
of Y = g(X), that is, need to find a function h(y), for y ∈ {y1 = g(x1 ), y2 = g(x2 ), ...}
such that h(yi ) = P (Y = g(xi )).
P
To this end, we group all the terms in i g(xi )p(xi ) having the same g(xi ), that is,
we let yj be a fixed number, and let {xj1 , xj2 , ..., } be all the xi such that g(xji ) = yj (so
we are not using the same notation as above.) Then
X XX
g(xi )p(xi ) = g(xji )p(xji )
i j i
XX
= yj p(xji )
j i
X X
= yj p(xji )
j i
X
= yj P (g(X) = yj )
j
= E[g(X)],

where the fourth equality follows from


X X
P (g(X) = yj ) = P (Y = g(xji )) = p(xji ).
i i

Theorem (Linearity of Expected Value). If a and b are constants, then

E[aX + b] = aE[X] + b.

Proof. Let g(X) = aX + b. Then by the theorem on expected value of a function of a


random variable,
X
E[aX + b] = (ax + b)p(x)
p(x)>0
X X
= a xp(x) + b 1
p(x)>0 p(x)>0

= aE[X] + b,
P
where in the last equality, we use the fact that p(x)>0 p(x) = 1.
By letting a = 0, and b = a in the theorem, we arrive at the corollary.
Corollary. For a constant a,
E[a] = a.
Lemma. Let X be a discrete random variable. Then the variance is given by
V ar[X] = E[X 2 ] − E[X]2 .
From the theorem on expected value of a function of a random variable, let g(X) =
(X − µ)2
X
V ar[X] = E[(X − µ)2 ] = (xi − µ)2 p(xi )
i
X X X
= x2i p(xi ) − 2µ xi p(xi ) + µ2 p(xi )
i i i
2 2 2
= E[X ] − 2E[X] + E[X]
= E[X 2 ] − E[X]2 ,
where we use g(X) = X 2 for the first term, and i p(xi ) = 1 for the last term in the
P
second last equality.
Lemma (Markov inequality). If X is a random variable and g(x) ≥ 0 is a nonnegative
function, then for any positive d > 0,
E[g(X)]
P (g(X) ≥ d) ≤ .
d
Proof. Let I be the indicator random variable for the event {g(X) ≥ d},

1 if g(X) ≥ d,
I(g(X)) =
0 otherwise.
Then since g(X) ≥ 0 and I(g(X)) ≤ 1,
g(X) ≥ dI.
Hence,
E[g(X)] ≥ E[dI] = dE[I] = dP (g(X) ≥ d),
which is what we want.
We will just prove (i) of the Chebychev’s inequality. The others can be derived from
(i).
Theorem (Chebychev’s inequality). For a random variable X with mean µX and vari-
2
ance σX ,
1
P (|X − µX | ≥ ασ) ≤ 2
α
for any real number α ∈ R.
Proof. Let g(X) = (X − µX )2 and d = α2 σX
2
. By the Markov inequality,
E[(X − µX )2 ]
P ((X − µX )2 ≥ α2 σX
2
)≤ 2
.
α2 σX
By taking square root on the left-hand side, it is equal to P (|X − µX | ≥ ασX ). Observe
2
that the numerator on the right-hand side is just the variance, σX . Hence, the right-hand
2
side reduces to 1/α , which is what we want.
4.8.2 Expected Value and Variance of Geometric Distributions
Lemma (Geometric series). For any x ̸= 1 and nonnegative integers m < n,
n
X 1 − xn−m+1
xk = xm .
k=m
1−x

Proof. We first show that


n
X 1 − xn+1
xk = .
k=0
1−x
Pn k
Let S(n) = k=0 x . Then
n
X n
X n
X n
X
k k+1 k
S(n)(1 − x) = S(n) − xS(n) = x − x =1+ x − xk − xn+1 ,
k=0 k=0 k=1 k=1

where in the third equality, we remove the first term from S(n) and remove the last term
from xS(n). Hence,
1 − xn+1
S(n) = .
1−x
So,
n n n−m
X
k m
X
k−m m
X 1 − xn−m+1
x =x x =x xk = xm .
k=m k=m k=0
1 − x

Letting m = 0 and n → ∞ of the equality in the lemma, for |x| < 1,



X 1
xk = .
k=0
1−x

Letting m = 0, differentiate both sides of the equality in the lemma with respect to
x, then let n → ∞, for |x| < 1,

X 1
kxk−1 = .
k=1
(1 − x)2

Letting m = 0, taking the second derivative on both sides of the equality in the lemma
with respect to x, then let n → ∞, for |x| < 1,
∞ ∞
2 X
k−2
X
= k(k − 1)x = k(k + 1)xk−1
(1 − x)3 k=2 k=1

X ∞
X
2 k−1
= k x + kxk−1
k=1 k=1

X 1
= k 2 xk−1 + ,
k=1
(1 − x)2

and hence,

X 2 1 x+1
k 2 xk−1 = 3
− 2
= .
k=1
(1 − x) (1 − x) (1 − x)3
Theorem (Expectation and variance of geometric distributions). Let X be a geometric
distribution with parameter p. Then the expectation and variance of X are
1
E[X] = ,
p
1−p
V ar[X] = .
p2
Proof. By the derivations above,
∞ ∞
X
k−1
X 1 1
E[X] = k(1 − p) p=p k(1 − p)k−1 = p 2
= ,
k=1 k=1
(1 − (1 − p)) p

and
∞ ∞
X X (1 − p) + 1 2−p
E[X 2 ] = k 2 (1 − p)k−1 p = k 2 (1 − p)k−1 = p = .
k=1 k=1
(1 − (1 − p))3 p2

Hence,  2
2 2−p 2 1 1−p
V ar[X] = E[X ] − E[X] = − = .
p2 p p2

4.8.3 Expected Value and Variance of a Poisson Distribution


Recall the Maclaurin series expansion for ex ,

x
X xk
e = .
k=0
k!

Taking derivative with respect to x on both sides, we have



X
x xk−1
e = k .
k=1
k!

Hence, the expectation of a Poisson distribution with parameter λ is


∞ k ∞
X
−λ λ −λ
X λk−1
E[X] = ke = λe k = λe−λ eλ = λ.
k=1
k! k=1
k!

Taking the second derivative with respect to x on both sides of the Macluarin series
expansion of ex , we have
∞ ∞ ∞
x
X xk−2 X 2 xk−2 X xk−2
e = k(k − 1) = k − k
k=2
k! k=2
k! k=2
k!
∞ ∞
!
1 X 2 xk−1 X xk−1
= k −1− k +1
x k=1 k! k=1
k!

!
1 X 2 xk−1 x
= k −e ,
x k=2 k!
and so ∞
X xk−1
k2 = (x + 1)ex .
k=1
k!
Therefore, the variance of a Poisson distribution with parameter λ is

2 2
X e−λ λk
V ar[X] = E[X ] − E[X] = k2 − λ2
k=1
k!
∞ k−1
X λ
= λe−λ k2 − λ2
k=1
k!
−λ
= λe (λ + 1)eλ − λ2
= λ.

4.8.4 Expected Value and Variance of Continuous Random Vari-


ables
Just as in the case of discrete continuous random variable, in order to compute the
expected value E[g(X)] if a function on a random variable X, we need to first find the
probability density function of the continuous random variable g(X).

Lemma. For a nonnegative random variable Y ,


Z ∞
E[Y ] = P (Y > y)dy.
0

Proof. Let pY (x) be the probability density function of Y . Then


Z ∞ Z ∞Z ∞
P (Y > y)dy = pY (x)dxdy
0 0 y
Z ∞ Z x 
= dy pY (x)dx
0 0
Z ∞
= xpY (x)dx
0
= E[Y ],
R∞
where we use P (Y > y) = y pY (x)dx in the first equality, and change of integration
order (that is we change from viewing the domain as being x-simple to being y-simple)
in the second equality.

Theorem. Let Y be a continuous random variable. Then


Z ∞ Z ∞
E[Y ] = P (Y > y)dy − P (Y < −y)dy.
0 0

Proof. Let pY (x) be the probability density function of Y . Then using the same deriva-
tions as in the lemma, we have
Z ∞ Z ∞ Z ∞ Z 0
P (Y < −y)dy = P (−Y > y)dy = tpY (−t)dt = − xpY (x)dx,
0 0 0 −∞
where in the last equality we did a change in variable t = −x. Similarly,
Z ∞ Z ∞
P (Y > y)dy = xpY (x)dx.
0 0

Hence,
Z ∞ Z ∞ Z ∞ Z 0
P (Y > y)dy − P (Y < −y)dy = xpY (x)dx + xpY (x)dx
0 0 0 −∞
Z ∞
= xpY (x)dx
−∞
= E[Y ].

Theorem (Expected value of a function of a random variable). Let X be a continuous


random variable with probability density function p(x). Then
Z ∞
E[g(X)] = g(x)p(x)dx.
−∞

Proof. Let Y = g(X) in the previous theorem, we have


Z ∞ Z ∞
E[g(X)] = P (g(X) > y)dy − P (g(X) < −y)dy
0 0
Z ∞Z Z ∞Z
= p(x)dxdy − p(x)dxdy
0 g(x)>y 0 g(x)<−y
!
Z Z g(x) Z Z  0
= dy p(x)dx − dy p(x)dx
g(x)>0 0 g(x)<0 g(x)
Z Z
= g(x)p(x)dx + g(x)p(x)dx
g(x)>0 g(x)<0
Z ∞
= g(x)p(x)dx,
−∞
R
where we used the fact that P (g(X) > y) = g(x)>y p(x)dx and P (g(X) < −y) =
R
g(x)<−y
p(x)dx in the second equality, and for the third equality, we used a change in
integration order.

4.8.5 Normal Distributions


The probability density function of a normal distribution is
1 (x−µ)2
p(x) = √ e− 2σ2 , for all x ∈ R.
2πσ 2
Let us show that p(x) is indeed a probability
R density function. It is clear that p(x) ≥ 0
for all x ∈ R. Hence, suffice to show that R p(x)dx = 1. Let t = (x−µ)/σ, or x = σt+µ.
Then dx dt
= σ, and so
Z ∞ Z ∞
1 −
(x−µ)2 1 t2
√ e 2σ2 dx = √ e− 2 dt
−∞ 2πσ 2 2π −∞
R∞ t2 √
Our task is thus reduced to showing that −∞
e− 2 dt = 2π. Taking its square, we have
Z ∞ 2 Z ∞  Z ∞ 2

2 2
− t2 − x2 − y2
e dt = e dx e dy
−∞ −∞ −∞
Z ∞Z ∞
x2 +y 2
= e− 2 dxdy
−∞ −∞
Z 2π Z ∞
r2
= e− 2 rdrdθ
0 0
= 2π,

where we used the smooth change of variables F(r, θ) = (r cos(θ), r sin(θ)) (see section
3.5.2 and 3.5.3). Thus, taking the square root on both sides, we
Z ∞
t2 √
e− 2 dt = 2π
−∞

and the result is proved.

The expected value and variance of the standard normal distribution are

E[Z] = 0,
V ar[Z] = 1.

Proof.
Z
1 z2
E[Z] = √ ze− 2 dz
2π R
1 h 2 iN
− z2
= √ lim −e
2π N →∞ −N
 
1 (−N )
− 2
2
− N2
2
= √ lim e −e
2π N →∞
= 0.

Then
Z
1 z2
V ar[Z] = E[Z ] = √ 2
z 2 e− 2 dz,
2π R

z2
and by integration by parts, using u = z and v = −e− 2 (see section 3.7.6), we have
 
2 iN
Z
1 h
− z2
2
− z2
V ar[Z] = √ lim −ze + e dz
2π Z N →∞ −N R
1 z2
= √ e− 2 dz
2π R
= 1.
4.8.6 Memoryless
A nonnegative random variable X is memoryless if for any positive real numbers t1 , t2 > 0,
P (X > t1 + t2 |X > t2 ) = P (X > t1 ).
Recall that P (A|B) = P (A ∩ B)/P (B). Hence, the equation above is equivalent to
P ((X > t1 + t2 ) ∩ (X > t2 ))
P (X > t1 ) =
P (X > t2 )
P (X > t1 + t2 )
= ,
P (X > t2 )
Therefore, a nonnegative distribution is memoryless if for all positive real numbers t1 , t2 >
0,
P (X > t1 + t2 ) = P (X > t1 )P (X > t2 ).
Suppose now X is a memoryless distribution. Let F̂ (x) = P (X > x), then we have
F̂ (t1 + t2 ) = F̂ (t1 )F̂ (t2 ).
Lemma. Suppose g(x) is a continuous function satisfying
g(s + t) = g(s)g(t),
then
g(x) = e−λx
for some real number λ.
Proof. (i) Note that since
     2
2 1 1 1
g =g + =g ,
n n n n
by induction, we have  m
m 1
g =g .
n n
(ii) Also,    n
1 1 1 1
g(1) = g + + ··· + =g ,
n n n n
which is equivalent to  
1
g = g(1)1/n .
n
(iii) Hence, m m
g = g(1) n
n
for all nonnegative integers m, n. By continuity, this means that g(x) = g(1)x for
all nonnegativereal number x.
2
Since g(1) = g 12 ≥ 0, letting λ = −log(g(1)), we have

g(x) = e−λx .
Thus, by the lemma,
P (X > x) = F̂ (x) = e−λx
for some λ. Hence,
F (x) = 1 − P (X < x) = 1 − e−λx ,
and by taking derivative with respect to x, the probability density function is
d
p(x) = F (x) = λe−λx , for x ≥ 0,
dx
which is the probability density function of a exponential distribution.

4.8.7 Gamma Distribution


Define the gamma function, denoted as Γ(α) to be
Z ∞
Γ(α) = xα−1 e−x dx, for all α ∈ R.
0

Lemma (Some properties of the gamma function).


(i) For any positive real number α, Γ(α + 1) = αΓ(α).

(ii) For any positive integer n, Γ(n) = (n − 1)!.


Proof. By integration by parts, letting u = xα and v = −e−x ,
Z ∞ Z ∞
α −x
 α −x N
Γ(α + 1) = x e dx = lim −x e 0 + αxα−1 e−x dx
0 N →∞ 0
Z ∞
= α xα−1 e−x dx = αΓ(α).
0

Next, observe that Z ∞


Γ(1) = e−x dx = 1.
0
Hence, if α = n a positive integer,

Γ(n) = (n−1)Γ(n−1) = (n−1)(n−2)Γ(n−2) = · · · = (n−1)(n−2) · · · 2·1·Γ(1) = (n−1)!.

Let Tn be the random variable denoting the amount of time one has to wait before a
total of n independent Poisson distributed events with parameter λ has occur, that is,

Tn = X1 + X2 + · · · + Xn ,

where Xi is a exponential random variable with parameter λ. Note that Tn is less than
or equal to t if and only if the number of events that occurred by time t is at least n. Let
N (t) denote the number of events in the time interval [0, t]. Then

F (t) = P (Tn ≤ t) = P (N (t) ≥ n)


∞ ∞
X X (λt)k
= P (N (t) = k) = e−λt .
k=n k=n
k!
Hence, taking the derivative with respect to t, the probability density function is
∞ k ∞
X
−λt λ ktk−1 X −λt (λt)k
p(t) = e − λe
k=n
k! k=n
k!
∞ ∞
X
−λt (λt)k−1 X −λt (λt)k
= λe − λe
k=n
(k − 1)! k=n k!
(λt)n−1
= λe−λt .
(n − 1)!

This is called a Erlang distribution.

Generalizing this, letting n be any nonnegative real number α ≥ 0, we have to replace


1/(n − 1)! with the gamma function Γ(α), and thus the probability density function is

(λt)α−1
p(t) = λe−λt , for t > 0,
Γ(α)

which is the probability density function for a Gamma distribution with parameters (α, λ).

The expected value and variance are


α
E[X] =
λ
α
V ar[X] =
λ2
Proof.
Z ∞ α−1 Z ∞
−λt (λt) 1
E[X] = tλe dt = λe−λt (λt)α dt.
0 Γ(α) λΓ(α) 0

dt
Let x = λt, then = λ1 , and we have
dx
Z ∞
1 Γ(α + 1) αΓ(α) α
E[X] = e−x xα dx = = = .
λΓ(α) 0 λΓ(α) λΓ(α) λ

Z ∞ Z ∞
α−1
2 2 −λt (λt)
1
E[X ] = t λe dt = 2 λeλt (λt)α+1 dt
0 Γ(α) λ Γ(α) 0
Z ∞
1 x α+1 Γ(α + 2) (α + 1)(α)Γ(α)
= e x dx = =
λ2 Γ(α) 0 λ2 Γ(α) λ2 Γ(α)
α(α + 1)
= .
λ2
Hence,

α(α + 1)  α 2 α
V ar[X] = E[X 2 ] − E[X]2 = 2
− = 2.
λ λ λ
4.8.8 Beta Distributions
Define the beta function to be
Z 1
B(α, β) = xα−1 (1 − x)β−1 dx,
0

for any nonnegative real numbers α, β ≥ 0.

Lemma. For any nonnegative numbers α, β ≥ 0,

Γ(α)Γ(β)
β(α, β) = .
Γ(α + β)

Proof. Let α, β be nonnegative numbers. Then


Z ∞ Z ∞
α−1 −x
Γ(α)Γ(β) = x e dx y β−1 e−y dy
Z0 ∞ Z ∞ 0

= e−x−y xα−1 y β−1 dxdy.


0 0

With a smooth change of variable F(z, t) = (zt, z(1−t)) (then JF = t(−z)−z(1−t) = z),
the equation becomes
Z ∞Z 1
Γ(α)Γ(β) = e−z (zt)α−1 (z(1 − t))β−1 zdtdz
Z0 ∞ 0 Z 1
−z α−β−1
= e z dz tα−1 (1 − t)β−1 dt
0 0
= Γ(α + β)B(α, β),

as desired.
Therefore, a beta distribution with parameters (α, β) can be written as
1
p(x) = xα−1 (1 − x)β−1 ,
B(α, β)

for all 0 < x < 1. This is the reason this distribution is called a beta distribution, and
clearly R 1 α−1
x (1 − x)β−1 dx
Z
B(α, β)
p(x)dx = 0 = = 1.
R B(α, β) B(α, β)
The expected value and variance of a beta distribution with parameters (α, β) are
α
E[X] =
α+β
αβ
V ar[X] =
(α + β)2 (α + β + 1)
Proof. Recall that Γ(α + 1) = αΓ(α). Hence,
Z 1 Z 1
1 α−1 β−1 1
xα (1 − x)β−1 dx

E[X] = x x (1 − x) dx =
B(α, β) 0 B(α, β) 0
B(α + 1, β) Γ(α + 1)Γ(β) Γ(α + β)
= =
B(α, β) Γ(α + β + 1) Γ(α)Γ(β)
αΓ(α)Γ(β) Γ(α + β)
=
(α + β)Γ(α + β) Γ(α)Γ(β)
α
= .
α+β
Similarly,
Z 1
2 1
x2 xα−1 (1 − x)β−1 dx

E[X ] =
B(α, β) 0
B(α + 2, β) (α + 1)αΓ(α)Γ(β) Γ(α + β)
= =
B(α, β) (α + β + 1)(α + β)Γ(α + β) Γ(α)Γ(β)
α(α + 1)
= .
(α + β)(α + β + 1)
Therefore,
 2
2 2 α(α + 1) α
V ar[X] = E[X ] − E[X] = −
(α + β)(α + β + 1) α+β
α (α + 1)(α + β) − α(α + β + 1)
=
α+β (α + β)(α + β + 1)
α β
=
α + β (α + β)(α + β + 1)
αβ
= 2
.
(α + β) (α + β + 1)

4.8.9 Independent Random Variables


Theorem. The continuous (discrete) random variables X and Y are independent if and
only if their joint probability density function can be expressed as
pX,Y (x, y) = f (x)g(y)
for some functions f and g, depending only on x and y, respectively.
Proof. Suppose X and Y are independent. Then pX,Y (x, y) = pX (x)pY (y), where pX (x)
and pY (y) are the marginal probability density function of X and Y , respectively. Hence,
we may let f (x) = pX (x) and g(y) = pY (y).

On the other hand, suppose pX,Y (x, y) = f (x)g(y) for some functions f and g, de-
pending only on x and y, respectively. Then
Z ∞Z ∞ Z ∞Z ∞
1 = pX,Y (x, y)dxdy = f (x)g(y)dxdy
−∞ −∞ −∞ −∞
Z ∞  Z ∞ 
= f (x)dx g(y)dy = c1 c2 ,
−∞ −∞
R∞ R∞
where c1 = −∞
f (x)dx and c2 = −∞
g(y)dy. Also,
Z ∞ Z ∞
pX (x) = pX,Y (x, y)dy = f (x) g(y)dy = c2 f (x),
Z−∞
∞ Z −∞

pY (y) = pX,Y (x, y)dx = g(y) f (x)dx = c1 g(y).
−∞ −∞

Hence, since c1 c2 = 1,

pX,Y (x, y) = c1 c2 f (x)g(y) = pX (x)pY (y),

which shows that X and Y are independent.

Remark. Observe from the proof of the theorem that even though the joint probabil-
ity density function can be written as the product of two functions depending on each
variables, it is not necessary true that pX (x) = f (x) and pY (y) = g(y). For example, let
 1  1 
4
xy 0 < x < 1, 0 < y < 1 4
x 0<x<1 y 0<y<1
pX,Y (x, y) = , f (x) = , g(y) = .
0 otherwise 0 otherwise 0 otherwise

Then pX,Y (x, y) = f (x)g(y), but


1

x 0<x<1
pX (x) = 2 ̸= f (x), (4.1)
0 otherwise
1

y 0<y<1
pY (y) = 2 ̸= g(y). (4.2)
0 otherwise

In this case, c1 = 21 , c2 = 2, pX (x) = c2 f (x), and pY (y) = c1 g(y).

4.8.10 Expected Values, Covariance, and Correlation


Theorem (Expected value of a function on random variables). Suppose X and Y are
random variables with joint probability distribution function pX,Y (x, y). Then for any
(reasonable) function g(x, y) of two variables,

• if X and Y are discrete,


XX
E[g(X, Y )] = g(x, y)pX,Y (x, y),
x y

• if X and Y are continuous


Z ∞ Z ∞
E[g(X, Y )] = g(x, y)pX,Y (x, y)dxdy.
−∞ −∞

Proof. • Suppose X and Y are discrete. Then following P P the same idea as in the
discrete univariate case, we will group all the terms in x y g(x, y)pX,Y (x, y) that
have the same g(xi , yj ), that is, let ht be a fixed number, and let

{(xt1 , yt(1,1) ), (xt1 , yt(1,2) ), ...(xt1 , yt(1,j) ), ..., (xti , yt(i,j) ), ...}
be the set of all pairs (x, y) such that g(xti , yt(i,j) ) = ht for all i, j ≥ 1. Then
XX XXX
g(x, y)pX,Y (x, y) = g(xti , yt(i,j) )pX,Y (xti , yt(i,j) )
x y t i j
X XX
= ht pX,Y (xti , yt(i,j) )
t i j
X
= ht P (g(x, y) = ht )
t
= E[g(X, Y )].

• Suppose X and Y are continuous. Suppose further that g(x, y) is a nonnegative


function, g(x, y) ≥ 0, for all (x, y) ∈]R2 . Then from the lemma in section 4.8.4,
Z ∞
E[g(x, y)] = P (g(x, y) > t)dt.
0
Writting
Z
P (g(x, y) > t) = pX,Y (x, y)dA,
g(x,y)>t

we have
Z ∞ Z
E[g(x, y)] = pX,Y (x, y)dAdt.
0 g(x,y)>t

Interchange the order of integration,


Z Z g(x,y)
E[g(x, y)] = pX,Y (x, y)dtdA
R 2 0
Z
= g(x, y)pX,Y (x, y)dA,
R2
as desired. Now in general, we have
Z ∞ Z ∞
E[g(X, Y )] = P (g(x, y) > t)dt − P (g(x, y) < −t)dt,
0 0
and the result follows from similar derivations as the univariate case and the deriva-
tions above.

Theorem (Expected value of product of functions on independent random variables). If


X and Y are independent random variables, then for any functions f and g,
E[f (X)g(Y )] = E[f (X)]E[g(Y )].
Proof. Since X and Y are independent, the joint probability density function splits,
pX,Y (x, y) = pX (x)pY (y), and thus
Z ∞Z ∞ Z ∞Z ∞
E[f (X)g(Y )] = f (x)g(y)pX,Y (x, y)dxdy = f (x)g(y)pX (x)pY (y)dxdy
−∞ −∞ −∞ −∞
Z ∞ Z ∞
= f (x)pX (x)dx g(y)pY (y)dy
−∞ −∞
= E[f (x)]E[g(y)],
where the first and last equality follows from the property of expected value of a function
on a random variable.
Theorem (Properties of covariance).

(i) (Symmetry) Cov[X, Y ] = Cov[Y, X]

(ii) (Additive constant) Cov[X + a, Y ] = Cov[X, Y ].


hP i P P
n Pm n m
(iii) (Linearity) Cov i=1 ai Xi , j=1 bi Yi = i=1 j=1 ai bj Cov[Xi , Yj ].

Proof. (i) Cov[X, Y ] = E[XY ] − E[X]E[Y ] = E[Y X] − E[Y ]E[X] = Cpv[Y, X].

(ii) By the linearity property of expected value,

Cov[X + a, Y ] = E[(X + a)Y ] − E[X + a]E[Y ] = E[XY + aY ] − (E[X] + a) E[Y ]


= E[XY ] + aE[Y ] − E[X]E[Y ] − aE[Y ] = E[XY ] − E[X]E[Y ]
= Cov[X, Y ].

(iii) By the linearity property of expected value,


" n m
# " n ! m !# " n # " m #
X X X X X X
Cov ai X i , bi Y i = = E ai X i bi Yi −E ai X i E bi Y i
i=1 j=1 i=1 j=1 i=1 j=1
" n X
m
# n
! m
!
X X X
= E ai bi Xi Yi − ai E [Xi ] bi E [Yi ]
i=1 j=1 i=1 j=1
n X
X m n X
X m
= ai bi E[Xi Yj ] − ai bi E [Xi ] E [Yi ]
i=1 j=1 i=1 j=1
n X
X m
= ai bi (E[Xi Yj ] − E[Xi ]E[Yj ])
i=1 j=1
n X
X m
= ai bi Cov[Xi , Yj ].
i=1 j=1

Lemma. For a random variable X, V ar[X] = 0 if and only if X = a is a constant.

Proof. Let µ be the expected value of X. Suppose X is discrete. Then since (xi −
µ)2 p(xi ) ≥ 0 and p(xi ) > 0 for all xi in the support of X,
X
0 = V ar[X] = E[(X − µ)2 ] = (xi − µ)2 p(xi )
i

if and only if xi − µ = 0 for all xi in the support of X. This is equivalent to X = µ is a


constant. The proof for X being continuous is analogous, interchanging the summation
with integration.

Lemma. Let X = (X1 , X2 , ..., Xn )T be a random vector. The covariance matrix of X


can be written as
Cov[X] = E[(X − µ)(X − µ)T ],
where µ is the mean vector.
Proof. Write    
µ1 E[X1 ]
 µ2   E[X2 ] 
 ..  = µ = E[X] =  ..  ,
   
.  . 
µn E[Xn ]
that is, E[Xi ] = µi , for i = 1, ..., n. Then

E[(X − µ)(X − µ)T ]


  
X1 − µ1
 X2 − µ2  
= E   X1 − µ1 X 2 − µ2 · · · Xn − µn 
  
..
 .  
Xn − µn
 
(X1 − µ1 )2 (X1 − µ1 )(X2 − µ2 ) · · · (X1 − µ1 )(Xn − µn )
 (X2 − µ2 )(X1 − µ1 ) (X2 − µ2 )(X2 − µ2 ) · · · (X2 − µ2 )(Xn − µn ) 
= E 
 
.. .. ... .. 
 . . . 
2
(Xn − µn )(X1 − µ1 ) (Xn − µn )(X2 − µ2 ) · · · (Xn − µn )
 
E[(X1 − µ1 )2 ] E[(X1 − µ1 )(X2 − µ2 )] · · · E[(X1 − µ1 )(Xn − µn )]
 E[(X2 − µ2 )(X1 − µ1 )] E[(X2 − µ2 )2 ] · · · E[(X2 − µ2 )(Xn − µn )]
= 
 
.. .. .. .. 
 . . . . 
2
E[(Xn − µn )(X1 − µ1 )] E[(Xn − µn )(X2 − µ2 )] · · · E[(Xn − µn ) ]
= (Cov[Xi , Xj ]i,j ) = Cov[X].

The correlation between two random variables X and Y is defined by


Cov[X, Y ]
ρ(X, Y ) = p p ,
V ar[X] V ar[Y ]

provided V ar[X]V ar[Y ] > 0 is positive.


Lemma. For random variables X and Y ,

−1 ≤ ρ(X, Y ) ≤ 1,

with ρ(X, Y ) = ±1 if and only if Y = a + bX for some real numbers a, b ∈ R.


2
Proof. Let σX and σY2 be the variance of X and Y , respectively. Since variance is non-
negative, by the linearity property of variance, we have
 
X Y
0 ≤ V ar +
σX σY
V ar[X] V ar[Y ] 2Cov[X, Y ]
= 2
+ +
σX σY2 σX σY
= 2(1 + ρ(X, Y )),

and thus,
−1 ≤ ρ(X, Y )
Similarly,
 
X Y
0 ≤ V ar −
σX σY
V ar[X] V ar[Y ] 2Cov[X, Y ]
= 2
+ −
σX σY2 σX σY
= 2(1 − ρ(X, Y )),

which tells us that


ρ(X, Y ) ≤ 1,
thus proving the inequality.

Now suppose Y = a + bX for some real number a, b, with b ̸= 0. Then σY2 = V ar[Y ] =
b2 V ar[X] = b2 σX
2
, and since σX , σY ≥ 0 we have σY = sgn(b)bσX , where

 1 if b > 0,
sgn(b) = 0 if b = 0,
−1 if b < 0.

Suppose sgn(b) = 1. Then


     
X Y X a + bX a
2(1 − ρ(X, Y )) = V ar − = V ar − = V ar =0
σX σY σX bσX bσX

and hence, ρ(X, Y ) = 1. If sgn(b) = −1, then


     
X Y X a + bX a
2(1 + ρ(X, Y )) = V ar + = V ar − = V ar =0
σX σY σX bσX bσX

and hence, ρ(X, Y ) = −1.

Conversely, if ρ(X, Y ) = 1, then necessarily


 
X Y
0 = V ar −
σX σY

which means that


Y X σY
− =a⇒Y =a+ X = a + bX,
σY σX σX
for some constant a, where b = σY /σX . If if ρ(X, Y ) = −1, then necessarily
 
X Y
0 = V ar +
σX σY

which means that


X Y σY
+ =a⇒Y =a− X = a + bX,
σX σY σX
for some constant a, where b = −σY /σX .
4.9 Exercise 4
1. A class is given 10 problems, 5 of which will be randomly selected for the final
exam. Suppose a student is able to do 7 of the problems. Find the probability that
the student will be able to do

(a) all 5 problems in the exam?


(b) at least 4 of the problems?

2. Consider a game where each player take turns to toss a die, the score they obtain
for that round will be the outcome of the toss added to their previous tosses. If the
outcome of the toss is 3, a player gets a bonus toss (only once, even if it lands on
3 the second time, it is the end of the player’s turn).

(a) Find the probability, by simulation and analytically, that a player score after
the second round is 8. (Hint: for the analytical computation, consider cases
where player gets no bonus, exactly one bonus, and 2 bonus tosses; that is,

P (score = 8) = P ((score = 8) ∩ (no bonus toss))


+P ((score = 8) ∩ (exactly 1 bonus toss))
+P ((score = 8) ∩ (2 bonus tosses)).

(b) Suppose the player’s score is 8 after the second round. Find the probability
by simulation and analytically that it was obtained with the help of exactly 1
bonus toss.

3. Four buses carrying 148 students from the same school arrive at a football stadium.
The buses carry, respectively, 40, 33, 25, and 50 students. One of the students
is randomly selected. Let X denote the number of students who were on the bus
carrying the randomly selected student. One of the 4 bus drivers is also randomly
selected. Let Y denote the number of students on her bus.

(a) Which of E[X] or E[Y ] do you think is larger? Why?


(b) Compute E[X] and E[Y ].
(c) Compute V ar[X] and V ar[Y ].

4. An insurance company writes a policy to the effect that an amount of money x


must be paid if some event E occurs within a year. If the company estimates that
E will occur within a year with probability p, what should it charge the customer
in order that its expected profit will be 10 percent of x?

5. Suppose a university has 10 blocks of hostels and each block has 10 bathrooms.
During a pandemic, a university periodically collects wastewater sample from the
bathrooms of the hostels to test for the presence of the virus. However, rather than
testing wastewater sample from each bathroom separately, it has been decided to
test the wastewater sample collected from bathrooms of an entire block. If the test
is negative, one test will suffice for the whole block, whereas if the test is positive,
wastewater sample will be collected from each bathroom and tested individually,
and in all, 11 tests will be made for that block. Assume that the probability that
the wastewater sample of a bathroom is tested positive is 0.1 for all bathrooms,
independently from one another. Compute the expected number of tests necessary
for all the blocks. What is the variance?

6. There are 100 marbles, of which 40 of them are blue and 60 of them are red. The
marbles are distributed randomly into 10 bags with 10 marbles each. Let p(k) be
the probability density function for the probability that a randomly chosen bag has
exactly k blue marbles.

(a) Define the function p(k) in R.


(b) Find the expected value and variance of the distribution defined by p(k).
(c) Let F (k) denote the cumulative distribution function. Compute F (8), and
without computing, find F (10).

7. There are 95 marbles, of which 40 of them are blue and 55 of them are red. The
marbles are distributed randomly into 10 bags, with 9 of them having 10 marbles
and 1 bag containing 5 marbles. Let p(k) be the probability density function for
the probability that a randomly chosen bag has exactly k blue marbles.

(a) Define the function p(k) in R.


(b) Find the expected value and variance of the distribution defined by p(k).

8. Suppose a particular trait (such as eye color or left-handedness) of a person is


classified based of one pair of genes, and suppose also that d represents a dominant
gene and r a recessive gene. Thus, a person with dd genes is purely dominant, one
with rr is purely recessive, and one with rd is hybrid. The purely dominant and
the hybrid individuals are alike in appearance. Children receive 1 gene from each
parent. Suppose, with respect to a particular trait, a pair of hybrid parents have
a total of 4 children. Assume that each child is equally likely to inherit either of 2
genes from each parents. What is the probability that 3 of the 4 children have the
outward appearance of the dominant gene?

9. Consider a game of marbles played between 2 players. For each round, each players
get to picks either 1 or 2 marbles, and guess the number of marbles their opponent
will pick. If only one player guesses correctly, he is considered the winner for the
round. If either both players guesses correctly, or neither guesses correctly, then
no one wins the round. Suppose a game consist of 10 rounds. Consider a specified
player. Find a number k such that the probability that he wins at most k rounds
in a game is 0.98.

10. A person tosses a fair coin until a tail appears for the first time. If the tail appears
on the nth flip, the person wins 2n dollars. Let X denote the player’s winnings.
Show that E[X] = +∞. This problem is known as the St. Petersburg paradox.

(a) Would you be willing to pay $1 million to play this game once?
(b) Would you be willing to pay $1 million for each game if you could play for as
long as you liked and only had to settle up when you stopped playing?
11. A satellite system consists of 12 components and functions on any given day if at
least 9 of the 12 components function on that day. On a rainy day, each of the
components independently functions with probability 0.9, whereas on a dry day,
each independently functions with probability 0.7. If the probability of rain for the
next 5 days is 0.3 independently,

(a) what is the probability that the satellite system will function for at least 3 of
the 5 days?
(b) what is the probability that the satellite system first malfunction happens on
the 5th day?

12. It is estimated that a certain lecturer makes on average 2 typographical errors every
10000 words. Suppose a page of the lecture notes consist of 700 words. What is
the probability that the page contains at least 1 typographical error?

13. Consider an online player versus player game where there are exactly 8 players (4v4)
in each match. Suppose approximately 10,000 matches were played last year.

(a) Estimate the probability that for at least 1 of these matches, at least 2 players
were born on 1st of January;
(b) Estimate the probability that for at least 2 of these matches, exactly 3 players
celebrated their birthday on the same day of the year.
(c) Repeat (b) under the assumption that there was at least a match with exactly
3 players celebrating their birthday on the same day of the year.

State your assumptions.

14. On average, a person contracts a cold 5 times in a given year. Suppose that a new
wonder drug (based on large quantities of vitamin C) has just been marketed that
reduces the number of times to 3 for 75 percent of the population. For the other
25 percent of the population, the drug has no appreciable effect on colds. If an
individual tries the drug for a year and has 2 colds in that time, how likely is it
that the drug is beneficial for him or her? (Hint: What is the discrete random
distribution that best model the number of times a person contracts a cold in a
given year.)

15. A bag contains 4 white and 4 black balls. We randomly choose 4 balls. If 2 of them
are white and 2 are black, we stop. If not, we replace the balls in the bag and again
randomly select 4 balls. This continues until exactly 2 of the 4 chosen are white.

(a) What is the probability that we shall make exactly 3 selections?


(b) What is the expected number of selections? What is the variance?

16. Consider a game where 2 players will each pick at random from a stack of cards
labeled 1 to 6, without replacement. The player who picked the larger card with
the larger number wins. The person to win 4 games is declared the overall winner.

(a) Find the probability that a winner is decided on the k = 4, 5, 6, 7 game.


(b) What is the expected number of games played? What is the variance?
(c)
17. For a discrete random variable X, its hazard function is defined as
pX (k + 1)
hX (k) = P (X = k + 1 | X > k) = ,
1 − FX (k)
where pX is the probability density function and FX is the cumulative distribution
function for X. The idea here is as follows: Say X is battery lifetime in months.
Then for instance hX (32) is the conditional probability that the battery will fail in
the next month, given that it has lasted 32 months so far. The notion is widely
used in medicine, insurance, device reliability and so on (though more commonly
for continuous random variables than discrete ones).
Show that for a geometrically distributed random variable, its hazard function is
constant. We say that geometric random variables are memoryless: It doesn’t
matter how long some process has been going; the probability is the same that it
will end in the next time epoch, as if it doesn’t ”remember” how long it has lasted
so far.
18. Let X be a random variable with probability density function

c(1 − x2 ) −1 ≤ x ≤ 1,
p(x) =
0 otherwise.
(a) What is the value of c?
(b) What is the cumulative distribution function of X?
(c) What is the expected value and variance?
19. A certain system requires 3 identical components to function properly. For redun-
dancy, the system has 5 of such components. Suppose all 5 components function
independently and function for a random amount of time X (in months) with iden-
tical probability density function
x
Cxe− 2 x > 0,

p(x) =
0 x ≤ 0.
What is the probability that the system functions for at least 6 months?
20. A filling station is supplied with gasoline once a week. Suppose its weekly volume of
sales in thousands of gallons is a random variable with probability density function

5(1 − x)4 0 < x < 1,
p(x) =
0 otherwise.

(a) What must the capacity of the tank be so that the probability of the supply
being exhausted in a given week is 0.01?
(b) What should the volume of the tank be so that the supply be so that the weekly
sales demand is always met?
(c) What is the expected weekly volume of sales?

21. The density function of X is given by



a + bx2 0 ≤ x ≤ 1,
p(x) =
0 otherwise.
If E[X] = 35 , find a and b.
22. A point is chosen at random on a line segment of length 1. Find the probability
that the difference in the length of the two segments is less than 0.2. What is the
expected value for the difference in the length of the two segments? What is the
variance?
23. Let X be a uniform (0, 1) random variable.
(a) Find E[X n ].
(b) Find the cumulative density function of Y = X n .
(c) Use (b) to find the probability density function of Y .
(d) Use (c) to find E[Y ]. Compare to the answer obtained in (a).
24. A man arrive at a train station at 10 A.M., knowing that the train will arrive at
some time normally distributed with mean 10:15 A.M. and standard deviation 5
minutes.
(a) What is the probability that the man will have to wait more than 10 minutes?
(b) What is the probability that the man have to wait for less than 10 minutes?
(c) If, at 10:15 A.M, the train has not yet arrived, what is the probability that you
will have to wait at least an additional 10 minutes?
25. The annual rainfall (in mm) in a certain region is normally distributed with µ =
2200 and σ = 600.
(a) What is the probability that starting with this year, it will take more than
10 years before a year occurs having a rainfall of more than 3000 mm? What
assumptions are you making?
(b) Find the probability in (a) by simulation.
26. An examination is frequently regarded as being good (in the sense of determining
a valid grade spread for those taking it) if the test scores of those taking the ex-
amination can be approximated by a normal density function. (In other words, a
graph of the frequency of grade scores should have approximately the bell-shaped
form of the normal density.) The instructor often uses the test scores to estimate
the normal parameters µ and σ 2 and then assigns the letter grade A to those whose
test score is greater than µ + σ, B to those whose score is between µ and µ + σ, C to
those whose score is between µ − σ and µ, D to those whose score is between µ − 2σ
and µ − σ, and F to those getting a score below µ − 2σ. (This strategy is sometimes
referred to as grading “on the curve.”) Determine the approximate percentage of
the class receiving an A, B, C, D, E, and F respectively.
27. The total number of thousands of kilometers a motorcycle can be driven before it
1
would need to be scraped is an exponential random variable with parameter 20 .
David purchased a used motorcycle from the previous owner who claimed that it
has been driven for less than 10,000 kilometers. What is the probability that David
get at least 20,000 more kilometers out of the motorbike given that the previous
owner’s claim is true?
28. Let X be a gamma random variable with paramters (α = 2, λ = 6). Find the value
a such that P (X < a) = 0.95.
29. The amount of time one has to wait for a double-decker buses to arrive at a certain
bus stop is a exponential distribution with mean 10 minutes.

(a) What is the probability that 3 double-decker buses arrive within 30 minutes?
What assumptions did you make?
(b) What is the mean waiting time for 5 double decker buses to arrive at the bus
stop?

30. (a) If X is a uniform distribution over (0, 1), find the density function of Y = eX .
(Hint: First find the cumulative density function of Y , that is, find P (Y ≤ y).)
(b) Suppose a random variable X has the following probability density function
 1
x
1 ≤ x ≤ e1 ,
p(x) =
0 otherwise.

Suggest how we can generate values of the random variable X in R using runif.
(c) Suggest how we might generate in R random values of a random variable X
with probability density function
 3 √
x 0 ≤ x ≤ 2,
p(x) =
0 otherwise.

31. (a) Find the parameters α and β of a standard beta distribution with expected
value E[X] = µ and variance V ar[X] = σ 2 . Hint: Observe that
  
αβ α α
= 1− .
(α + β)2 α+β α+β

(b) It is known that students takes between 2 to 8 hours to complete their MA2104
assignment, with mean 6 hours and variance 2 hours. Compute the probability
that a student will take more than 7 hours to finish the assignment.

32. Show that if the support of a random variable (either discrete or continuous) is
bounded, a < X < b, that is, p(x) > 0 only within the interval a < x < b, then
a < E[X] < b.

33. A bin of 5 transistors is known to contain 2 that are defective. The transistors are
to be tested, one at a time, until the defective ones are identified. Denote by N1 the
number of tests made until the first defective is identified and by N2 the number of
additional tests until the second defective is identified.

(a) Find the joint probability density function of N1 and N2 .


(b) Find the marginal probability density functions.

34. The joint probability density function of X and Y is given by

p(x, y) = c(y 2 − x2 )e−y , −y ≤ x ≤ y, 0 < y < ∞.

(a) Find c.
(b) Find the marginal densities of X and Y . (Hint: Integration by parts)
(c) Find E[X].

35. The joint probability density function of X and Y is given by


6  2 xy 
p(x, y) = x + , 0 < x < 1, 0 < y < 2.
7 2
(a) Verify that this is indeed a joint density function.
(b) Find P (X > Y ).
1 1

(c) Find P Y > 2
X< 2
.
(d) Find E[X].
(e) Find E[Y ].
(f) Find the covariance Cov[X, Y ] and the correlation ρ(X, Y ).

36. A man and a woman agree to meet at a certain location about 12:30 p.m. If the
man arrives at a time uniformly distributed between 12:15 and 12:45, and if the
woman independently arrives at a time uniformly distributed between 12:00 and 1
p.m., find the probability that the first to arrive waits no longer than 5 minutes.

(a) What is the probability that the man arrives first?


(b) Verify your answer in (a) by simulation.

37. The joint probability density function of X and Y is given by


 −(x+y)
xe x > 0, y > 0,
p(x, y) =
0 otherwise.

Are X and Y independent? If, instead, p(x, y) were given by



2 0 < x < y, 0 < y < 1
p(x, y) =
0 otherwise,

would X and Y be independent?

38. The joint probability density function of X and Y is



x + y 0 < x < 1, 0 < y < 1,
p(x, y) =
0 otherwise.

(a) Are X and Y independent?


(b) Find P (X + Y < 1).
(c) Find E[X].
(d) Find V ar[X]
(e) Find the correlation ρ(X, Y ).

39. Choose a number X at random from the set of numbers {1, 2, 3, 4, 5}. Now choose
a number at random from the subset no larger than X, that is, from {1, ..., X}.Call
this second number Y .

(a) Find the joint probability density function of X and Y .


(b) Find the conditional probability density function of X given that Y = i, where
i = 1, 2, 3, 4, 5.
(c) Are X and Y independent? Why?
(d) Find the conditional cumulative distribution FX|Y (4, 2) = P (X ≤ 4 | Y = 2).
(e) Find E[X|Y = 2].
(f) Find Cov[X, Y ].

40. The joint density function of X and Y is given by

p(x, y) = xe−x(y+1) , x > 0, y > 0.

(a) Find the conditional density of X, given Y = y, and that of Y , given X = x.


(b) Find the probability density function of Z = XY .
(c) Find P (X > 10 | Y = y).
(d) Find E[Y |X = x].

41. How many times would you expect to roll a fair die before all 6 sides appeared at
least once? What is the variance?

42. Let X and Y be discrete random variables and E[X|Y ] denote the function of the
random variable Y whose value at Y = y is the conditional expectation E[X|Y = y].
Note that E[X|Y ] is itself a random variable. Show that

E[X] = E[E[X|Y ]].

43. Let X be the number of 1’s and Y be the number of 2’s that occur in n rolls of a
fair die. Compute Cov[X, Y ]. Verify your answer via simulation.

44. Suppose X1 , ..., Xn are independent and identically uniform distribution over the
interval (0, 1). Define R = max{X1 , ..., Xn }. Find the density of R. (Hint: First,
use the fact that R ≤ t if and only if all Xi ≤ t for all i = 1, ..., n to find the
cumulative distribution function of R.)

45. Suppose X1 , ..., Xn are independent, with Xi having an exponential distribution


with parameter λi . Let S = min{X1 , ..., Xn }. Show that S has an exponential
distribution as well, and state the parameter. (See hint provided for question 44.)

46. Given an order n nonzero square matrix A, we define a new matrix A = |a1lk | A,
where alk is the entry of A with the largest absolute value, |ak | ≥ |aij | for all
i, j = 1, ..., n. Then all the entries of A are in the interval [−1, 1]. Note that A is
invertible if and only if A is invertible.
Write a function whose formal (argument) is n, and output is an order n square
matrix whose entries are randomly generated values within the interval (−1, 1).
Generate a large number of such matrices and find the probability that they are
singular. This exercise shows that most matrices are invertible.

47. If 65 percent of the population of a large community is in favor of a proposed rise


in school taxes, approximate the probability that a random sample of 100 people
will contain
(a) at least 50 who are in favor of the proposition;
(b) between 60 and 70 inclusive who are in favor;
(c) fewer than 75 in favor.

48. One thousand independent rolls of a fair die will be made.

(a) Compute an approximation to the probability that the number 6 will appear
between 150 and 200 times inclusively.
(b) If the number 6 appears exactly 200 times, find the probability that the number
5 will appear less than 150 times.

49. A model for the movement of a stock supposes that if the present price of the
stock is s, then after one period, it will be either 1.012s with probability 0.52 or
0.99s with probability 0.48. Assuming that successive movements are independent,
approximate the probability that the stock’s price will be up at least 30 percent
after the next 1000 periods.
References

[1] https://www.tutorialspoint.com/r/r_overview.htm

[2] Pruim, Randall & Horton, Nicholas & Kaplan, Daniel. (2014). Start Teaching with
R and RStudio. 10.13140/2.1.4414.6567.

[3] Ma, Siu Lun & Ng Kah Loon & Victor Tan (2016). Linear algebra : concepts and
techniques on euclidean spaces. McGraw-Hill Education (Asia)

[4] Fieller, N. (2018). Basics of matrix algebra for statistics with R. Chapman and
Hall/CRC.

[5] Lax, P. D., & Terrell, M. S. (2018). Multivariable Calculus with Applications.
Springer.

[6] Ross, K. A. (2013). Elementary analysis. Springer.

[7] Ross, S. M. (2019). A first course in probability. Boston, MA: Pearson.

[8] Matloff, N. (2019). Probability and statistics for data science: Math+ R+ data. CRC
Press.

[9] Ugarte, M. D., Militino, A. F., & Arnholt, A. T. (2008). Probability and Statistics
with R. CRC press.

269

You might also like