dl_01_intro
dl_01_intro
Introduction
Born 1985 in
Sofia, Bulgaria
Got BA in 2008 at
Pomona College, CA
(Computer Science &
Media Studies)
https://towardsdatascience.com/an-intro-to-deep-learning-for-face-recognition-aa8dfbbc51fb
Example deep learning tasks
• Image captioning
http://openaccess.thecvf.com/content_CVPR_2019/papers/Guo_MSCap_Multi-Style_Image_Captioning_With_Unpaired_Stylized_Text_CV
PR_2019_paper.pdf
http://openaccess.thecvf.com/content_CVPR_2019/papers/Kim_Dense_Relational_Captioning_Triple-Stream_Networks_for_Relationship-
Example deep learning tasks
• Image generation
Choi et al., “StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation”, CVPR 2018
Example deep learning tasks
• Fake news generation
https://www.youtube.com/watch?v=-QvIX3cY4lc
Example deep learning tasks
• Machine translation
Andrej Karpathy
Example deep learning tasks
• Fake news generation and detection
https://grover.allenai.org/detect
Example deep learning tasks
• Question answering
https://www.youtube.com/watch?v=wE3fmFTtP9g
Example deep learning tasks
• Artificial general intelligence???
https://www.dailymail.co.uk/sciencetech/article-5287647/Humans-robot-second-self.html
Example deep learning tasks
• Why are these tasks challenging?
• What are some problems from everyday life
that can be helped by deep learning?
• What are some ethical concerns about using
deep learning?
DL in a Nutshell
• Deep learning is a specific group of algorithms
falling in the broader realm of machine learning
• All ML/DL algorithms roughly match schema:
– Learn a mapping from input to output f: x y
– x: image, text, etc.
– y: {cat, notcat}, {1, 1.5, 2, …}, etc.
– f: this is where the magic happens
ML/DL in a Nutshell
y' = f(x)
output prediction input
function
vs
= 1 оr 0?
f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
Slide credit: L. Lazebnik
ML/DL in a Nutshell
• Example:
– x = pixels of the image (concatenated to form
a vector)
– y = integer (1 = apple, 2 = tomato, etc.)
– y’ = f(x) = wT x
• w is a vector of the same size as x
• One weight per each dimension of x (i.e. one
weight per pixel)
DL in a Nutshell
• Input network outputs
• Input X is raw (e.g. raw image,
one-hot representation of text)
• Network extracts features: abstraction of input
• Output is the labels Y
• All parameters of the network trained by
checking how well predicted/true Y agree, using
labels in the training set
Validation strategies
• Ultimately, for our application, what do we want?
– High accuracy on training data?
– No, high accuracy on unseen/new/test data!
– Why is this tricky?
• Training data
– Features (x) and labels (y) used to learn mapping f
• Test data
– Features used to make a prediction
– Labels only used to see how well we’ve learned f!!!
• Validation data
– Held-out set of the training data
– Can use both features and labels to tune model hyperparameters
– Hyperparameters are “knobs” of the algorithm tuned by the designer: number of
iterations for learning, learning rate, etc.
– We train multiple model (one per hyperparameter setting) and choose the best
one, on the validation set
Validation strategies
BAD: Overfitting; e.g. in K-
Idea #1: Choose hyperparameters nearest neighbors, K = 1 always
that work best on the data works perfectly on training data
Your Dataset
Idea #2: Split data into train and test, choose BAD: No idea how algorithm will
hyperparameters that work best on test data perform on new data; cheating
train test
Idea #3: Split data into train, val, and test; choose Better!
hyperparameters on val and evaluate on test
train validation test
Useful for small datasets, but not used too frequently in deep learning
• Unsupervised learning
– Training data does not include desired outputs
• Reinforcement learning
– Rewards from sequence of actions
x Regression y Continuous
Unsupervised Learning
vs
= 1 оr 0?
235 217 115 212 243 236 247 139 91 209 208 211
233 208 131 222 219 226 196 114 74 208 213 214
232 232 182 186 184 179 159 123 93 232 235 235
232 236 201 154 216 133 129 81 175 252 241 240
235 238 230 128 172 138 65 63 234 249 241 245
234 237 245 193 55 33 115 144 213 255 253 251
248 245 161 128 149 109 138 65 47 156 239 255
Red dots = training data (all that we see before we ship off our model!)
Green curve = true underlying model Blue curve = our predicted model/fit
Underfitting Overfitting
Error
Test error
Training error
Validation error
Error
Training error
Fei-Fei Li 3
Vector
• A column vector where
Fei-Fei Li 100
Matrix
• A matrix is an array of numbers with
size by , i.e. m rows and n columns.
Fei-Fei Li 101
Matrix Operations
• Addition
• Scaling
• X * Y = matrix product
• X .* Y = element-wise product
Inner Product
• Multiply corresponding entries of two vectors
and add up the result
Fei-Fei Li 104
Matrix Multiplication
Fei-Fei Li 106
Matrix Multiplication
• Example:
– Each entry of the
matrix product is
made by taking the
dot product of the
corresponding row in
the left matrix, with
the corresponding
column in the right
one.
Fei-Fei Li 107
Matrix Operation Properties
• Matrix addition is commutative and
associative
–A+B = B+A
– A + (B + C) = (A + B) + C
• Matrix multiplication is associative and
distributive but not commutative
– A(B*C) = (A*B)C
– A(B + C) = A*B + A*C
– A*B != B*A
Matrix Operations
• Transpose – flip matrix, so row 1 becomes
column 1
• A useful identity:
Fei-Fei Li 109
Inverse
• Given a matrix A, its inverse A-1 is a matrix such
that AA-1 = A-1A = I
• E.g.
Fei-Fei Li 110
Special Matrices
• Identity matrix I
– Square matrix, 1’s along
diagonal, 0’s elsewhere
– I ∙ [another matrix] = [that
matrix]
• Diagonal matrix
– Square matrix with numbers
along diagonal, 0’s elsewhere
– A diagonal ∙ [another matrix]
scales the rows of that matrix
Fei-Fei Li 111
Special Matrices
• Symmetric matrix
Fei-Fei Li 112
Norms
• L1 norm
• L2 norm
>> x = A\B
x =
1.0000
-0.5000
Fei-Fei Li 114
Matrix Rank
• Column/row rank
Fei-Fei Li 115
Linear independence
• Suppose we have a set of vectors v1, …, vn
• If we can express v1 as a linear combination of the
other vectors v2…vn, then v1 is linearly dependent on
the other vectors.
– The direction v1 can be expressed as a combination of the
directions v2…vn. (E.g. v1 = .7 v2 -.5 v4)
• If no vector is linearly dependent on the rest of the set,
the set is linearly independent.
– Common case: a set of vectors v1, …, vn is always linearly
independent if each vector is perpendicular to every other
vector (and non-zero)
Fei-Fei Li 116
Linear independence
Linearly independent set Not linearly independent
Fei-Fei Li 117
Singular Value Decomposition (SVD)
• There are several computer algorithms that
can “factor” a matrix, representing it as the
product of some other matrices
• The most useful of these is the Singular Value
Decomposition
• Represents any matrix A as a product of three
matrices: UΣVT
Fei-Fei Li 118
Singular Value Decomposition (SVD)
UΣV = A
T
Fei-Fei Li 119
Singular Value Decomposition (SVD)
• In general, if A is m x n, then U will be m x m, Σ
will be m x n, and VT will be n x n.
Fei-Fei Li 120
Singular Value Decomposition (SVD)
• U and V are always rotation matrices.
– Geometric rotation may not be an applicable
concept, depending on the matrix. So we call
them “unitary” matrices – each column is a unit
vector.
• Σ is a diagonal matrix
– The number of nonzero entries = rank of A
– The algorithm always sorts the entries high to low
Fei-Fei Li 121
Singular Value Decomposition
(SVD)
M = UΣVT
Example:
The derivative of the rate function is the
acceleration function.
Image: Wikipedia
Derivative = rate of change
• Linear function y = mx + b
• Slope
Image: Wikipedia
Ways to Write the Derivative
Given the function f(x), we can write its
derivative in the following ways:
- f '(x)
d
- dx
f(x)
The derivative of x is commonly written dx.
d
dt (t
4)
Texas A&M Dept of Statistics
More Formulas
- The derivative of u to a constant
power:
d
du u n
n
*un1du
- The derivative
of e:
d
du e u
eudu
- The derivative of
log:
d
Texas A&M Dept of Statistics log(u)
More Examples
- The derivative of u to a constant
power:
d
dx
3x3
- The derivative of e:
d
e4 y
dy
- The derivative of log:
d
dx
3log(x)
Texas A&M Dept of Statistics
Product and Quotient
The product rule and quotient rules are
commonly used in differentiation.
- Product rule:
d
du ( f (u)* g(u)) f (u)g'(u)
g(u) f '(u)
- Quotient
rule: g(u) f '(u) f (u)g'(u)
d 2
(g(u))
du
g(u)
f
Texas A&M Dept of Statistics
Chain Rule
The chain rule allows you to combine any of the
differentiation rules we have already covered.
d
f (g(u)) f '(g(u))* g'(u)*
du du
log(x2 )
g( y) 4 y3 p(x)
2y x
2 log(x2 )
p'(x)
g'( y) 12 y 2
x2
2
https://docs.scipy.org/doc/numpy/user/numpy-for-matlab-users.html
• Do first reading
• Go through NumPy tutorial
• Start thinking about your project!