0% found this document useful (0 votes)
12 views3 pages

Intro To Linear Programming

The document discusses convolution in the context of Convolutional Neural Networks (CNNs), defining how a filter operates on image data represented as tensors. It explains the process of applying multiple kernels to extract features from images and introduces the concept of robustness in classifiers, detailing how to determine the smallest perturbation that changes a data point's classification. Additionally, it covers the optimization problem for linear classifiers and the discriminator's objective in maximizing certain values related to data distributions.

Uploaded by

spammysharky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views3 pages

Intro To Linear Programming

The document discusses convolution in the context of Convolutional Neural Networks (CNNs), defining how a filter operates on image data represented as tensors. It explains the process of applying multiple kernels to extract features from images and introduces the concept of robustness in classifiers, detailing how to determine the smallest perturbation that changes a data point's classification. Additionally, it covers the optimization problem for linear classifiers and the discriminator's objective in maximizing certain values related to data distributions.

Uploaded by

spammysharky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

MA3K1 Mathematics of Machine Learning April 10, 2021

Solution (23) Given a vector x 2 Rd and a sequence g = (g d+1 , . . . , gd 1 )


T
, the
convolution is defined as the vector y = (y1 , . . . , yd )T , where
d
X
yk = xi · g k i , k 2 {1, . . . , d}
i=1

In terms of matrices,
0 1
0 1 g0 g 1 ··· g d+1 0 1
y1 B g1 C x1
B ..C B g0 ··· g d+2 C B .C
@y2 .A = B .. .. .. .. C @x2 ..A
@ . . . . A
yd xd
gd 1 gd 2 ··· g0

The sequence g is called a filter. Convolutional Neural Networks (CNNs) operate on


image data, and in the context of colour image data, a vector x that enters a layer of
the CNN will not be seen as a single vector in some Rd , but as a tensor with format
N ⇥ M ⇥ 3. Here, the image is considered to be a N ⇥ M matrix, with each entry
corresponding to a pixel, and each pixel is represented by a 3-tuple consisting of the red,
blue and green components. A convolution filter will therefore operate on a subset of
this tensor, typically a n ⇥ m ⇥ 3 window. Typical parameters are N = M = 32 and
n = m = 5.

2 3
⇤ ⇤ ⇤ ⇤ ⇤
2 3
⇤ ⇤66⇤⇤ 3⇤⇤ ⇤⇤ ⇤ ⇤7
7
2 6 7 ···
⇤ ⇤66⇤⇤
⇤ ⇤
⇤⇤6 ⇤⇤ ⇤ ⇤7 ⇤ 7 ⇤ ⇤ 7
6⇤ ⇤66⇤⇤ ⇤⇤ ⇤⇤7 ⇤ ⇤7 · · · 5
4 ⇤ ⇤ ⇤ 7 ⇤ ⇤ [⇤] · · ·
6 7
6⇤ ⇤ 4⇤⇤ ⇤⇤ ⇤⇤⇤7⇤⇤· · ⇤·⇤5 ⇤ ⇤ ..
6 7 ..
4⇤ ⇤ ⇤⇤ ⇤⇤ ⇤⇤5 ⇤ .⇤ .. . .
.. .
⇤ ⇤ ⇤ ⇤ .⇤ .
.. ..
.. ..
. .

Each 5 ⇥ 5 ⇥ 3 block is mapped linearly to an entry of a 28 ⇥ 28 matrix in the


same way (that is, applying the same filter). The filter applied is called a kernel and the
resulting map is interpreted as representing a feature. Specifically, instead of writing
it as a vector g, we can write it as a matrix K, and then have, if we denote the input
image by X, X
Y = X ⇤ K, Yij = Xk` · Ki k,j ` .
k,`

In a convolutional layer, various different kernels or convolutions are applied to the


image. This can be seen as extracting several features from an image. CNNs have the
advantage of being scalable: they work on images of arbitrary size.

19
MA3K1 Mathematics of Machine Learning April 10, 2021

Solution (24) Let h : Rd ! Y be any classifier. Define the smallest perturbation that
moves a data point into a different class as

(x) = inf {krk : h(x + r) 6= h(x)}. (3.5)


r

Given a probability distribution on the input spaces, define the robustness as



h (X)
⇢(h) = E .
kXk

For a linear classifier with only two classes, where h(x) = wT x + b, we get the
vector that moves point x to the boundary by solving the optimization problem (as for
SVMs)
1
r ⇤ = argmin krk2 subject to wT (x + r) + b = 0.
2
The solution is
|wT x + b|
r⇤ = .
kwk
Assume that we now have k linear functions f1 , . . . , fk , with fi (x) = wiT x + bi , and
a classifier h : Rd ! {1, . . . , k} that assigns to each x the index j of the largest value
fj (x) (this corresponds to the one-to-many setting for multiple classification). Let x be
such that maxj fj (x) = fk (x) and define the linear functions

gi (x) := fi (x) fk (x) = (wi wk )T x + (bi bk ), i 2 {1, . . . , k 1}.

Then \
x2 {y : gi (y) < 0}.
1ik 1

The intersection of half-spaces is a polyhedron, and x is in the interior of the polyhedron


P delineated by the hyperplanes Hi = {y : gi (x) = 0} (see Figure 6).

H4
H3

H2
H5
x

H1

Figure 6: The distance to misclassification is the radius of the largest enclosed ball in
a polyhedron P .

20
MA3K1 Mathematics of Machine Learning April 10, 2021

A perturbation x + r ceases to be in class k as soon as gj (x + r) > 0 for some j,


and the smallest length of such an r equals the radius of the largest enclosed ball in the
polyhedron. Formally, noting that wj wk = rgj (x), we get

|gj (x)| gĵ (x)


ĵ := arg min , r⇤ = · rgĵ (x). (3.6)
j krgj (w)k krgĵ (x)k2

Solution (25) The discriminator D would like to achieve, on average, a large value
on G0 (Z0 ), and a small value on G1 (Z1 ). Using the logarithm, this can be expressed as
the problem of maximizing

EZ0 [log(D(G0 (Z0 )))] + EZ1 [log(1 D(G1 (Z1 )))].

As an integral, using the push-forward measures, we get the expression


Z
[log(D(x))⇢X0 (x) + log(1 D(x))⇢X1 (x)] dx.
X

We choose D(x) so that the integrand becomes maximal. So considering the function

f (y) = log(y)⇢X0 (x) + log(1 y)⇢X1 (x)

and computing the first and second derivatives,

⇢X0 (x) ⇢X1 (x) ⇢X0 (x) ⇢X1 (x)


f 0 (y) = , f 00 (y) < 0,
y 1 y y2 (1 y)2

we see that we get a maximum at

⇢X0 (x)
y= ,
⇢X0 (x) + ⇢X1 (x)

which is the value of D⇤ (x) that we want.

21

You might also like