0% found this document useful (0 votes)
755 views28 pages

MIT Introduction To Deep Learning 6.S191

This document provides an introduction to the MIT 6.S191 course on deep learning. The instructor explains that they will cover the fundamentals of deep learning over two weeks through lectures and projects. Students can earn course credit by developing a deep learning project or writing a paper review. The course aims to provide both a technical and practical foundation in deep learning, covering topics like neural networks, probabilistic deep learning, and algorithmic bias.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
755 views28 pages

MIT Introduction To Deep Learning 6.S191

This document provides an introduction to the MIT 6.S191 course on deep learning. The instructor explains that they will cover the fundamentals of deep learning over two weeks through lectures and projects. Students can earn course credit by developing a deep learning project or writing a paper review. The course aims to provide both a technical and practical foundation in deep learning, covering topics like neural networks, probabilistic deep learning, and algorithmic bias.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 28

Good afternoon everyone and Welcome to MIT 

6.S191 -- Introduction to Deep Learning. My  

name is Alexander Amini and I'm so excited to 


be your instructor this year along with Ava  

Soleimany in this new virtual format. 6.S191 is 


a two-week bootcamp on everything deep learning  

and we'll cover a ton of material in 


only two weeks so I think it's really  

important for us to dive right in with 


these lectures but before we do that I  

do want to motivate exactly why I think 


this is such an awesome field to study  

and when we taught this class last year I decided 


to try introducing the class very differently and  

instead of me telling the class how great 


6.S191 is I wanted to let someone else do  

that instead so actually I want to start this year 


by showing you how we introduced 6s191 last year.

[Obama] Hi everybody and welcome to MIT 


6.S191 the official introductory course on  

deep learning taught here at MIT. Deep learning 


is revolutionizing so many things from robotics  

to medicine and everything in between you'll 


learn the fundamentals of this field and how you  

can build some of these incredible algorithms. 


In fact, this entire speech and video  

are not real and were created using deep learning 


and artificial intelligence and in this class  

you'll learn how. It has been an honor to speak 


with you today and I hope you enjoy the course.

So in case you couldn't tell that was 


actually not a real video or audio  

and the audio you actually heard was purposely 


degraded a bit more to even make it more obvious  

that this was not real and avoid some potential 


misuse even with the purposely degraded audio  

that intro went somewhat viral last year 


after the course and we got some really  

great and interesting feedback and to be 


honest after last year and when we did  
when we did this i thought it was going to 
be really hard for us to top it this year  

but actually i was wrong because the one thing 


i love about this field is that it's moving so  

incredibly fast that even within the past year 


the state of the art has significantly advanced  

and the video you saw that we used last year used 
deep learning but it was not a particularly easy  

video to create it required a full video of obama 


speaking and it used this to intelligently stitch  

together parts of the scene to make it look and 


appear like he was mouthing the words that i said  

and to see the behind the scenes here now 


you can see the same video with my voice.

[Alexander] Hi everybody and welcome to mit 6.S191  

the official introductory course on 


deep learning taught here at MIT.

Now it's actually possible to use just a single 


static image not the full video to achieve  

the exact same thing and now you can actually 


see eight more examples of obama now just  

created using just a single static image no more 


full dynamic videos but we can achieve the same  

incredible realism and result using deep learning 


now of course there's nothing restricting us to  

one person this method generalizes to different 


faces and there's nothing restricting us even  

to humans anymore or individuals that 


the algorithm has ever seen before

[Alexander] Hi everybody and welcome to mit  

6.S191 the official introductory course 


on deep learning taught here at MIT.

The ability to generate these types of 


dynamic moving videos from only a single  

image is remarkable to me and it's a testament 


to the true power of deep learning in this class  

you're going to actually not only learn about 


the technical basis of this technology but also  
some of the very important and very important 
ethical and societal implications of this work  

as well now I hope this was a really great way 


to get you excited about this course and 6.S191  

and with that let's get started we can actually 


start by taking a step back and asking ourselves  

what is deep learning defining deep learning 


in the context of intelligence intelligence is  

actually the ability to process information such 


that it can be used to inform a future decision  

now the field of artificial intelligence or AI 


is a science that actually focuses on building  

algorithms to do exactly this to build algorithms 


to process information such that they can inform  

future predictions now machine learning you 


can think of this as just a subset of AI  

that actually focuses on teaching an algorithm to 


learn from experiences without being explicitly  

programmed now deep learning takes this idea even 


further and it's a subset of machine learning that  

focuses on using neural networks to automatically 


extract useful patterns in raw data and then using  

these patterns or features to learn to perform 


that task and that's exactly what this class is  

about this class is about teaching algorithms 


how to learn a task directly from raw data  

and we want to provide you with a solid 


foundation both technically and practically  

for you to understand under the hood how these 


algorithms are built and how they can learn

so this course is split between technical 


lectures as well as project software labs we'll  

cover the foundation starting today with neural 


networks which are really the building blocks  

of everything that we'll see in this 


course and this year we also have two  

brand new really exciting hot topic lectures 


focusing on uncertainty and probabilistic deep  

learning as well as algorithmic bias and fairness 


finally we'll conclude with some really exciting  
guest lectures and student project presentations 
as part of a final project competition that all of  

you will be eligible to win some really exciting 


prizes now a bit of logistics before we dive into  

the technical side of the lecture for those of you 


taking this course for credit you will have two  

options to fulfill your credit requirement the 


first option will be to actually work in teams  

of or teams of up to four or individually 


to develop a cool new deep learning idea  

now doing so will make you eligible to win some of 


the prizes that you can see on the right hand side  

and we realize that in the context of this 


class which is only two weeks that's an  

extremely short amount of time to come up 


with an impressive project or research idea  

so we're not going to be judging you on the 


novelty of that idea but rather we're not  

going to be judging you on the results of that 


idea but rather the novelty of the idea your  

thinking process and how you how impactful this 


idea can be but not on the results themselves  

on the last day of class you will actually 


give a three-minute presentation to a group  

of judges who will then award the winners and the 


prizes now again three minutes is extremely short  

to actually present your ideas and present your 


project but i do believe that there's an art to  

presenting and conveying your ideas concisely 


and clearly in such a short amount of time  

so we will be holding you strictly to that 


to that strict deadline the second option  

to fulfill your grade requirement is to write a 


one-page review on a deep learning paper here the  

grade is based more on the clarity of the writing 


and the technical communication of the main ideas  

this will be due on thursday the last thursday 


of the class and you can pick whatever deep  

learning paper you would like if you would 


like some pointers we have provided some  

guide papers that can help you get started if 


you would just like to use one of those for  

your review in addition to the final project 


prizes we'll also be awarding this year three  

lab prizes one associated to each of the software 


labs that students will complete again completion  

of the software labs is not required for grade of 


this course but it will make you eligible for some  

of these cool prices so please we encourage 


everyone to compete for these prizes and  

get the opportunity to win them all please 


post the piazza if you have any questions  

visit the course website for announcements 


and digital recordings of the lectures etc  

and please email us if you have any 


questions also there are software labs  

and office hours right after each of these 


technical lectures held in gather town so  

please drop by in gather town to ask any questions 


about the software labs specifically on those or  

more generally about past software labs or 


about the lecture that occurred that day

now this team all this course has a incredible 


group of TA's and teaching assistants that  

you can reach out to at any time 


in case you have any issues or  

questions about the material that you're learning

and finally we want to give a 


huge thanks to all of our sponsors  

who without their help this class would 


not be possible this is the fourth year  

that we're teaching this class and each year it 


just keeps getting bigger and bigger and bigger  

and we really give a huge shout out to our 


sponsors for helping us make this happen each year  

and especially this year in light of the virtual 


format so now let's start with the fun stuff  

let's start by asking ourselves a question about 


why why do we all care about deep learning and  

specifically why do we care right now understand 


that it's important to actually understand first  

why is deep learning or how is deep learning 


different from traditional machine learning  

now traditionally machine learning algorithms 


define a set of features in their data usually  

these are features that are handcrafted or hand 


engineered and as a result they tend to be pretty  

brittle in practice when they're deployed the 


key idea of deep learning is to learn these  

features directly from data in a hierarchical 


manner that is can we learn if we want to learn  

how to detect a face for example can we learn 


to first start by detecting edges in the image  

composing these edges together to detect 


mid-level features such as a eye or a nose  

or a mouth and then going deeper and composing 


these features into structural facial features  

so that we can recognize this face this is this 


hierarchical way of thinking is really core  

to deep learning as core to everything 


that we're going to learn in this class  

actually the fundamental building blocks though 


of deep learning and neural networks have actually  

existed for decades so one interesting thing to 


consider is why are we studying this now now is an  

incredibly amazing time to study these algorithms 


and for one reason is because data has become  

much more pervasive these models are extremely 


hungry for data and at the moment we're living  

in an era where we have more data than ever 


before secondly these algorithms are massively  

parallelizable so they can benefit tremendously 


from modern gpu hardware that simply did not exist  

when these algorithms were developed and finally 


due to open source toolboxes like tensorflow  

building and deploying these models 


has become extremely streamlined
so let's start actually with 
the fundamental building block  

of deep learning and of every neural network that 


is just a single neuron also known as a perceptron  

so we're going to walk through exactly what is 


a perceptron how it's defined and we're going to  

build our way up to deeper neural networks all the 


way from there so let's start we're really at the  

basic building block the idea of a perceptron 


or a single neuron is actually very simple so  

i think it's really important for all of you 


to understand this at its core let's start by  

actually talking about the forward propagation 


of information through this single neuron we  

can define a set of inputs x i through x m 


which you can see on the left hand side and  

each of these inputs or each of these numbers 


are multiplied by their corresponding weight  

and then added together we take this single number 


the result of that addition and pass it through  

what's called a nonlinear activation function 


to produce our final output y we can actually  

actually this is not entirely correct because one 


thing i forgot to mention is that we also have  

what's called a bias term in here which allows 


you to shift your activation function left or  

right now on the right hand side of this diagram 


you can actually see this concept illustrated or  

written out mathematically as a single equation 


you can actually rewrite this in terms of linear  

algebra matrix multiplications and dot products 


to to represent this a bit more concisely  

so let's do that let's now do that with x 


capital x which is a vector of our inputs  

x1 through xm and capital w which is a vector 


of our weights w1 through wm so each of these  

are vectors of length m and the output is very 


simply obtained by taking their dot product  

adding a bias which in this case is 


w0 and then applying a non-linearity g
one thing is that i haven't i've been mentioning 
it a couple of times this non-linearity g what  

exactly is it because i've mentioned it now a 


couple of times well it is a non-linear function  

one common example of this nonlinear activation 


function is what is known as the sigmoid function  

defined here on the right in fact there 


are many types of nonlinear functions  

you can see three more examples here including the 


sigmoid function and throughout this presentation  

you'll actually see these tensorflow code 


blocks which will actually illustrate  

uh how we can take some of the topics that we're 


learning in this class and actually practically  

use them using the tensorflow software library now 


the sigmoid activation function which i presented  

on the previous slide is very popular since it's 


a function that gives outputs it takes as input  

any real number any activation value and it 


outputs a number always between 0 and 1. so  

this makes it really really suitable for problems 


and probability because probabilities also have to  

be between 0 and 1 so this makes them very well 


suited for those types of problems in modern deep  

neural networks the relu activation function which 


you can see on the right is also extremely popular  

because of its simplicity in this case it's a 


piecewise linear function it is zero before when  

it's uh in the negative regime and it is strictly 


the identity function in the positive regime

but one really important question that i 


hope that you're asking yourselves right now  

is why do we even need activation functions 


and i think actually throughout this course  

i do want to say that no matter what 


i say in the course i hope that always  

you're questioning why this is a necessary step 


and why do we need each of these steps because  

often these are the questions that can lead to 


really amazing research breakthroughs so why do  

we need activation functions now the point of 


an activation function is to actually introduce  

non-linearities into our network because these are 


non-linear functions and it allows us to actually  

deal with non-linear data this is extremely 


important in real life especially because in  

the real world data is almost always non-linear 


imagine i told you to separate here the green  

points from the red points but all you could 


use is a single straight line you might think  

this is easy with multiple lines or curved lines 


but you can only use a single straight line and  

that's what using a neural network with a linear 


activation function would be like that makes the  

problem really hard because no matter how deep the 


neural network is you'll only be able to produce  

a single line decision boundary and you're 


only able to separate your space with one line  

now using non-linear activation functions 


allows your neural network to approximate  

arbitrarily complex functions and that's what 


makes neural networks extraordinarily powerful

let's understand this with a simple example so 


that we can build up our intuition even further  

imagine i give you this trained network now with 


weights on the left hand side 3 and negative 2.  

this network only has two inputs x1 and x2 


if we want to get the output of it we simply  

do the same story as i said before first take 


a dot product of our inputs with our weights  

add the bias and apply a non-linearity 


but let's take a look at what's inside  

of that non-linearity it's simply a 


weighted combination of our inputs  

and the in the form of a two-dimensional line 


because in this case we only have two inputs  

so if we want to compute this output it's the 


same story as before we take a dot product of x  
and w we add our bias and apply our non-linearity 
what about what's inside of this nonlinearity g  

well this is just a 2d line in fact since it's 


just a two dimensional line we can even plot it  

in two-dimensional space this is called 


the feature space the input space  

in this case the feature space and the input 


space are equal because we only have one neuron  

so in this plot let me describe what you're seeing 


so on the two axes you're seeing our two inputs so  

on one axis is x1 one of the inputs on the other 


axis is x2 our other input and we can plot the  

line here our decision boundary of this trained 


neural network that i gave you as a line in this  

space now this line corresponds to actually all 


of the decisions that this neural network can make  

because if i give you a new data point for example 


here i'm giving you negative 1 2. this point lies  

somewhere in this space specifically at x one 


equal to negative one and x two equal to two  

that's just a point in the space i want you to 


compute its weighted combination and i i can  

actually follow the perceptron equation to get 


the answer so here we can see that if we plug it  

into the perceptron equation we get 1 plus minus 


3 minus 4 and the result would be minus 6. we  

plug that into our nonlinear activation function 


g and we get a final output of 0.002 now in fact  

remember that the sigmoid function actually 


divides the space into two parts of either  

because it outputs everything between zero and 


one it's dividing it between a point at 0.5 and  

greater than 0.5 and less than 0.5 when the input 
is less than 0 and greater than 0.5 that's when  

the input is positive we can illustrate the space 


actually but this feature space when we're dealing  

with a small dimensional data like in this case 


we only have two dimensions but soon we'll start  

to talk about problems where we have thousands or 


millions or in some cases even billions of inpu  
of uh weights in our neural network and then 
drawing these types of plots becomes extremely  

challenging and not really possible anymore but 


at least when we're in this regime of small number  

of inputs and small number of weights we can make 


these plots to really understand the entire space  

and for any new input that we obtain for example 


an input right here we can see exactly that this  

point is going to be having an activation function 


less than zero and its output will be less than  

0.5 the magnitude of that actually is computed 


by plugging it into the perceptron equation so  

we can't avoid that but we can immediately get an 


answer on the decision boundary depending on which  

side of this hyperplane that we lie on when we 


plug it in so now that we have an idea of how to  

build a perceptron let's start by building neural 


networks and seeing how they all come together

so let's revisit that diagram of the perceptron 


that i showed you before if there's only a few  

things that you get from this class i really 


want everyone to take away how a perceptron works  

and there's three steps remember them always 


the dot product you take a dot product of your  

inputs and your weights you add a bias and you 


apply your non-linearity there's three steps  

let's simplify this diagram a little bit let's 


clean up some of the arrows and remove the bias  

and we can actually see now that every line here 


has its own associated weight to it and i'll  

remove the bias term like i said for simplicity 


note that z here is the result of that dot  

product plus bias before we apply the activation 


function though g the final output though is is  

simply y which is equal to the activation 


function of z which is our activation value

now if we want to define a multi-output neural 


network we can simply add another perceptron to  

this picture so instead of having one perceptron 


now we have two perceptrons and two outputs each  

one is a normal perceptron exactly like we saw 


before taking its inputs from each of the x i x  

ones through x m's taking the dot product adding a 


bias and that's it now we have two outputs each of  

those perceptrons though will have a different set 


of weights remember that we'll get back to that

if we want it so actually one thing to keep 


in mind here is because all the inputs are  

densely connected every input has a connection to 


the weights of every perceptron these are often  

called dense layers or sometimes fully connected 


layers now we're through this class you're going  

to get a lot of experience actually coding up 


and practically creating some of these algorithms  

using a software toolbox called tensorflow 


so now that we have the understanding of how  

a single perceptron works and how a dense 


layer works this is a stack of perceptrons  

let's try and see how we can actually build up 


a dense layer like this all the way from scratch  

to do that we can actually start by initializing 


the two components of our dense layer  

which are the weights and the biases now that we 
have these two parameters of our neural network  

of our dense layer we can actually define the 


forward propagation of information just like  

we saw it and learn about already that forward 


propagation of information is simply the dot  

product or the matrix multiplication of our 


inputs with our weights at a bias that gives  

us our activation function here and then we 


apply this non-linearity to compute the output

now tensorflow has actually implemented 


this dense layer for us so we don't need  

to do that from scratch instead we 


can just call it like shown here so  

to create a dense layer with two outputs 


we can specify this units equal to two
now let's take a look at what's called a single 
layered neural network this is one we have a  

single hidden layer between our inputs and our 


outputs this layer is called the hidden layer  

because unlike an input layer and an output layer 


the states of this hidden layer are typically  

unobserved they're hidden to some extent they're 


not strictly enforced either and since we have  

this transformation now from the input layer to 


the hidden layer and from the hidden layer to the  

output layer each of these layers are going to 


have their own specified weight matrices we'll  

call w1 the weight matrices for the first layer 


and w2 the weight matrix for the second layer

if we take a zoomed in look at one of the neurons 


in this hidden layer let's take for example z2 for  

example this is the exact same perceptron that we 


saw before we can compute its output again using  

the exact same story taking all of its inputs x1 


through xm applying a dot product with the weights  

adding a bias and that gives us z2 if we 


look at a different neuron let's suppose z3  

we'll get a different value here because the 


weights leading to z3 are probably different than  

those leading to z2 now this picture looks a bit 


messy so let's try and clean things up a bit more

from now on i'll just use this symbol here to 


denote what we call this dense layer or fully  

connected layers and here you can actually 


see an example of how we can create this  

exact neural network again using tensorflow 


with the predefined dense layer notation  

here we're creating a sequential model where 


we can stack layers on top of each other  

first layer with n neurons and the second 


layer with two neurons the output layer

and if we want to create a deep neural network 


all we have to do is keep stacking these layers  

to create more and more hierarchical models 


ones where the final output is computed  
by going deeper and deeper into the network and 
to implement this in tensorflow again it's very  

similar as we saw before again using the tf keras 


sequential call we can stack each of these dense  

layers on top of each other each one specified 


by the number of neurons in that dense layer  

n1 and 2 but with the last output layer fixed to 


two outputs if that's how many outputs we have

okay so that's awesome now we have an idea of not 


only how to build up a neural network directly  

from a perceptron but how to compose them together 


to form complex deep neural networks let's take a  

look at how we can actually apply them to a very 


real problem that i believe all of you should  

care very deeply about here's a problem that we 


want to build an ai system to learn to answer  

will i pass this class and we can start with 


a simple two feature model one feature let's  

say is the number of lectures that you attend as 


part of this class and the second feature is the  

number of hours that you spend working on your 


final project you do have some training data  

from all of the past participants of success 191 


and we can plot this data on this feature space  

like this the green points here actually indicate 


students so each point is one student that has  

passed the class and the red points 


are students that have failed the pass  

failed the class you can see their where they are 
in this feature space depends on the actual number  

of hours that they attended the lecture the number 


of lectures they attended and the number of hours  

they spent on the final project and then there's 


you you spent you have attended four lectures  

and you have spent five hours on your 


final project and you want to understand  

uh will you or how can you build a neural network 


given everyone else in this class will you pass  

or fail uh this class based on the training data 


that you see so let's do it we have now all of the  

uh to do this now so let's build a neural 


network with two inputs x1 and x2 with x1 being  

the number of lectures that we attend x2 is the 


number of hours you spend on your final project  

we'll have one hidden layer with three units and 


we'll feed those into a final probability output  

by passing this class and we can see that 


the probability that we pass is 0.1 or 10  

that's not great but the reason is because 


that this model uh was never actually trained  

it's basically just a a baby it's never seen any 


data even though you have seen the data it hasn't  

seen any data and more importantly you haven't 


told the model how to interpret this data it needs  

to learn about this problem first it knows nothing 


about this class or final projects or any of that  

so one of the most important things to do 


this is actually you have to tell the model  

when it's able when it is making bad predictions 


in order for it to be able to correct itself  

now the loss of a neural network actually defines 


exactly this it defines how wrong a prediction was  

so it takes as input the predicted outputs 


and the ground truth outputs now if those  

two things are very far apart from each other 


then the loss will be very large on the other  

hand the closer these two things are from each 


other the smaller the loss and the more accurate  

the loss the model will be so we always 


want to minimize the loss we want to incur  

that we want to predict something that's 


as close as possible to the ground truth

now let's assume we have not just the data 


from one student but as we have in this case  

the data from many students we now care about 


not just how the model did on predicting just  

one prediction but how it did on average 


across all of these students this is what  
we call the empirical loss and it's 
simply just the mean or the average  

of every loss from each individual 


example or each individual student

when training a neural network we 


want to find a network that minimizes  

the empirical loss between our 


predictions and the true outputs

now if we look at the problem of binary 


classification where the neural network  

like we want to do in this case is supposed to 


answer either yes or no one or zero we can use  

what is called a soft max cross entropy loss now 


the soft max cross entropy loss is actually built  

is actually written out here and it's 


defined by actually what's called the  

cross entropy between two probability 


distributions it measures how far apart  

the ground truth probability distribution is 


from the predicted probability distribution  

let's suppose instead of predicting binary 


outputs will i pass this class or will i not  

pass this class instead you want to predict the 


final grade as a real number not a probability  

or as a percentage we want the the grade that you 


will get in this class now in this case because  

the type of the output is different we also need 


to use a different loss here because our outputs  

are no longer 0 1 but they can be any real number 


they're just the grade that you're going to get on  

on the final class so for example here since this 


is a continuous variable the grade we want to use  

what's called the mean squared error this measures 


just the the squared error the squared difference  

between our ground truth and our predictions 


again averaged over the entire data set  

okay great so now we've seen two loss functions 


one for classification binary outputs as well as  

regression continuous outputs and the problem now 


i think that we need to start asking ourselves is  
how can we take that loss function we've seen our 
loss function we've seen our network now we have  

to actually understand how can we put those two 


things together how can we use our loss function  

to train the weights of our neural network 


such that it can actually learn that problem  

well what we want to do is actually 


find the weights of the neural network  

that will minimize the loss of our 


data set that essentially means  

that we want to find the ws in our neural network 


that minimize j of w jfw is our empirical cost  

function that we saw in the previous slides that 


average loss over each data point in the data set

now remember that w capital w is simply 


a collection of all of the weights in our  

neural network not just from one layer 


but from every single layer so that's  

w0 from the zeroth layer to the first layer 


to the second layer all concatenate into one  

in this optimization problem we want to optimize 


all of the w's to minimize this empirical loss

now remember our loss function is just a 


simple function of our weights if we have  

only two weights we can actually plot this entire 


lost landscape over this grid of weight so on the  

one axis on the bottom you can see weight number 


one and the other one you can see weight zero  

there's only two weights in this neural network 


very simple neural network so we can actually plot  

for every w0 and w1 what is the loss what is the 


error that we'd expect to see and obtain from this  

neural network now the whole process of training a 


neural network optimizing it is to find the lowest  

point in this lost landscape that will tell us our 


optimal w0 and w1 now how can we do that the first  

thing we have to do is pick a point so let's pick 


any w0 w1 starting from this point we can compute  

the gradient of the landscape at that point 


now the gradient tells us the direction of  

highest or steepest ascent okay 


so that tells us which way is up  

okay if we compute the gradient of our 


loss with respect to our weights that's  

the derivative our gradient for the loss 


with respect to the weights that tells  

us the direction of which way is up on that 


lost landscape from where we stand right now  

instead of going up though we want to find the 


lowest loss so let's take the the negative of our  

gradient and take a small step in that direction 


okay and this will move us a little bit closer  

to the lowest point and we just keep repeating 


this now we compute the gradient at this point  

and repeat the process until we converge 


and we will converge to a local minimum  

we don't know if it will converge to a global 


minimum but at least we know that it should  

in theory converge to a local minimum now we can 


summarize this algorithm as follows this algorithm  

is also known as gradient descent so we start by 


initializing all of our weights randomly and we  

start and we loop until convergence we start 


from one of those weights our initial point  

we compute the gradient that tells us which way is 


up so we take a step in the opposite direction we  

take a small step here small is computed by 


multiplying our gradient by this factor eta  

and we'll learn more about this factor later 


this factor is called the learning rate  

we'll learn more about that later now again in 


tensorflow we can actually see this pseudocode  

of grading descent algorithm written out in 


code we can randomize all of our weights that  

in that basically initializes our search our 


optimization process at some point in space  

and then we keep looping over and over and over 


again and we compute the loss we compute the  
gradient and we take a small step of our weights 
in the direction of that gradient but now let's  

take a look at this term here this is the how 


we actually compute the gradient this explains  

how the loss is changing with respect to 


the weight but i never actually told you  

how we compute this so let's 


talk about this process  

which is actually extremely important in training 


neural networks it's known as backpropagation

so how does backpropagation work 


how do we compute this gradient  

let's start with a very simple neural network 


this is probably the simplest neural network  

in existence it only has one input one hidden 


neuron and one output computing the gradient of  

our loss j of w with respect to one of the weights 


in this case just w2 for example tells us how much  

a small change in w2 is going to affect our loss 


j so if we move around j infinitesimally small how  

will that affect our loss that's what the gradient 


is going to tell us of derivative of j of w 2.  

so if we write out this derivative we can actually 


apply the chain rule to actually compute it  

so what does that look like specifically 


we can decompose that derivative into the  

derivative of j d d w over d y multiplied by 


derivative of our output with respect to w2  

now the question here is with the second part 


if we want to compute now not the derivative  

of our loss with respect to w2 but now the 


loss with respect to w1 we can do the same  

story as before we can apply the chain rule now 


recursively so now we have to apply the chain  

rule again to this second part now the second 


part is expanded even further so the derivative  

of our output with respect to z1 which is the 


activation function of this first hidden unit and  

we can back propagate this information now you can 


see starting from our loss all the way through w2  
and then recursively applying this chain rule 
again to get to w1 and this allows us to see  

both the gradient at both w2 and w1 so 


in this case just to reiterate once again  

this is telling us this dj dw1 is telling us how 


a small change in our weight is going to affect  

our loss so we can see if we increase our weight a 


small amount it will increase our loss that means  

we will want to decrease the weight to decrease 


our loss that's what the gradient tells us which  

direction we need to step in order to decrease 


or increase our loss function now we showed this  

here for just two weights in our neural network 


because we only have two weights but imagine  

we have a very deep neural network one with more 


than just two layers of or one layer rather of of  

hidden units we can just repeat this this process 


of applying recursively applying the chain rule  

to determine how every single way in the model 


needs to change to impact that loss but really  

all this boils down to just recursively applying 


this chain rule formulation that you can see here

and that's the back propagation algorithm in 


theory it sounds very simple it's just a very  

very basic extension on derivatives and the chain 


rule but now let's actually touch on some insights  

from training these networks in practice that make 


this process much more complicated in practice  

and why using back propagation as we saw there 


is not always so easy now in practice training  

neural networks and optimization of networks can 


be extremely difficult and it's actually extremely  

computationally intensive here's the visualization 


of what a lost landscape of a real neural network  

can look like visualized on just two dimensions 


now you can see here that the loss is extremely  

non-convex meaning that it has many many local 


minimum that can make using an algorithm like  

gradient descent very very challenging because 


gradient descent is always going to step  

closest to the first local minimum but it can 


always get stuck there so finding how to get  

to the global minima or a really good solution for 


your neural network can often be very sensitive to  

your hyperparameters such as where the optimizer 


starts in this lost landscape if it starts in a  

potentially bad part of the landscape it can very 


easily get stuck in one of these local minimum

now recall the equation that we talked about for 


gradient descent this was the equation i showed  

you your next weight update is going to be your 


current weights minus a small amount called the  

learning rate multiplied by the gradient so we 


have this minus sign because we want to step in  

the opposite direction and we multiply it by the 


gradient or we multiply by the small number called  

here called eta which is what we call the learning 


rate how fast do we want to do the learning  

now it determines actually not just how fast 


to do the learning that's maybe not the best  

way to say it but it tells us how large should 


each step we take in practice be with regards  

to that gradient so the gradient tells us the 


direction but it doesn't necessarily tell us  

the magnitude of the direction so eta can tell 


us actually a scale of how much we want to trust  

that gradient and step in the direction of that 


gradient in practice setting even eta this one  

parameters one number can be extremely difficult 


and i want to give you a quick example of why  

so if you have a very non-convex loc or lost 


landscape where you have local minima if you  

set the learning rate too low then the model 


can get stuck in these local minima it can  

never escape them because it gets it actually does 


optimize itself but it optimizes it to a very sm  

to a non-optimal minima and it can converge very 


slowly as well on the other hand if we increase  
our learning rate too much then we can actually 
overshoot our our minima and actually diverge  

and and lose control and basically uh explode the 


training process completely one of the challenges  

is actually how to pre how to use stable learning 


rates that are large enough to avoid the local  

minima but small enough so that they don't 


diverge and convert or that they don't diverge  

completely so they're small enough to actually 


converge to that global spot once they reach it

so how can we actually set this learning 


rate well one option which is actually  

somewhat popular in practice is to actually 


just try a lot of different learning rates  

and that actually works it is a feasible approach 


but let's see if we can do something a little bit  

smarter than that more intelligent what if we 


could say instead how can we build an adaptive  

learning rate that actually looks at its lost 


landscape and adapts itself to account for what  

it sees in the landscape there are actually 


many types of optimizers that do exactly this  

this means that the learning rates are no longer 


fixed they can increase or decrease depending on  

how large the gradient is in that location and how 


fast we want and how fast we're actually learning

and many other options that could be also with 


regards to the size of the weights at that point  

the magnitudes etc in fact these have been widely 


explored and published as part of tensorflow as  

well and during your labs we encourage each of 


you to really try out each of these different  

types of optimizers and experiment with 


their performance in different types of  

problems so that you can gain very important 


intuition about when to use different types of  

optimizers are what their advantages are and 


disadvantages in certain applications as well  

so let's try and put all of this together so 


here we can see a full loop of using tensorflow  
to define your model on the first line define 
your optimizer here you can replace this with  

any optimizer that you want here i'm just using 


stochastic gradient descent like we saw before  

and feeding it through the model we loop 


forever we're doing this forward prediction  

we predict using our model we compute the 


loss with our prediction this is exactly  

the loss is telling us again how incorrect our 


prediction is with respect to the ground truth y  

we compute the gradient of our loss with respect 


to each of the weights in our neural network and  

finally we apply those gradients using our 


optimizer to step and update our weights  

this is really taking everything that we've 


learned in the class in the lecture so far  

and applying it into one one whole 


piece of code written in tensorflow  

so i want to continue this talk and really talk 


about tips for training these networks in practice  

now that we can focus on this very powerful 


idea of batching your data into mini batches  

so before we saw it with gradient descent that 


we have the following algorithm this gradient  

that we saw to compute using back propagation can 


be actually very intensive to compute especially  

if it's computed over your entire training set so 


this is a summation over every single data point  

in the entire data set in most real-life 


applications it is simply not feasible to  

compute this on every single iteration in 


your optimization loop alternatively let's  

consider a different variant of this algorithm 


called stochastic gradient descent so instead  

of computing the gradient over our entire data 


set let's just pick a single point compute the  

gradient of that single point with respect to the 


weights and then update all of our weights based  

on that gradient so this has some advantages this 


is very easy to compute because it's only using  

one data point now it's very fast but it's also 
very noisy because it's only from one data point  

instead there's a middle ground instead of 


computing this noisy gradient of a single point  

let's get a better estimate of our gradient by 


using a batch of b data points so now let's pick  

a batch of b data points and we'll compute the 


gradient estimation estimate simply as the average  

over this batch so since b here is usually not 


that large on the order of tens or hundreds of  

samples this is much much faster to compute than 


regular gradient descent and it's also much much  

more accurate than just purely stochastic gradient 


descent that only uses a single example now this  

increases the gradient accuracy estimation which 


also allows us to converge much more smoothly  

it also means that we can trust our gradient more 


than in stochastic gradient descent so that we can  

actually increase our learning rate a bit more 


as well mini-batching also leads to massively  

parallelizable computation we can split up the 


batches on separate workers and separate machines  

and thus achieve even more parallelization and 


speed increases on our gpus now the last topic  

i want to talk about is that of overfitting this 


is also known as the problem of generalization and  

is one of the most fundamental problems in all 


of machine learning and not just deep learning

now overfitting like i said is critical to 


understand so i really want to make sure that  

this is a clear concept in everyone's mind ideally 


in machine learning we want to learn a model  

that accurately describes our test data not the 


training data even though we're optimizing this  

model based on the training data what we really 


want is for it to perform well on the test data  

so said differently we want 


to build representations  
that can learn from our training data but 
still generalize well to unseen test data  

now assume you want to build a line to describe 


these points underfitting means that the model  

does simply not have enough capacity to 


represent these points so no matter how good  

we try to fit this model it simply does not 


have the capacity to represent this type of data  

on the far right hand side we can see the 


extreme other extreme where here the model  

is too complex it has too many parameters 


and it does not generalize well to new data  

in the middle though we can see what's called 


the ideal fit it's not overfitting it's not  

underfitting but it has a medium number of 


parameters and it's able to fit in a generalizable  

way to the output and is able to generalize well 


to brand new data when it sees it at test time  

now to address this problem let's talk about  

regularization how can we make sure that our 


models do not end up over fit because neural  

networks do have a ton of parameters how 


can we enforce some form of regularization  

to them now what is regularization regularization 


is a technique that constrains our optimization  

problems such that we can discourage these complex 


models from actually being learned and overfit  

right so again why do we need it we need it so 


that our model can generalize to this unseen  

data set and in neural networks we have many 


techniques for actually imposing regularization  

onto the model one very common technique and very 


simple to understand is called dropout this is one  

of the most popular forms of regularization 


in deep learning and it's very simple let's  

revisit this picture of a neural network this is 


a two layered neural network two hidden layers  

and in dropout during training all we simply 


do is randomly set some of the activations here  
to zero with some probability so what we can do is 
let's say we pick our probability to be 50 or 0.5  

we can drop randomly for each of the activations 


50 of those neurons this is extremely powerful as  

it lowers the capacity of our neural network so 


that they have to learn to perform better on test  

sets because sometimes on training sets it just 


simply cannot rely on some of those parameters  

so it has to be able to be resilient to 


that kind of dropout it also means that  

they're easier to train because at least on every 


forward passive iterations we're training only 50  

of the weights and only 50 of the gradients so 


that also cuts our uh gradient computation time  

down in by a factor of two so because now 


we only have to compute half the number of  

neuron gradients now on every iteration we dropped 


out on the previous iteration fifty percent of  

neurons but on the next iteration we're going 


to drop out a different set of fifty 50 of the  

neurons a different set of neurons and this gives 


the network it basically forces the network to  

learn how to take different pathways to get to its 


answer and it can't rely on any one pathway too  

strongly and overfit to that pathway this is a way 


to really force it to generalize to this new data  

the second regularization technique that 


we'll talk about is this notion of early  

stopping and again here the idea is very basic 


it's it's basically let's stop training once  

we realize that our our loss is increasing on a 


held out validation or let's call it a test set  

so when we start training we all know the 


definition of overfitting is when our model  

starts to perform worse on the test set so if we 


set aside some of this training data to be quote  

unquote test data we can monitor how our network 


is learning on this data and simply just stop  

before it has a chance to overfit so on the x-axis 


you can see the number of training iterations and  
on the y-axis you can see the loss that we 
get after training that number of iterations  

so as we continue to train in the beginning both 


lines continue to decrease this is as we'd expect  

and this is excellent since it 


means our model is getting stronger  

eventually though the network's testing 


loss plateaus and starts to increase  

note that the training accuracy will always 


continue to go to go down as long as the  

network has the capacity to memorize the data and 


this pattern continues for the rest of training  

so it's important here to actually focus on this 


point here this is the point where we need to stop  

training and after this point assuming 


that our test set is a valid representation  

of the true test set the accuracy of the model 


will only get worse so we can stop training here  

take this model and this should be the model that 


we actually use when we deploy into the real world  

anything any model taken from the left hand 


side is going to be underfit it's not going  

to be utilizing the full capacity of the 


network and anything taken from the right  

hand side is over fit and actually performing 


worse than it needs to on that held out test set

so i'll conclude this lecture by summarizing 


three key points that we've covered so far  

we started about the fundamental building 


blocks of neural networks the perceptron  

we learned about stacking and composing 


these perceptrons together to form complex  

hierarchical neural networks and how to 


mathematically optimize these models with  

back propagation and finally we address the 


practical side of these models that you'll find  

useful for the labs today including adaptive 


learning rates batching and regularization  

so thank you for attending the first 


lecture in 6s191 thank you very much

You might also like