0% found this document useful (0 votes)

12 views98 pages

Learning 3

Uploaded by

Surya Basnet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views98 pages

Learning 3

Uploaded by

Surya Basnet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 98

Lecture 4: Machine Learning 3

Roadmap

Backpropagation

K-means

Generalization

Best practices

Summary of Machine Learning

CS221 2
• In this module, I’ll discuss backpropagation, an algorithm to automatically compute gradients.
• It is generally associated with training neural networks, but actually it is much more general and applies to any function.
Motivation: regression with four-layer neural networks
Loss on one example:
Loss(x, y, V1 , V2 , V3 , w) = (w · σ(V3 σ(V2 σ(V1 φ(x)))) − y)2

(Stochastic) gradient descent:

V1 ← V1 − η∇V1 Loss(x, y, V1 , V2 , V3 , w)
V2 ← V2 − η∇V2 Loss(x, y, V1 , V2 , V3 , w)
V3 ← V3 − η∇V3 Loss(x, y, V1 , V2 , V3 , w)
w ← w − η∇w Loss(x, y, V1 , V2 , V3 , w)

How to get the gradient without doing manual work?

CS221 4
• So far, we’ve defined neural networks, which take an initial feature vector φ(x) and sends it through a sequence of matrix multiplications and
non-linear activations σ. At the end, we take the dot product between a weight vector w to produce the score.
• In regression, we predict the score, and use the squared loss, which looks at the squared difference between the score and the target y.
• Recall that we can use (stochastic) gradient descent to optimize the training loss (which is an average over the per-example losses). Now, we
need to update all the weight matrices, not just a single weight vector. This can be done by taking the gradient with respect to each weight
vector/matrix separately, and updating the respective weight vector/matrix by subtracting the gradient times a step size η.
• We can now proceed to take the gradient of the loss function with respect to the various weight vector/matrices. You should know how to
do this: just apply the chain rule. But grinding through this complex expression by hand can be quite tedious. If only we had a way for this
to be done automatically for us...
Computation graphs
Loss(x, y, V1 , V2 , V3 , w) = (w · σ(V3 σ(V2 σ(V1 φ(x)))) − y)2

Definition: computation graph

A directed acyclic graph whose root node represents the final mathematical expression
and each node represents intermediate subexpressions.

Upshot: compute gradients via general backpropagation algorithm

Purposes:
• Automatically compute gradients (how TensorFlow and PyTorch work)
• Gain insight into modular structure of gradient computations

CS221 6
• Enter computation graphs, which will rescue us.
• A computation graph is a directed acyclic graph that represents an arbitrary mathematical expression. The root of that node represents the
final expression, and the other nodes represent intermediate subexpressions.
• After having constructed the graph, we can compute all the gradients we want by running the general-purpose backpropagation algorithm,
which operates on an arbitrary computation graph.
• There are two purposes to using computation graphs. The first and most obvious one is that it avoids having us to do pages of calculus, and
instead delegates this to a computer. This is what packages such as TensorFlow or PyTorch do, and essentially all non-trivial deep learning
models are trained like this.
• The second purpose is that by defining the graph, we can gain more insight into the nature of how gradients are computed in a modular way.
Functions as boxes
c=a+b c=a·b
c + c ·
∂c ∂c ∂c ∂c
∂a =1 ∂b =1 ∂a =b ∂b =a

a b a b

(a + ) + b = c + 1 (a + )b = c + b
a + (b + ) = c + 1 a(b + ) = c + a

Gradients: how much does c change if a or b changes?

CS221 8
• The first conceptual step is to think of functions as boxes that take a set of inputs and produces an output.
• For example, take c = a + b. The key question is: if we perturb a by a small amount , how much does the output c change? In this case,
the output c is also perturbed by 1, so the gradient (partial derivative) is 1. We put this gradient on the edge.
• We can handle c = a · b in a similar way.
• Intuitively, the gradient is a measure of local sensivity: how much input perturbations get amplified when they go through the various functions.
Basic building blocks
+ − ·

1 1 1 −1 b a

a b a b a b

(·)2 max σ

2a 1[a > b] 1[a < b] σ(a)(1 − σ(a))

a a b a

CS221 10
• Here are some more examples of simple functions and their gradients. Let’s walk through them together.
• These should be familiar from basic calculus. All we’ve done is present them in a visually more intuitive way.
• For the max function, changing a only impacts the max iff a > b; and analogously for b.
• For the logistic function σ(z) = 1+e1−z , a bit of algebraic elbow grease produces the gradient. You can check that the gradient is zero when
|a| → ∞.
• It turns out that these simple functions are all we need to build up many of the more complex and potentially scarier looking functions that
we’ll encounter.
Function composition
c (·)2

∂c
∂b = 2b

b (·)2

∂b
∂a = 2a

a
Chain rule:
∂c ∂c ∂b
∂a = ∂b ∂a = (2b)(2a) = (2a2 )(2a) = 4a3

CS221 12
• Given these building blocks, we can now put them together to create more complex functions.
• Consider applying some function (e.g., squared) to a to get b, and then applying some other function (e.g., squared) to get c.
• What is the gradient of c with respect to a?
• We know from our building blocks the gradients on the edges.
• The final answer is given by the chain rule from calculus: just multiply the two gradients together.
• You can verify that this yields the correct answer (2b)(2a) = 4a3 .
• This visual intuition will help us better understand more complex functions.
Linear classification with hinge loss
loss max
Loss(x, y, w) = max{1 − w · φ(x)y, 0}
1[1 − margin > 0]

− 0 ∇w Loss(x, y, w) = −1[margin < 1]φ(x)y

−1
+ − ·
1 margin · 1 1 1 −1 b a

y a b a b a b

score · y (·)2 max σ

2a 1[a > b] 1[a < b] σ(a)(1 − σ(a))

φ(x)

a a b a
w φ(x)

CS221 14
• Now let’s turn to our first real-world example: the hinge loss for linear classification. We already computed the gradient before, but let’s do
it using computation graphs.
• We can construct the computation graph for this expression, proceeding bottom up. At the leaves are the inputs and the constants. Each
internal node is labeled with the operation (e.g., ·) and is labeled with a variable naming that subexpression (e.g., margin).
• In red, we have highlighted the weights w with respect to which we want to take the gradient. The central question is how small perturbations
in w affect a change in the output (loss).
• We can examine each edge from the path from w to loss, and compute the gradient using our handy reference of building blocks.
• The actual gradient is the product of the edge-wise gradients from w to the loss output.
Two-layer neural networks
loss (·)2

2(residual) 2
Loss(x, y, V, w) = (w · σ(Vφ(x)) − y)

residual −
∇w Loss(x, y, V, w) = 2(residual)h
1

∇V Loss(x, y, V, w) = 2(residual)w ◦ h ◦ (1 − h)φ(x)>

score · y

h w + − ·

1 1 1 −1 b a
w h σ
a b a b a b

h ◦ (1 − h)
(·)2 max σ

· 2a 1[a > b] 1[a < b] σ(a)(1 − σ(a))

a a b a
φ(x)

V φ(x)
CS221 16
• We now finally turn to neural networks, but the idea is essentially the same.
• Specifically, consider a two-layer neural network driving the squared loss.
• Let us build the computation graph bottom up.
• Now we need to take the gradient with respect to w and V. Again, these are just the product of the gradients on the paths from w or V to
the loss node at the root.
• Note that the two gradients have in common the the first two terms. Common paths result in common subexpressions for the gradient.
• There are some technicalities when dealing with vectors worth mentioning: First, the ◦ in h ◦ (1 − h) is elementwise multiplication (not the
dot product), since the non-linearity σ is applied elementwise. Second, there is a transpose for the gradient expression with respect to V and
not w because we are taking Vφ(x), while taking w · h = w> h.
• This computation graph also highlights the modularity of hypothesis class and loss function. You can pick any hypothesis class (linear
predictors or neural networks) to drive the score, and the score can be fed into any loss function (squared, hinge, etc.).
Backpropagation
Loss(x, y, w) = (w · φ(x) − y)2
loss = 9 (·)2 1

2(residual) w = [3, 1], φ(x) = [1, 2], y = 2

residual = 3 − 6 backpropagation
∇w Loss(x, y, w) = [6, 12]
1

score = 5 · 6 y=2
Definition: Forward/backward values
φ(x) = [1, 2]

Forward: fi is value for subexpression rooted at i

w = [3, 1] [6, 12] φ(x) = [1, 2]
Backward: gi = ∂loss
∂fi is how fi influences loss

Algorithm: backpropagation algorithm

Forward pass: compute each fi (from leaves to root)

Backward pass: compute each gi (from root to leaves)

CS221 18
• So far, we have mainly used the graphical representation to visualize the computation of function values and gradients for our conceptual
understanding.
• Now let us introduce the backpropagation algorithm, a general procedure for computing gradients given only the specification of the function.
• Let us go back to the simplest example: linear regression with the squared loss.
• All the quantities that we’ve been computing have been so far symbolic, but the actual algorithm works on real numbers and vectors. So let’s
use concrete values to illustrate the backpropagation algorithm.
• The backpropagation algorithm has two phases: forward and backward. In the forward phase, we compute a forward value fi for each node,
coresponding to the evaluation of that subexpression. Let’s work through the example.
• In the backward phase, we compute a backward value gi for each node. This value is the gradient of the loss with respect to that node,
which is also the product of all the gradients on the edges from the node to the root. To compute this backward value, we simply take the
parent’s backward value and multiply by the gradient on the edge to the parent. Let’s work through the example.
• Note that both fi and gi can either be scalars, vectors, or matrices, but have the same dimensionality.
A note on optimization
minV,w TrainLoss(V, w)
Linear predictors Neural networks

(convex) (non-convex)

Optimization of neural networks is in principle hard

CS221 20
• So now we can apply the backpropagation algorithm and compute gradients, stick them into stochastic gradient descent, and get some answer
out.
• One question which we haven’t addressed is whether stochastic gradient descent will work in the sense of actually finding the weights that
minimize the training loss?
• For linear predictors (using the squared loss or hinge loss), TrainLoss(w) is a convex function, which means that SGD (with an appropriately
step size) is theoretically guaranteed to converge to the global optimum.
• However, for neural networks, TrainLoss(V, w) is typically non-convex which means that there are multiple local optima, and SGD is not
guaranteed to converge to the global optimum. There are many settings that SGD fails both theoretically and empirically, but in practice,
SGD on neural networks can work much better than theory would predict, provided certain precautions are taken. The gap between theory
and practice is not well understood and an active area of research.
How to train neural networks

φ(x)
V
w
score = · σ( )

• Careful initialization (random noise, pre-training)

• Overparameterization (more hidden units than needed)
• Adaptive step sizes (AdaGrad, Adam)

Don’t let gradients vanish or explode!

CS221 22
• Training a neural network is very much like driving stick. In practice, there are some ”tricks” that are needed to make things work properly.
Just to name a few to give you a sense of the considerations:
• Initialization (where you start the weights) matters for non-convex optimization. Unlike for linear models, you can’t start at zero or else all
the subproblems will be the same (all rows of V will be the same). Instead, you want to initialize with a small amount of random noise.
• It is common to use overparameterized neural networks, ones with more hidden units (k) than is needed, because then there are more
”chances” that some of them will pick out on the right signal, and it is okay if some of the hidden units become ”dead”.
• There are small but important extensions of stochastic gradient descent that allow the step size to be tuned per weight.
• Perhaps one high-level piece of advice is that when training a neural network, it is important to monitor the gradients. If they vanish (get
too small), then training won’t make progress. If they explode (get too big), then training will be unstable.
Summary
loss (·)2

2(residual)

residual −

score · y

h w

w h σ

h ◦ (1 − h)

φ(x)

V φ(x)

• Computation graphs: visualize and understand gradients

• Backpropagation: general-purpose algorithm for computing gradients

CS221 24
• The most important concept in this module is the idea of a computation graph, allows us to represent arbitrary mathematical expressions,
which can just be built out of simple building blocks. They hopefully have given you a more visual and better understanding of what gradients
are about.
• The backpropagation algorithm allows us to simply write down an expression, and never have to take a gradient manually again. However,
it is still important to understand how the gradient arises, so that when you try to train a deep neural network and your gradients vanish, you
know how to think about debugging your network.
• The generality of computation graphs and backpropagation makes it possible to iterate very quickly on new types of models and loss functions
and opens up a new paradigm for model development: differential programming.
Roadmap

Backpropagation

K-means

Generalization

Best practices

Summary of Machine Learning

CS221 26
• In this module, we’ll talk about K-means, a simple algorithm for clustering, a form of unsupervised learning.
[Brown et al, 1992]

Word clustering
Input: raw text (100 million words of news articles)...
Output:
Cluster 1: Friday Monday Thursday Wednesday Tuesday Saturday Sunday weekends Sundays Saturdays
Cluster 2: June March July April January December October November September August
Cluster 3: water gas coal liquid acid sand carbon steam shale iron
Cluster 4: great big vast sudden mere sheer gigantic lifelong scant colossal
Cluster 5: man woman boy girl lawyer doctor guy farmer teacher citizen
Cluster 6: American Indian European Japanese German African Catholic Israeli Italian Arab
Cluster 7: pressure temperature permeability density porosity stress velocity viscosity gravity tension
Cluster 8: mother wife father son husband brother daughter sister boss uncle
Cluster 9: machine device controller processor CPU printer spindle subsystem compiler plotter
Cluster 10: John George James Bob Robert Paul William Jim David Mike
Cluster 11: anyone someone anybody somebody
Cluster 12: feet miles pounds degrees inches barrels tons acres meters bytes
Cluster 13: director chief professor commissioner commander treasurer founder superintendent dean custodian
Cluster 14: had hadn’t hath would’ve could’ve should’ve must’ve might’ve
Cluster 15: head body hands eyes voice arm seat eye hair mouth

CS221 28
• Here is a classic example of clustering from the NLP literature, called Brown clustering. This was the unsupervised learning method of choice
before word vectors.
• The input to the algorithm is simply raw text, and the output is a clustering of the words.
• The first cluster more or less represents days of the week, the second is months, the third is natural resources, and so on.
• It is important to note that no one told the algorithm what days of the week were or months or family relations. The clustering algorithm
discovered this structure automatically.
• On a personal note, Brown clustering was actually my first experience that got me to pursue research in NLP. Seeing the results of unsupervised
learning when it works was just magical. And of course today, we’re seeing even more strongly the potential of unsupervised learning with
neural language models such as BERT and GPT-3.
Classification (supervised learning)

2
training data Dtrain
[2, 0] 1
φ(x)1 φ(x)2 y

φ(x)2
0
0 2 1 learning algorithm
f classifier
-1
-2 0 1
1 -1 -1 -2
-1 -3
-3 -2 -1 0 1 2 3

φ(x)1

Labeled data is expensive to obtain

CS221 30
• I want to contrast unsupervised learning with supervised learning.
• Recall that in classification you’re given a set of labeled training examples.
• A learning algorithm produces a classifier that can classify new points.
• Note that we’re now plotting the (two-dimensional) feature vector rather than the raw input, since the learning algorithms only depend on
the feature vectors.
• However, the main challenge with supervised learning is that it can be expensive to collect the labels for data.
Clustering (unsupervised learning)
training data Dtrain
assignments
φ(x)1 φ(x)2
3 z 3
-2 1
2 1 2
0 1
1 1 1
φ(x)2

φ(x)2
-2 3
0
learning algorithm
1 0

-1 0 3 -1
1
-2 2 -1 -2
2
-3 1 -2 -3
-3 -2 -1 0 1 2 3 2 -3 -2 -1 0 1 2 3

φ(x)1 2 -3 φ(x)1
2
3 -2
2

Intuition: Want to assign nearby points to same cluster

Unlabeled data is very cheap to obtain

CS221 32
• In contrast, in clustering, you are only given unlabeled training examples.
• Our goal is to assign each point to a cluster. In this case, there are two clusters, 1 (blue) and 2 (orange).
• Intuitively, nearby points should be assigned to the same cluster.
• The advantage of unsupervised learning is that unlabeled data is often very cheap and almost free to obtain, especially text or images on the
web.
Clustering task

Definition: clustering

Input: training points

Dtrain = [x1 , . . . , xn ]
Output: assignment of each point to a cluster
z = [z1 , . . . , zn ] where zi ∈ {1, . . . , K}

CS221 34
• Formally, the task of clustering is to take a set of points as input and return a partitioning of the points into K clusters.
• We will represent the partitioning using an assignment vector z = [z1 , . . . , zn ].
• For each i, zi ∈ {1, . . . , K} specifies which of the K clusters point i is assigned to.
Centroids

Each cluster k = 1, . . . , K is represented by a centroid µk ∈ Rd

µ = [µ1 , . . . , µK ]
3

φ(x)2
0

-1

-2

-3
-3 -2 -1 0 1 2 3

φ(x)1

Intuition: want each point φ(xi ) to be close to its assigned centroid µzi

CS221 36
• What makes a cluster? The key assumption is that each cluster k is represented by a centroid µk .
• Now the intuition is that we want each point φ(xi ) to be close to its assigned centroid µzi .
K-means objective
3

2
φ(x)2 1
n
0 X
-1
Losskmeans (z, µ) = kφ(xi ) − µzi k2
-2
i=1

-3
-3 -2 -1 0 1 2 3

φ(x)1

min min Losskmeans (z, µ)

z µ

CS221 38
• To formalize this, we define the K-means objective (distinct from the K-means algorithm).
• The variables are the assignments z and centroids µ.
• We examine the squared distance (dashed lines) from a point φ(xi ) to the centroid of its assigned cluster µzi . Summing over all these squared
distances gives the K-means objective.
• This loss can be interpreted as a reconstruction loss: imagine replacing each data point by its assigned centroid. Then the objective captures
how lossy this compression was.
• Now our goal is to minimize the K-means loss.
Alternating minimization from optimum
0 1 2 10 11 12

1 1 2 2

If know centroids µ1 = 1, µ2 = 11:

z1 = arg min{(0 − 1)2 , (0 − 11)2 } = 1
z2 = arg min{(2 − 1)2 , (2 − 11)2 } = 1
z3 = arg min{(10 − 1)2 , (10 − 11)2 } = 2
z4 = arg min{(12 − 1)2 , (12 − 11)2 } = 2

If know assignments z1 = z2 = 1, z3 = z4 = 2:
µ1 = arg minµ (0 − µ)2 + (2 − µ)2 = 1
µ2 = arg minµ (10 − µ)2 + (12 − µ)2 = 11

CS221 40
• Before we present the K-means algorithm, let us form some intuitions.
• Consider the following one-dimensional clustering problem with 4 points. Intuitively there are two clusters.
• Suppose we know the centroids. Then for each point the assignment that minimizes the K-means loss is the closer of the two centroids.
• Suppose we know the assignments. Then for each cluster, we average the points that are assigned to that cluster.
Alternating minimization from random initialization
Initialize µ:
0 2 10 12

Iteration 1:
0 2 8 10 12

1 2 2 2

Iteration 2:
0 1 2 10 11 12

1 1 2 2

Converged.

CS221 42
• But of course we don’t know either the centroids or assignments.
• So we simply start with an arbitrary setting of the centroids.
• Then alternate between choosing the best assignments given the centroids, and choosing the best centroids given the assignments.
• This is the K-means algorithm.
K-means algorithm
Algorithm: K-means

Initialize µ = [µ1 , . . . , µK ] randomly.

For t = 1, . . . , T :
Step 1: set assignments z given µ
For each point i = 1, . . . , n:
zi ← arg min kφ(xi ) − µk k2
k=1,...,K
Step 2: set centroids µ given z
For each cluster k = 1, . . . , K:
1 X
µk ← φ(xi )
|{i : zi = k}|
i:zi =k

CS221 44
• Now we can state the K-means algorithm formally. We start by initializing all the centroids randomly. Then, we iteratively alternate back
and forth between steps 1 and 2, optimizing z given µ and vice-versa.
• Step 1 of K-means fixes the centroids µ. Then we can optimize the K-means objective with respect to z alone quite easily. It is easy to show
that the best label for zi is the cluster k that minimizes the distance to the centroid µk (which is fixed).
• Step 2 turns things around and fixes the assignments z. We can again look at the K-means objective function and optimize it with respect
to the centroids µ. The best µk is to place the centroid at the average of all the points assigned to cluster k.
Local minima
K-means is guaranteed to converge to a local minimum, but is not guaranteed to find the global
minimum.

[demo: getting stuck in local optima, seed = 100]

Solutions:
• Run multiple times from different random initializations
• Initialize with a heuristic (K-means++)

CS221 46
• K-means is guaranteed to decrease the loss function each iteration and will converge to a local minimum, but it is not guaranteed to find the
global minimum, so one must exercise caution when applying K-means.
• Advanced: One solution is to simply run K-means several times from multiple random initializations and then choose the solution that has
the lowest loss.
• Advanced: Or we could try to be smarter in how we initialize K-means. K-means++ is an initialization scheme which places centroids on
training points so that these centroids tend to be distant from one another.
Summary
Clustering: discover structure in unlabeled data
K-means objective: K-means algorithm:
3

1 assignments z centroids µ
φ(x)2

-1

-2

-3
-3 -2 -1 0 1 2 3

φ(x)1

Unsupervised learning use cases:

• Data exploration and discovery
• Providing representations to downstream supervised learning
CS221 48
• In summary, K-means is a simple and widely-used method for discovering cluster structure in data.
• Note that K-means can mean two things: the objective and the algorithm.
• Given points we define the K-means objective as the sum of the squared differences between a point and its assigned centroid.
• We also defined the K-means algorithm, which performs alternating optimization on the K-means objective.
• Finally, clustering is just one instance of unsupervised learning, which seeks to learn models from the wealth of unlabeled data alone.
Unsupervised learning can be used in two ways: exploring a dataset which has not been labeled (let the data speak), and learning representations
(discrete clusters or continuous embeddings) useful for downstream supervised applications.
Roadmap

Backpropagation

K-means

Generalization

Best practices

Summary of Machine Learning

CS221 50
• In this module, I will talk about the generalization of machine learning algorithms.
Minimizing training loss
Hypothesis class:
fw (x) = w · φ(x)
Training objective (loss function):
1 X
TrainLoss(w) = Loss(x, y, w)
|Dtrain |
(x,y)∈Dtrain

Optimization algorithm:
stochastic gradient descent

Is the training loss a good objective to optimize?

CS221 52
• Recall that our machine learning framework consists of specifying the hypothesis class, loss function, and the optimization algorithm.
• The hypothesis class could be linear predictors or neural networks. The loss function could be the hinge loss or the squared loss, which is
averaged to produce the training loss.
• The default optimization algorithm is (stochastic) gradient descent.
• But let’s be a bit more critical about the training loss. Is the training loss the right thing to optimize?
A strawman algorithm

Algorithm: rote learning

Training: just store Dtrain .

Predictor f (x):
If (x, y) ∈ Dtrain : return y.
Else: segfault.

Minimizes the objective perfectly (zero), but clearly bad...

CS221 54
• Here is a strategy to consider: the rote learning algorithm, which just memorizes the training data and crashes otherwise.
• The rote learning algorithm does a perfect job of minimizing the training loss.
• But it’s clearly a bad idea: It overfits to the training data and doesn’t generalize to unseen examples.
• So clearly machine learning can’t be about just minimizing the training loss.
Overfitting pictures

Classification Regression

CS221 56
• This is an extreme example of overfitting, which is when a learning algorithm outputs a predictor that does well on the training data but not
well on new examples.
• Here are two pictures that illustrate what overfitting looks like for binary classification and regression.
• On the left, we see that the green decision boundary gets zero training loss by separating all the blue points from the red ones. However, the
smoother and simpler black curve is intuitively more likely to be the better classifier.
• On the right, we see that the predictor that goes through all the points will get zero training loss, but intuitively, the black line is perhaps a
better option.
• In both cases, what is happening is that by over-optimizing on the training set, we risk fitting noise in the data.
Evaluation

Dtrain learning algorithm f

How good is the predictor f ?

Key idea: the real learning objective

Our goal is to minimize error on unseen future examples.

Don’t have unseen examples; next best thing:

Definition: test set

Test set Dtest contains examples not used for training.

CS221 58
• So what is the true objective then? Taking a step back, machine learning is just a means to an end. What we’re really doing is building a
predictor to be deployed in the real world, and we just happen to be using machine learning. What we really care about is how accurate that
predictor is on those unseen future inputs.
• Of course, we can’t access unseen future examples, so the next best thing is to create a test set. As much as possible, we should treat the
test set as a pristine thing that’s unseen. We definitely should not tune our predictor based on the test set, because we wouldn’t be able to
do that on future examples.
• Of course at some point we have to run our algorithm on the test set, but just be aware that each time this is done, the test set becomes
less good of an indicator of how well your predictor is actually doing.
Generalization
When will a learning algorithm generalize well?

Dtrain Dtest

CS221 60
• So far, we have an intuitive feel for what overfitting is. How do we make this precise? In particular, when does a learning algorithm generalize
from the training set to the test set?
Approximation and estimation error
All predictors
F
∗ Learning
f approx. error est. error
g fˆ

• Approximation error: how good is the hypothesis class?

• Estimation error: how good is the learned predictor relative to the potential of the
hypothesis class?
Err(fˆ) − Err(f ∗ ) = Err(fˆ) − Err(g) + Err(g) − Err(f ∗ )
| {z } | {z }
estimation approximation

CS221 62
• Here’s a cartoon that can help you understand the balance between fitting and generalization. Out there somewhere, there is a magical
predictor f ∗ that classifies everything perfectly. This predictor is unattainable; all we can hope to do is to use a combination of our domain
knowledge and data to approximate that. The question is: how far are we away from f ∗ ?
• Recall that our learning framework consists of (i) choosing a hypothesis class F (e.g., by defining the feature extractor) and then (ii) choosing
a particular predictor fˆ from F.
• Approximation error is how far the entire hypothesis class is from the target predictor f ∗ . Larger hypothesis classes have lower approximation
error. Let g ∈ F be the best predictor in the hypothesis class in the sense of minimizing test error g = arg minf ∈F Err(f ). Here, distance is
just the differences in test error: Err(g) − Err(f ∗ ).
• Estimation error is how good the predictor fˆ returned by the learning algorithm is with respect to the best in the hypothesis class:
Err(fˆ) − Err(g). Larger hypothesis classes have higher estimation error because it’s harder to find a good predictor based on limited data.
• We’d like both approximation and estimation errors to be small, but there’s a tradeoff here.
Effect of hypothesis class size
All predictors
F
Learning
f∗ approx. error est. error
g fˆ

As the hypothesis class size increases...

Approximation error decreases because:
taking min over larger set
Estimation error increases because:
harder to estimate something more complex

How do we control the hypothesis class size?

CS221 64
• The approximation error decreases monotonically as the hypothesis class size increases for a simple reason: you’re taking a minimum over a
larger set.
• The estimation error increases monotonically as the hypothesis class size increases for a deeper reason involving statistical learning theory
(explained in CS229T).
Strategy 1: dimensionality
w ∈ Rd

Reduce the dimensionality d (number of features):

CS221 66
• Let’s focus our attention to linear predictors. For each weight vector w, we have a predictor fw (for classification, fw (x) = w · φ(x)). So the
hypothesis class F = {fw } is all the predictors as w ranges. By controlling the number of possible values of w that the learning algorithm is
allowed to choose from, we control the size of the hypothesis class and thus guard against overfitting.
• One straightforward strategy is to change the dimensionality, which is the number of features. For example, linear functions are lower-
dimensional than quadratic functions.
Controlling the dimensionality
Manual feature (template) selection:
• Add feature templates if they help
• Remove feature templates if they don’t help
Automatic feature selection (beyond the scope of this class):
• Forward selection
• Boosting
• L1 regularization

It’s the number of features that matters

CS221 68
• Mathematically, you can think about removing a feature φ(x)37 as simply only allowing its corresponding weight to be zero (w37 = 0).
• Operationally, if you have a few feature templates, then it’s probably easier to just manually include or exclude them — this will give you
more intuition.
• If you have a lot of individual features, you can apply more automatic methods for selecting features, but these are beyond the scope of this
class.
• An important point is that it’s the number of features that matters, not the number of feature templates. (Can you define one feature
template that results in severe overfitting?) Nor is it the amount of code that you have to write to generate a particular feature.
Strategy 2: norm
w ∈ Rd

Reduce the norm (length) kwk:

CS221 70
• A related way to keep the weights small is called regularization, which involves adding an additional term to the objective function which
penalizes the norm (length) of w. This is probably the most common way to control the norm.
Controlling the norm
Regularized objective:
λ
min TrainLoss(w) + kwk2
w 2

Algorithm: gradient descent

Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
w ← w − η(∇w TrainLoss(w)+λw)

Same as gradient descent, except shrink the weights towards zero by λ.

CS221 72
• This form of regularization is also known as L2 regularization, or weight decay in deep learning literature.
• We can use gradient descent on this regularized objective, and this simply leads to an algorithm which subtracts a scaled down version of w
in each iteration. This has the effect of keeping w closer to the origin than it otherwise would be.
• Note: Support Vector Machines are exactly hinge loss + L2 regularization.
Controlling the norm: early stopping

Algorithm: gradient descent

Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
w ← w − η∇w TrainLoss(w)

Idea: simply make T smaller

Intuition: if have fewer updates, then kwk can’t get too big.

Lesson: try to minimize the training error, but don’t try too hard.

CS221 74
• A really cheap way to keep the weights small is to do early stopping. As we run more iterations of gradient descent, the objective function
improves. If we cared about the objective function, this would always be a good thing. However, our true objective is not the training loss.
• Each time we update the weights, w has the potential of getting larger, so by running gradient descent a fewer number of iterations, we are
implicitly ensuring that w stays small.
• Though early stopping seems hacky, there is actually some theory behind it. And one paradoxical note is that we can sometimes get better
solutions by performing less computation.
Summary
Not the real objective: training loss
Real objective: loss on unseen future examples
Semi-real objective: test loss

Key idea: keep it simple

Try to minimize training error, but keep the hypothesis class small.

CS221 76
• In summary, we started by noting that the training loss is not the objective. Instead it is minimizing unseen future examples, which is
approximated by the test set provided you are careful.
• We’ve seen several ways to control the size of the hypothesis class (and thus reducing the estimation error) based on either reducing the
dimensionality or reducing the norm.
• It is important to note that what matters is the size of the hypothesis class, not how ”complex” the predictors in the hypothesis class look.
To put it another way, using complex features backed by 1000 lines of code doesn’t hurt you if there are only 5 of them.
• So far, we’ve talked about the various knobs that we can turn to control the size of the hypothesis class, but how much do we turn each
knob?
Roadmap

Backpropagation

K-means

Generalization

Best practices

Summary of Machine Learning

CS221 78
• We’ve spent a lot of talking about the formal principles of machine learning.
• In this module, I will discuss some of the more empirical aspects you encounter in practice.
Choose your own adventure
Hypothesis class:
Feature extractor φ: linear, quadratic
fw (x) = sign(w · φ(x))
Architecture: number of layers, number of hidden units

Training objective:
1
P Loss function: hinge, logistic
|Dtrain | (x,y)∈Dtrain Loss(x, y, w) + Reg(w)
Regularization: none, L2

Optimization algorithm:
Number of epochs
Algorithm: stochastic gradient descent
Step size: constant, decreasing, adaptive
Initialize w = [0, . . . , 0]
For t = 1, . . . , T : Initialization: amount of noise, pre-training
For (x, y) ∈ Dtrain :
Batch size
w ← w − η∇w Loss(x, y, V, w)
CS221
Dropout 80
• Recall that there are three design decisions for setting up a machine learning algorithm: the hypothesis class, the training objective, and the
optimization algorithm.
• For the hypothesis class, there are two knobs you can turn. The first is the feature extractor φ (linear features, quadratic features, indicator
features on regions, etc. The second is the architecture of the predictor: linear (one layer) or neural network with layers, and in the case of
neural networks, how many hidden units (k) do we have.
• The second design decision is to specify the training objective, which we do by specifying the loss function depending how we want the
predictor to fit our data, and also whether we want to regularize the weights to guard against overfitting.
• The final design decision is how to optimize the predictor. Even the basic stochastic gradient descent algorithm has at least two knobs: how
long to train (number of epochs) and how aggressively to update (the step size). On top of that are many enhancements and tweaks common
to training deep neural networks: changing the step size over time, perhaps adaptively, how we initialize the weights, whether we update on
batches (say of 16 examples) instead of 1, and whether we apply dropout to guard against overfitting.
• So it is really a choose your own machine learning adventure. Sometimes decisions can be made via prior knowledge and are thoughtful (e.g.,
features that capture periodic trends). But in many (even most) cases, we don’t really know what the proper values should be. Instead, we
want a way to have these just set automatically.
Hyperparameters

Definition: hyperparameters

Design decisions (hypothesis class, training objective, optimization algorithm) that

need to be made before running the learning algorithm.

How do we choose hyperparameters?

Choose hyperparameters to minimize Dtrain error?

No - optimum would be to include all features, no regularization, train forever

Choose hyperparameters to minimize Dtest error?

No - choosing based on Dtest makes it an unreliable estimate of error!

CS221 82
• Each of these many design decisions is a hyperparameter.
• We could choose the hyperparameters to minimize the training loss. However, this would lead to a degenerate solution. For example, by
adding additional features, we can always decrease the training loss, so we would just end up adding all the features in the world, leading to
a model that wouldn’t generalize. We would turn off all regularization, because that just gets in the way of minimizing the training loss.
• What if we instead chose hyperparameters to minimize the test loss. This might lead to good hyperparameters, but is problematic because
you then lose the ability to measure how well you’re doing. Recall that the test set is supposed to be a surrogate for unseen examples, and
the more you optimize over them, the less unseen they become.
Validation set

Definition: validation set

A validation set is taken out of the training set and used to optimize hyperparameters.

Dtrain \Dval Dval Dtest

For each setting of hyperparameters, train on Dtrain \Dval , evaluate on Dval

CS221 84
• The solution is to invent something that looks like a test set. There’s no other data lying around, so we’ll have to steal it from the training
set. The resulting set is called the validation set.
• The size of the validation set should be large enough to give you a reliable estimate, but you don’t want to take away too many examples
from the training set.
• With this validation set, now we can simply try out a bunch of different hyperparameters and choose the setting that yields the lowest error
on the validation set. Which hyperparameter values should we try? Generally, you should start by getting the right order of magnitude (e.g.,
λ = 0.0001, 0.001, 0.01, 0.1, 1, 10) and then refining if necessary.
• In K-fold cross-validation, you divide the training set into K parts. Repeat K times: train on K − 1 of the parts and use the other part as
a validation set. You then get K validation errors, from which you can compute and report both the mean and the variance, which gives you
more reliable information.
Model development strategy

Algorithm: Model development strategy

• Split data into train, validation, test

• Look at data to get intuition
• Repeat:
• Implement model/feature, adjust hyperparameters
• Run learning algorithm
• Sanity check train and validation error rates
• Look at weights and prediction errors
• Evaluate on test set to get final error rates

CS221 86
• This slide represents the most important yet most overlooked part of machine learning: how to actually apply it in practice.
• We have so far talked about the mathematical foundation of machine learning (loss functions and optimization), and discussed some of the
conceptual issues surrounding overfitting, generalization, and the size of hypothesis classes. But what actually takes most of your time is not
writing new algorithms, but going through a development cycle, where you iteratively improve your system.
• The key is to stay connected with the data and the model, and have intuition about what’s going on. Make sure to empirically examine the
data before proceeding to the actual machine learning. It is imperative to understand the nature of your data in order to understand the
nature of your problem.
• First, maintain data hygiene. Hold out a test set from your data that you don’t look at until you’re done. Start by looking at the (training or
validation) data to get intuition. You can start to brainstorm what features / predictors you will need. You can compute some basic statistics.
• Then you enter a loop: implement a new model architecture or feature template. There are three things to look at: error rates, weights, and
predictions. First, sanity check the error rates and weights to make sure you don’t have an obvious bug. Then do an error analysis to see
which examples your predictor is actually getting wrong. The art of practical machine learning is turning these observations into new features.
• Finally, run your system once on the test set and report the number you get. If your test error is much higher than your validation error, then
you probably did too much tweaking and were overfitting (at a meta-level) the validation set.
Tips
Start simple:
• Run on small subsets of your data or synthetic data
• Start with a simple baseline model
• Sanity check: can you overfit 5 examples
Log everything:
• Track training loss and validation loss over time
• Record hyperparameters, statistics of data, model, and predictions
• Organize experiments (each run goes in a separate folder)
Report your results:
• Run each experiment multiple times with different random seeds
• Compute multiple metrics (e.g., error rates for minority groups)
CS221 88
• There is more to be said about the practice of machine learning. Here are some pieces of advice. Note that many related to simply good
software engineering practices.
• First, don’t start out by coding up a large complex model and try running it on a million examples. Start simple, both with the data (small
number of examples) and the model (e.g., linear classifier). Sanity check that things are working first before increasing the complexity. This
will help you debug in a regime where things are more interpretable and also things run faster. One sanity check is to train a sufficiently
expressive model on a very few examples and see if the model can overfit the examples (get zero training error). This does not produce a
useful model, but is a diagnostic to see if the optimization is working. If you can’t overfit on 5 examples, then you have a problem: maybe
the hypothesis class is too small, the data is too noisy, or the optimization isn’t working.
• Second, log everything so you can diagnose problems. Monitor the losses over epochs. It is also important to track the training loss so that
if you get bad results, you can find out if it is due to bad optimization or overfitting. Record all the hyperparameters, so that you have a full
record of how to reproduce the results.
• Third, when you report your results, you should be able to run an experiment multiple times with different randomness to see how stable the
results are. Report error bars. And finally, if it makes sense for your application to report more than just a single test accuracy. Report the
errors for minority groups and add if your model is treating every group fairly.
Summary

Dtrain \Dval Dval Dtest

Don’t look at the test set!

Understand the data!

Start simple!

Practice!

CS221 90
• To summarize, we’ve talked about the practice of machine learning.
• First, make sure you follow good data hygiene, separating out the test set and don’t look at it.
• But you should look at the training or validation set to get intuition about your data before you start.
• Then, start simple and make sure you understand how things are working.
• Beyond that, there are a lot of design decisions to be made (hyperparameters). So the most important thing is to practice, so that you can
start developing more intuition, and developing a set of best practices that works for you.
Roadmap

Backpropagation

K-means

Generalization

Best practices

Summary of Machine Learning

CS221 92
Machine Learning Summary
• Feature extraction (think hypothesis classes) [modeling]

• Prediction (linear, neural network, k-means) [modeling]

• Loss functions (evaluate errors) [modeling]

• Optimization (stochastic gradient, alternating minimization) [learning]

• Generalization (think development cycle) [modeling]

• We are not covering some other important aspects, e.g., fairness, privacy, interpretability

CS221 94
• This concludes our tour of the foundations of machine learning, although machine learning will come up again later in the course. You should
have gotten more than just a few isolated equations and algorithms. It is really important to think about the overarching principles in a
modular way.
• First, feature extraction is where you put your domain knowledge into. In designing features, it’s useful to think in terms of the induced
hypothesis classes — what kind of functions can your learning algorithm potentially learn?
• These features then drive prediction: either linearly or through a neural network. We can even think of k-means as trying to predict the data
points using the centroids.
• Loss functions connect predictions with the actual training examples.
• Note that all of the design decisions up to this point are about modeling. Algorithms are very important, but only come in once we have the
right optimization problem to solve.
• Finally, machine learning requires a leap of faith. How does optimizing anything at training time help you generalize to new unseen examples
at test time? Learning can only work when there’s a common core that cuts past all the idiosyncrasies of the examples. This is exactly what
features are meant to capture.
• Note that our lectures are a limited tour of machine learning, and do not cover some important aspects e.g., fairness, privacy, interpretability
in machine learning
Machine learning

Key idea: learning

Programs should improve with experience.

So far: reflex-based models

Next time: state-based models

CS221 96
• If we generalize for a moment, machine learning is really about programs that can improve with experience.
• So far, we have only focused on reflex-based models where the program only outputs a yes/no or a number, and the experience is examples
of input-output pairs.
• Next time, we will start looking at models which can perform higher-level reasoning, but machine learning will remain our companion for the
remainder of the class.

Computation Graph - 1
No ratings yet
Computation Graph - 1
65 pages
Lecture 02-2
No ratings yet
Lecture 02-2
37 pages
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
9 pages
First
No ratings yet
First
92 pages
Lecture 0.4 - Neural Networks
No ratings yet
Lecture 0.4 - Neural Networks
51 pages
Demystifying Deep Learning
No ratings yet
Demystifying Deep Learning
68 pages
Neural Networks
No ratings yet
Neural Networks
52 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
2024 04 CS115 Vector Caculus
No ratings yet
2024 04 CS115 Vector Caculus
131 pages
Introduction To Feed Forward Neural Networks
No ratings yet
Introduction To Feed Forward Neural Networks
121 pages
Neural Network (Perceptrons)
No ratings yet
Neural Network (Perceptrons)
31 pages
XCS224N Module2 Slides
No ratings yet
XCS224N Module2 Slides
80 pages
CS601 - Machine Learning - Unit 2 New
No ratings yet
CS601 - Machine Learning - Unit 2 New
56 pages
Lec 02 Computation Graphs
No ratings yet
Lec 02 Computation Graphs
64 pages
DL U-I Introduction Part-2
No ratings yet
DL U-I Introduction Part-2
48 pages
Large Scale Deep Learning
No ratings yet
Large Scale Deep Learning
170 pages
Lect 5
No ratings yet
Lect 5
89 pages
Linearity: Skip To Content
No ratings yet
Linearity: Skip To Content
10 pages
3EBX0 Lecture Notes Addendum
No ratings yet
3EBX0 Lecture Notes Addendum
10 pages
Understanding and Creating Neural Networks
No ratings yet
Understanding and Creating Neural Networks
69 pages
Neural Networks & Deep Learning 2025
No ratings yet
Neural Networks & Deep Learning 2025
73 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Computing Gradient Using Backpropagation: ZV0GDF798E
No ratings yet
Computing Gradient Using Backpropagation: ZV0GDF798E
5 pages
Unit 3
No ratings yet
Unit 3
6 pages
How To Build Your Own Neural Network From Scratch in
No ratings yet
How To Build Your Own Neural Network From Scratch in
6 pages
Python Unit 5
No ratings yet
Python Unit 5
36 pages
5 - From Linear Models To Multi-Layer Perceptrons
No ratings yet
5 - From Linear Models To Multi-Layer Perceptrons
45 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Backpropagation Exercises
No ratings yet
Backpropagation Exercises
7 pages
Tut 01
No ratings yet
Tut 01
39 pages
Neural Networks Essay Feranmi Dere
No ratings yet
Neural Networks Essay Feranmi Dere
7 pages
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
No ratings yet
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
84 pages
Topic 1 (Whole Numbers) - Y4
No ratings yet
Topic 1 (Whole Numbers) - Y4
23 pages
Backpropagation Algorithm
No ratings yet
Backpropagation Algorithm
29 pages
Foundations of Deep Learning
No ratings yet
Foundations of Deep Learning
30 pages
Module 2 Deep Feed Forward Networks
No ratings yet
Module 2 Deep Feed Forward Networks
18 pages
Unit IV BPA GD
No ratings yet
Unit IV BPA GD
12 pages
Week 1 Solutions
No ratings yet
Week 1 Solutions
8 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Lecture20 Backprop
No ratings yet
Lecture20 Backprop
77 pages
NN 1
No ratings yet
NN 1
21 pages
NN 2
No ratings yet
NN 2
12 pages
CS231n Convolutional Neural Networks For Visual Recognition 4
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition 4
10 pages
Machine Learning: Backpropagation
No ratings yet
Machine Learning: Backpropagation
24 pages
cs224n 2023 Lecture03 Neuralnets
No ratings yet
cs224n 2023 Lecture03 Neuralnets
83 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
Opentext™ Vendor Invoice Management For Sap Solutions: Installation Guide
No ratings yet
Opentext™ Vendor Invoice Management For Sap Solutions: Installation Guide
290 pages
Lecture04 Neuralnets
No ratings yet
Lecture04 Neuralnets
81 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Unit 3 Analysis - A System Requirements
No ratings yet
Unit 3 Analysis - A System Requirements
47 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
cs188 sp23 Note25
No ratings yet
cs188 sp23 Note25
8 pages
Week - 5 (Deep Learning) Q. 1) Explain The Architecture of Feed Forward Neural Network or Multilayer Perceptron. (12 Marks)
No ratings yet
Week - 5 (Deep Learning) Q. 1) Explain The Architecture of Feed Forward Neural Network or Multilayer Perceptron. (12 Marks)
7 pages
Illustrated Dictionary of Cyborg Anthropology Web
100% (3)
Illustrated Dictionary of Cyborg Anthropology Web
101 pages
PA-EAD BIM Standard Manual PDF
No ratings yet
PA-EAD BIM Standard Manual PDF
180 pages
Manual de Usuario VFD MAX500
No ratings yet
Manual de Usuario VFD MAX500
37 pages
CS231n Convolutional Neural Networks For Visual Recognition
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition
9 pages
Unit 5 Binary Trees
No ratings yet
Unit 5 Binary Trees
28 pages
Technical Spec ZJ40BD
100% (1)
Technical Spec ZJ40BD
63 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
Slides Kbagents
No ratings yet
Slides Kbagents
97 pages
Esquema Elétrico Asus Nexus 7 ME370T
No ratings yet
Esquema Elétrico Asus Nexus 7 ME370T
44 pages
Understanding Backpropagation Algorithm - Towards Data Science
No ratings yet
Understanding Backpropagation Algorithm - Towards Data Science
11 pages
Auto Cad Assignment 3
0% (1)
Auto Cad Assignment 3
3 pages
Task Model
No ratings yet
Task Model
68 pages
Neural Networks: Derivation: 1 Model
No ratings yet
Neural Networks: Derivation: 1 Model
9 pages
Lecture 2: Introduction To Pytorch
No ratings yet
Lecture 2: Introduction To Pytorch
7 pages
Convolutional Neural Network Tutorial
No ratings yet
Convolutional Neural Network Tutorial
8 pages
Leoxsys - Wifi Usb Adaper - User Manual
No ratings yet
Leoxsys - Wifi Usb Adaper - User Manual
43 pages
Grade 10 CAT Year Planner 2025
No ratings yet
Grade 10 CAT Year Planner 2025
9 pages
Laudon Ess10e PP 4
No ratings yet
Laudon Ess10e PP 4
48 pages
Chapter 1 Lab Lab Assignment
No ratings yet
Chapter 1 Lab Lab Assignment
7 pages
Encyclopedia
No ratings yet
Encyclopedia
56 pages
Machine Learning
No ratings yet
Machine Learning
68 pages
cs221 Lecture10
No ratings yet
cs221 Lecture10
43 pages
Deadlock
No ratings yet
Deadlock
38 pages
Disk Management
No ratings yet
Disk Management
46 pages
Unit 4 - C Designing Interfaces and Dialouges
No ratings yet
Unit 4 - C Designing Interfaces and Dialouges
25 pages
Ai ch18 Learning From Examples Part 2
No ratings yet
Ai ch18 Learning From Examples Part 2
30 pages
Unit 1 Complexity Analysis
No ratings yet
Unit 1 Complexity Analysis
6 pages
cs221 Lecture12
No ratings yet
cs221 Lecture12
28 pages
Thomas Adrienne 2
No ratings yet
Thomas Adrienne 2
2 pages
System Call
No ratings yet
System Call
21 pages
Audi A6 f2 Faulty 0009
No ratings yet
Audi A6 f2 Faulty 0009
2 pages
Unit 2 Linked Lists
No ratings yet
Unit 2 Linked Lists
21 pages
Process Creation 2
No ratings yet
Process Creation 2
11 pages
Architecture of Server Virtualization 3
No ratings yet
Architecture of Server Virtualization 3
13 pages
Aneka
No ratings yet
Aneka
12 pages
Smart Linx PDF
No ratings yet
Smart Linx PDF
47 pages
Hypervisor ESXI 5
No ratings yet
Hypervisor ESXI 5
8 pages
Vail CMMS
No ratings yet
Vail CMMS
24 pages
More Aneka Examples
No ratings yet
More Aneka Examples
10 pages
Unit 3 Stacks and Queues
No ratings yet
Unit 3 Stacks and Queues
13 pages
Weeklly Report
No ratings yet
Weeklly Report
13 pages
OS Syllabus
No ratings yet
OS Syllabus
5 pages
Chapter 3 Lab Lab Assignment
No ratings yet
Chapter 3 Lab Lab Assignment
7 pages
Unit 4 Recursion
No ratings yet
Unit 4 Recursion
10 pages
E Commercesecurityandpaymentsystems
No ratings yet
E Commercesecurityandpaymentsystems
21 pages
Chapter 2 Lab Lab Assignment
No ratings yet
Chapter 2 Lab Lab Assignment
6 pages
Chapter-III Water Resources Systems: Analysis
No ratings yet
Chapter-III Water Resources Systems: Analysis
53 pages
Global Economic Crime Survey 2016
No ratings yet
Global Economic Crime Survey 2016
56 pages
Magel Is
No ratings yet
Magel Is
40 pages
Control Theory Quiz
No ratings yet
Control Theory Quiz
28 pages
Chapter 4 Lab Instructions
No ratings yet
Chapter 4 Lab Instructions
3 pages
Fine Art Colour Photography: Lesson 1 Course Notes
No ratings yet
Fine Art Colour Photography: Lesson 1 Course Notes
19 pages
Constructor CPP Unit8
No ratings yet
Constructor CPP Unit8
28 pages
Inte 423 Exam Draft
No ratings yet
Inte 423 Exam Draft
3 pages
2 Vector-Calculus
No ratings yet
2 Vector-Calculus
3 pages
Presentation (Vehicle Insurance Policy)
No ratings yet
Presentation (Vehicle Insurance Policy)
10 pages
Types of Software Testing
No ratings yet
Types of Software Testing
10 pages
Vaccine Portal
No ratings yet
Vaccine Portal
3 pages
0417 s13 QP 31
No ratings yet
0417 s13 QP 31
8 pages
Apple Assignment
No ratings yet
Apple Assignment
8 pages
Cisco Asa Firepower
No ratings yet
Cisco Asa Firepower
11 pages
Manual of Diamond LCD Chess Game Time
100% (1)
Manual of Diamond LCD Chess Game Time
1 page
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
2.5/5 (2)