Fall 2022 Midterm Notes PDF
Fall 2022 Midterm Notes PDF
• No writing of formulas.
• Mix of multiple choice and essay
• Basic math, no calculator needed
• 90 minutes
Review items
• Attraction basin - the points that will lead to a local (or global) optima
• Backpropagation - Combines computing the gradient of the loss function with respect to all of
the weights. Backpropagation refers to the algorithm for computing the gradient, not how the
gradient is used
• Bagging - taking several random subsets of the data, creating a model and taking the mean is
called bagging (bootstrap aggregation)
• Bayesian Network / Bayes Net / Graphical Models
o Each node represents dependence. If two nodes are not connected direcly, they are
conditionally independent of eachother
o We can always recreate the joint probability distribution. It's the product of all values.
o Why do we sample?
▪ Probability of value
▪ generate values according to a distribution in order to simulate a process
▪ approximate inference
▪ Visualization; not always making a chart, but also viewing the data.
• Bayesian Learning - We want to learn the most probable (most likely) hypothesis given the data
and the domain knowledge that we bring.
o P(h | D) D for data, not distribution. We want argmax_P(h | D) h within H
o Bayes rule
o P(h) is a prior on the hypothesis - that a particular hypothesis in the hypothesis space is
likely or unlikely. What's interesting is that this prior is our domain knowledge.
o If we assume a uniform prior P(h)
▪ MAP - Maximum A Posteriori hypothesis
•
o The probability of P(D|h) for a noise free dataset is 1 if d_i = h(x_i) for every training
sample. If any disagree, then the probability is 0
o Our bias for shorter decision trees as actually being the prior. The thing that says smaller
trees are more likely.
• if you divide by the alpha (the measure of how good the hypothesis
was), the answer is always the same size, but has a normalized output. -
1 to +1
• As we create more hypotheses, we end up with something
smoother, with a larger margin, and thus less likely to overfit
o Normal independence means the join distribution between two variables is equal to the
product of their marginals. P(x,y)=P(x)*P(y)
▪ Conditional independence simply lets you know that the independence happens
when given some third variable
• Cross Validation - The goal is to generalize.
o Nothing we do on the training set actually makes sense unless we believe the training
set is representative of the actual data.
o We could on the data being IID
▪ Independently and identically distributed - all the data we collect will be from
the same source. They're all drawn from the same distribution.
▪ This is a fundamental assumption on many of the algorithms we work with
o We hold out a portion of the training set to be a stand-in for the test data.
o We'll try training on multiple folds, and then pick the lowest error . We'll do this by
averaging the error across all folds. The lowest error tells us the model to use?
o Training error improves as it fits the data more closely, but cross validation error falls
then rises.
• Hill climbing
o Random Restart Hill climbing. Basically just a hill climb times a constant number of
restarts.
▪ Convergence
• One way to converge would be to count the number of times you
haven't done better than your last local optima
• Another way would be to ensure that you're covering the space evenly.
o Assumption (bias)
▪ you can make local improvements and those local improvements add up to a
good local optima. The fitness surface is relatively smooth over your state space,
and you can find the optima by neighbors
• Hypothesis Class - The set of all concepts that you're willing to entertain. All functions you're
willing to entertain. (could be all possible functions in the world)
• Hypothesis Spaces
o Syntactic hypothesis space - all of the hypotheses that you could possibly write.
o Semantic hypothesis space - Actual different functions that you are practically
represented. These ones are meaningfully different
• Inferencing rules
o marginalization
o Chain rule
o Bayes Rule
• Information Theory - If we think of input and output vectors as a probability density functions,
we can compare how similar they are: mutual information.
o Entropy - is any information contained at all?
▪ When calculating, make sure to add the probability of all possible outcomes.
▪ We use log base 2.
o If a sequence is predictable or it has less uncertainty, then it has less information.
o Variable length encoding can give less expected bits per word/letter. A language which
can be expressed in variable length encoding has less information.
o joint entropy - the randomness contained in two variables together
▪
o Conditional entropy - a measure of the randomness of one variable given another
variable
▪ if the two variables x and y are independent, then the conditional entropy
simply becomes the entropy of that variable, and the joint entropy is simply
both added together
• Instances - input. Vectors of attributes that define whatever your input space is.
• Instance Based Learning - See kNN
• kNN
o Bias
▪ Preference Bias
• locality - near points are similar
• smoothness - averaging
• all features matter equally
• Linearly Separable - True if there is a line / half-plane that separates the positive and negative
examples
• MIMIC
o attempts to directly model distribution, iteratively refine the model, and attempt to
convey structure
o Algorithm:
▪ Generate samples consistent with our probability distribution (start with
uniform)
▪ Set theta t+1 to the nth percentile (best & most fit examples)
▪ retain only those samples
▪ Estimates P(x) theta t+1
▪ repeat
o Structure is hidden in how we represent probability distributions
o Theta is slowly ramped up over iterations
o Estimating Distribution
▪ The joint distribution is a product over each of the features depending only on
its parent
• Neural Network
o Complexity -
▪ You can add more nodes and more layers to a nn to increase complexity. The
downside of this is it gives you the ability to model noise and adds local minima
▪ larger numbers also add complexity and the possibility of overfitting
o Restriction Bias
▪ Perception unit was limited to half-planes
▪ nn with activation functions should be able to model just about any function.
• boolean expression can be represented
• continuous functions can be represented (with one hidden layer)
• arbitrary functions can be represented (with two hidden layers)
o Preference Bias
▪ We initialize the weights to small random values.
• We prefer smaller weights and prefer simpler explanations because we
won't allow our weights to grow large.
o Avoiding overfitting
▪ Use cross-validation
• Naïve Bayes
o Naïve = attributes are independent of one another
o Trying to find the probability of a value given a bunch of attributes, is equal to the
product that each of those attributes given the value, multiplied by the probability of
the value, and divided by a normalization factor
o You could also find the MAP class of V using Naïve Bayes
o Properties
▪ T = 0 like hill climbing
▪ T = inf like random walk
▪ decrease T slowly to allow for the algorithm to hone in on the best local optima
• Sum of Squares
o Residual - The difference between observed value and predicted value. Residual value
can be negative, this is why we square the residual.
o When we try to minimize this loss function, we are trying to find the value where the
derivative of the residual squared function is zero.
• SVM - support vector machines
o Finding the optimal decision boundary is the same as finding a line that maximizes the
w.
▪ This is done by calculating:
To Review:
ID3
Bayes Nets; how to draw them? especially with conditional indpendence
Cross Validation
Information Theory - https://faculty.cc.gatech.edu/~isbell/tutorials/InfoTheory.fm.pdf
Boosting paper - https://www.cs.princeton.edu/courses/archive/spring07/cos424/papers/boosting-
survey.pdf
https://storage.googleapis.com/supplemental_media/udacityu/367378584/Intro%20to%20Boosting.pdf
Maybe statquest boosting?
What is polynomial time?
Haussler's Theorem
No Free Lunch
Student notes:
https://github.com/mohamedameen93/CS-7641-Machine-Learning-Notes