0% found this document useful (0 votes)
20 views15 pages

Fall 2022 Midterm Notes PDF

The document provides a review of machine learning concepts including attraction basin, backpropagation, bagging, Bayesian networks, Bayesian learning, bias, boosting, candidate concepts, computational learning theory, conditional independence, cross validation, curse of dimensionality, decision trees, and ID3 algorithm.

Uploaded by

Jack A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views15 pages

Fall 2022 Midterm Notes PDF

The document provides a review of machine learning concepts including attraction basin, backpropagation, bagging, Bayesian networks, Bayesian learning, bias, boosting, candidate concepts, computational learning theory, conditional independence, cross validation, curse of dimensionality, decision trees, and ID3 algorithm.

Uploaded by

Jack A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

OH 3 Notes

• No writing of formulas.
• Mix of multiple choice and essay
• Basic math, no calculator needed
• 90 minutes

Review items
• Attraction basin - the points that will lead to a local (or global) optima
• Backpropagation - Combines computing the gradient of the loss function with respect to all of
the weights. Backpropagation refers to the algorithm for computing the gradient, not how the
gradient is used
• Bagging - taking several random subsets of the data, creating a model and taking the mean is
called bagging (bootstrap aggregation)
• Bayesian Network / Bayes Net / Graphical Models
o Each node represents dependence. If two nodes are not connected direcly, they are
conditionally independent of eachother
o We can always recreate the joint probability distribution. It's the product of all values.
o Why do we sample?
▪ Probability of value
▪ generate values according to a distribution in order to simulate a process
▪ approximate inference
▪ Visualization; not always making a chart, but also viewing the data.
• Bayesian Learning - We want to learn the most probable (most likely) hypothesis given the data
and the domain knowledge that we bring.
o P(h | D) D for data, not distribution. We want argmax_P(h | D) h within H
o Bayes rule

▪ The probability of the data D, given a hypothesis is true, is essentially the


probability that the label is true given the data
o P(D) is a prior on the data
▪ It's often hard to know what the prior probability of the data is. We can
sometimes ignore this when we try to find the argmax.

o P(h) is a prior on the hypothesis - that a particular hypothesis in the hypothesis space is
likely or unlikely. What's interesting is that this prior is our domain knowledge.
o If we assume a uniform prior P(h)
▪ MAP - Maximum A Posteriori hypothesis

▪ Maximum Likelihood Hypothesis:

o The probability of P(D|h) for a noise free dataset is 1 if d_i = h(x_i) for every training
sample. If any disagree, then the probability is 0
o Our bias for shorter decision trees as actually being the prior. The thing that says smaller
trees are more likely.

• The length is minimized by minimizing the number of misclassifications,


or error.
• You want the (1) minimal error and the (2) simplest hypothesis
• this is known as the minimum description length
o
• Bias
o Inductive Bias -
▪ Induction - All of machine learning (certainly supervised learning) is about
induction. Induction is the process of going from examples to a more general
rule. Generalizability
o Restriction Bias - the hypothesis class/set that you actually care about. It tells you
something about the representational power of the data set that you're using.
o Preference bias - What sorts of hypotheses from the hypotheses set we prefer. Given
two representations, why your learning algorithm would prefer one over the other
• Boosting - pick the hardest examples (lowest performance)
o combine with weighted mean
o We can define error as:
▪ # of mismatches
▪ the probability that our hypothesis disagrees with the true concept on some
instance, x
o How do we make a sub distribution?
▪ At every timestep we use the distribution at the previous timestep multiplied by
how well the current hypothesis does:

• This term will be 1 if they agree and -1 if they disagree


• alpha is always a positive number.
• Thus, if they agree, you raise e to a negative number, making it smaller. If they disagree you
raise e to a positive number
• This should have the effect of decreasing the prevalence of instances that agree, but it
has to do with what happened to the other instances.
o How do we combine the learners?
▪ The learners are weighted based on how well each of their hypotheses are doing.

o Why don't boosters suffer from overfitting?


▪ We normally keep track of error; we also keep track of confidence (could be variance,
etc)
▪ The final output of the boosted classifier

• if you divide by the alpha (the measure of how good the hypothesis
was), the answer is always the same size, but has a normalized output. -
1 to +1
• As we create more hypotheses, we end up with something
smoother, with a larger margin, and thus less likely to overfit

o When do boosting algorithms overfit?


▪ If the underlying learners all overfit, then there is little that boosting can do
• Candidate - A concept that you think may be the target concept.
• Computational Learning Theory
o Important items
▪ What is a learning problem
▪ Show specific algorithms work
▪ show these problems are fundamentally hard (maybe no algorithm of a
particular class can solve it)
o Learners with constraint queries
▪ For binary classification problems have to ask 2^k questions, where k is the
number of features.
o Learners with mistake bounds
▪ Will assume positive and negated for each variable, given input compute
output, if wrong set all positive variables that were 0 to absent, negative
variables that were 1 to absent
• You will never make more than k + 1 mistakes; where k is the number of
features (because you'll eliminate atleast 1 feature per mistake)
o When nature chooses what samples to provide
▪ sample complexity
• Concept (function) - maps inputs to outputs. Takes instances and maps them to an output.
o Target Concept - Actual function we are trying to find. The answer.
• Conditional Independece
o X is conditionally independent of Y given Z if the probability distribution governing X is
independent of the value of Y given the value of Z

o Normal independence means the join distribution between two variables is equal to the
product of their marginals. P(x,y)=P(x)*P(y)
▪ Conditional independence simply lets you know that the independence happens
when given some third variable
• Cross Validation - The goal is to generalize.
o Nothing we do on the training set actually makes sense unless we believe the training
set is representative of the actual data.
o We could on the data being IID
▪ Independently and identically distributed - all the data we collect will be from
the same source. They're all drawn from the same distribution.
▪ This is a fundamental assumption on many of the algorithms we work with
o We hold out a portion of the training set to be a stand-in for the test data.
o We'll try training on multiple folds, and then pick the lowest error . We'll do this by
averaging the error across all folds. The lowest error tells us the model to use?
o Training error improves as it fits the data more closely, but cross validation error falls
then rises.

• Curse of dimensionality - as the number of features/dimensions grows the amount of data we


need to generalize accurately grows exponentially
• Decision Tree
o Components
▪ Nodes
• Decision nodes - represent attributes
• Edges - possible values of attribute
▪ Leaves (answer)
o Expressiveness
▪ AND function is easily represented
▪ OR function is easily represented
▪ XOR function is easily represented
▪ n-OR : the size of the decision tree is linear
▪ n-XOR : Odd Parity - if an odd number of attributes are true, we output true.
Otherwise if an even number of attributes are true it is False. (0 counts as even)
• You need exponential number of nodes : 2^n where n is the number of
attributes
▪ As a truth table, can be expressed as 2^n rows and 2^2^n output combinations
o ID3
▪ information gain - mathmatical way to capture how much information is gained
by picking an attribute. The reduction in randomness. We want to maximize the
entropy gain
• Gain(S,A) = Entropy(s) - Avg_Entropy
• Entropy - The sum over all possible values you might see, of the
probability of seeing that value times the log of the probability
of seeing that value, times -1

o Inductive Bias of ID3


▪ Good splits at top
• shorter trees - comes naturally from doing good splits
▪ correct over incorrect
o When to stop expanding a tree?
▪ Cross validation: Hold out set to see if it reduces error
▪ build the entire tree and prune (check error before pruning)
o Regression Decision Tree
▪ Split based on variance
▪ Return average / local linear fit,
• Dependency tree
o special case of bayesian network, where every node has exactly one parent
o n^2 parameters
▪ you could fall into a bit of overfitting because in the case of independent
variables, many of the variables don't mean anything
• Genetic algorithms
o population - individuals
o mutation - local search (tweak)
o cross-over : population holds information
▪ Types:
• One-point cross-over choose one position to be the half parent split
• Assumptions : locality of the bits matter. It also assumes there
are sub-parts of the space that can be optimized independently.
• Uniform crossover - randomize which parent's data to take at each bit
position
o generations - iterations of improvements
o Algorithm:
▪ Generate an initial population
▪ Repeat until converge
• Compute fitness of all individuals in population
• Select most fit individuals (top half, weighted prob)
• weighted probability is more similar to exploration
• pair up individuals, replacing least fit individuals via crossover/mutation
• Gradient Descent - Good for non-(linearly separable) problems
o Neuron output values are not thresholded.
o Take partial derivative with respect to each weight.
o converges to local optimum
• Haussler's Theorem (bound true error)
o All of the hypotheses can be categorized by either having high true error or low true
error
o Having an error greater than epsilon means that the probability that I'm wrong is >
epsilon, or that the probability that I'm correct is 1-epsilon
o If you know the size of your hypothesis space and you know what your epsilon and delta
targets are, then you should sample a bunch and you'll be ok.

• Hill climbing
o Random Restart Hill climbing. Basically just a hill climb times a constant number of
restarts.
▪ Convergence
• One way to converge would be to count the number of times you
haven't done better than your last local optima
• Another way would be to ensure that you're covering the space evenly.
o Assumption (bias)
▪ you can make local improvements and those local improvements add up to a
good local optima. The fitness surface is relatively smooth over your state space,
and you can find the optima by neighbors
• Hypothesis Class - The set of all concepts that you're willing to entertain. All functions you're
willing to entertain. (could be all possible functions in the world)
• Hypothesis Spaces
o Syntactic hypothesis space - all of the hypotheses that you could possibly write.
o Semantic hypothesis space - Actual different functions that you are practically
represented. These ones are meaningfully different
• Inferencing rules
o marginalization

o Chain rule

o Bayes Rule

• Information Theory - If we think of input and output vectors as a probability density functions,
we can compare how similar they are: mutual information.
o Entropy - is any information contained at all?

▪ When calculating, make sure to add the probability of all possible outcomes.
▪ We use log base 2.
o If a sequence is predictable or it has less uncertainty, then it has less information.
o Variable length encoding can give less expected bits per word/letter. A language which
can be expressed in variable length encoding has less information.
o joint entropy - the randomness contained in two variables together


o Conditional entropy - a measure of the randomness of one variable given another
variable

▪ if the two variables x and y are independent, then the conditional entropy
simply becomes the entropy of that variable, and the joint entropy is simply
both added together

o Mutual information - a measure of the reduction of randomness of a variable, given


some other variable.

o Kullback-Leiber Divergence (KL Divergence)


▪ Measures the distance between two probability distributions
▪ Serves as a distance measure
• If we had a well known distribution we modeled as p(x) we could
sample from q(x) and use this as a distance metric instead of least
squares

• Instances - input. Vectors of attributes that define whatever your input space is.
• Instance Based Learning - See kNN
• kNN
o Bias
▪ Preference Bias
• locality - near points are similar
• smoothness - averaging
• all features matter equally
• Linearly Separable - True if there is a line / half-plane that separates the positive and negative
examples
• MIMIC
o attempts to directly model distribution, iteratively refine the model, and attempt to
convey structure
o Algorithm:
▪ Generate samples consistent with our probability distribution (start with
uniform)
▪ Set theta t+1 to the nth percentile (best & most fit examples)
▪ retain only those samples
▪ Estimates P(x) theta t+1
▪ repeat
o Structure is hidden in how we represent probability distributions
o Theta is slowly ramped up over iterations
o Estimating Distribution
▪ The joint distribution is a product over each of the features depending only on
its parent

• Because you're only ever conditioned on one parent, the conditional


probability tables stay very small : quadratic in the number of features.
• dependency trees are nice because they let you model relationships
between features.
• How do we create the dependency tree from the sample
• There's an underlying probability distribution that we want to
model.
• We want to find the best / closest probability
distribution. We use KL divergence

• If the distributions are the same, the number is


zero. As they diverge the number gets larger.
Divergence is unitless
• Essentially we want to minimize all of
the entropy for each of the features,
given its parents
• We want to minimize negative
mutual information / or
maximize mutual information
(the sum of)
• We want to find a
subset of edges, that
form a tree that has the
highest total
information (maximum
spanning tree)
• Prim
• Kr

• Neural Network
o Complexity -
▪ You can add more nodes and more layers to a nn to increase complexity. The
downside of this is it gives you the ability to model noise and adds local minima
▪ larger numbers also add complexity and the possibility of overfitting
o Restriction Bias
▪ Perception unit was limited to half-planes
▪ nn with activation functions should be able to model just about any function.
• boolean expression can be represented
• continuous functions can be represented (with one hidden layer)
• arbitrary functions can be represented (with two hidden layers)
o Preference Bias
▪ We initialize the weights to small random values.
• We prefer smaller weights and prefer simpler explanations because we
won't allow our weights to grow large.
o Avoiding overfitting
▪ Use cross-validation
• Naïve Bayes
o Naïve = attributes are independent of one another
o Trying to find the probability of a value given a bunch of attributes, is equal to the
product that each of those attributes given the value, multiplied by the probability of
the value, and divided by a normalization factor

o You could also find the MAP class of V using Naïve Bayes

o Why Naïve Bayes is useful:


▪ inference is cheap
▪ few parameters
▪ estimate parameters with labeled data
• typically count the number of parameters seen for each attribute.
• Typically in practice we smooth out the data to make sure that
each class has been seen atleast once
• this creates an inductive bias, that all outcomes are
atleast mildly possible
▪ connects inference and classification
▪ empirically successful
o Where Naïve Bayes breaks down
▪ no free lunch
▪ doesn't model interrelationships between attributes
• PAC Learning
o Error of h
▪ training error - fraction of training examples misclassified by h
▪ true error - fraction of examples that would be misclassified on sample drawn
from D

o where n is the size of the hypothesis space


• Perceptron - returns zero or one based on meeting a threshold.
o Perceptrons are always going to be linear functions and compute lines (half-planes).
o The thresholding of a perceptron is done with a bias term in practice.
o If the problem is linearly separable, we will find the solution in a finite number of
iterations.
▪ We don't and cannot know quantitatively what "finite" means
• Pink noise - uniform noise.
• Regression
o We can minimize the sum of squares as our error/loss function. To do this we find the
derivative of our sum of squared error function.
o Ways training data can have errors
▪ Sensor error
▪ Maliciously (given bad data)
▪ transcription error
▪ unmodeled influence (noise in the real world). Housing data could have things
like: builder of the house, interest rates, etc.
• Sample - (Training set). For classification it would be an example of inputs paired with the
correct label.
• Sigmoid - Activation function which as asymptotes at 0 and 1. It is differentiable
• Simulated Annealing
o Algorithm:
▪ Sample new point x in N(x)
▪ jump to new sample with probability given by acceptance probability function
P(x,x_t,T)
▪ Decrease temperature T
• If the evaluation is not an increase, we'll evaluate the e function and use
that as a probability of making a move. If the new point is much worse,
then we'll raise e to a large negative number and the probability of
making a move is lower (unless the temperature is high, which causes
the exponent to be a smaller negative number

o Properties
▪ T = 0 like hill climbing
▪ T = inf like random walk
▪ decrease T slowly to allow for the algorithm to hone in on the best local optima

• Sum of Squares
o Residual - The difference between observed value and predicted value. Residual value
can be negative, this is why we square the residual.
o When we try to minimize this loss function, we are trying to find the value where the
derivative of the residual squared function is zero.
• SVM - support vector machines
o Finding the optimal decision boundary is the same as finding a line that maximizes the
w.
▪ This is done by calculating:

• Rather than maximizing 2 over the length of w, we can minimize 1/2


w^2
• It is a quadratic programming problem.
• This is the same thing as maximizing

• Which makes W the sum of data times labels


times alpha
• Alphas are mostly 0. Much of the data does not
matter. Each of the data is a vector, but you can
find the support that you need by just using a
few of these vectors. The data points which
have a non-zero alpha are the ones with
o Are there any bad kernel functions?
▪ All kernel functions must fulfill the mercer condition
• Testing set - Looks like a training set. We take the candidate concept by measuring performance
against the
• testing set. If you only trained and measured performance on your training set, then you would
not have proven the ability to generalize.
• Version Space
o Trying to learn from a training set S, from the hypothesis class H
▪ h is a candidate, and it is consistent if c(x) = h(x) for x in S.
• version space is the space of all hypotheses that are consistent
o Epsilon exhausted occurs when every hypothesis in the hypothesis class has error less
than epsilon (true error)
• VC Dimensions
o What is the largest set on inputs that the hypothesis space can label in all possible
ways?
▪ Labeling in all possible ways = shatter
• the largest set of inputs is the VC dimension
o The VC dimension makes a statement about the amount of data that we need to learn.
▪ You only need to shatter a single example in a dimension, there exists
o The VC dimension is often the number of parameters (or normally d + 1). The weights
from each of the dimensions and theta (the greater than or equal to threshold)
o The VC dimension of a finite H:
▪ If we say the VC dimension of H is some number d, there have to be at least 2^d
distinct concepts because each of the 2^d different labeling will be unique: we
can't use the same hypothesis to get two different labeling.

• H is PAC-learnable if and only if the VC dimension is infinite


• Weak Learner - a learner, that no matter what the distribution is - will do better than chance (an
expected error <0.5 - e)
o If there exists a distribution where none of the hypotheses will do better than chance,
there is no weak learner (for a particular hypothesis space & instance set)
▪ if you have a lot of hypotheses that are bad at everything, it's going to be tough
to find a weak learner.
▪ If you have a lot of hypotheses that are good at everything, it's going to be easy
to find a weak learner.
• White noise - gaussian noise

To Review:
ID3
Bayes Nets; how to draw them? especially with conditional indpendence
Cross Validation
Information Theory - https://faculty.cc.gatech.edu/~isbell/tutorials/InfoTheory.fm.pdf
Boosting paper - https://www.cs.princeton.edu/courses/archive/spring07/cos424/papers/boosting-
survey.pdf
https://storage.googleapis.com/supplemental_media/udacityu/367378584/Intro%20to%20Boosting.pdf
Maybe statquest boosting?
What is polynomial time?
Haussler's Theorem
No Free Lunch

Student notes:
https://github.com/mohamedameen93/CS-7641-Machine-Learning-Notes

You might also like