0% found this document useful (0 votes)

20 views15 pages

Fall 2022 Midterm Notes PDF

The document provides a review of machine learning concepts including attraction basin, backpropagation, bagging, Bayesian networks, Bayesian learning, bias, boosting, candidate concepts, computational learning theory, conditional independence, cross validation, curse of dimensionality, decision trees, and ID3 algorithm.

Uploaded by

Jack A

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views15 pages

Fall 2022 Midterm Notes PDF

Uploaded by

Jack A

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

OH 3 Notes

• No writing of formulas.
• Mix of multiple choice and essay
• Basic math, no calculator needed
• 90 minutes

Review items
• Attraction basin - the points that will lead to a local (or global) optima
• Backpropagation - Combines computing the gradient of the loss function with respect to all of
the weights. Backpropagation refers to the algorithm for computing the gradient, not how the
gradient is used
• Bagging - taking several random subsets of the data, creating a model and taking the mean is
called bagging (bootstrap aggregation)
• Bayesian Network / Bayes Net / Graphical Models
o Each node represents dependence. If two nodes are not connected direcly, they are
conditionally independent of eachother
o We can always recreate the joint probability distribution. It's the product of all values.
o Why do we sample?
▪ Probability of value
▪ generate values according to a distribution in order to simulate a process
▪ approximate inference
▪ Visualization; not always making a chart, but also viewing the data.
• Bayesian Learning - We want to learn the most probable (most likely) hypothesis given the data
and the domain knowledge that we bring.
o P(h | D) D for data, not distribution. We want argmax_P(h | D) h within H
o Bayes rule

▪ The probability of the data D, given a hypothesis is true, is essentially the

probability that the label is true given the data
o P(D) is a prior on the data
▪ It's often hard to know what the prior probability of the data is. We can
sometimes ignore this when we try to find the argmax.

o P(h) is a prior on the hypothesis - that a particular hypothesis in the hypothesis space is
likely or unlikely. What's interesting is that this prior is our domain knowledge.
o If we assume a uniform prior P(h)
▪ MAP - Maximum A Posteriori hypothesis
•

▪ Maximum Likelihood Hypothesis:

o The probability of P(D|h) for a noise free dataset is 1 if d_i = h(x_i) for every training
sample. If any disagree, then the probability is 0
o Our bias for shorter decision trees as actually being the prior. The thing that says smaller
trees are more likely.

• The length is minimized by minimizing the number of misclassifications,

or error.
• You want the (1) minimal error and the (2) simplest hypothesis
• this is known as the minimum description length
o
• Bias
o Inductive Bias -
▪ Induction - All of machine learning (certainly supervised learning) is about
induction. Induction is the process of going from examples to a more general
rule. Generalizability
o Restriction Bias - the hypothesis class/set that you actually care about. It tells you
something about the representational power of the data set that you're using.
o Preference bias - What sorts of hypotheses from the hypotheses set we prefer. Given
two representations, why your learning algorithm would prefer one over the other
• Boosting - pick the hardest examples (lowest performance)
o combine with weighted mean
o We can define error as:
▪ # of mismatches
▪ the probability that our hypothesis disagrees with the true concept on some
instance, x
o How do we make a sub distribution?
▪ At every timestep we use the distribution at the previous timestep multiplied by
how well the current hypothesis does:
▪

• This term will be 1 if they agree and -1 if they disagree

• alpha is always a positive number.
• Thus, if they agree, you raise e to a negative number, making it smaller. If they disagree you
raise e to a positive number
• This should have the effect of decreasing the prevalence of instances that agree, but it
has to do with what happened to the other instances.
o How do we combine the learners?
▪ The learners are weighted based on how well each of their hypotheses are doing.

o Why don't boosters suffer from overfitting?

▪ We normally keep track of error; we also keep track of confidence (could be variance,
etc)
▪ The final output of the boosted classifier

• if you divide by the alpha (the measure of how good the hypothesis
was), the answer is always the same size, but has a normalized output. -
1 to +1
• As we create more hypotheses, we end up with something
smoother, with a larger margin, and thus less likely to overfit

o When do boosting algorithms overfit?

▪ If the underlying learners all overfit, then there is little that boosting can do
• Candidate - A concept that you think may be the target concept.
• Computational Learning Theory
o Important items
▪ What is a learning problem
▪ Show specific algorithms work
▪ show these problems are fundamentally hard (maybe no algorithm of a
particular class can solve it)
o Learners with constraint queries
▪ For binary classification problems have to ask 2^k questions, where k is the
number of features.
o Learners with mistake bounds
▪ Will assume positive and negated for each variable, given input compute
output, if wrong set all positive variables that were 0 to absent, negative
variables that were 1 to absent
• You will never make more than k + 1 mistakes; where k is the number of
features (because you'll eliminate atleast 1 feature per mistake)
o When nature chooses what samples to provide
▪ sample complexity
• Concept (function) - maps inputs to outputs. Takes instances and maps them to an output.
o Target Concept - Actual function we are trying to find. The answer.
• Conditional Independece
o X is conditionally independent of Y given Z if the probability distribution governing X is
independent of the value of Y given the value of Z

o Normal independence means the join distribution between two variables is equal to the
product of their marginals. P(x,y)=P(x)*P(y)
▪ Conditional independence simply lets you know that the independence happens
when given some third variable
• Cross Validation - The goal is to generalize.
o Nothing we do on the training set actually makes sense unless we believe the training
set is representative of the actual data.
o We could on the data being IID
▪ Independently and identically distributed - all the data we collect will be from
the same source. They're all drawn from the same distribution.
▪ This is a fundamental assumption on many of the algorithms we work with
o We hold out a portion of the training set to be a stand-in for the test data.
o We'll try training on multiple folds, and then pick the lowest error . We'll do this by
averaging the error across all folds. The lowest error tells us the model to use?
o Training error improves as it fits the data more closely, but cross validation error falls
then rises.

• Curse of dimensionality - as the number of features/dimensions grows the amount of data we

need to generalize accurately grows exponentially
• Decision Tree
o Components
▪ Nodes
• Decision nodes - represent attributes
• Edges - possible values of attribute
▪ Leaves (answer)
o Expressiveness
▪ AND function is easily represented
▪ OR function is easily represented
▪ XOR function is easily represented
▪ n-OR : the size of the decision tree is linear
▪ n-XOR : Odd Parity - if an odd number of attributes are true, we output true.
Otherwise if an even number of attributes are true it is False. (0 counts as even)
• You need exponential number of nodes : 2^n where n is the number of
attributes
▪ As a truth table, can be expressed as 2^n rows and 2^2^n output combinations
o ID3
▪ information gain - mathmatical way to capture how much information is gained
by picking an attribute. The reduction in randomness. We want to maximize the
entropy gain
• Gain(S,A) = Entropy(s) - Avg_Entropy
• Entropy - The sum over all possible values you might see, of the
probability of seeing that value times the log of the probability
of seeing that value, times -1

o Inductive Bias of ID3

▪ Good splits at top
• shorter trees - comes naturally from doing good splits
▪ correct over incorrect
o When to stop expanding a tree?
▪ Cross validation: Hold out set to see if it reduces error
▪ build the entire tree and prune (check error before pruning)
o Regression Decision Tree
▪ Split based on variance
▪ Return average / local linear fit,
• Dependency tree
o special case of bayesian network, where every node has exactly one parent
o n^2 parameters
▪ you could fall into a bit of overfitting because in the case of independent
variables, many of the variables don't mean anything
• Genetic algorithms
o population - individuals
o mutation - local search (tweak)
o cross-over : population holds information
▪ Types:
• One-point cross-over choose one position to be the half parent split
• Assumptions : locality of the bits matter. It also assumes there
are sub-parts of the space that can be optimized independently.
• Uniform crossover - randomize which parent's data to take at each bit
position
o generations - iterations of improvements
o Algorithm:
▪ Generate an initial population
▪ Repeat until converge
• Compute fitness of all individuals in population
• Select most fit individuals (top half, weighted prob)
• weighted probability is more similar to exploration
• pair up individuals, replacing least fit individuals via crossover/mutation
• Gradient Descent - Good for non-(linearly separable) problems
o Neuron output values are not thresholded.
o Take partial derivative with respect to each weight.
o converges to local optimum
• Haussler's Theorem (bound true error)
o All of the hypotheses can be categorized by either having high true error or low true
error
o Having an error greater than epsilon means that the probability that I'm wrong is >
epsilon, or that the probability that I'm correct is 1-epsilon
o If you know the size of your hypothesis space and you know what your epsilon and delta
targets are, then you should sample a bunch and you'll be ok.

• Hill climbing
o Random Restart Hill climbing. Basically just a hill climb times a constant number of
restarts.
▪ Convergence
• One way to converge would be to count the number of times you
haven't done better than your last local optima
• Another way would be to ensure that you're covering the space evenly.
o Assumption (bias)
▪ you can make local improvements and those local improvements add up to a
good local optima. The fitness surface is relatively smooth over your state space,
and you can find the optima by neighbors
• Hypothesis Class - The set of all concepts that you're willing to entertain. All functions you're
willing to entertain. (could be all possible functions in the world)
• Hypothesis Spaces
o Syntactic hypothesis space - all of the hypotheses that you could possibly write.
o Semantic hypothesis space - Actual different functions that you are practically
represented. These ones are meaningfully different
• Inferencing rules
o marginalization

o Chain rule

o Bayes Rule

• Information Theory - If we think of input and output vectors as a probability density functions,
we can compare how similar they are: mutual information.
o Entropy - is any information contained at all?

▪ When calculating, make sure to add the probability of all possible outcomes.
▪ We use log base 2.
o If a sequence is predictable or it has less uncertainty, then it has less information.
o Variable length encoding can give less expected bits per word/letter. A language which
can be expressed in variable length encoding has less information.
o joint entropy - the randomness contained in two variables together

▪
o Conditional entropy - a measure of the randomness of one variable given another
variable

▪ if the two variables x and y are independent, then the conditional entropy
simply becomes the entropy of that variable, and the joint entropy is simply
both added together

o Mutual information - a measure of the reduction of randomness of a variable, given

some other variable.

o Kullback-Leiber Divergence (KL Divergence)

▪ Measures the distance between two probability distributions
▪ Serves as a distance measure
• If we had a well known distribution we modeled as p(x) we could
sample from q(x) and use this as a distance metric instead of least
squares

• Instances - input. Vectors of attributes that define whatever your input space is.
• Instance Based Learning - See kNN
• kNN
o Bias
▪ Preference Bias
• locality - near points are similar
• smoothness - averaging
• all features matter equally
• Linearly Separable - True if there is a line / half-plane that separates the positive and negative
examples
• MIMIC
o attempts to directly model distribution, iteratively refine the model, and attempt to
convey structure
o Algorithm:
▪ Generate samples consistent with our probability distribution (start with
uniform)
▪ Set theta t+1 to the nth percentile (best & most fit examples)
▪ retain only those samples
▪ Estimates P(x) theta t+1
▪ repeat
o Structure is hidden in how we represent probability distributions
o Theta is slowly ramped up over iterations
o Estimating Distribution
▪ The joint distribution is a product over each of the features depending only on
its parent

• Because you're only ever conditioned on one parent, the conditional

probability tables stay very small : quadratic in the number of features.
• dependency trees are nice because they let you model relationships
between features.
• How do we create the dependency tree from the sample
• There's an underlying probability distribution that we want to
model.
• We want to find the best / closest probability
distribution. We use KL divergence

• If the distributions are the same, the number is

zero. As they diverge the number gets larger.
Divergence is unitless
• Essentially we want to minimize all of
the entropy for each of the features,
given its parents
• We want to minimize negative
mutual information / or
maximize mutual information
(the sum of)
• We want to find a
subset of edges, that
form a tree that has the
highest total
information (maximum
spanning tree)
• Prim
• Kr

• Neural Network
o Complexity -
▪ You can add more nodes and more layers to a nn to increase complexity. The
downside of this is it gives you the ability to model noise and adds local minima
▪ larger numbers also add complexity and the possibility of overfitting
o Restriction Bias
▪ Perception unit was limited to half-planes
▪ nn with activation functions should be able to model just about any function.
• boolean expression can be represented
• continuous functions can be represented (with one hidden layer)
• arbitrary functions can be represented (with two hidden layers)
o Preference Bias
▪ We initialize the weights to small random values.
• We prefer smaller weights and prefer simpler explanations because we
won't allow our weights to grow large.
o Avoiding overfitting
▪ Use cross-validation
• Naïve Bayes
o Naïve = attributes are independent of one another
o Trying to find the probability of a value given a bunch of attributes, is equal to the
product that each of those attributes given the value, multiplied by the probability of
the value, and divided by a normalization factor

o You could also find the MAP class of V using Naïve Bayes

o Why Naïve Bayes is useful:

▪ inference is cheap
▪ few parameters
▪ estimate parameters with labeled data
• typically count the number of parameters seen for each attribute.
• Typically in practice we smooth out the data to make sure that
each class has been seen atleast once
• this creates an inductive bias, that all outcomes are
atleast mildly possible
▪ connects inference and classification
▪ empirically successful
o Where Naïve Bayes breaks down
▪ no free lunch
▪ doesn't model interrelationships between attributes
• PAC Learning
o Error of h
▪ training error - fraction of training examples misclassified by h
▪ true error - fraction of examples that would be misclassified on sample drawn
from D

o where n is the size of the hypothesis space

• Perceptron - returns zero or one based on meeting a threshold.
o Perceptrons are always going to be linear functions and compute lines (half-planes).
o The thresholding of a perceptron is done with a bias term in practice.
o If the problem is linearly separable, we will find the solution in a finite number of
iterations.
▪ We don't and cannot know quantitatively what "finite" means
• Pink noise - uniform noise.
• Regression
o We can minimize the sum of squares as our error/loss function. To do this we find the
derivative of our sum of squared error function.
o Ways training data can have errors
▪ Sensor error
▪ Maliciously (given bad data)
▪ transcription error
▪ unmodeled influence (noise in the real world). Housing data could have things
like: builder of the house, interest rates, etc.
• Sample - (Training set). For classification it would be an example of inputs paired with the
correct label.
• Sigmoid - Activation function which as asymptotes at 0 and 1. It is differentiable
• Simulated Annealing
o Algorithm:
▪ Sample new point x in N(x)
▪ jump to new sample with probability given by acceptance probability function
P(x,x_t,T)
▪ Decrease temperature T
• If the evaluation is not an increase, we'll evaluate the e function and use
that as a probability of making a move. If the new point is much worse,
then we'll raise e to a large negative number and the probability of
making a move is lower (unless the temperature is high, which causes
the exponent to be a smaller negative number

o Properties
▪ T = 0 like hill climbing
▪ T = inf like random walk
▪ decrease T slowly to allow for the algorithm to hone in on the best local optima

• Sum of Squares
o Residual - The difference between observed value and predicted value. Residual value
can be negative, this is why we square the residual.
o When we try to minimize this loss function, we are trying to find the value where the
derivative of the residual squared function is zero.
• SVM - support vector machines
o Finding the optimal decision boundary is the same as finding a line that maximizes the
w.
▪ This is done by calculating:

• Rather than maximizing 2 over the length of w, we can minimize 1/2

w^2
• It is a quadratic programming problem.
• This is the same thing as maximizing

• Which makes W the sum of data times labels

times alpha
• Alphas are mostly 0. Much of the data does not
matter. Each of the data is a vector, but you can
find the support that you need by just using a
few of these vectors. The data points which
have a non-zero alpha are the ones with
o Are there any bad kernel functions?
▪ All kernel functions must fulfill the mercer condition
• Testing set - Looks like a training set. We take the candidate concept by measuring performance
against the
• testing set. If you only trained and measured performance on your training set, then you would
not have proven the ability to generalize.
• Version Space
o Trying to learn from a training set S, from the hypothesis class H
▪ h is a candidate, and it is consistent if c(x) = h(x) for x in S.
• version space is the space of all hypotheses that are consistent
o Epsilon exhausted occurs when every hypothesis in the hypothesis class has error less
than epsilon (true error)
• VC Dimensions
o What is the largest set on inputs that the hypothesis space can label in all possible
ways?
▪ Labeling in all possible ways = shatter
• the largest set of inputs is the VC dimension
o The VC dimension makes a statement about the amount of data that we need to learn.
▪ You only need to shatter a single example in a dimension, there exists
o The VC dimension is often the number of parameters (or normally d + 1). The weights
from each of the dimensions and theta (the greater than or equal to threshold)
o The VC dimension of a finite H:
▪ If we say the VC dimension of H is some number d, there have to be at least 2^d
distinct concepts because each of the 2^d different labeling will be unique: we
can't use the same hypothesis to get two different labeling.

• H is PAC-learnable if and only if the VC dimension is infinite

• Weak Learner - a learner, that no matter what the distribution is - will do better than chance (an
expected error <0.5 - e)
o If there exists a distribution where none of the hypotheses will do better than chance,
there is no weak learner (for a particular hypothesis space & instance set)
▪ if you have a lot of hypotheses that are bad at everything, it's going to be tough
to find a weak learner.
▪ If you have a lot of hypotheses that are good at everything, it's going to be easy
to find a weak learner.
• White noise - gaussian noise

To Review:
ID3
Bayes Nets; how to draw them? especially with conditional indpendence
Cross Validation
Information Theory - https://faculty.cc.gatech.edu/~isbell/tutorials/InfoTheory.fm.pdf
Boosting paper - https://www.cs.princeton.edu/courses/archive/spring07/cos424/papers/boosting-
survey.pdf
https://storage.googleapis.com/supplemental_media/udacityu/367378584/Intro%20to%20Boosting.pdf
Maybe statquest boosting?
What is polynomial time?
Haussler's Theorem
No Free Lunch

Student notes:
https://github.com/mohamedameen93/CS-7641-Machine-Learning-Notes

ML Hand Written Notes
No ratings yet
ML Hand Written Notes
19 pages
TTNT 09 Learning From Examples
No ratings yet
TTNT 09 Learning From Examples
58 pages
AWS Machine Learning Specialty Master Cheat Sheet
No ratings yet
AWS Machine Learning Specialty Master Cheat Sheet
24 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
21 pages
Decision Tree
No ratings yet
Decision Tree
42 pages
Lecture 3
No ratings yet
Lecture 3
18 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
116 pages
ML Interview
No ratings yet
ML Interview
65 pages
ML Unit-4 Prob Learning
No ratings yet
ML Unit-4 Prob Learning
36 pages
Learning
No ratings yet
Learning
51 pages
Chp8 Classification Basic Concepts - Lecture#8
No ratings yet
Chp8 Classification Basic Concepts - Lecture#8
40 pages
06 Regularizations
No ratings yet
06 Regularizations
42 pages
All Cards
No ratings yet
All Cards
106 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
Lec7 - Nonparametric Methods - II
No ratings yet
Lec7 - Nonparametric Methods - II
38 pages
ML 19.03 Sidenotes
No ratings yet
ML 19.03 Sidenotes
30 pages
Unit 3
No ratings yet
Unit 3
16 pages
MLP RL1
No ratings yet
MLP RL1
6 pages
ML Unit 2
No ratings yet
ML Unit 2
86 pages
Module 2
No ratings yet
Module 2
19 pages
AI Unit 4
No ratings yet
AI Unit 4
91 pages
UNIT I-Part 2
No ratings yet
UNIT I-Part 2
35 pages
Lecture 15 - Recap and Midterm Review
No ratings yet
Lecture 15 - Recap and Midterm Review
37 pages
Unit-I Machine Learning Basics
No ratings yet
Unit-I Machine Learning Basics
85 pages
Unit 5
No ratings yet
Unit 5
21 pages
Notes
No ratings yet
Notes
125 pages
19 ML Intro
No ratings yet
19 ML Intro
33 pages
Chapter19 4e
No ratings yet
Chapter19 4e
67 pages
Machine Learning General: Definiton
No ratings yet
Machine Learning General: Definiton
14 pages
Overfitting & Feature Engineering
No ratings yet
Overfitting & Feature Engineering
37 pages
Computer Network: 02 December 2024 22:38
No ratings yet
Computer Network: 02 December 2024 22:38
5 pages
Side by Side Extra L1 U3 - Teacher's Guide
No ratings yet
Side by Side Extra L1 U3 - Teacher's Guide
22 pages
Cs 171 18 IntroLearning Old
No ratings yet
Cs 171 18 IntroLearning Old
47 pages
Machine Learning Juunit2.pdf Lands
No ratings yet
Machine Learning Juunit2.pdf Lands
7 pages
Sec 1630
No ratings yet
Sec 1630
145 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
ML 5
No ratings yet
ML 5
26 pages
Machine - Learning (Unit 3)
No ratings yet
Machine - Learning (Unit 3)
9 pages
Machine Learning Cheatsheet Compiled and Curated by Robins Yadav
No ratings yet
Machine Learning Cheatsheet Compiled and Curated by Robins Yadav
14 pages
ML Short Question and Answers
No ratings yet
ML Short Question and Answers
11 pages
Ensemble Learning
No ratings yet
Ensemble Learning
52 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
07 Intro To ML
No ratings yet
07 Intro To ML
38 pages
Hypothesis Testing Made Simple
From Everand
Hypothesis Testing Made Simple
Leonard Gaston
4/5 (5)
Machine Learning
No ratings yet
Machine Learning
9 pages
ChatGPT - Machine Learning Overview
No ratings yet
ChatGPT - Machine Learning Overview
34 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
UE20CS302 Unit3 Slides
No ratings yet
UE20CS302 Unit3 Slides
308 pages
Module3 DS PPT
No ratings yet
Module3 DS PPT
68 pages
ML Notes
No ratings yet
ML Notes
15 pages
Unit 2
No ratings yet
Unit 2
76 pages
Lecturenotes Cse176
No ratings yet
Lecturenotes Cse176
80 pages
Data Mining NOTES
No ratings yet
Data Mining NOTES
57 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
No ratings yet
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
117 pages
Lecturenotes PDF
No ratings yet
Lecturenotes PDF
80 pages
Edo Basic Schools Teachers Recruitment 2025 Shortlisted Candidates For CBT
No ratings yet
Edo Basic Schools Teachers Recruitment 2025 Shortlisted Candidates For CBT
55 pages
ML - Interview Prep
No ratings yet
ML - Interview Prep
9 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
BSC Sem 3 & 4 (Major-Minor-MDC-SEC) Medical Laboratory Syllabus From 2024-25 (DT 13-05-2024)
No ratings yet
BSC Sem 3 & 4 (Major-Minor-MDC-SEC) Medical Laboratory Syllabus From 2024-25 (DT 13-05-2024)
24 pages
Senior Consultant - Management Consulting-1
No ratings yet
Senior Consultant - Management Consulting-1
1 page
Using Objects and Classes Defining Simple Classes
No ratings yet
Using Objects and Classes Defining Simple Classes
34 pages
Iptl Ice Task 1
No ratings yet
Iptl Ice Task 1
7 pages
New Prof Ed Monkayo June 14 2019
100% (2)
New Prof Ed Monkayo June 14 2019
148 pages
SAT Math: Master the Skills in 40 Pages
From Everand
SAT Math: Master the Skills in 40 Pages
Jennifer L Johnson
No ratings yet
Sampling in Statistics
From Everand
Sampling in Statistics
Stephanie Glen
No ratings yet
Students' Experiences of Active Engagement Through Cooperative Learning Activities in Lectures
No ratings yet
Students' Experiences of Active Engagement Through Cooperative Learning Activities in Lectures
11 pages
Medical Studies at The University of Santo Tomas (UST)
100% (1)
Medical Studies at The University of Santo Tomas (UST)
31 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
You Have Already Submitted The Examination Form.: Great!
No ratings yet
You Have Already Submitted The Examination Form.: Great!
1 page
2 Newborn Assesment
No ratings yet
2 Newborn Assesment
23 pages
Joshua William Buckholtz, PH.D.: Curriculum Vitae
No ratings yet
Joshua William Buckholtz, PH.D.: Curriculum Vitae
7 pages
Physics Pratical
No ratings yet
Physics Pratical
12 pages
Antonio Young Resume
No ratings yet
Antonio Young Resume
2 pages
Turn Taking
No ratings yet
Turn Taking
7 pages
TPCN Monthly List of Subcontractors 06-2017
No ratings yet
TPCN Monthly List of Subcontractors 06-2017
3 pages
Benefits of Multisensory Learning
No ratings yet
Benefits of Multisensory Learning
7 pages
2020 Minimum Entry Requirements: ANU College of Arts & Social Sciences
No ratings yet
2020 Minimum Entry Requirements: ANU College of Arts & Social Sciences
2 pages
Wind Turbine Design Project: Investigate
No ratings yet
Wind Turbine Design Project: Investigate
5 pages
Past Participle
No ratings yet
Past Participle
3 pages
Math 9 DLL Q1W1
No ratings yet
Math 9 DLL Q1W1
7 pages
Janlloyd Dugo - HOME ROOM GUIDANCE MODULE 1
93% (15)
Janlloyd Dugo - HOME ROOM GUIDANCE MODULE 1
2 pages
Introduction To Sap Hana Cloud Platform Certificate Full 27755
No ratings yet
Introduction To Sap Hana Cloud Platform Certificate Full 27755
1 page
Microsoft Windows Server 2016 Licensing
No ratings yet
Microsoft Windows Server 2016 Licensing
2 pages
Survey Questionnaire: Statement Always Sometimes Often Never
No ratings yet
Survey Questionnaire: Statement Always Sometimes Often Never
4 pages
198 Shorthand (English) Theory
No ratings yet
198 Shorthand (English) Theory
3 pages
Letter To The Respondents
No ratings yet
Letter To The Respondents
3 pages
Chapter 4 Marzano
No ratings yet
Chapter 4 Marzano
2 pages
Gold Coast Network Map
No ratings yet
Gold Coast Network Map
1 page
Tle - Work Plan 2023 2024
100% (2)
Tle - Work Plan 2023 2024
6 pages

Fall 2022 Midterm Notes PDF

Uploaded by

Fall 2022 Midterm Notes PDF

Uploaded by

OH 3 Notes

▪ The probability of the data D, given a hypothesis is true, is essentially the

▪ Maximum Likelihood Hypothesis:

• The length is minimized by minimizing the number of misclassifications,

• This term will be 1 if they agree and -1 if they disagree

o Why don't boosters suffer from overfitting?

o When do boosting algorithms overfit?

• Curse of dimensionality - as the number of features/dimensions grows the amount of data we

o Inductive Bias of ID3

o Mutual information - a measure of the reduction of randomness of a variable, given

o Kullback-Leiber Divergence (KL Divergence)

• Because you're only ever conditioned on one parent, the conditional

• If the distributions are the same, the number is

o Why Naïve Bayes is useful:

o where n is the size of the hypothesis space

• Rather than maximizing 2 over the length of w, we can minimize 1/2

• Which makes W the sum of data times labels

• H is PAC-learnable if and only if the VC dimension is infinite

You might also like