0% found this document useful (0 votes)
86 views26 pages

Neural Nets (Wrap-Up) and Decision Trees: CS 188: Artificial Intelligence

The document summarizes content from a course on artificial intelligence including neural networks and decision trees. It discusses wrapping up neural networks and introduces formalizing learning concepts like consistency and simplicity. It also covers decision trees, describing their expressiveness, how information gain is used to select attributes, and addressing overfitting issues. Examples of neural networks applied to computer vision, speech recognition, and machine translation are provided. Inductive learning and the tradeoff between consistency and simplicity are defined. Decision trees are described as having an expressive hypothesis space and automatically modeling feature interactions.

Uploaded by

Mihai Ilie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views26 pages

Neural Nets (Wrap-Up) and Decision Trees: CS 188: Artificial Intelligence

The document summarizes content from a course on artificial intelligence including neural networks and decision trees. It discusses wrapping up neural networks and introduces formalizing learning concepts like consistency and simplicity. It also covers decision trees, describing their expressiveness, how information gain is used to select attributes, and addressing overfitting issues. Examples of neural networks applied to computer vision, speech recognition, and machine translation are provided. Inductive learning and the tradeoff between consistency and simplicity are defined. Decision trees are described as having an expressive hypothesis space and automatically modeling feature interactions.

Uploaded by

Mihai Ilie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

CS 188: Artificial Intelligence

Neural Nets (wrap-up) and Decision Trees

Instructors: Pieter Abbeel and Dan Klein --- University of California, Berkeley
[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

Today
§ Neural Nets -- wrap

§ Formalizing Learning
§ Consistency
§ Simplicity

§ Decision Trees
§ Expressiveness
§ Information Gain
§ Overfitting
Deep Neural Network

(1) (2) (n)


x1 z1 z1 z1
(n 1)
z1
(OU T )
ez1
z1 s P (y1 |x; w) =
(1) (2) (n) ez1 + ez2 + ez3
x2 z2 z2 (n 1)
z2 z2 o
f
… (OU T ) t ez2
(2) (n) z2 P (y2 |x; w) =
x3 (1)
z3 z3 (n 1)
z3 z3 m ez1 + ez2 + ez3
a
… … … … x ez3
… (OU T )
z3 P (y3 |x; w) =
ez1 + ez2 + ez3
(1) (2) (n)
xL zK (1) zK (2) (n 1)
zK (n 1)
zK (n)

(k)
X (k 1,k) (k 1)
zi = g( Wi,j zj ) g = nonlinear activation function

Deep Neural Network: Also Learn the Features!


§ Training the deep neural network is just like logistic regression:

X
max ll(w) = max log P (y (i) |x(i) ; w)
w w
i

just w tends to be a much, much larger vector J

àjust run gradient ascent


+ stop when log likelihood of hold-out data starts to decrease
Neural Networks Properties
§ Theorem (Universal Function Approximators). A two-layer neural
network with a sufficient number of neurons can approximate
any continuous function to any desired accuracy.

§ Practical considerations
§ Can be seen as learning the features
§ Large number of neurons
§ Danger for overfitting
§ (hence early stopping!)

How well does it work?


Computer Vision

Object Detection
Manual Feature Design

Features and Generalization

[HoG: Dalal and Triggs, 2005]


Features and Generalization

Image HoG

Performance

graph credit Matt


Zeiler, Clarifai
Performance

graph credit Matt


Zeiler, Clarifai

Performance

AlexNet

graph credit Matt


Zeiler, Clarifai
Performance

AlexNet

graph credit Matt


Zeiler, Clarifai

Performance

AlexNet

graph credit Matt


Zeiler, Clarifai
MS COCO Image Captioning Challenge

Karpathy & Fei-Fei, 2015; Donahue et al., 2015; Xu et al, 2015; many more

Visual QA Challenge
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh
Speech Recognition

graph credit Matt Zeiler, Clarifai

Machine Translation
Google Neural Machine Translation (in production)
Today
§ Neural Nets -- wrap

§ Formalizing Learning
§ Consistency
§ Simplicity

§ Decision Trees
§ Expressiveness
§ Information Gain
§ Overfitting

§ Clustering

Inductive Learning
Inductive Learning (Science)
§ Simplest form: learn a function from examples
§ A target function: g
§ Examples: input-output pairs (x, g(x))
§ E.g. x is an email and g(x) is spam / ham
§ E.g. x is a house and g(x) is its selling price

§ Problem:
§ Given a hypothesis space H
§ Given a training set of examples xi
§ Find a hypothesis h(x) such that h ~ g

§ Includes:
§ Classification (outputs = class labels)
§ Regression (outputs = real numbers)

§ How do perceptron and naïve Bayes fit in? (H, h, g, etc.)

Inductive Learning
§ Curve fitting (regression, function approximation):

§ Consistency vs. simplicity


§ Ockham’s razor
Consistency vs. Simplicity
§ Fundamental tradeoff: bias vs. variance

§ Usually algorithms prefer consistency by default (why?)

§ Several ways to operationalize “simplicity”


§ Reduce the hypothesis space
§ Assume more: e.g. independence assumptions, as in naïve Bayes
§ Have fewer, better features / attributes: feature selection
§ Other structural limitations (decision lists vs trees)
§ Regularization
§ Smoothing: cautious use of small counts
§ Many other generalization parameters (pruning cutoffs today)
§ Hypothesis space stays big, but harder to get to the outskirts

Decision Trees
Reminder: Features
§ Features, aka attributes
§ Sometimes: TYPE=French
§ Sometimes: fTYPE=French(x) = 1

Decision Trees
§ Compact representation of a function:
§ Truth table
§ Conditional probability table
§ Regression values

§ True function
§ Realizable: in H
Expressiveness of DTs

§ Can express any function of the features

§ However, we hope for compact trees

Comparison: Perceptrons
§ What is the expressiveness of a perceptron over these features?

§ For a perceptron, a feature’s contribution is either positive or negative


§ If you want one feature’s effect to depend on another, you have to add a new conjunction feature
§ E.g. adding “PATRONS=full Ù WAIT = 60” allows a perceptron to model the interaction between the two atomic
features

§ DTs automatically conjoin features / attributes


§ Features can have different effects in different branches of the tree!

§ Difference between modeling relative evidence weighting (NB) and complex evidence interaction (DTs)
§ Though if the interactions are too complex, may not find the DT greedily
Hypothesis Spaces
§ How many distinct decision trees with n Boolean attributes?
= number of Boolean functions over n attributes
= number of distinct truth tables with 2n rows
= 2^(2n)
§ E.g., with 6 Boolean attributes, there are
18,446,744,073,709,551,616 trees

§ How many trees of depth 1 (decision stumps)?


= number of Boolean functions over 1 attribute
= number of truth tables with 2 rows, times n
= 4n
§ E.g. with 6 Boolean attributes, there are 24 decision stumps

§ More expressive hypothesis space:


§ Increases chance that target function can be expressed (good)
§ Increases number of hypotheses consistent with training set
(bad, why?)
§ Means we can get better predictions (lower bias)
§ But we may get worse predictions (higher variance)

Decision Tree Learning


§ Aim: find a small tree consistent with the training examples
§ Idea: (recursively) choose “most significant” attribute as root of (sub)tree
Choosing an Attribute
§ Idea: a good attribute splits the examples into subsets that are (ideally) “all positive” or
“all negative”

§ So: we need a measure of how “good” a split is, even if the results aren’t perfectly
separated out

Entropy and Information

§ Information answers questions


§ The more uncertain about the answer initially, the more
information in the answer
§ Scale: bits
§ Answer to Boolean question with prior <1/2, 1/2>?
§ Answer to 4-way question with prior <1/4, 1/4, 1/4, 1/4>?
§ Answer to 4-way question with prior <0, 0, 0, 1>?
§ Answer to 3-way question with prior <1/2, 1/4, 1/4>?

§ A probability p is typical of:


§ A uniform distribution of size 1/p
§ A code of length log 1/p
Entropy
§ General answer: if prior is <p1,…,pn>:
§ Information is the expected code length

1 bit

§ Also called the entropy of the distribution 0 bits


§ More uniform = higher entropy
§ More values = higher entropy
§ More peaked = lower entropy
§ Rare values almost “don’t count”
0.5 bit

Information Gain
§ Back to decision trees!
§ For each split, compare entropy before and after
§ Difference is the information gain
§ Problem: there’s more than one distribution after split!

§ Solution: use expected entropy, weighted by the number of


examples
Next Step: Recurse
§ Now we need to keep growing the tree!
§ Two branches are done (why?)
§ What to do under “full”?
§ See what examples are there…

Example: Learned Tree

§ Decision tree learned from these 12 examples:

§ Substantially simpler than “true” tree


§ A more complex hypothesis isn't justified by data
§ Also: it’s reasonable, but wrong
Example: Miles Per Gallon
mpg cylinders displacement horsepower weight acceleration modelyear maker

good 4 low low low high 75to78 asia


bad 6 medium medium medium medium 70to74 america
bad 4 medium medium medium low 75to78 europe
bad 8 high high high low 70to74 america
bad 6 medium medium medium medium 70to74 america
40 Examples

bad 4 low medium low medium 70to74 asia


bad 4 low medium low low 70to74 asia
bad 8 high high high low 75to78 america
: : : : : : : :
: : : : : : : :
: : : : : : : :
bad 8 high high high low 70to74 america
good 8 high medium high high 79to83 america
bad 8 high high high low 75to78 america
good 4 low low low low 79to83 america
bad 6 medium medium medium high 75to78 america
good 4 medium low low low 79to83 america
good 4 low low medium high 79to83 america
bad 8 high high high low 70to74 america
good 4 low medium low medium 75to78 europe
bad 5 medium medium medium medium 75to78 europe

Find the First Split

§ Look at information gain for


each attribute

§ Note that each attribute is


correlated with the target!

§ What do we split on?


Result: Decision Stump

Second Level
Final Tree

Reminder: Overfitting
§ Overfitting:
§ When you stop modeling the patterns in the training data (which
generalize)
§ And start modeling the noise (which doesn’t)

§ We had this before:


§ Naïve Bayes: needed to smooth
§ Perceptron: early stopping
MPG Training
Error

The test set error is much worse than the


training set error…
…why?

Consider this
split
Significance of a Split
§ Starting with:
§ Three cars with 4 cylinders, from Asia, with medium HP
§ 2 bad MPG
§ 1 good MPG

§ What do we expect from a three-way split?


§ Maybe each example in its own subset?
§ Maybe just what we saw in the last slide?

§ Probably shouldn’t split if the counts are so small they could be due to chance

§ A chi-squared test can tell us how likely it is that deviations from a perfect split are due to chance*

§ Each split will have a significance value, pCHANCE

Keeping it General

§ Pruning: y = a XOR b

§ Build the full decision tree a b y


0 0 0
§ Begin at the bottom of the tree 0 1 1
1 0 1
§ Delete splits in which 1 1 0
pCHANCE > MaxPCHANCE
§ Continue working upward until
there are no more prunable
nodes
§ Note: some chance nodes may
not get pruned because they
were “redeemed” later
Pruning example

§ With MaxPCHANCE = 0.1:

Note the improved


test set accuracy
compared with the
unpruned tree

Regularization

§ MaxPCHANCE is a regularization parameter


§ Generally, set it using held-out data (as usual)

Training
Accuracy

Held-out / Test

Decreasing Increasing
MaxPCHANCE

Small Trees Large Trees


High Bias High Variance
Two Ways of Controlling Overfitting

§ Limit the hypothesis space


§ E.g. limit the max depth of trees
§ Easier to analyze

§ Regularize the hypothesis selection


§ E.g. chance cutoff
§ Disprefer most of the hypotheses unless data is clear
§ Usually done in practice

Next Lecture: Applications!

You might also like