Neural Nets (Wrap-Up) and Decision Trees: CS 188: Artificial Intelligence
Neural Nets (Wrap-Up) and Decision Trees: CS 188: Artificial Intelligence
Instructors: Pieter Abbeel and Dan Klein --- University of California, Berkeley
[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]
Today
§ Neural Nets -- wrap
§ Formalizing Learning
§ Consistency
§ Simplicity
§ Decision Trees
§ Expressiveness
§ Information Gain
§ Overfitting
Deep Neural Network
(k)
X (k 1,k) (k 1)
zi = g( Wi,j zj ) g = nonlinear activation function
X
max ll(w) = max log P (y (i) |x(i) ; w)
w w
i
§ Practical considerations
§ Can be seen as learning the features
§ Large number of neurons
§ Danger for overfitting
§ (hence early stopping!)
Object Detection
Manual Feature Design
Image HoG
Performance
Performance
AlexNet
AlexNet
Performance
AlexNet
Karpathy & Fei-Fei, 2015; Donahue et al., 2015; Xu et al, 2015; many more
Visual QA Challenge
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh
Speech Recognition
Machine Translation
Google Neural Machine Translation (in production)
Today
§ Neural Nets -- wrap
§ Formalizing Learning
§ Consistency
§ Simplicity
§ Decision Trees
§ Expressiveness
§ Information Gain
§ Overfitting
§ Clustering
Inductive Learning
Inductive Learning (Science)
§ Simplest form: learn a function from examples
§ A target function: g
§ Examples: input-output pairs (x, g(x))
§ E.g. x is an email and g(x) is spam / ham
§ E.g. x is a house and g(x) is its selling price
§ Problem:
§ Given a hypothesis space H
§ Given a training set of examples xi
§ Find a hypothesis h(x) such that h ~ g
§ Includes:
§ Classification (outputs = class labels)
§ Regression (outputs = real numbers)
Inductive Learning
§ Curve fitting (regression, function approximation):
Decision Trees
Reminder: Features
§ Features, aka attributes
§ Sometimes: TYPE=French
§ Sometimes: fTYPE=French(x) = 1
Decision Trees
§ Compact representation of a function:
§ Truth table
§ Conditional probability table
§ Regression values
§ True function
§ Realizable: in H
Expressiveness of DTs
Comparison: Perceptrons
§ What is the expressiveness of a perceptron over these features?
§ Difference between modeling relative evidence weighting (NB) and complex evidence interaction (DTs)
§ Though if the interactions are too complex, may not find the DT greedily
Hypothesis Spaces
§ How many distinct decision trees with n Boolean attributes?
= number of Boolean functions over n attributes
= number of distinct truth tables with 2n rows
= 2^(2n)
§ E.g., with 6 Boolean attributes, there are
18,446,744,073,709,551,616 trees
§ So: we need a measure of how “good” a split is, even if the results aren’t perfectly
separated out
1 bit
Information Gain
§ Back to decision trees!
§ For each split, compare entropy before and after
§ Difference is the information gain
§ Problem: there’s more than one distribution after split!
Second Level
Final Tree
Reminder: Overfitting
§ Overfitting:
§ When you stop modeling the patterns in the training data (which
generalize)
§ And start modeling the noise (which doesn’t)
Consider this
split
Significance of a Split
§ Starting with:
§ Three cars with 4 cylinders, from Asia, with medium HP
§ 2 bad MPG
§ 1 good MPG
§ Probably shouldn’t split if the counts are so small they could be due to chance
§ A chi-squared test can tell us how likely it is that deviations from a perfect split are due to chance*
Keeping it General
§ Pruning: y = a XOR b
Regularization
Training
Accuracy
Held-out / Test
Decreasing Increasing
MaxPCHANCE