0% found this document useful (0 votes)

86 views26 pages

Neural Nets (Wrap-Up) and Decision Trees: CS 188: Artificial Intelligence

The document summarizes content from a course on artificial intelligence including neural networks and decision trees. It discusses wrapping up neural networks and introduces formalizing learning concepts like consistency and simplicity. It also covers decision trees, describing their expressiveness, how information gain is used to select attributes, and addressing overfitting issues. Examples of neural networks applied to computer vision, speech recognition, and machine translation are provided. Inductive learning and the tradeoff between consistency and simplicity are defined. Decision trees are described as having an expressive hypothesis space and automatically modeling feature interactions.

Uploaded by

Mihai Ilie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views26 pages

Neural Nets (Wrap-Up) and Decision Trees: CS 188: Artificial Intelligence

Uploaded by

Mihai Ilie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

CS 188: Artificial Intelligence

Neural Nets (wrap-up) and Decision Trees

Instructors: Pieter Abbeel and Dan Klein --- University of California, Berkeley
[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

Today
§ Neural Nets -- wrap

§ Formalizing Learning
§ Consistency
§ Simplicity

§ Decision Trees
§ Expressiveness
§ Information Gain
§ Overfitting
Deep Neural Network

(1) (2) (n)

x1 z1 z1 z1
(n 1)
z1
(OU T )
ez1
z1 s P (y1 |x; w) =
(1) (2) (n) ez1 + ez2 + ez3
x2 z2 z2 (n 1)
z2 z2 o
f
… (OU T ) t ez2
(2) (n) z2 P (y2 |x; w) =
x3 (1)
z3 z3 (n 1)
z3 z3 m ez1 + ez2 + ez3
a
… … … … x ez3
… (OU T )
z3 P (y3 |x; w) =
ez1 + ez2 + ez3
(1) (2) (n)
xL zK (1) zK (2) (n 1)
zK (n 1)
zK (n)

(k)
X (k 1,k) (k 1)
zi = g( Wi,j zj ) g = nonlinear activation function

Deep Neural Network: Also Learn the Features!

§ Training the deep neural network is just like logistic regression:

X
max ll(w) = max log P (y (i) |x(i) ; w)
w w
i

just w tends to be a much, much larger vector J

àjust run gradient ascent

+ stop when log likelihood of hold-out data starts to decrease
Neural Networks Properties
§ Theorem (Universal Function Approximators). A two-layer neural
network with a sufficient number of neurons can approximate
any continuous function to any desired accuracy.

§ Practical considerations
§ Can be seen as learning the features
§ Large number of neurons
§ Danger for overfitting
§ (hence early stopping!)

How well does it work?

Computer Vision

Object Detection
Manual Feature Design

Features and Generalization

[HoG: Dalal and Triggs, 2005]

Features and Generalization

Image HoG

Performance

graph credit Matt

Zeiler, Clarifai
Performance

graph credit Matt

Zeiler, Clarifai

Performance

AlexNet

graph credit Matt

Zeiler, Clarifai
Performance

AlexNet

graph credit Matt

Zeiler, Clarifai

Performance

AlexNet

graph credit Matt

Zeiler, Clarifai
MS COCO Image Captioning Challenge

Karpathy & Fei-Fei, 2015; Donahue et al., 2015; Xu et al, 2015; many more

Visual QA Challenge
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh
Speech Recognition

graph credit Matt Zeiler, Clarifai

Machine Translation
Google Neural Machine Translation (in production)
Today
§ Neural Nets -- wrap

§ Formalizing Learning
§ Consistency
§ Simplicity

§ Decision Trees
§ Expressiveness
§ Information Gain
§ Overfitting

§ Clustering

Inductive Learning
Inductive Learning (Science)
§ Simplest form: learn a function from examples
§ A target function: g
§ Examples: input-output pairs (x, g(x))
§ E.g. x is an email and g(x) is spam / ham
§ E.g. x is a house and g(x) is its selling price

§ Problem:
§ Given a hypothesis space H
§ Given a training set of examples xi
§ Find a hypothesis h(x) such that h ~ g

§ Includes:
§ Classification (outputs = class labels)
§ Regression (outputs = real numbers)

§ How do perceptron and naïve Bayes fit in? (H, h, g, etc.)

Inductive Learning
§ Curve fitting (regression, function approximation):

§ Consistency vs. simplicity

§ Ockham’s razor
Consistency vs. Simplicity
§ Fundamental tradeoff: bias vs. variance

§ Usually algorithms prefer consistency by default (why?)

§ Several ways to operationalize “simplicity”

§ Reduce the hypothesis space
§ Assume more: e.g. independence assumptions, as in naïve Bayes
§ Have fewer, better features / attributes: feature selection
§ Other structural limitations (decision lists vs trees)
§ Regularization
§ Smoothing: cautious use of small counts
§ Many other generalization parameters (pruning cutoffs today)
§ Hypothesis space stays big, but harder to get to the outskirts

Decision Trees
Reminder: Features
§ Features, aka attributes
§ Sometimes: TYPE=French
§ Sometimes: fTYPE=French(x) = 1

Decision Trees
§ Compact representation of a function:
§ Truth table
§ Conditional probability table
§ Regression values

§ True function
§ Realizable: in H
Expressiveness of DTs

§ Can express any function of the features

§ However, we hope for compact trees

Comparison: Perceptrons
§ What is the expressiveness of a perceptron over these features?

§ For a perceptron, a feature’s contribution is either positive or negative

§ If you want one feature’s effect to depend on another, you have to add a new conjunction feature
§ E.g. adding “PATRONS=full Ù WAIT = 60” allows a perceptron to model the interaction between the two atomic
features

§ DTs automatically conjoin features / attributes

§ Features can have different effects in different branches of the tree!

§ Difference between modeling relative evidence weighting (NB) and complex evidence interaction (DTs)
§ Though if the interactions are too complex, may not find the DT greedily
Hypothesis Spaces
§ How many distinct decision trees with n Boolean attributes?
= number of Boolean functions over n attributes
= number of distinct truth tables with 2n rows
= 2^(2n)
§ E.g., with 6 Boolean attributes, there are
18,446,744,073,709,551,616 trees

§ How many trees of depth 1 (decision stumps)?

= number of Boolean functions over 1 attribute
= number of truth tables with 2 rows, times n
= 4n
§ E.g. with 6 Boolean attributes, there are 24 decision stumps

§ More expressive hypothesis space:

§ Increases chance that target function can be expressed (good)
§ Increases number of hypotheses consistent with training set
(bad, why?)
§ Means we can get better predictions (lower bias)
§ But we may get worse predictions (higher variance)

Decision Tree Learning

§ Aim: find a small tree consistent with the training examples
§ Idea: (recursively) choose “most significant” attribute as root of (sub)tree
Choosing an Attribute
§ Idea: a good attribute splits the examples into subsets that are (ideally) “all positive” or
“all negative”

§ So: we need a measure of how “good” a split is, even if the results aren’t perfectly
separated out

Entropy and Information

§ Information answers questions

§ The more uncertain about the answer initially, the more
information in the answer
§ Scale: bits
§ Answer to Boolean question with prior <1/2, 1/2>?
§ Answer to 4-way question with prior <1/4, 1/4, 1/4, 1/4>?
§ Answer to 4-way question with prior <0, 0, 0, 1>?
§ Answer to 3-way question with prior <1/2, 1/4, 1/4>?

§ A probability p is typical of:

§ A uniform distribution of size 1/p
§ A code of length log 1/p
Entropy
§ General answer: if prior is <p1,…,pn>:
§ Information is the expected code length

1 bit

§ Also called the entropy of the distribution 0 bits

§ More uniform = higher entropy
§ More values = higher entropy
§ More peaked = lower entropy
§ Rare values almost “don’t count”
0.5 bit

Information Gain
§ Back to decision trees!
§ For each split, compare entropy before and after
§ Difference is the information gain
§ Problem: there’s more than one distribution after split!

§ Solution: use expected entropy, weighted by the number of

examples
Next Step: Recurse
§ Now we need to keep growing the tree!
§ Two branches are done (why?)
§ What to do under “full”?
§ See what examples are there…

Example: Learned Tree

§ Decision tree learned from these 12 examples:

§ Substantially simpler than “true” tree

§ A more complex hypothesis isn't justified by data
§ Also: it’s reasonable, but wrong
Example: Miles Per Gallon
mpg cylinders displacement horsepower weight acceleration modelyear maker

good 4 low low low high 75to78 asia

bad 6 medium medium medium medium 70to74 america
bad 4 medium medium medium low 75to78 europe
bad 8 high high high low 70to74 america
bad 6 medium medium medium medium 70to74 america
40 Examples

bad 4 low medium low medium 70to74 asia

bad 4 low medium low low 70to74 asia
bad 8 high high high low 75to78 america
: : : : : : : :
: : : : : : : :
: : : : : : : :
bad 8 high high high low 70to74 america
good 8 high medium high high 79to83 america
bad 8 high high high low 75to78 america
good 4 low low low low 79to83 america
bad 6 medium medium medium high 75to78 america
good 4 medium low low low 79to83 america
good 4 low low medium high 79to83 america
bad 8 high high high low 70to74 america
good 4 low medium low medium 75to78 europe
bad 5 medium medium medium medium 75to78 europe

Find the First Split

§ Look at information gain for

each attribute

§ Note that each attribute is

correlated with the target!

§ What do we split on?

Result: Decision Stump

Second Level
Final Tree

Reminder: Overfitting
§ Overfitting:
§ When you stop modeling the patterns in the training data (which
generalize)
§ And start modeling the noise (which doesn’t)

§ We had this before:

§ Naïve Bayes: needed to smooth
§ Perceptron: early stopping
MPG Training
Error

The test set error is much worse than the

training set error…
…why?

Consider this
split
Significance of a Split
§ Starting with:
§ Three cars with 4 cylinders, from Asia, with medium HP
§ 2 bad MPG
§ 1 good MPG

§ What do we expect from a three-way split?

§ Maybe each example in its own subset?
§ Maybe just what we saw in the last slide?

§ Probably shouldn’t split if the counts are so small they could be due to chance

§ A chi-squared test can tell us how likely it is that deviations from a perfect split are due to chance*

§ Each split will have a significance value, pCHANCE

Keeping it General

§ Pruning: y = a XOR b

§ Build the full decision tree a b y

0 0 0
§ Begin at the bottom of the tree 0 1 1
1 0 1
§ Delete splits in which 1 1 0
pCHANCE > MaxPCHANCE
§ Continue working upward until
there are no more prunable
nodes
§ Note: some chance nodes may
not get pruned because they
were “redeemed” later
Pruning example

§ With MaxPCHANCE = 0.1:

Note the improved

test set accuracy
compared with the
unpruned tree

Regularization

§ MaxPCHANCE is a regularization parameter

§ Generally, set it using held-out data (as usual)

Training
Accuracy

Held-out / Test

Decreasing Increasing
MaxPCHANCE

Small Trees Large Trees

High Bias High Variance
Two Ways of Controlling Overfitting

§ Limit the hypothesis space

§ E.g. limit the max depth of trees
§ Easier to analyze

§ Regularize the hypothesis selection

§ E.g. chance cutoff
§ Disprefer most of the hypotheses unless data is clear
§ Usually done in practice

Next Lecture: Applications!

Dressing The Man Mastering The Art of Permanent Fashion
97% (32)
Dressing The Man Mastering The Art of Permanent Fashion
321 pages
Introduction To Classification - PPT Slides 1
No ratings yet
Introduction To Classification - PPT Slides 1
62 pages
Unit IV Da Online - PPTX 2 82
No ratings yet
Unit IV Da Online - PPTX 2 82
81 pages
Frank Farrelly Jeff Brandsma Provocative Therapy OCR PDF PDF Format
83% (6)
Frank Farrelly Jeff Brandsma Provocative Therapy OCR PDF PDF Format
235 pages
From Excel To Machine Learning
100% (1)
From Excel To Machine Learning
48 pages
Radical Honesty
100% (1)
Radical Honesty
174 pages
Data Science For Beginners
100% (3)
Data Science For Beginners
354 pages
Churn Prediction
100% (3)
Churn Prediction
41 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
16 pages
IMDB Data Analyses
0% (1)
IMDB Data Analyses
38 pages
An Introduction TO Decision Trees
No ratings yet
An Introduction TO Decision Trees
30 pages
Ai 4 All
No ratings yet
Ai 4 All
33 pages
Elastic Search
0% (1)
Elastic Search
18 pages
Pattern Recognition
No ratings yet
Pattern Recognition
50 pages
Data Mining Using RFM Analysis, Derya Birant, Dokuz Eylul University, Turkey
100% (1)
Data Mining Using RFM Analysis, Derya Birant, Dokuz Eylul University, Turkey
18 pages
Use of Machine Learning Algorithms For Weld Quality Monitoring Using Acoustic Signature
No ratings yet
Use of Machine Learning Algorithms For Weld Quality Monitoring Using Acoustic Signature
7 pages
"Classifiers": R & D Project by Under The Guidance of
No ratings yet
"Classifiers": R & D Project by Under The Guidance of
59 pages
Daily Dose of Data Science - Archive
No ratings yet
Daily Dose of Data Science - Archive
354 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
100 pages
F08120000220104012F0812 P12 Behavioural Research in Accounting
No ratings yet
F08120000220104012F0812 P12 Behavioural Research in Accounting
34 pages
Toxtree User Manual
No ratings yet
Toxtree User Manual
66 pages
Refer For KNNDecison Tree SVM
No ratings yet
Refer For KNNDecison Tree SVM
90 pages
Data Mining NOTES
No ratings yet
Data Mining NOTES
57 pages
19 - Decision Tree - ID3
No ratings yet
19 - Decision Tree - ID3
87 pages
Total Marks (15 Qns 1 Mark 15 Marks) : Business Intelligence and Analytics Assignment Week 1
No ratings yet
Total Marks (15 Qns 1 Mark 15 Marks) : Business Intelligence and Analytics Assignment Week 1
29 pages
Paper 2 0 M
No ratings yet
Paper 2 0 M
20 pages
Lec12 2
No ratings yet
Lec12 2
103 pages
Decision Trees
No ratings yet
Decision Trees
53 pages
Decision Trees-Lecture 9&10
No ratings yet
Decision Trees-Lecture 9&10
60 pages
Unit 5. Decision Trees
No ratings yet
Unit 5. Decision Trees
58 pages
Ch02 DecisionTree
No ratings yet
Ch02 DecisionTree
41 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
Ch13. Decision Tree: KH Wong
No ratings yet
Ch13. Decision Tree: KH Wong
82 pages
7 DecisionTree
No ratings yet
7 DecisionTree
58 pages
Pazzani - Content-Based Recommender Systems
No ratings yet
Pazzani - Content-Based Recommender Systems
17 pages
Swayam 8thmajor
No ratings yet
Swayam 8thmajor
57 pages
Lecture 17 18
No ratings yet
Lecture 17 18
52 pages
TEGI0533
No ratings yet
TEGI0533
50 pages
Pattern Recognition Unit 1,2
No ratings yet
Pattern Recognition Unit 1,2
82 pages
Pattern Recognition
No ratings yet
Pattern Recognition
33 pages
Decision Tree
No ratings yet
Decision Tree
52 pages
21 Decision Trees
No ratings yet
21 Decision Trees
62 pages
Machine Learning: Decision Trees: CS540 Jerry Zhu University of Wisconsin-Madison
No ratings yet
Machine Learning: Decision Trees: CS540 Jerry Zhu University of Wisconsin-Madison
49 pages
Machine Learning 10601 Recitation 8 Oct 21, 2009: Oznur Tastan
No ratings yet
Machine Learning 10601 Recitation 8 Oct 21, 2009: Oznur Tastan
46 pages
Decision Trees
No ratings yet
Decision Trees
27 pages
Practice Q Machine Learning Ans
No ratings yet
Practice Q Machine Learning Ans
54 pages
Ds 6
No ratings yet
Ds 6
24 pages
Curs 1 SSL - Introduction
No ratings yet
Curs 1 SSL - Introduction
57 pages
Advance Data Science and AI Certification Program Learnbay
No ratings yet
Advance Data Science and AI Certification Program Learnbay
38 pages
Decision Trees
No ratings yet
Decision Trees
37 pages
AbuSaa2019 Article FactorsAffectingStudentsPerfor
No ratings yet
AbuSaa2019 Article FactorsAffectingStudentsPerfor
32 pages
Decision Trees
No ratings yet
Decision Trees
45 pages
Question Bank DMC
No ratings yet
Question Bank DMC
28 pages
A Comprehensive Survey On Graph Neural Networks
No ratings yet
A Comprehensive Survey On Graph Neural Networks
22 pages
Moisen and Frescino. 2002 Comparing Modeling Techniques To Predict Forest Characteristics
No ratings yet
Moisen and Frescino. 2002 Comparing Modeling Techniques To Predict Forest Characteristics
17 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
Dtree&rf
No ratings yet
Dtree&rf
26 pages
ML Chap 3
No ratings yet
ML Chap 3
52 pages
40 Interview Questions Asked at Startups in Machine Learning - Data Science
No ratings yet
40 Interview Questions Asked at Startups in Machine Learning - Data Science
33 pages
Decision Trees
No ratings yet
Decision Trees
15 pages
Decision Trees MIT 15.097 Course Notes
No ratings yet
Decision Trees MIT 15.097 Course Notes
17 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
Decision Tree
No ratings yet
Decision Tree
45 pages
Decision Making Under Risk Continued: Decision Trees
No ratings yet
Decision Making Under Risk Continued: Decision Trees
17 pages
Module4 DS PPT
No ratings yet
Module4 DS PPT
49 pages
Durham HART IR
No ratings yet
Durham HART IR
12 pages
Unit IV Decision Trees
No ratings yet
Unit IV Decision Trees
37 pages
07 - ML - Decision Tree
No ratings yet
07 - ML - Decision Tree
37 pages
Unit 3 (MLT)
No ratings yet
Unit 3 (MLT)
42 pages
Week03 Classification
No ratings yet
Week03 Classification
22 pages
Decision Trees: Classifier
No ratings yet
Decision Trees: Classifier
23 pages
Data Representation: 8 July 2013
No ratings yet
Data Representation: 8 July 2013
26 pages
1406 6247 PDF
No ratings yet
1406 6247 PDF
12 pages
Ai Unit 3
No ratings yet
Ai Unit 3
23 pages
Predicting Bitcoin Returns Using High-Dimensional
No ratings yet
Predicting Bitcoin Returns Using High-Dimensional
16 pages
Big Data Notes
No ratings yet
Big Data Notes
33 pages
Week 11 - Decision Tree Learning
No ratings yet
Week 11 - Decision Tree Learning
43 pages
Decision Trees 2
No ratings yet
Decision Trees 2
18 pages
M S Ramaiah Institute of Technology Department of Information Science & Engg
No ratings yet
M S Ramaiah Institute of Technology Department of Information Science & Engg
11 pages
05 Classification II 2024
No ratings yet
05 Classification II 2024
54 pages
Interpretable-AI Policies Using Evolutionary Nonlinear Decision Trees For Discrete Action Systems
No ratings yet
Interpretable-AI Policies Using Evolutionary Nonlinear Decision Trees For Discrete Action Systems
15 pages
22.InfoTheory DecisionTrees Short
No ratings yet
22.InfoTheory DecisionTrees Short
25 pages
D R V K B: Ifferentiable Easoning Over A Irtual Nowledge ASE
No ratings yet
D R V K B: Ifferentiable Easoning Over A Irtual Nowledge ASE
16 pages
Sharda 11e Full Accessible PPT 04
No ratings yet
Sharda 11e Full Accessible PPT 04
40 pages
Fall 2022 Midterm Notes PDF
No ratings yet
Fall 2022 Midterm Notes PDF
15 pages
R L N T M - R: Einforcement Earning Eural Uring Achines Evised
No ratings yet
R L N T M - R: Einforcement Earning Eural Uring Achines Evised
14 pages
Adversarial Robustness Through Local Lipschitzness: Madry Et Al. 2018
No ratings yet
Adversarial Robustness Through Local Lipschitzness: Madry Et Al. 2018
14 pages
Act 9
No ratings yet
Act 9
22 pages
Lecture 3
No ratings yet
Lecture 3
18 pages
Decision Tree
No ratings yet
Decision Tree
20 pages
DECISION TREES-jb
No ratings yet
DECISION TREES-jb
8 pages
CSPS: Arc Consistency
No ratings yet
CSPS: Arc Consistency
17 pages
LVC 1 Post-Session Summary
No ratings yet
LVC 1 Post-Session Summary
9 pages
P4-DTRF 1
No ratings yet
P4-DTRF 1
63 pages
Lecture-7 Machine Learning With Python
No ratings yet
Lecture-7 Machine Learning With Python
42 pages
Machine - Learning (Unit 3)
No ratings yet
Machine - Learning (Unit 3)
9 pages
Chapter 4 SqCzYr
No ratings yet
Chapter 4 SqCzYr
47 pages
Unit 3
No ratings yet
Unit 3
22 pages
Decision Tree
No ratings yet
Decision Tree
23 pages
CSE 455 Artificial Intelligence: Decision Trees
No ratings yet
CSE 455 Artificial Intelligence: Decision Trees
16 pages
Controllable Neural Story Plot Generation Via Reward Shaping
No ratings yet
Controllable Neural Story Plot Generation Via Reward Shaping
8 pages
Decision Trees
No ratings yet
Decision Trees
26 pages
Deep Bayesian Natural Language Processing
No ratings yet
Deep Bayesian Natural Language Processing
6 pages
The Queue Data Structure: Mugurel Ionu Ț Andreica
No ratings yet
The Queue Data Structure: Mugurel Ionu Ț Andreica
8 pages
Decision Trees
No ratings yet
Decision Trees
8 pages
Branching Into Brains: Deep Learning
No ratings yet
Branching Into Brains: Deep Learning
2 pages
Introduction in Jade: Jade (Java Agent Development Framework) Is A Software Framework
No ratings yet
Introduction in Jade: Jade (Java Agent Development Framework) Is A Software Framework
1 page
Prezentari Stagii Pe Bune 2014: Ilie Mihai
No ratings yet
Prezentari Stagii Pe Bune 2014: Ilie Mihai
1 page
Graphics Third Edition, A. K. Peters (SUA), 2009.: Bibliografie Curs
No ratings yet
Graphics Third Edition, A. K. Peters (SUA), 2009.: Bibliografie Curs
1 page
Artificial life: Random walk
From Everand
Artificial life: Random walk
Mietek Szyszkowicz
No ratings yet
Solving Math Problems
From Everand
Solving Math Problems
George N. Frempong
No ratings yet