0% found this document useful (0 votes)
14 views44 pages

Comp101 Lect02

COMP101 otago lecture 2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views44 pages

Comp101 Lect02

COMP101 otago lecture 2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

COMP101:

Foundations
of Information
Systems

Grant Dick
Department of
Information Science

Lecture 02:
Information Theory
Before we start …

• Labs start this week:


• OBS 1.18 (Otago Business School building), or
• Arts CAL (Arts Building) (Thursdays)
• See your timetable on eVision

• First lab assessment this week:


• Each lab has a small assessment activity or sign-
off
• Best 10 of 11 contribute to internal assessment
(20%)
• Minimum of 6 assessments completed for hurdle

2
Class Reps

• For a class of ~ 140 students, would be


great to have one from each lab stream
(i.e., 3-4 reps)

• If interested, please let me know (we can


do the registration for you!)

5
Goals for today

• Information theory – what is


“information”, how to measure
information quantity.

• Examples of Information Theory:


• Sizing data requirements
• Estimating required effort
• Estimating compression efficiency
• Creating decision rules from data

6
Recap from last lecture:

• I’m booking a table at a


restaurant, and the waiter
has asked “How many
guests?”
(note: it’s a loud restaurant,
so I answer with a gesture
instead of just talking)

• My response is:
a) “Table for 7, please?”
b) “Table for 25, please?”
c) “Table for 17, please?”
That gesture could be
d) “Table for 127, please?”

anything! 7
Data is just
symbols
These symbols can be used to represent things, but context is
needed!
Source: Wikimedia Commons
So, what is Information?

Claude Shannon
Information

• Very hard to define without using


tautology!

From OED:
Information

• “Knowledge communicated concerning


some particular fact, subject, or event;
that of which one is apprised or told;
intelligence, news.” (OED)

• Keywords: knowledge, communicate,


insight, fact, input.

• Information forms the input into decision


making
11
Information as “States”

• A source of information takes on different


values/concepts.

• Each value is a unique state for that


information source

• Note: “value” is independent from


representation (e.g., 10010, C, 11001002,
百 all represent the same information
state)
12
Information as “Surprise”

• If information provides insight into decision


making/understanding, then we can measure
the value/importance/size of information in
terms of its “Surprise”

• A rare state is usually surprising, and usually


also informative

• Therefore, information surprise is measured


in terms of its (inverse) frequency of
occurrence of its states
13
Information as Surprise

• A common state
(high ) is not
informative, so
Surprise is low

• A deterministic state ()
is completely
uninformative, so
Surprise is zero

• An impossible state ()
is undefined
14
The bit – smallest unit of
information
• Information must have variation:
• i.e., more than one state

• The smallest number of states is therefore 2:


• Presence or absence
• On or Off
• Yes or No
• 1 or 0

• Shannon gave this term a special name: bit


(binary digit)

15
Representing Multiple States in
Bits
• A bit can represent two states (0, 1)

• To represent multiple states, we simply chain


together more bits:
• e.g., to represent four states: (00, 01, 10, 11), six
states (000, 001, 010, 011, 100, 101), …
• We’ll elaborate on this in lectures 17-19
(representations)

• We therefore use the bit as the basic unit of


information (i.e., we “size” a source of
information as the number of bits required to
represent all of its states)
16
“Sizing” information
Just how many bits are required?

Examples:
Days of the week
Degrees offered by the university
Age of an individual

17
Size as “Expected Surprise”

• For a source of information with states,


we can measure the surprise of each
state as:

• The expected (average) surprise that we


get from this source is then:

• We call this value the Shannon


Entropy, measured in bits

18
Example:

• Imagine that we encounter weather states in


the following probabilities:
State
Sunny 1 0.20
Raining 2 0.20
Foggy 3 0.10
Cloudy 4 0.45
Frosty 5 0.05

• The entropy of this system is therefore:


bits
19
20
Equal Probability States

• Shannon Entropy is maximised when


states are equally-likely (when ), in these
cases, we typically simplify to:

21
Example: Password Characters

• Assume (unrealistically!) that each


character in a password can be A-Z, a-z,
0-9, _, or a space with equal probability
(i.e., random password).

• Each symbol equates to one of states

• Therefore, each character in a password


contributes bits of entropy.

22
Uses of Entropy

• Storage Requirements

• Estimating Computational Effort

• Decision Rules in Machine Learning

• As a basis for compression

23
Storage: How many bits are
needed?
• Often we need to work out how many bits
will be required to store a value in
memory, on disk, etc.

• Entropy defines the lower bound on


the number of required bits

• But we can’t store fractions of a bit, so


we round up to the nearest whole, so:

24
Computational Effort

• Entropy defines an exponential scale of work:


• Every additional bit of entropy doubles the number
of states in the system.

• Can therefore use entropy to measure the


difficulty or required work for a problem

• e.g., passwords: if we increase the per-


character entropy of our passwords (for a
given length), then we double the effort
required to brute force crack the password.
25
Decision Rules

• Decision often made by “divide and


conquer”:
• Partition problem into smaller (simpler) sets
• Use the average outcome of smaller sets to
inform decision (majority rule)

• How do we systematically identify the


best “splitting rule”?

• Entropy! (more accurately, entropy loss)

26
Entropy Loss (Information Gain!)

• Recall that a “pure” (no diversity) system


has zero entropy while a perfectly mixed
system has maximum diversity

• To test a splitting rule:


• Measure the entropy of the whole set
• “Split” the whole set into two (or more)
subsets
• Compute the entropy of each subset and total
these
• Measure the difference in total entropy before
and after the split. 27
If we have a lot of candidate splitting rules,
pick the one that produces the largest
difference!

This forms the basis of a method called decision tree learning!


28
Example (n.b., not assessed!)

• Predicting whether a person will default


on loan based on the following data:
default student balance income
No No 266 60183
Yes Yes 1551 19028
Yes No 1666 25055
No Yes 435 14813
Yes No 1320 36549
No No 1639 30625
Yes Yes 2039 12182
No No 0 62129
No No 566 39109
Yes No 1586 52275

• Entropy of “default” is: =1bit


• Let’s try splitting on “student=No” … 29
Example

• Predicting whether a person will default on


loan based on the following data:
default student default student
No No Yes Yes
Yes No No Yes
Yes No Yes Yes
No No
No No
No No
Yes No

• Entropy before: 1bit


• Entropy after: 0.7*0.985 + 0.3*0.918 =
0.965
• Difference: 0.035 bits (not much!)
30
Example

• Predicting whether a person will default


on loan based on the following data:
default student balance income
No No 266 60183
Yes Yes 1551 19028
Yes No 1666 25055
No Yes 435 14813
Yes No 1320 36549
No No 1639 30625
Yes Yes 2039 12182
No No 0 62129
No No 566 39109
Yes No 1586 52275

• Let’s try splitting on “balance ≤ 1000” …


31
Example

• Predicting whether a person will default on


loan based on the following data:
default balance default balance
No 266 Yes 1551
No 435 Yes 1666
No 0 Yes 1320
No 566 No 1639
Yes 2039
Yes 1586

• Entropy before: 1bit


• Entropy after: 0.4*0.0 + 0.6*0.650 = 0.390
• Difference: 0.61 bits (much better!)
32
Further splitting makes for
potentially even better
predictions!

Splitting rule:
If balance < 943 or
balance between 1612.5
and 1652.5, then default
is “No”, otherwise default
is “Yes”

33
Compression

• Entropy defines the number of bits


required to perfectly represent (encode) a
system:
• part of Shannon's source coding theorem
• Naïve encodings rarely achieve this efficiency

• Efficiency of encoding can be measured


as:

• Example …

34
Efficiency of naïve encoding of
states
• Consider our earlier example (states of
weather):
State Naïve Code
Sunny 1 0.20 000
Raining 2 0.20 001
Foggy 3 0.10 010
Cloudy 4 0.45 011
Frosty 5 0.05 100

• Efficiency of this 3-bit encoding is:

35
Huffman Coding

• Entropy and probability are closely


related

• Theory: give each state a potentially


different length (number of bits) to
encode

• More frequent states get shorter codes

• Encoding generated by building a


Huffman tree. 36
Building a Huffman tree

Can use the following algorithm to build a


tree:
1. Start with an empty set T
2. Add all the symbols to set T
3. While there are multiple "trees" in the set:
a. Remove the lowest probability tree from set T (break ties in
terms of smaller tree size), and call this tree A
b. Remove the lowest probability tree from set T (break ties in
terms of smaller tree size), and call this tree B
c. Make a new tree C by joining A and B, and set the
probability of C to p(A) + p(B)
d. Add the new tree C to set T
4. Return the only tree in set T as the Huffman tree
37
Example: building a Huffman
tree
1. Start with an empty set T
2. Add all the symbols to set T
3. While there are multiple "trees" in the set:
a. Remove the lowest probability tree from set T (break ties in
terms of smaller tree size), and call this tree A
b. Remove the lowest probability tree from set T (break ties in
terms of smaller tree size), and call this tree B
c. Make a new tree C by joining A and B, and set the probability
of C to p(A) + p(B)
d. Add the new tree C to set T
4. Return the only tree in set T as the Huffman tree

Using:
p(Sunny)=0.2,
p(Raining)=0.2,
p(Foggy)=0.1,
p(Cloudy)=0.45,
p(Frosty)=0.05
38
From Tree to Code

State Huffman
Code
Sunny 0.20
Raining 0.20
Foggy 0.10
Cloudy 0.45
Frosty 0.05

39
Efficiency of Huffman Code

• Efficiency is as before (average code length):


• Average=0.2*2 + 0.2*3 + 0.1*4 + 0.45*1 + 0.05*4
=2.05 bits
State Naïve Code Huffman
Code
Sunny 1 0.20 000 01
Raining 2 0.20 001 000
Foggy 3 0.10 010 0010
Cloudy 4 0.45 011 1
Frosty 5 0.05 100 0011

• Efficiency is therefore
40
A Huffman tree for Wikipedia!
Generated by scraping text
from 10000 Wikipedia
pages and computing
frequencies

This code from this tree


has an efficiency of
99.4% (a naïve 6-bit
encoding is 73.8%)

41
Remark

• Huffman codes are rarely used in


practice, but variants of the approach are
used in many places, e.g.,:
• ZIP compression, text compression, JPEG

42
In this week’s lab

• Compute the entropy of a system

• Build a Huffman tree, measure the efficiency


of it’s corresponding coding

• Use a Huffman code to encode and decode


signals

• Compare Huffman codes to Scrabble scores


and Morse code (you should see something
interesting here! )
43
Thanks!
Questions?

You might also like