Comp101 Lect02
Comp101 Lect02
Foundations
of Information
Systems
Grant Dick
Department of
Information Science
Lecture 02:
Information Theory
Before we start …
2
Class Reps
5
Goals for today
6
Recap from last lecture:
• My response is:
a) “Table for 7, please?”
b) “Table for 25, please?”
c) “Table for 17, please?”
That gesture could be
d) “Table for 127, please?”
anything! 7
Data is just
symbols
These symbols can be used to represent things, but context is
needed!
Source: Wikimedia Commons
So, what is Information?
Claude Shannon
Information
From OED:
Information
• A common state
(high ) is not
informative, so
Surprise is low
• A deterministic state ()
is completely
uninformative, so
Surprise is zero
• An impossible state ()
is undefined
14
The bit – smallest unit of
information
• Information must have variation:
• i.e., more than one state
15
Representing Multiple States in
Bits
• A bit can represent two states (0, 1)
Examples:
Days of the week
Degrees offered by the university
Age of an individual
17
Size as “Expected Surprise”
18
Example:
21
Example: Password Characters
22
Uses of Entropy
• Storage Requirements
23
Storage: How many bits are
needed?
• Often we need to work out how many bits
will be required to store a value in
memory, on disk, etc.
24
Computational Effort
26
Entropy Loss (Information Gain!)
Splitting rule:
If balance < 943 or
balance between 1612.5
and 1652.5, then default
is “No”, otherwise default
is “Yes”
33
Compression
• Example …
34
Efficiency of naïve encoding of
states
• Consider our earlier example (states of
weather):
State Naïve Code
Sunny 1 0.20 000
Raining 2 0.20 001
Foggy 3 0.10 010
Cloudy 4 0.45 011
Frosty 5 0.05 100
35
Huffman Coding
Using:
p(Sunny)=0.2,
p(Raining)=0.2,
p(Foggy)=0.1,
p(Cloudy)=0.45,
p(Frosty)=0.05
38
From Tree to Code
State Huffman
Code
Sunny 0.20
Raining 0.20
Foggy 0.10
Cloudy 0.45
Frosty 0.05
39
Efficiency of Huffman Code
• Efficiency is therefore
40
A Huffman tree for Wikipedia!
Generated by scraping text
from 10000 Wikipedia
pages and computing
frequencies
41
Remark
42
In this week’s lab