0% found this document useful (0 votes)
15 views

01 Intro Bits Bytes

2^20 = 1,048,576 possible messages with 20 yes/no questions. If some messages are disallowed or more likely than others, we could assign probabilities to the messages and use an entropy encoding like Huffman coding to compress the messages into fewer bits on average.

Uploaded by

Cường Darma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

01 Intro Bits Bytes

2^20 = 1,048,576 possible messages with 20 yes/no questions. If some messages are disallowed or more likely than others, we could assign probabilities to the messages and use an entropy encoding like Huffman coding to compress the messages into fewer bits on average.

Uploaded by

Cường Darma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Introduction:

algorithms (recapitulation),
bits, strings
Jesper Larsson
Course, teaching
• Me: Jesper Larsson, string and compression algorithms
person, teaching at MAU since 2014, research
background in string algorithms and data compression

• You?

• Languages? Spoken and for programming

• Lectures, assignments, computer time

• Not based on book, but exists in books

• Small and familiar course


Lectures (preliminary)
1. Intro, bits and strings
2. Bucket and radix sorting
3. Trie (digital tree)
4. Su x tree
5. Inverted le (search engine data structure) + regular expressions
6. Su x array
7. Su x data structure algorithms, supplemental
8. Information theory, codes, entropy coding
9. More on codes and their applications, Ziv-Lempel compression
10. The Burrows–Wheeler transform (BWT)
11. Substring search (KMP, BM, Karp–Rabin)
12. Catching up and summary
ffi
ffi
ffi
fi
I expect that you know

• Sorting and searching taught on basic algorithms


courses: binary search trees, hash tables, { selection,
insertion, quick-, merge-}-sort

• Programming

• Principles of algorithm analysis, O-notation etc.


Assignments (preliminary)

1. Word frequencies + compact le


2. Radix sorting (general interface)
3. Word frequencies 2 (trie) + search engine
4. Su x sorting
5. Entropy coding
6. BWT
ffi
fi
Today

• Brief recapitulation of algorithm time complexity

• We take a step back: digital representation of information

• Bitwise operations, numbers


(Not what this course is supposed to teach, but you are
going to need this in your programming assignments)

• If time: counting/bucket sort (“key-indexed counting”)


Algorithmic research
• Come up with algorithms for speci c problems

• Determine the “speed” of algorithms

• Find “faster” algorithms

• Prove that:
- A speci c algorithm does what it’s supposed to
- A speci c algorithm has a certain “speed”
- There can be no algorithm for the problem “faster”
than a certain “speed”
fi
fi
fi
how many times
do you have to
turn the crank?

Charles Babbage’s
analytical engine,
“programmed” by
Ada Lovelace
Time complexity of algorithm
• T(N) = a measure for the time it takes to run the program
on an input of size N

• Approximate with “at most proportional to”, O-notation


~, Θ, O

Ex 1. ⅙ N 3 + 20 N + 16 ~ ⅙N3 is Θ(N3)
Ex 2. ⅙ N 3 + 100 N 4/3 + 56 ~ ⅙N3 = proportional to N3
Ex 3. ⅙N3 - ½N 2 + ⅓ N ~ ⅙N3 = cubic

Most common measure


f(N) is O(g(N)) means: f(N) is at most proportional to g(N)
f(N) is Θ(g(N)) means: f(N) is precisely proportional to g(N)
g(N) is the “best” function such that f(N) is O(g(N))

Formally
f(N) is O(g(N)) means: ∃ constants N0 and c so that if N > N0, then |f(N)| < c·g(N)
f(N) is Ω(g(N)) means: ∃ constants N0 and c so that if N > N0, then |f(N)| > c·g(N)
f(N) is Θ(g(N)) means: f(N) is both O(g(N)) and Ω(g(N))
Time for sorting with
pairwise compares
log2

• Upper bound: ~N lg N compares (given by mergesort)

• Lower bound: ?

• Optimal algoritm: ?

Start with considering how many possible orderings there are


Decision tree to nd, using compare, which possible ordering of 3 elements is correct

a<b
tree heigh =
yes no number of
compares in the

b<c a<c

yes
no yes no

abc a<c bac b<c

yes no yes no

acb cab bca cba

one leaf per possible ordering: 3! = 6 leaves


fi
Decision tree for possible orderings of N values

• N values a1 to aN. Assume they are all di erent (a case we need to manage).
• N! = N · (N−1) · (N−2) · … · 3 · 2 · 1 di erent orderings
• Tree with compares (internal nodes), with orderings as leaves:

at least N! leaves no more than 2h leaves

• Worst case time: height h.


2 h ≥ N!
• Binary tree med h levels: at most 2h leaves
h ≥ lg (N!) ~ N lg N

Stirling's approximation

• Conclusion: Any algorithm must use lg (N!) ~ N lg N compares (worst case)


ff
ff
Stupid question?
• Well known: computers at machine level represent
everything using only 0 and 1

• So how can computers process and output graphics,


audio, or even text, which don’t look like 0s and 1s?

• What does it even mean “only 0 and 1”?

• (How to explain this to someone without comp sci


knowledge?)
Blondinrikard Fröberg, Listen closely https:// ic.kr/p/tRbAcU

στρατός (formerly known as Michelangelo_MI), At the end of the track https:// ic.kr/p/4wMSNh
fl
fl
Sound
• A sequence of amplitude values in binary
representatio

• Parameters: frequency, bits per sample…


n

Pictures
• Bit-mapped
PDF, Postscript, …

PNG, JPEG, GIF, …


ht
tp
:/
/b
lo
g.
xk
cd
.c
om
/2
01
0/
05
/0
3/
co
lo
r-
su
rv
ey
-r
es
ul
ts
/
Pictures
• Colors: RGB

• 3 byte numbers, 256×256×256


= 16777216 different colors
D. B. Gaston, Arabian Nights text cropped https:// ic.kr/p/5QvRXv

fl
7 bit ascii
International (e.g.
Scandinavian) characters
• Replace some glyphs [ ] \ { }
• Use 8 bits: Latin-1 (ISO 8859-1
• Replace for new chars (€ etc.): Latin-9
(ISO-8859-15
• Microsoft variant: Windows-125
• Unicode multibyte: UTF-8 de-facto standard?
)

UTF-8
No of bits we Bytes in
Byte 1 Byte 2 Byte 3 Byte 4
need UTF-8 value

7 1 0xxxxxxx

8–11 2 110xxxxx 10xxxxxx

12–16 3 1110xxxx 10xxxxxx 10xxxxxx

17–21 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx


Two approaches (which are
both kind of the same)

• Recursively split the universe in two parts. Denote the


parts 1 and 0. Binary tree with what we want to represent
in the leaves

• Assign numbers to everything we want to represent.


Encode in base 2 (binary numbers)
Game: 20 questions

• How many ”messages” are possible with 20 questions =


bits?

• What if some messages are (much) more likely than


others?

• (How many bits would we need to distinguish any


message ever written by a human?)
X has probability Claude Shannon

Optimal number of bits to represent X:


1
log2 bits
p
p

~ Unary number representation


Positional number system

ِ ‫خ َوا ِرزْم‬ َ ‫َعبْ َداهلل ُم َح َّمد ِبن ُم‬


ْ ‫وسى ا َ ْل‬
Abū ʿAbdallāh Muḥammad ibn Mūsā al-Khwārizmī
Indian positional system → “Arabic” numbers
782 = 2·100
+ 8·101
+ 7·102

2 7
1 3 6 8

4 5

0 9
Binary number
representation
decimal:
782 = 2·100
+ 8·101
+ 7·102

binary:
1100001110 = 0·20
+ 1·21
+ 1·22
+ 1·23
+ 0·24
+ 0·25
+ 0·26
+ 0·27
+ 1·28
+ 1·29
“8 Questions” for unsigned
numbers (octets or “bytes”)
00000000:
00000001:
00000010:
00000011:
00000100:

01111110: 12
01111111: 12
10000000: 12
10000001: 12
10000010: 13

11111110: 25
11111111: 255
4

Signed 8-bit numbers (octets


or “bytes”): two’s complement
00000000: 10000000: −12
00000001: 10000001: −12
00000010: 10000010: −12
00000011: ⋮
00000100: 11111110: −
⋮ 11111111: −1
01111110: 12 00000000:
01111111: 12 00000001:
10000000: −12 00000010:
10000001: −12 00000011:
10000010: −12 00000100:
⋮ ⋮
11111110: − 01111110: 12
11111111: −1 01111111: 127
4

Floating-point representation
0.250244140625:

Single precision, 32 bits: 1 sign, 8 exponent, 23 mantissa


0 01111101 00000000010000000000000

sign 0: positive
exponent: 01111101 = 125, subtract 127 (exponent bias): −2
mantissa: 1·20 (implicit rst 1 bit)
+ 0·2−1
+ 0·2−2
+ 0·2−3

+ 0·2−9
+ 1·2−10
+ 0·2−11

= 1.0009765625

+ 1.0009765625 · 2−2 = 0.250244140625


fi
Bitwise operations
• and & ∧ ·

• or | ∨ +

• xor ^ ⊻ ⊕

• not ~ ¬ ¯

• <<

• >>, >>>
a b a&b
0 0 0
0 1 0
1 0 0
1 1 1

a & b is true only if both a and b are true


a b a|b
0 0 0
0 1 1
1 0 1
1 1 1
a ∨ b is true if at least one of a and b is true
a b a^b
0 0 0
0 1 1
1 0 1
1 1 0

a ⊕ b is true when a and b are not equal


• Setting bit to 1: or 1 (or 0 means don't change)

• Setting bit to 0: and 0 (and 1 means don't change

• Flipping bit: xor 1 (xor 0 means don't change)


decimal: 9 + 5 = 14 decimal: 11 + 7 = 18
binary: 1001 + 101 = 1110 binary: 1011 + 111 = 10010

carry C
A
B

Each row is an operation withthree input bits and two output bits

outi = Ai ^ Bi ^ Ci
Ci+1 = Ai & Bi | Ci & (Ai ^ Bi)
Groups of bits

• Groups of 3: octal

• Groups of 4: hex[adecimal]

• groups of 8: byte “strings”


(One strange representation: IPv4 32-bit address: “192.168.10.199”)
Let’s calculate

• 2022 * 666
= 1346652

• 111|11100110 * 10|10011010
= 10100|10001100|01011100
2022 & 0x = 230
2022 >>> 8 = 7 7 230

666 & 0x = 154 2 154


666 >>> 8 = 2
ff
ff
154*230 & 0x = 92
154*230 >>> 8 = 138 7 230

138 2 154

92
ff
154*7 + 138 & 0x = 192
154*7 + 138 >>> 8 = 4 7 230

2 154

ff
4 192 92
2*230 & 0x = 204
2*230 >>> 8 = 1 7 230

1 2 154

4 192 92

204
ff
2*7 + 1 = 15
7 230

2 154

4 192 92

15 204
192 + 204 & 0x = 140
192 + 204 >>> 8 = 1 7 230

1+4+15 = 20 2 154

4 192 92

15 204

20 140 92

20·2562 + 140·2561 + 92·2560 = 1346652

= 666·2022
ff
Endianness

• Little-endian: byte 0 at address 0, byte 1 at address 1, …


Makes sense because it's analogous to bit 0 being the least signi cant bit

• Big-endian: byte 0 at the last address of the word,


most signi cant byte at address 0…
Makes sense because you can sort (unsigned) numbers as if they were strings
fi
fi
Counting multi-byte ints
A rst go at sorting in
less than O(N log N)

Next lecture!
fi

You might also like