0% found this document useful (0 votes)
60 views8 pages

2: Models of Computation: Al-Khw Arizm I

This document provides an overview of models of computation and algorithms. It discusses the random access machine model, pointer machine model, and Python model of computation. It also describes the document distance problem of computing the distance between two documents based on their shared words, and presents algorithms for solving this problem in linear time and space.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
0% found this document useful (0 votes)
60 views8 pages

2: Models of Computation: Al-Khw Arizm I

This document provides an overview of models of computation and algorithms. It discusses the random access machine model, pointer machine model, and Python model of computation. It also describes the document distance problem of computing the distance between two documents based on their shared words, and presents algorithms for solving this problem in linear time and space.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
You are on page 1/ 8

Lecture 2

6.006 Fall 2011

Lecture 2: Models of Computation

Lecture Overview
• What is an algorithm? What is time?

• Random access machine

• Pointer machine

• Python model

• Document distance: problem & algorithms

History
Al-Khwārizmı̄ “al-kha-raz-mi” (c. 780-850)
• “father of algebra” with his book “The Compendious Book on Calculation by Com­
pletion & Balancing”

• linear & quadratic equation solving: some of the first algorithms

What is an Algorithm?
• Mathematical abstraction of computer program

• Computational procedure to solve a problem

analog
program algorithm
built on
programming top of
pseudocode
language
model of
computer
computation

Figure 1: Algorithm

Model of computation specifies


• what operations an algorithm is allowed

• cost (time, space, . . . ) of each operation

• cost of algorithm = sum of operation costs

1
Lecture 2 6.006 Fall 2011

Random Access Machine (RAM)

0
1
2
3
4
5
. .

. .

. .
} word

• Random Access Memory (RAM) modeled by a big array

• Θ(1) registers (each 1 word)

• In Θ(1) time, can

– load word @ ri into register rj


– compute (+, −, ∗, /, &, |, ˆ) on registers
– store register rj into memory @ ri

• What’s a word? w ≥ lg (memory size) bits

– assume basic objects (e.g., int) fit in word


– unit 4 in the course deals with big numbers

• realistic and powerful → implement abstractions

Pointer Machine
• dynamically allocated objects (namedtuple)

• object has O(1) fields

• field = word (e.g., int) or pointer to object/null (a.k.a. reference)

• weaker than (can be implemented on) RAM

2
Lecture 2 6.006 Fall 2011

val 5
prev null
next

val -1
prev
next null

Python Model
Python lets you use either mode of thinking

1. “list” is actually an array → RAM


L[i] = L[j] + 5 → Θ(1) time

2. object with O(1) attributes (including references) → pointer machine


x = x.next → Θ(1) time

Python has many other operations. To determine their cost, imagine implementation in
terms of (1) or (2):

1. list
(a) L.append(x) → θ(1) time
obvious if you think of infinite array

but how would you have > 1 on RAM?


via table doubling [Lecture 9]

(b) L
' = L1v + L2" ≡ L = [ ] → θ(1)




(θ(1+|L1|+|L2|) time)




for x in L1: θ(|L1 |) ⎪


L.append(x) → θ(1)



for x in L2: θ(|L2 |) ⎪



L.append(x) → θ(1)



3
Lecture 2 6.006 Fall 2011

)
(c) L1.extend(L2) ≡ for x in L2: θ(1 + |L2 |) time
≡ L1+ = L2 L1.append(x) → θ(1)
)
(d) L2 = L1[i : j] ≡ L2 = [] θ(j − i + 1) = O(|L|)
for k in range(i, j):
L2.append(L1[i]) → θ(1)

(e)

b = x in L ≡ for y in L: ⎫ ⎪
⎪ θ(index of x) = θ(|L|)
& L.index(x) if x == y: ⎬ θ(1)


⎪ ⎪


& L.find(x) b = T rue; ⎬

break ⎭ ⎪


else





b = F alse

(f) len(L) → θ(1) time - list stores its length in a field

(g) L.sort() → θ(|L| log |L|) - via comparison sort [Lecture 3, 4 & 7)]

2. tuple, str: similar, (think of as immutable lists)

3. dict: via hashing [Unit 3 = Lectures


) 8-10]
D[key] = val
θ(1) time w.h.p.
key in D

4. set: similar (think of as dict without vals)

5. heapq: heappush & heappop - via heaps [Lecture 4] → θ(log(n)) time

6. long: via Karatsuba algorithm [Lecture 11]


x + y → O(|x| + |y|) time where |y| reflects # words
x ∗ y → O((|x| + |y|)log(3) ) ≈ O((|x| + |y|)1.58 ) time

Document Distance Problem — compute d(D1 , D2 )


The document distance problem has applications in finding similar documents, detecting
duplicates (Wikipedia mirrors and Google) and plagiarism, and also in web search (D2 =
query).
Some Definitions:

• Word = sequence of alphanumeric characters

• Document = sequence of words (ignore space, punctuation, etc.)

The idea is to define distance in terms of shared words. Think of document D as a vector:
D[w] = # occurrences of word W . For example:

4
Lecture 2 6.006 Fall 2011

dog

D2

the

D1
cat

Figure 2: D1 = “the cat”, D2 = “the dog”

As a first attempt, define document distance as

d/ (D1 , D2 ) = D1 · D2 = D1 [W ] · D2 [W ]
W

The problem is that this is not scale invariant. This means that long documents with 99%
same words seem farther than short documents with 10% same words.
This can be fixed by normalizing by the number of words:
D 1 · D2
d// (D1 , D2 ) =
|D1 | · |D2 |

where |Di | is the number of words in document i. The geometric (rescaling) interpretation
of this would be that:
d(D1 , D2 ) = arccos(d// (D1 , D2 ))
or the document distance is the angle between the vectors. An angle of 0◦ means the two
documents are identical whereas an angle of 90◦ means there are no common words. This
approach was introduced by [Salton, Wong, Yang 1975].

Document Distance Algorithm


1. split each document into words

2. count word frequencies (document vectors)

3. compute dot product (& divide)

5
Lecture 2 6.006 Fall 2011

(1) re.findall (r“ w+”, doc) → what cost?


in general re can be exponential time ⎫
→ for char in doc: ⎪
⎪ Θ(|doc|)

if not alphanumeric ⎫



add previous word ⎬ Θ(1) ⎪

(if any) to list ⎪




start new word ⎭ ⎭

(2) sort word list ← O(k log k · |word|) where k is #words



for word in list: ⎪
⎪ O( |word|) = O(|doc|)
if same as last word: ← O(|word|)


⎫ ⎪


increment counter ⎪
⎪ Θ(1) ⎬

else: ⎬ ⎪


add last word and count to list ⎪ ⎪


⎪ ⎪


reset counter to 0


(3) for word, count1 in doc1: ← Θ(k1 ) ⎪
⎬ O(k1 · k2 )
if word, count2 in doc2: ← Θ(k2 )

total += count1 * count2 Θ(1) ⎭

(3)’ start at first word of each list ⎪
⎪ O( |word|) = O(|doc|)

if words equal: ← O(|word|) ⎪



total += count1 * count2





if word1 ≤ word2: ← O(|word|) ⎬
advance list1 ⎪


else:





advance list2 ⎪



repeat either until list done

Dictionary Approach

(2)’ count = {}



for word in doc:





if word in count: ← ⎫ Θ(|word|) + Θ(1) w.h.p ⎬
O(|doc|) w.h.p.
count[word] += 1 ⎪
⎬ ⎪


else Θ(1)




⎪ ⎭
count[word] = 1 ⎭

(3)’ as above → O(|doc1 |) w.h.p.

6
Lecture 2 6.006 Fall 2011

Code (lecture2 code.zip & data.zip on website)

t2.bobsey.txt 268,778 chars/49,785 words/3354 uniq


t3.lewis.txt 1,031,470 chars/182,355 words/8534 uniq
seconds on Pentium 4, 2.8 GHz, C-Python 2.62, Linux 2.6.26

• docdist1: 228.1 — (1), (2), (3) (with extra sorting)


words = words + words on line

• docdist2: 164.7 — words += words on line

• docdist3: 123.1 — (3)’ . . . with insertion sort

• docdist4: 71.7 — (2)’ but still sort to use (3)’

• docdist5: 18.3 — split words via string.translate

• docdist6: 11.5 — merge sort (vs. insertion)

• docdist7: 1.8 — (3) (full dictionary)

• docdist8: 0.2 — whole doc, not line by line

7
MIT OpenCourseWare
http://ocw.mit.edu

6.006 Introduction to Algorithms


Fall 2011

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

You might also like