Principles of Algorithm
Analysis
Biostatistics 615
Lecture 3
Problem Set…
z Questions?
z FAQ:
• How to compile C code?
• 10 minute introduction for the adventurous…
C Compilers
z Popular commercial compilers are:
• Borland C++ Builder
• Microsoft Visual C++
• Metrowerks Codewarrior
z Several free compilers also available
• Borland C++ (older version, no graphics)
• GCC
GCC
z GNU C Compiler
• Available on most UNIX systems
• Also on newer Macintosh computers
z For Windows, download from
• www.mingw.org
• www.cygwin.com
Running GCC
z Command line application
z Basic usage is:
gcc –o program_name source.c
• Use extension “.c” or “.cpp” for source code
z Reads in source(s), creates executable
program
A simple program …
/* This is a comment */
#include “stdio.h”
#include “stdlib.h”
int main()
{
int lucky; // Variable declaration
srand(123456); // Initialize random numbers
lucky = rand() % 49; // Generate a random number
printf(“Hello! My lucky number is %d\n”, lucky);
return 0;
}
Good editors for C programs…
z Commercial compilers provide very
fancy editors
z A good free alternative is nedit
• On Windows, available when you install
cygwin.
Today
z Strategies for comparing algorithms
z Common relationships between
algorithm complexity and input data
z Compare two simple search algorithms
Objectives
z Framework for
• empirical testing
• approximate analysis
z Highlight performance characteristics of
algorithms
Specific Questions
z Compare two algorithms for one task
z Predict performance in a new environment
• If we had a computer that was 10x faster and could
handle 10x more data, how would approach perform?
z Set values of algorithm parameters
Two Common Mistakes
z Ignore performance of algorithm
• Shun faster algorithms to avoid complexity in program
• Instead, wait for simple N2 algorithms, when N log N
alternatives exist of modest complexity available
z Too much weight on performance of algorithm
• Improving a very fast program is not worth it
• Spending too much time tinkering with code is rarely
good use of time
Empirical analysis
z Given two algorithms … which is better?
z Run both
• Say, algorithm A takes 3 seconds
• Say, algorithm B takes 30 seconds
z Empirical studies may not always be practical
• Some algorithms may take too long to run!
Choices of Input Data
z Actual data
• Measures performance in use
z Random data
• Generic approach, may not be representative
z Perverse data
• Attempt worst case analysis
Limitations of Empirical
Analysis
z Quality of implementation
• Is our favored implementation coded more
carefully than another?
z Extraneous factors
• Compiler
• Machine
• Computer system
Limitations of Empirical
Analysis
z Requires a working program
z Theoretical analysis is an alternative
• Estimate potential gains
z Predict effectiveness relative to new
algorithms or computers (that may not
yet exist)
Theoretical Analysis
z Predict performance of algorithm based
on theoretical properties
z “Independent” of actual implementation
z Several constructs occur frequently in
algorithm analysis
Limitations of Theoretical
Analysis
z Efficiency can depend on compiler
z Efficiency may fluctuate with input data
z Some algorithms are not well understood
The idea…
z Given a code fragment
#Find parent of node i
i = a[i];
z Consider how many times it is executed
z But not how long each execution takes
Two typical analyses
z Average-case for random input
z Worst-case
z Are these representative of real world
problems?
• Check with empirical predictions…
The Primary Parameter N
z Examples
• Degree of polynomial
• Number of characters in a string
• Size of file to be sorted
• Number of input data items
• Some other abstract measure of problem size
z With multiple parameters, we can often hold
one of them constant
Running time as a function of N
Running time when N
f(N) Description
doubles…
1 constant -
log N logarithmic constant increase
N linear doubles
N log N log-linear more than doubles
N2 quadratic increases fourfold
N3 cubic increases eightfold
2N exponential running time squares
Running time as a function of N
z Multiple terms may be involved
• e.g. N + N log N
z Typically, we ignore
• Smaller terms
• Constant coefficient
• Focus on inner loop
z In rare cases, smaller terms and constant
coefficient will be important
Time to Solve Large Problem
Problem Size N = 1,000,000
operations
per second
N N log N N2
106 seconds minutes months
109 instant instant hours
1012 instant instant seconds
Time to Solve Huge Problem
Problem Size N = 1,000,000,000
operations
per second
N N log N N2
106 hours days never
109 seconds minutes centuries
1012 instant instant months
Big-Oh Notation
z Algorithm is O(N) or O(N log N)
• Common statement
• What does it mean?
z Summarizes performance for large N
z Focuses on leading terms of expression
describing running time
Big-Oh Notation
z Consider function g(N)
z It is said to be O(f(N))
z If there exist c0 and N0 such that:
• N > N0 implies c0f(N) > g(N)
From N to Running Time…
z Common relationships
• N2
• log N
• N log N
•N
z Describe examples of how these arise
z Cost of running program is CN
O(N2)
z Loop through input successively, eliminate
one item at a time
C N = C N −1 + N for N ≥ 2, C 1= 1
= C N − 2 + ( N − 1) + N
...
= 1 + 2 + ... + ( N − 1) + N
N ( N + 1)
=
2
O(log N)
z Recursive program, halves input in one step
C 2 n = C 2 n−1 + 1 for N ≥ 2, C 1= 1
= C 2 n−2 + 1 + 1
= C 2 n −3 + 3
...
= C 20 + n
= n +1
N = 2n
O(N log N)
z Recursive program, processes each item,
splits input into two halves, examines each
C N = 2C N / 2 + N for N ≥ 2, C 1= 0
one…
C2n = 2C2n−1 + 2 n
C2 n 2C2n−1 + 2 n
n
=
2 2n
C2n−1
= n −1
+1
2
C 2 n−2
= n−2
+1+1
2
...
=n
O(2N)
z Halves input, must examine each item…
CN = CN /2 + N for N ≥ 2, C 1= 1
N N N
= N + + + + ...
2 4 8
≈ 2N
Application
z Analysis of two search algorithms
z Consider a set of items
• Evaluate functions to decide whether a
particular item is present…
Sequential Search
int search(int a[], int value, int start, int stop)
{
// Variable declarations
int i;
// Search through each item
for (i = start; i <= stop; i++)
if (value == a[i])
return i;
// Search failed
return -1;
}
Sequential Search Properties
z Algorithm:
• Look through array sequentially, until we find a match
z Average cost
• If match found: N/2
• If match not found: N
z Actual cost depends on fraction of successful
searches
Better Sequential Search
z If items are sorted…
z Stop unsuccessful search early, when
we reach item with higher value
• Cost for unsuccessful searches is now N/2
z Overall, algorithm is still O(N)
Binary Search
int search(int a[], int value, int start, int stop)
{
while (stop >= start)
{
// Find midpoint
int mid = (start + stop) / 2;
// Compare midpoint to value
if (value == a[mid]) return mid;
// Reduce input in half!!!
if (value > a[mid])
{ start = mid + 1; }
else
{ stop = mid - 1; }
}
// Search failed
return -1;
}
Binary Search Properties
z Algorithm:
• Halve number of items to consider with each
comparison
z Worst-case cost
• Maximum cost is never greater than log2 N
z Much better than sequential search, but even
better methods exist!
Sequential vs. Binary Search
M = 1,000 M = 10,000 M = 100,000
N S B S B S B
125 1 1 13 2 130 20
250 3 0 25 2 251 22
500 5 0 49 3 492 23
1250 13 0 128 3 1276 25
2500 26 1 267 3 * 28
Timings in seconds, for M searches in table of N elements
Summary
z Outline principles for analysis of
algorithms
z Introduced some common relationships
between N and running time
z Described two simple search algorithms
Further Reading
z Read chapter 2 of Sedgewick
Tip of the Day:
Defensive Programming
z Document code
• Indicate intended purpose
• Specify required inputs
• Always indicate author
z Check for error conditions