CS Preliminaries: ECS289A
CS Preliminaries: ECS289A
ECS289A
Computer Science
• Computational solutions to problems: algorithms
• Programming the solutions: programs
• Data storage and access: databases
• Data Analysis: for hypothesis generation and
testing
• Human-computer Interfaces: interaction with data
• Building systems: hardware and software
• Education
ECS289A
What is a solution to a problem:
an algorithm
• A procedure designed to perform a certain
task, or solve a particular problem
• Algorithms are recipes: ordered lists of
steps to follow in order to complete a task
• Abstract idea behind particular
implementation in a computer program
ECS289A
1. Algorithms in Bioinformatics
Theoretical Computer Scientists are
contributors to the genomic revolution
• Sequence comparison
• Genome Assembly
• Phylogenetic Trees
• Microarray design (SBH)
• Data Integration
• Gene network inference
ECS289A
Algorithm Design
• Recognize the structure of a given problem:
– Where does it come from?
– What does it remind of?
– How does it relate to established problems?
• Build on existing, efficient data structures
and algorithms to solve the problem
• If the problem is difficult to solve efficiently,
use approximative algorithms
ECS289A
Problems and Solutions
In algorithmic lingo:
• Problems are very specific, general
mathematical tasks, that take variables as
input and yield variables as output.
• Particularizations (assigning values to the
variables) are called instances.
• Problem: Multiply(a,b): Given integers a
and b, compute their product a*b.
• Instance: Multiply (13, 243).
ECS289A
Algorithms produce solutions for any given
instance of a general problem
Multiply(a,b):
0) Let Product = 0
1) Take the k-th rightmost digit of b
and multiply a by it. Attach k-1 zeros
to the right, and add to Product.
2) Repeat Step 1. for all digits of b.
3) Product = a*b
ECS289A
Algorithm Analysis
• Correctness
– Exact solutions require a proof of correctness
– Heuristics: approximate solutions
• Resource Efficiency (complexity)
– Time: number of steps to follow to obtain a
solution as a function of the input size
– Space: amount of memory required for the
algorithm execution
• Best, Average, and Worst Case Analysis
ECS289A
Time / Space Complexity
• Input size: how many units of
constant size does it take to represent
the input? This is dependent on the
computational model, but can be
thought of as the storage size of the
input. The input size is usually n.
• Running time: f(n) = const., n, log n,
Poly(n), en
ECS289A
Big Oh Notation
• Asymptotic upper bound on the number of
steps an algorithm takes (in the worst case)
ECS289A
Complexity, contd.
• NP-complete problems: a polynomial algorithm
for one of them would mean all problems in NP
are polynomial time
• But, NO polynomial time algorithms for NP
problems are known
• P ≠ NP? Still unsolved, although strongly
suspected true.
• NP complete problems: 3-SAT, Hamiltonian
Cycle, Vertex Cover, Maximal Clique, etc.
Thousands of NP-complete problems known
• Compendium:
http://www.nada.kth.se/~viggo/problemlist/compendium.html
ECS289A
Why All That?
• Many important problems in the real world
tend to be NP-complete
• That means exact solutions are
intractable, but for very small instances
• Proving a problem to be NP-complete is
just a first step: a good algorist would use
good and efficient heuristics
ECS289A
Popular Algorithms
• Sorting
• String Matching
• Graph Algorithms
– Graph representation: linked lists, incidence matrix
– Graph Traversal (Depth First and Breadth First)
– Minimum Spanning Trees
– Shortest Paths
• Linear Programming
ECS289A
Algorithmic Techniques
• Combinatorial Optimization Problems
– Find min (max) of a given function under given
constraints
• Greedy – best solution locally
• Dynamic Programming – best global
solution, if the problem has a nice structure
• Simulated Annealing: if not much is known
about the problem. Good general technique
ECS289A
Data Structures
• Once a given problem is digested,
algorithm design becomes an engineering
discipline: having a big toolbox and
matching the tools to the task at hand
• A major part of the toolbox are data
structures:
Data representations allowing efficient
performance of basic operations
ECS289A
Basic Opperations
• Store/Search:
– Search(x)
– Delete(x)
– Insert(x)
• Priority:
– FindMIN
– FindMAX
• Set:
– UnionSet
– FindElement
ECS289A
Basic Data Structures
• Static: arrays and matrices
– Array of n elements: a[i], 0 <= i <= n-1
1 2 3 4 5
a[1] a[2] a[3] a[4] a[5]
1 2 3 4
– Matrix of n*n elements:
m[i][j], 0 <= i, j <= n-1 1 m[1][1] m[1][2] m[1][3] m[1][4]
ECS289A
Dynamic Data Structures: linked lists,
trees and balanced trees, hash tables
• No static memory allocation: items are added/deleted on
the go
• Trees
ECS289A
Hash Tables
a
c a b c NIL
f(key)
d
Keys
e d e f NIL
f g h i NIL
ECS289A
Set Data Structures
• Given sets A={1,2,3,4} and B={1,3}
• Operations: Find, Union
• Example:
– Find(A,3) = yes
– Find(A,5) = no
– Find(B,3) = yes
– Union(A,B) = {1,2,3,4}
• Very efficient: almost linear in the number
of union+find operations
ECS289A
Graphs
• Graph G(V, E). V is a set of vertices, E a
set of edges
V4
V2
V3
V5
E = { (v1, v2), (v1, v5), (v1, v6),
(v2, v3), (v2, v5), (v2, v6),
(v3, v4), (v3, v5), (v3, v6) }
V1
V6
ECS289A
• Linked list representation: V2
V4
ECS289A
A Greedy Clustering Example
ECS289A
• Clustering is a very important tool in
analysis of large quantities of data
ECS289A
Clustering Points in the Plane
Problem 1: Given n points p1 ( x1 , y1 ), p2 ( x2 , y2 ), , pn ( xn , yn )
in a plane, cluster them so that if the distance
between two points is less than D they are in the
same cluster
Input: D, p1 ( x1 , y1 ), p2 ( x2 , y2 ), , pn ( xn , yn )
Output: Sets (clusters) of points C1, C2, …, Ck.
D
C1 C2
ECS289A
Algorithm Draft
• Calculate distances between point pairs
ECS289A
• Sort the distances in ascending order
p2 p1 d2,1 p7 p5 d7,5
p3 p1 d3,1 p4 p3 d4,3
… … … … … …
ECS289A
Move through the sorted list of distances and add a new
point to a cluster if the distance is < D.
ECS289A
Algorithm in Detail
• Data Structure for the graph: adjacency matrix
p2 p1 d2,1
p3 p2 d3,2
p3 p1 d3,1
… … …
ECS289A
Algorithm in detail
• Calculate distances O(n2)
– For all pairs i,j calculate d(i,j)
• Sort adjacency table O(n2 log n)
• Start with n sets, p1,p2,…,pn. Build a linked-
list representation of a graph:
– Get the next smallest distance, d(i,j)
– If d(i,j) >= D done
– Else Union(Find(pi),Find(pj))
• Traverse the graph to find the connected
components (DFS)
ECS289A
Algorithm Analysis
• Correctness:
– All distances less than D are added
– Clusters contain all points with distance < D to
some other point in the cluster
• Time complexity:
– Bounded above by the sorting step
– O(n2 log n)
ECS289A
Discussion
• This algorithm is known as Single-Link
Hierarchical Clustering
• It is a version of Kruskal’s Minimum
Spanning Tree Algorithm
• It is fast
ECS289A
Performance on Real Data
• Lousy: Chaining effects
ECS289A
Better Approaches:
Complete-Link Clustering
Problem 2: Given n points p1 ( x1 , y1 ), p2 ( x2 , y2 ), , pn ( xn , yn )
in a plane, cluster them so that the distance
between any two points in a cluster is less than D
Input: D, p1 ( x1 , y1 ), p2 ( x2 , y2 ), , pn ( xn , yn )
Output: Sets (clusters) of points C1, C2, …, Ck.
D
C1 C2
ECS289A
2. Bio-databases
• A biological database is a large, organized
body of persistent data, usually associated
with computerized software designed to
update, query, and retrieve components of
the data stored within the system.
– easy access to the information
– a method for extracting only that information
needed to answer a specific biological question
• Many databases are linked through a
unique search and retrieval system, eg
NCBI's Entrez.
ECS289A
Database Interfacing
• APIs: scripts in Perl, Python, R
• Direct online:
– NCBI entrez
– KEGG
– Reactome
– etc.
ECS289A
3. Workflows
ECS289A