Deepbind: 6.874 - Pranam Chatterjee
Deepbind: 6.874 - Pranam Chatterjee
● Sequence specificity
Position Weight Matrix
Steps:
1. Get PFM by counting
occurrences of each
nucleotide at each
position.
2. Divide frequency by total
# of sequences.
3. Formally, given a set X of
N aligned sequences of
length i:
Data Issues
● Different forms of data
○ Specifity coefficient
■ Protein Binding Microarrays
■ RNAcompete arrays
○ Ranked Lists of Bound Sequences
■ ChIP-Seq
○ High Affinity Sequence List
■ HT-SELEX
● Large Quantities of Data
○ 10,000-100,000 sequences (1 EXPERIMENT)
● Additional Biases/Limitations
○ i.e., hyper-ChIPable regions of genome
○ Need to filter
DeepBind Claims
● Apply to both microarray and sequencing data
● Generalize well across technologies
● Tolerate noise and mislabeled data
● Can learn from millions of sequences through parallel implementation on a
graphics processing unit (GPU)
● Train models and tune parameters automatically
● Can discover new patterns without location information
MAX or MEAN
BINDING SCORE
Alipanahi, et al., Nature Biotechnology, 2015.
Calibration and Testing Procedure
12 terabases of data!!!
Alipanahi, et al., Nature Biotechnology, 2015.
Let’s unpack that...
● Thousands of PBM, RNAcompete, ChIP-Seq, and HT-SELEX experiments
● Create 927 DeepBind models
● 538 Transcription Factors
● 194 RNA-binding Proteins (RBPs)
● Pearson Correlation
○ Measures linear correlation between predicted intensity
and probe intensities
○ Higher absolute values (maxed at 1), indicate better
performing mode.
Quantitative Performance Against Other Methods
● How to do this?
● MUTATION MAPS!
○ Importance of each base
○ Effect of each mutation on
binding score
● Illustrates effect of point
mutations on binding
affinity