0% found this document useful (0 votes)
143 views25 pages

Deepbind: 6.874 - Pranam Chatterjee

DeepBind is a deep learning model that learns sequence specificities from large datasets of protein binding sequences. It generates Position Weight Matrix models for over 500 transcription factors and 194 RNA binding proteins. DeepBind outperforms other non-deep learning methods on tasks like predicting DNA and RNA binding. It can also identify variants that affect protein binding and help understand alternative splicing regulation. Overall, DeepBind demonstrates the ability of deep learning to discover sequence motifs from huge amounts of protein binding data.

Uploaded by

Oky Hermansyah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
143 views25 pages

Deepbind: 6.874 - Pranam Chatterjee

DeepBind is a deep learning model that learns sequence specificities from large datasets of protein binding sequences. It generates Position Weight Matrix models for over 500 transcription factors and 194 RNA binding proteins. DeepBind outperforms other non-deep learning methods on tasks like predicting DNA and RNA binding. It can also identify variants that affect protein binding and help understand alternative splicing regulation. Overall, DeepBind demonstrates the ability of deep learning to discover sequence motifs from huge amounts of protein binding data.

Uploaded by

Oky Hermansyah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

DeepBind

6.874 - Pranam Chatterjee


Why do we care?
● Regulatory processes
○ Transcription
○ Alternative Splicing
○ Disease correlation

● Sequence specificity
Position Weight Matrix
Steps:
1. Get PFM by counting
occurrences of each
nucleotide at each
position.
2. Divide frequency by total
# of sequences.
3. Formally, given a set X of
N aligned sequences of
length i:
Data Issues
● Different forms of data
○ Specifity coefficient
■ Protein Binding Microarrays
■ RNAcompete arrays
○ Ranked Lists of Bound Sequences
■ ChIP-Seq
○ High Affinity Sequence List
■ HT-SELEX
● Large Quantities of Data
○ 10,000-100,000 sequences (1 EXPERIMENT)
● Additional Biases/Limitations
○ i.e., hyper-ChIPable regions of genome
○ Need to filter
DeepBind Claims
● Apply to both microarray and sequencing data
● Generalize well across technologies
● Tolerate noise and mislabeled data
● Can learn from millions of sequences through parallel implementation on a
graphics processing unit (GPU)
● Train models and tune parameters automatically
● Can discover new patterns without location information

Alipanahi, et al., Nature Biotechnology, 2015.


Overview of DeepBind

Alipanahi, et al., Nature Biotechnology, 2015.


Training Procedure

MAX or MEAN

BINDING SCORE
Alipanahi, et al., Nature Biotechnology, 2015.
Calibration and Testing Procedure

12 terabases of data!!!
Alipanahi, et al., Nature Biotechnology, 2015.
Let’s unpack that...
● Thousands of PBM, RNAcompete, ChIP-Seq, and HT-SELEX experiments
● Create 927 DeepBind models
● 538 Transcription Factors
● 194 RNA-binding Proteins (RBPs)

(This took 4+ years, btw)

Alipanahi, et al., Nature Biotechnology, 2015.


How well does it work?
● Test on PBM data from DREAM5 TF-DNA Motif Recognition Challenge
● 86 different mouse transcription factors
● 2 array designs (~40,000 probes each)
○ All possible 10-mers, non-palindromic 8-mers (32x)
● Train on probe intensities, predict on held-out test array design

Example Competing Algorithms (26 in total)

● FeatureREDUCE (biophysical PWM/k-mer) None of these are deep-learning-based!


● BEEML-PBM (weighted regression)
● RankMotif++ (probabilistic)
● PFM models (position frequency matrices)
Alipanahi, et al., Nature Biotechnology, 2015.
Metrics
● Area Under Curve (AUC)
○ Measures true positive rate of model as a function of
false positive rate (ROC curve)
○ Tells us how good the model identifies actual positives
○ Higher AUC means better performing model

● Pearson Correlation
○ Measures linear correlation between predicted intensity
and probe intensities
○ Higher absolute values (maxed at 1), indicate better
performing mode.
Quantitative Performance Against Other Methods

Alipanahi, et al., Nature Biotechnology, 2015.


Do in vitro models accurately identify in vivo bound
sequences?
● 506 ENCODE ChIP-Seq data sets
● In vivo laboratory biases
○ Cell-type specificities
○ Nucleosome interactions
○ Chromatin remodeling, etc.
● 137 transcription factors
● Performed better than other
non-deep learning methods based
on AUC
● Can generalize to other data
acquisition methods
Alipanahi, et al., Nature Biotechnology, 2015.
First place goes to….DEEPBIND!
Why are RBPs sequence specifities difficult to
predict?
● Usually bind to ssRNA
● More flexible than DNA
● Can fold into stable secondary structures
● Recognition motif is highly flexible
○ Multiple domains neeeded for binding
● RNA structure also affects binding

Rhee, et al., Nucleic Acids Research 2008,


Identifying Damaging
Genetic Variants

● How to do this?
● MUTATION MAPS!
○ Importance of each base
○ Effect of each mutation on
binding score
● Illustrates effect of point
mutations on binding
affinity

Alipanahi, et al., Nature Biotechnology, 2015.


Mutation in MYC Enhancer Weakens TCF7L2
Binding Site

Alipanahi, et al., Nature Biotechnology, 2015.


SNP in Globin Cluster Creates GATA1 Binding Site

Alipanahi, et al., Nature Biotechnology, 2015.


DeepFind: an aggregate model
● What’s the point? To provide
collective contexts.
● I.e., true TF binding sites are
likely to be located with other
TF binding sites
● AUC ~ 0.76
● Predicts deleterious SNVs in
promoters

Alipanahi, et al., Nature Biotechnology, 2015.


One more application: Alternative Splicing
● AS generates transcriptional
diversity
● RBPs regulate splicing
● Binding scores at exon
junctions regulated by splicing
regulators
● Consistent with experimental
CLIP-seq data and known
binding profiles of RBP’s

Alipanahi, et al., Nature Biotechnology, 2015.


Prediction of Nova Regulation Mechanism
DeepBind Motif Learning
Key Takeaways
● GOAL: given regions experimentally determined to be bound by proteins,
what is the model describing bound sequences?
● Sequences/Binding Scores -> CNN -> binding scores for novel sequences
● Generates weighted ensembles of PWM’s and mutational maps
● ~600 different DeepBind models generated
● Identified RNA-binding sites involved in splicing regulation
● Identified disease-associated variants that affect TF binding

CHECK IT OUT YOURSELF: http://tools.genes.toronto.edu/deepbind/


Shortcomings and Future Work
● Comparisons with only non-deep learning models
● Not much better than non-deep learning models
● Assumes one motif in each probe
● Non-coding factors/variants ignored
● Does not account for positional dynamics of probe sequences -> DeeperBind
● How about epigenetic regulation of binding to sequences? -> DeepSEA

Cool name bro


Any Questions?

You might also like