Gene Expression: Quantification of Information Molecules and Their Applications
Gene Expression: Quantification of Information Molecules and Their Applications
Gene Expression: Quantification of Information Molecules and Their Applications
Molecules or Objects
Proteins & Peptides Gene Expression Chemoinformatics Image annotation
• Structure • Disease • Drug design • Image
prediction biomarkers • Chemical Classification
• Subcellular • Drug biomarkers descriptor • Medical images
localization • mRNA • QSAR models • Disease
• Therapeutic expression • Personalized classification
Application • Copy number inhibitors • Disease
• Ligand binding variation diagnostics
Molecular Biology Overview
Cell Nucleus
Chromosome
Nucleus
mRNA
• 2D electrophoresis
mRNA (Transcriptome) • Gel-free methods Functional
Mass spectrometry genomics
Protein sequencing
Proteins (Proteome) • Translational fusional
• Immunodetection
• Enzyme activities
Metabolites
(Metabolome) • Chromatography
• Mass spectrometry
• NMR
Glycomics Lipidomics
(Sugars) (Lipids)
Metabolomics
Chromosome
(23 pair) Epigenomics
M
M
Ac
Ac
M C
A
A
I
V
Y
M
E Proteomics
D
P. Brown, et. al. Affymetrix, whole genome 2008 many groups, mRNA-seq:
Gene expression profiling expression profiling using tiling direct sequencing of mRNAs
using spotted cDNA array: identifying and profiling using next generation
microarray: expression levels novel genes and splicing sequencing techniques (NGS)
of known genes variants
History
➢ 1980s: antibody-based assay (protein chip?)
http://www.affymetrix.com/technology/ge_analysis/index.affx
Terms/Jargons
Applications:
Gene (exon, isoform) expression estimation
Differential gene (exon, isoform) expression analysis
Transcriptome assembly - Map exon, intron boundaries, splice junctions
Discovery of novel transcribed regions
Analyse alternate splicing
Overview of RNA-Seq
Transcriptome profiling using NGS
How RNA-seq works
Sample preparation
Data analysis:
✓Mapping reads
✓Visualization (Gbrowser)
✓De novo assembly
✓Quantification
Figure from Wang et. al, RNA-Seq: a revolutionary tool for transcriptomics, Nat. Rev. Genetics 10, 57-63, 2009).
Line1: Sequence identifier
Alignment to genome
-Hisat2
-STAR
FPKM TPM
RPKM
Fragments Per Kilobase Transcripts Per
Reads Per Kilobase
of transcript per Kilobase Million
Million
Million mapped reads.
From reads to differential
expression
Raw Sequence Data QC by
FASTQ Files FastQC/R
Reads Mapping
Mapped Reads
Expression Quantification SAM/BAM Files
Small indels
Differential
expression
Transcriptomics Network and
Gene fusion pathway analysis
RNA-Seq
Alternative
splicing
RNA editing
Integrative analysis
Methylation
Epigenomics
Bisulfite-Seq Histone
ChIP-Seq modification
Transcription
Factor binding
https://doi.org/10.1038/nprot.2017.149
From Svensson et al, 2018.
Benefits of single cell sequencing
Opens the door to several biological and clinical questions
➢A public repository for the archiving and distribution of gene expression data
submitted by the scientific community.
➢GEO is a public functional genomics data repository supporting MIAME
(Minimum Information About a Microarray Experiment)-compliant data
submissions.
➢Array- and sequence-based data are accepted.
➢Curated, online resource for gene expression data browsing, query, analysis
and retrieval.
➢Convenient for deposition of gene expression data, as required by funding
agencies and journals.
Goals of GEO
1. Provide a robust, versatile database for efficient storage of high-
throughput functional genomic data
2. Offer simple submission procedures and formats that support
complete and well-annotated data [deposited by the research
community]
3. Provide user-friendly mechanisms that allow users to query, locate,
review and download studies and gene expression profiles of
interest
The GDC Data Portal: An Overview
The Genomic Data Commons (GDC) Data Portal provides users with web-based access to data from cancer
genomics studies.
Key GDC Data Portal features
•Open, granular access to information about all datasets available in the GDC.
•Advanced search and visualization-assisted filtering of data files.
•Data visualization tools to support the analysis and exploration of data (including on a gene and mutation level from
•Open-Access MAF files.
•Cart for collecting data files of interest.
•Authentication using eRA Commons credentials and auathorization using dbGaP for access to controlled data files.
•Secure data download directly from the cart or using the GDC Data Transfer Tool.
•For more information about available datasets, see the GDC Website.
https://docs.gdc.cancer.gov/Data_Portal/PDF/Data_Portal_UG.pdf
GDC data portal Repository
Data Category
SNV, transcriptome profiling, CNV, sequencing reads,
biospecimens, clinical, DNA methylation, somatic mutation,
combined nucleotide variation
Data Type
RAW single somatic mutation, annotated somatic mutation,
aligned reads, gene expression quantification, and so on…..
GDC clinical data
TCGA is a project, begun in 2005, to catalogue genetic mutations responsible for cancer,
using genome sequencing.
TCGA is supervised by the Center for Cancer Genomics and the National Human
Genome Research Institute. A three-year pilot project, begun in 2006, focused on
characterization of three types of human cancers: glioblastoma, lung, and ovarian
cancer.
In 2009, it expanded into phase II, which planned to complete the genomic
characterization and sequence analysis of 20–25 different tumor types by 2014
Gene expression (RNA-seq and array), copy number variation (array and genome
sequencing), SNP genotyping (GGS and WES), DNA methylation (array and RBSS),
microRNA profiling, proteomics (RPPA), and chromatin accessibility (ATAC-seq)
There are 3554 authorized requesters associated with TCGA study (currently)
Project Cases Seq Exp SNV CNV Meth Clinical Clinical Supplement
TCGA-BRCA 1,098 1,098 1,097 1,044 1,098 1,095 1,098 1,098
TCGA-GBM 617 406 166 396 599 423 617 617
TCGA-OV 608 575 492 443 597 602 608 608
TCGA-LUAD 585 582 519 569 518 579 585 585
TCGA-UCEC 560 559 559 542 558 559 560 560
TCGA-KIRC 537 535 534 339 534 533 537 537
TCGA-HNSC 528 528 528 510 526 528 528 528
TCGA-LGG 516 516 516 513 515 516 516 516
TCGA-THCA 507 507 507 496 505 507 507 507
TCGA-LUSC 504 504 504 497 504 503 504 504
TCGA-PRAD 500 498 498 498 498 498 500 500
TCGA-SKCM 470 470 469 470 470 470 470 470
TCGA-COAD 461 460 459 433 460 458 461 461
TCGA-STAD 443 443 439 441 443 443 443 443
TCGA-BLCA 412 412 412 412 412 412 412 412
TCGA-LIHC 377 377 376 375 376 377 377 377
TCGA-CESC 307 307 307 305 302 307 307 307
TCGA-KIRP 291 291 291 288 290 291 291 291
TCGA-SARC 261 261 261 255 261 261 261 261
TCGA-LAML 200 195 188 149 200 140 200 200
TCGA-ESCA 185 185 184 184 185 185 185 185
TCGA-PAAD 185 185 178 183 185 184 185 185
TCGA-PCPG 179 179 179 179 179 179 179 179
TCGA-READ 172 171 167 158 167 165 172 172
TCGA-TGCT 150 150 150 150 150 150 150 150
TCGA-THYM 124 124 124 123 124 124 124 124
TCGA-KICH 113 66 66 66 66 66 113 113
TCGA-ACC 92 92 80 92 92 80 92 92
TCGA-MESO 87 87 87 83 87 87 87 87
TCGA-UVM 80 80 80 80 80 80 80 80
TCGA-DLBC 58 48 48 37 50 48 58 58
TCGA-UCS 57 57 57 57 57 57 57 57
TCGA-CHOL 51 51 36 51 36 36 51 51
11,315 10,999 10,558 10,418 11,124 10,943 11,315 11,315
Types of data
www.ebi.ac.uk/gxa/sc/
Http://webs.iiitd.edu.in/raghava/cancerdr/
Overall Architecture of CancerlivER
◆ A biomarker is a biological molecule found in blood, other body fluids, or tissues that is a sign of a
normal or abnormal process, or of a condition or disease (National Cancer Institute (NCI))
Biomarkers
RNA Biomarker
Prognostics
Protein Biomarker
Predictive
Glyco Biomarker
Concept of Deep Learning
Prof. Gajendra P.S. Raghava
Head, Center for Computational Biology
Detection units
Spectrogram
Convolutional DBN for audio
Spectrogram
Probabilistic max pooling Convolutional DBN:
0 0 0 0
X1 X2 X3 X4
0 1 1
0 0 0 0 1 0 0 0 0 1 0 0
1 1
X1 X2 X3 X4
Spectrogram
Convolutional DBN for audio
Max pooling
Second CDBN
layer
Detection units
Max pooling
One CDBN
layer
Detection units
Deep Learning = Learning Hierarchical
Representations
78
Learning of object parts
Examples of learned object parts from object categories
Faces Cars Elephants Chairs
Training on multiple
objects
Trained on 4 classes (cars, faces, motorbikes, airplanes).
Second layer: object-specific features.
Third layer: More specific features.
Application of Artificial Intelligence (Convolutional Neural Network)
(Identification of drug resistant strains from genomic information: Local & global annotation)
Flow of information from genotype to phenotype
SNPs
S1 Genes
G1
S1 Pathways
G2
S2 Phenotypes
Pa1
G3
S3 Ph1
Pa2
G4
S4 Ph1
Pa3
G5
S5
Pa3
G6
S6
G6
S6
Deep Belief Networks
A deep belief network (DBN) is a probabilistic, generative model made up
of multiple layers of hidden units.
A composition of simple learning modules that make up each layer
Advantages:
Particularly helpful when limited training data are available
These pre-trained weights are closer to the optimal weights than are randomly
chosen initial weights.
82
Restricted Boltzmann Machine
The training method for RBMs proposed by Geoffrey An RBM with fully connected
visible and hidden units. Note
Hinton for use with training "Product of Expert" models is there are no hidden-hidden or
called contrastive divergence (CD). visible-visible connections.
83
Deep Belief Network Architecture
Once an RBM is trained, another RBM is "stacked" atop it, taking its input from
the final already-trained layer.
The new visible layer is initialized to a training vector, and values for the units in
the already-trained layers are assigned using the current weights and biases.
The new RBM is then trained with the CD procedure.
This whole process is repeated until some desired stopping criterion is met.
84
DBN Example: Hand-written Digit
Recognition
Input:
85
DBN Example: Hand-written Digit
Recognition
2000 top-level neurons
The top two layers form an associative
memory whose energy landscape
models the low dimensional manifolds of
10 label
the digits. 500 neurons
neurons
The model learns to generate
combinations of labels and images.
500 neurons
To perform recognition we start with a
neutral state of the label units and do
an up-pass from the image followed by 28 x 28
a few iterations of the top-level pixel
associative memory.
image
86
Deep Learning using Pytorch
PyTorch is a library for Python programs
It allows deep learning models to be expressed in idiomatic Python.
Clear syntax, streamlined API, and easy debugging
Programming the deep learning machine is very natural in PyTorch.
PyTorch gives us a data type, the Tensor, to hold numbers, vectors,
matrices, or arrays in general.
In addition, it provides functions for operating on them.
We can program with them incrementally and, if we want, interactively,
just like we are used to from Python.
PyTorch Deep Learning Model Life-Cycle
Prepare the Data.
Define the Model.
Train the Model.
Evaluate the Model.
Make Predictions
A convolutional layer
A CNN is a neural network with some convolutional layers
(and some other layers). A convolutional layer has a number
of filters that does convolutional operation.
Beak detector
A filter
Convolution These are the network
parameters to be learned.
1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1
…
…
6 x 6 image
Each filter detects a
small pattern (3 x 3).
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
If stride=2
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1
6 x 6 image 3 -2 -2 -1
-1 1 -1
Convolution -1 1 -1 Filter 2
-1 1 -1
stride=1
Repeat this for each filter
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map0 1
-1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3
Two 4 x 4 images
Forming 2 x 4 x 4 matrix
Color image: RGB 3 channels
11 -1-1 -1-1 -1-1 11 -1-1
1 -1 -1 -1 1 -1
-1 1 -1 -1-1 11 -1-1
-1-1 11 -1-1 Filter 1 -1 1 -1 Filter 2
-1-1 -1-1 11 -1-1-1 111 -1-1-1
-1 -1 1
Color image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Convolution v.s. Fully Connected
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image
x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected
…
…
…
…
0 1 0 0 1 0
0 0 1 0 1 0
x36
1 -1 -1 Filter 1 1 1
-1 1 -1 2 0
-1 -1 1 3 0
4: 0 3
…
1 0 0 0 0 1
0 1 0 0 1 0 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
0 1 0 0 1 0 10: 0
…
0 0 1 0 1 0
1 0
6 x 6 image
3 0
14
fewer parameters! 15 1 Only connect to 9
16 1 inputs, not fully
connected
…
1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3
…
1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0
…
0 0 1 0 1 0
1 0
6 x 6 image
3: 0
14:
Fewer parameters 15: 1
16: 1 Shared weights
Even fewer parameters
…
The whole CNN
cat dog ……
Convolution
Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times
Max Pooling
Flattened
Max Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1
3 -1 -3 -1 -1 -1 -1 -1
-3 1 0 -3 -1 -1 -2 1
-3 -3 0 1 -1 -1 -2 1
3 -2 -2 -1 -1 0 -4 3
Why Pooling
Subsampling pixels will not change the object
bird
bird
Subsampling
New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
The whole CNN
3 0
-1 1 Convolution
3 1
0 3
Max Pooling
A new image Can
repeat
Convolution many
Smaller than the original
times
image
The number of channels Max Pooling
Max Pooling
Max Pooling
1
3 0
-1 1 3
3 1 -1
0 3 Flattened
1 Fully Connected
Feedforward network
3
Only modified the network structure and
CNN in Keras input format (vector -> 3-D tensor)
input
Convolution
1 -1 -1
-1 1 -1
-1 1 -1
-1 1 -1 … There are
25 3x3
-1 -1 1
-1 1 -1 … Max Pooling
filters.
Input_shape = ( 28 , 28 , 1)
3 -1 3 Max Pooling
-3 1
Only modified the network structure and
CNN in Keras input format (vector -> 3-D array)
Input
1 x 28 x 28
Convolution
How many parameters for
each filter? 9 25 x 26 x 26
Max Pooling
25 x 13 x 13
Convolution
How many parameters 225=
for each filter? 50 x 11 x 11
25x9
Max Pooling
50 x 5 x 5
Only modified the network structure and
CNN in Keras input format (vector -> 3-D array)
Input
1 x 28 x 28
Output Convolution
25 x 26 x 26
Convolution
50 x 11 x 11
Max Pooling
1250 50 x 5 x 5
Flattened
AlphaGo
Next move
Neural
(19 x 19
Network positions)
19 x 19 matrix
Fully-connected feedforward
Black: 1
network can be used
white: -1
none: 0 But CNN performs much better
AlphaGo’s policy network
The following is quotation from their Nature article:
Note: AlphaGo does not use Max Pooling.
CNN in speech recognition
Image Time
Spectrogram
CNN in text classification
Source of image:
http://citeseerx.ist.psu.edu/viewdoc/downlo
ad?doi=10.1.1.703.6858&rep=rep1&type=p
df
Image-based Biomarkers
Molecules or Objects
Proteins & Peptides Gene Expression Chemoinformatics Image annotation
• Structure • Disease • Drug design • Image
prediction biomarkers • Chemical Classification
• Subcellular • Drug biomarkers descriptor • Medical images
localization • mRNA • QSAR models • Disease
• Therapeutic expression • Personalized classification
Application • Copy number inhibitors • Disease
• Ligand binding variation diagnostics
Major applications of Images
1. Gel eletrophorosis (SDS Pages)
2. 2-D gel electrophoresis
3. Microarray or gene expression data
4. Karyotyping (chromosome mapping)
5. MRI (Cardiac function imaging)
6. Ultrasound (3-D, breast cancer)
7. Radionuclide (Detection of Cancer)
8. Histology -> Biopsy (Neurological diseases)
9. Electron Microscope
10. Imaging in therapy
DNAsize: Measuring size of DNA fragments from
Gel Electrophoresis data
AC2DGel: Analysis and comparison of 2D gels
Classification of healthy and disease leafs
Image based prominent nucleoli detection
J Pathol Inform.
2015; 6: 39.
X-ray based diagnostics
Overview
Image (or region) similarity is used in many applications:
Object recognition
Scene classification
Image registration
Image retrieval
Robot localization
Template matching
Building panorama
And many more…
SIFT: Scale-invariant feature transform
SIFT Spin image
0 255
16 x 16
Step 4: For each square, compute gradient direction histogram over 8 directions.
138
OpenCV - Features
Cross Platform
Windows, Linux, Mac OS
Portable
iPhone
Android.
Language Support
C/C++
Python
139
OpenCV Overview:
Robot support
> 500 functions
opencv.willowgarage.com
Geometric
Descriptors
Segmentation Camera
Calibration,
Stereo, 3D
Features
Transforms Utilities and
Data Structures
Tracking
Machine
Learning: Fitting
•Detection,
•Recognition
Matrix Math
140
OpenCV – Getting Started
Download OpenCV
http://opencv.org
Setting up
Comprehensive guide on setting up OpenCV in various environments at
Two books
141
OpenCV Highlights
Focus on real-time image processing
Written in C/C++
Interface: C/C++, Python, Java, Matlab/Octave
Cross-platform: Windows, Mac, Linux, Android, iOS, etc
Open source and free!
Applications
Feature extraction
Recognition (facial, gesture, etc)
Segmentation
Robotics
Machine learning support: Boosting, k-nearest neighbor, SVM, etc
OpenCV for Computing SIFT Descriptors using
Python
Implementation of CNN using KERAS
https://towardsdatascience.com/building-a-convolutional-neural-
network-cnn-in-keras-329fbbadc5f5
Image Classification using PyTorch
https://www.analyticsvidhya.com/blog/2019/10/building-image-
classification-models-cnn-pytorch/