Gene Expression: Quantification of Information Molecules and Their Applications

Download as pdf or txt
Download as pdf or txt
You are on page 1of 146

Gene Expression : Quantification of Information

Molecules and their Applications

Prof. Gajendra P.S. Raghava


Head, Center for Computational Biology

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so


no claim of authorship on any slide
Biomedical- Applications
Concept Level
★Proteome annotation ★Drugs discovery ★Vaccine Design ★Biomarkers

Molecules or Objects
Proteins & Peptides Gene Expression Chemoinformatics Image annotation
• Structure • Disease • Drug design • Image
prediction biomarkers • Chemical Classification
• Subcellular • Drug biomarkers descriptor • Medical images
localization • mRNA • QSAR models • Disease
• Therapeutic expression • Personalized classification
Application • Copy number inhibitors • Disease
• Ligand binding variation diagnostics
Molecular Biology Overview
Cell Nucleus

Chromosome

Protein Gene (mRNA), Gene (DNA)


single strand
History of genomes sequencing
 1977 bacteriophage øX174 (5386bp, 11 genes)
 1981 mitochondrial genome (16,568bp)
 1986 chloroplast genome (120,000 bp)
 1995 Haemophilus influenzae (1.8Mb)
 1996 Saccharomyces whole genome (12.1Mb)
 1997 E. coli (4.6Mb; 4200 proteins)
 1998 Caenorhabditis elegans (97 Mb; 19,000 genes)
 2000 Arabidopsis thaliana (115Mb, 30,000 genes)
 2001 mouse (1 year!)
 2001 Homo sapiens (2 projects)
 2005 Pan, rice
 2006 Populus
Analysing the flow of genetic information
• Genome mapping
• Genome sequencing Structural
• Genome annotations genomics

Nucleus

• DNA arrays and chips


DNA (Genome)
• RNA sequencing
•(semi) qRT-PCR
pre-mRNA • Northern blot + hybrid.
Cytoplasm • Transcriptional fusions

mRNA
• 2D electrophoresis
mRNA (Transcriptome) • Gel-free methods Functional
Mass spectrometry genomics
Protein sequencing
Proteins (Proteome) • Translational fusional
• Immunodetection
• Enzyme activities
Metabolites
(Metabolome) • Chromatography
• Mass spectrometry
• NMR
Glycomics Lipidomics
(Sugars) (Lipids)

Metabolomics
Chromosome
(23 pair) Epigenomics
M
M

Ac
Ac

Cell Nucleus Chromatin


Organ, Tissue
Genomics (3×109)
miRNA
DNA (4 chemicals: A, T, G, C)
World of OMICs
Non-coding RNA Transcriptomics
mRNA (copies)

M C
A

A
I
V

Y
M
E Proteomics
D

Glycomics (Sugars attached proteins) Protein (20 chemicals: A, C, D ..)


The evolution of transcriptomics
Hybridization-based

P. Brown, et. al. Affymetrix, whole genome 2008 many groups, mRNA-seq:
Gene expression profiling expression profiling using tiling direct sequencing of mRNAs
using spotted cDNA array: identifying and profiling using next generation
microarray: expression levels novel genes and splicing sequencing techniques (NGS)
of known genes variants
History
➢ 1980s: antibody-based assay (protein chip?)

➢ ~1991: high-density DNA-synthetic chemistry (Affymetrix/oligo


chips)

➢ ~1995: microspotting (Stanford Univ/cDNA chips)

➢ replacing porous surface with solid surface replacing


radioactive label with fluorescent label improvement on
sensitivity
cDNA Microarray Technologies

❖Spot cloned cDNAs onto a glass microscope slide


❖usually PCR amplified segments of plasmids
❖Label 2 RNA samples with 2 different colors of flourescent
dye - control vs. experimental
❖Mix two labeled RNAs and hybridize to the chip
❖Make two scans - one for each color
❖Combine the images to calculate ratios of amounts of each
RNA that bind to each spot
Gene Expression Data
On p genes for n slides: p is O(10,000), n is O(10-
100), but growing,
Slides
slide 1 slide 2 slide 3 slide 4 slide 5 …
1 0.46 0.30 0.80 1.51 0.90 ...
2 -0.10 0.49 0.24 0.06 0.46 ...
Genes 3 0.15 0.74 0.04 0.10 0.20 ...
4 -0.45 -1.03 -0.79 -0.56 -0.32 ...
5 -0.06 1.06 1.35 1.09 -1.09 ...

Gene expression level of gene 5 in slide 4


= Log2( Red intensity / Green intensity)

These values are conventionally displayed


on a red (>0) yellow (0) green (<0) scale.
Affymetrix Expression Arrays

http://www.affymetrix.com/technology/ge_analysis/index.affx
Terms/Jargons

Stanford/cDNA chip Affymetrix/oligo chip


 one slide/experiment  one chip/experiment
 one spot  one probe/feature/cell
 1 gene => one spot or  1 gene => many probes
few spots(replica) (20~25 mers)
 control: control spots  control: match and
 control: two fluorescent mismatch cells.
dyes (Cy3/Cy5)
Practical Application of DNA Microarrays
 DNA Microarrays are used to study gene activity (expression)
 What proteins are being actively produced by a group of cells?
 “Which genes are being expressed?”
 How?
 When a cell is making a protein, it translates the genes (made of DNA) which
code for the protein into RNA used in its production
 The RNA present in a cell can be extracted
 If a gene has been expressed in a cell
 RNA will bind to “a copy of itself” on the array
 RNA with no complementary site will wash off the array
 The RNA can be “tagged” with a fluorescent dye to determine its presence

 DNA microarrays provide a high throughput technique for quantifying the


presence of specific RNA sequences
Analysis of Microarray Data
 Analysis of images
 Preprocessing of gene expression data
 Normalization of data
 Subtraction of Background Noise
 Global/local Normalization
 House keeping genes (or same gene)
 Expression in ratio (test/references) in log
 Differential Gene expression
 Repeats and calculate significance (t-test)
 Significance of fold used statistical method
 Clustering
 Supervised/Unsupervised (Hierarchical, K-means, SOM)
 Prediction or Supervised Machine Learnning (SVM)
Videos on Microarray
 https://www.youtube.com/watch?v=0ATUjAxNf6U (animation)
 https://www.youtube.com/watch?v=0Hj3f7vQFZU
 https://www.youtube.com/watch?v=W8BCXQtH0I8
What is RNA-Seq?
RNA-Seq is the process of sequencing the transcriptome which includes
protein coding and non-coding transcripts.

Applications:
 Gene (exon, isoform) expression estimation
 Differential gene (exon, isoform) expression analysis
 Transcriptome assembly - Map exon, intron boundaries, splice junctions
 Discovery of novel transcribed regions
 Analyse alternate splicing
Overview of RNA-Seq
Transcriptome profiling using NGS
How RNA-seq works

Sample preparation

Next generation sequencing (NGS)

Data analysis:
✓Mapping reads
✓Visualization (Gbrowser)
✓De novo assembly
✓Quantification

Figure from Wang et. al, RNA-Seq: a revolutionary tool for transcriptomics, Nat. Rev. Genetics 10, 57-63, 2009).
Line1: Sequence identifier

FASTQ files Line2: Raw sequence


Line3: meaningless
Line4: quality values for the sequence
Analysis Pipeline
Classic quantification of gene
expression using RNA-seq
Mapping

Alignment to genome
-Hisat2
-STAR

Counts reads per transcript

Normalization Read counts tables

FPKM TPM
RPKM
Fragments Per Kilobase Transcripts Per
Reads Per Kilobase
of transcript per Kilobase Million
Million
Million mapped reads.
From reads to differential
expression
Raw Sequence Data QC by
FASTQ Files FastQC/R

Reads Mapping

Unspliced Mapping Spliced mapping


BWA, Bowtie TopHat, MapSplice

Mapped Reads
Expression Quantification SAM/BAM Files

Summarize read counts FPKM/RPKM


Cufflinks QC by
RNA-SeQC
DE testing

DEseq, edgeR, etc Cuffdiff


List of DE
Functional Interpretation
Function Integrate with
Infer networks
enrichment other data

Biological Insights & hypothesis


Patient Technologies Data Analysis Integration and interpretation
point mutation

Small indels

Further understanding of cancer and clinical applications


Genomics Copy number
WGS, WES variation
Functional effect
Structural of mutation
variation

Differential
expression
Transcriptomics Network and
Gene fusion pathway analysis
RNA-Seq
Alternative
splicing

RNA editing
Integrative analysis
Methylation
Epigenomics
Bisulfite-Seq Histone
ChIP-Seq modification

Transcription
Factor binding

Shyr D, Liu Q. Biol Proced Online. (2013)15,4


Concept of Single Cell

The basic unit of life


Why single cell gene expression?
Improvements in scRNA-seq methods
~10 ~100 ~1000 ~10 000 ~100 000

https://doi.org/10.1038/nprot.2017.149
From Svensson et al, 2018.
Benefits of single cell sequencing
Opens the door to several biological and clinical questions

✓ Understanding heterogeneous samples:


✓ E.g. analyse cellular heterogeneity during immune or
stem cell development
✓ Identification and analysis of rare cell types
✓ E.g. circulating tumor cells from liquid biopsy
✓ Understanding cellular transitions and switches in
cell state
✓ Dissecting complex infections and revealing drug resistance
genotypes
Gene Expression Omnibus (GEO)

➢A public repository for the archiving and distribution of gene expression data
submitted by the scientific community.
➢GEO is a public functional genomics data repository supporting MIAME
(Minimum Information About a Microarray Experiment)-compliant data
submissions.
➢Array- and sequence-based data are accepted.
➢Curated, online resource for gene expression data browsing, query, analysis
and retrieval.
➢Convenient for deposition of gene expression data, as required by funding
agencies and journals.
Goals of GEO
1. Provide a robust, versatile database for efficient storage of high-
throughput functional genomic data
2. Offer simple submission procedures and formats that support
complete and well-annotated data [deposited by the research
community]
3. Provide user-friendly mechanisms that allow users to query, locate,
review and download studies and gene expression profiles of
interest
The GDC Data Portal: An Overview
The Genomic Data Commons (GDC) Data Portal provides users with web-based access to data from cancer
genomics studies.
Key GDC Data Portal features
•Open, granular access to information about all datasets available in the GDC.
•Advanced search and visualization-assisted filtering of data files.
•Data visualization tools to support the analysis and exploration of data (including on a gene and mutation level from
•Open-Access MAF files.
•Cart for collecting data files of interest.
•Authentication using eRA Commons credentials and auathorization using dbGaP for access to controlled data files.
•Secure data download directly from the cart or using the GDC Data Transfer Tool.
•For more information about available datasets, see the GDC Website.

Accessing the GDC Data Portal


The GDC Data Portal is accessible using a web browser such as Chrome, Firefox, and Microsoft Edge at
the following
URL: https://portal.gdc.cancer.gov
The front page displays a summary of all available datasets:

https://docs.gdc.cancer.gov/Data_Portal/PDF/Data_Portal_UG.pdf
GDC data portal Repository

 Data Category
SNV, transcriptome profiling, CNV, sequencing reads,
biospecimens, clinical, DNA methylation, somatic mutation,
combined nucleotide variation

 Data Type
RAW single somatic mutation, annotated somatic mutation,
aligned reads, gene expression quantification, and so on…..
GDC clinical data

 Clinical data is a collection of data related to patient diagnosis,


demographics, exposures, laboratory tests, and family
relationships.

 In the GDC, clinical data is searchable in the API, Data Portal, or


Legacy Archive. This only includes data that has been indexed and
aligns with the GDC Data Dictionary
https://www.nejm.org/doi/full/10.1056/NEJMp1607591
The Cancer Genome Atlas (TCGA)

• Launched in 2006 as a pilot, expanded in 2009, ended in 2017

• NIH-funded program to perform a comprehensive and integrated analysis of key


genomic/molecular features of many cancers

• A ‘marker paper’ in each project to provide fundamental insights

• Make the data publicly available to the research community

• Serves as a model for the power of teamwork in science.


U.K., France, Netherlands, Canada, U.S.

• Uveal melanoma chosen as one of 10 rare cancers included


TCGA history

 TCGA is a project, begun in 2005, to catalogue genetic mutations responsible for cancer,
using genome sequencing.
 TCGA is supervised by the Center for Cancer Genomics and the National Human
Genome Research Institute. A three-year pilot project, begun in 2006, focused on
characterization of three types of human cancers: glioblastoma, lung, and ovarian
cancer.
 In 2009, it expanded into phase II, which planned to complete the genomic
characterization and sequence analysis of 20–25 different tumor types by 2014
 Gene expression (RNA-seq and array), copy number variation (array and genome
sequencing), SNP genotyping (GGS and WES), DNA methylation (array and RBSS),
microRNA profiling, proteomics (RPPA), and chromatin accessibility (ATAC-seq)
 There are 3554 authorized requesters associated with TCGA study (currently)
Project Cases Seq Exp SNV CNV Meth Clinical Clinical Supplement
TCGA-BRCA 1,098 1,098 1,097 1,044 1,098 1,095 1,098 1,098
TCGA-GBM 617 406 166 396 599 423 617 617
TCGA-OV 608 575 492 443 597 602 608 608
TCGA-LUAD 585 582 519 569 518 579 585 585
TCGA-UCEC 560 559 559 542 558 559 560 560
TCGA-KIRC 537 535 534 339 534 533 537 537
TCGA-HNSC 528 528 528 510 526 528 528 528
TCGA-LGG 516 516 516 513 515 516 516 516
TCGA-THCA 507 507 507 496 505 507 507 507
TCGA-LUSC 504 504 504 497 504 503 504 504
TCGA-PRAD 500 498 498 498 498 498 500 500
TCGA-SKCM 470 470 469 470 470 470 470 470
TCGA-COAD 461 460 459 433 460 458 461 461
TCGA-STAD 443 443 439 441 443 443 443 443
TCGA-BLCA 412 412 412 412 412 412 412 412
TCGA-LIHC 377 377 376 375 376 377 377 377
TCGA-CESC 307 307 307 305 302 307 307 307
TCGA-KIRP 291 291 291 288 290 291 291 291
TCGA-SARC 261 261 261 255 261 261 261 261
TCGA-LAML 200 195 188 149 200 140 200 200
TCGA-ESCA 185 185 184 184 185 185 185 185
TCGA-PAAD 185 185 178 183 185 184 185 185
TCGA-PCPG 179 179 179 179 179 179 179 179
TCGA-READ 172 171 167 158 167 165 172 172
TCGA-TGCT 150 150 150 150 150 150 150 150
TCGA-THYM 124 124 124 123 124 124 124 124
TCGA-KICH 113 66 66 66 66 66 113 113
TCGA-ACC 92 92 80 92 92 80 92 92
TCGA-MESO 87 87 87 83 87 87 87 87
TCGA-UVM 80 80 80 80 80 80 80 80
TCGA-DLBC 58 48 48 37 50 48 58 58
TCGA-UCS 57 57 57 57 57 57 57 57
TCGA-CHOL 51 51 36 51 36 36 51 51
11,315 10,999 10,558 10,418 11,124 10,943 11,315 11,315
Types of data

• Core dataset: • Future datasets:


– Pathology report – 50x Whole-genome
– Histology images sequencing
– Clinical data – Bisulfide sequencing
– Whole exome-seq – Protein Array
– SNP 6.0 array
– mRNAseq
– miRNAseq
– Methylation array
Single Cell Expression Atlas

Discover and interpret gene


expression analysis results
at single cell level

www.ebi.ac.uk/gxa/sc/
Http://webs.iiitd.edu.in/raghava/cancerdr/
Overall Architecture of CancerlivER
◆ A biomarker is a biological molecule found in blood, other body fluids, or tissues that is a sign of a
normal or abnormal process, or of a condition or disease (National Cancer Institute (NCI))

Biomarkers

Based on Disease State Based on Biomolecules

Diagnostics DNA Biomarker

RNA Biomarker
Prognostics
Protein Biomarker

Predictive
Glyco Biomarker
Concept of Deep Learning
Prof. Gajendra P.S. Raghava
Head, Center for Computational Biology

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so


no claim of authorship on any slide
Deep Learning in Biology: Mining Omic Data
Deep Learning
We will cover two major deep learning models:
 Deep Belief Networks and Autoencoders employs layer-wise unsupervised
learning to initialize each layer and capture multiple levels of representation
simultaneously.
 Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep
belief nets. Neural Computation, 18:1527-1554.
 Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H. (2007). Greedy Layer-Wise Training
of Deep Networks, Advances in Neural Information Processing Systems 19

 Convolutional Neural Network organizes neurons based on animal’s visual


cortex system, which allows for learning patterns at both local level and
global level.
 Y. LeCun, L. Bottou, Y. Bengio and P. Haffner: Gradient-Based Learning Applied to
Document Recognition, Proceedings of the IEEE, 86(11):2278-2324, November 1998 65
Convolutional DBN for audio
Max pooling unit

Detection units

Spectrogram
Convolutional DBN for audio

Spectrogram
Probabilistic max pooling Convolutional DBN:

0 max {x1, x2, x3, x4}


Convolutional Neural net:
max {x1, x2, x3, x4}

0 0 0 0
X1 X2 X3 X4

Where xi are {0,1}, and mutually exclusive. Thus, 5


possible cases:

0 1 1

0 0 0 0 1 0 0 0 0 1 0 0

1 1
X1 X2 X3 X4

Where xi are real numbers.


0 0 1 0 0 0 0 1

Collapse 2n configurations into n+1 configurations.


Permits bottom up and top down inference.
Convolutional DBN for audio

Spectrogram
Convolutional DBN for audio

Max pooling
Second CDBN
layer
Detection units

Max pooling
One CDBN
layer
Detection units
Deep Learning = Learning Hierarchical
Representations

78
Learning of object parts
Examples of learned object parts from object categories
Faces Cars Elephants Chairs
Training on multiple
objects
Trained on 4 classes (cars, faces, motorbikes, airplanes).
Second layer: object-specific features.
Third layer: More specific features.
Application of Artificial Intelligence (Convolutional Neural Network)
(Identification of drug resistant strains from genomic information: Local & global annotation)
Flow of information from genotype to phenotype

SNPs

S1 Genes

G1
S1 Pathways

G2
S2 Phenotypes
Pa1
G3
S3 Ph1
Pa2
G4
S4 Ph1
Pa3
G5
S5
Pa3
G6
S6
G6
S6
Deep Belief Networks
 A deep belief network (DBN) is a probabilistic, generative model made up
of multiple layers of hidden units.
 A composition of simple learning modules that make up each layer

 A DBN can be used to generatively pre-train a DNN by using the learned


DBN weights as the initial DNN weights.
 Back-propagation or other discriminative algorithms can then be applied for fine-
tuning of these weights.

 Advantages:
 Particularly helpful when limited training data are available
 These pre-trained weights are closer to the optimal weights than are randomly
chosen initial weights.
82
Restricted Boltzmann Machine

 A DBN can be efficiently trained in an unsupervised, layer-


by-layer manner, where the layers are typically made of
restricted Boltzmann machines (RBM).

 RBM: undirected, generative energy-based model with a


"visible" input layer and a hidden layer, and connections
between the layers but not within layers.

 The training method for RBMs proposed by Geoffrey An RBM with fully connected
visible and hidden units. Note
Hinton for use with training "Product of Expert" models is there are no hidden-hidden or
called contrastive divergence (CD). visible-visible connections.

83
Deep Belief Network Architecture

 Once an RBM is trained, another RBM is "stacked" atop it, taking its input from
the final already-trained layer.
 The new visible layer is initialized to a training vector, and values for the units in
the already-trained layers are assigned using the current weights and biases.
 The new RBM is then trained with the CD procedure.
 This whole process is repeated until some desired stopping criterion is met.

84
DBN Example: Hand-written Digit
Recognition
 Input:

85
DBN Example: Hand-written Digit
Recognition
2000 top-level neurons
The top two layers form an associative
memory whose energy landscape
models the low dimensional manifolds of
10 label
the digits. 500 neurons
neurons
 The model learns to generate
combinations of labels and images.
500 neurons
 To perform recognition we start with a
neutral state of the label units and do
an up-pass from the image followed by 28 x 28
a few iterations of the top-level pixel
associative memory.
image
86
Deep Learning using Pytorch
 PyTorch is a library for Python programs
 It allows deep learning models to be expressed in idiomatic Python.
 Clear syntax, streamlined API, and easy debugging
 Programming the deep learning machine is very natural in PyTorch.
 PyTorch gives us a data type, the Tensor, to hold numbers, vectors,
matrices, or arrays in general.
 In addition, it provides functions for operating on them.
 We can program with them incrementally and, if we want, interactively,
just like we are used to from Python.
PyTorch Deep Learning Model Life-Cycle
 Prepare the Data.
 Define the Model.
 Train the Model.
 Evaluate the Model.
 Make Predictions
A convolutional layer
A CNN is a neural network with some convolutional layers
(and some other layers). A convolutional layer has a number
of filters that does convolutional operation.

Beak detector

A filter
Convolution These are the network
parameters to be learned.

1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1



6 x 6 image
Each filter detects a
small pattern (3 x 3).
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
If stride=2

1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1

6 x 6 image 3 -2 -2 -1
-1 1 -1
Convolution -1 1 -1 Filter 2
-1 1 -1
stride=1
Repeat this for each filter
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map0 1
-1 -1 -2 1

6 x 6 image 3 -2 -2 -1
-1 0 -4 3
Two 4 x 4 images
Forming 2 x 4 x 4 matrix
Color image: RGB 3 channels
11 -1-1 -1-1 -1-1 11 -1-1
1 -1 -1 -1 1 -1
-1 1 -1 -1-1 11 -1-1
-1-1 11 -1-1 Filter 1 -1 1 -1 Filter 2
-1-1 -1-1 11 -1-1-1 111 -1-1-1
-1 -1 1
Color image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Convolution v.s. Fully Connected

1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image

x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected




0 1 0 0 1 0
0 0 1 0 1 0
x36
1 -1 -1 Filter 1 1 1
-1 1 -1 2 0
-1 -1 1 3 0
4: 0 3


1 0 0 0 0 1
0 1 0 0 1 0 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
0 1 0 0 1 0 10: 0


0 0 1 0 1 0
1 0
6 x 6 image
3 0
14
fewer parameters! 15 1 Only connect to 9
16 1 inputs, not fully
connected


1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3


1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0


0 0 1 0 1 0
1 0
6 x 6 image
3: 0
14:
Fewer parameters 15: 1
16: 1 Shared weights
Even fewer parameters


The whole CNN
cat dog ……
Convolution

Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times

Max Pooling

Flattened
Max Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1

3 -1 -3 -1 -1 -1 -1 -1

-3 1 0 -3 -1 -1 -2 1

-3 -3 0 1 -1 -1 -2 1

3 -2 -2 -1 -1 0 -4 3
Why Pooling
 Subsampling pixels will not change the object
bird
bird

Subsampling

We can subsample the pixels to make image


smaller fewer parameters to characterize the image
A CNN compresses a fully connected
network in two ways:
 Reducing number of connections
 Shared weights on the edges
 Max pooling further reduces the complexity
Max Pooling

New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
The whole CNN
3 0
-1 1 Convolution

3 1
0 3
Max Pooling
A new image Can
repeat
Convolution many
Smaller than the original
times
image
The number of channels Max Pooling

is the number of filters


The whole CNN
cat dog ……
Convolution

Max Pooling

Fully Connected A new image


Feedforward network
Convolution

Max Pooling

Flattened A new image


3
Flattening
0

1
3 0
-1 1 3

3 1 -1
0 3 Flattened

1 Fully Connected
Feedforward network

3
Only modified the network structure and
CNN in Keras input format (vector -> 3-D tensor)

input

Convolution
1 -1 -1
-1 1 -1
-1 1 -1
-1 1 -1 … There are
25 3x3
-1 -1 1
-1 1 -1 … Max Pooling
filters.
Input_shape = ( 28 , 28 , 1)

28 x 28 pixels 1: black/white, 3: RGB Convolution

3 -1 3 Max Pooling

-3 1
Only modified the network structure and
CNN in Keras input format (vector -> 3-D array)

Input
1 x 28 x 28

Convolution
How many parameters for
each filter? 9 25 x 26 x 26

Max Pooling
25 x 13 x 13

Convolution
How many parameters 225=
for each filter? 50 x 11 x 11
25x9
Max Pooling
50 x 5 x 5
Only modified the network structure and
CNN in Keras input format (vector -> 3-D array)

Input
1 x 28 x 28

Output Convolution

25 x 26 x 26

Fully connected Max Pooling


feedforward network
25 x 13 x 13

Convolution
50 x 11 x 11

Max Pooling
1250 50 x 5 x 5
Flattened
AlphaGo

Next move
Neural
(19 x 19
Network positions)

19 x 19 matrix
Fully-connected feedforward
Black: 1
network can be used
white: -1
none: 0 But CNN performs much better
AlphaGo’s policy network
The following is quotation from their Nature article:
Note: AlphaGo does not use Max Pooling.
CNN in speech recognition

The filters move in the


Frequency
CNN frequency direction.

Image Time
Spectrogram
CNN in text classification

Source of image:
http://citeseerx.ist.psu.edu/viewdoc/downlo
ad?doi=10.1.1.703.6858&rep=rep1&type=p
df
Image-based Biomarkers

Prof. Gajendra P.S. Raghava


Head, Center for Computational Biology

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so


no claim of authorship on any slide
Biomedical- Applications
Concept Level
★Proteome annotation ★Drugs discovery ★Vaccine Design ★Biomarkers

Molecules or Objects
Proteins & Peptides Gene Expression Chemoinformatics Image annotation
• Structure • Disease • Drug design • Image
prediction biomarkers • Chemical Classification
• Subcellular • Drug biomarkers descriptor • Medical images
localization • mRNA • QSAR models • Disease
• Therapeutic expression • Personalized classification
Application • Copy number inhibitors • Disease
• Ligand binding variation diagnostics
Major applications of Images
1. Gel eletrophorosis (SDS Pages)
2. 2-D gel electrophoresis
3. Microarray or gene expression data
4. Karyotyping (chromosome mapping)
5. MRI (Cardiac function imaging)
6. Ultrasound (3-D, breast cancer)
7. Radionuclide (Detection of Cancer)
8. Histology -> Biopsy (Neurological diseases)
9. Electron Microscope
10. Imaging in therapy
DNAsize: Measuring size of DNA fragments from
Gel Electrophoresis data
AC2DGel: Analysis and comparison of 2D gels
Classification of healthy and disease leafs
Image based prominent nucleoli detection

J Pathol Inform.
2015; 6: 39.
X-ray based diagnostics
Overview
 Image (or region) similarity is used in many applications:
 Object recognition
 Scene classification
 Image registration
 Image retrieval
 Robot localization
 Template matching
 Building panorama
 And many more…
SIFT: Scale-invariant feature transform
SIFT Spin image

RIFT: Rotation-invariant feature transform


HoG: Hough transform

GLOH: Gradient Location and Orientation Histogram


Descriptors
Types of descriptors
 Intensity based
 Histogram
 Gradient based
 Color Based
 Frequency
 Shape
 Combination of the above
Descriptors
Intensity Histogram

0 255

- Not invariant to light intensity change


- Does not capture geometric information
Descriptors
Solution:
• Divide the area
• For each section compute it’s own histogram

SIFT - David Lowe 1999


Descriptors - SIFT
Step 2: Compute the gradient for each pixel (direction and magnitude)

16 x 16

Step 3: Divide the pixels into 16, 4x4 squares


Descriptors - SIFT

Step 4: For each square, compute gradient direction histogram over 8 directions.

The result: 128 dimensions feature vector.


Descriptors - SIFT
• Warp the feature into 16x16 square.
• Divide into 16, 4x4 squares.
• For each square, compute an histogram of the gradient
directions.

=> Feature vector (128)


OpenCV – Introduction
 OpenCV stands for the Open Source Computer Vision Library.
It was founded at Intel in 1999, went through some lean

years after the .bust but is now under active development,


now receiving ongoing support from Willow Garage.
 OpenCV is free for commercial and research use.
It has a BSD license. The library runs across many platforms

and actively supports Linux, Windows and Mac OS.


 OpenCV was founded to advance the field of computer vision.
It gives everyone a reliable, real time infrastructure to build

on. It collects and makes available the most useful


algorithms.

138
OpenCV - Features
 Cross Platform
Windows, Linux, Mac OS

 Portable
iPhone

Android.

 Language Support
C/C++

Python

139
OpenCV Overview:
Robot support
> 500 functions
opencv.willowgarage.com

General Image Processing Functions Image Pyramids

Geometric
Descriptors
Segmentation Camera
Calibration,
Stereo, 3D
Features
Transforms Utilities and
Data Structures

Tracking
Machine
Learning: Fitting
•Detection,
•Recognition
Matrix Math
140
OpenCV – Getting Started
 Download OpenCV
http://opencv.org

Install from macports/aptitude

 Setting up
Comprehensive guide on setting up OpenCV in various environments at

the official wiki.


 Online Reference:
http://docs.opencv.org

 Two books

141
OpenCV Highlights
 Focus on real-time image processing
 Written in C/C++
 Interface: C/C++, Python, Java, Matlab/Octave
 Cross-platform: Windows, Mac, Linux, Android, iOS, etc
 Open source and free!
Applications
 Feature extraction
 Recognition (facial, gesture, etc)
 Segmentation
 Robotics
 Machine learning support: Boosting, k-nearest neighbor, SVM, etc
OpenCV for Computing SIFT Descriptors using
Python
Implementation of CNN using KERAS
https://towardsdatascience.com/building-a-convolutional-neural-
network-cnn-in-keras-329fbbadc5f5
Image Classification using PyTorch
https://www.analyticsvidhya.com/blog/2019/10/building-image-
classification-models-cnn-pytorch/

You might also like