0% found this document useful (0 votes)

55 views146 pages

Gene Expression: Quantification of Information Molecules and Their Applications

This document summarizes a presentation on gene expression quantification and its applications. The presentation was given by Prof. Gajendra P.S. Raghava and covers topics like proteome annotation, drug discovery, vaccine design, biomarkers, gene expression, disease biomarkers, mRNA expression, copy number variation, and more. It also discusses molecules like proteins, peptides, genes, and images and their analysis using techniques like structure prediction, disease classification, drug design, chemical descriptors, and more.

Uploaded by

Vaibhav Jindal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views146 pages

Gene Expression: Quantification of Information Molecules and Their Applications

Uploaded by

Vaibhav Jindal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 146

Gene Expression : Quantification of Information

Molecules and their Applications

Prof. Gajendra P.S. Raghava

Head, Center for Computational Biology

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so

no claim of authorship on any slide
Biomedical- Applications
Concept Level
★Proteome annotation ★Drugs discovery ★Vaccine Design ★Biomarkers

Chromosome

Protein Gene (mRNA), Gene (DNA)

single strand
History of genomes sequencing
 1977 bacteriophage øX174 (5386bp, 11 genes)
 1981 mitochondrial genome (16,568bp)
 1986 chloroplast genome (120,000 bp)
 1995 Haemophilus influenzae (1.8Mb)
 1996 Saccharomyces whole genome (12.1Mb)
 1997 E. coli (4.6Mb; 4200 proteins)
 1998 Caenorhabditis elegans (97 Mb; 19,000 genes)
 2000 Arabidopsis thaliana (115Mb, 30,000 genes)
 2001 mouse (1 year!)
 2001 Homo sapiens (2 projects)
 2005 Pan, rice
 2006 Populus
Analysing the flow of genetic information
• Genome mapping
• Genome sequencing Structural
• Genome annotations genomics

Nucleus

• DNA arrays and chips

DNA (Genome)
• RNA sequencing
•(semi) qRT-PCR
pre-mRNA • Northern blot + hybrid.
Cytoplasm • Transcriptional fusions

mRNA
• 2D electrophoresis
mRNA (Transcriptome) • Gel-free methods Functional
Mass spectrometry genomics
Protein sequencing
Proteins (Proteome) • Translational fusional
• Immunodetection
• Enzyme activities
Metabolites
(Metabolome) • Chromatography
• Mass spectrometry
• NMR
Glycomics Lipidomics
(Sugars) (Lipids)

Metabolomics
Chromosome
(23 pair) Epigenomics
M
M

Ac
Ac

Cell Nucleus Chromatin

Organ, Tissue
Genomics (3×109)
miRNA
DNA (4 chemicals: A, T, G, C)
World of OMICs
Non-coding RNA Transcriptomics
mRNA (copies)

M C
A

A
I
V

Y
M
E Proteomics
D

Glycomics (Sugars attached proteins) Protein (20 chemicals: A, C, D ..)

The evolution of transcriptomics
Hybridization-based

P. Brown, et. al. Affymetrix, whole genome 2008 many groups, mRNA-seq:
Gene expression profiling expression profiling using tiling direct sequencing of mRNAs
using spotted cDNA array: identifying and profiling using next generation
microarray: expression levels novel genes and splicing sequencing techniques (NGS)
of known genes variants
History
➢ 1980s: antibody-based assay (protein chip?)

➢ ~1991: high-density DNA-synthetic chemistry (Affymetrix/oligo

chips)

➢ ~1995: microspotting (Stanford Univ/cDNA chips)

➢ replacing porous surface with solid surface replacing

radioactive label with fluorescent label improvement on
sensitivity
cDNA Microarray Technologies

❖Spot cloned cDNAs onto a glass microscope slide

❖usually PCR amplified segments of plasmids
❖Label 2 RNA samples with 2 different colors of flourescent
dye - control vs. experimental
❖Mix two labeled RNAs and hybridize to the chip
❖Make two scans - one for each color
❖Combine the images to calculate ratios of amounts of each
RNA that bind to each spot
Gene Expression Data
On p genes for n slides: p is O(10,000), n is O(10-
100), but growing,
Slides
slide 1 slide 2 slide 3 slide 4 slide 5 …
1 0.46 0.30 0.80 1.51 0.90 ...
2 -0.10 0.49 0.24 0.06 0.46 ...
Genes 3 0.15 0.74 0.04 0.10 0.20 ...
4 -0.45 -1.03 -0.79 -0.56 -0.32 ...
5 -0.06 1.06 1.35 1.09 -1.09 ...

Gene expression level of gene 5 in slide 4

= Log2( Red intensity / Green intensity)

These values are conventionally displayed

on a red (>0) yellow (0) green (<0) scale.
Affymetrix Expression Arrays

http://www.affymetrix.com/technology/ge_analysis/index.affx
Terms/Jargons

Stanford/cDNA chip Affymetrix/oligo chip

 one slide/experiment  one chip/experiment
 one spot  one probe/feature/cell
 1 gene => one spot or  1 gene => many probes
few spots(replica) (20~25 mers)
 control: control spots  control: match and
 control: two fluorescent mismatch cells.
dyes (Cy3/Cy5)
Practical Application of DNA Microarrays
 DNA Microarrays are used to study gene activity (expression)
 What proteins are being actively produced by a group of cells?
 “Which genes are being expressed?”
 How?
 When a cell is making a protein, it translates the genes (made of DNA) which
code for the protein into RNA used in its production
 The RNA present in a cell can be extracted
 If a gene has been expressed in a cell
 RNA will bind to “a copy of itself” on the array
 RNA with no complementary site will wash off the array
 The RNA can be “tagged” with a fluorescent dye to determine its presence

 DNA microarrays provide a high throughput technique for quantifying the

presence of specific RNA sequences
Analysis of Microarray Data
 Analysis of images
 Preprocessing of gene expression data
 Normalization of data
 Subtraction of Background Noise
 Global/local Normalization
 House keeping genes (or same gene)
 Expression in ratio (test/references) in log
 Differential Gene expression
 Repeats and calculate significance (t-test)
 Significance of fold used statistical method
 Clustering
 Supervised/Unsupervised (Hierarchical, K-means, SOM)
 Prediction or Supervised Machine Learnning (SVM)
Videos on Microarray
 https://www.youtube.com/watch?v=0ATUjAxNf6U (animation)
 https://www.youtube.com/watch?v=0Hj3f7vQFZU
 https://www.youtube.com/watch?v=W8BCXQtH0I8
What is RNA-Seq?
RNA-Seq is the process of sequencing the transcriptome which includes
protein coding and non-coding transcripts.

Applications:
 Gene (exon, isoform) expression estimation
 Differential gene (exon, isoform) expression analysis
 Transcriptome assembly - Map exon, intron boundaries, splice junctions
 Discovery of novel transcribed regions
 Analyse alternate splicing
Overview of RNA-Seq
Transcriptome profiling using NGS
How RNA-seq works

Sample preparation

Next generation sequencing (NGS)

Data analysis:
✓Mapping reads
✓Visualization (Gbrowser)
✓De novo assembly
✓Quantification

Figure from Wang et. al, RNA-Seq: a revolutionary tool for transcriptomics, Nat. Rev. Genetics 10, 57-63, 2009).
Line1: Sequence identifier

FASTQ files Line2: Raw sequence

Line3: meaningless
Line4: quality values for the sequence
Analysis Pipeline
Classic quantification of gene
expression using RNA-seq
Mapping

Alignment to genome
-Hisat2
-STAR

Counts reads per transcript

Normalization Read counts tables

FPKM TPM
RPKM
Fragments Per Kilobase Transcripts Per
Reads Per Kilobase
of transcript per Kilobase Million
Million
Million mapped reads.
From reads to differential
expression
Raw Sequence Data QC by
FASTQ Files FastQC/R

Reads Mapping

Unspliced Mapping Spliced mapping

BWA, Bowtie TopHat, MapSplice

Mapped Reads
Expression Quantification SAM/BAM Files

Summarize read counts FPKM/RPKM

Cufflinks QC by
RNA-SeQC
DE testing

DEseq, edgeR, etc Cuffdiff

List of DE
Functional Interpretation
Function Integrate with
Infer networks
enrichment other data

Biological Insights & hypothesis

Patient Technologies Data Analysis Integration and interpretation
point mutation

Small indels

Further understanding of cancer and clinical applications

Genomics Copy number
WGS, WES variation
Functional effect
Structural of mutation
variation

Differential
expression
Transcriptomics Network and
Gene fusion pathway analysis
RNA-Seq
Alternative
splicing

RNA editing
Integrative analysis
Methylation
Epigenomics
Bisulfite-Seq Histone
ChIP-Seq modification

Transcription
Factor binding

Shyr D, Liu Q. Biol Proced Online. (2013)15,4

Concept of Single Cell

The basic unit of life

Why single cell gene expression?
Improvements in scRNA-seq methods
~10 ~100 ~1000 ~10 000 ~100 000

https://doi.org/10.1038/nprot.2017.149
From Svensson et al, 2018.
Benefits of single cell sequencing
Opens the door to several biological and clinical questions

✓ Understanding heterogeneous samples:

✓ E.g. analyse cellular heterogeneity during immune or
stem cell development
✓ Identification and analysis of rare cell types
✓ E.g. circulating tumor cells from liquid biopsy
✓ Understanding cellular transitions and switches in
cell state
✓ Dissecting complex infections and revealing drug resistance
genotypes
Gene Expression Omnibus (GEO)

➢A public repository for the archiving and distribution of gene expression data
submitted by the scientific community.
➢GEO is a public functional genomics data repository supporting MIAME
(Minimum Information About a Microarray Experiment)-compliant data
submissions.
➢Array- and sequence-based data are accepted.
➢Curated, online resource for gene expression data browsing, query, analysis
and retrieval.
➢Convenient for deposition of gene expression data, as required by funding
agencies and journals.
Goals of GEO
1. Provide a robust, versatile database for efficient storage of high-
throughput functional genomic data
2. Offer simple submission procedures and formats that support
complete and well-annotated data [deposited by the research
community]
3. Provide user-friendly mechanisms that allow users to query, locate,
review and download studies and gene expression profiles of
interest
The GDC Data Portal: An Overview
The Genomic Data Commons (GDC) Data Portal provides users with web-based access to data from cancer
genomics studies.
Key GDC Data Portal features
•Open, granular access to information about all datasets available in the GDC.
•Advanced search and visualization-assisted filtering of data files.
•Data visualization tools to support the analysis and exploration of data (including on a gene and mutation level from
•Open-Access MAF files.
•Cart for collecting data files of interest.
•Authentication using eRA Commons credentials and auathorization using dbGaP for access to controlled data files.
•Secure data download directly from the cart or using the GDC Data Transfer Tool.
•For more information about available datasets, see the GDC Website.

Accessing the GDC Data Portal

The GDC Data Portal is accessible using a web browser such as Chrome, Firefox, and Microsoft Edge at
the following
URL: https://portal.gdc.cancer.gov
The front page displays a summary of all available datasets:

https://docs.gdc.cancer.gov/Data_Portal/PDF/Data_Portal_UG.pdf
GDC data portal Repository

 Data Category
SNV, transcriptome profiling, CNV, sequencing reads,
biospecimens, clinical, DNA methylation, somatic mutation,
combined nucleotide variation

 Data Type
RAW single somatic mutation, annotated somatic mutation,
aligned reads, gene expression quantification, and so on…..
GDC clinical data

 Clinical data is a collection of data related to patient diagnosis,

demographics, exposures, laboratory tests, and family
relationships.

 In the GDC, clinical data is searchable in the API, Data Portal, or

Legacy Archive. This only includes data that has been indexed and
aligns with the GDC Data Dictionary
https://www.nejm.org/doi/full/10.1056/NEJMp1607591
The Cancer Genome Atlas (TCGA)

• Launched in 2006 as a pilot, expanded in 2009, ended in 2017

• NIH-funded program to perform a comprehensive and integrated analysis of key

genomic/molecular features of many cancers

• A ‘marker paper’ in each project to provide fundamental insights

• Make the data publicly available to the research community

• Serves as a model for the power of teamwork in science.

U.K., France, Netherlands, Canada, U.S.

• Uveal melanoma chosen as one of 10 rare cancers included

TCGA history

 TCGA is a project, begun in 2005, to catalogue genetic mutations responsible for cancer,
using genome sequencing.
 TCGA is supervised by the Center for Cancer Genomics and the National Human
Genome Research Institute. A three-year pilot project, begun in 2006, focused on
characterization of three types of human cancers: glioblastoma, lung, and ovarian
cancer.
 In 2009, it expanded into phase II, which planned to complete the genomic
characterization and sequence analysis of 20–25 different tumor types by 2014
 Gene expression (RNA-seq and array), copy number variation (array and genome
sequencing), SNP genotyping (GGS and WES), DNA methylation (array and RBSS),
microRNA profiling, proteomics (RPPA), and chromatin accessibility (ATAC-seq)
 There are 3554 authorized requesters associated with TCGA study (currently)
Project Cases Seq Exp SNV CNV Meth Clinical Clinical Supplement
TCGA-BRCA 1,098 1,098 1,097 1,044 1,098 1,095 1,098 1,098
TCGA-GBM 617 406 166 396 599 423 617 617
TCGA-OV 608 575 492 443 597 602 608 608
TCGA-LUAD 585 582 519 569 518 579 585 585
TCGA-UCEC 560 559 559 542 558 559 560 560
TCGA-KIRC 537 535 534 339 534 533 537 537
TCGA-HNSC 528 528 528 510 526 528 528 528
TCGA-LGG 516 516 516 513 515 516 516 516
TCGA-THCA 507 507 507 496 505 507 507 507
TCGA-LUSC 504 504 504 497 504 503 504 504
TCGA-PRAD 500 498 498 498 498 498 500 500
TCGA-SKCM 470 470 469 470 470 470 470 470
TCGA-COAD 461 460 459 433 460 458 461 461
TCGA-STAD 443 443 439 441 443 443 443 443
TCGA-BLCA 412 412 412 412 412 412 412 412
TCGA-LIHC 377 377 376 375 376 377 377 377
TCGA-CESC 307 307 307 305 302 307 307 307
TCGA-KIRP 291 291 291 288 290 291 291 291
TCGA-SARC 261 261 261 255 261 261 261 261
TCGA-LAML 200 195 188 149 200 140 200 200
TCGA-ESCA 185 185 184 184 185 185 185 185
TCGA-PAAD 185 185 178 183 185 184 185 185
TCGA-PCPG 179 179 179 179 179 179 179 179
TCGA-READ 172 171 167 158 167 165 172 172
TCGA-TGCT 150 150 150 150 150 150 150 150
TCGA-THYM 124 124 124 123 124 124 124 124
TCGA-KICH 113 66 66 66 66 66 113 113
TCGA-ACC 92 92 80 92 92 80 92 92
TCGA-MESO 87 87 87 83 87 87 87 87
TCGA-UVM 80 80 80 80 80 80 80 80
TCGA-DLBC 58 48 48 37 50 48 58 58
TCGA-UCS 57 57 57 57 57 57 57 57
TCGA-CHOL 51 51 36 51 36 36 51 51
11,315 10,999 10,558 10,418 11,124 10,943 11,315 11,315
Types of data

• Core dataset: • Future datasets:

– Pathology report – 50x Whole-genome
– Histology images sequencing
– Clinical data – Bisulfide sequencing
– Whole exome-seq – Protein Array
– SNP 6.0 array
– mRNAseq
– miRNAseq
– Methylation array
Single Cell Expression Atlas

Discover and interpret gene

expression analysis results
at single cell level

www.ebi.ac.uk/gxa/sc/
Http://webs.iiitd.edu.in/raghava/cancerdr/
Overall Architecture of CancerlivER
◆ A biomarker is a biological molecule found in blood, other body fluids, or tissues that is a sign of a
normal or abnormal process, or of a condition or disease (National Cancer Institute (NCI))

Biomarkers

Based on Disease State Based on Biomolecules

Diagnostics DNA Biomarker

RNA Biomarker
Prognostics
Protein Biomarker

Predictive
Glyco Biomarker
Concept of Deep Learning
Prof. Gajendra P.S. Raghava
Head, Center for Computational Biology

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so

no claim of authorship on any slide
Deep Learning in Biology: Mining Omic Data
Deep Learning
We will cover two major deep learning models:
 Deep Belief Networks and Autoencoders employs layer-wise unsupervised
learning to initialize each layer and capture multiple levels of representation
simultaneously.
 Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep
belief nets. Neural Computation, 18:1527-1554.
 Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H. (2007). Greedy Layer-Wise Training
of Deep Networks, Advances in Neural Information Processing Systems 19

 Convolutional Neural Network organizes neurons based on animal’s visual

cortex system, which allows for learning patterns at both local level and
global level.
 Y. LeCun, L. Bottou, Y. Bengio and P. Haffner: Gradient-Based Learning Applied to
Document Recognition, Proceedings of the IEEE, 86(11):2278-2324, November 1998 65
Convolutional DBN for audio
Max pooling unit

Detection units

Spectrogram
Convolutional DBN for audio

Spectrogram
Probabilistic max pooling Convolutional DBN:

0 max {x1, x2, x3, x4}

Convolutional Neural net:
max {x1, x2, x3, x4}

0 0 0 0
X1 X2 X3 X4

Where xi are {0,1}, and mutually exclusive. Thus, 5

possible cases:

0 1 1

0 0 0 0 1 0 0 0 0 1 0 0

1 1
X1 X2 X3 X4

Where xi are real numbers.

0 0 1 0 0 0 0 1

Collapse 2n configurations into n+1 configurations.

Permits bottom up and top down inference.
Convolutional DBN for audio

Spectrogram
Convolutional DBN for audio

Max pooling
Second CDBN
layer
Detection units

Max pooling
One CDBN
layer
Detection units
Deep Learning = Learning Hierarchical
Representations

78
Learning of object parts
Examples of learned object parts from object categories
Faces Cars Elephants Chairs
Training on multiple
objects
Trained on 4 classes (cars, faces, motorbikes, airplanes).
Second layer: object-specific features.
Third layer: More specific features.
Application of Artificial Intelligence (Convolutional Neural Network)
(Identification of drug resistant strains from genomic information: Local & global annotation)
Flow of information from genotype to phenotype

SNPs

S1 Genes

G1
S1 Pathways

G2
S2 Phenotypes
Pa1
G3
S3 Ph1
Pa2
G4
S4 Ph1
Pa3
G5
S5
Pa3
G6
S6
G6
S6
Deep Belief Networks
 A deep belief network (DBN) is a probabilistic, generative model made up
of multiple layers of hidden units.
 A composition of simple learning modules that make up each layer

 A DBN can be used to generatively pre-train a DNN by using the learned

DBN weights as the initial DNN weights.
 Back-propagation or other discriminative algorithms can then be applied for fine-
tuning of these weights.

 Advantages:
 Particularly helpful when limited training data are available
 These pre-trained weights are closer to the optimal weights than are randomly
chosen initial weights.
82
Restricted Boltzmann Machine

 A DBN can be efficiently trained in an unsupervised, layer-

by-layer manner, where the layers are typically made of
restricted Boltzmann machines (RBM).

 RBM: undirected, generative energy-based model with a

"visible" input layer and a hidden layer, and connections
between the layers but not within layers.

 The training method for RBMs proposed by Geoffrey An RBM with fully connected
visible and hidden units. Note
Hinton for use with training "Product of Expert" models is there are no hidden-hidden or
called contrastive divergence (CD). visible-visible connections.

83
Deep Belief Network Architecture

 Once an RBM is trained, another RBM is "stacked" atop it, taking its input from
the final already-trained layer.
 The new visible layer is initialized to a training vector, and values for the units in
the already-trained layers are assigned using the current weights and biases.
 The new RBM is then trained with the CD procedure.
 This whole process is repeated until some desired stopping criterion is met.

84
DBN Example: Hand-written Digit
Recognition
 Input:

85
DBN Example: Hand-written Digit
Recognition
2000 top-level neurons
The top two layers form an associative
memory whose energy landscape
models the low dimensional manifolds of
10 label
the digits. 500 neurons
neurons
 The model learns to generate
combinations of labels and images.
500 neurons
 To perform recognition we start with a
neutral state of the label units and do
an up-pass from the image followed by 28 x 28
a few iterations of the top-level pixel
associative memory.
image
86
Deep Learning using Pytorch
 PyTorch is a library for Python programs
 It allows deep learning models to be expressed in idiomatic Python.
 Clear syntax, streamlined API, and easy debugging
 Programming the deep learning machine is very natural in PyTorch.
 PyTorch gives us a data type, the Tensor, to hold numbers, vectors,
matrices, or arrays in general.
 In addition, it provides functions for operating on them.
 We can program with them incrementally and, if we want, interactively,
just like we are used to from Python.
PyTorch Deep Learning Model Life-Cycle
 Prepare the Data.
 Define the Model.
 Train the Model.
 Evaluate the Model.
 Make Predictions
A convolutional layer
A CNN is a neural network with some convolutional layers
(and some other layers). A convolutional layer has a number
of filters that does convolutional operation.

Beak detector

A filter
Convolution These are the network
parameters to be learned.

1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1

…
…
6 x 6 image
Each filter detects a
small pattern (3 x 3).
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
If stride=2

1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1

6 x 6 image 3 -2 -2 -1
-1 1 -1
Convolution -1 1 -1 Filter 2
-1 1 -1
stride=1
Repeat this for each filter
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map0 1
-1 -1 -2 1

6 x 6 image 3 -2 -2 -1
-1 0 -4 3
Two 4 x 4 images
Forming 2 x 4 x 4 matrix
Color image: RGB 3 channels
11 -1-1 -1-1 -1-1 11 -1-1
1 -1 -1 -1 1 -1
-1 1 -1 -1-1 11 -1-1
-1-1 11 -1-1 Filter 1 -1 1 -1 Filter 2
-1-1 -1-1 11 -1-1-1 111 -1-1-1
-1 -1 1
Color image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Convolution v.s. Fully Connected

1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image

x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected

…
…

…
…
0 1 0 0 1 0
0 0 1 0 1 0
x36
1 -1 -1 Filter 1 1 1
-1 1 -1 2 0
-1 -1 1 3 0
4: 0 3

…
1 0 0 0 0 1
0 1 0 0 1 0 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
0 1 0 0 1 0 10: 0

…
0 0 1 0 1 0
1 0
6 x 6 image
3 0
14
fewer parameters! 15 1 Only connect to 9
16 1 inputs, not fully
connected

…
1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3

…
1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0

…
0 0 1 0 1 0
1 0
6 x 6 image
3: 0
14:
Fewer parameters 15: 1
16: 1 Shared weights
Even fewer parameters

…
The whole CNN
cat dog ……
Convolution

Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times

Max Pooling

Flattened
Max Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1

3 -1 -3 -1 -1 -1 -1 -1

-3 1 0 -3 -1 -1 -2 1

-3 -3 0 1 -1 -1 -2 1

3 -2 -2 -1 -1 0 -4 3
Why Pooling
 Subsampling pixels will not change the object
bird
bird

Subsampling

We can subsample the pixels to make image

smaller fewer parameters to characterize the image
A CNN compresses a fully connected
network in two ways:
 Reducing number of connections
 Shared weights on the edges
 Max pooling further reduces the complexity
Max Pooling

New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
The whole CNN
3 0
-1 1 Convolution

3 1
0 3
Max Pooling
A new image Can
repeat
Convolution many
Smaller than the original
times
image
The number of channels Max Pooling

is the number of filters

The whole CNN
cat dog ……
Convolution

Max Pooling

Fully Connected A new image

Feedforward network
Convolution

Max Pooling

Flattened A new image

3
Flattening
0

1
3 0
-1 1 3

3 1 -1
0 3 Flattened

1 Fully Connected
Feedforward network

3
Only modified the network structure and
CNN in Keras input format (vector -> 3-D tensor)

input

Convolution
1 -1 -1
-1 1 -1
-1 1 -1
-1 1 -1 … There are
25 3x3
-1 -1 1
-1 1 -1 … Max Pooling
filters.
Input_shape = ( 28 , 28 , 1)

28 x 28 pixels 1: black/white, 3: RGB Convolution

3 -1 3 Max Pooling

-3 1
Only modified the network structure and
CNN in Keras input format (vector -> 3-D array)

Input
1 x 28 x 28

Convolution
How many parameters for
each filter? 9 25 x 26 x 26

Max Pooling
25 x 13 x 13

Convolution
How many parameters 225=
for each filter? 50 x 11 x 11
25x9
Max Pooling
50 x 5 x 5
Only modified the network structure and
CNN in Keras input format (vector -> 3-D array)

Input
1 x 28 x 28

Output Convolution

25 x 26 x 26

Fully connected Max Pooling

feedforward network
25 x 13 x 13

Convolution
50 x 11 x 11

Max Pooling
1250 50 x 5 x 5
Flattened
AlphaGo

Next move
Neural
(19 x 19
Network positions)

19 x 19 matrix
Fully-connected feedforward
Black: 1
network can be used
white: -1
none: 0 But CNN performs much better
AlphaGo’s policy network
The following is quotation from their Nature article:
Note: AlphaGo does not use Max Pooling.
CNN in speech recognition

The filters move in the

Frequency
CNN frequency direction.

Image Time
Spectrogram
CNN in text classification

Source of image:
http://citeseerx.ist.psu.edu/viewdoc/downlo
ad?doi=10.1.1.703.6858&rep=rep1&type=p
df
Image-based Biomarkers

Prof. Gajendra P.S. Raghava

Head, Center for Computational Biology

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so

no claim of authorship on any slide
Biomedical- Applications
Concept Level
★Proteome annotation ★Drugs discovery ★Vaccine Design ★Biomarkers

Molecules or Objects
Proteins & Peptides Gene Expression Chemoinformatics Image annotation
• Structure • Disease • Drug design • Image
prediction biomarkers • Chemical Classification
• Subcellular • Drug biomarkers descriptor • Medical images
localization • mRNA • QSAR models • Disease
• Therapeutic expression • Personalized classification
Application • Copy number inhibitors • Disease
• Ligand binding variation diagnostics
Major applications of Images
1. Gel eletrophorosis (SDS Pages)
2. 2-D gel electrophoresis
3. Microarray or gene expression data
4. Karyotyping (chromosome mapping)
5. MRI (Cardiac function imaging)
6. Ultrasound (3-D, breast cancer)
7. Radionuclide (Detection of Cancer)
8. Histology -> Biopsy (Neurological diseases)
9. Electron Microscope
10. Imaging in therapy
DNAsize: Measuring size of DNA fragments from
Gel Electrophoresis data
AC2DGel: Analysis and comparison of 2D gels
Classification of healthy and disease leafs
Image based prominent nucleoli detection

J Pathol Inform.
2015; 6: 39.
X-ray based diagnostics
Overview
 Image (or region) similarity is used in many applications:
 Object recognition
 Scene classification
 Image registration
 Image retrieval
 Robot localization
 Template matching
 Building panorama
 And many more…
SIFT: Scale-invariant feature transform
SIFT Spin image

RIFT: Rotation-invariant feature transform

HoG: Hough transform

GLOH: Gradient Location and Orientation Histogram

Descriptors
Types of descriptors
 Intensity based
 Histogram
 Gradient based
 Color Based
 Frequency
 Shape
 Combination of the above
Descriptors
Intensity Histogram

0 255

- Not invariant to light intensity change

- Does not capture geometric information
Descriptors
Solution:
• Divide the area
• For each section compute it’s own histogram

SIFT - David Lowe 1999

Descriptors - SIFT
Step 2: Compute the gradient for each pixel (direction and magnitude)

16 x 16

Step 3: Divide the pixels into 16, 4x4 squares

Descriptors - SIFT

Step 4: For each square, compute gradient direction histogram over 8 directions.

The result: 128 dimensions feature vector.

Descriptors - SIFT
• Warp the feature into 16x16 square.
• Divide into 16, 4x4 squares.
• For each square, compute an histogram of the gradient
directions.

=> Feature vector (128)

OpenCV – Introduction
 OpenCV stands for the Open Source Computer Vision Library.
It was founded at Intel in 1999, went through some lean

years after the .bust but is now under active development,

now receiving ongoing support from Willow Garage.
 OpenCV is free for commercial and research use.
It has a BSD license. The library runs across many platforms

and actively supports Linux, Windows and Mac OS.

 OpenCV was founded to advance the field of computer vision.
It gives everyone a reliable, real time infrastructure to build

on. It collects and makes available the most useful

algorithms.

138
OpenCV - Features
 Cross Platform
Windows, Linux, Mac OS

 Portable
iPhone

Android.

 Language Support
C/C++

Python

139
OpenCV Overview:
Robot support
> 500 functions
opencv.willowgarage.com

General Image Processing Functions Image Pyramids

Geometric
Descriptors
Segmentation Camera
Calibration,
Stereo, 3D
Features
Transforms Utilities and
Data Structures

Tracking
Machine
Learning: Fitting
•Detection,
•Recognition
Matrix Math
140
OpenCV – Getting Started
 Download OpenCV
http://opencv.org

Install from macports/aptitude

 Setting up
Comprehensive guide on setting up OpenCV in various environments at

the official wiki.

 Online Reference:
http://docs.opencv.org

 Two books

141
OpenCV Highlights
 Focus on real-time image processing
 Written in C/C++
 Interface: C/C++, Python, Java, Matlab/Octave
 Cross-platform: Windows, Mac, Linux, Android, iOS, etc
 Open source and free!
Applications
 Feature extraction
 Recognition (facial, gesture, etc)
 Segmentation
 Robotics
 Machine learning support: Boosting, k-nearest neighbor, SVM, etc
OpenCV for Computing SIFT Descriptors using
Python
Implementation of CNN using KERAS
https://towardsdatascience.com/building-a-convolutional-neural-
network-cnn-in-keras-329fbbadc5f5
Image Classification using PyTorch
https://www.analyticsvidhya.com/blog/2019/10/building-image-
classification-models-cnn-pytorch/

Discovering Genomics, Proteomic - A. Malcolm Campbell
100% (1)
Discovering Genomics, Proteomic - A. Malcolm Campbell
366 pages
Types of Genomics
No ratings yet
Types of Genomics
28 pages
Rnaseq by Example
No ratings yet
Rnaseq by Example
163 pages
Introduction To Bioinformatics 1
No ratings yet
Introduction To Bioinformatics 1
109 pages
Cancer Genomics
No ratings yet
Cancer Genomics
47 pages
Bioinformatics Notes
No ratings yet
Bioinformatics Notes
104 pages
(Methods in Molecular Biology 1751) Yejun Wang, Ming-An Sun (Eds.) - Transcriptome Data Analysis - Methods and Protocols-Humana Press (2018)
100% (1)
(Methods in Molecular Biology 1751) Yejun Wang, Ming-An Sun (Eds.) - Transcriptome Data Analysis - Methods and Protocols-Humana Press (2018)
239 pages
Genomics 101
100% (1)
Genomics 101
64 pages
Merge
No ratings yet
Merge
370 pages
Assignment CB 1
No ratings yet
Assignment CB 1
69 pages
Anotacion de Genomas
No ratings yet
Anotacion de Genomas
84 pages
DNA Microarrays Gene Expression Applications - 1st Edition Complete Book Download
100% (8)
DNA Microarrays Gene Expression Applications - 1st Edition Complete Book Download
15 pages
APPLICATION OF BIOINFORMATICS IN MOLECULAR BIOLOGY AND CURRENT RESEACRH-Dr. Ruchi Yadav
No ratings yet
APPLICATION OF BIOINFORMATICS IN MOLECULAR BIOLOGY AND CURRENT RESEACRH-Dr. Ruchi Yadav
105 pages
Computational Biology: by Safynaz Abdel-Fattah Sayed Computer Science Department
No ratings yet
Computational Biology: by Safynaz Abdel-Fattah Sayed Computer Science Department
36 pages
Bioinformatics
No ratings yet
Bioinformatics
22 pages
Cheminformatics 1
No ratings yet
Cheminformatics 1
24 pages
Genomics
No ratings yet
Genomics
11 pages
MG - L8 - Genomics & Proteomics
No ratings yet
MG - L8 - Genomics & Proteomics
79 pages
MBG2004 Gene Set Func-Enrichment Analysis Week VIII
No ratings yet
MBG2004 Gene Set Func-Enrichment Analysis Week VIII
70 pages
Gene Expression Studies 2
No ratings yet
Gene Expression Studies 2
38 pages
Bioinformatics Reviewer Full
No ratings yet
Bioinformatics Reviewer Full
16 pages
DNA Microarray Technology Final 97-2003
No ratings yet
DNA Microarray Technology Final 97-2003
34 pages
Functional Genomics Overview 222
No ratings yet
Functional Genomics Overview 222
24 pages
edgeRUsersGuide PDF
No ratings yet
edgeRUsersGuide PDF
110 pages
Gene Expression - Microarrays: Misha Kapushesky
No ratings yet
Gene Expression - Microarrays: Misha Kapushesky
144 pages
Microarray PDF
No ratings yet
Microarray PDF
34 pages
Microarray 180304043919
No ratings yet
Microarray 180304043919
50 pages
TCGA Gene Expression Data Classification
No ratings yet
TCGA Gene Expression Data Classification
24 pages
Original 151212070725
No ratings yet
Original 151212070725
36 pages
Chapter 20 Genomics
No ratings yet
Chapter 20 Genomics
43 pages
Genomics: Experimental Methods: Dr. Pragasam Viswanathan Professor, SBST
No ratings yet
Genomics: Experimental Methods: Dr. Pragasam Viswanathan Professor, SBST
56 pages
Hands On
No ratings yet
Hands On
22 pages
Week 5
No ratings yet
Week 5
64 pages
Unit 2 Lect 1
No ratings yet
Unit 2 Lect 1
18 pages
BHU Biotech
No ratings yet
BHU Biotech
38 pages
Biology 0610 IGCSE 1
No ratings yet
Biology 0610 IGCSE 1
30 pages
Chapter 18 Presentation
No ratings yet
Chapter 18 Presentation
47 pages
Large-Scale Analysis of Gene Expression
No ratings yet
Large-Scale Analysis of Gene Expression
27 pages
Jasmonate in Plant Biology Methods and Protocols Antony Champion - Download The Ebook Now To Start Reading Without Waiting
100% (5)
Jasmonate in Plant Biology Methods and Protocols Antony Champion - Download The Ebook Now To Start Reading Without Waiting
63 pages
Lecture 3 - Gene Expression Analysis
No ratings yet
Lecture 3 - Gene Expression Analysis
20 pages
WS11 Chapter09 PDF
No ratings yet
WS11 Chapter09 PDF
54 pages
1.1 - Introduction To Omics
No ratings yet
1.1 - Introduction To Omics
20 pages
Genomics
No ratings yet
Genomics
38 pages
Trancriptome and Proteome Analysis
No ratings yet
Trancriptome and Proteome Analysis
68 pages
Cell and Gene Therapy in 2040 Report UK
No ratings yet
Cell and Gene Therapy in 2040 Report UK
23 pages
DNAmicroarray
No ratings yet
DNAmicroarray
14 pages
Recap Through Exercise
No ratings yet
Recap Through Exercise
37 pages
Score: Context-Oriented Structured and Unstructured Information Integration
No ratings yet
Score: Context-Oriented Structured and Unstructured Information Integration
35 pages
Protein Synthesis: Indian Institute of Technology Patna
No ratings yet
Protein Synthesis: Indian Institute of Technology Patna
29 pages
Essential Cell Biology 3rd Edition Bruce Alberts Test Bank
100% (38)
Essential Cell Biology 3rd Edition Bruce Alberts Test Bank
28 pages
KCL NGScourse Session1 Handout
No ratings yet
KCL NGScourse Session1 Handout
19 pages
DNA Microarrays Gene Expression Applications, 1st Edition ISBN 3540415084, 9783540415084 Complete EPUB Download
No ratings yet
DNA Microarrays Gene Expression Applications, 1st Edition ISBN 3540415084, 9783540415084 Complete EPUB Download
15 pages
Test For Upload
No ratings yet
Test For Upload
25 pages
DNA Microarray Overview: (Some Slides From Dr. Holly Dressman, Duke University
No ratings yet
DNA Microarray Overview: (Some Slides From Dr. Holly Dressman, Duke University
34 pages
RESUME - Shreya Ahuja
No ratings yet
RESUME - Shreya Ahuja
2 pages
Serial Analysis of Gene Expression
No ratings yet
Serial Analysis of Gene Expression
22 pages
Information Integration: Existing Methods and Solutions
No ratings yet
Information Integration: Existing Methods and Solutions
25 pages
Set 1 - MS Pre-Board Biology
No ratings yet
Set 1 - MS Pre-Board Biology
7 pages
Chapter On Transcriptomics
No ratings yet
Chapter On Transcriptomics
13 pages
Genomics
No ratings yet
Genomics
4 pages
Introduction To Bioinformatics: Course 341 Department of Computing Imperial College, London Moustafa Ghanem
No ratings yet
Introduction To Bioinformatics: Course 341 Department of Computing Imperial College, London Moustafa Ghanem
42 pages
BIO 311 Microbiology
No ratings yet
BIO 311 Microbiology
4 pages
Module 5-Lecture 3
No ratings yet
Module 5-Lecture 3
8 pages
Gene Expression Data Analysis: Minireview
No ratings yet
Gene Expression Data Analysis: Minireview
8 pages
2000-Gene Expression Data Analysis
No ratings yet
2000-Gene Expression Data Analysis
8 pages
Functional Genomics
No ratings yet
Functional Genomics
11 pages
Introduction To Genomics: Children's Hospital Informatics Program
No ratings yet
Introduction To Genomics: Children's Hospital Informatics Program
22 pages
12 Biology Notes Ch07 Evolution
No ratings yet
12 Biology Notes Ch07 Evolution
9 pages
Primers
No ratings yet
Primers
5 pages
Tentative Course List (July - Dec 2024)
No ratings yet
Tentative Course List (July - Dec 2024)
5 pages
Group 2 Daily Mcqs With Explanations Tspsc&Appsc 24-04-24 English
No ratings yet
Group 2 Daily Mcqs With Explanations Tspsc&Appsc 24-04-24 English
7 pages
Rahmeh Hilal - CV
No ratings yet
Rahmeh Hilal - CV
3 pages
Indian Addresses Matching
No ratings yet
Indian Addresses Matching
12 pages
Genomic Medicine: Basic Molecular Biology
No ratings yet
Genomic Medicine: Basic Molecular Biology
23 pages
Protein Synthesis Practice Worksheet
No ratings yet
Protein Synthesis Practice Worksheet
2 pages
Kiran Mazumdar Shaw The Biocon Story
No ratings yet
Kiran Mazumdar Shaw The Biocon Story
21 pages
Els Quarter 2 Week 1: Ortega, Carl Joseph G. Grade 11 - 1ABM Ms. Celeste Elviña
No ratings yet
Els Quarter 2 Week 1: Ortega, Carl Joseph G. Grade 11 - 1ABM Ms. Celeste Elviña
8 pages
Unit 8 Study Guide
No ratings yet
Unit 8 Study Guide
4 pages
Biology
No ratings yet
Biology
2 pages
Molecular Biology Notes
No ratings yet
Molecular Biology Notes
4 pages
Chemistry Is The Logic of Biological Phenomena: Chemistry 309: General Biochemistry Zvi Pasman Fall 2007
No ratings yet
Chemistry Is The Logic of Biological Phenomena: Chemistry 309: General Biochemistry Zvi Pasman Fall 2007
8 pages
RDU3
No ratings yet
RDU3
3 pages
Synthesis Paper
No ratings yet
Synthesis Paper
6 pages
Functional Genomics
No ratings yet
Functional Genomics
5 pages
Chapter 12 - Chromosomal Basis of Inheritance
No ratings yet
Chapter 12 - Chromosomal Basis of Inheritance
7 pages
Taxonomy - Biological Classification 2
No ratings yet
Taxonomy - Biological Classification 2
3 pages
Bio 20 - Unit C - Unit Plan
No ratings yet
Bio 20 - Unit C - Unit Plan
21 pages
Macromolecule Brochure
No ratings yet
Macromolecule Brochure
1 page
Functional Genomics Is A Field of
No ratings yet
Functional Genomics Is A Field of
6 pages
An Overview On Gene Expression Analysis: Dr. R. Radha, P. Rajendiran
No ratings yet
An Overview On Gene Expression Analysis: Dr. R. Radha, P. Rajendiran
6 pages
Phylogenetic Tree
No ratings yet
Phylogenetic Tree
3 pages
The Human Genome Project-: Description
No ratings yet
The Human Genome Project-: Description
8 pages
Qualitative Analysis of Plasmid Dna by Agarose Gel Electrophoresis
No ratings yet
Qualitative Analysis of Plasmid Dna by Agarose Gel Electrophoresis
3 pages
Cell Structure Function Worksheet With Hints
No ratings yet
Cell Structure Function Worksheet With Hints
7 pages
Therapeutic Nanocarriers in Cancer Treatment: Challenges and Future Perspective
From Everand
Therapeutic Nanocarriers in Cancer Treatment: Challenges and Future Perspective
Neeraj Mishra
No ratings yet
Current Cancer Biomarkers
From Everand
Current Cancer Biomarkers
PublishDrive
No ratings yet

Gene Expression: Quantification of Information Molecules and Their Applications

Uploaded by

Gene Expression: Quantification of Information Molecules and Their Applications

Uploaded by

Gene Expression : Quantification of Information

Molecules and their Applications

Prof. Gajendra P.S. Raghava

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so

Protein Gene (mRNA), Gene (DNA)

• DNA arrays and chips

Cell Nucleus Chromatin

Glycomics (Sugars attached proteins) Protein (20 chemicals: A, C, D ..)

➢ ~1991: high-density DNA-synthetic chemistry (Affymetrix/oligo

➢ ~1995: microspotting (Stanford Univ/cDNA chips)

➢ replacing porous surface with solid surface replacing

❖Spot cloned cDNAs onto a glass microscope slide

Gene expression level of gene 5 in slide 4

These values are conventionally displayed

Stanford/cDNA chip Affymetrix/oligo chip

 DNA microarrays provide a high throughput technique for quantifying the

Next generation sequencing (NGS)

FASTQ files Line2: Raw sequence

Counts reads per transcript

Normalization Read counts tables

Unspliced Mapping Spliced mapping

Summarize read counts FPKM/RPKM

DEseq, edgeR, etc Cuffdiff

Biological Insights & hypothesis

Further understanding of cancer and clinical applications

Shyr D, Liu Q. Biol Proced Online. (2013)15,4

The basic unit of life

✓ Understanding heterogeneous samples:

Accessing the GDC Data Portal

 Clinical data is a collection of data related to patient diagnosis,

 In the GDC, clinical data is searchable in the API, Data Portal, or

• Launched in 2006 as a pilot, expanded in 2009, ended in 2017

• NIH-funded program to perform a comprehensive and integrated analysis of key

• A ‘marker paper’ in each project to provide fundamental insights

• Make the data publicly available to the research community

• Serves as a model for the power of teamwork in science.

• Uveal melanoma chosen as one of 10 rare cancers included

• Core dataset: • Future datasets:

Discover and interpret gene

Based on Disease State Based on Biomolecules

Diagnostics DNA Biomarker

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so

 Convolutional Neural Network organizes neurons based on animal’s visual

0 max {x1, x2, x3, x4}

Where xi are {0,1}, and mutually exclusive. Thus, 5

Where xi are real numbers.

Collapse 2n configurations into n+1 configurations.

 A DBN can be used to generatively pre-train a DNN by using the learned

 A DBN can be efficiently trained in an unsupervised, layer-

 RBM: undirected, generative energy-based model with a

We can subsample the pixels to make image

is the number of filters

Fully Connected A new image

Flattened A new image

28 x 28 pixels 1: black/white, 3: RGB Convolution

Fully connected Max Pooling

The filters move in the

Prof. Gajendra P.S. Raghava

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so

RIFT: Rotation-invariant feature transform

GLOH: Gradient Location and Orientation Histogram

- Not invariant to light intensity change

SIFT - David Lowe 1999

Step 3: Divide the pixels into 16, 4x4 squares

The result: 128 dimensions feature vector.

=> Feature vector (128)

years after the .bust but is now under active development,

and actively supports Linux, Windows and Mac OS.

on. It collects and makes available the most useful

General Image Processing Functions Image Pyramids

Install from macports/aptitude

the official wiki.

You might also like