0% found this document useful (0 votes)
18 views39 pages

Intro Biol Datav2

This document provides an overview of the BT3041 course on analysis and interpretation of biological data. It discusses the large amounts of data being generated across various areas of biology such as genomics, proteomics, connectomics and more. It outlines the course structure which will cover topics like data representation, comparison, clustering, classification and statistical analysis techniques to help interpret these large and complex biological datasets.

Uploaded by

a.vidhya 12345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views39 pages

Intro Biol Datav2

This document provides an overview of the BT3041 course on analysis and interpretation of biological data. It discusses the large amounts of data being generated across various areas of biology such as genomics, proteomics, connectomics and more. It outlines the course structure which will cover topics like data representation, comparison, clustering, classification and statistical analysis techniques to help interpret these large and complex biological datasets.

Uploaded by

a.vidhya 12345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

BT3041:

Analysis and Interpretation of Biological Data

Instructor: V. Srinivasa Chakravarthy


Slot: D
Room: CRC301
Life…
under the burden of
BIG DATA
Telecom
• 4.6 billion mobile-phone subscriptions worldwide
• 1-2 billion people accessing the internet. [1]

• The world's effective capacity to exchange information through


telecommunication networks:
– 281 petabytes in 1986,
– 471 petabytes in 1993,
– 2.2 exabytes in 2000,
– 65 exabytes in 2007[8]
– predicted to reach
667 exabytes
annually by 2014.(Wiki)
Video Surveillance
• Up to 5.9 million closed-circuit television
cameras in UK
• Including 750,000 in
“sensitive locations” such as schools, hospitals
and care homes
.
• 1 camera for every 11 people
• A few GB/camera/day

http://www.telegraph.co.uk/technology/10172298/One-surveillance-
camera-for-every-11-people-in-Britain-says-CCTV-survey.html
Space Exploration
• Square Kilometer Array (SKA) project
• Radio telescopes spread over 1 sq km area
• “sensitive enough to detect airport radar on a
planet 50 light years away”
• Generates 750 terabytes every SECOND!

http://venturebeat.com/2014/10/05/how-big-data-is-fueling-a-new-age-in-
space-exploration/
DATA in modern world
• Data as the fourth pillar of science
• The first 3 pillars are:
– Theory
– Experiment
– Computation
Jobs!
• Needed by 2018, in US alone:
– 140,000 to 190,000 big data analysts
– 1.5 million managers who understand big data
file:///D:/BACKUPD/courses/biol_data/intro/Big_data_McKinsey_Company.htm#sthash.2NWbgp5G.dpuf
g y
l o
B io
i n
ata
g d
Bi
A Big Data place
• The European Bioinformatics Institute (EBI) in Hinxton, UK,
– part of the European Molecular Biology Laboratory
– one of the world's largest biology-data repositories,
– currently stores 20 petabytes of data and back-ups
– Data about genes, proteins and small molecules.
Another Big Data place
• Beijing Genomics Institute (BG) in Shenzen, China
• “The Sequence Factory”
• 157 genome sequencing instruments working 24X7
• Samples from people, plants, animals and microbes.
• Each day, it generates 6 terabytes of genomic data.
• Every instrument can decode one human genome per week
(used to take months or years and many staff).
(Marx, Nature, 2013)

…Where is it all coming from?


THE
OMICS
HUMAN GENOMICS
Human Genome Project
• Aim
– Identify sequence of bases on all 23 human chromosomes
(3 billion bases/3Gb)
– Identify genes within those sequences (~30 000 genes)
– Locate the position of the genes on the chromosomes

• $6 bn, 1000 scientists, 50 countries, 10+ years!

• Human genome can now be sequenced in a few days on the ‘next-


generation sequencing’ (NGS) machines

• Full genome data being collected from disease conditions


– the combined cancer genome and normal genome from a single patient
constitutes about 1 terabyte (1012 bytes)
– a million genomes would generate an exabyte (10 18 bytes). ”

(Courtesy: Karthik Raman)


Types of Genomics
• Disease genomics
– Millions of patients per disease
• DNA profiling
– Family lineages, parenting, forensics etc
• Comparative genomics
• Plant genomics
• Bacterial genomics
• Viral genomics
Transcriptome
• The set of all RNA molecules in a given cell,
population of cells or an organism
•  A gene may produce many different types of
mRNA molecules, so a transcriptome is much
more complex than the genome that encodes
it.
Proteins
• Peptide: a chain of amino acids (AAs)
• Assuming an average size of 200 AAs, number of possible
proteins is 20200 > # protons in the universe
• Assume:
– there are 107–108 species on Earth and
– 103–105 genes/species,
–  there are 1010–1013 unique protein sequences,
– <<possible sequence space,
– >> known protein number
– http://www.ncbi.nlm.nih.gov/books/NBK20267/
Single Protein Study

• Structure prediction
– Secondary structure
– Tertiary structure
– Quaternary Structure
• Can be quite complex
– P53 – tumor suppression gene assoc protein
– P53 mutation database exists
– 60,000 publications on p53 alone!!!
Proteomics
• The full complement of proteins expressed in
a cell, organ or an organism
Interactome
• is the whole set of molecular interactions in a
particular cell. 
I’m here

NOW…
LET’S GET TO THE CELL LEVEL
Neurons come in different Shapes
Neuromorph
The Connectome Project
• Find out the complete wiring diagram of the brain
• Human brain
– Has 100 billion neurons
– Each neuron has about 1k-10k connections
• Feasible for smaller organisms
– C. Elegans connectome – 300 neurons, 7000
connections (White et al 1986)
Human Connectome Project

- 1200 individuals

- fMRI + dMRI +
+ MRI + MEG

-Washington U +
U. Minnesota

http://www.humanconnectome.org/
Connectome Data Sizes
HCP Data Sizes (per Subject)
Session Format .zip File Size
Structural Unprocessed 70.99 MB
Preprocessed 1.19 GB
Resting State fMRI Unprocessed 2 GB
(each of 2 sessions) Preprocessed 3.24 GB

For 68 subjects:
Task fMRI  (avg per Task) Unprocessed 490 MB
Preprocessed 771 MB
                    (all 7 Tasks) Unprocessed 3.43 GB
Preprocessed 5.4 GB

Diffusion Unprocessed
Preprocessed
2.18 GB
2.81 GB
1.8 Terabytes!!
Group-Average on Unrelated 20 Additionally 289 MB
Processed
Total (per Subject) Unprocessed 9.81 GB
Preprocessed 15.77 GB
Both 25.58 GB

Total (5 Subjects) Unprocessed 62.16 GB


Preprocessed 78.83 GB
Both 141 GB

Total (20 Subjects) Unprocessed 247.34 GB


Preprocessed 315.05 GB
Both 562.39 GB

Total (68 Subjects) Unprocessed 815.4 GB


Preprocessed 1.058 TB
Both 1.873 TB
The Hierarchy

Tissue/organ Large scale networks

Cell (e.g. neuron) Microcircuits

Proteome
Metabolome/
Interactome

Transcriptome
Regulatory
Networks

Genome
ANALYZE THAT!
Questions?
• How to represent an object?
• How to compare objects?
– Same type or different?
– How different?
• How to group/cluster objects based on similarity?
• How to assign objects to classes?
• How to compare groups of objects?
– Are two groups of objects really different?
Course Structure
• Mathematical Preliminaries
– Vectors, vector spaces
– Eigenvalues and eigenvectors
– Derivatives in higher dimensions
– Linear Least Squares problem
– Optimization
• Lagrange multipliers
– Probability and Bayes theorem
Unsupervised Learning methods
• Clustering
– K-means
– Hierarchical clustering
– Scale-based clustering
– Fuzzy clustering
– Graph based clustering
– Self-organizing map
• Dimensionality reduction
– PCA and ICA
Classification
• Prototype-based classification
– K Nearest-neighbor classifier
– Learning Vector Classification
Classification
• Discriminant-based classification
– Linear Discriminant Analysis
– Neural Networks –
• Multilayer perceptron
• Radial Basis Function Network
– Support Vector Machines
– Bayesian Classifier
Biostatistics
• Standard Normal Distribution;
• Hypothesis testing;
• Multiple hypothesis testing;
• Chi-squared distribution;
• F-test and Student’s t-test;
• ANOVA;
• Regression Analysis
Text Books
• Introduction to Data Mining –
Tan/Steinbach/Kumar
• Neural Networks: A classroom approach –
Satish Kumar
• Analysis of Biological Data – Whitlock/Schluter
Grading
• Quiz I – 20
• Quiz 2 – 20
• Assignments – 20
• Endsem – 40

• Grading policy – RG!!!


May the
DATA
be with you!

You might also like