Intro Biol Datav2
Intro Biol Datav2
http://www.telegraph.co.uk/technology/10172298/One-surveillance-
camera-for-every-11-people-in-Britain-says-CCTV-survey.html
Space Exploration
• Square Kilometer Array (SKA) project
• Radio telescopes spread over 1 sq km area
• “sensitive enough to detect airport radar on a
planet 50 light years away”
• Generates 750 terabytes every SECOND!
http://venturebeat.com/2014/10/05/how-big-data-is-fueling-a-new-age-in-
space-exploration/
DATA in modern world
• Data as the fourth pillar of science
• The first 3 pillars are:
– Theory
– Experiment
– Computation
Jobs!
• Needed by 2018, in US alone:
– 140,000 to 190,000 big data analysts
– 1.5 million managers who understand big data
file:///D:/BACKUPD/courses/biol_data/intro/Big_data_McKinsey_Company.htm#sthash.2NWbgp5G.dpuf
g y
l o
B io
i n
ata
g d
Bi
A Big Data place
• The European Bioinformatics Institute (EBI) in Hinxton, UK,
– part of the European Molecular Biology Laboratory
– one of the world's largest biology-data repositories,
– currently stores 20 petabytes of data and back-ups
– Data about genes, proteins and small molecules.
Another Big Data place
• Beijing Genomics Institute (BG) in Shenzen, China
• “The Sequence Factory”
• 157 genome sequencing instruments working 24X7
• Samples from people, plants, animals and microbes.
• Each day, it generates 6 terabytes of genomic data.
• Every instrument can decode one human genome per week
(used to take months or years and many staff).
(Marx, Nature, 2013)
• Structure prediction
– Secondary structure
– Tertiary structure
– Quaternary Structure
• Can be quite complex
– P53 – tumor suppression gene assoc protein
– P53 mutation database exists
– 60,000 publications on p53 alone!!!
Proteomics
• The full complement of proteins expressed in
a cell, organ or an organism
Interactome
• is the whole set of molecular interactions in a
particular cell.
I’m here
NOW…
LET’S GET TO THE CELL LEVEL
Neurons come in different Shapes
Neuromorph
The Connectome Project
• Find out the complete wiring diagram of the brain
• Human brain
– Has 100 billion neurons
– Each neuron has about 1k-10k connections
• Feasible for smaller organisms
– C. Elegans connectome – 300 neurons, 7000
connections (White et al 1986)
Human Connectome Project
- 1200 individuals
- fMRI + dMRI +
+ MRI + MEG
-Washington U +
U. Minnesota
http://www.humanconnectome.org/
Connectome Data Sizes
HCP Data Sizes (per Subject)
Session Format .zip File Size
Structural Unprocessed 70.99 MB
Preprocessed 1.19 GB
Resting State fMRI Unprocessed 2 GB
(each of 2 sessions) Preprocessed 3.24 GB
For 68 subjects:
Task fMRI (avg per Task) Unprocessed 490 MB
Preprocessed 771 MB
(all 7 Tasks) Unprocessed 3.43 GB
Preprocessed 5.4 GB
Diffusion Unprocessed
Preprocessed
2.18 GB
2.81 GB
1.8 Terabytes!!
Group-Average on Unrelated 20 Additionally 289 MB
Processed
Total (per Subject) Unprocessed 9.81 GB
Preprocessed 15.77 GB
Both 25.58 GB
Proteome
Metabolome/
Interactome
Transcriptome
Regulatory
Networks
Genome
ANALYZE THAT!
Questions?
• How to represent an object?
• How to compare objects?
– Same type or different?
– How different?
• How to group/cluster objects based on similarity?
• How to assign objects to classes?
• How to compare groups of objects?
– Are two groups of objects really different?
Course Structure
• Mathematical Preliminaries
– Vectors, vector spaces
– Eigenvalues and eigenvectors
– Derivatives in higher dimensions
– Linear Least Squares problem
– Optimization
• Lagrange multipliers
– Probability and Bayes theorem
Unsupervised Learning methods
• Clustering
– K-means
– Hierarchical clustering
– Scale-based clustering
– Fuzzy clustering
– Graph based clustering
– Self-organizing map
• Dimensionality reduction
– PCA and ICA
Classification
• Prototype-based classification
– K Nearest-neighbor classifier
– Learning Vector Classification
Classification
• Discriminant-based classification
– Linear Discriminant Analysis
– Neural Networks –
• Multilayer perceptron
• Radial Basis Function Network
– Support Vector Machines
– Bayesian Classifier
Biostatistics
• Standard Normal Distribution;
• Hypothesis testing;
• Multiple hypothesis testing;
• Chi-squared distribution;
• F-test and Student’s t-test;
• ANOVA;
• Regression Analysis
Text Books
• Introduction to Data Mining –
Tan/Steinbach/Kumar
• Neural Networks: A classroom approach –
Satish Kumar
• Analysis of Biological Data – Whitlock/Schluter
Grading
• Quiz I – 20
• Quiz 2 – 20
• Assignments – 20
• Endsem – 40