0% found this document useful (0 votes)
97 views55 pages

Bioinformatics

Bioinformatics is an applied science that uses computer programs to access molecular biology databanks to make inferences about the information contained within the data archives.

Uploaded by

paretini01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views55 pages

Bioinformatics

Bioinformatics is an applied science that uses computer programs to access molecular biology databanks to make inferences about the information contained within the data archives.

Uploaded by

paretini01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

Fondazione Pisana per la Scienza,

January 28, 2019.

Bioinformatics
Paolo Aretini
Senior Researcher
FPS Bioinformatic Lab
[email protected]
Wifi: FPS-corporate
Password: FPScorporate
Bioinformatics

Bioinformatics is an applied science that uses computer programs to access


molecular biology databanks to make inferences about the information
contained within the data archives.
Bioinformatics
Bioinformatics

• This lab introduces you to some of the main databanks, their


applications, and programs.

• You will learn how to retrieve information from the databases,


and analyze the information to obtain useful knowledge about a
DNA sequence and its protein product.
Why it’s useful
• All of the information needed to build an organism is
contained in its DNA. If we could understand it, we
would know how life works.
– Preventing and curing diseases like cancer (which is
caused by mutations in DNA) and inherited diseases.
– Curing infectious diseases (everything from AIDS and
malaria to the common cold). If we understand how a
microorganism works, we can figure out how to block it.
– Understanding genetic and evolutionary relationships
between species
– Understanding genetic relationships between humans.
What are we looking for?
Data & databases
Biologists Collect Lots of Data
• Hundreds of thousands of species to explore; Millions of written articles
in scientific journals; Detailed genetic information:
• gene names
• phenotype of mutants
• location of genes/mutations on chromosomes
• linkage (distances between genes)

High Throughput lab technologies


• PCR
• Next Generation Sequencing(Illumina, IonTorrent Technologies)
• Microarrays (Affymetrix)
• Genome-wide SNP chips / SNP arrays (Illumina)
Main databases by category
Literature
• PubMed: scientific & medical abstracts/citations
• Health
• OMIM: online mendelian inheritance in man
Nucleotide Sequences
• Nucleotide: DNA and RNA sequences
Genomes
• Genome: genome sequencing projects by organism
• dbSNP: short genetic variations
• Ensembl : Ensembl is a genome browser for vertebrate genomes
Genes & Proteins
• Protein: protein sequences
• UniProt: protein sequences and related information
• Several Mutational Database (Cosmic; TP53 Database; BRCA1&2 database)
Pathways
• BioSystems: molecular pathways with links to genes, proteins
• KEGG Pathway: information on main biological pathways
DATABASES
Primary databases

REAL EXPERIMENTAL DATA (raw)


Biomolecular sequences or structures and associated annotation
information (organism, function, mutation linked to disease,
functional/structural patterns, bibliographic etc.)

Secondary databases

DERIVED INFORMATION (analyzed and annotated)


Fruits of analyses of primary data in the primary sources (patterns,
blocks, profiles etc. which represent the most conserved features
of multiple alignments)
GENEBANK DATABASE

• Contains all DNA and protein sequences described in the


scientific literature or collected in publicly funded research
• One can search by protein name to get DNA/mRNA sequences
• The search results could be filtered by species and other
parameters
DATABASES
Fasta format to store sequences
• The FASTA format is now universal for all
databases and software that handles DNA and
protein sequences
• Specifications:
• One header line
• starts with > with a ends with [return]
• Saccharomyces cerevisiae strain YC81 actin (ACT1) gene
• GenBank: JQ288018.1
• >gi|380876362|gb|JQ288018.1| Saccharomyces cerevisiae strain YC81 actin (ACT1) gene, partial cds
TGGCATCATACCTTCTACAACGAATTGAGAGTTGCCCCAGAAGAACACCCTGTTCTTTTGACTGAAG
CTCCAATGAACCCTAAATCAAACAGAGAAAAGATGACTCAAATTATGTTTGAAACTTTCAACGTTCC
AGCCTTCTACGTTTCCATCCAAGCCGTTTTGTCCTTGTACTCTTCCGGTAGAACTACTGGTATTGTTT
TGGATTCCGGTGATGGTGTTACTCACGTCGTTCCAATTTACGCTGGTTTCTCTCTACCTCACGCCATT
TTGAGAATCGATTTGGCCGGTAGAGATTTGACTGACTACTTGATGAAGATCTTGAGTGAACGTGGTT
ACTCTTTCTCCACCACTGCTGAAAGAGAAATTGTCCGTGACATCAAGGAAAAACTATGTTACGTCG
CCTTGGACTTCGAGCAAGAAATGCAAACCGCTGCTCAATCTTCTTCAATTGAAAAATCCTACGAAC
TTCCAGATGGTCAAGTCATCACTATTGGTAAC
BLASTN
BLAST
NCBI Databases contain more than just
DNA & protein sequences

NCBI main portal: http://www.ncbi.nlm.nih.gov/


OMIM Database

• Online Mendelian Inheritance in Man (OMIM)


•  ”information on all known mendelian disorders linked to over 12,000
genes”
• “Started at 1960s by Dr. Victor A. McKusick as a catalog of mendelian
traits and disorders”
• Linked disease data
• Links disease phenotypes and causative genes
• Used by physicians and geneticists
OMIM Database
PUBMED

• PubMed is one of the best known database in the whole


scientific community
• Most of biology related literature from all the related fields are
being indexed by this database
• It has very powerful mechanism of constructing search queries
• Many search fields ● Logical operatiors (AND, OR)
• Provides electronic links to most journals
PUBMED
UNIPROT

The mission of UniProt is to provide the scientific


community with a comprehensive, high-quality and
freely accessible resource of protein sequence and
functional information.
UNIPROT
MUTATION DATABASES

• Databases of mutations causing Mendelian disease


or cancer play a crucial role in research, diagnostic
and genetic health care and can play a role in life
and death decisions.
The Human Gene Mutation Database

• The Human Gene Mutation Database (HGMD®)


represents an attempt to collate all known
(published) gene lesions responsible for human
inherited disease
The Human Gene Mutation Database
VarSome

• VarSome's mission is to bring together the global


life sciences community and facilitate the
exchange of information that will lead to new
discoveries.
VarSome
Human Protein Atlas
The Human Protein Atlas is a Swedish-based program
initiated in 2003 with the aim to map all the human proteins
in cells, tissues and organs using integration of various omics
technologies, including antibody-based imaging, mass
spectrometry-based proteomics, transcriptomics and systems
biology. All the data in the knowledge resource is open
access to allow scientists both in academia and industry to
freely access the data for exploration of the human
proteome.
Human Protein Atlas
Revolution of NGS technologies
Sanger Sequencing
Comparison of Technologies
Sanger NGS
Max Output Max Output
57 Kb run (1h) 1,800 Gb run (3.5 days)
Genome Sequencing Cost per Mb (30x)

3
2
Relative throughput of HTT
Next Generation Sequencing emerges with a potential of data
production that will, eventually wipe out conventional HT
technologies in the years coming

NGS

NGS: Too many sequences to be handled in standard hardware


3
3
NGS Technologies
NGS sequencers

Roche 454 FLX+ Illumina GAIIx Life Tech SOLID 5500 Life Tech Ion Torrent Helicos Heliscope

Roche 454 Junior Illumina MiSeq NextSeq Illumina HiSeq Life Tech Ion Proton

Oxford Nanopore Oxford Nanopore Oxford Nanopore


Pacific MinION PromethION Complete Genomics Revolocity
Biosciences RS
GridIon 3
5
NGS sequencers

Roche 454 FLX+ Illumina GAIIx Life Tech SOLID 5500 Life Tech Ion Torrent Helicos Heliscope

Roche 454 Junior Illumina MiSeq NextSeq Illumina HiSeq Life Tech Ion Proton

Oxford Nanopore Oxford Nanopore Oxford Nanopore


Pacific MinION PromethION Complete Genomics Revolocity PacBio Sequel
Biosciences RS
GridIon 3
6
Illumina Sequencers

MiSeq NextSeq 500/550

Max Output Max Read Max Read Max Output Max Read Number Max Read Length
Number Length
15 Gb 120 Gb 400 M 2x150 bp
25 M 2x300 bp
www.illumina.com
Ion S5 and Ion S5XL
Illumina Sequencers

* Max Output Max Read Max Read


HiSeq 2500*/3000/4000 1,000* Gb
Number Length
4,000* M 2x125*
bp

Max Output Max Read Max Read


HiSeq X Ten/ X Five 1,800 Gb
Number Length
6,000 M 2x150 bp
NGS in Genomics
DATA ANALYSIS ISSUES
Storing and analyzing the huge amounts of data generated by
sequencing and other high-throughput technologies require
infrastructure providing high-performance computing and large-
scale storage resources.
Local Framework

CHALLENGES

• Laboratory-hosted servers require investments in informatics support


for configuring and using software;

• Servers are expensive to setup and maintain;

• Enough space and conditions for the equipment ("servers room”).


Local Framework

ADVANTAGE

• Many computational resources available;

• Customization and testing of pipeline with newly developed in-house


software;

• No data transfer;

• No ethical issues;
FPS BIOINFORMATICS

The laboratory was created to analyze and manage Next Generation


Sequencing (NGS) data.

• It provides IT support to the institution (backup and data storage,


software and device installation, database management);

• Bioinformatic and statistical analysis.

• NGS data analysis;


FPS BIOINFORMATICS

NGS technologies are used for many applications:

• rare variant discovery by whole genome resequencing or


targeted sequencing (exome analysis);

• transcriptome profiling of cells, tissues or organisms;

• many more applications (alternative splicing, identification of


epigenetic markers; ChIP-Seq).

NGS technologies in our lab:


GeneStudio S5 (Thermofisher) and NextSeq500 (Illumina)
IT Technologies
The Bioinformatic section is
equipped for intensive calculation
and short and long data storage,
by virtue of collaboration with IT
Center of University of Pisa that
hosts the informatics
infrastructure.

5 Dell Poweredge C8000 and


FC630 with 32 cpu cores, 128 gb
ram and 16 tb of storage (each).

70TB Storage System based on


Dell Equal-logic and PowerVault.

1 Torrent server with 8 cpu cores,


128 GB ram, 27 tb of storage and
2 Nvidia® tesla® gpu
Open source software
• We use mainly “open source” software implemented in Biolinux (Linux
Ubuntu);

• Command line software;

• Software for primary analysis (mapping; variant calling; gene expression and
differential gene expression);

• Software for data visualization (mapping data; gene expression data;


mutation data);

• Pathway and network analysis;


RNA-seq Analysis

Execution Time for 1 sample

• 1-2 hours with 28 threads (depending on data quality and size)


DNA-seq pipeline
Execution time for Illumina NextSeq500 Data

• Genome ad exome analysis with described pipeline implemented in


SeqMule (http://seqmule.openbioinformatics.org/en/latest/)

• 40 hours about to run the pipeline for genome analysis (28 threads,
DELL POWEREDGE C8000)

• 6-15 hours to run the pipeline for exome analysis (28 threads, DELL
POWEREDGE C8000)
Critical Step

Characterize biological meaning of data:

• Variant annotation e filtering;

• Pathway and Network Analysis;

Time consuming!
Critical Step: variant annotation e filtering
Critical Step: Pathway and Network Analysis;
Genetic characterization of Leigh’s Disease case.

Leigh syndrome (LS, OMIM 256000) is a rare heterogeneous progressive


neurodegenerative disorder usually presenting in infancy or early childhood. LS inheritance
is complex since patients may present mutations in mitochondrial DNA (mtDNA) or in
nuclear genes, which predominantly encode for proteins involved in respiratory chain
structure and assembly or in coenzyme Q10 biogenesis;

The proband is a 19-year-old male born from non-consanguineous parents of Caucasian


origin, after a normal pregnancy at 40 weeks of gestation with normal birth
measurements. Both parents and the 18- year-old brother are healthy;

Exome analysis was performed on affected individual (proband) and his relatives (mother,
father and brother);
Genetic characterization of Leigh’s Disease case.

Identification of a rare homozygous missense mutation in ECHS1 (Short-chain enoyl-


CoA hydratase) gene, present in a 19 years-old individual with Leigh Disease.
Genetic characterization of Leigh’s Disease case.

Using CeQer(http://www.ngsbicocca.org/html/ceqer.html), a software able to detect Copy


Number Variation from Exome data, we detected a deletion in an extensive region of
chromosome 10 (from 135120573 to 135187238) involving ZNF511, CALY, PRAP1, FUOM
and ECHS1. This deletion is present in the proband, in his mother and brother but not in
the father

You might also like