0% found this document useful (0 votes)
381 views64 pages

COMP90016 2023 06 Data Sources

This lecture discusses sources of sequencing data. It covers common file formats for raw sequencing data like fastq and fasta. It also discusses processed data formats like assembled genomes in fasta format and annotated genomes in gff format. The lecture describes major sequencing archives like SRA that house raw sequencing data and curated databases of reference genomes and proteins. It provides an overview of searching and accessing sequencing data from these various sources.

Uploaded by

Lynn CHEN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
381 views64 pages

COMP90016 2023 06 Data Sources

This lecture discusses sources of sequencing data. It covers common file formats for raw sequencing data like fastq and fasta. It also discusses processed data formats like assembled genomes in fasta format and annotated genomes in gff format. The lecture describes major sequencing archives like SRA that house raw sequencing data and curated databases of reference genomes and proteins. It provides an overview of searching and accessing sequencing data from these various sources.

Uploaded by

Lynn CHEN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Computational Genomics

Lecture 6
Sources of Sequencing Data
Dr Vicky Perreau

Before watching this lecture, make sure you are familiar with… Today

1 Intro & 2 3 Sequencing 4 Intro to 6 Sequencing


Genomics II
Genomics I technologies computing data sources
Sources of sequencing data

● Data types and File types


○ Common flat file formats

● Sequence archives
○ Searching and retrieval

● Curated data resources

2
Data type and File types
• Data types and file types

• Sequence archives

• Curated collections

3
Data ‘mining’

Sequencing data
Raw Processed Curated

4
Data types

• “Raw”
– Amplicon (.fasta), Readsets (.fastq / .fastq.gz)
– Amino acids sequence from Mass Spec
• Derived (processed)
– Assembled genomes (.fasta),
– Annotation (.gbk, .gff, gtf., .gff3)
– Predicted protein (predicted from DNA or mRNA sequence)
– Aligned reads (.sam &.bam), variants (.vcf)
• Currated
– Organised, annotated, filtered
• Metadata
– Sample data (source, treatments, batch, quality, phenotype etc...) 5
Raw Data
● Amplicon
○ PCR product, usually Sanger sequence (.ab1, .fasta)
● Locus
○ Multiple overlapping amplicons assembled (.fasta)
● Genome
○ Whole genome shotgun reads (.fastq.gz)
● Prepared libraries (.fastq.gz)
○ Exome
○ RNAseq
○ ChIP-seq
○ Single cell etc...

6
.fasta format
.fa, .fsa, .fna, .faa
Used for nucleotides or
amino acids
Single line header
Sequence may have numbers and spaces
No additional columns
No blank lines

https://en.wikipedia.org/wiki/FASTA_format
.fastq format (‘reads’ compiled into ‘readsets’)

https://en.wikipedia.org/wiki/FASTQ_format 8
Derived Data

● Assembled genome (.fasta)


○ Draft - multiple contigs

○ Complete - one contig per replicon

● Annotated genome (.gbk or .gff)


○ Genomic features labelled e.g. genes

● Protein sequences
○ Translated from predicted genes

○ Translated from assembled transcripts

9
General feature format (.gff)
describes gene models

5’ 3’

https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md 10
.gff format

Similar formats
GFF2
GTF
GFF3

9 required fields

https://en.wikipedia.org/wiki/General_feature_format 11

https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
.gff format

Row = feature
Each row has 9 fields

https://en.wikipedia.org/wiki/General_feature_format 12

https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
1

.gff format

Column 1:
Sequence ID

13
2

.gff format

Column 2:
source

14
3

.gff format

Column 3:
Type of feature

15
4 5

.gff format

Column 4:
Feature start site

Column 5:
Feature stop site

16
6

.gff format

Column 6:
score/confidence

17
7

.gff format

Column 7:
Strand the feature
is encoded on

18
8

.gff format

Column 8:
Phase 0, 1, or 2
Only present for
Protein encoding
features.

19
9

.gff format

Column 9:
Other atributes

20
.gtf format

gencode.v33.annotation.sorted.gtf
Downloaded from Gencode and viewed in command line using command:
21
$head -n 20 gencode.v33.annotation.sorted.gtf | cut -c 1-15
Genbank file format

.gb
.gbk

Sequence info
Many additional elements

22
https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
Sequence archives
• Data types and file types

• Sequence archives

• Curated collections

23
http://www.insdc.org/
Controlled access repositories
Human data

NCBI has database of genotype and phenotype dbGaP


• https://www.ncbi.nlm.nih.gov/gap/

EMBL has Genome phenome archive


• https://www.ebi.ac.uk/ega/about

DDBJ has genotype phenotype archive


• https://www.ddbj.nig.ac.jp/jga/index-e.html

25
Hosted at NCBI in
Washington, USA

SRA toolbox
https://www.ncbi.nlm.nih.gov/books/NBK158900/ 26
Hosted at EBI in
Cambridge UK

27
Hosted at NIG in
Mishima, Japan

28
Sequence Read Archive (SRA)
https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/

https://www.ncbi.nlm.nih.gov/sra/docs/sragrowth/
29
https://www.ncbi.nlm.nih.gov/sra/docs/submitmeta/
“Study” architecture in SRA (SRP#)

BioProject is a collection of biological


data for a single initiative, originating
from a single organization or from a
consortium
BioSample
Sample 1 Sample 2 Sample 3 (SRS#):
Descriptive
information about
the source
materials
Sample 4 Sample 5 Sample 6

Patient Patient Patient 31


A B C
Each experiment can generate multiple runs (SRR#)
SRR#7.fastq
SRX#1 RNAseq Whole genome
SRR#1.fastq Sequencing SRR#8.fastq
machine 1
SRR#2.fastq Nanopore SRX#4 SRR#9.fastq
SRX#2 SRR#10.fastq
RNAseq Library
SRR#3.fastq
machine 2
SRR#4.fastq
SRX#3 RNAseq Library Whole genome SRR#11.fastq
SRR#5.fastq machine 3 Sequencing SRR#12.fastq
SRR#6.fastq Ilumina SRX#5 SRR#13.fastq
SRR#14.fastq
An SRA Experiment SRX# is the main publishable unit and Sample 3
describes: SRS#3
• Replicate number
• Library
• Sequencing strategy
• Layout Sample 6
• Instrument model
A run (SRR#) is the sequencing data associated with an
Patient C
experiment. 32
33
Protein sequence databases

Currated annotated resources

● Protein https://www.ncbi.nlm.nih.gov/protein/
● UniProt https://www.uniprot.org/
● neXtprot https://www.nextprot.org/

Protein/peptide sequence data dumps

● Peptide Atlas http://www.peptideatlas.org/


● Pride https://www.ebi.ac.uk/pride/

34
Searching for data

35
NCBI-GEO database

Originally developed for


array format data

Now also holds holds


sequencing data for
experiments looking at
gene expression,
epigenetics and other
functional genomics.

36
DRA search at DDBJ

37
OmicsDi (https://www.omicsdi.org/)
Meta search engine searching multiple databases
and repositories simultaneously.

38
SciCrunch (https://scicrunch.org/browse/datadashboard)

39
Reference genomes

40
Reference genomes: NCBI Assembly database

41
Human genome
Agreed reference genomes for all organisms that have
been sequenced are important.
Features are mapped to nucleotide numbers on
chromosomes.
Updates to the genome can alter the numbering.
The version of the genome that you are working in is
critical to your analysis determines what other mapped
data can be included in your analysis.
‘p’ refers to “patch 14”- patches don’t alter nucleotide
numbering
Reproducible data requires that details of genome
versions used in any analysis, and their sources, are
Current human genome version is GRCh38.p14 described in detail in your methods and appropriately
referenced/cited.
42
Gencode annotation files

43
Curated collections
• File types

• Sequence archives

• Curated collections

44
Curated databases enable:
● Comparative genomics
○ Orthologs
○ Protein families
○ Evolutionary conservation
● Functional genomics
○ Homologous genes/proteins
○ Co-expression analysis
○ Phenotype (knockout studies)
○ Disease associations
○ Interactions with genes/proteins
○ Pathway analysis

45
EnteroBase
A Powerful, User-Friendly Online Resource for
Analyzing and Visualizing Genomic Variation
within Enteric Bacteria

Tutorials
https://enterobase.readthedocs.io/en/latest/ente
robase-tutorials/tutorials.html

Users guide
https://genome.cshlp.org/content/early/2019/12/
05/gr.251678.119

46
PlasmoDB

47
Virus Pathogen Resource (VIPR)

48
Model organism databases
● Drosophila http://flybase.org/
● Mouse http://www.informatics.jax.org/

● Rat https://www.rgd.mcw.edu/

● Yeast https://www.yeastgenome.org/

● C. elegans https://wormbase.org/

● Zebra fish http://zfin.org/

49
Ensemble database

Tutorials (inc. short videos) 50


51
OMIM
Human genetic disease
Collates disease associated with specific regions of
nucleotides in human DNA.
Many useful links available to other database from
within each entry.

52
Expression data: Gemma
Over 14,977 curated expression studies
2021 publication

53
Expression data: GREIN

Scrapes SRA data and reprocesses through standardized pipeline.

54
Selected large projects
Focused on human genomics and functional genomics

55
https://www.encodeproject.org/ 56
gnomad

https://gnomad.broadinstitute.org/

57
https://www.gtexportal.org
/home/

58
Some focused smaller projects
Neuroscience- Allen Brain Map
(https://portal.brain-map.org/)
Immunology- Immunological genome
(https://www.immgen.org/)
Genomics – 100,000 genome project
(https://www.genomicsengland.co.uk/initiatives/100000-genomes-project)
Interferome (http://www.interferome.org/interferome/home.jspx)

59
https://academic.oup.com/n
ar

60
Link to issue
Link to issue
61
Data mining overview Validate your findings
Annotate your code
Define your research question Store all your files safely
Plan your data search Reproducible research
Explore the area Curate the data Share your findings
Research the domain Filter for quality
Gather appropriate tools Practice
Stay on task (focus)
Learn from mistakes

Iron pyrite Gold

62
Summary
Datatypes (raw, derived/processed, metadata, currated)
Common File types (.fastq, .fasta, .gff, .gbk)
Importance of reference genomes (version number and patches) and annotation files
• mapping diverse types of features to nucleotides in genomes
The main sequence archives and some smaller ones
• Architecture of SRA sequence archive for deposit and retrieval
Variety of different ways to search for sequence data that may be of interest
• federated search engines
• Standardized processed data
Diversity of curated data resources
• Bringing together different datatypes form different sources to facilitate one particular area of interest.
Exponential growth of available sequence data
• Many new questions can, and are being asked of existing available data 63

• Bioinformatics can reuse data in many ways


Thank you
Please contact me if you have additional
questions or know of some great databases
that I haven’t mentioned.

[email protected]

You might also like