COMP90016 2023 06 Data Sources
COMP90016 2023 06 Data Sources
Lecture 6
Sources of Sequencing Data
Dr Vicky Perreau
Before watching this lecture, make sure you are familiar with… Today
● Sequence archives
○ Searching and retrieval
2
Data type and File types
• Data types and file types
• Sequence archives
• Curated collections
3
Data ‘mining’
Sequencing data
Raw Processed Curated
4
Data types
• “Raw”
– Amplicon (.fasta), Readsets (.fastq / .fastq.gz)
– Amino acids sequence from Mass Spec
• Derived (processed)
– Assembled genomes (.fasta),
– Annotation (.gbk, .gff, gtf., .gff3)
– Predicted protein (predicted from DNA or mRNA sequence)
– Aligned reads (.sam &.bam), variants (.vcf)
• Currated
– Organised, annotated, filtered
• Metadata
– Sample data (source, treatments, batch, quality, phenotype etc...) 5
Raw Data
● Amplicon
○ PCR product, usually Sanger sequence (.ab1, .fasta)
● Locus
○ Multiple overlapping amplicons assembled (.fasta)
● Genome
○ Whole genome shotgun reads (.fastq.gz)
● Prepared libraries (.fastq.gz)
○ Exome
○ RNAseq
○ ChIP-seq
○ Single cell etc...
6
.fasta format
.fa, .fsa, .fna, .faa
Used for nucleotides or
amino acids
Single line header
Sequence may have numbers and spaces
No additional columns
No blank lines
https://en.wikipedia.org/wiki/FASTA_format
.fastq format (‘reads’ compiled into ‘readsets’)
https://en.wikipedia.org/wiki/FASTQ_format 8
Derived Data
● Protein sequences
○ Translated from predicted genes
9
General feature format (.gff)
describes gene models
5’ 3’
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md 10
.gff format
Similar formats
GFF2
GTF
GFF3
9 required fields
https://en.wikipedia.org/wiki/General_feature_format 11
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
.gff format
Row = feature
Each row has 9 fields
https://en.wikipedia.org/wiki/General_feature_format 12
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
1
.gff format
Column 1:
Sequence ID
13
2
.gff format
Column 2:
source
14
3
.gff format
Column 3:
Type of feature
15
4 5
.gff format
Column 4:
Feature start site
Column 5:
Feature stop site
16
6
.gff format
Column 6:
score/confidence
17
7
.gff format
Column 7:
Strand the feature
is encoded on
18
8
.gff format
Column 8:
Phase 0, 1, or 2
Only present for
Protein encoding
features.
19
9
.gff format
Column 9:
Other atributes
20
.gtf format
gencode.v33.annotation.sorted.gtf
Downloaded from Gencode and viewed in command line using command:
21
$head -n 20 gencode.v33.annotation.sorted.gtf | cut -c 1-15
Genbank file format
.gb
.gbk
Sequence info
Many additional elements
22
https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
Sequence archives
• Data types and file types
• Sequence archives
• Curated collections
23
http://www.insdc.org/
Controlled access repositories
Human data
25
Hosted at NCBI in
Washington, USA
SRA toolbox
https://www.ncbi.nlm.nih.gov/books/NBK158900/ 26
Hosted at EBI in
Cambridge UK
27
Hosted at NIG in
Mishima, Japan
28
Sequence Read Archive (SRA)
https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/
https://www.ncbi.nlm.nih.gov/sra/docs/sragrowth/
29
https://www.ncbi.nlm.nih.gov/sra/docs/submitmeta/
“Study” architecture in SRA (SRP#)
● Protein https://www.ncbi.nlm.nih.gov/protein/
● UniProt https://www.uniprot.org/
● neXtprot https://www.nextprot.org/
34
Searching for data
35
NCBI-GEO database
36
DRA search at DDBJ
37
OmicsDi (https://www.omicsdi.org/)
Meta search engine searching multiple databases
and repositories simultaneously.
38
SciCrunch (https://scicrunch.org/browse/datadashboard)
39
Reference genomes
40
Reference genomes: NCBI Assembly database
41
Human genome
Agreed reference genomes for all organisms that have
been sequenced are important.
Features are mapped to nucleotide numbers on
chromosomes.
Updates to the genome can alter the numbering.
The version of the genome that you are working in is
critical to your analysis determines what other mapped
data can be included in your analysis.
‘p’ refers to “patch 14”- patches don’t alter nucleotide
numbering
Reproducible data requires that details of genome
versions used in any analysis, and their sources, are
Current human genome version is GRCh38.p14 described in detail in your methods and appropriately
referenced/cited.
42
Gencode annotation files
43
Curated collections
• File types
• Sequence archives
• Curated collections
44
Curated databases enable:
● Comparative genomics
○ Orthologs
○ Protein families
○ Evolutionary conservation
● Functional genomics
○ Homologous genes/proteins
○ Co-expression analysis
○ Phenotype (knockout studies)
○ Disease associations
○ Interactions with genes/proteins
○ Pathway analysis
45
EnteroBase
A Powerful, User-Friendly Online Resource for
Analyzing and Visualizing Genomic Variation
within Enteric Bacteria
Tutorials
https://enterobase.readthedocs.io/en/latest/ente
robase-tutorials/tutorials.html
Users guide
https://genome.cshlp.org/content/early/2019/12/
05/gr.251678.119
46
PlasmoDB
47
Virus Pathogen Resource (VIPR)
48
Model organism databases
● Drosophila http://flybase.org/
● Mouse http://www.informatics.jax.org/
● Rat https://www.rgd.mcw.edu/
● Yeast https://www.yeastgenome.org/
● C. elegans https://wormbase.org/
49
Ensemble database
52
Expression data: Gemma
Over 14,977 curated expression studies
2021 publication
53
Expression data: GREIN
54
Selected large projects
Focused on human genomics and functional genomics
55
https://www.encodeproject.org/ 56
gnomad
https://gnomad.broadinstitute.org/
57
https://www.gtexportal.org
/home/
58
Some focused smaller projects
Neuroscience- Allen Brain Map
(https://portal.brain-map.org/)
Immunology- Immunological genome
(https://www.immgen.org/)
Genomics – 100,000 genome project
(https://www.genomicsengland.co.uk/initiatives/100000-genomes-project)
Interferome (http://www.interferome.org/interferome/home.jspx)
59
https://academic.oup.com/n
ar
60
Link to issue
Link to issue
61
Data mining overview Validate your findings
Annotate your code
Define your research question Store all your files safely
Plan your data search Reproducible research
Explore the area Curate the data Share your findings
Research the domain Filter for quality
Gather appropriate tools Practice
Stay on task (focus)
Learn from mistakes
62
Summary
Datatypes (raw, derived/processed, metadata, currated)
Common File types (.fastq, .fasta, .gff, .gbk)
Importance of reference genomes (version number and patches) and annotation files
• mapping diverse types of features to nucleotides in genomes
The main sequence archives and some smaller ones
• Architecture of SRA sequence archive for deposit and retrieval
Variety of different ways to search for sequence data that may be of interest
• federated search engines
• Standardized processed data
Diversity of curated data resources
• Bringing together different datatypes form different sources to facilitate one particular area of interest.
Exponential growth of available sequence data
• Many new questions can, and are being asked of existing available data 63