IIITD Outline ● Introduction to Biological Databases ● List of Cancer databases ● The Cancer Genome Atlas (TCGA) ● GDC portal ● Data accessibility ● Dataset retrieval (TCGA-Assembler) ● Q &A What we need to know ? • What is a database and what are the features of an ideal db? • What are the relationships/differences between primary and derived sequence databases? • What are databases used for ? • Why is data integration useful? What are Databases? • Structured collection of information. • Consists of basic units called records or entries. • Each record consists of fields, which hold pre-defined data related to the record. • For example, a protein database would have protein entries as records and protein properties as fields (e.g., name of protein, length, amino-acid sequence) Key Features of an ideal DB • Comprehensive, but easy to search. • A simple, easy to understand structure. • Cross-referenced. • Minimum redundancy. • Easy retrieval of data. Types of Molecular DB • Primary Databases • Original submissions by experimentalists • Content controlled by the submitter • Examples: GenBank, Trace, SRA, SNP, GEO • Derivative Databases • Derived from primary data • Content controlled by third party • Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets, UniGene, Homologene, Structure, Conserved Domain Sequence Databases at NCBI • Primary • GenBank: NCBI’s primary sequence database • Trace Archive: reads from capillary sequencers • Sequence Read Archive: next generation data • Derivative • GenPept (GenBank translations) • Protein (UniProt—Swiss-Prot, PDB) • NCBI Reference Sequences (RefSeq) List of Databases for Oncogenomic Research TCGA (The Cancer Genome Atlas) o TCGA comprised of : o BiospecimenClinicalData, CNAData, MethylationData, o miRNASeqData, RNASeqData, RPPAData, o SomaticMutationData Levels of dataset in TCGA
• Level 1 indicated raw and controlled data,
• level 2 indicated processed and controlled data, • level 3 indicated Segmented or Interpreted Data and open access and level 4 indicated region of interest and open access data. • While the TCGA data portal provided level 1 to 3 data, Firehose only provides level 3 and 4. Cancer Dataset Retrieval FIREHOSE-GDAC https://gdac.broadinstitute.org Download the desired file Here we are taking mRNA seq TCGA-ASSEMBLER Go to the link, folder have all the supporting files Click this to download COMMANDS
rm(list = ls()) # Clear workspace
source("Module_A.R") # Load Module A functions source("Module_B.R") # Load Module B functions