0% found this document useful (0 votes)
13 views

Cancer Databases

Uploaded by

Alexis Torres
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Cancer Databases

Uploaded by

Alexis Torres
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Cancer Databases and

Retrieval

~ Dilraj Kaur, PhD


IIITD
Outline
● Introduction to Biological Databases
● List of Cancer databases
● The Cancer Genome Atlas (TCGA)
● GDC portal
● Data accessibility
● Dataset retrieval (TCGA-Assembler)
● Q &A
What we need to know ?
• What is a database and what are the features of an ideal db?
• What are the relationships/differences between primary and
derived sequence databases?
• What are databases used for ?
• Why is data integration useful?
What are Databases?
• Structured collection of information.
• Consists of basic units called records or entries.
• Each record consists of fields, which hold pre-defined data
related to the record.
• For example, a protein database would have protein entries as
records and protein properties as fields (e.g., name of protein,
length, amino-acid sequence)
Key Features of an ideal DB
• Comprehensive, but easy to search.
• A simple, easy to understand structure.
• Cross-referenced.
• Minimum redundancy.
• Easy retrieval of data.
Types of Molecular DB
• Primary Databases
• Original submissions by experimentalists
• Content controlled by the submitter
• Examples: GenBank, Trace, SRA, SNP, GEO
• Derivative Databases
• Derived from primary data
• Content controlled by third party
• Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets,
UniGene, Homologene, Structure, Conserved Domain
Sequence Databases at NCBI
• Primary
• GenBank: NCBI’s primary sequence database
• Trace Archive: reads from capillary sequencers
• Sequence Read Archive: next generation data
• Derivative
• GenPept (GenBank translations)
• Protein (UniProt—Swiss-Prot, PDB)
• NCBI Reference Sequences (RefSeq)
List of Databases for
Oncogenomic Research
TCGA (The Cancer Genome Atlas)
o TCGA comprised of :
o BiospecimenClinicalData, CNAData, MethylationData,
o miRNASeqData, RNASeqData, RPPAData,
o SomaticMutationData
Levels of dataset in TCGA

• Level 1 indicated raw and controlled data,


• level 2 indicated processed and controlled data,
• level 3 indicated Segmented or Interpreted Data and open
access and level 4 indicated region of interest and open
access data.
• While the TCGA data portal provided level 1 to 3 data,
Firehose only provides level 3 and 4.
Cancer Dataset Retrieval
FIREHOSE-GDAC
https://gdac.broadinstitute.org
Download the desired file
Here we are taking mRNA seq
TCGA-ASSEMBLER
Go to the link,
folder have all the
supporting files
Click this to
download
COMMANDS

rm(list = ls()) # Clear workspace


source("Module_A.R") # Load Module A functions
source("Module_B.R") # Load Module B functions

# Download normalized gene expression data

RNASeqRawData <- DownloadRNASeqData(saveFolderName = "./output path",


cancerType = ”LIHC", assayPlatform = "gene.normalized_RNAseq”)

GeneExpData <- ProcessRNASeqData(inputFilePath = RNASeqRawData[1], outputFileName =


”LIHC__illuminahiseq_rnaseqv2__GeneExp", outputFileFolder = "./output path", dataType = "geneExp", verType =
"RNASeqV2")
UCEC-Clinical data
UCEC-RNAseq File
THANK YOU

You might also like