0% found this document useful (0 votes)
13 views

Cancer Databases

Uploaded by

Alexis Torres
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Cancer Databases

Uploaded by

Alexis Torres
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Cancer Databases and

Retrieval

~ Dilraj Kaur, PhD


IIITD
Outline
● Introduction to Biological Databases
● List of Cancer databases
● The Cancer Genome Atlas (TCGA)
● GDC portal
● Data accessibility
● Dataset retrieval (TCGA-Assembler)
● Q &A
What we need to know ?
• What is a database and what are the features of an ideal db?
• What are the relationships/differences between primary and
derived sequence databases?
• What are databases used for ?
• Why is data integration useful?
What are Databases?
• Structured collection of information.
• Consists of basic units called records or entries.
• Each record consists of fields, which hold pre-defined data
related to the record.
• For example, a protein database would have protein entries as
records and protein properties as fields (e.g., name of protein,
length, amino-acid sequence)
Key Features of an ideal DB
• Comprehensive, but easy to search.
• A simple, easy to understand structure.
• Cross-referenced.
• Minimum redundancy.
• Easy retrieval of data.
Types of Molecular DB
• Primary Databases
• Original submissions by experimentalists
• Content controlled by the submitter
• Examples: GenBank, Trace, SRA, SNP, GEO
• Derivative Databases
• Derived from primary data
• Content controlled by third party
• Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets,
UniGene, Homologene, Structure, Conserved Domain
Sequence Databases at NCBI
• Primary
• GenBank: NCBI’s primary sequence database
• Trace Archive: reads from capillary sequencers
• Sequence Read Archive: next generation data
• Derivative
• GenPept (GenBank translations)
• Protein (UniProt—Swiss-Prot, PDB)
• NCBI Reference Sequences (RefSeq)
List of Databases for
Oncogenomic Research
TCGA (The Cancer Genome Atlas)
o TCGA comprised of :
o BiospecimenClinicalData, CNAData, MethylationData,
o miRNASeqData, RNASeqData, RPPAData,
o SomaticMutationData
Levels of dataset in TCGA

• Level 1 indicated raw and controlled data,


• level 2 indicated processed and controlled data,
• level 3 indicated Segmented or Interpreted Data and open
access and level 4 indicated region of interest and open
access data.
• While the TCGA data portal provided level 1 to 3 data,
Firehose only provides level 3 and 4.
Cancer Dataset Retrieval
FIREHOSE-GDAC
https://gdac.broadinstitute.org
Download the desired file
Here we are taking mRNA seq
TCGA-ASSEMBLER
Go to the link,
folder have all the
supporting files
Click this to
download
COMMANDS

rm(list = ls()) # Clear workspace


source("Module_A.R") # Load Module A functions
source("Module_B.R") # Load Module B functions

# Download normalized gene expression data

RNASeqRawData <- DownloadRNASeqData(saveFolderName = "./output path",


cancerType = ”LIHC", assayPlatform = "gene.normalized_RNAseq”)

GeneExpData <- ProcessRNASeqData(inputFilePath = RNASeqRawData[1], outputFileName =


”LIHC__illuminahiseq_rnaseqv2__GeneExp", outputFileFolder = "./output path", dataType = "geneExp", verType =
"RNASeqV2")
UCEC-Clinical data
UCEC-RNAseq File
THANK YOU

You might also like