0% found this document useful (0 votes)
9 views29 pages

R NGS

The document discusses the application of R and Bioconductor in next-generation sequencing (NGS) analysis, highlighting various tools and packages for RNA-seq, ChIP-seq, and SNP-seq. It emphasizes the importance of R as a programming language in bioinformatics and outlines the capabilities of Bioconductor for genomic data analysis. Key topics include data import/export, differential expression analysis, and the integration of biological metadata.

Uploaded by

azhagar_ss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views29 pages

R NGS

The document discusses the application of R and Bioconductor in next-generation sequencing (NGS) analysis, highlighting various tools and packages for RNA-seq, ChIP-seq, and SNP-seq. It emphasizes the importance of R as a programming language in bioinformatics and outlines the capabilities of Bioconductor for genomic data analysis. Key topics include data import/export, differential expression analysis, and the integration of biological metadata.

Uploaded by

azhagar_ss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

R & NGS

Dr. G. Ramesh Kumar, PhD.,


AU-KBC Research Centre,
MIT, Anna University,
Chromepet,Chennai-44.
UNIT III
• Application of R in NGS analysis:
• 5 TOPICS
• Introduction to Bioconductor GR
• Reading of RNA-seq data (ShortRead,
Rsamtools, GenomicRanges),
• annotation (biomaRt, genomeIntervals),
• reads coverage and assign counts (IRanges,
GenomicFeatures),
• differential expression (DESeq).
REF

• http://manuals.bioinformatics.ucr.edu/
home/ht-seq#R_BACK
Application of R in NGS analysis
• They are central to many applications in the:
• Genome annotation and
• NGS analysis areas, such as
• RNA-Seq,
• ChIP-Seq and
• SNP-Seq.
Application of R in NGS analysis

• Seq2pathway: an R/Bioconductor package for


pathway analysis of next-generation
sequencing data
R
• In recent years the R language has become the
Lingua Franca of data intensive research, and is
now by far the most widely used data analysis
programming language in bioinfomatics.
• One of the outstanding strengths of the R
language is the ease of programming extensions
to automate the analysis and mining of almost
any data type.
R

• The following topics will be introduced:


• (1) conditional executions,
• (2) loops,
• (3) writing custom functions,
• (4) calling external software,
• (5) running and debugging R programs, and
• (6) building custom R packages.
R
• R (http://www.r-project.org) is a versatile data
analysis environment that has a broad
application spectrum in all experimental and
quantitative scientific areas.
• The associated Bioconductor project provides
access to over 700 R extension packages for
the analysis of modern biological and
biomedical data sets, such as next generation
sequences, comparative genomics, network
modeling and statistical analysis.
R
• The R software is free and runs on all common operating
systems.

• The following topics will be covered:


• (1) command syntax,
• (2) basic functions,
• (3) data import/export,
• (4) data/object types,
• (5) graphical display,
• (6) usage of R packages/libraries (e.g. Bioconductor) and
• (7) using R for basic data analysis operations.
Bioconductor
• https://en.wikipedia.org/wiki/Bioconductor
• Bioconductor is a free, open source and open
development software project for the analysis and
comprehension of genomic data generated by wet
lab experiments in molecular biology.
• https://www.bioconductor.org/
• Bioconductor provides tools for the analysis and
comprehension of high-throughput genomic data.
• Bioconductor uses the R statistical programming
language, and is open source and open
development.
Why Open Source
• so that you can find out what algorithm is being
used, and how it is being used
• so that you can modify these algorithms to try
out new ideas or to accommodate local
conditions or needs
• so you can read the code, find bugs, suggest
improvements etc.
• so that they can be used as components
(potentially modified) in other peoples software
Overview
• biology is a computational science
• problems of data analysis, data generation,
reproducibility require computational support and
computational solutions
• we value code reuse
– many of the tasks have already been solved
– if we use those solutions we can put effort into new
research
• well designed, self-describing data structures help us
deal with complex data
Goals
• Provide access to powerful statistical and graphical methods
for the analysis of genomic data.
• Facilitate the integration of biological metadata (GenBank,
GO, Entrez Gene, PubMed) in the analysis of experimental
data.
• Allow the rapid development of extensible, interoperable, and
scalable software.
• Promote high-quality documentation and reproducible
research.
• Provide training in computational and statistical methods.
Bioconductor packages
Release 2.10, 554 Software Packages!
• General infrastructure
Biobase, Biostrings, biocViews
• Annotation:
annotate, annaffy, biomaRt, AnnotationDbi  data packages.
• Graphics/GUIs:
geneplotter, hexbin, limmaGUI, exploRase
• Pre-processing:
affy, affycomp, oligo, makecdfenv, vsn, gcrm, limma
• Differential gene expression:
genefilter, limma, ROC, siggenes, EBArrays, factDesign
• GSEA/Hypergeometric Testing
GSEABase, Category, GOstats, topGO
• Graphs and networks:
graph, RBGL, Rgraphviz
• Flow Cytometry:
flowCore, flowViz, flowUtils
• Protein Interactions:
ppiData, ppiStats, ScISI, Rintact
• Sequence Data:
Biostrings,ShortRead,rtracklayer,IRanges,GenomicFeatures,
VariantAnnotation
• Other data:
xcms, DNAcopy, PROcess, aCGH, rsbml, SBMLR, Rdisop
Component software

• interesting problems will require the


coordinated application of many
different techniques
• thus we need integrated interoperable
software
• of primary importance is well designed
and shared data structures
Data complexity
• Dimensionality.
• Dynamic/evolving data: e.g., gene annotation, sequence,
literature.
• Multiple data sources and locations: in-house, WWW.
• Multiple data types: numeric, textual, graphical.
No longer Xnxp!
We distinguish between biological metadata and
experimental metadata.
Experimental metadata

• when were the samples processed


and how
• what arrays were used/what kits
• if size selection of some sort (eg.
fractionation for proteomics
experiments) was used
• date the samples were run
• lane or chip information
• treatments
Biological metadata
• Biological attributes that can be applied to the
experimental data.
• E.g. for genes
– chromosomal location;
– gene annotation (Entrez Gene, GO);
– gene models
– relevant literature (PubMed)
• Biological metadata sets are large, evolving rapidly, and
typically distributed via the WWW.
• Tools: annotate, biomaRt, and
AnnotationDbi, GenomicFeatures packages,
and annotation data packages.
Annotation packages
annotate, annafy, biomaRt, and AnnotationDbi
Metadata package hgu95av2 mappings • Assemble and process genomic
between different gene IDs for this chip. annotation data from public
repositories.
GENENAME
ENTREZID • Build annotation data packages.
zinc finger protein 261
9203 • Associate experimental data in
real time to biological metadata
ACCNUM from web databases such as
X95808 MAP GenBank, GO, KEGG, Entrez
Xq13.1 Gene, and PubMed.
AffyID
41046_s_at
• Process and store query results:
e.g., search PubMed abstracts.
• Generate HTML reports of
analyses.
SYMBOL
ZNF261
PMID
10486218 GO
9205841 GO:0003677
8817323 GO:0007275
GO:0016021 + many other mappings
Sequence Annotation
• for a given gene:
– gene models
– sequence
– exon/intron boundaries
– location
– conservation
• often in the form of tracks
• it is important to keep track of the reference
genome being used
Vignettes
• Bioconductor developed a new documentation
paradigm, the vignette.
• A vignette is an executable document consisting of a
collection of documentation text and code chunks.
• Vignettes form dynamic, integrated, and reproducible
statistical documents that can be automatically
updated if either data or analyses are changed.
• Vignettes can be generated using the Sweave
function from the R tools package.
Bioconductor Software

• concentrate development resources on a few


important aspects
• Biobase: core classes and definitions that allow for
succinct description and handling of the data
• annotate: generic functions for annotation that can be
specialized
• genefilter/limma/DESeq/DEXSeq: differential
expression
• ShortRead/IRanges/GenomicFeatures/
VariantAnnotation: string manipulations, sequence
analysis
Quality Assessment
• ensuring that the data are of sufficient quality
is an essential first step
• arrayQuality Metrics: comprehensive QA
assessment of microarrays (one color or two
color)
– modifications are coming to make it more suitable
for sequence data
• ShortRead: tools for QA of short reads,
primarily Illumina
Biobase:ExpressionSet
• software should help organize and manipulate your
data
• the data need to be assembled correctly once, and
then they can be processed, subset etc without
worrying about them
• we developed the ExpressionSet class
• SummarizedExperiment class is the next iteration in
this process (in the GenomicRanges package)
Microarray data analysis
CEL, CDF .gpr, .Spot

Pre-processing affy marray


vsn limma
vsn
ExpressionSet
Annotation
annotate
Differential Graphs & Cluster Prediction annaffy
expression networks analysis biomaRt
edd graph CRAN + metadata
CRAN packages
genefilter RBGL class
class
limma Rgraphviz e1071
cluster Graphics
multtest ipred
MASS geneplotter
ROC LogitBoost
mva hexbin
+ CRAN MASS
nnet + CRAN
randomForest
rpart
Differential Expression
• limma: provides a linear models interface for
DE
– uses a moderated variance
– a variety of p-value correction methods are
provided
• DESeq and edgeR: for sequence data
– similar approach to limma
– make use of count data (Neg Binomial)
• DEXSeq for exon level differential expression
Machine Learning
• Software for machine learning has been written by many
different people
– the calling sequences and return values are unique to each
method
• MLInterfaces
• provides uniform calling sequences and return values for
all machine learning algorithms
• MLearn is the main wrapper function
– methods, eg knni, are passed to the wrapper
• return values are of class MLOutput
• see the MLInterfaces vignette for more details
Publications
• Bioconductor: Open software development for
computational biology and bioinformatics, Genome
Biology 2004, 5:R80,
http://genomebiology.com/2004/5/10/R80
• Bioinformatics and Computational Biology Solutions
using R and Bioconductor, Springer, 2005, R.
Gentleman, V. Carey, W. Huber, R. Irizarry, S. Dudoit
eds.
• Bioconductor Case Studies, Springer
• R Programming for Bioinformatics, Chapman Hall
Comprehensive R Archive Network
• CRAN is a network of ftp and web servers
around the world that store identical, up-to-
date, versions of code and documentation for
R.
• https://cran.r-project.org/

You might also like