- Bioinformatics Toolbox User's Guide-MathWorks (2022)
- Bioinformatics Toolbox User's Guide-MathWorks (2022)
User's Guide
R2022b
How to Contact MathWorks
Phone: 508-647-7000
Getting Started
1
Bioinformatics Toolbox Product Description . . . . . . . . . . . . . . . . . . . . . . . 1-2
Key Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
v
High-Throughput Sequence Analysis
2
Work with Next-Generation Sequencing Data . . . . . . . . . . . . . . . . . . . . . . . 2-2
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
What Files Can You Access? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
Before You Begin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
Create a BioIndexedFile Object to Access Your Source File . . . . . . . . . . . . 2-3
Determine the Number of Entries Indexed By a BioIndexedFile Object . . . 2-3
Retrieve Entries from Your Source File . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4
Read Entries from Your Source File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4
vi Contents
Sequence Analysis
3
Exploring a Nucleotide Sequence Using Command Line . . . . . . . . . . . . . . 3-2
Overview of Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
Searching the Web for Sequence Information . . . . . . . . . . . . . . . . . . . . . . 3-2
Reading Sequence Information from the Web . . . . . . . . . . . . . . . . . . . . . . 3-4
Determining Nucleotide Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5
Determining Codon Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
Open Reading Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
Amino Acid Conversion and Composition . . . . . . . . . . . . . . . . . . . . . . . . 3-13
vii
Aligning Pairs of Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-177
Microarray Analysis
4
Managing Gene Expression Data in Objects . . . . . . . . . . . . . . . . . . . . . . . . 4-2
viii Contents
Visualizing Microarray Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-74
Analyzing Affymetrix SNP Arrays for DNA Copy Number Variants . . . . 4-142
Phylogenetic Analysis
5
Using the Phylogenetic Tree App . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
Overview of the Phylogenetic Tree App . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
Opening the Phylogenetic Tree App . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
File Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
Tools Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11
Window Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17
Help Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-18
ix
Genetic Algorithm Search for Features in Mass Spectrometry Data . . . 6-71
x Contents
1
Getting Started
Bioinformatics Toolbox provides algorithms and apps for Next Generation Sequencing (NGS),
microarray analysis, mass spectrometry, and gene ontology. Using toolbox functions, you can read
genomic and proteomic data from standard file formats such as SAM, FASTA, CEL, and CDF, as well
as from online databases such as the NCBI Gene Expression Omnibus and GenBank®. You can explore
and visualize this data with sequence browsers, spatial heatmaps, and clustergrams. The toolbox also
provides statistical techniques for detecting peaks, imputing values for missing data, and selecting
features.
You can combine toolbox functions to support common bioinformatics workflows. You can use ChIP-
Seq data to identify transcription factors; analyze RNA-Seq data to identify differentially expressed
genes; identify copy number variants and SNPs in microarray data; and classify protein profiles using
mass spectrometry data.
Key Features
• Next Generation Sequencing analysis and browser
• Sequence analysis and visualization, including pairwise and multiple sequence alignment and
peak detection
• Microarray data analysis, including reading, filtering, normalizing, and visualization
• Mass spectrometry analysis, including preprocessing, classification, and marker identification
• Phylogenetic tree analysis
• Graph theory functions, including interaction maps, hierarchy plots, and pathways
• Data import from genomic, proteomic, and gene expression files, including SAM, FASTA, CEL, and
CDF, and from databases such as NCBI and GenBank
1-2
Product Overview
Product Overview
Features
The Bioinformatics Toolbox product extends the MATLAB® environment to provide an integrated
software environment for genome and proteome analysis. Scientists and engineers can answer
questions, solve problems, prototype new algorithms, and build applications for drug discovery and
design, genetic engineering, and biological research. An introduction to these features will help you
to develop a conceptual model for working with the toolbox and your biological data.
The Bioinformatics Toolbox product includes many functions to help you with genome and proteome
analysis. Most functions are implemented in the MATLAB programming language, with the source
available for you to view. This open environment lets you explore and customize the existing toolbox
algorithms or develop your own.
You can use the basic bioinformatic functions provided with this toolbox to create more complex
algorithms and applications. These robust and well-tested functions are the functions that you would
otherwise have to create yourself.
• Data formats and databases — Connect to Web-accessible databases containing genomic and
proteomic data. Read and convert between multiple data formats.
• High-throughput sequencing — Gene expression and transcription factor analysis of next-
generation sequencing data, including RNA-Seq and ChIP-Seq.
• Sequence analysis — Determine the statistical characteristics of a sequence, align two
sequences, and multiply align several sequences. Model patterns in biological sequences using
hidden Markov model (HMM) profiles.
• Phylogenetic analysis — Create and manipulate phylogenetic tree data.
• Microarray data analysis — Read, normalize, and visualize microarray data.
• Mass spectrometry data analysis — Analyze and enhance raw mass spectrometry data.
• Statistical learning — Classify and identify features in data sets with statistical learning tools.
• Programming interface — Use other bioinformatic software (BioPerl and BioJava) within the
MATLAB environment.
The field of bioinformatics is rapidly growing and will become increasingly important as biology
becomes a more analytical science. The toolbox provides an open environment that you can customize
for development and deployment of the analytical tools you will need.
• Prototype and develop algorithms — Prototype new ideas in an open and extensible
environment. Develop algorithms using efficient string processing and statistical functions, view
the source code for existing functions, and use the code as a template for customizing, improving,
or creating your own functions. See “Prototyping and Development Environment” on page 1-17.
• Visualize data — Visualize sequences and alignments, gene expression data, phylogenetic trees,
mass spectrometry data, protein structure, and relationships between data with interconnected
graphs. See “Data Visualization” on page 1-18.
• Share and deploy applications — Use an interactive GUI builder to develop a custom graphical
front end for your data analysis programs. Create standalone applications that run separately
from the MATLAB environment.
1-3
1 Getting Started
Expected Users
The Bioinformatics Toolbox product is intended for computational biologists and research scientists
who need to develop new algorithms or implement published ones, visualize results, and create
standalone applications.
While the toolbox includes many bioinformatic functions, it is not intended to be a complete set of
tools for scientists to analyze their biological data. However, the MATLAB environment is ideal for
rapidly designing and prototyping the tools you need.
1-4
Data Formats and Databases
Web-based databases — You can directly access public databases on the Web and copy sequence
and gene expression information into the MATLAB environment.
The sequence databases currently supported are GenBank (getgenbank), GenPept (getgenpept),
European Molecular Biology Laboratory (EMBL) (getembl), and Protein Data Bank (PDB) (getpdb).
You can also access data from the NCBI Gene Expression Omnibus (GEO) Web site by using a single
function (getgeodata).
Get multiply aligned sequences (gethmmalignment), hidden Markov model profiles (gethmmprof),
and phylogenetic tree data (gethmmtree) from the PFAM database.
Gene Ontology database — Load the database from the Web into a gene ontology object (geneont).
Select sections of the ontology with methods for the geneont object (getancestors,
getdescendents, getmatrix, getrelatives), and manipulate data with utility functions
(goannotread, num2goid).
Read data from instruments — Read data generated from gene sequencing instruments (scfread,
joinseq, traceplot), mass spectrometers (jcampread), and Agilent® microarray scanners
(agferead).
Reading data formats — The toolbox provides a number of functions for reading data from common
bioinformatic file formats.
Writing data formats — The functions for getting data from the Web include the option to save the
data to a file. However, there is a function to write data to a file using the FASTA format
(fastawrite).
BLAST searches — Request Web-based BLAST searches (blastncbi), get the results from a search
(getblast) and read results from a previously saved BLAST formatted report file (blastread).
The MATLAB environment has built-in support for other industry-standard file formats including
Microsoft® Excel® and comma-separated-value (CSV) files. Additional functions perform ASCII and
low-level binary I/O, allowing you to develop custom functions for working with any data format.
1-5
1 Getting Started
See Also
More About
• “High-Throughput Sequencing”
• “Microarray Analysis”
• “Sequence Analysis”
• “Structural Analysis”
• “Mass Spectrometry and Bioanalytics”
1-6
Sequence Alignments
Sequence Alignments
You can select from a list of analysis methods to compare nucleotide or amino acid sequences using
pairwise or multiple sequence alignment functions.
Multiple sequence profiles — Implementations for multiple alignment and profile hidden Markov
model algorithms (gethmmprof, gethmmalignment, gethmmtree, pfamhmmread, hmmprofalign,
hmmprofestimate, hmmprofgenerate, hmmprofmerge, hmmprofstruct, showhmmprof).
Biological codes — Look up the letters or numeric equivalents for commonly used biological codes
(aminolookup, baselookup, geneticcode, revgeneticcode).
See Also
More About
• “Sequence Utilities and Statistics” on page 1-8
• “Sequence Analysis”
• “Data Formats and Databases” on page 1-5
1-7
1 Getting Started
Sequence conversion and manipulation — The toolbox provides routines for common operations,
such as converting DNA or RNA sequences to amino acid sequences, that are basic to working with
nucleic acid and protein sequences (aa2int, aa2nt, dna2rna, rna2dna, int2aa, int2nt, nt2aa,
nt2int, seqcomplement, seqrcomplement, seqreverse).
You can manipulate your sequence by performing an in silico digestion with restriction endonucleases
(restrict) and proteases (cleave).
Sequence utilities — Determine a consensus sequence from a set of multiply aligned amino acid,
nucleotide sequences (seqconsensus, or a sequence profile (seqprofile). Format a sequence for
display (seqdisp) or graphically show a sequence alignment with frequency data (seqlogo).
Additional MATLAB functions efficiently handle string operations with regular expressions (regexp,
seq2regexp) to look for specific patterns in a sequence and search through a library for string
matches (seqmatch).
Look for possible cleavage sites in a DNA/RNA sequence by searching for palindromes
(palindromes).
See Also
More About
• “Sequence Alignments” on page 1-7
• “Sequence Analysis”
• “Protein and Amino Acid Sequence Analysis”
• “Data Formats and Databases” on page 1-5
1-8
Protein Property Analysis
Amino acid sequence utilities — Calculate amino acid statistics for a sequence (aacount) and get
information about character codes (aminolookup).
See Also
More About
• “Protein and Amino Acid Sequence Analysis”
• “Structural Analysis”
1-9
1 Getting Started
Phylogenetic Analysis
Phylogenetic analysis is the process you use to determine the evolutionary relationships between
organisms. The results of an analysis can be drawn in a hierarchical diagram called a cladogram or
phylogram (phylogenetic tree). The branches in a tree are based on the hypothesized evolutionary
relationships (phylogeny) between organisms. Each member in a branch, also known as a
monophyletic group, is assumed to be descended from a common ancestor. Originally, phylogenetic
trees were created using morphology, but now, determining evolutionary relationships includes
matching patterns in nucleic acid and protein sequences. The Bioinformatics Toolbox provides the
following data structure and functions for phylogenetic analysis.
Phylogenetic tree data — Read and write Newick-formatted tree files (phytreeread,
phytreewrite) into the MATLAB Workspace as phylogenetic tree objects (phytree).
Create a phylogenetic tree — Calculate the pairwise distance between biological sequences
(seqpdist), estimate the substitution rates (dnds, dndsml), build a phylogenetic tree from pairwise
distances (seqlinkage, seqneighjoin, reroot), and view the tree in an interactive GUI that
allows you to view, edit, and explore the data (phytreeviewer or view). This GUI also allows you to
prune branches, reorder, rename, and explore distances.
Phylogenetic tree object methods — You can access the functionality of the phytreeviewer user
interface using methods for a phylogenetic tree object (phytree). Get property values (get) and
node names (getbyname). Calculate the patristic distances between pairs of leaf nodes (pdist,
weights) and draw a phylogenetic tree object in a MATLAB Figure window as a phylogram,
cladogram, or radial treeplot (plot). Manipulate tree data by selecting branches and leaves using a
specified criterion (select, subtree) and removing nodes (prune). Compare trees (getcanonical)
and use Newick-formatted strings (getnewickstr).
See Also
More About
• “Sequence Utilities and Statistics” on page 1-8
• “Sequence Analysis”
1-10
Microarray Data Analysis Tools
Microarray data — Read Affymetrix GeneChip files (affyread) and plot data (probesetplot),
ImaGene results files (imageneread), SPOT files (sptread) and Agilent microarray scanner files
(agferead). Read GenePix GPR files (gprread) and GAL files (galread). Get Gene Expression
Omnibus (GEO) data from the Web (getgeodata) and read GEO data from files (geosoftread).
A utility function (magetfield) extracts data from one of the microarray reader functions (gprread,
agferead, sptread, imageneread).
Microarray normalization and filtering — The toolbox provides a number of methods for
normalizing microarray data, such as lowess normalization (malowess) and mean normalization
(manorm), or across multiple arrays (quantilenorm). You can use filtering functions to clean raw
data before analysis (geneentropyfilter, genelowvalfilter, generangefilter,
genevarfilter), and calculate the range and variance of values (exprprofrange, exprprofvar).
Microarray visualization — The toolbox contains routines for visualizing microarray data. These
routines include spatial plots of microarray data (maimage, redgreencmap), box plots (maboxplot),
loglog plots (maloglog), and intensity-ratio plots (mairplot). You can also view clustered expression
profiles (clustergram, redgreencmap). You can create 2-D scatter plots of principal components
from the microarray data (mapcaplot).
Microarray utility functions — Use the following functions to work with Affymetrix GeneChip data
sets. Get library information for a probe (probelibraryinfo), gene information from a probe set
(probesetlookup), and probe set values from CEL and CDF information (probesetvalues). Show
probe set information from NetAffx™ Analysis Center (probesetlink) and plot probe set values
(probesetplot).
The toolbox accesses statistical routines to perform cluster analysis and to visualize the results, and
you can view your data through statistical visualizations such as dendrograms, classification, and
regression trees.
See Also
More About
• “Microarray Data Storage” on page 1-12
• “Microarray Analysis”
1-11
1 Getting Started
The object constructor function, DataMatrix, lets you create a DataMatrix object to encapsulate
data and metadata from a microarray experiment. A DataMatrix object stores experimental data in a
matrix, with rows typically corresponding to gene names or probe identifiers, and columns typically
corresponding to sample identifiers. A DataMatrix object also stores metadata, including the gene
names or probe identifiers (as the row names) and sample identifiers (as the column names).
You can reference microarray expression values in a DataMatrix object the same way you reference
data in a MATLAB array, that is, by using linear or logical indexing. Alternately, you can reference this
experimental data by gene (probe) identifiers and sample identifiers. Indexing by these identifiers lets
you quickly and conveniently access subsets of the data without having to maintain additional index
arrays.
Many MATLAB operators and arithmetic functions are available to DataMatrix objects by means of
methods. These methods let you modify, combine, compare, analyze, plot, and access information
from DataMatrix objects. Additionally, you can easily extend the functionality by using general
element-wise functions, dmarrayfun and dmbsxfun, and by manually accessing the properties of a
DataMatrix object.
Note For more information on creating and using DataMatrix objects, see “Representing Expression
Data Values in DataMatrix Objects” on page 4-5.
See Also
More About
• “Microarray Data Analysis Tools” on page 1-11
• “Microarray Analysis”
1-12
Mass Spectrometry Data Analysis
Reading raw data — Load raw mass/charge and ion intensity data from comma-separated-value
(CSV) files, or read a JCAMP-DX-formatted file with mass spectrometry data (jcampread) into the
MATLAB environment.
You can also have data in TXT files and use the importdata function.
Spectrum analysis — Load spectra into a GUI (msviewer) for selecting mass peaks and further
analysis.
The following graphic illustrates the roles of the various mass spectrometry functions in the toolbox.
1-13
1 Getting Started
See Also
More About
• “Mass Spectrometry and Bioanalytics”
• “Data Formats and Databases” on page 1-5
1-14
Graph Theory Functions
• Directed Graph — Sparse matrix, either double real or logical. Row (column) index indicates the
source (target) of the edge. Self-loops (values in the diagonal) are allowed, although most of the
algorithms ignore these values.
• Undirected Graph — Lower triangle of a sparse matrix, either double real or logical. An
algorithm expecting an undirected graph ignores values stored in the upper triangle of the sparse
matrix and values in the diagonal.
• Direct Acyclic Graph (DAG) — Sparse matrix, double real or logical, with zero values in the
diagonal. While a zero-valued diagonal is a requirement of a DAG, it does not guarantee a DAG. An
algorithm expecting a DAG will not test for cycles because this will add unwanted complexity.
• Spanning Tree — Undirected graph with no cycles and with one connected component.
There are no attributes attached to the graphs; sparse matrices representing all four types of graphs
can be passed to any graph algorithm. All functions will return an error on nonsquare sparse
matrices.
Graph algorithms do not pretest for graph properties because such tests can introduce a time penalty.
For example, there is an efficient shortest path algorithm for DAG, however testing if a graph is
acyclic is expensive compared to the algorithm. Therefore, it is important to select a graph theory
function and properties appropriate for the type of the graph represented by your input matrix. If the
algorithm receives a graph type that differs from what it expects, it will either:
• Return an error when it reaches an inconsistency. For example, if you pass a cyclic graph to the
graphshortestpath function and specify Acyclic as the method property.
• Produce an invalid result. For example, if you pass a directed graph to a function with an
algorithm that expects an undirected graph, it will ignore values in the upper triangle of the
sparse matrix.
See Also
graphallshortestpaths | graphconncomp | graphshortestpath | graphisdag
1-15
1 Getting Started
The toolbox provides functions that build on the classification and statistical learning tools in the
Statistics and Machine Learning Toolbox™ software (classify, kmeans, fitctree, and
fitrtree).
These functions include imputation tools (knnimpute), and K-nearest neighbor classifiers (fitcknn).
Other functions include set up of cross-validation experiments (crossvalind) and comparison of the
performance of different classification methods (classperf). In addition, there are tools for
selecting diversity and discriminating features (rankfeatures, randfeatures).
1-16
Prototyping and Development Environment
Using matrices for sequences or groups of sequences allows you to work efficiently and not worry
about writing loops or other programming controls.
• Programming tools — Use a visual debugger for algorithm development and refinement and an
algorithm performance profiler to accelerate development.
1-17
1 Getting Started
Data Visualization
You can visually compare pairwise sequence alignments, multiply aligned sequences, gene expression
data from microarrays, and plot nucleic acid and protein characteristics. The 2-D and volume
visualization features let you create custom graphical representations of multidimensional data sets.
You can also create montages and overlays, and export finished graphics to an Adobe® PostScript®
image file or copy directly into Microsoft PowerPoint®.
1-18
Exchange Bioinformatics Data Between Excel and MATLAB
Note The following example assumes you have Spreadsheet Link software installed on your system.
The Excel file used in the following example contains data from DeRisi, J.L., Iyer, V.R., and Brown, P.O.
(Oct. 24, 1997). Exploring the metabolic and genetic control of gene expression on a genomic scale.
Science 278(5338), 680–686. PMID: 9381177. The data was filtered using the steps described in
“Gene Expression Profile Analysis” on page 4-95.
Note matlabroot is the MATLAB root folder, which is where MATLAB software is installed on
your system.
6 In the Excel software, enable macros. Click the Developer tab, and then select Macro Security
from the Code group. If the Developer tab is not displayed on the Excel ribbon, consult Excel
Help to display it. If you encounter the "Can't find project or library" error, you might need to
update the references in the Visual Basic software. Open Visual Basic by clicking the Developer
1-19
1 Getting Started
tab and selecting Visual Basic. Then select Tools > References > SpreadsheetLink. If the
MISSING: exclink2007.xlam check box is selected, clear it.
Tip To view a cell's formula, select the cell, and then view the formula in the formula bar
at the top of the Excel window.
2 Execute the formulas in cells J5, J6, J7, and J12, by selecting the cell, pressing F2, and then
pressing Enter.
Each of the first three cells contains a formula using the Spreadsheet Link function
MLPutMatrix, which creates a MATLAB variable from the data in the spreadsheet. Cell J12
contains a formula using the Spreadsheet Link function MLEvalString, which runs the
Bioinformatics Toolbox clustergram function using the three variables as input. For more
information on adding formulas using Spreadsheet Link functions, see “Create Diagonal Matrix
Using Worksheet Cells” (Spreadsheet Link).
1-20
Exchange Bioinformatics Data Between Excel and MATLAB
3 Note that cell J17 contains a formula using a macro function Clustergram, which was created in
the Visual Basic® Editor. Running this macro does the same as the formulas in cells J5, J6, J7, and
J12. Optionally, view the Clustergram macro function by clicking the Developer tab, and then
clicking the Visual Basic button . (If the Developer tab is not on the Excel ribbon, consult
Excel Help to display it.)
For more information on creating macros using Visual Basic Editor, see “Create Diagonal Matrix
Using VBA Macro” (Spreadsheet Link).
4 Execute the formula in cell J17 to analyze and visualize the data:
The macro function Clustergram runs creating three MATLAB variables (data, Genes, and
TimeSteps) and displaying a Clustergram window containing dendrograms and a heat map of
the data.
1-21
1 Getting Started
a Select cell J5, and then press F2 to display the formula for editing. Change H617 to H33,
and then press Enter.
b Select cell J6, then press F2 to display the formula for editing. Change A617 to A33, and
then press Enter.
2 Run the formulas in cells J5, J6, J7, and J12 to analyze and visualize a subset of the data:
1-22
Exchange Bioinformatics Data Between Excel and MATLAB
For example, create a variable in MATLAB containing a 3-by-7 matrix of the data, plot the data in a
Figure window, and then add the plot to your spreadsheet:
Note Make sure you use the ' (transpose) symbol when plotting the data in this step. You need
to transpose the data in YAGenes so that it plots as three genes over seven time intervals.
6 Select cell J20, and then click from the MATLAB group, select Get MATLAB figure.
1-23
1 Getting Started
1-24
Working with Whole Genome Data
This example shows how to create a memory mapped file for sequence data and work with it without
loading all the genomic sequence into memory. Whole genomes are available for human, mouse, rat,
fugu, and several other model organisms. For many of these organisms one chromosome can be
several hundred million base pairs long. Working with such large data sets can be challenging as you
may run into limitations of the hardware and software that you are using. This example shows one
way to work around these limitations in MATLAB®.
Solving technical computing problems that require processing and analyzing large amounts of data
puts a high demand on your computer system. Large data sets take up significant memory during
processing and can require many operations to compute a solution. It can also take a long time to
access information from large data files.
Computer systems, however, have limited memory and finite CPU speed. Available resources vary by
processor and operating system, the latter of which also consumes resources. For example:
32-bit processors and operating systems can address up to 2^32 = 4,294,967,296 = 4 GB of memory
(also known as virtual address space). Windows® XP and Windows® 2000 allocate only 2 GB of this
virtual memory to each process (such as MATLAB). On UNIX®, the virtual memory allocated to a
process is system-configurable and is typically around 3 GB. The application carrying out the
calculation, such as MATLAB, can require storage in addition to the user task. The main problem
when handling large amounts of data is that the memory requirements of the program can exceed
that available on the platform. For example, MATLAB generates an "out of memory" error when data
requirements exceed approximately 1.7 GB on Windows XP.
For more details on memory management and large data sets, see “Performance and Memory”.
On a typical 32-bit machine, the maximum size of a single data set that you can work with in MATLAB
is a few hundred MB, or about the size of a large chromosome. Memory mapping of files allows
MATLAB to work around this limitation and enables you to work with very large data sets in an
intuitive way.
The latest whole genome data sets can be downloaded from the Ensembl Website. The data are
provided in several formats. These are updated regularly as new sequence information becomes
available. This example will use human DNA data stored in FASTA format. Chromosome 1 is (in the
GRCh37.56 Release of September 2009) a 65.6 MB compressed file. After uncompressing the file it is
about 250MB. MATLAB uses 2 bytes per character, so if you read the file into MATLAB, it will require
about 500MB of memory.
This example assumes that you have already downloaded and uncompressed the FASTA file into your
local directory. Change the name of the variable FASTAfilename if appropriate.
FASTAfilename = 'Homo_sapiens.GRCh37.56.dna.chromosome.1.fa';
fileInfo = dir(which(FASTAfilename))
fileInfo =
1-25
1 Getting Started
name: 'Homo_sapiens.GRCh37.56.dna.chromosome.1.fa'
folder: 'I:\qe\test_data\Bioinformatics_Toolbox\v000\demoData\biomemorymapdemo'
date: '01-Feb-2013 11:54:41'
bytes: 253404851
isdir: 0
datenum: 7.3527e+05
Memory mapping allows MATLAB to access data in a file as though it is in memory. You can use
standard MATLAB indexing operations to access data. See the documentation for memmapfile for
more details.
You could just map the FASTA file and access the data directly from there. However the FASTA format
file includes new line characters. The memmapfile function treats these characters in the same way
as all other characters. Removing these before memory mapping the file will make indexing
operations simpler. Also, memory mapping does not work directly with character data so you will
have to treat the data as 8-bit integers (uint8 class). The function nt2int in the Bioinformatics
Toolbox™ can be used to convert character information into integer values. int2nt is used to
convert back to characters.
header =
mmFilename =
'Homo_sapiens.GRCh37.56.dna.chromosome.1.mm'
Read the FASTA file in blocks of 1MB, remove new line characters, convert to uint8, and write to the
MM file.
newLine = sprintf('\n');
blockSize = 2^20;
while ~feof(fidIn)
% Read in the data
charData = fread(fidIn,blockSize,'*char')';
% Remove new lines
charData = strrep(charData,newLine,'');
% Convert to integers
1-26
Working with Whole Genome Data
intData = nt2int(charData);
% Write to the new file
fwrite(fidOut,intData,'uint8');
end
fclose(fidIn);
fclose(fidOut);
The new file is about the same size as the old file but does not contain new lines or the header
information.
mmfileInfo = dir(mmFilename)
mmfileInfo =
name: 'Homo_sapiens.GRCh37.56.dna.chromosome.1.mm'
folder: 'C:\TEMP\Bdoc22b_2134332_7304\ibD166FA\17\tp9f51422f\bioinfo-ex57563178'
date: '26-Nov-2022 10:52:13'
bytes: 249250621
isdir: 0
datenum: 7.3885e+05
The command memmapfile constructs a memmapfile object that maps the new file to memory. In
order to do this, it needs to know the format of the file. The format of this file is simple, though much
more complicated formats can be mapped.
chr1 =
Filename: 'C:\TEMP\Bdoc22b_2134332_7304\ibD166FA\17\tp9f51422f\bioinfo-ex57563178\Homo_sapien
Writable: false
Offset: 0
Format: 'uint8'
Repeat: Inf
Data: 249250621x1 uint8 array
The memmapfile object has various properties. Filename stores the full path to the file. Writable
indicates whether or not the data can be modified. Note that if you do modify the data, this will also
modify the original file. Offset allows you to specify the space used by any header information.
Format indicates the data format. Repeat is used to specify how many blocks (as defined by
Format) to map. This can be useful for limiting how much memory is used to create the memory map.
These properties can be accessed in the same way as other MATLAB data. For more details see type
help memmapfile or doc memmapfile.
chr1.Data(1:10)
1-27
1 Getting Started
ans =
15
15
15
15
15
15
15
15
15
15
You can access any region of the data using indexing operations.
chr1.Data(10000000:10000010)'
ans =
1 1 2 2 2 2 3 4 2 4 2
Remember that the nucleotide information was converted to integers. You can use int2nt to get the
sequence information back.
int2nt(chr1.Data(10000000:10000010)')
ans =
'AACCCCGTCTC'
ans =
1-28
Working with Whole Genome Data
Now that you can easily access the whole chromosome, you can analyze the data. This example shows
one way to look at the GC content along the chromosome.
numNT = numel(chr1.Data);
blockSize = 500000;
numBlocks = floor(numNT/blockSize);
One way to help MATLAB performance when working with large data sets is to "preallocate" space
for data. This allows MATLAB to allocate enough space for all of the data rather than having to grow
the array in small chunks. This will speed things up and also protect you from problems of the data
getting too large to store. For more details on pre-allocating arrays, see: https://www.mathworks.com/
matlabcentral/answers/99124-how-do-i-pre-allocate-memory-when-using-matlab
ratio = zeros(numBlocks+1,1);
Loop over the data looking for C or G and then divide this number by the total number of A, T, C, and
G. This will take about a minute to run.
block = chr1.Data(stop+1:end);
gc = (sum(block == G | block == C));
at = (sum(block == A | block == T));
ratio(end) = gc/(gc+at);
1-29
1 Getting Started
The region in the center of the plot around 140Mbp is a large region of Ns.
seqdisp(chr1.Data(140000000:140001000))
ans =
1-30
Working with Whole Genome Data
indices = find(ratio>0.5);
ranges = [(1 + blockSize*(indices-1)), blockSize*indices];
fprintf('Region %d:%d has GC content %f\n',[ranges ,ratio(indices)]')
If you want to remove the temporary file, you must first clear the memmapfile object.
clear chr1
delete(mmFilename)
1-31
1 Getting Started
This example shows how to compare whole genomes for organisms, which allows you to compare the
organisms at a very different resolution relative to single gene comparisons. Instead of just focusing
on the differences between homologous genes you can gain insight into the large-scale features of
genomic evolution.
This example uses two strains of Chlamydia, Chlamydia trachomatis and Chlamydophila pneumoniae.
These are closely related bacteria that cause different, though both very common, diseases in
humans. Whole genomes are available in the GenBank® database for both organisms.
You can download these genomes using the getgenbank function. First, download the Chlamydia
trachomatis genome. Notice that the genome is circular and just over one million bp in length. These
sequences are quite large so may take a while to download.
seqtrachomatis = getgenbank('NC_000117');
Next, download Chlamydophila pneumoniae. This genome is also circular and a little longer at 1.2
Mbp.
seqpneumoniae = getgenbank('NC_002179');
For your convenience, previously downloaded sequences are included in a MAT-file. Note that data in
public repositories is frequently curated and updated. Hence, the results of this example might be
slightly different when you use up-to-date datasets.
load('chlamydia.mat','seqtrachomatis','seqpneumoniae')
A very simple approach for comparing the two genomes is to perform pairwise alignment between all
genes in the genomes. Given that these are bacterial genomes, a simple approach would be to
compare all ORFs in the two genomes. However, the GenBank data includes more information about
the genes in the sequences. This is stored in the CDS field of the data structure. Chlamydia
trachomatis has 895 coding regions, while Chlamydophila pneumoniae has 1112.
M = numel(seqtrachomatis.CDS)
N = numel(seqpneumoniae.CDS)
M =
895
N =
1112
Most of the CDS records contain the translation to amino acid sequences. The first CDS record in the
Chlamydia trachomatis data is a hypothetical protein of length 591 residues.
seqtrachomatis.CDS(1)
1-32
Comparing Whole Genomes
ans =
location: 'join(1041920..1042519,1..1176)'
gene: []
product: 'hypothetical protein'
codon_start: '1'
indices: [1041920 1042519 1 1176]
protein_id: 'NP_219502.1'
db_xref: 'GeneID:884145'
note: []
translation: 'MSIRGVGGNGNSRIPSHNGDGSNRRSQNTKGNNKVEDRVCSLYSSRSNENRESPYAVVDVSSMIESTPTSGETTRASRG
text: [19x58 char]
The fourth CDS record is for the gatA gene, which has product glutamyl-tRNA amidotransferase
subunit A. The length of the product sequence is 491 residues.
seqtrachomatis.CDS(4)
ans =
location: '2108..3583'
gene: 'gatA'
product: [2x47 char]
codon_start: '1'
indices: [2108 3583]
protein_id: 'NP_219505.1'
db_xref: 'GeneID:884087'
note: [7x58 char]
translation: 'MYRKSALELRDAVVNRELSVTAITEYFYHRIESHDEQIGAFLSLCKERALLRASRIDDKLAKGDPIGLLAGIPIGVKDN
text: [26x58 char]
A few of the Chlamydophila pneumoniae CDS have empty translations. Fill them in as follows. First,
find all empty translations, then display the first empty translation.
missingPn = find(cellfun(@isempty,{seqpneumoniae.CDS.translation}));
seqpneumoniae.CDS(missingPn(1))
ans =
location: 'complement(73364..73477)'
gene: []
product: 'hypothetical protein'
codon_start: '1'
indices: [73477 73364]
protein_id: 'NP_444613.1'
db_xref: 'GeneID:963699'
note: 'hypothetical protein; identified by Glimmer2'
1-33
1 Getting Started
translation: []
text: [10x52 char]
The function featureparse extracts features, such as the CDS, from the sequence structure. You
can then use cellfun to apply nt2aa to the sequences with missing translations.
allCDS = featureparse(seqpneumoniae,'Feature','CDS','Sequence',true);
missingSeqs = cellfun(@nt2aa,{allCDS(missingPn).Sequence},'uniform',false);
[seqpneumoniae.CDS(missingPn).translation] = deal(missingSeqs{:});
seqpneumoniae.CDS(missingPn(1))
ans =
location: 'complement(73364..73477)'
gene: []
product: 'hypothetical protein'
codon_start: '1'
indices: [73477 73364]
protein_id: 'NP_444613.1'
db_xref: 'GeneID:963699'
note: 'hypothetical protein; identified by Glimmer2'
translation: 'MLTDQRKHIQMLHKHNSIEIFLSNMVVEVKLFFKTLK*'
text: [10x52 char]
To compare the gatA gene in Chlamydia trachomatis with all the CDS genes in Chlamydophila
pneumoniae, put a for loop around the nwalign function. You could alternatively use local
alignment (swalign).
tic
gatAScores = zeros(1,N);
for inner = 1:N
gatAScores(inner) = nwalign(seqtrachomatis.CDS(4).translation,...
seqpneumoniae.CDS(inner).translation);
end
toc % |tic| and |toc| are used to report how long the calculation takes.
A histogram of the scores shows a large number of negative scores and one very high positive score.
hist(gatAScores,100)
title(sprintf(['Alignment Scores for Chlamydia trachomatis %s\n',...
'with all CDS in Chlamydophila pneumoniae'],seqtrachomatis.CDS(4).gene))
1-34
Comparing Whole Genomes
As expected, the high scoring match is with the gatA gene in Chlamydophila pneumoniae.
ans =
location: 'complement(838828..840306)'
gene: 'gatA'
product: [2x47 char]
codon_start: '1'
indices: [840306 838828]
protein_id: 'NP_445311.1'
db_xref: 'GeneID:963139'
note: [7x58 char]
translation: 'MYRYSALELAKAVTLGELTATGVTQHFFHRIEEAEGQVGAFISLCKEQALEQAELIDKKRSRGEPLGKLAGVPVGIKDN
text: [26x58 char]
The pairwise alignment of one gene from Chlamydia trachomatis with all genes from Chlamydophila
pneumoniae takes just under a minute on an Intel® Pentium 4, 2.0 GHz machine running Windows®
XP. To do this calculation for all 895 CDS in Chlamydia trachomatis would take about 12 hours on the
same machine. Uncomment the following code if you want to run the whole calculation.
1-35
1 Getting Started
scores = zeros(M,N);
parfor outer = 1:M
theScore = zeros(1,outer);
theSeq = seqtrachomatis.CDS(outer).translation;
for inner = 1:N
theScore(inner) = ...
nwalign(theSeq,...
seqpneumoniae.CDS(inner).translation);
end
scores(outer,:) = theScore;
end
Note the command parfor is used in the outer loop. If your machine is configured to run multiple
labs then the outer loop will be executed in parallel. For a full understanding of this construct, see
doc parfor.
The distributions of the scores for several genes show a pattern. The CDS(3) of Chlamydia
trachomatis is the gatC gene. This has a relatively short product,aspartyl/glutamyl-tRNA
amidotransferase subunit C, with only 100 residues.
gatCScores = zeros(1,N);
for inner = 1:N
gatCScores(inner) = nwalign(seqtrachomatis.CDS(3).translation,...
seqpneumoniae.CDS(inner).translation);
end
figure
hist(gatCScores,100)
title(sprintf(['Alignment Scores for Chlamydia trachomatis %s\n',...
'with all CDS in Chlamydophila pneumoniae'],seqtrachomatis.CDS(3).gene))
xlabel('Score');ylabel('Number of genes');
1-36
Comparing Whole Genomes
The best score again corresponds to the same gene in the Chlamydophila pneumoniae.
[gatCBest, gatCBestIdx] = max(gatCScores);
seqpneumoniae.CDS(gatCBestIdx).product
ans =
CDS(339) of Chlamydia trachomatis is the uvrA gene. This has a very long product, excinuclease ABC
subunit A, of length 1786.
uvrAScores = zeros(1,N);
for inner = 1:N
uvrAScores(inner) = nwalign(seqtrachomatis.CDS(339).translation,...
seqpneumoniae.CDS(inner).translation);
end
figure
hist(uvrAScores,100)
title(sprintf(['Alignment Scores for Chlamydia trachomatis %s\n',...
'with all CDS in Chlamydophila pneumoniae'],seqtrachomatis.CDS(339).gene))
xlabel('Score');ylabel('Number of genes');
1-37
1 Getting Started
ans =
location: '716887..722367'
gene: []
product: 'excinuclease ABC subunit A'
codon_start: '1'
indices: [716887 722367]
protein_id: 'NP_445220.1'
db_xref: 'GeneID:963214'
note: [6x58 char]
translation: 'MKSLPVYVSGIKVRNLKNVSIHFNSEEIVLLTGVSGSGKSSIAFDTLYAAGRKRYISTLPTFFATTITTLPNPKVEEIH
text: [46x58 char]
The distribution of the scores is affected by the length of the sequences, with very long sequences
potentially having much higher or lower scores than shorter sequences. You can normalize for this in
a number of ways. One way is to divide by the length of the sequences.
lnormgatABest = gatABest./length(seqtrachomatis.CDS(4).product)
lnormgatCBest = gatCBest./length(seqtrachomatis.CDS(3).product)
lnormuvrABest = uvrABest./length(seqtrachomatis.CDS(339).product)
1-38
Comparing Whole Genomes
lnormgatABest =
16.8794
lnormgatCBest =
2.2695
lnormuvrABest =
78.9615
An alternative normalization method is to use the self alignment score, that is the score from aligning
the sequence with itself.
gatASelf = nwalign(seqtrachomatis.CDS(4).translation,...
seqtrachomatis.CDS(4).translation);
gatCSelf = nwalign(seqtrachomatis.CDS(3).translation,...
seqtrachomatis.CDS(3).translation);
uvrASelf = nwalign(seqtrachomatis.CDS(339).translation,...
seqtrachomatis.CDS(339).translation);
normgatABest = gatABest./gatASelf
normgatCBest = gatCBest./gatCSelf
normuvrABest = uvrABest./uvrASelf
normgatABest =
0.7380
normgatCBest =
0.5212
normuvrABest =
0.5253
The all-against-all alignment calculation not only takes a lot of time, it also generates a large matrix
of scores. If you are looking for similar genes across species, then the scores that are interesting are
the positive scores that indicate good alignment. However, most of these scores are negative, and the
actual values are not particularly useful for this type of study. Sparse matrices allow you to store the
interesting values in a more efficient way.
The sparse matrix, spScores, in the MAT-file chlamydia.mat contains the positive values from the
all against all pairwise alignment calculation normalized by self-alignment score.
load('chlamydia.mat','spScores')
1-39
1 Getting Started
With the matrix of scores you can look at the distribution of scores of Chlamydophila pneumoniae
genes aligned with Chlamydia trachomatis and the converse of this, Chlamydia trachomatis genes
aligned with Chlamydophila pneumoniae genes
figure
subplot(2,1,1)
hist(max(spScores),100)
title('Highest Alignment Scores for Chlamydophila pneumoniae Genes')
xlabel('Score');ylabel('Number of genes');
subplot(2,1,2)
hist(max(spScores,[],2),100)
title('Highest Alignment Scores for Chlamydia trachomatis Genes')
xlabel('Score');ylabel('Number of genes');
Remember that there are 1112 CDS in Chlamydophila pneumoniae and only 895 in Chlamydia
trachomatis. The high number of zero scores in the top histogram indicates that many of the extra
CDS in Chlamydophila pneumoniae do not have good matches in Chlamydia trachomatis.
Another way to visualize the data is to look at the positions of points in the scores matrix that are
positive. The sparse function spy is an easy way to quickly view dotplots of matrices. This shows
some interesting structure in the positions of the high scoring matches.
figure
spy(spScores > 0)
title(sprintf('Dot Plot of High-Scoring Alignments.\nNormalized Threshold = 0'))
1-40
Comparing Whole Genomes
Raise the threshold a little higher to see clear diagonal lines in the plot.
spy(spScores >.1)
title(sprintf('Dot Plot of High-Scoring Alignments.\nNormalized Threshold = 0.1'))
1-41
1 Getting Started
Remember that these are circular genomes, and it seems that the starting points in GenBank are
arbitrary. Permute the scores matrix so that the best match of the first CDS in Chlamydophila
pneumoniae is in the first row to see a clear diagonal plot. This shows the synteny between the two
organisms.
1-42
Comparing Whole Genomes
Genes in different genomes that are related to each other are said to be homologous. Similarity can
be by speciation (orthologous genes) or by replication (paralogous genes). Having the scoring matrix
lets you look for both types of relationships.
The most obvious way to find orthologs is to look for the highest scoring pairing for each gene. If the
score is significant then these best reciprocal pairs are very likely to be orthologous.
The variable bestIndices contains the index of the best reciprocal pairs for the genes in
Chlamydophila pneumoniae. Sort the best scores and create a table to compare the description of the
best reciprocal pairs and discover very high similarity between the highest scoring best reciprocal
pairs.
1-43
1 Getting Started
Score 0.982993
Chlamydia trachomatis Gene : 50S ribosomal protein L36
Chlamydophila pneumoniae Gene : 50S ribosomal protein L36
Score 0.981818
Chlamydia trachomatis Gene : 30S ribosomal protein S15
Chlamydophila pneumoniae Gene : 30S ribosomal protein S15
Score 0.975422
Chlamydia trachomatis Gene : integration host factor alpha-subunit
Chlamydophila pneumoniae Gene : integration host factor beta-subunit
Score 0.971647
Chlamydia trachomatis Gene : 50S ribosomal protein L16
Chlamydophila pneumoniae Gene : 50S ribosomal protein L16
Score 0.970105
Chlamydia trachomatis Gene : 30S ribosomal protein S10
Chlamydophila pneumoniae Gene : 30S ribosomal protein S10
Score 0.969554
Chlamydia trachomatis Gene : rod shape-determining protein MreB
Chlamydophila pneumoniae Gene : rod shape-determining protein MreB
Score 0.953654
Chlamydia trachomatis Gene : hypothetical protein
Chlamydophila pneumoniae Gene : hypothetical protein
You can use the Variable Editor to look at the data in a spreadsheet format.
open('matches')
Compare the descriptions to see that the majority of the best reciprocal pairs have identical
descriptions.
exactMatches = strcmpi(matches(:,4),matches(:,5));
sum(exactMatches)
ans =
808
Perhaps more interesting are the best reciprocal pairs where the descriptions are not identical. Some
are simply differences in how the same gene is described, but others show quite different
descriptions.
mismatches = matches(~exactMatches,:);
for count = 1:7
fprintf(['Score %f\nChlamydia trachomatis Gene : %s\n',...
'Chlamydophila pneumoniae Gene : %s\n\n'],...
mismatches{count,1}, mismatches{count,4}, mismatches{count,5})
end
Score 0.975422
Chlamydia trachomatis Gene : integration host factor alpha-subunit
1-44
Comparing Whole Genomes
Score 0.929565
Chlamydia trachomatis Gene : low calcium response D
Chlamydophila pneumoniae Gene : type III secretion inner membrane protein SctV
Score 0.905000
Chlamydia trachomatis Gene : NrdR family transcriptional regulator
Chlamydophila pneumoniae Gene : transcriptional regulator NrdR
Score 0.903226
Chlamydia trachomatis Gene : Yop proteins translocation protein S
Chlamydophila pneumoniae Gene : type III secretion inner membrane protein SctS
Score 0.896212
Chlamydia trachomatis Gene : ATP-dependent protease ATP-binding subunit ClpX
Chlamydophila pneumoniae Gene : ATP-dependent protease ATP-binding protein ClpX
Score 0.890705
Chlamydia trachomatis Gene : ribonuclease E
Chlamydophila pneumoniae Gene : ribonuclease G
Score 0.884234
Chlamydia trachomatis Gene : ClpC protease ATPase
Chlamydophila pneumoniae Gene : ATP-dependent Clp protease ATP-binding protein
open('mismatches')
Once you have the scoring matrix this opens up many possibilities for further investigation. For
example, you could look for CDS where there are multiple high scoring reciprocal CDS. See
Cristianini and Hahn [1] for further ideas.
References
[1] Cristianini, N. and Hahn, M.W., "Introduction to Computational Genomics: A Case Studies
Approach", Cambridge University Press, 2007.
See Also
getgenbank | nwalign | featureparse
1-45
2
Overview
Many biological experiments produce huge data files that are difficult to access due to their size,
which can cause memory issues when reading the file into the MATLAB Workspace. You can construct
a BioIndexedFile object to access the contents of a large text file containing nonuniform size
entries, such as sequences, annotations, and cross-references to data sets. The BioIndexedFile
object lets you quickly and efficiently access this data without loading the source file into memory.
You can use the BioIndexedFile object to access individual entries or a subset of entries when the
source file is too big to fit into memory. You can access entries using indices or keys. You can read and
parse one or more entries using provided interpreters or a custom interpreter function.
Use the BioIndexedFile object in conjunction with your large source file to:
• FASTA
• FASTQ
• SAM
• Table — Tab-delimited table with multiple columns. Keys can be in any column. Rows with the
same key are considered separate entries.
• Multi-row Table — Tab-delimited table with multiple columns. Keys can be in any column.
Contiguous rows with the same key are considered a single entry. Noncontiguous rows with the
same key are considered separate entries.
• Flat — Flat file with concatenated entries separated by a character vector, typically //. Within an
entry, the key is separated from the rest of the entry by a white space.
2-2
Work with Next-Generation Sequencing Data
When you construct a BioIndexedFile object from your source file for the first time, you also
create an auxiliary index file, which by default is saved to the same location as your source file.
However, if your source file is in a read-only location, you can specify a different location to save the
index file.
Tip If you construct a BioIndexedFile object from your source file on subsequent occasions, it
takes advantage of the existing index file, which saves time. However, the index file must be in the
same location or a location specified by the subsequent construction syntax.
Tip If insufficient memory is not an issue when accessing your source file, you may want to try an
appropriate read function, such as genbankread, for importing data from GenBank files. .
Additionally, several read functions such as fastaread, fastqread, samread, and sffread include
a Blockread property, which lets you read a subset of entries from a file, thus saving memory.
Caution Do not modify the index file. If you modify it, you can get invalid results. Also, the
constructor function cannot use a modified index file to construct future objects from the
associated source file.
2-3
2 High-Throughput Sequence Analysis
ans =
6476
Note For a list and description of all properties of the object, see BioIndexedFile.
Use the getEntryByIndex method to retrieve a subset of entries from your source file that
correspond to specified indices. For example, retrieve the first 12 entries from the yeastgenes.sgd
source file:
Use the getEntryByKey method to retrieve a subset of entries from your source file that are
associated with specified keys. For example, retrieve all entries with keys of AAC1 and AAD10 from
the yeastgenes.sgd source file:
The output subset_entries is a character vector of concatenated entries. Because the keys in the
yeastgenes.sgd source file are not unique, this method returns all entries that have a key of AAC1
or AAD10.
Before using the read method, make sure the Interpreter property of the BioIndexedFile
object is set appropriately.
2-4
Work with Next-Generation Sequencing Data
There are two ways to set the Interpreter property of the BioIndexedFile object:
• When constructing the BioIndexedFile object, use the Interpreter property name/property
value pair
• After constructing the BioIndexedFile object, set the Interpreter property
Note For more information on setting the Interpreter property of the object, see
BioIndexedFile.
The read method reads and parses a subset of entries that you specify using either entry indices or
keys.
Example
To quickly find all the gene ontology (GO) terms associated with a particular gene because the entry
keys are gene names:
GO_YAT2_entries =
2-5
2 High-Throughput Sequence Analysis
Overview
High-throughput sequencing instruments produce large amounts of sequence read data that can be
challenging to store and manage. Using objects to contain this data lets you easily access,
manipulate, and filter the data.
Bioinformatics Toolbox includes two objects for working with sequence read data.
2-6
Manage Sequence Read Data in Objects
A BioRead object represents a collection of sequence reads. Each element in the object is associated
with a sequence, sequence header, and sequence quality information.
• Indexed — The data remains in the source file. Constructing the object and accessing its contents
is memory efficient. However, you cannot modify object properties, other than the Name property.
This is the default method if you construct a BioRead object from a FASTQ- or SAM-formatted
file.
• In Memory — The data is read into memory. Constructing the object and accessing its contents is
limited by the amount of available memory. However, you can modify object properties. When you
construct a BioRead object from a FASTQ structure or cell arrays, the data is read into memory.
When you construct a BioRead object from a FASTQ- or SAM-formatted file, use the InMemory
name-value pair argument to read the data into memory.
Note This example constructs a BioRead object from a FASTQ-formatted file. Use similar steps to
construct a BioRead object from a SAM-formatted file.
Use the BioRead constructor function to construct a BioRead object from a FASTQ-formatted file
and set the Name property:
BRObj1 =
The constructor function construct a BioRead object and, if an index file does not already exist, it
also creates an index file with the same file name, but with an .IDX extension. This index file, by
default, is stored in the same location as the source file.
Caution Your source file and index file must always be in sync.
• After constructing a BioRead object, do not modify the index file, or you can get invalid results
when using the existing object or constructing new objects.
• If you modify the source file, delete the index file, so the object constructor creates a new index
file when constructing new objects.
2-7
2 High-Throughput Sequence Analysis
Note Because you constructed this BioRead object from a source file, you cannot modify the
properties (except for Name) of the BioRead object.
A BioMap object represents a collection of sequence reads that map against a single reference
sequence. Each element in the object is associated with a read sequence, sequence header, sequence
quality information, and alignment/mapping information.
When constructing a BioMap object from a BAM file, the maximum size of the file is limited by your
operating system and available memory.
• Indexed — The data remains in the source file. Constructing the object and accessing its contents
is memory efficient. However, you cannot modify object properties, other than the Name property.
This is the default method if you construct a BioMap object from a SAM- or BAM-formatted file.
• In Memory — The data is read into memory. Constructing the object and accessing its contents is
limited by the amount of available memory. However, you can modify object properties. When you
construct a BioMap object from a structure, the data stays in memory. When you construct a
BioMap object from a SAM- or BAM-formatted file, use the InMemory name-value pair argument
to read the data into memory.
Note This example constructs a BioMap object from a SAM-formatted file. Use similar steps to
construct a BioMap object from a BAM-formatted file.
1 If you do not know the number and names of the reference sequences in your source file,
determine them using the saminfo or baminfo function and the ScanDictionary name-value
pair argument.
ans =
'seq1'
'seq2'
Tip The previous syntax scans the entire SAM file, which is time consuming. If you are confident
that the Header information of the SAM file is correct, omit the ScanDictionary name-value
pair argument, and inspect the SequenceDictionary field instead.
2 Use the BioMap constructor function to construct a BioMap object from the SAM file and set the
Name property. Because the SAM-formatted file in this example, ex2.sam, contains multiple
reference sequences, use the SelectRef name-value pair argument to specify one reference
sequence, seq1:
2-8
Manage Sequence Read Data in Objects
BMObj2 =
SequenceDictionary: 'seq1'
Reference: [1501x1 File indexed property]
Signature: [1501x1 File indexed property]
Start: [1501x1 File indexed property]
MappingQuality: [1501x1 File indexed property]
Flag: [1501x1 File indexed property]
MatePosition: [1501x1 File indexed property]
Quality: [1501x1 File indexed property]
Sequence: [1501x1 File indexed property]
Header: [1501x1 File indexed property]
NSeqs: 1501
Name: 'MyObject'
The constructor function constructs a BioMap object and, if index files do not already exist, it also
creates one or two index files:
• If constructing from a SAM-formatted file, it creates one index file that has the same file name as
the source file, but with an .IDX extension. This index file, by default, is stored in the same
location as the source file.
• If constructing from a BAM-formatted file, it creates two index files that have the same file name
as the source file, but one with a .BAI extension and one with a .LINEARINDEX extension. These
index files, by default, are stored in the same location as the source file.
Caution Your source file and index files must always be in sync.
• After constructing a BioMap object, do not modify the index files, or you can get invalid results
when using the existing object or constructing new objects.
• If you modify the source file, delete the index files, so the object constructor creates new index
files when constructing new objects.
Note Because you constructed this BioMap object from a source file, you cannot modify the
properties (except for Name and Reference) of the BioMap object.
Note This example constructs a BioMap object from a SAM structure using samread. Use similar
steps to construct a BioMap object from a BAM structure using bamread.
1 Use the samread function to create a SAM structure from a SAM-formatted file:
SAMStruct = samread('ex2.sam');
2 To construct a valid BioMap object from a SAM-formatted file, the file must contain only one
reference sequence. Determine the number and names of the reference sequences in your SAM-
2-9
2 High-Throughput Sequence Analysis
formatted file using the unique function to find unique names in the ReferenceName field of
the structure:
unique({SAMStruct.ReferenceName})
ans =
'seq1' 'seq2'
3 Use the BioMap constructor function to construct a BioMap object from a SAM structure.
Because the SAM structure contains multiple reference sequences, use the SelectRef name-
value pair argument to specify one reference sequence, seq1:
BMObj1 =
SequenceDictionary: {'seq1'}
Reference: {1501x1 cell}
Signature: {1501x1 cell}
Start: [1501x1 uint32]
MappingQuality: [1501x1 uint8]
Flag: [1501x1 uint16]
MatePosition: [1501x1 uint32]
Quality: {1501x1 cell}
Sequence: {1501x1 cell}
Header: {1501x1 cell}
NSeqs: 1501
Name: ''
You can retrieve a specific property from elements in a BioRead or BioMap object.
For example, to retrieve all headers from a BioRead object, use the Header property as follows:
allHeaders = BRObj1.Header;
This syntax returns a cell array containing the headers for all elements in the BioRead object.
Similarly, to retrieve all start positions of aligned read sequences from a BioMap object, use the
Start property of the object:
allStarts = BMObj1.Start;
This syntax returns a vector containing the start positions of aligned read sequences with respect to
the position numbers in the reference sequence in a BioMap object.
2-10
Manage Sequence Read Data in Objects
You can retrieve multiple properties from a BioRead or BioMap object in a single command using the
get method. For example, to retrieve both start positions and headers information of a BioMap
object, use the get method as follows:
This syntax returns a cell array containing all start positions and headers information of a BioMap
object.
For a list and description of all properties of a BioRead object, see BioRead class. For a list and
description of all properties of a BioMap object, see BioMap class.
Use specialized get methods with a numeric vector, logical vector, or cell array of headers to retrieve
a subset of information from an object. For example, to retrieve the first 10 elements from a BioRead
object, use the getSubset method:
This syntax returns a new BioRead object containing the first 10 elements in the original BioRead
object.
For example, to retrieve the first 12 positions of sequences with headers SRR005164.1,
SRR005164.7, and SRR005164.16, use the getSubsequence method:
subSeqs =
'TGGCTTTAAAGC'
'CCCGAAAGCTAG'
'AATTTTGCGGCT'
For example, to retrieve information about the third element in a BioMap object, use the getInfo
method:
This syntax returns a tab-delimited character vector containing this information for the third element:
• Sequence header
• SAM flags for the sequence
• Start position of the aligned read sequence with respect to the reference sequence
• Mapping quality score for the sequence
• Signature (CIGAR-formatted character vector) for the sequence
• Sequence
2-11
2 High-Throughput Sequence Analysis
For a complete list and description of methods of a BioRead object, see BioRead class. For a
complete list and description of methods of a BioMap object, see BioMap class.
To modify properties (other than Name and Reference) of a BioRead or BioMap object, the data
must be in memory, and not indexed. To ensure the data is in memory, do one of the following:
• Construct the object from a structure as described in “Construct a BioMap Object from a SAM or
BAM Structure” on page 2-9.
• Construct the object from a source file using the InMemory name-value pair argument.
BRObj1 = BioRead('SRR005164_1_50.fastq','InMemory',true);
To provide custom headers for sequences of interest (in this case sequences 1 to 5), do the following:
Several other specialized set methods let you set the properties of a subset of elements in a
BioRead or BioMap object.
For a complete list and description of methods of a BioRead object, see BioRead class. For a
complete list and description of methods of a BioMap object, see BioMap class.
For example, you can compute the number, indices, and start positions of the read sequences that
align within the first 25 positions of the reference sequence. To do so, use the getCounts,
getIndex, and getStart methods:
2-12
Manage Sequence Read Data in Objects
Cov =
12
Indices =
1
2
3
4
5
6
7
8
9
10
11
12
startPos =
1
3
5
6
9
13
13
15
18
22
22
24
The first two syntaxes return the number and indices of the read sequences that align within the
specified region of the reference sequence. The last syntax returns a vector containing the start
position of each aligned read sequence, corresponding to the position numbers of the reference
sequence.
For example, you can also compute the number of the read sequences that align to each of the first
10 positions of the reference sequence. For this computation, use the getBaseCoverage method:
Cov =
1 1 2 2 3 4 4 4 5 5
2-13
2 High-Throughput Sequence Analysis
For example, to retrieve the alignment of read sequences to the first 12 positions of the reference
sequence in a BioMap object, use the getAlignment method:
Alignment_1_12 =
CACTAGTGGCTC
CTAGTGGCTC
AGTGGCTC
GTGGCTC
GCTC
Indices =
1
2
3
4
5
Return the headers of the read sequences that align to a specific region of the reference sequence:
alignedHeaders =
'B7_591:4:96:693:509'
'EAS54_65:7:152:368:113'
'EAS51_64:8:5:734:57'
'B7_591:1:289:587:906'
'EAS56_59:8:38:671:758'
BMObj2 = BioMap('ex1.sam');
2 Use the filterByFlag method to create a logical vector indicating the read sequences in a
BioMap object that are mapped.
2-14
Manage Sequence Read Data in Objects
BMObj2 = BioMap('ex1.sam');
2 Use the filterByFlag method to create a logical vector indicating the read sequences in a
BioMap object that are mapped in a proper pair, that is, both the read sequence and its mate are
mapped to the reference sequence.
2-15
2 High-Throughput Sequence Analysis
GFFAnnotObj = GFFAnnotation('tair8_1.gff')
GFFAnnotObj =
Use the GTFAnnotation constructor function to construct a GTFAnnotation object from a GTF-
formatted file:
GTFAnnotObj = GTFAnnotation('hum37_2_1M.gtf')
GTFAnnotObj =
GFFAnnotObj.FieldNames
2-16
Store and Manage Feature Annotations in Objects
ans =
Columns 1 through 6
Columns 7 through 9
GTFAnnotObj.FieldNames
ans =
Columns 1 through 6
Columns 7 through 11
Determine the range of the reference sequences that are covered by feature annotations by using the
getRange method with the annotation object constructed in the previous section:
range = getRange(GFFAnnotObj)
range =
3631 498516
Creating a structure of the annotation data lets you access the field values. Use the getData method
to create a structure containing a subset of the data in a GFFAnnotation object constructed in the
previous section.
AnnotStruct =
2-17
2 High-Throughput Sequence Analysis
Extract the start positions for annotations 12 through 17. Notice that you must use square brackets
when indexing a range of positions:
Starts_12_17 = [AnnotStruct(12:17).Start]
Starts_12_17 =
Extract the start position and the feature for the 12th annotation:
Start_12 = AnnotStruct(12).Start
Start_12 =
4706
Feature_12 = AnnotStruct(12).Feature
Feature_12 =
CDS
• Determine counts of sequence reads aligned to regions of a reference sequence associated with
specific annotations, such as in RNA-Seq workflows.
• Find annotations within a specific range of a peak of interest in a reference sequence, such as in
ChIP-Seq workflows.
refNames =
'chr2'
3 Use the getFeatureNames method to retrieve the feature names from the annotation object:
featureNames = getFeatureNames(GTFAnnotObj)
2-18
Store and Manage Feature Annotations in Objects
featureNames =
'CDS'
'exon'
'start_codon'
'stop_codon'
4 Use the getGeneNames method to retrieve a list of the unique gene names from the annotation
object:
geneNames = getGeneNames(GTFAnnotObj)
geneNames =
'uc002qvu.2'
'uc002qvv.2'
'uc002qvw.2'
'uc002qvx.2'
'uc002qvy.2'
'uc002qvz.2'
'uc002qwa.2'
'uc002qwb.2'
'uc002qwc.1'
'uc002qwd.2'
'uc002qwe.3'
'uc002qwf.2'
'uc002qwg.2'
'uc002qwh.2'
'uc002qwi.3'
'uc002qwk.2'
'uc002qwl.2'
'uc002qwm.1'
'uc002qwn.1'
'uc002qwo.1'
'uc002qwp.2'
'uc002qwq.2'
'uc010ewe.2'
'uc010ewf.1'
'uc010ewg.2'
'uc010ewh.1'
'uc010ewi.2'
'uc010yim.1'
The previous steps gave us a list of available reference sequences, features, and genes associated
with the available annotations. Use this information to determine annotations of interest. For
instance, you might be interested only in annotations that are exons associated with the uc002qvv.2
gene on chromosome 2.
Filter Annotations
Use the getData method to filter the annotations and create a structure containing only the
annotations of interest, which are annotations that are exons associated with the uc002qvv.2 gene on
chromosome 2.
AnnotStruct = getData(GTFAnnotObj,'Reference','chr2',...
'Feature','exon','Gene','uc002qvv.2')
AnnotStruct =
2-19
2 High-Throughput Sequence Analysis
The return structure contains 12 elements, indicating there are 12 annotations that meet your filter
criteria.
After filtering the data to include only annotations that are exons associated with the uc002qvv.2
gene on chromosome 2, use the Start and Stop fields to create vectors of the start and end positions
for the ranges associated with the 12 annotations.
StartPos = [AnnotStruct.Start];
EndPos = [AnnotStruct.Stop];
Construct a BioMap object from a BAM-formatted file containing sequence read data aligned to
chromosome 2.
BMObj3 = BioMap('ex3.bam');
Then use the range for the annotations of interest as input to the getCounts method of a BioMap
object. This returns the counts of short reads aligned to the annotations of interest.
counts =
1399
1
54
221
97
125
0
1
0
65
9
12
2-20
Bioinformatics Toolbox Software Support Packages
1 In the Environment section of the MATLAB toolstrip, select Add-Ons > Get Add-Ons.
2 In the Add-On Explorer, search for the support package that you want to install by entering its
name.
3 Install the support package.
For details about installing add-ons, see “Get and Manage Add-Ons”. For other information, see “Add-
Ons”.
†
Version of the original (third-party) software
‡
You need to install Windows Subsystem for Linux (WSL) and a Linux distribution on your Windows
machine. For details on installing WSL, see here.
See Also
More About
• “Count Features from NGS Reads” on page 2-23
• “High-Throughput Sequencing”
2-21
2 High-Throughput Sequence Analysis
References
[1] Langmead, Ben, and Steven L Salzberg. “Fast Gapped-Read Alignment with Bowtie 2.” Nature
Methods 9, no. 4 (April 2012): 357–59. https://doi.org/10.1038/nmeth.1923.
[2] Trapnell, Cole, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J van Baren,
Steven L Salzberg, Barbara J Wold, and Lior Pachter. “Transcript Assembly and Quantification
by RNA-Seq Reveals Unannotated Transcripts and Isoform Switching during Cell
Differentiation.” Nature Biotechnology 28, no. 5 (May 2010): 511–15.
[3] Li, Heng, and Richard Durbin. “Fast and Accurate Long-Read Alignment with Burrows–Wheeler
Transform.” Bioinformatics 26, no. 5 (March 1, 2010): 589–95. https://doi.org/10.1093/
bioinformatics/btp698.
[4] Li, Heng, and Richard Durbin. “Fast and Accurate Short Read Alignment with Burrows-Wheeler
Transform.” Bioinformatics 25, no. 14 (July 15, 2009): 1754–60. https://doi.org/10.1093/
bioinformatics/btp324.
2-22
Count Features from NGS Reads
This example shows how to count features from paired-end sequencing reads after aligning them to
the whole human genome curated by the Genome Reference Consortium. This example uses Genome
Reference Consortium Human Build 38 patch release 12 (GRCh38.p12) as the human genome
reference.
This example works on the UNIX® and Mac platforms only. Download the Bioinformatics Toolbox™
Interface for Bowtie Aligner support package from the Add-On Explorer. For details, see
“Bioinformatics Toolbox Software Support Packages” on page 2-21.
• Downloaded and extracted the RefSeq assembly from Genome Reference Consortium Human
Build 38 patch release 12 (GRCh38.p12).
• Downloaded and organized some paired-end reads data. This example uses the exome sequencing
data from the 1000 genomes project. Paired-end reads are indicated by '_1' and '_2' in the
filenames.
Build Index
Construct an index for aligning reads to the reference using bowtie2build. The file
GCF_000001405.38_GRCh38.p12_genomic.fna contains the human reference genome in the
FASTA format. bowtieIdx is the base name of the reference index files. The '--threads 8' option
specifies the number of parallel threads to build index files faster. You do not need to specify full file
paths for *.fna or *.index files if you are running the example from the same folder location. Specify
the full paths if you wish to store the files elsewhere or run the example from a different folder.
bowtieIdx = 'GCF_000001405.38_GRCh38.p12_genomic.index';
buildFlag = bowtie2build('GCF_000001405.38_GRCh38.p12_genomic.fna',...
bowtieIdx,'--threads 8');
Align paired-end reads to the reference using bowtie2. You can create a Bowtie2AlignOptions
object to specify different options, such as the number of parallel threads to use.
opt = Bowtie2AlignOptions;
opt.NumThreads = 8;
reads1 = 'HG00096_1.fastq';
reads2 = 'HG00096_2.fastq';
bowtie2(bowtieIdx,reads1,reads2,'HG00096.sam',opt);
SAM files can be very large. Use BioMap to select only the data for the correct reference. For this
example, consider APOE, which is a gene on Chromosome 19 linked to Alzheimer's disease. Create a
smaller BAM file for APOE to improve performance.
2-23
2 High-Throughput Sequence Analysis
Warning: Found invalid tag in header type: 'PG'. Ignoring tag 'PN:bowtie2'.
Warning: The read sequences in input SAM file do not appear to be ordered
according to the start position of their alignments with the reference sequence.
Because of this, there will be a decrease in performance when accessing the
reads. For maximum performance, order the read sequences in the SAM file, before
creating a BioMap object.
Use featurecount to compare the number of transcripts for each APOE variant using a GTF file. A
full table of features is included in the GRCh38.p12 assembly in GFF format, which can be converted
to GTF using cuffgffread. This example uses a simplified GTF based on APOE transcripts.
APOE_gene.gtf is included with the software.
2-24
Count Features from NGS Reads
2-25
2 High-Throughput Sequence Analysis
2-26
Count Features from NGS Reads
2-27
2 High-Throughput Sequence Analysis
2-28
Count Features from NGS Reads
2-29
2 High-Throughput Sequence Analysis
2-30
Count Features from NGS Reads
See Also
bamsort | samsort | bwamem | bowtie2 | bowtie2build | featurecount | BioMap |
cuffgffread | cufflinks
2-31
2 High-Throughput Sequence Analysis
This example shows how to test RNA-Seq data for differentially expressed genes using a negative
binomial model.
Introduction
A typical differential expression analysis of RNA-Seq data consists of normalizing the raw counts and
performing statistical tests to reject or accept the null hypothesis that two groups of samples show no
significant difference in gene expression. This example shows how to inspect the basic statistics of
raw count data, how to determine size factors for count normalization and how to infer the most
differentially expressed genes using a negative binomial model.
The dataset for this example comprises of RNA-Seq data obtained in the experiment described by
Brooks et al. [1]. The authors investigated the effect of siRNA knock-down of pasilla, a gene known to
play an important role in the regulation of splicing in Drosophila melanogaster. The dataset consists
of 2 biological replicates of the control (untreated) samples and 2 biological replicates of the knock-
down (treated) samples.
The starting point for this analysis of RNA-Seq data is a count matrix, where the rows correspond to
genomic features of interest, the columns correspond to the given samples and the values represent
the number of reads mapped to each feature in a given sample.
The included file pasilla_count_noMM.mat contains two tables with the count matrices at the
gene level and at the exon level for each of the considered samples. You can obtain similar matrices
using the function featurecount.
load pasilla_count_noMM.mat
"FBgn0000003" "3R" 0 1 1 2
"FBgn0000008" "2R" 142 117 138 132
"FBgn0000014" "3R" 20 12 10 19
"FBgn0000015" "3R" 2 4 0 1
"FBgn0000017" "3L" 6591 5127 4809 6027
"FBgn0000018" "2L" 469 530 492 574
"FBgn0000024" "3R" 5 6 10 8
"FBgn0000028" "X" 0 0 2 1
Note that when counting is performed without summarization, the individual features (exons in this
case) are reported with their metafeature assignment (genes in this case) followed by the start and
stop positions.
% preview the table of read counts for exons
head(exonCountTable)
2-32
Identifying Differentially Expressed Genes from RNA-Seq Data
"FBgn0000003_2648220_2648518" "3R" 0 0 0 1
"FBgn0000008_18024938_18025756" "2R" 0 1 0 2
"FBgn0000008_18050410_18051199" "2R" 13 9 14 12
"FBgn0000008_18052282_18052494" "2R" 4 2 5 0
"FBgn0000008_18056749_18058222" "2R" 32 27 26 23
"FBgn0000008_18058283_18059490" "2R" 14 18 29 22
"FBgn0000008_18059587_18059757" "2R" 1 4 3 0
"FBgn0000008_18059821_18059938" "2R" 0 0 2 0
You can annotate and group the samples by creating a logical vector as follows:
samples = geneCountTable(:,3:end).Properties.VariableNames;
untreated = strncmp(samples,'untreated',length('untreated'))
treated = strncmp(samples,'treated',length('treated'))
untreated =
1 1 0 0
treated =
0 0 1 1
The included file also contains a table geneSummaryTable with the summary of assigned and
unassigned SAM entries. You can plot the basic distribution of the counting results by considering the
number of reads that are assigned to the given genomic features (exons or genes for this example), as
well as the number of reads that are unassigned (i.e. not overlapping any feature) or ambiguous (i.e.
overlapping multiple features).
st = geneSummaryTable({'Assigned','Unassigned_ambiguous','Unassigned_noFeature'},:)
bar(table2array(st)','stacked');
legend(st.Properties.RowNames','Interpreter','none','Location','southeast');
xlabel('Sample')
ylabel('Number of reads')
st =
3x4 table
2-33
2 High-Throughput Sequence Analysis
Note that a small fraction of the alignment records in the SAM files is not reported in the summary
table. You can notice this in the difference between the total number of records in a SAM file and the
total number of records processed during the counting procedure for that same SAM file. These
unreported records correspond to the records mapped to reference sequences that are not annotated
in the GTF file and therefore are not processed in the counting procedure. If the gene models account
for all the reference sequences used during the read mapping step, then all records are reported in
one of the categories of the summary table.
geneSummaryTable{'TotalEntries',:} - sum(geneSummaryTable{2:end,:})
ans =
When read counting is performed without summarization using the function featurecount, the
default IDs are composed by the attribute or metafeature (by default, gene_id) followed by the start
and the stop positions of the feature (by default, exon). You can use the exon start positions to plot
the read coverage across any chromosome in consideration, for example chromosome arm 2L.
% consider chromosome arm 2L
chr2L = strcmp(exonCountTable.Reference,'2L');
exonCount = exonCountTable{:,3:end};
2-34
Identifying Differentially Expressed Genes from RNA-Seq Data
2-35
2 High-Throughput Sequence Analysis
Alternatively, you can plot the read coverage considering the starting position of each gene in a given
chromosome. The file pasilla_geneLength.mat contains a table with the start and stop position of
each gene in the corresponding gene annotation file.
% load gene start and stop position information
load pasilla_geneLength
head(geneLength)
2-36
Identifying Differentially Expressed Genes from RNA-Seq Data
gstart = geneLength{k,'Start'};
gcounts = counts(j,:);
ans =
6360
The read count in RNA-Seq data has been found to be linearly related to the abundance of transcripts
[2]. However, the read count for a given gene depends not only on the expression level of the gene,
but also on the total number of reads sequenced and the length of the gene transcript. Therefore, in
order to infer the expression level of a gene from the read count, we need to account for the
sequencing depth and the gene transcript length. One common technique to normalize the read count
2-37
2 High-Throughput Sequence Analysis
is to use the RPKM (Read Per Kilobase Mapped) values, where the read count is normalized by the
total number of reads yielded (in millions) and the length of each transcript (in kilobases). This
normalization technique, however, is not always effective since few, very highly expressed genes can
dominate the total lane count and skew the expression analysis.
A better normalization technique consists of computing the effective library size by considering a size
factor for each sample. By dividing each sample's counts by the corresponding size factors, we bring
all the count values to a common scale, making them comparable. Intuitively, if sample A is
sequenced N times deeper than sample B, the read counts of non-differentially expressed genes are
expected to be on average N times higher in sample A than in sample B, even if there is no difference
in expression.
To estimate the size factors, take the median of the ratios of observed counts to those of a pseudo-
reference sample, whose counts can be obtained by considering the geometric mean of each gene
across all samples [3]. Then, to transform the observed counts to a common scale, divide the
observed counts in each sample by the corresponding size factor.
sizeFactors =
ans =
1.0e+03 *
You can appreciate the effect of this normalization by using the function boxplot to represent
statistical measures such as median, quartiles, minimum and maximum.
figure;
subplot(2,1,1)
maboxplot(log2(counts),'title','Raw Read Count','orientation','horizontal')
2-38
Identifying Differentially Expressed Genes from RNA-Seq Data
ylabel('sample')
xlabel('log2(counts)')
subplot(2,1,2)
maboxplot(log2(normCounts),'title','Normalized Read Count','orientation','horizontal')
ylabel('sample')
xlabel('log2(counts)')
In order to better characterize the data, we consider the mean and the dispersion of the normalized
counts. The variance of read counts is given by the sum of two terms: the variation across samples
(raw variance) and the uncertainty of measuring the expression by counting reads (shot noise or
Poisson). The raw variance term dominates for highly expressed genes, whereas the shot noise
dominates for lowly expressed genes. You can plot the empirical dispersion values against the mean
of the normalized counts in a log scale as shown below.
2-39
2 High-Throughput Sequence Analysis
loglog(meanTreated,dispTreated,'r.');
hold on;
loglog(meanUntreated,dispUntreated,'b.');
xlabel('log2(Mean)');
ylabel('log2(Dispersion)');
legend('Treated','Untreated','Location','southwest');
Given the small number of replicates, it is not surprising to expect that the dispersion values scatter
with some variance around the true value. Some of this variance reflects sampling variance and some
reflects the true variability among the gene expressions of the samples.
You can look at the difference of the gene expression among two conditions, by calculating the fold
change (FC) for each gene, i.e. the ratio between the counts in the treated group over the counts in
the untreated group. Generally these ratios are considered in the log2 scale, so that any change is
symmetric with respect to zero (e.g. a ratio of 1/2 or 2/1 corresponds to -1 or +1 in the log scale).
2-40
Identifying Differentially Expressed Genes from RNA-Seq Data
It is possible to annotate the values in the plot with the corresponding gene names, interactively
select genes, and export gene lists to the workspace by calling the mairplot function as illustrated
below:
mairplot(meanTreated,meanUntreated,'Labels',geneCountTable.ID,'Type','MA');
2-41
2 High-Throughput Sequence Analysis
It is convenient to store the information about the mean value and fold change for each gene in a
table. You can then access information about a given gene or a group of genes satisfying specific
criteria by indexing the table by gene names.
% summary
summary(geneTable)
2-42
Identifying Differentially Expressed Genes from RNA-Seq Data
Variables:
Values:
Min 0.21206
Median 201.24
Max 2.6789e+05
Values:
Min 0
Median 201.54
Max 2.5676e+05
Values:
Min 0
Median 199.44
Max 2.7903e+05
Values:
Min 0
Median 0.99903
Max Inf
Values:
Min -Inf
Median -0.001406
Max Inf
% preview
head(geneTable)
2-43
2 High-Throughput Sequence Analysis
ans =
1x5 table
ans =
1x2 table
meanBase log2FC
________ _______
ans =
4x5 table
Determining whether the gene expressions in two conditions are statistically different consists of
rejecting the null hypothesis that the two data samples come from distributions with equal means.
This analysis assumes the read counts are modeled according to a negative binomial distribution (as
proposed in [3]). The function rnaseqde performs this type of hypothesis testing with three possible
options to specify the type of linkage between the variance and the mean.
By specifying the link between variance and mean as an identity, we assume the variance is equal to
the mean, and the counts are modeled by the Poisson distribution [4]. "IDColumns" specifies columns
from the input table to append to the output table to help keep data organized.
2-44
Identifying Differentially Expressed Genes from RNA-Seq Data
diffTableIdentity = rnaseqde(geneCountTable,["untreated3","untreated4"],["treated2","treated3"],V
Alternatively, by specifying the variance as the sum of the shot noise term (i.e. mean) and a constant
multiplied by the squared mean, the counts are modeled according to a distribution described in [5].
The constant term is estimated using all the rows in the data.
diffTableConstant = rnaseqde(geneCountTable,["untreated3","untreated4"],["treated2","treated3"],V
Finally, by considering the variance as the sum of the shot noise term (i.e. mean) and a locally
regressed non-parametric smooth function of the mean, the counts are modeled according to the
distribution proposed in [3].
2-45
2 High-Throughput Sequence Analysis
The output of rnaseqde includes a vector of P-values. A P-value indicates the probability that a
change in expression as strong as the one observed (or even stronger) would occur under the null
hypothesis, i.e. the conditions have no effect on gene expression. In the histogram of the P-values we
observe an enrichment of low values (due to differentially expressed genes), whereas other values are
uniformly spread (due to non-differentially expressed genes). The enrichment of values equal to 1 are
due to genes with very low counts.
figure;
histogram(diffTableLocal.PValue,100)
xlabel('P-value')
ylabel('Frequency')
title('P-value enrichment')
Filter out those genes with relatively low count to observe a more uniform spread of non-significant P-
values across the range (0,1]. Note that this does not affect the distribution of significant P-values.
lowCountThreshold = 10;
lowCountGenes = all(counts < lowCountThreshold, 2);
histogram(diffTableLocal.PValue(~lowCountGenes),100)
xlabel('P-value')
ylabel('Frequency')
title('P-value enrichment without low count genes')
2-46
Identifying Differentially Expressed Genes from RNA-Seq Data
Thresholding P-values to determine what fold changes are more significant than others is not
appropriate for this type of data analysis, due to the multiple testing problem. While performing a
large number of simultaneous tests, the probability of getting a significant result simply due to
chance increases with the number of tests. In order to account for multiple testing, perform a
correction (or adjustment) of the P-values so that the probability of observing at least one significant
result due to chance remains below the desired significance level.
The Benjamini-Hochberg adjustment [6] is a statistical method that provides an adjusted P-value
answering the following question: what would be the fraction of false positives if all the genes with
adjusted P-values below a given threshold were considered significant?
The output of rnaseqde includes a vector of adjusted P-values in the "AdjustedPValue" field. By
default, the P-values are adjusted using the Benjamini-Hochberg adjustment. Alternatively, the
"FDRMethod" Name-Value argument in rnaseqde can be set to "storey" to perform Storey's
procedure [7].
Set a threshold of 0.1 for the adjusted P-values, equivalent to consider a 10% false positives as
acceptable, and identify the genes that are significantly expressed by considering all the genes with
adjusted P-values below this threshold.
% create a table with significant genes
sig = diffTableLocal.AdjustedPValue < 0.1;
diffTableLocalSig = diffTableLocal(sig,:);
diffTableLocalSig = sortrows(diffTableLocalSig,'AdjustedPValue');
numberSigGenes = size(diffTableLocalSig,1)
2-47
2 High-Throughput Sequence Analysis
numberSigGenes =
1904
You can now identify the most up-regulated or down-regulated genes by considering an absolute fold
change above a chosen cutoff. For example, a cutoff of 1 in log2 scale yields the list of genes that are
up-regulated with a 2 fold change.
numberSigGenesUp =
129
top10GenesUp =
10x6 table
numberSigGenesDown =
181
2-48
Identifying Differentially Expressed Genes from RNA-Seq Data
top10GenesDown =
10x6 table
A good visualization of the gene expressions and their significance is given by plotting the fold
change versus the mean in log scale and coloring the data points according to the adjusted P-values.
figure
scatter(log2(geneTable.meanBase),diffTableLocal.Log2FoldChange,3,diffTableLocal.PValue,'o')
colormap(flipud(cool(256)))
colorbar;
ylabel('log2(Fold Change)')
xlabel('log2(Mean of normalized counts)')
title('Fold change by FDR')
2-49
2 High-Throughput Sequence Analysis
You can see here that for weakly expressed genes (i.e. those with low means), the FDR is generally
high because low read counts are dominated by Poisson noise and consequently any biological
variability is drowned in the uncertainties from the read counting.
References
[1] Brooks et al. Conservation of an RNA regulatory map between Drosophila and mammals. Genome
Research 2011. 21:193-202.
[2] Mortazavi et al. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature
Methods 2008. 5:621-628.
[3] Anders et al. Differential expression analysis for sequence count data. Genome Biology 2010.
11:R106.
[4] Marioni et al. RNA-Seq: An assessment of technical reproducibility and comparison with gene
expression arrays. Genome Research 2008. 18:1509-1517.
[5] Robinson et al. Moderated statistical test for assessing differences in tag abundance.
Bioinformatics 2007. 23(21):2881-2887.
[6] Benjamini et al. Controlling the false discovery rate: a practical and powerful approach to multiple
testing. 1995. Journal of the Royal Statistical Society, Series B57 (1):289-300.
2-50
Identifying Differentially Expressed Genes from RNA-Seq Data
[7] J.D. Storey. "A direct approach to false discovery rates", Journal of the Royal Statistical Society, B
(2002), 64(3), pp.479-498.
See Also
featurecount | nbintest | mairplot | plotVarianceLink
More About
• “High-Throughput Sequencing”
2-51
2 High-Throughput Sequence Analysis
The first part of this example gives a brief overview of the app and supported file formats. The second
part of the example explores a single nucleotide variation in the cytochrome p450 gene (CYP2C19).
By default, the app loads Human (GRCh38/hg38) as the reference sequence and Refseq Genes as the
annotation file. There are two main panels in the app. The left panel is the Tracks panel and the right
panel is the embedded IGV web application. The Tracks panel is a read-only area displaying the track
names, source file names, and track types. The Tracks panel updates accordingly as you configure
the tracks in the embedded IGV app.
The Reset button restores the app to the default view with two tracks (HG38 with Refseq Genes) and
removes any other existing tracks. Before resetting, you can save the current view as a session
(.json) file and restore it later.
You can import a single reference sequence. The reference sequence must be in a FASTA file. Select
Import Reference on the Home tab. You can also import a corresponding cytoband file that contains
2-52
Visualize NGS Data Using Genomics Viewer App
cytogenetic G-banding data. You can add local files or specify external URLs. The URL must start with
either https or gs. Other file transfer protocols, such as ftp, are not supported.
You can import multiple data sets of sequence read alignment data. The alignment data must be a
BAM or CRAM file. It is not required that you have the corresponding index file (.BAI or .CRAI) in
the same location as your BAM or CRAM file. However, the absence of the index file will make the
app slower.
You can add read alignment files using Add tracks from file and Add tracks from URL options
from the Add tracks button. If you are specifying a URL, the URL must start with either https or gs.
Other file transfer protocols, such as ftp, are not supported.
You can import multiple sets of feature annotations from several files that contain data for a single
reference sequence. The supported annotation files are: .BED, .GFF, .GFF3, and .GTF.
You can also import structural variants (.VCF) and visualize genetic alterations, such as insertions
and deletions.
You can view segmented copy number data (.SEG) and quantitative genomic data (.WIG, .BIGWIG,
and .BEDGRAPH), such as ChIP peaks and alignment coverage.
You can add annotation and genomic data files using Add tracks from file and Add tracks from
URL options from the Add tracks button. If you are specifying a URL, the URL must start with either
https or gs. Other file transfer protocols, such as FTP, are not supported.
For the purposes of this example, start with a session file that has some preloaded tracks. To load the
file, click Open. Navigate to matlabroot\examples\bioinfo\data, where matlabroot is the
folder where you have installed MATLAB. Select rs4986893.json.
2-53
2 High-Throughput Sequence Analysis
The low coverage alignment data comes from a female Han Chinese from Beijing, China. The sample
ID is NA18564 and the sample has been identified with the CYP2C19*3 mutation [4].
In this session file, the alignment data has been centered around the location of the mutation on the
CYP2C19 gene.
1 Click the orange bar in the coverage area to look at the position and allele distribution
information.
2-54
Visualize NGS Data Using Genomics Viewer App
It shows that 71% of the reads have G while 29% have A at the location chr10:94,780,653. This
data is a low coverage data and may not show all the occurrences of this mutation. A high
coverage data will be explored later in the example.
2-55
2 High-Throughput Sequence Analysis
For details on the embedded IGV app and its available options, visit here.
You can look at the high coverage data from the same sample to see the occurrences of this mutation.
The high coverage data appears as track3. You can now see many occurrences of the mutation in
several reads.
6 Click the orange bar in the coverage area to see the allele distribution. It shows that G is
replaced by A in almost 50% of the time.
2-56
Visualize NGS Data Using Genomics Viewer App
References
[1] Robinson, J., H. Thorvaldsdóttir, W. Winckler, M. Guttman, E. Lander, G. Getz, J. Mesirov. 2011.
Integrative Genomics Viewer. Nature Biotechnology. 29:24–26.
[2] Thorvaldsdóttir, H., J. Robinson, J. Mesirov. 2013. Integrative Genomics Viewer (IGV): High-
performance genomics data visualization and exploration. Briefings in Bioinformatics.
14:178–192.
[3] https://medlineplus.gov/genetics/gene/cyp2c19/
[4] https://www.coriell.org/0/Sections/Search/Sample_Detail.aspx?Ref=NA18564&Product=DNA
See Also
Genomics Viewer | Sequence Alignment | Sequence Viewer
2-57
2 High-Throughput Sequence Analysis
This example shows how to perform a genome-wide analysis of DNA methylation in the human by
using genome sequencing.
Note: For enhanced performance, MathWorks recommends that you run this example on a 64-bit
platform, because the memory footprint is close to 2 GB. On a 32-bit platform, if you receive "Out of
memory" errors when running this example, try increasing the virtual memory (or swap space) of
your operating system or try setting the 3GB switch (32-bit Windows® XP only). For details, see
“Resolve “Out of Memory” Errors”.
Introduction
DNA methylation is an epigenetic modification that modulates gene expression and the maintenance
of genomic organization in normal and disease processes. DNA methylation can define different
states of the cell, and it is inheritable during cell replication. Aberrant DNA methylation patterns
have been associated with cancer and tumor suppressor genes.
In this example you will explore the DNA methylation profiles of two human cancer cells: parental
HCT116 colon cancer cells and DICERex5 cells. DICERex5 cells are derived from HCT116 cells after
the truncation of the DICER1 alleles. Serre et al. in [1] proposed to study DNA methylation profiles by
using the MBD2 protein as a methyl CpG binding domain and subsequently used high-throughput
sequencing (HTseq). This technique is commonly known as MBD-Seq. Short reads for two replicates
of the two samples have been submitted to NCBI's SRA archive by the authors of [1]. There are other
technologies available to interrogate DNA methylation status of CpG sites in combination with HTseq,
for example MeDIP-seq or the use of restriction enzymes. You can also analyze this type of data sets
following the approach presented in this example.
Data Sets
You can obtain the unmapped single-end reads for four sequencing experiments from NCBI. Short
reads were produced using Illumina®'s Genome Analyzer II. Average insert size is 120 bp, and the
length of short reads is 36 bp.
(2) produced SAM-formatted files by mapping the short reads to the reference human genome (NCBI
Build 37.5) using the Bowtie [2] algorithm. Only uniquely mapped reads are reported.
(3) compressed the SAM formatted files to BAM and ordered them by reference name first, then by
genomic position by using SAMtools [3].
This example also assumes that you downloaded the reference human genome (GRCh37.p5). You can
use the bowtie-inspect command to reconstruct the human reference directly from the bowtie
indices. Or you may download the reference from the NCBI repository by uncommenting the
following line:
2-58
Exploring Genome-Wide Differences in DNA Methylation Profiles
% getgenbank('NC_000009','FileFormat','fasta','tofile','hsch9.fasta');
To explore the signal coverage of the HCT116 samples you need to construct a BioMap. BioMap has
an interface that provides direct access to the mapped short reads stored in the BAM-formatted file,
thus minimizing the amount of data that is actually loaded into memory. Use the function baminfo to
obtain a list of the existing references and the actual number of short reads mapped to each one.
info = baminfo('SRR030224.bam','ScanDictionary',true);
fprintf('%-35s%s\n','Reference','Number of Reads');
for i = 1:numel(info.ScannedDictionary)
fprintf('%-35s%d\n',info.ScannedDictionary{i},...
info.ScannedDictionaryCount(i));
end
In this example you will focus on the analysis of chromosome 9. Create a BioMap for the two HCT116
sample replicates.
bm_hct116_1 = BioMap('SRR030224.bam','SelectRef','gi|224589821|ref|NC_000009.11|')
bm_hct116_2 = BioMap('SRR030225.bam','SelectRef','gi|224589821|ref|NC_000009.11|')
bm_hct116_1 =
SequenceDictionary: 'gi|224589821|ref|NC_000009.11|'
Reference: [106189x1 File indexed property]
Signature: [106189x1 File indexed property]
2-59
2 High-Throughput Sequence Analysis
bm_hct116_2 =
SequenceDictionary: 'gi|224589821|ref|NC_000009.11|'
Reference: [107586x1 File indexed property]
Signature: [107586x1 File indexed property]
Start: [107586x1 File indexed property]
MappingQuality: [107586x1 File indexed property]
Flag: [107586x1 File indexed property]
MatePosition: [107586x1 File indexed property]
Quality: [107586x1 File indexed property]
Sequence: [107586x1 File indexed property]
Header: [107586x1 File indexed property]
NSeqs: 107586
Name: ''
Using a binning algorithm provided by the getBaseCoverage method, you can plot the coverage of
both replicates for an initial inspection. For reference, you can also add the ideogram for the human
chromosome 9 to the plot using the chromosomeplot function.
figure
ha = gca;
hold on
n = 141213431; % length of chromosome 9
[cov,bin] = getBaseCoverage(bm_hct116_1,1,n,'binWidth',100);
h1 = plot(bin,cov,'b'); % plots the binned coverage of bm_hct116_1
[cov,bin] = getBaseCoverage(bm_hct116_2,1,n,'binWidth',100);
h2 = plot(bin,cov,'g'); % plots the binned coverage of bm_hct116_2
chromosomeplot('hs_cytoBand.txt', 9, 'AddToPlot', ha) % plots an ideogram along the x-axis
axis(ha,[1 n 0 100]) % zooms-in the y-axis
fixGenomicPositionLabels(ha) % formats tick labels and adds datacursors
legend([h1 h2],'HCT116-1','HCT116-2','Location','NorthEast')
ylabel('Coverage')
title('Coverage for two replicates of the HCT116 sample')
fig = gcf;
fig.Position = max(fig.Position,[0 0 900 0]); % resize window
2-60
Exploring Genome-Wide Differences in DNA Methylation Profiles
Because short reads represent the methylated regions of the DNA, there is a correlation between
aligned coverage and DNA methylation. Observe the increased DNA methylation close to the
chromosome telomeres; it is known that there is an association between DNA methylation and the
role of telomeres for maintaining the integrity of the chromosomes. In the coverage plot you can also
see a long gap over the chromosome centromere. This is due to the repetitive sequences present in
the centromere, which prevent us from aligning short reads to a unique position in this region. In the
data sets used in this example only about 30% of the short reads were uniquely mapped to the
reference genome.
DNA methylation normally occurs in CpG dinucleotides. Alteration of the DNA methylation patterns
can lead to transcriptional silencing, especially in the gene promoter CpG islands. But, it is also
known that DNA methylation can block CTCF binding and can silence miRNA transcription among
other relevant functions. In general, it is expected that mapped reads should preferably align to CpG
rich regions.
Load the human chromosome 9 from the reference file hs37.fasta. For this example, it is assumed
that you recovered the reference from the Bowtie indices using the bowtie-inspect command;
therefore hs37.fasta contains all the human chromosomes. To load only the chromosome 9 you can
use the option nave-value pair BLOCKREAD with the fastaread function.
chr9 = fastaread('hs37.fasta','blockread',9);
chr9.Header
ans =
2-61
2 High-Throughput Sequence Analysis
Use the cpgisland function to find the CpG clusters. Using the standard definition for CpG islands
[4], 200 or more bp islands with 60% or greater CpGobserved/CpGexpected ratio, leads to 1682 GpG
islands found in chromosome 9.
cpgi = cpgisland(chr9.Sequence)
cpgi =
Starts: [10783 29188 73049 73686 113309 114488 116877 117469 117987 … ]
Stops: [11319 29409 73624 73893 114336 114809 117105 117985 118203 … ]
Use the getCounts method to calculate the ratio of aligned bases that are inside CpG islands. For
the first replicate of the sample HCT116, the ratio is close to 45%.
aligned_bases_in_CpG_islands = getCounts(bm_hct116_1,cpgi.Starts,cpgi.Stops,'method','sum')
aligned_bases_total = getCounts(bm_hct116_1,1,n,'method','sum')
ratio = aligned_bases_in_CpG_islands ./ aligned_bases_total
aligned_bases_in_CpG_islands =
1724363
aligned_bases_total =
3822804
ratio =
0.4511
You can explore high resolution coverage plots of the two sample replicates and observe how the
signal correlates with the CpG islands. For example, explore the region between 23,820,000 and
23,830,000 bp. This is the 5' region of the human gene ELAVL2.
2-62
Exploring Genome-Wide Differences in DNA Methylation Profiles
end
To find regions that contain more mapped reads than would be expected by chance, you can follow a
similar approach to the one described by Serre et al. [1]. The number of counts for non-overlapping
contiguous 100 bp windows is statistically modeled.
First, use the getCounts method to count the number of mapped reads that start at each window. In
this example you use a binning approach that considers only the start position of every mapped read,
following the approach of Serre et al. However, you may also use the OVERLAP and METHOD name-
value pairs in getCounts to compute more accurate statistics. For instance, to obtain the maximum
coverage for each window considering base pair resolution, set OVERLAP to 1 and METHOD to MAX.
n = numel(chr9.Sequence); % length of chromosome
w = 1:100:n; % windows of 100 bp
counts_1 = getCounts(bm_hct116_1,w,w+99,'independent',true,'overlap','start');
counts_2 = getCounts(bm_hct116_2,w,w+99,'independent',true,'overlap','start');
First, try to model the counts assuming that all the windows with counts are biologically significant
and therefore from the same distribution. Use the negative bionomial distribution to fit a model the
count data.
nbp = nbinfit(counts_1);
2-63
2 High-Throughput Sequence Analysis
figure
hold on
emphist = histc(counts_1,0:100); % calculate the empirical distribution
bar(0:100,emphist./sum(emphist),'c','grouped') % plot histogram
plot(0:100,nbinpdf(0:100,nbp(1),nbp(2)),'b','linewidth',2); % plot fitted model
axis([0 50 0 .001])
legend('Empirical Distribution','Negative Binomial Fit')
ylabel('Frequency')
xlabel('Counts')
title('Frequency of counts, 100bp windows (HCT116-1)')
The poor fitting indicates that the observed distribution may be due to the mixture of two models, one
that represents the background and one that represents the count data in methylated DNA windows.
A more realistic scenario would be to assume that windows with a small number of mapped reads are
mainly the background (or null model). Serre et al. assumed that 100-bp windows containing four or
more reads are unlikely to be generated by chance. To estimate a good approximation to the null
model, you can fit the left body of the empirical distribution to a truncated negative binomial
distribution. To fit a truncated distribution use the mle function. First you need to define an
anonymous function that defines the right-truncated version of nbinpdf.
2-64
Exploring Genome-Wide Differences in DNA Methylation Profiles
Before fitting the real data, let us assess the fitting procedure with some sampled data from a known
distribution.
figure
hold on
emphist = histc(x,0:100); % Calculate the empirical distribution
bar(0:100,emphist./sum(emphist),'c','grouped') % plot histogram
h1 = plot(0:100,nbinpdf(0:100,nbphat1(1),nbphat1(2)),'b-o','linewidth',2);
h2 = plot(0:100,nbinpdf(0:100,nbphat2(1),nbphat2(2)),'r','linewidth',2);
h3 = plot(0:100,nbinpdf(0:100,nbphat3(1),nbphat3(2)),'g','linewidth',2);
axis([0 25 0 .2])
legend([h1 h2 h3],'Neg-binomial fitted to all data',...
'Neg-binomial fitted to truncated data',...
'Truncated neg-binomial fitted to truncated data')
ylabel('Frequency')
xlabel('Counts')
2-65
2 High-Throughput Sequence Analysis
For the two replicates of the HCT116 sample, fit a right-truncated negative binomial distribution to
the observed null model using the rtnbinfit anonymous function previously defined.
pval1 = 1 - nbincdf(counts_1,pn1(1),pn1(2));
pval2 = 1 - nbincdf(counts_2,pn2(1),pn2(2));
Calculate the false discovery rate using the mafdr function. Use the name-value pair BHFDR to use
the linear-step up (LSU) procedure ([6]) to calculate the FDR adjusted p-values. Setting the FDR <
0.01 permits you to identify the 100-bp windows that are significantly methylated.
fdr1 = mafdr(pval1,'bhfdr',true);
fdr2 = mafdr(pval2,'bhfdr',true);
Number_of_sig_windows_HCT116_1 = sum(w1)
Number_of_sig_windows_HCT116_2 = sum(w2)
Number_of_sig_windows_HCT116 = sum(w12)
Number_of_sig_windows_HCT116_1 =
1662
Number_of_sig_windows_HCT116_2 =
1674
Number_of_sig_windows_HCT116 =
1346
Overall, you identified 1662 and 1674 non-overlapping 100-bp windows in the two replicates of the
HCT116 samples, which indicates there is significant evidence of DNA methylation. There are 1346
windows that are significant in both replicates.
For example, looking again in the promoter region of the ELAVL2 human gene you can observe that
in both sample replicates, multiple 100-bp windows have been marked significant.
2-66
Exploring Genome-Wide Differences in DNA Methylation Profiles
DNA methylation is involved in the modulation of gene expression. For instance, it is well known that
hypermethylation is associated with the inactivation of several tumor suppressor genes. You can
study in this data set the methylation of gene promoter regions.
First, download from Ensembl a tab-separated-value (TSV) table with all protein encoding genes to a
text file, ensemblmart_genes_hum37.txt. For this example, we are using Ensembl release 64.
Using Ensembl's BioMart service, you can select a table with the following attributes: chromosome
name, gene biotype, gene name, gene start/end, and strand direction.
Use the provided helper function ensemblmart2gff to convert the downloaded TSV file to a GFF
formatted file. Then use GFFAnnotation to load the file into MATLAB and create a subset with the
genes present in chromosome 9 only. This results 800 annotated protein-coding genes in the Ensembl
database.
GFFfilename = ensemblmart2gff('ensemblmart_genes_hum37.txt');
a = GFFAnnotation(GFFfilename)
a9 = getSubset(a,'reference','9')
numGenes = a9.NumEntries
a =
a9 =
2-67
2 High-Throughput Sequence Analysis
numGenes =
800
Find the promoter regions for each gene. In this example we consider the proximal promoter as the
-500/100 upstream region.
downstream = 500;
upstream = 100;
Use a dataset as a container for the promoter information, as we can later add new columns to
store gene counts and p-values.
promoters = dataset({a9.Feature,'Gene'});
promoters.Strand = char(a9.Strand);
promoters.Start = promoterStart';
promoters.Stop = promoterStop';
Find genes with significant DNA methylation in the promoter region by looking at the number of
mapped short reads that overlap at least one base pair in the defined promoter region.
promoters.Counts_1 = getCounts(bm_hct116_1,promoters.Start,promoters.Stop,...
'overlap',1,'independent',true);
promoters.Counts_2 = getCounts(bm_hct116_2,promoters.Start,promoters.Stop,...
'overlap',1,'independent',true);
Fit a null distribution for each sample replicate and compute the p-values:
Ratio_of_sig_methylated_promoters = Number_of_sig_promoters./numGenes
Number_of_sig_promoters =
74
2-68
Exploring Genome-Wide Differences in DNA Methylation Profiles
Ratio_of_sig_methylated_promoters =
0.0925
Observe that only 74 (out of 800) genes in chromosome 9 have significantly DNA methylated regions
(pval<0.01 in both replicates). Display a report of the 30 genes with the most significant methylated
promoter regions.
[~,order] = sort(promoters.pval_1.*promoters.pval_2);
promoters(order(1:30),[1 2 3 4 5 7 6 8])
ans =
2-69
2 High-Throughput Sequence Analysis
Serre et al. [1] reported that, in these data sets, approximately 90% of the uniquely mapped reads fall
outside the 5' gene promoter regions. Using a similar approach as before, you can find genes that
have intergenic methylated regions. To compensate for the varying lengths of the genes, you can use
the maximum coverage, computed base-by-base, instead of the raw number of mapped short reads.
Another alternative approach to normalize the counts by the gene length is to set the METHOD name-
value pair to rpkm in the getCounts function.
intergenic = dataset({a9.Feature,'Gene'});
intergenic.Strand = char(a9.Strand);
intergenic.Start = a9.Start;
intergenic.Stop = a9.Stop;
intergenic.Counts_1 = getCounts(bm_hct116_1,intergenic.Start,intergenic.Stop,...
'overlap','full','method','max','independent',true);
intergenic.Counts_2 = getCounts(bm_hct116_2,intergenic.Start,intergenic.Stop,...
'overlap','full','method','max','independent',true);
trun = 10; % Set a truncation threshold
pn1 = rtnbinfit(intergenic.Counts_1(intergenic.Counts_1<trun),trun); % Fit to HCT116-1 intergenic
pn2 = rtnbinfit(intergenic.Counts_2(intergenic.Counts_2<trun),trun); % Fit to HCT116-2 intergenic
intergenic.pval_1 = 1 - nbincdf(intergenic.Counts_1,pn1(1),pn1(2)); % p-value for every intergeni
intergenic.pval_2 = 1 - nbincdf(intergenic.Counts_2,pn2(1),pn2(2)); % p-value for every intergeni
Ratio_of_sig_methylated_genes = Number_of_sig_genes./numGenes
[~,order] = sort(intergenic.pval_1.*intergenic.pval_2);
intergenic(order(1:30),[1 2 3 4 5 7 6 8])
2-70
Exploring Genome-Wide Differences in DNA Methylation Profiles
Number_of_sig_genes =
62
Ratio_of_sig_methylated_genes =
0.0775
ans =
2-71
2 High-Throughput Sequence Analysis
6.4307e-08 45 6.7163e-07
5.585e-07 49 1.8188e-07
6.4307e-08 42 1.7861e-06
1.4079e-06 51 9.4566e-08
4.1027e-07 46 4.8461e-07
2.2131e-07 42 1.7861e-06
2.6058e-06 43 1.2894e-06
4.1027e-07 36 1.2564e-05
1.4079e-06 39 4.7417e-06
1.9155e-06 36 1.2564e-05
1.9155e-06 35 1.7377e-05
4.8199e-06 37 9.0815e-06
6.5537e-06 37 9.0815e-06
1.0346e-06 31 6.3417e-05
3.0371e-05 41 2.4736e-06
2.2358e-05 40 3.4251e-06
4.1245e-05 41 2.4736e-06
2.2358e-05 38 6.5629e-06
2.2358e-05 37 9.0815e-06
For instance, explore the methylation profile of the BARX1 gene, the sixth significant gene with
intergenic methylation in the previous list. The GTF formatted file ensemblmart_barx1.gtf
contains structural information for this gene obtained from Ensembl using the BioMart service.
Use GTFAnnotation to load the structural information into MATLAB. There are two annotated
transcripts for this gene.
barx1 = GTFAnnotation('ensemblmart_barx1.gtf')
transcripts = getTranscriptNames(barx1)
barx1 =
transcripts =
{'ENST00000253968'}
{'ENST00000401724'}
Plot the DNA methylation profile for both HCT116 sample replicates with base-pair resolution.
Overlay the CpG islands and plot the exons for each of the two transcripts along the bottom of the
plot.
range = barx1.getRange;
r1 = range(1)-1000; % set the region limits
r2 = range(2)+1000;
figure
hold on
% plot high-resolution coverage of bm_hct116_1
2-72
Exploring Genome-Wide Differences in DNA Methylation Profiles
h1 = plot(r1:r2,getBaseCoverage(bm_hct116_1,r1,r2,'binWidth',1),'b');
% plot high-resolution coverage of bm_hct116_2
h2 = plot(r1:r2,getBaseCoverage(bm_hct116_2,r1,r2,'binWidth',1),'g');
Observe the highly methylated region in the 5' promoter region (right-most CpG island). Recall that
for this gene transcription occurs in the reverse strand. More interesting, observe the highly
methylated regions that overlap the initiation of each of the two annotated transcripts (two middle
CpG islands).
2-73
2 High-Throughput Sequence Analysis
In the study by Serre et al. another cell line is also analyzed. New cells (DICERex5) are derived from
the same HCT116 colon cancer cells after truncating the DICER1 alleles. It has been reported in
literature [5] that there is a localized change of DNA methylation at small number of gene promoters.
In this example, you will find significant 100-bp windows in two sample replicates of the DICERex5
cells following the same approach as the parental HCT116 cells, and then you will search statistically
significant differences between the two cell lines.
The helper function getWindowCounts captures the similar steps to find windows with significant
coverage as before. getWindowCounts returns vectors with counts, p-values, and false discovery
rates for each new replicate.
bm_dicer_1 = BioMap('SRR030222.bam','SelectRef','gi|224589821|ref|NC_000009.11|');
bm_dicer_2 = BioMap('SRR030223.bam','SelectRef','gi|224589821|ref|NC_000009.11|');
[counts_3,pval3,fdr3] = getWindowCounts(bm_dicer_1,4,w,100);
[counts_4,pval4,fdr4] = getWindowCounts(bm_dicer_2,4,w,100);
w3 = fdr3<.01; % logical vector indicating significant windows in DICERex5_1
w4 = fdr4<.01; % logical vector indicating significant windows in DICERex5-2
w34 = w3 & w4; % logical vector indicating significant windows in both replicates
Number_of_sig_windows_DICERex5_1 = sum(w3)
Number_of_sig_windows_DICERex5_2 = sum(w4)
Number_of_sig_windows_DICERex5 = sum(w34)
Number_of_sig_windows_DICERex5_1 =
908
Number_of_sig_windows_DICERex5_2 =
1041
Number_of_sig_windows_DICERex5 =
759
To perform a differential analysis you use the 100-bp windows that are significant in at least one of
the samples (either HCT116 or DICERex5).
wd = w34 | w12; % logical vector indicating windows included in the diff. analysis
Use the function manorm to normalize the data. The PERCENTILE name-value pair lets you filter out
windows with very large number of counts while normalizing, since these windows are mainly due to
artifacts, such as repetitive regions in the reference chromosome.
counts_norm = round(manorm(counts,'percentile',90).*100);
Use the function mattest to perform a two-sample t-test to identify differentially covered windows
from the two different cell lines.
2-74
Exploring Genome-Wide Differences in DNA Methylation Profiles
2-75
2 High-Throughput Sequence Analysis
Create a report with the 25 most significant differentially covered windows. While creating the report
use the helper function findClosestGene to determine if the window is intergenic, intragenic, or if
it is in a proximal promoter region.
[~,ord] = sort(pval);
fprintf('Window Pos Type p-value HCT116 DICERex5\n\n');
for i = 1:25
j = ord(i);
[~,msg] = findClosestGene(a9,[ws(j) ws(j)+99]);
fprintf('%10d %-25s %7.6f%5d%5d %5d%5d\n', ...
ws(j),msg,pval(j),counts_norm(j,:));
end
2-76
Exploring Genome-Wide Differences in DNA Methylation Profiles
Plot the DNA methylation profile for the promoter region of gene FAM189A2, the most significant
differentially covered promoter region from the previous list. Overlay the CpG islands and the
FAM189A2 gene.
range = getRange(getSubset(a9,'Feature','FAM189A2'));
r1 = range(1)-1000;
r2 = range(2)+1000;
figure
hold on
2-77
2 High-Throughput Sequence Analysis
Observe that the CpG islands are clearly unmethylated for both of the DICERex5 replicates.
References
[1] Serre, D., Lee, B.H., and Ting A.H., "MBD-isolated Genome Sequencing provides a high-
throughput and comprehensive survey of DNA methylation in the human genome", Nucleic Acids
Research, 38(2):391-9, 2010.
[2] Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L., "Ultrafast and Memory-efficient
Alignment of Short DNA Sequences to the Human Genome", Genome Biology, 10(3):R25, 2009.
[3] Li, H., et al., "The Sequence Alignment/map (SAM) Format and SAMtools", Bioinformatics,
25(16):2078-9, 2009.
[4] Gardiner-Garden, M. and Frommer, M., "CpG islands in vertebrate genomes", Journal of Molecular
Biology, 196(2):261-82, 1987.
[5] Ting, A.H., et al., "A Requirement for DICER to Maintain Full Promoter CpG Island
Hypermethylation in Human Cancer Cells", Cancer Research, 68(8):2570-5, 2008.
[6] Benjamini, Y. and Hochberg, Y., "Controlling the false discovery rate: a practical and powerful
approach to multiple testing", Journal of the Royal Statistical Society, 57(1):289-300, 1995.
2-78
Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data
This example shows how to perform a genome-wide analysis of a transcription factor in the
Arabidopsis Thaliana (Thale Cress) model organism.
For enhanced performance, it is recommended that you run this example on a 64-bit platform,
because the memory footprint is close to 2 Gb. On a 32-bit platform, if you receive "Out of
memory" errors when running this example, try increasing the virtual memory (or swap space) of
your operating system or try setting the 3GB switch (32-bit Windows® XP only). For details, see
“Resolve “Out of Memory” Errors”.
Introduction
ChIP-Seq is a technology that is used to identify transcription factors that interact with specific DNA
sites. First chromatin immunoprecipitation enriches DNA-protein complexes using an antibody that
binds to a particular protein of interest. Then, all the resulting fragments are processed using high-
throughput sequencing. Sequencing fragments are mapped back to the reference genome. By
inspecting over-represented regions it is possible to mark the genomic location of DNA-protein
interactions.
In this example, short reads are produced by the paired-end Illumina® platform. Each fragment is
reconstructed from two short reads successfully mapped, with this the exact length of the fragment
can be computed. Using paired-end information from sequence reads maximizes the accuracy of
predicting DNA-protein binding sites.
Data Set
This example explores the paired-end ChIP-Seq data generated by Wang et.al. [1] using the Illumina®
platform. The data set has been courteously submitted to the Gene Expression Omnibus repository
with accession number GSM424618. The unmapped paired-end reads can be obtained from the NCBI
FTP site.
(1) downloaded the data containing the unmapped short read and converted it to FASTQ formatted
files using the NCBI SRA Toolkit.
(2) produced a SAM formatted file by mapping the short reads to the Thale Cress reference genome,
using a mapper such as BWA [2], Bowtie, or SSAHA2 (which is the mapper used by authors of [1]),
and,
(3) ordered the SAM formatted file by reference name first, then by genomic position.
For the published version of this example, 8,655,859 paired-end short reads are mapped using the
BWA mapper [2]. BWA produced a SAM formatted file (aratha.sam) with 17,311,718 records
(8,655,859 x 2). Repetitive hits were randomly chosen, and only one hit is reported, but with lower
mapping quality. The SAM file was ordered and converted to a BAM formatted file using SAMtools [3]
before being loaded into MATLAB.
The last part of the example also assumes that you downloaded the reference genome for the Thale
Cress model organism (which includes five chromosomes). Uncomment the following lines of code to
download the reference from the NCBI repository:
2-79
2 High-Throughput Sequence Analysis
% getgenbank('NC_003070','FileFormat','fasta','tofile','ach1.fasta');
% getgenbank('NC_003071','FileFormat','fasta','tofile','ach2.fasta');
% getgenbank('NC_003074','FileFormat','fasta','tofile','ach3.fasta');
% getgenbank('NC_003075','FileFormat','fasta','tofile','ach4.fasta');
% getgenbank('NC_003076','FileFormat','fasta','tofile','ach5.fasta');
To create local alignments and look at the coverage we need to construct a BioMap. BioMap has an
interface that provides direct access to the mapped short reads stored in the BAM formatted file, thus
minimizing the amount of data that is actually loaded to the workspace. Create a BioMap to access all
the short reads mapped in the BAM formatted file.
bm = BioMap('aratha.bam')
bm =
Use the getSummary method to obtain a list of the existing references and the actual number of
short read mapped to each one.
getSummary(bm)
BioMap summary:
Name: ''
Container_Type: 'Data is file indexed.'
Total_Number_of_Sequences: 14637324
Number_of_References_in_Dictionary: 5
Number_of_Sequences Genomic_Range
Chr1 3151847 1 30427671
Chr2 3080417 1000 19698292
Chr3 3062917 94 23459782
Chr4 2218868 1029 18585050
Chr5 3123275 11 26975502
The remainder of this example focuses on the analysis of one of the five chromosomes, Chr1. Create a
new BioMap to access the short reads mapped to the first chromosome by subsetting the first one.
bm1 = getSubset(bm,'SelectReference','Chr1')
2-80
Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data
bm1 =
SequenceDictionary: 'Chr1'
Reference: [3151847x1 File indexed property]
Signature: [3151847x1 File indexed property]
Start: [3151847x1 File indexed property]
MappingQuality: [3151847x1 File indexed property]
Flag: [3151847x1 File indexed property]
MatePosition: [3151847x1 File indexed property]
Quality: [3151847x1 File indexed property]
Sequence: [3151847x1 File indexed property]
Header: [3151847x1 File indexed property]
NSeqs: 3151847
Name: ''
By accessing the Start and Stop positions of the mapped short read you can obtain the genomic
range.
x1 = min(getStart(bm1))
x2 = max(getStop(bm1))
x1 =
uint32
x2 =
uint32
30427671
To explore the coverage for the whole range of the chromosome, a binning algorithm is required. The
getBaseCoverage method produces a coverage signal based on effective alignments. It also allows
you to specify a bin width to control the size (or resolution) of the output signal. However internal
computations are still performed at the base pair (bp) resolution. This means that despite setting a
large bin size, narrow peaks in the coverage signal can still be observed. Once the coverage signal is
plotted you can program the figure's data cursor to display the genomic position when using the
tooltip. You can zoom and pan the figure to determine the position and height of the ChIP-Seq peaks.
[cov,bin] = getBaseCoverage(bm1,x1,x2,'binWidth',1000,'binType','max');
figure
plot(bin,cov)
axis([x1,x2,0,100]) % sets the axis limits
fixGenomicPositionLabels % formats tick labels and adds datacursors
xlabel('Base position')
2-81
2 High-Throughput Sequence Analysis
ylabel('Depth')
title('Coverage in Chromosome 1')
It is also possible to explore the coverage signal at the bp resolution (also referred to as the pile-up
profile). Explore one of the large peaks observed in the data at position 4598837.
p1 = 4598837-1000;
p2 = 4598837+1000;
figure
plot(p1:p2,getBaseCoverage(bm1,p1,p2))
xlim([p1,p2]) % sets the x-axis limits
fixGenomicPositionLabels % formats tick labels and adds datacursors
xlabel('Base position')
ylabel('Depth')
title('Coverage in Chromosome 1')
2-82
Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data
Observe the large peak with coverage depth of 800+ between positions 4599029 and 4599145.
Investigate how these reads are aligning to the reference chromosome. You can retrieve a subset of
these reads enough to satisfy a coverage depth of 25, since this is sufficient to understand what is
happening in this region. Use getIndex to obtain indices to this subset. Then use
getCompactAlignment to display the corresponding multiple alignment of the short-reads.
i = getIndex(bm1,4599029,4599145,'depth',25);
bmx = getSubset(bm1,i,'inmemory',false)
getCompactAlignment(bmx,4599029,4599145)
bmx =
SequenceDictionary: 'Chr1'
Reference: [62x1 File indexed property]
Signature: [62x1 File indexed property]
Start: [62x1 File indexed property]
MappingQuality: [62x1 File indexed property]
Flag: [62x1 File indexed property]
MatePosition: [62x1 File indexed property]
Quality: [62x1 File indexed property]
Sequence: [62x1 File indexed property]
Header: [62x1 File indexed property]
NSeqs: 62
Name: ''
ans =
2-83
2 High-Throughput Sequence Analysis
In addition to visually confirming the alignment, you can also explore the mapping quality for all the
short reads in this region, as this may hint to a potential problem. In this case, less than one percent
of the short reads have a Phred quality of 60, indicating that the mapper most likely found multiple
hits within the reference genome, hence assigning a lower mapping quality.
figure
i = getIndex(bm1,4599029,4599145);
hist(double(getMappingQuality(bm1,i)))
title('Mapping Quality of the reads between 4599029 and 4599145')
xlabel('Phred Quality Score')
ylabel('Number of Reads')
2-84
Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data
Most of the large peaks in this data set occur due to satellite repeat regions or due to its closeness to
the centromere [4], and show characteristics similar to the example just explored. You may explore
other regions with large peaks using the same procedure.
To prevent these problematic regions, two techniques are used. First, given that the provided data set
uses paired-end sequencing, by removing the reads that are not aligned in a proper pair reduces the
number of potential aligner errors or ambiguities. You can achieve this by exploring the flag field of
the SAM formatted file, in which the second less significant bit is used to indicate if the short read is
mapped in a proper pair.
i = find(bitget(getFlag(bm1),2));
bm1_filtered = getSubset(bm1,i)
bm1_filtered =
SequenceDictionary: 'Chr1'
Reference: [3040724x1 File indexed property]
Signature: [3040724x1 File indexed property]
Start: [3040724x1 File indexed property]
MappingQuality: [3040724x1 File indexed property]
Flag: [3040724x1 File indexed property]
MatePosition: [3040724x1 File indexed property]
Quality: [3040724x1 File indexed property]
Sequence: [3040724x1 File indexed property]
2-85
2 High-Throughput Sequence Analysis
Second, consider only uniquely mapped reads. You can detect reads that are equally mapped to
different regions of the reference sequence by looking at the mapping quality, because BWA assigns a
lower mapping quality (less than 60) to this type of short read.
i = find(getMappingQuality(bm1_filtered)==60);
bm1_filtered = getSubset(bm1_filtered,i)
bm1_filtered =
SequenceDictionary: 'Chr1'
Reference: [2313252x1 File indexed property]
Signature: [2313252x1 File indexed property]
Start: [2313252x1 File indexed property]
MappingQuality: [2313252x1 File indexed property]
Flag: [2313252x1 File indexed property]
MatePosition: [2313252x1 File indexed property]
Quality: [2313252x1 File indexed property]
Sequence: [2313252x1 File indexed property]
Header: [2313252x1 File indexed property]
NSeqs: 2313252
Name: ''
Visualize again the filtered data set using both, a coarse resolution with 1000 bp bins for the whole
chromosome, and a fine resolution for a small region of 20,000 bp. Most of the large peaks due to
artifacts have been removed.
[cov,bin] = getBaseCoverage(bm1_filtered,x1,x2,'binWidth',1000,'binType','max');
figure
plot(bin,cov)
axis([x1,x2,0,100]) % sets the axis limits
fixGenomicPositionLabels % formats tick labels and adds datacursors
xlabel('Base Position')
ylabel('Depth')
title('Coverage in Chromosome 1 after Filtering')
p1 = 24275801-10000;
p2 = 24275801+10000;
figure
plot(p1:p2,getBaseCoverage(bm1_filtered,p1,p2))
xlim([p1,p2]) % sets the x-axis limits
fixGenomicPositionLabels % formats tick labels and adds datacursors
xlabel('Base Position')
ylabel('Depth')
title('Coverage in Chromosome 1 after Filtering')
2-86
Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data
In Wang's paper [1] it is hypothesized that paired-end sequencing data has the potential to increase
the accuracy of the identification of chromosome binding sites of DNA associated proteins because
the fragment length can be derived accurately, while when using single-end sequencing it is
necessary to resort to a statistical approximation of the fragment length, and use it indistinctly for all
putative binding sites.
2-87
2 High-Throughput Sequence Analysis
Use the paired-end reads to reconstruct the sequencing fragments. First, get the indices for the
forward and the reverse reads in each pair. This information is captured in the fifth bit of the flag
field, according to the SAM file format.
fow_idx = find(~bitget(getFlag(bm1_filtered),5));
rev_idx = find(bitget(getFlag(bm1_filtered),5));
SAM-formatted files use the same header strings to identify pair mates. By pairing the header strings
you can determine how the short reads in BioMap are paired. To pair the header strings, simply order
them in ascending order and use the sorting indices (hf and hr) to link the unsorted header strings.
[~,hf] = sort(getHeader(bm1_filtered,fow_idx));
[~,hr] = sort(getHeader(bm1_filtered,rev_idx));
mate_idx = zeros(numel(fow_idx),1);
mate_idx(hf) = rev_idx(hr);
Use the resulting fow_idx and mate_idx variables to retrieve pair mates. For example, retrieve the
paired-end reads for the first 10 fragments.
for j = 1:10
disp(getInfo(bm1_filtered, fow_idx(j)))
disp(getInfo(bm1_filtered, mate_idx(j)))
end
Use the paired-end indices to construct a new BioMap with the minimal information needed to
represent the sequencing fragments. First, calculate the insert sizes.
J = getStop(bm1_filtered, fow_idx);
K = getStart(bm1_filtered, mate_idx);
L = K - J - 1;
Obtain the new signature (or CIGAR string) for each fragment by using the short read original
signatures separated by the appropriate number of skip CIGAR symbols (N).
n = numel(L);
cigars = cell(n,1);
for i = 1:n
2-88
Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data
Reconstruct the sequences for the fragments by concatenating the respective sequences of the
paired-end short reads.
J = getStart(bm1_filtered,fow_idx);
K = getStop(bm1_filtered,mate_idx);
L = K - J + 1;
figure
hist(double(L),100)
title(sprintf('Fragment Size Distribution\n %d Paired-end Fragments Mapped to Chromosome 1',n))
xlabel('Fragment Size')
ylabel('Count')
Construct a new BioMap to represent the sequencing fragments. With this, you will be able explore
the coverage signals as well as local alignments of the fragments.
bm1_fragments = BioMap('Sequence',seqs,'Signature',cigars,'Start',J)
2-89
2 High-Throughput Sequence Analysis
bm1_fragments =
Compare the coverage signal obtained by using the reconstructed fragments with the coverage signal
obtained by using individual paired-end reads. Notice that enriched binding sites, represented by
peaks, can be better discriminated from the background signal.
cov_reads = getBaseCoverage(bm1_filtered,x1,x2,'binWidth',1000,'binType','max');
[cov_fragments,bin] = getBaseCoverage(bm1_fragments,x1,x2,'binWidth',1000,'binType','max');
figure
plot(bin,cov_reads,bin,cov_fragments)
xlim([x1,x2]) % sets the x-axis limits
fixGenomicPositionLabels % formats tick labels and adds datacursors
xlabel('Base position')
ylabel('Depth')
title('Coverage Comparison')
legend('Short Reads','Fragments')
2-90
Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data
Perform the same comparison at the bp resolution. In this dataset, Wang et.al. [1] investigated a basic
helix-loop-helix (bHLH) transcription factor. bHLH proteins typically bind to a consensus sequence
called an E-box (with a CANNTG motif). Use fastaread to load the reference chromosome, search for
the E-box motif in the 3' and 5' directions, and then overlay the motif positions on the coverage
signals. This example works over the region 1-200,000, however the figure limits are narrowed to a
3000 bp region in order to better depict the details.
p1 = 1;
p2 = 200000;
cov_reads = getBaseCoverage(bm1_filtered,p1,p2);
[cov_fragments,bin] = getBaseCoverage(bm1_fragments,p1,p2);
chr1 = fastaread('ach1.fasta');
mp1 = regexp(chr1.Sequence(p1:p2),'CA..TG')+3+p1;
mp2 = regexp(chr1.Sequence(p1:p2),'GT..AC')+3+p1;
motifs = [mp1 mp2];
figure
plot(bin,cov_reads,bin,cov_fragments)
hold on
plot([1;1;1]*motifs,[0;max(ylim);NaN],'r')
xlim([111000 114000]) % sets the x-axis limits
fixGenomicPositionLabels % formats tick labels and adds datacursors
xlabel('Base position')
ylabel('Depth')
title('Coverage Comparison')
legend('Short Reads','Fragments','E-box motif')
2-91
2 High-Throughput Sequence Analysis
Observe that it is not possible to associate each peak in the coverage signals with an E-box motif. This
is because the length of the sequencing fragments is comparable to the average motif distance,
blurring peaks that are close. Plot the distribution of the distances between the E-box motif sites.
motif_sep = diff(sort(motifs));
figure
hist(motif_sep(motif_sep<500),50)
title('Distance (bp) between adjacent E-box motifs')
xlabel('Distance (bp)')
ylabel('Counts')
2-92
Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data
Use the function mspeaks to perform peak detection with Wavelets denoising on the coverage signal
of the fragment alignments. Filter putative ChIP peaks using a height filter to remove peaks that are
not enriched by the binding process under consideration.
putative_peaks = mspeaks(bin,cov_fragments,'noiseestimator',20,...
'heightfilter',10,'showplot',true);
hold on
legend('off')
plot([1;1;1]*motifs(motifs>p1 & motifs<p2),[0;max(ylim);NaN],'r')
xlim([111000 114000]) % sets the x-axis limits
fixGenomicPositionLabels % formats tick labels and adds datacursors
legend('Coverage from Fragments','Wavelet Denoised Coverage','Putative ChIP peaks','E-box Motifs'
xlabel('Base position')
ylabel('Depth')
title('ChIP-Seq Peak Detection')
2-93
2 High-Throughput Sequence Analysis
Use the knnsearch function to find the closest motif to each one of the putative peaks. As expected,
most of the enriched ChIP peaks are close to an E-box motif [1]. This reinforces the importance of
performing peak detection at the finest resolution possible (bp resolution) when the expected density
of binding sites is high, as it is in the case of the E-box motif. This example also illustrates that for
this type of analysis, paired-end sequencing should be considered over single-end sequencing [1].
h = knnsearch(motifs',putative_peaks(:,1));
distance = putative_peaks(:,1)-motifs(h(:))';
figure
hist(distance(abs(distance)<200),50)
title('Distance to Closest E-box Motif for Each Detected Peak')
xlabel('Distance (bp)')
ylabel('Counts')
2-94
Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data
References
[1] Wang, Congmao, Jie Xu, Dasheng Zhang, Zoe A Wilson, and Dabing Zhang. “An Effective Approach
for Identification of in Vivo Protein-DNA Binding Sites from Paired-End ChIP-Seq Data.” BMC
Bioinformatics 11, no. 1 (2010): 81.
[2] Li, H., and R. Durbin. “Fast and Accurate Short Read Alignment with Burrows-Wheeler
Transform.” Bioinformatics 25, no. 14 (July 15, 2009): 1754–60.
[3] Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin,
and 1000 Genome Project Data Processing Subgroup. “The Sequence Alignment/Map Format
and SAMtools.” Bioinformatics 25, no. 16 (August 15, 2009): 2078–79.
[4] Jothi, R., S. Cuddapah, A. Barski, K. Cui, and K. Zhao. “Genome-Wide Identification of in Vivo
Protein-DNA Binding Sites from ChIP-Seq Data.” Nucleic Acids Research 36, no. 16 (August 1,
2008): 5221–31.
[5] Hoffman, Brad G, and Steven J M Jones. “Genome-Wide Identification of DNA–Protein Interactions
Using Chromatin Immunoprecipitation Coupled with Flow Cell Sequencing.” Journal of
Endocrinology 201, no. 1 (April 2009): 1–13.
[6] Ramsey, Stephen A., Theo A. Knijnenburg, Kathleen A. Kennedy, Daniel E. Zak, Mark Gilchrist,
Elizabeth S. Gold, Carrie D. Johnson, et al. “Genome-Wide Histone Acetylation Data Improve
Prediction of Mammalian Transcription Factor Binding Sites.” Bioinformatics 26, no. 17
(September 1, 2010): 2071–75.
2-95
2 High-Throughput Sequence Analysis
See Also
BioMap | getBaseCoverage | getgenbank | getSummary
Related Examples
• “Identifying Differentially Expressed Genes from RNA-Seq Data” on page 2-32
• “Count Features from NGS Reads” on page 2-23
• “Exploring Genome-Wide Differences in DNA Methylation Profiles” on page 2-58
2-96
Working with Illumina/Solexa Next-Generation Sequencing Data
This example shows how to read and perform basic operations with data produced by the Illumina®/
Solexa Genome Analyzer®.
Introduction
During an analysis run with the Genome Analyzer Pipeline software, several intermediate files are
produced. In this example, you will learn how to read and manipulate the information contained in
sequence files (_sequence.txt).
The _sequence.txt files are FASTQ-formatted files that contain the sequence reads and their
quality scores, after quality trimming and filtering. You can use the fastqinfo function to display a
summary of the contents of a _sequence.txt file, and the fastqread function to read the contents
of the file. The output, reads, is a cell array of structures containing the Header, Sequence and
Quality fields.
filename = 'ilmnsolexa_sequence.txt';
info = fastqinfo(filename)
reads = fastqread(filename)
info =
Filename: 'ilmnsolexa_sequence.txt'
FilePath: 'C:\TEMP\Bdoc22b_2134332_7304\ibD166FA\17\tp9f51422f\bioinfo-ex25447385'
FileModDate: '06-May-2009 16:02:48'
FileSize: 30124
NumberOfEntries: 260
reads =
Header
Sequence
Quality
Because there is one sequence file per tile, it is not uncommon to have a collection of over 1,000 files
in total. You can read the entire collection of files associated with a given analysis run by
concatenating the _sequence.txt files into a single file. However, because this operation usually
produces a large file that requires ample memory to be stored and processed, it is advisable to read
the content in chunks using the blockread option of the fastqread function. For example, you can
read the first M sequences, or the last M sequences, or any M sequences in the file.
M = 150;
N = info.NumberOfEntries;
2-97
2 High-Throughput Sequence Analysis
readsFirst =
Header
Sequence
Quality
readsLast =
Header
Sequence
Quality
Once you load the sequence information into your workspace, you can determine the number and
length of the sequence reads and plot their distribution as follows:
seqs = {reads.Sequence};
readsLen = cellfun(@length, seqs);
figure(); hist(readsLen);
xlabel('Number of bases'); ylabel('Number of sequence reads');
title('Length distribution of sequence reads')
2-98
Working with Illumina/Solexa Next-Generation Sequencing Data
You can also examine the nucleotide composition by surveying the number of occurrences of each
base type in each sequence read, as shown below:
for i = 1:4
pos(i,:) = strfind(seqs, nt{i});
end
count = zeros(4,N);
for i = 1:4
count(i,:) = cellfun(@length, pos(i,:));
end
figure(); hist(count');
2-99
2 High-Throughput Sequence Analysis
xlabel('Occurrences');
ylabel('Number of sequence reads');
legend('A', 'C', 'G', 'T');
title('Base distribution by nucleotide type');
2-100
Working with Illumina/Solexa Next-Generation Sequencing Data
Each sequence read in the _sequence.txt file is associated with a score. The score is defined as SQ
= -10 * log10 (p / (1-p)), where p is the probability error of a base. You can examine the quality scores
associated with the base calls by converting the ASCII format into a numeric representation, and then
plotting their distribution, as shown below:
2-101
2 High-Throughput Sequence Analysis
The quality scores found in Solexa/Illumina files are asymptotic, but not identical, to the quality
scores used in the Sanger standard (Phred-like scores, Q). Q is defined as -10 * log10 (p), where p is
the error probability of a base. For example, if the quality score of a base is Q = 20, then p = 10
^(-20/10) = .01. This means that there is one wrong base call every 100 base calls with a score of20.
While Phred quality scores are positive integers, Solexa/Illumina quality scores can be negative. We
can convert Solexa quality scores into Phred quality scores using the following code:
sanger = q(1:3)'
solexa = sq(1:3)'
sanger =
{'>>>>>>>>>>>>:>:>>>>>>>>>>>>7&*7.1-%4'}
{'>>>>>>>>>>>>:>>>>>>>>>:17>5><1;1+&&,'}
{'>>>>:>>>>>7>5>>>>>5>>>>>7>5.+'69'(-%'}
2-102
Working with Illumina/Solexa Next-Generation Sequencing Data
solexa =
{']]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS'}
{']]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK'}
{']]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUXEFLA'}
Signal purity filtering has already been applied to the sequences in the _sequence.txt files. You
can perform additional filtering, for example by considering only those sequence reads whose bases
have all quality scores above a specific threshold:
%=== find sequence reads whose bases all have quality above threshold
len = 36;
qt = 10; % minimum quality threshold
a = cellfun(@(x) x > qt, SQ, 'UniformOutput', false);
b = cellfun(@sum, a);
c1 = find(b == len);
n1= numel(c1); % number of sequence reads passing the filter
disp([num2str(n1) ' sequence reads have all bases above threshold ' num2str(qt)]);
Alternatively, you can consider only those sequence reads that have less than a given number of bases
with quality scores below threshold:
%=== find sequence reads having less than M bases with quality below threshold
M = 5; % max number of bases with poor quality
a = cellfun(@(x) x <= qt, SQ, 'UniformOutput', false);
b = cellfun(@sum, a);
c2 = find(b <= M);
n2 = numel(c2); % number of sequence reads passing the filter
disp([num2str(n2) ' sequence reads have less than ' num2str(M) ' bases below threshold ' num2str(
Finally, you can apply a lower case mask to those bases that have quality scores below threshold:
seq = reads(1).Sequence
mseq = seq;
qt2 = 20; % quality threshold
mask = SQ{1} < qt;
mseq(mask) = lower(seq(mask))
seq =
'GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT'
mseq =
2-103
2 High-Throughput Sequence Analysis
'GGACTTTGTAGGATACCCTCGCTTTCCTtcTCCTgT'
To summarize read occurrences, you can determine the number of unique read sequences and their
distribution across the data set. You can also identify those sequence reads that occur multiple times,
often because they correspond to adapters or primers used in the sequencing process.
numUnique =
250
dupReads =
{'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'}
{'GATTTTATTGGTATCAGGGTTAATCGTGCCAAGAAA'}
{'GCATGGGTGATGCTGGTATTAAATCTGCCATTCAAG'}
{'GGGATGAACATAATAAGCAATGACGGCAGCAATAAA'}
{'GGGGGAGCACATTGTAGCATTGTGCCAATTCATCCA'}
{'GGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAA'}
{'GTTCTCACTTCTGTTACTCCAGCTTCTTCGGCACCT'}
{'GTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGC'}
{'GTTGGTTTCTATGTGGCTAAATACGTTAACAAAAAG'}
dupFreq =
2 2 2 2 2 2 3 2 2
2-104
Working with Illumina/Solexa Next-Generation Sequencing Data
Illumina/Solexa sequencing may produce false polyA at the edges of a tile. To identify these artifacts,
you need to identify homopolymers, that is, sequence reads composed of one type of nucleotide only.
In the data set under consideration, there are two homopolymers, both of which are polyA.
homopolIndex
homopol = {reads(homopolIndex).Sequence}'
homopolIndex =
251
257
homopol =
{'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'}
{'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'}
2-105
2 High-Throughput Sequence Analysis
Similarly, you can identify sequence reads that are near-matches to homopolymers, that is, sequence
reads that are composed almost exclusively of one nucleotide type.
nearhomopolIndex =
4
243
nearHomopol =
{'AAAAACATAAAAAAAAAAATAAAAAAACAAAAAAAA'}
{'AAAAAAATAAAAAAAAAAATAAAAAAAAATTAAAAA'}
Once you have processed and analyzed your data, it might be convenient to save a subset of
sequences in a separate FASTQ file for future consideration. For this purpose you can use the
fastqwrite function.
2-106
3
Sequence Analysis
Sequence analysis is the process you use to find information about a nucleotide or amino acid
sequence using computational methods. Common tasks in sequence analysis are identifying genes,
determining the similarity of two genes, determining the protein coded by a gene, and determining
the function of a gene by finding a similar gene in another organism with a known function.
Overview of Example
After sequencing a piece of DNA, one of the first tasks is to investigate the nucleotide content in the
sequence. Starting with a DNA sequence, this example uses sequence statistics functions to
determine mono-, di-, and trinucleotide content, and to locate open reading frames.
First research information about the human mitochondria and find the nucleotide sequence for the
genome. Next, look at the nucleotide content for the entire sequence. And finally, determine open
reading frames and extract specific gene sequences.
1 Use the MATLAB Help browser to explore the Web. In the MATLAB Command Window, type
web('http://www.ncbi.nlm.nih.gov/')
A separate browser window opens with the home page for the NCBI Web site.
2 Search the NCBI Web site for information. For example, to search for the human mitochondrion
genome, from the Search list, select Genome , and in the Search list, enter mitochondrion
homo sapiens.
3-2
Exploring a Nucleotide Sequence Using Command Line
3 Select a result page. For example, click the link labeled NC_012920.
The MATLAB Help browser displays the NCBI page for the human mitochondrial genome.
3-3
3 Sequence Analysis
The consensus sequence for the human mitochondrial genome has the GenBank accession number
NC_012920. Since the whole GenBank entry is quite large and you might only be interested in the
sequence, you can get just the sequence information.
3-4
Exploring a Nucleotide Sequence Using Command Line
1 Get sequence information from a Web database. For example, to retrieve sequence information
for the human mitochondrial genome, in the MATLAB Command Window, type
mitochondria = getgenbank('NC_012920','SequenceOnly',true)
The getgenbank function retrieves the nucleotide sequence from the GenBank database and
creates a character array.
mitochondria =
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT
TTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG
GAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATT
CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA
AAGT . . .
2 If you don't have a Web connection, you can load the data from a MAT file included with the
Bioinformatics Toolbox software, using the command
load mitochondria
The load function loads the sequence mitochondria into the MATLAB Workspace.
3 Get information about the sequence. Type
whos mitochondria
Information about the size of the sequence displays in the MATLAB Command Window.
After you read a sequence into the MATLAB environment, you can use the sequence statistics
functions to determine if your sequence has the characteristics of a protein-coding region. This
procedure uses the human mitochondrial genome as an example. See “Reading Sequence Information
from the Web” on page 3-4.
1 Plot monomer densities and combined monomer densities in a graph. In the MATLAB Command
Window, type
ntdensity(mitochondria)
3-5
3 Sequence Analysis
3 Count the nucleotides in the reverse complement of a sequence using the seqrcomplement
function.
basecount(seqrcomplement(mitochondria))
As expected, the nucleotide counts on the reverse complement strand are complementary to the
5'-3' strand.
ans =
A: 4094
C: 2169
G: 5181
T: 5124
4 Use the function basecount with the chart option to visualize the nucleotide distribution.
figure
basecount(mitochondria,'chart','pie');
3-6
Exploring a Nucleotide Sequence Using Command Line
5 Count the dimers in a sequence and display the information in a bar chart.
figure
dimercount(mitochondria,'chart','bar')
ans =
AA: 1604
AC: 1495
AG: 795
AT: 1230
CA: 1534
CC: 1771
CG: 435
CT: 1440
GA: 613
GC: 711
GG: 425
GT: 419
TA: 1373
TC: 1204
TG: 513
TT: 1004
3-7
3 Sequence Analysis
After you read a sequence into the MATLAB environment, you can analyze the sequence for codon
composition. This procedure uses the human mitochondria genome as an example. See “Reading
Sequence Information from the Web” on page 3-4.
codoncount(mitochondria)
3-8
Exploring a Nucleotide Sequence Using Command Line
3-9
3 Sequence Analysis
3-10
Exploring a Nucleotide Sequence Using Command Line
After you read a sequence into the MATLAB environment, you can analyze the sequence for open
reading frames. This procedure uses the human mitochondria genome as an example. See “Reading
Sequence Information from the Web” on page 3-4.
1 Display open reading frames (ORFs) in a nucleotide sequence. In the MATLAB Command
Window, type:
3-11
3 Sequence Analysis
seqshoworfs(mitochondria);
If you compare this output to the genes shown on the NCBI page for NC_012920, there are fewer
genes than expected. This is because vertebrate mitochondria use a genetic code slightly
different from the standard genetic code. For a list of genetic codes, see the Genetic Code table
in the aa2nt reference page.
2 Display ORFs using the Vertebrate Mitochondrial code.
orfs= seqshoworfs(mitochondria,...
'GeneticCode','Vertebrate Mitochondrial',...
'alternativestart',true);
Notice that there are now two large ORFs on the third reading frame. One starts at position 4470
and the other starts at 5904. These correspond to the genes ND2 (NADH dehydrogenase subunit
2 [Homo sapiens] ) and COX1 (cytochrome c oxidase subunit I) genes.
3 Find the corresponding stop codon. The start and stop positions for ORFs have the same indices
as the start positions in the fields Start and Stop.
ND2Start = 4470;
StartIndex = find(orfs(3).Start == ND2Start)
ND2Stop = orfs(3).Stop(StartIndex)
ND2Stop =
5511
4 Using the sequence indices for the start and stop of the gene, extract the subsequence from the
sequence.
ND2Seq = mitochondria(ND2Start:ND2Stop)
The subsequence (protein-coding region) is stored in ND2Seq and displayed on the screen.
attaatcccctggcccaacccgtcatctactctaccatctttgcaggcac
actcatcacagcgctaagctcgcactgattttttacctgagtaggcctag
aaataaacatgctagcttttattccagttctaaccaaaaaaataaaccct
cgttccacagaagctgccatcaagtatttcctcacgcaagcaaccgcatc
cataatccttc . . .
5 Determine the codon distribution.
codoncount (ND2Seq)
The codon count shows a high amount of ACC, ATA, CTA, and ATC.
3-12
Exploring a Nucleotide Sequence Using Command Line
aminolookup('code',nt2aa('ATA'))
aminolookup('code',nt2aa('CTA'))
aminolookup('code',nt2aa('ACC'))
aminolookup('code',nt2aa('ATC'))
Ile isoleucine
Leu leucine
Thr threonine
Ile isoleucine
After you locate an open reading frame (ORF) in a gene, you can convert it to an amino sequence and
determine its amino acid composition. This procedure uses the human mitochondria genome as an
example. See “Open Reading Frames” on page 3-11.
1 Convert a nucleotide sequence to an amino acid sequence. In this example, only the protein-
coding sequence between the start and stop codons is converted.
ND2AASeq = nt2aa(ND2Seq,'geneticcode',...
'Vertebrate Mitochondrial')
The sequence is converted using the Vertebrate Mitochondrial genetic code. Because the
property AlternativeStartCodons is set to 'true' by default, the first codon att is
converted to M instead of I.
MNPLAQPVIYSTIFAGTLITALSSHWFFTWVGLEMNMLAFIPVLTKKMNP
RSTEAAIKYFLTQATASMILLMAILFNNMLSGQWTMTNTTNQYSSLMIMM
AMAMKLGMAPFHFWVPEVTQGTPLTSGLLLLTWQKLAPISIMYQISPSLN
VSLLLTLSILSIMAGSWGGLNQTQLRKILAYSSITHMGWMMAVLPYNPNM
TILNLTIYIILTTTAFLLLNLNSSTTTLLLSRTWNKLTWLTPLIPSTLLS
LGGLPPLTGFLPKWAIIEEFTKNNSLIIPTIMATITLLNLYFYLRLIYST
SITLLPMSNNVKMKWQFEHTKPTPFLPTLIALTTLLLPISPFMLMIL
2 Compare your conversion with the published conversion in the GenPept database.
ND2protein = getgenpept('YP_003024027','sequenceonly',true)
The getgenpept function retrieves the published conversion from the NCBI database and reads
it into the MATLAB Workspace.
3 Count the amino acids in the protein sequence.
3-13
3 Sequence Analysis
aacount(ND2AASeq, 'chart','bar')
A bar graph displays. Notice the high content for leucine, threonine and isoleucine, and also
notice the lack of cysteine and aspartic acid.
atomiccomp(ND2AASeq)
molweight (ND2AASeq)
ans =
C: 1818
H: 2882
N: 420
O: 471
S: 25
ans =
3.8960e+004
If this sequence was unknown, you could use this information to identify the protein by
comparing it with the atomic composition of other proteins in a database.
3-14
Exploring a Nucleotide Sequence Using the Sequence Viewer App
The following procedure illustrates how to retrieve sequence information from the NCBI database on
the Web. This example uses the GenBank accession number NM_000520, which is the human gene
HEXA that is associated with Tay-Sachs disease.
Note Data in public repositories is frequently curated and updated; therefore, the results of this
example might be slightly different when you use up-to-date sequences.
seqviewer
The Sequence Viewer opens without a sequence loaded. Notice that the panes to the right and
bottom are blank.
2 To retrieve a sequence from the NCBI database, select File > Download Sequence from >
NCBI.
3-15
3 Sequence Analysis
3 In the Enter Sequence box, type an accession number for an NCBI database entry, for example,
NM_000520. Click the Nucleotide option button, and then click OK.
The MATLAB software accesses the NCBI database on the Web, loads nucleotide sequence
information for the accession number you entered, and calculates some basic statistics.
3-16
Exploring a Nucleotide Sequence Using the Sequence Viewer App
1 In the left pane tree, click Comments. The right pane displays general information about the
sequence.
3-17
3 Sequence Analysis
2 Now click Features. The right pane displays NCBI feature information, including index numbers
for a gene and any CDS sequences.
3 Click ORF to show the search results for ORFs in the six reading frames.
3-18
Exploring a Nucleotide Sequence Using the Sequence Viewer App
4 Click Annotated CDS to show the protein coding part of a nucleotide sequence.
3-19
3 Sequence Analysis
For instance, if you search for the word 'TAR' with the Regular Expression box checked, the app
highlights all the occurrences of 'TAA' and 'TAG' in the sequence since R = [AG].
The Sequence Viewer searches and displays the location of the selected word.
3-20
Exploring a Nucleotide Sequence Using the Sequence Viewer App
3
Clear the display by clicking the Clear Word Selection button on the toolbar.
3-21
3 Sequence Analysis
The Sequence Viewer displays the ORFs for the six reading frames in the lower-right pane.
Hover the cursor over a frame to display information about it.
The ORF is highlighted to indicate the part of the sequence that is selected.
3 Right-click the selected ORF and then select Export to Workspace. In the Export to MATLAB
Workspace dialog box, type a variable name, for example, NM_000520_ORF_2, then click
Export.
3-22
Exploring a Nucleotide Sequence Using the Sequence Viewer App
The Sequence Viewer adds a tab at the bottom for the new sequence while leaving the original
sequence open.
3-23
3 Sequence Analysis
5 In the left pane, click Full Translation. Select Display > Amino Acid Residue Display > One
Letter Code.
The Sequence Viewer displays the amino acid sequence below the nucleotide sequence.
3-24
Exploring a Nucleotide Sequence Using the Sequence Viewer App
seqviewer('close')
3-25
3 Sequence Analysis
The Sequence Viewer accesses the NCBI database on the Web and loads amino acid sequence
information for the accession number you entered.
3-26
Explore a Protein Sequence Using the Sequence Viewer App
3 Select Display > Amino Acid Color Scheme, and then select Charge, Function,
Hydrophobicity, Structure, or Taylor. For example, select Function.
The display colors change to highlight charge information about the amino acid residues. The
following table shows color legends for the amino acid color schemes.
3-27
3 Sequence Analysis
seqviewer('close')
3-28
Explore a Protein Sequence Using the Sequence Viewer App
References
[1] Taylor, W.R. (1997). Residual colours: a proposal for aminochromography. Protein Engineering 10,
7, 743–746.
3-29
3 Sequence Analysis
Determining the similarity between two sequences is a common task in computational biology.
Starting with a nucleotide sequence for a human gene, this example uses alignment algorithms to
locate and verify a corresponding gene in a model organism.
In this example, you are interested in studying Tay-Sachs disease. Tay-Sachs is an autosomal
recessive disease caused by the absence of the enzyme beta-hexosaminidase A (Hex A). This enzyme
is responsible for the breakdown of gangliosides (GM2) in brain and nerve cells.
First, research information about Tay-Sachs and the enzyme that is associated with this disease, then
find the nucleotide sequence for the human gene that codes for the enzyme, and finally find a
corresponding gene in another organism to use as a model for study.
Your help browser opens with the Tay-Sachs disease page in the Genes and Diseases section of the
NCBI web site. This section provides a comprehensive introduction to medical genetics. In particular,
this page contains an introduction and pictorial representation of the enzyme Hex A and its role in
the metabolism of the lipid GM2 ganglioside.
The gene HEXA codes for the alpha subunit of the dimer enzyme hexosaminidase A (Hex A), while the
gene HEXB codes for the beta subunit of the enzyme. A third gene, GM2A, codes for the activator
protein GM2. However, it is a mutation in the gene HEXA that causes Tay-Sachs.
The following procedure illustrates how to find the nucleotide sequence for a human gene in a public
database and read the sequence information into the MATLAB environment. Many public databases
for nucleotide sequences (for example, GenBank®, EMBL-EBI) are accessible from the Web. The
MATLAB Command Window with the MATLAB Help browser provide an integrated environment for
searching the Web and bringing sequence information into the MATLAB environment.
After you locate a sequence, you need to move the sequence data into the MATLAB Workspace.
Open the MATLAB Help browser to the NCBI Web site. In the MATLAB Command Window, enter:
web('https://www.ncbi.nlm.nih.gov/')
Search for the gene you are interested in studying. For example, from the Search list, select
Nucleotide, and in the for box enter Tay-Sachs. Look for the genes that code the alpha and beta
subunits of the enzyme hexosaminidase A (Hex A), and the gene that codes the activator enzyme. The
NCBI reference for the human gene HEXA has accession number NM_000520.
Get sequence data into the MATLAB environment. For example, to get sequence information for the
human gene HEXA, enter:
humanHEXA = getgenbank('NM_000520')
3-30
Compare Sequences Using Sequence Alignment Algorithms
LocusSequenceLength: '4785'
LocusNumberofStrands: ''
LocusTopology: 'linear'
LocusMoleculeType: 'mRNA'
LocusGenBankDivision: 'PRI'
LocusModificationDate: '18-JAN-2021'
Definition: 'Homo sapiens hexosaminidase subunit alpha (HEXA), transcript variant
Accession: 'NM_000520'
Version: 'NM_000520.6'
GI: ''
Project: []
DBLink: []
Keywords: 'RefSeq; MANE Select.'
Segment: []
Source: 'Homo sapiens (human)'
SourceOrganism: [4×65 char]
Reference: {[1×1 struct] [1×1 struct] [1×1 struct] [1×1 struct] [1×1 struct]
Comment: [48×66 char]
Features: [160×74 char]
CDS: [1×1 struct]
Sequence: 'ctcacgtggccagccccctccgagaggggagaccagcgggccatgacaagctccaggctttggttttcg
SearchURL: 'https://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=NM_0005
RetrieveURL: 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&
The following procedure illustrates how to find the nucleotide sequence for a mouse gene related to a
human gene, and read the sequence information into the MATLAB environment. The sequence and
function of many genes is conserved during the evolution of species through homologous genes.
Homologous genes are genes that have a common ancestor and similar sequences. One goal of
searching a public database is to find similar genes. If you are able to locate a sequence in a database
that is similar to your unknown gene or protein, it is likely that the function and characteristics of the
known and unknown genes are the same.
After finding the nucleotide sequence for a human gene, you can do a BLAST search or search in the
genome of another organism for the corresponding gene. This procedure uses the mouse genome as
an example.
web('http://www.ncbi.nlm.nih.gov')
Search the nucleotide database for the gene or protein you are interested in studying. For example,
from the Search list, select Nucleotide, and in the for box enter hexosaminidase A.
The search returns entries for the mouse and human genomes. For the purposes of this example, use
the accession number AK080777 for the mouse gene HEXA.
Get sequence information for the mouse gene into the MATLAB environment.
mouseHEXA = getgenbank('AK080777')
The following procedure illustrates how to convert a sequence from nucleotides to amino acids and
identify the open reading frames. A nucleotide sequence includes regulatory sequences before and
3-31
3 Sequence Analysis
after the protein coding section. By analyzing this sequence, you can determine the nucleotides that
code for the amino acids in the final protein.
After you have a list of genes you are interested in studying, you can determine the protein coding
sequences. This procedure uses the human gene HEXA and mouse gene HEXA as an example.
If you did not retrieve gene data from the Web, you can load example data from a MAT-file included
with the Bioinformatics Toolbox™ software. In the MATLAB Command window, enter:
load hexosaminidase
Locate open reading frames (ORFs) in the human gene. For example, for the human gene HEXA,
enter:
humanORFs = seqshoworfs(humanHEXA.Sequence)
seqshoworfs creates the output structure humanORFs. This structure contains the position of the
start and stop codons for all open reading frames (ORFs) on each reading frame. The figure displays
3-32
Compare Sequences Using Sequence Alignment Algorithms
the three reading frames with the ORFs colored blue, red, and green. Notice that the longest ORF is
in the first reading frame.
The mouse gene shows the longest ORF on the first reading frame.
The following procedure illustrates how to use global and local alignment functions to compare two
amino acid sequences. You could use alignment functions to look for similarities between two
nucleotide sequences, but alignment functions return more biologically meaningful results when you
are using amino acid sequences.
After you have located the open reading frames on your nucleotide sequences, you can convert the
protein coding sections of the nucleotide sequences to their corresponding amino acid sequences,
and then you can compare them for similarities.
3-33
3 Sequence Analysis
Using the open reading frames identified previously, convert the human and mouse DNA sequences to
the amino acid sequences. Because both the human and mouse HEXA genes were in the first reading
frames (default), you do not need to indicate which frame.
humanProtein = nt2aa(humanHEXA.Sequence);
mouseProtein = nt2aa(mouseHEXA.Sequence);
Draw a dot plot comparing the human and mouse amino acid sequences. Dot plots are one of the
easiest ways to look for similarity between sequences. The diagonal line shown below indicates that
there may be a good alignment between the two sequences.
warning('off','bioinfo:seqdotplot:imageTooBigForScreen');
seqdotplot(mouseProtein,humanProtein,4,3);
ylabel('Mouse hexosaminidase A (alpha subunit)')
xlabel('Human hexosaminidase A (alpha subunit)')
uif = gcf;
uif.Position(:) = [100 100 1280 800]; % Resize the figure.
warning('on','bioinfo:seqdotplot:imageTooBigForScreen');
Globally align the two amino acid sequences, using the Needleman-Wunsch algorithm.
[GlobalScore, GlobalAlignment] = nwalign(humanProtein,mouseProtein)
GlobalScore = 634.3333
3-34
Compare Sequences Using Sequence Alignment Algorithms
You can also visualize the alignment in the Sequence Alignment app. The alignment is very good
between amino acid position 69 and 599, after which the two sequences appear to be unrelated.
Notice that there is a stop (*) in the sequence at this point. If you shorten the sequences to include
only the amino acids that are in the protein you might get a better alignment. Include the amino acid
positions from the first methionine (M) to the first stop (*) that occurs after the first methionine.
seqalignviewer(GlobalAlignment);
Trim the sequence from the first start amino acid (usually M) to the first stop (*) and then try
alignment again. Find the indices for the stops in the sequences.
humanStops = 1×6
3-35
3 Sequence Analysis
mouseStops = 1×4
Looking at the amino acid sequence for humanProtein, the first M is at position 70, and the first stop
after that position is actually the second stop in the sequence (position 599). Looking at the amino
acid sequence for mouseProtein, the first M is at position 11, and the first stop after that position is
the first stop in the sequence (position 557).
Truncate the sequences to include only amino acids in the protein and the stop.
humanProteinORF = humanProtein(70:humanStops(2))
humanProteinORF =
'MTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYVLYPNNFQFQYDVSSAAQPGCSVLDEAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLV
mouseProteinORF = mouseProtein(11:mouseStops(1))
mouseProteinORF =
'MAGCRLWVSLLLAAALACLATALWPWPQYIQTYHRRYTLYPNNFQFRYHVSSAAQAGCVVLDEAFRRYRNLLFGSGSWPRPSFSNKQQTLGKNILV
3-36
Compare Sequences Using Sequence Alignment Algorithms
Another way to truncate an amino acid sequence to only those amino acids in the protein is to first
truncate the nucleotide sequence with indices from the seqshoworfs function. Remember that the
ORF for the human HEXA gene and the ORF for the mouse HEXA were both on the first reading
frame.
humanORFs = seqshoworfs(humanHEXA.Sequence)
mouseORFs = seqshoworfs(mouseHEXA.Sequence)
3-37
3 Sequence Analysis
humanPORF = nt2aa(humanHEXA.Sequence(humanORFs(1).Start(1):humanORFs(1).Stop(1)));
mousePORF = nt2aa(mouseHEXA.Sequence(mouseORFs(1).Start(1):mouseORFs(1).Stop(1)));
[GlobalScore2, GlobalAlignment2] = nwalign(humanPORF, mousePORF);
seqalignviewer(GlobalAlignment2);
3-38
Compare Sequences Using Sequence Alignment Algorithms
The result from first truncating a nucleotide sequence before converting it to an amino acid sequence
is the same as the result from truncating the amino acid sequence after conversion. An alternative
method to working with subsequences is to use a local alignment function with the nontruncated
sequences.
Locally align the two amino acid sequences using a Smith-Waterman algorithm.
3-39
3 Sequence Analysis
close all;
See Also
swalign | nwalign
3-40
View and Align Multiple Sequences
In this section...
“Overview of the Sequence Alignment App” on page 3-41
“Visualize Multiple Sequence Alignment” on page 3-41
“Adjust Sequence Alignments Manually” on page 3-42
“Rearrange Rows” on page 3-50
“Generate Phylogenetic Tree from Aligned Sequences” on page 3-52
gagaa = multialignread('aagag.aln')
2 View the aligned sequences in the Sequence Alignment app.
seqalignviewer(gagaa);
3-41
3 Sequence Analysis
1 To better visualize the sequence alignments, you can zoom in by selecting Display > Zoom in.
Select this option multiple times until you achieve the zoom level you want.
2 Identify an area where you could improve the alignment.
3-42
View and Align Multiple Sequences
3 Click a letter or a region. The selected region is the center block. You can then drag the
sequence(s) to the left or right of the center block.
3-43
3 Sequence Analysis
4 To move a single letter (T in this example), click and drag the letter T (center block) to the right
to insert a gap.
3-44
View and Align Multiple Sequences
3-45
3 Sequence Analysis
6 You can also move multiple residues (a subsequence). Suppose you want to move a subsequence
to available gaps. First select the gap region that you want to fill in.
3-46
View and Align Multiple Sequences
7 Drag the subsequence(s) from the right or left of the gap region into the gap area.
3-47
3 Sequence Analysis
8 Suppose you want to remove one or more of the aligned sequences. First select the sequence(s)
to be removed. Then select Edit > Delete Sequences.
3-48
View and Align Multiple Sequences
3-49
3 Sequence Analysis
10 After the edit, you can export the aligned sequences or consensus sequence to a FASTA file or
MATLAB Workspace from the File menu.
Rearrange Rows
You can move the rows (sequences) up or down by one row. You can also move selected rows to the
top or bottom of the list.
3-50
View and Align Multiple Sequences
3-51
3 Sequence Analysis
Select Display > View Tree > Selected... to generate a tree from selected sequences.
3-52
View and Align Multiple Sequences
A phylogenetic tree for the sequences is displayed in the Phylogenetic Tree app. For details on the
app, see “Using the Phylogenetic Tree App” on page 5-2.
3-53
3 Sequence Analysis
See Also
seqalignviewer | Sequence Alignment | Sequence Viewer | Genomics Viewer
More About
• “Sequence Alignments” on page 1-7
• “Aligning Pairs of Sequences” on page 3-177
3-54
Analyzing Synonymous and Nonsynonymous Substitution Rates
This example shows how the analysis of synonymous and nonsynonymous mutations at the nucleotide
level can suggest patterns of molecular adaptation in the genome of HIV-1. This example is based on
the discussion of natural selection at the molecular level presented in Chapter 6 of "Introduction to
Computational Genomics. A Case Studies Approach" [1].
Introduction
The human immunodeficiency virus 1 (HIV-1) is the more geographically widespread of the two viral
strains that cause Acquired Immunodeficiency Syndrome (AIDS) in humans. Because the virus rapidly
and constantly evolves, at the moment there is no cure nor vaccine against HIV infection. The HIV
virus presents a very high mutation rate that allows it to evade the response of our immune system as
well as the action of specific drugs. At the same time, however, the rapid evolution of HIV provides a
powerful mechanism that reveals important insights into its function and resistance to drugs. By
estimating the force of selective pressures (positive and purifying selections) across various regions
of the viral genome, we can gain a general understanding of how the virus evolves. In particular, we
can determine which genes evolve in response to the action of the targeted immune system and
which genes are conserved because they are involved in some of the virus essential functions.
Nonsynonymous mutations to a DNA sequence cause a change in the translated amino acid sequence,
whereas synonymous mutations do not. The comparison between the number of nonsynonymous
mutations (dn or Ka), and the number of synonymous mutations (ds or Ks), can suggest whether, at
the molecular level, natural selection is acting to promote the fixation of advantageous mutations
(positive selection) or to remove deleterious mutations (purifying selection). In general, when positive
selection dominates, the Ka/Ks ratio is greater than 1; in this case, diversity at the amino acid level is
favored, likely due to the fitness advantage provided by the mutations. Conversely, when negative
selection dominates, the Ka/Ks ratio is less than 1; in this case, most amino acid changes are
deleterious and, therefore, are selected against. When the positive and negative selection forces
balance each other, the Ka/Ks ratio is close to 1.
Download two genomic sequences of HIV-1 (GenBank® accession numbers AF033819 and M27323).
For each encoded gene we extract relevant information, such as nucleotide sequence, translated
sequence and gene product name.
hiv1(1) = getgenbank('AF033819');
hiv1(2) = getgenbank('M27323');
For your convenience, previously downloaded sequences are included in a MAT-file. Note that data in
public repositories is frequently curated and updated; therefore the results of this example might be
slightly different when you use up-to-date datasets.
load hiv1.mat
genes1 = featureparse(hiv1(1),'feature','CDS','Sequence','true');
genes2 = featureparse(hiv1(2),'feature','CDS','Sequence','true');
3-55
3 Sequence Analysis
Align the corresponding protein sequences in the two HIV-1 genomes and use the resulting alignment
as a guide to insert the appropriate gaps in the nucleotide sequences. Then calculate the Ka/Ks ratio
for each individual gene and plot the results.
KaKs = zeros(1,numel(genes1));
for iCDS = 1:numel(genes1)
% align aa sequences of corresponding genes
[score,alignment] = nwalign(genes1(iCDS).translation,genes2(iCDS).translation);
seq1 = seqinsertgaps(genes1(iCDS).Sequence,alignment(1,:));
seq2 = seqinsertgaps(genes2(iCDS).Sequence,alignment(3,:));
KaKs =
Columns 1 through 7
Columns 8 through 9
0.5115 0.3338
3-56
Analyzing Synonymous and Nonsynonymous Substitution Rates
All the considered genes, with the exception of TAT, have a total Ka/Ks less than 1. This is in
accordance with the fact that most protein-coding genes are considered to be under the effect of
purifying selection. Indeed, the majority of observed mutations are synonymous and do not affect the
integrity of the encoded proteins. As a result, the number of synonymous mutations generally exceeds
the number of nonsynonymous mutations. The case of TAT represents a well known exception; at the
amino acid level, the TAT protein is one of the least conserved among the viral proteins.
Oftentimes, different regions of a single gene can be exposed to different selective pressures. In these
cases, calculating Ka/Ks over the entire length of the gene does not provide a detailed picture of the
evolutionary constraints associated with the gene. For example, the total Ka/Ks associated with the
ENV gene is 0.5155. However, the ENV gene encodes for the envelope glycoprotein GP160, which in
turn is the precursor of two proteins: GP120 (residues 31-511 in AF033819) and GP41 (residues
512-856 in AF033819). GP120 is exposed on the surface of the viral envelope and performs the first
step of HIV infection; GP41 is non-covalently bonded to GP120 and is involved in the second step of
HIV infection. Thus, we can expect these two proteins to respond to different selective pressures, and
a global analysis on the entire ENV gene can obscure diversified behavior. For this reason, we
conduct a finer analysis by using sliding windows of different sizes.
Align ENV genes of the two genomes and measure the Ka/Ks ratio over sliding windows of size equal
to 5, 45, and 200 codons.
env = 8; % ORF number corresponding to gene ENV
3-57
3 Sequence Analysis
[score,alignment] = nwalign(genes1(env).translation,genes2(env).translation);
env_1 = seqinsertgaps(genes1(env).Sequence,alignment(1,:));
env_2 = seqinsertgaps(genes2(env).Sequence,alignment(3,:));
The choice of the sliding window size can be problematic: windows that are too long (in this example,
200 codons) average across long regions of a single gene, thus hiding segments where Ka/Ks is
potentially behaving in a peculiar manner. Too short windows (in this example, 5 codons) are likely to
produce results that are very noisy and therefore not very meaningful. In the case of the ENV gene, a
sliding window of 45 codons seems to be appropriate. In the plot, although the general trend is below
3-58
Analyzing Synonymous and Nonsynonymous Substitution Rates
the threshold of 1, we observe several peaks over the threshold of 1. These regions appear to
undergo positive selection that favors amino acid diversity, as it provides some fitness advantage.
Using Sliding Window Analyses for GAG, POL and ENV Genes
You can perform similar analyses on other genes that display a global Ka/Ks ratio less than 1.
Compute the global Ka/Ks ratio for the GAG, POL and ENV genes. Then repeat the calculation using a
sliding window.
% display the global Ka/Ks for the GAG, POL and ENV genes
KaKs(gene_index)
for i = 1:numel(gene_index)
ID = gene_index(i);
[score,alignment] = nwalign(genes1(ID).translation,genes2(ID).translation);
s1 = seqinsertgaps(genes1(ID).Sequence,alignment(1,:));
s2 = seqinsertgaps(genes2(ID).Sequence,alignment(3,:));
ans =
3-59
3 Sequence Analysis
3-60
Analyzing Synonymous and Nonsynonymous Substitution Rates
3-61
3 Sequence Analysis
The GAG (Group-specific Antigen) gene provides the basic physical infrastructure of the virus. It
codes for p24 (the viral capsid), p6 and p7 (the nucleocapsid proteins), and p17 (a matrix protein).
Since this gene encodes for many fundamental proteins that are structurally important for the
survival of the virus, the number of synonymous mutations exceeds the number of nonsynonymous
mutations (i.e., Ka/Ks <1). Thus, this protein is expected to be constrained by purifying selection to
maintain viral infectivity.
The POL gene codes for viral enzymes, such as reverse transcriptase, integrase, and protease. These
enzymes are essential to the virus survival and, therefore, the selective pressure to preserve their
function and structural integrity is quite high. Consequently, this gene appears to be under purifying
selection and we observe Ka/Ks ratio values less than 1 for the majority of the gene length.
The ENV gene codes for the precursor to GP120 and GP41, proteins embedded in the viral envelope,
which enable the virus to attach to and fuse with target cells. GP120 infects any target cell by binding
to the CD4 receptor. As a consequence, GP120 has to maintain the mechanism of recognition of the
host cell and at the same time avoid the detection by the immune system. These two roles are carried
out by different parts of the protein, as shown by the trend in the Ka/Ks ratio. This viral protein is
undergoing purifying (Ka/Ks < 1) and positive selection (Ka/Ks >1) in different regions. A similar
trend is observed in GP41.
The glycoprotein GP120 binds to the CD4 receptor of any target cell, particularly the helper T-cell.
This represents the first step of HIV infection and, therefore, GP120 was among the first proteins
studied with the intent of finding a HIV vaccine. It is interesting to determine which regions of GP120
3-62
Analyzing Synonymous and Nonsynonymous Substitution Rates
appear to undergo purifying selection, as indicators of protein regions that are functionally or
structurally important for the virus survival, and could potentially represent drug targets.
From ENV genes, extract the sequences coding for GP120. Compute the Ka/Ks over sliding window of
size equal to 45 codons. Plot and overlap the trend of Ka/Ks with the location of four T cell epitopes
for GP120.
figure
hold on
% plot Ka/Ks relatively to the middle codon of the sliding window
plot(windowSize/2+(1:numel(dn120)),dn120./ds120)
plot(epiLoc,[1 1],'linewidth',5)
line([0 numel(dn120)+windowSize/2],[1 1],'LineStyle',':')
title('GP120, Ka / Ks and epitopes');
ylabel('Ka / Ks');
xlabel('sliding window (middle codon)');
3-63
3 Sequence Analysis
Although the general trend of the Ka/Ks ratio is less than 1, there are some regions where the ratio is
greater than one, indicating that these regions are likely to be under positive selection. Interestingly,
the location of some of these regions corresponds to the presence of T cell epitopes, identified by
cellular methods. These segments display high amino acid variability because amino acid diversity in
these regions allows the virus to evade the host immune system recognition. Thus, we can conclude
that the source of variability in this regions is likely to be the host immune response.
References
[1] Cristianini, N. and Hahn, M.W., "Introduction to Computational Genomics: A Case Studies
Approach", Cambridge University Press, 2007.
[2] Siebert, S.A., et al., "Natural Selection on the gag, pol, and env Genes of Human
Immunodeficiency Virus 1 (HIV-1)", Molecular Biology and Evolution, 12(5):803-813, 1995.
3-64
Investigating the Bird Flu Virus
This example shows how to calculate Ka/Ks ratios for eight genes in the H5N1 and H2N3 virus
genomes, and perform a phylogenetic analysis on the HA gene from H5N1 virus isolated from
chickens across Africa and Asia. For the phylogenetic analysis, you will reconstruct a neighbor-joining
tree and create a 3-D plot of sequence distances using multidimensional scaling. Finally, you will map
the geographic locations where each HA sequence was found on a regional map. Sequences used in
this example were selected from the bird flu case study on the Computational Genomics Website [1].
Note: The final section in this example requires the Mapping Toolbox™.
Introduction
There are three types of influenza virus: Type A, B and C. All influenza genomes are comprised of
eight segments or genes that code for polymerase B2 (PB2), polymerase B1 (PB1), polymerase A (PA),
hemagglutinin (HA), nucleoprotein (NP), neuraminidase (NA), matrix (M1), and non-structural (NS1)
proteins. Note: Type C virus has hemagglutinin-esterase (HE), a homolog to HA.
Of the three types of influenza, Type A has the potential to be the most devastating. It affects birds
(its natural reservoir), humans and other mammals and has been the major cause of global influenza
epidemics. Type B affects only humans causing local epidemics, and Type C does not tend to cause
serious illness.
Type A influenzas are further classified into different subtypes according to variations in the amino
acid sequences of HA (H1-16) and NA (N1-9) proteins. Both proteins are located on the outside of the
virus. HA attaches the virus to the host cell then aids in the process of the virus being fused in to the
cell. NA clips the newly created virus from the host cell so it can move on to a healthy new cell.
Difference in amino acid composition within a protein and recombination of the various HA and NA
proteins contribute to Type A influenzas' ability to jump host species (i.e. bird to humans) and wide
range of severity. Many new drugs are being designed to target HA and NA proteins [2,3,4].
In 1997, H5N1 subtype of the avian influenza virus, a Type A influenza virus, made an unexpected
jump to humans in Hong Kong causing the deaths of six people. To control the rapidly spreading
disease, all poultry in Hong Kong was destroyed. Sequence analysis of the H5N1 virus is shown here
[2,4].
An investigation of the Ka/Ks ratios for each gene segment of the H5N1 virus will provide some
insight into how each is changing over time. Ka/Ks is the ratio of non-synonymous changes to
synonymous in a sequence. For a more detailed explanation of Ka/Ks ratios, see “Analyzing
Synonymous and Nonsynonymous Substitution Rates” on page 3-55. To calculate Ka/Ks, you need a
copy of the gene from two time points. You can use H5N1 virus isolated from chickens in Hong Kong
in 1997 and 2001. For comparison, you can include H2N3 virus isolated from mallard ducks in
Alberta in 1977 and 1985 [1].
For the purpose of this example, sequence data is provided in four MATLAB® structures that were
created by genbankread.
load('birdflu.mat','chicken1997','chicken2001','mallard1977','mallard1985')
3-65
3 Sequence Analysis
Data in public repositories is frequently curated and updated. You can retrieve the up-to-date
datasets by using the getgenbank function. Note that if data has indeed changed, the results of this
example might be slightly different when you use up-to-date datasets.
chicken1997 = arrayfun(@(x)getgenbank(x{:}),{chicken1997.Accession});
chicken2001 = arrayfun(@(x)getgenbank(x{:}),{chicken2001.Accession});
mallard1977 = arrayfun(@(x)getgenbank(x{:}),{mallard1977.Accession});
mallard1985 = arrayfun(@(x)getgenbank(x{:}),{mallard1985.Accession});
You can extract just the coding portion of the nucleotide sequences using the featureparse
function. The featureparse function returns a structure with fields containing information from the
Features section in a GenBank file including with a Sequence field that contains just the coding
sequence.
for ii = 1:numel(chicken1997)
ntSeq97{ii} = featureparse(chicken1997(ii),'feature','cds','sequence',true);
ntSeq01{ii} = featureparse(chicken2001(ii),'feature','cds','sequence',true);
ntSeq77{ii} = featureparse(mallard1977(ii),'feature','cds','sequence',true);
ntSeq85{ii} = featureparse(mallard1985(ii),'feature','cds','sequence',true);
end
ntSeq97{1}
ans =
Location: '<1..>2273'
Indices: [1 2273]
UnknownFeatureBoundaries: 1
gene: 'PB2'
codon_start: '1'
product: 'PB2 protein'
protein_id: 'AAF02361.1'
db_xref: 'GI:6048850'
translation: 'RIKELRDLMSQSRTREILTKTTVDHMAIIKKYTSGRQEKNPALRMKWMMAMKYPITADKRIMEMIP
Sequence: 'agaataaaagaactaagagatttgatgtcgcaatctcgcacacgcgagatactgacaaaaaccact
Visual inspection of the sequence structures revealed some of the genes have splice variants
represented in the GenBank files. Because this analysis is only on PB2, PB1, PA, HA, NP, NA, M1, and
NS1 genes, you need to remove any splice variants.
ntSeq97{7}(1) = [];% M2
ntSeq97{8}(1) = [];% NS2
3-66
Investigating the Bird Flu Virus
You need to align the nucleotide sequences to calculate the Ka/Ks ratio. Align protein sequences for
each gene (available in the 'translation' field) using nwalign function, then insert gaps into
nucleotide sequence using seqinsertgaps. Use the function dnds to calculate non-synonymous and
synonymous substitution rates for each of the eight genes in the virus genomes. If you are interested
in seeing the sequence alignments, set the 'verbose' option to true when using dnds.
proteins = {'PB2','PB1','PA','HA','NP','NA','M1','NS1'};
H5N1 Virus
for ii = 1:numel(ntSeq97)
[sc,align] = nwalign(ntSeq97{ii}.translation,ntSeq01{ii}.translation,'alpha','aa');
ch97seq = seqinsertgaps(ntSeq97{ii}.Sequence,align(1,:));
ch01seq = seqinsertgaps(ntSeq01{ii}.Sequence,align(3,:));
[dn,ds] = dnds(ch97seq,ch01seq);
H5N1.(proteins{ii}) = dn/ds;
end
H2N3 Virus
for ii = 1:numel(ntSeq77)
[sc,align] = nwalign(ntSeq77{ii}.translation,ntSeq85{ii}.translation,'alpha','aa');
ch77seq = seqinsertgaps(ntSeq77{ii}.Sequence,align(1,:));
ch85seq = seqinsertgaps(ntSeq85{ii}.Sequence,align(3,:));
[dn,ds] = dnds(ch77seq,ch85seq);
H2N3.(proteins{ii}) = dn/ds;
end
H5N1
H2N3
H5N1 =
PB2: 0.0226
PB1: 0.0240
PA: 0.0307
HA: 0.0943
NP: 0.0517
NA: 0.1015
M1: 0.0460
NS1: 0.3010
H2N3 =
PB2: 0.0048
PB1: 0.0021
PA: 0.0089
3-67
3 Sequence Analysis
HA: 0.0395
NP: 0.0071
NA: 0.0559
M1: 0
NS1: 0.1954
Note: Ka/Ks ratio results may vary from those shown on [1] due to sequence splice variants.
H5N1rates = cellfun(@(x)(H5N1.(x)),proteins);
H2N3rates = cellfun(@(x)(H2N3.(x)),proteins);
bar3([H2N3rates' H5N1rates']);
ax = gca;
ax.XTickLabel = {'H2N3','H5N1'};
ax.YTickLabel = proteins;
zlabel('Ka/Ks');
view(-115,16);
title('Ka/Ks Ratios for H5N1 (Chicken) and H2N3 (Mallard) Viruses');
NS1, HA and NA have larger non-synonymous to synonymous ratios compared to the other genes in
both H5N1 and H2N3. Protein sequence changes to these genes have been attributed to an increase
in H5N1 pathogenicity. In particular, changes to the HA gene may provide the virus the ability to
transfer into others species beside birds [2,3].
3-68
Investigating the Bird Flu Virus
The H5N1 virus attaches to cells in the gastrointestinal tract of birds and the respiratory tract of
humans. Changes to the HA protein, which helps bind the virus to a healthy cell and facilitates its
incorporation into the cell, are what allow the virus to affect different organs in the same and
different species. This may provide it the ability to jump from birds to humans [2,3]. You can perform
a phylogenetic analysis of the HA protein from H5N1 virus isolated from chickens at different times
(years) in different regions of Asia and Africa to investigate their relationship to each other.
Load HA amino acid sequence data from 16 regions/times from the MAT-file provided birdflu.mat
or retrieve the up-to-date sequence data from the NCBI repository using the getgenpept function.
load('birdflu.mat','HA')
HA = arrayfun(@(x)getgenpept(x{:}),{HA.Accession});
Create a new structure array containing fields corresponding to amino acid sequence (Sequence) and
source information (Header). You can extract source information from the HA using featureparse
then parse with regexp.
for ii = 1:numel(HA)
source = featureparse(HA(ii),'feature','source');
strain = regexp(source.strain,'A/[Cc]hicken/(\w+\s*\w*).*/(\d+)','tokens');
proteinHA(ii).Header = sprintf('%s_%s',char(strain{1}(1)),char(strain{1}(2)));
proteinHA(ii).Sequence = HA(ii).Sequence;
end
proteinHA(1)
ans =
Header: 'Nigeria_2006'
Sequence: 'mekivllfaivslvksdqicigyhannsteqvdtimeknvtvthaqdilekthngklcdldgvkplilrdcsvagwllgnpm
Align the HA amino acid sequences using multialign and visualize the alignment with
seqalignviewer.
alignHA = multialign(proteinHA);
seqalignviewer(alignHA);
3-69
3 Sequence Analysis
Calculate the distances between sequences using seqpdist with the Jukes-Cantor method. Use
seqneighjoin to reconstruct a phylogenetic tree using the neighbor-joining method.
Seqneighjoin returns a phytree object.
distHA = seqpdist(alignHA,'method','Jukes-Cantor','alpha','aa');
HA_NJtree = seqneighjoin(distHA,'equivar',alignHA);
Use the view method associated with phytree objects to open the tree in the Phylogenetic Tree
Tool.
view(HA_NJtree);
3-70
Investigating the Bird Flu Virus
Another way to visualize the relationship between sequences is to use multidimensional scaling
(MDS) with the distances calculated for the phylogenetic tree. This functionality is provided by the
cmdscale function in Statistics and Machine Learning Toolbox™.
[Y,eigvals] = cmdscale(distHA);
You can use the eigenvalues returned by cmdscale to help guide your decision of whether to use the
first two or three dimensions in your plot.
sigVecs = [1:3;eigvals(1:3)';eigvals(1:3)'/max(abs(eigvals))];
report = ['Dimension Eigenvalues Normalized' ...
sprintf('\n %d\t %1.4f %1.4f',sigVecs)];
display(report);
report =
3-71
3 Sequence Analysis
2 0.0028 0.4462
3 0.0014 0.2209'
The first two dimensions represent a large portion of the data, but the third still contains information
that might help resolve clusters in the sequence data. You can create a three dimensional scatter plot
using plot3 function.
locations = {proteinHA(:).Header};
figure
plot3(Y(:,1),Y(:,2),Y(:,3),'ok');
text(Y(:,1)+0.002,Y(:,2),Y(:,3)+0.001,locations,'interpreter','no');
title('MDS Plot of HA Sequences');
view(-21,12);
Clusters appear to correspond to groupings in the phylogenetic tree. Find the sequences belonging to
each cluster using the subtree method of phytree. One of subtree's required inputs is the node
number (number of leaves + number of branches), which will be the new subtree's root node. For
your example, the cluster containing Hebei and Hong Kong in the MDS plot is equivalent to the
subtree whose root node is Branch 14, which is Node 30 (16 leaves + 14 branches).
cluster1 = get(subtree(HA_NJtree,30),'LeafNames');
cluster2 = get(subtree(HA_NJtree,21),'LeafNames');
cluster3 = get(subtree(HA_NJtree,19),'LeafNames');
3-72
Investigating the Bird Flu Virus
[cl1,cl1_ind] = intersect(locations,cluster1);
[cl2,cl2_ind] = intersect(locations,cluster2);
[cl3,cl3_ind] = intersect(locations,cluster3);
[cl4,cl4_ind] = setdiff(locations,{cl1{:} cl2{:} cl3{:}});
Change the color and marker symbols on the MDS plot to correspond to each cluster.
h = plot3(Y(cl1_ind,1),Y(cl1_ind,2),Y(cl1_ind,3),'^',...
Y(cl2_ind,1),Y(cl2_ind,2),Y(cl2_ind,3),'o',...
Y(cl3_ind,1),Y(cl3_ind,2),Y(cl3_ind,3),'d',...
Y(cl4_ind,1),Y(cl4_ind,2),Y(cl4_ind,3),'v');
numClusters = 4;
col = autumn(numClusters);
for i = 1:numClusters
h(i).MarkerFaceColor = col(i,:);
end
set(h(:),'MarkerEdgeColor','k');
text(Y(:,1)+0.002,Y(:,2),Y(:,3),locations,'interpreter','no');
title('MDS Plot of HA Sequences');
view(-21,12);
For more detailed information on using Ka/Ks ratios, phylogenetics and MDS for sequence analysis
see Cristianini and Hahn [5].
Displaying Geographic Regions of the H5N1 Virus on a Map of Africa and Asia
3-73
3 Sequence Analysis
Using tools from Mapping Toolbox, you can plot the location where each virus was isolated on a map
of Africa and Asia. To do this, you need the latitude and longitude for each location. For information
on finding geospatial data on the internet, see “Find Geospatial Data Online” (Mapping Toolbox).
Latitude and longitude for the capital city of each geographic region where the viruses were isolated
are provided for this example.
Create a geostruct structure, regionHA, that contains the geographic information for each feature,
or sequence, to be displayed. A geostruct is required to have Geometry, Lat, and Lon fields that
specify the feature type, latitude and longitude. This information is used by mapping functions in
Mapping Toolbox to display geospatial data.
[regionHA(1:16).Geometry] = deal('Point');
[regionHA(:).Lat] = deal(9.10, 34.31, 15.31, 39.00, 39.00, 39.00, 55.26,...
15.56, 34.00, 33.14, 34.20, 23.00, 37.35, 44.00,...
22.11, 22.11);
[regionHA(:).Lon] = deal(7.10, 69.08, 32.35, 116.00, 116.00, 116.00,...
65.18, 105.48, 114.00, 131.36, 131.40, 113.00,...
127.00, 127.00, 114.14, 114.14);
A geostruct can also have attribute fields that contain additional information about each feature. Add
attribute fields Name and Cluster to the regionHA structure. The Cluster field contains the
sequence's cluster number, which you will use to identify the sequences' cluster membership.
[regionHA(:).Name] = deal(proteinHA.Header);
[regionHA(cl1_ind).Cluster] = deal(1);
[regionHA(cl2_ind).Cluster] = deal(2);
[regionHA(cl3_ind).Cluster] = deal(3);
[regionHA(cl4_ind).Cluster] = deal(4);
regionHA(1)
ans =
Geometry: 'Point'
Lat: 9.1000
Lon: 7.1000
Name: 'Nigeria_2006'
Cluster: 3
Create a structure using the makesymbolspec function, which will contain marker and color
specifications for each marker to be displayed on the map. You will pass this structure to the
geoshow function. Symbol markers and colors are set to correspond with the clusters in MDS plot.
clusterSymbols = makesymbolspec('Point',...
{'Cluster',1,'Marker', '^'},...
{'Cluster',2,'Marker', 'o'},...
{'Cluster',3,'Marker', 'd'},...
{'Cluster',4,'Marker', 'v'},...
{'Cluster',[1 4],'MarkerFaceColor',autumn(4)},...
{'Default','MarkerSize', 6},...
{'Default','MarkerEdgeColor','k'});
Load the mapping information and use the geoshow function to plot virus locations on a map.
3-74
Investigating the Bird Flu Virus
load coast
load topo
figure
fig = gcf;
fig.Renderer = 'zbuffer';
worldmap([-45 85],[0 160])
setm(gca,'mapprojection','robinson',...
'plabellocation',30,'mlabelparallel',-45,'mlabellocation',30)
plotm(lat, long)
geoshow(topo, topolegend, 'DisplayType', 'texturemap')
demcmap(topo)
brighten(.60)
geoshow(regionHA,'SymbolSpec',clusterSymbols);
title('Geographic Locations of HA Sequence in Africa and Asia')
Using the kmlwrite function from Mapping Toolbox, you can write the location and annotation
information for each sequence to a KML-formatted file. Google Earth displays geographic data from
KML files within its Earth browser. Mapping Toolbox's kmlwrite function translates a geostruct,
such as regionHA, into a KML-formatted file to be used by Google Earth. For more information on
kmlwrite, see “Exporting Vector Data to KML” (Mapping Toolbox).
3-75
3 Sequence Analysis
You can further annotate each sequence with information from the Features section of the GenBank
file using the featureparse function. You can then use this information to populate the geostruct,
regionHA, and display it in table form as a description tag for each placemark in the Google Earth
browser. In a geostruct, mandatory fields are Geometry, Lat and Lon field. All other fields are
considered to be attributes of the placemark.
for i = 1:numel(HA)
feats = featureparse(HA(i),'Feature','source');
regionHA(i).Strain = feats.strain;
if isfield(feats,'country')
regionHA(i).Country = feats.country;
else
regionHA(i).Country = 'N/A';
end
year = regexp(regionHA(i).Name,'\d+','match');
regionHA(i).Year = year{1};
% Create a link to GenPept record through the accession number
regionHA(i).AccessionNumber = ...
['<a href="http://www.ncbi.nlm.nih.gov/sites/entrez?db=Protein&cmd=search&term=',...
HA(i).Accession,'">',HA(i).Accession,'</a>'];
end
[regionHA.SequenceLength] = deal(HA.LocusSequenceLength);
Create an attribute structure using the makeattribspec function, which you will use to format the
description table for each marker. The attribute structure dictates the order and formatting of each
attribute. You can also use it to not display one of the attributes in the geostruct, regionHA.
attribStruct = makeattribspec(regionHA);
Remove the Name field and reorder the fields in the attribute structure.
attribStruct = rmfield(attribStruct,'Name');
attribStruct = orderfields(attribStruct,{'AccessionNumber','Strain',...
'SequenceLength','Country','Year','Cluster'});
regionHA = orderfields(regionHA,{'AccessionNumber','Strain',...
'SequenceLength','Country','Year','Cluster','Geometry','Lon','Lat',...
'Name'});
kmlDirectory = tempdir;
filename = fullfile(kmlDirectory,'HA_geographic_locations.kml');
kmlwrite(filename,regionHA,'Description',attribStruct,'Name',{regionHA.Strain},...
'Icon','http://maps.google.com/mapfiles/kml/shapes/arrow.png','iconscale',1.5);
3-76
Investigating the Bird Flu Virus
You can display a KML file in a Google Earth browser [6]. Google Earth must be installed on the
system. On Windows® platforms, display the KML file with:
winopen(filename)
For Unix and MAC users, display the KML file with:
For this example, the KML file was previously displayed using Google Earth Pro. The Google Earth
image was then saved using the Google Earth "File->Save Image" menu. This is how the data in your
KML file looks when loaded into Google Earth. To get this view move around and zoom in on the
region over Asia.
3-77
3 Sequence Analysis
Click a placemark to view information about the sequence. The accession number in each data table
is a hyperlink to the GenPept sequence file in the NCBI Protein Database.
3-78
Investigating the Bird Flu Virus
Optionally, remove the new KML file from your KML output directory.
delete(filename)
close all
References
[1] https://computationalgenomics.blogs.bristol.ac.uk/case_studies/birdflu_demo
[2] Laver, W.G., Bischofberger, N. and Webster, R.G., "Disarming Flu Viruses", Scientific American,
280(1):78-87, 1999.
[3] Suzuki, Y. and Masatoshi, N., "Origin and Evolution of Influenza Virus Hemagglutinin Genes",
Molecular Biology and Evolution, 19(4):501-9, 2002.
[4] Gambaryan, A., et al., "Evolution of the receptor binding phenotype of influenza A(H5) viruses",
Virology, 344(2):432-8, 2006.
3-79
3 Sequence Analysis
[5] Cristianini, N. and Hahn, M.W., "Introduction to Computational Genomics: A Case Studies
Approach", Cambridge University Press, 2007.
[6] Google Earth images were acquired using Google Earth Pro. For more information about Google
Earth and Google Earth Pro, visit http://earth.google.com/
3-80
Exploring Primer Design
This example shows how to use the Bioinformatics Toolbox™ to find potential primers that can be
used for automated DNA sequencing.
Introduction
Primer design for PCR can be a daunting task. A good primer should usually meet the following
criteria:
This example uses MATLAB® and Bioinformatics Toolbox to find PCR primers for the human
hexosaminidase gene.
First load the hexosaminidase nucleotide sequence from the provided MAT-file
hexosaminidase.mat. The DNA sequence that you want to find primers for is in the Sequence field
of the loaded structure.
load('hexosaminidase.mat','humanHEXA')
sequence = humanHEXA.Sequence;
You can also use the getgenbank function to retrieve the sequence information from the NCBI data
repository and load it into MATLAB. The NCBI reference sequence for HEXA has accession number
NM_000520.
humanHEXA = getgenbank('NM_000520');
The oligoprop function is a useful tool to get properties of oligonucleotide DNA sequences. This
function calculates the GC content, molecular weight, melting temperature, and secondary structure
information. oligoprop returns a structure that has fields with the associated information. Use the
help command to see what other properties oligoprop returns.
nt = oligoprop('AAGCTCAAAAACGCGCGGTATTCGACTGGCGTGATCTATTTTATGCT')
nt =
GC: 44.6809
GCdelta: 0
Hairpins: [3x47 char]
Dimers: [9x47 char]
MolWeight: 1.4468e+04
3-81
3 Sequence Analysis
MolWeightdelta: 0
Tm: [68.9128 79.7752 85.3393 69.6497 68.2474 75.8931]
Tmdelta: [0 0 0 0 0 0]
Thermo: [4x3 double]
Thermodelta: [4x3 double]
The goal is to create a list of all potential forward primers of length 20. You can do this either by
iterating over the entire sequence and taking subsequences at every possible position or by using a
matrix of indices. The following example shows how you can set a matrix of indices and then use it to
create all possible forward subsequences of length 20, in this case N-19 subsequences where N is the
length of the target hexosaminidase sequence. Then you can use the oligoprop function to get
properties for each of the potential primers in the list.
N = length(sequence) % length of the target sequence
M = 20 % desired primer length
index = repmat((0:N-M)',1,M)+repmat(1:M,N-M+1,1);
fwdprimerlist = sequence(index);
N =
2437
M =
20
After you have all potential forward primers, you can search for reverse primers in a similar fashion.
Reverse primers are found on the complementary strand. To obtain the complementary strand use the
seqcomplement function.
comp_sequence = seqcomplement(sequence);
revprimerlist = seqreverse(comp_sequence(index));
The GC content information for the primers is in a structure with the field GC. To eliminate all
potential primers that do not meet the criteria stated above (a GC content of 45% to 55%), you can
make a logical indexing vector that indicates which primers have GC content outside the acceptable
range. Extract the GC field from the structure and convert it to a numeric vector.
fwdgc = [fwdprimerprops.GC]';
revgc = [revprimerprops.GC]';
3-82
Exploring Primer Design
The melting temperature is significant when you are designing PCR protocols. Create another logical
indexing vector to keep track of primers with bad melting temperatures. The melting temperatures
from oligoprop are estimated in a variety of ways (basic, salt-adjusted, nearest-neighbor). The
following example uses the nearest-neighbor estimates for melting temperatures with parameters
established by SantaLucia, Jr. [1]. These are stored in the fifth element of the field Tm returned by
oligoprop. The other elements of this field represent other methods to estimate the melting
temperature. You can also use the mean function to compute an average over all the estimates.
fwdtm = cell2mat({fwdprimerprops.Tm}');
revtm = cell2mat({revprimerprops.Tm}');
bad_fwdprimers_tm = fwdtm(:,5) < 50 | fwdtm(:,5) > 60;
bad_revprimers_tm = revtm(:,5) < 50 | revtm(:,5) > 60;
Self-dimerization and hairpin formation can prevent the primer from binding to the target sequence.
As above, you can create logical indexing vectors to indicate whether the potential primers do or do
not form self-dimers or hairpins.
bad_fwdprimers_dimers = ~cellfun('isempty',{fwdprimerprops.Dimers}');
bad_fwdprimers_hairpin = ~cellfun('isempty',{fwdprimerprops.Hairpins}');
bad_revprimers_dimers = ~cellfun('isempty',{revprimerprops.Dimers}');
bad_revprimers_hairpin = ~cellfun('isempty',{revprimerprops.Hairpins}');
A strong base pairing at the 3' end of the primer is helpful. Find all the primers that do not end in a G
or C. Remember that all the sequences in the lists are 5'->3'.
Primers that have stretches of repeated nucleotides can give poor PCR results. These are sequences
with low complexity. To eliminate primers with stretches of four or more repeated bases, use the
function regexp.
fwdrepeats = regexpi(cellstr(fwdprimerlist),'a{4,}|c{4,}|g{4,}|t{4,}','ONCE');
revrepeats = regexpi(cellstr(revprimerlist),'a{4,}|c{4,}|g{4,}|t{4,}','ONCE');
bad_fwdprimers_repeats = ~cellfun('isempty',fwdrepeats);
bad_revprimers_repeats = ~cellfun('isempty',revrepeats);
The rows of the original list of subsequences correspond to the base number where each
subsequence starts. You can use the logical indexing vectors collected so far and create a new list of
primers that satisfy all the criteria discussed above. The figure shows how the forward primers have
been filtered, where values equal to 1 indicates bad primers and values equal to 0 indicates good
primers.
3-83
3 Sequence Analysis
good_fwdpos = find(all(~bad_fwdprimers,2));
good_fwdprimers = fwdprimerlist(good_fwdpos,:);
good_fwdprop = fwdprimerprops(good_fwdpos);
N_good_fwdprimers = numel(good_fwdprop)
good_revpos = find(all(~bad_revprimers,2));
good_revprimers = revprimerlist(good_revpos,:);
good_revprop = revprimerprops(good_revpos);
N_good_revprimers = numel(good_revprop)
figure
imagesc([bad_fwdprimers any(bad_fwdprimers,2)]);
title('Filtering candidate forward primers');
ylabel('Primer location');
xlabel('Criteria');
ax = gca;
ax.XTickLabel = char({'%GC','Tm','Dimers','Hairpin','GC clamp','Repeats','All'});
ax.XTickLabelRotation = 45;
colorbar
N_good_fwdprimers =
140
N_good_revprimers =
147
3-84
Exploring Primer Design
Cross dimerization can occur between the forward and reverse primer if they have a significant
amount of complementarity. The primers will not function properly if they dimerize with each other.
To check for dimerization, align every forward primer against every reverse primer, using the
swalign function, and keep the low-scoring pairs of primers. This information can be stored in a
matrix with rows representing forward primers and columns representing reverse primers. This
exhaustive calculation can be quite time-consuming. However, there is no point in performing this
calculation on primer pairs where the reverse primer is upstream of the forward primer. Therefore,
these primer pairs can be ignored. The image in the figure shows the pairwise scores before being
thresholded, low scores (dark blue) represent primer pairs that do not dimerize.
scr_mat = [-1,-1,-1,1;-1,-1,1,-1;-1,1,-1,-1;1,-1,-1,-1;];
scr = zeros(N_good_fwdprimers,N_good_revprimers);
for i = 1:N_good_fwdprimers
for j = 1:N_good_revprimers
if good_fwdpos(i) < good_revpos(j)
scr(i,j) = swalign(good_fwdprimers(i,:), good_revprimers(j,:), ...
'SCORINGMATRIX',scr_mat,'GAPOPEN',5,'ALPHA','NT');
else
scr(i,j) = 13; % give a high score to ignore forward primers
% that are after reverse primers
end
end
end
figure
3-85
3 Sequence Analysis
imagesc(scr)
title('Cross dimerization scores')
xlabel('Candidate reverse primers')
ylabel('Candidate forward primers')
colorbar
Low scoring primer pairs are identified as logical one in an indicator matrix.
pairedprimers = scr<=3;
An alternative way to present this information is to look at all potential combinations of primers in the
sequence domain. Each dot in the plot represents a possible combination between the forward and
reverse primers after filtering out all those cases with potential cross dimerization.
[f,r] = find(pairedprimers);
figure
plot(good_revpos(r),good_fwdpos(f),'r.','markersize',10)
axis([1 N 1 N])
title('Primer selection graph')
xlabel('Reverse primer positions')
ylabel('Forward primer positions')
3-86
Exploring Primer Design
You can use the information calculated so far to find the best primer pairs that allow amplification of
the 220bp region from position 880 to 1100. First, you find all pairs that can cover the required
region, taking into account the length of the primer. Then, you calculate the Euclidean distance of the
actual positions to the desired ones, and re-order the list starting with the closest distance.
hold on
plot(good_revpos(r(pairs)),good_fwdpos(f(pairs)),'b.','markersize',10)
plot([1100 1100],[1 880-M],'g')
plot([1100 N],[880-M 880-M],'g')
3-87
3 Sequence Analysis
Use the sprintf function to generate a report with the ten best pairs and associated information.
These primer pairs can then be verified experimentally. These primers can also be 'BLASTed' using
the blastncbi function to check specificity.
Primers = sprintf('Fwd/Rev Primers Start End %%GC mT Length\n\n');
for i = 1:10
fwd = f(pairs(i));
rev = r(pairs(i));
Primers = sprintf('%s%-21s%-6d%-6d%-4.4g%-4.4g\n%-21s%-6d%-6d%-4.4g%-7.4g%-6d\n\n', ...
Primers, good_fwdprimers(fwd,:),good_fwdpos(fwd),good_fwdpos(fwd)+M-1,good_fwdprop(fwd).GC,go
good_revprimers(rev,:),good_revpos(rev)+M-1,good_revpos(rev),good_revprop(rev).GC,go
good_revpos(rev) - good_fwdpos(fwd) );
end
disp(Primers)
3-88
Exploring Primer Design
Use the rebasecuts function to list all the restriction enzymes from the REBASE® database [2] that
will cut a primer. These restriction enzymes can be used in the design of cloning experiments. For
example, you can use this on the first pair of primers from the list of possible primers that you just
calculated.
fwdprimer = good_fwdprimers(f(pairs(1)),:)
fwdcutter = unique(rebasecuts(fwdprimer))
revprimer = good_revprimers(r(pairs(1)),:)
revcutter = unique(rebasecuts(revprimer))
fwdprimer =
'tacatctcgccattacctgc'
fwdcutter =
{'AbaSI' }
{'Acc36I'}
{'BfuAI' }
{'BmeDI' }
{'BspMI' }
{'BveI' }
{'FspEI' }
{'LpnPI' }
{'MspJI' }
{'RlaI' }
{'SetI' }
{'SgeI' }
3-89
3 Sequence Analysis
{'SgrTI' }
{'YkrI' }
revprimer =
'tcaacctcatctcctccaag'
revcutter =
{'AbaSI' }
{'AspBHI'}
{'BmeDI' }
{'BsaXI' }
{'FspEI' }
{'MnlI' }
{'MspJI' }
{'RlaI' }
{'SetI' }
{'SgeI' }
{'SgrTI' }
{'YkrI' }
References
[1] SantaLucia, J. Jr., "A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor
thermodynamics", PNAS, 95(4):1460-5, 1998.
[2] Roberts, R.J., et al., "REBASE--restriction enzymes and DNA methyltransferases", Nucleic Acids
Research, 33:D230-2, 2005.
3-90
Identifying Over-Represented Regulatory Motifs
This example illustrates a simple approach to searching for potential regulatory motifs in a set of co-
expressed genomic sequences by identifying significantly over-represented ungapped words of fixed
length. The discussion is based on the case study presented in Chapter 10 of "Introduction to
Computational Genomics. A Case Studies Approach" [1].
Introduction
The circadian clock is the 24 hour cycle of the physiological processes that synchronize with the
external day-night cycle. Most of the work on the circadian oscillator in plants has been carried out
using the model plant Arabidopsis thaliana. In this organism, the regulation of a series of genes that
need to be turned on or off at specific time of the day and night, is accomplished by small regulatory
sequences found upstream the genes in question. One such regulatory motif, AAAATATCT, also known
as the Evening Element (EE), has been identified in the promoter regions of circadian clock-regulated
genes that show peak expression in the evening [2].
We consider three sets of clock-regulated genes, clustered according to the time of the day when they
are maximally expressed: set 1 corresponds to 1 KB-long upstream regions of genes whose
expression peak in the morning (8am-4pm); set 2 corresponds to 1 KB-long upstream regions of
genes whose expression peak in the evening (4pm-12pm); set 3 corresponds to 1 KB-long upstream
regions of genes whose expression peak in the night (12pm-8am). Because we are interested in a
regulatory motif in evening genes, set 2 represents our target set, while set 1 and set 3 are used as
background. In each set, the sequences and their respective reverse complements are concatenated
to each other, with individual sequences separated by a gap symbol (-).
load evemotifdemodata.mat;
N1 = numel(set1) * 2;
N2 = numel(set2) * 2;
N3 = numel(set3) * 2;
To determine which candidate motif is over-represented in a given target set with respect to the
background set, we identify all possible W-mers (words of length W) in both sets and compute their
frequency. A word is considered over-represented if its frequency in the target set is significantly
higher than the frequency in the background set. This difference is also called "margin".
3-91
3 Sequence Analysis
type findOverrepresentedWords
If we consider all words of length W = 9 that appear more frequently in the target set (upstream
region of genes highly expressed in the evening) with respect to the background set (upstream region
of genes highly expressed in the morning and night), we notice that the most over-represented word
is 'AAAATATCT', also known as the Evening Element (EE) motif.
W = 9;
ans =
{'AAAATATCT'}
{'AGATATTTT'}
{'CTCTCTCTC'}
{'GAGAGAGAG'}
3-92
Identifying Over-Represented Regulatory Motifs
{'AGAGAGAGA'}
{'TCTCTCTCT'}
{'AAATATCTT'}
{'AAGATATTT'}
{'AAAAATATC'}
{'GATATTTTT'}
ans =
1.0e-03 *
0.1439
0.1439
0.1140
0.1140
0.1074
0.1074
0.0713
0.0713
0.0695
0.0695
Besides the EE motif, other words of length W = 9 appear to be over-represented in the target set. In
particular, we notice the presence of repeats, i.e., words consisting of a single nucleotide or dimer
repeated for the entire word length, such as 'CTCTCTCTC'. This phenomenon is quite common in
genomic sequences and generally is associated with non-functional components. Because in this
context the repeats are unlikely to be biologically significant, we filter them out.
% === determine repeats
wordsN = numel(words);
r = zeros(wordsN,1);
for i = 1:wordsN
if (all(words{i}(1:2:end) == words{i}(1)) && ... % odd positions are the same
all(words{i}(2:2:end) == words{i}(2))) % even positions are the same
r(i) = 1;
end
end
r = logical(r);
EEMotif = motif{1}
EEMargin = margin(1)
motif =
3-93
3 Sequence Analysis
{'AAAATATCT'}
{'AGATATTTT'}
{'AAATATCTT'}
{'AAGATATTT'}
{'AAAAATATC'}
{'GATATTTTT'}
{'AAATAAAAT'}
{'ATTTTATTT'}
{'TAAATAAAA'}
{'TTTTATTTA'}
margin =
1.0e-03 *
0.1439
0.1439
0.0713
0.0713
0.0695
0.0695
0.0656
0.0656
0.0600
0.0600
EEMotif =
'AAAATATCT'
EEMargin =
1.4393e-04
After removing the repeats, we observe that the EE motif ('AAAATATCT') and its reverse complement
('AGATATTTT') are at the top of the list. The other over-represented words are either simple variants
of the EE motif, such as 'AAATATCTT', 'AAAAATATC', 'AAATATCTC', or their reverse complements,
such as 'AAGATATTT', 'GATATTTTT', 'GAGATATTT'.
Various techniques can be used to assess the statistical significance of the margin computed for the
EE motif. For example, we can repeat the analysis using some control sequences and evaluate the
resulting margins with respect to the EE margin. Genomic regions of Arabidopsis thaliana that are
further away from the transcription start site are good candidates for this purpose. Alternatively, we
could randomly split and shuffle the sequences under consideration and use these as controls.
Another simple solution is to generate random sequences according to the nucleotide composition of
the three original sets of sequences, as shown below.
3-94
Identifying Over-Represented Regulatory Motifs
bases2 = basecount(s2);
bases3 = basecount(s3);
The variable ctrlMargin holds the estimated margins of the top motifs for each of the 100 control
sequences generated as described above. The distribution of these margins can be approximated by
the extreme value distribution. We use the function gevfit from the Statistics and Machine Learning
Toolbox™ to estimate the parameters (shape, scale, and location) of the extreme value distribution
and we overlay a scaled version of its probability density function, computed using gevpdf, with the
histogram of the margins of the control sequences.
% === overlay
figure()
hold on;
hist(ctrlMargin, buckets);
h = findobj(gca,'Type','patch');
h.FaceColor = [.9 .9 .9];
plot(x, scaleFactor * y, 'r');
stem(EEMargin, 1, 'b');
xlabel('Margin');
ylabel('Number of sequences');
legend('Ctrl Margins', 'EVD pdf', 'EE Margin');
3-95
3 Sequence Analysis
The control margins are the differences in frequency that we would expect to find when a word is
over-represented by chance alone. The margin relative to the EE motif is clearly significantly larger
than the control margins, and does not fit within the probability density curve of the random controls.
Because the EE margin is larger than all 100 control margins, we can conclude that the over-
representation of the EE motif in the target set is statistically significant and the p-value estimate is
less than 0.01.
If we repeat the search for over-represented words of length W = 6...11, we observe that all the top
motifs are either substrings (if W < 9) or superstrings (if W > 9) of the EE motif. Thus, how do we
decide what is the correct length of this motif? We can expect that the optimal length maximizes the
difference in frequency between the motif in the target set and the same motif in the background set.
However, in order to compare the margin across different lengths, the margin must be normalized to
account for the natural tendency of shorter words to occur more frequently. We perform this
normalization by dividing each margin by the margin corresponding to the most over-represented
word of identical length in a random set of sequences with a nucleotide composition similar to the
target set. For convenience, the top over-represented words for length W = 6...11 and their margins
are stored in the variables topMotif and topMargin. Similarly, the top over-represented words for
length W = 6...11 and their margins in a random set are stored in the variables rTopMotif and
rTopMargin.
% === top over-represented words, W = 6...11 in set 2 (evening)
topMotif
topMargin
3-96
Identifying Over-Represented Regulatory Motifs
% === plot
figure()
plot(6:11, score(6:11));
xlabel('Motif length');
ylabel('Normalized margin');
title('Optimal motif length');
hold on
line([bestLength bestLength], [0 bestScore], 'LineStyle', '-.')
topMotif =
{0x0 double }
{0x0 double }
{0x0 double }
{0x0 double }
{0x0 double }
{'AATATC' }
{'AATATCT' }
{'AAATATCT' }
{'AAAATATCT' }
{'AAAAATATCT' }
{'AAAAAATATCT'}
topMargin =
1.0e-03 *
NaN
NaN
NaN
NaN
NaN
0.3007
0.2607
0.2074
0.1439
0.0648
0.0424
rTopMotif =
{0x0 double }
{0x0 double }
3-97
3 Sequence Analysis
{0x0 double }
{0x0 double }
{0x0 double }
{'ATTATA' }
{'TATAATA' }
{'TTATTAAA' }
{'GTTATTAAA' }
{'ATTATATATC' }
{'ATGTTATTATT'}
rTopMargin =
1.0e-03 *
NaN
NaN
NaN
NaN
NaN
0.5650
0.2374
0.0972
0.0495
0.0279
0.0183
3-98
Identifying Over-Represented Regulatory Motifs
By plotting the normalized margin versus the motif length, we find that length W = 9 is the most
informative in discriminating over-represented motifs in the target sequence (evening set) against the
background set (morning and night sets).
Although the EE Motif has been identified and experimentally validated as a regulatory motif for
genes whose expression peaks in the evening hours, it is not shared by all evening genes, nor is it
exclusive of these genes. We count the occurrences of the EE motif in the three sequence sets and
determine what proportion of genes in each set contain the motif.
EECount = zeros(3,1);
% === plot
figure()
bar(EEProp, 0.5);
ylabel('Proportion of genes containing EE Motif');
xlabel('Gene set');
title('Presence of EE Motif');
ylim([0 1])
ax = gca;
ax.XTickLabel = {'morning', 'evening', 'night'};
3-99
3 Sequence Analysis
It appears as though about 9% of genes in set 1, 40% of genes in set 2, and 13% of genes in set 3
have the EE motif. Thus, not all genes in set 2 have the motif, but it is clearly enriched in this group.
Unlike many other functional motifs, the EE motif does not appear to accumulate at specific gene
locations in the set of sequences analyzed. After determining the location of each occurrence with
respect to the transcription start site (TSS), we observe a relatively uniform distribution of
occurrences across the upstream region of the genes considered, with the possible exception of the
middle region (between 400 and 500 bases upstream of the TSS).
3-100
Identifying Over-Represented Regulatory Motifs
References
[1] Cristianini, N. and Hahn, M.W., "Introduction to Computational Genomics: A Case Studies
Approach", Cambridge University Press, 2007.
[2] Harmer, S.L., et al., "Orchestrated Transcription of Key Pathways in Arabidopsis by the Circadian
Clock", Science, 290(5499):2110-3, 2000.
3-101
3 Sequence Analysis
This example illustrates how to use the rnafold and rnaplot functions to predict and plot the
secondary structure of an RNA sequence.
Introduction
RNA plays an important role in the cell, both as genetic information carrier (mRNA) and as functional
element (tRNA, rRNA). Because the function of an RNA sequence is largely associated with its
structure, predicting the RNA structure from its sequence has become increasingly important.
Because base pairing and base stacking represent the majority of the free energy contribution to
folding, a good estimation of secondary structure can be very helpful not only in the interpretation of
the function and reactivity, but also in the analysis of the tertiary structure of the RNA molecule.
The secondary structure of an RNA sequence is determined by the interaction between its bases,
including hydrogen bonding and base stacking. One of the many methods for RNA secondary
structure prediction uses the nearest-neighbor model and minimizes the total free energy associated
with an RNA structure. The minimum free energy is estimated by summing individual energy
contributions from base pair stacking, hairpins, bulges, internal loops and multi-branch loops. The
energy contributions of these elements are sequence- and length-dependent and have been
experimentally determined [1]. The rnafold function uses the nearest-neighbor thermodynamic
model to predict the minimum free-energy secondary structure of an RNA sequence. More
specifically, the algorithm implemented in rnafold uses dynamic programming to compute the
energy contributions of all possible elementary substructures and then predicts the secondary
structure by considering the combination of elementary substructures whose total free energy is
minimum. In this computation, the contribution of coaxially stacked helices is not accounted for, and
the formation of pseudoknots (non-nested structural elements) is forbidden.
tRNAs are small molecules (73-93 nucleotides) that during translation transfer specific amino acids to
the growing polypeptide chain at the ribosomal site. Although at least one tRNA molecule exists for
each amino acid type, both secondary and tertiary structures are well conserved among the various
tRNA types, most likely because of the necessity of maintaining reliable interaction with the
ribosome. We consider the following tRNA-Phe sequence from Saccharomyces cerevisiae and predict
the minimum free-energy secondary structure using the function rnafold.
% === Predict secondary structure in bracket notation
phe_seq = 'GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGACUGAAGAUCUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCACCA';
phe_str = rnafold(phe_seq)
phe_str =
'(((((((..((((........)))).(((((.......))))).....(((((.......))))))))))))....'
In the bracket notation, each dot represents an unpaired base, while a pair of equally nested, opening
and closing brackets represents a base pair. Alternative representations of RNA secondary structures
can be drawn using the function rnaplot. For example, the structure predicted above can be
3-102
Predicting and Visualizing the Secondary Structure of RNA Sequences
displayed as a rooted tree, where leaf nodes correspond to unpaired residues and internal nodes
(except the root) correspond to base pairs. You can display the position and type of each residue by
clicking on the corresponding node.
The tRNA secondary structure is commonly represented in a diagram plot and resembles a clover
leaf. It displays four base-paired stems (or "arms") and three loops. Each of the four stems has been
extensively studied and characterized: acceptor stem (positions 1-7 and 66-72), D-stem (positions
10-13 and 22-25), anticodon stem (positions 27-31 and 39-43) and T-stem (positions 49-53 and 61-65).
We can draw the tRNA secondary structure as a two-dimensional plot where each residue is identified
by a dot and the backbone and the hydrogen bonds are represented as lines between the dots. The
stems consist of consecutive stretches of base paired residues (blue dots), while the loops are formed
by unpaired residues (red dots).
% === Plot the secondary structure using the dot diagram representation
rnaplot(phe_str, 'seq', phe_seq, 'format', 'dot');
3-103
3 Sequence Analysis
While all the stems are important for a proper three-dimensional folding of the molecule and
successful interplay with ribosome and tRNA synthetases, the acceptor stem and the anticodon stem
are particularly interesting because they include the attachment site and the anticodon triplet. The
attachment site (positions 74-76) occurs at the 3' end of the RNA chains and consists of the sequence
C-C-A in all amino acid acceptor stems. The anticodon triplet consists of 3 bases that pair with a
complementary codon in the messenger RNA. In the case of Phe-tRNA, the anticodon sequence A-A-G
(positions 34-36) pairs with the mRNA codon U-U-C, encoding the amino acid phenylalanine. We can
redraw the structure and highlight these regions in the acceptor stem and anticodon stem by using
the selection property:
aag_pos = 34:36;
cca_pos = 74:76;
3-104
Predicting and Visualizing the Secondary Structure of RNA Sequences
The segregation of the sequence into four separate stems is better appreciated by displaying the
structure as graph plot. Each residue is represented on the abscissa and semi-elliptical lines connect
bases that pair with each other. The lack of pseudoknots in the secondary structure is reflected by the
absence of intersecting lines. This is expected in tRNA secondary structures and anticipated because
the dynamic programming method used does not allow pseudoknots.
3-105
3 Sequence Analysis
Similar observations can be drawn by displaying the secondary structure as a circle, where each base
is represented by a dot on the circumference of a circle of arbitrary size, and bases that pair with
each other are connected by lines. The lines are visually clustered into four distinct groups, separated
by stretched of unpaired residues. We can hide the unpaired residues by using H.Unpaired, the
handle returned with the colorby property set to state.
3-106
Predicting and Visualizing the Secondary Structure of RNA Sequences
As you can see, the outputs of the rnaplot function include a MATLAB® structure H consisting of
handles that can be used to change the aspect properties of various residue subsets. For example, if
you set the color scheme using the colorby property set to residue, the dots are colored according
to the residue type, and you can change their property using the appropriate handle.
ha =
XLim: [-1 1]
YLim: [-1 1.1000]
XScale: 'linear'
YScale: 'linear'
GridLineStyle: '-'
Position: [0.1124 0.1100 0.6703 0.8150]
Units: 'normalized'
H =
3-107
3 Sequence Analysis
A: [1x1 Line]
C: [1x1 Line]
G: [1x1 Line]
U: [1x1 Line]
Selected: [0x1 Line]
3-108
Predicting and Visualizing the Secondary Structure of RNA Sequences
Despite some differences in their primary sequences, tRNAs molecules present a secondary structure
pattern that is well conserved across the three phylogenetic domains. Consider the structure of the
tRNA-Phe of one representative organism for each phylogenetic domain: Saccharomyces cerevisiae
for the Eukaryotes, Haloarcula marismortui for the Archaea, and Thermus thermophilus for the
Bacteria. Then predict and plot their secondary structures using the mountain plot representation.
yeast = 'GCGGACUUAGCUCAGUUGGGAGAGCGCCAGACUGAAGAUCUGGAGGUCCUGUGUUCGAUCCACAGAGUUCGCACCA';
halma = 'GCCGCCUUAGCUCAGACUGGGAGAGCACUCGACUGAAGAUCGAGCUGUCCCCGGUUCAAAUCCGGGAGGCGGCACCA';
theth = 'GCCGAGGUAGCUCAGUUGGUAGAGCAUGCGACUGAAAAUCGCAGUGUCGGCGGUUCGAUUCCGCCCCUCGGCACCA';
yeast_str = rnafold(yeast);
theth_str = rnafold(theth);
halma_str = rnafold(halma);
3-109
3 Sequence Analysis
3-110
Predicting and Visualizing the Secondary Structure of RNA Sequences
3-111
3 Sequence Analysis
The similarity among the resulting structures is striking, the only difference being one extra residue
in the D-loop of Haloarcula marismortui, displayed in the first flat slope in the mountain plot.
Besides the Watson-Crick base pairs (A-U, G-C), virtually every class of functional RNA presents G-U
wobble base pairs. G-U pairs have an array of distinctive chemical, structural and conformational
properties: they have high affinity for metal ions, they are almost thermodynamically as stable as
Watson-Crick base pairs, and they present conformational flexibility to different environments. The
wobble pair at the third position of the acceptor helix of tRNA is very highly conserved in almost all
organisms. This conservation suggests that the G-U pair possesses unique features that can hardly be
duplicated by other pairs. You can observe the base pair type distribution on the secondary structure
diagram by coloring the base pairs according to their type.
3-112
Predicting and Visualizing the Secondary Structure of RNA Sequences
References
[1] Matthews, D., Sabina, J., Zuker, M., and Turner, D. "Expanded sequence dependence of
thermodynamic parameters improves prediction of RNA secondary structure", Journal of Molecular
Biology, 288(5):911-40, 1999.
3-113
3 Sequence Analysis
This example shows how HMM profiles are used to characterize protein families. Profile analysis is a
key tool in bioinformatics. The common pairwise comparison methods are usually not sensitive and
specific enough for analyzing distantly related sequences. In contrast, Hidden Markov Model (HMM)
profiles provide a better alternative to relate a query sequence to a statistical description of a family
of sequences. HMM profiles use a position-specific scoring system to capture information about the
degree of conservation at various positions in the multiple alignment of these sequences. HMM
profile analysis can be used for multiple sequence alignment, for database searching, to analyze
sequence composition and pattern segmentation, and to predict protein structure and locate genes by
predicting open reading frames.
Start this example with an already built HMM of a protein family. Retrieve the model for the well-
known 7-fold transmembrane receptor from the Sanger Institute database. The PFAM key number is
PF00002. Also retrieve the pre-aligned sequences used to train this model. More information about
the PFAM database can be found at http://pfam.xfam.org/.
hmm_7tm = gethmmprof(2);
seed_seqs = gethmmalignment(2,'type','seed');
For your convenience, previously downloaded sequences are included in a MAT-file. Note that data in
public repositories is frequently curated and updated; therefore the results of this example might be
slightly different when you use up-to-date datasets.
load('gpcrfam.mat','hmm_7tm','seed_seqs')
Models and alignments can also be stored and parsed in later directly from the files using the
pfamhmmread, fastaread and multialignread functions.
Display the names and contents of the first three loaded sequences using the seqdisp command.
seqdisp(seed_seqs([1 2 3]),'row',70)
ans =
'>VIPR2_HUMAN/123-371 '
' 1 YILVKAIYTL GYSVS.LMSL ATGSIILCLF .RKLHCTR.N YIHLNLFLSF ILRAISVLVK .DDVLYSSS.'
' 71 GTLHCPD... .......... .......... ....QPSSW. ..V.GCKLSL VFLQYCIMAN FFWLLVEGLY'
'141 LHTLLVA... ...MLPP.RR CFLAYLLIGW GLPTVCIGAW TAAR...... .........L YLED......'
'211 ......TGC. WDTN.DHSVP W....WVIRI PILISIIVNF VLFISIIRIL LQKLT..... .SPDVGGNDQ'
'281 SQY....... .......... .......... ....KRLAKS TLLLIPLFGV HYMV..FAVF PISI...S.S'
'351 KYQILFELCL GSF....QGL VV '
' '
'>VIPR_CARAU/100-348 '
' 1 FRSVKIGYTI GHSVS.LISL TTAIVILCMS .RKLHCTR.N YIHMHLFVSF ILKAIAVFVK .DAVLYDVIQ'
' 71 ESDNCS.... .......... .......... .....TASV. ....GCKAVI VFFQYCIMAS FFWLLVEGLY'
'141 LHALLAVS.. ...FFSE.RK YFWWYILIGW GGPTIFIMAW SFAK...... .........A YFND......'
'211 ......VGC. WDIIENSDLF W....WIIKT PILASILMNF ILFICIIRIL RQKIN..... .CPDIGRNES'
'281 NQY....... .......... .......... ....SRLAKS TLLLIPLFGI NFII..FAFI PENI...K.T'
'351 ELRLVFDLIL GSF....QGF VV '
3-114
Using HMMs for Profile Analysis of a Protein Family
' '
'>VIPR1_RAT/140-386 '
' 1 YNTVKTGYTI GYSLS.LASL LVAMAILSLF .RKLHCTR.N YIHMHLFMSF ILRATAVFIK .DMALFNSG.'
' 71 EIDHCS.... .......... .......... .....EASV. ....GCKAAV VFFQYCVMAN FFWLLVEGLY'
'141 LYTLLAVS.. ...FFSE.RK YFWGYILIGW GVPSVFITIW TVVR...... .........I YFED......'
'211 ......FGC. WDTI.INSSL W....WIIKA PILLSILVNF VLFICIIRIL VQKLR..... .PPDIGKNDS'
'281 SPY....... .......... .......... ....SRLAKS TLLLIPLFGI HYVM..FAFF PDNF...K.A'
'351 QVKMVFELVV GSF....QGF VV '
More information regarding how to store the profile HMM information in a MATLAB® structure is
found in the help for hmmprofstruct.
To test the profile HMM alignment tool you can re-align the sequences from the multiple alignment to
the HMM model. First erase the periods in sequences used to format the downloaded aligned
sequences. Doing this removes the alignment information from the sequences.
seqs = strrep({seed_seqs.Sequence},'.','');
names = {seed_seqs.Header};
Next, send the results to the Web Browser to better explore the new multiple alignment. Columns
marked with * at the bottom indicate when the model was in a "match" or "delete" state.
hmmprofmerge(aligned_seqs,names,scores)
You can also explore the alignment from the command window; the hmmprofmerge function with one
output argument places the aligned sequences into a char array.
str = hmmprofmerge(aligned_seqs);
str(1:10,1:80)
ans =
'YILVKAIYTLGYSVS.LMSLATGSIILCLF.RKLHCTR.NYIHLNLFLSFILRAISVLVK.DDVLYSSSG-TLH......'
'FRSVKIGYTIGHSVS.LISLTTAIVILCMS.RKLHCTR.NYIHMHLFVSFILKAIAVFVK.DAVLYDVIQESDN......'
'YNTVKTGYTIGYSLS.LASLLVAMAILSLF.RKLHCTR.NYIHMHLFMSFILRATAVFIK.DMALFNSG-EIDH......'
'FGAIKTGYTIGHSLS.LISLTAAMIILCIF.RKLHCTR.NYIHMHLFMSFIMRAIAVFIK.DIVLFESG-ESDH......'
'YLSVKALYTVGYSTS.LVTLTTAMVILCRF.RKLHCTR.NFIHMNLFVSFMLRAISVFIK.DWILYAEQD-SSH......'
'FSTVKIIYTTGHSIS.IVALCVAIAILVAL.RRLHCPR.NYIHTQLFATFILKASAVFLK.DAAIFQGDS-TDH......'
'LSTLKQLYTAGYATS.LISLITAVIIFTCF.RKFHCTR.NYIHINLFVSFILRATAVFIK.DAVLFSDET-QNH......'
3-115
3 Sequence Analysis
'FDRLGMIYTVGYSVS.LASLTVAVLILAYF.RRLHCTR.NYIHMHLFLSFMLRAVSIFVK.DAVLYSGATLDEA......'
'FERLYVMYTVGYSIS.FGSLAVAILIIGYF.RRLHCTR.NYIHMHLFVSFMLRATSIFVK.DRVVHAHIGVKEL......'
'ALNLFYLTIIGHGLS.IASLLISLGIFFYF.KSLSCQR.ITLHKNLFFSFVCNSVVTIIH.LTAVANNQALVAT......'
Having a profile HHM which describes this family has several advantages over plain sequence
comparison. Suppose that you have a new oligonucleotide that you want to relate to the 7-
transmembrane receptor family. For this example, get a protein sequence from NCBI and extract the
aminoacid sequence.
mousegpcr = getgenpept('NP_783573');
Bai3 = mousegpcr.Sequence;
seqdisp(Bai3,'row',70)
ans =
First, using local alignment compare the new sequence to one of the sequences in the multiple
alignment. For instance use the first sequence, in this case the human protein 'VIPR2'. The Smith-
Waterman algorithm (swalign) can make use of scoring matrices. Scoring matrices can capture the
probability of substitution of symbols. The sequences in this example are known to be only distantly
related, so BLOSUM30 is a good choice for the scoring matrix.
VIPR2 = seqs{1};
[sc_aa_affine, alignment] = swalign(Bai3,VIPR2,'ScoringMatrix',...
3-116
Using HMMs for Profile Analysis of a Protein Family
'blosum30','gapopen',5,'extendgap',3,'showscore',true);
sc_aa_affine
sc_aa_affine =
69.6000
By looking at the scoring space, apparently, both sequences are related. However, this relationship
could not be inferred from a dot plot.
Bai3_aligned_region = strrep(alignment(1,:),'-','');
seqdotplot(VIPR2,Bai3_aligned_region,7,2)
ylabel('VIPR2'); xlabel('Bai3');
3-117
3 Sequence Analysis
Is either of these two examples enough evidence to affirm that these sequences are related? One way
to test this is to randomly create a fake sequence with the same distribution of amino acids and see
how it aligns to the family. Notice that the score of the local alignment between the fake sequence
and the VIPR2 protein is not significantly lower than the score of the alignment between the Bia3 and
VIPR2 proteins. To ensure reproducibility of the results of this example, we reset the global random
generator.
rng(0,'twister');
fakeSeq = randseq(1000,'FROMSTRUCTURE',aacount(VIPR2));
sc_fk_affine = swalign(fakeSeq,VIPR2,'ScoringMatrix','blosum30',...
'gapopen',5,'extendgap',3,'showscore',true)
sc_fk_affine =
60.4000
3-118
Using HMMs for Profile Analysis of a Protein Family
In contrast, when you align both sequences to the family using the trained profile HMM, the score of
aligning the target sequence to the family profile is significantly larger than the score of aligning the
fake sequence.
sc_aa_hmm = hmmprofalign(hmm_7tm,Bai3)
sc_fk_hmm = hmmprofalign(hmm_7tm,fakeSeq)
sc_aa_hmm =
214.5286
sc_fk_hmm =
-49.1624
Similarly to the swalign alignment function, when you use profile alignments you can visualize the
scoring space using the showscore option to the hmmprofalign function.
hmmprofalign(hmm_7tm,Bai3,'showscore',true);
title('log-odds score for best path: Bai3');
3-119
3 Sequence Analysis
hmmprofalign(hmm_7tm,fakeSeq,'showscore',true);
title('log-odds score for best path: fake sequence');
3-120
Using HMMs for Profile Analysis of a Protein Family
[sc_aa_hmm,align,ptrs] = hmmprofalign(hmm_7tm,Bai3);
Bai3_hmmaligned_region = Bai3(min(ptrs):max(ptrs));
hmmprofalign(hmm_7tm,Bai3_hmmaligned_region,'showscore',true);
title('log-odds score for best path: Bai3 aligned globally');
3-121
3 Sequence Analysis
naa = numel(Bai3_hmmaligned_region);
repeats = randseq(1000,'FROMSTRUCTURE',aacount(Bai3)); %artificial example
repeats(200+(1:naa)) = Bai3_hmmaligned_region;
repeats(500+(1:naa)) = Bai3_hmmaligned_region;
repeats(700+(1:naa)) = Bai3_hmmaligned_region;
hmmprofalign(hmm_7tm,repeats,'showscore',true);
title('log-odds score for best path: Bai3 tandem repeats');
3-122
Using HMMs for Profile Analysis of a Protein Family
In MATLAB®, you can search for fragment domains by manually activating the B->M and M->E
transition probabilities of the HMM model.
hmm_7tm_f = hmm_7tm;
hmm_7tm_f.BeginX(3:end)=.002;
hmm_7tm_f.MatchX(1:end-1,4)=.002;
Create a random sequence, or fragment model, with a small insertion of the Bai3 protein:
fragment = randseq(1000,'FROMSTRUCTURE',aacount(Bai3));
fragment(501:550) = Bai3_hmmaligned_region(101:150);
Try aligning the random sequence with the inserted peptide to both models, the global and fragment
model:
hmmprofalign(hmm_7tm,fragment,'showscore',true);
title('log-odds score for best path: PF00002 global ');
hmmprofalign(hmm_7tm_f,fragment,'showscore',true);
title('log-odds score for best path: PF00002 fragment domains');
3-123
3 Sequence Analysis
3-124
Using HMMs for Profile Analysis of a Protein Family
The function showhmmprof is an interactive tool to explore the profile HMM. Try right and left mouse
clicks over the model figures. There are three plots for each model: (1) the symbol emission
probabilities in the Match states, (2) the symbol emission probabilities in the Insert states, and (3)
the Transition probabilities.
showhmmprof(hmm_7tm,'scale','logodds')
3-125
3 Sequence Analysis
3-126
Using HMMs for Profile Analysis of a Protein Family
An alternative method to explore a profile HMM is by creating a sequence logo from the multiple
alignment. A sequence logo displays the frequency of bases found at each position within a given
region, usually for a binding site. Using the hmm_7tm sequences, consider the portion of the
Parathyroid hormone-related peptide receptor (precursor) found at the n-terminus of the
PTRR_Human sequence. The seqlogo allows a quick visual comparison of how well this region is
conserved across the 7tm family.
seqlogo(str,'startat',1,'endat',20,'alphabet','AA')
3-127
3 Sequence Analysis
Profile Estimation
Profile HMMs can also be estimated from a multiple alignment. As new sequences related to the
family are found, it is possible to re-estimate the model parameters.
hmm_7tm_new = hmmprofestimate(hmm_7tm,str)
hmm_7tm_new =
Name: '7tm_2'
PfamAccessionNumber: 'PF00002.19'
ModelDescription: '7 transmembrane receptor (Secretin family)'
ModelLength: 243
Alphabet: 'AA'
MatchEmission: [243x20 double]
InsertEmission: [243x20 double]
NullEmission: [0.0768 0.0418 0.0396 0.0305 0.0201 0.0378 ... ]
BeginX: [244x1 double]
MatchX: [242x4 double]
3-128
Using HMMs for Profile Analysis of a Protein Family
In case your sequences are not pre-aligned, you can also utilize the multialign function before
estimating a new HMM profile. It is possible to refine the HMM profile by re-aligning the sequences
to the model and re-estimating the model iteratively until you converge to a locally optimal model.
aligned_seqs = multialign(seqs);
hmm_7tm_ma = hmmprofestimate(hmmprofstruct(270),aligned_seqs)
showhmmprof(hmm_7tm_ma,'scale','logodds')
close; close; % close insertion emission prob. and transition prob.
hmm_7tm_ma =
ModelLength: 270
Alphabet: 'AA'
MatchEmission: [270x20 double]
InsertEmission: [270x20 double]
NullEmission: [0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 ... ]
BeginX: [271x1 double]
MatchX: [269x4 double]
InsertX: [269x2 double]
DeleteX: [269x2 double]
FlankingInsertX: [2x2 double]
LoopX: [2x2 double]
NullX: [2x1 double]
3-129
3 Sequence Analysis
str = hmmprofmerge(aligned_seqs);
str(1:10,1:80)
ans =
'YILVKAIYTLGYSVSLMSLATGSIILCLF.RKLHCTRNYIHLNLFLSFILRAISVLVKDDVLYSS---SGTLHCP-....'
'FRSVKIGYTIGHSVSLISLTTAIVILCMS.RKLHCTRNYIHMHLFVSFILKAIAVFVKDAVLYDVIQ--ESDNCS-....'
'YNTVKTGYTIGYSLSLASLLVAMAILSLF.RKLHCTRNYIHMHLFMSFILRATAVFIKDMALFNS---GEIDHCS-....'
'FGAIKTGYTIGHSLSLISLTAAMIILCIF.RKLHCTRNYIHMHLFMSFIMRAIAVFIKDIVLFES---GESDHCH-....'
'YLSVKALYTVGYSTSLVTLTTAMVILCRF.RKLHCTRNFIHMNLFVSFMLRAISVFIKDWILYAE---QDSSHCF-....'
'FSTVKIIYTTGHSISIVALCVAIAILVAL.RRLHCPRNYIHTQLFATFILKASAVFLKDAAIFQG---DSTDHCS-....'
'LSTLKQLYTAGYATSLISLITAVIIFTCF.RKFHCTRNYIHINLFVSFILRATAVFIKDAVLFSD---ETQNHCL-....'
'FDRLGMIYTVGYSVSLASLTVAVLILAYF.RRLHCTRNYIHMHLFLSFMLRAVSIFVKDAVLYSGATLDEAERLTE....'
'FERLYVMYTVGYSISFGSLAVAILIIGYF.RRLHCTRNYIHMHLFVSFMLRATSIFVKDRVVHAHIGVKELESLIM....'
'ALNLFYLTIIGHGLSIASLLISLGIFFYF.KSLSCQRITLHKNLFFSFVCNSVVTIIHLTAVANNQALVATNP---....'
hmmprofmerge(aligned_seqs,names,scores)
3-130
Predicting Protein Secondary Structure Using a Neural Network
This example shows a secondary structure prediction method that uses a feed-forward neural
network and the functionality available with the Deep Learning Toolbox™.
It is a simplified example intended to illustrate the steps for setting up a neural network with the
purpose of predicting secondary structure of proteins. Its configuration and training methods are not
meant to be necessarily the best solution for the problem at hand.
Introduction
Neural network models attempt to simulate the information processing that occurs in the brain and
are widely used in a variety of applications, including automated pattern recognition.
The Rost-Sander data set [1] consists of proteins whose structures span a relatively wide range of
domain types, composition and length. The file RostSanderDataset.mat contains a subset of this
data set, where the structural assignment of every residue is reported for each protein sequence.
load RostSanderDataset.mat
N = numel(allSeq);
id =
'1CSE-ICOMPLEX(SERINEPROTEINASE-INHIBITOR)03-JU'
seq =
'KSFPEVVGKTVDQAREYFTLHYPQYNVYFLPEGSPVTLDLRYNRVRVFYNPGTNVVNHVPHVG'
str =
'CCCHHHCCCCHHHHHHHHHHHCCCCEEEEEECCCCEECCCCCCEEEEEEECCCCEECCCCEEC'
In this example, you will build a neural network to learn the structural state (helix, sheet or coil) of
each residue in a given protein, based on the structural patterns observed during a training phase.
Due to the random nature of some steps in the following approach, numeric results might be slightly
different every time the network is trained or a prediction is simulated. To ensure reproducibility of
the results, we reset the global random generator to a saved state included in the loaded file, as
shown below:
rng(savedState);
For the current problem we define a neural network with one input layer, one hidden layer and one
output layer. The input layer encodes a sliding window in each input amino acid sequence, and a
3-131
3 Sequence Analysis
prediction is made on the structural state of the central residue in the window. We choose a window
of size 17 based on the statistical correlation found between the secondary structure of a given
residue position and the eight residues on either side of the prediction point [2]. Each window
position is encoded using a binary array of size 20, having one element for each amino acid type. In
each group of 20 inputs, the element corresponding to the amino acid type in the given position is set
to 1, while all other inputs are set to 0. Thus, the input layer consists of R = 17x20 input units, i.e. 17
groups of 20 inputs each.
In the following code, we first determine for each protein sequence all the possible subsequences
corresponding to a sliding window of size W by creating a Hankel matrix, where the ith column
represents the subsequence starting at the ith position in the original sequence. Then for each
position in the window, we create an array of size 20, and we set the jth element to 1 if the residue in
the given position has a numeric representation equal to j.
W = 17; % sliding window size
The output layer of our neural network consists of three units, one for each of the considered
structural states (or classes), which are encoded using a binary scheme. To create the target matrix
for the neural network, we first obtain, from the data, the structural assignments of all possible
subsequences corresponding to the sliding window. Then we consider the central position in each
window and transform the corresponding structural assignment using the following binary encoding:
1 0 0 for coil, 0 1 0 for sheet, 0 0 1 for helix.
cr = ceil(W/2); % central residue position
You can perform the binarization of the input and target matrix described in the two steps above in a
more concise way by executing the following equivalent code:
% === concise binarization of the inputs and targets
for i = 1:N
seq = double(allSeq(i).Sequence);
win = hankel(seq(1:W),seq(W:end)); % concurrent inputs (sliding windows)
3-132
Predicting Protein Secondary Structure Using a Neural Network
Once we define the input and target matrices for each sequence, we create an input matrix, P, and
target matrix, T, representing the encoding for all the sequences fed into the network.
% === construct input and target matrices
P = double([allSeq.P]); % input matrix
T = double([allSeq.T]); % target matrix
The problem of secondary structure prediction can be thought of as a pattern recognition problem,
where the network is trained to recognize the structural state of the central residue most likely to
occur when specific residues in the given sliding window are observed. We create a pattern
recognition neural network using the input and target matrices defined above and specifying a
hidden layer of size 3.
hsize = 3;
net = patternnet(hsize);
net.layers{1} % hidden layer
net.layers{2} % output layer
ans =
name: 'Hidden'
dimensions: 3
distanceFcn: (none)
distanceParam: (none)
distances: []
initFcn: 'initnw'
netInputFcn: 'netsum'
netInputParam: (none)
positions: []
range: [3x2 double]
size: 3
topologyFcn: (none)
transferFcn: 'tansig'
transferParam: (none)
userdata: (your custom info)
ans =
name: 'Output'
dimensions: 0
distanceFcn: (none)
distanceParam: (none)
3-133
3 Sequence Analysis
distances: []
initFcn: 'initnw'
netInputFcn: 'netsum'
netInputParam: (none)
positions: []
range: []
size: 0
topologyFcn: (none)
transferFcn: 'softmax'
transferParam: (none)
userdata: (your custom info)
The pattern recognition network uses the default Scaled Conjugate Gradient algorithm for training,
but other algorithms are available (see the Deep Learning Toolbox documentation for a list of
available functions). At each training cycle, the training sequences are presented to the network
through the sliding window defined above, one residue at a time. Each hidden unit transforms the
signals received from the input layer by using a transfer function logsig to produce an output signal
that is between and close to either 0 or 1, simulating the firing of a neuron [2]. Weights are adjusted
so that the error between the observed output from each unit and the desired output specified by the
target matrix is minimized.
3-134
Predicting Protein Secondary Structure Using a Neural Network
During training, the training tool window opens and displays the progress. Training details such as
the algorithm, the performance criteria, the type of error considered, etc. are shown.
Use the function view to generate a graphical view of the neural network.
view(net)
3-135
3 Sequence Analysis
One common problem that occurs during neural network training is data overfitting, where the
network tends to memorize the training examples without learning how to generalize to new
situations. The default method for improving generalization is called early stopping and consists in
dividing the available training data set into three subsets: (i) the training set, which is used for
computing the gradient and updating the network weights and biases; (ii) the validation set, whose
error is monitored during the training process because it tends to increase when data is overfitted;
and (iii) the test set, whose error can be used to assess the quality of the division of the data set.
When using the function train, by default, the data is randomly divided so that 60% of the samples
are assigned to the training set, 20% to the validation set, and 20% to the test set, but other types of
partitioning can be applied by specifying the property net.divideFnc (default dividerand). The
structural composition of the residues in the three subsets is comparable, as seen from the following
survey:
[i,j] = find(T(:,tr.trainInd));
Ctrain = sum(i == 1)/length(i);
Etrain = sum(i == 2)/length(i);
Htrain = sum(i == 3)/length(i);
3-136
Predicting Protein Secondary Structure Using a Neural Network
[i,j] = find(T(:,tr.valInd));
Cval = sum(i == 1)/length(i);
Eval = sum(i == 2)/length(i);
Hval = sum(i == 3)/length(i);
[i,j] = find(T(:,tr.testInd));
Ctest = sum(i == 1)/length(i);
Etest = sum(i == 2)/length(i);
Htest = sum(i == 3)/length(i);
figure()
pie([Ctrain; Etrain; Htrain]);
title('Structural assignments in training data set');
legend('C', 'E', 'H')
figure()
pie([Cval; Eval; Hval]);
title('Structural assignments in validation data set');
legend('C', 'E', 'H')
figure()
pie([Ctest; Etest; Htest]);
title('Structural assignments in testing data set ');
legend('C', 'E', 'H')
3-137
3 Sequence Analysis
3-138
Predicting Protein Secondary Structure Using a Neural Network
The function plotperform display the trends of the training, validation, and test errors as training
iterations pass.
figure()
plotperform(tr)
3-139
3 Sequence Analysis
The training process stops when one of several conditions (see net.trainParam) is met. For
example, in the training considered, the training process stops when the validation error increases
for a specified number of iterations (6) or the maximum number of allowed iterations is reached
(1000).
ans =
3-140
Predicting Protein Secondary Structure Using a Neural Network
To analyze the network response, we examine the confusion matrix by considering the outputs of the
trained network and comparing them to the expected results (targets).
O = sim(net,P);
figure()
plotconfusion(T,O);
3-141
3 Sequence Analysis
The diagonal cells show the number of residue positions that were correctly classified for each
structural class. The off-diagonal cells show the number of residue positions that were misclassified
(e.g. helical positions predicted as coiled positions). The diagonal cells correspond to observations
that are correctly classified. Both the number of observations and the persentage of the total number
of observations are shown in each cell. The column on the far right of the plot shows the percentages
of all the examples predicted to belong to each class that are correctly and incorrectly classified.
These metrics are often called the precision (or positive predictive value) and false discovery rate,
respectively. The row at the bottom of the plot shows the percentages of all the examples belonging to
each class that are correctly and incorrectly classified. These metrics are often called the recall (or
true positive rate) and false negative rate, respectively. The cell in the bottom right of the plot shows
the overall accuracy.
We can also consider the Receiver Operating Characteristic (ROC) curve, a plot of the true positive
rate (sensitivity) versus the false positive rate (1 - specificity).
3-142
Predicting Protein Secondary Structure Using a Neural Network
figure()
plotroc(T,O);
The neural network that we have defined is relative simple. To achieve some improvements in the
prediction accuracy we could try one of the following:
3-143
3 Sequence Analysis
• Increase the number of training vectors. Increasing the number of sequences dedicated to
training requires a larger curated database of protein structures, with an appropriate distribution
of coiled, helical and sheet elements.
• Increase the number of input values. Increasing the window size or adding more relevant
information, such as biochemical properties of the amino acids, are valid options.
• Use a different training algorithm. Various algorithms differ in memory and speed requirements.
For example, the Scaled Conjugate Gradient algorithm is relatively slow but memory efficient,
while the Levenberg-Marquardt is faster but more demanding in terms of memory.
• Increase the number of hidden neurons. By adding more hidden units we generally obtain a more
sophisticated network with the potential for better performances but we must be careful not to
overfit the data.
We can specify more hidden layers or increased hidden layer size when the pattern recognition
network is created, as shown below:
hsize = [3 4 2];
net3 = patternnet(hsize);
hsize = 20;
net20 = patternnet(hsize);
We can also assign the network initial weights to random values in the range -0.1 to 0.1 as suggested
by the study reported in [2] by setting the net20.IW and net20.LW properties as follows:
% === assign random values in the range -.1 and .1 to the weights
net20.IW{1} = -.1 + (.1 + .1) .* rand(size(net20.IW{1}));
net20.LW{2} = -.1 + (.1 + .1) .* rand(size(net20.LW{2}));
In general, larger networks (with 20 or more hidden units) achieve better accuracy on the protein
training set, but worse accuracy in the prediction accuracy. Because a 20-hidden-unit network
involves almost 7,000 weights and biases, the network is generally able to fit the training set closely
but loses the ability of generalization. The compromise between intensive training and prediction
accuracy is one of the fundamental limitations of neural networks.
net20 = train(net20,P,T);
O20 = sim(net20,P);
numWeightsAndBiases = length(getx(net20))
numWeightsAndBiases =
6883
3-144
Predicting Protein Secondary Structure Using a Neural Network
You can display the confusion matrices for training, validation and test subsets by clicking on the
corresponding button in the training tool window.
You can evaluate structure predictions in detail by calculating prediction quality indices [3], which
indicate how well a particular state is predicted and whether overprediction or underprediction has
occurred. We define the index pcObs(S) for state S (S = {C, E, H}) as the number of residues
3-145
3 Sequence Analysis
correctly predicted in state S, divided by the number of residues observed in state S. Similarly, we
define the index pcPred(S) for state S as the number of residues correctly predicted in state S,
divided by the number of residues predicted in state S.
[i,j] = find(compet(O));
[u,v] = find(T);
3-146
Predicting Protein Secondary Structure Using a Neural Network
These quality indices are useful for the interpretation of the prediction accuracy. In fact, in cases
where the prediction technique tends to overpredict/underpredict a given state, a high/low prediction
accuracy might just be an artifact and does not provide a measure of quality for the technique itself.
Conclusions
The method presented here predicts the structural state of a given protein residue based on the
structural state of its neighbors. However, there are further constraints when predicting the content
of structural elements in a protein, such as the minimum length of each structural element.
Specifically, a helix is assigned to any group of four or more contiguous residues, and a sheet is
assigned to any group of two or more contiguous residues. To incorporate this type of information, an
additional network can be created so that the first network predicts the structural state from the
amino acid sequence, and the second network predicts the structural element from the structural
state.
References
[1] Rost, B., and Sander, C., "Prediction of protein secondary structure at better than 70% accuracy",
Journal of Molecular Biology, 232(2):584-99, 1993.
[2] Holley, L.H. and Karplus, M., "Protein secondary structure prediction with a neural network",
PNAS, 86(1):152-6, 1989.
[3] Kabsch, W., and Sander, C., "How good are predictions of protein secondary structure?", FEBS
Letters, 155(2):179-82, 1983.
3-147
3 Sequence Analysis
This example shows how to display, inspect and annotate the three-dimensional structure of
molecules. This example performs a three-dimensional superposition of the structures of two related
proteins.
Introduction
Ubiquitin is a small protein of approximately 76 amino acids, found in all eukaryotic cells and very
well conserved among species. Through post-translational modification of a variety of proteins,
ubiquitin is involved in many diverse biological processes, including protein degradation, protein
trafficking, DNA repair, gene regulation, etc. Because of its ubiquitous presence in cells and its
involvement in many fundamental processes, ubiquitin has been the focus of extensive research at
the sequence, structural, and functional level.
You can view the three-dimensional structure of ubiquitin by downloading the crystal structure file
from the PDB database and then displaying it using the molviewer function. By default, the protein
structure is rendered such that each atom is represented by a ball and each bond is represented by a
stick. You can change the mode of rendering by selecting display options below the figure. You can
also rotate and manipulate the structure by click-dragging the protein or by entering Rasmol
commands in the Scripting Console.
In this example, we will explore the structural characteristics of ubiquitin through combinations of
Rasmol commands passed to the evalrasmolscript function. However, you can perform the same
analysis by using the Molecule Viewer window. The information for the ubiquitin protein is provided
in the MAT-file ubilikedata.mat.
load('ubilikedata.mat','ubi')
Alternatively, you can use the getpdb function to retrieve the protein information from the PDB
repository and load it into MATLAB®. Note that data in public repositories is frequently curated and
updated; therefore the results of this example might be slightly different when you use up-to-date
datasets.
ubi = getpdb('1ubi');
h1 = molviewer(ubi);
3-148
Visualizing the Three-Dimensional Structure of a Molecule
We can look at the ubiquitin fold by using the "cartoon" rendering, which clearly displays the
secondary structure elements. We restrict our selection to the protein, since we are not interested in
displaying other heterogeneous particles, such as water molecules.
% Display the molecule as cartoon and color the atoms according to their
% secondary structure assignment. Then remove other atoms and bonds.
evalrasmolscript(h1, ['spacefill off; wireframe off; ' ...
'restrict protein; cartoon on; color structure; ' ...
'center selected;']);
3-149
3 Sequence Analysis
The ubiquitin fold consists of five antiparallel beta strands, one alpha helix, a small 3-10 helix, and
several turns and loops. The fold resembles a small barrel, with the beta sheet forming one side and
the alpha helix forming the other side of the barrel. The bottom part is closed by the 3-10 helix. We
can better appreciate the compact, globular fold of ubiquitin by spinning the structure 360 degrees
and by zooming in and out using the "move" command.
3-150
Visualizing the Three-Dimensional Structure of a Molecule
The compactness and high stability of the ubiquitin fold is related to the spatial distribution of
hydrophobic and hydrophilic amino acids in the folded state. We can look at the distribution of
charged amino acids by selecting positively and negatively charged residues and then by rendering
these atoms with different colors (red and blue respectively). We can also render water molecules as
white to see their relationship to the charged residues.
evalrasmolscript(h1, ['select protein; color gray; ' ...
'select positive; color red; spacefill 300; ' ...
'select negative; color blue; spacefill 300; ' ...
'select HOH; color white; spacefill 100;']); % water atoms
The charged amino acids are located primarily on the surface exposed to the solvent, where they
interact with the water molecules. In particular, we notice that the charge distribution is not uniform
across the sides of the ubiquitin's barrel. In fact, the side with the alpha helix appears to be more
crowded with charged amino acids than the side containing the beta strands.
3-151
3 Sequence Analysis
We can perform a similar analysis by looking at the spatial distribution of some hydrophobic amino
acids, such as Alanine, Isoleucine, Valine, Leucine and Methionine. You can also use the Rasmol label
"hydrophobic" to select all hydrophobic residues.
Unlike the charged amino acids above, the hydrophobic amino acids are located primarily in the
interior of the barrel. This gives high stability to the ubiquitin fold, since hydrophobic amino acids are
shielded from the solvent, making the protein structure compact and tight.
3-152
Visualizing the Three-Dimensional Structure of a Molecule
Ubiquitin displays a tight fold with one alpha helix traversing one side of the small barrel. The length
of this alpha helix presents some variation among the representatives of the ubiquitin-like protein
family. We can determine the actual size of the helix either by double clicking on the relevant atoms
or by using MATLAB® and Rasmol commands as follows.
initHelixRes = 23
endHelixRes = 34
3-153
3 Sequence Analysis
3-154
Visualizing the Three-Dimensional Structure of a Molecule
'select Lys and *.ca; spacefill 300; labels on; ' ...
... % label alpha carbons
'select 72-76; wireframe 100; color cyan; ']);
% select C-terminal tail
Several studies have shown that different roles are played by polyubiquitins when the molecules are
linked together through different Lysines. For example, Lys(11)-, Lys(29)-, and Lys(48)-linked
polyubiquitins target proteins for the proteasome (i.e., for degradation). In contrast, Lys(6)- and
Lys(63)-linked polyubiquitins are associated with reversible modifications, such as protein trafficking
control.
The crystal structure of a diubiquitin chain consisting of two moieties is represented in the PDB
record 1aar. We can view and label an actual isopeptide bond between the C-terminal tail of one
ubiquitin (labeled as chain A), and Lys(48) of the other ubiquitin (labeled as chain B).
3-155
3 Sequence Analysis
Retrieve the protein 1aar from PDB or load the data from the MAT-file.
aar = getpdb('1aar');
load('ubilikedata.mat','aar')
h2 = molviewer(aar);
3-156
Visualizing the Three-Dimensional Structure of a Molecule
There is a surprisingly diverse family of ubiquitin-like proteins that display significant structural
similarity to ubiquitin. One of these proteins is SUMO (Small Ubiquitin-like MOdifier), a small protein
involved in a wide spectrum of post-translational modifications, such as transcriptional regulation,
nuclear-cytosolic transport, and protein stability. Similar to ubiquitination, the covalent attachment
and detachment of SUMO occur via a cascade of enzymatic actions. Despite the structural and
operational similarities between ubiquitin and SUMO, these two proteins display quite limited
sequence similarity, as can be seen from their global sequence alignment.
Retrieve the protein SUMO from PDB or load the data from the MAT-file.
aar = getpdb('lwm2');
load('ubilikedata.mat','sumo')
score = -3.3333
In order to better appreciate the structural similarity between ubiquitin and SUMO, perform a three-
dimensional superposition of the two structures. Using the pdbsuperpose function, we compute and
apply a linear transformation (translation, reflection, orthogonal rotation, and scaling) such that the
atoms of one structure best conform to the atoms of the other structure.
pdbsuperpose(ubi, sumo);
3-157
3 Sequence Analysis
By selecting the appropriate option button in the Models section of the Molecule Viewer window, we
can view the ubiquitin structure (Model = 1) and the SUMO-2 structure (Model = 2) separately or we
can look at them superposed (Model = All). When both models are actively displayed, the structural
similarity between the two folds is striking.
3-158
Visualizing the Three-Dimensional Structure of a Molecule
3-159
3 Sequence Analysis
3-160
Visualizing the Three-Dimensional Structure of a Molecule
The conservation of the structural fold in the absence of a significant sequence similarity could point
to the occurrence of convergent evolution for these two proteins. However, some of the mechanisms
in ubiquitination and sumoylation have analogies that are not fold-related and could suggest some
deeper, perhaps distant, relationship. More importantly, the fact that the spectrum of functions
3-161
3 Sequence Analysis
performed by ubiquitin and SUMO-2 is so widespread, suggests that the high stability and
compactness of the ubiquitin-like superfold might be the reason behind its conservation.
close all;
3-162
Calculating and Visualizing Sequence Statistics
This example shows how to use basic sequence manipulation techniques and computes some useful
sequence statistics. It also illustrates how to look for coding regions (such as proteins) and pursue
further analysis of them.
In this example you will explore the DNA sequence of the human mitochondria. Mitochondria are
structures, called organelles, that are found in the cytoplasm of the cell in hundreds to thousands for
each cell. Mitochondria are generally the major energy production center in eukaryotes, they help to
degrade fats and sugars.
The consensus sequence of the human mitochondria genome has accession number NC_012920. You
can getgenbank function to get the latest annotated sequence from GenBank® into the MATLAB®
workspace.
mitochondria_gbk = getgenbank('NC_012920');
For your convenience, previously downloaded sequence is included in a MAT-file. Note that data in
public repositories is frequently curated and updated; therefore the results of this example might be
slightly different when you use up-to-date datasets.
load mitochondria
Copy just the DNA sequence to a new variable mitochondria. You can access parts of the DNA
sequence by using regular MATLAB indexing commands.
mitochondria = mitochondria_gbk.Sequence;
mitochondria_length = length(mitochondria)
first_300_bases = seqdisp(mitochondria(1:300))
mitochondria_length =
16569
first_300_bases =
You can look at the composition of the nucleotides with the ntdensity function.
figure
ntdensity(mitochondria)
3-163
3 Sequence Analysis
This shows that the mitochondria genome is A-T rich. The GC-content is sometimes used to classify
organisms in taxonomy, it may vary between different species from ~30% up to ~70%. Measuring GC
content is also useful for identifying genes and for estimating the annealing temperature of DNA
sequence.
Now, you will use some of the sequence statistics functions in the Bioinformatics Toolbox™ to look at
various properties of the human mitochondrial genome. You can count the number of bases of the
whole sequence using the basecount function.
bases = basecount(mitochondria)
bases =
A: 5124
C: 5181
G: 2169
T: 4094
These are on the 5'-3' strand. You can look at the reverse complement case using the
seqrcomplement function.
compBases = basecount(seqrcomplement(mitochondria))
3-164
Calculating and Visualizing Sequence Statistics
compBases =
A: 4094
C: 2169
G: 5181
T: 5124
As expected, the base counts on the reverse complement strand are complementary to the counts on
the 5'-3' strand.
You can use the chart option to basecount to display a pie chart of the distribution of the bases.
figure
basecount(mitochondria,'chart','pie');
title('Distribution of Nucleotide Bases for Human Mitochondrial Genome');
Now look at the dimers in the sequence and display the information in a bar chart using
dimercount.
figure
dimers = dimercount(mitochondria,'chart','bar')
title('Mitochondrial Genome Dimer Histogram');
3-165
3 Sequence Analysis
dimers =
AA: 1604
AC: 1495
AG: 795
AT: 1230
CA: 1534
CC: 1771
CG: 435
CT: 1440
GA: 613
GC: 711
GG: 425
GT: 419
TA: 1373
TC: 1204
TG: 513
TT: 1004
3-166
Calculating and Visualizing Sequence Statistics
In a nucleotide sequence an obvious thing to look for is if there are any open reading frames. An ORF
is any sequence of DNA or RNA that can be potentially translated into a protein. The function
seqshoworfs can be used to visualize ORFs in a sequence.
Note: In the HTML tutorial only the first page of the output is shown, however when running the
example you will be able to inspect the complete mitochondrial genome using the scrollbar on the
figure.
seqshoworfs(mitochondria);
If you compare this output to the genes shown on the NCBI page there seem to be slightly fewer
ORFs, and hence fewer genes, than expected.
Vertebrate mitochondria do not use the Standard genetic code so some codons have different
meaning in mitochondrial genomes. For more information about using different genetic codes in
MATLAB see the help for the function geneticcode. The GeneticCode option to the seqshoworfs
function allows you to look at the ORFs again but this time with the vertebrate mitochondrial genetic
code.
3-167
3 Sequence Analysis
In the human mitochondrial DNA sequence some genes are also started by alternative start codons
[1]. Use the AlternativeStartCodons option to the seqshoworfs function to search also for
these ORFs.
Notice that there are now two much larger ORFs on the third reading frame: One starting at position
4470 and the other starting at 5904. These correspond to the ND2 (NADH dehydrogenase subunit 2)
and COX1 (cytochrome c oxidase subunit I) genes.
orfs = seqshoworfs(mitochondria,'GeneticCode','Vertebrate Mitochondrial',...
'AlternativeStartCodons',true)
orfs =
Start
Stop
You can also look at all the features that have been annotated to the human mitochondrial genome.
Explore the complete GenBank entry mitochondria_gbk with the featureparse function.
3-168
Calculating and Visualizing Sequence Statistics
Particularly, you can explore the annotated coding sequences (CDS) and compare them with the ORFs
previously found. Use the Sequence option to the featureparse function to extract, when possible,
the DNA sequences respective to each feature. The featureparse function will complement the
pieces of the source sequence when appropriate.
features = featureparse(mitochondria_gbk,'Sequence',true)
coding_sequences = features.CDS;
coding_sequences_id = sprintf('%s ',coding_sequences.gene)
features =
coding_sequences_id =
'ND1 ND2 COX1 COX2 ATP8 ATP6 COX3 ND3 ND4L ND4 ND5 ND6 CYTB '
ND2CDS =
Location: '4470..5511'
Indices: [4470 5511]
gene: 'ND2'
gene_synonym: 'MTND2'
note: 'TAA stop codon is completed by the addition of 3' A residues to the mRNA'
codon_start: '1'
transl_except: '(pos:5511,aa:TERM)'
transl_table: '2'
product: 'NADH dehydrogenase subunit 2'
protein_id: 'YP_003024027.1'
db_xref: {'GI:251831108' 'GeneID:4536' 'HGNC:7456' 'MIM:516001'}
translation: 'MNPLAQPVIYSTIFAGTLITALSSHWFFTWVGLEMNMLAFIPVLTKKMNPRSTEAAIKYFLTQATASMILLMAILFN
Sequence: 'attaatcccctggcccaacccgtcatctactctaccatctttgcaggcacactcatcacagcgctaagctcgcactg
COX1CDS =
Location: '5904..7445'
Indices: [5904 7445]
3-169
3 Sequence Analysis
gene: 'COX1'
gene_synonym: 'COI; MTCO1'
note: 'cytochrome c oxidase I'
codon_start: '1'
transl_except: []
transl_table: '2'
product: 'cytochrome c oxidase subunit I'
protein_id: 'YP_003024028.1'
db_xref: {'GI:251831109' 'GeneID:4512' 'HGNC:7419' 'MIM:516030'}
translation: 'MFADRWLFSTNHKDIGTLYLLFGAWAGVLGTALSLLIRAELGQPGNLLGNDHIYNVIVTAHAFVMIFFMVMPIMIGG
Sequence: 'atgttcgccgaccgttgactattctctacaaaccacaaagacattggaacactatacctattattcggcgcatgagc
Create a map indicating all the features found in this GenBank entry using the featureview
function.
[h,l] = featureview(mitochondria_gbk,{'CDS','tRNA','rRNA','D_loop'},...
[2 1 2 2 2],'Fontsize',9);
legend(h,l,'interpreter','none');
title('Homo sapiens mitochondrion, complete genome')
You can translate the DNA sequences that code for the ND2 and COX1 proteins by using the nt2aa
function. Again the GeneticCode option must be used to specify the vertebrate mitochondrial
genetic code.
3-170
Calculating and Visualizing Sequence Statistics
You can get a more complete picture of the amino acid content with aacount.
figure
subplot(2,1,1)
ND2aaCount = aacount(ND2,'chart','bar');
title('Histogram of Amino Acid Count for the ND2 Protein');
subplot(2,1,2)
COX1aaCount = aacount(COX1,'chart','bar');
title('Histogram of Amino Acid Count for the COX1 Protein');
3-171
3 Sequence Analysis
Notice the high leucine, threonine and isoleucine content and also the lack of cysteine or aspartic
acid.
You can use the atomiccomp and molweight functions to calculate more properties about the ND2
protein.
ND2AtomicComp = atomiccomp(ND2)
ND2MolWeight = molweight(ND2)
ND2AtomicComp =
C: 1818
H: 2882
N: 420
O: 471
S: 25
ND2MolWeight =
3.8960e+04
For further investigation of the properties of the ND2 protein, use proteinplot. This is a graphical
user interface (GUI) that allows you to easily create plots of various properties, such as
3-172
Calculating and Visualizing Sequence Statistics
hydrophobicity, of a protein sequence. Click on the "Edit" menu to create new properties, to modify
existing property values, or, to adjust the smoothing parameters. Click on the "Help" menu in the GUI
for more information on how to use the tool.
proteinplot(ND2)
You can also programmatically create plots of various properties of the sequence using
proteinpropplot.
figure
proteinpropplot(ND2,'PropertyTitle','Parallel beta strand')
3-173
3 Sequence Analysis
Calculating the Codon Frequency using all the Genes in the Human Mitochondrial Genome
The codoncount function counts the number of occurrences of each codon in the sequence and
displays a formatted table of the result.
codoncount(ND2CDS)
Notice that in the ND2 gene there are more CTA, ATC and ACC codons than others. You can check
what amino acids these codons get translated into using the nt2aa and aminolookup functions.
3-174
Calculating and Visualizing Sequence Statistics
CTA_aa = aminolookup('code',nt2aa('CTA'))
ATC_aa = aminolookup('code',nt2aa('ATC'))
ACC_aa = aminolookup('code',nt2aa('ACC'))
CTA_aa =
'Leu Leucine
'
ATC_aa =
'Ile Isoleucine
'
ACC_aa =
'Thr Threonine
'
To calculate the codon frequency for all the genes you can concatenate them into a single sequence
before using the function codoncount. You need to ensure that the codons are complete (three
nucleotides each) so the read frame of the sequence is not lost at the concatenation.
numCDS = numel(coding_sequences);
CDS = cell(numCDS,1);
for i = 1:numCDS
seq = coding_sequences(i).Sequence;
CDS{i} = seq(1:3*floor(length(seq)/3));
end
allCDS = [CDS{:}];
codoncount(allCDS)
Use the figure option to the codoncount function to show a heat map with the codon frequency.
Use the geneticcode option to overlay a grid on the figure that groups the synonymous codons
according with the Vertebrate Mitochondrial genetic code. Observe the particular bias of Leucine
(codons 'CTN').
3-175
3 Sequence Analysis
figure
count = codoncount(allCDS,'figure',true,'geneticcode','Vertebrate Mitochondrial');
title('Human Mitochondrial Genome Codon Frequency')
close all
References
[1] Barrell, B.G., Bankier, A.T. and Drouin, J., "A different genetic code in human mitochondria",
Nature, 282(5735):189-94, 1979.
3-176
Aligning Pairs of Sequences
This example shows how to extract some sequences from GenBank®, find open reading frames
(ORFs), and then align the sequences using global and local alignment algorithms.
One of the many fascinating sections of the NCBI web site is the Genes and diseases section. This
section provides a comprehensive introduction to medical genetics.
In this example you will be looking at genes associated with Tay-Sachs Disease. Tay-Sachs is an
autosomal recessive disease caused by mutations in both alleles of a gene (HEXA, which codes for the
alpha subunit of hexosaminidase A) on chromosome 15.
The NCBI reference sequence for HEXA has accession number NM_000520. You can use the
getgenbank function to retrieve the sequence information from the NCBI data repository and load it
into MATLAB®.
humanHEXA = getgenbank('NM_000520');
By doing a BLAST search or by searching in the mouse genome you can find an orthogonal gene,
AK080777 is the accession number for a mouse hexosaminidase A gene.
mouseHEXA = getgenbank('AK080777');
For your convenience, previously downloaded sequences are included in a MAT-file. Note that data in
public repositories is frequently curated and updated; therefore the results of this example might be
slightly different when you use up-to-date datasets.
load('hexosaminidase.mat','humanHEXA','mouseHEXA')
You can use the function seqshoworfs to look for ORFs in the sequence for the human HEXA gene.
Notice that the longest ORF is on the first reading frame. The output value in the variable
humanORFs is a structure giving the position of the start and stop codons for all the ORFs on each
reading frame.
humanORFs = seqshoworfs(humanHEXA.Sequence)
3-177
3 Sequence Analysis
Now look at the ORFs in the mouse HEXA gene. In this case the ORF is also on the first frame.
mouseORFs = seqshoworfs(mouseHEXA.Sequence)
3-178
Aligning Pairs of Sequences
The first step is to use global sequence alignment to look for similarities between these sequences.
You could look at the alignment between the nucleotide sequences, but it is generally more
instructive to look at the alignment between the protein sequences, in this example we know that the
sequences are coding sequences. Use the nt2aa function to convert the nucleotide sequences into
the corresponding amino acid sequences. Observe that the HEXA gene occurs in the first frame for
both sequences, otherwise you should use the input argument Frame to specify an alternative coding
frame.
humanProtein = nt2aa(humanHEXA.Sequence);
mouseProtein = nt2aa(mouseHEXA.Sequence);
One of the easiest ways to look for similarity between sequences is with a dot plot.
seqdotplot(mouseProtein,humanProtein)
Warning: Match matrix has more points than available screen pixels.
Scaling image by factors of 1 in X and 2 in Y.
3-179
3 Sequence Analysis
With the default settings, the dot plot is a little difficult to interpret, so you can try a slightly more
stringent dot plot.
seqdotplot(mouseProtein,humanProtein,4,3)
Warning: Match matrix has more points than available screen pixels.
Scaling image by factors of 1 in X and 2 in Y.
The diagonal line indicates that there is probably a good alignment so you can now take a look at the
global alignment using the function nwalign which uses the Needleman-Wunsch algorithm.
[score, globalAlignment] = nwalign(humanProtein,mouseProtein)
score = 634.3333
3-180
Aligning Pairs of Sequences
The alignment is very good except for the terminal segments. For instance, notice the sparse matched
pairs in the first positions. This occurs because a global alignment attempts to force the matching all
the way to the ends and there is point where the penalty for opening new gaps is comparable to the
score of matching residues. In some cases it is desirable to remove the gap penalty added at the ends
of a global alignment; this allows you to better match this pair of sequences. This technique is
commonly known as 'semi-global' alignment or 'glocal' alignment.
score = 1.0413e+03
Another way to refine your alignment is by using only the protein sequences. Notice that the aligned
region is delimited by start ( M-methionine ) and stop ( * ) amino acids in the sequences. If the
sequence is shortened so that only the translated regions are considered, then it seems likely that you
will get a better alignment. Use the find command to look for the index of the start amino acid in
each sequence:
humanStart = 70
mouseStart = 11
Similarly, use the find command to look for the index of the first stop occurring after the start of the
translation. Special care needs to be taken because there is also a stop at the very beginning of the
humanProtein sequence.
humanStop = 599
mouseStop = 539
humanSeq = humanProtein(humanStart:humanStop);
humanSeqFormatted = seqdisp(humanSeq)
3-181
3 Sequence Analysis
mouseSeq = mouseProtein(mouseStart:mouseStop);
mouseSeqFormatted = seqdisp(mouseSeq)
score = 1.0423e+03
Open reading frame information is also available from the output of the seqshoworfs command, but
the indices are based on the nucleotide sequences. Use these indices to trim the original nucleotide
sequences and then translate them to amino acids.
humanPORF = nt2aa(humanHEXA.Sequence(humanORFs(1).Start(1):humanORFs(1).Stop(1)));
mousePORF = nt2aa(mouseHEXA.Sequence(mouseORFs(1).Start(1):mouseORFs(1).Stop(1)));
[score, ORFAlignment] = nwalign(humanPORF,mousePORF)
score = 1042
Alternatively, you can use the coding region information (CDS) from the GenBank data structure to
find the coding region of the genes.
idx = humanHEXA.CDS.indices;
humanCodingRegion = humanHEXA.Sequence(idx(1):idx(2));
idx = mouseHEXA.CDS.indices;
mouseCodingRegion = mouseHEXA.Sequence(idx(1):idx(2));
3-182
Aligning Pairs of Sequences
You can also get the translation of the coding regions from this structure.
humanTranslatedRegion = humanHEXA.CDS.translation;
mouseTranslatedRegion = mouseHEXA.CDS.translation;
Local Alignment
Instead of truncating the sequences to look for better alignment, an alternative approach is to use a
local alignment. The function swalign performs local alignment using the Smith-Waterman
algorithm. This shows a very good alignment for the whole coding region and reasonable similarity
for a few residues beyond at both the ends of the gene.
score = 1057
All the sequence alignment functions provided in MATLAB can be customized. For example, by
modifying the rows and columns of a scoring matrix you can align sequences by complement and not
by identity. In this case you can reorder the NUC44 scoring matrix; a positive score is given for
complements while a negative score is given otherwise. The first 30 nucleotides from the mouse
HEXA gene will be aligned to its complement.
4 3 2 1 6 5 8 7 9 10 14 13 12 11 15
Mc = M(:,map)
Mc = 15×15
-4 -4 -4 5 -4 1 1 -4 -4 1 -1 -1 -1 -4 -2
-4 -4 5 -4 1 -4 1 -4 1 -4 -1 -1 -4 -1 -2
-4 5 -4 -4 -4 1 -4 1 1 -4 -1 -4 -1 -1 -2
5 -4 -4 -4 1 -4 -4 1 -4 1 -4 -1 -1 -1 -2
-4 1 -4 1 -4 -1 -2 -2 -2 -2 -1 -3 -1 -3 -1
1 -4 1 -4 -1 -4 -2 -2 -2 -2 -3 -1 -3 -1 -1
1 1 -4 -4 -2 -2 -4 -1 -2 -2 -3 -3 -1 -1 -1
-4 -4 1 1 -2 -2 -1 -4 -2 -2 -1 -1 -3 -3 -1
-4 1 1 -4 -2 -2 -2 -2 -1 -4 -1 -3 -3 -1 -1
1 -4 -4 1 -2 -2 -2 -2 -4 -1 -3 -1 -1 -3 -1
⋮
3-183
3 Sequence Analysis
score = 150
close all;
3-184
Assessing the Significance of an Alignment
This example shows a method that can be used to investigate the significance of sequence
alignments. The number of identities or positives in an alignment is not a clear indicator of a
significant alignment. A permutation of a sequence from an alignment will have similar percentages
of positives and identities when aligned against the original sequence. The score from an alignment is
a better indicator of the significance of an alignment. This example uses the same Tay-Sachs disease
related genes and proteins analyzed in “Aligning Pairs of Sequences” on page 3-177.
In this example, you will work directly with protein data so use getgenpept instead of getgenbank
to download the data from the NCBI site. First read the human protein information into MATLAB®.
humanProtein = getgenpept('NP_000511');
Results from a BLASTX search performed with this sequence showed that a Drosophila protein,
GenPept accession number AAM29423, has some similarity to the human HEXA sequence. Use
getgenpept to download this sequence.
flyProtein = getgenpept('AAM29423');
For your convenience, previously downloaded sequences are included in a MAT-file. Note that data in
public repositories is frequently curated and updated; therefore the results of this example might be
slightly different when you use up-to-date datasets.
load('flyandhumanproteins.mat','humanProtein','flyProtein')
seqdisp(humanProtein)
seqdisp(flyProtein)
ans =
ans =
3-185
3 Sequence Analysis
The first thing to do is to use seqdotplot to see if there are any areas that are clearly aligned. This
doesn't show any obvious alignments, but there are some areas of interest.
seqdotplot(humanProtein,flyProtein,3,2)
title('Dot Plot of Two HexA-like Proteins');
ylabel('Human Protein');xlabel('Drosophila Protein');
3-186
Assessing the Significance of an Alignment
Notice that there are a few diagonal stretches in the dot plot. This is not particularly good evidence of
a significant global alignment, but you can try a global alignment using the function nwalign. The
BLOSUM50 scoring matrix is used by default.
[sc50,globAlig50] = nwalign(humanProtein,flyProtein)
fprintf('Score = %g \n',sc50)
sc50 =
49.6667
globAlig50 =
'MT-S-S--R----LW----F--SLL-----LA-A-AF--A-GR------ATAL-WP----W--P-QN---FQT-----SDQR--Y-------
'|: : | | | | ::| :: | |: | | |::: | | :| ::: | :| :
'MSLAVSLRRALLVLLTGAIFILTVLYWNQGVTKAQAYNEALERPHSHHDASGFPIPVEKSWTYKCENDRCMRVGHHGKSAKRVSFISCSMTC
Score = 49.6667
The sequence similarity is fairly low, so BLOSUM30 might be a more appropriate scoring matrix.
[sc30,globAlig30] = nwalign(humanProtein,flyProtein,'scoringmatrix','blosum30')
fprintf('Score = %g \n',sc30)
sc30 =
82
globAlig30 =
'MT-S-S--R-----L-W--F--S-LL----L--AAAF--A-GR------ATAL-WP----W--P-QN-F--QT-----SDQR--Y-------
'|: : | | | : | : |: : |:|: | | |::: | | : :| :: |::| :
'MSLAVSLRRALLVLLTGAIFILTVLYWNQGVTKAQAYNEALERPHSHHDASGFPIPVEKSWTYKCENDRCMRVGHHGKSAKRVSFISCSMTC
Score = 82
This gives an alignment that has some areas of fairly strong similarity, but is this alignment
statistically significant? One way to investigate whether this score is significant is to use Monte Carlo
techniques. Given that the fly sequence was found using a BLAST search, there is some evidence that
there is similarity between the two sequences. It is reasonable to expect the score for this alignment
to be higher than the scores obtained from aligning random sequences of amino acids to the protein.
To assess if the score is significant the first step is to make some random sequences that are similar
to that of the fly protein. One way to do this is to take random permutations of the fly sequence. This
can be done with the randperm function. Then calculate the global alignment of these random
sequences against the human protein and look at the statistical significance of the scores.
3-187
3 Sequence Analysis
Initialize the state of the default random number generators to ensure that the figures and results
generated match the ones in the HTML version of this example.
rng(0,'twister')
n = 50;
globalscores = zeros(n,1);
flyLen = length(flyProtein.Sequence);
for i = 1:n
perm = randperm(flyLen);
permutedSequence = flyProtein.Sequence(perm);
globalscores(i) = nwalign(humanProtein,permutedSequence,'scoringmatrix','blosum30');
end
Now plot the scores as a bar chart. Note that because you are using randomly generated sequences.
figure
buckets = ceil(n/5);
hist(globalscores,buckets)
hold on;
stem(sc30,1,'k')
title('Determining Alignment Significance using Monte Carlo Techniques');
xlabel('Score'); ylabel('Number of Sequences');
The scores of the alignments to the random sequences can be approximated by the type 1 extreme
value distribution. Use the evfit function from the Statistics and Machine Learning Toolbox™ to
estimate the parameters of this distribution.
parmhat = evfit(globalscores)
3-188
Assessing the Significance of an Alignment
parmhat =
-31.7597 6.6440
x = min(globalscores):max([globalscores;sc30]);
y = evpdf(x,parmhat(1),parmhat(2));
[v, c] = hist(globalscores,buckets);
binWidth = c(2) - c(1);
scaleFactor = n*binWidth;
plot(x,scaleFactor*y,'r');
hold off;
From this plot you can see that the global alignment (globAlig30) is clearly statistically significant.
In FLYBASE web site you can search for all Drosophila beta-N-acetylhexosaminidase genes. The gene
that you have been looking at so far is referenced as CG8824. Now you want to take a look at another
similar gene, for instance Hexo1.
flyHexo1 = getgenpept('AAL28566');
The fly Hexo1 aminoacid sequence is also provided in the MAT-file flyandhumanproteins.mat.
3-189
3 Sequence Analysis
load('flyandhumanproteins.mat','flyHexo1')
seqdisp(humanProtein)
ans =
Repeat the process of generating a global alignment and then using random permutations of the
amino acids to estimate the significance of the global alignment.
[Hexo1score,Hexo1Alignment] = nwalign(humanProtein,flyHexo1,'scoringmatrix','blosum30')
fprintf('Score = %g \n',Hexo1score)
Hexo1globalscores = zeros(n,1);
flyLen = length(flyHexo1.Sequence);
for i = 1:n
perm = randperm(flyLen);
permutedSequence = flyHexo1.Sequence(perm);
Hexo1globalscores(i) = nwalign(humanProtein,permutedSequence,'scoringmatrix','blosum30');
end
Hexo1score =
-72.2000
Hexo1Alignment =
'MTSSRL-WFSLLLAAAFA-GRATALWPWPQNFQTSDQRYVLYPNNFQFQYDVSSAAQPGCSVLDEAFQRYRDLLFGSGSWPRPYLTGKRHTL
'|: :| | :::: : ::: :| ::: : :|| | | : : : ||::::: | | :: |:: | : : | :: | :|
'MALVKLNTFHWHITDSHSFPLEVKKRPELHKLGAYSQRQV-Y--T-R-R-DVAEVVEYG-RV--RGI-RVMP-EF-D-A-PAHVGEGWQH--
Score = -72.2
Plot the scores, calculate the parameters of the distribution and overlay the PDF on the bar chart.
figure
buckets = ceil(n/5);
hist(Hexo1globalscores,buckets)
title('Determining Alignment Significance using Monte Carlo Techniques');
xlabel('Score');
ylabel('Number of Sequences');
hold on;
3-190
Assessing the Significance of an Alignment
stem(Hexo1score,1,'c')
parmhat = evfit(Hexo1globalscores)
x = min(Hexo1globalscores):max([Hexo1globalscores;Hexo1score]);
y = evpdf(x,parmhat(1),parmhat(2));
[v, c] = hist(Hexo1globalscores,buckets);
binWidth = c(2) - c(1);
scaleFactor = n*binWidth;
plot(x,scaleFactor*y,'r');
hold off;
parmhat =
-70.6926 7.0619
In this case it appears that the alignment is not statistically significant. Higher scoring alignments
can easily be generated from a random permutation of the amino acids in the sequence. You can
calculate an approximate p-value from the estimated extreme value CDF: However, far more than 50
random permutations are needed to get a reliable estimate of the extreme value pdf parameters from
which to calculate a reasonably accurate p-value.
p = 1 - evcdf(Hexo1score,parmhat(1),parmhat(2))
p =
3-191
3 Sequence Analysis
0.4458
One thing to notice is that the lengths of the two sequences are very different. The human HEXA1 is
529 residues long and the fly Hexo1 protein is only 383 residues in length. When you try to align
these two sequences globally this difference in length means that a large number of gaps will have to
be introduced into the sequence. This means that the significance of the scores will be heavily
dependent on the GAPOPEN and EXTENDGP parameters. (See the help for nwalign for more details.)
Instead of using global alignment, in this case a better approach might be to look at the local
alignment between the two sequences.
You will now repeat the process of estimating the significance of an alignment this time using local
alignment and a slightly different method of generating the random sequences. Instead of simply
permuting the letters in the sequence, an alternative is to draw a sequence from a multinomial
distribution which is estimated from the fly protein sequence. You can do this using the aacount and
randseq functions; the first estimates the amino acid frequencies of the query sequence and the
later randomly creates new sequences based on this distribution.
[lscore,locAlig] = swalign(humanProtein,flyHexo1,'scoringmatrix','blosum30')
fprintf('Score = %g \n',lscore)
localscores = zeros(n,1);
aas = aacount(flyHexo1);
for i = 1:n
randProtein = randseq(flyLen,'FROMSTRUCTURE',aas);
localscores(i) = swalign(humanProtein,randProtein,'scoringmatrix','blosum30');
end
lscore =
152
locAlig =
'MAYNKLNVFHWHLVDDPSFPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEYARLRGIRVLAEFDTPGHT-LSWG-PGIPGLL-TPCYSGS
'||: |||:||||::|::|||:| |||:: |:|:: ::|| :||:||:||:|:|||||: |||:|:|: :| ::::::: :: :::
'MALVKLNTFHWHITDSHSFPLEVKKRPELHKLGAYSQR-QVYTRRDVAEVVEYGRVRGIRVMPEFDAPAHVGEGWQHKNMTACFNAQPWKSF
Score = 152
Plot the scores, calculate the parameters of the distribution and overlay the PDF on the bar chart.
figure
hist(localscores,buckets)
title('Determining Alignment Significance using Monte Carlo Techniques');
xlabel('Score');
ylabel('Number of Sequences');
hold on;
stem(lscore,1,'r')
parmhat = evfit(localscores)
x = min(localscores):max([localscores;lscore]);
3-192
Assessing the Significance of an Alignment
y = evpdf(x,parmhat(1),parmhat(2));
[v, c] = hist(localscores,buckets);
binWidth = c(2) - c(1);
scaleFactor = n*binWidth;
plot(x,scaleFactor*y,'r');
hold off;
parmhat =
40.8331 3.9312
You might like to experiment to see if there are significant differences in the distribution of scores
generated with randperm and randseq.
With the local alignment it appears that the alignment is statistically significant. In fact, looking at
the local alignment shows a very good alignment for the full length of the Hexo1 sequence.
close all;
3-193
3 Sequence Analysis
This example shows how to handle Scoring Matrices with the sequence alignment tools. The example
uses proteins associated with retinoblastoma, a disease caused by a tumor which develops from the
immature retina.
More information on retinoblastoma can be found at the Genes and diseases section of the NCBI web
site.
The "BLink" link on this page shows related sequences in different organisms. These links can change
frequently, so for this example you can load a set of previously saved data from a MAT-file.
load retinoblastoma
You can also use the getgenpept function to retrieve the sequence information from the NCBI data
repository and load it into MATLAB.
human = getgenpept('AAA69808','SequenceOnly',true);
chicken = getgenpept('NP_989750','SequenceOnly',true);
trout = getgenpept('AAD13390','SequenceOnly',true);
xenopus = getgenpept('A44879','SequenceOnly',true);
One approach to study the relationship between these two proteins is to use a global alignment with
the nwalign function.
[sc,hvc] = nwalign(human,chicken)
sc = 1.4543e+03
In this alignment the function used the default scoring matrix, BLOSUM62. Different scoring matrices
can give different alignments. How can you find the best alignment? One approach is to try different
scoring matrices and look for the highest score. When the score from the alignment functions is in
the same scale (in this case, bits) you can compare different alignments to see which gives the
highest score.
This example uses the PAM family of matrices, though the approach used could also be used with the
BLOSUM family of scoring matrices. The PAM family of matrices in the Bioinformatics Toolbox™
consists of 50 matrices, PAM10, PAM20,..., PAM490, PAM500.
Take the two sequences (CAH72243 and CAA51019) and align them with each member of the PAM
family and then look for the highest score.
score = zeros(1,50);
fprintf('Trying different PAM matrices ')
3-194
Using Scoring Matrices to Measure Evolutionary Distance
..................................................
You can use the plot function to create a graph of the results.
x = 10:10:500;
plot(x,score)
legend('Human vs. Chicken');
title('Global Alignment Scores for Different PAM Scoring Matrices');
xlabel('PAM matrix');ylabel('Score (bits)');
You can use max with two outputs to find the highest score and the index in the results vector where
the highest value occurred. In this case the highest score occurred with the third matrix, that is
PAM30.
bestScore = 2.2605e+03
3-195
3 Sequence Analysis
idx = 3
xenopusScore = zeros(1,50);
troutScore = zeros(1,50);
fprintf('Trying different PAM matrices ')
..................................................
You can use the command hold on to tell MATLAB® to add new plots to the existing figure. Once
you have finished doing this you must remember to disable this feature by using hold off.
hold on
plot(x,xenopusScore,'g')
plot(x,troutScore,'r')
legend({'Human vs. Chicken','Human vs. Xenopus','Human vs. Trout'});box on
title('Global Alignment Scores for Different PAM Scoring Matrices');
xlabel('PAM matrix');ylabel('Score (bits)');
hold off
3-196
Using Scoring Matrices to Measure Evolutionary Distance
You will see that different matrices give the highest scores for the different organisms. For human
and xenopus, the best score is with PAM40 and for human and trout the best score is PAM50.
bestXScore = 1607
Xidx = 4
bestTScore = 1484
Tidx = 5
The PAM scoring matrix giving the best alignment for two sequences is an indicator of the relative
evolutionary interval since the organisms diverged: The smaller the PAM number, the more closely
related the organisms. Since organisms, and protein families across organisms, evolve at widely
varying rates, there is no simple correlation between PAM distance and evolutionary time. However,
for an analysis of a specific protein family across multiple species, the corresponding PAM matrices
will provide a relative evolutionary distance between the species and allow accurate phylogenetic
mapping. In this example, the results indicate that the human sequence is more closely related to the
chicken sequence than to the frog sequence, which in turn is more closely related than the trout
sequence.
3-197
3 Sequence Analysis
This example shows the interoperability between MATLAB® and Bioperl - passing arguments from
MATLAB to Perl scripts and pulling BLAST search data back to MATLAB.
NOTE: Perl and the Bioperl modules must be installed to run the Perl scripts in this example. Since
version 1.4, Bioperl modules have a warnings.pm dependency requiring at least version 5.6 of Perl. If
you have difficulty running the Perl scripts, make sure your PERL5LIB environment variable includes
the path to your Bioperl installation or try running from the Bioperl installation directory. See the
links at https://www.perl.com and https://bioperl.org/ for current release files and complete
installation instructions.
Introduction
Gleevec™ (STI571 or imatinib mesylate) was the first approved drug to specifically turn off the signal
of a known cancer-causing protein. Initially approved to treat chronic myelogenous leukemia (CML),
it is also effective for treatment of gastrointestinal stromal tumors (GIST).
Research has identified several gene targets for Gleevec including: Proto-oncogene tyrosine-protein
kinase ABL1 (NP_009297), Proto-oncogene tyrosine-protein kinase Kit (NP_000213), and Platelet-
derived growth factor receptor alpha precursor (NP_006197).
target_ABL1 = 'NP_009297';
target_Kit = 'NP_000213';
target_PDGFRA = 'NP_006197';
You can load the sequence information for these proteins from local GenPept text files using
genpeptread.
Alternatively, you can obtain protein information directly from the online GenPept database
maintained by the National Center for Biotechnology Information (NCBI).
The MATLAB whos command gives information about the size of these sequences.
whos ABL1_seq
whos Kit_seq
whos PDGFRA_seq
3-198
Calling Bioperl Functions from MATLAB
From MATLAB, you can harness existing Bioperl modules to run a BLAST search on these sequences.
MW_BLAST.pl is a Perl program based on the RemoteBlast Bioperl module. It reads sequences from
FASTA files, so start by creating a FASTA file for each sequence.
Warning: ABL1.fa already exists. The data will be appended to the file.
Warning: Kit.fa already exists. The data will be appended to the file.
Warning: PDGFRA.fa already exists. The data will be appended to the file.
BLAST searches can take a long time to return results, and the Perl program MW_BLAST includes a
repeating sleep state to await the report. Sample results have been included with this example, but if
you want to try running the BLAST search with the three sequences, uncomment the following
commands. MW_BLAST.pl will save the BLAST results in three files on your disk, ABL1.out, Kit.out
and PDGFRA.out. The process can take 15 minutes or more.
% try
% perl('MW_BLAST.pl','blastp','pdb','1e-10','ABL1.fa','Kit.fa','PDGFRA.fa');
% catch
% error(message('bioinfo:bioperldemo:PerlError'))
% end
type MW_BLAST.pl
#!/usr/bin/perl -w
use Bio::Tools::Run::RemoteBlast;
use strict;
use 5.006;
3-199
3 Sequence Analysis
The next step is to parse the output reports and find scores >= 100. You can then identify hits found
by more than one protein for further research, possibly identifying new targets for drug therapy.
try
protein_list = perl('MW_parse.pl', which('ABL1.out'), which('Kit.out'), which('PDGFRA.out'))
catch
error(message('bioinfo:bioperldemo:PerlError'))
end
protein_list =
'
/home/Data/ABL1.out
1OPL, 2584, 0.0, Chain A, Structural Basis For The Auto-Inhibition Of C-Abl...
1FMK, 923, 1e-100, Crystal Structure Of Human Tyrosine-Protein Kinase C-Src p...
1QCF, 919, 1e-100, Chain A, Crystal Structure Of Hck In Complex With A Src Fa...
1KSW, 916, 1e-100, Chain A, Structure Of Human C-Src Tyrosine Kinase (Thr338g...
1AD5, 883, 6e-96, Chain A, Src Family Kinase Hck-Amp-Pnp Complex pdb|1AD5|B ...
2ABL, 866, 5e-94, Sh3-Sh2 Domain Fragment Of Human Bcr-Abl Tyrosine Kinase
3-200
Calling Bioperl Functions from MATLAB
3LCK, 666, 9e-71, The Kinase Domain Of Human Lymphocyte Kinase (Lck), Activa...
1QPE, 666, 9e-71, Chain A, Structural Analysis Of The Lymphocyte-Specific Ki...
1QPD, 656, 1e-69, Chain A, Structural Analysis Of The Lymphocyte-Specific Ki...
1K2P, 620, 2e-65, Chain A, Crystal Structure Of Bruton's Tyrosine Kinase Dom...
1BYG, 592, 3e-62, Chain A, Kinase Domain Of Human C-Terminal Src Kinase (Csk...
1M7N, 561, 1e-58, Chain A, Crystal Structure Of Unactivated Apo Insulin-Like...
1JQH, 560, 2e-58, Chain A, Igf-1 Receptor Kinase Domain pdb|1JQH|B Chain B, ...
1P4O, 560, 2e-58, Chain A, Structure Of Apo Unactivated Igf-1r Kinase Domain...
1K3A, 553, 1e-57, Chain A, Structure Of The Insulin-Like Growth Factor 1 Rec...
1GJO, 550, 2e-57, Chain A, The Fgfr2 Tyrosine Kinase Domain
1FVR, 540, 3e-56, Chain A, Tie2 Kinase Domain pdb|1FVR|B Chain B, Tie2 Kinas...
1AB2, 528, 9e-55, Proto-Oncogene Tyrosine Kinase (E.C.2.7.1.112) (Src Homolo...
1IRK, 525, 2e-54, Insulin Receptor (Tyrosine Kinase Domain) Mutant With Cys ...
1I44, 523, 3e-54, Chain A, Crystallographic Studies Of An Activation Loop Mu...
1IR3, 522, 4e-54, Chain A, Phosphorylated Insulin Receptor Tyrosine Kinase I...
1FGK, 522, 4e-54, Chain A, Crystal Structure Of The Tyrosine Kinase Domain O...
1P14, 521, 6e-54, Chain A, Crystal Structure Of A Catalytic-Loop Mutant Of T...
1M14, 496, 4e-51, Chain A, Tyrosine Kinase Domain From Epidermal Growth Fact...
1PKG, 496, 4e-51, Chain A, Structure Of A C-Kit Kinase Product Complex pdb|1...
1VR2, 463, 3e-47, Chain A, Human Vascular Endothelial Growth Factor Receptor...
1JU5, 330, 8e-32, Chain C, Ternary Complex Of An Crk Sh2 Domain, Crk-Derived...
1BBZ, 317, 3e-30, Chain A, Crystal Structure Of The Abl-Sh3 Domain Complexed...
1AWO, 303, 1e-28, The Solution Nmr Structure Of Abl Sh3 And Its Relationship...
1BBZ, 303, 1e-28, Chain E, Crystal Structure Of The Abl-Sh3 Domain Complexed...
1G83, 287, 8e-27, Chain A, Crystal Structure Of Fyn Sh3-Sh2 pdb|1G83|B Chain...
1LCK, 270, 7e-25, Chain A, Sh3-Sh2 Domain Fragment Of Human P56-Lck Tyrosine...
1MUO, 233, 1e-20, Chain A, Crystal Structure Of Aurora-2, An Oncogenic Serin...
1GRI, 232, 2e-20, Chain A, Grb2 pdb|1GRI|B Chain B, Grb2
1A9U, 220, 4e-19, The Complex Structure Of The Map Kinase P38SB203580 pdb|1B...
1BMK, 213, 3e-18, Chain A, The Complex Structure Of The Map Kinase P38SB2186...
1IAN, 209, 8e-18, Human P38 Map Kinase Inhibitor Complex
1GZ8, 208, 1e-17, Chain A, Human Cyclin Dependent Kinase 2 Complexed With Th...
1OVE, 208, 1e-17, Chain A, The Structure Of P38 Alpha In Complex With A Dihy...
1OIT, 207, 1e-17, Chain A, Imidazopyridines: A Potent And Selective Class Of...
1B38, 206, 2e-17, Chain A, Human Cyclin-Dependent Kinase 2 pdb|1B39|A Chain ...
1OGU, 206, 2e-17, Chain A, Structure Of Human Thr160-Phospho Cdk2CYCLIN A CO...
1E9H, 206, 2e-17, Chain A, Thr 160 Phosphorylated Cdk2 - Human Cyclin A3 Com...
1JST, 206, 2e-17, Chain A, Phosphorylated Cyclin-Dependent Kinase-2 Bound To...
1WFC, 206, 2e-17, Structure Of Apo, Unphosphorylated, P38 Mitogen Activated ...
1QMZ, 206, 2e-17, Chain A, Phosphorylated Cdk2-Cyclyin A-Substrate Peptide C...
1DI8, 206, 2e-17, Chain A, The Structure Of Cyclin-Dependent Kinase 2 (Cdk2)...
1H1P, 206, 2e-17, Chain A, Structure Of Human Thr160-Phospho Cdk2CYCLIN A CO...
1DI9, 205, 2e-17, Chain A, The Structure Of P38 Mitogen-Activated Protein Ki...
1H4L, 202, 5e-17, Chain A, Structure And Regulation Of The Cdk5-P25(Nck5a) C...
3-201
3 Sequence Analysis
---------------------------------------------------
3-202
Calling Bioperl Functions from MATLAB
3-203
3 Sequence Analysis
/home/Data/Kit.out
1PKG, 974, 1e-106, Chain A, Structure Of A C-Kit Kinase Product Complex pdb|1...
1VR2, 805, 6e-87, Chain A, Human Vascular Endothelial Growth Factor Receptor...
1GJO, 730, 3e-78, Chain A, The Fgfr2 Tyrosine Kinase Domain
1FGK, 700, 8e-75, Chain A, Crystal Structure Of The Tyrosine Kinase Domain O...
1OPL, 410, 4e-41, Chain A, Structural Basis For The Auto-Inhibition Of C-Abl...
1FVR, 405, 1e-40, Chain A, Tie2 Kinase Domain pdb|1FVR|B Chain B, Tie2 Kinas...
1M7N, 383, 5e-38, Chain A, Crystal Structure Of Unactivated Apo Insulin-Like...
1P4O, 383, 5e-38, Chain A, Structure Of Apo Unactivated Igf-1r Kinase Domain...
1JQH, 381, 8e-38, Chain A, Igf-1 Receptor Kinase Domain pdb|1JQH|B Chain B, ...
1QCF, 377, 2e-37, Chain A, Crystal Structure Of Hck In Complex With A Src Fa...
1K3A, 371, 1e-36, Chain A, Structure Of The Insulin-Like Growth Factor 1 Rec...
3-204
Calling Bioperl Functions from MATLAB
/home/Data/PDGFRA.out
1PKG, 625, 5e-66, Chain A, Structure Of A C-Kit Kinase Product Complex pdb|1...
1VR2, 550, 2e-57, Chain A, Human Vascular Endothelial Growth Factor Receptor...
1FGI, 500, 1e-51, Chain A, Crystal Structure Of The Tyrosine Kinase Domain O...
1GJO, 492, 1e-50, Chain A, The Fgfr2 Tyrosine Kinase Domain
1FVR, 419, 4e-42, Chain A, Tie2 Kinase Domain pdb|1FVR|B Chain B, Tie2 Kinas...
1QCF, 380, 1e-37, Chain A, Crystal Structure Of Hck In Complex With A Src Fa...
1QPE, 364, 9e-36, Chain A, Structural Analysis Of The Lymphocyte-Specific Ki...
1QPD, 364, 9e-36, Chain A, Structural Analysis Of The Lymphocyte-Specific Ki...
3LCK, 360, 2e-35, The Kinase Domain Of Human Lymphocyte Kinase (Lck), Activa...
1OPL, 358, 4e-35, Chain A, Structural Basis For The Auto-Inhibition Of C-Abl...
1FMK, 354, 1e-34, Crystal Structure Of Human Tyrosine-Protein Kinase C-Src p...
1KSW, 353, 2e-34, Chain A, Structure Of Human C-Src Tyrosine Kinase (Thr338g...
1AD5, 353, 2e-34, Chain A, Src Family Kinase Hck-Amp-Pnp Complex pdb|1AD5|B ...
1BYG, 352, 2e-34, Chain A, Kinase Domain Of Human C-Terminal Src Kinase (Csk...
1I44, 351, 3e-34, Chain A, Crystallographic Studies Of An Activation Loop Mu...
1IRK, 350, 4e-34, Insulin Receptor (Tyrosine Kinase Domain) Mutant With Cys ...
1M7N, 349, 5e-34, Chain A, Crystal Structure Of Unactivated Apo Insulin-Like...
1JQH, 349, 5e-34, Chain A, Igf-1 Receptor Kinase Domain pdb|1JQH|B Chain B, ...
1P4O, 349, 5e-34, Chain A, Structure Of Apo Unactivated Igf-1r Kinase Domain...
1P14, 344, 2e-33, Chain A, Crystal Structure Of A Catalytic-Loop Mutant Of T...
1IR3, 343, 2e-33, Chain A, Phosphorylated Insulin Receptor Tyrosine Kinase I...
1K3A, 338, 9e-33, Chain A, Structure Of The Insulin-Like Growth Factor 1 Rec...
1M14, 332, 4e-32, Chain A, Tyrosine Kinase Domain From Epidermal Growth Fact...
1K2P, 315, 4e-30, Chain A, Crystal Structure Of Bruton's Tyrosine Kinase Dom...
1PME, 167, 6e-13, Structure Of Penta Mutant Human Erk2 Map Kinase Complexed ...
1JOW, 155, 1e-11, Chain B, Crystal Structure Of A Complex Of Human Cdk6 And ...
1BI8, 155, 1e-11, Chain A, Mechanism Of G1 Cyclin Dependent Kinase Inhibitio...
1F3M, 150, 6e-11, Chain C, Crystal Structure Of Human SerineTHREONINE KINASE...
'
type MW_parse.pl
#!/usr/bin/perl
use Bio::SearchIO;
use strict;
use 5.006;
3-205
3 Sequence Analysis
# A sample BLAST parsing program based on the SearchIO.pm Bioperl module. Takes
# a list of BLAST report files and prints a list of the top hits from each
# report based on an arbitrary minimum score.
# Take each report name and print information about the top hits.
my $seq_count = 0;
while ( defined($ARGV[0])) {
my $breport = shift @ARGV;
print "\n$breport\n";
my $in = new Bio::SearchIO(-format => 'blast',
-file => $breport);
my $num_hit = 0;
my $short_desc;
while ( my $result = $in->next_result) {
while ( my $curr_hit = $result->next_hit ) {
if ( $curr_hit->raw_score >= $min_score ) {
if (length($curr_hit->description) >= 60) {
$short_desc = substr($curr_hit->description, 0, 58)."...";
} else {
$short_desc = $curr_hit->description;
}
print $curr_hit->accession, ", ",
$curr_hit->raw_score, ", ",
$curr_hit->significance, ", ",
$short_desc, "\n";
}
$num_hit++;
}
}
$seq_count++;
}
If you are running on Windows®, it is also possible to call MATLAB functions from Perl. You can
launch MATLAB in an Automation Server mode by using the /Automation switch in the MATLAB
startup command (e.g. D:\applications\matlab7x\bin\matlab.exe /Automation).
Here's a script to illustrate the process of launching an automation server, calling MATLAB functions
and passing variables between Perl and MATLAB.
type MATLAB_from_Perl.pl
#!/usr/bin/perl -w
use Win32::OLE;
use Win32::OLE::Variant;
3-206
Calling Bioperl Functions from MATLAB
sub send_to_matlab
{ my ($call, @command) = @_;
my $status = 0;
print "\n>> $call( @command )\n";
$result = $matlabApp->Invoke($call, @command);
if (defined($result))
{ unless ($result =~ s/^.\?{3}/Error:/)
{ print "$result\n" unless ($result eq "");
}
else
{ print "$result\n";
$status = -1;
}
}
return $status;
}
3-207
3 Sequence Analysis
$m2Real = Variant(VT_ARRAY|VT_R8|VT_BYREF,4,4);
$m2Imag = Variant(VT_ARRAY|VT_R8|VT_BYREF,4,4);
print "\n>> GetFullMatrix( 'magicArray', 'base', ",'$m2Real, $m2Imag'," )\n";
$matlabApp->GetFullMatrix('magicArray', 'base', $m2Real, $m2Imag);
MATLAB offers additional tools for protein analysis and further research with these proteins. For
example, to access the sequences and run a full Smith-Waterman alignment on the tyrosine kinase
domain of the human insulin receptor (pdb 1IRK) and the kinase domain of the human lymphocyte
kinase (pdb 3LCK), load the sequence data:
IRK = pdbread('pdb1irk.ent');
LCK = pdbread('pdb3lck.ent');
3-208
Calling Bioperl Functions from MATLAB
Now perform a local alignment with the Smith-Waterman algorithm. MATLAB uses BLOSUM 50 as
the default scoring matrix for AA strings with a gap penalty of 8. Of course, you can change any of
these parameters.
MATLAB and the Bioinformatics Toolbox™ offer additional tools for investigating nucleotide and
amino acid sequences. For example, pdbdistplot displays the distances between atoms and amino
acids in a PDB structure, while ramachandran generates a plot of the torsion angle PHI and the
torsion angle PSI of the protein sequence. The toolbox function proteinplot provides a graphical
user interface (GUI) to easily import sequences and plot various properties such as hydrophobicity.
3-209
3 Sequence Analysis
This example shows how to programmatically search and retrieve data from NCBI's Entrez databases
using NCBI's Entrez Utilities (E-Utilities).
E-Utilities (eUtils) are server-side programs (e.g. ESearch, ESummary, EFetch, etc.,) developed and
maintained by NCBI for searching and retrieving data from most Entpwdrez Databases. You access
tools via URLs with a strict syntax of a specific base URL, a call to the eUtil's script and its associated
parameters. For more details on eUtils, see E-Utilities Help.
In this example, we consider the genes sequenced from the H5N1 virus, isolated in 1997 from a
chicken in Hong Kong as a starting point for our analysis. This particular virus jumped from chickens
to humans, killing six people before the spread of the disease was brought under control by
destroying all poultry in Hong Kong [1]. You can use ESearch to find the sequence data needed for
the analysis. ESearch requires input of a database (db) and search term (term). Optionally, you can
request for ESearch to store your search results on the NCBI history server through the usehistory
parameter.
baseURL = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
eutil = 'esearch.fcgi?';
dbParam = 'db=nuccore';
termParam = '&term=A/chicken/Hong+Kong/915/97+OR+A/chicken/Hong+Kong/915/1997';
usehistoryParam = '&usehistory=y';
esearchURL = [baseURL, eutil, dbParam, termParam, usehistoryParam]
esearchURL =
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccore&term=A/chicken/Hong+Kong/915
The term parameter can be any valid Entrez query. Note that there cannot be any spaces in the URL,
so parameters are separated by '&' and any spaces in a query term need to be replaced with '+' (e.g.
'Hong+Kong').
You can use webread to send the URL and return the results from ESearch as a character array.
searchReport = webread(esearchURL)
searchReport =
3-210
Accessing NCBI Entrez Databases with E-Utilities
<Id>6048903</Id>
<Id>6048875</Id>
<Id>6048849</Id>
<Id>6048829</Id>
<Id>6048770</Id>
<Id>3421265</Id>
</IdList><TranslationSet/><TranslationStack> <TermSet> <Term>A/chicken/Hong[All Fields]</Ter
ESearch returns the search results in XML. The report contains information about the query
performed, which database was searched and UIDs (unique IDs) to the records that match the query.
If you use the history server, the report contains two additional IDs, WebEnv and query_key, for
accessing the results. WebEnv is the location of the results on the server, and query_key is a number
indexing the queries performed. Since WebEnv and query_key are query dependent they will
change every time the search is executed. Either the UIDs or WebEnv and query_key can be parsed
out of the XML report then passed to other eUtils. You can use regexp to do the parsing and store
the tokens in the structure with fieldnames WebEnv and QueryKey.
ncbi = regexp(searchReport,...
'<QueryKey>(?<QueryKey>\w+)</QueryKey>\s*<WebEnv>(?<WebEnv>\S+)</WebEnv>',...
'names')
ncbi =
QueryKey: '1'
WebEnv: 'NCID_1_3777459_130.14.22.215_9001_1464976330_1306835914_0Me...'
To get a quick overview of sequences that matched the query you can use ESummary. ESummary
retrieves a brief summary, or Document Summary (DocSum), for each record. ESummary requires an
input of which database to access and which records to retrieve, identified either by a list of UIDs
passed through id parameter or by the WebEnv and query_key parameters. ESummary returns a
report in XML that contains the summary information for each record. Use websave with ESummary
to perform the record summary retrieval and write out the XML report to a file.
tmpDirectory = tempdir;
summaryFname = fullfile(tmpDirectory,'summaryReport.xml');
websave(summaryFname, [baseURL...
'esummary.fcgi?db=nuccore&WebEnv=',ncbi.WebEnv,...
'&query_key=',ncbi.QueryKey]);
You can create an XSL stylesheet to view information from the ESummary XML report in a web
browser. For more information on writing XSL stylesheets, see W3C® XSL. An XSL stylesheet was
created for this example to view the sequence summary information and provide links to their full
GenBank® files. Xslt can be used to view the XML report in a Web browser from MATLAB®.
xslt(summaryFname,'genbankSummary.xsl','-web');
3-211
3 Sequence Analysis
To perform the sequence analysis, you need to get the full GenBank record for each sequence. EFetch
retrieves full records from Entrez databases. EFetch requires an input of a database and a list of
UIDs or WebEnv and query_key. Additionally, EFetch can return the output in different formats. You
can specify which output format (i.e. GenBank (gb), FASTA) and file format (i.e. text, ASN.1, XML)
you want through the rettype and retmode parameters, respectively. Rettype equals gb for
GenBank file format and retmode equals text for this query. Genbankread can be used directly
with the EFetch URL to retrieve all the GenBank records and read them into a structure array. This
structure can then be used as input to seqviewer to visualize the sequences.
ch97struct = genbankread([baseURL...
'efetch.fcgi?db=nuccore&rettype=gb&retmode=text&WebEnv=',ncbi.WebEnv,...
'&query_key=',ncbi.QueryKey]);
seqviewer(ch97struct)
3-212
Accessing NCBI Entrez Databases with E-Utilities
It might be useful to have PubMed articles related to these genes records. ELink provides this
functionality. It finds associations between records within or between databases. You can give ELink
the query_key and WebEnv IDs from above and tell it to find records in the PubMed Database (db
parameter) associated with your records from the Nucleotide (nuccore) Database (dbfrom
parameter). ELink returns an XML report with the UIDs for the records in PubMed. These UIDs can
3-213
3 Sequence Analysis
be parsed out of the report and passed to other eUtils (e.g. ESummary). Use the stylesheet created
for viewing ESummary reports to view the results of ELink.
elinkReport = webread([baseURL...
'elink.fcgi?dbfrom=nuccore&db=pubmed&WebEnv=', ncbi.WebEnv,...
'&query_key=',ncbi.QueryKey]);
pubmedIDs = regexp(elinkReport,'<Link>\s+<Id>(\w*)</Id>\s+</Link>','tokens');
NumberOfArticles = numel(pubmedIDs)
% Put PubMed UIDs into a string that can be read by EPost URL.
pubmed_str = [];
for ii = 1:NumberOfArticles
pubmed_str = sprintf([pubmed_str '%s,'],char(pubmedIDs{ii}));
end
NumberOfArticles =
You can use EPost to posts UIDs to the history server. It returns an XML report with a query_key
and WebEnv IDs pointing to the location of the history server. Again, these can be parsed out of the
report and used with other eUtils calls.
epostKeys =
QueryKey: '1'
WebEnv: 'NCID_1_3778415_130.14.22.215_9001_1464976335_906725031_0Met...'
ELink can do "within" database searches. For example, you can query for a nucleotide sequence
within Nucleotide (nuccore) database to find similar sequences, essentially performing a BLAST
search. For "within" database searches, ELink returns an XML report containing the related records,
along with a score ranking its relationship to the query record. From the above PubMed search, you
might be interested in finding all articles related to those articles in PubMed. This is easy to do with
ELink. To do a "within" database search, set db and dbfrom to PubMed. You can use the query_key
and WebEnv from the EPost call.
pm2pmReport = webread([baseURL...
'elink.fcgi?dbfrom=pubmed&db=pubmed&query_key=',epostKeys.QueryKey,...
'&WebEnv=',epostKeys.WebEnv]);
pubmedIDs = regexp(pm2pmReport,'(?<=<Id>)\w*(?=</Id>)','match');
3-214
Accessing NCBI Entrez Databases with E-Utilities
NumberOfArticles = numel(unique(pubmedIDs))
pubmed_str = [];
for ii = 1:NumberOfArticles
pubmed_str = sprintf([pubmed_str '%s,'],char(pubmedIDs{ii}));
end
NumberOfArticles =
526
Use websave with EFetch to retrieve full abstracts for the articles and write out the returned XML
report to a file. An XSL stylesheet is provided with this example for viewing the results of the EFetch
query. The XML report can be transformed using the stylesheet and opened in a Web browser from
MATLAB using xslt.
fullFname = fullfile(tmpDirectory,'H5N1_relatedArticles.xml');
websave(fullFname, [baseURL 'efetch.fcgi?db=pubmed&retmode=xml&id=',...
pubmed_str(1:end-1)]);
xslt(fullFname,'pubmedFullReport.xsl','-web');
3-215
3 Sequence Analysis
To see what other Entrez databases contain information about the H5N1 virus, use EGQuery.
EGQuery performs a text search across all available Entrez databases and returns the number of hits
in each. EGQuery accepts any valid Entrez text query as input through the term parameter.
entrezSearch = webread([baseURL,'egquery.fcgi?term=H5N1+AND+virus']);
entrezResults = regexp(entrezSearch,...
'<DbName>(?<DB>\w+\s*\w*)</DbName>.*?(<Count>)(?<Count>\d+)</Count>',...
'names');
entrezDBs = {entrezResults(:).DB};
dbCounts = str2double({entrezResults(:).Count});
entrezDBs = entrezDBs(logical(dbCounts)); % remove databases with no records
[dbCounts,sortInd] = sort(dbCounts(logical(dbCounts)));
entrezDBs = entrezDBs(sortInd);
numDBs = numel(entrezDBs);
barh(log10(dbCounts));
ylim([.5 numDBs+.5])
ax = gca;
3-216
Accessing NCBI Entrez Databases with E-Utilities
ax.YTick = 1:numDBs;
ax.YTickLabel = entrezDBs;
xlabel('Log(Number of Records)');
title('Number of H5N1 Related-Records Per Entrez Database');
References
[1] Cristianini, N. and Hahn, M.W. "Introduction to Computational Genomics: A Case Studies
Approach", Cambridge University Press, 2007.
3-217
4
Microarray Analysis
In MATLAB, you can represent all the previous data and information in an ExpressionSet object,
which typically contains the following objects:
• One ExptData object containing expression values from a microarray experiment in one or more
DataMatrix objects
• One MetaData object containing sample metadata in two dataset arrays
• One MetaData object containing feature metadata in two dataset arrays
• One MIAME object containing experiment descriptions
The following graphic illustrates a typical ExpressionSet object and its component objects.
4-2
Managing Gene Expression Data in Objects
Each element (DataMatrix object) in the ExpressionSet object has an element name. Also, there is
always one DataMatrix object whose element name is Expressions.
An ExpressionSet object lets you store, manage, and subset the data from a microarray gene
expression experiment. An ExpressionSet object includes properties and methods that let you access,
retrieve, and change data, metadata, and other information about the microarray experiment. These
properties and methods are useful to view and analyze the data. For a list of the properties and
methods, see ExpressionSet class.
To learn more about constructing and using objects for microarray gene expression data and
information, see:
4-3
4 Microarray Analysis
4-4
Representing Expression Data Values in DataMatrix Objects
The object constructor function, DataMatrix, lets you create a DataMatrix object to encapsulate
data and metadata (row and column names) from a microarray experiment. A DataMatrix object
stores experimental data in a matrix, with rows typically corresponding to gene names or probe
identifiers, and columns typically corresponding to sample identifiers. A DataMatrix object also stores
metadata, including the gene names or probe identifiers (as the row names) and sample identifiers
(as the column names).
You can reference microarray expression values in a DataMatrix object the same way you reference
data in a MATLAB array, that is, by using linear or logical indexing. Alternately, you can reference this
experimental data by gene (probe) identifiers and sample identifiers. Indexing by these identifiers lets
you quickly and conveniently access subsets of the data without having to maintain additional index
arrays.
Many MATLAB operators and arithmetic functions are available to DataMatrix objects by means of
methods. These methods let you modify, combine, compare, analyze, plot, and access information
from DataMatrix objects. Additionally, you can easily extend the functionality by using general
element-wise functions, dmarrayfun and dmbsxfun, and by manually accessing the properties of a
DataMatrix object.
Note For tables describing the properties and methods of a DataMatrix object, see the DataMatrix
object reference page.
load filteredyeastdata
2 Create variables to contain a subset of the data, specifically the first five rows and first four
columns of the yeastvalues matrix, the genes cell array, and the times vector.
yeastvalues = yeastvalues(1:5,1:4);
genes = genes(1:5,:);
times = times(1:4);
4-5
4 Microarray Analysis
3 Import the microarray object package so that the DataMatrix constructor function will be
available.
import bioma.data.*
4 Use the DataMatrix constructor function to create a small DataMatrix object from the gene
expression data.
dmo = DataMatrix(yeastvalues,genes,times)
dmo =
1 Use the get method to display the properties of the DataMatrix object, dmo.
get(dmo)
Name: ''
RowNames: {5x1 cell}
ColNames: {' 0' ' 9.5' '11.5' '13.5'}
NRows: 5
NCols: 4
NDims: 2
ElementClass: 'double'
2 Use the set method to specify a name for the DataMatrix object, dmo.
dmo = set(dmo,'Name','MyDMObject');
3 Use the get method again to display the properties of the DataMatrix object, dmo.
get(dmo)
Name: 'MyDMObject'
RowNames: {5x1 cell}
ColNames: {' 0' ' 9.5' '11.5' '13.5'}
NRows: 5
NCols: 4
NDims: 2
ElementClass: 'double'
Note For a description of all properties of a DataMatrix object, see the DataMatrix object reference
page.
4-6
Representing Expression Data Values in DataMatrix Objects
• Parenthesis ( ) indexing
• Dot . indexing
Parentheses () Indexing
Use parenthesis indexing to extract a subset of the data in dmo and assign it to a new DataMatrix
object dmo2:
dmo2 = dmo(1:5,2:3)
dmo2 =
9.5 11.5
SS DNA 1.699 -0.026
YAL003W 0.146 -0.129
YAL012W 0.175 0.467
YAL026C 0.796 0.384
YAL034C 0.487 -0.184
Use parenthesis indexing to extract a subset of the data using row names and column names, and
assign it to a new DataMatrix object dmo3:
dmo3 =
11.5
SS DNA -0.026
YAL012W 0.467
YAL034C -0.184
Note If you use a cell array of row names or column names to index into a DataMatrix object, the
names must be unique, even though the row names or column names within the DataMatrix object
are not unique.
Use parenthesis indexing to assign new data to a subset of the elements in dmo2:
9.5 11.5
SS DNA 1.7 -0.03
YAL003W 0.15 -0.13
YAL012W 0.175 0.467
YAL026C 0.796 0.384
YAL034C 0.487 -0.184
9.5 11.5
YAL012W 0.175 0.467
YAL026C 0.796 0.384
YAL034C 0.487 -0.184
4-7
4 Microarray Analysis
Dot . Indexing
Note In the following examples, notice that when using dot indexing with DataMatrix objects, you
specify all rows or all columns using a colon within single quotation marks, (':').
Use dot indexing to extract the data from the 11.5 column only of dmo:
timeValues = dmo.(':')('11.5')
timeValues =
-0.0260
-0.1290
0.4670
0.3840
-0.1840
Use dot indexing to assign new data to a subset of the elements in dmo:
dmo.(1:2)(':') = 7
dmo =
dmo.YAL034C = []
dmo =
dmo.(':')(2:3)=[]
dmo =
0 13.5
SS DNA 7 7
YAL003W 7 7
YAL012W 0.157 -0.379
YAL026C 0.246 0.981
4-8
Representing Expression Data Values in ExptData Objects
The following illustrates a small DataMatrix object containing expression values from three samples
(columns) and seven features (rows):
A B C
100001_at 2.26 20.14 31.66
100002_at 158.86 236.25 206.27
100003_at 68.11 105.45 82.92
100004_at 74.32 96.68 84.87
100005_at 75.05 53.17 57.94
100006_at 80.36 42.89 77.21
100007_at 216.64 191.32 219.48
An ExptData object lets you store, manage, and subset the data values from a microarray experiment.
An ExptData object includes properties and methods that let you access, retrieve, and change data
values from a microarray experiment. These properties and methods are useful to view and analyze
the data. For a list of the properties and methods, see ExptData class.
1 Import the bioma.data package so that the DataMatrix and ExptData constructor functions
are available.
import bioma.data.*
2 Use the DataMatrix constructor function to create a DataMatrix object from the gene
expression data in the mouseExprsData.txt file. This file contains a table of expression values
and metadata (sample and feature names) from a microarray experiment done using the
Affymetrix MGU74Av2 GeneChip array. There are 26 sample names (A through Z), and 500
feature names (probe set names).
EDObj = ExptData(dmObj);
4-9
4 Microarray Analysis
Experiment Data:
500 features, 26 samples
1 elements
Element names: Elmt1
Note For complete information on constructing ExptData objects, see ExptData class.
For example, to determine the number of elements (DataMatrix objects) in an ExptData object:
EDObj.NElements
ans =
Note Property names are case sensitive. For a list and description of all properties of an ExptData
object, see ExptData class.
or
methodname(objectname)
Columns 1 through 9
'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' ...
4-10
Representing Expression Data Values in ExptData Objects
size(EDObj)
ans =
500 26
Note For a complete list of methods of an ExptData object, see ExptData class.
References
[1] Hovatta, I., Tennant, R S., Helton, R., et al. (2005). Glyoxalase 1 and glutathione reductase 1
regulate anxiety in mice. Nature 438, 662–666.
4-11
4 Microarray Analysis
• Values dataset array — A dataset array containing the measured value of each variable per
sample or feature. In this dataset array, the columns correspond to variables and rows correspond
to either samples or features. The number and names of the columns in this dataset array must
match the number and names of the rows in the Descriptions dataset array. If this dataset array
contains sample metadata, then the number and names of the rows (samples) must match the
number and names of the columns in the DataMatrix objects in the same ExpressionSet object. If
this dataset array contains feature metadata, then the number and names of the rows (features)
must match the number and names of the rows in the DataMatrix objects in the same
ExpressionSet object.
• Descriptions dataset array — A dataset array containing a list of the variable names and their
descriptions. In this dataset array, each row corresponds to a variable. The row names are the
variable names, and a column, named VariableDescription, contains a description of the
variable. The number and names of the rows in the Descriptions dataset array must match the
number and names of the columns in the Values dataset array.
The following illustrates a dataset array containing the measured value of each variable per sample
or feature:
Gender Age Type Strain Source
A 'Male' 8 'Wild type' '129S6/SvEvTac' 'amygdala'
B 'Male' 8 'Wild type' '129S6/SvEvTac' 'amygdala'
C 'Male' 8 'Wild type' '129S6/SvEvTac' 'amygdala'
D 'Male' 8 'Wild type' 'A/J ' 'amygdala'
E 'Male' 8 'Wild type' 'A/J ' 'amygdala'
F 'Male' 8 'Wild type' 'C57BL/6J ' 'amygdala'
The following illustrates a dataset array containing a list of the variable names and their descriptions:
VariableDescription
id 'Sample identifier'
Gender 'Gender of the mouse in study'
Age 'The number of weeks since mouse birth'
Type 'Genetic characters'
Strain 'The mouse strain'
Source 'The tissue source for RNA collection'
4-12
Representing Sample and Feature Metadata in MetaData Objects
A MetaData object lets you store, manage, and subset the metadata from a microarray experiment. A
MetaData object includes properties and methods that let you access, retrieve, and change metadata
from a microarray experiment. These properties and methods are useful to view and analyze the
metadata. For a list of the properties and methods, see MetaData class
1 Import the bioma.data package so that the MetaData constructor function is available.
import bioma.data.*
2 Load some sample data, which includes Fisher’s iris data of 5 measurements on a sample of 150
irises.
load fisheriris
3 Create a dataset array from some of Fisher's iris data. The dataset array will contain 750
measured values, one for each of 150 samples (iris replicates) at five variables (species, SL, SW,
PL, PW). In this dataset array, the rows correspond to samples, and the columns correspond to
variables.
irisVarDesc =
VariableDescription
species 'Iris species'
SL 'Sepal Length'
SW 'Sepal Width'
PL 'Petal Length'
PW 'Petal Width'
5 Create a MetaData object from the two dataset arrays.
import bioma.data.*
2 View the mouseSampleData.txt file included with the Bioinformatics Toolbox software.
4-13
4 Microarray Analysis
Note that this text file contains two tables. One table contains 130 measured values, one for each
of 26 samples (A through Z) at five variables (Gender, Age, Type, Strain, and Source). In this
table, the rows correspond to samples, and the columns correspond to variables. The second
table has lines prefaced by the # symbol. It contains five rows, each corresponding to the five
variables: Gender, Age, Type, Strain, and Source. The first column contains the variable name.
The second column has a column header of VariableDescription and contains a description
of the variable.
Sample Names:
A, B, ...,Z (26 total)
Variable Names and Meta Information:
VariableDescription
Gender ' Gender of the mouse in study'
Age ' The number of weeks since mouse birth'
Type ' Genetic characters'
Strain ' The mouse strain'
Source ' The tissue source for RNA collection'
4-14
Representing Sample and Feature Metadata in MetaData Objects
objectname.propertyname
MDObj2.NVariables
ans =
objectname.propertyname = propertyvalue
Note Property names are case sensitive. For a list and description of all properties of a MetaData
object, see MetaData class.
objectname.methodname
or
methodname(objectname)
For example, to access the dataset array in a MetaData object that contains the variable values:
MDObj2.variableValues;
To access the dataset array of a MetaData object that contains the variable descriptions:
variableDesc(MDObj2)
ans =
VariableDescription
Gender ' Gender of the mouse in study'
Age ' The number of weeks since mouse birth'
Type ' Genetic characters'
Strain ' The mouse strain'
Source ' The tissue source for RNA collection'
Note For a complete list of methods of a MetaData object, see MetaData class.
4-15
4 Microarray Analysis
• Experiment design
• Microarrays used
• Samples used
• Sample preparation and labeling
• Hybridization procedures and parameters
• Normalization controls
• Preprocessing information
• Data processing specifications
A MIAME object includes properties and methods that let you access, retrieve, and change
experiment information related to a microarray experiment. These properties and methods are useful
to view and analyze the information. For a list of the properties and methods, see MIAME class.
1 Import the bioma.data package so that the MIAME constructor function is available.
import bioma.data.*
2 Use the getgeodata function to return a MATLAB structure containing Gene Expression
Omnibus (GEO) Series data related to accession number GSE4616.
geoStruct = getgeodata('GSE4616')
geoStruct =
MIAMEObj1 = MIAME(geoStruct);
4-16
Representing Experiment Information in a MIAME Object
MIAMEObj1 =
Experiment Description:
Author name: Mika,,Silvennoinen
Riikka,,Kivelä
Maarit,,Lehti
Anna-Maria,,Touvras
Jyrki,,Komulainen
Veikko,,Vihko
Heikki,,Kainulainen
Laboratory: LIKES - Research Center
Contact information: Mika,,Silvennoinen
URL:
PubMedIDs: 17003243
Abstract: A 90 word abstract is available. Use the Abstract property.
Experiment Design: A 234 word summary is available. Use the ExptDesign property.
Other notes:
[1x80 char]
import bioma.data.*
2 Use the MIAME constructor function to create a MIAME object using individual properties.
MIAMEObj2 = MIAME('investigator', 'Jane Researcher',...
'lab', 'One Bioinformatics Laboratory',...
'contact', '[email protected]',...
'url', 'www.lab.not.exist',...
'title', 'Normal vs. Diseased Experiment',...
'abstract', 'Example of using expression data',...
'other', {'Notes:Created from a text file.'});
MIAMEObj2 =
Experiment Description:
Author name: Jane Researcher
Laboratory: One Bioinformatics Laboratory
Contact information: [email protected]
URL: www.lab.not.exist
PubMedIDs:
Abstract: A 4 word abstract is available. Use the Abstract property.
No experiment design summary available.
Other notes:
'Notes:Created from a text file.'
objectname.propertyname
For example, to retrieve the PubMed identifier of publications related to a MIAME object:
MIAMEObj1.PubMedID
ans =
4-17
4 Microarray Analysis
17003243
objectname.propertyname = propertyvalue
Note Property names are case sensitive. For a list and description of all properties of a MIAME
object, see MIAME class.
objectname.methodname
or
methodname(objectname)
MIAMEObj1.isempty
ans =
Note For a complete list of methods of a MIAME object, see MIAME class.
4-18
Representing All Data in an ExpressionSet Object
• One ExptData object containing expression values from a microarray experiment in one or more
DataMatrix objects
• One MetaData object containing sample metadata in two dataset arrays
• One MetaData object containing feature metadata in two dataset arrays
• One MIAME object containing experiment descriptions
The following graphic illustrates a typical ExpressionSet object and its component objects.
4-19
4 Microarray Analysis
Each element (DataMatrix object) in the ExpressionSet object has an element name. Also, there is
always one DataMatrix object whose element name is Expressions.
An ExpressionSet object lets you store, manage, and subset the data from a microarray gene
expression experiment. An ExpressionSet object includes properties and methods that let you access,
retrieve, and change data, metadata, and other information about the microarray experiment. These
properties and methods are useful to view and analyze the data. For a list of the properties and
methods, see ExpressionSet class.
Note The following procedure assumes you have executed the example code in the previous sections:
4-20
Representing All Data in an ExpressionSet Object
1 Import the bioma package so that the ExpressionSet constructor function is available.
import bioma.*
2 Construct an ExpressionSet object from EDObj, an ExptData object, MDObj2, a MetaData object
containing sample variable information, and MIAMEObj, a MIAME object.
ESObj
ExpressionSet
Experiment Data: 500 features, 26 samples
Element names: Expressions
Sample Data:
Sample names: A, B, ...,Z (26 total)
Sample variable names and meta information:
Gender: Gender of the mouse in study
Age: The number of weeks since mouse birth
Type: Genetic characters
Strain: The mouse strain
Source: The tissue source for RNA collection
Feature Data: none
Experiment Information: use 'exptInfo(obj)'
objectname.propertyname
ESObj.NSamples
ans =
26
Note Property names are case sensitive. For a list and description of all properties of an
ExpressionSet object, see ExpressionSet class.
4-21
4 Microarray Analysis
objectname.methodname
or
methodname(objectname)
For example, to retrieve the sample variable names from an ExpressionSet object:
ESObj.sampleVarNames
ans =
ans =
Experiment description
Author name: Mika,,Silvennoinen
Riikka,,Kivelä
Maarit,,Lehti
Anna-Maria,,Touvras
Jyrki,,Komulainen
Veikko,,Vihko
Heikki,,Kainulainen
Laboratory: XYZ Lab
Contact information: Mika,,Silvennoinen
URL:
PubMedIDs: 17003243
Abstract: A 90 word abstract is available Use the Abstract property.
Experiment Design: A 234 word summary is available Use the ExptDesign property.
Other notes:
[1x80 char]
Note For a complete list of methods of an ExpressionSet object, see ExpressionSet class.
4-22
Analyzing Illumina Bead Summary Gene Expression Data
This example shows how to analyze Illumina® BeadChip gene expression summary data using
MATLAB® and Bioinformatics Toolbox™ functions.
Introduction
This example shows how to import and analyze gene expression data from the Illumina BeadChip
microarray platform. The data set in the example is from the study of gene expression profiles of
human spermatogenesis by Platts et al. [1]. The expression levels were measured on Illumina Sentrix
Human 6 (WG6) BeadChips.
The data from most microarray gene expression experiments generally consists of four components:
experiment data values, sample information, feature annotations, and information about the
experiment. This example uses microarray data, constructs each of the four components, assembles
them into an ExpressionSet object, and finds the differentially expressed genes. For more
examples about the ExpressionSet class, see “Working with Objects for Microarray Experiment
Data” on page 4-203.
Samples were hybridized on three Illumina Sentrix Human 6 (WG6) BeadChips. Retrieve the GEO
Series data GSE6967 using getgeodata function.
TNGEOData = getgeodata('GSE6967')
TNGEOData =
The TNGEOData structure contains Header and Data fields. The Header field has two fields, Series
and Samples, containing a description of the experiment and samples respectively. The Data field
contains a DataMatrix of normalized summary expression levels from the experiment.
nSamples =
13
ans =
4-23
4 Microarray Analysis
For simplicity, extract the sample labels from the sample titles.
sampleLabels = cellfun(@(x) char(regexp(x, '\w\d+', 'match')),...
TNGEOData.Header.Samples.title, 'UniformOutput',false)
sampleLabels =
Columns 1 through 7
Columns 8 through 13
Download the supplementary file GSE6967_RAW.tar and unzip the file to access the 13 text files
produced by the BeadStudio software, which contain the raw, non-normalized bead summary values.
The raw data files are named with their GSM accession numbers. For this example, construct the file
names of the text data files using the path where the text files are located.
rawDataFiles = cell(1,nSamples);
for i = 1:nSamples
rawDataFiles {i} = [TNGEOData.Header.Samples.geo_accession{i} '.txt'];
end
Set the variable rawDataPath to the path and directory to which you extracted the data files.
rawDataPath = 'C:\Examples\illuminagedemo\GSE6967';
Use the ilmnbsread function to read the first of the summary files and store the content in a
structure.
rawData = ilmnbsread(fullfile(rawDataPath, rawDataFiles{1}))
rawData =
4-24
Analyzing Illumina Bead Summary Gene Expression Data
rawData.ColumnNames'
ans =
{'MIN_Signal-1412091085_A' }
{'AVG_Signal-1412091085_A' }
{'MAX_Signal-1412091085_A' }
{'NARRAYS-1412091085_A' }
{'ARRAY_STDEV-1412091085_A'}
{'BEAD_STDEV-1412091085_A' }
{'Avg_NBEADS-1412091085_A' }
{'Detection-1412091085_A' }
nTargets = size(rawData.Data, 1)
nTargets =
47293
Read the non-normalized expression values (Avg_Signal), the detection confidence limits and the
Sentrix chip IDs from the summary data files. The gene expression values are identified with Illumina
probe target IDs. You can specify the columns to read from the data file.
for i = 1:nSamples
rawData =ilmnbsread(fullfile(rawDataPath, rawDataFiles{i}),...
'COLUMNS', {'AVG_Signal', 'Detection'});
chipIDs(i) = regexp(rawData.ColumnNames(1), '\d*', 'match', 'once');
rawMatrix(:, i) = rawData.Data(:, 1);
detectionConf(:,i) = rawData.Data(:,2);
end
4-25
4 Microarray Analysis
There are three Sentrix BeadChips used in the experiment. Inspect the Illumina Sentrix BeadChip IDs
in chipIDs to determine the number of samples hybridized on each chip.
summary(chipIDs)
samplesChip1 = sampleLabels(chipIDs=='1412091085')
samplesChip2 = sampleLabels(chipIDs=='1412091086')
samplesChip3 = sampleLabels(chipIDs=='1477791158')
samplesChip1 =
samplesChip2 =
samplesChip3 =
Six samples (T2, T1, T6, T4, T8, and N11) were hybridized to six arrays on the first chip, four samples
(T3, T7, T5, and N6) on the second chip, and three samples (N12, N5, and N1) on the third chip.
Use a boxplot to view the raw expression levels of each sample in the experiment.
logRawExprs = log2(rawMatrix);
4-26
Analyzing Illumina Bead Summary Gene Expression Data
The difference in intensities between samples on the same chip and samples on different chips does
not seem too large. The first BeadChip, containing samples T2, T1, T6, T4, T8 and N11, seems to be
slightly more variable than others.
Using MA and XY plots to do a pairwise comparison of the arrays on a BeadChip can be informative.
On an MA plot, the average (A) of the expression levels of two arrays are plotted on the x axis, and
the difference (M) in the measurement on the y axis. An XY plot is a scatter plot of one array against
another. This example uses the helper function maxyplot to plot MAXY plots for a pairwise
comparison of the three arrays on the first chip hybridized with teratozoospermic samples (T2, T1
and T6).
Note: You can also use the mairplot function to create the MA or IR (Intensity/Ratio) plots for
comparison of specific arrays.
inspectIdx = 1:3;
maxyplot(rawMatrix, inspectIdx)
sgtitle('Before Normalization')
4-27
4 Microarray Analysis
In an MAXY plot, the MA plots for all pairwise comparisons are in the upper-right corner. In the
lower-left corner are the XY plots of the comparisons. The MAXY plot shows the two arrays, T1 and
T2, to be quite similar, while different from the other array, T6.
The expression box plots and MAXY plots reveal that there are differences in expression levels within
chips and between chips; hence, the data requires normalization. Use the quantilenorm function to
apply quantile normalization to the raw data.
Note: You can also try invariant set normalization using the mainvarsetnorm function.
normExprs = rawMatrix;
normExprs(:, :) = quantilenorm(rawMatrix.(':')(':'));
log2NormExprs = log2(normExprs);
figure;
maboxplot(log2NormExprs, 'ORIENTATION', 'horizontal')
ylabel('Sample Labels')
xlabel('log2(Expression Value)')
title('After Quantile Normalization')
4-28
Analyzing Illumina Bead Summary Gene Expression Data
Display and inspect the MAXY plot of the three arrays (T2, T1 and T6) on the first chip after the
normalization.
maxyplot(normExprs, inspectIdx)
sgtitle('After Quantile Normalization')
4-29
4 Microarray Analysis
Many of the genes in this study are not expressed, or have only small variability across the samples.
First, you can remove genes with very low absolute expression values by using genelowvalfilter.
Second, filter out genes with a small variance across samples using genevarfilter.
Microarray manufacturers usually provide annotations of a collection of features for each type of
chip. The chip annotation files contain metadata such as the gene name, symbol, NCBI accession
number, chromosome location and pathway information. Before assembling an ExpressionSet
object for the experiment, obtain the annotations about the features or probes on the BeadChip. You
can download the Human_WG-6.csv annotation file for Sentrix Human 6 (WG6) BeadChips from the
Support page at the Illumina web site and save the file locally. Read the annotation file into MATLAB
as a dataset array. Set the variable annotPath to the path and directory to which you downloaded
the annotation file.
annotPath = fullfile('C:\Examples\illuminagedemo\Annotation');
WG6Annot = dataset('xlsfile', fullfile(annotPath, 'Human_WG-6.csv'));
4-30
Analyzing Illumina Bead Summary Gene Expression Data
get(WG6Annot)
Description: ''
VarDescription: {}
Units: {}
DimNames: {'Observations' 'Variables'}
UserData: []
ObsNames: {}
VarNames: {1×13 cell}
fDataVariables =
Columns 1 through 5
Columns 6 through 10
Columns 11 through 13
ans =
47296
Because the expression data in this example is only a small set of the full expression values, you will
work with only the features in the DataMatrix object log2NormExprs. Find the matching features
in log2NormExprs and WG6Annot.Target.
[commTargets, fI, WGI] = intersect(rownames(log2NormExprs), WG6Annot.Target);
You can store the preprocessed expression values and detection limits of the annotated probe targets
as an ExptData object.
fNames = rownames(log2NormExprs);
TNExptData = bioma.data.ExptData({log2NormExprs(fI, :), 'ExprsValues'},...
{detectionConf(fI, :), 'DetectionConfidences'})
TNExptData =
4-31
4 Microarray Analysis
Experiment Data:
42313 features, 13 samples
2 elements
Element names: ExprsValues, DetectionConfidences
The sample data in the Header.Samples field of the TNGEOData structure can be overwhelming and
difficult to navigate through. From the data in Header.Samples field, you can gather the essential
information about the samples, such as the sample titles, GEO sample accession numbers, etc., and
store the sample data as a MetaData object.
sampleChars =
Columns 1 through 4
Columns 5 through 8
Columns 9 through 13
Create a dataset array to store the sample data you just extracted.
sampleDS =
4-32
Analyzing Illumina Bead Summary Gene Expression Data
Store the sample metadata as an object of the MetaData class, including a short description for each
variable.
TNSData = bioma.data.MetaData(sampleDS,...
{'Sample GEO accession number',...
'Spermic type',...
'Fertility characteristics'})
TNSData =
Sample Names:
T2, T1, ...,N1 (13 total)
Variable Names and Meta Information:
VariableDescription
GSM {'Sample GEO accession number'}
Type {'Spermic type' }
Characteristics {'Fertility characteristics' }
The collection of feature metadata for Sentrix Human 6 BeadChips is large and diverse. Select
information about features that are unique to the experiment and save the information as a
MetaData object. Extract annotations of interest, for example, Accession and Symbol.
Create a MetaData object for the feature annotation information with brief descriptions about the
two variables of the metadata.
WG6FData =
Sample Names:
GI_10047089-S, GI_10047091-S, ...,hmm9988-S (42313 total)
Variable Names and Meta Information:
VariableDescription
Accession {'Accession number of probe target'}
Symbol {'Gene Symbol of probe target' }
Most of the experiment descriptions in the Header.Series field of the TNGEOData structure can be
reorganized and stored as a MIAME object, which you will use to assemble the ExpressionSet object
for the experiment.
TNExptInfo = bioma.data.MIAME(TNGEOData)
4-33
4 Microarray Analysis
TNExptInfo =
Experiment Description:
Author name: Adrian,E,Platts
David,J,Dix
Hector,E,Chemes
Kary,E,Thompson
Robert,,Goodrich
John,C,Rockett
Vanesa,Y,Rawe
Silvina,,Quintana
Michael,P,Diamond
Lillian,F,Strader
Stephen,A,Krawetz
Laboratory: Wayne State University
Contact information: Stephen,A,Krawetz
URL: http://compbio.med.wayne.edu
PubMedIDs: 17327269
Abstract: A 82 word abstract is available. Use the Abstract property.
Experiment Design: A 61 word summary is available. Use the ExptDesign property.
Other notes:
{'ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE6nnn/GSE6967/suppl/GSE6967_RAW.tar'}
Now that you've created all the components, you can create an object of the ExpressionSet class to
store the expression values, sample information, chip feature annotations and description information
about this experiment.
TNExprSet =
ExpressionSet
Experiment Data: 42313 features, 13 samples
Element names: Expressions, DetectionConfidences
Sample Data:
Sample names: T2, T1, ...,N1 (13 total)
Sample variable names and meta information:
GSM: Sample GEO accession number
Type: Spermic type
Characteristics: Fertility characteristics
Feature Data:
Feature names: GI_10047089-S, GI_10047091-S, ...,hmm9988-S (42313 total)
Feature variable names and meta information:
Accession: Accession number of probe target
Symbol: Gene Symbol of probe target
Experiment Information: use 'exptInfo(obj)'
You can save an object of ExpressionSet class as a MAT file for further data analysis.
4-34
Analyzing Illumina Bead Summary Gene Expression Data
Load the experiment data saved from the previous section. You will use this datas to find differentially
expressed genes between the teratozoospermia and normal samples.
load TNGeneExprSet
TNSamples = sampleNames(TNExprSet);
Tz = strncmp(TNSamples, 'T', 1);
Ns = strncmp(TNSamples, 'N', 1);
nTz = sum(Tz)
nNs = sum(Ns)
TNExprs = expressions(TNExprSet);
TzData = TNExprs(:,Tz);
NsData = TNExprs(:,Ns);
meanTzData = mean(TzData,2);
meanNsData = mean(NsData,2);
groupLabels = [TNSamples(Tz), TNSamples(Ns)];
nTz =
nNs =
Perform a permutation t-test using the mattest function to permute the columns of the gene
expression data matrix of TzData and NsData. Note: Depending on the sample size, it may not be
feasible to consider all possible permutations. Usually, a random subset of permutations are
considered in the case of a large sample size.
Use the nchoosek function in Statistics and Machine Learning Toolbox™ to determine the number of
all possible permutations of the samples in this example.
nPerms =
1287
Use the PERMUTE option of the mattest function to compute the p-values of all the permutations.
4-35
4 Microarray Analysis
You can also compute the differential score from the p-values using the following anonymous function
[1].
Determine the number of genes considered to have a differential score greater than 20. Note: You
may get a different number of genes due to the permutation test outcome.
up =
3741
down =
3033
In multiple hypothesis testing, where we simultaneously tests the null hypothesis of thousands of
genes, each test has a specific false positive rate, or a false discovery rate (FDR) [2]. Estimate the
FDR and q-values for each test using the mafdr function.
figure;
[pFDR, qValues] = mafdr(pValues, 'showplot', true);
diffScoresFDRQ = diffscore(qValues, meanTzData, meanNsData);
4-36
Analyzing Illumina Bead Summary Gene Expression Data
Determine the number of genes with an absolute differential score greater than 20. Note: You may
get a different number of genes due to the permutation test and the bootstrap outcomes.
sum(abs(diffScoresFDRQ)>=20)
ans =
3122
4-37
4 Microarray Analysis
Note: From the volcano plot UI, you can interactively change the p-value cutoff and fold-change limit,
and export differentially expressed genes.
nDiffGenes = numel(diffStruct.GeneLabels)
nDiffGenes =
451
Get the list of up-regulated genes for the Tz samples compared to the Ns samples.
nUpGenes =
223
Get the list of down-regulated genes for the Tz samples compared to the Ns samples.
4-38
Analyzing Illumina Bead Summary Gene Expression Data
nDownGenes =
228
You can get the subset of experiment data containing only the differentially expressed genes.
Principal component analysis (PCA) on differentially expressed genes shows linear separability of the
Tz samples from the Ns samples.
PCAScore = pca(TNDiffExprSet.expressions);
figure;
plot(PCAScore(:,1), PCAScore(:,6), 's', 'MarkerSize',10, 'MarkerFaceColor','g');
hold on
text(PCAScore(:,1)+0.02, PCAScore(:,6), TNDiffExprSet.sampleNames)
plot([0,0], [-0.5 0.5], '--r')
ax = gca;
ax.XTick = [];
ax.YTick = [];
ax.YTickLabel = [];
title('PCA Mapping')
xlabel('Principal Component 1')
ylabel('Principal Component 6')
4-39
4 Microarray Analysis
You can also use the interactive tool created by the mapcaplot function to perform principal
component analysis.
mapcaplot((TNDiffExprSet.expressions)')
4-40
Analyzing Illumina Bead Summary Gene Expression Data
Perform unsupervised hierarchical clustering of the significant gene profiles from the Tz and Ns
groups using correlation as the distance metric to cluster the samples.
sampleDist = pdist(TNDiffExprSet.expressions','correlation');
sampleLink = linkage(sampleDist);
figure;
dendrogram(sampleLink, 'labels', TNDiffExprSet.sampleNames,'ColorThreshold',0.5)
4-41
4 Microarray Analysis
ax = gca;
ax.YTick = [];
ax.Box = 'on';
title('Hierarchical Sample Clustering')
Use the clustergram function to create the hierarchical clustering of differentially expressed genes,
and apply the colormap redbluecmap to the clustergram.
cmap = redbluecmap(9);
cg = clustergram(TNDiffExprSet.expressions,'Colormap',cmap,'Standardize',2);
addTitle(cg,'Hierarchical Gene Clustering')
4-42
Analyzing Illumina Bead Summary Gene Expression Data
Clustering of the most differentially abundant transcripts clearly partitions teratozoospermic (Tz) and
normospermic (Ns) spermatozoal RNAs.
References
[1] Platts, A.E., et al., "Success and failure in human spermatogenesis as revealed by
teratozoospermic RNAs", Human Molecular Genetics, 16(7):763-73, 2007.
[2] Storey, J.D. and Tibshirani, R., "Statistical significance for genomewide studies", PNAS,
100(16):9440-5, 2003.
4-43
4 Microarray Analysis
This example shows how to detect DNA copy number alterations in genome-wide array-based
comparative genomic hybridization (CGH) data.
Introduction
Copy number changes or alterations is a form of genetic variation in the human genome [1]. DNA
copy number alterations (CNAs) have been linked to the development and progression of cancer and
many diseases.
DNA microarray based comparative genomic hybridization (CGH) is a technique allows simultaneous
monitoring of copy number of thousands of genes throughout the genome [2,3]. In this technique,
DNA fragments or "clones" from a test sample and a reference sample differentially labeled with dyes
(typically, Cy3 and Cy5) are hybridized to mapped DNA microarrays and imaged. Copy number
alterations are related to the Cy3 and Cy5 fluorescence intensity ratio of the targets hybridized to
each probe on a microarray. Clones with normalized test intensities significantly greater than
reference intensities indicate copy number gains in the test sample at those positions. Similarly,
significantly lower intensities in the test sample are signs of copy number loss. BAC (bacterial
artificial chromosome) clone based CGH arrays have a resolution in the order of one million base
pairs (1Mb) [3]. Oligonucleotide and cDNA arrays provide a higher resolution of 50-100kb [2].
Array CGH log2-based intensity ratios provide useful information about genome-wide CNAs. In
humans, the normal DNA copy number is two for all the autosomes. In an ideal situation, the normal
clones would correspond to a log2 ratio of zero. The log2 intensity ratios of a single copy loss would
be -1, and a single copy gain would be 0.58. The goal is to effectively identify locations of gains or
losses of DNA copy number.
The data in this example is the Coriell cell line BAC array CGH data analyzed by Snijders et al.(2001).
The Coriell cell line data is widely regarded as a "gold standard" data set. You can download this data
of normalized log2-based intensity ratios and the supplemental table of known karyotypes from
https://www.nature.com/articles/ng754#supplementary-information. You will compare these
cytogenically mapped alterations with the locations of gains or losses identified with various functions
of MATLAB and its toolboxes.
For this example, the Coriell cell line data are provided in a MAT file. The data file
coriell_baccgh.mat contains coriell_data, a structure containing of the normalized average of
the log2-based test to reference intensity ratios of 15 fibroblast cell lines and their genomic positions.
The BAC targets are ordered by genome position beginning at 1p and ending at Xq.
load coriell_baccgh
coriell_data
coriell_data =
4-44
Detecting DNA Copy Number Alteration in Array-Based CGH Data
You can plot the genome wide log2-based test/reference intensity ratios of DNA clones. In this
example, you will display the log2 intensity ratios for cell line GM03576 for chromosomes 1 through
23.
sample =
To label chromosomes and draw the chromosome borders, you need to find the number of data points
of in each chromosome.
% Label the autosomes with their chromosome numbers, and the sex chromosome
% with X.
x_label = chr_nums - ceil(chr_data_len/2);
y_label = zeros(1, length(x_label)) - 1.6;
chr_labels = num2str((1:1:23)');
chr_labels = cellstr(chr_labels);
chr_labels{end} = 'X';
figure
hold on
h_ratio = plot(coriell_data.Log2Ratio(:,sample), '.');
h_vbar = line(x_vbar, y_vbar, 'color', [0.8 0.8 0.8]);
h_text = text(x_label, y_label, chr_labels,...
'fontsize', 8, 'HorizontalAlignment', 'center');
h_axis = h_ratio.Parent;
h_axis.XTick = [];
h_axis.YGrid = 'on';
h_axis.Box = 'on';
xlim([0 chr_nums(23)])
ylim([-1.5 1.5])
title(coriell_data.Sample{sample})
xlabel({'', 'Chromosome'})
4-45
4 Microarray Analysis
ylabel('Log2(T/R)')
hold off
In the plot, borders between chromosomes are indicated by grey vertical bars. The plot indicates that
the GM03576 cell line is trisomic for chromosomes 2 and 21 [3].
You can also plot the profile of each chromosome in a genome. In this example, you will display the
log2 intensity ratios for each chromosome in cell line GM05296 individually.
hp = plot(chr_y, '.');
line([0, chr_data_len(c)], [0,0], 'color', 'r');
h_axis = hp.Parent;
h_axis.XTick = [];
h_axis.Box = 'on';
xlim([0 chr_data_len(c)])
ylim([-1.5 1.5])
xlabel(['chr ' chr_labels{c}], 'FontSize', 8)
end
sgtitle('GM05296');
4-46
Detecting DNA Copy Number Alteration in Array-Based CGH Data
The plot indicates the GM05296 cell line has a partial trisomy at chromosome 10 and a partial
monosomy at chromosome 11.
Observe that the gains and losses of copy number are discrete. These alterations occur in contiguous
regions of a chromosome that cover several clones to entitle chromosome.
The array-based CGH data can be quite noisy. Therefore, accurate identification of chromosome
regions of equal copy number that accounts for the noise in the data requires robust computational
methods. In the rest of this example, you will work with the data of chromosomes 9, 10 and 11 of the
GM05296 cell line.
A simple approach to perform high-level smoothing is to use a nonparametric filter. The function
mslowess implements a linear fit to samples within a shifting window, is this example you use a SPAN
of 15 samples.
for iloop = 1:length(GM05296_Data)
idx = coriell_data.Chromosome == GM05296_Data(iloop).Chromosome;
4-47
4 Microarray Analysis
chr_x = coriell_data.GenomicPosition(idx);
chr_y = coriell_data.Log2Ratio(idx, sample);
% Smoother
GM05296_Data(iloop).SmoothedRatio = ...
mslowess(GM05296_Data(iloop).GenomicPosition,...
GM05296_Data(iloop).Log2Ratio,...
'SPAN',15);
To better visualize and later validate the locations of copy number changes, we need cytoband
information. Read the human cytoband information from the hs_cytoBand.txt data file using the
cytobandread function. It returns a structure of human cytoband information [4].
hs_cytobands = cytobandread('hs_cytoBand.txt')
hs_cytobands =
You can inspect the data by plotting the log2-based ratios, the smoothed ratios and the derivative of
the smoothed ratios together. You can also display the centromere position of a chromosome in the
data plots. The magenta vertical bar marks the centromere of the chromosome.
4-48
Detecting DNA Copy Number Alteration in Array-Based CGH Data
4-49
4 Microarray Analysis
4-50
Detecting DNA Copy Number Alteration in Array-Based CGH Data
Detecting Change-Points
The derivatives of the smoothed ratio over a certain threshold usually indicate substantial changes
with large peaks, and provide the estimate of the change-point indices. For this example you will
select a threshold of 0.1.
thrd = 0.1;
Gaussian Mixture (GM) or Expectation-Maximization (EM) clustering can provide fine adjustments to
the change-point indices [5]. The convergence to statistically optimal change-point indices can be
4-51
4 Microarray Analysis
facilitated by surrounding each index with equal-length set of adjacent indices. Thus each edge is
associated with left and right distributions. The GM clustering learns the maximum-likelihood
parameters of the two distributions. It then optimally adjusts the indices given the learned
parameters.
You can set the length for the set of adjacent positions distributed around the change-point indices.
For this example, you will select a length of 5. You can also inspect each change-point by plotting its
GM clusters. In this example, you will plot the GM clusters for the Chromosome 10 data.
len = 5;
for iloop = 1:length(GM05296_Data)
seg_num = numel(GM05296_Data(iloop).SegIndex) - 1;
if seg_num > 1
% Plot the data points in chromosome 10 data
if GM05296_Data(iloop).Chromosome == 10
figure
hold on;
plot(GM05296_Data(iloop).GenomicPosition,...
GM05296_Data(iloop).Log2Ratio, '.')
ylim([-0.5, 1])
xlabel('Genomic Position')
ylabel('Log2(T/R)')
title(sprintf('Chromosome %d - GM05296', ...
GM05296_Data(iloop).Chromosome))
end
segidx = GM05296_Data(iloop).SegIndex;
segidx_emadj = GM05296_Data(iloop).SegIndex;
% Select initial guess for the cluster index for each point.
gmpart = (gmy > (min(gmy) + range(gmy)/2)) + 1;
4-52
Detecting DNA Copy Number Alteration in Array-Based CGH Data
Once you determine the optimal change-point indices, you also need to determine if each segment
represents a statistically significant changes in DNA copy number. You will perform permutation t-
tests to assess the significance of the segments identified. A segment includes all the data points from
one change-point to the next change-point or the chromosome end. In this example, you will perform
10,000 permutations of the data points on two consecutive segments along the chromosome at the
significance level of 0.01.
alpha = 0.01;
for iloop = 1:length(GM05296_Data)
seg_num = numel(GM05296_Data(iloop).SegIndex) - 1;
seg_index = GM05296_Data(iloop).SegIndex;
if seg_num > 1
ppvals = zeros(seg_num+1, 1);
4-53
4 Microarray Analysis
seg1idx = seg_index(sloop):seg_index(sloop+1)-1;
if sloop== seg_num-1
seg2idx = seg_index(sloop+1):(seg_index(sloop+2));
else
seg2idx = seg_index(sloop+1):(seg_index(sloop+2)-1);
end
seg1 = GM05296_Data(iloop).SmoothedRatio(seg1idx);
seg2 = GM05296_Data(iloop).SmoothedRatio(seg2idx);
n1 = numel(seg1);
n2 = numel(seg2);
N = n1+n2;
segs = [seg1;seg2];
% Permutation test
iter = 10000;
t_perm = zeros(iter,1);
for i = 1:iter
randseg = segs(randperm(N));
t_perm(i) = abs(mean(randseg(1:n1))-mean(randseg(n1+1:N)));
end
ppvals(sloop+1) = sum(t_perm >= abs(t_obs))/iter;
end
Cytogenetic study indicates cell line GM05296 has a trisomy at 10q21-10q24 and a monosomy at
11p12-11p13 [3]. Plot the segment means of the three chromosomes over the original data with bold
red lines, and add the chromosome ideograms to the plots using the chromosomeplot function. Note
that the genomic positions in the Coriell cell line data set are in kilo base pairs. Therefore, you will
need to convert cytoband data from bp to kilo bp when adding the ideograms to the plot.
for iloop = 1:length(GM05296_Data)
figure;
seg_num = numel(GM05296_Data(iloop).SegIndex) - 1;
seg_mean = ones(seg_num,1);
chr_num = GM05296_Data(iloop).Chromosome;
for jloop = 2:seg_num+1
idx = GM05296_Data(iloop).SegIndex(jloop-1):GM05296_Data(iloop).SegIndex(jloop);
seg_mean(idx) = mean(GM05296_Data(iloop).Log2Ratio(idx));
4-54
Detecting DNA Copy Number Alteration in Array-Based CGH Data
line(GM05296_Data(iloop).GenomicPosition(idx), seg_mean(idx),...
'color', 'r', 'linewidth', 3);
end
line(GM05296_Data(iloop).GenomicPosition, GM05296_Data(iloop).Log2Ratio,...
'linestyle', 'none', 'Marker', '.');
line([acen_pos(chr_num), acen_pos(chr_num)], [-1, 1],...
'linewidth', 2,...
'color', 'm',...
'linestyle', '-.');
ylabel('Log2(T/R)')
ax = gca;
ax.Box = 'on';
ylim([-1, 1])
title(sprintf('Chromosome %d - GM05296', chr_num));
chromosomeplot(hs_cytobands, chr_num, 'addtoplot', gca, 'unit', 2)
end
4-55
4 Microarray Analysis
4-56
Detecting DNA Copy Number Alteration in Array-Based CGH Data
As shown in the plots, no copy number alterations were found on chromosome 9, there is copy
number gain span from 10q21 to 10q24, and a copy number loss region from 11p12 to 11p13. The
CNAs found match the known results in cell line GM05296 determined by cytogenetic analysis.
You can also display the CNAs of the GM05296 cell line align to the chromosome ideogram summary
view using the chromosomeplot function. Determine the genomic positions for the CNAs on
chromosomes 10 and 11.
chr10_idx = GM05296_Data(2).SegIndex(2):GM05296_Data(2).SegIndex(3)-1;
chr10_cna_start = GM05296_Data(2).GenomicPosition(chr10_idx(1))*1000;
chr10_cna_end = GM05296_Data(2).GenomicPosition(chr10_idx(end))*1000;
chr11_idx = GM05296_Data(3).SegIndex(2):GM05296_Data(3).SegIndex(3)-1;
chr11_cna_start = GM05296_Data(3).GenomicPosition(chr11_idx(1))*1000;
chr11_cna_end = GM05296_Data(3).GenomicPosition(chr11_idx(end))*1000;
Create a structure containing the copy number alteration data from the GM05296 cell line data
according to the input requirements of the chromosomeplot function.
cna_struct =
4-57
4 Microarray Analysis
This example shows how MATLAB and its toolboxes provide tools for the analysis and visualization of
copy-number alterations in array-based CGH data.
References
[1] Redon, R., et al., "Global variation in copy number in the human genome", Nature,
444(7118):444-54, 2006.
[2] Pinkel, D., et al., "High resolution analysis of DNA copy number variations using comparative
genomic hybridization to microarrays", Nature Genetics, 20(2):207-11, 1998.
[3] Snijders, A.M., et al., "Assembly of microarrays for genome-wide measurement of DNA copy
number", Nature Genetics, 29(3):263-4, 2001.
4-58
Detecting DNA Copy Number Alteration in Array-Based CGH Data
[5] Myers, C.L., et al., "Accurate detection of aneuploidies in array CGH and gene expression
microarray data", Bioinformatics, 20(18):3533-43, 2004.
4-59
4 Microarray Analysis
This example shows how to use a Bayesian hidden Markov model (HMM) technique to identify copy
number alteration in array-based comparative genomic hybridization (CGH) data.
Introduction
Array-based CGH is a high-throughput technique to measure DNA copy number change across the
genome. The DNA fragments or "clones" of test and reference samples are hybridized to mapped
array fragments. Log2 intensity ratios of test to reference provide useful information about genome-
wide profiles in copy number. In an ideal situation, the log2 ratio of normal (copy-neutral) clones is
log2(2/2) = 0, single copy losses is log2(1/2) = -1, and single copy gains is log2(3/2) = 0.58. Multiple
copy gains or amplifications would have values of log2(4/2), log2(5/2),.... Loss of both copies, or a
deletion would correspond to the value of -inf. In real applications, even after accounting for
measurement error, the log2 ratios differ considerably from the theoretical values. The ratios
typically shrink towards zero. One main factor is contamination of the tumor samples with normal
cells. There is also a dependence between the intensity ratios of neighboring clones. These issues
necessitate the use of efficient statistical algorithms characterizing the genomic profiles.
Bayesian HMM
Guha et al., (2006) proposed a Bayesian statistical approach depending on a hidden Markov model
(HMM) for analyzing array CGH data. The hidden Markov model accounts for the dependence
between neighboring clones. The intensity ratios are generated by hidden copy number states.
Bayesian learning is used to identify genome-wide changes in copy number from the data. Posterior
inferences are made about the copy number gains and losses.
In this Bayesian HMM algorithm, there are four states, defined as copy number loss state (1), copy
number neutral state (2), single copy gain state (3), and amplification (multiple gain) state (4). A copy
number state is associated with each clone. The normalized log2 ratios are assumed to be distributed
as
4-60
Analyzing Array-Based CGH Data Using Bayesian Hidden Markov Modeling
sample for the chromosome parameters. The starting values of the parameters are generated from
the priors. The generated copy number states represent draws from the marginal posterior of
interest, For each MCMC draw, the generated states are inspected and classified as focal ablations,
transition points, amplifications, outliers and whole chromosomal changes.
In this example, you will apply the Bayesian HMM algorithm to analyze the array CGH profiles of
some pancreatic cancer samples [2].
The data in this example is the array CGH profiles of 24 pancreatic adenocarcinoma cell lines and 13
primary tumor specimens from Alguirre et al.,(2004). Labeled DNA fragments were hybridized to
Agilent® human cDNA microarrays containing 14,160 cDNA clones. About 9,420 clones have unique
map positions with a median interval between mapped elements of 100.1 kb. More details of the data
and experiment can be found in [2]. For convenience, the data has already been normalized and the
log2 based intensity ratios are provided by the MAT file pancrea_oligocgh.mat.
You will apply the Bayesian HMM algorithm to analyze chromosome 12 of sample 6 of the pancreatic
adenocarcinoma data, and compare the results with the segments found by the circular binary
segmentation (CBS) algorithm of Olshen et al.,(2004).
Load the MAT file containing the log2 intensity ratios and mapped genomic positions of the 37
samples.
load pancrea_oligocgh
pancrea_data
pancrea_data =
sampleIndex = 6;
chromID = 12;
sample = pancrea_data.Sample{sampleIndex}
sample =
'PA.C.Dan.G'
Load and plot the log2 ratio data of chromosome 12 from sample PA.C.Dan.G.
4-61
4 Microarray Analysis
Y = pancrea_data.Log2Ratio(idx, sampleIndex);
N = numel(Y)
N =
437
4-62
Analyzing Array-Based CGH Data Using Bayesian Hidden Markov Modeling
You can start the analysis by performing chromosomal segmentation using the CBS algorithm [3],
which is implemented in the cghcbs function. The process will take several seconds. You can view
the plot of the segment means over the original data by specifying the SHOWPLOT parameter. Note:
You can type doc cghcbs for more details on this function.
PS = cghcbs(pancrea_data, 'SampleInd', sampleIndex, ...
'Chromosome', chromID, 'ShowPlot', chromID);
ylim(gca, ylims)
As shown in the figure, the CBS procedure declared the set of high intensity ratios as two separate
segments. The CBS procedure also found a region with copy number losses.
Initializing Parameters
The Bayesian HMM approach uses a Metropolis-within-Gibbs algorithm to generate posterior samples
of the parameters [1]. The model parameters are grouped into four blocks. The algorithm iteratively
generates each of the four blocks conditional on the remaining blocks and the data.
To analyze the data with the Bayesian HMM algorithm, you need to initialize the parameters. More
details on prior parameters can be found in references [1] and [4].
Initialize the state of the random number generator to ensure that the figures generated by these
command match the figures in the HTML version of this example.
4-63
4 Microarray Analysis
rng('default');
NS = 4;
NMC = 100;
Determine the hyperparameters of the prior distributions for the four states.
Set the parameter epsilon which determines the constrains of the means.
eps = 0.1;
Guha et al., (2006) assumes the inverse of the prior error variances (sigma^2) as gamma
distributions with lower bounds of 0.41 for states 1, 2 and 3. Set the scale parameters for the gamma
distributions for each state.
sg_alpha = [1 1 1 1];
sg_beta = [1, 1, 1, 1];
sg_bounds = [0.41 0.41 0.41 1];
Define a variable states to store the copy number state sequences of the clones for each MCMC
iteration.
Define a variable st_counts to hold the state transition counts for each copy number state.
iloop = 1;
Determine sigmas for the four states by sampling from gamma distribution with prior scale
parameter alpha and beta.
Determine means for the four states by sampling from truncated normal distribution between the
lower and upper bounds of the means. Note: The fourth state lower bound will be determined by the
third state.
4-64
Analyzing Array-Based CGH Data Using Bayesian Hidden Markov Modeling
Assume independent Dirichlet priors for the rows of the stochastic 4x4 transition probability matrix
[1]. Generate the stochastic prior transition matrix A from the Dirichlet distributions.
a = ones(NS, NS);
A = acghhmmsample('dirichlet', a, NS);
The transition matrix has a unique stationary distribution. The stationary distribution PI is an
eigenvector of the transition matrix associated with the eigenvalue 1.
Pi = PI(A, NS);
B = zeros(NS, N);
for i = 1:NS
B(i,:) = normpdf(Y, mus(i,iloop), sigmas(i,iloop));
end
Decode initial hidden states of the clones using a stochastic forward-backward algorithm [4].
For each MCMC iteration, the four blocks of parameters are generated as follows [1]: Update block
B1 using a Metropolis-Hastings step to generate the transition matrix, update block B2 the copy
number states using a stochastic forward propagate backward sampling algorithm, update block B3
by computing the mus, and update block B4 to generate sigmas.
% Updating block B1
% Generate the transition matrix from the Dirichlet distributions
C = acghhmmsample('dirichlet', st_counts + 1, NS);
4-65
4 Microarray Analysis
% Updating block B2
% Generate copy number states using Forward propagate, backward sampling [4].
states(:, iloop) = acghhmmfb(Pi, A, B);
figure;
for j = 1:NS
subplot(2,2,j)
ksdensity(mus(j,:));
title(sprintf('State %d', j))
end
sgtitle('Distribution of Mu of States');
hold off;
4-66
Analyzing Array-Based CGH Data Using Bayesian Hidden Markov Modeling
figure;
for j = 1:NS
subplot(2,2,j)
ksdensity(sigmas(j,:));
title(sprintf('State %d', j))
end
sgtitle('Distribution of Sigma of States');
hold off;
4-67
4 Microarray Analysis
Posterior Inference
Draw a state label for each clone from the MCMC sampling and compute the posterior probabilities
of each state.
4-68
Analyzing Array-Based CGH Data Using Bayesian Hidden Markov Modeling
end
clone_states = clone_states';
Plot the state label for each clone on chromosome 12 of sample PA.C.Dan.G.
figure;
leg = zeros(1,4);
for i = 1:N
if clone_states(i) == 1
leg(1) = plot(i,Y(i),'v', 'MarkerFaceColor', [1 0.2 0.2],...
'MarkerEdgeColor', 'none');
elseif clone_states(i) == 2
leg(2) = plot(i,Y(i),'o', 'Color', [0.4 0.4 0.4]);
elseif clone_states(i) == 3
leg(3) = plot(i,Y(i),'^', 'MarkerFaceColor', [0.2 1 0.2],...
'MarkerEdgeColor', 'none');
elseif clone_states(i) == 4
leg(4) = plot(i, Y(i), '^', 'MarkerFaceColor', [0.2 0.2 1],...
'MarkerEdgeColor', 'none');
end
hold on;
end
ylim(gca, ylims)
legend(leg, 'State 1', 'State 2','State 3','State 4')
xlabel('Index')
ylabel('Log2(ratio)')
title('State Label')
hold off
4-69
4 Microarray Analysis
For each MCMC draw, the generated states can be classified as focal aberrations, transition points,
amplifications, outliers and whole chromosomal changes [1]. In this example, you will find the high-
level amplifications, transition points and outliers on chromosome 12 of sample PA.C.Dan.G.
A clone with state = 4 is considered a high-level amplification [1]. Find high-level amplifications.
high_lvl_amp_idx = find(clone_states == 4);
A transition point is associated with large-scale regions of gains and losses and is declared when the
width of the altered region exceeds 5 mega base pair [1]. Find transition points.
region_lim = 5e6;
focalabr_idx=[1;find(diff(clone_states)~=0);N];
istranspoint = false(length(focalabr_idx), 1);
for i = 1:length(focalabr_idx)-1
region_x = X(focalabr_idx(i+1)) - X(focalabr_idx(i));
istranspoint(i+1) = region_x > region_lim;
end
trans_idx = focalabr_idx(istranspoint);
% Remove adjacent trans_idx that have the same states.
hasadjacentstate = [diff(clone_states(trans_idx))==0; true];
trans_idx = trans_idx(~hasadjacentstate)
focalabr_idx = focalabr_idx(~istranspoint);
focalabr_idx = focalabr_idx(2:end-1);
trans_idx =
107
135
323
An outlier for gains is a focal aberration satisfying its z-score greater than 2, while an outlier for
losses has a z-score less than -2 [1].
[F,Xi] = ksdensity(sigmas(1,:));
[dummy, idx] = max(F);
sigma_1 = Xi(idx);
outlier_loss_idx = outlier_loss_idx((Y(outlier_loss_idx) - mu_1)/sigma_1 < -2)
end
outlier_loss_idx =
4-70
Analyzing Array-Based CGH Data Using Bayesian Hidden Markov Modeling
[F,Xi] = ksdensity(sigmas(3,:));
[dummy, idx] = max(F);
sigma_1 = Xi(idx);
outlier_gain_idx = outlier_gain_idx((Y( outlier_gain_idx) - mu_1)/sigma_1 > 2)
end
outlier_gain_idx =
Add the classified labels to the intensity ratio plot of chromosome 12 of sample PA.C.Dan.G. Plot the
segment means from the CBS procedure for comparison.
figure;
hl1 = plot(X, Y, '.', 'color', [0.4 0.4 0.4]);
hold on;
if ~isempty(high_lvl_amp_idx)
hl2 = line(X(high_lvl_amp_idx), Y(high_lvl_amp_idx),...
'LineStyle', 'none',...
'Marker', '^',...
'MarkerFaceColor', [0.2 0.2 1],...
'MarkerEdgeColor', 'none');
end
if ~isempty(trans_idx)
for i = 1:numel(trans_idx)
hl3 = line(ones(1,2)*X(trans_idx(i)), [-3.5, 3.5],...
'LineStyle', '--',...
'Color', [1 0.6 0.2]);
end
end
if ~isempty(outlier_gain_idx)
line(X(outlier_gain_idx), Y(outlier_gain_idx),...
'LineStyle', 'none',...
'Marker', 'v',...
'MarkerFaceColor', [1 0 0],...
'MarkerEdgeColor', 'none');
end
if ~isempty(outlier_loss_idx)
hl4 = line(X(outlier_loss_idx), Y(outlier_loss_idx),...
'LineStyle', 'none',...
'Marker', 'v',...
'MarkerFaceColor', [1 0 0],...
'MarkerEdgeColor', 'none');
end
4-71
4 Microarray Analysis
The Bayesian HMM algorithm found 3 transition points indicated by the broken vertical lines in the
plot. The Bayesian HMM algorithm identified two high-level amplified regions marked by blue up-
triangles in the plot. The two high-level amplified regions correspond to the two minimal common
regions (MCRs)[2] on chromosome 12, associated with copy number gains as explained by Aguirre et
al.,(2004). The Bayesian HMM declared the first set of high intensity rations as a single region of
high-level amplification. In comparison, the CBS procedure failed to detect the second MCR and
segmented the first MCR into two regions. No outlier was detected in this example.
References
[1] Guha, S., Li, Y. and Neuberg, D., "Bayesian hidden Markov modeling of array CGH data", Journal
of the American Statistical Association, 103(482):485-497, 2008.
4-72
Analyzing Array-Based CGH Data Using Bayesian Hidden Markov Modeling
[2] Aguirre, A.J., et al., "High-resolution characterization of the pancreatic adenocarcinoma genome",
PNAS, 101(24):9067-72, 2004.
[3] Olshen, A.B., et al., "Circular binary segmentation for the analysis of array-based DNA copy
number data", Biostatistics, 5(4):557-7, 2004.
[4] Shah, S.P., et al., "Integrating copy number polymorphisms into array CGH analysis using a robust
HMM", Bioinformatics, 22(14):e431-e439, 2006
4-73
4 Microarray Analysis
This example shows various ways to explore and visualize raw microarray data. The example uses
microarray data from a study of gene expression in mouse brains [1].
Brown, V.M et.al. [1] used microarrays to explore the gene expression patterns in the brain of a
mouse in which a pharmacological model of Parkinson's disease (PD) was induced using
methamphetamine. The raw data for this experiment is available from the Gene Expression Omnibus
website using the accession number GSE30 [1].
The file mouse_h3pd.gpr contains the data for one of the microarrays used in the study, specifically
from a sample collected from voxel H3 of the brain in a Parkinson's Disease (PD) model mouse. The
file uses the GenePix® GPR file format. The voxel sample was labeled with Cy3 (green) and the
control (RNA from a total, not voxelated, normal mouse brain) was labeled with Cy5.
GPR formatted files provide a large amount of information about the array including the mean,
median and standard deviation of the foreground and background intensities of each spot at the
635nm wavelength (the red, Cy5 channel) and the 532nm wavelength (the green, Cy3 channel).
The command gprread reads the data from the file into a structure.
pd = gprread('mouse_h3pd.gpr')
pd =
You can access the fields of a structure using dot notation. For example, access the first ten column
names.
pd.ColumnNames(1:10)
ans =
{'X' }
{'Y' }
{'Dia.' }
{'F635 Median' }
4-74
Visualizing Microarray Data
{'F635 Mean' }
{'F635 SD' }
{'B635 Median' }
{'B635 Mean' }
{'B635 SD' }
{'% > B635+1SD'}
pd.Names(1:10)
ans =
{'AA467053'}
{'AA388323'}
{'AA387625'}
{'AA474342'}
{'Myo1b' }
{'AA473123'}
{'AA387579'}
{'AA387314'}
{'AA467571'}
{0x0 char }
The maimage command can take the microarray data structure and create a pseudocolor image of
the data arranged in the same order as the spots on the array, i.e., a spatial plot of the microarray.
The "F635 Median" field shows the median pixel values for the foreground of the red (Cy5) channel.
figure
maimage(pd,'F635 Median','title',{'Parkinson''s Model','Foreground Median Pixels','Red Channel'})
4-75
4 Microarray Analysis
The "F532 Median" field corresponds to the foreground of the green (Cy3) channel.
figure
maimage(pd,'F532 Median','title',{'Parkinson''s Model','Foreground Median Pixels','Green Channel'
4-76
Visualizing Microarray Data
The "B635 Median" field shows the median values for the background of the red channel. Notice the
very high background levels down the right side of the array.
figure
maimage(pd,'B635 Median','title',{'Parkinson''s Model','Background Median Pixels','Red Channel'})
4-77
4 Microarray Analysis
The "B532 Median" shows the median values for the background of the green channel.
figure
maimage(pd,'B532 Median','title',{'Parkinson''s Model','Background Median Pixels','Green Channel'
4-78
Visualizing Microarray Data
You can now consider the data obtained for the same brain voxel in an untreated control mouse. In
this case, the voxel sample was labeled with Cy3, and the control (RNA from a total, not voxelated
brain) was labeled with Cy5.
wt = gprread('mouse_h3wt.gpr')
wt =
Use maimage to show pseudocolor images of the foreground and background corresponding to the
untreated mouse. The subplot command can be used to combine the plots.
figure
subplot(2,2,1);
4-79
4 Microarray Analysis
maimage(wt,'F635 Median','title',{'Foreground','(Red)'})
subplot(2,2,2);
maimage(wt,'F532 Median','title',{'Foreground','(Green)'})
subplot(2,2,3);
maimage(wt,'B635 Median','title',{'Background','(Red)'})
subplot(2,2,4);
maimage(wt,'B532 Median','title',{'Background','(Green)'})
If you look at the scale for the background images, you will notice that the background levels are
much higher than those for the PD mouse and there appears to be something non random affecting
the background of the Cy3 channel of this slide. Changing the colormap can sometimes provide more
insight into what is going on in pseudocolor plots. For more control over the color, try the
colormapeditor function. You can also right-click on the colorbar to bring up various options for
modifying the colormap of the plot including interactive colormap shifting.
colormap hot
4-80
Visualizing Microarray Data
The maimage command is a simple way to quickly create pseudocolor images of microarray data.
However, sometimes it is convenient to create customizable plots using the imagesc command, as
shown below.
Use magetfield to extract data for the B532 median field and the Indices field to index into the
Data. You can bound the intensities of the background plot to give more contrast in the image.
figure
subplot(1,2,1);
imagesc(b532Data(wt.Indices))
axis image
colorbar
title('B532, WT')
subplot(1,2,2);
imagesc(maskedData(wt.Indices))
axis image
colorbar
title('Enhanced B532, WT')
4-81
4 Microarray Analysis
The maboxplot function can be used to look at the distribution of data in each of the blocks.
figure
subplot(2,1,1)
maboxplot(pd,'F532 Median','title','Parkinson''s Disease Model Mouse')
subplot(2,1,2)
maboxplot(pd,'B532 Median','title','Parkinson''s Disease Model Mouse')
figure
subplot(2,1,1)
maboxplot(wt,'F532 Median','title','Untreated Mouse')
subplot(2,1,2)
maboxplot(wt,'B532 Median','title','Untreated Mouse')
4-82
Visualizing Microarray Data
4-83
4 Microarray Analysis
From the box plots you can clearly see the spatial effects in the background intensities. Blocks
number 1,3,5 and 7 are on the left side of the arrays, and blocks number 2,4,6 and 8 are on the right
side.
There are two columns in the microarray data structure labeled "F635 Median - B635" and "F532
Median - B532". These columns are the differences between the median foreground and the median
background for the 635 nm channel and 532 nm channel respectively. These give a measure of the
actual expression levels. The spatial effect is less noticeable in these plots.
figure
subplot(2,1,1)
maboxplot(pd,'F635 Median - B635','title','Parkinson''s Disease Model Mouse ')
subplot(2,1,2)
maboxplot(pd,'F532 Median - B532','title','Parkinson''s Disease Model Mouse')
figure
subplot(2,1,1)
maboxplot(wt,'F635 Median - B635','title','Untreated Mouse')
subplot(2,1,2)
maboxplot(wt,'F532 Median - B532','title','Untreated Mouse')
4-84
Visualizing Microarray Data
4-85
4 Microarray Analysis
Rather than work with the data in the larger structure, it is often easier to extract the data into
separate variables.
A simple way to compare the two channels is with a loglog plot. The function maloglog is used to
do this. Points that are above the diagonal in this plot correspond to genes that have higher
expression levels in the H3 voxel than in the brain as a whole.
figure
maloglog(cy5Data,cy3Data)
title('Loglog Scatter Plot of PD Model');
xlabel('F635 Median - B635 (Control)');
ylabel('F532 Median - B532 (Voxel H3)');
4-86
Visualizing Microarray Data
Notice how the loglog function gives some warnings about negative and zero elements. This is
because some of the values in the 'F635 Median - B635' and 'F532 Median - B532' columns are zero
or less than zero. Spots where this happened might be bad spots or spots that failed to hybridize.
Similarly, spots with positive, but very small, differences between foreground and background are
also considered bad spots. These warnings can be disabled using the warning command.
figure
maloglog(cy5Data,cy3Data)
title('Loglog Scatter Plot of PD Model');
xlabel('F635 Median - B635 (Control)');
ylabel('F532 Median - B532 (Voxel H3)');
4-87
4 Microarray Analysis
An alternative to simply ignoring or disabling the warnings is to remove the bad spots from the data
set. This can be done by finding points where either the red or green channel have values less than or
equal to a threshold value, for example 10.
threshold = 10;
badPoints = (cy5Data <= threshold) | (cy3Data <= threshold);
You can then remove these points and redraw the loglog plot.
4-88
Visualizing Microarray Data
The distribution plot can be annotated by labeling the various points with the corresponding genes.
figure
maloglog(cy5Data,cy3Data,'labels',pd.Names(~badPoints),'factorlines',2)
title('Loglog Scatter Plot of PD Model');
xlabel('F635 Median - B635 (Control)');
ylabel('F532 Median - B532 (Voxel H3)');
4-89
4 Microarray Analysis
Try using the mouse to click on some of the outlier points. You will see the gene name associated with
the point. Most of the outliers are below the y = x line. In fact most of the points are below this line.
Ideally the points should be evenly distributed on either side of this line. In order for this to happen,
the points need to be normalized. You can use the manorm function to perform global mean
normalization.
normcy5 = manorm(cy5Data);
normcy3 = manorm(cy3Data);
If you plot the normalized data you will see that the points are more evenly distributed about the y =
x line.
figure
maloglog(normcy5,normcy3,'labels',pd.Names(~badPoints),'factorlines',2)
title('Normalized Loglog Scatter Plot of PD Model');
xlabel('F635 Median - B635 (Control)');
ylabel('F532 Median - B532 (Voxel H3)');
4-90
Visualizing Microarray Data
You will recall that the background of the chips was not uniform. You can use print-tip (block)
normalization to normalize each block separately. The function manorm will perform block
normalization automatically if block information is available in the microarray data structure.
Instead of removing negative or points below the threshold, you can set them to NaN. This does not
change the size or shape of the data, but NaN points will not be displayed on plots.
figure
maloglog(bn_cy5Data,bn_cy3Data,'labels',pd.Names,'factorlines',2)
title('Refined, Normalized Loglog Scatter Plot of PD Model');
xlabel('F635 Median - B635 (Control)');
ylabel('F532 Median - B532 (Voxel H3)');
4-91
4 Microarray Analysis
The function mairplot is used to create an Intensity vs. Ratio plot for the normalized data. If the
name-value pair 'PlotOnly' is set to false, you can explore the data interactively, such as select points
to see the names of the associated genes, normalize the data, highlight gene names in the up-
regulated or down-regulated lists, or change the values of the factor lines.
mairplot(normcy5,normcy3,'labels',pd.Names(~badPoints),'PlotOnly',true,...
'title','Intensity vs. Ratio of PD Model');
4-92
Visualizing Microarray Data
You can use the Normalize option to mairplot to perform Lowess normalization on the data.
mairplot(normcy5,normcy3,'labels',pd.Names(~badPoints),'PlotOnly',true,...
'Normalize',true,'title', 'Intensity vs. Ratio of PD Model (Normalized)');
4-93
4 Microarray Analysis
References
[1] Brown, V.M., et al., "Multiplex three dimensional brain gene expression mapping in a mouse model
of Parkinson's disease", Genome Research, 12(6):868-84, 2002.
4-94
Gene Expression Profile Analysis
This example shows a number of ways to look for patterns in gene expression profiles.
This example uses data from the microarray study of gene expression in yeast published by DeRisi, et
al. 1997 [1]. The authors used DNA microarrays to study temporal gene expression of almost all
genes in Saccharomyces cerevisiae during the metabolic shift from fermentation to respiration.
Expression levels were measured at seven time points during the diauxic shift. The full data set can
be downloaded from the Gene Expression Omnibus website, https://www.ncbi.nlm.nih.gov/geo/query/
acc.cgi?acc=GSE28.
The MAT-file yeastdata.mat contains the expression values (log2 of ratio of CH2DN_MEAN and
CH1DN_MEAN) from the seven time steps in the experiment, the names of the genes, and an array of
the times at which the expression levels were measured.
load yeastdata.mat
To get an idea of the size of the data you can use numel(genes) to show how many genes are
included in the data set.
numel(genes)
ans =
6400
You can access the genes names associated with the experiment by indexing the variable genes, a
cell array representing the gene names. For example, the 15th element in genes is YAL054C. This
indicates that the 15th row of the variable yeastvalues contains expression levels for YAL054C.
genes{15}
ans =
'YAL054C'
A simple plot can be used to show the expression profile for this ORF.
plot(times, yeastvalues(15,:))
xlabel('Time (Hours)');
ylabel('Log2 Relative Expression Level');
4-95
4 Microarray Analysis
You can also plot the actual expression ratios, rather than the log2-transformed values.
plot(times, 2.^yeastvalues(15,:))
xlabel('Time (Hours)');
ylabel('Relative Expression Level');
4-96
Gene Expression Profile Analysis
The gene associated with this ORF, ACS1, appears to be strongly up-regulated during the diauxic
shift. You can compare the expression of this gene to the expression of other genes by plotting
multiple lines on the same figure.
hold on
plot(times, 2.^yeastvalues(16:26,:)')
xlabel('Time (Hours)');
ylabel('Relative Expression Level');
title('Profile Expression Levels');
4-97
4 Microarray Analysis
Typically, a gene expression dataset includes information corresponding to genes that do not show
any interesting changes during the experiment. To make it easier to find the interesting genes, you
can reduce the size of the data set to some subset that contains only the most significant genes.
If you look through the gene list, you will see several spots marked as 'EMPTY'. These are empty
spots on the array, and while they might have data associated with them, for the purposes of this
example, you can consider these points to be noise. These points can be found using the strcmp
function and removed from the data set with indexing commands.
emptySpots = strcmp('EMPTY',genes);
yeastvalues(emptySpots,:) = [];
genes(emptySpots) = [];
numel(genes)
ans =
6314
There are also see several places in the dataset where the expression level is marked as NaN. This
indicates that no data was collected for this spot at the particular time step. One approach to dealing
with these missing values would be to impute them using the mean or median of data for the
particular gene over time. This example uses a less rigorous approach of simply throwing away the
data for any genes where one or more expression level was not measured. The function isnan is used
4-98
Gene Expression Profile Analysis
to identify the genes with missing data and indexing commands are used to remove the genes with
missing data.
nanIndices = any(isnan(yeastvalues),2);
yeastvalues(nanIndices,:) = [];
genes(nanIndices) = [];
numel(genes)
ans =
6276
If you were to plot the expression profiles of all the remaining profiles, you would see that most
profiles are flat and not significantly different from the others. This flat data is obviously of use as it
indicates that the genes associated with these profiles are not significantly affected by the diauxic
shift; however, in this example, you are interested in the genes with large changes in expression
accompanying the diauxic shift. You can use filtering functions in the Bioinformatics Toolbox™ to
remove genes with various types of profiles that do not provide useful information about genes
affected by the metabolic change.
You can use the genevarfilter function to filter out genes with small variance over time. The
function returns a logical array (i.e., a mask) of the same size as the variable genes with ones
corresponding to rows of yeastvalues with variance greater than the 10th percentile and zeros
corresponding to those below the threshold. You can use the mask to index into the values and
remove the filtered genes.
mask = genevarfilter(yeastvalues);
yeastvalues = yeastvalues(mask,:);
genes = genes(mask);
numel(genes)
ans =
5648
The function genelowvalfilter removes genes that have very low absolute expression values.
Note that these filter functions can also automatically calculate the filtered data and names, so it is
not necessary to index the original data using the mask.
[mask,yeastvalues,genes] = genelowvalfilter(yeastvalues,genes,'absval',log2(3));
numel(genes)
ans =
822
Finally, you can use the function geneentropyfilter to remove genes whose profiles have low
entropy, for example entropy levels in the 15th percentile of the data.
[mask,yeastvalues,genes] = geneentropyfilter(yeastvalues,genes,'prctile',15);
numel(genes)
4-99
4 Microarray Analysis
ans =
614
Cluster Analysis
Now that you have a manageable list of genes, you can look for relationships between the profiles
using some different clustering techniques from the Statistics and Machine Learning Toolbox™. For
hierarchical clustering, the function pdist calculates the pairwise distances between profiles and
linkage creates the hierarchical cluster tree.
corrDist = pdist(yeastvalues,'corr');
clusterTree = linkage(corrDist,'average');
The cluster function calculates the clusters based on either a cutoff distance or a maximum number
of clusters. In this case, the maxclust option is used to identify 16 distinct clusters.
clusters = cluster(clusterTree,'maxclust',16);
The profiles of the genes in these clusters can be plotted together using a simple loop and the
subplot command.
figure
for c = 1:16
subplot(4,4,c);
plot(times,yeastvalues((clusters == c),:)');
axis tight
end
sgtitle('Hierarchical Clustering of Profiles');
4-100
Gene Expression Profile Analysis
The Statistics and Machine Learning Toolbox also has a K-means clustering function. Again, sixteen
clusters are found, but because the algorithm is different these will not necessarily be the same
clusters as those found by hierarchical clustering.
Initialize the state of the random number generator to ensure that the figures generated by these
command match the figures in the HTML version of this example.
rng('default');
4-101
4 Microarray Analysis
Instead of plotting all the profiles, you can plot just the centroids.
figure
for c = 1:16
subplot(4,4,c);
plot(times,ctrs(c,:)');
axis tight
axis off
end
sgtitle('K-Means Clustering of Profiles');
4-102
Gene Expression Profile Analysis
You can use the clustergram function to create a heat map of the expression levels and a
dendrogram from the output of the hierarchical clustering.
cgObj = clustergram(yeastvalues(:,2:end),'RowLabels',genes,'ColumnLabels',times(2:end));
4-103
4 Microarray Analysis
h = mapcaplot(yeastvalues,genes);
4-104
Gene Expression Profile Analysis
Notice that the scatter plot of the scores of the first two principal components shows that there are
two distinct regions. This is not unexpected as the filtering process removed many of the genes with
low variance or low information. These genes would have appeared in the middle of the scatter plot.
If you want to look at the values of the principal components, the pca function in the Statistics and
Machine Learning Toolbox is used to calculate the principal components of a data set.
4-105
4 Microarray Analysis
The first output, pc, is a matrix of the principal components of the yeastvalues data. The first
column of the matrix is the first principal component, the second column is the second principal
component, and so on. The second output, zscores, consists of the principal component scores, i.e.,
a representation of yeastvalues in the principal component space. The third output, pcvars, contains
the principal component variances, which give a measure of how much of the variance of the data is
accounted for by each of the principal components.
It is clear that the first principal component accounts for a majority of the variance in the model. You
can compute the exact percentage of the variance accounted for by each component as shown below.
pcvars./sum(pcvars) * 100
ans =
79.8316
9.5858
4.0781
2.6486
2.1723
0.9747
0.7089
This means that almost 90% of the variance is accounted for by the first two principal components.
You can use the cumsum command to see the cumulative sum of the variances.
cumsum(pcvars./sum(pcvars) * 100)
ans =
79.8316
89.4174
93.4955
96.1441
98.3164
99.2911
100.0000
If you want to have more control over the plotting of the principal components, you can use the
scatter function.
figure
scatter(zscores(:,1),zscores(:,2));
xlabel('First Principal Component');
ylabel('Second Principal Component');
title('Principal Component Scatter Plot');
4-106
Gene Expression Profile Analysis
An alternative way to create a scatter plot is with the function gscatter from the Statistics and
Machine Learning Toolbox. gscatter creates a grouped scatter plot where points from each group
have a different color or marker. You can use clusterdata, or any other clustering function, to
group the points.
figure
pcclusters = clusterdata(zscores(:,1:2),'maxclust',8,'linkage','av');
gscatter(zscores(:,1),zscores(:,2),pcclusters,hsv(8))
xlabel('First Principal Component');
ylabel('Second Principal Component');
title('Principal Component Scatter Plot with Colored Clusters');
4-107
4 Microarray Analysis
Self-Organizing Maps
If you have the Deep Learning Toolbox™, you can use a self-organizing map (SOM) to cluster the
data.
The selforgmap function creates a new SOM network object. This example will generate a SOM
using the first two principal components.
P = zscores(:,1:2)';
net = selforgmap([4 4]);
net = train(net,P);
4-108
Gene Expression Profile Analysis
Use plotsom to display the network over a scatter plot of the data. Note that the SOM algorithm
uses random starting points so the results will vary from run to run.
figure
plot(P(1,:),P(2,:),'.g','markersize',20)
hold on
plotsom(net.iw{1,1},net.layers{1}.distances)
hold off
4-109
4 Microarray Analysis
You can assign clusters using the SOM by finding the nearest node to each point in the data set.
distances = dist(P',net.IW{1}');
[d,cndx] = min(distances,[],2); % cndx contains the cluster index
figure
gscatter(P(1,:),P(2,:),cndx,hsv(numel(unique(cndx)))); legend off;
hold on
plotsom(net.iw{1,1},net.layers{1}.distances);
hold off
4-110
Gene Expression Profile Analysis
close('all');
delete(cgObj);
delete(h);
References
[1] DeRisi, J.L., Iyer, V.R. and Brown, P.O., "Exploring the metabolic and genetic control of gene
expression on a genomic scale", Science, 278(5338):680-6, 1997.
4-111
4 Microarray Analysis
This example shows how to use the functions in the Bioinformatics Toolbox™ for working with
Affymetrix® GeneChip® data.
The function affyread can read four types of Affymetrix data files. These are DAT files, which
contain raw image data, CEL files which contain information about the intensity values of the
individual probes, CHP files which contain information about probe sets, and EXP files, which contain
information about experimental conditions and protocols. affyread can also read CDF and GIN
library files. The CDF file contains information about which probes belong to which probe set and the
GIN file contains information about the probe sets such as the gene name with which the probe set is
associated. To learn more about the actual files, you can download sample data files from the
Affymetrix Support Site. Most of the data sets are stored in DTT archives. To extract the DAT, CEL
and CHP files you will need to install the Data Transfer Tool.
For this example, you will need some sample data files (DAT, CEL, CHP) from the E. coli Antisense
Genome Array. Download these from Demo_Data_E-coli-antisense.zip Extract the data files from the
DTT archive using the Data Transfer Tool. Set the variable exampleDataDir to the name of the path
and directory to which you extracted the sample data files.
exampleDataDir = 'C:\Examples\affydemo\data';
In addition to the data files, you will also need Ecoli_ASv2.CDF and Ecoli_ASv2.GIN, the library files
for the E. coli Antisense Genome Array. You may already have these files if you have any Affymetrix
GeneChip software installed on your machine. If not, get the library files by downloading and
unzipping the E. coli Antisense Genome Array zip file
Note that you will have to register in order to access the library files.
You only have to unzip the files, you do not have to run the Setup.exe file in the archive.
Set the variable libDir to the name of the path and directory to which you extracted the library
files.
libDir = 'C:\Examples\affydemo\libfiles';
The raw image data from the chip scanner is saved in the DAT file. If you use affyread to read a
DAT file you will see that it creates a MATLAB® structure.
datStruct = affyread(fullfile(exampleDataDir,'Ecoli-antisense-121502.dat'))
datStruct =
Name: 'Ecoli-antisense-121502.dat'
4-112
Working with Affymetrix Data
DataPath: 'I:\qe\test_data\Bioinformatics_Toolbox\v000\demoData\affydemo\data'
LibPath: 'I:\qe\test_data\Bioinformatics_Toolbox\v000\demoData\affydemo\data'
FullPathName: 'I:\qe\test_data\Bioinformatics_Toolbox\v000\demoData\affydemo\data\Ecoli-an
ChipType: 'Ecoli_ASv2'
NumPixelsPerRow: 4733
NumRows: 4733
MinData: 0
MaxData: 46108
PixelSize: 3
CellMargin: 2
ScanSpeed: 17
ScanDate: '13-Aug-0001 11:31:58'
ScannerID: ''
UpperLeftX: 231
UpperLeftY: 235
UpperRightX: 4492
UpperRightY: 253
LowerLeftX: 220
LowerLeftY: 4501
LowerRightX: 4482
LowerRightY: 4519
ServerName: ''
Image: [4733x4733 uint16]
You can access fields of the structure using the dot notation.
datStruct.NumRows
ans =
4733
datFigure = figure;
imagesc(datStruct.Image);
4-113
4 Microarray Analysis
You can change the colormap from the default jet to another using the colormap command.
colormap pink
4-114
Working with Affymetrix Data
You can zoom in on a particular area by using the Zoom In tool with the mouse, or by using the axis
command. Notice that this stretches the y-axis.
4-115
4 Microarray Analysis
You can use the axis image command to set the correct aspect ratio.
axis image
axis([1900 2800 160 650])
4-116
Working with Affymetrix Data
The information about each probe on the chip is extracted from the image data by the Affymetrix
image analysis software. The information is stored in the CEL file. affyread reads a CEL file into a
structure. Notice that many of the fields are the same as those in the DAT structure.
celStruct = affyread(fullfile(exampleDataDir,'Ecoli-antisense-121502.CEL'))
celStruct =
Name: 'Ecoli-antisense-121502.CEL'
DataPath: 'I:\qe\test_data\Bioinformatics_Toolbox\v000\demoData\affydemo\data'
LibPath: 'I:\qe\test_data\Bioinformatics_Toolbox\v000\demoData\affydemo\data'
FullPathName: 'I:\qe\test_data\Bioinformatics_Toolbox\v000\demoData\affydemo\data\Ecoli-a
ChipType: 'Ecoli_ASv2'
Date: '01-Feb-2013 11:55:24'
FileVersion: 3
Algorithm: 'Percentile'
AlgParams: 'Percentile:75;CellMargin:2;OutlierHigh:1.500;OutlierLow:1.004'
NumAlgParams: 4
CellMargin: 2
Rows: 544
Cols: 544
NumMasked: 0
NumOutliers: 115
4-117
4 Microarray Analysis
NumProbes: 295936
UpperLeftX: 231
UpperLeftY: 235
UpperRightX: 4492
UpperRightY: 253
LowerLeftX: 220
LowerLeftY: 4501
LowerRightX: 4482
LowerRightY: 4519
ProbeColumnNames: {8x1 cell}
Probes: [295936x8 single]
The CEL file contains information about where each probe is on the chip and also the intensity values
for the probe. You can use the maimage function to display the chip.
celFigure = figure;
maimage(celStruct)
4-118
Working with Affymetrix Data
If you compare the image created from the CEL file and the image created from the DAT file, you will
notice that the CEL image is lower resolution. This is because there is only one pixel per probe in this
image, whereas the DAT file image has many pixels per probe.
The structures created by affyread can be very large. It is a good idea to clear them from memory
once they are no longer needed.
clear datStruct
close(datFigure); close(celFigure);
The Probes field of the CEL structure contains information about the individual probes. There are
eight values per probe. These are stored in the ProbeColumnNames field of the structure.
celStruct.ProbeColumnNames
ans =
{'PosX' }
{'PosY' }
{'Intensity'}
{'StdDev' }
{'Pixels' }
{'Outlier' }
{'Masked' }
4-119
4 Microarray Analysis
{'ProbeType'}
So if you look at one row of the Probes field of the CEL structure you will see eight values
corresponding to the X position, Y position, intensity, and so forth.
celStruct.Probes(1:10,:)
ans =
1.0e+04 *
Columns 1 through 7
Column 8
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
The CHP file contains the results of the experiment. These include the average signal measures for
each probe set as determined by the Affymetrix software and information about which probe sets are
called as present, absent or marginal and the p-values for these calls.
chpStruct = affyread(fullfile(exampleDataDir,'Ecoli-antisense-121502.CHP'),libDir)
chpStruct =
Name: 'Ecoli-antisense-121502.CHP'
DataPath: 'I:\qe\test_data\Bioinformatics_Toolbox\v000\demoData\affydemo\data'
LibPath: 'I:\qe\test_data\Bioinformatics_Toolbox\v000\demoData\affydemo\libfiles'
4-120
Working with Affymetrix Data
FullPathName: 'I:\qe\test_data\Bioinformatics_Toolbox\v000\demoData\affydemo\data\Ecoli-an
ChipType: 'Ecoli_ASv2'
AssayType: 'Expression'
Date: '01-Feb-2013 11:55:24'
CellFile: 'c:\documents and settings\bkolou\desktop\demo_data_e-coli-antisense\Ecoli-a
Algorithm: 'ExpressionStat'
AlgVersion: '5.0'
NumAlgParams: 13
AlgParams: 'SFGene=All SF=5.578290 NF=1.000000 TGT=500 Perturbation=1.1 Gamma2L=0.006 G
NumChipSummary: 3
ChipSummary: 'RawQ=1.62 Noise=Avg:1.33,Stdev:0.20,Max:1.7,Min:1.0 Background=Avg:42.81,St
BackgroundZones: [1x1 struct]
Rows: 544
Cols: 544
NumProbeSets: 7312
NumQCProbeSets: 0
ProbeSets: [7312x1 struct]
The ProbeSets field contains information about the probe sets. This includes some library
information, such as the ID and the type of probe set, and also results information such as the
calculated signal value and the Present/Absent/Marginal call information. The call is given in the
Detection field of the ProbeSets structure. The 'argG_b3172_at' probe set is called as being
'Present'.
chpStruct.ProbeSets(5213)
ans =
Name: 'argG_b3172_at'
ProbeSetType: 'Expression'
CompDataExists: 0
NumPairs: 15
NumPairsUsed: 15
Signal: 127.6070
Detection: 'Present'
DetectionPValue: 0.0134
CommonPairs: []
SignalLogRatio: []
SignalLogRatioLow: []
SignalLogRatioHigh: []
Change: []
ChangePValue: []
chpStruct.ProbeSets(5216)
ans =
Name: 'IG_2069_3319273_3319712_rev_at'
4-121
4 Microarray Analysis
ProbeSetType: 'Expression'
CompDataExists: 0
NumPairs: 15
NumPairsUsed: 15
Signal: 35.0037
Detection: 'Absent'
DetectionPValue: 0.2661
CommonPairs: []
SignalLogRatio: []
SignalLogRatioLow: []
SignalLogRatioHigh: []
Change: []
ChangePValue: []
chpStruct.ProbeSets(5215)
ans =
Name: 'yhbX_b3173_at'
ProbeSetType: 'Expression'
CompDataExists: 0
NumPairs: 15
NumPairsUsed: 15
Signal: 147.7237
Detection: 'Marginal'
DetectionPValue: 0.0559
CommonPairs: []
SignalLogRatio: []
SignalLogRatioLow: []
SignalLogRatioHigh: []
Change: []
ChangePValue: []
You can calculate how many probe sets are called as being 'Present',
numPresent = sum(strcmp('Present',{chpStruct.ProbeSets.Detection}))
numPresent =
4605
'Absent',
numAbsent = sum(strcmp('Absent',{chpStruct.ProbeSets.Detection}))
numAbsent =
2524
4-122
Working with Affymetrix Data
and 'Marginal'.
numMarginal = sum(strcmp('Marginal',{chpStruct.ProbeSets.Detection}))
numMarginal =
183
maboxplot will display a box plot of the log2 signal values for all probe sets.
maboxplot(chpStruct,'Signal','title',chpStruct.Name)
The CHP file gives summary information about probe sets but if you want more detailed information
about how the individual probes in a probe set behave you need to connect the probe information in
the CEL file to the corresponding probe sets. This information is stored in the CDF library file
associated with a chip type. The CDF files are typically stored in a central library directory.
cdfStruct = affyread('Ecoli_ASv2.cdf',libDir)
cdfStruct =
4-123
4 Microarray Analysis
Name: 'Ecoli_ASv2.cdf'
ChipType: 'Ecoli_ASv2'
LibPath: 'I:\qe\test_data\Bioinformatics_Toolbox\v000\demoData\affydemo\libfiles'
FullPathName: 'I:\qe\test_data\Bioinformatics_Toolbox\v000\demoData\affydemo\libfiles\
Date: '04-Feb-2013 11:14:01'
Rows: 544
Cols: 544
NumProbeSets: 7312
NumQCProbeSets: 13
ProbeSetColumnNames: {6x1 cell}
ProbeSets: [7325x1 struct]
Most of the information in the file is about the probe sets. In this example there are 7312 regular
probe sets and 13 QC probe sets. The ProbeSets field of the structure is a 7325x1 array of
structures.
cdfStruct.ProbeSets
ans =
Name
ProbeSetType
CompDataExists
NumPairs
NumQCProbes
QCType
GroupNames
ProbePairs
A probe set record contains information about the name, type and number of probe pairs in the probe
set.
probeSetIndex = 5213;
cdfStruct.ProbeSets(probeSetIndex)
ans =
Name: 'argG_b3172_at'
ProbeSetType: 'Expression'
CompDataExists: 0
NumPairs: 15
NumQCProbes: 0
QCType: 0
GroupNames: {'argG_b3172_at'}
ProbePairs: [15x6 int32]
The information about where the probes for a probe set are on the chip is stored in the ProbePairs
field. This is a matrix with one row for each probe pair and six columns. The information in the
columns corresponds to the ProbeSetColumnNames of the CDF structure.
4-124
Working with Affymetrix Data
cdfStruct.ProbeSetColumnNames
cdfStruct.ProbeSets(probeSetIndex).ProbePairs
ans =
{'GroupNumber'}
{'Direction' }
{'PMPosX' }
{'PMPosY' }
{'MMPosX' }
{'MMPosY' }
ans =
The first column shows the probe group number. The second column shows the probe direction. The
group number is always 1 for expression arrays. Direction 1 corresponds to 'sense' and 2 corresponds
to 'anti-sense'. The remaining columns give the X and Y coordinates of the PM and MM probes on the
chip. You can use these coordinates to find the index of a probe in the celStruct.
PMX = cdfStruct.ProbeSets(probeSetIndex).ProbePairs(1,3);
PMY = cdfStruct.ProbeSets(probeSetIndex).ProbePairs(1,4);
theProbe = find((celStruct.Probes(:,1) == PMX) & ...
(celStruct.Probes(:,2) == PMY))
theProbe =
96719
You can then extract all the information about this probe from the CEL structure.
celStruct.Probes(theProbe,:)
ans =
4-125
4 Microarray Analysis
Columns 1 through 7
Column 8
1.0000
If you want to do this lookup for all probes, you can use the function probelibraryinfo. This
creates a matrix with one row per probe and three columns. The first column is the index of the probe
set to which the probe belongs. The second column contains the probe pair index and the third
column indicates if the probe is a perfect match (1) or mismatch (-1) probe. Notice that index of the
probe pair index is 1 based.
probeinfo = probelibraryinfo(celStruct,cdfStruct);
probeinfo(theProbe,:)
ans =
5213 1 1
The function probesetvalues does the reverse of this lookup and creates a matrix of information
from the CEL and CDF structures containing all the information about a given probe set. This matrix
has 20 columns corresponding to ProbeSetNumber, ProbePairNumber, UseProbePair,
Background, PMPosX, PMPosY, PMIntensity, PMStdDev, PMPixels, PMOutlier, PMMasked,
MMPosX, MMPosY, MMIntensity, MMStdDev, MMPixels, MMOutlier, MMMasked, Group, and
Direction.
probeName = cdfStruct.ProbeSets(probeSetIndex).Name;
psvals = probesetvalues(celStruct,cdfStruct,probeName);
sprintf( ['%4d %2d %d %d PM: %3d %3d %5.1f %5.1f %2d %d %d',...
' MM: %3d %3d %5.1f %5.1f %2d %d %d %d %d\n'],psvals')
ans =
'5212 0 0 4.543512e+01 PM: 430 177 169.0 35.4 25 0 0 MM: 430 178 163.5 24.1 30 0 0 1 2
5212 1 0 4.545356e+01 PM: 431 177 127.3 21.8 30 0 0 MM: 431 178 100.3 14.6 36 0 0 1 2
5212 2 0 4.547230e+01 PM: 432 177 127.0 23.7 30 0 0 MM: 432 178 175.0 28.6 36 0 0 1 2
5212 3 0 4.549129e+01 PM: 433 177 133.3 25.9 36 0 0 MM: 433 178 94.0 22.7 30 0 0 1 2
5212 4 0 4.551051e+01 PM: 434 177 212.3 43.3 36 0 0 MM: 434 178 171.8 36.5 30 0 0 1 2
5212 5 0 4.552995e+01 PM: 435 177 149.5 27.5 36 0 0 MM: 435 178 154.0 30.3 30 0 0 1 2
5212 6 0 4.554958e+01 PM: 436 177 50.3 11.2 30 0 0 MM: 436 178 46.0 9.8 25 0 0 1 2
5212 7 0 4.556938e+01 PM: 437 177 152.5 37.7 36 0 0 MM: 437 178 107.0 21.0 36 0 0 1 2
5212 8 0 4.558934e+01 PM: 438 177 164.5 31.2 36 0 0 MM: 438 178 97.3 21.9 36 0 0 1 2
5212 9 0 4.560939e+01 PM: 439 177 126.0 23.4 36 0 0 MM: 439 178 121.3 25.3 36 0 0 1 2
5212 10 0 4.562955e+01 PM: 440 177 54.0 11.2 36 0 0 MM: 440 178 54.0 12.9 36 0 0 1 2
5212 11 0 4.564975e+01 PM: 441 177 83.3 17.4 36 0 0 MM: 441 178 62.3 12.5 36 0 0 1 2
5212 12 0 4.566998e+01 PM: 442 177 95.5 17.1 30 0 0 MM: 442 178 84.0 18.6 30 0 0 1 2
5212 13 0 4.569022e+01 PM: 443 177 110.0 19.6 36 0 0 MM: 443 178 92.5 22.0 36 0 0 1 2
5212 14 0 4.571042e+01 PM: 444 177 251.0 46.0 36 0 0 MM: 444 178 111.8 20.7 36 0 0 1 2
4-126
Working with Affymetrix Data
'
You can extract the intensity values from the matrix and look at some of the statistics of the data.
pmIntensity = psvals(:,7);
mmIntensity = psvals(:,14);
boxplot([pmIntensity,mmIntensity],'labels',{'Perfect Match','Mismatch'})
title(sprintf('Boxplot of raw intensity values for probe set %s',...
probeName),'interpreter','none')
% Use interpreter none to prevent the TeX interpreter treating the _ as
% subscript.
Now that you have the intensity values for the probes, you can plot the values for the perfect match
and mismatch probes.
figure
plot(pmIntensity,'b'); hold on
plot(mmIntensity,'r'); hold off
title(sprintf('Probe intensity values for probe set %s',...
probeName),'interpreter','none')
4-127
4 Microarray Analysis
Alternatively, you can use the function probesetplot to create this plot directly from the CEL and
CDF structures. The showstats option adds the mean, and lines for +/- one standard deviation for
both the perfect match and the mismatch probes to the plot.
probesetplot(celStruct,cdfStruct,probeName,'showstats',true);
4-128
Working with Affymetrix Data
The Affymetrix probe set IDs are not particularly descriptive. The mapping between the probe set IDs
and the gene IDs is stored in the GIN library file. This is a text file so you can open it in an editor and
browse through the file, or you can use affyread to read the information into a structure.
ginStruct = affyread('Ecoli_ASv2.GIN',libDir)
ginStruct =
Name: 'Ecoli_ASv2'
Version: 2
ProbeSetName: {7312x1 cell}
ID: {7312x1 cell}
Description: {7312x1 cell}
SourceNames: {2x1 cell}
SourceURL: {2x1 cell}
SourceID: [7312x1 double]
You can search through the structure for a particular probe set. Alternatively, you can use the
function probesetlookup to find information about the gene for a probe set.
info = probesetlookup(cdfStruct,probeName)
4-129
4 Microarray Analysis
info =
Identifier: '3315278'
ProbeSetName: 'argG_b3172_at'
CDFIndex: 5213
GINIndex: 3074
Description: '/start=3316278 /end=3317621 /direction=+ /description=argininosuccinate synthe
Source: 'NCBI EColi Genome'
SourceURL: 'http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/altvik?gi=115&db=g&from=3315278'
The function probesetlink will link out to the NetAffx™ Web site to show the actual sequences
used for the probes. Note that you will need to be a registered user of NetAffx to access this
information.
probesetlink(cdfStruct,probeName);
4-130
Preprocessing Affymetrix Microarray Data at the Probe Level
This example shows how to use MATLAB® and Bioinformatics Toolbox™ for preprocessing
Affymetrix® oligonucleotide microarray probe-level data with two preprocessing techniques, Robust
Multi-array Average (RMA) and GC Robust Multi-array Average (GCRMA).
Introduction
With Affymetrix oligonucleotide microarray platforms, gene expression is measured using probe sets
consisting of 11 to 20 perfect match (PM) probes (25 nucleotides in length) complementary to target
mRNA sequences. Each probe set also has the same number of mismatch (MM) probes, in which the
13th nucleotide has been changed to its complement. The PM probes are designed for gene specific
hybridization. The control MM probe measurements are thought to comprise most of the background
non-specific binding, such as cross-hybridization. A PM probe and its corresponding MM probe are
referred to as a probe pair.
The measured probe intensities and locations from a hybridized microarray are stored in a CEL file.
For each Affymetrix microarray platform, the information relating probe pairs to probe set IDs, and to
locations on the array is stored in a CDF library file. The probe sequence information is stored in a
sequence file (FASTA or tab-separated format).
In general, preprocessing Affymetrix probe-level expression data consists of three steps: background
adjustment, normalization, and summarization at the probe set level as a measure of the expression
level of corresponding mRNA. Many methods exist for the statistical procedures of these three steps.
Two popular techniques, RMA (Irizarry et al., 2003) and GCRMA (Wu et al., 2004), are used in this
example.
Note: This example shows the RMA and GCRMA preprocessing procedures to compute expression
values from input CEL files in step-by-step detail, using several functions. You can also complete the
same RMA or GCRMA techniques in one function call by using the Bioinformatics Toolbox affyrma
or affygcrma functions, respectively.
Importing Data
The CNS experiment was conducted using the Affymetrix HuGeneFL GeneChip® array, and the data
were stored in CEL files. Information related to each probe is contained in the Affymetrix Hu6800
CDF library file.
If you dont already have the Hu6800 CDF library file, download the HuGeneFL Genome Array library
zip file. Extract the Hu6800.CDF file into a directory, such as C:\Examples\affypreprocessdemo
\libfiles. Note: You will have to register in order to access the library files, but you do not have to
run the setup.exe file in the archive.
The CNS dataset (CEL files) is available here. To complete this example, download the CEL files of
the CNS dataset into a directory, such as C:\Examples\affypreprocessdemo\data. Unzip the
CEL file archives. Note: This dataset contains more CEL files than are needed for this example.
CNS_DataA_Sample_CEL.txt, a file provided with Bioinformatics Toolbox, contains a list of the 42 CEL
filenames used for this example, and the samples (10 medulloblastomas, 10 rhabdoid, 10 malignant
4-131
4 Microarray Analysis
glioma, 8 supratentorial PNETS, and 4 normal human cerebella) to which they belong. Load this data
into two MATLAB variables.
fid = fopen('CNS_DataA_Sample_CEL.txt','r');
ftext = textscan(fid,'%q%q');
fclose(fid);
samples = ftext{1};
cels = ftext{2};
Set the variables celPath and libPath to the paths of the CEL files and library directories.
celPath = 'C:\Examples\affypreprocessdemo\data';
libPath = 'C:\Examples\affypreprocessdemo\libfiles';
Rename the cel files so that each file name starts with the MG number that follows the underscore "_"
in the original file name. For instance, GSM1688666_MG1999060202AA.CEL is renamed to
MG1999060202AA.CEL. You do not need to run this code if the file names are already in the required
format.
A = dir(fullfile(celPath,'*.cel'));
fileNames = string({A.name});
for iFile = 1:numel(A)
newName = fullfile(celPath,extractAfter(fileNames(iFile),"_"));
movefile(fullfile(celPath,fileNames(iFile)),newName);
end
The function celintensityread can read multiple CEL files and access a CDF library file. It returns
a MATLAB structure containing the probe information and probe intensities. The matrices of PM and
MM intensities from multiple CEL files are stored in the PMIntensities and MMIntensities
fields. In each probe intensity matrix, the column indices correspond to the order in which the CEL
files were read, and each row corresponds to a probe. Create a MATLAB structure of PM and MM
probe intensities by loading data from the CEL files from the directory where the CEL files are
stored, and pass in the path to where you stored the CDF library file. (Note: celintensityread will
report the progress to the MATLAB command window. You can turn the progress report off by setting
the input parameter VERBOSE to false.)
4-132
Preprocessing Affymetrix Microarray Data at the Probe Level
probeData =
CDFName: 'Hu6800.CDF'
CELNames: {1×42 cell}
NumChips: 42
NumProbeSets: 7129
NumProbes: 140983
ProbeSetIDs: {7129×1 cell}
ProbeIndices: [140983×1 uint8]
GroupNumbers: [140983×1 uint8]
PMIntensities: [140983×42 single]
MMIntensities: [140983×42 single]
nSamples =
42
nProbeSets =
4-133
4 Microarray Analysis
7129
nProbes = probeData.NumProbes
nProbes =
140983
To perform GCRMA preprocessing, the probe sequence information of the HuGeneFL array is also
required. The Affymetrix support site provides probe sequence information for most of the available
arrays, either as FASTA formatted or tab-delimited files. This example assumes you have the
HuGeneFL_probe_tab file in the library files directory. Use the function affyprobeseqread to parse
the sequence file and return the probe sequences in an nProbes x 25 matrix of integers that
represents the PM probe sequence bases, with rows corresponding to the probes on the chip and
columns corresponding to the base positions of the 25-mer.
S = affyprobeseqread('HuGeneFL_probe_tab', 'Hu6800.CDF',...
'seqpath', libPath, 'cdfpath', libPath, 'seqonly', true)
S =
The RMA procedure uses only PM probe intensities for background adjustment (Irizarry et al., 2003),
while GCRMA adjusts background using probe sequence information and MM control probe
intensities to estimate non-specific binding (Wu et al., 2004). Both RMA and GCRMA are preceded by
quantile normalization (Bolstad et al., 2003) and median polish summarization (Irizarry et al., 2003)
of PM intensities.
The RMA background adjustment method corrects PM probe intensities chip by chip. The PM probe
intensities are modeled as the sum of a normal noise component and an exponential signal
component. Use rmabackadj to background adjust the PM intensities in the CNS data. You can
inspect the intensity distribution histogram and the estimated background adjustment of a specific
chip by setting the input parameter SHOWPLOT to the column index of the chip.
4-134
Preprocessing Affymetrix Microarray Data at the Probe Level
Several nonlinear normalization methods have been successfully applied to Affymetrix microarray
data. The RMA procedure normalizes the probe-level data with a quantile normalization method. Use
quantilenorm to normalize the background adjusted PM intensities in the CNS data. Note: If you
are interested in a rank-invariant set normalization method, use the affyinvarsetnorm function
instead.
pms_bgnorm = quantilenorm(pms_bg);
The GCRMA procedure adjusts for optical noise and non-specific binding (NSB) taking into account
the effect of the stronger bonding of G/C pairs (Naef et al., 2003, Wu et al., 2004). GCRMA uses probe
sequence information to estimate probe affinities for computing non-specific binding. The probe
affinity is modeled as a sum of the position-dependent base effects. Usually, the probe affinities are
estimated from the MM intensities of an NSB experiment. If NSB data is not available, the probe
affinities can still be estimated from sequence information and MM probe intensities normalized by
the probe set median intensity (Naef et al., 2003).
For the CNS dataset, use the data from the microarray hybridized with the normal cerebella sample
(Brain_Ncer_1) to compute the probe affinities for the HuGeneFL array. Use affyprobeaffinities
4-135
4 Microarray Analysis
to estimate the probe affinities of an Affymetrix microarray. Use the SHOWPLOT input parameter to
inspect a plot showing the effects of base A, C, G, and T at the 25 positions.
figure
idx = find(strcmpi('Brain_Ncer_1', samples));
[pmAlpha, mmAlpha] = affyprobeaffinities(S.SequenceMatrix,...
probeData.MMIntensities(:, idx), 'showplot', true);
Note: There are 496 probes on a HuGeneFL array that do not have sequence information; the
affinities for these probes were NaN.
With the probe affinities available, the amount of NSB can be estimated by fitting a LOWESS curve
through MM probe intensities vs. MM probe affinities. The function gcrmabackadj performs optical
and NSB corrections. The input parameter SHOWPLOT shows a plot of the optical noise adjusted MM
intensities against its affinities, and the smooth fit of a specified chip. You can compute the
background intensities with one of two estimation methods, Maximum Likelihood Estimate (MLE) and
Empirical-Bayes (EB), which computes the posterior mean of specific binding given prior observed
intensities. Here you will background adjust four arrays using both estimation methods. (Note:
gcrmabackadj will report the progress to the MATLAB command window. You can turn the progress
report off by setting the input parameter VERBOSE to false.)
Background adjust the first four chips using GCRMA-MLE method, and inspect the plot of intensity
vs. affinity for data from the third array.
pms_MLE_bg = gcrmabackadj(probeData.PMIntensities(:,1:4),...
probeData.MMIntensities(:, 1:4),...
4-136
Preprocessing Affymetrix Microarray Data at the Probe Level
Background adjust the first four chips using the GCRMA-EB method. Processing with this method is
more computationally intensive and will take longer.
pms_EB_bg = gcrmabackadj(probeData.PMIntensities(:,1:4),...
probeData.MMIntensities(:, 1:4),...
pmAlpha, mmAlpha, 'method', 'EB');
You can continue the preprocessing with the quatilenorm and rmasummary functions, or use the
gcrma function to do everything. The gcrma function performs background adjustment and returns
expression measures of background adjusted PM probe intensities using the same normalization and
summarization methods as RMA. You can also pass in the sequence matrix instead of affinities. The
function will automatically compute the affinities in this case. (Note: gcrma will report the progress
4-137
4 Microarray Analysis
to the MATLAB command window. You can turn the progress report off by setting the input parameter
VERBOSE to false.)
Use boxplots to inspect the PM intensity distributions of the first four chips with three background
adjustment procedures.
figure
subplot(4,1,1)
4-138
Preprocessing Affymetrix Microarray Data at the Probe Level
Use boxplots to inspect the background corrected and normalized PM intensity distributions of the
first four chips with three background adjustment procedures.
pms_MLE_bgnorm = quantilenorm(pms_MLE_bg);
pms_EB_bgnorm = quantilenorm(pms_EB_bg);
figure
subplot(3,1,1)
maboxplot(log2(pms_bgnorm(:, 1:4)), samples(1:4),...
'title','Normalized RMA Background Adjusted Intensity',...
'orientation', 'horizontal')
subplot(3,1,2)
maboxplot(log2(pms_MLE_bgnorm), samples(1:4),...
'title','Normalized GCRMA-MLE Background Adjusted Intensity',...
'orientation', 'horizontal')
subplot(3,1,3)
4-139
4 Microarray Analysis
maboxplot(log2(pms_EB_bgnorm), samples(1:4),...
'title','Normalized GCRMA-EB Background Adjusted Intensity',...
'orientation', 'horizontal')
Final Remarks
You can perform importing of data from CEL files and all three preprocessing steps of the RMA and
GCRMA techniques shown in this example by using the affyrma and affygcrma functions
respectively.
References
[1] Pomeroy, S.L., et al., "Prediction of central nervous system embryonal tumour outcome based on
gene expression", Nature, 415(6870):436-42, 2002.
[2] Irizarry, R.A., et al., "Exploration, normalization, and summaries of high density oligonucleotide
array probe level data", Biostatistics, 4(2):249-64, 2003.
[3] Wu, Z., et al., "A model based background adjustment for oligonucleotide expression arrays",
Journal of the American Statistical Association, 99(468):909-17, 2004.
[4] Bolstad, B.M., et al., "A comparison of normalization methods for high density oligonucleotide
array data based on variance and bias", Bioinformatics, 19(2):185-93, 2003.
4-140
Preprocessing Affymetrix Microarray Data at the Probe Level
[5] Naef, F., and Magnasco, M.O. "Solving the riddle of the bright mismatches: labeling and effective
binding in oligonucleotide arrays", Physical Review, E, Statistical, Nonlinear and Soft Matter Physics,
68(1Pt1):011906, 2003.
4-141
4 Microarray Analysis
This example shows how to study DNA copy number variants by preprocessing and analyzing data
from the Affymetrix® GeneChip® Human Mapping 100k array.
Introduction
A copy number variant (CNV) is defined as a chromosomal segment that is 1kb or larger in length,
whose copy number varies in comparison to a reference genome. CNV is one of the hallmarks of
genetic instability common to most human cancers. When studying cancers, an important goal is to
quickly and precisely identify copy number amplifications and deletions, and to assess their
frequencies at the genome level. Recently, single nucleotide polymorphism (SNP) arrays have been
used to detect and quantify genome-wide copy number alterations with high resolution. SNP array
approaches also provide genotype information. For example, they can reveal loss of heterozygosity
(LOH), which can provide supporting evidence for the presence of a deletion.
The Affymetrix GeneChip Mapping Array Set is a popular platform for high-throughput SNP
genotyping and CNV detection. In this example, we use a publicly available data set from the
Affymetrix 100K SNP array that interrogates over 100,000 SNP sites. You will import and preprocess
the probe level data, estimate the raw signal ratios of the samples compared to references, and then
infer copy numbers at each SNP locus after segmentation.
Data
Zhao et al. studied genome-wide copy number alterations of human lung carcinoma cell lines and
primary tumors [1]. The samples were hybridized to Affymetrix 100K SNP arrays, each containing
115,593 mapped SNP loci. For this example, you will analyze data from 24 small cell lung carcinoma
(SCLC) samples, of which 19 were primary tumor samples and 5 were cell line samples.
For each sample, SNPs were genotyped with two different arrays, Early Access 50KXba and Early
Access 50KHind, in parallel. In brief, two aliquots of DNA samples were first digested with an XbaI or
HindIII restriction enzyme, respectively. The digested DNA was ligated to an adaptor before
subsequent polymerase chain reaction (PCR) amplification. Four PCR reactions were set up for each
XbaI or HindIII adaptor-ligated DNA sample. The PCR products from the four reactions were pooled,
concentrated, and fragmented to a size range of 250 to 2,000 bp. Fragmented PCR products were
then labeled, denatured, and hybridized to the arrays.
For this example, you will work with data from the EA 50KXba array. To analyze the data from EA
50KHind array just repeat the steps. The SNP array data are stored in CEL files with each CEL file
containing data from one array.
Note: High density SNP microarray data analysis requires extended amounts of memory from the
operating system; if you receive "Out of memory" errors when running this example, try increasing
the virtual memory (or swap space) of your operating system or try setting the 3GB switch (32-bit
Windows® XP only). For details, see “Resolve “Out of Memory” Errors”.
This example uses the 50KXba and 50KHind SNP array data sets (not included in the toolbox) from
the Meyerson Laboratory at the Dana-Farber Cancer Institute. You may use any other dataset to
perform similar analyses.
The CDF library files used for these two arrays are CentXbaAv2.cdf and CentHindAv2.cdf. You
can obtain these files from Affymetrix Web Site.
4-142
Analyzing Affymetrix SNP Arrays for DNA Copy Number Variants
Set the variable Xba_celPath with the path to the location you stored the Xba array CEL files, and
the variable libPath with the path to the location of the CDF library file for the EA 50KXba SNP
array. (These files are not distributed with Bioinformatics Toolbox™).
Xba_celPath = 'C:\Examples\affysnpcnvdemo\Xba_array';
libPath = 'C:\Examples\affysnpcnvdemo\LibFiles';
SCLC_Sample_CEL.txt, a file provided with the Bioinformatics Toolbox™ software, contains a list of
the 24 CEL file names used for this example, and the samples (5 SCLC cell lines and 19 primary
tumors) to which they belong. Load this data into two MATLAB® variables.
fid = fopen('SCLC_Sample_CEL.txt','r');
ftext = textscan(fid, '%q%q');
fclose(fid);
samples = ftext{1};
cels = ftext{2};
nSample = numel(samples)
nSample =
24
The Affymetrix 50KXba SNP array has a density up to 50K SNP sites. Each SNP on the array is
represented by a collection of probe quartets. A probe quartet consists of a set of probe pairs for both
alleles (A and B) and for both forward and reverse strands (antisense and sense) for the SNP. Each
probe pair consists a perfect match (PM) probe and a mismatch (MM) probe. The Bioinformatics
Toolbox software provides functions to access the probe-level data.
The function affyread reads the CEL files and the CDF library files for Affymetrix SNP arrays.
Read the sixth CEL file of the EA 50KXba data into a MATLAB structure.
s_cel = affyread(fullfile(Xba_celPath, [cels{6} '.CEL']))
s_cel =
Name: 'S0168T.CEL'
DataPath: 'C:\Examples\affysnpcnvdemo\Xba_array'
LibPath: 'C:\Examples\affysnpcnvdemo\Xba_array'
FullPathName: 'C:\Examples\affysnpcnvdemo\Xba_array\S0168T.CEL'
ChipType: 'CentXbaAv2'
Date: '01-Feb-2013 11:54:13'
FileVersion: 3
Algorithm: 'Percentile'
AlgParams: 'Percentile:75;CellMargin:2;OutlierHigh:1.500;OutlierLow:1.004;AlgVersion:6
NumAlgParams: 16
CellMargin: 2
Rows: 1600
Cols: 1600
NumMasked: 0
NumOutliers: 12478
4-143
4 Microarray Analysis
NumProbes: 2560000
UpperLeftX: 222
UpperLeftY: 236
UpperRightX: 8410
UpperRightY: 219
LowerLeftX: 252
LowerLeftY: 8426
LowerRightX: 8440
LowerRightY: 8410
ProbeColumnNames: {8×1 cell}
Probes: [2560000×8 single]
Read the CDF library file for the EA 50KXba array into a MATLAB structure.
s_cdf =
Name: 'CentXbaAv2.cdf'
ChipType: 'CentXbaAv2'
LibPath: 'C:\Examples\affysnpcnvdemo\LibFiles'
FullPathName: 'C:\Examples\affysnpcnvdemo\LibFiles\CentXbaAv2.cdf'
Date: '01-Feb-2013 11:54:12'
Rows: 1600
Cols: 1600
NumProbeSets: 63434
NumQCProbeSets: 9
ProbeSetColumnNames: {6×1 cell}
ProbeSets: [63443×1 struct]
You can inspect the overall quality of the array by viewing the probe-level intensity data using the
function maimage.
maimage(s_cel)
4-144
Analyzing Affymetrix SNP Arrays for DNA Copy Number Variants
The affysnpquartets function creates a table of probe quartets for a SNP. On Affymetrix 100K SNP
arrays, a probe quartet contains 20 probe pairs. For example, to get detailed information on probe set
number 6540, you can type the following commands:
ps_id = 6540;
ps_qt = affysnpquartets(s_cel, s_cdf, ps_id)
ps_qt =
ProbeSet: '2685329'
AlleleA: 'A'
AlleleB: 'G'
Quartet: [1×6 struct]
You can also view the heat map of the intensities of the PM and MM probe pairs of a SNP probe
quartet using the probesetplot function. Click the Insert Colorbar button to show the color scale
of the heat map.
4-145
4 Microarray Analysis
In this view, the 20 probe pairs are ordered from left to right. The first two rows (10 probe pairs)
correspond to allele A, and the last two rows (10 probe pairs) corresponds to allele B. For each allele,
the left 5 probe pairs correspond to the sense strand (-), while the right 5 probe pairs correspond to
the antisense (+) strand.
You will use the celintensityread function to read all 24 CEL files. The celintensityread
function returns a structure containing the matrices of PM and MM (optional) intensities for the
probes and their group numbers. In each probe intensity matrix, the column indices correspond to
the order in which the CEL files were read, and each row corresponds to a probe. For copy number
(CN) analysis, only PM intensities are needed.
Import the probe intensity data of all EA 50KXba arrays into a MATLAB structure.
4-146
Analyzing Affymetrix SNP Arrays for DNA Copy Number Variants
XbaData =
CDFName: 'CentXbaAv2.cdf'
CELNames: {1×24 cell}
NumChips: 24
NumProbeSets: 63434
NumProbes: 1268480
ProbeSetIDs: {63434×1 cell}
ProbeIndices: [1268480×1 uint8]
GroupNumbers: [1268480×1 uint8]
PMIntensities: [1268480×24 single]
Affymetrix Early Access arrays are the same as the current commercial Mapping 100K arrays with
the exception of some the probes being masked out. The data obtained from Affymetrix EA 100K SNP
arrays can be converted to Mapping 100K arrays by filtering out the rejected SNP probe IDs on Early
Access array and converting the SNP IDs to Mapping 100K SNP IDs. The SNP IDs for EA 50KXba and
50KHind arrays and their corresponding SNP IDs on Mapping 50KXba and 50KHind arrays are
provided in two MAT files shipped with the Bioinformatics Toolbox software, Mapping50K_Xba_V_EA
and Mapping50K_Hind_V_EA, respectively.
load Mapping50K_Xba_V_EA
The helper function affysnpemconvert converts the data to Mapping 50KXba data.
XbaData = affysnpemconvert(XbaData, EA50K_Xba_SNPID, Mapping50K_Xba_SNPID)
XbaData =
CDFName: 'CentXbaAv2.cdf'
CELNames: {1×24 cell}
NumChips: 24
NumProbeSets: 58960
NumProbes: 1179200
ProbeSetIDs: {58960×1 cell}
ProbeIndices: [1179200×1 uint8]
4-147
4 Microarray Analysis
You can view the density plots of the log-transformed PM intensity distribution across the 24 samples
before preprocessing.
f=zeros(nSample, 100);
xi = zeros(nSample, 100);
for i = 1:nSample
[f(i,:),xi(i,:)] = ksdensity(log2(XbaData.PMIntensities(:,i)));
end
figure;
plot(xi', f')
xlabel('log2(PM)')
ylabel('Density')
title('Density Plot')
hold on
4-148
Analyzing Affymetrix SNP Arrays for DNA Copy Number Variants
[f,xi] = ksdensity(log2(XbaData.PMIntensities(:,1)));
plot(xi', f', '--r', 'Linewidth', 3)
hold off
Note: You can also use the RMA or GCRMA procedures for background correction. The RMA
procedure estimates the background by a mixture model where the background signals are assumed
to be normally distributed and the true signals are exponentially distributed, while the GCRMA
process consists of optical background correction and probe-sequence based background adjustment.
For more information on how to use the RMA and GCRMA procedures, see “Preprocessing Affymetrix
Microarray Data at the Probe Level” on page 4-131.
Probe-Level Summarization
By using the GroupNumbers field data from the structure XbaData, you can extract the intensities
for allele A and allele B for each probe. Use the function affysnpintensitysplit to split the
probe intensities matrix PMIntensities into two single-precision matrices, PMAIntensities and
PMBIntensities, for allele A and allele B probes respectively. The number of probes in each matrix
is the maximum number of probes for each allele.
XbaData = affysnpintensitysplit(XbaData)
XbaData =
CDFName: 'CentXbaAv2.cdf'
4-149
4 Microarray Analysis
For total copy number analysis, a simplification is to ignore the allele A and allele B sequences and
their strand information and, instead, combine the PM intensities for allele A and allele B of each
probe pair.
PM_Xba = XbaData.PMAIntensities + XbaData.PMBIntensities;
For a particular SNP, we now have K (K=5 for Affymetrix Mapping 100K arrays) added signals, each
signal being a measure of the same thing - the total CN. However, each of the K signals has slightly
different sequences, so their hybridization efficiency might differ. You can use RMA summarization
methods to sum up allele probe intensities for each SNP probe set.
PM_Xba = rmasummary(XbaData.ProbeIndices, PM_Xba);
Affymetrix provides CSV-formatted annotation files for their SNP arrays. You can download the
annotation files for Mapping 100K arrays from https://www.thermofisher.com/us/en/home/life-science/
microarray-analysis/microarray-data-analysis/genechip-array-annotation-files.html.
For this example, download and unzip the annotation file for the Mapping, 50KXba array
Mapping50K_Xba240.na29.annot.csv. The SNP probe information of the Mapping 50KXba array,
can be read from this annotation file. Set the variable annoPath with the path to the location where
you saved the annotation file.
annoPath = 'C:\Examples\affysnpcnvdemo\AnnotFiles';
The function affysnpannotread reads the annotation file and returns a structure containing SNP
chromosome information, chromosomal positions, sequences and PCR fragment length information
ordered by probe set IDs from the second input variable.
annoFile = fullfile(annoPath, 'Mapping50K_Xba240.na29.annot.csv');
annot_Xba = affysnpannotread(annoFile, XbaData.ProbeSetIDs)
annot_Xba =
4-150
Analyzing Affymetrix SNP Arrays for DNA Copy Number Variants
Raw CN Estimation
The relative copy number at a SNP between two samples is estimated based on the log2 ratio of the
normalized signals. The averaged normalized signals from normal samples are used as the global
reference. The preprocessed reference mean log-transformed signals for the Mapping 50KXBa array
and the 50KHind array are provided in the MAT-files, SCLC_normal_Xba and SCLC_normal_Hind
respectively.
load SCLC_Normal_Xba
SNPs probes with missing chromosome number, genomic position or fragment length in the
annotation file don't have enough information for further CN analysis. Also for CN analysis, Y
chromosomes are usually ignored. Filter out these SNP probes.
u_chr = unique(chr_sort);
gpsidx = zeros(length(gpos_sort),1);
for i = 1:length(u_chr)
uidx = find(chr_sort == u_chr(i));
gp_s = gpos_sort(uidx);
[gp_ss, ssidx] = sort(gp_s);
s_res = uidx(ssidx);
gpsidx(uidx) = s_res;
end
gpos_ssort = gpos_sort(gpsidx);
log2Ratio_ssort = log2Ratio_sort(gpsidx, :);
probesetids_ssort = probesetids_sort(gpsidx);
fragmentlen_ssort = fragmentlen_sort(gpsidx);
accession_ssort = accession_sort(gpsidx);
4-151
4 Microarray Analysis
In the analysis, systematic effects from the PCR process should be taken into account. For example,
longer fragments usually result in less PCR amplification, which leads to less material to hybridize
and weaker signals. You can see this by plotting the raw CNs with fragment-length effect.
figure;
plot(fragmentlen_ssort, log2Ratio_ssort(:, 1), '.')
hold on
plot([0 2200], [0 0], '-.g')
xlim([0 2200])
ylim([-5 5])
xlabel('Fragment Length')
ylabel('log2(Ratio)')
title('Pre PCR fragment length normalization')
Nannya et al., 2005 introduced a robust linear model to estimate and remove this effect. For this
example, use the malowess function for PCR fragment length normalization for sample 1. Then
display the smooth fit curve.
smoothfit = malowess(fragmentlen_ssort,log2Ratio_ssort(:,1));
hold on
plot(fragmentlen_ssort, smoothfit, 'r+')
hold off
4-152
Analyzing Affymetrix SNP Arrays for DNA Copy Number Variants
figure;
plot(fragmentlen_ssort, log2Ratio_norm, '.');
hold on
plot([0 2200], [0 0], '-.g')
xlim([0 2200])
ylim([-5 5])
xlabel('Fragment Length')
ylabel('log2(Ratio)')
title('Post PCR fragment length normalization')
hold off
4-153
4 Microarray Analysis
You can normalize PCR fragment length effect for all the samples using the malowess function.
Again, you can repeat the previous steps for the 50KHind array data.
CN Genome Profile
Load a MAT-file containing the preprocessed and normalized CN data from both the 50KXba arrays
and 50KHind arrays.
load SCLC_CN_Data
You can now plot the whole-genome profile of total CNs. For example, plot the whole-genome profile
for sample 1 (CL_H524) using a helper function plotcngenomeprofile.
plotcngenomeprofile(SCLC_CN.GenomicPosition,SCLC_CN.Log2Ratio(:, 1),...
SCLC_CN.Chromosome, 1:23, SCLC_CN.Sample{1})
4-154
Analyzing Affymetrix SNP Arrays for DNA Copy Number Variants
You can also plot each chromosome CN profile in a subplot. For example, plot each chromosome CN
profile for sample 12 (PT_0177T):
plotcngenomeprofile(SCLC_CN.GenomicPosition,SCLC_CN.Log2Ratio(:, 12),...
SCLC_CN.Chromosome, 1:23, SCLC_CN.Sample{12}, 'S')
4-155
4 Microarray Analysis
In the Zhao et al., 2005 study, a high-level amplification was observed in the q12.2-q12.3 region on
chromosome 8 for SCLS samples. You can perform CBS segmentation on chromosome 8 for sample
PT_S0177T.
ps =
Sample: 'PT_S0177T'
SegmentData: [1×1 struct]
4-156
Analyzing Affymetrix SNP Arrays for DNA Copy Number Variants
4-157
4 Microarray Analysis
startbp =
62089326
62182830
128769526
endbp =
62182830
62729651
129006828
You can also get cytoband information for the CNVs. The function cytobandread returns cytoband
information in a structure.
ct = cytobandread('hs_cytoBand.txt')
4-158
Analyzing Affymetrix SNP Arrays for DNA Copy Number Variants
ct =
cn_cytobands = cell(length(cnv),1);
for i = 1:length(cnv)
istart = find(ct.BandStartBPs <= startbp(i) & ct.BandEndBPs >= startbp(i) & strcmp(ct.ChromLa
iend = find(ct.BandStartBPs <= endbp(i) & ct.BandEndBPs >= endbp(i) & strcmpi(ct.ChromLabels,
if strcmpi(ct.BandLabels{istart}, ct.BandLabels{iend})
cn_cytobands{i} = ['8' ct.BandLabels{istart}];
else
cn_cytobands{i} = ['8' ct.BandLabels{istart} '-' '8' ct.BandLabels{iend}];
end
end
Create a report displaying the start positions, end positions and size of the CNVs.
Among the three regions of amplification, the 8q12-13 region has been confirmed by interphase FISH
analysis (Zhao et al., 2005).
You can also visualize the fraction of samples with copy number amplifications of at least three copies
(red), and copy number losses to less than 1.5 copies (blue), across all SNPs for all SCLS samples.
The function cghfreqplot displays frequency of copy number alterations across multiple samples.
To better visualize the data, plot only the SNPs with gain or loss frequency over 25%.
gainThrd = log2(3/2);
lossThrd = log2(1.5/2);
cghfreqplot(SCLC_CN, 'Color', [1 0 0; 0 0 1],...
'Threshold', [gainThrd, lossThrd], 'cutoff', 0.25)
title('SCLC Summary Plot')
4-159
4 Microarray Analysis
4-160
Analyzing Affymetrix SNP Arrays for DNA Copy Number Variants
References
[1] Zhao, X., et al., "Homozygous deletions and chromosome amplifications in human lung carcinomas
revealed by single nucleotide polymorphism array analysis", Cancer Research, 65(13):5561-70, 2005.
[2] Nannya, Y., et al., "A robust algorithm for copy number detection using high-density
oligonucleotide single nucleotide polymorphism genotyping arrays", Cancer Research, 65(14):6071-8,
2005.
4-161
4 Microarray Analysis
This example shows how to retrieve gene expression data series (GSE) from the NCBI Gene
Expression Omnibus (GEO) and perform basic analysis on the expression profiles.
Introduction
The NCBI Gene Expression Omnibus (GEO) is the largest public repository of high-throughput
microarray experimental data. GEO data have four entity types including GEO Platform (GPL), GEO
Sample (GSM), GEO Series (GSE) and curated GEO DataSet (GDS).
A Platform record describes the list of elements on the array in the experiment (e.g., cDNAs,
oligonucleotide probesets). Each Platform record is assigned a unique and stable GEO accession
number (GPLxxx).
A Sample record describes the conditions under which an individual Sample was handled, the
manipulations it underwent, and the abundance measurement of each element derived from it. Each
Sample record is assigned a unique and stable GEO accession number (GSMxxx).
A Series record defines a group of related Samples and provides a focal point and description of the
whole study. Series records may also contain tables describing extracted data, summary conclusions,
or analyses. Each Series record is assigned a unique GEO accession number (GSExxx).
More information about GEO can be found in GEO Overview. Bioinformatics Toolbox™ provides
functions that can retrieve and parse GEO format data files. GSE, GSM, GSD and GPL data can be
retrieved by using the getgeodata function. The getgeodata function can also save the retrieved
data in a text file. GEO Series records are available in SOFT format files and in tab-delimited text
format files. The function geoseriesread reads the GEO Series text format file. The geosoftread
function reads the usually quite large SOFT format files.
In this example, you will retrieve the GSE5847 data set from GEO database, and perform statistical
testing on the data. GEO Series GSE5847 contains experimental data from a gene expression study of
tumor stroma and epithelium cells from 15 inflammatory breast cancer (IBC) cases and 35 non-
inflammatory breast cancer cases (Boersma et al. 2008).
The function getgeodata returns a structure containing data retrieved from the GEO database. You
can also save the returned data in its original format to your local file system for use in subsequent
MATLAB® sessions. Note that data in public repositories is frequently curated and updated;
therefore the results of this example might be slightly different when you use up-to-date datasets.
Use the geoseriesread function to parse the previously downloaded GSE text format file.
gseData = geoseriesread('GSE5847.txt')
gseData =
4-162
Working with GEO Series Data
The structure returned contains a Header field that stores the metadata of the Series data, and a
Data field that stores the Series matrix data.
The GSE5847 matrix data in the Data field are stored as a DataMatrix object. A DataMatrix object is
a data structure like MATLAB 2-D array, but with additional metadata of row names and column
names. The properties of a DataMatrix can be accessed like other MATLAB objects.
get(gseData.Data)
Name: ''
RowNames: {22283x1 cell}
ColNames: {1x95 cell}
NRows: 22283
NCols: 95
NDims: 2
ElementClass: 'double'
The row names are the identifiers of the probe sets on the array; the column names are the GEO
Sample accession numbers.
gseData.Data(1:5, 1:5)
ans =
The Series metadata are stored in the Header field. The structure contains Series information in the
Header.Series field, and sample information in the Header.Sample field.
gseData.Header
ans =
The Series field contains the title of the experiment and the microarray GEO Platform ID.
gseData.Header.Series
4-163
4 Microarray Analysis
ans =
gseData.Header.Samples
ans =
4-164
Working with GEO Series Data
The data_processing field contains the information of the preprocessing methods, in this case the
Robust Multi-array Average (RMA) procedure.
gseData.Header.Samples.data_processing(1)
ans =
{'RMA'}
sampleSources = unique(gseData.Header.Samples.source_name_ch1);
sampleSources{:}
ans =
ans =
gseData.Header.Samples.characteristics_ch1(:,1)
ans =
{'IBC' }
{'Deceased'}
Determine the IBC and non-IBC labels for the samples from the
Header.Samples.characteristics_ch1 field as group labels.
sampleGrp = gseData.Header.Samples.characteristics_ch1(1,:);
4-165
4 Microarray Analysis
The Series metadata told us the array platform id: GPL96, which is an Affymetrix® GeneChip®
Human Genome U133 array set HG-U133A. Retrieve the GPL96 SOFT format file from GEO using the
getgeodata function. For example, assume you used the getgeodata function to retrieve the
GPL96 Platform file and saved it to a file, such as GPL96.txt. Use the geosoftread function to
parse this SOFT format file.
gplData = geosoftread('GPL96.txt')
gplData =
Scope: 'PLATFORM'
Accession: 'GPL96'
Header: [1x1 struct]
ColumnDescriptions: {16x1 cell}
ColumnNames: {16x1 cell}
Data: {22283x16 cell}
The ColumnNames field of the gplData structure contains names of the columns for the data:
gplData.ColumnNames
ans =
{'ID' }
{'GB_ACC' }
{'SPOT_ID' }
{'Species Scientific Name' }
{'Annotation Date' }
{'Sequence Type' }
{'Sequence Source' }
{'Target Description' }
{'Representative Public ID' }
{'Gene Title' }
{'Gene Symbol' }
{'ENTREZ_GENE_ID' }
{'RefSeq Transcript ID' }
{'Gene Ontology Biological Process'}
{'Gene Ontology Cellular Component'}
{'Gene Ontology Molecular Function'}
You can get the probe set ids and gene symbols for the probe sets of platform GPL69.
Use gene symbols to label the genes in the DataMatrix object gseData.Data. Be aware that the
probe set IDs from the platform file may not be in the same order as in gseData.Data. In this
example they are in the same order.
4-166
Working with GEO Series Data
Display the first five rows and five columns of the expression data with row names as gene symbols.
gseData.Data(1:5, 1:5)
ans =
Bioinformatics Toolbox and Statistics and Machine Learning Toolbox™ offer a wide spectrum of
analysis and visualization tools for microarray data analysis. However, because it is not our main goal
to explain the analysis methods in this example, you will apply only a few of the functions to the gene
expression profile from stromal cells. For more elaborate examples about feature selection tools
available, see “Select Features for Classifying High-Dimensional Data”.
The experiment was performed on IBC and non-IBC samples derived from stromal cells and epithelial
cells. In this example, you will work with the gene expression profile from stromal cells. Determine
the sample indices for the stromal cell type from the
gseData.Header.Samples.source_name_ch1 field.
nStroma = sum(stromaIdx)
nStroma =
47
nStromaIBC =
13
4-167
4 Microarray Analysis
nStromaNonIBC =
34
You can also label the columns in stromaData with the group labels:
Display the histogram of the normalized gene expression measurements of a specified gene. The x-
axes represent the normalized expression level. For example, inspect the distribution of the gene
expression values of these genes.
fID = 331:339;
figure;
for i = 1:length(fID)
subplot(3,3,i);
bar(edges, histStroma(:,i), 'histc')
xlim([-3 3])
if i <= length(fID)-3
ax = gca;
ax.XTickLabel = [];
end
title(sprintf('gene%d - %s', fID(i), stromaData.RowNames{fID(i)}))
end
sgtitle('Gene Expression Value Distributions')
4-168
Working with GEO Series Data
The gene expression profile was accessed using the Affymetrix GeneChip more than 22,000 features
on a small number of samples (~100). Among the 47 tumor stromal samples, there are 13 IBC and 34
non-IBC. But not all the genes are differentially expressed between IBC and non-IBC tumors.
Statistical tests are needed to identify a gene expression signature that distinguish IBC from non-IBC
stromal samples.
Use genevarfilter to filter out genes with a small variance across samples.
stromaData.NRows
ans =
20055
Apply a t-statistic on each gene and compare p-values for each gene to find significantly differentially
expressed genes between IBC and non-IBC groups by permuting the samples (1,000 times for this
example).
rng default
[pvalues, tscores] = mattest(stromaData(:, 'IBC'), stromaData(:, 'non-IBC'),...
'Showhist', true', 'showplot', true, 'permute', 1000);
4-169
4 Microarray Analysis
4-170
Working with GEO Series Data
ans =
52
ans =
p-values t-scores
INPP5E 2.3318e-05 5.0389
ARFRP1 /// IGLJ3 2.7575e-05 4.9753
USP46 3.4336e-05 -4.9054
GOLGB1 4.7706e-05 -4.7928
TTC3 0.00010695 -4.5053
THUMPD1 0.00013164 -4.4317
4-171
4 Microarray Analysis
0.00016042 4.3656
MAGED2 0.00017042 -4.3444
DNAJB9 0.0001782 -4.3266
KIF1C 0.00022122 4.2504
0.00022237 -4.2482
DZIP3 0.00022414 -4.2454
COPB1 0.00023199 -4.2332
PSD3 0.00024649 -4.2138
PLEKHA4 0.00026505 4.186
DNAJB9 0.0002767 -4.1708
CNPY2 0.0002801 -4.1672
USP9X 0.00028442 -4.1619
SEC22B 0.00030146 -4.1392
GFER 0.00030506 -4.1352
References
[1] Boersma, B.J., Reimers, M., Yi, M., Ludwig, J.A., et al. "A stromal gene signature associated with
inflammatory breast cancer", International Journal of Cancer, 122(6):1324-32, 2008.
4-172
Identifying Biomolecular Subgroups Using Attractor Metagenes
This example shows workflows for the analysis of gene expression data with the attractor metagene
algorithm. Gene expression data is available for many model organisms and disease conditions. This
example shows how to use the metafeatures function to explore biomolecular phenotypes in breast
cancer.
The Cancer Genome Atlas (TCGA) includes several kinds of data across multiple cancer indications.
TCGA includes measurements of gene expression, protein expression, clinical outcomes, and more. In
this example, you explore breast cancer gene expression.
Researchers collected tumor samples, and used Agilent G4502A microarrays to measure their gene
expression. In this example you use the Level-3 expression data, which has been post-processed from
the original measurements into the expression calls. Data was retrieved May 20, 2014.
Load the data into MATLAB®. The MAT-file TCGA_Breast_Gene_Expression.mat contains gene
expression data of 17814 genes for 590 different patients. The expression data is stored in the
variable geneExpression. The gene names are stored in the variable geneNames.
load TCGA_Breast_Gene_Expression
To see for the organization of the data, check number of genes and samples in this data set.
size(geneExpression)
ans = 1×2
17814 590
geneNames is a cell array of the gene names. You can access the entries using MATLAB cell array
indexing:
geneNames{655}
ans =
'EGFR'
This cell array indicates that the 655th row of the variable geneExpression contains expression
measurements for the gene expression of Epidermal Growth Factor Receptor (EGFR).
The attractor metagene algorithm was developed as part of the DREAM 8 challenge to develop
prognostic biomarkers for breast cancer survival. The attractor metagene approach discovers and
quantifies underlying biomolecular events. These events reduce the dimensionality of the gene
expression data, and also allow for subtype classification and investigation of regulatory machinery
[1].
A metagene is defined as any weighted sum of gene expression. Suppose you have a collection of co-
expressed genes. You can create a metagene by averaging the expression levels of the genes in the
collection.
4-173
4 Microarray Analysis
There is the potential to refine our understanding of the gene expression captured in this metagene.
Suppose you create a set of weights that quantify the similarity between the genes in our collection
and the metagene. Genes that are more similar to the metagene receive larger weights, while genes
that are less similar receive smaller weights. Using these new weights, you can form a new metagene
that is a weighted average of gene expression. The new metagene better captures a biomolecular
event that governs some element of gene regulation in the expression data.
This procedure forms the core of the attractor metagene algorithm. Form a metagene using some
current estimate of the weights, then update the weights based on a measure of similarity. Attractor
metagenes are defined as the attracting fixed points of this iterative process.
The algorithm exists within the broad family of unsupervised machine learning algorithms. Related
algorithms include principal component analysis, various clustering algorithms (especially fuzzy c-
means), non-negative matrix factorization, and others. The main advantage of the metagene approach
is that the results of the algorithm tend to be more clearly linked with a phenotype defined by gene
expression.
Concretely, in the ith iteration of the algorithm. You have a vector of weights, Wi , of size 1-by-number
of genes. The estimate of the metagene during the i th iteration is:
Mi = Wi * G
G is the number of genes by number of samples gene expression matrix. To update the weights:
W j, i + 1 = J(Mi, G j)
α
J(Mi, G j) = MI(Mi, G j)
if the correlation between Mi and G j is greater than 0. MI is the mutual information between Mi and
G j. The function metafeatures uses the B-spline estimator of mutual information described in [3].
If, instead, the correlation between Mi and G j is less than or equal to 0, then:
J(Mi, G j) = 0
The weights are all greater than or equal to zero. Because mutual information is scale invariant, you
can normalize the weights in whatever way you choose. Here, they are normalized so their sum is 1.
The algorithm is initialized by either random or user-selected weights. It proceeds until the change in
Mi between iterations is small, or a prespecified number of iterations is exhausted.
The data has several NaN values. To check how many, sum over an indicator returned by isnan.
sum(sum(isnan(geneExpression)))
ans = 1695
Out of the approximately 10 million entries of geneExpression, there are 1695 missing entries.
Before proceeding you will need to deal with these missing entries.
4-174
Identifying Biomolecular Subgroups Using Attractor Metagenes
There are several ways to impute these missing values. You can use a simple method called K nearest
neighbor imputation supplied by the Bioinformatics Toolbox (TM). K-nearest neighbor imputation
works by replacing missing data with the corresponding value from a weighted average of the k
nearest columns to the column with the missing data.
Use k = 3, and replace the current value of geneExpression with one that has no NaN values.
geneExpression = knnimpute(geneExpression,3);
ans = 0
For more information about knnimpute, see the Bioinformatics Toolbox documentation.
doc knnimpute
The function metafeatures uses the attractor metagene algorithm to identify motifs of gene
regulation.
Setup an options structure. In this case, set the display to provide the information about the
algorithm at each iteration.
opts = struct('Display','iter');
metafeatures also allows for specifying start values. You can seed the starting weights to
emphasize genes that you are interested in. There are three common drivers of breast cancer, ERBB2
(also called HER2), estrogen, and progestrone.
Set the weight for each of these genes to 1 in three different rows of startValues. Each row
corresponds to initial values for a different replicate. strcmp compares the genes of interest and the
list of genes in the data set. find returns the index in the list of the gene.
erbb = find(strcmp('ERBB2',geneNames));
estrogen = find(strcmp('ESR1',geneNames));
progestrone = find(strcmp('PGR',geneNames));
startValues = zeros(size(geneExpression,1),3);
startValues(erbb,1) = 1;
startValues(estrogen,2) = 1;
startValues(progestrone,3) = 1;
Call metafeatures with the imputed data set. The second argument, geneNames is the list of all the
genes in the data set. Supplying the gene names is not required. However, the gene names can allow
exploration of the highly ranked genes that are returned by the algorithm to get insights into the
biomolecular event described by the metagene.
[meta, weights, genes_sorted] = metafeatures(geneExpression,geneNames,'start',startValues,'option
4-175
4 Microarray Analysis
4-176
Identifying Biomolecular Subgroups Using Attractor Metagenes
1 58 4.81e-05 8632
1 59 2.85e-05 8632
1 60 1.97e-05 8632
1 61 3.05e-05 8632
1 62 1.41e-05 8632
1 63 1.02e-05 8632
1 64 7.89e-06 8632
1 65 9.34e-06 8632
1 66 2.07e-05 8632
1 67 1.52e-05 8632
1 68 2.26e-05 8632
1 69 1.55e-05 8632
1 70 2.24e-05 8632
1 71 1.75e-05 8632
1 72 2.01e-05 8632
1 73 6.47e-06 8632
1 74 1.62e-05 8632
1 75 2.23e-05 8632
1 76 1.93e-05 8632
1 77 1.71e-05 8632
1 78 6.94e-06 8632
1 79 3.21e-06 8632
1 80 1.58e-05 8632
1 81 2.02e-05 8632
1 82 1.99e-05 8632
1 83 2.12e-05 8632
1 84 1.79e-05 8632
1 85 1.60e-05 8632
1 86 1.78e-05 8632
1 87 1.87e-05 8632
1 88 1.66e-05 8632
1 89 5.98e-06 8632
1 90 1.26e-05 8632
1 91 2.14e-05 8632
1 92 1.82e-05 8632
1 93 6.97e-06 8632
1 94 1.04e-05 8632
1 95 2.13e-05 8632
1 96 6.39e-06 8632
1 97 1.75e-05 8632
1 98 2.37e-05 8632
1 99 2.01e-05 8632
1 100 1.98e-05 8632
2 1 1.93e+01 9893
2 2 6.04e+00 9885
2 3 3.80e+00 9883
2 4 2.53e+00 9886
2 5 1.73e+00 9881
2 6 1.13e+00 9873
2 7 7.19e-01 9869
2 8 4.63e-01 9866
2 9 3.08e-01 9870
2 10 2.13e-01 9874
2 11 1.54e-01 9872
2 12 1.15e-01 9874
4-177
4 Microarray Analysis
2 13 8.72e-02 9874
2 14 6.68e-02 9874
2 15 5.14e-02 9874
2 16 3.97e-02 9875
2 17 3.07e-02 9875
2 18 2.37e-02 9873
2 19 1.84e-02 9871
2 20 1.42e-02 9871
2 21 1.10e-02 9871
2 22 8.54e-03 9872
2 23 6.62e-03 9872
2 24 5.05e-03 9872
2 25 4.01e-03 9872
2 26 3.09e-03 9872
2 27 2.38e-03 9872
2 28 1.85e-03 9872
2 29 1.43e-03 9872
2 30 1.09e-03 9872
2 31 8.46e-04 9872
2 32 6.73e-04 9872
2 33 5.10e-04 9872
2 34 3.81e-04 9872
2 35 2.98e-04 9872
2 36 2.46e-04 9872
2 37 1.51e-04 9872
2 38 1.63e-04 9872
2 39 1.15e-04 9872
2 40 7.11e-05 9872
2 41 1.18e-04 9872
2 42 7.28e-05 9872
2 43 1.89e-05 9872
2 44 4.24e-05 9872
2 45 1.60e-05 9872
2 46 6.75e-06 9872
2 47 4.81e-05 9872
2 48 2.47e-05 9872
2 49 1.04e-05 9872
2 50 7.46e-06 9872
2 51 9.31e-06 9872
2 52 5.25e-06 9872
2 53 3.89e-05 9872
2 54 9.38e-06 9872
2 55 3.33e-05 9872
2 56 1.48e-05 9872
2 57 2.45e-05 9872
2 58 2.58e-05 9872
2 59 1.00e-05 9872
2 60 1.86e-05 9872
2 61 5.87e-05 9872
2 62 2.97e-05 9872
2 63 1.07e-05 9872
2 64 8.84e-06 9872
2 65 8.29e-06 9872
2 66 1.58e-05 9872
2 67 1.48e-05 9872
2 68 5.00e-06 9872
2 69 2.74e-05 9872
2 70 1.20e-05 9872
4-178
Identifying Biomolecular Subgroups Using Attractor Metagenes
2 71 2.91e-05 9872
2 72 9.45e-06 9872
2 73 1.75e-05 9872
2 74 1.56e-05 9872
2 75 6.56e-06 9872
2 76 1.79e-05 9872
2 77 2.67e-05 9872
2 78 5.55e-05 9872
2 79 2.55e-05 9872
2 80 1.03e-05 9872
2 81 2.74e-05 9872
2 82 2.04e-05 9872
2 83 1.00e-05 9872
2 84 1.11e-05 9872
2 85 9.83e-06 9872
2 86 2.71e-05 9872
2 87 1.42e-05 9872
2 88 1.28e-05 9872
2 89 2.24e-05 9872
2 90 4.58e-05 9872
2 91 3.36e-05 9872
2 92 9.74e-06 9872
2 93 1.06e-05 9872
2 94 1.50e-05 9872
2 95 5.05e-05 9872
2 96 1.12e-05 9872
2 97 2.52e-05 9872
2 98 9.77e-06 9872
2 99 6.10e-06 9872
2 100 2.97e-05 9872
3 1 3.75e+00 9963
3 2 1.08e+00 9966
3 3 4.29e-01 9959
3 4 1.87e-01 9961
3 5 8.45e-02 9958
3 6 3.88e-02 9957
3 7 1.80e-02 9956
3 8 8.36e-03 9956
3 9 3.89e-03 9956
3 10 1.78e-03 9956
3 11 8.68e-04 9956
3 12 3.96e-04 9956
3 13 1.89e-04 9956
3 14 8.92e-05 9956
3 15 4.25e-05 9956
3 16 1.16e-05 9956
3 17 1.57e-05 9956
3 18 1.67e-05 9956
3 19 1.59e-05 9956
3 20 1.07e-05 9956
3 21 9.21e-06 9956
3 22 1.59e-05 9956
3 23 6.23e-06 9956
3 24 8.68e-06 9956
3 25 1.56e-05 9956
4-179
4 Microarray Analysis
3 26 1.55e-05 9956
3 27 9.65e-06 9956
3 28 9.74e-06 9956
3 29 9.75e-06 9956
3 30 9.75e-06 9956
3 31 9.84e-06 9956
3 32 1.49e-05 9956
3 33 1.05e-05 9956
3 34 1.43e-05 9956
3 35 2.14e-05 9956
3 36 6.64e-06 9956
3 37 1.54e-06 9956
3 38 2.23e-06 9956
3 39 2.98e-06 9956
3 40 8.89e-06 9956
3 41 1.60e-05 9956
3 42 1.06e-05 9956
3 43 9.08e-06 9956
3 44 1.60e-05 9956
3 45 6.74e-06 9956
3 46 8.71e-06 9956
3 47 9.45e-06 9956
3 48 1.48e-05 9956
3 49 1.05e-05 9956
3 50 1.43e-05 9956
3 51 2.15e-05 9956
3 52 6.64e-06 9956
3 53 1.41e-06 9956
3 54 1.98e-06 9956
3 55 2.58e-06 9956
3 56 9.07e-06 9956
3 57 6.54e-06 9956
3 58 5.44e-06 9956
3 59 4.36e-06 9956
3 60 8.43e-06 9956
3 61 1.08e-05 9956
3 62 9.73e-06 9956
3 63 9.72e-06 9956
3 64 9.72e-06 9956
3 65 9.75e-06 9956
3 66 9.78e-06 9956
3 67 9.82e-06 9956
3 68 1.50e-05 9956
3 69 1.02e-05 9956
3 70 1.34e-05 9956
3 71 2.08e-05 9956
3 72 1.30e-05 9956
3 73 2.11e-05 9956
3 74 1.56e-05 9956
3 75 9.45e-06 9956
3 76 1.48e-05 9956
3 77 1.11e-05 9956
3 78 8.97e-06 9956
3 79 1.31e-05 9956
3 80 2.19e-05 9956
3 81 8.99e-06 9956
3 82 1.60e-05 9956
3 83 7.51e-06 9956
4-180
Identifying Biomolecular Subgroups Using Attractor Metagenes
3 84 6.78e-06 9956
3 85 7.51e-06 9956
3 86 1.10e-05 9956
3 87 1.39e-05 9956
3 88 6.38e-06 9956
3 89 6.05e-06 9956
3 90 4.66e-06 9956
3 91 7.28e-06 9956
3 92 7.98e-06 9956
3 93 1.15e-05 9956
3 94 8.72e-06 9956
3 95 1.56e-05 9956
3 96 1.82e-05 9956
3 97 1.23e-05 9956
3 98 6.69e-06 9956
3 99 1.63e-06 9956
3 100 1.15e-06 9956
The variable meta has the value of the three metagenes discovered for each sample. You can plot the
three metagenes to gain insight into the nature of gene regulation across different phenotypes of
breast cancer.
plot3(meta(1,:),meta(2,:),meta(3,:),'o')
xlabel('ERBB2 metagene')
ylabel('Estrogen metagene')
zlabel('Progestrone metagene')
4-181
4 Microarray Analysis
In the plot, there is a group of points bunched together with low values for all three metagenes.
Based on mRNA levels, the expectation is that points are associated with tumor samples that are
triple-negative or basal type.
There is also a group of points that have high estrogen receptor metagene expression. This group
spans both high and low progestrone metagene expression. There are no points with high
progestrone metagene expression and low estrogen metagene expression. This finding is consistent
with the observation that ER-/PR+ breast cancers are extremely rare [2].
The remaining points are the ERBB2 positive cancers. They have less representation in this data set
than the hormone-driven and triple-negative cancers. There are also no firmly established
relationships between hormone receptor expression and ERBB2 status.
To develop a better understanding of the gene regulation captured by the metagenes, take a closer
look at the metagene discovered by initializing the estrogen receptor to have weight 1. You can list
the top ten genes contributing to the metagene for the 11th metagene discovered.
genes_sorted(1:10,2)
This metagene captures the biomolecular event associated with the transition to estrogen-driven
breast cancer. The four, top-ranked, genes listed are:
Transcriptional changes in each of these genes are implicated in estrogen-driven breast cancer. The
three genes other than ESR1 are known to be coexpressed with ESR1. Identification of these genes
illustrates the power of the attractor metagene algorithm to link gene expression with phenotypes.
Similar versions of the estrogen metagene and the ERBB2 metagene are described in [1]. The
ordering of the gene contributions differs slightly between this analysis and [1] because a different
breast cancer data set was used. Variations in the weights are to be expected, but the ordering of the
genes by weights are roughly the same. Specifically, genes with the top 10 weights are mostly the
same between this version, and the version described in [1]. Similarly, there is significant overlap
between the genes with the top 100 weights.
Genes can contribute to multiple metagenes. In this sense, the attractor metagene algorithm is a
"soft" clustering technique. In this example, finding metagenes in breast cancer data, there is overlap
4-182
Identifying Biomolecular Subgroups Using Attractor Metagenes
in the sets of genes that have larger contribution weights to the estrogen and progestrone
metagenes.
The column sum of the elevated_weights is the total number of elevated weights in each of the
three metagenes.
sum(elevated_weights)
ans = 1×3
19 96 27
Of the 96 elevated weights for the estrogen metagene, and the 27 for the progestrone metagene,
there are 22 elevated weights that are in both sets.
sum(elevated_weights(:,2) & elevated_weights(:,3))
ans = 22
However, there is no overlap between the ERBB2 metagene and the estrogen metagene:
sum(elevated_weights(:,1) & elevated_weights(:,2))
ans = 0
as well as no overlap between the ERBB2 metagene and the progestrone metagene:
sum(elevated_weights(:,1) & elevated_weights(:,3))
ans = 0
In the similarity metric of the algorithm, the parameter alpha controls the degree of nonlinearity. As
alpha is increased, the number of metagenes tends to increase. The default alpha is 5, because this
value was good for the work in [1], but for different data sets or use cases, you must adjust alpha.
To illustrate the effects of alpha, if alpha is 1 in the breast cancer analysis, then the progesterone and
estrogen metagenes are not distinct.
[meta_alpha_1, weights_alpha_1, genes_sorted_alpha_1] = ...
metafeatures(geneExpression,geneNames,'start',startValues,'alpha',1);
In this case, only two metagenes are returned, despite the fact that we ran the algorithm three times.
size(meta_alpha_1)
ans = 1×2
4-183
4 Microarray Analysis
2 590
This result is because, by default, metafeatures returns only the unique metagenes. The
initialization with the weight for ESR1 set to 1, and the initialization with the weight for PGR set to 1,
both converge to metagenes that are effectively the same.
References
[1] Cheng, Wei-Yi, Tai-Hsien Ou Yang, and Dimitris Anastassiou. "Biomolecular events in cancer
revealed by attractor metagenes." PLoS computational biology 9.2 (2013): e1002920.
[2] Hefti, Marco M., et al."Estrogen receptor negative/progesterone receptor positive breast cancer is
not a reproducible subtype." Breast Cancer Research 15.4 (2013): R68.
[3] Daub, Carsten O., et al. "Estimating mutual information using B-spline functions?an improved
similarity measure for analysing gene expression data." BMC bioinformatics 5.1 (2004): 118.
4-184
Working with the Clustergram Function
The clustergram function creates a heat map with dendrograms to show hierarchical clustering of
data. These types of heat maps have become a standard visualization method for microarray data
since first applied by Eisen et al. [1]. This example illustrates some of the options of the
clustergram function. The example uses data from the van't Veer et al. breast cancer microarray
study [2].
Importing Data
A study by van't Veer et al. investigated whether tumor ability for metastasis is obtained later in
development or inherent in the initial gene expression signature [2]. The study analyzed tumor
samples from 117 young breast cancer patients, of whom 78 were sporadic lymph-node-negative. The
gene expression profiles of these 78 patients were searched for prognostic signatures. Of the 78
patients, 44 exhibited non-recurrences within five years of surgical treatment while 34 had
recurrences. Samples were hybridized to Agilent® two-color oligonucleotide microarrays
representing approximately 25,000 human genes. The authors selected 4,918 significant genes that
had at least a two-fold differential expression relative to the reference and a p-value for being
expressed < 0.01 in at least 3 samples. By using supervised classification, the authors identified a
poor prognosis gene expression signature of 231 genes [2].
A subset of the preprocessed gene expression data from [2] is provided in the
bc_train_filtered.mat MAT-file. Samples for 78 lymph-node-negative patients are included, each
one containing the gene expression values for the 4,918 significant genes. Gene expression values
have already been preprocessed, by normalization and background subtraction, as described in [2].
load bc_train_filtered
bcTrainData
bcTrainData =
The list of 231 genes in the prognosis profile proposed by van't Veer et al. is also provided in the
bc_proggenes231.mat MAT-file. Genes are ordered according to their correlation coefficient with
the prognostic groups.
load bc_proggenes231
4-185
4 Microarray Analysis
For this example, you will work with the 35 most positive correlated genes and the 35 most negative
correlated genes.
Clustering
You will use the clustergram function to perform hierarchical clustering and generate a heat map
and dendrogram of the data. The simplest form of clustergram clusters the rows or columns of a
data set using Euclidean distance metric and average linkage. In this example, you will cluster the
samples (columns) only.
The matrix of gene expression data, progValues, contains some missing data. These are marked as
NaN. You need to provide an imputation function name or function handle to impute values for
missing data. In this example, you will use the k-nearest neighbors imputation procedure
implemented in the function knnimpute.
4-186
Working with the Clustergram Function
The dendrogram at the top of the heat map shows the clustering of samples. The missing data are
shown in the heat map in gray. The data has been standardized across all samples for each gene, so
that the mean is 0 and the standard deviation is 1.
You can determine and change properties of a clustergram object. For example, you can find out
which distance metric was used in the clustering.
cg_s.ColumnPDist
ans =
{'Euclidean'}
Then you can change the distance metric for the columns to correlation.
cg_s.ColumnPDist = 'correlation';
4-187
4 Microarray Analysis
By changing the distance metric from Euclidean to correlation, the tumor samples are clearly
clustered into a good prognosis group and a poor prognosis group.
To see all the properties of the clustergram, simply use the get method.
get(cg_s)
Cluster: 'ROW'
RowPDist: {'Euclidean'}
ColumnPDist: {'correlation'}
Linkage: {'Average'}
Dendrogram: {}
OptimalLeafOrder: 1
LogTrans: 0
DisplayRatio: [0.2000 0.2000]
RowGroupMarker: []
ColumnGroupMarker: []
ShowDendrogram: 'on'
Standardize: 'NONE'
Symmetric: 1
DisplayRange: 3
4-188
Working with the Clustergram Function
Next, you will cluster both the rows and the columns of the data to produce a heat map with two
dendrograms. In this example, the left dendrogram shows the clustering of the genes (rows), and the
top dendrogram shows the clustering of the samples (columns).
4-189
4 Microarray Analysis
You can also change the dendrogram option to differentiate clusters of genes and clusters of samples
with distances 1 unit apart.
cg.Dendrogram = 1;
4-190
Working with the Clustergram Function
You can zoom in, zoom out and pan the heat map by selecting the corresponding toolbar buttons or
menu items.
4-191
4 Microarray Analysis
Click the Data Cursor button or select Tools > Data Cursor to activate Data Cursor Mode. In
this mode, click the heat map to display a data tip showing the expression value, the gene label and
the sample label of current data point. You can click-drag the data tip to other data points in the
heatmap. To delete the data tip, right-click, then select Delete Current Datatip from the context
menu.
4-192
Working with the Clustergram Function
Click the Insert Colorbar button to show the color scale of the heat map.
4-193
4 Microarray Analysis
To interact with dendrogram, be sure that the Data Cursor Mode is deactivated (click the Data
Cursor button again). Move the mouse over the dendrogram. When the mouse is over a branch node
a red marker appears and the branch is highlighted.
4-194
Working with the Clustergram Function
Click and hold the red marker to display a data tip with the group number and the number of nodes
in the group. If the space is available, it also displays the labels for the nodes. For example, mouse
over and click on a dendrogram clustering group of the samples.
4-195
4 Microarray Analysis
Right-click the red marker to display a context menu. From the context menu you can change the
dendrogram color for the select group, print the group to a separate Figure window, copy the group
to a new Clustergram window, export it as a clustergram object to the MATLAB® Workspace, or
export the clustering group information as a structure to the MATLAB® Workspace.
4-196
Working with the Clustergram Function
For example, select group 55 from the gene clustering dendrogram, and export it to the MATLAB®
Workspace by right-clicking then selecting Export Group to Workspace. You can view the
dendrograms and heat map for this clustergram object in a new Clustergram window by using the
view method.
4-197
4 Microarray Analysis
The default color scheme is the red-green color scale that is widely used in microarray data analysis.
In this example, a different color scheme may be more useful. The colormap option allows you to
specify an alternate colormap.
cg.Colormap = redbluecmap;
cg.DisplayRange = 2;
4-198
Working with the Clustergram Function
The clustergram function also lets you add color markers and text labels to annotate specific
regions of rows or columns. For example, to denote specific dendrogram groups of genes and groups
of samples, create structure arrays to specify the annotations for each dimension.
4-199
4 Microarray Analysis
In this example, you will perform hierarchical clustering for almost 5,000 genes of the filtered data
[2].
cg_all = clustergram(bcTrainData.Log10Ratio,...
'RowLabels', bcTrainData.Accession,...
'ColumnLabels', bcTrainData.Samples,...
'RowPdist', 'correlation',...
'ColumnPdist', 'correlation',...
'Displayrange', 0.6,...
'Standardize', 3,...
'ImputeFun', @knnimpute)
4-200
Working with the Clustergram Function
Tip: When working with large data sets, MATLAB® can run out of memory during the clustering
computation. You can convert double precision data to single precision using the single function.
Note that the gene expression data in bcTrainData are already single precision.
You can resize a clustergram window like any other MATLAB® Figure window by click-dragging the
edge of the window.
4-201
4 Microarray Analysis
If you want even more control over the clustering, you can use the clustering functions in the
Statistics and Machine Learning Toolbox™ directly. See the “Gene Expression Profile Analysis” on
page 4-95 example for some examples of how to do this.
References
[1] Eisen, M. B., et al., "Cluster analysis and display of genome-wide expression patterns", PNAS,
95(25):14863-8, 1998.
[2] van't Veer, L., et al., "Gene expression profiling predicts clinical outcome of breast cancer",
Nature, 415(6871):530-6, 2002.
4-202
Working with Objects for Microarray Experiment Data
This example shows how to create and manipulate MATLAB® containers designed for storing data
from a microarray experiment.
Microarray experimental data are very complex, usually consisting of data and information from a
number of different sources. Storing and managing the large and complex data sets in a coherent
manner is a challenge. Bioinformatics Toolbox™ provides a set of objects to represent the different
pieces of data from a microarray experiment.
The ExpressionSet class is a single, convenient data structure for storing and managing different
types of data from a microarray gene expression experiment.
An ExpressionSet object consists of these four components that are common to all microarray gene
expression experiments:
Experiment data: Expression values from microarray experiments. These data are stored as an
instance of the ExptData class.
Sample information: The metadata describing the samples in the experiment. The sample metadata
are stored as an instance of the MetaData class.
Array feature annotations: The annotations about the features or probes on the array used in the
experiment. The annotations can be stored as an instance of the MetaData class.
Experiment descriptions: Information to describe the experiment methods and conditions. The
information can be stored as an instance of the MIAME class.
The ExpressionSet class coordinates and validates these data components. The class provides
methods for retrieving and setting the data stored in an ExpressionSet object. An ExpressionSet
object also behaves like many other MATLAB data structures that can be subsetted and copied.
Experiment Data
In a microarray gene expression experiment, the measured expression values for each feature per
sample can be represented as a two-dimensional matrix. The matrix has F rows and S columns, where
F is the number of features on the array, and S is the number of samples on which the expression
values were measured. A DataMatrix object is a two-dimensional matrix that you can index by row
and column numbers, logical vectors, or row and column names.
dm =
4-203
4 Microarray Analysis
The function size returns the number of rows and columns in a DataMatrix object.
size(dm)
ans =
5 4
You can index into a DataMatrix object like other MATLAB numeric arrays by using row and column
numbers. For example, you can access the elements at rows 1 and 2, column 3.
dm(1:2, 3)
ans =
Sample3
Feature1 0.15761
Feature2 0.97059
You can also index into a DataMatrix object by using its row and column names. Reassign the
elements in row 2 and 3, column 1 and 4 to different values.
dm({'Feature2', 'Feature3'}, {'Sample1', 'Sample4'}) = [2, 3; 4, 5]
dm =
The gene expression data used in this example is a small set of data from a microarray experiment
profiling adult mouse gene expression patterns in common strains on the Affymetrix® MG-U74Av2
array [1].
Read the expression values from the tab-formatted file mouseExprsData.txt into MATLAB
Workspace as a DataMatrix object.
exprsData = bioma.data.DataMatrix('file', 'mouseExprsData.txt');
class(exprsData)
ans =
'bioma.data.DataMatrix'
4-204
Working with Objects for Microarray Experiment Data
get(exprsData)
Name: 'mouseExprsData'
RowNames: {500x1 cell}
ColNames: {1x26 cell}
NRows: 500
NCols: 26
NDims: 2
ElementClass: 'double'
colnames(exprsData)
ans =
Columns 1 through 8
Columns 9 through 16
Columns 17 through 24
Columns 25 through 26
{'Y'} {'Z'}
exprsData(1:10, 1:5)
ans =
A B C D E
100001_at 2.26 20.14 31.66 14.58 16.04
100002_at 158.86 236.25 206.27 388.71 388.09
100003_at 68.11 105.45 82.92 82.9 60.38
100004_at 74.32 96.68 84.87 72.26 98.38
100005_at 75.05 53.17 57.94 60.06 63.91
100006_at 80.36 42.89 77.21 77.24 40.31
100007_at 216.64 191.32 219.48 237.28 298.18
100009_r_at 3806.7 1425 2468.5 2172.7 2237.2
100010_at NaN NaN NaN 7.18 22.37
100011_at 81.72 72.27 127.61 91.01 98.13
4-205
4 Microarray Analysis
exprsData_log2 = log2(exprsData);
exprsData_log2(1:10, 1:5)
ans =
A B C D E
100001_at 1.1763 4.332 4.9846 3.8659 4.0036
100002_at 7.3116 7.8842 7.6884 8.6026 8.6002
100003_at 6.0898 6.7204 6.3736 6.3733 5.916
100004_at 6.2157 6.5951 6.4072 6.1751 6.6203
100005_at 6.2298 5.7325 5.8565 5.9083 5.998
100006_at 6.3284 5.4226 6.2707 6.2713 5.3331
100007_at 7.7592 7.5798 7.7779 7.8904 8.22
100009_r_at 11.894 10.477 11.269 11.085 11.127
100010_at NaN NaN NaN 2.844 4.4835
100011_at 6.3526 6.1753 6.9956 6.508 6.6166
In a microarray experiment, the data set often contains one or more matrices that have the same
number of rows and columns and identical row names and column names. ExptData class is
designed to contain and coordinate one or more data matrices having identical row and column
names with the same dimension size. The data values are stored as DataMatrix objects. Each
DataMatrix object is an element of an ExptData object. The ExptData class is responsible for data
validation and coordination between these DataMatrix objects.
Store the gene expression data of natural scale and log2 base expression values separately in an
ExptData object.
mouseExptData =
Experiment Data:
500 features, 26 samples
2 elements
Element names: naturalExprs, log2Exprs
exprsData2 = mouseExptData('log2Exprs');
get(exprsData2)
4-206
Working with Objects for Microarray Experiment Data
Sample Metadata
The metadata about the samples in a microarray experiment can be represented as a table with S
rows and V columns, where S is the number of samples, and V is the number of variables. The
contents of the table are the values of each variable for each sample. For example, the file
mouseSampleData.txt contains such a table. The description of each sample variable is marked by
a # symbol.
The MetaData class is designed for storing and manipulating variable values and their metadata in a
coordinated fashion. You can read the mouseSampleData.txt file into MATLAB as a MetaData
object.
sData = bioma.data.MetaData('file', 'mouseSampleData.txt', 'vardescchar', '#')
sData =
Sample Names:
A, B, ...,Z (26 total)
Variable Names and Meta Information:
VariableDescription
Gender {' Gender of the mouse in study' }
Age {' The number of weeks since mouse birth'}
Type {' Genetic characters' }
Strain {' The mouse strain' }
Source {' The tissue source for RNA collection' }
The properties of MetaData class provide information about the samples and variables.
numSamples = sData.NSamples
numVariables = sData.NVariables
numSamples =
26
numVariables =
The variable values and the variable descriptions for the samples are stored as two dataset arrays
in a MetaData class. The MetaData class provides access methods to the variable values and the
meta information describing the variables.
4-207
4 Microarray Analysis
sData.variableValues
ans =
Source
A {'amygdala' }
B {'amygdala' }
C {'amygdala' }
D {'amygdala' }
E {'amygdala' }
F {'amygdala' }
G {'amygdala' }
H {'cingulate cortex'}
I {'cingulate cortex'}
J {'cingulate cortex'}
K {'cingulate cortex'}
L {'cingulate cortex'}
M {'cingulate cortex'}
N {'cingulate cortex'}
O {'hippocampus' }
P {'hippocampus' }
Q {'hippocampus' }
R {'hippocampus' }
S {'hippocampus' }
T {'hippocampus' }
U {'hypothalamus' }
V {'hypothalamus' }
W {'hypothalamus' }
4-208
Working with Objects for Microarray Experiment Data
X {'hypothalamus' }
Y {'hypothalamus' }
Z {'hypothalamus' }
summary(sData.variableValues)
The sampleNames and variableNames methods are convenient ways to access the names of
samples and variables. Retrieve the variable names of the sData object.
variableNames(sData)
ans =
You can retrieve the meta information about the variables describing the samples using the
variableDesc method. In this example, it contains only the descriptions about the variables.
variableDesc(sData)
ans =
VariableDescription
Gender {' Gender of the mouse in study' }
Age {' The number of weeks since mouse birth'}
Type {' Genetic characters' }
Strain {' The mouse strain' }
Source {' The tissue source for RNA collection' }
You can subset the sample data sData object using numerical indexing.
sData(3:6, :)
ans =
4-209
4 Microarray Analysis
Sample Names:
C, D, ...,F (4 total)
Variable Names and Meta Information:
VariableDescription
Gender {' Gender of the mouse in study' }
Age {' The number of weeks since mouse birth'}
Type {' Genetic characters' }
Strain {' The mouse strain' }
Source {' The tissue source for RNA collection' }
You can display the mouse strain of specific samples by using numerical indexing.
sData.Strain([2 14])
ans =
{'129S6/SvEvTac'}
{'C57BL/6J' }
Note that the row names in sData and the column names in exprsData are the same. It is an
important relationship between the expression data and the sample data in the same experiment.
all(ismember(sampleNames(sData), colnames(exprsData)))
ans =
logical
The metadata about the features or probe set on an array can be very large and diverse. The chip
manufacturers usually provide a specific annotation file for the features of each type of array. The
metadata can be stored as a MetaData object for a specific experiment. In this example, the
annotation file for the MG-U74Av2 array can be downloaded from the Affymetrix web site. You will
need to convert the file from CSV to XLSX format using a spreadsheet software application.
Read the entire file into MATLAB as a dataset array. Alternatively, you can use the Range option in
the dataset constructor. Any blank spaces in the variable names are removed to make them valid
MATLAB variable names. A warning is displayed each time this happens.
mgU74Av2 = table2dataset(readtable('MG_U74Av2_annot.xlsx'));
Warning: Column headers from the file were modified to make them valid MATLAB
identifiers before creating variable names for the table. The original column
headers are saved in the VariableDescriptions property.
Set 'VariableNamingRule' to 'preserve' to use the original column headers as
table variable names.
4-210
Working with Objects for Microarray Experiment Data
get(mgU74Av2)
Description: ''
VarDescription: {1x43 cell}
Units: {}
DimNames: {'Row' 'Variables'}
UserData: []
ObsNames: {}
VarNames: {1x43 cell}
ans =
12488
Retrieve the names of variables describing the features on the array and view the first 20 variable
names.
fDataVariables = get(mgU74Av2, 'VarNames');
fDataVariables(1:20)'
ans =
{'ProbeSetID' }
{'GeneChipArray' }
{'SpeciesScientificName' }
{'AnnotationDate' }
{'SequenceType' }
{'SequenceSource' }
{'TranscriptID_ArrayDesign_'}
{'TargetDescription' }
{'RepresentativePublicID' }
{'ArchivalUniGeneCluster' }
{'UniGeneID' }
{'GenomeVersion' }
{'Alignments' }
{'GeneTitle' }
{'GeneSymbol' }
{'ChromosomalLocation' }
{'UnigeneClusterType' }
{'Ensembl' }
{'EntrezGene' }
{'SwissProt' }
Set the ObsNames property to the probe set IDs, so that you can access individual gene annotations
by indexing with probe set IDs.
mgU74Av2 = set(mgU74Av2,'ObsNames',mgU74Av2.ProbeSetID);
mgU74Av2('100709_at',{'GeneSymbol','ChromosomalLocation'})
4-211
4 Microarray Analysis
ans =
GeneSymbol ChromosomalLocation
100709_at {'Tpbpa'} {'chr13 B2|13 36.0 cM'}
In some cases, it is useful to extract specific annotations that are relevant to the analysis. Extract
annotations for GeneTitle, GeneSymbol, ChromosomalLocation, and Pathway relative to the
features in exprsData.
mgU74Av2 = mgU74Av2(:,{'GeneTitle',...
'GeneSymbol',...
'ChromosomalLocation',...
'Pathway'});
mgU74Av2 = mgU74Av2(rownames(exprsData),:);
get(mgU74Av2)
Description: ''
VarDescription: {1x4 cell}
Units: {}
DimNames: {'Row' 'Variables'}
UserData: []
ObsNames: {500x1 cell}
VarNames: {1x4 cell}
You can store the feature annotation dataset array as an instance of the MetaData class.
fData = bioma.data.MetaData(mgU74Av2)
fData =
Sample Names:
100001_at, 100002_at, ...,100717_at (500 total)
Variable Names and Meta Information:
VariableDescription
GeneTitle {'NA'}
GeneSymbol {'NA'}
ChromosomalLocation {'NA'}
Pathway {'NA'}
Notice that there are no descriptions for the feature variables in the fData MetaData object. You
can add descriptions about the variables in fData using the variableDesc method.
fData = variableDesc(fData, {'Gene title of a probe set',...
'Probe set gene symbol',...
'Probe set chromosomal locations',...
'The pathway the genes involved in'})
fData =
Sample Names:
100001_at, 100002_at, ...,100717_at (500 total)
Variable Names and Meta Information:
4-212
Working with Objects for Microarray Experiment Data
VariableDescription
GeneTitle {'Gene title of a probe set' }
GeneSymbol {'Probe set gene symbol' }
ChromosomalLocation {'Probe set chromosomal locations' }
Pathway {'The pathway the genes involved in'}
Experiment Information
The MIAME class is a flexible data container designed for a collection of basic descriptions about a
microarray experiment, such as investigators, laboratories, and array designs. The MIAME class
loosely follows the Minimum Information About a Microarray Experiment (MIAME) specification [2].
expDesc =
Experiment Description:
Author name: Jane OneName
Laboratory: Bioinformatics Laboratory
Contact information:
URL:
PubMedIDs:
Abstract: A 5 word abstract is available. Use the Abstract property.
No experiment design summary available.
Other notes:
{'Notes: Created from a text files.'}
Another way to create a MIAME object is from GEO series data. The MIAME class will populate the
corresponding properties from the GEO series structure. The information associated with the gene
profile experiment in this example is available from the GEO database under the accession number
GSE3327 [1]. Retrieve the GEO Series data using the getgeodata function.
getgeodata('GSE3327', 'ToFile', 'GSE3327.txt');
geoSeries =
4-213
4 Microarray Analysis
exptGSE3327 =
Experiment Description:
Author name: Iiris,,Hovatta
David,J,Lockhart
Carrolee,,Barlow
Laboratory: The Salk Institute for Biological Studies
Contact information: Carrolee,,Barlow
URL:
PubMedIDs: 16244648
Abstract: A 14 word abstract is available. Use the Abstract property.
Experiment Design: A 8 word summary is available. Use the ExptDesign property.
Other notes:
{'ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/series/GSE3327/GSE3327_RAW.tar'}
abstract = exptGSE3327.Abstract
pubmedID = exptGSE3327.PubMedID
abstract =
pubmedID =
'16244648'
The ExpressionSet class is designed specifically for microarray gene expression experiment data.
Assemble an ExpressionSet object for the example mouse gene expression experiment from the
different data objects you just created.
exptSet =
ExpressionSet
Experiment Data: 500 features, 26 samples
Element names: Expressions
Sample Data:
Sample names: A, B, ...,Z (26 total)
Sample variable names and meta information:
Gender: Gender of the mouse in study
Age: The number of weeks since mouse birth
Type: Genetic characters
Strain: The mouse strain
Source: The tissue source for RNA collection
4-214
Working with Objects for Microarray Experiment Data
Feature Data:
Feature names: 100001_at, 100002_at, ...,100717_at (500 total)
Feature variable names and meta information:
GeneTitle: Gene title of a probe set
GeneSymbol: Probe set gene symbol
ChromosomalLocation: Probe set chromosomal locations
Pathway: The pathway the genes involved in
Experiment Information: use 'exptInfo(obj)'
You can also create an ExpressionSet object with only the expression values in a DataMatrix or a
numeric matrix.
miniExprSet = bioma.ExpressionSet(exprsData)
miniExprSet =
ExpressionSet
Experiment Data: 500 features, 26 samples
Element names: Expressions
Sample Data: none
Feature Data: none
Experiment Information: none
The data objects for a microarray experiment can be saved as MAT files. Save the ExpressionSet
object exptSet to a MAT file named mouseExpressionSet.mat.
save mouseExpressionSet exptSet
ans =
{'Expressions'}
exptSet.NSamples
ans =
26
exptSet.NFeatures
4-215
4 Microarray Analysis
ans =
500
A number of methods are available to access and update data stored in an ExpressionSet object.
You can access the columns of the sample data using dot notation.
exptSet.Strain(1:5)
ans =
{'129S6/SvEvTac'}
{'129S6/SvEvTac'}
{'129S6/SvEvTac'}
{'A/J ' }
{'A/J ' }
Retrieve the feature names using the featureNames method. In this example, the feature names are
the probe set identifiers on the array.
featureNames(exptSet, 1:5)
ans =
{'100001_at'}
{'100002_at'}
{'100003_at'}
{'100004_at'}
{'100005_at'}
The unique identifier of the samples can be accessed via the sampleNames method.
exptSet.sampleNames(1:5)
ans =
The sampleVarNames method lists the variable names in the sample data.
exptSet.sampleVarNames
4-216
Working with Objects for Microarray Experiment Data
ans =
sDataset = sampleVarValues(exptSet)
sDataset =
Source
A {'amygdala' }
B {'amygdala' }
C {'amygdala' }
D {'amygdala' }
E {'amygdala' }
F {'amygdala' }
G {'amygdala' }
H {'cingulate cortex'}
I {'cingulate cortex'}
J {'cingulate cortex'}
K {'cingulate cortex'}
L {'cingulate cortex'}
M {'cingulate cortex'}
4-217
4 Microarray Analysis
N {'cingulate cortex'}
O {'hippocampus' }
P {'hippocampus' }
Q {'hippocampus' }
R {'hippocampus' }
S {'hippocampus' }
T {'hippocampus' }
U {'hypothalamus' }
V {'hypothalamus' }
W {'hypothalamus' }
X {'hypothalamus' }
Y {'hypothalamus' }
Z {'hypothalamus' }
Retrieve the ExptData object containing expression values. There may be more than one
DataMatrix object with identical dimensions in an ExptData object. In an ExpressionSet object,
there is always a element DataMatrix object named Expressions containing the expression
matrix.
exptDS = exptData(exptSet)
exptDS =
Experiment Data:
500 features, 26 samples
1 elements
Element names: Expressions
dMatrix = expressions(exptSet);
The returned expression DataMatrix should be identical to the exprsData DataMatrix object that
you created earlier.
get(dMatrix)
Name: 'mouseExprsData'
RowNames: {500x1 cell}
ColNames: {1x26 cell}
NRows: 500
NCols: 26
NDims: 2
ElementClass: 'double'
exptSet.pubMedID
ans =
'16244648'
4-218
Working with Objects for Microarray Experiment Data
You can subset an ExpressionSet object so that you can focus on the samples and features of
interest. The first indexing argument subsets the features and the second argument subsets the
samples.
Create a new ExpressionSet object consisting of the first five features and the samples named A, B,
and C.
mySet = exptSet(1:5, {'A', 'B', 'C'})
mySet =
ExpressionSet
Experiment Data: 5 features, 3 samples
Element names: Expressions
Sample Data:
Sample names: A, B, C
Sample variable names and meta information:
Gender: Gender of the mouse in study
Age: The number of weeks since mouse birth
Type: Genetic characters
Strain: The mouse strain
Source: The tissue source for RNA collection
Feature Data:
Feature names: 100001_at, 100002_at, ...,100005_at (5 total)
Feature variable names and meta information:
GeneTitle: Gene title of a probe set
GeneSymbol: Probe set gene symbol
ChromosomalLocation: Probe set chromosomal locations
Pathway: The pathway the genes involved in
Experiment Information: use 'exptInfo(obj)'
size(mySet)
ans =
5 3
featureNames(mySet)
ans =
{'100001_at'}
{'100002_at'}
{'100003_at'}
{'100004_at'}
{'100005_at'}
sampleNames(mySet)
ans =
4-219
4 Microarray Analysis
You can also create a subset consisting of only the samples from hippocampus tissues.
hippocampusSet = exptSet(:, nominal(exptSet.Source)== 'hippocampus')
hippocampusSet =
ExpressionSet
Experiment Data: 500 features, 6 samples
Element names: Expressions
Sample Data:
Sample names: O, P, ...,T (6 total)
Sample variable names and meta information:
Gender: Gender of the mouse in study
Age: The number of weeks since mouse birth
Type: Genetic characters
Strain: The mouse strain
Source: The tissue source for RNA collection
Feature Data:
Feature names: 100001_at, 100002_at, ...,100717_at (500 total)
Feature variable names and meta information:
GeneTitle: Gene title of a probe set
GeneSymbol: Probe set gene symbol
ChromosomalLocation: Probe set chromosomal locations
Pathway: The pathway the genes involved in
Experiment Information: use 'exptInfo(obj)'
hippocampusSet.Source
ans =
{'hippocampus'}
{'hippocampus'}
{'hippocampus'}
{'hippocampus'}
{'hippocampus'}
{'hippocampus'}
hippocampusExprs = expressions(hippocampusSet);
get(hippocampusExprs)
Name: 'mouseExprsData'
RowNames: {500x1 cell}
ColNames: {'O' 'P' 'Q' 'R' 'S' 'T'}
NRows: 500
NCols: 6
NDims: 2
ElementClass: 'double'
4-220
Working with Objects for Microarray Experiment Data
References
[1] Hovatta, I., et al., "Glyoxalase 1 and glutathione reductase 1 regulate anxiety in mice", Nature,
438(7068):662-6, 2005.
[2] Brazma, A., et al., "Minimum information about a microarray experiment (MIAME) - toward
standards for microarray data", Nat. Genet. 29(4):365-371, 2001.
4-221
5
Phylogenetic Analysis
The Phylogenetic Tree app can read data from Newick and ClustalW tree formatted files.
This procedure uses the phylogenetic tree data stored in the file pf00002.tree as an example. The
data was retrieved from the protein family (PFAM) Web database and saved to a file using the
accession number PF00002 and the function gethmmtree.
1 Create a phytree object. For example, to create a phytree object from tree data in the file
pf00002.tree, type
tr = phytreeread('pf00002.tree')
phytreeviewer(tr)
5-2
Using the Phylogenetic Tree App
File Menu
The File menu includes the standard commands for opening and closing a file, and it includes
commands to use phytree object data from the MATLAB Workspace. The File menu commands are
shown below.
5-3
5 Phylogenetic Analysis
Use the New Viewer command to open tree data from a file into a second Phylogenetic Tree viewer.
5-4
Using the Phylogenetic Tree App
• MATLAB Workspace — Select the Import from Workspace options, and then select a
phytree object from the list.
• File — Select the Open phylogenetic tree file option, click the Browse button, select a
directory, select a file with the extension .tree, and then click Open. The toolbox uses the
file extension .tree for Newick-formatted files, but you can use any Newick-formatted file
with any extension.
A second Phylogenetic Tree viewer opens with tree data from the selected file.
Open Command
Use the Open command to read tree data from a Newick-formatted file and display that data in the
app.
The app replaces the current tree data with data from the selected file.
Use the Import from Workspace command to read tree data from a phytree object in the MATLAB
Workspace and display the data using the app.
5-5
5 Phylogenetic Analysis
The app replaces the current tree data with data from the selected object.
There may be times when you make changes that you would like to undo. The Phylogenetic Tree
app does not have an undo command, but you can get back to the original tree you started viewing
with the Open Original in New Viewer command.
Save As Command
After you create a phytree object or prune a tree from existing data, you can save the resulting tree
in a Newick-formatted file. The sequence data used to create the phytree object is not saved with
the tree.
The app saves tree data without the deleted branches, and it saves changes to branch and leaf
names. Formatting changes such as branch rotations, collapsed branches, and zoom settings are
not saved in the file.
Because some of the Phylogenetic Tree viewer commands cannot be undone (for example, the Prune
command), you might want to make a copy of your tree before trying a command. At other times, you
5-6
Using the Phylogenetic Tree App
might want to compare two views of the same tree, and copying a tree to a new tool window allows
you to make changes to both tree views independently .
1 Select File > Export to New Viewer, and then select either With Hidden Nodes or Only
Displayed.
The Phylogenetic Tree app can open Newick-formatted files with tree data. However, it does not
create a phytree object in the MATLAB Workspace. If you want to programmatically explore
phylogenetic trees, you need to use the Export to Workspace command.
1 Select File > Export to Workspace, and then select either With Hidden Nodes or Only
Displayed.
3 Click OK.
After you have explored the relationships between branches and leaves in your tree, you can copy the
tree to a MATLAB Figure window. Using a Figure window lets you use all the features for annotating,
changing font characteristics, and getting your figure ready for publication. Also, from the Figure
window, you can save an image of the tree as it was displayed in the Phylogenetic Tree app.
1 From the File menu, select Print to Figure, and then select either With Hidden Nodes or
Only Displayed.
5-7
5 Phylogenetic Analysis
5-8
Using the Phylogenetic Tree App
'radial'
5-9
5 Phylogenetic Analysis
Tip This rendering type hides the significance of the root node
and emphasizes clusters, thereby making it useful for visually
assessing clusters and detecting outliers.
'equaldaylight'
Tip This rendering type hides the significance of the root node
and emphasizes clusters, thereby making it useful for visually
assessing clusters and detecting outliers.
3 Select the Display Labels you want on your figure. You can select from all to none of the
options.
When you print from the Phylogenetic Tree app or a MATLAB Figure window (with a tree published
from the viewer), you can specify setup options for printing a tree.
5-10
Using the Phylogenetic Tree App
The Print Preview window opens, which you can use to select page formatting options.
2 Select the page formatting options and values you want, and then click Print.
Print Command
Use the Print command to make a copy of your phylogenetic tree after you use the Print Preview
command to select formatting options.
Tools Menu
Use the Tools menu to:
The Tools menu and toolbar contain most of the commands specific to trees and phylogenetic
analysis. Use these commands and modes to edit and format your tree interactively. The Tools menu
commands are:
5-11
5 Phylogenetic Analysis
Inspect Mode
Viewing a phylogenetic tree in the Phylogenetic Tree app provides a rough idea of how closely
related two sequences are. However, to see exactly how closely related two sequences are, measure
the distance of the path between them. Use the Inspect command to display and measure the path
between two sequences.
1
Select Tools > Inspect, or from the toolbar, click the Inspect Tool Mode icon .
The app highlights the path between the two nodes and displays the path length in the pop-up
window. The path length is the patristic distance calculated by the seqpdist function.
5-12
Using the Phylogenetic Tree App
Some trees have thousands of leaf and branch nodes. Displaying all the nodes can create an
unreadable tree diagram. By collapsing some branches, you can better see the relationships between
the remaining nodes.
1 Select Tools > Collapse/Expand, or from the toolbar, click the Collapse/Expand Brand Mode
icon .
The paths, branch nodes, and leaf nodes below the selected branch appear in gray, indicating you
selected them to collapse (hide from view).
The app hides the display of paths, branch nodes, and leaf nodes below the selected branch.
However, it does not remove the data.
Tip After collapsing nodes, you can redraw the tree by selecting Tools > Fit to Window.
A phylogenetic tree is initially created by pairing the two most similar sequences and then adding the
remaining sequences in a decreasing order of similarity. You can rotate branches to emphasize the
direction of evolution.
1
Select Tools > Rotate Branch, or from the toolbar, click the Rotate Branch Mode icon .
5-13
5 Phylogenetic Analysis
The branch and leaf nodes below the selected branch node rotate 180 degrees around the branch
node.
4 To undo the rotation, simply click the branch node again.
The Phylogenetic Tree app takes the node names from a phytree object and creates numbered
branch names starting with Branch 1. You can edit any of the leaf or branch names.
1
Select Tools > Rename, or from the toolbar, click the Rename Leaf/Branch Mode icon .
4 To accept your changes and close the text box, click outside of the text box. To save your
changes, select File > Save As.
Your tree can contain leaves that are far outside the phylogeny, or it can have duplicate leaves that
you want to remove.
1 Select Tools > Prune, or from the toolbar, click the Prune (delete) Leaf/Branch Mode icon
5-14
Using the Phylogenetic Tree App
For a leaf node, the branch line connected to the leaf appears in gray. For a branch node, the
branch lines below the node appear in gray.
Note If you delete nodes (branches or leaves), you cannot undo the changes. The Phylogenetic
Tree app does not have an Undo command.
3 Click the branch or leaf node.
The tool removes the branch from the figure and rearranges the other nodes to balance the tree
structure. It does not recalculate the phylogeny.
Tip After pruning nodes, you can redraw the tree by selecting Tools > Fit to Window.
The Zoom and Pan commands are the standard controls for resizing and moving the screen in any
MATLAB Figure window.
1
Select Tools > Zoom In, or from the toolbar, click the Zoom In icon .
The app activates zoom in mode and changes the cursor to a magnifying glass.
2 Place the cursor over the section of the tree diagram you want to enlarge and then click.
3
From the toolbar click the Pan icon .
5-15
5 Phylogenetic Analysis
4 Move the cursor over the tree diagram, left-click, and drag the diagram to the location you want
to view.
Tip After zooming and panning, you can reset the tree to its original view, by selecting Tools >
Reset View.
Select Submenu
Select a single branch or leaf node by clicking it. Select multiple branch or leaf nodes by Shift-
clicking the nodes, or click-dragging to draw a box around nodes.
Use the Select submenu to select specific branch and leaf nodes based on different criteria.
• Select By Distance — Displays a slider bar at the top of the window, which you slide to specify a
distance threshold. Nodes whose distance from the selected node are below this threshold appear
in red. Nodes whose distance from the selected node are above this threshold appear in blue.
• Select Common Ancestor — For all selected nodes, highlights the closest common ancestor
branch node in red.
• Select Leaves — If one or more nodes are selected, highlights the nodes that are leaf nodes in
red. If no nodes are selected, highlights all leaf nodes in red
• Propagate Selection — For all selected nodes, highlights the descendant nodes in red.
• Swap Selection — Clears all selected nodes and selects all deselected nodes.
After selecting nodes using one of the previous commands, hide and show the nodes using the
following commands:
• Collapse Selected
• Expand Selected
• Expand All
Clear all selected nodes by clicking anywhere else in the Phylogenetic Tree app.
Phylogenetic trees can have thousands of leaves and branches, and finding a specific node can be
difficult. Use the Find Leaf/Branch command to locate a node using its name or part of its name.
5-16
Using the Phylogenetic Tree App
2 In the Regular Expression to match box, enter a name or partial name of a branch or leaf
node.
3 Click OK.
The branch or leaf nodes that match the expression appear in red.
After selecting nodes using the Find Leaf/Branch command, you can hide and show the nodes using
the following commands:
• Collapse Selected
• Expand Selected
• Expand All
When you select nodes, either manually or using the previous commands, you can then collapse them
by selecting Tools > Collapse Selected.
The data for branches and leaves that you hide using the Collapse/Expand or Collapse Selected
command are not removed from the tree. You can display selected or all hidden data using the
Expand Selected or Expand All command.
After you hide nodes with the collapse commands, or delete nodes with the Prune command, there
can be extra space in the tree diagram. Use the Fit to Window command to redraw the tree diagram
to fill the entire Figure window.
Use the Reset View command to remove formatting changes such as collapsed branches and zooms.
Options Submenu
Use the Options command to select the behavior for the zoom and pan modes.
Window Menu
This section illustrates how to switch to any open window.
5-17
5 Phylogenetic Analysis
The Window menu is standard on MATLAB interfaces and Figure windows. Use this menu to select
any opened window.
Help Menu
This section illustrates how to select quick links to the Bioinformatics Toolbox documentation for
phylogenetic analysis functions, tutorials, and the Phylogenetic Tree app reference.
Use the Help menu to select quick links to the Bioinformatics Toolbox documentation for
phylogenetic analysis functions, tutorials, and the phytreeviewer reference.
5-18
Building a Phylogenetic Tree for the Hominidae Species
This example shows how to construct phylogenetic trees from mtDNA sequences for the Hominidae
taxa (also known as pongidae). This family embraces the gorillas, chimpanzees, orangutans and
humans.
Introduction
The mitochondrial D-loop is one of the fastest mutating sequence regions in animal DNA, and
therefore, is often used to compare closely related organisms. The origin of modern man is a highly
debated issue that has been addressed by using mtDNA sequences. The limited genetic variability of
human mtDNA has been explained in terms of a recent common genetic ancestry, thus implying that
all modern-population mtDNAs likely originated from a single woman who lived in Africa less than
200,000 years.
This example uses mitochondrial D-loop sequences isolated for different hominidae species with the
following GenBank Accession numbers.
You can use the getgenbank function inside a for-loop to retrieve the sequences from the NCBI data
repository and load them into MATLAB®.
For your convenience, previously downloaded sequences are included in a MAT-file. Note that data in
public repositories is frequently curated and updated; therefore, the results of this example might be
slightly different when you use up-to-date sequences.
load('primates.mat')
Compute pairwise distances using the 'Jukes-Cantor' formula and the phylogenetic tree with the
'UPGMA' distance method. Since the sequences are not pre-aligned, seqpdist performs a pairwise
alignment before computing the distances.
5-19
5 Phylogenetic Analysis
distances = seqpdist(primates,'Method','Jukes-Cantor','Alpha','DNA');
UPGMAtree = seqlinkage(distances,'UPGMA',primates)
h = plot(UPGMAtree,'orient','top');
title('UPGMA Distance Tree of Primates using Jukes-Cantor model');
ylabel('Evolutionary distance')
Alternate tree topologies are important to consider when analyzing homologous sequences between
species. A neighbor-joining tree can be built using the seqneighjoin function. Neighbor-joining
trees use the pairwise distance calculated above to construct the tree. This method performs
clustering using the minimum evolution method.
NJtree = seqneighjoin(distances,'equivar',primates)
h = plot(NJtree,'orient','top');
title('Neighbor-Joining Distance Tree of Primates using Jukes-Cantor model');
ylabel('Evolutionary distance')
5-20
Building a Phylogenetic Tree for the Hominidae Species
Notice that different phylogenetic reconstruction methods result in different tree topologies. The
neighbor-joining tree groups Chimp Vellerosus in a clade with the gorillas, whereas the UPGMA tree
groups it near chimps and orangutans. The getcanonical function can be used to compare these
isomorphic trees.
sametree =
logical
You can explore the phylogenetic tree by considering the nodes (leaves and branches) within a given
patristic distance from the 'European Human' entry and reduce the tree to the sub-branches of
interest by pruning away non-relevant nodes.
names = get(UPGMAtree,'LeafNames')
[h_all,h_leaves] = select(UPGMAtree,'reference',3,'criteria','distance','threshold',0.3);
subtree_names = names(h_leaves)
5-21
5 Phylogenetic Analysis
leaves_to_prune = ~h_leaves;
pruned_tree = prune(UPGMAtree,leaves_to_prune)
h = plot(pruned_tree,'orient','top');
title('Pruned UPGMA Distance Tree of Primates using Jukes-Cantor model');
ylabel('Evolutionary distance')
names =
{'German_Neanderthal' }
{'Russian_Neanderthal' }
{'European_Human' }
{'Chimp_Troglodytes' }
{'Chimp_Schweinfurthii' }
{'Chimp_Verus' }
{'Chimp_Vellerosus' }
{'Puti_Orangutan' }
{'Jari_Orangutan' }
{'Mountain_Gorilla_Rwanda'}
{'Eastern_Lowland_Gorilla'}
{'Western_Lowland_Gorilla'}
subtree_names =
{'German_Neanderthal' }
{'Russian_Neanderthal' }
{'European_Human' }
{'Chimp_Troglodytes' }
{'Chimp_Schweinfurthii'}
{'Chimp_Verus' }
5-22
Building a Phylogenetic Tree for the Hominidae Species
With view you can further explore/edit the phylogenetic tree using an interactive tool. See also
phytreeviewer.
view(UPGMAtree,h_leaves)
5-23
5 Phylogenetic Analysis
References
[1] Ovchinnikov, I.V., et al., "Molecular analysis of Neanderthal DNA from the northern Caucasus",
Nature, 404(6777):490-3, 2000.
[2] Sajantila, A., et al., "Genes and languages in Europe: an analysis of mitochondrial lineages",
Genome Research, 5(1):42-52, 1995.
[3] Krings, M., et al., "Neandertal DNA sequences and the origin of modern humans", Cell,
90(1):19-30, 1997.
[4] Jensen-Seaman, M.I. and Kidd, K.K., "Mitochondrial DNA variation and biogeography of eastern
gorillas", Molecular Ecology, 10(9):2241-7, 2001.
5-24
Analyzing the Origin of the Human Immunodeficiency Virus
This example shows how to construct phylogenetic trees from multiple strains of the HIV and SIV
viruses.
Introduction
Mutations accumulate in the genomes of pathogens, in this case the human/simian immunodeficiency
virus, during the spread of an infection. This information can be used to study the history of
transmission events, and also as evidence for the origins of the different viral strains.
There are two characterized strains of human AIDS viruses: type 1 (HIV-1) and type 2 (HIV-2). Both
strains represent cross-species infections. The primate reservoir of HIV-2 has been clearly identified
as the sooty mangabey (Cercocebus atys). The origin of HIV-1 is believed to be the common
chimpanzee (Pan troglodytes).
In this example, the variations in three longest coding regions from seventeen different isolated
strains of the Human and Simian immunodeficiency virus are used to construct a phylogenetic tree.
The sequences for these virus strains can be retrieved from GenBank® using their accession
numbers. The three coding regions of interest, the gag protein, the pol polyprotein and the envelope
polyprotein precursor, can then be extracted from the sequences using the CDS information in the
GenBank records.
numViruses = size(data,1)
numViruses =
16
You can use the getgenbank function to copy the data from GenBank into a structure in MATLAB®.
The SearchURL field of the structure contains the address of the actual GenBank record. You can
browse this record using the web command.
5-25
5 Phylogenetic Analysis
acc_num = data{1,2};
lentivirus = getgenbank(acc_num);
web(lentivirus(1).SearchURL)
Retrieve the sequence information from the NCBI GenBank database for the rest of the accession
numbers.
For your convenience, previously downloaded sequences are included in a MAT-file. Note that data in
public repositories is frequently curated and updated; therefore the results of this example might be
slightly different when you use up-to-date datasets.
load('lentivirus.mat')
Extract CDS for the GAG, POL, and ENV coding regions. Then extract the nucleotide sequences using
the CDS pointers.
The seqpdist and seqlinkage commands are used to construct a phylogenetic tree for the GAG
coding region using the 'Tajima-Nei' method to measure the distance between the sequences and the
unweighted pair group method using arithmetic averages, or 'UPGMA' method, for the hierarchical
clustering. The 'Tajima-Nei' method is only defined for nucleotides, therefore nucleotide sequences
are used rather than the translated amino acid sequences. The distance calculation may take quite a
few minutes as it is very computationally intensive.
gagd = seqpdist(gag,'method','Tajima-Nei','Alphabet','NT','indel','pair');
gagtree = seqlinkage(gagd,'UPGMA',data(:,1))
plot(gagtree,'type','angular');
title('Immunodeficiency virus (GAG protein)')
5-26
Analyzing the Origin of the Human Immunodeficiency Virus
Next construct a phylogenetic tree for the POL polyproteins using the 'Jukes-Cantor' method to
measure distance between sequences and the weighted pair group method using arithmetic
averages, or 'WPGMA' method, for the hierarchical clustering. The 'Jukes-Cantor' method is defined
for amino-acids sequences, which, being significantly shorter than the corresponding nucleotide
sequences, means that the calculation of the pairwise distances will be significantly faster.
Calculate the distance and linkage, and then generate the tree.
pold = seqpdist(aapol,'method','Jukes-Cantor','indel','pair');
poltree = seqlinkage(pold,'WPGMA',data(:,1))
plot(poltree,'type','angular');
title('Immunodeficiency virus (POL polyprotein)')
5-27
5 Phylogenetic Analysis
Construct a phylogenetic tree for the ENV polyproteins using the normalized pairwise alignment
scores as distances between sequences and the 'UPGMA', method for hierarchical clustering.
envd = seqpdist(aaenv,'method','Alignment','indel','score',...
'ScoringMatrix','Blosum62');
envtree = seqlinkage(envd,'UPGMA',data(:,1))
plot(envtree,'type','angular');
title('Immunodeficiency virus (ENV polyprotein)')
5-28
Analyzing the Origin of the Human Immunodeficiency Virus
The three trees are similar but there are some interesting differences. For example in the POL tree,
the 'SIVmnd5440 Mandrillus sphinx' sequence is placed close to the HIV-1 strains, but in the ENV
tree it is shown as being very distant to the HIV-1 sequences. Given that the three trees show slightly
different results, a consensus tree using all three regions, may give better general information about
the complete viruses. A consensus tree can be built using a weighted average of the three trees.
Note that different metrics were used in the calculation of the pairwise distances. This could bias the
consensus tree. You may wish to recalculate the distances for the three regions using the same metric
to get an unbiased tree.
tree_hiv = seqlinkage(dist,'average',data(:,1));
plot(tree_hiv,'type','angular');
title('Immunodeficiency virus (Weighted tree)')
5-29
5 Phylogenetic Analysis
The phylogenetic tree resulting from our analysis illustrates the presence of two clusters and some
other isolated strains. The most compact cluster includes all the HIV2 samples; at the top branch of
this cluster we observe the sooty mangabey which has been identified as the origin of this lentivirus
in humans. The cluster containing the HIV1 strain, however is not as compact as the HIV2 cluster.
From the tree it appears that the Chimpanzee is the source of HIV1, however, the origin of the cross-
species transmission to humans is still a matter of debate amongst HIV researchers.
% Add annotations
annotation(gcf,'textarrow',[0.29 0.31],[0.36 0.28],'Color',[1 0.5 0],...
'String',{'Possible HIV type 1 origin'},'TextColor',[1 0.5 0]);
5-30
Analyzing the Origin of the Human Immunodeficiency Virus
References:
[1] Gao, F., et al., "Origin of HIV-1 in the chimpanzee Pan troglodytes troglodytes", Nature,
397(6718):436-41, 1999.
[2] Kestler, H.W., et al., "Comparison of simian immunodeficiency virus isolates", Nature,
331(6157):619-22, 1998.
[3] Alizon, M., et al., "Genetic variability of the AIDS virus: nucleotide sequence analysis of two
isolates from African patients", Cell, 46(1):63-74, 1986.
5-31
5 Phylogenetic Analysis
This example shows how to generate bootstrap replicates of DNA sequences. The data generated by
bootstrapping is used to estimate the confidence of the branches in a phylogenetic tree.
Introduction
Bootstrap, jackknife, and permutation tests are common tests used in phylogenetics to estimate the
significance of the branches of a tree. This process can be very time consuming because of the large
number of samples that have to be taken in order to have an accurate confidence estimate. The more
times the data are sampled the better the analysis. A cluster of computers can shorten the time
needed for this analysis by distributing the work to several machines and recombining the data.
This example uses 12 pre-aligned sequences isolated from different hominidae species and stored in a
FASTA-formatted file. A phylogenetic tree is constructed by using the UPGMA method with pairwise
distances. More specifically, the seqpdist function computes the pairwise distances among the
considered sequences and then the function seqlinkage builds the tree and returns the data in a
phytree object. You can use the phytreeviewer function to visualize and explore the tree.
primates = fastaread('primatesaligned.fa');
num_seqs = length(primates)
num_seqs = 12
orig_primates_dist = seqpdist(primates);
orig_primates_tree = seqlinkage(orig_primates_dist,'average',primates);
phytreeviewer(orig_primates_tree);
5-32
Bootstrapping Phylogenetic Trees
A bootstrap replicate is a shuffled representation of the DNA sequence data. To make a bootstrap
replicate of the primates data, bases are sampled randomly from the sequences with replacement and
concatenated to make new sequences. The same number of bases as the original multiple alignment
is used in this analysis, and then gaps are removed to force a new pairwise alignment. The function
randsample samples the data with replacement. This function can also sample the data randomly
without replacement to perform jackknife analysis. For this analysis, 100 bootstrap replicates for
each sequence are created.
num_boots = 100;
seq_len = length(primates(1).Sequence);
boots = cell(num_boots,1);
for n = 1:num_boots
reorder_index = randsample(seq_len,seq_len,true);
for i = num_seqs:-1:1 %reverse order to preallocate memory
bootseq(i).Header = primates(i).Header;
bootseq(i).Sequence = strrep(primates(i).Sequence(reorder_index),'-','');
end
5-33
5 Phylogenetic Analysis
boots{n} = bootseq;
end
Determining the distances between DNA sequences for a large data set and building the phylogenetic
trees can be time-consuming. Distributing these calculations over several machines/cores decreases
the computation time. This example assumes that you have already started a MATLAB® pool with
additional parallel resources. For information about setting up and selecting parallel configurations,
see "Programming with User Configurations" in the Parallel Computing Toolbox™ documentation. If
you do not have the Parallel Computing Toolbox™, the following PARFOR loop executes sequentially
without any further modification.
fun = @(x) seqlinkage(x,'average',{primates.Header});
boot_trees = cell(num_boots,1);
parpool('local');
parfor (n = 1:num_boots)
dist_tmp = seqpdist(boots{n});
boot_trees{n} = fun(dist_tmp);
end
delete(gcp('nocreate'));
The topology of every bootstrap tree is compared with that of the original tree. Any interior branch
that gives the same partition of species is counted. Since branches may be ordered differently among
different trees but still represent the same partition of species, it is necessary to get the canonical
form for each subtree before comparison. The first step is to get the canonical subtrees of the original
tree using the subtree and getcanonical methods from the Bioinformatics Toolbox™.
for i = num_seqs-1:-1:1 % for every branch, reverse order to preallocate
branch_pointer = i + num_seqs;
sub_tree = subtree(orig_primates_tree,branch_pointer);
orig_pointers{i} = getcanonical(sub_tree);
orig_species{i} = sort(get(sub_tree,'LeafNames'));
end
Now you can get the canonical subtrees for all the branches of every bootstrap tree.
for j = num_boots:-1:1
for i = num_seqs-1:-1:1 % for every branch
branch_ptr = i + num_seqs;
sub_tree = subtree(boot_trees{j},branch_ptr);
clusters_pointers{i,j} = getcanonical(sub_tree);
clusters_species{i,j} = sort(get(sub_tree,'LeafNames'));
end
end
For each subtree in the original tree, you can count how many times it appears within the bootstrap
subtrees. To be considered as similar, they must have the same topology and span the same species.
count = zeros(num_seqs-1,1);
for i = 1 : num_seqs -1 % for every branch
for j = 1 : num_boots * (num_seqs-1)
if isequal(orig_pointers{i},clusters_pointers{j})
5-34
Bootstrapping Phylogenetic Trees
if isequal(orig_species{i},clusters_species{j})
count(i) = count(i) + 1;
end
end
end
end
Pc = 11×1
1.0000
1.0000
0.9900
0.9900
0.5400
0.5400
1.0000
0.4300
0.3900
0.3900
⋮
The confidence information associated with each branch node can be stored within the tree by
annotating the node names. Thus, you can create a new tree, equivalent to the original primates tree,
and annotate the branch names to include the confidence levels computed above. phytreeviewer
displays this data in datatips when the mouse is hovered over the internal nodes of the tree.
[ptrs,dist,names] = get(orig_primates_tree,'POINTERS','DISTANCES','NODENAMES');
tr = phytree(ptrs,dist,names)
You can select the branch nodes with a confidence level greater than a given threshold, for example
0.9, and view these corresponding nodes in the Phylogenetic Tree app. You can also select these
branch nodes interactively within the app.
5-35
5 Phylogenetic Analysis
References
[2] Nei, M. and Kumar, S., "Molecular Evolution and Phylogenetics", Oxford University Press. Chapter
4, 2000.
5-36
6
This example shows how to improve the quality of raw mass spectrometry data. In particular, this
example illustrates the typical steps for preprocesssing protein surface-enhanced laser desorption/
ionization-time of flight mass spectra (SELDI-TOF).
Mass spectrometry data can be stored in different formats. If the data is stored in text files with two
columns (the mass/charge (M/Z) ratios and the corresponding intensity values), you can use one of
the following MATLAB® I/O functions: importdata, dlmread, or textscan. Alternatively, if the
data is stored in JCAMP-DX formatted files, you can use the function jcampread. If the data is
contained in a spreadsheet of an Excel® workbook, you can use the function xlsread.If the data is
stored in mzXML formatted files, you can use the function mzxmlread, and finally, if the data is
stored in SPC formatted files, you can use tgspcread.
This example uses spectrograms taken from one of the low-resolution ovarian cancer NCI/FDA data
sets from the FDA-NCI Clinical Proteomics Program Databank. These spectra were generated using
the WCX2 protein-binding chip, two with manual sample handling and two with a robotic sample
dispenser/processor.
sample = importdata('mspec01.csv')
sample =
The M/Z ratios are in the first column of the data field and the ion intensities are in the second.
MZ = sample.data(:,1);
Y = sample.data(:,2);
For better manipulation of the data, you can load multiple spectrograms and concatenate them into a
single matrix. Use the dlmread function to read comma separated value files. Note: This example
assumes that the M/Z ratios are the same for the four files. For data sets with different M/Z ratios,
use msresample to create a uniform M/Z vector.
files = {'mspec01.csv','mspec02.csv','mspec03.csv','mspec04.csv'};
for i = 1:4
Y(:,i) = dlmread(files{i},',',1,1); % skips the first row (header)
end
6-2
Preprocessing Raw Mass Spectrometry Data
Resampling mass spectrometry data has several advantages. It homogenizes the mass/charge (M/Z)
vector, allowing you to compare different spectra under the same reference and at the same
resolution. In high-resolution data sets, the large size of the files leads to computationally intensive
algorithms. However, high-resolution spectra can be redundant. By resampling, you can decimate the
signal into a more manageable M/Z vector, preserving the information content of the spectra. The
msresample function allows you to select a new M/Z vector and also applies an antialias filter that
prevents high-frequency noise from folding into lower frequencies.
Load a high-resolution spectrum taken from the high-resolution ovarian cancer NCI/FDA data set. For
convenience, the spectrum is included in a MAT-formatted file.
load sample_hi_res
numel(MZ_hi_res)
ans =
355760
Down-sample the spectra to 10,000 M/Z points between 2,000 and 11,000. Use the SHOWPLOT
property to create a customized plot that lets you follow and assess the quality of the preprocessing
action.
[MZH,YH] = msresample(MZ_hi_res,Y_hi_res,10000,'RANGE',[2000 11000],'SHOWPLOT',true);
6-3
6 Mass Spectrometry and Bioanalytics
Zooming into a reduced region reveals the detail of the down-sampling procedure.
6-4
Preprocessing Raw Mass Spectrometry Data
Baseline Correction
Mass spectrometry data usually show a varying baseline caused by the chemical noise in the matrix
or by ion overloading. The msbackadj function estimates a low-frequency baseline, which is hidden
among high-frequency noise and signal peaks. It then subtracts the baseline from the spectrogram.
Adjust the baseline of the set of spectrograms and show only the second one and its estimated
background.
YB = msbackadj(MZ,Y,'WINDOWSIZE',500,'QUANTILE',0.20,'SHOWPLOT',2);
6-5
6 Mass Spectrometry and Bioanalytics
Miscalibration of the mass spectrometer leads to variations of the relationship between the observed
M/Z vector and the true time-of-flight of the ions. Therefore, systematic shifts can appear in repeated
experiments. When a known profile of peaks is expected in the spectrogram, you can use the function
msalign to standardize the M/Z values.
To align the spectrograms, provide a set of M/Z values where reference peaks are expected to appear.
You can also define a vector with relative weights that is used by the aligning algorithm to emphasize
peaks with small area.
Display a heat map to observe the alignment of the spectra before and after applying the alignment
algorithm.
msheatmap(MZ,YB,'MARKERS',P,'RANGE',[3000 10000])
title('Before Alignment')
6-6
Preprocessing Raw Mass Spectrometry Data
YA = msalign(MZ,YB,P,'WEIGHTS',W);
msheatmap(MZ,YA,'MARKERS',P,'RANGE',[3000 10000])
title('After Alignment')
6-7
6 Mass Spectrometry and Bioanalytics
Normalization
In repeated experiments, it is common to find systematic differences in the total amount of desorbed
and ionized proteins. The msnorm function implements several variations of typical normalization (or
standardization) methods.
For example, one of many methods to standardize the values of the spectrograms is to rescale the
maximum intensity of every signal to a specific value, for instance 100. It is also possible to ignore
problematic regions; for example, in serum samples you might want to ignore the low-mass region
(M/Z < 1000 Da.).
6-8
Preprocessing Raw Mass Spectrometry Data
The msnorm function can also standardize by using the area under the curve (AUC) and then rescale
the spectrograms to have relative intensities below 100.
6-9
6 Mass Spectrometry and Bioanalytics
Standardized spectra usually contain a mixture of noise and signal. Some applications require
denoising of the spectrograms to improve the validity and precision of the observed mass/charge
values of the peaks in the spectra. For the same reason, denoising also improves further peak
detection algorithms. However, it is important to preserve the sharpness (or high-frequency
components) of the peak as much as possible. For this, you can use Lowess smoothing (mslowess)
and polynomial filters (mssgolay).
YS = mssgolay(MZ,YN2,'SPAN',35,'SHOWPLOT',3);
6-10
Preprocessing Raw Mass Spectrometry Data
Zooming into a reduced region reveals the detail of the smoothing algorithm.
6-11
6 Mass Spectrometry and Bioanalytics
A simple approach to find putative peaks is to look at the first derivative of the smoothed signal, then
filer some of these locations to avoid small ion-intensity peaks.
P1 = mspeaks(MZ,YS,'DENOISING',false,'HEIGHTFILTER',2,'SHOWPLOT',1)
P1 =
{164x2 double}
{171x2 double}
{169x2 double}
{147x2 double}
6-12
Preprocessing Raw Mass Spectrometry Data
The mspeaks function can also estimate the noise using wavelets denoising. This method is generally
more robust, because peak detection can be achieved directly over noisy spectra. The algorithm will
adapt to varying noise conditions of the signal, and peaks can be resolved even if low resolution or
oversegmentation exists.
P2 = mspeaks(MZ,YN2,'BASE',12,'MULTIPLIER',10,'HEIGHTFILTER',1,'SHOWPLOT',1)
P2 =
{322x2 double}
{370x2 double}
{324x2 double}
{295x2 double}
6-13
6 Mass Spectrometry and Bioanalytics
P3 =
{81x2 double}
{93x2 double}
{57x2 double}
{53x2 double}
Peaks corresponding to similar compounds may still be reported with slight mass/charge differences
or drifts. Assuming that the four spectrograms correspond to comparable biological/chemical
samples, it might be useful to compare peaks from different spectra, which requires peak binning
(a.k.a. peak coalescing). The crucial task in data binning is to create a common mass/charge
reference vector (or bins). Ideally, bins should collect one peak from each signal and should avoid
collecting multiple relevant peaks from the same signal into the same bin.
This example uses hierarchical clustering to calculate a common mass/charge reference vector. The
approach is sufficient when using low-resolution spectra; however, for high-resolution spectra or for
6-14
Preprocessing Raw Mass Spectrometry Data
data sets with many spectrograms, the function mspalign provides other scalable methods to
estimate a common mass/charge reference and perform data binning.
Put all the peaks into a single array and construct a vector with the spectrogram index for each peak.
allPeaks = cell2mat(P3);
numPeaks = cellfun(@(x) length(x),P3);
Sidx = accumarray(cumsum(numPeaks),1);
Sidx = cumsum(Sidx)-Sidx;
Create a custom distance function that penalizes clusters containing peaks from the same
spectrogram, then perform hierarchical clustering.
tree = linkage(pdist([allPeaks(:,1),Sidx],distfun));
clusters = cluster(tree,'CUTOFF',75,'CRITERION','Distance');
distfun =
@(x,y)(x(:,1)-y(:,1)).^2+(x(:,2)==y(:,2))*10^6
The common mass/charge reference vector (CMZ) is found by calculating the centroids for each
cluster.
CMZ = accumarray(clusters,prod(allPeaks,2))./accumarray(clusters,allPeaks(:,2));
PR = accumarray(clusters,allPeaks(:,2),[],@max);
[CMZ,h] = sort(CMZ);
PR = PR(h);
figure
hold on
box on
plot([CMZ CMZ],[-10 100],'-k')
plot(MZ,YN2)
axis([7200 8500 -10 100])
xlabel('Mass/Charge (M/Z)')
ylabel('Relative Intensity')
title('Common Mass/Charge (M/Z) Locations found by Clustering')
6-15
6 Mass Spectrometry and Bioanalytics
The samplealign function allows you to use a dynamic programming algorithm to assign the
observed peaks in each spectrogram to the common mass/charge reference vector (CMZ).
When simpler binning approaches are used, such as rounding the mass/charge values or using
nearest neighbor quantization to the CMZ vector, the same peak from different spectra my be assigned
to different bins due to the small drifts that still exist. To circumvent this problem, the bin size can be
increased with the sacrifice of mass spectrometry peak resolution. By using dynamic programming
binning, you preserve the resolution while minimizing the problem of assigning similar compounds
from different spectrograms to different peak locations.
PA = nan(numel(CMZ),4);
for i = 1:4
[j,k] = samplealign([CMZ PR],P3{i},'BAND',15,'WEIGHTS',[1 .1]);
PA(j,i) = P3{i}(k,2);
end
figure
hold on
box on
plot([CMZ CMZ],[-10 100],':k')
plot(MZ,YN2)
plot(CMZ,PA,'o')
axis([7200 8500 -10 100])
xlabel('Mass/Charge (M/Z)')
6-16
Preprocessing Raw Mass Spectrometry Data
ylabel('Relative Intensity')
title('Peaks Aligned to the Common Mass/Charge (M/Z) Reference')
Use msviewer to inspect the preprocessed spectrograms on a given range (for example, between
values 7600 and 8200).
r1 = 7600;
r2 = 8200;
range = MZ > r1 & MZ < r2;
rangeMarkers = CMZ(CMZ > r1 & CMZ < r2);
msviewer(MZ(range),YN2(range,:),'MARKERS',rangeMarkers,'GROUP',1:4)
6-17
6 Mass Spectrometry and Bioanalytics
See Also
mssgolay | msnorm | msalign | msheatmap | msbackadj | msresample | mspeaks | msviewer
Related Examples
• “Batch Processing of Spectra Using Sequential and Parallel Computing” on page 6-79
• “Visualizing and Preprocessing Hyphenated Mass Spectrometry Data Sets for Metabolite and
Protein/Peptide Profiling” on page 6-19
• “Identifying Significant Features and Classifying Protein Profiles” on page 6-38
6-18
Visualizing and Preprocessing Hyphenated Mass Spectrometry Data Sets for Metabolite and Protein/Peptide Profiling
This example shows how to manipulate, preprocess and visualize data from Liquid Chromatography
coupled with Mass Spectrometry (LC/MS). These large and high dimensional data sets are extensively
utilized in proteomics and metabolomics research. Visualizing complex peptide or metabolite
mixtures provides an intuitive method to evaluate the sample quality. In addition, methodical
correction and preprocessing can lead to automated high throughput analysis of samples allowing
accurate identification of significant metabolites and specific peptide features in a biological sample.
Introduction
In a typical hyphenated mass spectrometry experiment, proteins and metabolites are harvested from
cells, tissues, or body fluids, dissolved and denatured in solution, and enzymatically digested into
mixtures. These mixtures are then separated either by High Performance Liquid Chromatography
(HPLC), capillary electrophoresis (CE), or gas chromatography (GC) and coupled to a mass-
spectrometry identification method, such as Electrospray Ionization Mass Spectrometry (ESI-MS),
matrix assisted ionization (MALDI or SELDI TOF-MS), or tandem mass spectrometry (MS/MS).
For this example, we use a test sample LC-ESI-MS data set with a seven protein mix. The data in this
example (7MIX_STD_110802_1) is from the Sashimi Data Repository. The data set is not distributed
with MATLAB®. To complete this example, you must download the data set into a local directory or
your own repository. Alternatively, you can try other data sets available in other public databases for
protein expression data such as the Peptide Atlas at the Institute of Systems Biology [1].
Most of the current mass spectrometers can translate or save the acquisition data using the mzXML
schema. This standard is an XML (eXtensible Markup Language)-based common file format developed
by the Sashimi project to address the challenges involved in representing data sets from different
manufacturers and from different experimental setups into a common and expandable schema.
mzXML files used in hyphenated mass spectrometry are usually very large. The MZXMLINFO function
allows you to obtain basic information about the dataset without reading it into memory. For example,
you can retrieve the number of scans, the range of the retention time, and the number of tandem MS
instruments (levels).
info = mzxmlinfo('7MIX_STD_110802_1.mzXML','NUMOFLEVELS',true)
info =
Filename: '7MIX_STD_110802_1.mzXML'
FileModDate: '01-Feb-2013 11:54:30'
FileSize: 26789612
NumberOfScans: 7161
StartTime: 'PT0.00683333S'
EndTime: 'PT200.036S'
DataProcessingIntensityCutoff: 'N/A'
DataProcessingCentroided: 'true'
DataProcessingDeisotoped: 'N/A'
DataProcessingChargeDeconvoluted: 'N/A'
6-19
6 Mass Spectrometry and Bioanalytics
DataProcessingSpotIntegration: 'N/A'
NumberOfMSLevels: 2
The MZXMLREAD function reads the XML document into a MATLAB structure. The fields scan and
index are placed at the first level of the output structure for improved access to the spectral data.
The remainder of the mzXML document tree is parsed according to the schema specifications. This
LC/MS data set contains 7161 scans with two MS levels. For this example you will use only the first
level scans. Second level spectra are usually used for peptide/protein identification, and come at a
later stage in some types of workflow analyses. MZXMLREAD can filter the desired scans without
loading all the dataset into memory:
mzXML_struct = mzxmlread('7MIX_STD_110802_1.mzXML','LEVEL',1)
mzXML_struct =
If you receive any errors related to memory or java heap space during the loading, try increasing your
java heap space as described here.
More detailed information pertaining the mass-spectrometer and the experimental conditions are
found in the field msRun.
mzXML_struct.mzXML.msRun
ans =
scanCount: 7161
startTime: "PT0.00683333S"
endTime: "PT200.036S"
parentFile: [1×1 struct]
msInstrument: [1×1 struct]
dataProcessing: [1×1 struct]
To facilitate the handling of the data, the MZXML2PEAKS function extracts the list of peaks from each
scan into a cell array (peaks]) and their respective retention time into a column vector (time). You
can extract the spectra of certain level by setting the LEVEL input parameter.
[peaks,time] = mzxml2peaks(mzXML_struct);
numScans = numel(peaks)
numScans =
2387
6-20
Visualizing and Preprocessing Hyphenated Mass Spectrometry Data Sets for Metabolite and Protein/Peptide Profiling
The MSDOTPLOT function creates an overview display of the most intense peaks in the entire data set.
In this case, we visualize only the most intense 5% ion intensity peaks by setting the input parameter
QUANTILE to 0.95.
h = msdotplot(peaks,time,'quantile',.95);
title('5 Percent Overall Most Intense Peaks')
You can also filter the peaks individually for each scan using a percentile of the base peak intensity.
The base peak is the most intense peak found in each scan [2]. This parameter is given automatically
by most of the spectrometers. This operation requires querying into the mxXML structure to obtain
the base peak information. Note that you could also find the base peak intensity by iterating the MAX
function over the peak list.
basePeakInt = [mzXML_struct.scan.basePeakIntensity]';
peaks_fil = cell(numScans,1);
for i = 1:numScans
h = peaks{i}(:,2) > (basePeakInt(i).*0.75);
peaks_fil{i} = peaks{i}(h,:);
end
whos('basePeakInt','level_1','peaks','peaks_fil')
msdotplot(peaks_fil,time)
title('Peaks Above (0.75 x Base Peak Intensity) for Each Scan')
6-21
6 Mass Spectrometry and Bioanalytics
You can customize a 3-D overview of the filtered peaks using the STEM3 function. The STEM3 function
requires to put the data into three vectors, whose elements form the triplets (the retention time, the
mass/charge, and the intensity value) that represent every stem.
peaks_3D = cell(numScans,1);
for i = 1:numScans
peaks_3D{i}(:,[2 3]) = peaks_fil{i};
peaks_3D{i}(:,1) = time(i);
end
peaks_3D = cell2mat(peaks_3D);
figure
stem3(peaks_3D(:,1),peaks_3D(:,2),peaks_3D(:,3),'marker','none')
axis([0 12000 400 1500 0 1e9])
view(60,60)
xlabel('Retention Time (seconds)')
ylabel('Mass/Charge (M/Z)')
zlabel('Relative Ion Intensity')
title('Peaks Above (0.75 x Base Peak Intensity) for Each Scan')
6-22
Visualizing and Preprocessing Hyphenated Mass Spectrometry Data Sets for Metabolite and Protein/Peptide Profiling
You can plot colored stems using the PATCH function. For every triplet in peaks_3D, interleave a new
triplet with the intensity value set to zero. Then create a color vector dependent on the intensity of
the stem. A logarithmic transformation enhances the dynamic range of the colormap. For the
interleaved triplets assign a NaN, so that PATCH function does not draw lines connecting contiguous
stems.
peaks_patch = sortrows(repmat(peaks_3D,2,1));
peaks_patch(2:2:end,3) = 0;
col_vec = log(peaks_patch(:,3));
col_vec(2:2:end) = NaN;
figure
patch(peaks_patch(:,1),peaks_patch(:,2),peaks_patch(:,3),col_vec,...
'edgeColor','flat','markeredgecolor','flat','Marker','x','MarkerSize',3);
axis([0 12000 400 1500 0 1e9])
view(90,90)
xlabel('Retention Time (seconds)')
ylabel('Mass/Charge (M/Z)')
zlabel('Relative Ion Intensity')
title('Peaks Above (0.75 x Base Peak Intensity) for Each Scan')
6-23
6 Mass Spectrometry and Bioanalytics
view(40,40)
6-24
Visualizing and Preprocessing Hyphenated Mass Spectrometry Data Sets for Metabolite and Protein/Peptide Profiling
Common techniques in the industry work with peak information (a.k.a. centroided data) instead of
raw signals. This may save memory, but some important details are not visible, especially when it is
necessary to inspect samples with complex mixtures. To further analyze this data set, we can create a
common grid in the mass/charge dimension. Since not all of the scans have enough information to
reconstruct the original signal, we use a peak preserving resampling method. By choosing the
appropriate parameters for the MSPPRESAMPLE function, you can ensure that the resolution of the
spectra is not lost, and that the maximum values of the peaks correlate to the original peak
information.
[MZ,Y] = msppresample(peaks,5000);
whos('MZ','Y')
With this matrix of ion intensities, Y, you can create a colored heat map. The MSHEATMAP function
automatically adjusts the colorbar utilized to show the statistically significant peaks with hot colors
and the noisy peaks with cold colors. The algorithm is based on clustering significant peaks and noisy
peaks by estimating a mixture of Gaussians with an Expectation-Maximization approach. Additionally,
you can use the MIDPOINT input parameter to select an arbitrary threshold to separate noisy peaks
from significant peaks, or you can interactively shift the colormap to hide or unhide peaks. When
6-25
6 Mass Spectrometry and Bioanalytics
working with heat maps, it is common to display the logarithm of the ion intensities, which enhances
the dynamic range of the colormap.
You can zoom to small regions of interest to observe the data, either interactively or
programmatically using the AXIS function. We observe some regions with high relative ion intensity.
These represent peptides in the biological sample.
6-26
Visualizing and Preprocessing Hyphenated Mass Spectrometry Data Sets for Metabolite and Protein/Peptide Profiling
You can overlay the original peak information of the LC/MS data set. This lets you evaluate the
performance of the peak-preserving resampling operation. You can use the returned handle to hide/
unhide the dots.
dp1 = msdotplot(peaks,time);
6-27
6 Mass Spectrometry and Bioanalytics
The two dimensional peaks appear to be noisy or they do not show a compact shape in contiguous
spectra. This is a common problem for many mass spectrometers. Random fluctuations of the mass/
charge value extracted from peaks of replicate profiles are reported to range from 0.1% to 0.3% [3].
Such variability can be caused by several factors, e.g. poor calibration of the detector, low signal-to-
noise ratio, or problems in the peak extraction algorithms. The MSPALIGN function implements
advanced data binning algorithms that synchronize all the spectra in a data set to a common mass/
charge grid (CMZ). CMZ can be chosen arbitrarily or it can be estimated after analyzing the data
[2,4,5]. The peak matching procedure can use either a nearest neighbor rule or a dynamic
programming alignment.
Repeat the visualization process with the aligned peaks: perform peak preserving resampling, create
a heat map, overlay the aligned peak information, and zoom into the same region of interest as
before. When the spectrum is re-calibrated, it is possible to distinguish the isotope patterns of some
of the peptides.
[MZ_A,Y_A] = msppresample(peaks_CMZ,5000);
fh2 = msheatmap(MZ_A,time,log(Y_A),'resolution',.10,'range',[500 900]);
title('LC/MS Data Set with the Mass/Charge Calibrated to a CMZ')
dp2 = msdotplot(peaks_CMZ,time);
axis([527 539 385 496])
6-28
Visualizing and Preprocessing Hyphenated Mass Spectrometry Data Sets for Metabolite and Protein/Peptide Profiling
MSPALIGN computes a single CMZ for the whole LC/MS data set. This may not be the ideal case for
samples with more complex mixtures of peptides and/or metabolites than the data set utilized in this
example. In the case of complex mixtures, you can align each spectrum to a local set of spectra that
contain only informative peaks (high intensity) with similar retention times, otherwise the calibration
process in regions with small peaks (low intensity) can be biased by other peaks that share similar
mass/charge values but are at different retention times. To perform a finer calibration, you can
employ the SAMPLEALIGN function. This function is a generalization of the Constrained Dynamic
Time Warping (CDTW) algorithms commonly utilized in speech processing [6]. In the following for
loop, we maintain a buffer with the intensities of the previous aligned spectra (LAI). The ion
intensities of the spectra are scaled with the anonymous function SF (inside SAMPLEALIGN) to reduce
the distance between large peaks. SAMPLEALIGN reduces the overall distance of all matched points
and introduces gaps as necessary. In this case we use a finer MZ vector (FMZ), such that we preserve
the correct value of the mass/charge of the peaks as much as possible. Note: this may take some time,
as the CDTW algorithm is executed 2,387 times.
peaks_FMZ = cell(numScans,1);
for i = 1:numScans
% show progress
6-29
6 Mass Spectrometry and Bioanalytics
if ~rem(i,250)
fprintf(' %d...',i);
end
% align peaks in current scan to LAI
[k,j] = samplealign([FMZ,LAI],double(peaks{i}),'band',1.5,'gap',[0,2],'dist',DF);
% updating the LAI buffer
LAI = LAI*.25;
LAI(k) = LAI(k) + peaks{i}(j,2);
% save the alignment
peaks_FMZ{i} = [FMZ(k) peaks{i}(j,2)];
end
As a final step to improve the image, you can apply a Gaussian filter in the chromatographic direction
to smooth the whole data set.
Gpulse = exp(-.1*(-10:10).^2)./sum(exp(-.1*(-10:10).^2));
YF = convn(Y_B,Gpulse,'same');
fh4 = msheatmap(MZ_B,time,log(YF),'resolution',.10,'limits',[500 900]);
6-30
Visualizing and Preprocessing Hyphenated Mass Spectrometry Data Sets for Metabolite and Protein/Peptide Profiling
You can link the axes of two heat maps, to interactively or programmatically compare regions
between two data sets. In this case we compare the original and the final enhanced LC/MS matrices.
linkaxes(findobj([fh1 fh4],'Tag','MSHeatMap'))
axis([521 538 266 617])
6-31
6 Mass Spectrometry and Bioanalytics
6-32
Visualizing and Preprocessing Hyphenated Mass Spectrometry Data Sets for Metabolite and Protein/Peptide Profiling
Once the LC/MS data set is smoothed and resampled into a regular grid, it is possible to extract the
most informative spectra by looking at the local maxima of the Total Ion Chromatogram (TIC). The
TIC is straightforwardly computed by summing the rows of YF. Then, use the MSPEAKS function to
find the retention time values for extracting selected subsets of spectra.
TIC = mean(YF);
pt = mspeaks(time,TIC','multiplier',10,'overseg',100,'showplot',true);
title('Extracted peaks from the Total Ion Chromatogram (TIC)')
pt(pt(:,1)>4000,:) = []; % remove spectra above 4000 seconds
numPeaks = size(pt,1)
numPeaks =
22
6-33
6 Mass Spectrometry and Bioanalytics
figure;
hold on
box on
plot3(repmat(MZ_B,1,numPeaks),repmat(1:numPeaks,numel(MZ_B),1),xSpec)
view(20,85)
ax = gca;
ax.YTick = 1:numPeaks;
ax.YTickLabel = num2str(time(xRows));
axis([500 900 0 numPeaks 0 1e8])
xlabel('Mass/Charge (M/Z)')
ylabel('Time')
zlabel('Relative Ion Intensity')
title('Extracted Spectra Subset')
6-34
Visualizing and Preprocessing Hyphenated Mass Spectrometry Data Sets for Metabolite and Protein/Peptide Profiling
Overlay markers for the extracted spectra over the enhanced heatmap.
linkaxes(findobj(fh4,'Tag','MSHeatMap'),'off')
figure(fh4)
hold on
for i = 1:numPeaks
plot([400 1500],xRows([i i]),'m')
end
axis([500 900 100 925])
dp4.Visible = 'off';
title('Final Enhanced LC/MS Data Set with Extracted Spectra Marked')
6-35
6 Mass Spectrometry and Bioanalytics
6-36
Visualizing and Preprocessing Hyphenated Mass Spectrometry Data Sets for Metabolite and Protein/Peptide Profiling
References
[1] Desiere, F. et al., "The Peptide Atlas Project", Nucleic Acids Research, 34:D655-8, 2006.
[2] Purvine, S., Kolker, N., and Kolker, E., "Spectral Quality Assessment for High-Throughput Tandem
Mass Spectrometry Proteomics", OMICS: A Journal of Integrative Biology, 8(3):255-65, 2004.
[3] Kazmi, A.S., et al., "Alignment of high resolution mass spectra: Development of a heuristic
approach for metabolomics", Metabolomics, 2(2):75-83, 2006.
[4] Jeffries, N., "Algorithms for alignment of mass spectrometry proteomic data", Bioinformatics,
21(14):3066-3073, 2005.
[5] Yu, W., et al., "Multiple peak alignment in sequential data analysis: A scale-space based approach",
IEEE®/ACM Trans. Computational Biology and Bioinformatics, 3(3):208-219, 2006.
[6] Sakoe, H. and Chiba s., "Dynamic programming algorithm optimization for spoken word
recognition", IEEE Trans. Acoustics, Speech and Signal Processing, ASSP-26(1):43-9, 1978.
6-37
6 Mass Spectrometry and Bioanalytics
This example shows how to classify mass spectrometry data and use some statistical tools to look for
potential disease markers and proteomic pattern diagnostics.
Introduction
Serum proteomic pattern diagnostics can be used to differentiate samples from patients with and
without disease. Profile patterns are generated using surface-enhanced laser desorption and
ionization (SELDI) protein mass spectrometry. This technology has the potential to improve clinical
diagnostics tests for cancer pathologies. The goal is to select a reduced set of measurements or
"features" that can be used to distinguish between cancer and control patients. These features will be
ion intensity levels at specific mass/charge values.
Preprocess Data
The ovarian cancer data set in this example is from the FDA-NCI Clinical Proteomics Program
Databank. The data set was generated using the WCX2 protein array. The data set includes 95
controls and 121 ovarian cancers. For a detailed description of this data set, see [1] and [4].
This example assumes that you already have the preprocessed data
OvarianCancerQAQCdataset.mat. However, if you do not have the data file, you can recreate by
following the steps in the example “Batch Processing of Spectra Using Sequential and Parallel
Computing” on page 6-79.
6-38
Identifying Significant Features and Classifying Protein Profiles
NumberCancerDatasets = numel(filesCancer);
fprintf('Found %d Cancer mass-spectrograms.\n',NumberCancerDatasets)
filesNormal = dir([repositoryN '*.txt']);
NumberNormalDatasets = numel(filesNormal);
fprintf('Found %d Control mass-spectrograms.\n',NumberNormalDatasets)
[MZ,Y] = msbatchprocessing(repository,files);
grp = [repmat({'Cancer'},size(filesCancer));...
repmat({'Normal'},size(filesNormal))];
The preprocessing steps from the script and example listed above are intended to illustrate a
representative set of possible pre-processing procedures. Using different steps or parameters may
lead to different and possibly improved results of this example.
Load Data
Once you have the preprocessed data, you can load it into MATLAB.
load OvarianCancerQAQCdataset
whos
There are three variables: MZ, Y, grp. MZ is the mass/charge vector, Y is the intensity values for all
216 patients (control and cancer), and grp holds the index information as to which of these samples
represent cancer patients and which ones represent normal patients.
Initialize some variables that will be used through out the example.
6-39
6 Mass Spectrometry and Bioanalytics
You can plot some data sets into a figure window to visually compare profiles from the two groups; in
this example five spectrograms from cancer patients (blue) and five from control patients (green) are
displayed.
Zooming in on the region from 8500 to 8700 M/Z shows some peaks that might be useful for
classifying the data.
axis([8450,8700,-1,7])
6-40
Identifying Significant Features and Classifying Protein Profiles
Another way to visualize the whole data set is to look at the group average signal for the control and
cancer samples. You can plot the group average and the envelopes of each group.
6-41
6 Mass Spectrometry and Bioanalytics
Observe that apparently there is no single feature that can discriminate both groups perfectly.
A simple approach for finding significant features is to assume that each M/Z value is independent
and compute a two-way t-test. rankfeatures returns an index to the most significant M/Z values,
for instance 100 indices ranked by the absolute value of the test statistic. This feature selection
method is also known as a filtering method, where the learning algorithm is not involved on how the
features are selected.
[feat,stat] = rankfeatures(Y,grp,'CRITERION','ttest','NUMBER',100);
The first output of rankfeatures can be used to extract the M/Z values of the significant features.
sig_Masses = MZ(feat);
sig_Masses(1:7)' %display the first seven
ans =
1.0e+03 *
The second output of rankfeatures is a vector with the absolute value of the test statistic. You can
plot it over the spectra using yyaxis.
6-42
Identifying Significant Features and Classifying Protein Profiles
figure;
yyaxis left
plot(MZ, [mean_N mean_C]);
ylim([-1,20])
xlim([7950,8300])
title('Significant M/Z Values')
xlabel(xAxisLabel);
ylabel(yAxisLabel);
yyaxis right
plot(MZ,stat);
ylim([-1,22])
ylabel('Test Statistic');
Notice that there are significant regions at high M/Z values but low intensity (~8100 Da.). Other
approaches to measure class separability are available in rankfeatures, such as entropy based,
Bhattacharyya, or the area under the empirical receiver operating characteristic (ROC) curve.
Now that you have identified some significant features, you can use this information to classify the
cancer and normal samples. Due to the small number of samples, you can run a cross-validation using
the 20% holdout to have a better estimation of the classifier performance. cvpartition allows you
6-43
6 Mass Spectrometry and Bioanalytics
to set the training and test indices for different types of system evaluation methods, such as hold-out,
K-fold and Leave-M-Out.
cv =
Observe that features are selected only from the training subset and the validation is performed with
the test subset. classperf allows you to keep track of multiple validations.
After the loop you can assess the performance of the overall blind classification using any of the
properties in the CP object, such as the error rate, sensitivity, specificity, and others.
cp_lda1
Label: ''
Description: ''
ClassLabels: {2x1 cell}
GroundTruth: [216x1 double]
NumberOfObservations: 216
ControlClasses: 2
TargetClasses: 1
ValidationCounter: 10
SampleDistribution: [216x1 double]
ErrorDistribution: [216x1 double]
SampleDistributionByClass: [2x1 double]
ErrorDistributionByClass: [2x1 double]
CountingMatrix: [3x2 double]
CorrectRate: 0.8488
ErrorRate: 0.1512
LastCorrectRate: 0.8837
LastErrorRate: 0.1163
InconclusiveRate: 0
ClassifiedRate: 1
Sensitivity: 0.8208
Specificity: 0.8842
PositivePredictiveValue: 0.8995
NegativePredictiveValue: 0.7962
PositiveLikelihood: 7.0890
NegativeLikelihood: 0.2026
6-44
Identifying Significant Features and Classifying Protein Profiles
Prevalence: 0.5581
DiagnosticTable: [2x2 double]
This naive approach for feature selection can be improved by eliminating some features based on the
regional information. For example, 'NWEIGHT' in rankfeatures outweighs the test statistic of
neighboring M/Z features such that other significant M/Z values can be incorporated into the subset
of selected features
ans =
0.9023
Lilien et al. presented in [2] an algorithm to reduce the data dimensionality that uses principal
component analysis (PCA), then LDA is used to classify the groups. In this example 2000 of the most
significant features in the M/Z space are mapped to the 150 principal components
ans =
0.9814
Feature selection can also be reinforced by classification, this approach is usually referred to as a
wrapper selection method. Randomized search for feature selection generates random subsets of
features and assesses their quality independently with the learning algorithm. Later, it selects a pool
of the most frequent good features. Li et al. in [3] apply this concept to the analysis of protein
6-45
6 Mass Spectrometry and Bioanalytics
expression patterns. The randfeatures function allows you to search a subset of features using LDA
or a k-nearest neighbor classifier over randomized subsets of features.
Note: the following example is computationally intensive, so it has been disabled from the example.
Also, for better results you should increase the pool size and the stringency of the classifier from the
default values in randfeatures. Type help randfeatures for more information.
Assess the Quality of the Selected Features with the Evaluation Set
The first output from randfeatures is an ordered list of indices of MZ values. The first item occurs
most frequently in the subsets where good classification was achieved. The second output is the
actual counts of the number of times each value was selected. You can use hist to look at this
distribution.
figure;
hist(fCount,max(fCount)+1);
You will see that most values appear at most once in a selected subset. Zooming in gives a better idea
of the details for the more frequently selected values.
6-46
Identifying Significant Features and Classifying Protein Profiles
axis([0 80 0 100])
Only a few values were selected more than 10 times. You can visualize these by using a stem plot to
show the most frequently selected features.
6-47
6 Mass Spectrometry and Bioanalytics
These features appear to clump together in several groups. You can investigate further how many of
the features are significant by running the following experiment. The most frequently selected
feature is used to classify the data, then the two most frequently selected features are used and so on
until all the features that were selected more than 10 times are used. You can then see if adding more
features improves the classifier.
nSig = sum(fCount>10);
cp_rndfeat = zeros(20,nSig);
for i = 1:nSig
for j = 1:20
cv = repartition(cv);
P = pca(Y(feat(1:i),training(cv))');
x = Y(feat(1:i),:)' * P;
c = classify(x(test(cv),:),x(training(cv),:),grp(training(cv)));
cp = classperf(grp,c,test(cv));
cp_rndfeat(j,i) = cp.CorrectRate; % average correct classification rate
end
end
figure
plot(1:nSig, [max(cp_rndfeat);mean(cp_rndfeat)]);
legend({'Best CorrectRate','Mean CorrectRate'},'Location', 'SouthEast')
6-48
Identifying Significant Features and Classifying Protein Profiles
From this graph you can see that for as few as three features it is sometimes possible to get perfect
classification. You will also notice that the maximum of the mean correct rate occurs for a small
number of features and then gradually decreases.
You can now visualize the features that give the best average classification. You can see that these
actually correspond to only three peaks in the data.
6-49
6 Mass Spectrometry and Bioanalytics
There are many classification tools in MATLAB® that you can also use to analyze proteomic data.
Among them are support vector machines (fitcsvm), k-nearest neighbors (fitcknn), neural
networks (Deep Learning Toolbox™), and classification trees (fitctree). For feature selection, you
can also use sequential subset feature selection (sequentialfs) or optimize the randomized search
methods by using a genetic algorithm (Global Optimization Toolbox). For example, see “Genetic
Algorithm Search for Features in Mass Spectrometry Data” on page 6-71.
References
[1] Conrads, T P, V A Fusaro, S Ross, D Johann, V Rajapakse, B A Hitt, S M Steinberg, et al. “High-
Resolution Serum Proteomic Features for Ovarian Cancer Detection.” Endocrine-Related
Cancer, June 2004, 163–78.
[2] Lilien, Ryan H., Hany Farid, and Bruce R. Donald. “Probabilistic Disease Classification of
Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum.” Journal of
Computational Biology 10, no. 6 (December 2003): 925–46.
[3] Li, L., D. M. Umbach, P. Terry, and J. A. Taylor. “Application of the GA/KNN Method to SELDI
Proteomics Data.” Bioinformatics 20, no. 10 (July 1, 2004): 1638–40.
[4] Petricoin, Emanuel F, Ali M Ardekani, Ben A Hitt, Peter J Levine, Vincent A Fusaro, Seth M
Steinberg, Gordon B Mills, et al. “Use of Proteomic Patterns in Serum to Identify Ovarian
Cancer.” The Lancet 359, no. 9306 (February 2002): 572–77.
6-50
Identifying Significant Features and Classifying Protein Profiles
See Also
msnorm | rankfeatures | classperf
Related Examples
• “Batch Processing of Spectra Using Sequential and Parallel Computing” on page 6-79
6-51
6 Mass Spectrometry and Bioanalytics
This example shows how the SAMPLEALIGN function can correct nonlinear warping in the
chromatographic dimension of hyphenated mass spectrometry data sets without the need for full
identification of the sample compounds and/or the use of internal standards. By correcting such
warping between a pair (or set) of biologically related samples, differential analysis is enhanced and
can be automated.
Introduction
The use of complex peptide and metabolite mixtures in LC/MS requires label-free alignment
procedures. The analysis of this type of data requires searching for statistically significant differences
between biologically related data sets, without the need for a full identification of all the compounds
in the sample (either peptides/proteins or metabolites). Comparing compounds requires alignment in
two dimensions, the mass-charge dimension and the retention time dimension [1]. In the examples
“Preprocessing Raw Mass Spectrometry Data” on page 6-2 and “Visualizing and Preprocessing
Hyphenated Mass Spectrometry Data Sets for Metabolite and Protein/Peptide Profiling” on page 6-19,
you can learn how to use the MSALIGN, MSPALIGN, and SAMPLEALIGN functions to warp or calibrate
different type of anomalies in the mass/charge dimension. In this example, you will learn how to use
the SAMPLEALIGN function to also correct the nonlinear and unpredicted variations in the retention
time dimension.
While it is possible to implement alternative methods for aligning retention times, other approaches
typically require identification of compounds, which is not always feasible, or manual manipulations
that thwart attempts to automate for high throughput data analysis.
This example uses two samples in PAe000153 and PAe000155 available from Peptide Atlas [2]. The
samples are LC-ESI-MS scans of four salt protein fractions from the saccharomyces cerevisiae each
containing more than 1000 peptides. Yeast samples were treated with different chemicals (glycine
and serine) in order to get two biologically diverse samples. Time alignment of these two data sets is
one of the most challenging cases reported in [3]. The data sets are not distributed with MATLAB®.
You must download the data sets to a local directory or your own repository. Alternatively, you can try
other data sets available in public databases for protein data, such as Sashimi Data Repository. If you
receive any errors related to memory or java heap space, try increasing your java heap space as
described here. LC/MS data analysis requires extended amounts of memory from the operating
system; if you receive "Out of memory" errors when running this example, try increasing the
virtual memory (or swap space) of your operating system or try setting the 3GB switch (32-bit
Windows® XP only). For details, see “Resolve “Out of Memory” Errors”.
Read and extract the lists of peaks from the XML files containing the intensity data for the sample
treated with Serine and the sample treated with Glycine.
ser = mzxmlread('005_1.mzXML')
[ps,ts] = mzxml2peaks(ser,'level',1);
gly = mzxmlread('005a.mzXML')
[pg,tg] = mzxml2peaks(gly,'level',1);
ser =
6-52
Differential Analysis of Complex Protein and Metabolite Mixtures using Liquid Chromatography/Mass Spectrometry (LC/MS)
gly =
Use the MSPPRESAMPLE function to resample the data sets while preserving the peak heights and
locations in the mass/charge dimension. Data sets are resampled to have both a common grid with
5,000 mass/charge values. A common grid is desirable for comparative visualization, and for
differential analysis.
[MZs,Ys] = msppresample(ps,5000);
[MZg,Yg] = msppresample(pg,5000);
Use the MSHEATMAP function to visualize both samples. When working with heat maps it is a common
technique to display the logarithm of the ion intensities, which enhances the dynamic range of the
colormap.
fh1 = msheatmap(MZs,ts,log(Ys),'resolution',0.15);
title('Serine Treatment')
fh2 = msheatmap(MZg,tg,log(Yg),'resolution',0.15);
title('Glycine Treatment')
6-53
6 Mass Spectrometry and Bioanalytics
6-54
Differential Analysis of Complex Protein and Metabolite Mixtures using Liquid Chromatography/Mass Spectrometry (LC/MS)
Notice you can visualize the data sets separately; however, the time vectors have different size, and
therefore the heat maps have different number of rows (or Ys and Yg have different number of
columns). Moreover, the sampling rate is not constant and the shift between the time vectors is not
linear.
whos('Ys','Yg','ts','tg')
figure
plot(1:numel(ts),ts,1:numel(tg),tg)
legend('Serine','Glycine','Location','NorthWest')
title('Time Vectors')
xlabel('Spectrum Index')
ylabel('Retention Time (seconds)')
6-55
6 Mass Spectrometry and Bioanalytics
To observe the same region of interest in both data sets, you need to calculate the appropriate row
indices in each matrix. For example, to inspect the peptide peaks in the 480-520 Da MZ range and
1550-1900 seconds retention time range, you need to find the closest matches for this range in the
time vectors and then zoom in each figure:
ind_ser = samplealign(ts,[1550;1900]);
figure(fh1);
axis([480 520 ind_ser'])
ind_gly = samplealign(tg,[1550;1900]);
figure(fh2);
axis([480 520 ind_gly'])
6-56
Differential Analysis of Complex Protein and Metabolite Mixtures using Liquid Chromatography/Mass Spectrometry (LC/MS)
6-57
6 Mass Spectrometry and Bioanalytics
Even though you zoomed in the same range, you can still observe that the top-right peptide in the
axes is shifted in the retention time dimension. In the sample treated with serine, the center of this
peak appears to occur at approximately 1595 seconds, while in the sample treated with glycine the
putative same peptide occurs at approximately 1630 seconds. This will prevent you from a accurate
comparative analysis, even if you resample the data sets to the same time vector. In addition to the
shift in the retention time, the data set seems to be improperly calibrated in the mass/charge
dimension, because the peaks do not have a compact shape in contiguous spectra.
Before correcting the retention time, you can enhance the samples using an approach similar to the
one described in the example “Visualizing and Preprocessing Hyphenated Mass Spectrometry Data
Sets for Metabolite and Protein/Peptide Profiling” on page 6-19. For brevity, we only display the
MATLAB code without any further explanation:
SF = @(x) 1-exp(-x./5e7); % scaling function
DF = @(R,S) sqrt((SF(R(:,2))-SF(S(:,2))).^2 + (R(:,1)-S(:,1)).^2);
CMZ = (315:.10:680)'; % Common Mass/Charge Vector with a finer grid
6-58
Differential Analysis of Complex Protein and Metabolite Mixtures using Liquid Chromatography/Mass Spectrometry (LC/MS)
end
% Peak-preserving resample
[MZs,Ys] = msppresample(psa,5000);
[MZg,Yg] = msppresample(pga,5000);
% Gaussian Filtering
Gpulse = exp(-.5*(-10:10).^2)./sum(exp(-.05*(-10:10).^2));
Ysf = convn(Ys,Gpulse,'same');
Ygf = convn(Yg,Gpulse,'same');
% Visualization
fh3 = msheatmap(MZs,ts,log(Ysf),'resolution',0.15);
title('Serine Treatment (Enhanced)')
axis([480 520 ind_ser'])
fh4 = msheatmap(MZg,tg,log(Ygf),'resolution',0.15);
title('Glycine Treatment (Enhanced)')
axis([480 520 ind_gly'])
6-59
6 Mass Spectrometry and Bioanalytics
6-60
Differential Analysis of Complex Protein and Metabolite Mixtures using Liquid Chromatography/Mass Spectrometry (LC/MS)
Chromatographic Alignment
At this point, you have mass/charge calibrated and smoothed the two LC/MS data sets, but you are
still unable to perform a differential analysis because the data sets have a small misalignment along
the retention time axis.
You can use SAMPLEALIGN to correct the drift in the chromatographic domain. First, you should
inspect the data and look for the worst case shift, this helps you to estimate the BAND constraint. By
panning over both heat maps you can observe that common peptide peaks are not shifted more than
50 seconds. Use the input argument SHOWCONSTRAINTS to display the constraint space for the time
warping operation and assess if the Dynamic Programming (DP) algorithm can handle this problem
size. In this case you have less than 12,000 nodes. By omitting the output arguments, SAMPLEALIGN
displays only the constraints without running the DP algorithm. Also note that the input signals are
the filtered and enhanced data sets, but these have been upsampled to 5,000 MZ values, which are
very computationally demanding if you use all. Therefore, use the function MSPALIGN to obtain a
reduced list of mass/charge values (RMZ) indicating where the most intense peaks are, then use the
SAMPLEALIGN function also to find the indices of MZs (or MZg) that best match the reduced mass/
charge vector:
RMZ = mspalign([ps;pg])';
idx = samplealign(MZs,RMZ,'width',1); % with these input parameters this
% operation is equivalent to find the
% nearest neighbor for each RMZ in
% MZs.
6-61
6 Mass Spectrometry and Bioanalytics
6-62
Differential Analysis of Complex Protein and Metabolite Mixtures using Liquid Chromatography/Mass Spectrometry (LC/MS)
SAMPLEALIGN uses the Euclidean distance as default to score matched pairs of samples. In LC/MS
data sets each sample corresponds to a spectrum at a given time, therefore, the cross-correlation
between each pair of matched spectra provides a better distance metric. SAMPLEALIGN allows you to
define your own metric to calculate the distance between spectra, it is also possible to envision a
metric that rewards more spectra pairs that match high ion intensity peaks rather than low ion
intensity noisy peaks. Use the input argument WEIGHT to remove the first column from the inputs,
which represents the retention time, so the scoring metric between spectra is based only on the ion
intensities.
cc = @(Xu,Yu) (mean(Xu.*Yu,2).^2)./mean(Xu.*Xu,2)./mean(Yu.*Yu,2);
ub = @(X) bsxfun(@minus,X,mean(X,2));
df = @(x,y) 1-cc(ub(x),ub(y));
fh5 = figure;
plot(ts(i),tg(j)); grid
title('Warp Function')
xlabel('Retention Time in Data Set with Serine')
ylabel('Retention Time in Data Set with Glycine')
6-63
6 Mass Spectrometry and Bioanalytics
6-64
Differential Analysis of Complex Protein and Metabolite Mixtures using Liquid Chromatography/Mass Spectrometry (LC/MS)
Because it is expected that the real shift function between both data sets is continuous, you can
smooth the observed shifts or regress a continuous model to create a warp model between the two
time axes. In this example, we simply regress the drifting to a polynomial function:
[p,p_struct,mu] = polyfit((ts(i)+tg(j))/2,ts(i)-tg(j),20);
sf = @(z) polyval(p,(z-mu(1))./mu(2));
figure(fh6)
plot(tg,sf(tg),'r')
legend('Observed shifts','Estimated shift curve','Location','SouthEast')
6-65
6 Mass Spectrometry and Bioanalytics
Comparative Analysis
To carry out a comparative analysis, resample the LC/MS data sets to a common time vector. When
resampling we use the estimated shift function to correct the retention time dimension. In this
example, each spectrum in both data sets is shifted half the distance of the shift function. In the case
of multiple alignment of data sets, it is possible to calculate the pairwise shift functions between all
data sets, and then apply the corrections to a common reference in such a way that the overall shifts
are minimized [4].
t = (max(min(tg),min(ts)):mean(diff(tg)):min(max(tg),max(ts)))';
% Visualization
fh7 = msheatmap(MZs,t,log(interp1q(ts,Ysf',t+sf(t)/2)'),'resolution',0.15);
title('Serine Treatment (Enhanced & Aligned)')
fh8 = msheatmap(MZg,t,log(interp1q(tg,Ygf',t-sf(t)/2)'),'resolution',0.15);
title('Glycine Treatment (Enhanced & Aligned)')
6-66
Differential Analysis of Complex Protein and Metabolite Mixtures using Liquid Chromatography/Mass Spectrometry (LC/MS)
6-67
6 Mass Spectrometry and Bioanalytics
To interactively or programmatically compare regions between two enhanced and time aligned data
sets, you can link the axes of two heat maps. Because the heat maps now use a regularly spaced time
vector, you can zoom in by using the AXIS function without having to search the appropriate row
indices of the matrices.
linkaxes(findobj([fh7 fh8],'Tag','MSHeatMap'))
axis([480 520 1550 1900])
6-68
Differential Analysis of Complex Protein and Metabolite Mixtures using Liquid Chromatography/Mass Spectrometry (LC/MS)
6-69
6 Mass Spectrometry and Bioanalytics
References
[1] Listgarten, J. and Emili, A., "Statistical and computational methods for comparative proteomic
profiling using liquid chromatography tandem mass spectrometry", Molecular and Cell Proteomics,
4(4):419-34, 2005.
[2] Desiere, F., et al., "The Peptide Atlas Project", Nucleic Acids Research, 34:D655-8, 2006.
[3] Prince, J.T. and Marcotte, E.M., "Chromatographic alignment of ESI-LC-MS proteomics data sets
by ordered bijective interpolated warping", Analytical Chemistry, 78(17):6140-52, 2006.
[4] Andrade L. and Manolakos E.S., "Automatic Estimation of Mobility Shift Coefficients in DNA
Chromatograms", Neural Networks for Signal Processing Proceedings, 91-100, 2003.
6-70
Genetic Algorithm Search for Features in Mass Spectrometry Data
This example shows how to use the Global Optimization Toolbox with the Bioinformatics
Toolbox™ to optimize the search for features to classify mass spectrometry (SELDI) data.
Introduction
Genetic algorithms optimize search results for problems with large data sets. You can use the
MATLAB® genetic algorithm function to solve these problems in Bioinformatics. Genetic algorithms
have been applied to phylogenetic tree building, gene expression and mass spectrometry data
analysis, and many other areas of Bioinformatics that have large and computationally expensive
problems. This example searches for optimal features (peaks) in mass spectrometry data. We will look
for specific peaks in the data that distinguish cancer patients from control patients.
First familiarize yourself with the Global Optimization Toolbox. The documentation describes how a
genetic algorithm works and how to use it in MATLAB. To access the documentation, use the doc
command.
doc ga
The original data in this example is from the FDA-NCI Clinical Proteomics Program Databank. It is a
collection of samples from 121 ovarian cancer patients and 95 control patients. For a detailed
description of this data set, see [1] and [2].
This example assumes that you already have the preprocessed data
OvarianCancerQAQCdataset.mat. However, if you do not have the data file, you can recreate by
following the steps in the example “Batch Processing of Spectra Using Sequential and Parallel
Computing” on page 6-79.
6-71
6 Mass Spectrometry and Bioanalytics
[MZ,Y] = msbatchprocessing(repository,files);
grp = [repmat({'Cancer'},size(filesCancer));...
repmat({'Normal'},size(filesNormal))];
Once you have the preprocessed data, you can load it into MATLAB.
load OvarianCancerQAQCdataset
whos
There are three variables: MZ, Y, grp. MZ is the mass/charge vector, Y is the intensity values for all
216 patients (control and cancer), and grp holds the index information as to which of these samples
represent cancer patients and which ones represent normal patients. To visualize this data, see the
example “Identifying Significant Features and Classifying Protein Profiles” on page 6-38.
A genetic algorithm requires an objective function, also known as the fitness function, which
describes the phenomenon that we want to optimize. In this example, the genetic algorithm
6-72
Genetic Algorithm Search for Features in Mass Spectrometry Data
machinery tests small subsets of M/Z values using the fitness function and then determines which
M/Z values get passed on to or removed from each subsequent generation. The fitness function
biogafit is passed to the genetic algorithm solver using a function handle. In this example, biogafit
maximizes the separability of two classes by using a linear combination of 1) the a-posteriori
probability and 2) the empirical error rate of a linear classifier (classify). You can create your own
fitness function to try different classifiers or alternative methods for assessing the performance of the
classifiers.
type biogafit
thePopulation = round(thePopulation);
try
[c,~,p] = classify(Y(thePopulation,:)',Y(thePopulation,:)',double(id),'linear');
cp = classperf(id,c);
classPerformance = 100*cp.ErrorRate + 1 - mean(max(p,[],2));
catch
% In case pooled covariance matrix is not positive definite we try a
% naive-Bayes classifier:
try
[c,~,p] = classify(Y(thePopulation,:)',Y(thePopulation,:)',double(id),'diaglinear');
cp = classperf(id,c);
classPerformance = 100*cp.ErrorRate + 1 - mean(max(p,[],2));
catch
classPerformance = Inf;
end
end
Users can change how the optimization is performed by the genetic algorithm by creating custom
functions for crossover, fitness scaling, mutation, selection, and population creation. In this example
you will use the biogacreate function written for this example to create initial random data points
from the mass spectrometry data. The function header requires specific input parameters as specified
by the GA documentation. There is a default creation function in the toolbox for creating initial
populations of data points.
type biogacreate
6-73
6 Mass Spectrometry and Bioanalytics
pop = zeros(options.PopulationSize,GenomeLength);
npop = numel(pop);
ranked_features = rankfeatures(Y,id,'NumberOfIndices',npop,'NWeighting',.5);
pop(:) = randsample(ranked_features,npop);
The GA function uses an options structure to hold the algorithm parameters that it uses when
performing a minimization with a genetic algorithm. The optimoptions function will create this
options structure. For the purposes of this example, the genetic algorithm will run only for 50
generations. However, you may set 'Generations' to a larger value.
options = optimoptions('ga','CreationFcn',{@biogacreate,Y,id},...
'PopulationSize',120,...
'Generations',50,...
'Display','iter')
options =
ga options:
Set properties:
CreationFcn: {1x3 cell}
Display: 'iter'
MaxGenerations: 50
PopulationSize: 120
Default properties:
ConstraintTolerance: 1.0000e-03
CrossoverFcn: []
CrossoverFraction: 0.8000
EliteCount: '0.05*PopulationSize'
FitnessLimit: -Inf
FitnessScalingFcn: @fitscalingrank
FunctionTolerance: 1.0000e-06
HybridFcn: []
InitialPopulationMatrix: []
InitialPopulationRange: []
InitialScoresMatrix: []
MaxStallGenerations: 50
MaxStallTime: Inf
MaxTime: Inf
MutationFcn: []
NonlinearConstraintAlgorithm: 'auglag'
OutputFcn: []
PlotFcn: []
6-74
Genetic Algorithm Search for Features in Mass Spectrometry Data
PopulationType: 'doubleVector'
SelectionFcn: []
UseParallel: 0
UseVectorized: 0
Use ga to start the genetic algorithm function. 100 groups of 20 datapoints each will evolve over 50
generations. Selection, crossover, and mutation events generate a new population in every
generation.
% initialize the random generators to the same state used to generate the
% published example
rng('default')
nVars = 12; % set the number of desired features
FitnessFcn = {@biogafit,Y,id}; % set the fitness function
feat = ga(FitnessFcn,nVars,options); % call the Genetic Algorithm
feat = round(feat);
Significant_Masses = MZ(feat)
cp = classperf(classify(Y(feat,:)',Y(feat,:)',id),id);
cp.CorrectRate
Options:
CreationFcn: @biogacreate
CrossoverFcn: @crossoverscattered
SelectionFcn: @selectionstochunif
MutationFcn: @mutationgaussian
6-75
6 Mass Spectrometry and Bioanalytics
Significant_Masses =
1.0e+03 *
7.6861
7.9234
8.9834
8.6171
7.1808
7.3057
8.1132
8.5241
7.0527
7.7600
7.7442
7.7245
ans =
6-76
Genetic Algorithm Search for Features in Mass Spectrometry Data
To visualize which features have been selected by the genetic algorithm, the data is plotted with peak
positions marked with red vertical lines.
Observe the interesting peak around 8100 Da., which seems to be shifted to the right on healthy
samples.
6-77
6 Mass Spectrometry and Bioanalytics
References
[1] Conrads, T P, V A Fusaro, S Ross, D Johann, V Rajapakse, B A Hitt, S M Steinberg, et al. “High-
Resolution Serum Proteomic Features for Ovarian Cancer Detection.” Endocrine-Related
Cancer, June 2004, 163–78.
[2] Petricoin, Emanuel F, Ali M Ardekani, Ben A Hitt, Peter J Levine, Vincent A Fusaro, Seth M
Steinberg, Gordon B Mills, et al. “Use of Proteomic Patterns in Serum to Identify Ovarian
Cancer.” The Lancet 359, no. 9306 (February 2002): 572–77.
See Also
msnorm
Related Examples
• “Batch Processing of Spectra Using Sequential and Parallel Computing” on page 6-79
• “Identifying Significant Features and Classifying Protein Profiles” on page 6-38
6-78
Batch Processing of Spectra Using Sequential and Parallel Computing
This example shows how you can use a single computer, a multicore computer, or a cluster of
computers to preprocess a large set of mass spectrometry signals. Note: Parallel Computing
Toolbox™ and MATLAB® Parallel Server™ are required for the last part of this example.
Introduction
This example shows the required steps to set up a batch operation over a group of mass spectra
contained in one or more directories. You can achieve this sequentially, or in parallel using either a
multicore computer or a cluster of computers. Batch processing adapts to the single-program
multiple-data (SPMD) parallel computing model, and it is best suited for Parallel Computing
Toolbox™ and MATLAB® Parallel Server™.
This example assumes that you have downloaded and uncompressed the data sets into your
repository. Ideally, you should place the data sets in a network drive. If the workers all have access to
the same drives on the network, they can access needed data that reside on these shared resources.
This is the preferred method for sharing data, as it minimizes network traffic.
First, get the name and full path to the data repository. Two strings are defined: the path from the
local computer to the repository and the path required by the cluster computers to access the same
directory. Change both accordingly to your network configuration.
local_repository = 'C:/Examples/MassSpecRepository/OvarianCD_PostQAQC/';
worker_repository = 'L:/Examples/MassSpecRepository/OvarianCD_PostQAQC/';
For this particular example, the files are stored in two subdirectories: 'Normal' and 'Cancer'. You can
create lists containing the files to process using the DIR command,
cancerFiles =
name
folder
date
bytes
isdir
datenum
6-79
6 Mass Spectrometry and Bioanalytics
normalFiles =
name
folder
date
bytes
isdir
datenum
N =
216
Before attempting to process all the files in parallel, you need to test your algorithms locally with a
for loop.
Write a function with the sequential set of instructions that need to be applied to every data set. The
input arguments are the path to the data (depending on how the machine that will actually do the
work sees them) and the list of files to process. The output arguments are the preprocessed signals
and the M/Z vector. Because the M/Z vector is the same for every spectrogram after the
preprocessing, you need to store it only once. For example:
type msbatchprocessing
K = numel(files);
Y = zeros(15000,K); % need to preset the size of Y for memory performance
MZ = zeros(15000,1);
parfor k = 1:K
6-80
Batch Processing of Spectra Using Sequential and Parallel Computing
% read the two-column text file with mass-charge and intensity values
fid = fopen(file,'r');
ftext = textscan(fid, '%f%f');
fclose(fid);
signal = ftext{1};
intensity = ftext{2};
% the mass/charge vector is the same for all spectra after the resample
if k==1
MZ(:,k) = mz;
end
end
Note inside the function MSBATCHPROCESSING the intentional use of PARFOR instead of FOR. Batch
processing is generally implemented by tasks that are independent between iterations. In such case,
the statement FOR can indifferently be changed to PARFOR, creating a sequence of MATLAB®
statements (or program) that can run seamlessly on a sequential computer, a multicore computer, or
a cluster of computers, without having to modify it. In this case, the loop executes sequentially
because you have not created a Parallel Pool (assuming that in the Parallel Computing Toolbox™
Preferences the checkbox for automatically creating a Parallel Pool is not checked, otherwise
MATLAB will execute in parallel anyways). For the example purposes, only 20 spectrograms are
preprocessed and stored in the Y matrix. You can measure the amount of time MATLAB® takes to
complete the loop using the TIC and TOC commands.
tic
repository = local_repository;
K = 20; % change to N to do all
[MZ,Y] = msbatchprocessing(repository,files(1:K));
If you have Parallel Computing Toolbox™, you can use local workers to parallelize the loop iterations.
For example, if your local machine has four-cores, you can start a Parallel Pool with four workers
using the default 'local' cluster profile:
POOL = parpool('local',4);
6-81
6 Mass Spectrometry and Bioanalytics
tic
repository = local_repository;
K = 20; % change to N to do all
[MZ,Y] = msbatchprocessing(repository,files(1:K));
delete(POOL)
If you have Parallel Computing Toolbox™ and MATLAB® Parallel Server™ you can also distribute the
loop iterations to a larger number of computers. In this example, the cluster profile
'compbio_config_01' links to 6 workers. For information about setting up and selecting parallel
configurations, see "Cluster Profiles and Computation Scaling" in the Parallel Computing Toolbox™
documentation.
Note that if you have written your own batch processing function, you should include it in the
respective cluster profile by using the Cluster Profile Manager. This will ensure that MATLAB®
properly transmits your new function to the workers. You access the Cluster Profile Manager using
the Parallel pull-down menu on the MATLAB® desktop.
POOL = parpool('compbio_config_01',6);
tic
repository = worker_repository;
K = 20; % change to N to do all
[MZ,Y] = msbatchprocessing(repository,files(1:K));
delete(POOL)
The execution schemes described above all operate synchronously, that is, they block the MATLAB®
command line until their execution is completed. If you want to start a batch process job and get
access to the command line while the computations run asynchronously (async), you can manually
distribute the parallel tasks and collect the results later. This example uses the same cluster profile as
before.
Create one job with one task (MSBATCHPROCESSING). The task runs on one of the workers, and its
internal PARFOR loop is distributed among all the available workers in the parallel configuration.
6-82
Batch Processing of Spectra Using Sequential and Parallel Computing
Note that if N (number of spectrograms) is much larger than the number of available workers in your
parallel configuration, Parallel Computing Toolbox™ automatically balances the work load, even if
you have a heterogeneous cluster.
submit(JOB)
When the job is submitted, your local MATLAB® prompt returns immediately. Your parallel job starts
once the parallel resources become available. Meanwhile, you can monitor your parallel job by
inspecting the TASK or JOB objects. Use the WAIT method to programmatically wait until your task
finishes:
wait(TASK)
TASK.OutputArguments
ans =
MZ = TASK.OutputArguments{1};
Y = TASK.OutputArguments{2};
destroy(JOB) % done retrieving the results
disp(sprintf('Parallel time (asynchronous) with 6 remote workers for %d spectrograms: %f seconds'
Parallel time (asynchronous) with 6 remote workers for 216 spectrograms: 68.368132 seconds
Postprocessing
After collecting all the data, you can use it locally. For example, you can apply group normalization:
Y = msnorm(MZ,Y,'QUANTILE',0.5,'LIMITS',[3500 11000],'MAX',50);
Create a grouping vector with the type for each spectrogram as well as indexing vectors. This
"labelling" will aid to perform further analysis on the data set.
grp = [repmat({'Cancer'},size(cancerFiles));...
repmat({'Normal'},size(normalFiles))];
cancerIdx = find(strcmp(grp,'Cancer'));
numel(cancerIdx) % number of files in the "Cancer" subdirectory
ans =
121
normalIdx = find(strcmp(grp,'Normal'));
numel(normalIdx) % number of files in the "Normal" subdirectory
6-83
6 Mass Spectrometry and Bioanalytics
ans =
95
Once the data is labelled, you can display some spectrograms of each class using a different color
(the first five of each group in this example).
h = plot(MZ,Y(:,cancerIdx(1:5)),'b',MZ,Y(:,normalIdx(1:5)),'r');
axis([7650 8200 -2 50])
xlabel('Mass/Charge (M/Z)');ylabel('Relative Intensity')
legend(h([1 end]),{'Ovarian Cancer','Control'})
title('Region of the pre-processed spectrograms')
Save the preprocessed data set, because it will be used in the examples “Identifying Significant
Features and Classifying Protein Profiles” on page 6-38 and “Genetic Algorithm Search for Features
in Mass Spectrometry Data” on page 6-71.
6-84
Batch Processing of Spectra Using Sequential and Parallel Computing
Disclaimer
TIC - TOC timing is presented here as an example. The sequential and parallel execution time will
vary depending on the hardware you use.
References
[1] Conrads, T P, V A Fusaro, S Ross, D Johann, V Rajapakse, B A Hitt, S M Steinberg, et al. “High-
Resolution Serum Proteomic Features for Ovarian Cancer Detection.” Endocrine-Related
Cancer, June 2004, 163–78.
[2] Petricoin, Emanuel F, Ali M Ardekani, Ben A Hitt, Peter J Levine, Vincent A Fusaro, Seth M
Steinberg, Gordon B Mills, et al. “Use of Proteomic Patterns in Serum to Identify Ovarian
Cancer.” The Lancet 359, no. 9306 (February 2002): 572–77.
See Also
msnorm | msresample | msbackadj | mslowess | msalign
6-85