2016 Article 880
2016 Article 880
Abstract
Background: Recent advances in transcriptome sequencing have enabled the discovery of thousands of long
non-coding RNAs (lncRNAs) across many species. Though several lncRNAs have been shown to play important roles
in diverse biological processes, the functions and mechanisms of most lncRNAs remain unknown. Two significant
obstacles lie between transcriptome sequencing and functional characterization of lncRNAs: identifying truly non-coding
genes from de novo reconstructed transcriptomes, and prioritizing the hundreds of resulting putative lncRNAs
for downstream experimental interrogation.
Results: We present slncky, a lncRNA discovery tool that produces a high-quality set of lncRNAs from RNA-sequencing
data and further uses evolutionary constraint to prioritize lncRNAs that are likely to be functionally important. Our
automated filtering pipeline is comparable to manual curation efforts and more sensitive than previously published
computational approaches. Furthermore, we developed a sensitive alignment pipeline for aligning lncRNA loci and
propose new evolutionary metrics relevant for analyzing sequence and transcript evolution. Our analysis reveals that
evolutionary selection acts in several distinct patterns, and uncovers two notable classes of intergenic lncRNAs: one
showing strong purifying selection on RNA sequence and another where constraint is restricted to the regulation but
not the sequence of the transcript.
Conclusion: Our results highlight that lncRNAs are not a homogenous class of molecules but rather a mixture
of multiple functional classes with distinct biological mechanism and/or roles. Our novel comparative methods for
lncRNAs reveals 233 constrained lncRNAs out of tens of thousands of currently annotated transcripts, which we make
available through the slncky Evolution Browser.
Keywords: Long non-coding RNAs, Evolution, Comparative genomics, Molecular evolution, Annotation,
LincRNA, RNA-seq, Transcriptome
© 2016 Chen et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Chen et al. Genome Biology (2016) 17:19 Page 2 of 17
Yet, this approach is limited in both sensitivity and spe- mixture of multiple functional classes that may reflect
cificity: (1) it incorrectly classifies bona fide lncRNAs as distinct biological mechanism and/or roles.
protein-coding simply because they are conserved; and
(2) it incorrectly classifies transcripts as lncRNAs when Results and Discussion
they are actually extended untranslated regions (UTRs) slncky a software package to identify long non-coding
of coding genes, pseudogenes, or members of lineage- RNAs
specific protein-coding gene family expansions, such as To develop a simple and accessible method to identify
zinc finger proteins or olfactory genes. Previous lncRNA lncRNAs directly from RNA-Seq transcript assemblies,
cataloging efforts have addressed these issues by incorp- we created slncky, a method that enables rapid identifi-
orating additional filtering criteria along with extensive cation of high-confidence lncRNA catalogs directly from
manual curation to define meaningful lncRNA catalogs an RNA-Seq dataset.
[12, 13, 15] or by including specialized libraries that Determining a set of lncRNAs from reconstructed an-
better capture transcript boundaries [14, 16]. While these notations involves several steps to ensure that transcripts
approaches have proven to be extremely valuable, they represent complete transcriptional units and that they are
remain extremely labor-intensive and time-consuming, unlikely to encode for a protein. Current methods for de-
even for experienced users. fining coding potential rely on codon substitution models,
To address this challenge, we developed slncky, a such as PhyloCSF [17] and RNACode [18], which fail in
method and accessible software package that enables three important cases: (1) they often incorrectly classify
robust and rapid identification of high-confidence lncRNA non-coding RNAs as protein-coding – including TUG1,
catalogs directly from RNA-Seq transcript assemblies MALAT1, and XIST – merely because they are conserved;
without reliance on evolutionary measures of coding (2) they fail to identify lineage specific proteins as coding;
potential. slncky goes through several key steps to accur- and (3) they erroneously identify non-coding elements
ately separate lncRNAs from coding genes, pseudogenes, (for example, UTR fragments, intronic reads) as lncRNAs.
and assembly artifacts, while also identifying novel pro- Rather than using codon substitution models, slncky
teins including small peptides. This approach yields a high implements a set of sensitive filtering steps to exclude
confidence lncRNA catalog. Indeed, when applied to fragment assemblies, UTR extensions, gene duplications,
mouse embryonic stem cells, slncky accurately identifies and pseudogenes, which are often mischaracterized as
virtually all well-characterized lncRNAs and performs as lncRNAs, while also avoiding the exclusion of bona fide
well as previous manually curated catalogs. lncRNA transcripts that are excluded simply because they
Comparative analysis remains an important approach to have high evolutionary conservation.
assess potential function of a lncRNA without requiring To achieve this goal, slncky carries out the following
additional experimental efforts. Despite its importance, steps (Fig. 1a): (1) slncky removes any transcript that
identifying conservation of lncRNAs remains a challenge. overlaps (on the same strand) any portion of an anno-
To address this need, slncky incorporates a comparative tated protein-coding gene in the same species; (2) slncky
analysis pipeline specially designed for the study of leverages the conservation of coding genes and uses
RNA evolution. annotations in related species to further exclude un-
Here we demonstrate the utility of slncky by applying annotated protein-coding genes, or incomplete tran-
it to a comparative study of the embryonic stem (ES) scripts that align to UTR sequences (Methods); and (3)
cell transcriptome across human, mouse, rat, chim- to remove poorly annotated members of species-specific
panzee, and bonobo, and to previously defined data- protein-coding gene expansions, slncky aligns all identi-
sets consisting of >700 RNA-Seq experiments across fied transcripts to each other and removes any transcript
human and mouse. When applying slncky to these that shares significant homology with another non-
datasets, we discover hundreds of conserved lncRNAs. coding transcript (Methods). The result is a filtered set
Furthermore, our metrics for evaluating transcript of transcripts that retains conserved, non-coding tran-
evolution show that there are clear evolutionary prop- scripts that may score highly for coding potential, while
erties that divide lncRNAs into separate classes that excluding up to approximately 25 % of coding or
display distinct patterns of selective pressure. In par- pseudogenic transcripts normally identified as lncRNAs
ticular, we identify two notable classes of ‘intergenic’ by traditional approaches.
ancestral lncRNAs (‘lincRNAs’): one showing strong After removing reconstructions that are likely gene
purifying selection on the RNA sequence and another fragments, pseudogenes, or members of gene family
showing only conservation of the act of transcription expansions, slncky searches for novel or previously un-
but with little conservation on the transcript pro- annotated coding genes, using a method that is less
duced. These results highlight that lncRNAs are not a confounded by evolutionary conservation than codon
homogenous class of molecules but are likely a substitution models. Specifically, slncky uses a sensitive
Chen et al. Genome Biology (2016) 17:19 Page 3 of 17
A STEP 1:
Remove transcripts Mouse
that overlap annotated
coding genes Reconstructions
STEP 2:
Remove transcripts Mouse
that align to syntenic
coding genes Reconstructions
Human
STEP 4:
Final set of lncRNAs
150 3
Frequency
2
100 Cyrano
1
Malat1
0
50 Tug1
–1
0 –2
–15 –10 –5 0 –15 –10 –5 0
Log10 (RNAcode P-value) Log10 (RNAcode P-value)
D E
20 21
100
Well-characterized
identified by slncky
Transcripts
Transcripts
12,000 1800
60 Gene duplications
8000 1200 Coding gene orthologs
40
lncRNAs
4000 600 4 4
20
0 0 0
GENCODE V19 V7 V12 GENCODE V7 V12 V7 V12
Ne
Ne
sln
sln
as
as
cs
cs
ck
ck
hi
h
ul
ul
y
ie
et
ea
ea
t
le
le
et
et
ta
ta
al
al
l.
l.
.
alignment pipeline to find orthologous transcripts and (5) our set of lncRNAs have a significantly reduced
(Methods) and analyzes all possible open reading frames ribosome release score (RRS) [22], a measure that accur-
(ORFs) (that is, sequences containing both a start codon, a ately predicts coding potential from ribosome profiling
stop codon and containing at least 10 amino acids) that data, than protein-coding genes (73-fold, P <2.2 × 10−16,
are present in both species. For each ORF, slncky t-test) (Fig. 1c).
computes the ratio of non-synonymous to synonymous Together, these results demonstrate that slncky provides
mutations (dN/dS) and excludes all annotations with a a simple and robust strategy for identifying lncRNAs from
significant dN/dS ratio (Methods). By requiring the pres- a de novo transcriptome. Rather than requiring many user-
ence of a conserved ORF that is transcribed in multiple defined parameters, slncky learns filtering parameters dir-
species, and by computing the dN/dS ratio across the ectly from the data making it useful across many different
entire ORF alignment, slncky is more specific than con- species, including non-model organisms (Methods).
ventional coding-potential scoring software, which report
all high-scoring segments within an alignment. slncky provides greater sensitivity and specificity than
Having developed a method to identify lncRNAs directly previous lncRNA catalogs
from RNA-Seq data, we sought to characterize its To verify the scalability and overall utility of slncky for
sensitivity and specificity by comparing lncRNAs identified defining lncRNAs across multiple datasets in different
by slncky to the well-studied set of lncRNAs expressed in species, we ran slncky on GENCODE’s latest compre-
mouse embryonic stem (ES) cells [11]. To do this, we gen- hensive gene annotation set (V19) totaling 189,020 tran-
erated RNA-Seq libraries from pluripotent cells obtained scripts, of which 16,482 are annotated as lncRNAs that
from three different mouse strains cultured using previ- do not overlap a coding gene [15]. GENCODE is an
ously described growing conditions [19, 20] and used de ideal test case because it represents the current gold
novo reconstruction to build transcript models (Methods, standard lncRNA-annotation set, primarily because much
Additional file 1: Table S1). We then applied slncky of its content undergoes extensive manual curation. Ap-
to define a set of 408 lncRNAs (Methods, Additional plying slncky, we identified 14,722 human lncRNA genes.
file 1: Figure S1). Our analysis also identified four tran- Importantly, these include >90 % of the lncRNAs identi-
scripts – Apela, Tunar, 1500011K16Rik (LINC00116), and fied by GENCODE, with only 136 human (0.9 %) anno-
BC094334 (LINC00094) – that contain conserved ORFs tated protein coding gene, and 83 (0.6 %) annotated
with high coding potential (Additional file 1: Figure pseudogenes identified as lncRNAs. Transcripts that are
S2A and 2B). annotated as lncRNAs by GENCODE but not by slncky
Several lines of evidence indicate that our identified include 1,735 (12 %) transcripts that are part of a cluster
set represents bona fide lncRNAs: (1) slncky recovered of duplicated genes, of which 123 (1 %) aligned to a
all of the 20 functionally characterized lncRNAs that are known zinc finger protein or olfactory gene. An additional
expressed in the pluripotent state (Additional file 2), 181 (1 %) transcripts were excluded because they aligned
demonstrating that our stringent approach is still sensi- significantly to an orthologous protein coding gene in
tive; (2) Our identified lncRNAs contain chromatin mouse (Fig. 1d).
modifications of active RNA Polymerase II transcription We then compared our filtering strategy with two
(K4-K36), exhibiting similar levels as our previous ES previously published large-scale comparative studies that
catalogs (approximately 70 %) [11, 21]; (3) lncRNAs were based on GENCODE annotations [23, 24]. For the
identified by slncky have significantly lower evolution- set of lncRNAs defined by Washietl et al. [24], slncky
ary coding potential scores than protein-coding genes was able to remove 9.6 % (156) of the annotations that
(P = 1.3 × 10−6, t-test) (Fig. 1b); (4) slncky does not fil- were likely a result of gene duplications and 1.2 % (19)
ter out known conserved lncRNAs, such as Malat1, that aligned significantly to a mouse coding transcript.
Tug1, Miat, that are often excluded due to significant In contrast, slncky only removed a handful of transcripts
coding-potential scores (Additional file 1: Figure S2C); (<0.1 %) from the Necsulea et al. dataset [23]. Importantly,
Chen et al. Genome Biology (2016) 17:19 Page 5 of 17
slncky was much more sensitive as it identified virtually all site conservation’ (SSC) score, defined as the percent of
well-characterized lncRNAs (20/21, Methods) compared splice sites that are conserved across both lncRNAs, to
to only 20 % (4/21) by these previous reports (Fig. 1e). characterize conservation of transcript structure; and (4)
Finally, we compared slncky to a recently published pipeline An ‘insertion/deletion rate’, defined as the log2 rate of
for filtering reconstructed transcripts from RNA-Seq data, insertion/deletion events in exonic regions relative to
called PLAR (Hezroni et al. [14]). We found that slncky intronic regions, to provide an alternative measure of
and PLAR performed comparably in removing coding gene sequence conservation (Fig. 2a).
orthologs and gene duplications, but slncky remained more We tested the performance of slncky’s orthology finding
sensitive in recovering well-characterized transcripts (33/36 step by reanalyzing previous studies of lncRNA conserva-
recovered by slncky compared to 27/36 by PLAR) tion across mammals [24] and vertebrates [14, 16, 23]
(Additional file 1: Figure S3). (Methods). Our approach of aligning the two syntenic loci
Together, our results highlight the power of slncky for rather than just the transcripts increases slncky sensitivity
identifying a high-confidence set of lncRNAs by exclud- with very little drop in specificity. In mammals, slncky
ing known artifacts that are often mistaken for lncRNAs. successfully identified the vast majority (>95 %, 1,466/
Furthermore, our results demonstrate that slncky per- 1,521 lncRNAs) of the previously reported orthologous
forms as well as manual curation for defining bona fide lncRNAs while also finding an additional 121 pairs
lncRNAs and can even identify the challenging cases (8.0 %) of homologous human-mouse lncRNAs that
that are often missed by curation efforts. were previously reported as species-specific (Methods).
Similarly, in vertebrates, a four-fold greater evolution-
slncky enables detailed studies of lncRNA evolution ary distance, slncky was able to recover 26 of 29 (90 %)
Having developed a method to define high-quality of the previously defined ancestral lncRNAs; the
lncRNAs, we sought to study the evolutionary properties alignments for the remaining three, although found, are
of lncRNAs. While comparative genomics has provided indistinguishable from alignments that can be randomly
important insights for studying proteins, enhancers, and found across syntenic loci and do not pass our signifi-
promoters [25–30], relatively little has been done to study cance threshold (Methods). Furthermore, slncky identi-
the evolution of lncRNAs. One of the main challenges is fied an additional three pairs of vertebrate conserved
that lncRNAs diverge rapidly, accumulating both base nu- lncRNAs.
cleotide substitutions and insertion/deletion (indel) events. Together, these results demonstrate that slncky provides
Both of these properties render lncRNAs difficult to align an efficient, sensitive, and accessible method for detecting
with conventional aligners and phylogenetic approaches. and characterizing orthologous lncRNAs across any
To enable evolutionary analysis of lncRNAs, we imple- pair of species, providing an important tool for studying
mented a computationally efficient and sensitive strategy lncRNA evolution or for prioritizing lncRNAs based on
to align lncRNAs and characterize their sequence and evolutionary conservation.
transcript evolution (Fig. 2a, Methods). To this end,
slncky identifies the syntenic genomic region for a lncRNA Evolutionary analysis reveals multiple lncRNA classes
in the orthologous species. If a transcript exists in a characterized by distinct signatures
syntenic region, slncky aligns the two regions using a sen- Initial work by us and others incorporating expression
sitive seed-based local pairwise aligner [31]. To avoid the data across species showed that the expression of
possibility of spurious matches, slncky scores each align- lncRNAs is often poorly conserved – with the rate of
ment relative to a set of random intergenic regions from transcript expression loss occurring faster than loss of
the orthologous genome and only keeps alignments its genomic sequence identity across species [23, 24].
that score higher than 95 % of the random intergenic While these results provided important insights into the
sequences (Methods). evolution of lncRNAs, these analyses did not fully ex-
Next, slncky characterizes sequence and transcript plore the properties of the conserved lncRNAs. Having
conservation properties of orthologous lncRNAs. slncky developed a method to comprehensively identify and
calculates four metrics: (1) A ‘transcript-genome identity’ align lncRNAs across species, we sought to further
(TGI) score, defined as the percent of lncRNA base pairs understand the evolutionary properties of lncRNAs. To do
that align and are identical to a syntenic genomic locus, this, we generated RNA-Seq data from ES cells derived
to characterize how well the transcript sequence is con- from three mouse strains (129SvEv, NOD, and castaneous),
served across the two species; (2) A ‘transcript-transcript rat, and human (Methods). We added additional pub-
identity’ (TTI) score, defined as the percent of identical, lished RNA-Seq data for chimpanzee and bonobo iPS
aligning base pairs found in the transcribed, exonic cells [32] (Additional file 1: Table S1). The gene expres-
regions of both lncRNAs, to characterize how much of sion between species shows a similarly high correlation
the transcript is transcribed in both species; (3) A ‘splice to that previously observed for matched tissues across
Chen et al. Genome Biology (2016) 17:19 Page 6 of 17
Mouse
Alignment
to genome Rat 408 350 nod
0 100 Chimp 124 0.6 408 cast
Sequence
identity (%)
Human (30%)
rm1 naïve
rm1 primed 10
Mouse
492 Rat
nod naïve 73
Gene nod primed (18%)
expression cast naïve
cast primed 407 Chimp
–3 1.5 407
Rat naïve
Rat
Fig. 2 slncky’s orthology pipeline discovers a small set of pluripotent lncRNAs conserved across mammals. a Schematic of slncky’s orthology
pipeline and metrics for measuring sequence and transcript evolution. b Top: Sequence identity of each lncRNA loci when aligned to syntenic
region of every other species. In the species of origin, sequence identity is 100 % (red); if no sytenic region exists, sequence identity is set at 0 %
(blue). Bottom: expression level of every lncRNA loci across studied species. Heatmap colors represent globally-scaled log10(FPKM) values with
log10(0) set to −3. log10(FPKM) values were floored at −3 (blue) and 1.5 (red). The majority of lncRNAs are alignable to syntenic regions of other
species but not expressed. c Number of lncRNAs found within each species and at each ancestral node (inferred by parsimony). Substitutions per
100 bp are given for each branch. Conservation of lncRNA transcription dramatically falls off even between closely-related species humans
and chimpanzees
species (Additional file 1: Figure S4), highlighting the ‘intergenic’ lncRNAs (lincRNAs). Interestingly, we found
suitability of this set for comparative analysis. that these classes have distinct patterns of sequence and
Applying slncky, we identified 408 mouse, 492 rat, 407 transcript evolution.
chimpanzee, and 413 human lncRNAs (Additional file 1: These classes exhibit modest, but distinct, differences
Figure S1, Additional file 3). We found that lncRNAs are in transcript-genome identity (TGI), and striking differ-
generally expressed only in a single species, despite the ences in transcript-transcript identity (TTI) (Fig. 3a).
fact that most lncRNA loci can be aligned across species While the loci of miRNA host genes can readily be aligned
(Fig. 2b). In all, we found 73 (18 %) lncRNAs that are between species (that is, have similar TGI identity), their
expressed in pluripotent cells across all mammals and transcript structure have diverged tremendously, with
are likely to be present prior to the divergence between 8.5 % median TTI across humans and mouse. lncRNAs
rodents and primates (Fig. 2c, Additional file 4). divergently transcribed within 500 base pairs of a coding
Like previous catalogs, our lncRNAs fall into different gene have also diverged rapidly in TTI, except for se-
classes: miRNA host genes, snoRNA host genes, diver- quence transcribed near the promoter. For these genes,
gently expressed lncRNAs that are transcribed in the TTI is generally confined to the first exon. snoRNA host
opposite orientation of a coding gene with which they transcripts are very well conserved in both sequence and
share a promoter (Methods), and a remaining set of transcript structure, though we find an excess of indel
Chen et al. Genome Biology (2016) 17:19 Page 7 of 17
a b
Fig. 3 Metrics of sequence and transcript evolution reveal distinct classes of lncRNAs. a Left: Schematic representing alignment signatures found
for miRNA host, divergent, snoRNA host, and intergenic lncRNAs. Alignments of identical base pairs transcribed in both species (that is, transcript-transcript
identity) is shown in light red while alignments of identical base pairs transcribed only in top species (that is, transcript-genome identity) is shown in light
blue. Right: Median transcript-transcript (TTI) (dotted lines) and transcript-genome identity (TGI) (solid lines) from mouse-human alignments of first three
exons of miRNA host (orange), divergent (blue), snoRNA host (purple), and intergenic (green) lncRNAs. Each class of lncRNAs displays distinct patterns of
TTI. b Boxplots of TGI and TTI, barplot of splice site conservation, and boxplot of insertion/deletion rate (IDR). c Number of lncRNAs in each class in mouse
(left), human (middle), and conserved across all studied species (right). Each lncRNA class has individual turnover rates, with miRNA and snoRNA host
genes highly conserved in transcription across mammals, and divergent and intergenic lncRNAs evolving much faster
events in exons (1.2-fold more) as compared to introns [14, 33], possibly indicating that they may belong to a
(Fig. 3b). Finally, intergenic lncRNAs (lincRNAs) also have different class of lincRNAs. In addition to distinct differ-
conserved transcript structure but a 1.5-fold reduction in ences in conservation of transcript structure, we found
exonic indel events compared to snoRNA hosts (Fig. 3b), that the turnover of transcription differ across lncRNA
despite comparable intronic indel rates (Additional file 1: classes: the majority of miRNA host and snoRNA host
Figure S5), suggesting that they undergo different selective genes show conserved transcription across mammals
pressure than host genes. Most of the pluripotent- (95 % and 87 %, respectively), whereas only a small per-
expressed, well-characterized lncRNAs are found in this centage of divergent and intergenic genes show conserved
class of lincRNAs, which displays high TTI and splice site transcription (22 % and 7 %, respectively, Fig. 3c).
conservation (SSC). Two notable exceptions to the class We note that some lncRNAs have been proposed to
of lincRNAs are FIRRE and TSIX, which have very poor have dual functions and our evolutionary metrics allow
TTI (5 % and 0.1 %, respectively). Both lincRNAs have us to further explore this possibility. For example, GAS5
been previously reported as ‘conserved in synteny’ only is a known snoRNA host gene and has also been
Chen et al. Genome Biology (2016) 17:19 Page 8 of 17
reported to function as a RNA gene [34]. Interestingly, orthology search to only lncRNAs expressed in matched
we found that GAS5 has the typical signature of a tissues drastically reduced the number of poorly aligning
snoRNA host, with higher indel rates at exons relative to lncRNAs (Additional file 1: Figure S8B).
its intronic regions (1.4-fold higher) (Fig. 3b, Additional Taken together, we conclude that the majority of syn-
file 4), suggesting that GAS5, if truly functional as a tenic pairs we find are unrelated transcripts that have
non-coding gene, likely acts through a different mechan- been annotated independently in human and mouse,
ism than other intergenic lncRNAs. perhaps in very different cell types, and which have no
We further note that these distinct signatures of evolu- ancestral relationship. It is notable however that we
tion are robust enough to identify incorrectly annotated found 39 pairs of human-mouse candidate orthologs
transcripts. For example, based on current annotations, that have low TTI, yet have at least one conserved splice
LINC-PINT is an ‘intergenic’ lncRNA as the closest an- site. This is surprising, because under the null hypoth-
notated coding gene, MKLN1, begins approximately esis that these set of orthologs occupy a syntenic loci
184 kb downstream [35]. However, its transcriptional mostly by chance, we expect no pairs of orthologs to have
conservation pattern is typical of a divergent transcript, an orthologous (conserved) donor/acceptor site (Methods).
with transcriptional identity confined only to its first These 39 transcripts are reminiscent of lincRNA FIRRE,
exon. Closer inspection of expression data from our and which has similarly low TTI but has one conserved splice
other tissues [36] revealed that in fact, an unannotated, site (out of 12). The fact that a set of lincRNAs are likely
alternative transcriptional start site of MKLN1 begins ancestral but with exonic sequence that has diverged rap-
less than 200 base pairs downstream, consistent with idly points to a different class of lincRNAs with a very low
LINC-PINT’s divergent alignment profile (Additional file purifying selective pressure on most of transcribed bases.
1: Figure S6). To investigate whether there are (at least) two distinct
We next sought to extend our evolutionary analysis classes of lincRNAs, we first sought to reduce the number
to larger catalogs of mouse and human lncRNAs of possible spurious lincRNA orthologous pairs by either
[15, 23, 24, 37]. Altogether, we searched for candidate requiring transcript-transcript identity >60 %, which
orthologs across 251,786 human and 25,335 mouse tran- controls the false discovery rate at 10 % (Additional file 1:
scripts corresponding to 56,280 and 15,508 unique lncRNA Figure S8C), or by requiring at least one conserved splice
loci (Fig. 4a) using default parameters of slncky. miRNA sites. We excluded the eight intergenic transcripts that
hosts, divergent lncRNAs, and snoRNA host genes show contain a conserved ORF between human and mouse with
the same distinct evolutionary patterns that we observed in a significant dN/dS ratio and significant coding potential
pluripotent cells (Fig. 4b and c). Additionally, we found score because they may encode for small proteins
that miRNA hosts that harbor miRNAs inside exonic re- (Additional file 1: Table S2). Using these criteria, we found
gions (for example, H19 [38]) show a distinct conservation 232 pairs of human-mouse lincRNAs orthologs with a
pattern reminiscent of lincRNAs (high TTI and SSC), conservation profile similar to that found in the pluripo-
but without indel-constrained exons (Additional file 1: tent analysis (Additional file 1: Figure S9), but with a
Figure S7), consistent with the functional importance bimodal TTI distribution (Fig. 4e). Modeling the TTI
of their exonic sequence. distribution as two Gaussians, we find 186 (80.1 %)
In contrast to our previous analysis in matched pluripo- lincRNAs with high TTI (mean 65.5 % ∓ 7.1 %) and 46
tent cells, we found that the majority of the 1,861 candidate (19.8 %) with low TTI (mean 15.6 % ∓ 11.7 %). This
orthologous intergenic lncRNAs identified from syntenic further suggests that selection may operate in two distinct
locations in human and mouse have low TTI (<30 %) and ways: for the majority of lincRNAs, it acts on the full RNA
no conserved splice sites (approximately 61 %). Several lines transcript, preserving the transcript sequence, while for a
of evidence suggest that the majority of these poorly small subset of lincRNAs, the lincRNA sequence may be
aligning pairs may not be true orthologs but instead may be under positive selection, or perhaps only the act of tran-
transcripts at syntenic loci in different cell types or tran- scription may be under selective constraint. With the goal
scriptional noise. First, applying our orthology-finding pipe- of aiding in the study of these human-mouse conserved
line to randomly shuffled transcripts resulted in a similar lincRNAs, we built an easily accessible application avail-
proportion of syntenic transcripts with low TTI and zero able at https://scripts.mit.edu/~jjenny as a resource for
conserved splice sites (Fig. 4d). Second, though poor align- visually exploring the alignment and conservation proper-
ment metrics could be the result of incomplete reconstruc- ties of these lincRNAs.
tions of lowly expressed lincRNAs, when we performed a Finally, we sought to understand properties of lincR-
similar analysis on a FPKM-matched set of reconstructed NAs that explain their conservation or rapid turnover by
coding transcripts, orthologous pairs have both high TTI investigating promoter conservation (Fig. 5). Within
and high SSC (Additional file 1: Figure S8A). Third, incorp- our pluripotent-expressed lincRNAs (Fig. 5a), we found
orating human and mouse expression data and limiting the that mammalian-conserved lincRNA promoters have
Chen et al. Genome Biology (2016) 17:19 Page 9 of 17
Frequency
300
UCSC 7,859 (1.5%) 6,074 (2.6%) 200
1000
E
Identity (%)
Frequency
70
0.2
0.0 35
1 2 3 1 2 3 1 2 3
0
Exon number Exon number Exon number
LOC728743 AK022898
Counts
Transcript-genome identity Transcript-transcript identity 1.0 57
Fig. 4 Combined catalog search of lncRNA orthologs recapitulates distinct lncRNA classes. a Existing lncRNA catalogs were combined for large-scale
search of lncRNA orthologs between human (left) and mouse (right). Barplot shows number of transcripts contributed from each source. b Mean
transcript-transcript (TTI) (solid line) and transcript-genome identity (TGI) (dotted line) across first three exons of miRNA host (orange), divergent (blue),
and snoRNA host (purple) genes, recapitulating signatures from smaller search of lncRNA orthologs expressed in pluripotent cells. Error bars represent
standard error of the mean. c Boxplots of TGI and TTI, barplot of splice site conservation (SSC), and boxplot of indel rate (IDR) of each lncRNA class.
Insertion/deletion rates were not calculated for miRNA host lncRNAs because not enough exonic segments aligned to accurately calculate rate.
Two-sample t-test was used to test for significance for all figures, except one-sample t-test was used to test if mean of indel rate is significantly deviated
from 0. ** denotes P <0.01 and * denotes P <0.05. d Histograms of TTI (left) and SSC (right) of candidate intergenic lncRNA (lincRNAs) orthologs (solid bars)
from combined search as compared to shuffled transcripts (hashed bars). Even among alignments of shuffled transcripts, we found many poorly-aligning
orthologs, suggesting they are artifactual from large number of initial transcripts. e Binned scatterplot of TTI and SSC of filtered lincRNA
orthologs (TTI >60 % or SSC >0), enriched for true ancestral orthologs. Distribution of filtered TTI is shown on top, along x-axis. Overlaid
on scatterplot are data points for lncRNA orthologs found in analysis of pluripotent cells
conservation scores comparable to protein coding genes, repeat elements are transcribed in ES and germline tissues
consistent with previous reports [11, 12], while species- and silenced in differentiated tissues. We observe that for
specific lincRNA promoters are indistinguishable from 60.7 % of rodent-specific lincRNAs (that is, mouse or
neutral evolution of random intergenic genomic sequence. mouse and rat expressed lincRNAs), the time of ERVK
Conservation also extends to the promoter structure, as integration on the evolutionary tree corresponds exactly
we found clear enrichment for CpG islands in conserved with the evolutionary pattern of lincRNA transcription,
lincRNAs, despite comparable CG content (approximately providing strong evidence that the ERVK element is a
48 %) to that of species-specific lincRNA promoters, primary driver for the origin of the lincRNA. We found
further suggesting strong selection on their transcriptional corroborating trends of promoter conservation when
control. In contrast, we found that conservation is nega- examining the larger set of lincRNAs from our combined
tively correlated with repeat content in lincRNA promoters, set of annotations (Fig. 5b). Importantly, we found no
and that a significant fraction (30.6 %, P = 1.65 × 10−3, statistical difference in promoter conservation between
Fisher’s exact test) of species-specific lincRNA promoters high and low TTI lincRNA orthologs, suggesting selection
contain species-specific endoretroviral K (ERVK) repeat for transcription even with poorly aligning orthologs.
element that appear to be driving transcription. This re- Together, these results highlight the power of evolu-
peat element is enriched only in promoters of lincRNAs tionary analysis to identify distinct functional classes of
expressed in pluripotent and testis cells (Additional file 1: lncRNAs and to reveal distinct features of these classes.
Table S3), consistent with previous observations that In particular, we found 232 intergenic lncRNAs that
Chen et al. Genome Biology (2016) 17:19 Page 10 of 17
A Pluripotent-expressed B Combined
lincRNA promoters lincRNAs promoters
10
Conservation (SiPhy)
Conservation (SiPhy)
** ns *** ns
15
5 10
5
0 0
−5
ns
−5
70
*** ns 60 ***
60 ***
50
50
40
40
30 ns
30
20 20
10 10
0 0
*** ***
Repeat content (%)
Repeat content (%)
100 ***
80 100
60 ns
50
40
20
0
0 ns
129SvEv
cast
(high TTI)
Mouse
Rat
Human
Coding
Mouse
Huamn
(low TTI)
Human
Fig. 5 Conserved lncRNA promoters display strong selection for transcriptional control. a In each plot, each bar from left to right represents
lncRNAs from pluripotent analysis that increase in conservation: 129SvEv-specific, cast-specific, expressed across all mouse subspecies, expressed in
mouse and rat, expressed in all mammalian species, and finally, expressed coding genes. Top: Promoter conservation in SiPhy scores (0 represents neutral
evolution). Middle: Percent of promoters harboring CpG island. Bottom: Percent of promoter base pairs that belong to repeat element. b Same promoter
metrics as A for mouse-specific and human-mouse conserved lncRNAs from combined lncRNA catalogs, and coding genes. Human-mouse conserved
orthologs are split between those with low TTI and high TTI. *** denotes P <0.001; ** denotes P <0.01; * denotes P <0.05 (t-test)
appear to be under selective constraint for and may play and much skepticism about what these large number of
important roles in biology. We note that the majority of transcripts mean. The main challenge is that the number
lncRNAs appear to be species-specific, raising questions of functionally characterized lncRNAs remains a tiny
about whether most of these transcripts are simply fraction of the total number of lncRNAs that have been
byproducts of transcription, with no important biological annotated. The significant effort required for functional
function. Alternatively, these lncRNA functions may be characterization of a single lncRNA compared to its
highly redundant or easily replaceable, in which case annotation has impeded the functional characterization
evolutionary turnover could be explained by a stochastic of the large catalogs of lncRNAs. Accordingly, liberal
evolutionary process where redundant lincRNAs are cataloging efforts have led to a plethora of transcripts
fixed randomly along the evolutionary tree. defined as lncRNAs that are rarely transcribed or artifacts
of transcript assembly, thereby preventing experimental
Conclusion progress. slncky provides an important and conservative
While interest in lncRNAs has exploded, there is still approach for defining lncRNAs that enriches for bona fide
relatively little known about the functions of lncRNAs lncRNAs. While slncky will not necessarily capture every
Chen et al. Genome Biology (2016) 17:19 Page 11 of 17
single lncRNA nor will it provide the longest list of Next, slncky searches for gene duplication events (for
possible lncRNAs, it provides a method to define high example, zinc finger protein or olfactory gene expan-
confidence annotation of lncRNAs from any RNA-Seq sions) by aligning each transcript to every other putative
dataset. This approach will enable meaningful experimental lncRNA transcript using lastz with default parameters
characterization of lncRNAs, making it easier to reconcile [31]. slncky then aligns each transcript to shuffled inter-
the large numbers of defined lncRNAs with the functional genic regions to find a null distribution of alignment
roles of these lncRNAs, and providing a consistent standard scores, repeating this procedure 200 times in order to
for evaluating bona fide lncRNAs. estimate an empirical P-value. Any alignment with a P-
Evolutionary conservation has long been a confusing value lower than 0.05 is considered significant. Sets of
feature of lncRNAs. While it is clear that lncRNAs are putative lncRNAs transcript that share significant hom-
enriched for conserved sequences, their high levels of ology are then merged, creating larger “duplication clus-
sequence divergence make them a challenge to study. ters”. These transcripts do not necessarily share similarity
While most lncRNAs do not appear to be conserved to a protein-coding gene, though slncky will check and re-
across mammals, it is currently unclear whether these port homology to known ZFPs and olfactory genes.
lineage-specific lncRNA play important roles in lineage- slncky’s default parameters, which we used in all analyses
specific biology. It is possible that many lncRNAs have reported (−−min_cluster_size 2), notes and removes any
‘functional orthologs’: genes with similar function but no duplication cluster containing two or more transcripts.
ancestral relationship. Importantly, evidence of functional Finally, slncky removes any transcript that aligns to a
orthology was recently reported for XIST. Although XIST syntenic coding gene in another species. (Human and
is not found in marsupials, an opossum lncRNA called mouse annotations are provided, though users can de-
RSX was shown to have similar function. While RSX is fine their own). First, slncky learns a positive distribution
capable of silencing the X chromosome in mouse, it shares by aligning all the transcripts removed in the first filter-
no ancestral relationship with XIST [39]. We note that ing step, which we know overlap coding genes, to their
functional orthology cannot be studied with the methods syntenic coding gene. slncky builds an empirical positive
presented here and future work will be needed to explore score distribution from these alignments. To align genes
how many lncRNAs might play such lineage-specific slncky first uses liftOver (−−minMatch = 0.1) [40] to de-
roles or to what extend non-homologous lncRNAs carry termine the syntenic loci in the comparing genome and
similar function. lastz [31] to perform the alignment across the syntenic
We demonstrated that lncRNAs can be categorized region. Using the empirical distribution, slncky learns an
into distinct sets based on their evolutionary properties. exonic identity threshold that has an empirical P value
Most notably, we found two sets of conserved intergenic of 0.05. slncky repeats the alignment procedure on the
lncRNAs: one that shows signs of purifying selection at putative lncRNAs to syntenic coding genes and filters
the sequence level, and one that shows selection only for out any transcripts that align at a higher score than this
transcription. It will be fascinating to determine whether threshold, even if alignments occur only in UTR or in-
these two sets of lincRNA also correlate with functional tronic regions. In this way, slncky removes unannotated
differences. While we defined classes based on conserva- coding genes, pseudogenes, as well use UTR or intronic
tion, there are likely many other classes of lncRNAs that fragments from incomplete transcript assemblies. To re-
cannot be defined by conservation alone. We anticipate duce computational cost, whenever more than 250
that as more cell types and tissues are explored, these anno- coding-overlapping genes were filtered out from the first
tation and evolutionary approaches will be even more valu- step, only a random subset of 250 transcripts is used to
able and enable more detailed studies of lncRNA biology. build the positive distribution.
(dN/dS ratio). We calculated an empirical P value for each Aldrich). Primed (EpiSC) N2B27 media for murine and rat
dN/dS ratio by aligning 50,000 random intergenic regions cells (EpiSCs) contained 8 ng/mL recombinant human
and repeating the ORF finding procedure. Because the bFGF (Peprotech Asia), 20 ng/mL recombinant human
distribution of dN/dS ratio is dependent on ORF length Activin (Peprotech), and 1 % Knockout serum replacement
(Additional file 1 1: Figure S2), we binned ORF lengths by (KSR- Invitrogen). Primed rodent cells were expanded on
5 base pair windows and assigned an empirical P value if matrigel (BD Biosciences).
we had at least 100 random ORFs within that bin. P values 129SvEv (Taconic farms) male primed epiblast stem
were corrected for multiple hypothesis testing. For long cell (EpiSC) line was derived from E6.5 embryos previ-
ORFs, for which less than 100 length-matched random ously described in [41]. 129SvEv naïve ESCs were de-
ORFs existed, we kept all alignments with dN/dS ratios <1. rived from E3.5 blastocysts. NOD naïve ESC and primed
EpiSC lines were previously embryo-derived generated
A sensitive method for aligning orthologous lncRNAs and described in [42]. castaneous ESC line was derived
In searching for conserved lncRNA orthologs, slncky first from E3.5 in naive 2i/LIF conditions and rendered into a
defines the syntenic region of the comparing genome with primed cell line by passaging over eight times into
liftOver (−minMatch = 0.1 –multiple = Y) [40]. If a non- primed conditions [43, 44]. Rat naïve iPSC lines were
coding transcript exists in the syntenic region, slncky then previously described in [44]. Briefly, rat tail tip derived
aligns the area 150,000 base pairs upstream to 150,000 base fibroblasts were infected with a DOX inducible
pairs downstream of two syntenic regions. We choose STEMCA-OKSM lentiviral reprogramming vector and
150,000 base pairs as a general heuristic that is likely to M2rtTa lentivirus in 2i/LIF conditions. Established cell
include an easily-alignable coding transcript up- and down- lines were maintained on irradiated MEF cells in 2i/LIF
stream of the lncRNA, which helps lastz to find a positively independent of DOX. Simultaneously, primed rat
scoring alignment. Importantly, we also found that pluripotent cells were generated by transferring the rat
lncRNAs could only be aligned with a reduced gap-open naïve iPSC cells into primed EpiSC medium for more
penalty (−−gap = 25,040) because of many small insertions than eight passages before analysis was conducted. Naïve
that appear to be well-tolerated by lncRNA transcripts. human C1 iPSC lines were derived and expanded on
To ensure we are not reporting alignments that may irradiated DR4 feeder cells as previously described [19].
occur at random (driven mostly by repetitive elements),
we align each lncRNA to shuffled intergenic regions to RNA-Sequencing
establish a null distribution and determine the empirical RNA-Seq libraries were prepared as described in [45].
5 % threshold for determining significant alignment Briefly, 10 μg of total RNA was polyA selected twice using
scores. Because of our inclusion of flanking regions, it is Oligo(dT)25 beads (Life Technologies) and NEB oligo(dT)
possible to have a significant alignment in which only the binding buffer. PolyA-selected RNA was fragmented,
flanking regions align but not the lncRNA transcripts. repaired, and cleaned using Zymo RNA concentrator-5 kit.
slncky reports these transcripts since it is possible that A total of 30 ng of polyA-selected RNA per sample were
they are ‘syntologs’ and carry out orthologous functions used to make RNA-Seq libraries. An adapter was ligated to
but have evolved to a point where they no longer align. RNA, RNA was reverse transcribed, and a second adapter
was ligated on cDNA. Illumina indexes were introduced
Data collection during nine cycles of PCR using NEB Q5 Master Mix.
Pluripotent cell lines and growth conditions Samples were sequenced 100-index-100 on HiSeq2500.
Naïve 2i/LIF media for mouse and rat (rodent) naïve
pluripotent cells was assembled as follows: 500 mL of Filtering
N2B27 media was generated by including: 240 mL DMEM/ Filtering pluripotent lncRNAs from four mammalian species
F12 (Biological Industries – custom-made), 240 mL Neuro- Transcripts were reconstructed from RNA-Sequencing
basal (Invitrogen; 21103), 5 mL N2 supplement (Invitrogen; data using Scripture (v3.1, −-coverage = 0.2) [11] and
17502048), 5 mL B27 supplement (Invitrogen; 17504044), multi-exonic transcripts were filtered using slncky with
1 mM glutamine (Invitrogen), 1 % non-essential amino default parameters. Annotations of coding genes were
acids (Invitrogen), 0.1 mM β-mercaptoethanol (Sigma), downloaded from UCSC (‘coding’ genes from track
penicillin-streptomycin (Invitrogen), and 5 mg/mL BSA UCSC Genes, table kgTxInfo) [46] and RefSeq [47].
(Sigma). Naïve conditions for murine embryonic stem cells Mapped coding genes were downloaded from UCSC
(ESCs) included 10 μg recombinant human LIF (Peprotech) Transmap database (track UCSC Genes, table transMa-
and small-molecule inhibitors CHIR99021 (CH, 1 μM- pAlnUcscGenes) [46]. For the mouse genome, we also
Axon Medchem) and PD0325901 (PD, 0.75 μM - TOCRIS) included any blat-aligned human coding gene (track
referred to as naïve 2i/LIF conditions. Naïve rodent UCSC Genes, table blastHg18KG) [46]. As expected, the
cells were expanded on fibronectin coated plates (Sigma majority of reconstructed transcripts overlapped an
Chen et al. Genome Biology (2016) 17:19 Page 13 of 17
annotated coding or mapped coding gene at >95 % multiple sequence alignments of 29 vertebrate genomes
(Additional file 1: Figure S2). In the next step, slncky from the mouse perspective [29].
aligned each putative lncRNA to every other putative
lncRNA to detect duplications of species-specific gene Ribosome release scores
families. Across mouse, rat, and human transcriptomes, Ribosome profiling data of mouse ES cells (E14) was
we found large clusters (15+ genes) of transcripts sharing downloaded from [51] (GSE30839). Ribosome release
significant sequence similarity with each other that also scores (RRS) were calculated as described in [22] using
aligned to either zinc finger proteins or olfactory proteins. the RRS Program provided by the Guttman Lab.
For unclear reasons, but likely due to the draft status of
the assembly which results in collapsed repetitive se- Functionally characterized lncRNAs
quence, we did not find any large clusters of duplicated To test the sensitivity of lncRNA filtering pipelines, we
genes in the chimpanzee genome, and instead found five derived a list of well-characterized lncRNAs. To do this,
small clusters of paralogs (Additional file 1: Figure S1). we first took the intersection of annotated non-coding
Finally, slncky aligned the remaining transcripts to syn- transcripts from UCSC [46], RefSeq [47], and GENCODE
tenic coding genes. For mouse and chimp transcripts, [52]. We then removed any lncRNA with a generically
we aligned to syntenic human coding genes and for rat assigned name (for example, LINC00028 or LOC728716)
and human transcripts, we aligned to syntenic mouse as well as generically named snoRNA and miRNA host
coding genes. The learned transcript similarity threshold genes (for example, SNHG8 or MIR4697HG). Finally, we
for each pair of comparing species varied as a function performed a literature search on the remaining lncRNAs,
of distance between species: the empirical threshold for and kept only those that were specifically experimentally
calling a significant human-chimp alignment was 29.8 % interrogated rather than reported from a large-scale
sequence similarity while for human-mouse alignments screen. This list of well-characterized lncRNAs is available
it was approximately 14 % (Additional file 1: Figure S1). in Additional file 2.
(−−minMatch 0.01, −-pad 500000) to search for human- of these coding genes from our generated RNA-Seq data.
zebrafish and mouse-zebrafish lncRNA orthologs. Note Repeating the same analysis as described above., we
that in both analyses, lncRNA annotations were not fil- assigned the last common ancestor (LCA) of each
tered by slncky’s filtering pipeline prior to the ortholog coding gene. We were able to correctly assign the
search so that our results could be directly comparable human-mouse ancestor as the LCA for 134 of 162
with the original publication. (83 %) coding genes, providing confidence that we
are able to sensitively detect orthologs of lncRNAs,
Annotating orthologous lncRNAs in pluripotent mammalian even though they are lowly expressed.
cells
We applied slncky to our pluripotent RNA-Seq data to Combined catalog analysis
conduct an evolutionary analysis of lncRNAs across mul- We downloaded human and mouse lncRNA annota-
tiple mammalian species. We first searched for ortholo- tions, where they existed, from RefSeq [47, 23], UCSC
gous lncRNAs in a pairwise manner between every [46], GENCODE (v19 and vM1) [52, 12], and MiTran-
possible pair of species. Because the reconstruction soft- scriptome [36]. We filtered lncRNAs and searched for
ware we used does not report lowly expressed transcripts orthologs using slncky with default parameters. For over-
that do not pass a significance threshold, and because we lapping isoforms that belong to the same gene, we chose
removed single-exons from our filtering step, we devised a one canonical ortholog pair that had the highest number
method to rescue orthologous transcripts that may have of conserved splice sites and/or highest transcript-
been removed in those steps. For each lncRNA, if no transcript identity. miRNA host and snoRNA host genes
orthologous lncRNA was detected by slncky, we went back were annotated using Ensembl annotations of miRNAs
to the original RNA-Seq data and forced reconstruction of and snoRNAs [56]. Divergent genes were annotated based
lowly-expressed and/or single-exon transcripts in the syn- on distance and orientation of closest UCSC or RefSeq-
tenic region. We then re-aligned the lncRNA with these annotated coding gene. Orthologous lncRNAs were classi-
newly reconstructed transcripts and added the transcript fied as a miRNA host, divergent, or snoRNA host if the
to our lncRNA set when a significant alignment was found. transcript was annotated as such in both species. All other
We kept only pairs of conserved lncRNAs where a signifi- lncRNAs were classified as intergenic.
cant alignment was found in both reciprocal searches (for An orthology search was conducted on shuffled
example, mouse-to-human and human-to-mouse). transcripts by collapsing overlapping isoforms to a ca-
Next, given pairs of lncRNA orthologs across all spe- nonical gene as described above, and shuffling to an
cies, we created ortholog groups by greedily linking intergenic location (that is, not overlapping an anno-
ortholog pairs. For example, given pairs {A,B} and {B,C}, tated coding gene) using shuffleBed utility [57]. We
we assigned {A,B,C} to one orthologous group, even if then carried out the orthology search and alignment
paring {A,C} did not exist. Finally, we used Fitch’s exactly as described for lncRNAs. To empirically esti-
algorithm [55] to recursively reconstruct the most mate the expected number of conserved splice sites
parsimonious presence/absence phylogenetic tree for across shuffled orthologs, we took each pair of true
each lncRNA and determine the last common ancestor lncRNA orthologs and reshuffled splice sites within
(LCA) in which each lncRNA appeared. In the event a sin- the loci such that it was correctly located at donor/
gle LCA could not be determined by parsimony, we chose acceptor sites (GT, AG), and re-evaluated number of
the most recent ancestor as the LCA in order to have conserved splice sites.
conservative conservation estimates. For example, if a We used distributions resulting from our shuffled
lncRNA was found in mouse and rat, but missing in orthology search to filter and remove spurious hits from
human and chimp, we assigned the LCA to be at the our set of candidate lincRNA orthologs. We then fitted
rodent root, rather than at the mammalian root with two Gaussians to the resulting transcript-transcript
a loss event at primates. identity using the mixtools package for R and default
parameters [58]. Convergence was reached after 31
Annotating matched low expression coding genes iterations of EM and final log-likelihood was 146.64.
We tested our ability to detect conservation of lowly Each ortholog pair was assigned to a Gaussian based on
expressed transcripts by using our pipeline to reconstruct posterior probability cutoff of 50 %.
lowly-expressed coding genes known to be conserved
across our tested species. We binned the set of intergenic Promoter properties
lncRNAs by increments of 0.1 log10(FPKM), and sampled We defined promoters to be the 500 base pairs up-
a set of 162 coding genes that matched in log10(FPKM) stream of the lincRNA’s transcription start site (TSS).
distribution in mouse ES cells. We then applied slncky’s We calculated several genomic properties of this re-
orthology-finding module to the de novo reconstructions gion as follows:
Chen et al. Genome Biology (2016) 17:19 Page 15 of 17
SiPhy Abbreviations
We calculated average SiPhy score across promoter region ES: embryonic stem; ESC: embryonic stem cell; FPKM: fragments per
kilobase of transcript per million reads mapped; ORF: open reading frame;
as previously described [59] using 29 mammals’ alignment RNA-Seq: RNA-Sequencing; RRS: ribosomal release score; lincRNA: long
from mouse perspective [29]. intergenic non-coding RNA; lncRNA: long non-coding RNA; SSC: splice site
conservation; TGI: transcript-genome identity; TTI: transcript-transcript
identity; UTR: untranslated region.
CpG islands
For the analysis of CpG islands, we used annotations Competing interests
provided by the UCSC Genome Browser (assembly mm9, The authors declare that they have no competing interests.
track CpG Islands, table cpgIslandExt).
Authors’ contributions
Repeat elements JC participated in the design and coordination of the study, carried out all
computational analysis and software development of slncky and slncky
We intersected promoter regions with annotations from Evolutionary Browser, and wrote the manuscript. AS carried out RNA-Sequencing.
RepeatMasker [60] and calculated the number of base XZ and SK participated in development of supporting software. IM and
pairs of a lincRNA promoter belonging to a repeat JH participated in deriving cell lines. M Guttman and AR participated in
writing the manuscript. MG conceived of the study, participated in its
element as well as percentage of lincRNA promoters design and coordination, and wrote the manuscript. All authors read
harboring each class of repeat element. We then re- and approved the final manuscript.
peated this analysis with random intergenic regions,
matched in size and GC content. To find statistically Acknowledgements
significant deviations in repeat content, we used Fisher’s We thank Leslie Gaffney for artwork and advise on figures. JC was supported
by an NHGRI training grant and by the Jan and Ruby Krouwer Fellowship
exact test to compare the proportion of species-specific Fund. MG was supported by DARPA grants D12AP00004 and D13AP00074.
lincRNA promoters containing each repeat element to the AR and MG were also supported by the CEGS 1P50HG006193. AR is supported
proportion of random, GC-matched intergenic regions by the Howard Hughes Medical Institute. JHH is supported by Ilana and Pascal
Mantoux; the New York Stem Cell Foundation and is a New York Stem Cell
containing the same element. We reported any repeat Foundation - Robertson Investigator. We thank the Garber, Lander, and Regev
element that deviated from random, intergenic regions laboratory members for helpful discussions.
with a P value <0.005 (corrected for number of repeat types
Author details
we tested). 1
Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA. 2Division of
Health Sciences and Technology, Massachusetts Institute of Technology,
Data availability Cambridge, MA 02140, USA. 3Division of Biology and Biological Engineering,
California Institute of Technology, Cambridge, MA 02140, USA. 4Program in
Bioinformatics and Integrative Biology, University of Massachusetts Medical
Raw and processed RNA-Seq data are available under School, Worcester, MA 01655, USA. 5Department of Molecular Genetics,
GEO accession GSE64818: http://www.ncbi.nlm.nih. Weizmann Institute of Science, Rehovot 76100, Israel. 6Howard Hughes
Medical Institute, Department of Biology, Massachusetts Institute of
gov/geo/query/acc.cgi?acc=GSE64818 Technology, Cambridge, MA 02140, USA. 7Program in Molecular Biology,
A database of conserved lncRNAs discovered in this University of Massachusetts Medical School, Worcester, MA 01655, USA.
analysis is available at https://scripts.mit.edu/~jjenny
Received: 26 October 2015 Accepted: 14 January 2016
Software availability
slncky (http://slncky.github.io) was developed in Python References
2.0 and is freely available as source code distributed 1. Ballabio A, Sebastio G, Carrozzo R, Parenti G, Piccirillo A, Persico MG, et al.
under the MIT License. slncky was tested on Linux and Deletions of the steroid sulphatase gene in “classical” X-linked ichthyosis
and in X-linked ichthyosis associated with Kallmann syndrome.
Mac OS X. The version used in this manuscript is available Hum Genet. 1987;77:338–41.
from DOI: 10.5281/zenodo.44628 (https://zenodo.org/ 2. Greider CW, Blackburn EH. A telomeric sequence in the RNA of
badge/latestdoi/19958/slncky/slncky). Tetrahymena telomerase required for telomere repeat synthesis.
Nature. 1989;337:331–7.
3. Loewer S, Cabili MN, Guttman M, Loh Y-H, Thomas K, Park IH, et al.
Large intergenic non-coding RNA-RoR modulates reprogramming of
Additional files human induced pluripotent stem cells. Nat Genet. 2010;42:1113–7.
4. Carpenter S, Aiello D, Atianand MK, Ricci EP, Gandhi P, Hall LL, et al.
Additional File 1: Supplementary figures and tables. (PDF 3.85 MB) A long noncoding RNA mediates both activation and repression of
immune response genes. Science. 2013;341:789–92.
Additional file 2: Curated list of "well-characterized lncRNAs". 5. Willingham AT, Orth AP, Batalov S, Peters EC, Wen BG, Aza-Blanc P, et al.
(XLSX 52 kb) A strategy for probing the function of noncoding RNAs finds a repressor of
Additional file 3: Bed file of lncRNAs discovered from mouse NFAT. Science. 2005;309:1570–3.
(mm9), human (hg19), chimp/bonobo (panTro4), and rat (rn5). 6. Guttman M, Donaghey J, Carey BW, Garber M, Grenier JK, Munson G, et al.
(XLSX 229 kb) lincRNAs act in the circuitry controlling pluripotency and differentiation.
Additional file 4: Excel file of evolutionary metrics of all lncRNAs Nature. 2011;477:295–300.
found to be conserved to the human/chimp/rat/mouse ancestor. 7. Flockhart RJ, Webster DE, Qu K, Mascarenhas N, Kovalski J, Kretz M, et al.
(XLSX 19 kb) BRAFV600E remodels the melanocyte transcriptome and induces BANCR to
regulate melanoma cell migration. Genome Res. 2012;22:1006–14.
Chen et al. Genome Biology (2016) 17:19 Page 16 of 17
8. Guan Y, Kuo W-L, Stilwell JL, Takano H, Lapuk AV, Fridlyand J, et al. 32. Marchetto MCN, Narvaiza I, Denli AM, Benner C, Lazzarini TA, Nathanson JL, et al.
Amplification of PVT1 contributes to the pathophysiology of ovarian and Differential L1 regulation in pluripotent stem cells of humans and apes.
breast cancer. Clin Cancer Res. 2007;13:5745–55. Nature. 2013;503:525–9.
9. Prensner JR, Iyer MK, Balbin OA, Dhanasekaran SM, Cao Q, Brenner JC, et al. 33. Hacisuleyman E, Goff LA, Trapnell C, Williams A, Henao-Mejia J, Sun L, et al.
Transcriptome sequencing across a prostate cancer cohort identifies PCAT-1, Topological organization of multichromosomal regions by the long
an unannotated lincRNA implicated in disease progression. Nat Biotechnol. intergenic noncoding RNA Firre. Nature Publishing Group.
2011;29:742–9. 2014;21:198–206.
10. Ellis BC, Molloy PL, Graham LD. CRNDE: A long non-coding RNA involved in 34. Smith CM, Steitz JA. Classification of gas5 as a multi-small-nucleolar-RNA
CanceR, Neurobiology, and DEvelopment. Front Genet. 2012;3:270. (snoRNA) host gene and a member of the 5’-terminal oligopyrimidine
11. Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, et al. gene family reveals common features of snoRNA host genes. Mol Cell
Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the Biol. 1998;18:6897–909.
conserved multi-exonic structure of lincRNAs. Nat Biotechnol. 2010;28:503–10. 35. Sauvageau M, Goff LA, Lodato S, Bonev B, Groff AF, Gerhardinger C, et al.
12. Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, et al. Multiple knockout mouse models reveal lincRNAs are required for life and
Integrative annotation of human large intergenic noncoding RNAs reveals brain development. Elife. 2013;2, e01749.
global properties and specific subclasses. Genes Dev. 2011;25:1915–27. 36. Merkin J, Russell C, Chen P, Burge CB. Evolutionary dynamics of
13. Pauli A, Valen E, Lin MF, Garber M, Vastenhouw NL, Levin JZ, et al. gene and isoform regulation in mammalian tissues.
Systematic identification of long noncoding RNAs expressed during Science. 2012;338:1593–99.
zebrafish embryogenesis. Genome Res. 2012;22:577–91. 37. Iyer MK, Niknafs YS, Malik R, Singhal U, Sahu A, Hosono Y, et al.
14. Hezroni H, Koppstein D, Schwartz MG, Avrutin A, Bartel DP, Ulitsky I. The landscape of long noncoding RNAs in the human transcriptome.
Principles of long noncoding RNA evolution derived from direct Nat Genet. 2015;47:199–208.
comparison of transcriptomes in 17 species. Cell Rep. 2015;11:1110–22. 38. Brannan CI, Dees EC, Ingram RS, Tilghman SM. The product of the H19 gene
15. Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, et al. may function as an RNA. Mol Cell Biol. 1990;10:28–36.
The GENCODE v7 catalog of human long noncoding RNAs: Analysis 39. Grant J, Mahadevaiah SK, Khil P, Sangrithi MN, Royo H, Duckworth J, et al.
of their gene structure, evolution, and expression. Genome Res. Rsx is a metatherian RNA with Xist-like properties in X-chromosome
2012;22:1775–89. inactivation. Nature. 2012;487:254–8.
16. Ulitsky I, Shkumatava A, Jan CH, Sive H, Bartel DP. Conserved function of 40. Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, et al.
lincRNAs in vertebrate embryonic development despite rapid sequence The UCSC Genome Browser Database: update 2006. Nucleic Acids Res.
evolution. Cell. 2011;147:1537–50. 2006;34(Database issue):D590–8.
17. Lin MF, Jungreis I, Kellis M. PhyloCSF: a comparative genomics method to 41. Tesar PJ, Chenoweth JG, Brook FA, Davies TJ, Evans EP, Mack DL, et al.
distinguish protein coding and non-coding regions. Bioinformatics. New cell lines from mouse epiblast share defining features with human
2011;27:i275–82. embryonic stem cells. Nature. 2007;448:196–9.
18. Washietl S, Findeiss S, Müller SA, Kalkhof S, von Bergen M, Hofacker IL, et al. 42. Mikkelsen TS, Hanna J, Zhang X, Ku M, Wernig M, Schorderet P, et al.
RNAcode: robust discrimination of coding and noncoding regions in Dissecting direct reprogramming through integrative genomic analysis.
comparative sequence data. RNA. 2011;17:578–94. Nature. 2008;454:49–55.
19. Hanna J, Cheng AW, Saha K, Kim J, Lengner CJ, Soldner F, et al. Human 43. Guo G, Yang J, Nichols J, Hall JS, Eyres I, Mansfield W, et al. Klf4 reverts
embryonic stem cells with biological and epigenetic characteristics similar developmentally programmed restriction of ground state pluripotency.
to those of mouse ESCs. Proc Natl Acad Sci. 2010;107:9222–7. Development. 2009;136:1063–9.
20. Gafni O, Weinberger L, Mansour AA, Manor YS, Chomsky E, Ben-Yosef D, et al. 44. Hanna J, Markoulaki S, Mitalipova M, Cheng AW, Cassady JP, Staerk J, et al.
Derivation of novel human ground state naive pluripotent stem cells. Metastable pluripotent states in NOD-mouse-derived ESCs. Cell Stem Cell.
Nature. 2013;504:282–6. 2009;4:513–24.
21. Guttman M, Amit I, Garber M, French C, Lin MF, Feldser D, et al. 45. Shishkin AA, Giannoukos G, Kucukural A, Ciulla D, Busby M, Surka C, et al.
Chromatin signature reveals over a thousand highly conserved large Simultaneous generation of many RNA-seq libraries in a single reaction. Nat
non-coding RNAs in mammals. Nature. 2009;458:223–7. Methods. 2015;12:323–5.
22. Guttman M, Russell P, Ingolia NT, Weissman JS, Lander ES. Ribosome profiling 46. Karolchik D, Barber GP, Casper J, Clawson H, Cline MS, Diekhans M, et al.
provides evidence that large noncoding RNAs do not encode proteins. The UCSC Genome Browser database: 2014 update. Nucleic Acids Res.
Cell. 2013;154:240–51. 2014;42(Database issue):D764–70.
23. Necsulea A, Soumillon M, Warnefors M, Liechti A, Daish T, Zeller U, et al. 47. Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O,
The evolution of lncRNA repertoires and expression patterns in tetrapods. et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids
Nature. 2014;505(7485):635–40. Res. 2014;42(Database issue):D756–63.
24. Washietl S, Kellis M, Garber M. Evolutionary dynamics and tissue 48. Xiao S, Xie D, Cao X, Yu P, Xing X, Chen C-C, et al. Comparative epigenomic
specificity of human long noncoding RNAs in six mammals. annotation of regulatory DNA. Cell. 2012;149:1381–92.
Genome Res. 2014;24(4):616–28. 49. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient
25. Bafna V, Huson DH. The conserved exon method for gene finding. alignment of short DNA sequences to the human genome. Genome Biol.
Proc Int Conf Intell Syst Mol Biol. 2000;8:3–12. 2009;10:R25.
26. Batzoglou S, Pachter L, Mesirov JP, Berger B, Lander ES. Human and mouse 50. Garber M, Yosef N, Goren A, Raychowdhury R, Thielke A, Guttman M, et al.
gene structure: comparative analysis and application to exon prediction. A high-throughput chromatin immunoprecipitation approach
Genome Res. 2000;10:950–8. reveals principles of dynamic gene regulation in mammals.
27. Korf I, Flicek P, Duan D, Brent MR. Integrating genomic homology into gene Mol Cell. 2012;47:810–22.
structure prediction. Bioinformatics. 2001;17 Suppl 1:S140–8. 51. Ingolia NT, Lareau LF, Weissman JS. Ribosome profiling of mouse embryonic
28. Pachter L, Alexandersson M, Cawley S. Applications of generalized pair stem cells reveals the complexity and dynamics of mammalian proteomes.
hidden Markov models to alignment and gene finding problems. J Comput Cell. 2011;147:789–802.
Biol. 2002;9:389–99. 52. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, et al.
29. Lindblad-Toh K, Garber M, Zuk O, Lin MF, Parker BJ, Washietl S, et al. GENCODE: the reference human genome annotation for The ENCODE Project.
A high-resolution map of human evolutionary constraint using 29 Genome Res. 2012;22:1760–74.
mammals. Nature. 2011;478:476–82. 53. Brawand D, Soumillon M, Necsulea A, Julien P, Csárdi G, Harrigan P, et al.
30. Wenger AM, Clarke SL, Guturu H, Chen J, Schaar BT, McLean CY, et al. The evolution of gene expression levels in mammalian organs. Nature.
PRISM offers a comprehensive genomic approach to transcription factor 2011;478:343–8.
function prediction. Genome Res. 2013;23:889–904. 54. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H,
31. Harris RS. Improved Pairwise Alignment of Genomic DNA. Ph.D. Thesis, Salzberg SL, Rinn JL, Pachter L: Differential gene and transcript expression
The Pennsylvania State University; 2007. Retrieved from http://www.bx.psu. analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc
edu/~rsharris/rsharris_phd_thesis_2007.pdf. 2012, 7:562–578.
Chen et al. Genome Biology (2016) 17:19 Page 17 of 17
55. Fitch WM. Toward Defining the Course of Evolution: Minimum Change for a
Specific Tree Topology. Syst Zool. 1971;20:406–16.
56. Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, et al. Ensembl 2014.
Nucleic Acids Res. 2014;42(Database issue):D749–55.
57. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing
genomic features. Bioinformatics. 2010;26:841–2.
58. Benaglia T, Chauveau D, Hunter D, Young D. mixtools: An r package for
analyzing finite mixture models. J Stat Softw. 2009;32:1–29.
59. Garber M, Guttman M, Clamp M, Zody MC, Friedman N, Xie X.
Identifying novel constrained elements by exploiting biased substitution
patterns. Bioinformatics. 2009;25:i54–62.
60. Smit AFA, Hubley R, Green P: RepeatMasker. Available at: http://www.
repeatmasker.org. [Accessed 9 April 2013].