Identification of Recurring Tumor Specific Somatic Mutations in AML by Transcriptome Seq Greif2011
Identification of Recurring Tumor Specific Somatic Mutations in AML by Transcriptome Seq Greif2011
Identification of Recurring Tumor Specific Somatic Mutations in AML by Transcriptome Seq Greif2011
ORIGINAL ARTICLE
Genetic lesions are crucial for cancer initiation. Recently, whole the human trithorax homolog and histone methyltransferase
genome sequencing, using next generation technology, was MLL and nucleophosmin (NPM1).2–7
used as a systematic approach to identify mutations in So far, most of the genes that were found mutated in AML
genomes of various types of tumors including melanoma, lung
and breast cancer, as well as acute myeloid leukemia (AML). were found through a candidate gene approach, because of their
Here, we identify tumor-specific somatic mutations by sequenc- involvement in translocations or in hematopoetic differentiation.
ing transcriptionally active genes. Mutations were detected by For example, CEBPA knockout mice show a block in myeloid
comparing the transcriptome sequence of an AML sample with differentiation, and both MLL and NPM1 were initially found to
the corresponding remission sample. Using this approach, we be involved in fusion genes that resulted from chromosomal
found five non-synonymous mutations specific to the tumor translocations in leukemia patients.5–9
sample. They include a nonsense mutation affecting the RUNX1
gene, which is a known mutational target in AML, and a
With the advent of next generation sequencing technology,
missense mutation in the putative tumor suppressor gene the unbiased detection of tumor-specific somatic mutations
TLE4, which encodes a RUNX1 interacting protein. Another became possible.10–15 Sequence analysis of an AML genome
missense mutation was identified in SHKBP1, which acts resulted in the identification of recurring mutations in the gene
downstream of FLT3, a receptor tyrosine kinase mutated in IDH1, encoding the enzyme isocitrate dehydrogenase 1.11
about 30% of AML cases. The frequency of mutations in TLE4 Metabolite screening of AML samples revealed that the related
and SHKBP1 in 95 cytogenetically normal AML patients was 2%.
Our study demonstrates that whole transcriptome sequencing
enzyme IDH2 is another mutational target.16 Despite its
leads to the rapid detection of recurring point mutations in the technical feasibility, whole genome sequencing is still cost
coding regions of genes relevant to malignant transformation. intensive, and therefore several alternative approaches of
Leukemia (2011) 25, 821–827; doi:10.1038/leu.2011.19; targeted sequencing have been proposed, like the sequencing
published online 22 February 2011 of coding regions. Although the size of a diploid human genome
Keywords: acute myeloid leukemia; point mutations; TLE4; is about 6 Gbp, the transcriptome, as defined by the combined
SHKBP1; RUNX1; transcriptome sequencing
length of all mRNAs in a cell, is only 0.6 Gbp in size. This figure
is based on the estimate that a cell contains about 300 000
transcripts, with an average length of 2000 bases.17,18 Sequenc-
ing of only a few gigabases of the transcriptome should allow
Introduction mutation detection in a large proportion of transcribed genes.
Here we report that sequencing of an AML tumor and the
Acute myeloid leukemia (AML) is the most frequent hemato- corresponding remission transcriptome allowed us to analyze
logical malignancy in adults, with an annual incidence of 3 to 4 approximately 10 000 genes and to identify five tumor-specific
cases per 100 000 individuals. Despite the increasing knowl- somatic mutations.
edge about the molecular pathology of AML, the prognosis
remains poor, with a 5-year survival of only 25–30%.
Materials and methods
Chromosomal aberrations in tumor cells are found in approxi-
mately half of the AML patients, whereas the other half of the
Case information
patients has a normal karyotype (cytogenetically normal-AML).1
A diagnostic bone marrow sample was collected from a 69-year-
Even though a growing number of submicroscopic genetic
old patient, diagnosed with AML M1 in May 2008. The patient
lesions is identified in AML, about 25% of cytogenetically
was included in the AML Cooperative Group clinical trial, and
normal-AML patients do not carry any of the currently known
informed consent and ethical approval for scientific use of the
mutations. The list of frequently affected genes includes the
sample including genetic studies were obtained. After induction
receptor tyrosine kinase FLT3, the transcription factor CEBPA,
therapy using the sequential high-dose cytosine arabinoside and
mitoxantrone (S-HAM) protocol, complete remission was
Correspondence: Professor SK Bohlander, Department of Medicine III, achieved. After leukocyte recovery in July 2008, a remission
University of Munich, Marchioninistr 15, Klinikum Grosshadern,
sample from peripheral blood was taken.
Munich, Bavaria 81377, Germany.
E-mail: [email protected]
5
These authors contributed equally to this work.
6
Joint senior and corresponding authors. Sample preparation
Received 17 October 2010; revised 28 November 2010; accepted 10 Approximately 50 106 cells from each sample were used for
December 2010; published online 22 February 2011 mRNA extraction using Trizol (Invitrogen, Carlsbad, CA, USA).
AML transcriptome sequencing
PA Greif et al
822
The sequencing library was prepared using mRNA-Seq sample Distribution of reads across exonic and non-exonic
preparation kit (Illumina, San Diego, CA, USA). In brief, mRNA regions
was selected using oligo-dT beads (dynabeads, Invitrogen). The To determine the success of the RNA library preparation, we
mRNA was then fragmented using metal ion hydrolysis and calculated the percentage of reads matching to known exons
reversely transcribed using random hexamer primers. Following from the UCSC genome browser. For the AML sample, B63% of
steps included end repair, adapter ligation, size selection and reads aligned to exons, B28.5% to introns and B7.5% to
polymerase chain reaction enrichment. intergenic regions, whereas for the remission sample, B73.5%
of reads aligned to exons, B20.5% to introns and B6% to
intergenic regions (Figure 1b). The relatively high proportion of
Sequence alignment intronic reads may stem from unspliced mRNAs. Variable
Short-read alignment and consensus assembly were performed proportions of intronic and exonic reads were observed between
using the BWA (v.0.5.5) sequence-alignment program,19 with different preparations from the same samples, indicating that
the default parameters and interactive trimming of low quality minor differences in RNA concentration and quality might
bases at the end of reads (cut-off quality value q ¼ 15). We used strongly influence the competitive binding of shorter spliced and
an expanded reference sequence comprising the human longer incompletely spliced mRNAs to oligo dT-beads. The
genome assembly (build NCBI36/hg18) and all annotated splice values varied between the different chromosomes and the
sites extracted from the University of California Santa Cruz number of reads mapping to exons were correlated with overall
(UCSC) genome browser-known gene track. In total, we gene density on the chromosome (Supplementary Figure S1).
generated 127 115 919 paired-end reads of 36 bp length for
the AML sample, of which 95.08% aligned to the reference
sequence, and 187 782 678 paired-end reads for the remission Expression analysis
sample with 82 % aligning to the reference. Read mapping, Expression values were calculated as RPKM (reads per kilo-base
subsequent assembly and variant calling were performed using of gene model per million mapped reads.21 In brief, the number
the resequencing software packages BWA and SAMtools.19,20 of uniquely mapping reads (BWA mapping quality 40, B75 to
During alignment, 31.27 and 39.81% apparently duplicated 85% of reads for both samples) for each gene was counted
reads were removed from the AML and remission sample, and then normalized by gene length and the total number of
respectively. reads generated in the experiment. As the reference set,
Figure 1 (a) Histograms of the sequence coverage in a non-redundant gene set based on the Ensembl annotation (35 876 genes) for genes detected
in both samples (left), the acute myeloid leukemia (AML, middle) and remission (right) samples. Minimum sequence coverage is plotted on the
x-axis and number of genes is plotted on the y-axis. We sequenced 10 152 genes with an average coverage of 7 or greater, 6989 genes with an
average coverage of 20 or greater and 5535 genes with an average coverage of at least 30 in both samples (left). The result obtained from the AML
sample was 11 293 genes with an average coverage of 7 or greater, 7878 genes with an average coverage of 20 or greater and 6326 genes with an
average coverage of 30 or greater (middle). The sequencing of the remission yielded 11 906 genes with an average coverage of 7 or greater, 8805
genes with an average coverage of 20 or greater and 7446 genes at an average coverage of at least 30 (right). The high proportion of genes detected
in both samples indicates a good comparability of expression profiles. (b) Two pie charts showing the percentage of reads from the AML (left) and
remission samples (right) that map to exons, introns or intergenic regions (see also Supplementary Figure S1).
Leukemia
AML transcriptome sequencing
PA Greif et al
823
we used a non-redundant gene set based on the Ensembl gene We sequenced 4.35 and 5.54 Gbp of the tumor and remission
annotations by merging all annotated transcripts from the same sample, respectively, on an Illumina GA IIx sequencer
gene into a single ‘maximum coding sequence’. This set (Illumina). We used the NCBI36/hg18 genome assembly as
contained 35 876 genes. Exonic regions that were shared by reference sequence and compiled a non-redundant mRNA set
two or more different genes (for instance sense and anti-sense from the Ensembl transcripts database resulting in a set of 35 876
transcripts or non-coding RNAs within exons) were excluded genes. Read mapping to the reference genome was performed
and not used for RPKM calculation as reads from these regions with the BWA software.19 Approximately 95 and 82% of the
can not be unambiguously assigned to single genes. reads mapped to the reference, of which 63 and 74% mapping
to exonic sequences in the tumor and remission sample,
respectively (Figure 1b, Supplementary Figure S1).
Spearman’s rank correlation coefficient The average sequence read depth for every gene was first
Spearman’s rank correlation coefficient was calculated from the calculated to obtain the number of genes suitable for mutation
log2 RPKM values of the tumor and remission sample, using the detection. The read depth per gene ranged from 0 to over 1000.
R package for statistical computing. A total of 10 152 genes had an average read depth of at least
sevenfold and 6989 genes had an average read depth of 20 or
greater in both samples. These numbers were only slightly
higher when the tumor and remission samples were analyzed
SNP calling individually, indicating that the gene expression pattern was
Variant calling was performed using the SAMtools package comparable even though the tumor sample was a bone marrow
(v.0.1.5c).20 For the variant filter of SAMtools, we used the aspirate with more than 90% blasts, whereas the remission
following settings: minimum read depth ¼ 3; maximum read sample was from peripheral blood with a normal white blood
depth ¼ 9999; minimum root mean square mapping quality cell count (Figure 1a). The comparability was supported by a
for single-nucleotide polymorphisms (SNPs) ¼ 25; minimum high correlation of the gene expression levels between the
mapping quality of gaps ¼ 10; minimum indel score for samples as shown by a Spearman Rank correlation coefficient of
filtering ¼ 25; window size around potential indels ¼ 10; win- 0.82 (Figure 2a).
dow size for filtering dense SNPs ¼ 10; maximum number of Single-nucleotide variants (SNV) were called with the
SNPs allowed in window ¼ 2. SAMtools software package,20 using mainly the default para-
Subsequently, we applied additional filters. We required each meters and custom filters applied at later stages. To achieve a
putative SNP to have (i) a median quality value of the variant low false-positive rate, we required a minimum read depth of
bases of at least 20 (ii) that at least 15% of all reads covering the 7 in both samples. We set this threshold because there is a
position show the variant allele and (iii) that at least 10% of detection rate of approximately 70% at this read depth.22 For the
reads showing the variant allele are from opposite strands. same purpose, we quality filtered the SNV set of the tumor
Functional analysis of SNPs was performed with custom Perl sample, but used an unfiltered set of the remission sample for
scripts using data sets from Ensembl and the UCSC genome comparison (Figure 2b).
browser. Known SNP locations, Ensembl and known gene Quality filtering in the tumor resulted in a set of 8978 SNVs in
annotations were used as provided by the UCSC genome coding regions. This compares favorably with approximately
browser. 20 000 SNVs that can be found in the entire coding sequence
using exome sequencing.23 In the next step, we excluded all
coding SNVs that were present in the dbSNP database version
Results 130 or in the exomes of 8 HapMap samples. The remaining 926
sites contained 612 SNVs, which led to an amino acid
To demonstrate the feasibility of this approach, we selected an substitution or, which disrupted canonical splice sites. These
AML sample (bone marrow aspirate) and a corresponding 612 SNVs were then compared with the unfiltered calls of the
remission sample (peripheral blood) for transcriptome sequen- remission sample at these 612 positions. We excluded all
cing. The patient, a 69-year-old female, presented with de novo positions with any indication that the same SNV was also
AML, with blood counts and bone marrow morphology being present in the remission sample.
consistent with the diagnosis of AML without maturation This strategy resulted in the identification of 11 candidate
according to the French-American-British classification (FAB SNVs unique to the tumor sample. Capillary sequencing of
AML M1). After induction therapy, complete remission was genomic DNA from both the tumor and the remission sample
achieved. One year after initial diagnosis, the patient relapsed confirmed five SNVs, which affected the genes RUNX1, TLE4,
and received an allogenic bone marrow transplant. SHKBP1, XPO7 and RRP8. (Table 1, Figure 3). Two SNVs were
Conventional cytogenetic analysis revealed a normal female false positives with the same heterozygous SNVs being also
karyotype (46, XX[20]). An internal tandem duplication of FLT3, present in the genomic DNA of the remission sample, four SNVs
an NPM1 mutation and a partial tandem duplication in the could not be confirmed in the AML sample.
MLL gene were excluded in a routine diagnostic screen. We RUNX1 (AML1) carried a heterozygous stop mutation in the
further investigated whether the tumor sample contained Runt domain. RUNX1 is the fusion partner of RUNX1T1 (eight
somatic copy number variations using the HumanOmni1-Quad twenty one (ETO)) in the recurring t(8;21) (q22;q22) transloca-
chip (Illumina), containing probes for approximately 1 million tion present in 8–13% of de novo AML cases.24 In addition,
loci. We found no evidence of somatic loss-of-heterozygosity point mutations in RUNX1 have recently been described in
indicating the presence of a normal diploid genome. A total of AML, in particular AML secondary to myelodysplastic syn-
29 copy number changes were present in both the tumor and drome, radiation exposure or chemotherapy, at a frequency of
remission sample. We compared the copy number variations 8–10%.25
with those contained in the database of genomic variants and TLE4 carried a missense mutation at position 511 (N511S).
1600 controls from a population-based study. All the copy TLE4 is located on chromosome 9 band q34, which is frequently
number variations were present at least once in these cohorts. deleted in AML with t(8;21) translocations, and is therefore a
Leukemia
AML transcriptome sequencing
PA Greif et al
824
described in AML and myelodysplastic syndrome.28 Thus, it is
likely that SB1 mutations affect FLT3 signaling. SHKBP1
overexpression in cell lines has antiapoptotic effects.29
The fourth and fifth AML-specific mutations were missense
mutations in XPO7 (a member of the importin beta superfamily)
and RRP8 (a methyltransferase, possibly involved in ribosomal
RNA processing).
Although recurring mutations in RUNX1 are known to occur
in AML, mutations in TLE4 or SHKBP1 have not been described
before. We therefore screened the complete coding sequence of
TLE4 and SHKBP1, as well as of RUNX1 in 95 cytogenetically
normal-AML patients by capillary sequencing of genomic DNA
(Table 2). As expected, we found several patients with RUNX1
mutation (9/95; 9.5%): nine missense mutations (two patients
with two mutations each), one nonsense mutation and a 5 bp
insertion. We also discovered two missense mutations in TLE4
and two missense mutations in SHKBP1 (Table 2), strongly
suggesting that both TLE4 and SHKBP1 are mutational targets in
AML at a frequency of about 2%. Mutations in TLE4, SHKBP1
and RUNX1 were mutually exclusive in the cohort of 95
cytogenetically normal-AML patients. TLE4 mutations were
found in patients with mutations in NPM1 and C/EBPA, whereas
SHKBP1 mutations were found in combination with mutations
in NPM1 and FLT3 (Table 2).
Discussion
Leukemia
AML transcriptome sequencing
PA Greif et al
825
Table 1 Confirmed tumor-specific somatic mutations identified by transcriptome sequencing
Gene Position (hg18) Reference Tumor Amino acid Ensembl protein Read depth Read depth
genotype genotype tumor remission
Figure 3 Sequencing of genomic DNA from the patient confirms five point mutations detected by whole transcriptome sequencing. The genes
affected are RUNX1, (a) TLE4, (b) SHKBP1, (c) XPO7 (d) and RRP8 (e) Chromatograms show sequences from AML (upper panels) and remission
(lower panels) for each gene.
Apparently, many subtle genetic changes may contribute to the transcriptome required only the sequencing of 10 Gbp and
disease through multiple interactions. resulted in the identification of five tumor-specific mutations in
Although analysis of the two AML genomes required the gene coding regions. Thus, our findings demonstrate that
sequencing of over 120 Gbp for each patient and resulted in whole transcriptome sequencing might be an order of magni-
the detection of 10 to 12 tumor-specific mutations in the gene tude, faster and more cost effective than whole genome
coding regions in each case,10,11 our analysis of an AML sequencing for the detection of point mutations in coding
Leukemia
AML transcriptome sequencing
PA Greif et al
826
Table 2 Mutations identified by capillary sequencing of 95 CN-AML patients
Gene Position (hg18) Reference Tumor Amino acid Ensembl protein Number of Additional
genotype genotype pat. (ID#) mutations
Conflict of interest
Acknowledgements
Leukemia
AML transcriptome sequencing
PA Greif et al
827
7 Falini B, Mecucci C, Tiacci E, Alcalay M, Rosati R, Pasqualucci L 19 Li H, Durbin R. Fast and accurate short read alignment with
et al. Cytoplasmic nucleophosmin in acute myelogenous leukemia Burrows-Wheeler transform. Bioinformatics 2009; 25: 1754–1760.
with a normal karyotype. N Engl J Med 2005; 352: 254–266. 20 Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N et al.
8 Zhang DE, Zhang P, Wang ND, Hetherington CJ, Darlington GJ, The Sequence Alignment/Map format and SAMtools. Bioinfor-
Tenen DG. Absence of granulocyte colony-stimulating factor matics 2009; 25: 2078–2079.
signaling and neutrophil development in CCAAT enhancer binding 21 Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B.
protein alpha-deficient mice. Proc Natl Acad Sci USA 1997; 94: Mapping and quantifying mammalian transcriptomes by RNA-Seq.
569–574. Nat Methods 2008; 5: 621–628.
9 Caligiuri MA, Schichman SA, Strout MP, Mrozek K, Baer MR, 22 Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and
Frankel SR et al. Molecular rearrangement of the ALL-1 gene in calling variants using mapping quality scores. Genome Res 2008;
acute myeloid leukemia without cytogenetic evidence of 11q23 18: 1851–1858.
chromosomal translocations. Cancer Res 1994; 54: 370–373. 23 Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM
10 Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K et al. et al. Exome sequencing identifies the cause of a Mendelian
DNA sequencing of a cytogenetically normal acute myeloid disorder. Nat Genet 2010; 42: 30–35.
leukaemia genome. Nature 2008; 456: 66–72. 24 Peterson LF, Zhang DE. The 8;21 translocation in leukemogenesis.
11 Mardis ER, Ding L, Dooling DJ, Larson DE, McLellan MD, Chen K Oncogene 2004; 23: 4255–4262.
et al. Recurring mutations found by sequencing an acute myeloid 25 Osato M. Point mutations in the RUNX1/AML1 gene: another actor
leukemia genome. N Engl J Med 2009; 361: 1058–1066. in RUNX leukemia. Oncogene 2004; 23: 4284–4296.
12 Pleasance ED, Cheetham RK, Stephens PJ, McBride DJ, 26 Dayyani F, Wang J, Yeh JR, Ahn EY, Tobey E, Zhang DE et al. Loss
Humphray SJ, Greenman CD et al. A comprehensive catalogue of TLE1 and TLE4 from the del(9q) commonly deleted region in
of somatic mutations from a human cancer genome. Nature 2010; AML cooperates with AML1-ETO to affect myeloid cell prolifera-
463: 191–196. tion and survival. Blood 2008; 111: 4338–4347.
13 Pleasance ED, Stephens PJ, O’Meara S, McBride DJ, Meynert A, 27 Borinstein SC, Hyatt MA, Sykes VW, Straub RE, Lipkowitz S,
Jones D et al. A small-cell lung cancer genome with complex Boulter J et al. SETA is a multifunctional adapter protein with three
signatures of tobacco exposure. Nature 2010; 463: 184–190. SH3 domains that binds Grb2, Cbl, and the novel SB1 proteins.
14 Stephens PJ, McBride DJ, Lin ML, Varela I, Pleasance ED, Simpson Cell Signal 2000; 12: 769–779.
JT et al. Complex landscapes of somatic rearrangement in human 28 Reindl C, Quentmeier H, Petropoulos K, Greif PA, Benthaus T,
breast cancer genomes. Nature 2009; 462: 1005–1010. Argiropoulos B et al. CBL exon 8/9 mutants activate the FLT3
15 Stratton MR, Campbell PJ, Futreal PA. The cancer genome. Nature pathway and cluster in core binding factor/11q deletion acute
2009; 458: 719–724. myeloid leukemia/myelodysplastic syndrome subtypes. Clin Can-
16 Gross S, Cairns RA, Minden MD, Driggers EM, Bittinger MA, cer Res 2009; 15: 2238–2247.
Jang HG et al. Cancer-associated metabolite 2-hydroxyglutarate 29 Liu JP, Liu NS, Yuan HY, Guo Q, Lu H, Li YY. Human homologue
accumulates in acute myelogenous leukemia with isocitrate of SETA binding protein 1 interacts with cathepsin B and
dehydrogenase 1 and 2 mutations. J Exp Med 2010; 207: 339–344. participates in TNF-Induced apoptosis in ovarian cancer cells.
17 Hurowitz EH, Drori I, Stodden VC, Donoho DL, Brown PO. Virtual Mol Cell Biochem 2006; 292: 189–195.
Northern analysis of the human genome. PLoS ONE 2007; 2: e460. 30 Mayr C, Bartel DP. Widespread shortening of 30 UTRs by
18 Velculescu VE, Madden SL, Zhang L, Lash AE, Yu J, Rago C et al. alternative cleavage and polyadenylation activates oncogenes in
Analysis of human transcriptomes. Nat Genet 1999; 23: 387–388. cancer cells. Cell 2009; 138: 673–684.
Leukemia