ScRNA Seq Course
ScRNA Seq Course
ScRNA Seq Course
5 Introduction to R/Bioconductor 41
5.1 Installing packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Installation instructions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Data-types/classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4 Basic data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.5 More information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.6 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3
4 CONTENTS
6 Tabula Muris 65
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2 Downloading the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.3 Reading the data (Smartseq2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4 Building a scater object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.5 Reading the data (10X) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.6 Building a scater object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.7 Advanced Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
9 Seurat 305
9.1 Seurat object class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
9.2 Expression QC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
9.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
9.4 Highly variable genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
9.5 Dealing with confounders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
9.6 Linear dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
9.7 Significant PCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
9.8 Clustering cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
9.9 Marker genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
9.10 sessionInfo() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
12 Resources 333
12.1 scRNA-seq protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
12.2 External RNA Control Consortium (ERCC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
12.3 scRNA-seq analysis tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
12.4 scRNA-seq public datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
6 CONTENTS
Chapter 1
Today it is possible to obtain genome-wide transcriptome data from single cells using high-throughput
sequencing (scRNA-seq). The main advantage of scRNA-seq is that the cellular resolution and the genome
wide scope makes it possible to address issues that are intractable using other methods, e.g. bulk RNA-seq
or single-cell RT-qPCR. However, to analyze scRNA-seq data, novel methods are required and some of the
underlying assumptions for the methods developed for bulk RNA-seq experiments are no longer valid.
In this course we will discuss some of the questions that can be addressed using scRNA-seq as well as the
available computational and statistical methods avialable. The course is taught through the University of
Cambridge Bioinformatics training unit, but the material found on these pages is meant to be used for
anyone interested in learning about computational analysis of scRNA-seq data. The course is taught twice
per year and the material here is updated prior to each event.
The number of computational tools is increasing rapidly and we are doing our best to keep up to date with
what is available. One of the main constraints for this course is that we would like to use tools that are
implemented in R and that run reasonably fast. Moreover, we will also confess to being somewhat biased
towards methods that have been developed either by us or by our friends and colleagues.
1.1 Video
This video was recorded in November 2017, at that time the course contained less chapters than the current
version.
1.2 Registration
Please follow this link and register for the “Analysis of single cell RNA-seq data” course:
http://training.csx.cam.ac.uk/bioinformatics/search
1.3 GitHub
https://github.com/hemberg-lab/scRNA.seq.course
7
8 CHAPTER 1. ABOUT THE COURSE
1.6 License
All of the course material is licensed under GPL-3. Anyone is welcome to go through the material in order
to learn about analysis of scRNA-seq data. If you plan to use the material for your own teaching, we would
appreciate if you tell us about it in addition to providing a suitable citation.
1.7 Prerequisites
The course is intended for those who have basic familiarity with Unix and the R scripting language.
We will also assume that you are familiar with mapping and analysing bulk RNA-seq data as well as with
the commonly available computational tools.
We recommend attending the Introduction to RNA-seq and ChIP-seq data analysis or the Analysis of high-
throughput sequencing data with Bioconductor before attending this course.
1.8 Contact
If you have any comments, questions or suggestions about the material, please contact Vladimir Kiselev.
Chapter 2
• A major breakthrough (replaced microarrays) in the late 00’s and has been widely used since
• Measures the average expression level for each gene across a large population of input cells
• Useful for comparative transcriptomics, e.g. samples of the same tissue from different species
• Useful for quantifying expression signatures from ensembles, e.g. in disease studies
• Insufficient for studying heterogeneous systems, e.g. early development studies, complex tissues
(brain)
• Does not provide insights into the stochastic nature of gene expression
2.2 scRNA-seq
2.3 Workflow
Overall, experimental scRNA-seq protocols are similar to the methods used for bulk RNA-seq. We will be
discussing some of the most common approaches in the next chapter.
9
10 CHAPTER 2. INTRODUCTION TO SINGLE-CELL RNA-SEQ
2.5 Challenges
The main difference between bulk and single cell RNA-seq is that each sequencing library represents a single
cell, instead of a population of cells. Therefore, significant attention has to be paid to comparison of the
results from different cells (sequencing libraries). The main sources of discrepancy between the libraries are:
• Amplification (up to 1 million fold)
• Gene ‘dropouts’ in which a gene is observed at a moderate expression level in one cell but is not
detected in another cell (Kharchenko et al., 2014).
2.5. CHALLENGES 11
Figure 2.3: Moore’s law in single cell transcriptomics (image taken from [Svensson et
al](https://arxiv.org/abs/1704.01379))
In both cases the discrepancies are introduced due to low starting amounts of transcripts since the RNA
comes from one cell only. Improving the transcript capture efficiency and reducing the amplification bias
are currently active areas of research. However, as we shall see in this course, it is possible to alleviate some
of these issues through proper normalization and corrections.
Development of new methods and protocols for scRNA-seq is currently a very active area of research, and
several protocols have been published over the last few years. An non-comprehensive list includes:
• CEL-seq (Hashimshony et al., 2012)
• CEL-seq2 (Hashimshony et al., 2016)
• Drop-seq (Macosko et al., 2015)
• InDrop-seq (Klein et al., 2015)
• MARS-seq (Jaitin et al., 2014)
• SCRB-seq (Soumillon et al., 2014)
• Seq-well (Gierahn et al., 2017)
• Smart-seq (Picelli et al., 2014)
• Smart-seq2 (Picelli et al., 2014)
• SMARTer
• STRT-seq (Islam et al., 2013)
The methods can be categorized in different ways, but the two most important aspects are quantification
and capture.
For quantification, there are two types, full-length and tag-based. The former tries to achieve a uniform
read coverage of each transcript. By contrast, tag-based protocols only capture either the 5’- or 3’-end of
each RNA. The choice of quantification method has important implications for what types of analyses the
data can be used for. In theory, full-length protocols should provide an even coverage of transcripts, but as
we shall see, there are often biases in the coverage. The main advantage of tag-based protocol is that they
2.6. EXPERIMENTAL METHODS 13
Figure 2.5: Image of a 96-well Fluidigm C1 chip (image taken from Fluidigm)
can be combined with unique molecular identifiers (UMIs) which can help improve the quantification (see
chapter 4.6). On the other hand, being restricted to one end of the transcript may reduce the mappability
and it also makes it harder to distinguish different isoforms (Archer et al., 2016).
The strategy used for capture determines throughput, how the cells can be selected as well as what kind of
additional information besides the sequencing that can be obtained. The three most widely used options are
microwell-, microfluidic- and droplet- based.
For well-based platforms, cells are isolated using for example pipette or laser capture and placed in microflu-
idic wells. One advantage of well-based methods is that they can be combined with fluorescent activated
cell sorting (FACS), making it possible to select cells based on surface markers. This strategy is thus very
useful for situations when one wants to isolate a specific subset of cells for sequencing. Another advantage is
that one can take pictures of the cells. The image provides an additional modality and a particularly useful
application is to identify wells containg damaged cells or doublets. The main drawback of these methods is
that they are often low-throughput and the amount of work required per cell may be considerable.
Microfluidic platforms, such as Fluidigm’s C1, provide a more integrated system for capturing cells and for
carrying out the reactions necessary for the library preparations. Thus, they provide a higher throughput
than microwell based platforms. Typically, only around 10% of cells are captured in a microfluidic platform
and thus they are not appropriate if one is dealing with rare cell-types or very small amounts of input.
Moreover, the chip is relatively expensive, but since reactions can be carried out in a smaller volume money
can be saved on reagents.
The idea behind droplet based methods is to encapsulate each individual cell inside a nanoliter droplet
together with a bead. The bead is loaded with the enzymes required to construct the library. In particular,
each bead contains a unique barcode which is attached to all of the reads originating from that cell. Thus,
all of the droplets can be pooled, sequenced together and the reads can subsequently be assigned to the cell
of origin based on the barcodes. Droplet platforms typically have the highest throughput since the library
preparation costs are on the order of .05 USD/cell. Instead, sequencing costs often become the limiting
factor and a typical experiment the coverage is low with only a few thousand different transcripts detected
14 CHAPTER 2. INTRODUCTION TO SINGLE-CELL RNA-SEQ
Figure 2.6: Schematic overview of the drop-seq method (Image taken from Macosko et al)
The most suitable platform depends on the biological question at hand. For example, if one is interested
in characterizing the composition of a tissue, then a droplet-based method which will allow a very large
number of cells to be captured is likely to be the most appropriate. On the other hand, if one is interesting
in characterizing a rare cell-population for which there is a known surface marker, then it is probably best
to enrich using FACS and then sequence a smaller number of cells.
Clearly, full-length transcript quantification will be more appropriate if one is interested in studying different
isoforms since tagged protocols are much more limited. By contrast, UMIs can only be used with tagged
protocols and they can facilitate gene-level quantification.
Two recent studies from the Enard group (Ziegenhain et al., 2017) and the Teichmann group (Svensson
et al., 2017) have compared several different protocols. In their study, Ziegenhain et al compared five
different protocols on the same sample of mouse embryonic stem cells (mESCs). By controlling for the
number of cells as well as the sequencing depth, the authors were able to directly compare the sensitivity,
noise-levels and costs of the different protocols. One example of their conclusions is illustrated in the figure
below which shows the number of genes detected (for a given detection threshold) for the different methods.
As you can see, there is almost a two-fold difference between drop-seq and Smart-seq2, suggesting that the
choice of protocol can have a major impact on the study
Svensson et al take a different approach by using synthetic transcripts (spike-ins, more about these later)
with known concentrations to measure the accuracy and sensitivity of different protocols. Comparing a wide
range of studies, they also reported substantial differences between the protocols.
2.7. WHAT PLATFORM TO USE FOR MY EXPERIMENT? 15
As protocols are developed and computational methods for quantifying the technical noise are improved, it
is likely that future studies will help us gain further insights regarding the strengths of the different methods.
These comparative studies are helpful not only for helping researchers decide which protocol to use, but also
for developing new methods as the benchmarking makes it possible to determine what strategies are the
most useful ones.
Chapter 3
3.1 FastQC
Once you’ve obtained your single-cell RNA-seq data, the first thing you need to do with it is check the
quality of the reads you have sequenced. For this task, today we will be using a tool called FastQC. FastQC
is a quality control tool for sequencing data, which can be used for both bulk and single-cell RNA-seq data.
FastQC takes sequencing data as input and returns a report on read quality. Copy and paste this link into
your browser to visit the FastQC website:
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
This website contains links to download and install FastQC and documentation on the reports produced.
Fortunately we have already installed FastQC for you today, so instead we will take a look at the documenta-
tion. Scroll down the webpage to ‘Example Reports’ and click ‘Good Illumina Data’. This gives an example
of what an ideal report should look like for high quality Illumina reads data.
Now let’s make a FastQC report ourselves.
Today we will be performing our analysis using a single cell from an mESC dataset produced by
(Kolodziejczyk et al., 2015). The cells were sequenced using the SMART-seq2 library preparation protocol
and the reads are paired end. The files are located in Share. Let’s take a look at them:
less Share/ERR522959_1.fastq
less Share/ERR522959_2.fastq
Task 1: Try to work out what command you should use to produce the FastQC report. Hint: Try executing
fastqc -h
This command will tell you what options are available to pass to FastQC. Feel free to ask for help if you get
stuck! If you are successful, you should generate a .zip and a .html file for both the forwards and the reverse
reads files. Once you have been successful, feel free to have a go at the next section.
If you haven’t done so already, generate the FastQC report using the commands below:
mkdir fastqc_results
fastqc -o fastqc_results Share/ERR522959_1.fastq Share/ERR522959_2.fastq
17
18 CHAPTER 3. PROCESSING RAW SCRNA-SEQ DATA
Once the command has finished executing, you should have a total of four files - one zip file for each of the
paired end reads, and one html file for each of the paired end reads. The report is in the html file. To view
it, we will need to get it off AWS and onto your computer using either filezilla or scp. Ask an instructor if
you are having difficulties.
Once the file is on you computer, click on it. Your FastQC report should open. Have a look through the file.
Remember to look at both the forwards and the reverse end read reports! How good quality are the reads?
Is there anything we should be concerned about? How might we address those concerns?
Fortunately there is software available for read trimming. Today we will be using Trim Galore!. Trim Galore!
is a wrapper for the reads trimming software cutadapt.
Read trimming software can be used to trim sequencing adapters and/or low quality reads from the ends of
reads. Given we noticed there was some adaptor contamination in our FastQC report, it is a good idea to
trim adaptors from our data.
Task 2: What type of adapters were used in our data? Hint: Look at the FastQC report ‘Adapter Content’
plot.
Now let’s try to use Trim Galore! to remove those problematic adapters. It’s a good idea to check read
quality again after trimming, so after you have trimmed your reads you should use FastQC to produce
another report.
Task 3: Work out the command you should use to trim the adapters from our data. Hint 1: You can use
trim_galore -h
To find out what options you can pass to Trim Galore. Hint 2: Read through the output of the above
command carefully. The adaptor used in this experiment is quite common. Do you need to know the actual
sequence of the adaptor to remove it?
Task 3: Produce a FastQC report for your trimmed reads files. Is the adapter contamination gone?
Once you think you have successfully trimmed your reads and have confirmed this by checking the FastQC
report, feel free to check your results using the next section.
3.2.1 Solution
You can use the command(s) below to trim the Nextera sequencing adapters:
mkdir fastqc_trimmed_results
trim_galore --nextera -o fastqc_trimmed_results Share/ERR522959_1.fastq Share/ERR522959_2.fastq
Remember to generate new FastQC reports for your trimmed reads files! FastQC should now show that your
reads pass the ‘Adaptor Content’ plot. Feel free to ask one of the instructors if you have any questions.
Congratulations! You have now generated reads quality reports and performed adaptor trimming. In the
next lab, we will use STAR and Kallisto to align our trimmed and quality-checked reads to a reference
transcriptome.
3.3. FILE FORMATS 19
3.3.1 FastQ
FastQ is the most raw form of scRNASeq data you will encounter. All scRNASeq protocols are sequenced
with paired-end sequencing. Barcode sequences may occur in one or both reads depending on the protocol
employed. However, protocols using unique molecular identifiers (UMIs) will generally contain one read with
the cell and UMI barcodes plus adapters but without any transcript sequence. Thus reads will be mapped
as if they are single-end sequenced despite actually being paired end.
FastQ files have the format:
>ReadID
READ SEQUENCE
+
SEQUENCING QUALITY SCORES
3.3.2 BAM
BAM file format stores mapped reads in a standard and efficient manner. The human-readable version is
called a SAM file, while the BAM file is the highly compressed version. BAM/SAM files contain a header
which typically includes
information on the sample preparation, sequencing and mapping; and a tab-separated row for each individual
alignment of each read.
Alignment rows employ a standard format with the following columns:
(1) QNAME : read name (generally will include UMI barcode if applicable)
(2) FLAG : number tag indicating the “type” of alignment, link to explanation of all possible “types”
(3) RNAME : reference sequence name (i.e. chromosome read is mapped to).
(4) POS : leftmost mapping position
(5) MAPQ : Mapping quality
(6) CIGAR : string indicating the matching/mismatching parts of the read (may include soft-clipping).
(7) RNEXT : reference name of the mate/next read
(8) PNEXT : POS for mate/next read
(9) TLEN : Template length (length of reference region the read is mapped to)
(10) SEQ : read sequence
(11) QUAL : read quality
BAM/SAM files can be converted to the other format using ‘samtools’:
samtools view -S -b file.sam > file.bam
samtools view -h file.bam > file.sam
Some sequencing facilities will automatically map your reads to the a standard genome and deliver either
BAM or CRAM formatted files. Generally they will not have included ERCC sequences in the genome
thus no ERCC reads will be mapped in the BAM/CRAM file. To quantify ERCCs (or any other genetic
alterations) or if you just want to use a different alignment algorithm than whatever is in the generic pipeline
(often outdated), then you will need to convert the BAM/CRAM files back to FastQs:
20 CHAPTER 3. PROCESSING RAW SCRNA-SEQ DATA
BAM files can be converted to FastQ using bedtools. To ensure a single copy for multi-mapping reads first
sort by read name and remove secondary alignments using samtools. Picard also contains a method for
converting BAM to FastQ files.
# sort reads by name
samtools sort -n original.bam -o sorted_by_name.bam
# remove secondary alignments
samtools view -b -F 256 sorted_by_name.bam -o primary_alignment_only.bam
# convert to fastq
bedtools bamtofastq -i primary_alignment_only.bam -fq read1.fq -fq2 read2.fq
3.3.3 CRAM
CRAM files are similar to BAM files only they contain information in the header to the reference genome
used in the mapping in the header. This allow the bases in each read that are identical to the reference to
be further compressed. CRAM also supports some lossy data compression approaches to further optimize
storage compared to BAMs. CRAMs are mainly used by the Sanger/EBI sequencing facility.
CRAM and BAM files can be interchanged using the lastest version of samtools (>=v1.0). However, this
conversion may require downloading the reference genome into cache. Alternatively, you may pre-download
the correct reference either from metadata in the header of the CRAM file, or from talking to whomever
generated the CRAM and specify that file using ‘-T’ Thus we recommend setting a specific cache location
prior to doing this:
export REF_CACHE=/path_to/cache_directory_for_reference_genome
samtools view -b -h -T reference_genome.fasta file.cram -o file.bam
samtools view -C -h -T reference_genome.fasta file.bam -o file.cram
At times it may be useful to mannual inspect files for example to check the metadata in headers that the
files are from the correct sample. ‘less’ and ‘more’ can be used to inspect any text files from the command
line. By “pipe-ing” the output of samtools view into these commands using ‘|’ we check each of these file
types without having to save multiple copies of each file.
less file.txt
more file.txt
# counts the number of lines in file.txt
wc -l file.txt
samtools view -h file.[cram/bam] | more
# counts the number of lines in the samtools output
samtools view -h file.[cram/bam] | wc -l
Exercises
You have been provided with a small cram file: EXAMPLE.cram
Task 1: How was this file aligned? What software was used? What was used as the genome? (Hint: check
the header)
Task 2: How many reads are unmapped/mapped? How total reads are there? How many secondary align-
ments are present? (Hint: use the FLAG)
Task 3: Convert the CRAM into two Fastq files. Did you get exactly one copy of each read? (name these
files “10cells_read1.fastq” “10cells_read2.fastq”)
3.3. FILE FORMATS 21
If you get stuck help information for each piece of software can be displayed by entering running the command
“naked” - e.g. ‘samtools view’, ‘bedtools’
Answer
To map your reads you will also need the reference genome and in many cases the genome annotation file (in
either GTF or GFF format). These can be downloaded for model organisms from any of the main genomics
databases: Ensembl, NCBI, or UCSC Genome Browser.
GTF files contain annotations of genes, transcripts, and exons. They must contain: (1) seqname : chro-
mosome/scaffold (2) source : where this annotation came from (3) feature : what kind of feature is this?
(e.g. gene, transcript, exon) (4) start : start position (bp) (5) end : end position (bp) (6) score : a number
(7) strand : + (forward) or - (reverse) (8) frame : if CDS indicates which base is the first base of the first
codon (0 = first base, 1 = second base, etc..) (9) attribute : semicolon-separated list of tag-value pairs of
extra information (e.g. names/IDs, biotype)
Empty fields are marked with “.”
In our experience Ensembl is the easiest of these to use, and has the largest set of annotations. NCBI tends
to be more strict in including only high confidence gene annotations. Whereas UCSC contains multiple
geneset annotations that use different criteria.
If you experimental system includes non-standard sequences these must be added to both the genome fasta
and gtf to quantify their expression. Most commonly this is done for the ERCC spike-ins, although the same
must be done for CRISPR- related sequences or other overexpression/reporter constructs.
For maximum utility/flexibility we recommend creating complete and detailed entries for any non-standard
sequences added.
There is no standardized way to do this. So below is our custom perl script for creating a gtf and fasta
file for ERCCs which can be appended to the genome. You may also need to alter a gtf file to deal with
repetitive elements in introns when/if you want to quantify intronic reads. Any scripting language or even
‘awk’ and/or some text editors can be used to do this relatively efficiently, but they are beyond the scope of
this course.
# Converts the Annotation file from
# https://www.thermofisher.com/order/catalog/product/4456740 to
# gtf and fasta files that can be added to existing genome fasta & gtf files.
my @FASTAlines = ();
my @GTFlines = ();
open (my $ifh, "ERCC_Controls_Annotation.txt") or die $!;
<$ifh>; #header
while (<$ifh>) {
# Do all the important stuff
chomp;
my @record = split(/\t/);
my $sequence = $record[4];
$sequence =~ s/\s+//g; # get rid of any preceeding/tailing white space
$sequence = $sequence."NNNN";
my $name = $record[0];
my $genbank = $record[1];
push(@FASTAlines, ">$name\n$sequence\n");
# is GTF 1 indexed or 0 indexed? -> it is 1 indexed
# + or - strand?
22 CHAPTER 3. PROCESSING RAW SCRNA-SEQ DATA
# Write output
open(my $ofh, ">", "ERCC_Controls.fa") or die $!;
foreach my $line (@FASTAlines) {
print $ofh $line;
} close ($ofh);
3.4 Demultiplexing
Demultiplexing is done differently depending on the protocol used and the particular pipeline you are using
a full pipeline. The most flexible demultiplexing pipeline we are aware of is zUMIs which can be used to
demultiplex and map most UMI-based protocols. For Smartseq2 or other paired-end full transcript protocols
the data will usually already be demultiplexed. Public repositories such as GEO or ArrayExpress require data
small-scale/plate-based scRNASeq data to be demultiplexed prior to upload, and many sequencing facilities
will automatically demultiplex data before returning it to you. If you aren’t using a published pipeline and
the data was not demultiplexed by the sequencing facility you will have to demultiplex it yourself. This
usually requires writing a custom script since barcodes may be of different lengths and different locations in
the reads depending on the protocols used.
For all data-type “demultiplexing” involves identifying and removing the cell-barcode sequence from one or
both reads. If the expected cell-barcodes are known ahead of time, i.e. the data is from a PCR-plate-based
protocol, all that is necessarily is to compare each cell-barcode to the expected barcodes and assign the
associated reads to the closest cell-barcode (with maximum mismatches of 1 or 2 depending on the design of
the cell-barcodes). These data are generally demultiplexed prior to mapping as an easy way of parallelizing
the mapping step.
We have publicly available perl scripts capable of demultiplexing any scRNASeq data with a single cell-
barcode with or without UMIs for plate-based protocols. These can be used as so:
perl 1_Flexible_UMI_Demultiplexing.pl 10cells_read1.fq 10cells_read2.fq "C12U8" 10cells_barcodes.txt 2 E
##
## Doesn't match any cell: 0
## Ambiguous: 0
## Exact Matches: 400
## Contain mismatches: 0
## Input Reads: 400
## Output Reads: 400
## Barcode Structure: 12 bp CellID followed by 8 bp UMI
perl 1_Flexible_FullTranscript_Demultiplexing.pl 10cells_read1.fq 10cells_read2.fq "start" 12 10cells_ba
##
## Doesn't match any cell: 0
## Ambiguous: 0
3.4. DEMULTIPLEXING 23
For UMI containing data, demultiplexing includes attaching the UMI code to the read name of the gene-body
containing read. If the data are from a droplet-based protocol or SeqWell where the number of expected
barcodes is much higher than the expected number of cell, then usually the cell-barcode will also be attached
to the read name to avoid generating a very large number of files. In these cases, demultiplexing will happen
during the quantification step to facilitate the identification of cell-barcodes which correspond to intact cells
rather than background noise.
For droplet based methods only a fraction of droplets contain both beads and an intact cell. However, biology
experiments are messy and some RNA will leak out of dead/damaged cells. So droplets without an intact
cell are likely to capture a small amount of the ambient RNA which will end up in the sequencing library
and contribute a reads to the final sequencing output. The variation in droplet size, amplification efficiency,
and sequencing will lead both “background” and real cells to have a wide range of library sizes. Various
approaches have been used to try to distinguish those cell barcodes which correspond to real cells.
Most methods use the total molecules (could be applied to total reads) per barcode and try to find a “break
point” between bigger libraries which are cells + some background and smaller libraries assumed to be purely
background. Let’s load some example simulated data which contain both large and small cells:
umi_per_barcode <- read.table("droplet_id_example_per_barcode.txt.gz")
truth <- read.delim("droplet_id_example_truth.gz", sep=",")
Exercise How many unique barcodes were detected? How many true cells are present in the data? To
simplify calculations for this section exclude all barcodes with fewer than 10 total molecules.
Answer
One approach is to look for the inflection point where the total molecules per barcode suddenly drops:
barcode_rank <- rank(-umi_per_barcode[,2])
plot(barcode_rank, umi_per_barcode[,2], xlim=c(1,8000))
24 CHAPTER 3. PROCESSING RAW SCRNA-SEQ DATA
barcode_rank
Here we can see an roughly exponential curve of library sizes, so to make things simpler lets log-transform
them.
log_lib_size <- log10(umi_per_barcode[,2])
plot(barcode_rank, log_lib_size, xlim=c(1,8000))
5
4
log_lib_size
3
2
1
barcode_rank That’s
better, the “knee” in the distribution is much more pronounced. We could manually estimate where the
“knee” is but it much more reproducible to algorithmically identify this point.
# inflection point
o <- order(barcode_rank)
log_lib_size <- log_lib_size[o]
3.4. DEMULTIPLEXING 25
3
2
1
barcode_rank
threshold <- 10^log_lib_size[inflection]
Another is to fix a mixture model and find where the higher and lower distributions intersect. However, data
may not fit the assumed distributions very well:
set.seed(-92497)
# mixture model
require("mixtools")
## number of iterations= 43
plot(mix, which=2, xlab2="log(mol per cell)")
26 CHAPTER 3. PROCESSING RAW SCRNA-SEQ DATA
Density Curves
1.2
1.0
0.8
Density
0.6
0.4
0.2
0.0
1 2 3 4 5
Exercise Identify cells using this split point and calculate the TPR and Recall.
Answer
A third, used by CellRanger, assumes a ~10-fold range of library sizes for real cells and estimates this range
using the expected number of cells.
n_cells <- length(truth[,1])
# CellRanger
totals <- umi_per_barcode[,2]
totals <- sort(totals, decreasing = TRUE)
# 99th percentile of top n_cells divided by 10
thresh = totals[round(0.01*n_cells)]/10
plot(totals, xlim=c(1,8000))
abline(h=thresh, col="red", lwd=2)
3.4. DEMULTIPLEXING 27
Index Exer-
cise Identify cells using this threshodl and calculate the TPR and Recall.
Answer
Below we have provided code for how this method is currently run: (We will update this page when the
method is officially released)
require("Matrix")
raw.counts <- readRDS("droplet_id_example.rds")
require("DropletUtils")
# emptyDrops
set.seed(100)
e.out <- emptyDrops(my.counts)
is.cell <- e.out$FDR <= 0.01
sum(is.cell, na.rm=TRUE)
plot(e.out$Total, -e.out$LogProb, col=ifelse(is.cell, "red", "black"),
xlab="Total UMI count", ylab="-Log Probability")
Figure 3.1: Figure 1: Diagram of how STAR performs alignments, taken from Dobin et al.
Now we have trimmed our reads and established that they are of good quality, we would like to map them
to a reference genome. This process is known as alignment. Some form of alignment is generally required if
we want to quantify gene expression or find genes which are differentially expressed between samples.
Many tools have been developed for read alignment, but today we will focus on two. The first tool we will
consider is STAR (?). For each read in our reads data, STAR tries to find the longest possible sequence
which matches one or more sequences in the reference genome. For example, in the figure below, we have a
read (blue) which spans two exons and an alternative splicing junction (purple). STAR finds that the first
part of the read is the same as the sequence of the first exon, whilst the second part of the read matches the
sequence in the second exon. Because STAR is able to recognise splicing events in this way, it is described
as a ‘splice aware’ aligner.
Usually STAR aligns reads to a reference genome, potentially allowing it to detect novel splicing events or
chromosomal rearrangements. However, one issue with STAR is that it needs a lot of RAM, especially if your
reference genome is large (eg. mouse and human). To speed up our analysis today, we will use STAR to align
reads from to a reference transcriptome of 2000 transcripts. Note that this is NOT normal or recommended
practice, we only do it here for reasons of time. We recommend that normally you should align to a reference
genome.
Two steps are required to perform STAR alignment. In the first step, the user provides STAR with reference
genome sequences (FASTA) and annotations (GTF), which STAR uses to create a genome index. In the
second step, STAR maps the user’s reads data to the genome index.
Let’s create the index now. Remember, for reasons of time we are aligning to a transcriptome rather
than a genome today, meaning we only need to provide STAR with the sequences of the transcripts we
will be aligning reads to. You can obtain transcriptomes for many model organisms from Ensembl (https:
//www.ensembl.org/info/data/ftp/index.html).
Task 2: What does each of the options we used do? Hint: Use the STAR manual to help you (https:
//github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf)
Task 3: How would the command we used in Task 1 be different if we were aligning to the genome rather
than the transcriptome?
Now that we have created the index, we can perform the mapping step.
Task 4: Try to work out what command you should use to map our trimmed reads (from ERR522959) to the
index you created. Use the STAR manual to help you. One you think you know the answer, check whether
it matches the solution in the next section and execute the alignment.
Task 5: Try to understand the output of your alignment. Talk to one of the instructors if you need help!
You can use the folowing commands to perform the mapping step:
mkdir results
mkdir results/STAR
A k-mer is a sequence of length k derived from a read. For example, imagine we have a read with the
sequence ATCCCGGGTTAT and we want to make 7-mers from it. To do this, we would find the first 7-mer
by counting the first seven bases of the read. We would find the second 7-mer by moving one base along,
then counting the next seven bases. Figure 2 shows all the 7-mers that could be derived from our read:
Kallisto has a specially designed mode for pseudo-aligning reads from single-cell RNA-seq experiments. Un-
like STAR, Kallisto psuedo-aligns to a reference transcriptome rather than a reference genome. This means
30 CHAPTER 3. PROCESSING RAW SCRNA-SEQ DATA
Kallisto maps reads to splice isoforms rather than genes. Mapping reads to isoforms rather than genes is
especially challenging for single-cell RNA-seq for the following reasons:
• Single-cell RNA-seq is lower coverage than bulk RNA-seq, meaning the total amount of information
available from reads is reduced.
• Many single-cell RNA-seq protocols have 3’ coverage bias, meaning if two isoforms differ only at their
5’ end, it might not be possible to work out which isoform the read came from.
• Some single-cell RNA-seq protocols have short read lengths, which can also mean it is not possible to
work out which isoform the read came from.
Kallisto’s pseudo mode takes a slightly different approach to pseudo-alignment. Instead of aligning to
isoforms, Kallisto aligns to equivalence classes. Essentially, this means if a read maps to multiple isoforms,
Kallisto records the read as mapping to an equivalence class containing all the isoforms it maps to. Instead
of using gene or isoform expression estimates in downstream analysis such as clustering, equivalence class
counts can be used instead. Figure 3 shows a diagram which helps explain this.
Today we will just perform pseudo-alignment with one cell, but Kallisto is also capable of pseudo-aligning
multiple cells simultaneously and using information from UMIs. See https://pachterlab.github.io/kallisto/
manual for details.
As for STAR, you will need to produce an index for Kallisto before the pseudo-alignment step.
Task 6: Use the below command to produce the Kallisto index. Use the Kallisto manual (https://pachterlab.
github.io/kallisto/manual) to work out what the options do in this command.
mkdir indices/Kallisto
kallisto index -i indices/Kallisto/transcripts.idx Share/2000_reference.transcripts.fa
Task 7: Use the Kallisto manual to work out what command to use to perform pseudo-alignment. One
you think you know the answer, check whether it matches the solution in the next section and execute the
pseudo-alignment.
3.6. KALLISTO AND PSEUDO-ALIGNMENT 31
Figure 3.3: Figure 3: A diagram explaining Kallisto’s Equivalence Classes, taken from https://pachterlab.
github.io/kallisto/singlecell.html.
The command above should produce 4 files - matrix.cells, matrix.ec, matrix.tsv and run_info.json.
• matrix.cells contains a list of cell IDs. As we only used one cell, this file should just contain
“ERR522959”
• matrix.ec contains information about the equivalence classes used. The first number in each row is the
equivalence class ID. The second number(s) correspond to the transcript ID(s) in that equivalence class.
For example “10 1,2,3” would mean that equivalence class 10 contains transcript IDs 1,2 and 3. The ID
numbers correspond to the order that the transcripts appear in reference.transcripts.fa. Zero indexing
is used, meaning transcript IDs 1,2 and 3 correspond to the second, third and fourth transcripts in
2000_reference.transcripts.fa.
• matrix.tsv contains information about how many reads in each cell map to each equivalence class. The
first number is the equivalence class ID, as defined in matrix.ec. The second number is the cell ID,
where the cell ID corresponds to the order that the cell came in the matrix.cells file. The third number
is the number of reads which fall into that equivalence class. For example, “5 1 3” means that 3 reads
from cell 1 map to equivalence class 5. Note that zero indexing is used, so cell 1 corresponds to the
second line of matrix.cells.
• run_info.json contains information about how Kallisto was executed and can be ignored.
32 CHAPTER 3. PROCESSING RAW SCRNA-SEQ DATA
Chapter 4
Many analyses of scRNA-seq data take as their starting point an expression matrix. By convention, the
each row of the expression matrix represents a gene and each column represents a cell (although some authors
use the transpose). Each entry represents the expression level of a particular gene in a given cell. The units
by which the expression is meassured depends on the protocol and the normalization strategy used.
4.1 Reads QC
The output from a scRNA-seq experiment is a large collection of cDNA reads. The first step is to ensure
that the reads are of high quality. The quality control can be performed by using standard tools, such as
FastQC or Kraken.
Assuming that our reads are in experiment.bam, we run FastQC as
$<path_to_fastQC>/fastQC experiment.bam
Below is an example of the output from FastQC for a dataset of 125 bp reads. The plot reveals a technical
error which resulted in a couple of bases failing to be read correctly in the centre of the read. However,
since the rest of the read was of high quality this error will most likely have a negligible effect on mapping
efficiency.
Additionally, it is often helpful to visualize the data using the Integrative Genomics Browser (IGV) or
SeqMonk.
After trimming low quality bases from the reads, the remaining sequences can be mapped to a reference
genome. Again, there is no need for a special purpose method for this, so we can use the STAR or the
TopHat aligner. For large full-transcript datasets from well annotated organisms (e.g. mouse, human) pseudo-
alignment methods (e.g. Kallisto, Salmon) may out-perform conventional alignment. For drop-seq based
datasets with tens- or hundreds of thousands of reads pseudoaligners become more appealing since their
run-time can be several orders of magnitude less than traditional aligners.
An example of how to map reads.bam to using STAR is
$<path_to_STAR>/STAR --runThreadN 1 --runMode alignReads
--readFilesIn reads1.fq.gz reads2.fq.gz --readFilesCommand zcat --genomeDir <path>
--parametersFiles FileOfMoreParameters.txt --outFileNamePrefix <outpath>/output
33
34 CHAPTER 4. CONSTRUCTION OF EXPRESSION MATRIX
Note, if the spike-ins are used, the reference sequence should be augmented with the DNA sequence of the
spike-in molecules prior to mapping.
Note, when UMIs are used, their barcodes should be removed from the read sequence. A common practice
is to add the barcode to the read name.
Once the reads for each cell have been mapped to the reference genome, we need to make sure that a sufficient
number of reads from each cell could be mapped to the reference genome. In our experience, the fraction of
mappable reads for mouse or human cells is 60-70%. However, this result may vary depending on protocol,
read length and settings for the read alignment. As a general rule, we expect all cells to have a similar
fraction of mapped reads, so any outliers should be inspected and possibly removed. A low proportion of
mappable reads usually indicates contamination.
Note Salmon produces estimated read counts and estimated transcripts per million (tpm) in our experience
the latter over corrects the expression of long genes for scRNASeq, thus we recommend using read counts.
The histogram below shows the total number of reads mapped to each cell for an scRNA-seq experiment.
Each bar represents one cell, and they have been sorted in ascending order by the total number of reads
per cell. The three red arrows indicate cells that are outliers in terms of their coverage and they should
be removed from further analysis. The two yellow arrows point to cells with a surprisingly large number
of unmapped reads. In this example we kept the cells during the alignment QC step, but they were later
removed during cell QC due to a high proportion of ribosomal RNA reads.
4.4. MAPPING QC 35
Figure 4.2: Example of the total number of reads mapped to each cell.
4.4 Mapping QC
After mapping the raw sequencing to the genome we need to evaluate the quality of the mapping. There
are many ways to measure the mapping quality, including: amount of reads mapping to rRNA/tRNAs,
proportion of uniquely mapping reads, reads mapping across splice junctions, read depth along the transcripts.
Methods developed for bulk RNA-seq, such as RSeQC, are applicable to single-cell data:
However the expected results will depend on the experimental protocol, e.g. many scRNA-seq methods use
poly-A selection to avoid sequencing rRNAs which results in a 3’ bias in the read coverage across the genes
(aka gene body coverage). The figure below shows this 3’ bias as well as three cells which were outliers and
removed from the dataset:
The next step is to quantify the expression level of each gene for each cell. For mRNA data, we can use one
of the tools which has been developed for bulk RNA-seq data, e.g. HT-seq or FeatureCounts
# include multimapping
<featureCounts_path>/featureCounts -O -M -Q 30 -p -a genome.gtf -o outputfile input.bam
# exclude multimapping
<featureCounts_path>/featureCounts -Q 30 -p -a genome.gtf -o outputfile input.bam
Unique molecular identifiers (UMIs) make it possible to count the absolute number of molecules and they
have proven popular for scRNA-seq. We will discuss how UMIs can be processed in the next chapter.
36 CHAPTER 4. CONSTRUCTION OF EXPRESSION MATRIX
Thanks to Andreas Buness from EMBL Monterotondo for collaboration on this section.
4.6.1 Introduction
Unique Molecular Identifiers are short (4-10bp) random barcodes added to transcripts during reverse-
transcription. They enable sequencing reads to be assigned to individual transcript molecules and thus the
removal of amplification noise and biases from scRNASeq data.
When sequencing UMI containing data, techniques are used to specifically sequence only the end of the
transcript containing the UMI (usually the 3’ end).
4.6. UNIQUE MOLECULAR IDENTIFIERS (UMIS) 37
Figure 4.5: UMI sequencing reads, red lightning bolts represent different fragmentation locations
Since the number of unique barcodes (4N , where N is the length of UMI) is much smaller than the total
number of molecules per cell (~106 ), each barcode will typically be assigned to multiple transcripts. Hence,
to identify unique molecules both barcode and mapping location (transcript) must be used. The first step
is to map UMI reads, for which we recommend using STAR since it is fast and outputs good quality BAM-
alignments. Moreover, mapping locations can be useful for eg. identifying poorly-annotated 3’ UTRs of
transcripts.
UMI-sequencing typically consists of paired-end reads where one read from each pair captures the cell and
UMI barcodes while the other read consists of exonic sequence from the transcript (Figure 4.5). Note that
trimming and/or filtering to remove reads containing poly-A sequence is recommended to avoid erors due to
these read mapping to genes/transcripts with internal poly-A/poly-T sequences.
After processing the reads from a UMI experiment, the following conventions are often used:
1. The UMI is added to the read name of the other paired read.
• For extremely large, shallow datasets, the cell barcode may be added to the read name as well to
reduce the number of files.
In theory, every unique UMI-transcript pair should represent all reads originating from a single RNA molecule.
However, in practice this is frequently not the case and the most common reasons are:
How to best account for errors in UMIs remains an active area of research. The best approaches that we are
aware of for resolving the issues mentioned above are:
1. UMI-tools’ directional-adjacency method implements a procedure which considers both the number of
mismatches and the relative frequency of similar UMIs to identify likely PCR/sequencing errors.
2. Currently an open question. The problem may be mitigated by removing UMIs with few reads to
support their association with a particular transcript, or by removing all multi-mapping reads.
3. Simple saturation (aka “collision probability”) correction proposed by Grun, Kester and van Oude-
naarden (2014) to estimate the true number of molecules M :
n
M ≈ −N ∗ log(1 −)
N
where N = total number of unique UMI barcodes and n = number of observed barcodes.
An important caveat of this method is that it assumes that all UMIs are equally frequent. In most cases this
is incorrect, since there is often a bias related to the GC content.
Determining how to best process and use UMIs is currently an active area of research in the bioinformatics
community. We are aware of several methods that have recently been developed, including:
• UMI-tools
• PoissonUMIs
• zUMIs
• dropEst
Current UMI platforms (DropSeq, InDrop, ICell8) exhibit low and highly variable capture efficiency as shown
in the figure below.
This variability can introduce strong biases and it needs to be considered in downstream analysis. Recent
analyses often pool cells/genes together based on cell-type or biological pathway to increase the power.
Robust statistical analyses of this data is still an open research question and it remains to be determined
how to best adjust for biases.
4.6. UNIQUE MOLECULAR IDENTIFIERS (UMIS) 39
Exercise 1 We have provided you with UMI counts and read counts from induced pluripotent stem cells
generated from three different individuals (Tung et al., 2017) (see: Chapter 7.1 for details of this dataset).
umi_counts <- read.table("tung/molecules.txt", sep = "\t")
read_counts <- read.table("tung/reads.txt", sep = "\t")
Introduction to R/Bioconductor
5.1.1 CRAN
The Comprehensive R Archive Network CRAN is the biggest archive of R packages. There are few require-
ments for uploading packages besides building and installing succesfully, hence documentation and support
is often minimal and figuring how to use these packages can be a challenge it itself. CRAN is the default
repository R will search to find packages to install:
install.packages("devtools")
require("devtools")
5.1.2 Github
Github isn’t specific to R, any code of any type in any state can be uploaded. There is no guarantee a package
uploaded to github will even install, nevermind do what it claims to do. R packages can be downloaded and
installed directly from github using the “devtools” package installed above.
devtools::install_github("tallulandrews/M3Drop")
Github is also a version control system which stores multiple versions of any package. By default the most
recent “master” version of the package is installed. If you want an older version or the development branch
this can be specified using the “ref” parameter:
# different branch
devtools::install_github("tallulandrews/M3D", ref="nbumi")
# previous commit
devtools::install_github("tallulandrews/M3Drop", ref="434d2da28254acc8de4940c1dc3907ac72973135")
Note: make sure you re-install the M3Drop master branch for later in the course.
5.1.3 Bioconductor
Bioconductor is a repository of R-packages specifically for biological analyses. It has the strictest require-
ments for submission, including installation on every platform and full documentation with a tutorial (called
a vignette) explaining how the package should be used. Bioconductor also encourages utilization of standard
41
42 CHAPTER 5. INTRODUCTION TO R/BIOCONDUCTOR
data structures/classes and coding style/naming conventions, so that, in theory, packages and analyses can
be combined into large pipelines or workflows.
source("https://bioconductor.org/biocLite.R")
biocLite("edgeR")
Note: in some situations it is necessary to substitute “http://” for “https://” in the above depending on the
security features of your internet connection/network.
Bioconductor also requires creators to support their packages and has a regular 6-month release schedule.
Make sure you are using the most recent release of bioconductor before trying to install packages for the
course.
source("https://bioconductor.org/biocLite.R")
biocLite("BiocUpgrade")
5.1.4 Source
The final way to install packages is directly from source. In this case you have to download a fully built
source code file, usually packagename.tar.gz, or clone the github repository and rebuild the package yourself.
Generally this will only be done if you want to edit a package yourself, or if for some reason the former
methods have failed.
install.packages("M3Drop_3.05.00.tar.gz", type="source")
All the packages necessary for this course are available here. Starting from “RUN Rscript -e”install.packages(‘devtools’)”
“, run each of the commands (minus”RUN“) on the command line or start an R session and run each of the
commands within the quotation marks. Note the ordering of the installation is important in some cases, so
make sure you run them in order from top to bottom.
5.3 Data-types/classes
R is a high level language so the underlying data-type is generally not important. The exception if you are
accessing R data directly using another language such as C, but that is beyond the scope of this course.
Instead we will consider the basic data classes: numeric, integer, logical, and character, and the higher level
data class called “factor”. You can check what class your data is using the “class()” function.
Aside: R can also store data as “complex” for complex numbers but generally this isn’t relevant for biological
analyses.
5.3.1 Numeric
The “numeric” class is the default class for storing any numeric data - integers, decimal numbers, numbers
in scientific notation, etc…
x = 1.141
class(x)
## [1] "numeric"
5.3. DATA-TYPES/CLASSES 43
y = 42
class(y)
## [1] "numeric"
z = 6.02e23
class(z)
## [1] "numeric"
Here we see that even though R has an “integer” class and 42 could be stored more efficiently as an integer
the default is to store it as “numeric”. If we want 42 to be stored as an integer we must “coerce” it to that
class:
y = as.integer(42)
class(y)
## [1] "integer"
Coercion will force R to store data as a particular class, if our data is incompatible with that class it will
still do it but the data will be converted to NAs:
as.numeric("H")
5.3.2 Character/String
The “character” class stores all kinds of text data. Programing convention calls data containing multiple
letters a “string”, thus most R functions which act on character data will refer to the data as “strings” and
will often have “str” or “string” in it’s name. Strings are identified by being flanked by double quotation
marks, whereas variable/function names are not:
x = 5
## [1] "x"
b = x # variable x
b
## [1] 5
In addition to standard alphanumeric characters, strings can also store various special characters. Special
characters are identified using a backlash followed by a single character, the most relevant are the special
character for tab : \t and new line : \n. To demonstrate the these special characters lets concatenate (cat)
together two strings with these characters separating (sep) them:
cat("Hello", "World", sep= " ")
## Hello World
44 CHAPTER 5. INTRODUCTION TO R/BIOCONDUCTOR
## Hello World
cat("Hello", "World", sep= "\n")
## Hello
## World
Note that special characters work differently in different functions. For instance the paste function does the
same thing as cat but does not recognize special characters.
paste("Hello", "World", sep= " ")
## [1] "Hello\tWorld"
paste("Hello", "World", sep= "\n")
## [1] "Hello\nWorld"
Single or double backslash is also used as an escape character to turn off special characters or allow quotation
marks to be included in strings:
cat("This \"string\" contains quotation marks.")
Special characters are generally only used in pattern matching, and reading/writing data to files. For instance
this is how you would read a tab-separated file into R.
dat = read.delim("file.tsv", sep="\t")
Another special type of character data are colours. Colours can be specified in three main ways: by name
from those available, by red, green, blue values using the rgb function, and by hue (colour), saturation (colour
vs white) and value (colour/white vs black) using the hsv function. By default rgb and hsv expect three
values in 0-1 with an optional fourth value for transparency. Alternatively, sets of predetermined colours
with useful properties can be loaded from many different packages with RColorBrewer being one of the most
popular.
reds = c("red", rgb(1,0,0), hsv(0, 1, 1))
reds
1.0
0.8
0.6
0.4
0.2
0.0
5.3.3 Logical
The logical class stores boolean truth values, i.e. TRUE and FALSE. It is used for storing the results
of logical operations and conditional statements will be coerced to this class. Most other data-types can
be coerced to boolean without triggering (or “throwing”) error messages, which may cause unexpected
behaviour.
x = TRUE
class(x)
## [1] "logical"
y = "T"
as.logical(y)
## [1] TRUE
z = 5
as.logical(z)
## [1] TRUE
x = FALSE
class(x)
## [1] "logical"
y = "F"
as.logical(y)
## [1] FALSE
z = 0
as.logical(z)
## [1] FALSE
46 CHAPTER 5. INTRODUCTION TO R/BIOCONDUCTOR
Exercise 1 Experiment with other character and numeric values, which are coerced to TRUE or FALSE?
which are coerced to neither? Do you ever throw a warning/error message?
5.3.4 Factors
String/Character data is very memory inefficient to store, each letter generally requires the same amount of
memory as any integer. Thus when storing a vector of strings with repeated elements it is more efficient assign
each element to an integer and store the vector as integers and an additional string-to-integer association
table. Thus, by default R will read in text columns of a data table as factors.
str_vector = c("Apple", "Apple", "Banana", "Banana", "Banana", "Carrot", "Carrot", "Apple", "Banana")
factored_vector = factor(str_vector)
factored_vector
## [1] Apple Apple Banana Banana Banana Carrot Carrot Apple Banana
## Levels: Apple Banana Carrot
as.numeric(factored_vector)
## [1] 1 1 2 2 2 3 3 1 2
The double nature of factors can cause some unintuitive behaviour. E.g. joining two factors together will
convert them to the numeric form and the original strings will be lost.
c(factored_vector, factored_vector)
## [1] 1 1 2 2 2 3 3 1 2 1 1 2 2 2 3 3 1 2
Likewise if due to formatting issues numeric data is mistakenly interpretted as strings, then you must convert
the factor back to strings before coercing to numeric values:
x = c("20", "25", "23", "38", "20", "40", "25", "30")
x = factor(x)
as.numeric(x)
## [1] 1 3 2 5 1 6 3 4
as.numeric(as.character(x))
## [1] 20 25 23 38 20 40 25 30
To make R read text as character data instead of factors set the environment option stringsAsFactors=FALSE.
This must be done at the start of each R session.
options(stringsAsFactors=FALSE)
Exercise How would you use factors to create a vector of colours for an arbitrarily long vector of fruits like
str_vector above? Answer
We recommend checking your data is of the correct class after reading from files:
x = 1.4
is.numeric(x)
## [1] TRUE
is.character(x)
5.4. BASIC DATA STRUCTURES 47
## [1] FALSE
is.logical(x)
## [1] FALSE
is.factor(x)
## [1] FALSE
So far we have only looked at single values and vectors. Vectors are the simplest data structure in R. They
are a 1-dimensional array of data all of the same type. If the input when creating a vector is of different
types it will be coerced to the data-type that is most consistent with the data.
x = c("Hello", 5, TRUE)
x
## [1] "character"
Here we tried to put character, numeric and logical data into a single vector so all the values were coerced
to character data.
A matrix is the two dimensional version of a vector, it also requires all data to be of the same type. If we
combine a character vector and a numeric vector into a matrix, all the data will be coerced to characters:
x = c("A", "B", "C")
y = c(1, 2, 3)
class(x)
## [1] "character"
class(y)
## [1] "numeric"
m = cbind(x, y)
m
## x y
## [1,] "A" "1"
## [2,] "B" "2"
## [3,] "C" "3"
The quotation marks indicate that the numeric vector has been coerced to characters. Alternatively, to store
data with columns of different data-types we can use a dataframe.
z = data.frame(x, y)
z
## x y
## 1 A 1
## 2 B 2
## 3 C 3
48 CHAPTER 5. INTRODUCTION TO R/BIOCONDUCTOR
class(z[,1])
## [1] "character"
class(z[,2])
## [1] "numeric"
If you have set stringsAsFactors=FALSE as above you will find the first column remains characters, otherwise
it will be automatically converted to a factor.
options(stringsAsFactors=TRUE)
z = data.frame(x, y)
class(z[,1])
## [1] "factor"
Another difference between matrices and dataframes is the ability to select columns using the $ operator:
m$x # throws an error
z$x # ok
The final basic data structure is the list. Lists allow data of different types and different lengths to be
stored in a single object. Each element of a list can be any other R object : data of any type, any data
structure, even other lists or functions.
l = list(m, z)
ll = list(sublist=l, a_matrix=m, numeric_value=42, this_string="Hello World", even_a_function=cbind)
ll
## $sublist
## $sublist[[1]]
## x y
## [1,] "A" "1"
## [2,] "B" "2"
## [3,] "C" "3"
##
## $sublist[[2]]
## x y
## 1 A 1
## 2 B 2
## 3 C 3
##
##
## $a_matrix
## x y
## [1,] "A" "1"
## [2,] "B" "2"
## [3,] "C" "3"
##
## $numeric_value
## [1] 42
##
## $this_string
## [1] "Hello World"
##
## $even_a_function
## function (..., deparse.level = 1)
5.5. MORE INFORMATION 49
## .Internal(cbind(deparse.level, ...))
## <bytecode: 0x55e4ded2f378>
## <environment: namespace:base>
Lists are most commonly used when returning a large number of results from a function that do not fit into
any of the previous data structures.
You can get more information about any R commands relevant to these datatypes using by typing ?function
in an interactive session.
Tidy data is a concept largely defined by Hadley Wickham (Wickham, 2014). Tidy data has the following
three characteristics:
1. Each variable has its own column.
2. Each observation has its own row.
3. Each value has its own cell.
Here is an example of some tidy data:
## Students Subject Years Score
## 1 Mark Maths 1 5
## 2 Jane Biology 2 6
## 3 Mohammed Physics 3 4
## 4 Tom Maths 2 7
## 5 Celia Computing 3 9
Here is an example of some untidy data:
## Students Sport Category Counts
## 1 Matt Tennis Wins 0
## 2 Matt Tennis Losses 1
## 3 Ellie Rugby Wins 3
## 4 Ellie Rugby Losses 2
## 5 Tim Football Wins 1
## 6 Tim Football Losses 4
## 7 Louise Swimming Wins 2
## 8 Louise Swimming Losses 2
## 9 Kelly Running Wins 5
## 10 Kelly Running Losses 1
Task 1: In what ways is the untidy data not tidy? How could we make the untidy data tidy?
Tidy data is generally easier to work with than untidy data, especially if you are working with packages such
as ggplot. Fortunately, packages are available to make untidy data tidy. Today we will explore a few of the
functions available in the tidyr package which can be used to make untidy data tidy. If you are interested in
finding out more about tidying data, we recommend reading “R for Data Science”, by Garrett Grolemund
and Hadley Wickham. An electronic copy is available here: http://r4ds.had.co.nz/
50 CHAPTER 5. INTRODUCTION TO R/BIOCONDUCTOR
The untidy data above is untidy because two variables (Wins and Losses) are stored in one column
(Category). This is a common way in which data can be untidy. To tidy this data, we need to make
Wins and Losses into columns, and store the values in Counts in these columns. Fortunately, there is a
function from the tidyverse packages to perform this operation. The function is called spread, and it takes
two arguments, key and value. You should pass the name of the column which contains multiple variables to
key, and pass the name of the column which contains values from multiple variables to value. For example:
library(tidyverse)
sports<-data.frame(Students=c("Matt", "Matt", "Ellie", "Ellie", "Tim", "Tim", "Louise", "Louise", "Kelly
sports
The other common way in which data can be untidy is if the columns are values instead of variables. For
example, the dataframe below shows the percentages some students got in tests they did in May and June.
The data is untidy because the columns May and June are values, not variables.
percentages<-data.frame(student=c("Alejandro", "Pietro", "Jane"), "May"=c(90,12,45), "June"=c(80,30,100)
Fortunately, there is a function in the tidyverse packages to deal with this problem too. gather() takes the
names of the columns which are values, the key and the value as arguments. This time, the key is the name
of the variable with values as column names, and the value is the name of the variable with values spread
over multiple columns. Ie:
gather(percentages, "May", "June", key="Month", value = "Percentage")
your data is stored in a tidy format. Fortunately, the data structures we commonly use to facilitate single-cell
RNA-seq analysis usually encourage store your data in a tidy manner.
If you google ‘rich data’, you will find lots of different definitions for this term. In this course, we will use
‘rich data’ to mean data which is generated by combining information from multiple sources. For example,
you could make rich data by creating an object in R which contains a matrix of gene expression values
across the cells in your single-cell RNA-seq experiment, but also information about how the experiment was
performed. Objects of the SingleCellExperiment class, which we will discuss below, are an example of rich
data.
From Wikipedia: Bioconductor is a free, open source and open development software project for the analysis
and comprehension of genomic data generated by wet lab experiments in molecular biology. Bioconductor
is based primarily on the statistical R programming language, but does contain contributions in other
programming languages. It has two releases each year that follow the semiannual releases of R. At any one
time there is a release version, which corresponds to the released version of R, and a development version,
which corresponds to the development version of R. Most users will find the release version appropriate for
their needs.
We strongly recommend all new comers and even experienced high-throughput data analysts to use well
developed and maintained Bioconductor methods and classes.
SingleCellExperiment (SCE) is a S4 class for storing data from single-cell experiments. This includes
specialized methods to store and retrieve spike-in information, dimensionality reduction coordinates and size
factors for each cell, along with the usual metadata for genes and libraries.
In practice, an object of this class can be created using its constructor:
library(SingleCellExperiment)
counts <- matrix(rpois(100, lambda = 10), ncol=10, nrow=10)
rownames(counts) <- paste("gene", 1:10, sep = "")
colnames(counts) <- paste("cell", 1:10, sep = "")
sce <- SingleCellExperiment(
assays = list(counts = counts),
rowData = data.frame(gene_names = paste("gene_name", 1:10, sep = "")),
colData = data.frame(cell_names = paste("cell_name", 1:10, sep = ""))
)
sce
## class: SingleCellExperiment
## dim: 10 10
## metadata(0):
## assays(1): counts
## rownames(10): gene1 gene2 ... gene9 gene10
## rowData names(1): gene_names
## colnames(10): cell1 cell2 ... cell9 cell10
## colData names(1): cell_names
## reducedDimNames(0):
52 CHAPTER 5. INTRODUCTION TO R/BIOCONDUCTOR
## spikeNames(0):
In the SingleCellExperiment, users can assign arbitrary names to entries of assays. To assist interoper-
ability between packages, some suggestions for what the names should be for particular types of data are
provided by the authors:
• counts: Raw count data, e.g., number of reads or transcripts for a particular gene.
• normcounts: Normalized values on the same scale as the original counts. For example, counts divided
by cell-specific size factors that are centred at unity.
• logcounts: Log-transformed counts or count-like values. In most cases, this will be defined as log-
transformed normcounts, e.g., using log base 2 and a pseudo-count of 1.
• cpm: Counts-per-million. This is the read count for each gene in each cell, divided by the library size
of each cell in millions.
• tpm: Transcripts-per-million. This is the number of transcripts for each gene in each cell, divided by
the total number of transcripts in that cell (in millions).
Each of these suggested names has an appropriate getter/setter method for convenient manipulation of the
SingleCellExperiment. For example, we can take the (very specifically named) counts slot, normalise it
and assign it to normcounts instead:
normcounts(sce) <- log2(counts(sce) + 1)
sce
## class: SingleCellExperiment
## dim: 10 10
## metadata(0):
## assays(2): counts normcounts
## rownames(10): gene1 gene2 ... gene9 gene10
## rowData names(1): gene_names
## colnames(10): cell1 cell2 ... cell9 cell10
## colData names(1): cell_names
## reducedDimNames(0):
## spikeNames(0):
dim(normcounts(sce))
## [1] 10 10
head(normcounts(sce))
scater is a R package for single-cell RNA-seq analysis (McCarthy et al., 2017). The package contains several
useful methods for quality control, visualisation and pre-processing of data prior to further downstream
analysis.
scater features the following functionality:
• Automated computation of QC metrics
• Transcript quantification from read data with pseudo-alignment
• Data format standardisation
• Rich visualizations for exploratory analysis
• Seamless integration into the Bioconductor universe
• Simple normalisation methods
We highly recommend to use scater for all single-cell RNA-seq analyses and scater is the basis of the first
part of the course.
As illustrated in the figure below, scater will help you with quality control, filtering and normalization
of your expression matrix following mapping and alignment. Keep in mind that this figure represents the
original version of scater where an SCESet class was used. In the newest version this figure is still correct,
except that SCESet can be substituted with the SingleCellExperiment class.
5.7.1 Bioconductor
From Wikipedia: Bioconductor is a free, open source and open development software project for the analysis
and comprehension of genomic data generated by wet lab experiments in molecular biology. Bioconductor
is based primarily on the statistical R programming language, but does contain contributions in other
programming languages. It has two releases each year that follow the semiannual releases of R. At any one
time there is a release version, which corresponds to the released version of R, and a development version,
which corresponds to the development version of R. Most users will find the release version appropriate for
their needs.
We strongly recommend all new comers and even experienced high-throughput data analysts to use well
developed and maintained Bioconductor methods and classes.
SingleCellExperiment (SCE) is a S4 class for storing data from single-cell experiments. This includes
specialized methods to store and retrieve spike-in information, dimensionality reduction coordinates and size
factors for each cell, along with the usual metadata for genes and libraries.
In practice, an object of this class can be created using its constructor:
library(SingleCellExperiment)
counts <- matrix(rpois(100, lambda = 10), ncol=10, nrow=10)
rownames(counts) <- paste("gene", 1:10, sep = "")
colnames(counts) <- paste("cell", 1:10, sep = "")
sce <- SingleCellExperiment(
assays = list(counts = counts),
rowData = data.frame(gene_names = paste("gene_name", 1:10, sep = "")),
colData = data.frame(cell_names = paste("cell_name", 1:10, sep = ""))
54 CHAPTER 5. INTRODUCTION TO R/BIOCONDUCTOR
Figure 5.1:
5.7. BIOCONDUCTOR, SINGLECELLEXPERIMENT AND SCATER 55
)
sce
## class: SingleCellExperiment
## dim: 10 10
## metadata(0):
## assays(1): counts
## rownames(10): gene1 gene2 ... gene9 gene10
## rowData names(1): gene_names
## colnames(10): cell1 cell2 ... cell9 cell10
## colData names(1): cell_names
## reducedDimNames(0):
## spikeNames(0):
In the SingleCellExperiment, users can assign arbitrary names to entries of assays. To assist interoper-
ability between packages, some suggestions for what the names should be for particular types of data are
provided by the authors:
• counts: Raw count data, e.g., number of reads or transcripts for a particular gene.
• normcounts: Normalized values on the same scale as the original counts. For example, counts divided
by cell-specific size factors that are centred at unity.
• logcounts: Log-transformed counts or count-like values. In most cases, this will be defined as log-
transformed normcounts, e.g., using log base 2 and a pseudo-count of 1.
• cpm: Counts-per-million. This is the read count for each gene in each cell, divided by the library size
of each cell in millions.
• tpm: Transcripts-per-million. This is the number of transcripts for each gene in each cell, divided by
the total number of transcripts in that cell (in millions).
Each of these suggested names has an appropriate getter/setter method for convenient manipulation of the
SingleCellExperiment. For example, we can take the (very specifically named) counts slot, normalise it
and assign it to normcounts instead:
normcounts(sce) <- log2(counts(sce) + 1)
sce
## class: SingleCellExperiment
## dim: 10 10
## metadata(0):
## assays(2): counts normcounts
## rownames(10): gene1 gene2 ... gene9 gene10
## rowData names(1): gene_names
## colnames(10): cell1 cell2 ... cell9 cell10
## colData names(1): cell_names
## reducedDimNames(0):
## spikeNames(0):
dim(normcounts(sce))
## [1] 10 10
head(normcounts(sce))
scater is a R package for single-cell RNA-seq analysis (McCarthy et al., 2017). The package contains several
useful methods for quality control, visualisation and pre-processing of data prior to further downstream
analysis.
scater features the following functionality:
• Automated computation of QC metrics
• Transcript quantification from read data with pseudo-alignment
• Data format standardisation
• Rich visualizations for exploratory analysis
• Seamless integration into the Bioconductor universe
• Simple normalisation methods
We highly recommend to use scater for all single-cell RNA-seq analyses and scater is the basis of the first
part of the course.
As illustrated in the figure below, scater will help you with quality control, filtering and normalization
of your expression matrix following mapping and alignment. Keep in mind that this figure represents the
original version of scater where an SCESet class was used. In the newest version this figure is still correct,
except that SCESet can be substituted with the SingleCellExperiment class.
ggplot2 is an R package designed by Hadley Wickham which facilitates data plotting. In this lab, we will
touch briefly on some of the features of the package. If you would like to learn more about how to use
ggplot2, we would recommend reading “ggplot2 Elegant graphics for data analysis”, by Hadley Wickham.
The aes function specifies how variables in your dataframe map to features on your plot. To understand
how this works, let’s look at an example:
5.8. AN INTRODUCTION TO GGPLOT2 57
Figure 5.2:
58 CHAPTER 5. INTRODUCTION TO R/BIOCONDUCTOR
library(ggplot2)
library(tidyverse)
set.seed(1)
counts <- as.data.frame(matrix(rpois(100, lambda = 10), ncol=10, nrow=10))
Gene_ids <- paste("gene", 1:10, sep = "")
colnames(counts) <- paste("cell", 1:10, sep = "")
counts<-data.frame(Gene_ids, counts)
counts
## Gene_ids cell1 cell2 cell3 cell4 cell5 cell6 cell7 cell8 cell9 cell10
## 1 gene1 8 8 3 5 5 9 11 9 13 6
## 2 gene2 10 2 11 13 12 12 7 13 12 15
## 3 gene3 7 8 13 8 9 9 9 5 15 12
## 4 gene4 11 10 7 13 12 12 12 8 11 12
## 5 gene5 14 7 8 9 11 10 13 13 5 11
## 6 gene6 12 12 11 15 8 7 10 9 10 15
## 7 gene7 11 11 14 11 11 5 9 13 13 7
## 8 gene8 9 12 9 8 6 14 7 12 12 10
## 9 gene9 14 12 11 7 10 10 8 14 7 10
## 10 gene10 11 10 9 7 11 16 8 7 7 4
ggplot(data = counts, mapping = aes(x = cell1, y = cell2))
12.5
10.0
7.5
cell2
5.0
2.5
8 10 12 14
cell1
Let’s take a closer look at the final command, ggplot(data = counts, mapping = aes(x = cell1, y =
cell2)). ggplot() initialises a ggplot object and takes the arguments data and mapping. We pass our
dataframe of counts to data and use the aes() function to specify that we would like to use the variable
cell1 as our x variable and the variable cell2 as our y variable.
5.8. AN INTRODUCTION TO GGPLOT2 59
Task 1: Modify the command above to initialise a ggplot object where cell10 is the x variable and cell8 is
the y variable.
Clearly, the plots we have just created are not very informative because no data is displayed on them. To
display data, we will need to use geoms.
5.8.4 Geoms
We can use geoms to specify how we would like data to be displayed on our graphs. For example, our choice
of geom could specify that we would like our data to be displayed as a scatterplot, a barplot or a boxplot.
Let’s see how our graph would look as a scatterplot.
ggplot(data = counts, mapping = aes(x = cell1, y = cell2)) + geom_point()
12.5
10.0
7.5
cell2
5.0
2.5
8 10 12 14
cell1
Now we can see that there doesn’t seem to be any correlation between gene expression in cell1 and cell2.
Given we generated counts randomly, this isn’t too surprising.
Task 2: Modify the command above to create a line plot. Hint: execute ?ggplot and scroll down the help
page. At the bottom is a link to the ggplot package index. Scroll through the index until you find the geom
options.
So far we’ve been considering the gene counts from 2 of the cells in our dataframe. But there are actually
10 cells in our dataframe and it would be nice to compare all of them. What if we wanted to plot data from
all 10 cells at the same time?
60 CHAPTER 5. INTRODUCTION TO R/BIOCONDUCTOR
At the moment we can’t do this because we are treating each individual cell as a variable and assigning that
variable to either the x or the y axis. We could create a 10 dimensional graph to plot data from all 10 cells
on, but this is a) not possible to do with ggplot and b) not very easy to interpret. What we could do instead
is to tidy our data so that we had one variable representing cell ID and another variable representing gene
counts, and plot those against each other. In code, this would look like:
counts<-gather(counts, colnames(counts)[2:11], key = 'Cell_ID', value='Counts')
head(counts)
Essentially, the problem before was that our data was not tidy because one variable (Cell_ID) was spread
over multiple columns. Now that we’ve fixed this problem, it is much easier for us to plot data from all 10
cells on one graph.
ggplot(counts,aes(x=Cell_ID, y=Counts)) + geom_boxplot()
16
12
Counts
cell1 cell10 cell2 cell3 cell4 cell5 cell6 cell7 cell8 cell9
Cell_ID
Task 3: Use the updated counts dataframe to plot a barplot with Cell_ID as the x variable and Counts as
the y variable. Hint: you may find it helpful to read ?geom_bar.
Task 4: Use the updated counts dataframe to plot a scatterplot with Gene_ids as the x variable and Counts
as the y variable.
5.8. AN INTRODUCTION TO GGPLOT2 61
A common method for visualising gene expression data is with a heatmap. Here we will use the R package
pheatmap to perform this analysis with some gene expression data we will name test.
library(pheatmap)
set.seed(2)
test = matrix(rnorm(200), 20, 10)
test[1:10, seq(1, 10, 2)] = test[1:10, seq(1, 10, 2)] + 3
test[11:20, seq(2, 10, 2)] = test[11:20, seq(2, 10, 2)] + 2
test[15:20, seq(2, 10, 2)] = test[15:20, seq(2, 10, 2)] + 4
colnames(test) = paste("Cell", 1:10, sep = "")
rownames(test) = paste("Gene", 1:20, sep = "")
pheatmap(test)
Gene2
Gene4 6
Gene5
Gene9
Gene6 4
Gene8
Gene7
Gene10 2
Gene1
Gene3 0
Gene14
Gene12
Gene11 −2
Gene13
Gene16
Gene20
Gene15
Gene19
Gene17
Gene18
Cell2
Cell4
Cell8
Cell6
Cell10
Cell3
Cell5
Cell9
Cell1
Cell7
Let’s take a moment to work out what this graphic is showing us. Each row represents a gene and each
column represents a cell. How highly expressed each gene is in each cell is represented by the colour of the
corresponding box. For example, we can tell from this plot that gene18 is highly expressed in cell10 but
lowly expressed in cell1.
This plot also gives us information on the results of a clustering algorithm. In general, clustering algorithms
aim to split datapoints (eg.cells) into groups whose members are more alike one another than they are alike
the rest of the datapoints. The trees drawn on the top and left hand sides of the graph are the results of
clustering algorithms and enable us to see, for example, that cells 4,8,2,6 and 10 are more alike one another
than they are alike cells 7,3,5,1 and 9. The tree on the left hand side of the graph represents the results of
a clustering algorithm applied to the genes in our dataset.
If we look closely at the trees, we can see that eventually they have the same number of branches as there
are cells and genes. In other words, the total number of cell clusters is the same as the total number of
62 CHAPTER 5. INTRODUCTION TO R/BIOCONDUCTOR
cells, and the total number of gene clusters is the same as the total number of genes. Clearly, this is not
very informative, and will become impractical when we are looking at more than 10 cells and 20 genes.
Fortunately, we can set the number of clusters we see on the plot. Let’s try setting the number of gene
clusters to 2:
pheatmap(test, kmeans_k = 2)
4
Cluster: 1 Size: 8 3
Cluster: 2 Size: 12
Cell5
Cell9
Cell7
Cell1
Cell3
Cell2
Cell10
Cell6
Cell4
Cell8
Now we can see that the genes fall into two clusters - a cluster of 8 genes which are upregulated in cells 2,
10, 6, 4 and 8 relative to the other cells and a cluster of 12 genes which are downregulated in cells 2, 10, 6,
4 and 8 relative to the other cells.
Task 5: Try setting the number of clusters to 3. Which number of clusters do you think is more informative?
Principal component analysis (PCA) is a statistical procedure that uses a transformation to convert a set
of observations into a set of values of linearly uncorrelated variables called principal components. The
transformation is carried out so that the first principle component accounts for as much of the variability in
the data as possible, and each following principle component accounts for the greatest amount of variance
possible under the contraint that it must be orthogonal to the previous components.
PCA plots are a good way to get an overview of your data, and can sometimes help identify confounders
which explain a high amount of the variability in your data. We will investigate how we can use PCA plots
in single-cell RNA-seq analysis in more depth in a future lab, here the aim is to give you an overview of what
PCA plots are and how they are generated.
Let’s make a PCA plot for our test data. We can use the ggfortify package to let ggplot know how to
interpret principle components.
5.8. AN INTRODUCTION TO GGPLOT2 63
library(ggfortify)
Principle_Components<-prcomp(test)
autoplot(Principle_Components, label=TRUE)
0.50 Gene14
Gene13
Gene11
0.25
PC2 (7.01%)
Gene16
Gene12
Gene4 Gene20
Gene10 Gene5
0.00 Gene9Gene2
Gene7
Gene1
Gene18
Gene15
Gene3 Gene8
Gene19
−0.25
Gene17
Gene6
−0.2 0.0 0.2
PC1 (78.19%)
Task 6: Compare your clusters to the pheatmap clusters. Are they related? (Hint: have a look at the gene
tree for the first pheatmap we plotted)
Task 7: Produce a heatmap and PCA plot for counts (below):
set.seed(1)
counts <- as.data.frame(matrix(rpois(100, lambda = 10), ncol=10, nrow=10))
rownames(counts) <- paste("gene", 1:10, sep = "")
colnames(counts) <- paste("cell", 1:10, sep = "")
64 CHAPTER 5. INTRODUCTION TO R/BIOCONDUCTOR
Chapter 6
Tabula Muris
6.1 Introduction
To give you hands-on experience analyzing from start to finish a single-cell RNASeq dataset we will be
using as an example, data from the Tabula Muris initial release. The Tabula Muris is an international
collaboration with the aim to profile every cell-type in the mouse using a standardized method. They
combine highthroughput but low-coverage 10X data with lower throughput but high-coverage FACS-sorted
cells + Smartseq2.
The initial release of the data (20 Dec 2017), contain almost 100,000 cells across 20 different tissues/organs.
You have been assigned one of these tissues as an example to work on over this course, and on Friday each
person will have 3 minutes to present the result for their tissue.
Note if you download the data by hand you should unzip & rename the files as above before continuing.
65
66 CHAPTER 6. TABULA MURIS
You should now have two folders : “FACS” and “droplet” and one annotation and metadata file for each.
To inspect these files you can use the head to see the top few lines of the text files (Press “q” to exit):
head -n 10 droplet_metadata.csv
## channel,mouse.id,tissue,subtissue,mouse.sex
## 10X_P4_0,3-M-8,Tongue,NA,M
## 10X_P4_1,3-M-9,Tongue,NA,M
## 10X_P4_2,3-M-8/9,Liver,hepatocytes,M
## 10X_P4_3,3-M-8,Bladder,NA,M
## 10X_P4_4,3-M-9,Bladder,NA,M
## 10X_P4_5,3-M-8,Kidney,NA,M
## 10X_P4_6,3-M-9,Kidney,NA,M
## 10X_P4_7,3-M-8,Spleen,NA,M
## 10X_P7_0,3-F-56,Liver,NA,F
You can also check the number of rows in each file using:
wc -l droplet_annotation.csv
## 54838 droplet_annotation.csv
Exercise How many cells do we have annotations for from FACS? from 10X?
Answer FACS : 54,838 cells Droplet : 42,193 cells
We can now read in the relevant count matrix from the comma-separated file. Then inspect the resulting
dataframe:
dat = read.delim("FACS/Kidney-counts.csv", sep=",", header=TRUE)
dat[1:5,1:5]
## X A14.MAA000545.3_8_M.1.1 E1.MAA000545.3_8_M.1.1
## 1 0610005C13Rik 0 0
## 2 0610007C21Rik 1 0
## 3 0610007L01Rik 0 0
## 4 0610007N19Rik 0 0
## 5 0610007P08Rik 0 0
## M4.MAA000545.3_8_M.1.1 O21.MAA000545.3_8_M.1.1
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
We can see that the first column in the dataframe is the gene names, so first we move these to the rownames
so we have a numeric matrix:
dim(dat)
rownames(dat)[grep("^ERCC-", rownames(dat))]
## Plate
## Mouse B001717 B002775 MAA000545 MAA000752 MAA000801 MAA000922
## 3_10_M 0 0 0 104 0 0
## 3_11_M 0 0 0 0 196 0
## 3_38_F 237 0 0 0 0 0
## 3_39_F 0 169 0 0 0 0
## 3_8_M 0 0 82 0 0 0
## 3_9_M 0 0 0 0 0 77
Lastly we will read the computationally inferred cell-type annotation and match them to the cell in our
expression matrix:
ann <- read.table("FACS_annotations.csv", sep=",", header=TRUE)
ann <- ann[match(cellIDs, ann[,1]),]
celltype <- ann[,3]
68 CHAPTER 6. TABULA MURIS
Finally if the dataset contains spike-ins we a hidden variable in the SingleCellExperiment object to track
them:
isSpike(sceset, "ERCC") <- grepl("ERCC-", rownames(sceset))
##
## Attaching package: 'Matrix'
Now we will add the appropriate row and column names. However, if you inspect the read cellbarcodes you
will see that they are just the barcode sequence associated with each cell. This is a problem since each batch
of 10X data uses the same pool of barcodes so if we need to combine data from multiple 10X batches the
cellbarcodes will not be unique. Hence we will attach the batch ID to each cell barcode:
head(cellbarcodes)
## V1
## 1 AAACCTGAGATGCCAG-1
## 2 AAACCTGAGTGTCCAT-1
## 3 AAACCTGCAAGGCTCC-1
## 4 AAACCTGTCCTTGCCA-1
## 5 AAACGGGAGCTGAACG-1
## 6 AAACGGGCAGGACCCT-1
rownames(molecules) <- genenames[,1]
colnames(molecules) <- paste("10X_P4_5", cellbarcodes[,1], sep="_")
Now lets get the metadata and computational annotations for this data:
meta <- read.delim("droplet_metadata.csv", sep=",", header=TRUE)
head(meta)
Here we can see that we need to use “10X_P4_5” to find the metadata for this batch, also note that the
format of the mouse ID is different in this metadata table with hyphens instead of underscores and with the
gender in the middle of the ID. From checking the methods section of the accompanying paper we know that
the same 8 mice were used for both droplet and plate-based techniques. So we need to fix the mouse IDs to
be consistent with those used in the FACS experiments.
meta[meta$channel == "10X_P4_5",]
Note: depending on the tissue you have been assigned you may have 10X data from mixed samples :
e.g. mouse id = 3-M-5/6. You should still reformat these to be consistent but they will not match mouse ids
from the FACS data which may affect your downstream analysis. If the mice weren’t from an inbred strain
it would be possible to assign individual cells to a specific mouse using exonic-SNPs but that is beyond the
scope of this course.
ann <- read.delim("droplet_annotation.csv", sep=",", header=TRUE)
head(ann)
Exercise Repeat the above for the other 10X batches for your tissue.
Answer
Now that we have read the 10X data in multiple batches we need to combine them into a single SingleCell-
Experiment object. First we will check that the gene names are the same and in the same order across all
batches:
identical(rownames(molecules1), rownames(molecules2))
## [1] TRUE
identical(rownames(molecules1), rownames(molecules3))
## [1] TRUE
Now we’ll check that there aren’t any repeated cellIDs:
72 CHAPTER 6. TABULA MURIS
## [1] 0
sum(colnames(molecules1) %in% colnames(molecules3))
## [1] 0
sum(colnames(molecules2) %in% colnames(molecules3))
## [1] 0
Everything is ok, so we can go ahead and combine them:
all_molecules <- cbind(molecules1, molecules2, molecules3)
all_cell_anns <- as.data.frame(rbind(cell_anns1, cell_anns2, cell_anns3))
all_cell_anns$batch <- rep(c("10X_P4_5", "10X_P4_6","10X_P7_5"), times = c(nrow(cell_anns1), nrow(cell_a
Since this is 10X data it will not contain spike-ins, so we just save the data:
saveRDS(sceset, "kidney_droplet.rds")
7.1.1 Introduction
Once gene expression has been quantified it is summarized as an expression matrix where each row
corresponds to a gene (or transcript) and each column corresponds to a single cell. This matrix should be
examined to remove poor quality cells which were not detected in either read QC or mapping QC steps.
Failure to remove low quality cells at this stage may add technical noise which has the potential to obscure
the biological signals of interest in the downstream analysis.
Since there is currently no standard method for performing scRNASeq the expected values for the various
QC measures that will be presented here can vary substantially from experiment to experiment. Thus, to
perform QC we will be looking for cells which are outliers with respect to the rest of the dataset rather than
comparing to independent quality standards. Consequently, care should be taken when comparing quality
metrics across datasets collected using different protocols.
To illustrate cell QC, we consider a dataset of induced pluripotent stem cells generated from three different
individuals (Tung et al., 2017) in Yoav Gilad’s lab at the University of Chicago. The experiments were carried
out on the Fluidigm C1 platform and to facilitate the quantification both unique molecular identifiers (UMIs)
and ERCC spike-ins were used. The data files are located in the tung folder in your working directory. These
files are the copies of the original files made on the 15/03/16. We will use these copies for reproducibility
purposes.
library(SingleCellExperiment)
library(scater)
options(stringsAsFactors = FALSE)
73
74 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
The data consists of 3 individuals and 3 replicates and therefore has 9 batches in total.
We standardize the analysis by using both SingleCellExperiment (SCE) and scater packages. First, create
the SCE object:
umi <- SingleCellExperiment(
assays = list(counts = as.matrix(molecules)),
colData = anno
)
Define control features (genes) - ERCC spike-ins and mitochondrial genes (provided by the authors):
isSpike(umi, "ERCC") <- grepl("^ERCC-", rownames(umi))
isSpike(umi, "MT") <- rownames(umi) %in%
c("ENSG00000198899", "ENSG00000198727", "ENSG00000198888",
"ENSG00000198886", "ENSG00000212907", "ENSG00000198786",
"ENSG00000198695", "ENSG00000198712", "ENSG00000198804",
"ENSG00000198763", "ENSG00000228253", "ENSG00000198938",
"ENSG00000198840")
Histogram of umi$total_counts
40
30
Frequency
20
10
0
umi$total_counts
7.1.3 Cell QC
Next we consider the total number of RNA molecules detected per sample (if we were using read counts
rather than UMI counts this would be the total number of reads). Wells with few reads/molecules are likely
to have been broken or failed to capture a cell, and should thus be removed.
hist(
umi$total_counts,
breaks = 100
)
abline(v = 25000, col = "red")
Exercise 1
2. What distribution do you expect that the total number of molecules for each cell should follow?
Our answer
## filter_by_total_counts
## FALSE TRUE
## 46 818
76 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
Histogram of umi$total_features
50
40
Frequency
30
20
10
0
umi$total_features
In addition to ensuring sufficient sequencing depth for each sample, we also want to make sure that the reads
are distributed across the transcriptome. Thus, we count the total number of unique genes detected in each
sample.
hist(
umi$total_features,
breaks = 100
)
abline(v = 7000, col = "red")
From the plot we conclude that most cells have between 7,000-10,000 detected genes, which is normal for
high-depth scRNA-seq. However, this varies by experimental protocol and sequencing depth. For example,
droplet-based methods or samples with lower sequencing-depth typically detect fewer genes per cell. The
most notable feature in the above plot is the “heavy tail” on the left hand side of the distribution. If
detection rates were equal across the cells then the distribution should be approximately normal. Thus we
remove those cells in the tail of the distribution (fewer than 7,000 detected genes).
Exercise 2
Our answer
## filter_by_expr_features
## FALSE TRUE
## 116 748
7.1. EXPRESSION QC (UMI) 77
15
batch
NA19098.r1
NA19098.r2
pct_counts_MT
NA19098.r3
NA19101.r1
10 NA19101.r2
NA19101.r3
NA19239.r1
NA19239.r2
NA19239.r3
Another measure of cell quality is the ratio between ERCC spike-in RNAs and endogenous RNAs. This ratio
can be used to estimate the total amount of RNA in the captured cells. Cells with a high level of spike-in
RNAs had low starting amounts of RNA, likely due to the cell being dead or stressed which may result in
the RNA being degraded.
plotPhenoData(
umi,
aes_string(
x = "total_features",
y = "pct_counts_MT",
colour = "batch"
)
)
plotPhenoData(
umi,
aes_string(
x = "total_features",
y = "pct_counts_ERCC",
colour = "batch"
)
)
The above analysis shows that majority of the cells from NA19098.r2 batch have a very high ERCC/Endo
ratio. Indeed, it has been shown by the authors that this batch contains cells of smaller size.
78 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
20
15
batch
NA19098.r1
pct_counts_ERCC
NA19098.r2
NA19098.r3
NA19101.r1
10 NA19101.r2
NA19101.r3
NA19239.r1
NA19239.r2
NA19239.r3
5
0
4000 6000 8000 10000
total_features
Exercise 3
Create filters for removing batch NA19098.r2 and cells with high expression of mitochondrial genes (>10%
of total counts in a cell).
Our answer
## filter_by_ERCC
## FALSE TRUE
## 96 768
## filter_by_MT
## FALSE TRUE
## 31 833
Exercise 4
What would you expect to see in the ERCC vs counts plot if you were examining a dataset containing cells
of different sizes (eg. normal & senescent cells)?
Answer
You would expect to see a group corresponding to the smaller cells (normal) with a higher fraction of ERCC
reads than a separate group corresponding to the larger cells (senescent).
7.1. EXPRESSION QC (UMI) 79
7.1.4.1 Manual
##
## FALSE TRUE
## 207 657
7.1.4.2 Automatic
Another option available in scater is to conduct PCA on a set of QC metrics and then use automatic outlier
detection to identify potentially problematic cells.
By default, the following metrics are used for PCA-based outlier detection:
• pct_counts_top_100_features
• total_features
• pct_counts_feature_controls
• n_detected_feature_controls
• log10_counts_endogenous_features
• log10_counts_feature_controls
scater first creates a matrix where the rows represent cells and the columns represent the different QC
metrics. Here, the PCA plot provides a 2D representation of cells ordered by their quality metrics. The
outliers are then detected using methods from the mvoutlier package.
umi <- plotPCA(
umi,
size_by = "total_features",
shape_by = "use",
pca_data_input = "pdata",
detect_outliers = TRUE,
return_SCE = TRUE
)
table(umi$outlier)
##
## FALSE TRUE
## 819 45
80 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
3
use
FALSE
TRUE
2
Component 2: 21% variance
outlier
FALSE
1
TRUE
total_features
0
3000
5000
7000
−1 9000
11000
−2
−2.5 0.0 2.5 5.0 7.5
Component 1: 78% variance
Figure 7.5: PCA plot used for automatic detection of cell outliers
Exercise 5
Compare the default, automatic and manual cell filters. Plot a Venn diagram of the outlier cells from these
filterings.
Hint: Use vennCounts and vennDiagram functions from the limma package to make a Venn diagram.
Answer
##
## Attaching package: 'limma'
## The following object is masked from 'package:scater':
##
## plotMDS
## The following object is masked from 'package:BiocGenerics':
##
## plotMA
In addition to removing cells with poor quality, it is usually a good idea to exclude genes where we suspect
that technical artefacts may have skewed the results. Moreover, inspection of the gene expression profiles
7.1. EXPRESSION QC (UMI) 81
Automatic Manual
0 45 162
657
Figure 7.6: Comparison of the default, automatic and manual cell filters
may provide insights about how the experimental procedures could be improved.
It is often instructive to consider the number of reads consumed by the top 50 expressed genes.
plotQC(umi, type = "highest-expression")
The distributions are relatively flat indicating (but not guaranteeing!) good coverage of the full transcriptome
of these cells. However, there are several spike-ins in the top 15 genes which suggests a greater dilution of
the spike-ins may be preferrable if the experiment is to be repeated.
It is typically a good idea to remove genes whose expression level is considered “undetectable”. We define
a gene as detectable if at least two cells contain more than 1 transcript from the gene. If we were considering
read counts rather than UMI counts a reasonable threshold is to require at least five reads in at least two
cells. However, in both cases the threshold strongly depends on the sequencing depth. It is important to
keep in mind that genes must be filtered after cell filtering since some genes may only be detected in poor
quality cells (note colData(umi)$use filter applied to the umi dataset).
filter_genes <- apply(
counts(umi[ , colData(umi)$use]),
1,
function(x) length(x[x > 1]) >= 2
)
rowData(umi)$use <- filter_genes
table(filter_genes)
## filter_genes
## FALSE TRUE
## 4660 14066
Depending on the cell-type, protocol and sequencing depth, other cut-offs may be appropriate.
Dimensions of the QCed dataset (do not forget about the gene filter we defined above):
82 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
ENSG00000136942 |||||||||||
||||||
||
||||
||||
||||||||||||
|||||||||
||
|||||
||||
|||||
|||
|||
|||||
|||
||||||||
||| |||||||||
|||||||
ENSG00000149273 |||||||||
||
|||
|||||
|||
|||||
|||
|||
||||||||
||
||||
|||||
|||
|||||||
||
|||
||||
||
|||
|||||
||||||
|||
||||||||
|||||||||||||||||||| ||
ENSG00000125691 || |||||
||||||
||||
||||||
||||
|||
|||
|||
|||
|||||
||||
|||
|||||
|||
||
|||||
||
||||
||||
|||
|||||
|||
|||
||||||||||
||||||
|||||||||
ENSG00000172809 |||||
||
|||||||||
||||
|||
||||||
||||||
|||||||
||||
||||
|||
||||
||||
||
||||||
||
||||
||||
|||||||||
|| ||||||||
||||||
|||||
||||||||||||
ENSG00000111716 ||||||||||||||
|||
||||||||||||||
||
||||
||||
|||||
|||||||
||
|||
||||
|||
||||
||
|||||
|||
|||
||||||||
|||
|||
|||
||||||||
|||||||
||||
||||| |
ENSG00000186468 ||||||
|||||
|||||
||||
||||||
|||
||||||
|||||
||||
|||||
|||||
||
|||||
||||
||
||||
||||
||||
||||
||||||
|||
|||
||
|||||
|||||||
|||||||
ENSG00000137154 |||||||||||
||||||||
||||||
||||||
|||||
|||
|||||
|||
||||
||||
|||||
||
||||||||
||||
||||
|||||||||||
||
|||||
||||||
|||||||||||| | total_features
ENSG00000089157 | |||||||
|||||
||||||
|||||
||||
||||
|||||||||||
||
||||||
||
||||||
||
|||
|||||
||||||
|||
|||||
||||
|||||
|||
|||
||||||
|||||||||| | 11000
ENSG00000143947 ||||
|||||||||||||||
|||
||||||
||||||
|||
||
||||||
||
||||
|||||
|||
|||||
||||
||
||||
|||||||||
|||
||
||||||
||||||
||||||||||||||||||| |
ENSG00000070756 | | |||||||
||
||||
|||||
||||
||||
|||||
|||
|||
|||
||||||||||
||
||||
|||
||||
||||||||
|||
||
||||
||| |||||
|||||
||||
||
|||
|||||||
|||
||||||
||||
||||||||||
||||||||
||||||
|||||||
|||||
||||
|||
|||
|||||||
||||
||||
||||||||||||
||
|||
||
||||||
||||
||||||
||||||
|||||||||||
|||||
|||||
||||
9000
ENSG00000182899
ENSG00000071082 | |||||||||||||
||||
||||
|||||
|||
|||||
||||
||||
||||
||
|||||
|||
||||
||||||
|||
|||||||
|||||
|||
|||
|||||
||||
||||||
||
|||||
|||||||||||||||
ENSG00000197061 ||||||||||||||
||
||||||||||
||
||||
|||
|||||||||
|||
||||||
|||
|||||||||||||
|||||||||||
||||
|||||
||||
||||||||||||||||||||||||||||||||| |||| 7000
ENSG00000087086 ||||||
||||||||||
||||
||||
||
||||||||
|||
||||
||
|||||||
||||
||||
|||||
|||||
|||
|||||||||
||
|||
||||||
|||||||||||||||||||||
ENSG00000231500 ||||||||||||||||||||||
||||||||||||||||
||
||||||
||||
||||
|||||
||||||||
||
|||||||||
||| |||||||||||||
5000
ENSG00000198786 ||||||||||||||||
||||
||||
||||||
|| ||||||||||
|||
|||||||||
|||||||||
|||||
||||||||
|||
||||||
||||||||||||
||||||||
|||||||||||| || | | | |
ENSG00000008988 |||||||||||
||||
|||||
|||||
||||||
|||||
||||||
||
||
||||
|||||
||||
|||||||
||||
||||
|||||
|||
|||||
|||
|||||||||
|| |||
|||
|||||
|||||
|||||||||||||
ENSG00000084207 |||||||||||
||
|||
|||
||||
|||
||||||
|||||
|||
|||
|||||
|||||
|||
|||
||||
|||
||||||
|||||
||||||||
|||||||
||||
||||||||||||||||||| 3000
ENSG00000184009 |||||||||||
||
|||||||
||
||
|||||||||||
|||||||
||||||||
|||
||
||||||||
|||
|||||||
|||
|||
|||
||||||
||||||||||||
ENSG00000109475 ||||||||||||
||||||
|||
||||
|||||
||||
|||
||||
|||||
||||||||
|||
|||
||||
|||||
||
|||||
||||
||||||
||
|||
|||||||||||||
ENSG00000142937 | |||||
||||
|||||||||
||
||||||
|||||||
||||||
||||||
||||||||
|||
|||
|||
|||||
|||||
|||||||||
||
||||||||||
|||
|||||||||||||||| Feature control?
ENSG00000117450 |||||
|||||
||
|||||||||
||||
|||
|||
|||||
|||
||
|||||
|||||||
|||
|||
|||
|||||
||||||
||
|||
|||||
||||
||
||||||
||||||||
|||||||| |
ENSG00000204628 |||||||||||||||||
|||
|||
|||||
||||
|||
||
|||
||||||
||||
||
|||
||||||
||
|||||
||||
||
||||||
|||||||||
||||
||||
||||||||||| | FALSE
ENSG00000138326 |||||||
|||
||||||
||||
|||
||||||
|||
|||||
||||||||
|||
|||||
|||
|||||
||||||
||
|||
||||
|||||||
|||||
||||
||||
||||||
||||
|||||||||
ENSG00000127184 ||||||||||
||
||||||||||
||||||
|||||||||
|||||
|||||
||||
|||||
||
|||||
||||
||
||||||
|||||
||
|||||||
|||
||||||||
|||
|||||
||||||| TRUE
ENSG00000130255 ||||||
|||
|||||||||
||
|||||
||||
||||
|||
|||
||||||
||||
|||||
||
|||||||
|||
|||
|||||
|||||
|||
||||
|||||
|||||
|||
|||
|||||
|||
|||||||||
0 1 2 3 4 5
% of total counts
Figure 7.7: Number of total counts consumed by the top 50 expressed genes
7.1. EXPRESSION QC (UMI) 83
dim(umi[rowData(umi)$use, colData(umi)$use])
Perform exactly the same QC analysis with read counts of the same Blischak data. Use tung/reads.txt
file to load the reads. Once you have finished please compare your results to ours (next chapter).
7.1.9 sessionInfo()
library(SingleCellExperiment)
library(scater)
options(stringsAsFactors = FALSE)
head(reads[ , 1:3])
## ENSG00000187583 0 0 0
## ENSG00000187642 0 0 0
head(anno)
table(filter_by_total_counts)
## filter_by_total_counts
## FALSE TRUE
## 180 684
hist(
reads$total_features,
breaks = 100
)
abline(v = 7000, col = "red")
86 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
Histogram of reads$total_counts
10 15 20 25 30 35
Frequency
5
0
reads$total_counts
Histogram of reads$total_features
50
40
Frequency
30
20
10
0
reads$total_features
30
batch
NA19098.r1
NA19098.r2
pct_counts_MT
NA19098.r3
20 NA19101.r1
NA19101.r2
NA19101.r3
NA19239.r1
NA19239.r2
10 NA19239.r3
table(filter_by_expr_features)
## filter_by_expr_features
## FALSE TRUE
## 116 748
plotPhenoData(
reads,
aes_string(
x = "total_features",
y = "pct_counts_MT",
colour = "batch"
)
)
plotPhenoData(
reads,
aes_string(
x = "total_features",
y = "pct_counts_ERCC",
colour = "batch"
)
)
88 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
60
batch
NA19098.r1
pct_counts_ERCC
NA19098.r2
NA19098.r3
40
NA19101.r1
NA19101.r2
NA19101.r3
NA19239.r1
NA19239.r2
20
NA19239.r3
filter_by_ERCC <-
reads$batch != "NA19098.r2" & reads$pct_counts_ERCC < 25
table(filter_by_ERCC)
## filter_by_ERCC
## FALSE TRUE
## 103 761
filter_by_MT <- reads$pct_counts_MT < 30
table(filter_by_MT)
## filter_by_MT
## FALSE TRUE
## 18 846
reads$use <- (
# sufficient features (genes)
filter_by_expr_features &
# sufficient molecules counted
filter_by_total_counts &
# sufficient endogenous RNA
filter_by_ERCC &
# remove cells with unusual number of reads in MT genes
filter_by_MT
)
7.2. EXPRESSION QC (READS) 89
use
2 FALSE
TRUE
Component 2: 14% variance
1 outlier
FALSE
TRUE
0
total_features
3000
5000
−1 7000
9000
11000
−2
Figure 7.12: PCA plot used for automatic detection of cell outliers
table(reads$use)
##
## FALSE TRUE
## 258 606
reads <- plotPCA(
reads,
size_by = "total_features",
shape_by = "use",
pca_data_input = "pdata",
detect_outliers = TRUE,
return_SCE = TRUE
)
table(reads$outlier)
##
## FALSE TRUE
## 756 108
library(limma)
##
## Attaching package: 'limma'
## The following object is masked from 'package:scater':
##
90 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
Automatic Manual
0 108 150
606
Figure 7.13: Comparison of the default, automatic and manual cell filters
## plotMDS
## The following object is masked from 'package:BiocGenerics':
##
## plotMA
auto <- colnames(reads)[reads$outlier]
man <- colnames(reads)[!reads$use]
venn.diag <- vennCounts(
cbind(colnames(reads) %in% auto,
colnames(reads) %in% man)
)
vennDiagram(
venn.diag,
names = c("Automatic", "Manual"),
circle.col = c("blue", "green")
)
plotQC(reads, type = "highest-expression")
table(filter_genes)
## filter_genes
## FALSE TRUE
## 2664 16062
dim(reads[rowData(reads)$use, colData(reads)$use])
ENSG00000164032 |||||
|||
||||
||||
||||||||
|||||
|||
|||||||
|||
||||
||||
|||||
|||
||||
|||
||||||
|||
||
|||
|||
||||||
|||||||
||||
|||||
||||||
ENSG00000089157 ||||
||||
||
|||||||
|||||||
||||
|||
|||
||||
||||
|||
|||
|||
||||||
||||||||
|||
|||
||||||||
|||
||||
|||||
||||||||||||||||||
ENSG00000084207 |||||||||
|||||
||||
|||||||
|||
|||
|||||||
|||
||||
||||
||
||||||||
||||
|||
|||||
||||
|||||
||
|||
||||||
||
||||||
||||||||||||||| |
ENSG00000117450 ||||||
||
||||||||||||||
|||||
||||||
|||
|||||
|||||
|||
|||
|||||
|||||
|||
|||||||
||||||
||
|||||||||||||||
|||
ENSG00000075624 ||||||
|||||||||
|||||
|||
||||||
||||
||||
|||||||
|||||
|||
|||||
||||
||||
|||
||||
||||||
||
|||
||||
|||
|||||
|||
||||||||
|||
||||||||||||
ENSG00000145592 ||||||||
||||||
|||
||||
|||||||||
|||||
||||||||||
|||
||
|||||||||
||||
|||
|||||
||||||
|||
|||
|||||||
ENSG00000080824 |||||||||||||||||
||||||
||||
||||
|||
||||
||
||||
||||||
||
|||
|||||||
||||
||||
|||
|||
|||||
||||||||
|||| | Feature control?
ENSG00000125691 ||||||||
||||
|||
||||||
||||
||||
||||||
|||
||||||||||
|||
|||||
|||
||||
|||
|||||
|||||
|||
|||||
||||||
ENSG00000111716 ||||||||
||||
|||||
||||
|||||
|||||
|||||
|||
|||
||||||||
||
||||
|||||||
||||
||||
||||
||||
||||||||
||||| FALSE
ENSG00000136942 |||||
||||||||||||
||
||||||
|||
||||
||
|||
||||
||||
||||||
|||
||
|||||
||||
||||
|||||
||
||||||
|||||||||
ENSG00000109475 ||||||
||||
|||||||
||||
||||
||
|||||
|||||
||||||
||||
|||||
||
|||
|||
||||||
|||
||||
||||||||||||||
||| ||||||||| TRUE
ENSG00000186468 |||||||||
|||
|||||||||
||||
|||||||
||
|||
||||
|||
||
||||||
||
|||||||||
|||
|||||
||||||||||||||
ENSG00000071082 ||||||
|||||
||
|||||||
|||
|||
||||
||||||
||
||||||
|||||
||||
|||
||||
||||||
||
||||||
||||||||||||||
ENSG00000198786 ||||
||||||
|||
||||
||||
||
|||||
||
||||
||
||||
|||||||
|||
||||||||
|||
|||
|||
|||||||
||||||||||||||
|||||||||||||||| | | |
ENSG00000143947 ||||||||
|||
||
|||||
||||
|||
|||||
|||
|||||
|||
|||
||
||||
||
||||||
||
||||||
|||||
|||||||||||||| total_features
ENSG00000119705 |||
||||
||
|||||
|||
|||
|||||
|||||
|||||
|||
|||
||||
||
||||||
|||
||
|||||
|||
||
||||||
|||||
|||||||
||||||||||
||| 11000
ENSG00000008988 |||||
||||||
|||
|||||
||||||
||
||||
|||
|||
|||||
||||||||
|||
||||
||||||||
|||||||||||
|||
||
||||||||||||||
ENSG00000172809 |||||
|||
||||||
|||
|||||||
|||
||||
||||
||||
||
|||||
|||
||||||
|||
|||
||
||||||
|||
|||
|||||||||||||||||
9000
ENSG00000130255 |||||||||||
||||||
|||||||||
||
||||||
||
|||
|||||
||
|||||
||||
||
|||||
|||||
|||
||||||||
ENSG00000204628 ||||||
||||||
||||||
|||
|||
||||
||
||||
|||
|||
|||||
|||
|||
||||
||
||||
||||
||||
|||
|||||
|||||| |
ENSG00000182899 ||||||
|||||
||||||||
||
|||||
||||||
||||
||||
||||||
|||||
|||||
||||||
|||||||||||| 7000
ENSG00000170315 |||||||
|||
||||
||||
|||||||
||||
||| ||||
||||
|||||
|||
||
|||||||
||||
||
||||
||||||
||||
||||||||||||| |
ENSG00000156482 |||
||||
|||||||||
||||
|||
||||||
||||
||||
||||
|||
||||
||||
||||||||
|||
|||
||||
||||
|||||
|||||||
5000
ENSG00000115541 ||||||
|||
|||||
||||
||
||||||
|||
|||||||||||
|||
||
|||
||||||||
|||
||
||||||
||||||||||
||||| |
ENSG00000132341 ||||||
||
|||||||
|||
||||
||||
||||
||||||
||
||||||
||||||
||||
||||
|||||||||
|||
|||||||||||||| |
ENSG00000100316 ||||
||||
|||||
||||
|||||
|||
||
||||
|||||||
||
|||
|||
||||||||
|||||
|||
||||||||
||||
|||||||||
|||
|||
|||||||||||| 3000
0 5 10 15
% of total counts
Figure 7.14: Number of total counts consumed by the top 50 expressed genes
92 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
By comparing Figure 7.6 and Figure 7.13, it is clear that the reads based filtering removed more cells than
the UMI based analysis. If you go back and compare the results you should be able to conclude that the
ERCC and MT filters are more strict for the reads-based analysis.
sessionInfo()
7.3.1 Introduction
In this chapter we will continue to work with the filtered Tung dataset produced in the previous chapter.
We will explore different ways of visualizing the data to allow you to asses what happened to the expression
matrix after the quality control step. scater package provides several very useful functions to simplify
visualisation.
One important aspect of single-cell RNA-seq is to control for batch effects. Batch effects are technical
artefacts that are added to the samples during handling. For example, if two sets of samples were prepared
in different labs or even on different days in the same lab, then we may observe greater similarities between
the samples that were handled together. In the worst case scenario, batch effects may be mistaken for true
biological variation. The Tung data allows us to explore these issues in a controlled manner since some of
the salient aspects of how the samples were handled have been recorded. Ideally, we expect to see batches
from the same individual grouping together and distinct groups corresponding to each individual.
library(SingleCellExperiment)
library(scater)
options(stringsAsFactors = FALSE)
umi <- readRDS("tung/umi.rds")
umi.qc <- umi[rowData(umi)$use, colData(umi)$use]
endog_genes <- !rowData(umi.qc)$is_feature_control
The easiest way to overview the data is by transforming it using the principal component analysis and then
visualize the first two principal components.
94 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
Principal component analysis (PCA) is a statistical procedure that uses a transformation to convert a set of
observations into a set of values of linearly uncorrelated variables called principal components (PCs). The
number of principal components is less than or equal to the number of original variables.
Mathematically, the PCs correspond to the eigenvectors of the covariance matrix. The eigenvectors are
sorted by eigenvalue so that the first principal component accounts for as much of the variability in the data
as possible, and each succeeding component in turn has the highest variance possible under the constraint
that it is orthogonal to the preceding components (the figure below is taken from here).
7.3.2.1 Before QC
Without log-transformation:
plotPCA(
umi[endog_genes, ],
exprs_values = "counts",
colour_by = "batch",
size_by = "total_features",
shape_by = "individual"
)
With log-transformation:
plotPCA(
umi[endog_genes, ],
exprs_values = "logcounts_raw",
colour_by = "batch",
size_by = "total_features",
shape_by = "individual"
)
Clearly log-transformation is benefitial for our data - it reduces the variance on the first principal component
and already separates some biological effects. Moreover, it makes the distribution of the expression values
more normal. In the following analysis and chapters we will be using log-transformed raw counts by default.
However, note that just a log-transformation is not enough to account for different technical
factors between the cells (e.g. sequencing depth). Therefore, please do not use logcounts_raw
7.3. DATA VISUALIZATION 95
NA19098.r1
NA19098.r2
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
10
NA19239.r1
Component 2: 4% variance
NA19239.r2
NA19239.r3
individual
NA19098
0 NA19101
NA19239
total_features
3000
5000
7000
−10 9000
0 50 100 11000
Component 1: 76% variance
for your downstream analysis, instead as a minimum suitable data use the logcounts slot of the
SingleCellExperiment object, which not just log-transformed, but also normalised by library
size (e.g. CPM normalisation). In the course we use logcounts_raw only for demonstration
purposes!
7.3.2.2 After QC
plotPCA(
umi.qc[endog_genes, ],
exprs_values = "logcounts_raw",
colour_by = "batch",
size_by = "total_features",
shape_by = "individual"
)
Comparing Figure 7.17 and Figure 7.18, it is clear that after quality control the NA19098.r2 cells no longer
form a group of outliers.
By default only the top 500 most variable genes are used by scater to calculate the PCA. This can be adjusted
by changing the ntop argument.
Exercise 1 How do the PCA plots change if when all 14,214 genes are used? Or when only top 50 genes
are used? Why does the fraction of variance accounted for by the first PC change so dramatically?
Hint Use ntop argument of the plotPCA function.
Our answer
96 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
NA19098.r1
10 NA19098.r2
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
5
NA19239.r1
Component 2: 6% variance
NA19239.r2
NA19239.r3
0
individual
NA19098
NA19101
NA19239
−5
total_features
3000
5000
−10
7000
9000
−25 0 25 50 11000
Component 1: 43% variance
If your answers are different please compare your code with ours (you need to search for this exercise in the
opened file).
An alternative to PCA for visualizing scRNASeq data is a tSNE plot. tSNE (t-Distributed Stochastic
Neighbor Embedding) combines dimensionality reduction (e.g. PCA) with random walks on the nearest-
neighbour network to map high dimensional data (i.e. our 14,214 dimensional expression matrix) to a 2-
dimensional space while preserving local distances between cells. In contrast with PCA, tSNE is a stochastic
algorithm which means running the method multiple times on the same dataset will result in different plots.
Due to the non-linear and stochastic nature of the algorithm, tSNE is more difficult to intuitively interpret
tSNE. To ensure reproducibility, we fix the “seed” of the random-number generator in the code below so
that we always get the same plot.
7.3.3.1 Before QC
plotTSNE(
umi[endog_genes, ],
exprs_values = "logcounts_raw",
perplexity = 130,
colour_by = "batch",
size_by = "total_features",
shape_by = "individual",
7.3. DATA VISUALIZATION 97
batch
15
NA19098.r1
NA19098.r3
NA19101.r1
10 NA19101.r2
NA19101.r3
NA19239.r1
Component 2: 7% variance
NA19239.r2
5 NA19239.r3
total_features
8000
0 9000
10000
11000
−5
individual
NA19098
NA19101
−10 NA19239
−40 −20 0 20
Component 1: 17% variance
rand_seed = 123456
)
7.3.3.2 After QC
plotTSNE(
umi.qc[endog_genes, ],
exprs_values = "logcounts_raw",
perplexity = 130,
colour_by = "batch",
size_by = "total_features",
shape_by = "individual",
rand_seed = 123456
)
Interpreting PCA and tSNE plots is often challenging and due to their stochastic and non-linear nature, they
are less intuitive. However, in this case it is clear that they provide a similar picture of the data. Comparing
Figure 7.21 and 7.22, it is again clear that the samples from NA19098.r2 are no longer outliers after the QC
filtering.
Furthermore tSNE requires you to provide a value of perplexity which reflects the number of neighbours
used to build the nearest-neighbour network; a high value creates a dense network which clumps cells together
while a low value makes the network more sparse allowing groups of cells to separate from each other. scater
uses a default perplexity of the total number of cells divided by five (rounded down).
98 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
20 NA19101.r3
NA19239.r1
Component 2: 2% variance
NA19239.r2
NA19239.r3
total_features
0
8000
9000
10000
11000
−20
individual
NA19098
NA19101
NA19239
−100 0 100 200
Component 1: 15% variance
You can read more about the pitfalls of using tSNE here.
Exercise 2 How do the tSNE plots change when a perplexity of 10 or 200 is used? How does the choice of
perplexity affect the interpretation of the results?
Our answer
Perform the same analysis with read counts of the Blischak data. Use tung/reads.rds file to load the reads
SCE object. Once you have finished please compare your results to ours (next chapter).
7.3.5 sessionInfo()
batch
NA19098.r1
NA19098.r3
NA19101.r1
5.0 NA19101.r2
NA19101.r3
Component 2: 13% variance
NA19239.r1
NA19239.r2
2.5 NA19239.r3
total_features
8000
0.0
9000
10000
11000
−2.5
individual
NA19098
NA19101
NA19239
−8 −4 0 4
Component 1: 25% variance
NA19098.r1
NA19098.r2
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
5 NA19239.r1
NA19239.r2
NA19239.r3
Dimension 2
individual
0 NA19098
NA19101
NA19239
total_features
3000
−5 5000
7000
9000
−10 −5 0 5 10 11000
Dimension 1
batch
6
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
3 NA19101.r3
NA19239.r1
NA19239.r2
NA19239.r3
Dimension 2
0 total_features
8000
9000
10000
11000
−3
individual
NA19098
NA19101
−6 NA19239
−10 −5 0 5 10
Dimension 1
library(scater)
options(stringsAsFactors = FALSE)
reads <- readRDS("tung/reads.rds")
reads.qc <- reads[rowData(reads)$use, colData(reads)$use]
endog_genes <- !rowData(reads.qc)$is_feature_control
plotPCA(
reads[endog_genes, ],
exprs_values = "counts",
colour_by = "batch",
size_by = "total_features",
shape_by = "individual"
)
plotPCA(
reads[endog_genes, ],
exprs_values = "logcounts_raw",
colour_by = "batch",
size_by = "total_features",
shape_by = "individual"
)
plotPCA(
reads.qc[endog_genes, ],
exprs_values = "logcounts_raw",
colour_by = "batch",
102 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
batch
NA19098.r1
NA19098.r3
20 NA19101.r1
NA19101.r2
NA19101.r3
NA19239.r1
NA19239.r2
NA19239.r3
Dimension 2
total_features
8000
9000
10000
11000
−20
individual
NA19098
NA19101
NA19239
−20 0 20
Dimension 1
size_by = "total_features",
shape_by = "individual"
)
plotTSNE(
reads[endog_genes, ],
exprs_values = "logcounts_raw",
perplexity = 130,
colour_by = "batch",
size_by = "total_features",
shape_by = "individual",
rand_seed = 123456
)
plotTSNE(
reads.qc[endog_genes, ],
exprs_values = "logcounts_raw",
perplexity = 130,
colour_by = "batch",
size_by = "total_features",
shape_by = "individual",
rand_seed = 123456
)
sessionInfo()
7.4. DATA VISUALIZATION (READS) 103
batch
NA19098.r1
NA19098.r3
NA19101.r1
3 NA19101.r2
NA19101.r3
NA19239.r1
NA19239.r2
NA19239.r3
Dimension 2
0
total_features
8000
9000
10000
−3 11000
individual
NA19098
−6 NA19101
NA19239
−5.0 −2.5 0.0 2.5 5.0
Dimension 1
20 NA19098.r1
NA19098.r2
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
10
NA19239.r1
Component 2: 3% variance
NA19239.r2
NA19239.r3
individual
0
NA19098
NA19101
NA19239
total_features
−10
3000
5000
7000
9000
−40 −20 0 20 11000
Component 1: 60% variance
NA19098.r1
NA19098.r2
NA19098.r3
NA19101.r1
4
NA19101.r2
NA19101.r3
NA19239.r1
Component 2: 2% variance
NA19239.r2
NA19239.r3
0
individual
NA19098
NA19101
NA19239
−4
total_features
3000
5000
7000
9000
−10 0 10 20 11000
Component 1: 18% variance
7.5.1 Introduction
There is a large number of potential confounders, artifacts and biases in sc-RNA-seq data. One of the main
challenges in analyzing scRNA-seq data stems from the fact that it is difficult to carry out a true technical
replicate (why?) to distinguish biological and technical variability. In the previous chapters we considered
batch effects and in this chapter we will continue to explore how experimental artifacts can be identified
and removed. We will continue using the scater package since it provides a set of methods specifically
for quality control of experimental and explanatory variables. Moreover, we will continue to work with the
Blischak data that was used in the previous chapter.
library(scater, quietly = TRUE)
options(stringsAsFactors = FALSE)
umi <- readRDS("tung/umi.rds")
umi.qc <- umi[rowData(umi)$use, colData(umi)$use]
endog_genes <- !rowData(umi.qc)$is_feature_control
106 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
5
NA19239.r1
Component 2: 2% variance
NA19239.r2
NA19239.r3
total_features
8000
0
9000
10000
11000
individual
−5 NA19098
NA19101
NA19239
−10 −5 0 5 10
Component 1: 6% variance
The umi.qc dataset contains filtered cells and genes. Our next step is to explore technical drivers of variability
in the data to inform data normalisation before downstream analysis.
Let’s first look again at the PCA plot of the QCed dataset:
plotPCA(
umi.qc[endog_genes, ],
exprs_values = "logcounts_raw",
colour_by = "batch",
size_by = "total_features"
)
scater allows one to identify principal components that correlate with experimental and QC variables of
interest (it ranks principle components by R2 from a linear model regressing PC value against the variable
of interest).
Let’s test whether some of the variables correlate with any of the PCs.
plotQC(
umi.qc[endog_genes, ],
type = "find-pcs",
7.5. IDENTIFYING CONFOUNDING FACTORS 107
NA19098.r1
NA19098.r2
NA19098.r3
NA19101.r1
NA19101.r2
2.5
NA19101.r3
NA19239.r1
NA19239.r2
NA19239.r3
Dimension 2
0.0
individual
NA19098
NA19101
−2.5 NA19239
total_features
3000
5000
−5.0
7000
9000
−3 −2 −1 0 1 2 11000
Dimension 1
exprs_values = "logcounts_raw",
variable = "total_features"
)
Indeed, we can see that PC1 can be almost completely explained by the number of detected genes. In fact,
it was also visible on the PCA plot above. This is a well-known issue in scRNA-seq and was described here.
scater can also compute the marginal R2 for each variable when fitting a linear model regressing expression
values for each gene against just that variable, and display a density plot of the gene-wise marginal R2 values
for the variables.
plotQC(
umi.qc[endog_genes, ],
type = "expl",
exprs_values = "logcounts_raw",
variables = c(
"total_features",
"total_counts",
"batch",
"individual",
"pct_counts_ERCC",
"pct_counts_MT"
)
108 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
batch
NA19098.r1
NA19098.r3
2 NA19101.r1
NA19101.r2
NA19101.r3
NA19239.r1
1
NA19239.r2
NA19239.r3
Dimension 2
0 total_features
8000
9000
10000
−1
11000
individual
−2
NA19098
NA19101
NA19239
−2 0 2
Dimension 1
This analysis indicates that the number of detected genes (again) and also the sequencing depth (number
of counts) have substantial explanatory power for many genes, so these variables are good candidates for
conditioning out in a normalisation step, or including in downstream statistical models. Expression of ERCCs
also appears to be an important explanatory variable and one notable feature of the above plot is that batch
explains more than individual. What does that tell us about the technical and biological variability of the
data?
In addition to correcting for batch, there are other factors that one may want to compensate for. As with
batch correction, these adjustments require extrinsic information. One popular method is scLVM which
allows you to identify and subtract the effect from processes such as cell-cycle or apoptosis.
In addition, protocols may differ in terms of their coverage of each transcript, their bias based on the average
content of A/T nucleotides, or their ability to capture short transcripts. Ideally, we would like to compensate
for all of these differences and biases.
7.5.5 Exercise
Perform the same analysis with read counts of the Blischak data. Use tung/reads.rds file to load the reads
SCESet object. Once you have finished please compare your results to ours (next chapter).
7.5. IDENTIFYING CONFOUNDING FACTORS 109
batch
6 NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
3
NA19239.r1
NA19239.r2
NA19239.r3
Dimension 2
0 total_features
8000
9000
10000
11000
−3
individual
NA19098
NA19101
NA19239
−4 0 4 8
Dimension 1
7.5.6 sessionInfo()
batch
NA19098.r1
2 NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
1 NA19239.r1
NA19239.r2
NA19239.r3
Dimension 2
0
total_features
8000
9000
−1 10000
11000
individual
−2
NA19098
NA19101
NA19239
−2 −1 0 1 2
Dimension 1
batch
NA19098.r1
10
NA19098.r3
NA19101.r1
Component 2: 7% variance
NA19101.r2
NA19101.r3
5
NA19239.r1
NA19239.r2
NA19239.r3
0
total_features
8000
9000
−5 10000
11000
−10
−40 −20 0 20
Component 1: 17% variance
component 1 component 2
(R−squared 0.93) (R−squared 0.031)
15
20
10
0
5
0
−20
−5
−40 −10
component 3 component 7
(R−squared 0.0046) (R−squared 0.0046)
Principal component value
10 6
5 3
0 0
−5 −3
−10 −6
component 8 component 9
(R−squared 0.0027) (R−squared 0.0023)
6 5.0
3 2.5
0.0
0
−2.5
−3
−5.0
−6
7000 8000 9000 10000 11000 7000 8000 9000 10000 11000
total_features
1.2
0.9
total_features
total_counts
Density
0.6 batch
pct_counts_ERCC
individual
pct_counts_MT
0.3
0.0
batch
NA19098.r1
NA19098.r3
5 NA19101.r1
Component 2: 2% variance
NA19101.r2
NA19101.r3
NA19239.r1
NA19239.r2
NA19239.r3
0
total_features
8000
9000
10000
11000
−5
−10 −5 0 5 10
Component 1: 6% variance
component 1 component 3
(R−squared 0.75) (R−squared 0.040)
10 4
0 0
−10 −4
−8
component 2 component 5
(R−squared 0.016) (R−squared 0.014)
Principal component value
5 3
0
0
−3
−5
−6
component 6 component 4
(R−squared 0.0049) (R−squared 0.0038)
5.0
2.5 4
0.0
0
−2.5
−5.0 −4
−7.5
7000 8000 9000 10000 11000 7000 8000 9000 10000 11000
total_features
1.0
batch
total_features
Density
pct_counts_ERCC
total_counts
0.5 individual
pct_counts_MT
0.0
##
## attached base packages:
## [1] stats4 parallel methods stats graphics grDevices utils
## [8] datasets base
##
## other attached packages:
## [1] knitr_1.19 scater_1.6.2
## [3] SingleCellExperiment_1.0.0 SummarizedExperiment_1.8.1
## [5] DelayedArray_0.4.1 matrixStats_0.53.0
## [7] GenomicRanges_1.30.1 GenomeInfoDb_1.14.0
## [9] IRanges_2.12.0 S4Vectors_0.16.0
## [11] ggplot2_2.2.1 Biobase_2.38.0
## [13] BiocGenerics_0.24.0
##
## loaded via a namespace (and not attached):
## [1] viridis_0.5.0 httr_1.3.1 edgeR_3.20.8
## [4] bit64_0.9-7 viridisLite_0.3.0 shiny_1.0.5
## [7] assertthat_0.2.0 blob_1.1.0 vipor_0.4.5
## [10] GenomeInfoDbData_1.0.0 yaml_2.1.16 progress_1.1.2
## [13] pillar_1.1.0 RSQLite_2.0 backports_1.1.2
## [16] lattice_0.20-34 glue_1.2.0 limma_3.34.8
## [19] digest_0.6.15 XVector_0.18.0 colorspace_1.3-2
## [22] cowplot_0.9.2 htmltools_0.3.6 httpuv_1.3.5
## [25] Matrix_1.2-7.1 plyr_1.8.4 XML_3.98-1.9
## [28] pkgconfig_2.0.1 biomaRt_2.34.2 bookdown_0.6
116 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
7.7.1 Introduction
In the previous chapter we identified important confounding factors and explanatory variables. scater
allows one to account for these variables in subsequent statistical models or to condition them out using
normaliseExprs(), if so desired. This can be done by providing a design matrix to normaliseExprs(). We
are not covering this topic here, but you can try to do it yourself as an exercise.
Instead we will explore how simple size-factor normalisations correcting for library size can remove the effects
of some of the confounders and explanatory variables.
Library sizes vary because scRNA-seq data is often sequenced on highly multiplexed platforms the total reads
which are derived from each cell may differ substantially. Some quantification methods (eg. Cufflinks, RSEM)
incorporated library size when determining gene expression estimates thus do not require this normalization.
However, if another quantification method was used then library size must be corrected for by multiplying
or dividing each column of the expression matrix by a “normalization factor” which is an estimate of the
library size relative to the other cells. Many methods to correct for library size have been developped for
bulk RNA-seq and can be equally applied to scRNA-seq (eg. UQ, SF, CPM, RPKM, FPKM, TPM).
7.7.3 Normalisations
7.7.3.1 CPM
The simplest way to normalize this data is to convert it to counts per million (CPM) by dividing each column
by its total then multiplying by 1,000,000. Note that spike-ins should be excluded from the calculation of
total expression in order to correct for total cell RNA content, therefore we will only use endogenous genes.
Example of a CPM function in R:
calc_cpm <-
function (expr_mat, spikes = NULL)
{
norm_factor <- colSums(expr_mat[-spikes, ])
7.7. NORMALIZATION THEORY 117
return(t(t(expr_mat)/norm_factor)) * 10^6
}
One potential drawback of CPM is if your sample contains genes that are both very highly expressed and
differentially expressed across the cells. In this case, the total molecules in the cell may depend of whether
such genes are on/off in the cell and normalizing by total molecules may hide the differential expression of
those genes and/or falsely create differential expression for the remaining genes.
Note RPKM, FPKM and TPM are variants on CPM which further adjust counts by the length of the
respective gene/transcript.
To deal with this potentiality several other measures were devised.
The size factor (SF) was proposed and popularized by DESeq (Anders and Huber, 2010). First the
geometric mean of each gene across all cells is calculated. The size factor for each cell is the median across
genes of the ratio of the expression to the gene’s geometric mean. A drawback to this method is that since
it uses the geometric mean only genes with non-zero expression values across all cells can be used in its
calculation, making it unadvisable for large low-depth scRNASeq experiments. edgeR & scater call this
method RLE for “relative log expression”. Example of a SF function in R:
calc_sf <-
function (expr_mat, spikes = NULL)
{
geomeans <- exp(rowMeans(log(expr_mat[-spikes, ])))
SF <- function(cnts) {
median((cnts/geomeans)[(is.finite(geomeans) & geomeans >
0)])
}
norm_factor <- apply(expr_mat[-spikes, ], 2, SF)
return(t(t(expr_mat)/norm_factor))
}
7.7.3.3 UQ
The upperquartile (UQ) was proposed by (Bullard et al., 2010). Here each column is divided by the 75%
quantile of the counts for each library. Often the calculated quantile is scaled by the median across cells to
keep the absolute level of expression relatively consistent. A drawback to this method is that for low-depth
scRNASeq experiments the large number of undetected genes may result in the 75% quantile being zero (or
close to it). This limitation can be overcome by generalizing the idea and using a higher quantile (eg. the
99% quantile is the default in scater) or by excluding zeros prior to calculating the 75% quantile. Example
of a UQ function in R:
calc_uq <-
function (expr_mat, spikes = NULL)
{
UQ <- function(x) {
quantile(x[x > 0], 0.75)
}
uq <- unlist(apply(expr_mat[-spikes, ], 2, UQ))
norm_factor <- uq/median(uq)
return(t(t(expr_mat)/norm_factor))
}
118 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
7.7.3.4 TMM
Another method is called TMM is the weighted trimmed mean of M-values (to the reference) proposed
by (Robinson and Oshlack, 2010). The M-values in question are the gene-wise log2 fold changes between
individual cells. One cell is used as the reference then the M-values for each other cell is calculated compared
to this reference. These values are then trimmed by removing the top and bottom ~30%, and the average of
the remaining values is calculated by weighting them to account for the effect of the log scale on variance.
Each non-reference cell is multiplied by the calculated factor. Two potential issues with this method are
insufficient non-zero genes left after trimming, and the assumption that most genes are not differentially
expressed.
7.7.3.5 scran
scran package implements a variant on CPM specialized for single-cell data (L. Lun et al., 2016). Briefly
this method deals with the problem of vary large numbers of zero values per cell by pooling cells together
calculating a normalization factor (similar to CPM) for the sum of each pool. Since each cell is found in
many different pools, cell-specific factors can be deconvoluted from the collection of pool-specific factors
using linear algebra.
7.7.3.6 Downsampling
A final way to correct for library size is to downsample the expression matrix so that each cell has approxi-
mately the same total number of molecules. The benefit of this method is that zero values will be introduced
by the down sampling thus eliminating any biases due to differing numbers of detected genes. However, the
major drawback is that the process is not deterministic so each time the downsampling is run the resulting
expression matrix is slightly different. Thus, often analyses must be run on multiple downsamplings to
ensure results are robust. Example of a downsampling function in R:
Down_Sample_Matrix <-
function (expr_mat)
{
min_lib_size <- min(colSums(expr_mat))
down_sample <- function(x) {
prob <- min_lib_size/sum(x)
return(unlist(lapply(x, function(y) {
rbinom(1, y, prob)
})))
}
down_sampled_mat <- apply(expr_mat, 2, down_sample)
return(down_sampled_mat)
}
7.7.4 Effectiveness
to compare the efficiency of different normalization methods we will use visual inspection of PCA plots and
calculation of cell-wise relative log expression via scater’s plotRLE() function. Namely, cells with many
(few) reads have higher (lower) than median expression for most genes resulting in a positive (negative) RLE
across the cell, whereas normalized cells have an RLE close to zero. Example of a RLE function in R:
calc_cell_RLE <-
function (expr_mat, spikes = NULL)
{
RLE_gene <- function(x) {
7.8. NORMALIZATION PRACTICE (UMI) 119
if (median(unlist(x)) > 0) {
log((x + 1)/(median(unlist(x)) + 1))/log(2)
}
else {
rep(NA, times = length(x))
}
}
if (!is.null(spikes)) {
RLE_matrix <- t(apply(expr_mat[-spikes, ], 1, RLE_gene))
}
else {
RLE_matrix <- t(apply(expr_mat, 1, RLE_gene))
}
cell_RLE <- apply(RLE_matrix, 2, median, na.rm = T)
return(cell_RLE)
}
Note The RLE, TMM, and UQ size-factor methods were developed for bulk RNA-seq data and, depending
on the experimental context, may not be appropriate for single-cell RNA-seq data, as their underlying
assumptions may be problematically violated.
Note scater acts as a wrapper for the calcNormFactors function from edgeR which implements several
library size normalization methods making it easy to apply any of these methods to our data.
Note edgeR makes extra adjustments to some of the normalization methods which may result in
somewhat different results than if the original methods are followed exactly, e.g. edgeR’s and scater’s
“RLE” method which is based on the “size factor” used by DESeq may give different results to the
estimateSizeFactorsForMatrix method in the DESeq/DESeq2 packages. In addition, some versions of
edgeR will not calculate the normalization factors correctly unless lib.size is set at 1 for all cells.
Note For CPM normalisation we use scater’s calculateCPM() function. For RLE, UQ and TMM we
use scater’s normaliseExprs() function. For scran we use scran package to calculate size factors (it
also operates on SingleCellExperiment class) and scater’s normalize() to normalise the data. All these
normalization functions save the results to the logcounts slot of the SCE object. For downsampling we
use our own functions shown above.
7.8.1 Raw
plotPCA(
umi.qc[endog_genes, ],
120 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
batch
15
NA19098.r1
NA19098.r3
NA19101.r1
10 NA19101.r2
NA19101.r3
NA19239.r1
Component 2: 7% variance
NA19239.r2
5 NA19239.r3
total_features
8000
0 9000
10000
11000
−5
individual
NA19098
NA19101
−10 NA19239
−40 −20 0 20
Component 1: 17% variance
exprs_values = "logcounts_raw",
colour_by = "batch",
size_by = "total_features",
shape_by = "individual"
)
7.8.2 CPM
batch
10 NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
5 NA19239.r1
Component 2: 4% variance
NA19239.r2
NA19239.r3
0 total_features
8000
9000
10000
11000
−5
individual
NA19098
NA19101
−10
NA19239
−5 0 5 10
Component 1: 8% variance
Figure 7.39: PCA plot of the tung data after CPM normalisation
CPM
4
batch
0
NA19098.r1
Relative log expression
NA19098.r3
−2
NA19101.r1
NA19101.r2
Raw
4 NA19101.r3
NA19239.r1
2 NA19239.r2
NA19239.r3
−2
Sample
batch
NA19098.r1
10
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
5 NA19239.r1
Component 2: 4% variance
NA19239.r2
NA19239.r3
0 total_features
8000
9000
10000
11000
−5
individual
NA19098
NA19101
−10
NA19239
−10 −5 0 5 10
Component 1: 8% variance
Figure 7.41: PCA plot of the tung data after RLE normalisation
Raw
4
batch
0
NA19098.r1
Relative log expression
NA19098.r3
−2
NA19101.r1
NA19101.r2
RLE
4 NA19101.r3
NA19239.r1
2 NA19239.r2
NA19239.r3
−2
Sample
7.8.4 Upperquantile
batch
10 NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
5
NA19239.r1
Component 2: 4% variance
NA19239.r2
NA19239.r3
0
total_features
8000
9000
10000
−5
11000
individual
NA19098
−10
NA19101
NA19239
−10 −5 0 5 10
Component 1: 9% variance
plotRLE(
umi.qc[endog_genes, ],
exprs_mats = list(Raw = "logcounts_raw", UQ = "logcounts"),
exprs_logged = c(TRUE, TRUE),
colour_by = "batch"
)
7.8.5 TMM
Raw
4
batch
0
NA19098.r1
Relative log expression
NA19098.r3
−2
NA19101.r1
NA19101.r2
UQ
4 NA19101.r3
NA19239.r1
2 NA19239.r2
NA19239.r3
−2
Sample
size_by = "total_features",
shape_by = "individual"
)
plotRLE(
umi.qc[endog_genes, ],
exprs_mats = list(Raw = "logcounts_raw", TMM = "logcounts"),
exprs_logged = c(TRUE, TRUE),
colour_by = "batch"
)
7.8.6 scran
batch
NA19098.r1
10
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
NA19239.r1
Component 2: 4% variance
5
NA19239.r2
NA19239.r3
total_features
0
8000
9000
10000
11000
−5
individual
NA19098
NA19101
NA19239
−5 0 5 10
Component 1: 8% variance
Figure 7.45: PCA plot of the tung data after TMM normalisation
Raw
4
batch
0
NA19098.r1
Relative log expression
NA19098.r3
−2
NA19101.r1
NA19101.r2
TMM
4 NA19101.r3
NA19239.r1
2 NA19239.r2
NA19239.r3
−2
Sample
batch
10 NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
5 NA19239.r1
Component 2: 4% variance
NA19239.r2
NA19239.r3
0 total_features
8000
9000
10000
11000
−5
individual
NA19098
NA19101
−10 NA19239
−5 0 5 10
Component 1: 8% variance
Figure 7.47: PCA plot of the tung data after LSF normalisation
)
plotRLE(
umi.qc[endog_genes, ],
exprs_mats = list(Raw = "logcounts_raw", scran = "logcounts"),
exprs_logged = c(TRUE, TRUE),
colour_by = "batch"
)
scran sometimes calculates negative or zero size factors. These will completely distort the normalized ex-
pression matrix. We can check the size factors scran has computed like so:
summary(sizeFactors(umi.qc))
7.8.7 Downsampling
Raw
4
batch
0
NA19098.r1
Relative log expression
NA19098.r3
−2
NA19101.r1
NA19101.r2
scran
4 NA19101.r3
NA19239.r1
2 NA19239.r2
NA19239.r3
−2
Sample
colour_by = "batch",
size_by = "total_features",
shape_by = "individual"
)
plotRLE(
umi.qc[endog_genes, ],
exprs_mats = list(Raw = "logcounts_raw", DownSample = "logcounts"),
exprs_logged = c(TRUE, TRUE),
colour_by = "batch"
)
Some methods combine library size and fragment/gene length normalization such as:
• RPKM - Reads Per Kilobase Million (for single-end sequencing)
• FPKM - Fragments Per Kilobase Million (same as RPKM but for paired-end sequencing, makes sure
that paired ends mapped to the same fragment are not counted twice)
• TPM - Transcripts Per Kilobase Million (same as RPKM, but the order of normalizations is reversed
- length first and sequencing depth second)
These methods are not applicable to our dataset since the end of the transcript which contains the UMI
was preferentially sequenced. Furthermore in general these should only be calculated using appropriate
quantification software from aligned BAM files not from read counts since often only a portion of the
entire gene/transcript is sequenced, not the entire length. If in doubt check for a relationship between
7.8. NORMALIZATION PRACTICE (UMI) 129
batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
5 NA19101.r3
NA19239.r1
Component 2: 3% variance
NA19239.r2
NA19239.r3
0
total_features
8000
9000
10000
−5 11000
individual
NA19098
NA19101
−10 NA19239
−10 −5 0 5 10
Component 1: 6% variance
DownSample
4
batch
0
NA19098.r1
Relative log expression
NA19098.r3
−2
NA19101.r1
NA19101.r2
Raw
4 NA19101.r3
NA19239.r1
2 NA19239.r2
NA19239.r3
−2
Sample
# If you have mouse data, change the arguments based on this example:
# getBMFeatureAnnos(
# object,
# filters = "ensembl_transcript_id",
# attributes = c(
# "ensembl_transcript_id",
# "ensembl_gene_id",
# "mgi_symbol",
# "chromosome_name",
# "transcript_biotype",
# "transcript_start",
# "transcript_end",
# "transcript_count"
# ),
# feature_symbol = "mgi_symbol",
# feature_id = "ensembl_gene_id",
# biomart = "ENSEMBL_MART_ENSEMBL",
# dataset = "mmusculus_gene_ensembl",
# host = "www.ensembl.org"
# )
Some of the genes were not annotated, therefore we filter them out:
umi.qc.ann <- umi.qc[!is.na(rowData(umi.qc)$ensembl_gene_id), ]
Now we compute the total gene length in Kilobases by using the end_position and start_position fields:
eff_length <-
abs(rowData(umi.qc.ann)$end_position - rowData(umi.qc.ann)$start_position) / 1000
plot(eff_length, rowMeans(counts(umi.qc.ann)))
There is no relationship between gene length and mean expression so __FPKM__s & __TPM__s are
7.8. NORMALIZATION PRACTICE (UMI) 131
plotPCA(
umi.qc.ann,
exprs_values = "tpm",
colour_by = "batch",
size_by = "total_features",
shape_by = "individual"
)
Note The PCA looks for differences between cells. Gene length is the same across cells for each gene thus
FPKM is almost identical to the CPM plot (it is just rotated) since it performs CPM first then normalizes
gene length. Whereas, TPM is different because it weights genes by their length before performing CPM.
7.8.9 Exercise
Perform the same analysis with read counts of the tung data. Use tung/reads.rds file to load the reads
SCE object. Once you have finished please compare your results to ours (next chapter).
7.8.10 sessionInfo()
batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
5
NA19239.r1
Component 2: 2% variance
NA19239.r2
NA19239.r3
total_features
8000
0
9000
10000
11000
individual
−5 NA19098
NA19101
NA19239
−10 −5 0 5 10
Component 1: 6% variance
batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
5 NA19101.r3
NA19239.r1
Component 2: 2% variance
NA19239.r2
NA19239.r3
total_features
0 8000
9000
10000
11000
individual
−5
NA19098
NA19101
NA19239
−15 −10 −5 0 5 10
Component 1: 4% variance
Figure 7.52: PCA plot of the tung data after CPM normalisation
CPM
6
2
batch
0
NA19098.r1
Relative log expression
−2 NA19098.r3
NA19101.r1
NA19101.r2
Raw
6 NA19101.r3
NA19239.r1
4
NA19239.r2
2 NA19239.r3
−2
Sample
batch
NA19098.r1
8
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
NA19239.r1
Component 2: 2% variance
4
NA19239.r2
NA19239.r3
total_features
0 8000
9000
10000
11000
−4
individual
NA19098
NA19101
NA19239
−10 −5 0 5 10 15
Component 1: 4% variance
Figure 7.54: PCA plot of the tung data after RLE normalisation
Raw
6
2
batch
0
NA19098.r1
Relative log expression
−2 NA19098.r3
NA19101.r1
NA19101.r2
RLE
6 NA19101.r3
NA19239.r1
4
NA19239.r2
2 NA19239.r3
−2
Sample
batch
NA19098.r1
NA19098.r3
NA19101.r1
4 NA19101.r2
NA19101.r3
NA19239.r1
Component 2: 2% variance
NA19239.r2
NA19239.r3
0
total_features
8000
9000
10000
−4
11000
individual
NA19098
−8 NA19101
NA19239
−15 −10 −5 0 5 10
Component 1: 4% variance
Raw
6
2
batch
0
NA19098.r1
Relative log expression
−2 NA19098.r3
NA19101.r1
NA19101.r2
UQ
6 NA19101.r3
NA19239.r1
4
NA19239.r2
2 NA19239.r3
−2
Sample
batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
5 NA19101.r3
NA19239.r1
Component 2: 2% variance
NA19239.r2
NA19239.r3
total_features
0
8000
9000
10000
11000
−5 individual
NA19098
NA19101
NA19239
−10 −5 0 5 10
Component 1: 4% variance
Figure 7.58: PCA plot of the tung data after TMM normalisation
Raw
2
batch
0
NA19098.r1
Relative log expression
−2 NA19098.r3
NA19101.r1
NA19101.r2
TMM
NA19101.r3
NA19239.r1
4
NA19239.r2
2 NA19239.r3
−2
Sample
batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
5 NA19101.r3
NA19239.r1
Component 2: 3% variance
NA19239.r2
NA19239.r3
total_features
0 8000
9000
10000
11000
individual
−5 NA19098
NA19101
NA19239
−10 0
Component 1: 4% variance
Figure 7.60: PCA plot of the tung data after LSF normalisation
Raw
6
2
batch
0
NA19098.r1
Relative log expression
−2 NA19098.r3
NA19101.r1
−4
NA19101.r2
scran
NA19101.r3
6
NA19239.r1
4
NA19239.r2
2 NA19239.r3
−2
−4
Sample
batch
8
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
4 NA19239.r1
Component 2: 2% variance
NA19239.r2
NA19239.r3
total_features
0
8000
9000
10000
11000
−4
individual
NA19098
NA19101
NA19239
−5 0 5 10
Component 1: 4% variance
DownSample
batch
0
NA19098.r1
Relative log expression
−2 NA19098.r3
NA19101.r1
NA19101.r2
Raw
NA19101.r3
4 NA19239.r1
NA19239.r2
2
NA19239.r3
−2
Sample
7.10.1 Introduction
In the previous chapter we normalized for library size, effectively removing it as a confounder. Now we
will consider removing other less well defined confounders from our data. Technical confounders (aka batch
effects) can arise from difference in reagents, isolation methods, the lab/experimenter who performed the
7.10. DEALING WITH CONFOUNDERS 141
experiment, even which day/time the experiment was performed. Accounting for technical confounders, and
batch effects particularly, is a large topic that also involves principles of experimental design. Here we address
approaches that can be taken to account for confounders when the experimental design is appropriate.
Fundamentally, accounting for technical confounders involves identifying and, ideally, removing sources of
variation in the expression data that are not related to (i.e. are confounding) the biological signal of interest.
Various approaches exist, some of which use spike-in or housekeeping genes, and some of which use endogenous
genes.
The use of spike-ins as control genes is appealing, since the same amount of ERCC (or other) spike-in was
added to each cell in our experiment. In principle, all the variablity we observe for these genes is due to techni-
cal noise; whereas endogenous genes are affected by both technical noise and biological variability. Technical
noise can be removed by fitting a model to the spike-ins and “substracting” this from the endogenous genes.
There are several methods available based on this premise (eg. BASiCS, scLVM, RUVg); each using different
noise models and different fitting procedures. Alternatively, one can identify genes which exhibit significant
variation beyond technical noise (eg. Distance to median, Highly variable genes). However, there are issues
with the use of spike-ins for normalisation (particularly ERCCs, derived from bacterial sequences), including
that their variability can, for various reasons, actually be higher than that of endogenous genes.
Given the issues with using spike-ins, better results can often be obtained by using endogenous genes instead.
Where we have a large number of endogenous genes that, on average, do not vary systematically between
cells and where we expect technical effects to affect a large number of genes (a very common and reasonable
assumption), then such methods (for example, the RUVs method) can perform well.
We explore both general approaches below.
library(scRNA.seq.funcs)
library(RUVSeq)
library(scater)
library(SingleCellExperiment)
library(scran)
library(kBET)
library(sva) # Combat
library(edgeR)
set.seed(1234567)
options(stringsAsFactors = FALSE)
umi <- readRDS("tung/umi.rds")
umi.qc <- umi[rowData(umi)$use, colData(umi)$use]
endog_genes <- !rowData(umi.qc)$is_feature_control
erccs <- rowData(umi.qc)$is_feature_control
Factors contributing to technical noise frequently appear as “batch effects” where cells processed on different
days or by different technicians systematically vary from one another. Removing technical noise and correct-
ing for batch effects can frequently be performed using the same tool or slight variants on it. We will be
considering the Remove Unwanted Variation (RUVSeq). Briefly, RUVSeq works as follows. For n samples
142 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
and J genes, consider the following generalized linear model (GLM), where the RNA-Seq read counts are
regressed on both the known covariates of interest and unknown factors of unwanted variation:
Here, Y is the n × J matrix of observed gene-level read counts, W is an n × k matrix corresponding to the
factors of “unwanted variation” and O is an n × J matrix of offsets that can either be set to zero or esti-
mated with some other normalization procedure (such as upper-quartile normalization). The simultaneous
estimation of W , α, β, and k is infeasible. For a given k, instead the following three approaches to estimate
the factors of unwanted variation W are used:
• RUVg uses negative control genes (e.g. ERCCs), assumed to have constant expression across samples;
• RUVs uses centered (technical) replicate/negative control samples for which the covariates of interest
are constant;
• RUVr uses residuals, e.g., from a first-pass GLM regression of the counts on the covariates of interest.
We will concentrate on the first two approaches.
7.10.2.1 RUVg
7.10.2.2 RUVs
7.10.3 Combat
If you have an experiment with a balanced design, Combat can be used to eliminate batch effects while
preserving biological effects by specifying the biological effects using the mod parameter. However the Tung
7.10. DEALING WITH CONFOUNDERS 143
data contains multiple experimental replicates rather than a balanced design so using mod1 to preserve
biological variability will result in an error.
combat_data <- logcounts(umi.qc)
mod_data <- as.data.frame(t(combat_data))
# Basic batch removal
mod0 = model.matrix(~ 1, data = mod_data)
# Preserve biological variability
mod1 = model.matrix(~ umi.qc$individual, data = mod_data)
# adjust for total genes detected
mod2 = model.matrix(~ umi.qc$total_features, data = mod_data)
assay(umi.qc, "combat") <- ComBat(
dat = t(mod_data),
batch = factor(umi.qc$batch),
mod = mod0,
par.prior = TRUE,
prior.plots = FALSE
)
7.10.4 mnnCorrect
mnnCorrect (Haghverdi et al., 2017) assumes that each batch shares at least one biological condition with
each other batch. Thus it works well for a variety of balanced experimental designs. However, the Tung data
contains multiple replicates for each invidividual rather than balanced batches, thus we will normalized each
individual separately. Note that this will remove batch effects between batches within the same individual
but not the batch effects between batches in different individuals, due to the confounded experimental design.
Thus we will merge a replicate from each individual to form three batches.
do_mnn <- function(data.qc) {
batch1 <- logcounts(data.qc[, data.qc$replicate == "r1"])
batch2 <- logcounts(data.qc[, data.qc$replicate == "r2"])
batch3 <- logcounts(data.qc[, data.qc$replicate == "r3"])
if (ncol(batch2) > 0) {
x = mnnCorrect(
batch1, batch2, batch3,
k = 20,
sigma = 0.1,
cos.norm.in = TRUE,
svd.dim = 2
)
res1 <- x$corrected[[1]]
res2 <- x$corrected[[2]]
res3 <- x$corrected[[3]]
dimnames(res1) <- dimnames(batch1)
dimnames(res2) <- dimnames(batch2)
144 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
7.10.5 GLM
A general linear model is a simpler version of Combat. It can correct for batches while preserving biolog-
ical effects if you have a balanced design. In a confounded/replicate design biological effects will not be
fit/preserved. Similar to mnnCorrect we could remove batch effects from each individual separately in or-
der to preserve biological (and technical) variance between individuals. For demonstation purposes we will
naively correct all cofounded batch effects:
glm_fun <- function(g, batch, indi) {
model <- glm(g ~ batch + indi)
model$coef[1] <- 0 # replace intercept with 0 to preserve reference batch.
return(model$coef)
}
effects <- apply(
logcounts(umi.qc),
1,
glm_fun,
batch = umi.qc$batch,
indi = umi.qc$individual
)
7.10. DEALING WITH CONFOUNDERS 145
Exercise 2
Perform GLM correction for each individual separately. Store the final corrected matrix in the glm_indi
slot.
A key question when considering the different methods for removing confounders is how to quantitatively
determine which one is the most effective. The main reason why comparisons are challenging is because
it is often difficult to know what corresponds to technical counfounders and what is interesting biological
variability. Here, we consider three different metrics which are all reasonable based on our knowledge of
the experimental design. Depending on the biological question that you wish to address, it is important to
choose a metric that allows you to evaluate the confounders that are likely to be the biggest concern for the
given situation.
7.10.6.1 Effectiveness 1
We evaluate the effectiveness of the normalization by inspecting the PCA plot where colour corresponds
the technical replicates and shape corresponds to different biological samples (individuals). Separation of
biological samples and interspersed batches indicates that technical variation has been removed. We always
use log2-cpm normalized data to match the assumptions of PCA.
for(n in assayNames(umi.qc)) {
print(
plotPCA(
umi.qc[endog_genes, ],
colour_by = "batch",
size_by = "total_features",
shape_by = "individual",
exprs_values = n
) +
ggtitle(n)
)
}
146 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
counts batch
NA19098.r1
NA19098.r3
10 NA19101.r1
NA19101.r2
NA19101.r3
NA19239.r1
Component 2: 3% variance
NA19239.r2
NA19239.r3
0
total_features
8000
9000
10000
11000
−10
individual
NA19098
NA19101
0 50 100 NA19239
Component 1: 73% variance
logcounts_raw batch
15 NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
10
NA19101.r3
NA19239.r1
Component 2: 7% variance
NA19239.r2
5 NA19239.r3
total_features
8000
0
9000
10000
11000
−5
individual
NA19098
NA19101
−10
−40 −20 0 20 NA19239
Component 1: 17% variance
7.10. DEALING WITH CONFOUNDERS 147
logcounts batch
NA19098.r1
10
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
5 NA19239.r1
Component 2: 4% variance
NA19239.r2
NA19239.r3
0 total_features
8000
9000
10000
11000
−5
individual
NA19098
NA19101
−10
−5 0 5 10 NA19239
Component 1: 8% variance
ruvg1 batch
10 NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
5
NA19239.r1
Component 2: 3% variance
NA19239.r2
NA19239.r3
0 total_features
8000
9000
10000
−5 11000
individual
NA19098
NA19101
−10
−5 0 5 10 NA19239
Component 1: 5% variance
148 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
ruvg10 batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
5
NA19101.r3
NA19239.r1
Component 2: 2% variance
NA19239.r2
NA19239.r3
0
total_features
8000
9000
10000
−5
11000
individual
NA19098
−10
NA19101
−5 0 5 NA19239
Component 1: 4% variance
ruvs1 batch
NA19098.r1
NA19098.r3
NA19101.r1
5 NA19101.r2
NA19101.r3
NA19239.r1
Component 2: 3% variance
NA19239.r2
NA19239.r3
0
total_features
8000
9000
−5
10000
11000
−10 individual
NA19098
NA19101
−10 −5 0 5 10 NA19239
Component 1: 5% variance
7.10. DEALING WITH CONFOUNDERS 149
ruvs10 batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
4
NA19239.r1
Component 2: 2% variance
NA19239.r2
NA19239.r3
total_features
0
8000
9000
10000
11000
−4
individual
NA19098
NA19101
−4 0 4 8 NA19239
Component 1: 4% variance
combat batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
4 NA19101.r3
NA19239.r1
Component 2: 1% variance
NA19239.r2
NA19239.r3
0
total_features
8000
9000
10000
−4 11000
individual
NA19098
NA19101
−8
−5 0 5 10 NA19239
Component 1: 1% variance
150 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
combat_tf batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
4 NA19101.r3
NA19239.r1
Component 2: 1% variance
NA19239.r2
NA19239.r3
0
total_features
8000
9000
10000
−4 11000
individual
NA19098
−8 NA19101
−5 0 5 10 NA19239
Component 1: 1% variance
mnn batch
NA19098.r1
10
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
5
NA19239.r1
Component 2: 6% variance
NA19239.r2
NA19239.r3
0
total_features
8000
9000
−5 10000
11000
−10 individual
NA19098
NA19101
−10 −5 0 5 10 15 NA19239
Component 1: 17% variance
7.10. DEALING WITH CONFOUNDERS 151
glm batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
4 NA19101.r3
NA19239.r1
Component 2: 1% variance
NA19239.r2
NA19239.r3
0
total_features
8000
9000
10000
−4 11000
individual
NA19098
NA19101
−8
−5 0 5 10 NA19239
Component 1: 1% variance
glm_indi batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
5 NA19101.r3
NA19239.r1
Component 2: 5% variance
NA19239.r2
NA19239.r3
0
total_features
8000
9000
10000
−5 11000
individual
NA19098
NA19101
−10 −5 0 5 10 NA19239
Component 1: 16% variance
152 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
Exercise 3
Consider different ks for RUV normalizations. Which gives the best results?
7.10.6.2 Effectiveness 2
We can also examine the effectiveness of correction using the relative log expression (RLE) across cells to
confirm technical noise has been removed from the dataset. Note RLE only evaluates whether the number of
genes higher and lower than average are equal for each cell - i.e. systemic technical effects. Random technical
noise between batches may not be detected by RLE.
res <- list()
for(n in assayNames(umi.qc)) {
res[[n]] <- suppressWarnings(calc_cell_RLE(assay(umi.qc, n), erccs))
}
par(mar=c(6,4,1,1))
boxplot(res, las=2)
1.5
1.0
0.5
0.0
−0.5
−1.0
counts
logcounts_raw
logcounts
ruvg1
ruvg10
ruvs1
ruvs10
combat
combat_tf
mnn
glm
glm_indi
7.10.6.3 Effectiveness 3
We can repeat the analysis from Chapter 12 to check whether batch effects have been removed.
for(n in assayNames(umi.qc)) {
print(
plotQC(
umi.qc[endog_genes, ],
type = "expl",
7.10. DEALING WITH CONFOUNDERS 153
counts
1.2
0.9
total_counts
total_features
Density
0.6 batch
pct_counts_ERCC
individual
pct_counts_MT
0.3
0.0
exprs_values = n,
variables = c(
"total_features",
"total_counts",
"batch",
"individual",
"pct_counts_ERCC",
"pct_counts_MT"
)
) +
ggtitle(n)
)
}
Exercise 4
Perform the above analysis for each normalization/batch correction method. Which method(s) are most/least
effective? Why is the variance accounted for by batch never lower than the variance accounted for by
individual?
7.10.6.4 Effectiveness 4
Another method to check the efficacy of batch-effect correction is to consider the intermingling of points
from different batches in local subsamples of the data. If there are no batch-effects then proportion of cells
154 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
logcounts_raw
1.2
0.9
total_features
total_counts
Density
0.6 batch
pct_counts_ERCC
individual
pct_counts_MT
0.3
0.0
logcounts
0.9
batch
individual
Density
0.6
pct_counts_ERCC
pct_counts_MT
total_features
total_counts
0.3
0.0
ruvg1
1.25
1.00
0.75 batch
total_features
Density
individual
total_counts
0.50
pct_counts_ERCC
pct_counts_MT
0.25
0.00
ruvg10
1.2
0.8 batch
individual
Density
total_features
total_counts
pct_counts_ERCC
0.4 pct_counts_MT
0.0
ruvs1
0.9
batch
individual
Density
0.6
total_features
pct_counts_ERCC
total_counts
pct_counts_MT
0.3
0.0
ruvs10
0.9
batch
individual
Density
0.6
total_features
total_counts
pct_counts_ERCC
pct_counts_MT
0.3
0.0
combat
0.9
total_features
total_counts
Density
0.6
pct_counts_ERCC
pct_counts_MT
batch
individual
0.3
0.0
combat_tf
0.9
total_features
0.6 total_counts
Density
pct_counts_ERCC
pct_counts_MT
batch
individual
0.3
0.0
mnn
0.9
batch
individual
Density
0.6
pct_counts_ERCC
pct_counts_MT
total_features
total_counts
0.3
0.0
glm
0.6
0.4
total_features
Density
total_counts
pct_counts_ERCC
pct_counts_MT
0.2
0.0
glm_indi
0.6
batch
individual
Density
0.4
total_features
pct_counts_MT
pct_counts_ERCC
total_counts
0.2
0.0
from each batch in any local region should be equal to the global proportion of cells in each batch.
kBET (Buttner et al., 2017) takes kNN networks around random cells and tests the number of cells from each
batch against a binomial distribution. The rejection rate of these tests indicates the severity of batch-effects
still present in the data (high rejection rate = strong batch effects). kBET assumes each batch contains the
same complement of biological groups, thus it can only be applied to the entire dataset if a perfectly balanced
design has been used. However, kBET can also be applied to replicate-data if it is applied to each biological
group separately. In the case of the Tung data, we will apply kBET to each individual independently to check
for residual batch effects. However, this method will not identify residual batch-effects which are confounded
with biological conditions. In addition, kBET does not determine if biological signal has been preserved.
compare_kBET_results <- function(sce){
indiv <- unique(sce$individual)
norms <- assayNames(sce) # Get all normalizations
results <- list()
for (i in indiv){
for (j in norms){
tmp <- kBET(
df = t(assay(sce[,sce$individual== i], j)),
batch = sce$batch[sce$individual==i],
heuristic = TRUE,
verbose = FALSE,
addTest = FALSE,
plot = FALSE)
results[[i]][[j]] <- tmp$summary$kBET.observed[1]
}
}
return(as.data.frame(results))
}
require("reshape2")
require("RColorBrewer")
# Plot results
dod <- melt(as.matrix(eff_debatching), value.name = "kBET")
colnames(dod)[1:2] <- c("Normalisation", "Individual")
hjust = 1
)
) +
ggtitle("Effect of batch regression methods per individual")
NA19239
kBET
1.00
Individual
0.75
NA19101
0.50
0.25
0.00
NA19098
w
ts
1
10
co 0
tf
i
1
nn
nd
co ba
nt
vg
t_
vs
ra
un
gl
vg
vs
m
ou
_i
ba
m
ru
s_
ru
co
ru
m
ru
gc
m
nt
gl
ou
lo
gc
lo
Normalisation
Exercise 5
Why do the raw counts appear to have little batch effects?
Perform the same analysis with read counts of the tung data. Use tung/reads.rds file to load the reads SCE
object. Once you have finished please compare your results to ours (next chapter). Additionally, experiment
with other combinations of normalizations and compare the results.
7.10.8 sessionInfo()
library(scRNA.seq.funcs)
library(RUVSeq)
library(scater)
library(SingleCellExperiment)
library(scran)
library(kBET)
library(sva) # Combat
library(edgeR)
set.seed(1234567)
options(stringsAsFactors = FALSE)
reads <- readRDS("tung/reads.rds")
reads.qc <- reads[rowData(reads)$use, colData(reads)$use]
endog_genes <- !rowData(reads.qc)$is_feature_control
erccs <- rowData(reads.qc)$is_feature_control
Exercise 1
if (ncol(batch2) > 0) {
x = mnnCorrect(
batch1, batch2, batch3,
k = 20,
sigma = 0.1,
cos.norm.in = TRUE,
svd.dim = 2
)
res1 <- x$corrected[[1]]
res2 <- x$corrected[[2]]
res3 <- x$corrected[[3]]
dimnames(res1) <- dimnames(batch1)
dimnames(res2) <- dimnames(batch2)
dimnames(res3) <- dimnames(batch3)
return(cbind(res1, res2, res3))
} else {
x = mnnCorrect(
batch1, batch3,
k = 20,
sigma = 0.1,
cos.norm.in = TRUE,
svd.dim = 2
)
170 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
Exercise 2
for(n in assayNames(reads.qc)) {
print(
plotPCA(
reads.qc[endog_genes, ],
colour_by = "batch",
size_by = "total_features",
shape_by = "individual",
exprs_values = n
) +
ggtitle(n)
)
}
7.11. DEALING WITH CONFOUNDERS (READS) 171
counts batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
10
NA19239.r1
Component 2: 4% variance
NA19239.r2
NA19239.r3
0 total_features
8000
9000
10000
11000
−10
individual
NA19098
NA19101
−40 −20 0 20 NA19239
Component 1: 36% variance
logcounts_raw batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
5 NA19239.r1
Component 2: 2% variance
NA19239.r2
NA19239.r3
total_features
8000
0
9000
10000
11000
individual
−5 NA19098
NA19101
−10 −5 0 5 10 NA19239
Component 1: 6% variance
172 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
logcounts batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
5
NA19239.r1
Component 2: 3% variance
NA19239.r2
NA19239.r3
total_features
0 8000
9000
10000
11000
individual
−5
NA19098
NA19101
−10 0 NA19239
Component 1: 4% variance
ruvg1 batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
5
NA19101.r3
NA19239.r1
Component 2: 2% variance
NA19239.r2
NA19239.r3
0 total_features
8000
9000
10000
11000
−5
individual
NA19098
NA19101
−10 −5 0 5 10 NA19239
Component 1: 4% variance
7.11. DEALING WITH CONFOUNDERS (READS) 173
ruvg10 batch
NA19098.r1
NA19098.r3
NA19101.r1
5 NA19101.r2
NA19101.r3
NA19239.r1
Component 2: 2% variance
NA19239.r2
NA19239.r3
0
total_features
8000
9000
10000
11000
−5
individual
NA19098
NA19101
−10 −5 0 5 10 NA19239
Component 1: 4% variance
ruvs1 batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
5 NA19239.r1
Component 2: 2% variance
NA19239.r2
NA19239.r3
total_features
0 8000
9000
10000
11000
−5 individual
NA19098
NA19101
−5 0 5 NA19239
Component 1: 2% variance
174 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
ruvs10 batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
4 NA19239.r1
Component 2: 1% variance
NA19239.r2
NA19239.r3
total_features
0 8000
9000
10000
11000
−4 individual
NA19098
NA19101
−4 0 4 NA19239
Component 1: 2% variance
combat batch
NA19098.r1
NA19098.r3
5 NA19101.r1
NA19101.r2
NA19101.r3
NA19239.r1
Component 2: 1% variance
NA19239.r2
NA19239.r3
0
total_features
8000
9000
10000
−5 11000
individual
NA19098
NA19101
−10 −5 0 5 10 15 NA19239
Component 1: 4% variance
7.11. DEALING WITH CONFOUNDERS (READS) 175
combat_tf batch
NA19098.r1
5 NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
NA19239.r1
Component 2: 1% variance
NA19239.r2
0 NA19239.r3
total_features
8000
9000
10000
−5 11000
individual
NA19098
NA19101
−10 −5 0 5 10 15 NA19239
Component 1: 5% variance
mnn batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
5 NA19101.r3
NA19239.r1
Component 2: 4% variance
NA19239.r2
NA19239.r3
0
total_features
8000
9000
10000
−5 11000
individual
NA19098
NA19101
−10
−10 0 10 NA19239
Component 1: 14% variance
176 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
glm batch
NA19098.r1
NA19098.r3
NA19101.r1
NA19101.r2
NA19101.r3
5 NA19239.r1
Component 2: 1% variance
NA19239.r2
NA19239.r3
total_features
8000
0
9000
10000
11000
individual
−5
NA19098
NA19101
−10 −5 0 5 10 15 NA19239
Component 1: 4% variance
glm_indi batch
NA19098.r1
NA19098.r3
NA19101.r1
10 NA19101.r2
NA19101.r3
NA19239.r1
Component 2: 4% variance
NA19239.r2
5
NA19239.r3
total_features
0 8000
9000
10000
11000
−5
individual
NA19098
−10
NA19101
−5 0 5 10 NA19239
Component 1: 6% variance
7.11. DEALING WITH CONFOUNDERS (READS) 177
1.0
0.5
0.0
−0.5
−1.0
counts
logcounts_raw
logcounts
ruvg1
ruvg10
ruvs1
ruvs10
combat
combat_tf
mnn
glm
glm_indi
for(n in assayNames(reads.qc)) {
print(
plotQC(
reads.qc[endog_genes, ],
type = "expl",
exprs_values = n,
variables = c(
"total_features",
"total_counts",
"batch",
"individual",
"pct_counts_ERCC",
"pct_counts_MT"
)
) +
ggtitle(n)
)
}
178 CHAPTER 7. CLEANING THE EXPRESSION MATRIX
counts
1.2
0.9
batch
total_counts
Density
0.6 individual
pct_counts_ERCC
total_features
pct_counts_MT
0.3
0.0
logcounts_raw
1.0
batch
total_features
Density
pct_counts_ERCC
total_counts
0.5 individual
pct_counts_MT
0.0
logcounts
1.0
batch
total_features
Density
pct_counts_ERCC
individual
0.5 total_counts
pct_counts_MT
0.0
ruvg1
1.2
0.8 batch
total_features
Density
individual
pct_counts_ERCC
total_counts
0.4 pct_counts_MT
0.0
ruvg10
1.0
batch
total_features
Density
individual
pct_counts_ERCC
0.5 total_counts
pct_counts_MT
0.0
ruvs1
1.0
batch
individual
Density
total_features
pct_counts_ERCC
0.5 total_counts
pct_counts_MT
0.0
ruvs10
1.2
0.8
batch
individual
Density
total_features
pct_counts_ERCC
total_counts
0.4 pct_counts_MT
0.0
combat
1.2
0.8
total_features
pct_counts_ERCC
Density
total_counts
batch
pct_counts_MT
0.4 individual
0.0
combat_tf
1.25
1.00
0.75 total_features
pct_counts_ERCC
Density
batch
total_counts
0.50
individual
pct_counts_MT
0.25
0.00
mnn
0.9
batch
individual
Density
0.6
total_features
pct_counts_ERCC
total_counts
pct_counts_MT
0.3
0.0
glm
0.75
total_features
Density
0.50
pct_counts_ERCC
total_counts
pct_counts_MT
0.25
0.00
glm_indi
0.8
0.6
batch
individual
Density
total_features
0.4
pct_counts_ERCC
total_counts
pct_counts_MT
0.2
0.0
require("reshape2")
require("RColorBrewer")
# Plot results
dod <- melt(as.matrix(eff_debatching), value.name = "kBET")
colnames(dod)[1:2] <- c("Normalisation", "Individual")
NA19239
kBET
1.00
Individual
0.75
NA19101
0.50
0.25
0.00
NA19098
w
ts
1
10
10
tf
1
i
nn
nd
ba
nt
vg
t_
vs
ra
un
gl
vg
vs
m
ou
_i
ba
m
ru
s_
ru
co
ru
m
ru
co
gc
m
nt
gl
co
ou
lo
gc
lo
Normalisation
Biological Analysis
8.1.1 Introduction
One of the most promising applications of scRNA-seq is de novo discovery and annotation of cell-types
based on transcription profiles. Computationally, this is a hard problem as it amounts to unsupervised
clustering. That is, we need to identify groups of cells based on the similarities of the transcriptomes
without any prior knowledge of the labels. Moreover, in most situations we do not even know the number of
clusters a priori. The problem is made even more challenging due to the high level of noise (both technical
and biological) and the large number of dimensions (i.e. genes).
When working with large datasets, it can often be beneficial to apply some sort of dimensionality reduction
method. By projecting the data onto a lower-dimensional sub-space, one is often able to significantly reduce
the amount of noise. An additional benefit is that it is typically much easier to visualize the data in a 2 or
3-dimensional subspace. We have already discussed PCA (chapter 7.3.2) and t-SNE (chapter 7.3.2).
Unsupervised clustering is useful in many different applications and it has been widely studied in machine
learning. Some of the most popular approaches are hierarchical clustering, k-means clustering and
graph-based clustering.
In hierarchical clustering, one can use either a bottom-up or a top-down approach. In the former case, each
cell is initially assigned to its own cluster and pairs of clusters are subsequently merged to create a hieararchy:
187
188 CHAPTER 8. BIOLOGICAL ANALYSIS
With a top-down strategy, one instead starts with all observations in one cluster and then recursively split
each cluster to form a hierarchy. One of the advantages of this strategy is that the method is deterministic.
8.1.3.2 k-means
In k-means clustering, the goal is to partition N cells into k different clusters. In an iterative manner, cluster
centers are assigned and each cell is assigned to its nearest cluster:
Most methods for scRNA-seq analysis includes a k-means step at some point.
Over the last two decades there has been a lot of interest in analyzing networks in various domains. One
goal is to identify groups or modules of nodes in a network.
Some of these methods can be applied to scRNA-seq data by building a graph where each node represents
a cell. Note that constructing the graph and assigning weights to the edges is not trivial. One advantage
of graph-based methods is that some of them are very efficient and can be applied to networks containing
millions of nodes.
190 CHAPTER 8. BIOLOGICAL ANALYSIS
8.1.5.1 SINCERA
8.1.5.2 pcaReduce
pcaReduce (žurauskienė and Yau, 2016) combines PCA, k-means and “iterative” hierarchical clustering.
Starting from a large number of clusters pcaReduce iteratively merges similar clusters; after each merging
event it removes the principle component explaning the least variance in the data.
8.1.5.3 SC3
• SC3 (Kiselev et al., 2017) is based on PCA and spectral dimensionality reductions
• Utilises k-means
• Additionally performs the consensus clustering
8.1.5.5 SNN-Cliq
SNN-Cliq (Xu and Su, 2015) is a graph-based method. First the method identifies the k-nearest-neighbours
of each cell according to the distance measure. This is used to calculate the number of Shared Nearest
Neighbours (SNN) between each pair of cells. A graph is built by placing an edge between two cells If they
have at least one SNN. Clusters are defined as groups of cells with many edges between them using a “clique”
method. SNN-Cliq requires several parameters to be defined manually.
Seurat clustering is based on a community detection approach similar to SNN-Cliq and to one previously
proposed for analyzing CyTOF data (Levine et al., 2015). Since Seurat has become more like an all-in-one
tool for scRNA-seq data analysis we dedicate a separate chapter to discuss it in more details (chapter 9).
To compare two sets of clustering labels we can use adjusted Rand index. The index is a measure of the
similarity between two data clusterings. Values of the adjusted Rand index lie in [0; 1] interval, where 1
means that two clusterings are identical and 0 means the level of similarity expected by chance.
library(pcaMethods)
library(pcaReduce)
library(SC3)
library(scater)
library(SingleCellExperiment)
library(pheatmap)
library(mclust)
set.seed(1234567)
To illustrate clustering of scRNA-seq data, we consider the Deng dataset of cells from developing mouse
embryo (Deng et al., 2014). We have preprocessed the dataset and created a SingleCellExperiment object
in advance. We have also annotated the cells with the cell types identified in the original publication (it is
the cell_type2 column in the colData slot).
## class: SingleCellExperiment
## dim: 22431 268
## metadata(0):
## assays(2): counts logcounts
## rownames(22431): Hvcn1 Gbp7 ... Sox5 Alg11
## rowData names(10): feature_symbol is_feature_control ...
## total_counts log10_total_counts
192 CHAPTER 8. BIOLOGICAL ANALYSIS
##
## 16cell 4cell 8cell early2cell earlyblast late2cell
## 50 14 37 8 43 10
## lateblast mid2cell midblast zy
## 30 12 60 4
A simple PCA analysis already separates some strong cell types and provides some insights in the data
structure:
plotPCA(deng, colour_by = "cell_type2")
cell_type2
10
Component 2: 16% variance
16cell
4cell
8cell
early2cell
earlyblast
0 late2cell
lateblast
mid2cell
midblast
zy
−10
−20 −10 0 10
Component 1: 33% variance
As you can see, the early cell types separate quite well, but the three blastocyst timepoints are more difficult
to distinguish.
8.2.2 SC3
Let’s run SC3 clustering on the Deng data. The advantage of the SC3 is that it can directly ingest a
SingleCellExperiment object.
8.2. CLUSTERING EXAMPLE 193
Now let’s image we do not know the number of clusters k (cell types). SC3 can estimate a number of clusters
for you:
deng <- sc3_estimate_k(deng)
## Estimating k...
metadata(deng)$sc3$k_estimation
## [1] 6
Interestingly, the number of cell types predicted by SC3 is smaller than in the original data annotation.
However, early, mid and late stages of different cell types together, we will have exactly 6 cell types. We
store the merged cell types in cell_type1 column of the colData slot:
plotPCA(deng, colour_by = "cell_type1")
10
Component 2: 16% variance
cell_type1
16cell
2cell
4cell
0 8cell
blast
zygote
−10
−20 −10 0 10
Component 1: 33% variance
Now we are ready to run SC3 (we also ask it to calculate biological properties of the clusters):
deng <- sc3(deng, ks = 10, biology = TRUE)
SC3 result consists of several different outputs (please look in (Kiselev et al., 2017) and SC3 vignette for
more details). Here we show some of them:
Consensus matrix:
sc3_plot_consensus(deng, k = 10, show_pdata = "cell_type2")
cell_type2
1 cell_type2
16cell
0.8 4cell
8cell
0.6 early2cell
earlyblast
0.4 late2cell
lateblast
0.2 mid2cell
midblast
0 zy
Silhouette plot:
sc3_plot_silhouette(deng, k = 10)
8.2. CLUSTERING EXAMPLE 195
1 : 73 | 0.88
2 : 28 | 0.87
3 : 22 | 0.99
4 : 14 | 1.00
5 : 13 | 0.62
6 : 15 | 0.67
7 : 41 | 0.72
8 : 58 | 0.51
9 :: 22 || 0.97
10 0.14
cell_type2
cell_type2
12 16cell
10 4cell
8cell
8 early2cell
earlyblast
6
late2cell
4 lateblast
mid2cell
2
midblast
0 zy
cell_type2 15 cell_type2
Pemt 16cell
Pdxk−ps 4cell
Ppt2
Gm11127 8cell
2310040G24Rik 10
early2cell
Nt5c3l
Trim7 earlyblast
Zfp820 late2cell
Hcrt 5
4933409K07Rik lateblast
Lgals1 mid2cell
BC051665 midblast
Glipr1 zy
Gm3776 0
Gsta1
Tspan8 Cluster
Tspan1 1
3110003A17Rik 2
5730469M10Rik
Gng2 3
Zscan4c 4
Rfpl4b 5
Zscan4f
Zscan4b 6
Tob2 8
Zscan4d
Gm2022
Zfp352
Usp17l5
Gm16381
Art3
Zfp644
Rtn2
2610024G14Rik
Dpep2
Gm13718
Ell2
Prkcd
Gm7104
Gm1995
Gm5124
Fam195a
1110034A24Rik
Zfp428
Gm6251
Impdh1
Bhmt2
Gm6787
Homer3
Sfpi1
Mthfd2
E130306D19Rik
Med30
Fam117b
Xlr3a
Xlr3c
Gm3143
Usp3
Pdlim2
Rrm2b
Acadvl
Acaa2
Klf6
Slc37a2
198 CHAPTER 8. BIOLOGICAL ANALYSIS
sc3_10_clusters
10
Component 2: 16% variance
1
2
3
4
5
0 6
7
8
9
10
−10
−20 −10 0 10
Component 1: 33% variance
Compare the results of SC3 clustering with the original publication cell type labels:
adjustedRandIndex(colData(deng)$cell_type2, colData(deng)$sc3_10_clusters)
## [1] 0.7705208
Note Due to direct calculation of distances SC3 becomes very slow when the number of cells is > 5000. For
large datasets containing up to 105 cells we recomment using Seurat (see chapter 9).
• Exercise 1: Run SC3 for k from 8 to 12 and explore different clustering solutions in your web browser.
• Exercise 2: Which clusters are the most stable when k is changed from 8 to 12? (Look at the
“Stability” tab)
• Exercise 3: Check out differentially expressed genes and marker genes for the obtained clusterings.
Please use k = 10.
• Exercise 4: Change the marker genes threshold (the default is 0.85). Does SC3 find more marker
genes?
8.2. CLUSTERING EXAMPLE 199
8.2.3 pcaReduce
pcaReduce operates directly on the expression matrix. It is recommended to use a gene filter and log
transformation before running pcaReduce. We will use the default SC3 gene filter (note that the exprs slot
of a scater object is log-transformed by default).
# use the same gene filter as in SC3
input <- logcounts(deng[rowData(deng)$sc3_gene_filter, ])
There are several parameters used by pcaReduce: * nbt defines a number of pcaReduce runs (it is stochastic
and may have different solutions after different runs) * q defines number of dimensions to start clustering
with. The output will contain partitions for all k from 2 to q+1. * method defines a method used for
clustering. S - to perform sampling based merging, M - to perform merging based on largest probability.
We will run pcaReduce 1 time:
# run pcaReduce 1 time creating hierarchies from 1 to 30 clusters
pca.red <- PCAreduce(t(input), nbt = 1, q = 30, method = 'S')[[1]]
pcaReduce
10
Component 2: 16% variance
1
10
2
3
4
0 5
6
7
8
9
−10
−20 −10 0 10
Component 1: 33% variance
Exercise 5: Run pcaReduce for k = 2 and plot a similar PCA plot. Does it look good?
Hint: When running pcaReduce for different ks you do not need to rerun PCAreduce function, just use
already calculated pca.red object.
Our solution:
Exercise 6: Compare the results between pcaReduce and the original publication cell types for k = 10.
200 CHAPTER 8. BIOLOGICAL ANALYSIS
10
Component 2: 16% variance
pcaReduce
1
0 2
−10
−20 −10 0 10
Component 1: 33% variance
10
5
Dimension 2
−5
−10
−10 −5 0 5 10
Dimension 1
Our solution:
## [1] 0.4216031
tSNE plots that we saw before (7.3.3) when used the scater package are made by using the Rtsne and
ggplot2 packages. Here we will do the same:
deng <- plotTSNE(deng, rand_seed = 1, return_SCE = TRUE)
Note that all points on the plot above are black. This is different from what we saw before, when the cells
were coloured based on the annotation. Here we do not have any annotation and all cells come from the
same batch, therefore all dots are black.
Now we are going to apply k-means clustering algorithm to the cloud of points on the tSNE map. How many
groups do you see in the cloud?
We will start with k = 8:
colData(deng)$tSNE_kmeans <- as.character(kmeans(deng@reducedDims$TSNE, centers = 8)$clust)
plotTSNE(deng, rand_seed = 1, colour_by = "tSNE_kmeans")
10
5 tSNE_kmeans
1
2
Dimension 2
3
0 4
5
6
7
8
−5
−10
−10 −5 0 5 10
Dimension 1
Figure 8.8: tSNE map of the patient data with 8 colored clusters, identified by the k-means clustering
algorithm
8.2. CLUSTERING EXAMPLE 203
Our solution:
## [1] 0.3701639
As you may have noticed, both pcaReduce and tSNE+kmeans are stochastic and give different results every
time they are run. To get a better overview of the solutions, we need to run the methods multiple times.
SC3 is also stochastic, but thanks to the consensus step, it is more robust and less likely to produce different
outcomes.
8.2.5 SNN-Cliq
Here we run SNN-cliq with te default parameters provided in the author’s example:
distan <- "euclidean"
par.k <- 3
par.r <- 0.7
par.m <- 0.5
# construct a graph
scRNA.seq.funcs::SNN(
data = t(input),
outfile = "snn-cliq.txt",
k = par.k,
distance = distan
)
# find clusters in the graph
snn.res <-
system(
paste0(
"python utils/Cliq.py ",
"-i snn-cliq.txt ",
"-o res-snn-cliq.txt ",
"-r ", par.r,
" -m ", par.m
),
intern = TRUE
)
cat(paste(snn.res, collapse = "\n"))
SNNCliq
−1 22
1 23
10 10 24
Component 2: 16% variance
11 25
12 26
13 27
14 28
15 29
0
16 3
17 4
18 5
19 6
2 7
−10 20 8
21 9
−20 −10 0 10
Component 1: 33% variance
Exercise 9: Compare the results between SNN-Cliq and the original publication cell types.
Our solution:
## [1] 0.2629731
8.2.6 SINCERA
As mentioned in the previous chapter SINCERA is based on hierarchical clustering. One important thing
to keep in mind is that it performs a gene-level z-score transformation before doing clustering:
# perform gene-by-gene per-sample z-score transformation
dat <- apply(input, 1, function(y) scRNA.seq.funcs::z.transform.helper(y))
# hierarchical clustering
dd <- as.dist((1 - cor(t(dat), method = "pearson"))/2)
hc <- hclust(dd, method = "average")
If the number of cluster is not known SINCERA can identify k as the minimum height of the hierarchical
tree that generates no more than a specified number of singleton clusters (clusters containing only 1 cell)
num.singleton <- 0
kk <- 1
for (i in 2:dim(dat)[2]) {
clusters <- cutree(hc, k = i)
clustersizes <- as.data.frame(table(clusters))
singleton.clusters <- which(clustersizes$Freq < 2)
if (length(singleton.clusters) <= num.singleton) {
kk <- i
8.2. CLUSTERING EXAMPLE 205
} else {
break;
}
}
cat(kk)
## 6
Exercise 10: Compare the results between SINCERA and the original publication cell types.
Our solution:
## [1] 0.3823537
Exercise 11: Is using the singleton cluster criteria for finding k a good idea?
206 CHAPTER 8. BIOLOGICAL ANALYSIS
8.2.7 sessionInfo()
library(scRNA.seq.funcs)
library(matrixStats)
library(M3Drop)
library(RColorBrewer)
library(SingleCellExperiment)
set.seed(1)
Single-cell RNASeq is capable of measuring the expression of many thousands of genes in every cell. How-
ever, in most situations only a portion of those will show a response to the biological condition of interest,
e.g. differences in cell-type, drivers of differentiation, respond to an environmental stimulus. Most genes
detected in a scRNASeq experiment will only be detected at different levels due to technical noise. One
consequence of this is that technical noise and batch effects can obscure the biological signal of interest.
Thus, it is often advantageous to perform feature selection to remove those genes which only exhibit technical
noise from downstream analysis. Not only does this generally increase the signal:noise ratio in the data; it
also reduces the computational complexity of analyses, by reducing the total amount of data to be processed.
For scRNASeq data, we will be focusing on unsupervised methods of feature selection which don’t require
any a priori information, such as cell-type labels or biological group, since they are not available, or may be
unreliable, for many experiments. In contrast, differential expression (chapter 8.6) can be considered a form
of supervised feature selection since it uses the known biological label of each sample to identify features
(i.e. genes) which are expressed at different levels across groups.
For this section we will continue working with the Deng data.
deng <- readRDS("deng/deng-reads.rds")
cellLabels <- colData(deng)$cell_type2
This data can be QCed and normalized for library size using M3Drop, which removes cells with few detected
genes, removes undetected genes, and converts raw counts to CPM.
deng_list <- M3DropCleanData(
counts(deng),
labels = cellLabels,
min_detected_genes = 100,
208 CHAPTER 8. BIOLOGICAL ANALYSIS
is.counts = TRUE
)
expr_matrix <- deng_list$data # Normalized & filtered expression matrix
celltype_labs <- factor(deng_list$labels) # filtered cell-type labels
cell_colors <- brewer.pal(max(3,length(unique(celltype_labs))), "Set3")
Exercise 1: How many cells & genes have been removed by this filtering?
There are two main approaches to unsupervised feature selection. The first is to identify genes which behave
differently from a null model describing just the technical noise expected in the dataset.
If the dataset contains spike-in RNAs they can be used to directly model technical noise. However, measure-
ments of spike-ins may not experience the same technical noise as endogenous transcripts (Svensson et al.,
2017). In addition, scRNASeq experiments often contain only a small number of spike-ins which reduces our
confidence in fitted model parameters.
The first method proposed to identify features in scRNASeq datasets was to identify highly variable genes
(HVG). HVG assumes that if genes have large differences in expression across cells some of those differences
are due to biological difference between the cells rather than technical noise. However, because of the nature
of count data, there is a positive relationship between the mean expression of a gene and the variance in the
read counts across cells. This relationship must be corrected for to properly identify HVGs.
Exercise 2 Using the functions rowMeans and rowVars to plot the relationship between mean expression
and variance for all genes in this dataset. (Hint: use log=“xy” to plot on a log-scale).
8.3. FEATURE SELECTION 209
1e+07
1e+04
Variance
1e+01
1e−02
1e−05
Mean Expression
A popular method to correct for the relationship between variance and mean expression was proposed by
Brennecke et al.. To use the Brennecke method, we first normalize for library size then calculate the mean
and the square coefficient of variation (variation divided by the squared mean expression). A quadratic curve
is fit to the relationship between these two variables for the ERCC spike-in, and then a chi-square test is
used to find genes significantly above the curve. This method is included in the M3Drop package as the
Brennecke_getVariableGenes(counts, spikes) function. However, this dataset does not contain spike-ins so
we will use the entire dataset to estimate the technical noise.
In the figure below the red curve is the fitted technical noise model and the dashed line is the 95% CI. Pink
dots are the genes with significant biological variability after multiple-testing correction.
Brennecke_HVG <- BrenneckeGetVariableGenes(
expr_matrix,
fdr = 0.01,
minBiolDisp = 0.5
)
210 CHAPTER 8. BIOLOGICAL ANALYSIS
100
squared coefficient of variation (CV^2)
10
0.1
An alternative to finding HVGs is to identify genes with unexpectedly high numbers of zeros. The frequency
of zeros, know as the “dropout rate”, is very closely related to expression level in scRNASeq data. Zeros
are the dominant feature of single-cell RNASeq data, typically accounting for over half of the entries in the
final expression matrix. These zeros predominantly result from the failure of mRNAs failing to be reversed
transcribed (Andrews and Hemberg, 2016). Reverse transcription is an enzyme reaction thus can be modelled
using the Michaelis-Menten equation:
Pdropout = 1 − S/(K + S)
where S is the mRNA concentration in the cell (we will estimate this as average expression) and K is the
Michaelis-Menten constant.
Because the Michaelis-Menten equation is a convex non-linear function, genes which are differentially expres-
sion across two or more populations of cells in our dataset will be shifted up/right of the Michaelis-Menten
model (see Figure below).
K <- 49
S_sim <- 10^seq(from = -3, to = 4, by = 0.05) # range of expression values
MM <- 1 - S_sim / (K + S_sim)
8.3. FEATURE SELECTION 211
plot(
S_sim,
MM,
type = "l",
lwd = 3,
xlab = "Expression",
ylab = "Dropout Rate",
xlim = c(1,1000)
)
S1 <- 10
P1 <- 1 - S1 / (K + S1) # Expression & dropouts for cells in condition 1
S2 <- 750
P2 <- 1 - S2 / (K + S2) # Expression & dropouts for cells in condition 2
points(
c(S1, S2),
c(P1, P2),
pch = 16,
col = "grey85",
cex = 3
)
mix <- 0.5 # proportion of cells in condition 1
points(
S1 * mix + S2 * (1 - mix),
P1 * mix + P2 * (1 - mix),
pch = 16,
col = "grey35",
cex = 3
)
1.0
0.8
Dropout Rate
0.6
0.4
0.2
0.0
Expression
212 CHAPTER 8. BIOLOGICAL ANALYSIS
Note: add log="x" to the plot call above to see how this looks on the log scale, which is used in M3Drop
figures.
Exercise 3: Produce the same plot as above with different expression levels (S1 & S2) and/or mixtures
(mix).
We use M3Drop to identify significant outliers to the right of the MM curve. We also apply 1% FDR multiple
testing correction:
M3Drop_genes <- M3DropFeatureSelection(
expr_matrix,
mt_method = "fdr",
mt_threshold = 0.01
)
1.0
MMenten
K = 9.153
0.80.6
Dropout Rate
0.4 0.2
0.0
−2 0 2 4
log10(expression)
M3Drop_genes <- M3Drop_genes$Gene
An alternative method is contained in the M3Drop package that is tailored specifically for UMI-tagged data
which generally contains many zeros resulting from low sequencing coverage in addition to those resulting
from insufficient reverse-transcription. This model is the Depth-Adjusted Negative Binomial (DANB). This
method describes each expression observation as a negative binomial model with a mean related to both
the mean expression of the respective gene and the sequencing depth of the respective cell, and a variance
related to the mean-expression of the gene.
Unlike the Michaelis-Menten and HVG methods there isn’t a reliable statistical test for features selected by
this model, so we will consider the top 1500 genes instead.
8.3. FEATURE SELECTION 213
A completely different approach to feature selection is to use gene-gene correlations. This method is based on
the idea that multiple genes will be differentially expressed between different cell-types or cell-states. Genes
which are expressed in the same cell-population will be positively correlated with each other where as genes
expressed in different cell-populations will be negatively correated with each other. Thus important genes
can be identified by the magnitude of their correlation with other genes.
The limitation of this method is that it assumes technical noise is random and independent for each cell,
thus shouldn’t produce gene-gene correlations, but this assumption is violated by batch effects which are
generally systematic between different experimental batches and will produce gene-gene correlations. As a
result it is more appropriate to take the top few thousand genes as ranked by gene-gene correlation than
consider the significance of the correlations.
cor_mat <- cor(t(expr_matrix), method = "spearman") # Gene-gene correlations
diag(cor_mat) <- rep(0, times = nrow(expr_matrix))
score <- apply(cor_mat, 1, function(x) {max(abs(x))}) #Correlation of highest magnitude
names(score) <- rownames(expr_matrix);
score <- score[order(-score)]
Cor_genes <- names(score[1:1500])
Lastly, another common method for feature selection in scRNASeq data is to use PCA loadings. Genes with
high PCA loadings are likely to be highly variable and correlated with many other variable genes, thus may
be relevant to the underlying biology. However, as with gene-gene correlations PCA loadings tend to be
susceptible to detecting systematic variation due to batch effects; thus it is recommended to plot the PCA
results to determine those components corresponding to the biological variation rather than batch effects.
# PCA is typically performed on log-transformed expression data
pca <- prcomp(log(expr_matrix + 1) / log(2))
# plot projection
plot(
pca$rotation[,1],
pca$rotation[,2],
pch = 16,
col = cell_colors[as.factor(celltype_labs)]
)
214 CHAPTER 8. BIOLOGICAL ANALYSIS
0.05
0.00
pca$rotation[, 2]
−0.05
−0.10
−0.15
pca$rotation[, 1]
# calculate loadings for components 1 and 2
score <- rowSums(abs(pca$x[,c(1,2)]))
names(score) <- rownames(expr_matrix)
score <- score[order(-score)]
PCA_genes <- names(score[1:1500])
Exercise 4 Consider the top 5 principal components. Which appear to be most biologically relevant? How
does the top 1,500 features change if you consider the loadings for those components?
We can check whether the identified features really do represent genes differentially expressed between cell-
types in this dataset.
M3DropExpressionHeatmap(
M3Drop_genes,
expr_matrix,
cell_labels = celltype_labs
)
8.3. FEATURE SELECTION 215
−2 0 1 2
Z−Score
Rbm38
Klf11
Mknk1
Gsg2
Slc39a6
Dcaf5
Arap2
Zfp644
Gata4
Dlg5
Yy2
Gpt2
Trim44
Sesn2
Snai1
Ddr2
Zfp707
Ddit4l
B020004J07Rik
Nr2c2
Sertad1
Ercc6l
Arl4a
Prmt6
Them4
Dusp16
Fkbpl
Zkscan14
Ovol1
Fbxo45
Ctsh
Bspry
Zbtb5
Slc29a2
Usp29
4930432K21Rik
Tcfcp2
Tmem229b
1110008J03Rik
Dmrtb1
Arrdc2
Tdrd9
Exd1
Klhl11
Gm11544
Zfp775
Kif3c
Eda
Dnajb14
Atf7ip2
Slc19a3
Gm1995
Gm7104
Sp6
2010317E24Rik
Tbc1d9
Slc35e4
Osbpl6
Rusc1
Tmem56
C130074G19Rik
C2cd2l
Fam57b
Rgs11
Gata1
Stac2
Ficd
Bex6
Ovol2
Ubtfl1
Ifi35
Tmem184c
Icam1
Gpr161
Amhr2
Plk3
Id1
Klb
Kcnk5
Slc13a2
Pcgf5
Slco4c1
Pfkfb3
Rab39
Relb
Actr5
Serping1
Pla2g4e
Foxa1
Rasl11a
Tspan4
Tcirg1
Tekt2
Mtap7d2
2310033E01Rik
Dgat2
Frat1
Stx1a
Rundc1
Ccdc134
Gorasp1
Nr1d2
Hinfp
Slc25a12
Bid
Avl9
Rdh10
Ell2
9530008L14Rik
Pla2g16
Polr1e
Arhgef16
Crb3
Dnajb9
Gm13718
Slc16a6
Nin
Lmo7
Lrrc42
Ptpn21
Abhd6
Slc25a25
Zc3h12c
Fbxo34
Zscan5b
Arrdc1
Sirt7
Zfp821
Zdhhc7
Pnpla3
Wdr47
Fam46c
Baiap2
Arhgap18
Pcgf1
Cry1
Nfya
Arhgap29
Foxj3
Syne2
Zfyve20
Ppp1r15a
BC031353
Sept3
Padi6
Nexn
Pak1
Rnf122
4930473A06Rik
Abca13
Csrnp2
Ttn
Tnfsf13b
Rimklb
Klf17
Angel2
Socs7
Fkbp5
Sfmbt1
Zranb1
Asxl1
Rhpn2
Bmi1
Plekha8
Pias1
Klhdc2
Zfp654
Zswim3
Ssx2ip
Gcap14
Rnf220
Nf1
5730419I09Rik
G6pdx
Pex13
Daam1
Skap2
Spg20
D2Ertd750e
3110001I22Rik
Fbxo28
Mthfd2
Tubb6
Gzf1
8430410A17Rik
Tcl1
Phc2
A630007B06Rik
Jak2
Kbtbd8
Slc45a3
Spry4
Mllt3
Cdc37l1
Rnf24
Tiam1
Cnot6l
Tet3
Papd7
Rasa2
Foxk2
BC017647
Armcx5
Stxbp5
Cnst
Phospho2
Akap17b
Dzip3
Usp47
Ncoa2
Stradb
Ccdc6
Rragc
Tiparp
Gm16515
Zfp57
Limd1
Neo1
Pde4dip
Pik3cd
Nav3
Eif2ak4
Fam117a
Fam84b
Fam107b
Pnma5
Rgn
1110003E01Rik
Extl1
Ubash3b
Fgd6
Myo18a
Farp1
Ctdsp1
Stambp
Ppfibp1
Fmnl3
Gng12
Ccdc50
Gyg
Laptm4b
Fbxo18
Myo5b
Errfi1
Oat
Slc44a2
Cpd
1700017B05Rik
G2e3
Gltpd1
Endod1
Chit1
Nicn1
Abcd1
Rap2c
Map3k3
Irgq
Tnfrsf12a
Plekho2
Zfp954
Fam83d
Hadh
Ass1
Tpm1
Ddah1
Gjb3
Ptgr1
AA467197
Tagln2
Akr1b8
H2afy
Serpinb6a
Acaa2
Alpl
Csf3r
Hexa
Dap
Gsto1
Ckmt1
Uqcrc1
Cldn4
Slc46a1
Fbp2
Fabp3
Sdc4
Elf3
Gm2a
Cd63
Tbx20
Fkbp11
Rab7l1
Steap3
Reep1
Epas1
Plcd1
Atg16l2
Gcnt1
Amotl2
Cdx2
Smpdl3a
Pdzd3
Stx7
Tspan1
Mfge8
Gdf3
Alg5
Tdgf1
Tdh
Id3
Vnn1
Nid1
Acsl4
Slc12a2
Tfrc
Ccne1
Larp1b
Obox6
Socs3
Fcgr3
Clp1
Pemt
Cited1
Alppl2
Slc34a2
Tmem50b
Ctsc
Psat1
Bhmt
Spic
Fbxl20
5031439G07Rik
Nipal1
Ppt2
Def8
Hsd17b14
Naalad2
Parp8
2310040G24Rik
Bhmt2
Stmn3
Dmgdh
Actn2
Ctsz
Plbd1
Bcl2l14
Gulo
Nt5c3l
Ezr
Zc3hav1
Pgm1
Pnrc2
Prrc1
Cryz
0610009O20Rik
Pet112l
Crot
Adck5
Pgm2
Mpp7
Cat
Cmbl
Nfe2l2
Pmvk
Fbxo3
As3mt
Vangl1
Cd97
Pros1
Ceacam10
Itga3
Tmx4
Impa1
Itga5
Ggt1
16cell Retsat
Sgpl1
Sfmbt2
Dhrs7
Nynrin
Klf6
Fam129a
Atxn10
Acadvl
Pepd
Nipsnap3b
Tmem144
Xbp1
Htatip2
Gmppb
Rnh1
Pla2g10
Glipr2
Arhgap8
Asl
Rad9
Ddx3y
Eif2s3y
Adam19
Ildr1
Tbc1d7
Top2b
Tmem150a
Chmp2b
Arl6ip5
Sri
Galk2
Aagab
Hdhd2
Nrm
Aldh1l1
Fam154a
Tmco4
F2rl1
Mgll
Mll1
Slc37a2
Epb4.1l1
AI662270
1600029D21Rik
Rhoc
Atp8b1
4cell Tpp1
Pycard
Gng10
Rai14
Bend4
Pafah2
Agpat4
Fuca2
Cast
Tcn2
Rab17
Frmd4b
Trp53bp2
Serpinb9b
Flot2
Nek6
Sox2
Fam25c
Mdk
Smpdl3b
Pdgfra
Ppp1r14d
Aqp8
Serpinh1
Plod2
Slc1a3
Col4a2
Fgf10
B3gnt7
Jam2
Fn1
Sulf2
Rnf130
E130012A19Rik
Ifitm2
Ly6g6e
Ralb
Clic1
Col4a1
4930583H14Rik
C77370
Rnd3
Tat
Gbas
Osbpl1a
Fndc3a
Fam193a
8cell Ccbl2
Efr3a
Zfyve21
Abhd4
Rnf34
Sept11
Cyb5r3
Nedd4
Golm1
Ift52
Lhfpl2
Slc2a1
Shisa5
Tmem175
Rsu1
Mtap
Nsdhl
Crip1
Fxyd4
S100a10
Serpinb6c
Bmyc
Lta4h
Tmem9
Naga
Efhd2
Cbs
Acot2
Lgals4
Wfdc2
Dsg2
Pls3
Dsp
Eomes
2610528J11Rik
Ppp1r3b
Sipa1l2
Ndrg1
5730469M10Rik
Prss8
2810405K02Rik
Cryab
Tinagl1
Slc15a2
Gng2
Tmprss2
B3gnt5
early2cell Capg
Msn
Tacstd2
Enpp1
Slc44a4
Tnfrsf9
Wipf1
Hsd17b6
Dkkl1
Dusp6
Mbnl3
E2f8
Slc6a14
Tmsb4x
Fmnl2
Lcp1
Anxa6
Zyx
Lrrfip1
Emp2
Mgst3
Ooep
Cgnl1
Fxyd6
Pdzk1
Ptprf
Ggta1
Txndc16
Sh3bgrl2
Gyk
BC053393
Dppa1
Tspan8
Lgals1
Anxa2
Adh1
Krt18
Krt8
Id2
Enpep
Ahnak
Ank
Lrp2
Slc2a3
Cnn2
Slc9a3r1
Neu1
earlyblast Cpn1
Cldn6
Slc5a11
Gsta4
Ech1
Trap1a
Commd3
Capns1
Mdh1
Plin2
L1td1
Gss
Asns
Ldha
Glrx
Nudt5
Upp1
Sparc
Gjb5
Spp1
Slc38a4
Tmem126b
Adat2
Rp2h
Tns1
Flrt1
Angptl4
Rnf128
Pltp
Liph
Fhod1
Casp7
Hyal2
Msrb2
Hemk1
Tab1
Nr4a1
Smoc1
Fank1
Smpd2
Cdk18
Acss1
Pgam2
Msc
Apoa1
Stard4
Clic3
late2cell Cidea
Xist
Tspan3
B230118H07Rik
Trim38
Suox
1500009L16Rik
Got1l1
Ccbl1
Lgals9
Plekhf1
Hkdc1
Kcns3
Lims2
Eno3
Wdr72
Cebpa
Acy1
Prkcc
Slc39a2
Dysf
Nupr1
Efna1
Tmem37
BC017643
Rnmtl1
Cish
Cyp51
Cyr61
Zfp566
Enc1
Zyg11a
Gpr160
1700061G19Rik
Mep1b
Sox21
E130309D14Rik
Bckdha
Dffb
Ftsj2
Ly6e
Pck2
Zscan10
Fgf4
Serpinb1a
Gpr19
Syt9
lateblast Fbp1
Tfpi
Abcg1
Maged1
Plscr1
Morc1
Pmaip1
Ddx26b
1700042G15Rik
Lck
Hesx1
Mylip
Cpe
Nup62cl
Eda2r
Slc13a5
Gpa33
Tbx15
Gjb4
Dennd2c
Hnf4a
Dhrs4
Dram1
Dnajc22
Amacr
Foxh1
Ephx2
Cpt2
Bmp4
St3gal4
Ckap4
Pth1r
Rcn1
Psmb9
Timp1
Pyy
Gstm7
Pla2g2f
Fahd1
Hspb8
5330411J11Rik
Slc7a3
1700080O16Rik
Crygc
Aspa
9030617O03Rik
Gpx2
mid2cell Ndrg2
Ephx1
4930550L24Rik
Rasd1
Hal
2310046K01Rik
Lad1
Zbtb3
Zfp386
Rfesd
Lxn
Rhog
Htr5b
Fdxacb1
Cthrc1
Ano9
Hand1
Hcrt
Clcnka
Cbr2
Kng1
Mboat1
Mybpc2
Arhgdig
Apobec3
Akr1c21
Hk1
Entpd3
Ctrl
Klhl10
Tcam1
Vpreb3
Acsbg2
Mgat3
Fermt3
C2
Eif2c4
B4galnt1
Tcp11l2
Gm7120
Plk5
Lrg1
Allc
Hcn3
Ube1y1
Mme
Pnliprp2
midblast Pak6
Tmem125
Fgfbp1
Lphn2
Pon2
Trpv5
Aim1
Car4
Mtfp1
Prtg
Dusp4
Tmem45b
Slc6a13
Vtcn1
Tmem171
Plekhb2
B3gnt1
Mab21l3
Patl2
Nid2
Tcta
Rin3
Capn2
Mbp
Frk
Dsc2
Faah
Sh3bp5
1600014K23Rik
Exph5
Mbnl2
Wnt7b
Parva
Trpm6
Zfp280c
Ccdc22
Cblb
Serinc2
Dok1
Kcnv2
Tmem139
Susd2
2310030G06Rik
Slc37a4
Gchfr
Pgc
Atp12a
zy Ptk6
Pigz
Sh2d4a
Plb1
Mycbpap
Dapk1
Erbb2
Sh3tc2
P2rx3
Wnt3a
2010300C02Rik
Vill
Tcfap2a
Dmkn
Ccdc42
Gata2
Pdzk1ip1
Unc93a
Sft2d2
Prkag2
Cd82
AU021092
Slc38a1
Hs3st3b1
Gsn
Fbln1
Anxa3
Slc22a5
Acsl6
Slc7a7
Coro7
Entpd1
Glipr1
Myo1e
Gulp1
BC051665
3110003A17Rik
Cln6
Trp53i11
Capn5
Slc9a6
Fhl2
Herpud1
Akr1e1
Abhd14a
Mycn
Lama1
Ppox
Cpxm1
Hmox1
Tmem177
Msra
Acadsb
Cldn3
Cdx1
Elf5
Abhd14b
Blnk
Plac1
Vgll3
S100a16
Aldh3b2
Rassf4
Isx
Ldoc1
Aif1l
Qser1
Cep135
Bscl2
Fam135a
Gtdc1
Wasf1
Pacs1
Myo5a
Chd9
Arntl
Lrrc28
Ggct
Znfx1
Rfx7
Cep110
Fam160a1
Ninj1
Cttnbp2nl
Ankrd6
Shroom2
Tesc
Usp46
Pecr
4933411K20Rik
Rimkla
Mapk3
Snapc1
Slc16a10
Hyls1
Pramef12
Zswim4
Zkscan5
Pramef8
Specc1l
Lrrcc1
Ripk1
Dusp7
Kbtbd7
Pvrl3
Fam116a
Smarca2
Tulp3
Mapk8
Spire1
Uchl1
AU022751
Nsun4
Dennd4a
Mtus2
Pgm2l1
Zfp229
Bicd2
Plekha3
Arg2
C2cd3
Ccdc92
Clock
Ppm1d
Bbs2
Dusp1
Trafd1
Loh12cr1
Traf3
Pls1
Cdc42ep2
Zfp770
Gm5134
Zfp791
Acsl1
Hhex
Kdm1b
Elmod2
Baiap2l1
Cdc25b
Cyb5b
Rnf19a
Sycp1
Bmp2k
Pcdh9
Fyn
Tcfap2e
Fhod3
Vil1
Atp10b
Kif5c
Fam199x
Pdzd2
Mn1
BB557941
Rb1
Slc6a9
1700001L05Rik
Jazf1
Mesp2
Bnc2
Mtap1b
Jag1
Sycp2
Hook1
Ehbp1
Iqca
2310022B05Rik
Tiam2
Pnrc1
Mpdz
Agtpbp1
Fchsd2
Ptbp2
Tmem164
Btbd2
Prickle1
Chsy1
Per3
Fbxw21
Astl
Cpeb1
Mlxipl
Ssbp2
Pdk3
Daglb
Spred1
Taok3
Pik3r3
Spag1
Slc10a6
Fgf1
Syngr3
Ssfa2
Dpep2
Irf1
Orai1
AU015228
Sh2d3c
Dis3l
Slc10a3
Zbtb37
Lonrf3
Itpr1
Cntln
Ern1
Tanc2
Mxd1
Zp2
Nanos1
Rps6ka6
Elavl3
Cbx2
Gdpd1
Tubb3
Elavl2
Cdr2
AU015836
Npm2
Nlrp5
Rab3d
Tex15
Zp3
Ankrd44
E2f2
Tob2
Tle6
2010107G23Rik
Usp2
Nfil3
Sdc1
Cd164l2
D1Ertd622e
Strn
Calr4
Eef2k
Arhgap44
Asap1
Kif3b
Akap11
Rab8b
Tusc3
Pgbd5
Slc30a3
Mapkbp1
Nhsl1
1700024P16Rik
Dclk2
Il6st
Arhgap28
Dock7
Eps8
Ssh3
Pxt1
Zfp703
Lipo1
Ccdc88a
Tk2
Cnksr3
Gm3143
Ust
Ceacam20
Slc4a8
Slc25a48
1300018I17Rik
Ano10
Dstyk
Morn4
Wfs1
Chn1
Map3k6
Tdrd12
Unc13b
Rps6ka5
Ptpn22
Zbtb8b
Pla2g7
Plcg2
Zfp612
Cpm
Ccdc164
Lipt1
Rab38
Atp11a
Chdh
Fry
D9Ertd402e
Plat
Maml1
Phtf1
Btg2
Myd88
Cd55
Fam110b
Tbc1d2b
Celsr1
Nceh1
Zfp667
Dok4
Atxn1
Mapk8ip3
Usp3
2310044G17Rik
Pigc
S100a14
Htra1
Grhl1
Tmem109
Rab4a
Tbc1d19
Ets2
Srd5a3
Vrk3
Pja1
Manba
Pip4k2a
Herpud2
Det1
Tgds
Add3
Nfxl1
Bbs4
Sema5a
Pskh1
Car9
Rfx2
Mtmr7
Gm12942
Fndc3b
Grina
Ngly1
Sbf2
Akap2
B9d1
Zfp819
Ptprg
Cubn
Slc35a5
Pfkp
Axin2
Trps1
Tnfaip8
Galm
Prr5
Icosl
Bin3
1700021K19Rik
Msrb3
Golph3l
Mapre2
1300010F03Rik
Stat6
Fut8
Copg2
Rbm44
Aldh18a1
Gdi1
Slc16a13
Zmym6
Tdp1
Rdh11
Pnpo
Bhlhe40
Wdr8
Nlrp14
Bmp15
Ldhb
Gdf9
Fbxw24
Dazl
Khdc1b
Zbed3
Pld1
Pkd2l2
Oas1d
Nlrp4f
Eif4e1b
Pla2g4c
Oog3
Accsl
Btg4
C86187
Trim61
Omt2a
Omt2b
Parp12
Gdap1
Siah2
Klhl8
Eif4e3
Fbxw19
Ston2
Phf1
Pou4f1
P4ha3
Fam46b
Fmn2
Bfsp1
Gm1965
Obox5
Slain1
Trim75
Nlrp4a
Rgs2
Rrm2b
Gm10436
Oog1
Trim60
Gm13023
C87977
Gli3
Brdt
Mbd3l2
D7Ertd443e
Tcl1b5
Tcl1b3
Tcl1b2
Tcl1b4
D6Ertd474e
Sebox
Nobox
Rspo2
Nlrp9c
Grm2
Fbxo43
Th
Plxnc1
Dpysl3
Sorbs2
Kif17
Nrp1
Slco3a1
Rgs17
Rapgef5
Gm13084
Oog2
Nlrp9a
Fbxw22
Folr4
Itpr2
Nlrp4b
Tbx4
Zbtb16
Fbxw18
Txnip
H1foo
Miox
Bpgm
C87414
Plekhg1
Mfap2
AA619741
Zfp957
Fbxw15
Fbxw20
Ehf
Obox1
Obox2
4933427D06Rik
Fbxw28
Fbxw14
Fbxw16
Afap1
Oog4
Phf15
Ccno
Lpin2
Oas1c
Oas1e
Kit
Tcl1b1
Oosp1
Gm97
D6Ertd527e
Wee2
Gm7056
Rfpl4
C87499
Cpa1
Meis2
Creb3l4
Nlrp9b
Bcl2l10
E330034G19Rik
LOC100502936
Tgfb2
Myom3
Col6a3
Bcas1
Itga9
Thsd7a
7420461P10Rik
Nrn1l
Cyp39a1
Gabra3
Ugt8a
Gm1564
Dnahc10
A730082K24Rik
Antxr2
Fam167a
Ipcef1
Zmat4
Katnal1
Chst11
Gnb4
Tspyl4
Txndc2
Hmgcll1
Magea10
Tmprss11a
BC021891
Agtr1a
Prox1
Igsf11
Zfp541
Cntn4
Htra4
Cdh13
B4galt2
Stbd1
Vsx2
Sgms2
Prmt2
Tmem117
Snph
E330012B07Rik
Caln1
Cacna1h
Bmp6
Ikzf4
Sh3rf3
Ppm1h
Scml2
Psd3
Gramd1c
Efna5
Bcar3
Cdca7l
Sh3bgrl
Lsp1
4930455F23Rik
Sh3kbp1
Ccng2
Hspa4l
Tmcc2
Gcm2
Slc6a7
Dbndd1
Tex11
Pou2f2
Fam184a
Tnik
Mfsd2a
Sec16b
Dgkk
Ccdc79
Erc2
Mybl1
Phyhipl
Lbh
Ngef
9130023H24Rik
AA415398
Nell1
Jam3
0610040J01Rik
Krt84
Dgkb
Prkd1
Bmp5
Clvs2
Lhx8
Epha3
Wdr69
7420426K07Rik
Agbl2
Zfp616
Bicc1
Ophn1
Nlrp2
Reep2
Bnc1
Stard13
Mcc
Zc3h6
Gna14
Fbxw4
Rassf5
Sec24d
Dnahc9
Fa2h
Gda
Egfr
Hif3a
Rundc3b
Dtx3l
Pde3b
Ica1
Tdrd1
Afap1l2
Adamtsl1
Unc5c
Fam125b
Rph3a
Rcan3
Otub2
Dock3
2900026A02Rik
Mael
Zfp46
Fetub
Spon1
Mpzl1
Zp1
4930503L19Rik
Slc25a31
Mxra8
Pdha2
Tbc1d8
1700030J22Rik
Cdk20
Rhof
Susd3
Fam49a
Mast4
Atl1
Derl3
LOC100503167
Cpsf4l
1700018B24Rik
Atp1a3
Cd7
Rab3b
Plag1
Socs5
Macrod2
Slc6a15
Hip1r
Necab2
Ftsjd1
Usp11
Klhl13
Pld2
Lrch1
Zfp108
Stxbp1
Cpne9
Cdo1
Mllt11
Pfn2
Hsd3b7
Hbegf
Sdcbp2
9930012K11Rik
Hpcal1
Tshz1
Sh3yl1
Mro
St6gal1
Tshz2
Plxna4
BC030476
Efcab6
Ptprr
Sytl4
Gm4745
Tmem72
Anks4b
Styk1
2610034M16Rik
Plce1
Gng3
Lmo1
Oas1h
Zfp750
Fbxw13
Gm839
Ampd3
Hnmt
AA792892
Adcy5
Zar1l
Fbxw26
Pcbp3
Tcf7
Iqgap2
Tceal8
Umodl1
Plac8
Ccdc69
Dnajb4
Pabpc1l
Gja4
BC017612
Flrt3
Nudt7
A430033K04Rik
Plac1l
Opalin
Tgfb3
Musk
Zdhhc8
Mansc1
Tubal3
Oas1f
Ctnna3
C86695
4930588N13Rik
Nanos2
Fam13c
Ninj2
Casp8
Ms4a1
Scg3
Otx1
Insm1
Osgin1
Postn
E230016K23Rik
Gm4981
Diras2
Gpr1
Emilin2
Rab32
Lyplal1
2010002M12Rik
S1pr3
Ap3m2
Zfp334
C8b
Hsf5
Speer5−ps1
Zfp735
Taf9b
Rnf219
Pabpn1l
Gm889
Trim35
Gm13125
Fam110c
Crabp2
Mamdc2
Trh
Stard5
Tor3a
Pdlim2
Chac2
Snai3
Sertad3
Gm10696
Spz1
Tchh
Gm9125
1600025M17Rik
Gm5
Ugt2b36
Gm12789
Tdpoz1
Card6
Tnfaip8l2
Rab43
Prr23a
Rilpl2
6230427J02Rik
Eid2
Limch1
Arid5a
Zfp948
Gm11487
Gm4340
Gm11756
Gm428
Gm4971
Rfpl4b
Gm2022
BB287469
Gm5039
Zfp54
Obox3
Zfp352
Usp17l5
Gm4850
Dub1a
Gm5662
Gm2016
Gm13109
BC080695
Gm16381
1700019M22Rik
Zscan4c
Zscan4f
Zscan4d
4933411G11Rik
Lpar6
Duoxa2
Gpr50
Tpo
Dcc
Tacr3
Gm12824
Antxr1
Ankrd43
Ccnjl
1700013H16Rik
Cxcl16
Prokr2
Arl4d
Lmx1a
Ier5
Glt25d2
Mmp19
Fam26f
Apol7b
Slc26a1
Ric3
Gal3st2
Tef
Gadd45b
Fbxo16
Neto2
Cst7
Cml2
Atp4b
Nxt2
Parp9
Nmnat1
Zfp623
Arsk
Plk2
Wdr44
Hspa2
Cd24a
Bambi
0610005C13Rik
Shbg
Nr1h3
Trib3
Iqub
Plcd4
Klrg2
Apom
Tktl1
Rtn2
Sardh
Rab34
Slc27a5
Ciita
A330049M08Rik
Ifi27l1
Ggt6
Lrrc15
Npc1l1
Aspg
Abcb5
Arg1
Oasl2
Enpp3
Prdm14
16cell.17
early2cell.3
early2cell
early2cell.1
zy
zy.3
zy.1
zy.2
early2cell.4
early2cell.7
early2cell.2
early2cell.5
early2cell.6
late2cell.4
late2cell.3
late2cell.6
late2cell.7
late2cell.1
late2cell.5
late2cell.8
late2cell
late2cell.9
late2cell.2
mid2cell.2
mid2cell.3
mid2cell.5
mid2cell.4
mid2cell
mid2cell.1
mid2cell.8
mid2cell.9
mid2cell.7
mid2cell.6
mid2cell.11
4cell.8
mid2cell.10
4cell.7
4cell.9
4cell.3
4cell.6
4cell.5
4cell
4cell.1
4cell.4
4cell.2
4cell.11
4cell.12
4cell.10
4cell.13
16cell.32
16cell.45
16cell.15
16cell.31
16cell.16
16cell.23
16cell.24
16cell.39
16cell.49
16cell.44
16cell.47
16cell.46
16cell.40
16cell.41
16cell.43
16cell.48
16cell.19
16cell.14
16cell.22
16cell.20
16cell.18
16cell.21
16cell.38
16cell.42
16cell.34
16cell.26
16cell.29
16cell.25
16cell.37
16cell.33
16cell.30
16cell.35
16cell.27
16cell.36
8cell.35
8cell.29
8cell.31
8cell.33
8cell.32
8cell.36
8cell.28
8cell.30
8cell.34
8cell.12
8cell.9
8cell.10
8cell.13
8cell.11
8cell.7
8cell.8
16cell.3
16cell.7
16cell.8
16cell.13
16cell.9
16cell.12
16cell.11
16cell.28
16cell.10
16cell.1
16cell.6
16cell
16cell.2
16cell.5
8cell.19
16cell.4
8cell
8cell.2
8cell.3
8cell.25
8cell.1
8cell.5
8cell.6
8cell.4
8cell.21
8cell.24
8cell.27
8cell.14
8cell.15
8cell.17
8cell.22
8cell.26
8cell.18
8cell.20
8cell.16
lateblast.13
lateblast.2
lateblast.3
lateblast.9
lateblast.8
lateblast.27
lateblast.7
lateblast.4
lateblast.25
lateblast.15
lateblast.11
lateblast.5
lateblast.22
lateblast.21
lateblast.19
lateblast.26
lateblast.20
lateblast.24
lateblast
lateblast.16
lateblast.18
lateblast.28
lateblast.29
lateblast.17
lateblast.6
earlyblast.27
earlyblast.2
midblast.30
earlyblast.38
earlyblast.32
earlyblast.42
earlyblast.36
earlyblast.11
earlyblast.10
earlyblast
earlyblast.18
earlyblast.13
earlyblast.7
earlyblast.9
midblast.47
midblast.59
midblast.43
midblast.7
midblast.2
midblast.52
midblast.45
midblast.50
midblast.5
midblast.25
midblast.6
midblast.1
midblast.56
midblast.54
midblast.53
midblast.51
midblast.57
midblast.55
midblast.58
earlyblast.16
earlyblast.17
lateblast.23
midblast
lateblast.14
earlyblast.39
earlyblast.37
earlyblast.30
midblast.42
midblast.13
earlyblast.19
earlyblast.3
earlyblast.14
8cell.23
lateblast.1
midblast.24
lateblast.12
lateblast.10
earlyblast.20
earlyblast.24
earlyblast.25
earlyblast.26
earlyblast.5
earlyblast.40
earlyblast.41
midblast.44
earlyblast.23
earlyblast.15
midblast.22
midblast.38
midblast.37
midblast.36
midblast.32
midblast.9
earlyblast.1
midblast.8
midblast.16
midblast.39
earlyblast.12
midblast.11
midblast.35
midblast.19
midblast.17
midblast.48
midblast.49
midblast.18
midblast.20
midblast.46
midblast.21
midblast.12
earlyblast.28
midblast.14
earlyblast.29
earlyblast.22
earlyblast.21
earlyblast.31
earlyblast.34
earlyblast.6
earlyblast.8
earlyblast.4
midblast.40
midblast.29
midblast.26
midblast.31
midblast.10
midblast.41
earlyblast.33
midblast.28
midblast.4
midblast.15
earlyblast.35
midblast.33
midblast.3
midblast.27
midblast.23
midblast.34
216 CHAPTER 8. BIOLOGICAL ANALYSIS
We can also consider how consistent each feature selection method is with the others using the Jaccard Index:
J <- sum(M3Drop_genes %in% HVG_genes)/length(unique(c(M3Drop_genes, HVG_genes)))
Exercise 5
Plot the expression of the features for each of the other methods. Which appear to be differentially expressed?
How consistent are the different methods for this dataset?
8.3.4 sessionInfo()
library(SingleCellExperiment)
library(TSCAN)
library(M3Drop)
library(monocle)
library(destiny)
library(SLICER)
library(ouija)
library(scater)
library(ggplot2)
library(ggthemes)
library(ggbeeswarm)
library(corrplot)
set.seed(1)
In many situations, one is studying a process where cells change continuously. This includes, for example,
many differentiation processes taking place during development: following a stimulus, cells will change from
one cell-type to another. Ideally, we would like to monitor the expression levels of an individual cell over
time. Unfortunately, such monitoring is not possible with scRNA-seq since the cell is lysed (destroyed) when
the RNA is extracted.
Instead, we must sample at multiple time-points and obtain snapshots of the gene expression profiles. Since
some of the cells will proceed faster along the differentiation than others, each snapshot may contain cells
at varying points along the developmental progression. We use statistical methods to order the cells along
one or more trajectories which represent the underlying developmental trajectories, this ordering is referred
to as “pseudotime”.
In this chapter we will consider five different tools: Monocle, TSCAN, destiny, SLICER and ouija for ordering
cells according to their pseudotime development. To illustrate the methods we will be using a dataset on
mouse embryonic development (Deng et al., 2014). The dataset consists of 268 cells from 10 different time-
points of early mouse development. In this case, there is no need for pseudotime alignment since the cell
labels provide information about the development trajectory. Thus, the labels allow us to establish a ground
truth so that we can evaluate and compare the different methods.
A recent review by Cannoodt et al provides a detailed summary of the various computational methods for
trajectory inference from single-cell transcriptomics (Cannoodt et al., 2016). They discuss several tools, but
unfortunately for our purposes many of these tools do not have complete or well-maintained implementations,
and/or are not implemented in R.
Cannoodt et al cover:
• SCUBA - Matlab implementation
218 CHAPTER 8. BIOLOGICAL ANALYSIS
Figure 8.10: Descriptions of trajectory inference methods for single-cell transcriptomics data (Fig. 2 from
Cannoodt et al, 2016).
Unfortunately only two tools discussed (Monocle and TSCAN) meet the gold standard of open-source soft-
ware hosted in a reputable repository.
The following figures from the paper summarise some of the features of the various tools.
Let us take a first look at the Deng data, without yet applying sophisticated pseudotime methods. As the
plot below shows, simple PCA does a very good job of displaying the structure in these data. It is only
once we reach the blast cell types (“earlyblast”, “midblast”, “lateblast”) that PCA struggles to separate the
distinct cell types.
deng_SCE <- readRDS("deng/deng-reads.rds")
deng_SCE$cell_type2 <- factor(
deng_SCE$cell_type2,
levels = c("zy", "early2cell", "mid2cell", "late2cell",
"4cell", "8cell", "16cell", "earlyblast",
"midblast", "lateblast")
)
cellLabels <- deng_SCE$cell_type2
deng <- counts(deng_SCE)
colnames(deng) <- cellLabels
deng_SCE <- plotPCA(deng_SCE, colour_by = "cell_type2",
return_SCE = TRUE)
8.4. PSEUDOTIME ANALYSIS 219
Figure 8.11: Characterization of trajectory inference methods for single-cell transcriptomics data (Fig. 3
from Cannoodt et al, 2016).
cell_type2
10
Component 2: 16% variance
zy
early2cell
mid2cell
late2cell
4cell
0 8cell
16cell
earlyblast
midblast
lateblast
−10
−20 −10 0 10
Component 1: 33% variance
PCA, here, provides a useful baseline for assessing different pseudotime methods. For a very naive pseudotime
we can just take the co-ordinates of the first principal component.
deng_SCE$PC1 <- reducedDim(deng_SCE, "PCA")[,1]
ggplot(as.data.frame(colData(deng_SCE)), aes(x = PC1, y = cell_type2,
220 CHAPTER 8. BIOLOGICAL ANALYSIS
colour = cell_type2)) +
geom_quasirandom(groupOnX = FALSE) +
scale_color_tableau() + theme_classic() +
xlab("First principal component") + ylab("Timepoint") +
ggtitle("Cells ordered by first principal component")
lateblast
midblast cell_type2
zy
earlyblast
early2cell
16cell mid2cell
Timepoint
late2cell
8cell
4cell
4cell 8cell
16cell
late2cell
earlyblast
mid2cell midblast
lateblast
early2cell
zy
−20 −10 0 10
First principal component
As the plot above shows, PC1 struggles to correctly order cells early and late in the developmental timecourse,
but overall does a relatively good job of ordering cells by developmental time.
8.4.2 TSCAN
TSCAN combines clustering with pseudotime analysis. First it clusters the cells using mclust, which is
based on a mixture of normal distributions. Then it builds a minimum spanning tree to connect the clusters.
The branch of this tree that connects the largest number of clusters is the main branch which is used to
determine pseudotime.
1 3 5 7 9
State
2 4 6 8 10
15
PCA_dimension_2
191
206
180 249
192260
263207262
261
181
119
174
5 7
115 188 257
10 229 175125
237
168
133
241 164211
173 258
232
208
179
259
251
184
121
264
216
210209
256
230
151136 110163
120
117
169 185
170187
89
239
243 128219122
167190
88 123150
135238
227
171
143 212
145 225 248
183
189
178
226 250
240
5 242213
114 118
222
142
235116
224
231
215217
254
255
130166
182
233
214
131
144 137
146
220112
172
165
252
266107268
267
265 228134 141
152 236
246 138
148 221
108 234223
0 162
109
106 104
103
102
205
132
3
139
253244247
111127
124
129
113 176
140
149
177147
186
−5
161
153
155 158
10
156
6
105
196
194 195 9970
7195
101
44
35
49
90
932745
9782
47
42 32 48 43
126
245
218
193 23 37 41
2 1
154
160 201
199 203
202 69 89
31
28 3839
15 5040 46
159
157 204197 200 79 72
919877
29
30
74 936 96
18 634 24 33
56 66 87
13100
5 21
2276
94 68
3 17
−10
198
57 6052
6364
61 59 4
58
53
62
55
51 81
7310
86
54 807565 2014 8
85
11 26
2
78 83 4
67
92
17
84
19 12 1625
Frustratingly, TSCAN only provides pseudotime values for 221 of 268 cells, silently returning missing values
for non-assigned cells.
lateblast
midblast cell_type2
zy
earlyblast
early2cell
16cell mid2cell
Timepoint
late2cell
8cell
4cell
4cell 8cell
16cell
late2cell
earlyblast
mid2cell midblast
lateblast
early2cell
zy
TSCAN gets the development trajectory the “wrong way around”, in the sense that later pseudotime values
correspond to early timepoints and vice versa. This is not inherently a problem (it is easy enough to reverse
the ordering to get the intuitive interpretation of pseudotime), but overall it would be a stretch to suggest
that TSCAN performs better than PCA on this dataset. (As it is a PCA-based method, perhaps this is not
entirely surprising.)
8.4.3 monocle
Monocle skips the clustering stage of TSCAN and directly builds a minimum spanning tree on a reduced
dimension representation of the cells to connect all cells. Monocle then identifies the longest path in this tree
to determine pseudotime. If the data contains diverging trajectories (i.e. one cell type differentiates into two
different cell-types), monocle can identify these. Each of the resulting forked paths is defined as a separate
cell state.
Unfortunately, Monocle does not work when all the genes are used, so we must carry out feature selection.
First, we use M3Drop:
m3dGenes <- as.character(
M3DropFeatureSelection(deng)$Gene
)
1.0
MMenten
K = 102.056
0.8
Dropout Rate
0.2 0.4 0.6
0.0
−2 0 2 4
log10(expression)
d <- deng[which(rownames(deng) %in% m3dGenes), ]
d <- d[!duplicated(rownames(d)), ]
State 1 2 3
8
Component 2
0
1
We can again compare the inferred pseudotime to the known sampling timepoints.
deng_SCE$pseudotime_monocle <- pseudotime_monocle$pseudotime
ggplot(as.data.frame(colData(deng_SCE)),
aes(x = pseudotime_monocle,
y = cell_type2, colour = cell_type2)) +
geom_quasirandom(groupOnX = FALSE) +
scale_color_tableau() + theme_classic() +
xlab("monocle pseudotime") + ylab("Timepoint") +
ggtitle("Cells ordered by monocle pseudotime")
8.4. PSEUDOTIME ANALYSIS 225
lateblast
midblast cell_type2
zy
earlyblast
early2cell
16cell mid2cell
Timepoint
late2cell
8cell
4cell
4cell 8cell
16cell
late2cell
earlyblast
mid2cell midblast
lateblast
early2cell
zy
0 10 20 30
monocle pseudotime
Monocle - at least with its default settings - performs poorly on these data. The “late2cell” group is
completely separated from the “zy”, “early2cell” and “mid2cell” cells (though these are correctly ordered),
and there is no separation at all of “4cell”, “8cell”, “16cell” or any blast cell groups.
Diffusion maps were introduced by Ronald Coifman and Stephane Lafon, and the underlying idea is to
assume that the data are samples from a diffusion process. The method infers the low-dimensional manifold
by estimating the eigenvalues and eigenvectors for the diffusion operator related to the data.
Haghverdi et al have applied the diffusion maps concept to the analysis of single-cell RNA-seq data to create
an R package called destiny.
We will take the ranko prder of cells in the first diffusion map component as “diffusion map pseudotime”
here.
deng <- logcounts(deng_SCE)
colnames(deng) <- cellLabels
dm <- DiffusionMap(t(deng))
theme_classic()
0.2
Timepoint
zy
early2cell
Diffusion component 2
0.1 mid2cell
late2cell
4cell
8cell
16cell
0.0
earlyblast
midblast
lateblast
−0.1
lateblast
midblast cell_type2
zy
earlyblast
early2cell
16cell mid2cell
Timepoint
late2cell
8cell
4cell
4cell 8cell
16cell
late2cell
earlyblast
mid2cell midblast
lateblast
early2cell
zy
0 100 200
Diffusion map pseudotime (first diffusion map component)
Like the other methods, using the first diffusion map component from destiny as pseudotime does a good
job at ordering the early time-points (if we take high values as “earlier” in developement), but it is unable
to distinguish the later ones.
Exercise 2 Do you get a better resolution between the later time points by considering additional eigenvec-
tors?
Exercise 3 How does the ordering change if you only use the genes identified by M3Drop?
8.4.5 SLICER
The SLICER method is an algorithm for constructing trajectories that describe gene expression changes
during a sequential biological process, just as Monocle and TSCAN are. SLICER is designed to capture
highly nonlinear gene expression changes, automatically select genes related to the process, and detect
multiple branch and loop features in the trajectory (Welch et al., 2016). The SLICER R package is available
from its GitHub repository and can be installed from there using the devtools package.
We use the select_genes function in SLICER to automatically select the genes to use in builing the cell
trajectory. The function uses “neighbourhood variance” to identify genes that vary smoothly, rather than
fluctuating randomly, across the set of cells. Following this, we determine which value of “k” (number of
nearest neighbours) yields an embedding that most resembles a trajectory. Then we estimate the locally
linear embedding of the cells.
library("lle")
slicer_genes <- select_genes(t(deng))
k <- select_k(t(deng[slicer_genes,]), kmin = 30, kmax=60)
## finding neighbours
228 CHAPTER 8. BIOLOGICAL ANALYSIS
## calculating weights
## computing coordinates
## finding neighbours
## calculating weights
## computing coordinates
## finding neighbours
## calculating weights
## computing coordinates
## finding neighbours
## calculating weights
## computing coordinates
## finding neighbours
## calculating weights
## computing coordinates
## finding neighbours
## calculating weights
## computing coordinates
## finding neighbours
## calculating weights
## computing coordinates
slicer_traj_lle <- lle(t(deng[slicer_genes,]), m = 2, k)$Y
## finding neighbours
## calculating weights
## computing coordinates
reducedDim(deng_SCE, "LLE") <- slicer_traj_lle
plotReducedDim(deng_SCE, use_dimred = "LLE", colour_by = "cell_type2") +
xlab("LLE component 1") + ylab("LLE component 2") +
ggtitle("Locally linear embedding of cells from SLICER")
8.4. PSEUDOTIME ANALYSIS 229
cell_type2
zy
early2cell
LLE component 2
−1 mid2cell
late2cell
4cell
8cell
−2 16cell
earlyblast
midblast
lateblast
−3
With the locally linear embedding computed we can construct a k-nearest neighbour graph that is fully
connected. This plot displays a (yellow) circle for each cell, with the cell ID number overlaid in blue. Here
we show the graph computed using 10 nearest neighbours. Here, SLICER appears to detect one major
trajectory with one branch.
slicer_traj_graph <- conn_knn_graph(slicer_traj_lle, 10)
plot(slicer_traj_graph, main = "Fully connected kNN graph from SLICER")
11 471
239187
69
66
166
167
170
176
190
187
185
165
172
191
178
184
188
174 1 70
65
1480 79
82
789
886
90
192
183
180
171
168
181
250
256
264
179
258
169
262
259
189
248 13
5
106 83
9285
163
247
261
260
263
252
214
257
206
164
221
255 67
74
75
84
77
78
182
210
218
186
211
137
149
113
140
129
124
253
173 68
81 97
249
142
207
152
235
230
112
251
229
146
127
117
110
121
241 12
9 776
72100
3
94
95
101
99
98
96
93
212
177
147
148
175
128
151
213
222
111
237
243
216 31
21
15
22
29
133
242 30
17
36
119
254
120
125
150
244
227
122 23
20
38
26
226 25
19
130
224
208 27
132
131
223
126
136
115 37
135 24
217
145
246
225
219
209
215
240
114
14440
143
245 34
139
205 39
32
28
138
239
232
220
134
236
118
228
238
141
233
116
35
44
231
234
123
43
16
88
45
48
33
47
42
41
50
49
46
58
59
54
57
53
62
52
51
55
60
56
63
64
61
161
154
157
156
162
159
155
153
158
198
197
160
202
203
199
193
200
194
204
18
201
196
195
102
106
104
107
108
265
268
103
105
109
267
266
From this graph we can identify “extreme” cells that are candidates for start/end cells in the trajectory.
230 CHAPTER 8. BIOLOGICAL ANALYSIS
0
Manifold Dim 2
−1
109
229
−2
−3
Manifold Dim 1
start <- ends[1]
Having defined a start cell we can order the cells in the estimated pseudotime.
pseudotime_order_slicer <- cell_order(slicer_traj_graph, start)
branches <- assign_branches(slicer_traj_graph, start)
pseudotime_slicer <-
data.frame(
Timepoint = cellLabels,
pseudotime = NA,
State = branches
)
pseudotime_slicer$pseudotime[pseudotime_order_slicer] <-
1:length(pseudotime_order_slicer)
deng_SCE$pseudotime_slicer <- pseudotime_slicer$pseudotime
We can again compare the inferred pseudotime to the known sampling timepoints. SLICER does not provide
a pseudotime value per se, just an ordering of cells.
ggplot(as.data.frame(colData(deng_SCE)),
aes(x = pseudotime_slicer,
y = cell_type2, colour = cell_type2)) +
geom_quasirandom(groupOnX = FALSE) +
scale_color_tableau() + theme_classic() +
xlab("SLICER pseudotime (cell ordering)") +
ylab("Timepoint") +
theme_classic()
8.4. PSEUDOTIME ANALYSIS 231
lateblast
midblast
cell_type2
earlyblast zy
early2cell
16cell
mid2cell
Timepoint
late2cell
8cell
4cell
4cell 8cell
16cell
late2cell
earlyblast
mid2cell midblast
lateblast
early2cell
zy
0 100 200
SLICER pseudotime (cell ordering)
Like the previous method, SLICER here provides a good ordering for the early time points. It places “16cell”
cells before “8cell” cells, but provides better ordering for blast cells than many of the earlier methods.
Exercise 4 How do the results change for different k? (e.g. k = 5) What about changing the number of
nearest neighbours in the call to conn_knn_graph?
Exercise 5 How does the ordering change if you use a different set of genes from those chosen by SLICER
(e.g. the genes identified by M3Drop)?
8.4.6 Ouija
• Early timepoints: Dazl, Rnf17, Sycp3, Nanog, Pou5f1, Fgf8, Egfr, Bmp5, Bmp15
• Mid timepoints: Zscan4b, Foxa1, Prdm14, Sox21
• Late timepoints: Creb3, Gpx4, Krt8, Elf5, Eomes, Cdx2, Tdgf1, Gdf3
With Ouija we can model genes as either exhibiting monotonic up or down regulation (known as switch-
like behaviour), or transient behaviour where the gene briefly peaks. By default, Ouija assumes all genes
exhibit switch-like behaviour (the authors assure us not to worry if we get it wrong - the noise model means
incorrectly specifying a transient gene as switch-like has minimal effect).
Here we can “cheat” a little and check that our selected marker genes do actually identify different timepoints
of the differentiation process.
ouija_markers_down <- c("Dazl", "Rnf17", "Sycp3", "Fgf8",
"Egfr", "Bmp5", "Bmp15", "Pou5f1")
ouija_markers_up <- c("Creb3", "Gpx4", "Krt8", "Elf5", "Cdx2",
"Tdgf1", "Gdf3", "Eomes")
ouija_markers_transient <- c("Zscan4b", "Foxa1", "Prdm14", "Sox21")
ouija_markers <- c(ouija_markers_down, ouija_markers_up,
ouija_markers_transient)
plotExpression(deng_SCE, ouija_markers, x = "cell_type2", colour_by = "cell_type2") +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
8.4. PSEUDOTIME ANALYSIS 233
Dazl Rnf17
15
10
5
0
Sycp3 Fgf8
15
10
5
0
Egfr Bmp5
15
10
5
0
Bmp15 Pou5f1
15
10
5
0
Creb3 Gpx4 cell_type2
15 zy
Expression (logcounts)
10 early2cell
5 mid2cell
0 late2cell
0 midblast
lateblast
Cdx2 Tdgf1
15
10
5
0
Gdf3 Eomes
15
10
5
0
Zscan4b Foxa1
15
10
5
0
Prdm14 Sox21
15
10
5
0
zy
ell
zy
ell
ell
ell
ll
ll
ell
ell
st
t
t
ell
ll
ll
ell
st
t
t
las
las
las
las
ce
ce
ce
ce
bla
bla
4c
8c
4c
8c
c
2c
2c
d2
e2
16
d2
e2
16
db
eb
db
eb
rly
rly
rly
rly
mi
lat
mi
lat
mi
lat
mi
lat
ea
ea
ea
ea
cell_type2
234 CHAPTER 8. BIOLOGICAL ANALYSIS
In order to fit the pseudotimes wesimply call ouija, passing in the expected response types. Note that if
no response types are provided then they are all assumed to be switch-like by default, which we will do
here. The input to Ouija can be a cell-by-gene matrix of non-negative expression values, or an ExpressionSet
object, or, happily, by selecting the logcounts values from a SingleCellExperiment object.
We can apply prior information about whether genes are up- or down-regulated across the differentiation
process, and also provide prior information about when the switch in expression or a peak in expression is
likely to occur.
• Hamiltonian Monte Carlo (HMC) - full MCMC inference where gradient information of the log-posterior
is used to “guide” the random walk through the parameter space, or
• Automatic Differentiation Variational Bayes (ADVI or simply VI) - approximate inference where the
KL divergence to an approximate distribution is minimised.
In general, HMC will provide more accurate inference with approximately correct posterior variance for all
parameters. However, VB is orders of magnitude quicker than HMC and while it may underestimate posterior
variance, the Ouija authors suggest that anecdotally it often performs as well as HMC for discovering posterior
pseudotimes.
To help the Ouija model, we provide it with prior information about the strength of switches for up- and
down-regulated genes. By setting switch strength to -10 for down-regulated genes and 10 for up-regulated
genes with a prior strength standard deviation of 0.5 we are telling the model that we are confident about
the expected behaviour of these genes across the differentiation process.
options(mc.cores = parallel::detectCores())
response_type <- c(rep("switch", length(ouija_markers_down) +
length(ouija_markers_up)),
rep("transient", length(ouija_markers_transient)))
switch_strengths <- c(rep(-10, length(ouija_markers_down)),
rep(10, length(ouija_markers_up)))
switch_strength_sd <- c(rep(0.5, length(ouija_markers_down)),
rep(0.5, length(ouija_markers_up)))
garbage <- capture.output(
oui_vb <- ouija(deng_SCE[ouija_markers,],
single_cell_experiment_assay = "logcounts",
response_type = response_type,
switch_strengths = switch_strengths,
switch_strength_sd = switch_strength_sd,
inference_type = "vb")
)
print(oui_vb)
We can plot the gene expression over pseudotime along with the maximum a posteriori (MAP) estimates of
the mean function (the sigmoid or Gaussian transient function) using the plot_expression function.
8.4. PSEUDOTIME ANALYSIS 235
plot_expression(oui_vb)
Bmp15 Cdx2
Dazl
1.5
1.0
0.5
0.0
Normalised log expression
Elf5
1.5
1.0
0.5
0.0
Fgf8
1.5
1.0
0.5
0.0
Gdf3
1.5
1.0
0.5
0.0
Krt8 Prdm14Sox21 Tdgf1
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
1.5
1.0
0.5
0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Ouija pseudotime
We can also visualise when in the trajectory gene regulation behaviour occurs, either in the form of
the switch time or the peak time (for switch-like or transient genes) using the plot_switch_times and
plot_transient_times functions:
plot_switch_times(oui_vb)
236 CHAPTER 8. BIOLOGICAL ANALYSIS
Regulation
−10 −5 0 5 10
Pou5f1
Rnf17
Elf5
Sycp3
Cdx2
Dazl
Bmp15
Gene
Tdgf1
Krt8
Gdf3
Eomes
Gpx4
Creb3
Egfr
Bmp5
Fgf8
0.00 0.25 0.50 0.75 1.00
Switch point
plot_peak_times(oui_vb)
8.4. PSEUDOTIME ANALYSIS 237
Sox21
Prdm14
Gene
Foxa1
Zscan4b
## for <92>
250
200
Pseudotime order ...
P(i>j)
1.00
150 0.75
0.50
0.25
100
0.00
50
3
Cell classification
ggplot(as.data.frame(colData(deng_SCE)),
aes(x = pseudotime_ouija,
y = cell_type2, colour = cell_type2)) +
geom_quasirandom(groupOnX = FALSE) +
scale_color_tableau() + theme_classic() +
xlab("Ouija pseudotime") +
ylab("Timepoint") +
theme_classic()
8.4. PSEUDOTIME ANALYSIS 243
lateblast
midblast
cell_type2
earlyblast zy
early2cell
16cell
mid2cell
Timepoint
late2cell
8cell
4cell
4cell 8cell
16cell
late2cell
earlyblast
mid2cell midblast
lateblast
early2cell
zy
Ouija does quite well in the ordering of the cells here, although it can be sensitive to the choice of marker
genes and prior information supplied. How do the results change if you select different marker genes or
change the priors?
Ouija identifies four metastable states here, which we might annotate as “zygote/2cell”, “4/8/16 cell”, “blast1”
and “blast2”.
ggplot(as.data.frame(colData(deng_SCE)),
aes(x = as.factor(ouija_cell_class),
y = pseudotime_ouija, colour = cell_type2)) +
geom_boxplot() +
coord_flip() +
scale_color_tableau() + theme_classic() +
xlab("Ouija cell classification") +
ylab("Ouija pseudotime") +
theme_classic()
244 CHAPTER 8. BIOLOGICAL ANALYSIS
4
cell_type2
zy
early2cell
Ouija cell classification
3 mid2cell
late2cell
4cell
8cell
2 16cell
earlyblast
midblast
lateblast
A common analysis is to work out the regulation orderings of genes. For example, is gene A upregulated
before gene B? Does gene C peak before the downregulation of gene D? Ouija answers these questions in
terms of a Bayesian hypothesis test of whether the difference in regulation timing (either switch time or peak
time) is significantly different to 0. This is collated using the gene_regulation function.
gene_regs <- gene_regulation(oui_vb)
head(gene_regs)
## # A tibble: 6 x 7
## # Groups: label, gene_A [6]
## label gene_A gene_B mean_difference lower_95 upper_95 significant
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <lgl>
## 1 Bmp15 - Cdx2 Bmp15 Cdx2 -0.0434 -0.0912 0.0110 F
## 2 Bmp15 - Cre~ Bmp15 Creb3 0.278 0.220 0.330 T
## 3 Bmp15 - Elf5 Bmp15 Elf5 -0.656 -0.688 -0.618 T
## 4 Bmp15 - Eom~ Bmp15 Eomes 0.0766 0.00433 0.153 T
## 5 Bmp15 - Fox~ Bmp15 Foxa1 -0.0266 -0.0611 0.00287 F
## 6 Bmp15 - Gdf3 Bmp15 Gdf3 0.0704 0.00963 0.134 T
What conclusions can you draw from the gene regulation output from Ouija?
If you have time, you might try the HMC inference method and see if that changes the Ouija results in any
way.
How do the trajectories inferred by TSCAN, Monocle, Diffusion Map, SLICER and Ouija compare?
8.4. PSEUDOTIME ANALYSIS 245
TSCAN and Diffusion Map methods get the trajectory the “wrong way round”, so we’ll adjust that for these
comparisons.
df_pseudotime <- as.data.frame(
colData(deng_SCE)[, grep("pseudotime", colnames(colData(deng_SCE)))]
)
colnames(df_pseudotime) <- gsub("pseudotime_", "",
colnames(df_pseudotime))
df_pseudotime$PC1 <- deng_SCE$PC1
df_pseudotime$order_tscan <- -df_pseudotime$order_tscan
df_pseudotime$diffusionmap <- -df_pseudotime$diffusionmap
ouija 0.8
0.6
0.93 PC1
0.4
0.2
0.82 0.91 order_tscan
0
−0.4
0.02 0.04 0.05 0.03 monocle
−0.6
Each package also enables the visualization of expression through pseudotime. Following individual genes
is very helpful for identifying genes that play an important role in the differentiation process. We illustrate
the procedure using the Rhoa gene.
246 CHAPTER 8. BIOLOGICAL ANALYSIS
We have added the pseudotime values computed with all methods here to the colData slot of an SCE object.
Having done that, the full plotting capabilities of the scater package can be used to investigate relationships
between gene expression, cell populations and pseudotime. This is particularly useful for the packages such
as SLICER that do not provide plotting functions.
Principal components
plotExpression(deng_SCE, "Rhoa", x = "PC1",
colour_by = "cell_type2", show_violin = FALSE,
show_smooth = TRUE)
Rhoa
12 cell_type2
zy
Expression (logcounts)
early2cell
mid2cell
late2cell
10 4cell
8cell
16cell
earlyblast
midblast
8 lateblast
−20 −10 0 10
PC1
TSCAN
plotExpression(deng_SCE, "Rhoa", x = "pseudotime_order_tscan",
colour_by = "cell_type2", show_violin = FALSE,
show_smooth = TRUE)
8.4. PSEUDOTIME ANALYSIS 247
Rhoa
12
cell_type2
zy
Expression (logcounts)
early2cell
mid2cell
late2cell
10 4cell
8cell
16cell
earlyblast
midblast
8 lateblast
Monocle
plotExpression(deng_SCE, "Rhoa", x = "pseudotime_monocle",
colour_by = "cell_type2", show_violin = FALSE,
show_smooth = TRUE)
248 CHAPTER 8. BIOLOGICAL ANALYSIS
Rhoa
12
cell_type2
zy
Expression (logcounts)
early2cell
mid2cell
late2cell
10 4cell
8cell
16cell
earlyblast
midblast
8 lateblast
0 10 20 30
pseudotime_monocle
Diffusion Map
plotExpression(deng_SCE, "Rhoa", x = "pseudotime_diffusionmap",
colour_by = "cell_type2", show_violin = FALSE,
show_smooth = TRUE)
8.4. PSEUDOTIME ANALYSIS 249
Rhoa
12
cell_type2
zy
Expression (logcounts)
early2cell
mid2cell
late2cell
10 4cell
8cell
16cell
earlyblast
midblast
8 lateblast
0 100 200
pseudotime_diffusionmap
SLICER
plotExpression(deng_SCE, "Rhoa", x = "pseudotime_slicer",
colour_by = "cell_type2", show_violin = FALSE,
show_smooth = TRUE)
250 CHAPTER 8. BIOLOGICAL ANALYSIS
Rhoa
12
cell_type2
zy
Expression (logcounts)
early2cell
mid2cell
late2cell
10 4cell
8cell
16cell
earlyblast
midblast
8 lateblast
0 100 200
pseudotime_slicer
Ouija
plotExpression(deng_SCE, "Rhoa", x = "pseudotime_ouija",
colour_by = "cell_type2", show_violin = FALSE,
show_smooth = TRUE)
8.4. PSEUDOTIME ANALYSIS 251
Rhoa
12
cell_type2
zy
Expression (logcounts)
early2cell
mid2cell
late2cell
10 4cell
8cell
16cell
earlyblast
midblast
8 lateblast
How many of these methods outperform the naive approach of using the first principal component to represent
pseudotime for these data?
Exercise 7: Repeat the exercise using a subset of the genes, e.g. the set of highly variable genes that can
be obtained using Brennecke_getVariableGenes()
8.4.9 sessionInfo()
##
## other attached packages:
## [1] bindrcpp_0.2 rstan_2.17.3
## [3] StanHeaders_2.17.2 lle_1.1
## [5] snowfall_1.84-6.1 snow_0.4-2
## [7] MASS_7.3-45 scatterplot3d_0.3-40
## [9] corrplot_0.84 ggbeeswarm_0.6.0
## [11] ggthemes_3.4.0 scater_1.6.2
## [13] ouija_0.99.0 Rcpp_0.12.15
## [15] SLICER_0.2.0 destiny_2.6.1
## [17] monocle_2.6.1 DDRTree_0.1.5
## [19] irlba_2.3.2 VGAM_1.0-4
## [21] ggplot2_2.2.1 Matrix_1.2-7.1
## [23] M3Drop_3.05.00 numDeriv_2016.8-1
## [25] TSCAN_1.16.0 SingleCellExperiment_1.0.0
## [27] SummarizedExperiment_1.8.1 DelayedArray_0.4.1
## [29] matrixStats_0.53.0 Biobase_2.38.0
## [31] GenomicRanges_1.30.1 GenomeInfoDb_1.14.0
## [33] IRanges_2.12.0 S4Vectors_0.16.0
## [35] BiocGenerics_0.24.0 knitr_1.19
##
## loaded via a namespace (and not attached):
## [1] utf8_1.1.3 shinydashboard_0.6.1 R.utils_2.6.0
## [4] tidyselect_0.2.3 lme4_1.1-15 RSQLite_2.0
## [7] AnnotationDbi_1.40.0 htmlwidgets_1.0 grid_3.4.3
## [10] combinat_0.0-8 Rtsne_0.13 munsell_0.4.3
## [13] codetools_0.2-15 statmod_1.4.30 colorspace_1.3-2
## [16] fastICA_1.2-1 rstudioapi_0.7 robustbase_0.92-8
## [19] vcd_1.4-4 tensor_1.5 VIM_4.7.0
## [22] TTR_0.23-3 labeling_0.3 slam_0.1-42
## [25] splancs_2.01-40 tximport_1.6.0 bbmle_1.0.20
## [28] GenomeInfoDbData_1.0.0 polyclip_1.6-1 bit64_0.9-7
## [31] pheatmap_1.0.8 rhdf5_2.22.0 rprojroot_1.3-2
## [34] coda_0.19-1 xfun_0.1 R6_2.2.2
## [37] RcppEigen_0.3.3.3.1 locfit_1.5-9.1 bitops_1.0-6
## [40] spatstat.utils_1.8-0 assertthat_0.2.0 scales_0.5.0
## [43] nnet_7.3-12 beeswarm_0.2.3 gtable_0.2.0
## [46] goftest_1.1-1 rlang_0.1.6 MatrixModels_0.4-1
## [49] lazyeval_0.2.1 acepack_1.4.1 checkmate_1.8.5
## [52] inline_0.3.14 yaml_2.1.16 reshape2_1.4.3
## [55] abind_1.4-5 backports_1.1.2 httpuv_1.3.5
## [58] Hmisc_4.1-1 tensorA_0.36 tools_3.4.3
## [61] bookdown_0.6 cubature_1.3-11 gplots_3.0.1
## [64] RColorBrewer_1.1-2 proxy_0.4-21 MCMCglmm_2.25
## [67] plyr_1.8.4 progress_1.1.2 base64enc_0.1-3
## [70] zlibbioc_1.24.0 purrr_0.2.4 RCurl_1.95-4.10
## [73] densityClust_0.3 prettyunits_1.0.2 rpart_4.1-10
## [76] alphahull_2.1 deldir_0.1-14 reldist_1.6-6
## [79] viridis_0.5.0 cowplot_0.9.2 zoo_1.8-1
## [82] ggrepel_0.7.0 cluster_2.0.6 magrittr_1.5
## [85] data.table_1.10.4-3 SparseM_1.77 lmtest_0.9-35
## [88] RANN_2.5.1 mime_0.5 evaluate_0.10.1
## [91] xtable_1.8-2 XML_3.98-1.9 smoother_1.1
## [94] pbkrtest_0.4-7 mclust_5.4 gridExtra_2.3
8.5. IMPUTATION 253
8.5 Imputation
library(scImpute)
library(SC3)
library(scater)
library(SingleCellExperiment)
library(mclust)
set.seed(1234567)
As discussed previously, one of the main challenges when analyzing scRNA-seq data is the presence of zeros,
or dropouts. The dropouts are assumed to have arisen for three possible reasons:
• The gene was not expressed in the cell and hence there are no transcripts to sequence
• The gene was expressed, but for some reason the transcripts were lost somewhere prior to sequencing
• The gene was expressed and transcripts were captured and turned into cDNA, but the sequencing
depth was not sufficient to produce any reads.
Thus, dropouts could be result of experimental shortcomings, and if this is the case then we would like to
provide computational corrections. One possible solution is to impute the dropouts in the expression matrix.
To be able to impute gene expression values, one must have an underlying model. However, since we do
not know which dropout events are technical artefacts and which correspond to the transcript being truly
absent, imputation is a difficult challenge.
To the best of our knowledge, there are currently two different imputation methods available: MAGIC (van
Dijk et al., 2017) and scImpute (Li and Li, 2017). MAGIC is only available for Python or Matlab, but we
will run it from within R.
254 CHAPTER 8. BIOLOGICAL ANALYSIS
8.5.1 scImpute
To test scImpute, we use the default parameters and we apply it to the Deng dataset that we have worked
with before. scImpute takes a .csv or .txt file as an input:
deng <- readRDS("deng/deng-reads.rds")
write.csv(counts(deng), "deng.csv")
scimpute(
count_path = "deng.csv",
infile = "csv",
outfile = "txt",
out_dir = "./",
Kcluster = 10,
ncores = 2
)
res,
colour_by = "cell_type2"
)
20
cell_type2
10
Component 2: 20% variance
16cell
4cell
8cell
early2cell
earlyblast
0
late2cell
lateblast
mid2cell
midblast
zy
−10
−40 −20 0 20
Component 1: 56% variance
Compare this result to the original data in Chapter 8.2. What are the most significant differences?
To evaluate the impact of the imputation, we use SC3 to cluster the imputed matrix
res <- sc3_estimate_k(res)
metadata(res)$sc3$k_estimation
## [1] 6
res <- sc3(res, ks = 10, gene_filter = FALSE)
adjustedRandIndex(colData(deng)$cell_type2, colData(res)$sc3_10_clusters)
## [1] 0.4687332
plotPCA(
res,
colour_by = "sc3_10_clusters"
)
256 CHAPTER 8. BIOLOGICAL ANALYSIS
20
sc3_10_clusters
10
Component 2: 20% variance
1
2
3
4
5
0
6
7
8
9
10
−10
−40 −20 0 20
Component 1: 56% variance
Exercise: Based on the PCA and the clustering results, do you think that imputation using scImpute is a
good idea for the Deng dataset?
8.5.2 MAGIC
30
20 cell_type2
Component 2: 18% variance
16cell
4cell
8cell
early2cell
10 earlyblast
late2cell
lateblast
mid2cell
midblast
0
zy
−10
−20 −10 0 10 20
Component 1: 79% variance
Compare this result to the original data in Chapter 8.2. What are the most significant differences?
To evaluate the impact of the imputation, we use SC3 to cluster the imputed matrix
res <- sc3_estimate_k(res)
metadata(res)$sc3$k_estimation
## [1] 4
res <- sc3(res, ks = 10, gene_filter = FALSE)
adjustedRandIndex(colData(deng)$cell_type2, colData(res)$sc3_10_clusters)
## [1] 0.3752866
plotPCA(
res,
colour_by = "sc3_10_clusters"
)
258 CHAPTER 8. BIOLOGICAL ANALYSIS
30
20 sc3_10_clusters
Component 2: 18% variance
1
2
3
4
10 5
6
7
8
9
0
10
−10
−20 −10 0 10 20
Component 1: 79% variance
Exercise: What is the difference between scImpute and MAGIC based on the PCA and clustering analysis?
Which one do you think is best to use?
8.5.3 sessionInfo()
One of the most common types of analyses when working with bulk RNA-seq data is to identify differentially
expressed genes. By comparing the genes that change between two conditions, e.g. mutant and wild-type or
260 CHAPTER 8. BIOLOGICAL ANALYSIS
stimulated and unstimulated, it is possible to characterize the molecular mechanisms underlying the change.
Several different methods, e.g. DESeq2 and edgeR, have been developed for bulk RNA-seq. Moreover, there
are also extensive datasets available where the RNA-seq data has been validated using RT-qPCR. These data
can be used to benchmark DE finding algorithms and the available evidence suggests that the algorithms
are performing quite well.
In contrast to bulk RNA-seq, in scRNA-seq we usually do not have a defined set of experimental conditions.
Instead, as was shown in a previous chapter (8.2) we can identify the cell groups by using an unsupervised
clustering approach. Once the groups have been identified one can find differentially expressed genes either
by comparing the differences in variance between the groups (like the Kruskal-Wallis test implemented in
SC3), or by comparing gene expression between clusters in a pairwise manner. In the following chapter we
will mainly consider tools developed for pairwise comparisons.
Unlike bulk RNA-seq, we generally have a large number of samples (i.e. cells) for each group we are comparing
in single-cell experiments. Thus we can take advantage of the whole distribution of expression values in each
group to identify differences between groups rather than only comparing estimates of mean-expression as is
standard for bulk RNASeq.
There are two main approaches to comparing distributions. Firstly, we can use existing statistical mod-
els/distributions and fit the same type of model to the expression in each group then test for differences in
the parameters for each model, or test whether the model fits better if a particular paramter is allowed to
be different according to group. For instance in Chapter 7.10 we used edgeR to test whether allowing mean
expression to be different in different batches significantly improved the fit of a negative binomial model of
the data.
Alternatively, we can use a non-parametric test which does not assume that expression values follow any
particular distribution, e.g. the Kolmogorov-Smirnov test (KS-test). Non-parametric tests generally convert
observed expression values to ranks and test whether the distribution of ranks for one group are signficantly
different from the distribution of ranks for the other group. However, some non-parametric methods fail in
the presence of a large number of tied values, such as the case for dropouts (zeros) in single-cell RNA-seq
expression data. Moreover, if the conditions for a parametric test hold, then it will typically be more powerful
than a non-parametric test.
The most common model of RNASeq data is the negative binomial model:
set.seed(1)
hist(
rnbinom(
1000,
mu = 10,
size = 100),
col = "grey50",
xlab = "Read Counts",
main = "Negative Binomial"
)
8.6. DIFFERENTIAL EXPRESSION (DE) ANALYSIS 261
Negative Binomial
200
150
Frequency
100
50
0
0 5 10 15 20
Read Counts
Figure 8.12: Negative Binomial distribution of read counts for a single gene across 1000 cells
Mean: µ = mu
Variance: σ 2 = mu + mu2 /size
It is parameterized by the mean expression (mu) and the dispersion (size), which is inversely related to the
variance. The negative binomial model fits bulk RNA-seq data very well and it is used for most statistical
methods designed for such data. In addition, it has been show to fit the distribution of molecule counts
obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al. 2014, Islam et al.
2011).
However, a raw negative binomial model does not fit full-length transcript data as well due to the high
dropout rates relative to the non-zero read counts. For this type of data a variety of zero-inflated negative
binomial models have been proposed (e.g. MAST, SCDE).
d <- 0.5;
counts <- rnbinom(
1000,
mu = 10,
size = 100
)
counts[runif(1000) < d] <- 0
hist(
counts,
col = "grey50",
xlab = "Read Counts",
main = "Zero-inflated NB"
)
Mean: µ = mu · (1 − d)
262 CHAPTER 8. BIOLOGICAL ANALYSIS
Zero−inflated NB
500
400
300
Frequency
200
100
0
0 5 10 15 20
Read Counts
Variance: σ 2 = µ · (1 − d) · (1 + d · µ + µ/size)
These models introduce a new parameter d, for the dropout rate, to the negative binomial model. As we
saw in Chapter 19, the dropout rate of a gene is strongly correlated with the mean expression of the gene.
Different zero-inflated negative binomial models use different relationships between mu and d and some may
fit µ and d to the expression of each gene independently.
Finally, several methods use a Poisson-Beta distribution which is based on a mechanistic model of tran-
scriptional bursting. There is strong experimental support for this model (Kim and Marioni, 2013) and it
provides a good fit to scRNA-seq data but it is less easy to use than the negative-binomial models and much
less existing methods upon which to build than the negative binomial model.
a <- 0.1
b <- 0.1
g <- 100
lambdas <- rbeta(1000, a, b)
counts <- sapply(g*lambdas, function(l) {rpois(1, lambda = l)})
hist(
counts,
col = "grey50",
xlab = "Read Counts",
main = "Poisson-Beta"
)
8.7. DE IN A REAL DATASET 263
Poisson−Beta
400
300
Frequency
200
100
0
0 20 40 60 80 100 120
Read Counts
Mean: µ = g · a/(a + b)
Variance: σ 2 = g 2 · a · b/((a + b + 1) · (a + b)2 )
This model uses three parameters: a the rate of activation of transcription; b the rate of inhibition of
transcription; and g the rate of transcript production while transcription is active at the locus. Differential
expression methods may test each of the parameters for differences across groups or only one (often g).
All of these models may be further expanded to explicitly account for other sources of gene expression
differences such as batch-effect or library depth depending on the particular DE algorithm.
Exercise: Vary the parameters of each distribution to explore how they affect the distribution of gene
expression. How similar are the Poisson-Beta and Negative Binomial models?
library(scRNA.seq.funcs)
library(edgeR)
library(monocle)
library(MAST)
library(ROCR)
set.seed(1)
8.7.1 Introduction
To test different single-cell differential expression methods we will be using the Blischak dataset from Chapters
7-17. For this experiment bulk RNA-seq data for each cell-line was generated in addition to single-cell data.
We will use the differentially expressed genes identified using standard methods on the respective bulk data as
264 CHAPTER 8. BIOLOGICAL ANALYSIS
the ground truth for evaluating the accuracy of each single-cell method. To save time we have pre-computed
these for you. You can run the commands below to load these data.
DE <- read.table("tung/TPs.txt")
notDE <- read.table("tung/TNs.txt")
GroundTruth <- list(
DE = as.character(unlist(DE)),
notDE = as.character(unlist(notDE))
)
This ground truth has been produce for the comparison of individual NA19101 to NA19239. Now load the
respective single-cell data:
molecules <- read.table("tung/molecules.txt", sep = "\t")
anno <- read.table("tung/annotation.txt", sep = "\t", header = TRUE)
keep <- anno[,1] == "NA19101" | anno[,1] == "NA19239"
data <- molecules[,keep]
group <- anno[keep,1]
batch <- anno[keep,4]
# remove genes that aren't expressed in at least 6 cells
gkeep <- rowSums(data > 0) > 5;
counts <- data[gkeep,]
# Library size normalization
lib_size = colSums(counts)
norm <- t(t(counts)/lib_size * median(lib_size))
# Variant of CPM for datasets with library sizes of fewer than 1 mil molecules
Now we will compare various single-cell DE methods. Note that we will only be running methods which are
available as R-packages and run relatively quickly.
The types of test that are easiest to work with are non-parametric ones. The most commonly used non-
parametric test is the Kolmogorov-Smirnov test (KS-test) and we can use it to compare the distributions for
each gene in the two individuals.
The KS-test quantifies the distance between the empirical cummulative distributions of the expression of
each gene in each of the two populations. It is sensitive to changes in mean experession and changes in
variability. However it assumes data is continuous and may perform poorly when data contains a large
number of identical values (eg. zeros). Another issue with the KS-test is that it can be very sensitive for
large sample sizes and thus it may end up as significant even though the magnitude of the difference is very
small.
\begin{figure}
8.7. DE IN A REAL DATASET 265
\caption{Illustration of the two-sample Kolmogorov–Smirnov statistic. Red and blue lines each correspond
to an empirical distribution function, and the black arrow is the two-sample KS statistic. (taken from
here)} \end{figure}
This code “applies” the function to each row (specified by 1) of the expression matrix, data. In the
function we are returning just the p.value from the ks.test output. We can now consider how many of the
ground truth positive and negative DE genes are detected by the KS-test:
## [1] 5095
# Number of KS-DE genes
sum(GroundTruth$DE %in% sigDE)
## [1] 792
266 CHAPTER 8. BIOLOGICAL ANALYSIS
## [1] 3190
# Number of KS-DE genes that are truly not-DE
As you can see many more of our ground truth negative genes were identified as DE by the KS-test (false
positives) than ground truth positive genes (true positives), however this may be due to the larger number
of notDE genes thus we typically normalize these counts as the True positive rate (TPR), TP/(TP + FN),
and False positive rate (FPR), FP/(FP+TP).
tp <- sum(GroundTruth$DE %in% sigDE)
fp <- sum(GroundTruth$notDE %in% sigDE)
tn <- sum(GroundTruth$notDE %in% names(pVals)[pVals >= 0.05])
fn <- sum(GroundTruth$DE %in% names(pVals)[pVals >= 0.05])
tpr <- tp/(tp + fn)
fpr <- fp/(fp + tn)
cat(c(tpr, fpr))
## 0.7346939 0.2944706
Now we can see the TPR is much higher than the FPR indicating the KS test is identifying DE genes.
So far we’ve only evaluated the performance at a single significance threshold. Often it is informative to
vary the threshold and evaluate performance across a range of values. This is then plotted as a
receiver-operating-characteristic curve (ROC) and a general accuracy statistic can be calculated as the area
under this curve (AUC). We will use the ROCR package to facilitate this plotting.
# Only consider genes for which we know the ground truth
pVals <- pVals[names(pVals) %in% GroundTruth$DE |
names(pVals) %in% GroundTruth$notDE]
truth <- rep(1, times = length(pVals));
truth[names(pVals) %in% GroundTruth$DE] = 0;
pred <- ROCR::prediction(pVals, truth)
perf <- ROCR::performance(pred, "tpr", "fpr")
ROCR::plot(perf)
## [1] 0.7954796
Finally to facilitate the comparisons of other DE methods let’s put this code into a function so we don’t
need to repeat it:
DE_Quality_AUC <- function(pVals) {
pVals <- pVals[names(pVals) %in% GroundTruth$DE |
names(pVals) %in% GroundTruth$notDE]
truth <- rep(1, times = length(pVals));
truth[names(pVals) %in% GroundTruth$DE] = 0;
pred <- ROCR::prediction(pVals, truth)
perf <- ROCR::performance(pred, "tpr", "fpr")
ROCR::plot(perf)
aucObj <- ROCR::performance(pred, "auc")
return([email protected][[1]])
}
8.7. DE IN A REAL DATASET 267
1.0
0.8
True positive rate
0.6
0.4
0.2
0.0
The Wilcox-rank-sum test is another non-parametric test, but tests specifically if values in one group are
greater/less than the values in the other group. Thus it is often considered a test for difference in median
expression between two groups; whereas the KS-test is sensitive to any change in distribution of expression
values.
pVals <- apply(
norm, 1, function(x) {
wilcox.test(
x[group == "NA19101"],
x[group == "NA19239"]
)$p.value
}
)
# multiple testing correction
pVals <- p.adjust(pVals, method = "fdr")
DE_Quality_AUC(pVals)
## [1] 0.8320326
8.7.4 edgeR
We’ve already used edgeR for differential expression in Chapter 7.10. edgeR is based on a negative
binomial model of gene expression and uses a generalized linear model (GLM) framework, the enables us to
include other factors such as batch to the model.
dge <- DGEList(
counts = counts,
268 CHAPTER 8. BIOLOGICAL ANALYSIS
1.0
0.8
True positive rate
0.6
0.4
0.2
0.0
## [1] 0.8466764
8.7.5 Monocle
Monocle can use several different models for DE. For count data it recommends the Negative Binomial
model (negbinomial.size). For normalized data it recommends log-transforming it then using a normal
distribution (gaussianff). Similar to edgeR this method uses a GLM framework so in theory can account
for batches, however in practice the model fails for this dataset if batches are included.
pd <- data.frame(group = group, batch = batch)
rownames(pd) <- colnames(counts)
pd <- new("AnnotatedDataFrame", data = pd)
1.0
0.8
True positive rate
0.6
0.4
0.2
0.0
phenoData = pd,
expressionFamily = negbinomial.size()
)
Obj <- estimateSizeFactors(Obj)
Obj <- estimateDispersions(Obj)
res <- differentialGeneTest(Obj, fullModelFormulaStr = "~group")
## [1] 0.8252662
Exercise: Compare the results using the negative binomial model on counts and those from using the
normal/gaussian model (gaussianff()) on log-transformed normalized counts.
Answer:
## [1] 0.7357829
8.7.6 MAST
MAST is based on a zero-inflated negative binomial model. It tests for differential expression using a
hurdle model to combine tests of discrete (0 vs not zero) and continuous (non-zero values) aspects of gene
expression. Again this uses a linear modelling framework to enable complex models to be considered.
log_counts <- log(counts + 1) / log(2)
fData <- data.frame(names = rownames(log_counts))
rownames(fData) <- rownames(log_counts);
270 CHAPTER 8. BIOLOGICAL ANALYSIS
1.0
0.8
True positive rate
0.6
0.4
0.2
0.0
0.6
0.4
0.2
0.0
1.0
0.8
True positive rate
0.6
0.4
0.2
0.0
## [1] 0.8284046
272 CHAPTER 8. BIOLOGICAL ANALYSIS
These methods are too slow to run today but we encourage you to try them out on your own:
8.7.8 BPSC
BPSC uses the Poisson-Beta model of single-cell gene expression, which we discussed in the previous
chapter, and combines it with generalized linear models which we’ve already encountered when using
edgeR. BPSC performs comparisons of one or more groups to a reference group (“control”) and can include
other factors such as batches in the model.
library(BPSC)
bpsc_data <- norm[,batch=="NA19101.r1" | batch=="NA19239.r1"]
bpsc_group = group[batch=="NA19101.r1" | batch=="NA19239.r1"]
8.7.9 SCDE
SCDE is the first single-cell specific DE method. It fits a zero-inflated negative binomial model to
expression data using Bayesian statistics. The usage below tests for differences in mean expression of
individual genes across groups but recent versions include methods to test for differences in mean
expression or dispersion of groups of genes, usually representing a pathway.
library(scde)
cnts <- apply(
counts,
2,
function(x) {
storage.mode(x) <- 'integer'
return(x)
}
)
names(group) <- 1:length(group)
colnames(cnts) <- 1:length(group)
o.ifm <- scde::scde.error.models(
counts = cnts,
groups = group,
n.cores = 1,
threshold.segmentation = TRUE,
save.crossfit.plots = FALSE,
save.model.plots = FALSE,
verbose = 0,
min.size.entries = 2
)
priors <- scde::scde.expression.prior(
8.7. DE IN A REAL DATASET 273
models = o.ifm,
counts = cnts,
length.out = 400,
show.plot = FALSE
)
resSCDE <- scde::scde.expression.difference(
o.ifm,
cnts,
priors,
groups = group,
n.randomizations = 100,
n.cores = 1,
verbose = 0
)
# Convert Z-scores into 2-tailed p-values
pVals <- pnorm(abs(resSCDE$cZ), lower.tail = FALSE) * 2
DE_Quality_AUC(pVals)
8.7.10 sessionInfo()
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 RColorBrewer_1.1-2 rprojroot_1.3-2
## [4] tools_3.4.3 backports_1.1.2 R6_2.2.2
## [7] KernSmooth_2.23-15 hypergeo_1.2-13 lazyeval_0.2.1
## [10] colorspace_1.3-2 gridExtra_2.3 moments_0.14
## [13] compiler_3.4.3 orthopolynom_1.0-5 bookdown_0.6
## [16] slam_0.1-42 caTools_1.17.1 scales_0.5.0
## [19] stringr_1.2.0 digest_0.6.15 rmarkdown_1.8
## [22] XVector_0.18.0 pkgconfig_2.0.1 htmltools_0.3.6
## [25] rlang_0.1.6 FNN_1.1 bindr_0.1
## [28] combinat_0.0-8 gtools_3.5.0 dplyr_0.7.4
## [31] RCurl_1.95-4.10 magrittr_1.5 GenomeInfoDbData_1.0.0
## [34] Rcpp_0.12.15 munsell_0.4.3 abind_1.4-5
## [37] viridis_0.5.0 stringi_1.1.6 yaml_2.1.16
## [40] MASS_7.3-45 zlibbioc_1.24.0 Rtsne_0.13
## [43] plyr_1.8.4 grid_3.4.3 gdata_2.18.0
## [46] ggrepel_0.7.0 contfrac_1.1-11 lattice_0.20-34
## [49] locfit_1.5-9.1 pillar_1.1.0 igraph_1.1.2
## [52] reshape2_1.4.3 glue_1.2.0 evaluate_0.10.1
## [55] data.table_1.10.4-3 deSolve_1.20 gtable_0.2.0
## [58] RANN_2.5.1 assertthat_0.2.0 xfun_0.1
## [61] qlcMatrix_0.9.5 HSMMSingleCell_0.112.0 viridisLite_0.3.0
## [64] tibble_1.4.2 pheatmap_1.0.8 elliptic_1.3-7
## [67] bindrcpp_0.2 cluster_2.0.6 fastICA_1.2-1
## [70] densityClust_0.3 statmod_1.4.30
library(scater)
library(SingleCellExperiment)
8.8.1 Introduction
As more and more scRNA-seq datasets become available, carrying merged_seurat comparisons between
them is key. There are two main approaches to comparing scRNASeq datasets. The first approach is
“label-centric” which is focused on trying to identify equivalent cell-types/states across datasets by
comparing individual cells or groups of cells. The other approach is “cross-dataset normalization” which
attempts to computationally remove experiment-specific technical/biological effects so that data from
multiple experiments can be combined and jointly analyzed.
The label-centric approach can be used with dataset with high-confidence cell-annotations, e.g. the Human
Cell Atlas (HCA) (Regev et al., 2017) or the Tabula Muris (?) once they are completed, to project cells or
clusters from a new sample onto this reference to consider tissue composition and/or identify cells with
novel/unknown identity. Conceptually, such projections are similar to the popular BLAST method
(Altschul et al., 1990), which makes it possible to quickly find the closest match in a database for a newly
identified nucleotide or amino acid sequence. The label-centric approach can also be used to compare
datasets of similar biological origin collected by different labs to ensure that the annotation and the
analysis is consistent.
The cross-dataset normalization approach can also be used to compare datasets of similar biological origin,
unlike the label-centric approach it enables the join analysis of multiple datasets to facilitate the
8.8. COMPARING/COMBINING SCRNASEQ DATASETS 275
Figure 8.20: Label-centric dataset comparison can be used to compare the annotations of two different
samples.
Figure 8.21: Label-centric dataset comparison can project cells from a new experiment onto an annotated
reference.
276 CHAPTER 8. BIOLOGICAL ANALYSIS
identification of rare cell-types which may to too sparsely sampled in each individual dataset to be reliably
detected. However, cross-dataset normalization is not applicable to very large and diverse references since
it assumes a significant portion of the biological variablility in each of the datasets overlaps with others.
8.8.2 Datasets
We will running these methods on two human pancreas datasets: (Muraro et al., 2016) and (Segerstolpe
et al., 2016). Since the pancreas has been widely studied, these datasets are well annotated.
muraro <- readRDS("pancreas/muraro.rds")
segerstolpe <- readRDS("pancreas/segerstolpe.rds")
This data has already been formatted for scmap. Cell type labels must be stored in the cell_type1
column of the colData slots, and gene ids that are consistent across both datasets must be stored in the
feature_symbol column of the rowData slots.
First, lets check our gene-ids match across both datasets:
sum(rowData(muraro)$feature_symbol %in% rowData(segerstolpe)$feature_symbol)/nrow(muraro)
## [1] 0.9599519
sum(rowData(segerstolpe)$feature_symbol %in% rowData(muraro)$feature_symbol)/nrow(segerstolpe)
## [1] 0.719334
Here we can see that 96% of the genes present in muraro match genes in segerstople and 72% of genes in
segerstolpe are match genes in muraro. This is as expected because the segerstolpe dataset was more
8.8. COMPARING/COMBINING SCRNASEQ DATASETS 277
deeply sequenced than the muraro dataset. However, it highlights some of the difficulties in comparing
scRNASeq datasets.
We can confirm this by checking the overall size of these two datasets.
dim(muraro)
In addition, we can check the cell-type annotations for each of these dataset using the command below:
summary(factor(colData(muraro)$cell_type1))
Here we can see that even though both datasets considered the same biological tissue the two datasets,
they have been annotated with slightly different sets of cell-types. If you are familiar withpancreas biology
you might recognize that the pancreatic stellate cells (PSCs) in segerstolpe are a type of mesenchymal stem
cell which would fall under the “mesenchymal” type in muraro. However, it isn’t clear whether these two
annotations should be considered synonymous or not. We can use label-centric comparison methods to
determine if these two cell-type annotations are indeed equivalent.
Alternatively, we might be interested in understanding the function of those cells that were “unclassified
endocrine” or were deemed too poor quality (“not applicable”) for the original clustering in each dataset by
leveraging in formation across datasets. Either we could attempt to infer which of the existing annotations
they most likely belong to using label-centric approaches or we could try to uncover a novel cell-type
among them (or a sub-type within the existing annotations) using cross-dataset normalization.
To simplify our demonstration analyses we will remove the small classes of unassigned cells, and the poor
quality cells. We will retain the “unclassified endocrine” to see if any of these methods can elucidate what
cell-type they belong to.
segerstolpe <- segerstolpe[,colData(segerstolpe)$cell_type1 != "unclassified"]
segerstolpe <- segerstolpe[,colData(segerstolpe)$cell_type1 != "not applicable",]
muraro <- muraro[,colData(muraro)$cell_type1 != "unclear"]
278 CHAPTER 8. BIOLOGICAL ANALYSIS
library(scmap)
set.seed(1234567)
We recently developed scmap (Kiselev and Hemberg, 2017) - a method for projecting cells from a
scRNA-seq experiment onto the cell-types identified in other experiments. Additionally, a cloud version of
scmap can be run for free, withmerged_seurat restrictions, from http://www.hemberg-lab.cloud/scmap.
Once we have a SingleCellExperiment object we can run scmap. First we have to build the “index” of
our reference clusters. Since we want to know whether PSCs and mesenchymal cells are synonymous we
will project each dataset to the other so we will build an index for each dataset. This requires first selecting
the most informative features for the reference dataset.
muraro <- selectFeatures(muraro, suppress_plot = FALSE)
Genes highlighted with the red colour will be used in the futher analysis (projection).
segerstolpe <- selectFeatures(segerstolpe, suppress_plot = FALSE)
8.8. COMPARING/COMBINING SCRNASEQ DATASETS 279
From the y-axis of these plots we can see that scmap uses a dropmerged_seurat-based feature selection
method.
You may want to adjust your features using the setFeatures function if features are too heavily
concentrated in only a few cell-types. In this case the dropmerged_seurat-based features look good so we
will just them.
Exercise Using the rowData of each dataset how many genes were selected as features in both datasets?
What does this tell you abmerged_seurat these datasets?
Answer
8.8.3.2 Projecting
scmap computes the distance from each cell to each cell-type in the reference index, then applies an
empirically derived threshold to determine which cells are assigned to the closest reference cell-type and
which are unassigned. To account for differences in sequencing depth distance is calculated using the
spearman correlation and cosine distance and only cells with a consistent assignment with both distances
are returned as assigned.
We will project the segerstolpe dataset to muraro dataset:
seger_to_muraro <- scmapCluster(
projection = segerstolpe,
index_list = list(
muraro = metadata(muraro)$scmap_cluster_index
)
)
Note that in each case we are projecting to a single dataset but that this could be extended to any number
of datasets for which we have computed indices.
Now lets compare the original cell-type labels with the projected labels:
table(colData(muraro)$cell_type1, muraro_to_seger$scmap_cluster_labs)
##
## acinar alpha beta co-expression delta ductal endothelial
## acinar 211 0 0 0 0 0 0
## alpha 1 763 0 18 0 2 0
## beta 2 1 397 7 2 2 0
## delta 0 0 2 1 173 0 0
## ductal 7 0 0 0 0 208 0
## endothelial 0 0 0 0 0 0 15
## epsilon 0 0 0 0 0 0 0
## gamma 2 0 0 0 0 0 0
## mesenchymal 0 0 0 0 0 1 0
##
## epsilon gamma MHC class II PSC unassigned
## acinar 0 0 0 0 8
## alpha 0 2 0 0 26
## beta 0 5 1 2 29
## delta 0 0 0 0 17
## ductal 0 0 5 3 22
## endothelial 0 0 0 1 5
## epsilon 3 0 0 0 0
## gamma 0 95 0 0 4
## mesenchymal 0 0 0 77 2
Here we can see that cell-types do map to their equivalents in segerstolpe, and importantly we see that all
but one of the “mesenchymal” cells were assigned to the “PSC” class.
table(colData(segerstolpe)$cell_type1, seger_to_muraro$scmap_cluster_labs)
##
## acinar alpha beta delta ductal endothelial
## acinar 181 0 0 0 4 0
## alpha 0 869 1 0 0 0
## beta 0 0 260 0 0 0
## co-expression 0 7 31 0 0 0
## delta 0 0 1 111 0 0
## ductal 0 0 0 0 383 0
## endothelial 0 0 0 0 0 14
## epsilon 0 0 0 0 0 0
## gamma 0 2 0 0 0 0
## mast 0 0 0 0 0 0
## MHC class II 0 0 0 0 0 0
## PSC 0 0 1 0 0 0
282 CHAPTER 8. BIOLOGICAL ANALYSIS
## unclassified endocrine 0 0 0 0 0 0
##
## epsilon gamma mesenchymal unassigned
## acinar 0 0 0 0
## alpha 0 0 0 16
## beta 0 0 0 10
## co-expression 0 0 0 1
## delta 0 0 0 2
## ductal 0 0 0 3
## endothelial 0 0 0 2
## epsilon 6 0 0 1
## gamma 0 192 0 3
## mast 0 0 0 7
## MHC class II 0 0 0 5
## PSC 0 0 53 0
## unclassified endocrine 0 0 0 41
Again we see cell-types match each other and that all but one of the “PSCs” match the “mesenchymal”
cells providing strong evidence that these two annotations should be considered synonymous.
We can also visualize these tables using a Sankey diagram:
plot(getSankey(colData(muraro)$cell_type1, muraro_to_seger$scmap_cluster_labs[,1], plot_height=400))
Exercise How many of the previously unclassified cells would be be able to assign to cell-types using
scmap?
Answer
scmap can also project each cell in one dataset to its approximate closest neighbouring cell in the reference
dataset. This uses a highly optimized search algorithm allowing it to be scaled to very large references (in
theory 100,000-millions of cells). However, this process is stochastic so we must fix the random seed to
ensure we can reproduce our results.
We have already performed feature selection for this dataset so we can go straight to building the index.
set.seed(193047)
segerstolpe <- indexCell(segerstolpe)
## Parameter M was not provided, will use M = n_features / 10 (if n_features <= 1000), where n_features
## Parameter k was not provided, will use k = sqrt(number_of_cells)
muraro <- indexCell(muraro)
## Parameter M was not provided, will use M = n_features / 10 (if n_features <= 1000), where n_features
## Parameter k was not provided, will use k = sqrt(number_of_cells)
In this case the index is a series of clusterings of each cell using different sets of features, parameters k and
M are the number of clusters and the number of features used in each of these subclusterings. New cells are
assigned to the nearest cluster in each subclustering to generate unique pattern of cluster assignments. We
then find the cell in the reference dataset with the same or most similar pattern of cluster assignments.
We can examine the cluster assignment patterns for the reference datasets using:
metadata(muraro)$scmap_cell_index$subclusters[1:5,1:5]
8.8. COMPARING/COMBINING SCRNASEQ DATASETS 283
8.8.5 Metaneighbour
Metaneighbour is specifically designed to ask whether cell-type labels are consistent across datasets. It
comes in two versions. First is a fully supervised method which assumes cell-types are known in all
datasets and calculates how “good” those cell-type labels are. (The precise meaning of “good” will be
described below). Alternatively, metaneighbour can estimate how similar all cell-types are to each other
both within and across datasets. We will only be using the unsupervised version as it has much more
general applicability and is easier to interpret the results of.
Metaneighbour compares cell-types across datasets by building a cell-cell spearman correlation network.
The method then tries to predict the label of each cell through weighted “votes” of its nearest-neighbours.
Then scores the overall similarity between two clusters as the AUROC for assigning cells of typeA to typeB
based on these weighted votes. AUROC of 1 would indicate all the cells of typeA were assigned to typeB
before any other cells were, and an AUROC of 0.5 is what you would get if cells were being randomly
assigned.
Metanighbour is just a couple of R functions not a complete package so we have to load them using source
284 CHAPTER 8. BIOLOGICAL ANALYSIS
source("2017-08-28-runMN-US.R")
Metaneighbour requires all datasets to be combined into a single expression matrix prior to running:
is.common <- rowData(muraro)$feature_symbol %in% rowData(segerstolpe)$feature_symbol
muraro <- muraro[is.common,]
segerstolpe <- segerstolpe[match(rowData(muraro)$feature_symbol, rowData(segerstolpe)$feature_symbol),]
rownames(segerstolpe) <- rowData(segerstolpe)$feature_symbol
rownames(muraro) <- rowData(muraro)$feature_symbol
identical(rownames(segerstolpe), rownames(muraro))
## [1] TRUE
combined_logcounts <- cbind(logcounts(muraro), logcounts(segerstolpe))
dataset_labels <- rep(c("m", "s"), times=c(ncol(muraro), ncol(segerstolpe)))
cell_type_labels <- c(colData(muraro)$cell_type1, colData(segerstolpe)$cell_type1)
Since Metaneighbor is much slower than scmap, we will down sample these datasets.
subset <- sample(1:nrow(pheno), 2000)
combined_logcounts <- combined_logcounts[,subset]
pheno <- pheno[subset,]
cell_type_labels <- cell_type_labels[subset]
dataset_labels <- dataset_labels[subset]
Now we are ready to run Metaneighbor. First we will run the unsupervised version that will let us see
which cell-types are most similar across the two datasets.
unsup <- run_MetaNeighbor_US(var.genes, combined_logcounts, unique(pheno$Celltype), pheno)
heatmap(unsup)
8.8. COMPARING/COMBINING SCRNASEQ DATASETS 285
8.8.6 mnnCorrect
mnnCorrect corrects datasets to facilitate joint analysis. It order to account for differences in composition
between two replicates or two different experiments it first matches invidual cells across experiments to find
the overlaping biologicial structure. Using that overlap it learns which dimensions of expression correspond
to the biological state and which dimensions correspond to batch/experiment effect; mnnCorrect assumes
these dimensions are orthologal to each other in high dimensional expression space. Finally it removes the
batch/experiment effects from the entire expression matrix to return the corrected matrix.
To match individual cells to each other across datasets, mnnCorrect uses the cosine distance to avoid
library-size effect then identifies mututal nearest neighbours (k determines to neighbourhood size) across
datasets. Only overlaping biological groups should have mutual nearest neighbours (see panel b below).
However, this assumes that k is set to approximately the size of the smallest biological group in the
datasets, but a k that is too low will identify too few mutual nearest-neighbour pairs to get a good
estimate of the batch effect we want to remove.
Learning the biological/techncial effects is done with either singular value decomposition, similar to RUV
we encounters in the batch-correction section, or with principal component analysis with the opitimized
irlba package, which should be faster than SVD. The parameter svd.dim specifies how many dimensions
should be kept to summarize the biological structure of the data, we will set it to three as we found three
major groups using Metaneighbor above. These estimates may be futher adjusted by smoothing (sigma)
and/or variance adjustment (var.adj).
mnnCorrect also assumes you’ve already subset your expression matricies so that they contain identical
genes in the same order, fortunately we have already done with for our datasets when we set up our data
for Metaneighbor.
286 CHAPTER 8. BIOLOGICAL ANALYSIS
Figure 8.23: mnnCorrect batch/dataset effect correction. From Haghverdi et al. 2017
8.8. COMPARING/COMBINING SCRNASEQ DATASETS 287
require("scran")
First let’s check that we found a sufficient number of mnn pairs, mnnCorrect returns a list of dataframe
with the mnn pairs for each dataset.
dim(corrected$pairs[[1]]) # muraro -> others
## [1] 0 3
dim(corrected$pairs[[2]]) # seger -> others
## [1] 2533 3
The first and second columns contain the cell column IDs and the third column contains a number
indicating which dataset/batch the column 2 cell belongs to. In our case, we are only comparing two
datasets so all the mnn pairs have been assigned to the second table and the third column contains only
ones
head(corrected$pairs[[2]])
mnnCorrect found 2533 sets of mutual nearest-neighbours between n_unique_seger segerstolpe cells and
n_unique_muraro muraro cells. This should be a sufficient number of pairs but the low number of unique
cells in each dataset suggests we might not have captured the full biological signal in each dataset.
Exercise Which cell-types had mnns across these datasets? Should we increase/decrease k?
Answer
Now we could create a combined dataset to jointly analyse these data. However, the corrected data is no
longer counts and usually will contain negative expression values thus some analysis tools may no longer be
appropriate. For simplicity let’s just plot a joint TSNE.
require("Rtsne")
The Seurat package contains another correction method for combining multiple datasets, called CCA.
However, unlike mnnCorrect it doesn’t correct the expression matrix itself directly. Instead Seurat finds a
lower dimensional subspace for each dataset then corrects these subspaces. Also different from mnnCorrect,
Seurat only combines a single pair of datasets at a time.
Seurat uses gene-gene correlations to identify the biological structure in the dataset with a method called
canonical correlation analysis (CCA). Seurat learns the shared structure to the gene-gene correlations and
then evaluates how well each cell fits this structure. Cells which must better described by a data-specific
dimensionality reduction method than by the shared correlation structure are assumed to represent
dataset-specific cell-types/states and are discarded before aligning the two datasets. Finally the two
datasets are aligned using ‘warping’ algorithms which normalize the low-dimensional representations of
each dataset in a way that is robust to differences in population density.
Note because Seurat uses up a lot of library space you will have to restart your R-session to load it, and
the plots/merged_seuratput won’t be automatically generated on this page.
Reload the data:
muraro <- readRDS("pancreas/muraro.rds")
segerstolpe <- readRDS("pancreas/segerstolpe.rds")
segerstolpe <- segerstolpe[,colData(segerstolpe)$cell_type1 != "unclassified"]
segerstolpe <- segerstolpe[,colData(segerstolpe)$cell_type1 != "not applicable",]
muraro <- muraro[,colData(muraro)$cell_type1 != "unclear"]
is.common <- rowData(muraro)$feature_symbol %in% rowData(segerstolpe)$feature_symbol
muraro <- muraro[is.common,]
segerstolpe <- segerstolpe[match(rowData(muraro)$feature_symbol, rowData(segerstolpe)$feature_symbol),]
8.8. COMPARING/COMBINING SCRNASEQ DATASETS 289
Next we must normalize, scale and identify highly variable genes for each dataset:
muraro_seurat <- NormalizeData(object=muraro_seurat)
muraro_seurat <- ScaleData(object=muraro_seurat)
muraro_seurat <- FindVariableGenes(object=muraro_seurat, do.plot=TRUE)
Eventhough Seurat corrects for the relationship between dispersion and mean expression, it doesn’t use the
corrected value when ranking features. Compare the results of the command below with the results in the
plots above:
head([email protected], 50)
head([email protected], 50)
But we will follow their example and use the top 2000 most dispersed genes withmerged_seurat correcting
for mean expression from each dataset anyway.
gene.use <- union(rownames(x = head(x = [email protected], n = 2000)),
rownames(x = head(x = [email protected], n = 2000)))
Exercise Find the features we would use if we selected the top 2000 most dispersed after scaling by mean.
(Hint: consider the order function)
Answer
Now we will run CCA to find the shared correlation structure for these two datasets:
Note to speed up the calculations we will be using only the top 5 dimensions but ideally you would
consider many more and then select the top most informative ones using DimHeatmap.
merged_seurat <- RunCCA(object=muraro_seurat, object2=seger_seurat, genes.use=gene.use, add.cell.id1="m"
DimPlot(object = merged_seurat, reduction.use = "cca", group.by = "dataset", pt.size = 0.5) # Before cor
To identify dataset specific cell-types we compare how well cells are ‘explained’ by CCA vs dataset-specific
principal component analysis.
merged_seurat <- CalcVarExpRatio(object = merged_seurat, reduction.type = "pca", grouping.var = "dataset
merged.all <- merged_seurat
merged_seurat <- SubsetData(object=merged_seurat, subset.name="var.ratio.pca", accept.low = 0.5) # CCA >
merged.discard <- SubsetData(object=merged.all, subset.name="var.ratio.pca", accept.high = 0.5)
290 CHAPTER 8. BIOLOGICAL ANALYSIS
Here we can see that despite both datasets containing endothelial cells, almost all of them have been
discarded as “dataset-specific”. Now we can align the datasets:
merged_seurat <- AlignSubspace(object = merged_seurat, reduction.type = "cca", grouping.var = "dataset",
DimPlot(object = merged_seurat, reduction.use = "cca.aligned", group.by = "dataset", pt.size = 0.5) # Af
Exercise Compare the results for if you use the features after scaling dispersions.
Answer
Advanced Exercise Use the clustering methods we previously covered on the combined datasets. Do you
identify any novel cell-types?
8.8.8 sessionInfo()
library(scfind)
library(SingleCellExperiment)
set.seed(1234567)
8.9.1 About
scfind is a tool that allows one to search single cell RNA-Seq collections (Atlas) using lists of genes,
e.g. searching for cells and cell-types where a specific set of genes are expressed. scfind is a Bioconductor
package. Cloud implementation of scfind with a large collection of datasets is available on our website.
8.9.2 Dataset
We will run scfind on the same human pancreas dataset as in the previous chapter. scfind also operates
on SingleCellExperiment class:
muraro <- readRDS("pancreas/muraro.rds")
Figure 8.29: scfind can be used to search large collection of scRNA-seq data by a list of gene IDs.
The gene index contains for each gene indexes of the cells where it is expressed. This is similar to
sparsification of the expression matrix. In addition to this the index is also compressed in a way that it can
accessed very quickly. We estimated that one can achieve 5-10 compression factor with this method.
By default the cell_type1 column of the colData slot of the SingleCellExperiment object is used to
define cell types, however it can also be defined manually using the cell_type_column argument of the
buildCellTypeIndex function (check ?buildCellTypeIndex).
Now let’s define lists of cell type specific marker genes. We will use the marker genes identified in the
original publication, namely in Figure 1:
# these genes are taken from fig. 1
muraro_alpha <- c("GCG", "LOXL4", "PLCE1", "IRX2", "GC", "KLHL41",
"CRYBA2", "TTR", "TM4SF4", "RGS4")
muraro_beta <- c("INS", "IAPP", "MAFA", "NPTX2", "DLK1", "ADCYAP1",
"PFKFB2", "PDX1", "TGFBR3", "SYT13")
muraro_gamma <- c("PPY", "SERTM1", "CARTPT", "SLITRK6", "ETV1",
"THSD7A", "AQP3", "ENTPD2", "PTGFR", "CHN2")
muraro_delta <- c("SST", "PRG4", "LEPR", "RBP4", "BCHE", "HHEX",
"FRZB", "PCSK1", "RGS2", "GABRG2")
298 CHAPTER 8. BIOLOGICAL ANALYSIS
findCell function returns a list of p-values corresponding to all cell types in a given dataset. It also
outputs a list of cells in which genes from the given gene list are co-expressed. We will run it on all lists of
marker genes defined above:
res <- findCell(cellIndex, muraro_alpha)
barplot(-log10(res$p_values), ylab = "-log10(pval)", las = 2)
2.0
−log10(pval)
1.5
1.0
0.5
0.0
alpha
ductal
endothelial
delta
acinar
beta
unclear
gamma
mesenchymal
head(res$common_exprs_cells) epsilon
## cell_id cell_type
## 1 1 alpha
## 2 3 alpha
## 3 7 alpha
## 4 9 alpha
## 5 15 alpha
## 6 20 alpha
Exercise 1
Perform a search by beta, delta and gamma gene lists and explore the results.
##
##
##
##
##
##
##
##
6
5
4
3
2
1
−log10(pval) −log10(pval)
102
98
96
92
72
71
0.0
0.5
1.0
1.5
2.0
0.0
0.5
1.0
1.5
2.0
cell_id cell_type
alpha
beta
beta
beta
beta
beta
cell_id cell_type
alpha alpha
ductal ductal
8.9. SEARCH SCRNA-SEQ DATA
endothelial endothelial
delta delta
acinar acinar
beta beta
unclear unclear
gamma gamma
mesenchymal mesenchymal
epsilon epsilon
299
300 CHAPTER 8. BIOLOGICAL ANALYSIS
## 1 40 delta
## 2 212 delta
## 3 225 delta
## 4 253 delta
## 5 330 delta
## 6 400 delta
2.0
1.5
−log10(pval)
1.0
0.5
0.0
alpha
ductal
endothelial
delta
acinar
beta
unclear
gamma
mesenchymal
epsilon
## cell_id cell_type
## 1 53 alpha
## 2 102 beta
## 3 255 gamma
## 4 305 gamma
## 5 525 gamma
## 6 662 gamma
Exercise 2
Load the segerstolpe and search it using alpha, beta, delta and gamma gene lists identified in muraro
dataset.
##
##
##
##
##
##
##
6
5
4
3
2
1
−log10(pval) −log10(pval)
48
43
32
24
20
18
0.0
0.5
1.0
1.5
2.0
0
1
2
3
4
alpha
alpha
alpha
alpha
alpha
alpha
cell_id cell_type
delta delta
8.9. SEARCH SCRNA-SEQ DATA
alpha alpha
gamma gamma
ductal ductal
acinar acinar
beta beta
unclassified endocrine unclassified endocrine
co−expression co−expression
MHC class II MHC class II
PSC PSC
endothelial endothelial
epsilon epsilon
mast mast
unclassified unclassified
301
302 CHAPTER 8. BIOLOGICAL ANALYSIS
## cell_id cell_type
## 1 15 co-expression
## 2 58 beta
## 3 300 beta
## 4 390 co-expression
## 5 504 co-expression
## 6 506 beta
1.5
−log10(pval)
1.0
0.5
0.0
not applicable
delta
alpha
gamma
ductal
acinar
beta
unclassified endocrine
co−expression
MHC class II
PSC
endothelial
epsilon
mast
unclassified
## cell_id cell_type
## 1 170 delta
## 2 715 delta
## 3 1039 co-expression
## 4 1133 delta
## 5 1719 delta
## 6 1721 delta
8.9. SEARCH SCRNA-SEQ DATA 303
1.5
−log10(pval)
1.0
0.5
0.0
not applicable
delta
alpha
gamma
ductal
acinar
beta
unclassified endocrine
co−expression
MHC class II
PSC
endothelial
epsilon
mast
unclassified
## cell_id cell_type
## 1 47 gamma
## 2 458 gamma
## 3 476 gamma
## 4 600 gamma
## 5 606 gamma
## 6 622 gamma
8.9.6 sessionInfo()
sessionInfo()
Seurat
Seurat was originally developed as a clustering tool for scRNA-seq data, however in the last few years the
focus of the package has become less specific and at the moment Seurat is a popular R package that can
perform QC, analysis, and exploration of scRNA-seq data, i.e. many of the tasks covered in this course.
Although the authors provide several tutorials, here we provide a brief overview by following an example
created by the authors of Seurat (2,800 Peripheral Blood Mononuclear Cells). We mostly use default
values in various function calls, for more details please consult the documentation and the authors. For
course purpose will use a small Deng dataset described in the previous chapters:
deng <- readRDS("deng/deng-reads.rds")
Note Thanks to community detection approach used in Seurat clustering, it allows one to work on
datasets containing up to 105 cells. We recommend using Seurat for datasets with more than 5000 cells.
For smaller dataset a good alternative will be SC3.
Seurat does not integrate SingleCellExperiment Bioconductor class described above, but instead
introduces its own object class - seurat. All calculations in this chapter are performed on an object of this
class. To begin the analysis we first need to initialize the object with the raw (non-normalized) data. We
will keep all genes expressed in >= 3 cells and all cells with at least 200 detected genes:
library(SingleCellExperiment)
library(Seurat)
library(mclust)
library(dplyr)
seuset <- CreateSeuratObject(
raw.data = counts(deng),
min.cells = 3,
min.genes = 200
)
9.2 Expression QC
Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. We can
visualize gene and molecule counts and plot their relationship:
305
306 CHAPTER 9. SEURAT
VlnPlot(
object = seuset,
features.plot = c("nGene", "nUMI"),
nCol = 2
)
nGene nUMI
15000
6e+07
12000
4e+07
9000
2e+07
6000
0e+00
SeuratProject SeuratProject
Identity Identity
GenePlot(
object = seuset,
gene1 = "nUMI",
gene2 = "nGene"
)
9.3. NORMALIZATION 307
0.58
12000
nGene
8000
4000
nUMI
Now we will exclude cells with a clear outlier number of read counts:
seuset <- FilterCells(
object = seuset,
subset.names = c("nUMI"),
high.thresholds = c(2e7)
)
9.3 Normalization
After removing unwanted cells from the dataset, the next step is to normalize the data. By default, we
employ a global-scaling normalization method LogNormalize that normalizes the gene expression
measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and
log-transforms the result:
seuset <- NormalizeData(
object = seuset,
normalization.method = "LogNormalize",
scale.factor = 10000
)
Seurat calculates highly variable genes and focuses on these for downstream analysis. FindVariableGenes
calculates the average expression and dispersion for each gene, places these genes into bins, and then
calculates a z-score for dispersion within each bin. This helps control for the relationship between
variability and average expression:
308 CHAPTER 9. SEURAT
Oas1d
C87499
E330034G19Rik
Bcl2l10 LOC100502936
Accsl
D6Ertd527e
Rgs2
Fbxw24 Khdc1bBtg4Tcl1 Spin1
Txnip4933427D06Rik Eif4e1b AU022751 C86187
Fbxw15 Pkd2l2 Oog3
Tgfb2
Rfpl4 Bmp15Omt2aOmt2b
Oas1c
Meis2 Nlrp4a
Ldhb Wee2
Oosp1
Nlrp4f Trim61
BC051665
Parp12 Zbed3
Oas1e Gm13023 Siah2
4
Obox3
H1foo Trim75
Obox1
Fbxw22 Dub1a
Gm97 Tcl1b1
Trim60 Dazl
Gm5039 Fbxw28
Zfp352 Obox5
Gm13078Slc45a3
Gdf9
Nlrp9b Gm1995
Pld1
Nfya
Sfmbt1 Ccdc6
Obox2 Gm1965
Gm4850Bfsp1 Nlrp14
Gdap1
Gm11544
Tet3 Jak2
5730469M10Rik
9230115E21Rik
Gm5662
Gm8300
Gm4340 E2f5
Slain1
Gm7104
AU015228
Gm10436
Pla2g4c
Gm7056 Klhl8
Spry4
Kbtbd8 Bmi1 Mphosph6
Adh1
Eif4e3
C87977
Creb3l4
C87414
Oog4 Brdt
Mllt3
Gli3 Rhpn2
Ndrg1 Tiparp Zfp57
Gm13084 Tcl1b4
Apoe
Gdpd1
Tcl1b5
Fbxw20
Gm2016Rgs17Fmn2
AU015836
Ehf
Pyy
Tob2
P4ha3
Tubb3
Sebox Xist Birc3
Mthfd2
Entpd1
Ube2d2
Psip1
Riok1
Tmsb4x Lgals1Tspan8
Klf17
Bpgm
Fam46b
Nlrp9c Cnot6l
Fam46c
Pdgfra
Nexn Krt18
AA619741 Oog1
Tmem92−ps
Tcl1b2
Btbd2
Zfp54 Ankrd44
Elavl2
Klhl11
Eef2k
Plekhg1 Dusp7
D7Ertd443e
Chsy1
A630007B06Rik
Papd7
Gyg
Dcp1a
Cnot7
Zscan5b
Zfp707 Abl2
Gm4926
Ahnak Mbtd1 Lrp2
Slc2a3 Id2
Fhod3
Tcl1b3 Nlrp5
Vil1
Fam199x
Mfap2
Cpa1 Fmnl3
Per3
Rasl11a Phc2
Kcnq1ot1
Plekha8 MsnSlc15a2
C230081A13Rik
Foxj3 Tat Mat2b
Ccndbp1EnpepPaip1 Lcp1BC053393
Upp1 Akr1b8
Dab2
Tcfap2e Tiam1
G6pdx
Uchl1
Angel2 Cdc42se2
Rragc
Dispersion
Car2 2010107G23Rik
Arhgdib Grm2
Dclk2
Tesc
Znfx1
Jazf1
Calcoco1
Ephx2
Rrm2b Slc39a2
Rnf220
Stradb
Rasa2
Rbm38B3gnt5
Tti1 Mtmr14
Glipr2 Ctnnal1
Cenpn Dppa1
Ptdss2
Anxa2Asns
Spp1Tdgf1 Obox6 Elf3 Larp4 Fabp3
Pemt
A530054K11Rik
Rac2
Slc6a11 Kbtbd7
Cttnbp2nl Rnf24
Arid4b
Serpinb9b
Pik3cd Aqp8
Tmprss2
Klhdc2 Tnfsf13b
Sparc Soat1
Rnf114
Epn2 Dbf4
Psme4 Krt8
2
Chd5
Prss12
Cd34
Zmynd10
Eltd1 Nlrp9a
Efcab6 Ptbp2
Pnrc1
Smarca2
Dnajb4 Npm2Gng12
Hivep1
Akr1c21
Slc19a3
Slc1a3
Slc44a4
8430410A17Rik
Gm11487 Prss8
B020004J07Rik
D730040F13Rik Rbm18
Gng2 Capg
Pcgf1
Gatad2b
Eomes
Srpk2
Rnf38 Fn1
Ppp1r15a Lrrfip1
Hsd17b14
Yap1 Alpl
Rock1
Gata3
Brca2 Myh10
Cul1 Pttg1
D630029K05Rik
Ncan
Ppp1r3c
Gimap6 Casp8
1700024P16Rik
Dhdh
A430033K04Rik
Cd53
Trp63
Tmem132b
Cd93
Iqch
Fmn1 Gja4
Gm11756
Dub1 Mkl2
Gm2022
Usp17l5
Nlrp4b
Fbxw14
Duoxa2Th5730419I09Rik
Pmaip1
Tacstd2
Gm15455
Enpp1
Usp3
Tmem168
Mapk8
Cdr2 Fkbp5
2700049A03Rik Tspan1
Daam1
Glcci1
Myo5b
Padi6
Arap2Zfp664
D2Ertd750e
Tinagl1
Zfp644
Serpinh1
Zfp317
Pecr Ddr2
Mbd3l2 Ttc9c
Slc6a14 Lamb1
Gm13718
Dhrs7
Sfmbt2 Emp2
Eif4enif1
Cnn2
Rimklb Rnd3
Dnajb9
Ranbp9
4921513D23Rik Ank Sin3a
Pdia6
Vnn1 Mad2l1bp
Ube2d3Tfrc Trim43b Gm11517
Tagln2
BC030307
Mef2c
Camk4
Samsn1
Col5a2
Gp1ba
Ctla2a
Rgs18 BB287469
Fam54a
Pcbp3
Nkd1
Epsti1
Coro1aNfib
Musk
Gm839
Plac1l
Lca5
Angptl2
Tmem98
Tgm5
Mansc1
Laptm5
Ranbp6
Alox5ap Nobox
Sorbs2
Oas1h
Gm4971
Rfpl4b
Tbx4
Ust
Dpysl3
Lsp1
Folr4
Pabpc1l
Tubal3 Rab38
Tdrd12
Zfp280c
4930550L24Rik
D6Ertd474e
7420426K07Rik
Styk1
HrAno10
Abat Liph
Rab3d
Slc22a5
Morn4
Fbxw19
Dstyk
Ccdc69
Rspo2 Pls3
Plec
Nsun4
Exoc2
Stxbp5
Dock7
Plekha3
Hspa2 Mll1Plod2
Itga5
Fndc3a
Usp28
Dcp2
Pnpla3
Ggta1
Thap11 Akirin2
Efhd2 Fbxo33
Ssx2ip
Gcap14
4931406P16Rik
ZyxBC018507
Mkln1
Cbs
Ptpn21 Ppt2
Mtus1
NsfUhrf1
Naalad2 Tpm1
Marcks
Fabp5 Epcam
Ccne1 Slc5a6
Pramel5
Zmpste24
Cldn4 Psap
Gpd1l
Actg1
Adh6aDnahc8
Epb4.2
1600016N20Rik
Tshz3
Gm813
B230206F22Rik
Cacna2d3
Dfna5
Tmem176b
Map2k5
CcinKlf7
Tigd2
C86695
Kctd21
Atp2c2Lrg1
4930506M07Rik
Zglp1
Gm3414
Camk1g
Mbnl1
Lama4
2810417H13Rik
HpgdsGm428
Ogn
Fxyd5
2810030E01Rik
Cndp1
Mfsd6
Pde1b
1700013H16Rik
Akap5 Fbxo43
Stard13
Plcg2
Plac8
Ccno
Fbxw18
Sytl4
St6gal1
Iqgap2
Oog2
Abca1
Gm8801
Il33
Zfp493
Zfp174
Isoc2b
AI646023
Zfp697 Phf15
Slc30a3
Wfs1
Tbrg3
Kif13a
Agrn
Foxred2Tfpi
Kif21b
Tdpoz1
Fbxw16Egfr
Trp53i11
Zbtb16
Ttc30b
Ugt2b36
Srgn
1110028C15Rik
Lcorl
D11Wsu47e
Pygm
Phlpp1
Zscan22
Ehd3
Dusp22 Gdi1
Armcx5
Arg2
Strn
Fam116a
Tsga14
Tbc1d22a
Aldh1l1
Dsc2
Plk2
Kif17
Arhgap28
Svip
Lpar6
Slc43a1
Vsig10
Rlbp1
Tbpl2 Lama1
Abcd4
Rwdd2b
Ston2
Lpin2
Pcdh9
Itpr2
Klhl28
Ehbp1
Bmp2k
Plxnc1 Pak1
Yy2
Tcn2
Zfp942
Ica1
Tep1
Lgals9
Arsk
Dennd1b
Umodl1 Nid2
Tle6 Mbnl3
Slc37a2
Zfp934
Ccdc163
Cdc37l1
G2e3Ddit4l
Rab17
Fxyd6
Txndc16
Wdr20a
Hand1 Wdr76
Utp14b
Ficd
Abca3
Sap30 Gga3 Anxa6
Foxo1
Golga5Ttn
Taf3Ipo8
Neu1
Pros1 Pecam1
Mdm4 Gjb5
Zfp622
Slc5a11
Apc
Mfge8 Cldn7 Pabpn1
Slc20a1
GlrxAnxa7
Myh9Uap1 Sp110
Tbc1d15
Tceb3
Gpbp1 Spic
Ptgr1 Bhmt
Chka Trim43a
Slc18a1
Lrrc19
Eml1
Speer5−ps1
Phka1
Efemp1
Gpr56
Il31ra
Usp−ps
Btn2a2
Itga8
Rasl12
Fam111a
Gm14023 Sept6
AI414108
Ggh
Pdia2 Tdrd1
Afap1l2
Zdhhc2
Yod1
Zfp90
Plagl1
Zfp760
Arhgap31
Plekhg6
Calcrl
Nr0b1
AA415398
Gpr50 Pld2
Gm9125
Rhbdd1
Zfp750
Zscan4d
Epm2aZfp39
Ap1g2
Zscan4c
Slc5a8
Cd81
Adcyap1r1
Zfp182 Afap1
Fzd7
Pdk1 Usp2
D1Ertd622e
Pou4f1
Nfil3
Agtpbp1
Zfp957
Nudt7Fyn
Wnt3a
Ptprr
Zc3h7b
Ampd2
Trpm6
Slc4a8
Vmac
Serpinb6b
Tktl2
Sh3bgrl
Smc1b
F2r
A730082K24Rik
Reg3b
Prap1
Wdpcp Bmp4
Taf9b
Slc22a23
Gna14
Ninj2
Hapln4 Icosl
Mtm1
Rundc3b Kit
Zbtb40
Bnc2
Erbb2
Slc38a1
Maged1
Blnk
Plekhb2
Rapgef5
Phf1
Itpr1
3110003A17Rik
Zfp62
Tcf7Mro
Mme
Epha3
B3gnt1
Zmym6
Tceal8
Abcg1
Zfp420
Gdpd5
Ephb2
Cerk
Ccr9
Pwwp2b
5330426P16Rik Cd38
Sugp2
Mycbpap Laptm4b
Rimkla
Slc30a1
Snx14
Farp1Stk31
Msh3
Eml4
Tmx1
Leprot
Neat1
Tmem229b
Ncoa2 Asxl1
Sh3bgrl2
Camk2g
Gpr161 Ly6a
Fmnl2
Slc24a5
Col4a1
Gcdh
Ooep
Slc16a6
Epha4
Rnf185Crip1Pard3
Fam178a
Mll2
Tuft1 Trap1a
Baz2b Topors
Nid1
Parp8
Sox15
Tmem62
Chek1 Pan3
Nup214
Rbbp6
Map1lc3aId3
Ei24 Grn
2410004A20Rik
Iqgap1 Pttg1ip
Tubb5 Mier3 Psat1 Ctsl Gja1
GcshSdc4
Nab2
Ppp1r3f
Foxp1
Slamf9
BC030476
Efcab7
2010002M12Rik
Ccdc30
Rab6b
Gk2
Afap1l1
Ufsp1
Fam63b
Ccdc68
Slc7a3
9030224M15Rik
Mafb
Tspan7
Ptgs1
Gpr85
Zfp712
6720456H20Rik
Heatr8
Map3k15
Msln
Fgd3
Papss2
Ppapdc2
Vamp2
Zdhhc15
H2−M3
Emb
Fsd2
Gabra1
Slco3a1
H2−Ke6
Muc13
Sema3e
Prkch
Gm1564
St18
Rtn2
Cdx1
Abhd10
Fam49a
4930503L19Rik
Slc12a8
Arhgap22
Kcnrg
Elmo1Crygc
1700010I14Rik
1700019M22Rik
Acsm2
Rnase4
Ankrd43Wdr69
6230427J02Rik
Chi3l1
Nap1l3 Sh3tc2
Pfkfb4
Dnahc9
Mslnl
4933411G11Rik
Ap3b2Tpk1
Tnfaip8l2
Gm10696
Gnrh1
1600025M17Rik
9430083A17Rik
Cyp2b23
Slc16a9
Mmp12
Prkd2
Glt28d2
Flt3 Cables2
Gng3
Phka2Zc3h6
Lrrc8e
Atp10b
Xk
Arrdc3
Pip5k1b
Zpbp2
Zfp185
Zfp940
Klhl30
Aoah Slc7a11
Aldh3b2
Zscan4f Miox
1700021K19Rik
6530402F18Rik Mfsd2b
Fam125b
Lmx1a
Atg9b
0610010F05Rik
Gm17821
Smyd1
Cyp4f39
Zfp658
0610009B14Rik
Pla2g2f
LOC100503167
9030617O03Rik
Sybu
Gbp4
Ltbp2
Cdh20
Mbd2
A830007P12Rik
Kdelc1
Pla2g4b
Ttll9
Nhedc2
Dync1i1
Wfdc3
Lrrc8c
Eml6
Zfp189
Arhgap15
Pabpn1l Ccdc15
Sirpa
C630043F03Rik
Rcbtb2
Pdk2
Ifngr1
Epb4.1l4a
Mfsd7a
Tmem161a
Pkp3
Snx24Egf
Ptprc
C630004H02Rik
Fam83b
Zbtb7c
Gm4956
Rnf219Rgl2
Ccl9
Tmem139
Rab15
2610034M16Rik
Zar1
Naif1
Kbtbd13
Saa3
Bcas1
Clu
2310044G17Rik
P2rx5
Nanos2
Dhrs3
Gm5868Oas1f
Slc25a35
Krba1
Adamtsl1
Tesk1
Xpnpep2
Phf19
Prom2
Slc25a24
Fam154b
Ptpn14
Mamdc2
H2−T24
Frmd5
Zfp941
Apbb3
Gapt
Gm12794
Chst7
Tiam2
Tmem106b
Specc1
Zfp248
Ank2
Sh3kbp1
Tifab
BC080695
Zfp938
Zscan12
Spata22
Plp1
1300002E11RikFam160a1
Dnlz
Rbl2
Nsun3
Mfsd9
Hoxb9
Cd109
Tmprss11a Hif3a
Zfp192
Gstt1
Ccdc155
Ampd3
Kcnv2
Colec12
Mst1
RhdNlrp2
Arhgef4
Proca1
4930444A02Rik
Nefl Usp31
Snhg11
Ddit4
Arl4d
Zfp53
Necab2
Ceacam20
Ikzf1
Slc16a5
Tmem177
Insm1Irf6
Otx2
Ephb6
Insr
Procr
Ptpre
Tex11
Tgfbr3
Smad1
Stxbp6
A730011L01Rik
Slc25a37
Nnmt
Zfp353
Ripk2
Aox4
Galnt7
Gpr125
Sepn1
Coq10a
Zfp418 Sil1
Reep5
Mbd1
Cd164l2
Dhrs4
Slc25a48
AmphRarb
X99384
Upk3a Ercc5
Prickle1
Mtap1b
Rcn1
Mettl21a
Lman2l
Npc1l1
Bcar3
E130311K13Rik
Gnb5
Ccnjl
Mocos
Ucp2
Hspa13
Ctnna3
Ebi3
Magea10 Dse
Plb1
Pfkfb1
Acsm3
Tshz1
Klhl26
Triml1
Myo1a
9130024F11Rik
Serpinb1c
Cald1
Apcdd1
Cd44
Col10a1
Serpina3m Wnt7b
Rb1
Asb3
Unc5c
Mcf2
Fam120b
Fgfrl1
Garnl3
Pla2g4f
Snx7
Abo Pars2
Sycp2
Zfp108
Lrrc33
Gt(ROSA)26Sor
Sh2d4a
Slc2a12 Heatr5b
Srl
Zdhhc17
Wdr13
A130022J15Rik
Sohlh2
Col14a1 Pus7l
Sergef
Lepre1
B630005N14Rik
BC017158
Prl3b1
Tmem173
Cpne4
Gcm1 Tubgcp4
Prr23a Usp30
E2f2
Calr4
Ern1
Cbx2
Bhlhe40
Reps2
4933406J08Rik
Abcb1a
Rph3a
Cdo1
Manba
Cep97 Casp2
Iqca
Slc37a4
Zfp719Aplf
Zfp791
Anpep
Zan
Alg6
Acp2
Cpeb1
Rassf5
Zdhhc24
Traf6Bnc1
Fbxw4
Pcyox1
Zfp735
Gm16039
9130008F23Rik
Cabp5
Zfp266
Dcst1Ttc8
Zfp157
Dtx3l
Lphn2 Stim1
Rdh11
Chdh
Serpinb1a
Usp11
5730455P16Rik
Dpp4 Wdr72
Chn1
Rft1
Dok2
Taok3
Maoa
2310046K01Rik
Hmgcs2 Faah
Capn2
Acsl6
Extl3
Itfg1
Suv420h2
Coro2a
Zfp616
Plxna3
Timp3 Gda
AU021092
Nhsl1
Flrt3
Plxna4Pqlc3
Zfp703
Meis1
Hnf4a
Sh2d3c
Fdxr
Ccm2
Cdk20
Hgsnat
Fa2h
Lmo1
Gm12789Ssbp2
Scml2
Prss35
3110057O12RikAspa
Dapk1
C2cd2
Lad1
Casr
2310022B05Rik
Dock8
Hyal1
AW554918
Cdh2
Lipe
Ccdc81
Tmem184b
Nat1 Frk
Recql
Tob1
Mesp2
Chpf
Bmp5
Tchh Hal
Senp8
Ninj1
Cd82
Dennd2c
Rtkn
Mtmr1Tekt2
Pdzd2
Stambp
Slc9a6
Atp12a
Tub Lif
Bambi
Fam135a
Hbegf
Lipo1
Ckap4
Rnf32 Sp6
Zfp566
Lonrf3
Kif5c
Pon2 T
Tnfaip8
Top3b
Tex15
Nupr1
Eno3
Dusp4
C1galt1
Nod1 Fgfbp1
Rbm44
Enpp3
Slc6a9
Prkaa2
Tpcn2
Nell1
Galns
Lrrc15
Angel1
Zfp111
Myl9
RelTmem37
Slc38a2
Gm7534
Myo1g Mcc
Prex1 Fbxw7
Cilp
BC017643
Arg1 Nr4a1
Mycn
S100a14
Gstm1
Tmem18
Zfp667
Antxr2
1700001L05Rik
Fam198b
Tmem47
Sphk1 Eefsec
Rwdd2a
Gas2
Gm16516
Cnksr1Fgf1 Pja2
Mapk8ip3
Gpr19
Telo2
Cblb
Tecpr2
Bnip2
Oasl2
Spon1
Tpst1
Isx Elmo2 Arntl
Man2a1
Lrrc8d
Tmem129
Trappc2
Tdp1
Tmem171
Timp1
Plac1
Fank1
Alg2
Ano9
Pou2f1
Ctns
Hipk2
Ptk6
Reep2
Qsox1
Abhd14a
Atp4b
Arid5a
Fbxl17
Lysmd3
Hsd3b7
Srxn1
Gpa33
B4galt4
Agbl2
Trim62
Slc6a13
Ttll13
op2b
4933411K20Rik
Fam35a
Tmem45b
Usp46
Nrp1
Rassf4
Nbeal2 Atp9b
Asph
Dcaf12l1EdaStk3
4930432K21Rik
Snx8
Dis3l
Fchsd2
Eps8
Shroom2
Myo1e
Aim1
Jag1
Irf1
Gulp1
Gripap1
Camk1 Ormdl1
Slc29a2
Aldh18a1
Trps1
Syt9
Lpin3
Casp7
Pak6
Qser1
Lrrc28
Anxa3
Clmn
Lrrc2
Hs3st3b1
Ccdc88a
Appl2
Herpud1
Slc19a1Il2rg
Abhd14b
Vgll3
Als2cr4 Cyp51
Grb10
Galm
P2rx3
Mylip
Clcn5
Slc35b3
Trpc6
Lmf1
Pigg
S100a16
Sult6b1
Etohd2 Zfp715
Trdmt1
Zfp870
Aldoc
BC051019
Mmp15
Nadsyn1 Cyld
Ccdc138
Atp2a3
Far2
A230073K19Rik Zfp26
Cdk14
Wdr44
Mgat3
Lyn
Cep135 Frmd4b
Glipr1
Rp2h
Obfc1
Fgf4
Morc1
Cast
Tanc2
Anxa4
Dusp1
Trp53bp2
Car4
Gjb4
Cpt2
Slc13a5
Spz1
2310008H04Rik Hps3
Rfx4
Pdzk1ip1
Rap2c
Spire1
Extl1
Tank
Zfp947
Mphosph9
Zdhhc21Pif1
Pfkp
Msc
Atxn1
Cebpa
Rapgef1
Gpr171
Mbnl2
Pde3b
Prom1 Usp40
Ildr1
Prkag2
Cep110
Adat1
Zfp507 Pvrl3
Dennd4a
Slc19a2
Dpep2
Ttc26Tuba3a
Mtrf1
Pip4k2a
Arrb1
Ryr2
Gspt2
Tbc1d24 Llgl2
Dnalc4
1110003E01Rik
Plat
Rin3
Sclt1
Traf3ip1
Plag1
Dok1
Hnmt Mllt1
Hyal2
P2rx4
Pias4
Mettl15 Acpp
Sap130
Fgf10
Por Igsf3
Pml
Cwf19l1
Mtap
Zfp35
Pck2
Rassf1
Gyk C77370
Ppp1r3b
Tmem56
Fam69aPoglut1
Golm1
Fam117b
Dsg2
Pgm2
Pdzk1
Zfp51
S100a10
Slc25a15
Pdcd4
Tpp1
Slc39a6
Phospho2
Zcchc2
Tmco3
Mtus2
Rhbdl2
Dnajc22
Antxr1
D330012F22Rik Rint1
Hsd17b6
Me2
Galnt2
Coro7
Ppic
Apbb2
Nr1h3
Cyr61
8430410K20Rik
Rgs19 Dgat2
Nsmaf
Card6
Alg10b
Lpl
Cenpb Cd24a
C030046E11Rik Lifr
Spice1
Elf4
Stk30PigaAgpat4
Zfp472
Trim23
Mbp
Scai
Arhgap17
Dock1
Slc4a7
Nsun6
Stard4
Dzip3
Pla2g4e
Diap2
Tmem106a
LipgBaat1
Cdk5rap2
Rps6ka6
3830406C13Rik
Stard5
Sh3bp5
BC017612
Aldh1b1 Lct
Mks1
Sec16b
Meg3
Mapkapk2
Mblac2
Sh3gl3
Tgfb3 Cant1
Trappc9
Gpr160
Slc33a1
Fam59a
Wrn
Pigz
Parva
Elf5Asz1 Cthrc1
Brf1
Mapk3
Fech
Tmem164
Ptpn9
Rab8b
Akap2Harbi1
Idua
Gsta1
Atp8a1
Kdelc2 Pltp
A4galt
Atp11a
Smyd4
Tbx15
Mn1 Dcbld2
Cpped1Rbl1
Endod1
Agl Esyt1
Tnrc6b
Fgd6
Zfp60 Ccdc50
Tmco4
Tmem43
Fuca2 Tubb6
Socs7
Tnfrsf9
Cr1l Flcn
Rnf122 Lamc1
Tsc22d2
Slc25a25
Cgnl1
Smarcal1
Cryab
Gata4
Reep1
Foxr1
Foxm1
Ptprf
1600029D21Rik
Gm16515
Psrc1
Stau1
Smurf2
Lima1
Fat1
Sgpl1
Slc39a10
Errfi1
Fbxo3
Retsat
Klf6
Dcaf8
Pias1
Oat
Naga
Tmem184c
Fam193a
Rbfox2
Slc2a8
Cep55 Plbd1
Vcl
Csrnp2
Lta4h
Ell2
Slc16a1
Dsp
Gde1
Wasl
Ubp1
Mgst3
Jhdm1d
Ythdf3
Naa30
Zfp592
Zfp217
Cpn1
Kcnk5
Zfp445
Nedd4
NfrkbB2m
Osbpl8
Atf2
Csf3r
Txndc12
Ralbp1
Crk
Socs3
Gss
Arfgef1
Ugp2
Lmo7 Gprc5a
Esco1
Alkbh5
Tjp2 Kpna3
Gm13154
Atp6ap2
Lamp2
Sirt1
Lsm14a
Acadvl
Aco1
Slc12a2
Slc4a11
Cldn6
Scd1 Mdh1
Acaa2
Gsta4
Idh1
Ccdc117
Atp6ap1
Usp38
Clp1
Nin
Usp34 Clic4
Ankrd10
Agpat9
Pdp2
Stag2
Cdh1
Slc34a2
Slc39a4Cited1
Wapal
Crnkl1
Fdft1 Ctsc
Dstn
Rb1cc1
Trim43c
Fbxl20
Impdh2
Clpx
Idh3a
Hsp90b1
Arpc5Wdr1
Pank3
Myl12b
2610019F03Rik
Ypel2
GldcOdc1Ppm1a Pdxk Sat1Aldoa
Lrpap1 Isyna1
1700052N19Rik
Fahd2a
Ier5
Tspan10
Slc4a5 Ier3
Dip2a Hmox1
Cpsf4l
9330020H09Rik
Anks4b
Fam13c Crtc3
Dapp1
Krt84
Irak4
Zfp867
Enho Zfp236
Fgd4
Dpp8
Ptprk
Rnf169
Tcf23
Klhl13
Itga9
Naprt1 Serinc2
Cpe
Slc38a6
Nuak1
Fam83g
Dhx34
Ophn1 Cbr2
Aim1l
Ipp
Zbtb1
Slc35a1
Elmo3
Apbb1
Zfp708
Sirt4
Prkd1 Fbln1
Tk2
Midn
Ptgfrn
Tcp11l2
Sdcbp2 Acss1
Zfp746
Frat1
Zrsr1
Grhl1
Mab21l3
Capn5
Mxd1Narg2
Parp6
Arl6ip5 Arhgap12
Fam107b Cryz
Cdkn1a
Lhfpl2
Ifitm2
Dcaf5 Crot Tmem231
Rab14Acsl4
Smpd1
0.4
Eno4
Npy
Cercam
Unc80
Dusp27
Fam110c
Neurl1bMtmr9
Tshz2
6030446N20Rik
Tbc1d10c
Gal3st1
Myom3Trim3
Cxcl16
Fam189a2
2810422O20Rik
Dgkb
Mtcp1
Sec31b
Nostrin
Padi4Dock4
Zfp780b
Stim2
Eml2
1700080O16Rik
Prune2
Zfp13
C430049B03Rik
Pdlim4
Kcnu1
Olfm2
Mia2
Mtap6
Efemp2 Rec8
Ccnd2
Fmnl1
Icam2
9430020K01Rik
Gstcd
Plekha2
Akna
1700049G17Rik
Dusp9
B3galnt1
A730017L22RikFam73b
Gpr126
Capn11
Nwd1
Olfr976
Sema3b
Slc2a9
Zfp12
Fbxo41
Napsa
Sox6 Patl2
Pth1r
Nudt12
Lhpp
Carns1
Aifm2
Necab3
Tmem81
Tmem125
Sepp1Ablim2
Amigo3
Gstm7
C330024D21Rik
Zcchc24
Evi2bNfe2
Zfp85−rs1
Zfp119bF3
Ddx60
Whrn
Fbxo25
2510009E07Rik
Thsd7a
2700081O15Rik
Gpr20 Limch1
Nek4
Prrg4
Nlgn1
BC061212
Slc11a1
Lamc2
2810001G20Rik
Aass
Nr2f2
AI427809
Anxa5
E330016A19Rik
Arrdc4
ManealWwox
Rnf207
Lhfpl5
Klk7
Il4ra
Dopey2
Etnk2
Aldh1l2
Itga11
Dpysl2
Armc9
Entpd5
Tomm40l
Ppp1r9a
Gpatch3
Hcfc2
Psg17
Accn3
Slc22a17
Mdfic
Zfp59
Aloxe3
Steap1
Mgmt
5330411J11Rik
AI854703
Zfp709
Eno2
S1pr1
4930529M08Rik
1700001L19Rik
Chac2
Tmem149
Fkbp10Vti1a
Lrch1
Clvs2
Zfp563
2500004C02Rik
Ncs1
Cd74
Zfp810
Tmem40
Gm12824
2210013O21Rik
Arhgap23
D130043K22Rik
Casq2 Pcx
Rasd1
Stxbp1
Efnb1Sh3yl1
Ffar3
4930452B06Rik
2010300C02Rik
Cyp4f13 Fpgt
Prr18
Eda2r
Pcdh15
Dock3
2310047M10Rik
Zfp454
Dgkq
E330033B04Rik
Adck3
Tlr2
Adhfe1
Zeb1
Cd209eCgrrf1
Pik3ip1 Pvr
Acsl5
9130011E15Rik
Nrp2
Gm13083
Tmem54Mapk1ip1
Efna3
Chn2
AK129341
Spata1
Cmtm8Sertad3
Ftsjd1 Fhl2
Smad3
Zkscan5
Ccnyl1
Acsl1
Magi3
Gcnt2
Cyb5d2
Abcg2
Slc41a1
Rps6kc1
Ndrg2
Pdgfrb
1700007K13Rik
1600014C10Rik
Bicc1
Slc9a5
Ipcef1
Map3k12 Phkb
Pnpla7
Inpp4a
Apoa1
Mbd5
Fam72a
Zfp846
2200002K05Rik
1500031L02Rik
Zscan18
Sec14l1
Diras2
Gpx3
Psd3
Snhg8
Mei1
Plekhg2
Fam134b Snai3
Cecr5 Uty
Cpm
Pnliprp2 Vps18
Ttc17
Sept12
Zfp738
Ngly1
Rfesd
Madd
Mib2
1810030O07Rik
Ttc7 Pias3
Zfp438
Mid1ip1 Psd4
Mgll
BC057079
Stx6
1110038D17Rik
Mtss1l
Pgpep1
Ccdc164
Mospd2
Fbxw26
Zscan4a Nav3
Crat
Vtcn1
Cyp20a1
Zfp933
Samhd1
Klhdc5
Gm5 Ldlr
Ccdc45
Pou2f3
Osbpl3 Zranb3
Msrb3
Ovca2
Mobkl2a
Snap91
Bmf Fbxw5
Pggt1b
Zfp866
Ccdc71
Zfp597
Unc93a
Mapre2
Atp10d
Ldoc1
Apom
Mest
Asap1
Inpp5e
Dak
Farp2
5430416N02Rik
Foxj2
Clstn3
Lrrc46
Fundc2
Cd28
Gm9766
Setmar Zfp386
AW146154
Ube1y1
Ifitm3
Klf15
HlfFgr
Snx25
Zfp334 Elmod3
Efna5
AI987944
D930016D06Rik
Dync2li1
Samd10
Dock2 Fos
E130303B06Rik
Ramp1 Ndst1
Nnt
Plagl2
Dnajc16
Bcl2l11
Rab43
App
Qpctl
Unc13b
Spopl
Fbxw13
Wbp5 Ptar1
Dab2ip
Lrig1
BC067068
Rufy2
Tirap
Ccdc73
Ssh3
Lmo2
Nab1
1300010F03Rik
2310079F23Rik
Tacr3
Muc1
Gfod2
Hook1
1700042G15Rik
Phf7
Pdha2Nanos1
Prokr2
2310046O06Rik
Gsta3
Spin2
PirGuf1
E2f3
Fes
Sh3glb2
Nup210
Stoml1
Nudt10
Edaradd
C1qtnf2
3110052M02Rik
Tmem88 Abcb5
Kdelr3
Pin1−ps1
Slc1a4Src
Hsf5 Iqcc
Rftn2 Fbxl12
Fam160a2
Adcy5
Pcmtd2 Adal
Abcc1
Tmem126a
Epm2aip1
Slc30a2
Park2
Trpv5
D930015E06Rik
Orai2
Fam110b
Pcbp4
Ldhd
Zfp512
Daglb
Cdc42ep2
4831426I19Rik
Ccng2
Zfp583
Mmp19Lace1
Fam164a
Dennd3
Ankrd13b
4833442J19Rik
Slc17a5
Slc25a43
Gm3143
Map3k9
Syce1
Zmat4
Mageh1Rxrg
Ccdc13
Slc46a3
Gramd1c
Cd99l2
Slc10a4Rsl1
Tnfaip6
3110043O21Rik
Cacnb4
Man2b2
Txndc2
Tmem116
BC021891
Car5b
Hdac9
Ppm1eLrp1
Gpam
4930558C23Rik
Vax2os2
Kif21a
Prkd3
1600021P15Rik
Chga
Rnf144b
D10Jhu81e
Tap1
1500002O20Rik
Tarbp1
Abca2
B230120H23Rik
Gm13109
Cacna1d
Agxt2l1
Hmgcll1
Arrdc5
Rpph1 Ptcd1
Adam15
Csrnp3 Lime1
Arhgap44
Tusc3 Gga1
Stat6
Sdc1
Akap8l
Srcrb4d
Mpzl1
Dcc
Gorab
Esyt3
Tctn3
Slc25a40
Prss32
Zbtb26
2310067B10Rik Ctu2
Mettl22
Mtap7d2
Trim26
Pigh Myo10
Rnf17
Slc35e3
ClcnkaNav1
Hcrt
Rnf31
Sertad2
Ppp1r14d
Vars2
Clec10a
Snx15
2310030G06Rik
Aldh3a1 Gtpbp5
Zfp101Dalrd3
Elk4
Ubxn2b
Ggct
Ccnj
Ext1
D16Ertd472e
4632427E13Rik
Dpep1
Stbd1
Amy1
Slc26a4
Tmem17 Jmjd4
Slc26a11
Dis3l2
Akap11 Jmjd7
Tmem101
E130309D14Rik
Plekha7
Abca5
4933407H18Rik
Coro2b
Ap3m2 Spag1
Rnf141 Hexim1
Casp9
Dync2h1
Ube2g2
Slc7a7
Lrdd
Hemk1
Slc16a13
Tmed3
Sorbs3
Dnmbp
Ccdc17
Stat1
Dusp3
Gm4745 Ptprg
Lmnb2
Gfra3
Shisa7
Dapk2
Filip1l
Nlrp4e
McamPxdn
P2rx7
Cxcr7 Sirt3
Zfp612
1300018I17Rik
Gm13119 Lrrc50
Fbxo31
Rpusd3
Ccdc94
Zfp758
Ccdc22
E2f7
Gpr3
Ankzf1
Sardh
Arfgap3
Ifitm1 Efhd1
Fry
Pacrgl
Bcdin3d
D5Ertd579e
Tex21 Gnai1
Zfp120
Rad51c
Mycl1
Sarm1
Gtpbp2
Zfp605
Gopc
Abcc4
Pax6
Mgat1
Tapbp
Fbxw21
Nkain1
Otub2
Pick1
Oscp1
Sh3bp1
St3gal5
Fez1
Acap3
Tctn2 Ado
Arl15
Mfsd8Alg12
Tab1
Fancc
Arhgef26
Cml2
Fbxo7Dnajc24
Tcf7l2
Rtn4ip1
Prdm14 Pla2g6
Vps13c
Vrk3
MlxiplHlcs
Dohh
Cdk18
Stx1a
Inpp5b
Impad1
Cd55
Qsox2
AW549877
Nmnat1
2310004I24Rik
Nek3
Ngfrap1 Tmeff1
FkrpIck
Kdm4b
Lama3
Gm12169
Zbtb33
Pgap1
2610301G19Rik
Zfp628
Camta2
Kdm5d
Cbln1 Tpra1
Pgbd5
Agap1
Dtwd1
Zfp212
Mast4Adat2
Ttll1
Ciz1
Zbtb2
Cnksr3
Klhl20
Ankra2
Rab30
D9Ertd402e
Apol7b
0610040J01Rik
Lcn10
Vat1l Fktn
Sez6l2
Thap6 Wdr6
Sbf1
Pgm2l1
Acss2
Foxh1
Slc29a3
Malt1
Zxdc
Actr5
Prr5
Xrcc4
Lrrn4 Ttll6
Zfp819
Zgpat
C1galt1c1
2900056M20Rik Zfp952 Hhex
Zfp955a
Slc9a8
Fam117a
Sbf2
Gpr172b
Armc7
Notch1
Atxn7l1
Samm50
Iqub
Tarsl2
Cd22 Itpr3
Triml2
Neo1
Gm12942
Ttll3
Zfp874a
Auh
Bscl2
2610002D18Rik
Kifap3
Tmco7
B4galt5
Fanca
Csad
Rnpepl1
Zfp68
Zfp446
Rbks Gsn
Prcp
C2
Mll5
Itm2b
Slc25a14
Ythdc2
Galk2
Htra1
Nhlrc2
Golph3l
Pik3r3Serhl
Flrt1
Il34 Mllt6
Plod3
Tcirg1
Kank1
Chst12
Rpgr
Pskh1
Dram1 Iffo2
Fgfr4
Acer1
Mib1
Chic1
Sdccag8 DmknRfx7
Baiap2l1
Cidea
Sema5a
Zfyve16
LckFoxa1
Chit1
Tcfap2a
Krt7
Mpdz
Lhx8
Celsr1 Pqlc2
Fhod1
Zfp617
Sft2d2
Tia1
Ppox
Susd2
Lysmd4
Hspb8
Nup62cl
B3galtl
C430048L16Rik
Ndst2
Kcnk2 Slc10a3
BC023829
Trh
Klhl25 Med12
BC013529
Akr1e1
Dvl1
Sepsecs
Ube2cbp
Tbc1d8
Tmcc3
Gsdmd
Rhod Cyth3
Tmub1 Mlh3
Ipo13
Alg9
Plscr1
Ube2h
Cep72
Hesx1 Col4a2
Pex13
Efna1
Tom1l1
Nfxl1
Nisch
Pdik1l
Cdc42bpaMpi
Fam13b
Nat6
Bicd2
Hook3
Car9
Ercc8
Stxbp3a
Gusb
Pls1
Wdr7
Pxt1
Zfp87
Elavl3
Rab3b Tulp3
Zfp36l1
Zp2
Mpg Nav2
Stag3
Ccdc42
Acp6 Mkl1
Slc35b4
Dusp16
Rhobtb2
MsraTtc19
Zfp459
Hyls1
Slc35f5
Cln6
Mpnd
Tcam1
Ciita
Aif1l
Sh3rf3
Zfp81
Ddx26b
Ccdc134
Rab13
Exph5
Kank3 Grk5Vps45
Pja1
Aste1Prkcc
Pex5
Socs2
Klraq1
Cep350
Mto1
Chd9
Palb2
Ppargc1b
Ikzf5
Ccdc32
Pfkfb2
Rbbp9
Rhbdd3
Entpd2
Opalin Gab2
Thap3
Ulk4
Tmcc1
Spata13
Nfkbia
Tspyl4
Gpx2
Lsm14b
Ctbs
Zfp2
Phldb2
Catsperg1
Osbpl5 Neb Vill
Ext2
Exoc4
Gtdc1
Acrbp
PrtgEnc1
Atp6v0a2
Vgll4
Tmem109
Hsdl1
Rufy1
Fbp1
Zfp874b
4930579G24Rik
Jam3
Phc3
Sesn3
Slc43a2
Arntl2
Tspan15
Camk1d Tbce
Btg2
Psmb9
Neto2
Tcta
FlnaDysf
Rabep2
Nupl2
Amacr
Cntln
NbasDdx28
Trip10
Zfp948
Tada2b Mogs
Zfp871
Zbtb41
Lats2
Reep4
Ccdc92
Fnip2
Pgam2 Tomm40
Adam19
Cln5
Bin3
Fanci
Ube2j1
Pask
Pex26
Lxn Fzd5
Gns
Tmem2
Chmp6
Fermt3 Ripk1
Txnl4b
Pxn
Mvk
Rnf128
Fancb
F2rl1
Phf20l1
Zfp160
Cep250
Ubox5
Rpp25Ttpal
Cog6
Btbd9Dtx2
Erbb3
Dgat1
Dmc1
Cst7
Slc16a10
Mixl1Mical1
Dmrtb1
Otud7b Spg7
Nhedc1
Smug1
Pacs1
Pacs2
Nrip1
Gmeb2
Gabrd Arap1
Kctd18
Mthfd1l
Htr5b Ctdsp1
Angptl4
Syngr3
Myo5a
Dnajc17
Tmem19
Zfp770
Pex6
Prss36
Nceh1 Ttf2
Lrp11
Acacb
Vps13a
Prmt10
2310057M21Rik
Plce1 Cnnm3
Atp7a
Zfp113
Erlec1
Zfp944
Anks3
Vps37b
Gpatch2
Tie1
Ctbp1 Dusp6
Sri
Ddhd1 Klf16
Dag1
Zfp367
9530008L14Rik
3110001I22Rik
Bcor Ly6e
Atp8b1 Mepce
Fam53c
Yes1
Thop1
Kdm2aCwc27
Prmt6
D030056L22Rik
Fbxw8 Ubtfl1Atr
Ceacam10
Ube2l6
Zc3h7a
Prodh Slc2a1
Gltpd1
Piwil2
Dnajb14
Fbxo28
BC017647
S100pbp
Slc28a3
Fdx1
Usp53
Pi4kb
Rhbdd2
Zfp748
Ggcx
Snrnp35
Rab11fip4
Fbxl16
Zfp943
Klhl18
Egfl7 E2f6
Chd2
Men1
Mdk
Itgae
Smoc1
Plaur P Lama5
Crbn
Nfe2l2
Efr3aGlg1
Fundc1
2310040G24Rik
Cdkal1
Zfp646 Itga3
Sall1Ocln
Ccbl2
H2−K1
Mtmr7
Mettl9Stac2
4930583H14Rik
Fbxo34
Nsdhl
de12
Xbp1 Taf5
Tmem9
Ddx3y
Mov10
Oxct1
Bhmt2
Stx7
Erp44
Serpinb6c Sfi1
Bag5
Swap70
Fxyd4
Abhd4 Fnbp4
Smagp
Zfp820
Slc38a4
Lpp Rbm27
BC016423
Nipal1
Tor1b
Plcd1
Cept1
Cpne3
Mpzl2
Add1 Gtf2a1
Hipk1
C130026I21Rik
Snupn
Cbll1
Serinc1
Fcgr3
Cdk12
Myc Tpi1
Raf1
Depdc7 Gulo
Ganab Guca1a
Dusp14
Havcr1
Mki67
Asah1
Zfand5Pdpn
Zfand2a
H2afy Msh6
Ass1
Sun1Cth
Igf2r
Plin2Ccna2
Cap1
Spns1
Ypel5 Tet1
Wdr45 Ptges
Wdr43Cul5Tmem41b
Pramel7
Pmm1 Snx2 Wtap
Mt1
Tmem252700023E23Rik
Gpc4
1500001M20Rik
Pde4d
Sema4g
Kctd4
Neil3
2310042E22RikCpne9
3110045C21Rik
Slc22a4
Rnf215
Gstm6
Fut10
Adamtsl4
Lass4
1700034H15Rik
Tmem86a
Zfhx2
Rph3al
Slc10a6
Nckipsd
Shf
Mtap7d1
Ncald
Hps1
Zfp52
Ccdc38
Rgl1
Tmem150c
Chrna10
Slc26a6Asb1
Lancl2
Lamb2
Itga1
2210009G21Rik
Chpf2
Traf3ip2
Pkd1l3
Scube2
Masp2
Bmpr1b
Tmtc1
Nrap
Bmp1
Fam83a
1700019G17Rik
Mov10l1
Gm889
Clcnkb
Chst11
Zfp109
Rin1
Car8
Ak8
Elk3
D330045A20Rik
Akr1c14
6330408A02Rik
Tm7sf2Tpo
Zcchc16
Tmem110
Dpysl4
Hdac5
Pfkm
B4galt7
Cdc14a
Ppfia4
Hydin
Gadd45g
Mum1l1
Thbs1
Dnajc6
Sgms2
Bbs1
Myo7a
Etv1
Cdkl4
Adarb1 Pcyt1b
Map3k6
Mobkl2b
Hus1b Tec
Otx1
Ly6f
Pfn4
Prps2
1110012J17Rik
Atp2b2
Pstpip2
Cpne5
Snph
Fam159b
Mmp11Chd3
2310030N02Rik
Tnfrsf22
Tspan17
Sec24d
Cyp4f16
Gadd45a
Phlda3
Ap4m1
Cct8l1
Cdon
Mapk11
Pcsk5
Acyp2 Tex9
Ssfa2
Kptn
Ubtd1
Ogdhl
Rnasel
Rhoh
Sept9
Gpr1
BC020402 Dus4l
Ankrd26
Tmem132c
Fzd2
Prox1
Gm10857
Crygd
C430002E04Rik
Zkscan3
Rin2
4930534B04Rik
Glt25d2
Tdrd7
BC026585
Cdca7l
Bex1
Ccdc104
2610021K21Rik
E330012B07Rik
Gm4981
Sec14l2
Serinc5
Ppp1r16b
Wbp2nl
Adora1Pdp1
Ms4a1
AI450353
Nrarp
Zfp422
7420461P10Rik
Adck2
Gramd1a
Upk3b Maml1
Fgd1
Sfxn5
Trim35
Dhrs11
Pacsin1Plk5
Creb3l2
4933437F05Rik
Htra4
Dmrt1
Slc25a41
Camta1
Tmem135
L1camCyb5rl
Pea15a
Zfp862
Dnahc7b
H2−Q10
Rab40b
Zfp119a
H6pd
Zfp677
Ralgps1
Lass3
Rimbp3 Nxt2
Ss18l1
Rgmb
B230118H07Rik
Chrna9
Tas1r1
Scel
Spata2L
Fam132b
H2−Bl
Dnajc28
9130017N09Rik
Gpat2
1700030K09Rik
2610015P09Rik
Fam169b
2410131K14Rik
Tbl1x
Lpxn
AA792892Dnahc2
Zfy2Ppp3ca
Itih5
Ppm1h
Rbak
Dpy19l3 Pcca
Polg2
Ikbkb
Efcab10
Pipox
Jakmip1
Ly6g6c
Spire2 Alg1
Mettl8
Mlycd
Ahsa2
Aspg
Zfp747
Zfp105
Ak3
9230110C19Rik
Hs3st1
4930588N13Rik Zfp346
Mbl2
BB557941
Gpr180
Postn
Tmem115
A730085A09Rik
Osgin1
Cyfip2
Lrrc34
Papss1
Scand1
1500015A07Rik
Mfsd2a Mpped2
Zfp759
Agphd1
Tceanc
Gm239
Ctxn1
D1Pas1
Plscr3
Cyb5r2
Wipi1
Gm1141 Tert
Bckdhb
Sec1
BC030336
Zcchc4
Abhd3
9130019O22Rik
Eif2ak2
4933427D14Rik
Nckap1l
Rab25Zfp82
Efnb3
Lect1
Hdac10
Kndc1Acox3
2310003L22Rik
Mpp3
Olfr376
Armc5
4930563E22Rik
Rap1gap2
Upk1a Ric3
Ube2u
Lrp12
Tecpr1
Rsad1
2810021B07Rik
Gm16381
Fn3krp
Tmem136
Mctp2
Gbp7
Sema6a Zbtb3
Rbm12b
Rps6ka3
Lphn1
Gpc3
Olfm1
Fam149b
Abca16
9930012K11Rik
Spsb1
Mogat2
Ctsll3
Bpil2
4931414P19Rik
Fancf
Mb
Fhit Ptrf Adck4
4933421E11Rik
Tmtc2 Tmem126b
Sema4d
Atp6ap1l
C230091D08Rik
Gstm4
4931408C20Rik
2610002M06Rik Cask
Atp1a3
Large
Zhx3
Micall1Nars2
Kif3b
Pigc
Cpxm1
Nfatc4
Zfp354a Shbg
Entpd3
Mfsd7b
Mapk7
5730494M16Rik
Bend7
Igsf9
Kctd12
Arhgef5
Slc35d2 Gcc1
Trafd1
Arhgef1
C8b
Ankrd42
Cdr2l
Cachd1
Tmem72 Man2c1
Zfp692
Mapkbp1
2210019I11Rik Lrrc45
Xylt2
Myof
B9d2
Enpp4
Fam161a
Shc3
Cyb561
Bcl9
4930430F08RikSfrp1
Zfp58
Alas2
Mocs1
Stambpl1
Fam46a
Parp14
Cd200
Rcbtb1
Pdlim3
Zscan20
Papolb
Glis1
Gramd1b
AI480653
Dock11
Zfp369
Gpr155 Rab28
Zkscan6
Mier2
Rhof
Trim32
Lypla2
Pyroxd1
Cryl1
Rab33b
Pdk3
Pbxip1
Zmynd11Arhgap9
2310057J16Rik
Ripk4
Dkk1
Tex14
Eif5a2
L3mbtl3
BC048403
Slc2a4
Inpp1
Gpr113
Abca7
S1pr3
Fzd4 Ctps2
Cbr4
Mertk
Gmpr2
Pank4
Dhrs13
1190002N15RikOsgin2
Ip6k2
Akap7
Napepld
Gm5134
Parp9 Znrf1
Map2k7
Foxo3 Exd2
Prrg2
Zfp772
Xrcc2
Lrrtm2 Ovol2
Bcorl1
Il17rd
Smad7
Stx18
Inpp5f
Tmem186
Sdr39u1
Zdhhc16
Rufy3
Fsd1l
Tnip2
Tead2
Ganc
Pign
Rab23
Eps8l2
Anxa9
Dph1
Wdr90
Ubash3b
Cldn3
Ankrd52 Id1
Donson
Slc30a9
Got1l1
Slc15a4
Sc5d
TbcdRit1
Vezt
Homer1
Nlrx1
Acsf3
Wnt10bZfp865
Xdh Greb1l
Klhl7
Tbc1d10a
Lyrm1
Bbs2
2410002I01Rik
Tbc1d12
Gm4858 Btd
Itgb3bp
Npr3 Myd88 Mep1b
Pnkp
Sik1
Srek1ip1
Fancl
Ift88
Praf2
Txndc15
Wwc1
Fam193b
Ptger4
Lrrc68 Ppip5k1
Mtbp
Tuba3b
LOC100303645Hps4
Zfp426
Cep152 Leng1
Plk1s1
Coq10b
Dcaf15
Phlpp2
Senp7
Cul7 Phf10
Ikbip
Sec61a2
Slc38a10
P2ry14Msrb2
Lrp4
Ets2 Ttll5
Immp1l
Pptc7
Parp4
Trmt12
Rab31
Nsl1
Mtfp1
Ralgapa2
Exog
Ctu1
Aldh3a2
Il1rn
Mterfd3
Sh3pxd2a
Cd47 Rxrb
Apex2
Acer3
Lrrc14
Znf512b
Gins3
Tpmt
Cntrob
Zfp184
Tmem180
Ssh2
Sh3bp4Tmem138
Anks1
Zkscan17
Gnb2
Galnt3
Nlrp12
Osr2
Runx1
1700018B24Rik
Aknad1
Cyp2t4
Zfp275 Prkx
Mllt11
Pcnxl3
Gm13128
Nr1d1
Fam195b
Gm3776 Inf2
Ttbk2
Myo1d
Zfp213
Folr2
Jmjd5 Txnrd2
Rnf146
Hagh
Ablim1
Efha1
Spink2
Ppm1d
Lin28b
Phf16
Pfn2Fem1a
Ulk2
Clstn1
Zfp532
Poli
Rbm41
Grip1Zfp623
Zfp345
Fam129b
Lepr
2310014H01Rik
Batf
Oplah Rhou
Grk4
Pacsin3
Zfp790
6720489N17Rik
Manea Nprl3
Sfxn4
Atl1
Tspan9
Zfp46
Cldn9
Srebf2
Ift140
Gba2
Slc6a15
Ptprn
Eya1
Pramef6
Ephx1
Bphl
Gipc2
Fabp9
St3gal4
Rilpl2
Pgc
A630072M18Rik
Gal3st2
Gm4975
Tsnaxip1
Rcan3
Lysmd1
Fbxo44
Rps6ka4
Nefh
Tmem51
1700086O06Rik Bik
2810046L04Rik
5730559C18Rik
LOC100503652
Xkr9
Esr2 Elp4
Dip2c
Il6st
Tcfl5
2310035K24Rik
Wasf1
Lrrc51
4930422G04Rik
Rab32
2610002J02Rik
Abtb2
Fahd1
Itgb7
Mapkapk3
Recql4
Atp1b2
Gm7102 Derl3 Gpsm2
Snapc1
S1pr2
Fam57b
Acer2
Pvt1
Arid3b
Lims2
Pla2g5
Nek7 Tbcc
Carm1
Klhl10
Arrdc2
Cubn
Rab27a
Map3k1
Il10rb
Srd5a3
Pter
A930004D18RikLpin1
Cdh4
Clec16a
Tnik
Col6a3
Sccpdh
Ehd2 P4htm
Ifi27l1
Zscan4bTicam1
Fam178b
Zar1l
Abcg4
Ankrd56
Dleu2
D830031N03Rik
Rabgap1l
Trp73
Zfp72
4930412O13Rik Ptov1
Eme2
Socs5
Igsf11
Ankrd57
Fetub
Fam92a
Sun2Trpc2
Gprasp1
Phyhipl
Higd1a
Gpx7
Mybl1
Lonrf1
2810453I06Rik
Tppp3
Vps37c
Col7a1Pomt1
Lpcat1
Nudt13
Vim
D4Bwg0951e
A830010M20Rik
Slc6a12
Adprhl1
Zfp541
Crebl2
Sort1
Zic3
Oxsm
1700008J07Rik
Them5
2610020H08Rik
Rps6ka2
Zfp689 Trabd
Ncf2
Hpcal1
Gpr4
Calml4Cct6b
Dok4
Cfc1
Rad51l1 Terf2ip
Vps36
Glis2
Tapt1
Zfat
X
Ccnd1
Xpc
Ranbp17
Cep68
Pomt2
Hps6
Acsbg2
Crybg3
AstlB9d1
Adap2
2310033E01Rik
Slc27a3
Tcf7l1
Cc2d1a
5730455O13Rik
Gstm5
5730528L13Rik
1110008J03Rik
Ndrg4
Sgsm2
Mxd4 Hip1r
2610029I01Rik
Cntn4
Efna4
Pcgf2
AI467606Ddx11
Fbxl4
Ppp1r9b Pola2
Adam9
Rarg
Amt
Med18
Dopey1
Tsen54
Rab4a
1110020G09Rik
Gsdma3
Zbtb42
Maml3
6720401G13Rik
Heg1
Alpk3
1700123I01Rik
Dnahc6
Bach2
4930558J18Rik
Tmem117Rab4b
Tnfrsf8Ocm
Gcnt4
Zfp235
Ttc25
Dqx1
Mrc1
Acan
Bhlhb9
Card10
Abcc8
Fan1
Accs
Trp53rk
Dnaja4
Ahsg
Ahr
Zfp273
Zhx1
Nphp3
Ncam1
Arhgef2Tm2d1
Nt5mMitf
Ppl
Cnp
Rnf123
Prkab2
4921531C22Rik
Ppm1j
Tmem205
Sgce
2010015L04Rik
Tmem146
Thrsp
Wdr25Gbe1
Pde7a
Il1r1
Rgma
Depdc1b
Tcte2
Gm17751
Psen2
B3gnt8
Zbtb25
Zfp282
Atcay
Dcun1d4
Ccdc36
1810010H24Rik
Tspan13
Cdcp1
Ak1
1700023I07Rik
Tnfrsf25
Iqsec2
Gabbr1
Gm6981
Shroom4
Zfp93
Gna15
Ccr4
Ttc30a1
Gm5113
Dock10 Wrnip1
Tpst2
Ift27
Tbc1d9b
Med16
Uaca
TefJosd2
Crem
Slc12a4
4933430I17Rik
Osbpl10
Zfp449
Stat5bTcf4
Nipal3
Ttc39cWls
Allc
Paqr4
Tbc1d2b
Rps6ka5
Shkbp1
Pla1a
Abcc10
B4galnt2
Hcn3
D3Ertd254e
Tmod2
Nfkbiz
Abcb1b Scyl3
Lrig2
Agap3
Itpripl1
Gm13040|Gm13057
Arhgap30Mex3a
Prpf40b
B930041F14Rik
Zfp94
Tbc1d5
Lrrc23
E230016K23Rik Cxxc5
Pde6d
Eepd1Eid2
Zfp251
Tsx
Gstm2
Slc44a3
9430091E24Rik
D930020B18Rik
2310047B19Rik
Dennd1a
Evpl
Apol7b|Apol7e
D3Ertd751e Vav2
Zdbf2
Jrk
Bivm
Pdlim2
Hey1
Esam
E030024N20Rik
Ulk3
Gm13125Itgb4 Dhtkd1
Dhx35
Pzp
Exd1
Zfp329
Pofut1
Trim11
Zer1
Usp20Ptprj
Zp1
Fabp4
Nxn
Cdk6
Acad12
Prss38Ccdc88c
Lemd2
Megf8
Crabp2
Macrod2
Osta
Ccdc120Zfp524
Serpini1
Megf6
Tns4
Msh5
Ikzf4
Zmat3
Csgalnact2
Prr19
Actn3
Fgf3
Slc26a9
B4galt2 Cpd
AI662270
Dcun1d2
Gen1
Klhl15
Mthfsd
Rundc1 Iqcg
Pcnx
Rbfa
Stat2
Taco1
Mcoln1
Zfp74
Scfd2
Gse1
Clic3
Troap
Kctd13
Fkbpl
Ppap2c
Fdxacb1
Tor3a
Stxbp5l
Rgs10Slc25a1
Slc7a8
D330028D13Rik
Gfpt2
Bbs7
Lmln
Tmem184a Bcar1
D430020J02Rik Acads
Lats1
Gucy2eAcy1
Adcy9
1700037H04Rik
Fstl3
Lmbrd2
Trove2
Ankrd54
AA960436
Dem1
Phtf1
Relb Zbtb5
Rap2b
Ppp3cb
Rasgrp4
BC002230
Asb7
Ppp2r2c
Plcg1Cux2
Ccdc21Mlxip
Elf1
Pafah2
Clock
Zfp609
Atl3
Kalrn
Aasdh Aldh4a1
Tagap1
Zfp397
Pnldc1
Cd59a
lr3c
Epb4.9
Lin52
Xlr3b
Cdc42ep1
1110014N23Rik
Atg4c
Rbms1
P
Ift172 Zfp64
rmt1
Tube1
6530401N04Rik
Uhrf2
1600014K23Rik
4930503E14Rik Cacnb3
Urgcp
Acsbg1
Slc35c1
BC048355
Cry2
Synrg
Abhd5
Narf
Tcf12
Slc35a5
Phactr4
Acvr2a
Uckl1
Pot1b
Ttll10 Ptplb
Ercc6
Rpusd1
Nutf2
Zswim7
B4galnt1
Bbs10
Ammecr1 Pdss1
Grb7
Lgals4
Vpreb3
Peli2
Gfod1 Tusc2
Mars2
Depdc5
Dock6
Mir17hgYipf2
Wwp1
Rinl
Cdkn2d
Loh12cr1
Sharpin
Slc44a1
Zfp873
Tmem63b
1110032A03Rik
Hspa4l Hddc2
Arl3
Hace1
Stat5a
Eif2c4
Zdhhc13
Cdc42ep5
Rasip1Herc3
Tdpoz4
Vav3
Sestd1
Myh11
Myo5c Pkd1
Rpgrip1l
Mmp1b
Sox17
Hs6st2
Zfp946
Lysmd2
Tead3
Krt19 Rnf103
Prmt2
Irf5
Ldoc1l
Zcchc14
Tnfrsf21Pdss2
Slco2a1
Exoc6b
Phf21a
F630110N24RikZfp27
2310021P13Rik
Nfkbie
D2hgdh
Tbc1d2
Cbfa2t3 Vrk2Atrip
Lnx2
Gm608
Nmral1
Jmjd8
Pigb Eri2
Smcr7
Szt2
Sh3bp5l
Anxa11
Wdr89
Ppp1r13b
Btbd19 Tyw3
Cdyl
Akap9
Fbxl18
Insig2
Skiv2l
Mppe1 Lztr1
Slc1a5
Fbrs
Atg4d
Setdb2
Manbal
Dolk
Rpusd2
Cog5
Pex11a
Bst2
Tbc1d22b
1700030J22Rik RhogMast3
Zfp956
Rpgrip1
Gclc
Cep78
Apoa1bp
Tle1
Ergic1
Fam3a
Bbs9
Arl8a
Nqo2
Cyhr1
D230037D09Rik
2310015A10RikAqp11
Plek2
BC046404
Psma8Msi2 Smtn
Zbtb9
Hormad1
Zbtb37 Gk5
Fam172a
Sec14l4
Glb1l
Popdc3 Fam160b1
Ighmbp2
Otud3
Zfp61
4930455F23Rik
Uhmk1Hvcn1
Ptprs
Rnf170
Pdgfb
Zc3h10
Srgap1 Cirbp
Ddit3
Pard6a
Dnajb2
Smyd2
Map3k2
Gadd45b
Oaf
5930434B04Rik
Fam184a
Dbndd1
Dnajc18
Zbtb46
Echdc3
Fam55d
Enox1 Traf3
Chmp4c Kng1
Usp29
Galnt11
Gm129
Gm8994
Wbp1Ctrl
Mre11a
Nkrf
Irak1
Zbtb48
Fam73a
Tdrd9
Trio
Zfp518b
Ptpn22 Cdx2
2700050L05Rik
Recql5
Lman1
Fem1c
Adck5
Tubg2
Auts2
Fto
Lactb Tmem194
Dpy19l1
Klf8
Usp42
Ovol1
Zfp263
Ahcyl2
Snai1
Fam154a
Ccny
1810008A18Rik
Zswim4
Acad10
Pik3r2 Wipf1
Ctdp1
Ppcs
Pcgf3
Gipc1
Cdc42ep3
Nthl1
Taz
Dnajc30
Nans
Decr1
Clcn2
Naaa
Zbed6
Zfp260
Med30
Scarb1
Zscan29
Slc26a2
Fam53b
D0H4S114
Ankrd45
Ddx58 Mrm1 Abi2
Helb
Rbm45
Fam160b2
Nmt2
Lipt1
Capn1
Gm5918
Hsd17b7
Dgkk
Plcd4 Pxmp4
C
Plekhm2
Bmpr2 Cnpy4
Lass6
Ly96
Clcn4−2ds1 Aagab
Ccdc14 P4ha2
Zranb1
Irgq
Trub2
Atp6v0a4
Mcm8
Yeats2 Elovl1
Ralgps2
Aldh6a1
E130012A19Rik
Gata2
Gmcl1
Dtnb
Acadsb Pnpo
Tesk2
Fam165b
Myo9b
Ccdc66
Zfp229
Zyg11a
Zfp788
Ccdc57
Ddb2
Slc20a2 Kif18b
Zrsr2
Dedd2
Vma21
Klrg2
Mnt
Ddx59
Limd1
Hrsp12
Ids
Sac3d1
Pts
Grina
Frmd6
Sh2b3
Lmbr1
Arl5b Patz1
Il6ra
Atrn
Cdk19Cphx Kbtbd4
Gphn
Mpp7
Epc1
Pramef12
Zfp939 Elf2 Zfp599
Tubgcp2
Rnf34
2010106G01Rik
T Ccng1
Slc25a19
Entpd6
1810074P20Rik
Ppp2r1b
Pgm1
Ambra1
Pppde1
2010317E24Rik
Wdr47 P4ha1
Naa11
Fkbp11Stk35
Lin54
Fam102a
Tbc1d9
Sertad1
Pus10
Snx9
Pik3r1
Shroom3
Pygb Cox10
Impdh1
E130306D19Rik
Plekhh1
Rab3gap2
Specc1l Abhd6
Fam179b
Sqle Ldha
Gpatch8
Wdr81
Fam129a
Csrp2bp
Efnb2
Casc3
Kctd3
span14
Cat
Smpdl3a Gcat
Psen1
Rpap1
Runx1t1
Orc1
Rev3l
Cep57l1
Pikfyve Lgmn
Esrp1
Ggt1
Ube2g1
Ezr Golph3
Wwc2
Phf20
Cdk2ap1
S100a11
Arhgef16
Brwd3
Usp24
Epb4.1
Ccrn4l Larp1b
Pcyt1a
Pepd Ltn1
Prmt7
Zfand6
Dlst
Stmn3Pitrm1
Rnf13
Polr1e
Ckap2l
Hmgxb3 Pom121
Fkbp6
Cyb5r1
Naa40
Casc4
Gnl3l Atmin Ktn1
5031439G07Rik
Cyb5r3Ogt Nt5c3l
HexaSlc12a7
Serpinb6a
Scd2
Tmsb10
Ech1
Ppp4r2
Agtrap
Pias2
L2hgdh
Phf13
Mta2Aacs
L1td1 Bod1l
Tmem50b
Lrrc42
Aldh9a1
Usp15
Wipi2
Ddah1
Glud1
2700029M09Rik
Hk2
Sdhd
Trp53
Casp8ap2
Rab3il1 Fdps
Akap12
Slbp
Gjb3F11r
Cecr2 Zc3hc1
Pkm2
Rragd
Setx
Cyp2s1
Tet2
Ankrd17 Hspa5
Ivns1abp
CtsbCstb
Hsph1 Ctsa
Bcat1
Crip2
Actn4 Mat2a
Nolc1 Prdx1Calm1
Tmem104
Nat2
Ms4a6d
Gcm2
Caln1
Pmepa1
Zdhhc8
Fam167a
Oasl1
Otog
Syt11
2410066E13Rik
A530064D06Rik
Nlrc4
Gnb4
Lrrc8b
Camk2b
Cmya5
Inppl1
Mef2a
Erc2
Slc7a1
Arhgap4
Mmp17
Mpp4
Kera
Trpm4
Zdhhc9
March8
Hecw1
Upk1b
Fbxo10
Ppp2r3a
Gpnmb
Gramd4
Hspa1a
Gm5156
Pmfbp1
Ecm2
Tmem206
2610301B20Rik
Ankdd1b
Cdk3−ps
H2−DMa
Srgap3Igf2
9130023H24Rik
Mtap9
1810048J11Rik
Mfap4
Fam53a
Adamts14 Abhd16a
Pramef17
Wdr95
1810063B07RikHdac4
Kank2
Nfat5
4931409K22Rik
Npepl1
Plek
4931406H21Rik
Thbd Atf7
Slc35d1
Dbndd2
Mettl7a1
Rs1
A230072C01Rik
Gm10033
Mcm9
Tmem189
Wdr60
Efcab2
Thap2
Apbb1ip
Lmod3
Asgr2
Nudt11
Slc26a1
Col5a1
Nr2e1
6330410L21Rik
Gtf2ird2
Gabra3
Nrsn1
Zfp398Nubpl
Arl6
Kctd15
Taf6l
Ift122
1700025E21Rik Ada
A930005H10Rik
Ilvbl
Stra8
Ddx4
Scg3
Pag1
Bahd1
Tmem218
Slc36a1
Mfap5
Adrb3
Prdm16
Tslp
Sat2
Ltbp4
Scrn2
1190003J15Rik
BC013712
Slc18a2
Calr3
Vegfb
Cdsn
Ppp2r5a
BC053749
Pcyox1l
Nedd9
Plin5
Zfp69
Gpsm1
Lpar2
Top1mt
Unc13d
Ly6g6d
Pde6a
Crocc
Cd79b Tmem123
4632415K11Rik
Pik3c2g
Mc5r
Nudt17
Hist2h3c1
Tctn1
Six4
Epn3
Itm2a
Samd7
Tigd3
Mlkl
Wdfy4
Ccdc79
Pycr1
Gm11190
5430407P10Rik
Dnase1l2 Tdrd5
Rilpl1
Wdr65
1700020O03Rik
Mxra8
1700009P17RikKif1c
Zfp71−rs1
Ankrd50
Fam89b
Sumf1
Creld1
Arl2bp
2310002L13Rik
Il11ra1
A930013F10Rik
Med25
Ppp1r12b
Slc7a4
Myom2
Tm6sf1
Pid1
Zbtb22
Cyp39a1 Rabl5
Neil1
H60b
Krt28
Hexdc
Gm7120
Glt1d1
Fam69b
Sema3c
Zfp458
Slmo1
Elovl2
Usp26
D130040H23Rik
Zscan4e
D630023F18Rik
Ccdc63
Tmem8b
Ramp2
2200002J24Rik
Vsx2
Vamp4
Fam82b
Comp
4932415G12Rik
Zfp579
Lyplal1
Hspb2
Bag2
Slc31a2
Kif27
Cables1
Cpne2
Pip4k2b
Nos3
Gpr149
B3gnt3
Notum
Sema4a
Dab1
Ptch1 Pex7
Gpr137
Hsd17b1
Trim40
Pkhd1Ogfrl1
Chst5
Bet3l
Kitl
Slc41a2
Pkd2
Gm166
9830147E19Rik
Steap2
Crx
1700054N08Rik
1190002H23Rik
Pou2f2
Zswim5
Tcf19
H2−Eb1
Tex12
1110051M20Rik
Arhgap39
Dcdc2c
Ppp2r2b
Syt2
Sufu
Tmem80
Cd160
Zfp653
Zfy1
6330403K07Rik
4930548H24Rik
Ryr3
Syt12 Il6
1810014F10Rik
Smarcd3
Rcor3
Stxbp4
Il17rc
C030039L03Rik
Fam123b
Gm5434
Fkbp1b
E330020D12Rik
Myh13
Hdhd3 Twf2
Tmem107
BC046331
2410004B18Rik
Fads1 Mical3
Smarcc2
Fgl1
Zfp14
2610206C17Rik
2900026A02Rik
AW146020
Npr2 Sp4
Ppp1r16a
Scamp5
4933403G14Rik
Ttc23
Dio1
Cd2
Dnahc10
Galnt4
Mst1rCtdspl
Pde8a
1700034H14Rik
Kcnf1
Nr3c1
D5Ertd605e
Gm9992
Gbx2
St6galnac3
Prrg1
Trim36
Tmcc2
Rsbn1 Cd7
Wdyhv1
Snta1
Tspo
Gpr116
Zc3hav1l
Fam161b
Ap1s3
E430018J23RikB3gnt2
Nmb
Plxnb1
Slc50a1
Syt17 Dlg3
Nek8
Efna2
Sept10
Kctd2
Mapk13
Rfx5
Tmem219
Svopl
Rhbdf2
1110012D08Rik Pstk
Rfx1
Txndc11
Fam38a
Dnajc27
2010111I01Rik
Il17re
Morn1
Tctex1d1
Gm5465
Prkcb
Adap1
9830001H06Rik
Cd300ld
Zfp385a
Kbtbd3
6330439K17Rik
1600015I10Rik Elk1 Zfp825
Zfp672
A2ld1
Stk10
2010320M18Rik
Ush2a
Cd68
Pet2
Sytl3
Aoc2
D430042O09Rik
AI506816
Katnal1
Pard6g Capn10
Wbscr16
Dom3z
Fbxl19
Tubd1
Stx16
Letmd1
Jak3 P
Acaa1a
BC088983
HpFbxo46
Zkscan14
Aars2
Ubxn11
Pik3r6
Dennd2a
Epha2
Atp4a
Slc39a11
Slc5a4b
Hspb6Pold4
Ift81
Stx2
Zfp560
1700028J19Rik
Anxa8
Txnrd3
Arhgap27
Prss48
N4bp3
Fbxo17
Cog8
Gm5887
Gm128 Agrp
Cela2a
4921517L17Rik
Ccdc116
Smox
Arhgap26
Rnf182
Ceacam1
Abcc2 Mmgt2
Acbd4
Fcgrt Iah1
Lyst
Abcc3
2510002D24Rik
Guca1b Scrn3
0610030E20Rik
Chl1
Zfp146
Gabarapl1
Slc38a9
Spsb2
Tbc1d8b
Mobkl2c
Aifm3
6230409E13Rik
Mdm1
BC065397
Ffar2
Gm6150
Fam185a
Cgref1
Rnf145
A330049M08Rik
Dtna
Ttc32
Usp6nl
Dtnbp1
Dennd1cFbf1
Lins
4930578N16Rik
Tmem170b Nt5c3
Tchp
Zbtb8b
Chchd6
Krt10
Kremen1
Otoa
Tpm2
Wdr31
Crispld2
Gpld1
Il23a Tcea3
Serac1
2610507I01Rik
Lrsam1
Lrrc4
1700016K19Rik
Hmga2−ps1
0610005C13Rik
Gm5635
Syngr4
Fam107a
2510049J12Rik
Cd1d1
Col9a2 Mbd6
St7l
Usp49
Itgb3
Slc25a33
Iqcb1
1810049H13Rik
AI118078
Cryge
Usp50
Npas3
Gnat2
Gm15760
Creb3l3
Tarm1
Ank3 Gan
Zbtb39
Heca
Bcl9l
D630037F22Rik
4922501C03Rik
Pcdh19
Clec4a1
Notch2
Itpripl2
Krt12
Nudt6
Nphp1
Hivep2
Klhl23
Hexim2
Zfp937
Gpd1
Crym
Purg
G530011O06Rik
Gsdma Yaf2
BC048679
Atat1
Fam196a
Plekhb1
Tmem63a
Fbn1 Pigv
D17H6S53E F2
Mospd3
Hjurp
Colq
Hecw2
Ttll11
Ace
1110021L09Rik
Apba2
Napb
A130010J15Rik
Steap4
Dand5
Bin1
2410012M07Rik
Acvr1
Kank4
Fras1
Slc48a1
Fam40b
Armc2
Tmem14a
Ece1
Ccdc152
Ankrd23
Btbd6
4931440F15Rik
Pou3f1
Abca4 Lbh Eml3
Lat2
Ncdn
Arhgef11
Pank2
Ero1l
Palm
Snx11
Cog1 Alms1
Kctd6
9530068E07Rik
Ppm1m Sav1
Marveld2
Mfsd10Cltb
Btbd7
AU040320
Bri3
Slx1b
Zfp511
Acvr2b
Gpha2
Btk
Ap3s1
Ccl25
Zfp429
Cyp2c44
1700029J07RikZfp78
Gm347
Ngef
Cpne7
Ccdc23
Wdfy2
E130308A19Rik
Fam164c
Slc24a3 Rev1
Cox4i2
Bnipl
Cldn15
Dscam
Zfp324Orai1
Poc1b
Mybpc1
Gstk1 Tbck
Zcwpw1
Itpka
B3gntl1
Drd3
Fgf7
Eml5
Zswim6
Mfsd3
Fut11
Sgtb
Ggt6
5530601H04Rik
Marveld1
Thumpd2
Ccdc96
Spata21 Cby1
1810030N24Rik
Zfp462
Pnck
Spred1
Ptpn4
6430550D23Rik CyctOgg1
Pdlim7
Tmem191c
Hist1h1c
H2−Oa
Has3 Fam203a
Zfp668
Zfp882 Ogfod2
Wdsub1Cyba Gnal
Slc35a3
Fars2
Ddr1
Atp5s
Fam126a
Apobr
Pogk
Pbx3
Agtr1a
Basp1
Zfp276
Trim47
Slc39a13
Synm
Siae
Nradd
Bmp6
Phyhd1Chd1l
Cmklr1
Zfp516
Cpt1c
Il1f9
Cml1
9130404H23Rik
Fam65b
Unc45b
Accn2 Svil
Mocs2
Lsm11
Pla2g7
Fbxo27
Rassf8
Hck
Gmds
Rtn1
Fam171a1
Prkce
Gm3336Sgsm1
Solh Os9
Ina
Plekha6
Slc25a45 Nfic
Fbxo16
Nppb
Slc38a8
Mtap1s
Bcl2l12
Igf1r
Gchfr
Ddx23
Ppm1f
Ric8b
Fam116b Ccne2
Rap1gds1
Fam174aSlc35a2
Nr2c1
Zfp408
Spred2
Atg10
Fgfr3
Lrrc17
Dio3
Fam189b
Parp11 Scrib
Parp16Rac3
Tmed8
St3gal2
Hmbox1
9430008C03Rik
Cspg5
Slc22a13b
Gpc1
2210015D19Rik
Gm1631
Zfp558
Gimap9Inhbb
2210404J11Rik
H2−D1
Atxn7l2
Myh15
Zfp319
Smtnl1
Klhl17
Rnf135
Fut9
Actr3b
Atrnl1
Trim2
Mmab
St3gal3
Lrrc56
Zfp763
Spesp1
Dnahc11
Dbn1
Adm Sbk1
Trmu
2310003H01RikThap7
Tnfrsf17
Tmem44 Zfp839
Tbc1d19
Syne1
Grhl3
Mark1
Slc27a5
Degs2
Tmem82
Ostb
Tspan5
4933432B09Rik
6430548M08Rik
4930486L24Rik
Dusp26
Bcl7c
Ephb1
Susd3
Cd177Zfp691
1810043G02Rik
Tnfrsf19
Tmem41a
Aebp1 Csnk1g2
Zfp553
Trib3
Dpy19l4
Gm221 Pim2
Kif9 Med9
Zbtb24
Sh3d19
Zfp687Syde2
Trnau1ap
Rbck1
Zbtb8a
Mus81
5830433M19Rik Fadd
Adprhl2
Sycn
9930104L06Rik
Def6
Zfp28
Fosl2
4931428F04Rik
Fam3bCblc
Fam26fMael
Zfp41
Adamts3
Kcnn4
Mipol1 Ninl
2810408M09Rik Clybl
Sesn2
alm3
Sult5a1
Aldh16a1
Znrd1as
Irf3
Dnajb12
MplDhx57
Dnajc5g
Nt5dc3Rap2a
Kirrel
Rab34Akt1s1
Taf15
Polk
Cotl1 Rhob
H1f0
Dmd
Zfp777Cdkl2
Dnajc4
Fam173a
Asap2
Ribc1
Tagap
Uevld
Sf3a2
Icam4 Sp2
Safb2
Slc25a32
Trim41
Rab22aGatc
Ccdc76
Akap17b
Ppa2
Mlf1ip
Tfpt
Ndufaf3
Apba3
1110054O05Rik
Rbm20
Rims3
Mill1 Lig4Mcat
Nhej1Utf1
Blvrb
Dnajc1
Ing2
Ift74
4930427A07Rik
9130409I23RikTrex1
Ankrd16
Zfp781
1110031I02Rik
Slc25a23
Tnfrsf10b
Prdm15
Slc25a30
Slc6a7
Ugt8a
Rgs14
LOC100616095
4933404O12Rik Pkn1
SiglecgAcap2
Nme4
Arl16 Zfpl1
Fkbp15
Mobkl1a
Taf1c
Wdr70Tnks
Hscb
Hook2 Nek6
Ankrd6
Atf7ip2
Ppp2r3c
Gclm
Mettl14
Scoc
Bcl2l2
Ccdc34
Pla2g12a
Rttn
Fam33a
Tssc4
Uvrag
Zfp652 Ptpmt1
Eif2c1
Gm13040|Gm13043|Gm13057 Cpeb3
Ppp1r12c
Ttyh3
Pkn3
1600002K03Rik
Cav2
Gm4984Gtpbp3
Aqp7
Zfp9
Tppp
Nptx2
Dnajc12
Nrn1l
Tmem165
Gm7694Agxt2l2
Zfp595
Wdr17
Zap70
Macrod1
Polr3gl
Ube2d1
Wdr19Pygl
Itga7
Slc7a2
Map6d1
Atp6v1c2
Stk32b
Palm2
Gfra1
Asrgl1
Asap3
Dusp8
Pion
Kif5a Sh2b1
B3galt6
Hacl1
Prdm9
S1pr5
Cpeb4
Csdc2
1700094D03Rik
Slc25a31
Cdh23Star
Ptpdc1
Cdh3
3830431G21Rik
2810021J22Rik
Arhgap33
Akr1c13
Lrrc20
Mapre3
Enpp2
Capsl
Vapb
Tmem120b
Pcbd1
Khnyn
Arrb2
Mettl21d Glrx2
Foxk1
Crebzf
Tmem5
Dpf3
Rbm4b
Gtpbp8
Bcl3
Rgp1
Caskin2
Lcor
Tradd
Trappc6a
Rnf208
Acy3
Cflar
Tasp1
Hspb11
Tst
Letm2
Iqce
Adat3
1700066M21Rik
Ccnb3 Zfand3
Tatdn2
Cacna1h Cnpy3
Tatdn3
E130317F20Rik
Lmbr1l
Wdr96
B3galnt2
Map4k2
Lpcat2
Nup210l
Aspn
2610306M01Rik
1700106N22Rik
Gm17830
Ccdc109a
Nrg4
Cdc42bpg
Mxra7
AA986860
Ppp3cc Yars2
Prex2
Palld
Ctnnbip1
Smarcd1
Ddah2
ORF19
Tbc1d20
H2−K2
Ccdc150
Lrrc48
Phlda2
Khk
G630016D24Rik
8030462N17Rik
Ddx43 Nubp1
D1Bwg0212e
Gsdmc2
Comtd1
Slc27a1
Tulp1Fam132a
Dhcr24
Oas1a
Tsen15
4931406C07Rik
Zfp442
1700019N12Rik
D10Bwg1379e
1700029F12Rik Sesn1
Impa2
Sirt5
Syvn1 Ndfip1
St5Xlr4b
Setd4
Stk19
Gm13152
Dtwd2
Flot1
Ptrh1
AU022252
Ntpcr Socs4
Gmfg
Duxbl|Gm10391|Gm10394 Phlda1
Fam108a
Scly
Efhb
Mxi1
Rasal2
Pbx2
Phkg2
Dpep3
Morn2
Lcmt2
Ebag9
Atp13a2
Cacna2d1Mgl2
Igdcc3 Usf2
1810020D17Rik
Piwil1
Zfp219
Hspa12a
Tmem106c
Tspyl2
Arid3a
Efr3b
Rhbdl3
Tmc7
Arhgef19
1700113I22Rik
Fam122a
Zfp959
Myod1
Zik1
Prkra
Ptpn18 Coq2
Ifngr2
Snrk
Il17rb
Trib2
Fzd9Ebpl
Kdm4a
Hsbp1l1
Ebf1
Chst3
D630032N06Rik Mtrf1l
Mink1
Ankrd33b
BC051142
Mtx3
1700021C14Rik
Ankrd35
Syt16
Wdr93
Tubb2a
1700015F17Rik
Grid2ip
H2−M5
Cdkn2cYbx2 E2f8
Erich1
Unkl
Axin2
Pde4dip
Wdr8
Irak2
Ilkap
Pgp
Fam25c
Tnrc6c
Trappc6b
Gm13051
Trmt61b
Zfp606
Elac1
Cmtm4Kri1
PigtNck2Rala
Arf5 T
Tmc6
BC005561
Trmt61a
CklfDlc1
Spr
Mtfmt
Eif2c3
Tjap1
2510012J08Rik
Mzf1
Hsd11b2
Zfp473
Lgals12 Plod1
Sycp1
Zscan10
Stx4a
CutcPrr3
Zfp639
Mesdc1
Pdxp
Zfp161
Ears2
Rab2b
Prkaca
D19Ertd386e
Gm6792 Mta1
Mettl4
Fert2
Leprotl1
Smad5 Nf1
Trim38
Perp
Nek1
Itpkc
Atxn1l
Krit1
Far1
S
Rfx2
Prkcz
Slc35e2Plxnb2Prkcd
Rsu1
Slc29a1
Foxk2
Rpap2
Suox
B3gnt7
Spg20 Elovl5
Dpp3
Btbd10
Gltp
Atp2c1
Rusc1
Htatip2
Rnf130
Cstf2t
Slc39a14
1200011M11Rik
Slco4c1
Znrf3 Fiz1
A830080D01Rik
Ttc15
Dram2Tacc2
Hnrnpul2
Pfkl
JubSlc9a3r1
Lpar1
Snx18
Zfhx3
Msl2
Fbxo21
1810019D21Rik
Lyrm7
Pcif1
Sept1
Agk
Dvl3
Minpp1
Tstd2 Ankrd46
Tle3 Lin28aEtv5
ulf2
Zyg11b
Zswim3
Rptor
Nde1 Ppt1
Arl4a
Osgepl1
Zc3hav1
gif1
Helq
Fam83d
Gemin7
Nupl1
Bcl7b
Setd6
Csta
Pdgfa Itgav Upf2
Vipar
Kdsr
Spast Zfp936
Senp1
Inpp5d
Camsap1l1
Tsc22d1Clptm1l
Scamp1
Arhgef12Xpr1
Stx5a
Tdh
Syne2
Nr2c2
Ubxn7
Tyro3
A530032D15Rik Osbpl2
Cry1
Gcnt1
Ncstn
Rnf6
Cd97
Trim59
Tmem131
Ercc6l
Ssbp3
Coil
Ostm1
Fbxo8
Ap1m1
Ano6
Zfp654
Tor1aip1
Ubr1
Secisbp2l
Ing3
Itsn2 Ebp
Cmtm6
Nudt5
Abcd3
Hsf1
Ubn1
Slain2
Mccc2Vta1
Cd320
Prdm2
Prosc
Paip2b
Dip2b
Klhl21
Pdzd3 Ammecr1l
Hmgn5
Jak1
Cnnm4
Ctsz
Fgfr2
Atg5
Odf2
Psmd3
Agpat5
D14Abb1e
Ppp1r12a
Tmem209
Dcaf12
Spnb2
Blmh Mpp1
Med13
Ccdc99
Degs1
Ints4
Mysm1
Nek2
Hiatl1
Sp3 Map2k1
Arhgap11a Atg9a
Ddx1
Zfp259
Capns1 Sgk3 AA467197
Asf1b
Zfp296
Usp9x
Arih1
Hexb
Def8Zfp280b
Commd10
Rnf216
Syngr2
Tes
Elp2 Tbl3
Tmem111 Crebbp
Stt3b
Mfsd7c
Zfr
Ireb2 Evi5
Akap1
Sh3bgrl3
Slc7a6 Ugcg
Slc25a44
Gatad2a
Mapk6
Fbxo11
Clk1
Fam45a
Hn1l Atg13
Sos1 Lpcat3
Ap2a2
Cdca5 Cd164
Mlf2
Mesdc2 Smc6 Arih2
Tpm4
Cdc20
Atrx
Gm11545
Btg1 Zbtb10 Ccnb1
Nub1
Fam96a
Aprt
Ctsd Fbp2
Ipo5
Gemin8
Mog
Gm4944
Duox1Ift80
Thnsl1
Prelp
Mmel1
Kcnn2
Rab11fip3
Vgll1
Ncrna00085
2900060B14Rik
Echdc1
Dhx58
Apod
Ccdc125
Nt5dc2
Usp51
Dmbx1
Prr24
AU041133
Mboat4
Rfx3
Tmem52 Cenpv
Maz
A630089N07Rik
Ncoa7
Mctp1
Six1
Gatsl2
Dsel
BC019943
Ptprcap
Srd5a1
Zfp30
Arsa
Slc25a21
Tmigd1
Gm10466
Tnks1bp1
Prr15l Ddc
Ap2a1
Wdr35
Ticam2
Skap1
Cntd1
Tmc4
Pbx4
Col8a2
Rhbdf1
Ifitm6
Dennd4b
Adm2
D730005E14Rik
1110002L01Rik Rcn3
Cobll1
Rras
1700001O22Rik
3110056O03Rik
Tsku
Csf1
AU019990
Pde9a
Rtn4r
6330416G13Rik
Il17f Htra2
Kif24
Mterf
Trpt1
Dgcr6
Tmc8Ip6k1
Crtc2
Stub1
Hsh2d
Azi1
NbeaTxk
Sh2b2
Sec22c
Fbxo36
Cradd
5033414D02Rik
Mex3b
9030425E11Rik
Frmd4a
Cdh13
Esrra
Pex10
Clcn6
Foxc1
Asxl3
Foxi3
Irak3
Fam110a
Rab3a
Ttc28
Ccdc107
Nmnat2
1700012A16Rik
Gm5127
Fhl4 Tcf3
Sult2b1
Zfhx4
Slc22a20
Akr7a5
Lcat
4922505G16Rik
Fbxo32
Slco5a1
Gm962
Smtnl2
Calm5
Tbx19
Gm11202
Gm5065 Cib2
Thra
Stau2
Nes
Mettl7a2
Ccdc62
Ankar Kl
Chml
Rnf113a1
Slc29a4
Cacnb1 Sct
Dvl2
Fuk
Gas8 Crkl
Cnot3
Map4k1
Il17ra
AI837181
4921530L18Rik
Spnb1
Adssl1 Arv1
Tmem141
Ankrd39
Etohi1
Cela1
Cdkn1b
Ptk2b
Tmem42
Ccdc28b
Mrs2
Slfn2
BC068281
Fam43b
Ccdc157 Pdf
Fance
Tnfsf12
Srr
Unk
Gtsf1Gtf3a
Mef2d
N6amt1
Numb
Mri1 Oma1
T Jam2
ns1
Mboat1
Hes1 Dmxl2
Ccdc137
Pnrc2
Tex19.2
2810405K02Rik
Klf11
Rai14
Slc44a2
C2cd3 Cnot6
A930001N09Rik
Plk4Stard7
Actr8 Rabgef1
Phf6Kifc3
Amotl2
Ift52
Ttc13
Elovl7
Dclre1c
Abcb8
Mfn2 Tgs1
Tm9sf4
Timeless
Galk1
Dicer1
Lclat1Nusap1
3230401D17Rik
4632434I11Rik
Ccdc47
Sorbs1 Surf4
Dnmt3l
Sc4mol
Kif2cAbcf3 Acly
Glb1
Ppfibp2
Stt3a Hip1
Slc6a6 H47
SdhaCul4bBrd2 Calm2
Rif1
0
0.8
0.6
−2
0.2
−4
0 1 2 3 4
Average expression
length(x = [email protected])
## [1] 6127
We are not entirely sure what is going on in the lower left hand corner of the plot above. A similar feature
can be found in the Satija lab tutorial, so we do not believe that it is due to an error in how we used the
method.
##
|
| | 0%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|== | 4%
|
|=== | 4%
|
|=== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|====== | 10%
|
|======= | 10%
|
|======= | 11%
|
|======= | 12%
|
|======== | 12%
|
|======== | 13%
|
|========= | 14%
|
|========== | 15%
|
|========== | 16%
|
|=========== | 16%
|
|=========== | 17%
|
|=========== | 18%
|
|============ | 18%
|
310 CHAPTER 9. SEURAT
|============ | 19%
|
|============= | 20%
|
|============== | 21%
|
|============== | 22%
|
|=============== | 22%
|
|=============== | 23%
|
|=============== | 24%
|
|================ | 24%
|
|================ | 25%
|
|================= | 26%
|
|================== | 27%
|
|================== | 28%
|
|=================== | 28%
|
|=================== | 29%
|
|=================== | 30%
|
|==================== | 30%
|
|==================== | 31%
|
|==================== | 32%
|
|===================== | 32%
|
|===================== | 33%
|
|====================== | 34%
|
|======================= | 35%
|
|======================= | 36%
|
|======================== | 36%
|
|======================== | 37%
|
|======================== | 38%
|
|========================= | 38%
|
9.5. DEALING WITH CONFOUNDERS 311
|========================= | 39%
|
|========================== | 40%
|
|=========================== | 41%
|
|=========================== | 42%
|
|============================ | 42%
|
|============================ | 43%
|
|============================ | 44%
|
|============================= | 44%
|
|============================= | 45%
|
|============================== | 46%
|
|=============================== | 47%
|
|=============================== | 48%
|
|================================ | 48%
|
|================================ | 49%
|
|================================ | 50%
|
|================================= | 50%
|
|================================= | 51%
|
|================================= | 52%
|
|================================== | 52%
|
|================================== | 53%
|
|=================================== | 54%
|
|==================================== | 55%
|
|==================================== | 56%
|
|===================================== | 56%
|
|===================================== | 57%
|
|===================================== | 58%
|
|====================================== | 58%
|
312 CHAPTER 9. SEURAT
|====================================== | 59%
|
|======================================= | 60%
|
|======================================== | 61%
|
|======================================== | 62%
|
|========================================= | 62%
|
|========================================= | 63%
|
|========================================= | 64%
|
|========================================== | 64%
|
|========================================== | 65%
|
|=========================================== | 66%
|
|============================================ | 67%
|
|============================================ | 68%
|
|============================================= | 68%
|
|============================================= | 69%
|
|============================================= | 70%
|
|============================================== | 70%
|
|============================================== | 71%
|
|============================================== | 72%
|
|=============================================== | 72%
|
|=============================================== | 73%
|
|================================================ | 74%
|
|================================================= | 75%
|
|================================================= | 76%
|
|================================================== | 76%
|
|================================================== | 77%
|
|================================================== | 78%
|
|=================================================== | 78%
|
9.5. DEALING WITH CONFOUNDERS 313
|=================================================== | 79%
|
|==================================================== | 80%
|
|===================================================== | 81%
|
|===================================================== | 82%
|
|====================================================== | 82%
|
|====================================================== | 83%
|
|====================================================== | 84%
|
|======================================================= | 84%
|
|======================================================= | 85%
|
|======================================================== | 86%
|
|========================================================= | 87%
|
|========================================================= | 88%
|
|========================================================== | 88%
|
|========================================================== | 89%
|
|========================================================== | 90%
|
|=========================================================== | 90%
|
|=========================================================== | 91%
|
|=========================================================== | 92%
|
|============================================================ | 92%
|
|============================================================ | 93%
|
|============================================================= | 94%
|
|============================================================== | 95%
|
|============================================================== | 96%
|
|=============================================================== | 96%
|
|=============================================================== | 97%
|
|=============================================================== | 98%
|
|================================================================ | 98%
|
314 CHAPTER 9. SEURAT
|================================================================ | 99%
|
|=================================================================| 100%
## [1] "Scaling data matrix"
##
|
| | 0%
|
|=================================================================| 100%
Next we perform PCA on the scaled data. By default, the genes in [email protected] are used as input,
but can be alternatively defined using pc.genes. Running dimensionality reduction on highly variable
genes can improve performance. However, with some types of data (UMI) - particularly after regressing out
technical variables, PCA returns similar (albeit slower) results when run on much larger subsets of genes,
including the whole transcriptome.
seuset <- RunPCA(
object = seuset,
pc.genes = [email protected],
do.print = TRUE,
pcs.print = 1:5,
genes.print = 5
)
## [1] "PC1"
## [1] "Gm10436" "Zbed3" "Gm13023" "Oog1" "C86187"
## [1] ""
## [1] "Fbp2" "Fam96a" "Cstb" "Lrpap1" "Ctsd"
## [1] ""
## [1] ""
## [1] "PC2"
## [1] "Gsta4" "Id2" "Ptgr1" "AA467197" "Myh9"
## [1] ""
## [1] "Gm11517" "Obox6" "Pdxk" "Map1lc3a" "Cited1"
## [1] ""
## [1] ""
## [1] "PC3"
## [1] "Psrc1" "Ninj2" "Gja4" "Tdrd12" "Wdr76"
## [1] ""
## [1] "Efnb2" "Gm9125" "Pabpn1" "Mad2l1bp"
## [5] "1600025M17Rik"
## [1] ""
## [1] ""
## [1] "PC4"
## [1] "Upp1" "Tdgf1" "Baz2b" "Rnd3" "Col4a1"
## [1] ""
## [1] "Rragd" "Ppfibp2" "Smpdl3a" "Cldn4" "Amotl2"
## [1] ""
## [1] ""
## [1] "PC5"
## [1] "Snhg8" "Trappc2" "Acsm2" "Angptl2" "Nlgn1"
9.6. LINEAR DIMENSIONALITY REDUCTION 315
## [1] ""
## [1] "Akap1" "Stub1" "Apoe" "Scand1" "Hjurp"
## [1] ""
## [1] ""
Seurat provides several useful ways of visualizing both cells and genes that define the PCA:
PrintPCA(object = seuset, pcs.print = 1:5, genes.print = 5, use.full = FALSE)
## [1] "PC1"
## [1] "Gm10436" "Zbed3" "Gm13023" "Oog1" "C86187"
## [1] ""
## [1] "Fbp2" "Fam96a" "Cstb" "Lrpap1" "Ctsd"
## [1] ""
## [1] ""
## [1] "PC2"
## [1] "Gsta4" "Id2" "Ptgr1" "AA467197" "Myh9"
## [1] ""
## [1] "Gm11517" "Obox6" "Pdxk" "Map1lc3a" "Cited1"
## [1] ""
## [1] ""
## [1] "PC3"
## [1] "Psrc1" "Ninj2" "Gja4" "Tdrd12" "Wdr76"
## [1] ""
## [1] "Efnb2" "Gm9125" "Pabpn1" "Mad2l1bp"
## [5] "1600025M17Rik"
## [1] ""
## [1] ""
## [1] "PC4"
## [1] "Upp1" "Tdgf1" "Baz2b" "Rnd3" "Col4a1"
## [1] ""
## [1] "Rragd" "Ppfibp2" "Smpdl3a" "Cldn4" "Amotl2"
## [1] ""
## [1] ""
## [1] "PC5"
## [1] "Snhg8" "Trappc2" "Acsm2" "Angptl2" "Nlgn1"
## [1] ""
## [1] "Akap1" "Stub1" "Apoe" "Scand1" "Hjurp"
## [1] ""
## [1] ""
VizPCA(object = seuset, pcs.use = 1:2)
316 CHAPTER 9. SEURAT
Slain1 Gm11517
Gm7056 Obox6
Pld1 Pdxk
Creb3l4 Map1lc3a
Gm1965 Cited1
Obox5 Trim43b
Plekhg1 Trim43c
Fam199x Sp110
Tcl1b1 Larp4
Ldhb Gcsh
Rph3a Tfrc
Tcl1b4 Ccne1
Trim60 Sox15
Nlrp4a Slc12a2
D6Ertd474e Impdh2
Klhl8 Acsl4
Omt2b Crnkl1
Siah2 Nolc1
Btg4 Pemt
C87977 Mier3
Omt2a Nub1
Mfap2 Cth
Oog3 Clpx
Parp12 Gpd1l
Trim61 L2hgdh
C86187 Nin
Oog1 Ppt2
Gm13023 Rb1cc1
Zbed3 Cdk2ap1
Gm10436 Pank3
PC1 PC2
PCAPlot(object = seuset, dim.1 = 1, dim.2 = 2)
20
PC2
SeuratProject
0
−20
PCHeatmap(
object = seuset,
pc.use = 1:6,
cells.use = 500,
do.balanced = TRUE,
label.columns = FALSE,
use.full = FALSE
)
PC 1 PC 2 PC 3
Fbp2 Gm11517 Efnb2
Fam96a Obox6 Gm9125
Cstb Pdxk Pabpn1
Lrpap1 Map1lc3a Mad2l1bp
Ctsd Cited1 1600025M17Rik
Crip2 Trim43b Gm5
Tpm4 Trim43c Gm4850
Ptges Sp110 Dcdc2c
Actg1 Larp4 AU015228
Snx2 Gcsh Gm7104
Psap Tfrc Gm7102
Isyna1 Ccne1 Eid2
Sdc4 Sox15 Gm1995
Aldoa Slc12a2 Spz1
Gja1 Impdh2 Tdpoz1
Klhl8 Ech1 Morn4
Omt2b Calm1 Gm4981
Siah2 Lrrfip1 Gm17830
Btg4 Akr1b8 Gm889
C87977 Krt18 4930588N13Rik
Omt2a Tpi1 Bet3l
Mfap2 Krt8 Pabpc1l
Oog3 BC053393 Chst7
Parp12 Wdr1 Bach2
Trim61 Pkm2 Casp8
C86187 Myh9 Wdr76
Oog1 AA467197 Tdrd12
Gm13023 Ptgr1 Gja4
Zbed3 Id2 Ninj2
Gm10436 Gsta4 Psrc1
PC 4 PC 5 PC 6
Rragd Akap1 Wtap
Ppfibp2 Stub1 Usp15
Smpdl3a Apoe Crip2
Cldn4 Scand1 Fdps
Amotl2 Hjurp Pramel5
Cyp2s1 Fstl3 Rragd
Actn4 Slc4a11 Mlf2
Cnnm4 Msh6 Acly
Cldn7 Hiatl1 Ugcg
Gata3 Zfp449 Gm428
Tmem62 Tmem231 4933411G11Rik
Ap2a2 Akt1s1 Guca1a
Dusp14 Cog8 Gm11756
Pdzk1 Scfd2 Ptges
Tmem109 Rnf219 Bri3
Ctsb Fhit Mxd4
Col4a2 Akap5 Agxt2l1
E130012A19Rik Nfib 2310047M10Rik
Jam2 Cep97 9330020H09Rik
Uhrf1 Abca16 1110002L01Rik
Tet1 Gp1ba Naif1
Fgf10 Upk3a Cdk20
Rnf130 4930506M07Rik Fhit
Pecam1 4930503L19Rik Txnip
Fn1 Gstt1 Rtn2
Col4a1 Nlgn1 Zdhhc24
Rnd3 Angptl2 Ccdc120
Baz2b Acsm2 Aspn
Tdgf1 Trappc2 Dusp22
Upp1 Snhg8 Zfp644
The JackStrawPlot function provides a visualization tool for comparing the distribution of p-values for
each PC with a uniform distribution (dashed line). Significant PCs will show a strong enrichment of genes
with low p-values (solid curve above the dashed line). In this case it appears that PCs 1-8 are significant.
318 CHAPTER 9. SEURAT
0.2
0.1
0.0
Theoretical [runif(1000)]
0.2
0.1
0.0
PC7 5.07e−36 PC8 7.82e−21 PC9 0.0736
0.3
0.2
0.1
0.0
0.000 0.025 0.050 0.075 0.100
0.000 0.025 0.050 0.075 0.100
0.000 0.025 0.050 0.075 0.100
Empirical
A more ad hoc method for determining which PCs to use is to look at a plot of the standard deviations of
the principle components and draw your cutoff where there is a clear elbow in the graph. This can be done
with PCElbowPlot. In this example, it looks like the elbow would fall around PC 5.
PCElbowPlot(object = seuset)
9.8. CLUSTERING CELLS 319
Standard Deviation of PC 30
25
20
15
10
5
5 10 15 20
PC
Seurat implements an graph-based clustering approach. Distances between the cells are calculated based
on previously identified PCs. Seurat approach was heavily inspired by recent manuscripts which applied
graph-based clustering approaches to scRNA-seq data - SNN-Cliq ((Xu and Su, 2015)) and CyTOF data -
PhenoGraph ((Levine et al., 2015)). Briefly, these methods embed cells in a graph structure - for example a
K-nearest neighbor (KNN ) graph, with edges drawn between cells with similar gene expression patterns,
and then attempt to partition this graph into highly interconnected quasi-cliques or communities. As in
PhenoGraph, we first construct a KNN graph based on the euclidean distance in PCA space, and refine the
edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard
distance). To cluster the cells, we apply modularity optimization techniques - SLM ((Blondel et al., 2008)),
to iteratively group cells together, with the goal of optimizing the standard modularity function.
The FindClusters function implements the procedure, and contains a resolution parameter that sets the
granularity of the downstream clustering, with increased values leading to a greater number of clusters.
We find that setting this parameter between 0.6 − 1.2 typically returns good results for single cell datasets
of around 3, 000 cells. Optimal resolution often increases for larger datasets. The clusters are saved in the
object@ident slot.
seuset <- FindClusters(
object = seuset,
reduction.type = "pca",
dims.use = 1:8,
resolution = 1.0,
print.output = 0,
save.SNN = TRUE
)
320 CHAPTER 9. SEURAT
A useful feature in Seurat is the ability to recall the parameters that were used in the latest function calls
for commonly used functions. For FindClusters, there is the function PrintFindClustersParams to print
a nicely formatted summary of the parameters that were chosen:
PrintFindClustersParams(object = seuset)
We can look at the clustering results and compare them to the original cell labels:
table(seuset@ident)
##
## 0 1 2 3
## 85 75 59 34
adjustedRandIndex(colData(deng)[[email protected], ]$cell_type2, seuset@ident)
## [1] 0.3981315
Seurat also utilises tSNE plot to visulise clustering results. As input to the tSNE, we suggest using the
same PCs as input to the clustering analysis, although computing the tSNE based on scaled gene
expression is also supported using the genes.use argument.
seuset <- RunTSNE(
object = seuset,
dims.use = 1:8,
do.fast = TRUE
)
TSNEPlot(object = seuset)
9.9. MARKER GENES 321
0
tSNE_2
0
1
2
3
−5
−10
−10 0 10 20
tSNE_1
Seurat can help you find markers that define clusters via differential expression. By default, it identifes
positive and negative markers of a single cluster, compared to all other cells. You can test groups of
clusters vs. each other, or against all cells. For example, to find marker genes for cluster 2 we can run:
markers2 <- FindMarkers(seuset, 2)
Serpinh1 Uhrf1
2.0
2
1.5
1.0
1
0.5
0.0 0
0 1 2 3 0 1 2 3
Identity Identity
FeaturePlot(
seuset,
head(rownames(markers2)),
cols.use = c("lightgrey", "blue"),
nCol = 3
)
9.9. MARKER GENES 323
5 5 5
tSNE_2
tSNE_2
tSNE_2
0 0 0
−5 −5 −5
5 5 5
tSNE_2
tSNE_2
tSNE_2
0 0 0
−5 −5 −5
FindAllMarkers automates this process and find markers for all clusters:
markers <- FindAllMarkers(
object = seuset,
only.pos = TRUE,
min.pct = 0.25,
thresh.use = 0.25
)
DoHeatmap generates an expression heatmap for given cells and genes. In this case, we are plotting the top
10 markers (or all markers if less than 20) for each cluster:
top10 <- markers %>% group_by(cluster) %>% top_n(10, avg_logFC)
DoHeatmap(
object = seuset,
genes.use = top10$gene,
slim.col.label = TRUE,
remove.key = TRUE
)
324 CHAPTER 9. SEURAT
Gpd1l
Larp4
Pemt
Fbxo15
Alppl2
Timd2
Gm11517
Pdxk
Tfrc
Obox6
Id2
Fabp3
Krt8
Krt18
BC053393
Adh1
Tspan8
Lgals1
Lrp2
Lcp1
Uhrf1
Upp1
Spp1
Tdgf1
Aqp8
Glrx
Fabp5
Marcks
Fn1
Tat
Btg4
Accsl
Oog3
C86187
Klf17
Zbed3
LOC100502936
Khdc1b
Spin1
Tcl1
0 1 2 3
Exercise: Compare marker genes provided by Seurat and SC3.
9.10 sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
##
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.19.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 methods stats graphics grDevices utils
## [8] datasets base
##
## other attached packages:
## [1] bindrcpp_0.2 dplyr_0.7.4
## [3] mclust_5.4 Seurat_2.2.0
9.10. SESSIONINFO() 325
327
328 CHAPTER 10. “IDEAL” SCRNASEQ PIPELINE (AS OF OCT 2017)
Figure 10.1: Appropriate approaches to batch effects in scRNASeq. Red arrows indicate batch effects which
are (pale) or are not (vibrant) correctable through batch-correction.
10.3. PREPARING EXPRESSION MATRIX 329
• Quantification
– Small dataset, no UMIs : featureCounts
– Large datasets, no UMIs: Salmon, kallisto
– UMI dataset : UMI-tools + featureCounts
Advanced exercises
For the final part of the course we would like you to work on more open ended problems. The goal is to
carry out the type of analyses that you would be doing for an actual research project.
Participants who have their own dataset that they are interested in should feel free to work with them.
For other participants we recommend downloading a dataset from the conquer resource (consistent
quantification of external rna-seq data). conquer uses Salmon to quantify the transcript abundances in a
given sample. For a given organism, the fasta files containing cDNA and ncRNA sequences from Ensembl
are complemented with ERCC spike-in sequences, and a Salmon quasi-mapping index is built for the entire
catalog. Then Salmon is run to estimate the abundance of each transcript. The abundances estimated by
Salmon are summarised and provided to the user in the form of a MultiAssayExperiment object. This
object can be downloaded via the buttons in the MultiAssayExperiment column. The provided
MultiAssayExperiment object contains two “experiments”, corresponding to the gene-level and
transcript-level values.
The gene-level experiment contains four “assays”:
• TPM
• count
• count_lstpm (count-scale length-scaled TPMs)
• avetxlength (the average transcript length, which can be used as offsets in count models based on the
count assay, see here).
The transcript-level experiment contains three “assays”:
• TPM
• count
• efflength (the effective length estimated by Salmon)
The MultiAssayExperiment also contains the phenotypic data (in the colData slot), as well as some
metadata for the data set (the genome, the organism and the Salmon index that was used for the
quantification).
Here we will show you how to create an SCE from a MultiAssayExperiment object. For example, if you
download Shalek2013 dataset you will be able to create an SCE using the following code:
library(MultiAssayExperiment)
library(SummarizedExperiment)
library(scater)
d <- readRDS("~/Desktop/GSE41265.rds")
cts <- assays(experiments(d)[["gene"]])[["count_lstpm"]]
tpms <- assays(experiments(d)[["gene"]])[["TPM"]]
331
332 CHAPTER 11. ADVANCED EXERCISES
You can also see that several different QC metrics have already been pre-calculated on the conquer website.
Here are some suggestions for questions that you can explore:
• There are two mESC datasets from different labs (i.e. Xue and Kumar). Can you merge them and
remove the batch effects?
• Clustering and pseudotime analysis look for different patterns among cells. How might you tell which
is more appropriate for your dataset?
• One of the main challenging in hard clustering is to identify the appropriate value for k. Can you
use one or more of the clustering tools to explore the different hierarchies available? What are good
mathematical and/or biological criteria for determining k?
• The choice of normalization strategy matters, but how do you determine which is the best method?
Explore the effect of different normalizations on downstream analyses.
• scRNA-seq datasets are high-dimensional and since most dimensions (ie genes) are not informative.
Consequently, dimensionality reduction and feature selection are important when analyzing and visu-
alizing the data. Consider the effect of different feature selection methods and dimensionality reduction
on clustering and/or pseudotime inference.
• One of the main challenges after clustering cells is to interpret the biological relevance of the sub-
populations. One approach is to identify gene ontology terms that are enriched for the set of marker
genes. Identify marker genes (e.g. using SC3 or M3Drop) and explore the ontology terms using gProfiler,
WebGestalt or DAVID.
• Similarly, when ordering cells according to pseudotime we would like to understand what underlying
cellular processes are changing over time. Identify a set of changing genes from the aligned cells and
use ontology terms to characterize them.
Chapter 12
Resources
333
334 CHAPTER 12. RESOURCES
Bibliography
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search
tool. Journal of Molecular Biology, 215(3):403–410.
Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biol,
11(10):R106.
Archer, N., Walsh, M. D., Shahrezaei, V., and Hebenstreit, D. (2016). Modeling enzyme processivity reveals
that RNA-seq libraries are biased in characteristic and correctable ways. Cell Systems, 3(5):467–479.e12.
Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. (2008). Fast unfolding of communities in
large networks. J. Stat. Mech., 2008(10):P10008.
Bray, N. L., Pimentel, H., Melsted, P., and Pachter, L. (2016). Near-optimal probabilistic rna-seq quantifi-
cation. Nat Biotechnol, 34(5):525–527.
Bullard, J. H., Purdom, E., Hansen, K. D., and Dudoit, S. (2010). Evaluation of statistical methods for
normalization and differential expression in mRNA-seq experiments. BMC Bioinformatics, 11(1):94.
Buttner, M., Miao, Z., Wolf, A., Teichmann, S. A., and Theis, F. J. (2017). Assessment of batch-correction
methods for scRNA-seq data with a new test metric. bioRxiv, page 200345.
Cannoodt, R., Saelens, W., and Saeys, Y. (2016). Computational methods for trajectory inference from
single-cell transcriptomics. Eur. J. Immunol., 46(11):2496–2506.
Deng, Q., Ramskold, D., Reinius, B., and Sandberg, R. (2014). Single-cell RNA-seq reveals dynamic, random
monoallelic gene expression in mammalian cells. Science, 343(6167):193–196.
Gierahn, T. M., Wadsworth, 2nd, M. H., Hughes, T. K., Bryson, B. D., Butler, A., Satija, R., Fortune, S.,
Love, J. C., and Shalek, A. K. (2017). Seq-Well: Portable, low-cost RNA sequencing of single cells at high
throughput. Nat. Methods, 14(4):395–398.
Guo, M., Wang, H., Potter, S. S., Whitsett, J. A., and Xu, Y. (2015). SINCERA: A pipeline for single-cell
RNA-seq profiling analysis. PLoS Comput Biol, 11(11):e1004575.
Haghverdi, L., Lun, A. T. L., Morgan, M. D., and Marioni, J. C. (2017). Correcting batch effects in single-cell
RNA sequencing data by matching mutual nearest neighbours. bioRxiv, page 165118.
Hashimshony, T., Senderovich, N., Avital, G., Klochendler, A., de Leeuw, Y., Anavy, L., Gennert, D., Li,
S., Livak, K. J., Rozenblatt-Rosen, O., Dor, Y., Regev, A., and Yanai, I. (2016). CEL-seq2: Sensitive
highly-multiplexed single-cell RNA-seq. Genome Biol, 17(1).
Hashimshony, T., Wagner, F., Sher, N., and Yanai, I. (2012). CEL-seq: Single-cell RNA-seq by multiplexed
linear amplification. Cell Reports, 2(3):666–673.
Islam, S., Zeisel, A., Joost, S., La Manno, G., Zajac, P., Kasper, M., Lönnerberg, P., and Linnarsson, S.
(2013). Quantitative single-cell RNA-seq with unique molecular identifiers. Nat Meth, 11(2):163–166.
335
336 BIBLIOGRAPHY
Jaitin, D. A., Kenigsberg, E., Keren-Shaul, H., Elefant, N., Paul, F., Zaretsky, I., Mildner, A., Cohen,
N., Jung, S., Tanay, A., and Amit, I. (2014). Massively parallel single-cell RNA-seq for marker-free
decomposition of tissues into cell types. Science, 343(6172):776–779.
Kharchenko, P. V., Silberstein, L., and Scadden, D. T. (2014). Bayesian approach to single-cell differential
expression analysis. Nat Meth, 11(7):740–742.
Kiselev, V. Y. and Hemberg, M. (2017). scmap - a tool for unsupervised projection of single cell RNA-seq
data. bioRxiv, page 150292.
Kiselev, V. Y., Kirschner, K., Schaub, M. T., Andrews, T., Yiu, A., Chandra, T., Natarajan, K. N., Reik, W.,
Barahona, M., Green, A. R., and Hemberg, M. (2017). SC3: Consensus clustering of single-cell RNA-seq
data. Nat Meth, 14(5):483–486.
Klein, A., Mazutis, L., Akartuna, I., Tallapragada, N., Veres, A., Li, V., Peshkin, L., Weitz, D., and
Kirschner, M. (2015). Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells.
Cell, 161(5):1187–1201.
Kolodziejczyk, A., Kim, J. K., Svensson, V., Marioni, J., and Teichmann, S. (2015). The technology and
biology of single-cell RNA sequencing. Molecular Cell, 58(4):610–620.
L. Lun, A. T., Bach, K., and Marioni, J. C. (2016). Pooling across cells to normalize single-cell RNA
sequencing data with many zero counts. Genome Biol, 17(1).
Levine, J., Simonds, E., Bendall, S., Davis, K., Amir, E.-a., Tadmor, M., Litvin, O., Fienberg, H., Jager, A.,
Zunder, E., Finck, R., Gedman, A., Radtke, I., Downing, J., Pe’er, D., and Nolan, G. (2015). Data-driven
phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell, 162(1):184–
197.
Li, W. V. and Li, J. J. (2017). scImpute: Accurate and robust imputation for single cell RNA-Seq data.
bioRxiv, page 141598.
Macosko, E., Basu, A., Satija, R., Nemesh, J., Shekhar, K., Goldman, M., Tirosh, I., Bialas, A., Kami-
taki, N., Martersteck, E., Trombetta, J., Weitz, D., Sanes, J., Shalek, A., Regev, A., and McCarroll, S.
(2015). Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell,
161(5):1202–1214.
McCarthy, D. J., Campbell, K. R., Lun, A. T. L., and Wills, Q. F. (2017). Scater: Pre-processing, quality
control, normalization and visualization of single-cell RNA-seq data in r. Bioinformatics, page btw777.
Muraro, M., Dharmadhikari, G., Grün, D., Groen, N., Dielen, T., Jansen, E., van Gurp, L., Engelse, M.,
Carlotti, F., de Koning, E., and van Oudenaarden, A. (2016). A single-cell transcriptome atlas of the
human pancreas. Cell Systems, 3(4):385–394.e3.
Picelli, S., Björklund, �. K., Faridani, O. R., Sagasser, S., Winberg, G., and Sandberg, R. (2013). Smart-seq2
for sensitive full-length transcriptome profiling in single cells. Nat Meth, 10(11):1096–1098.
Picelli, S., Faridani, O. R., Björklund, A. K., Winberg, G., Sagasser, S., and Sandberg, R. (2014). Full-length
RNA-seq from single cells using smart-seq2. Nat. Protoc., 9(1):171–181.
Regev, A., Teichmann, S., Lander, E. S., Amit, I., Benoist, C., Birney, E., Bodenmiller, B., Campbell, P.,
Carninci, P., Clatworthy, M., Clevers, H., Deplancke, B., Dunham, I., Eberwine, J., Eils, R., Enard, W.,
Farmer, A., Fugger, L., Gottgens, B., Hacohen, N., Haniffa, M., Hemberg, M., Kim, S. K., Klenerman, P.,
Kriegstein, A., Lein, E., Linnarsson, S., Lundeberg, J., Majumder, P., Marioni, J., Merad, M., Mhlanga,
M., Nawijn, M., Netea, M., Nolan, G., Pe’er, D., Philipakis, A., Ponting, C. P., Quake, S. R., Reik, W.,
Rozenblatt-Rosen, O., Sanes, J. R., Satija, R., Shumacher, T., Shalek, A. K., Shapiro, E., Sharma, P.,
Shin, J., Stegle, O., Stratton, M., Stubbington, M. J. T., van Oudenaarden, A., Wagner, A., Watt, F. M.,
Weissman, J. S., Wold, B., Xavier, R. J., Yosef, N., and Human Cell Atlas (2017). The human cell atlas.
bioRxiv, page 121202.
BIBLIOGRAPHY 337
Robinson, M. D. and Oshlack, A. (2010). A scaling normalization method for differential expression analysis
of RNA-seq data. Genome Biol, 11(3):R25.
Segerstolpe, �., Palasantza, A., Eliasson, P., Andersson, E.-M., Andréasson, A.-C., Sun, X., Picelli, S.,
Sabirsh, A., Clausen, M., Bjursell, M. K., Smith, D., Kasper, M., Ämmälä, C., and Sandberg, R. (2016).
Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metabolism,
24(4):593–607.
Soumillon, M., Cacchiarelli, D., Semrau, S., van Oudenaarden, A., and Mikkelsen, T. S. (2014). Characteri-
zation of directed differentiation by high-throughput single-cell RNA-Seq. bioRxiv, page 003236.
Stegle, O., Teichmann, S. A., and Marioni, J. C. (2015). Computational and analytical challenges in single-
cell transcriptomics. Nat Rev Genet, 16(3):133–145.
Svensson, V., Natarajan, K. N., Ly, L.-H., Miragaia, R. J., Labalette, C., Macaulay, I. C., Cvejic, A., and
Teichmann, S. A. (2017). Power analysis of single-cell RNA-sequencing experiments. Nat Meth, 14(4):381–
387.
Tang, F., Barbacioru, C., Wang, Y., Nordman, E., Lee, C., Xu, N., Wang, X., Bodeau, J., Tuch, B. B.,
Siddiqui, A., Lao, K., and Surani, M. A. (2009). mRNA-seq whole-transcriptome analysis of a single cell.
Nat Meth, 6(5):377–382.
Tung, P.-Y., Blischak, J. D., Hsiao, C. J., Knowles, D. A., Burnett, J. E., Pritchard, J. K., and Gilad, Y.
(2017). Batch effects and the effective design of single-cell gene expression studies. Sci. Rep., 7:39921.
van Dijk, D., Nainys, J., Sharma, R., Kathail, P., Carr, A. J., Moon, K. R., Mazutis, L., Wolf, G., Kr-
ishnaswamy, S., and Pe’er, D. (2017). MAGIC: A diffusion-based imputation method reveals gene-gene
interactions in single-cell RNA-sequencing data. bioRxiv, page 111591.
Welch, J. D., Hartemink, A. J., and Prins, J. F. (2016). SLICER: Inferring branched, nonlinear cellular
trajectories from single cell RNA-seq data. Genome Biol, 17(1).