Co Expresion
Co Expresion
Co Expresion
doi: 10.1093/bib/bbw139
Advance Access Publication Date: 10 January 2017
Paper
Abstract
Gene co-expression networks can be used to associate genes of unknown function with biological processes, to prioritize
candidate disease genes or to discern transcriptional regulatory programmes. With recent advances in transcriptomics and
next-generation sequencing, co-expression networks constructed from RNA sequencing data also enable the inference of
functions and disease associations for non-coding genes and splice variants. Although gene co-expression networks typic-
ally do not provide information about causality, emerging methods for differential co-expression analysis are enabling the
identification of regulatory genes underlying various phenotypes. Here, we introduce and guide researchers through a (dif-
ferential) co-expression analysis. We provide an overview of methods and tools used to create and analyse co-expression
networks constructed from gene expression data, and we explain how these can be used to identify genes with a regulatory
role in disease. Furthermore, we discuss the integration of other data types with co-expression networks and offer future
perspectives of co-expression analysis.
Key words: transcriptomics; functional genomics; disease gene prediction; next-generation sequencing; network analysis
Sipko van Dam is a researcher at the Department of Genetics, UMC Groningen. He carried out his doctoral work at the University of Liverpool creating and
analysing a co-expression network constructed from public RNA-seq data.
Urmo Vo ~ sa is researcher in the Department of Genetics, UMC Groningen. His main interests lie in the genetics of gene expression and integration of data
from different layers of genomic complexity to untangle the causes of complex diseases.
Adriaan van der Graaf is a master’s student of Molecular Biology and Biotechnology at the Department of Genetics, UMC Groningen, focusing on novel
statistical techniques in the analysis of expression data.
Lude Franke is an associate professor at the Department of Genetics at the University Medical Centre Groningen. He is a statistical geneticist, working on
analysing data on the genetics of complex and autoimmune diseases (e.g. celiac disease).
~ o Pedro de Magalha
Joa ~ es is a reader at the University of Liverpool where he leads the Integrative Genomics of Ageing Group (http://pcwww.liv.ac.uk/aging/).
The group’s research integrates different strategies but its focal point is developing and applying experimental and computational methods that help
bridge the gap between genotype and phenotype, and help decipher the human genome and how it regulates complex processes like ageing.
Submitted: 12 September 2016; Received (in revised form): 1 December 2016
C The Author 2017. Published by Oxford University Press.
V
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/),
which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
575
576 | van Dam et al.
Gene co-expression networks can be used for various pur- expression methods are sensitive to noise [21], they are becoming
poses, including candidate disease gene prioritization, func- more effective with the increase in RNA-seq data quantity and
tional gene annotation (Figure 1) and the identification of quality. RNA-seq further permits co-expression analysis to focus
regulatory genes. However, co-expression networks are effect- on splice variants and non-coding RNAs.
ively only able to identify correlations; they indicate which genes In this review, we provide an introduction and overview of
are active simultaneously, which often indicates they are active what constitutes a co-expression network, followed by a guide
in the same biological processes, but do not normally confer in- of the different steps in co-expression analysis using RNA-seq
formation about causality or distinguish between regulatory and data. We then describe commonly used and newly emerging
regulated genes. An increasingly used method that goes beyond methods and tools for co-expression analysis, with a focus on
traditional co-expression networks is differential co-expression differential co-expression analysis to identify regulatory genes
analysis [4–7]. This approach identifies genes with varying co- that underlie disease. We conclude with a discussion of the in-
expression partners under different conditions, such as disease tegration of co-expression networks with other types of data, to
states [4, 8–10], tissue types [11] and developmental stages [12], e.g. infer regulatory processes, and with future prospects and
because these genes are more likely to be regulators that under- remaining challenges in the field.
lie phenotypic differences. The regulatory roles of such genes
can be further investigated by integrating data types such as pro-
Co-expression networks
tein–protein interactions, methylome data, interactions between
transcription factors (TFs) and their targets, and with sequence A co-expression network identifies which genes have a tendency
motif analysis of co-expressed genes [13–15]. This aids in the to show a coordinated expression pattern across a group of sam-
identification of regulatory elements such as TFs, expression ples. This co-expression network can be represented as a gene–
quantitative trait loci (eQTLs) and methylation patterns that af- gene similarity matrix, which can be used in downstream ana-
fect the expression and composition of co-expression modules. lyses (Figure 1). Canonical co-expression network construction
Gene expression and regulation can be highly tissue-specific, and analyses can be described with the following three steps.
and most disease-related genes have tissue-specific expression In the first step, individual relationships between genes are
abnormalities [16, 17]. The increased availability of expression data defined based on correlation measures or mutual information
for multiple tissues has allowed for differential co-expression ana- [22–24] between each pair of genes. These relationships describe
lysis, which can identify both tissue-specific signatures and shared the similarity between expression patterns of the gene pair
co-expression signatures [11]. These tissue-specific signatures can across all the samples. Different measures of correlation have
be disrupted in tissue-specific diseases and would not be detected been used to construct networks, including Pearson’s or
in analyses aggregating multiple tissues. Even when no sample Spearman’s correlations [25, 26]. Alternatively, least absolute
classification is available, subpopulation-specific modules can be error regression [27] or a Bayesian approach [28] can be used to
resolved, an approach that has been particularly successful in clas- construct a co-expression network. The latter two have the
sifying different cancer subtypes to provide prognostic markers added benefit that they can be used to identify causal links and
[18–20]. Differential co-expression analysis is also useful for analy- have been explained elsewhere [29]. For a discussion of other
sing data sets in which the subpopulations are unknown, e.g. types of similarity measures, we refer to [30]. Many of these
large-scale single-cell RNA-seq data [5, 12]. While differential co- similarity metrics can also be used to construct protein–protein
Gene co-expression analysis | 577
interaction networks, which were compared using cancer data gene pairs is binary, i.e. either 0 or 1, and genes are either con-
in [31]. nected or unconnected. An un-weighted network can be created
In the second step, co-expression associations are used to from a weighted network by, for example, considering all genes
construct a network where each node represents a gene and with a correlation above a certain threshold to be connected
each edge represents the presence and the strength of the co- and all others unconnected. We focus on weighted networks in
expression relationship (Figure 1) [32]. this review because (to date) they have produced more robust
In the third step, modules (groups of co-expressed genes) are results than un-weighted networks [40].
identified using one of several available clustering techniques.
Clustering in co-expression analyses is used to group genes Microarrays versus RNA-seq data
with similar expression patterns across multiple samples to
Co-expression networks can be constructed from gene expres-
map to the exons of isoform B, all reads mapping to exon X are co-expression studies have used a cut-off of 10 million reads
then assigned to isoform B, resulting in isoform A being con- per sample [2, 21, 60]. Co-expression networks constructed
sidered as not expressed at all. Although this elegant solution using this cut-off have been suggested to have a similar quality
was validated using simulations, no experimental validation to microarray-based co-expression networks if constructed
was conducted. from the same number of samples [21], but decreasing in quality
The most common way of constructing RNA-seq-based co- with fewer reads. The percentage of mapped reads is another
expression networks is to merge all overlapping gene isoforms frequently considered cut-off in which samples with <70% or
in the RNA-seq data analysis and then construct the network at 80% of the reads mapping to the genome are removed. Giorgi
the gene level. This approach, however, loses information about et al. demonstrated, using 65 Arabidopsis thaliana samples with
different transcripts encoded by the same gene. Alternatively, 12 million reads but applying only a 30% mapping cut-off
Quality control
FastQC [81] A tool that uses .fastq, .bam or .sam files to identify and highlight potential issues in the
http://www.bioinformatics.babraham. data, such as low base quality scores, low sequence quality and GC content biases.
ac.uk/projects/fastqc/ þ Can be used either with or without user interface.
Uses only the first 200 000 sequences in the file.
RSeQC [82] þ A tool with a wider range of quality control measures than FastQC.
http://rseqc.sourceforge.net/ þ Can also be used on mapped data to obtain information on metrics such as the preva-
(continued)
Gene co-expression analysis | 581
Table 1. Continued
Tool/method Description, strengths (þ) and limitations ()
after dividing the expression of each gene by the geometric mean of the given gene
across all samples. This differs from the normalization implemented in the DEseq2 dif-
ferential expression analysis.
Implemented into the DEseq2 R package.
Correction for batch effects
Limma-removeBatchEffect [97] A method which uses linear models to correct for batch effects.
Svaseq [98] This method estimates biases based on genes that have no phenotypic expression effects,
https://github.com/jtleek/svaseq which are then used for correction of the data.
(continued)
582 | van Dam et al.
Table 1. Continued
Tool/method Description, strengths (þ) and limitations ()
Genie3 [113] A tool that incorporates TF information to construct a regulatory network by determining
the TF expression pattern that best explains the expression of each of their target genes.
þ Creates directional networks.
Requires TF information.
CoRegNet [114] A tool that identifies co-operative regulators of genes from different data types.
cMonkey [115] Calculates joint bicluster membership probability from different data types by identifying
groups of genes that group together in multiple data types.
Visualization
types [5] or species [132, 133]. Below, we provide an overview of the analysis. It then prioritizes which genes in these modules
commonly used and newly emerging methods and tools, sepa- are likely to underlie the phenotype associated with the module
rated into two categories: (1) approaches that identify differen- by identifying either genes behaving similarly to the eigengene
tial co-expression between predefined sample groups (such as of the module or those genes that are intra-modular hub genes
conditions, time points or tissue types) and (2) approaches that (these tend to coincide). By design, DICER is tailored to identify
do not require prior knowledge about sample groups and use an module pairs that correlate differently between sample groups,
algorithm that identifies co-expression clusters in a priori un- e.g. modules that form one large interconnected module in one
known subpopulations of the samples. group compared with several smaller modules in another
(Figure 3D). DICER may be particularly useful for time series ex-
Differential co-expression analysis between periments in which co-expression changes are gradual, e.g. cell
cycle series experiments, where modules are specific to a par-
sample groups
ticular phase and co-expressed in transitions between phases.
Most differential co-expression analyses rely on differential DiffCoEx focuses on modules that are differentially co-
clustering; they identify clusters that contain different genes or expressed with the same sets of genes. The most extreme case
behave differently under changing conditions or phenotypes. of this behaviour is sets of genes that ‘hop’ from one set of cor-
The most frequently used programs for differential clustering related genes to another in a coordinated manner (Figure 3E). In
analysis, which have also been compared with others programs, this case, DiffCoEx would cluster ‘hopping’ genes in a similar
are WGCNA [54], DICER [4] and DiffCoEx [100], all of which first manner. DINGO is a more recent tool that works similarly to
identify modules co-expressed across the full set of study sam- DiffCoEx by grouping genes based on how differently they be-
ples. These co-expressed modules can then be correlated to pre- have in a particular subset of samples (representing e.g. a par-
defined sample subpopulations representing, for example, ticular condition) from the baseline co-expression determined
disease status or tissue type. from all samples [102]. These are the most likely genes to ex-
WGCNA determines the activity and importance of each plain different phenotypes that are associated with the two dif-
module in each subpopulation of samples (Figure 3A and 3C). ferent networks. Each of the methods detects specific module
For each module, an eigengene is calculated, which is the vector changes by design, but they can also detect modular changes
that best describes the expression behaviour (in a linear fash- that they were not specifically designed for and may outper-
ion) of all genes within this module in the samples included in form other tools in the identification of these changes [130].
Gene co-expression analysis | 583
A number of studies have used differential co-expression detectable by identifying genes that are at the periphery in
network analyses to identify networks unique to specific tissues these modules (Figure 3B). Moreover, some TFs have different
[11] or disease states [134]. The rapid increase in publicly avail- roles in different tissues. These TFs would be expected to be
able RNA-seq data and projects such as GTEx and ENCODE, hub genes that are central to one module under one condition
which generate large-scale RNA-seq profiles, has enabled co- and central to another module in another condition.
expression analysis within and across different tissues [11, 15]. Differentially connected genes are those with different co-
The GTEx project collects and provides expression data from expression partners between two sample groups. These genes
multiple human tissues for the study of gene expression, regu- appear to play a regulatory part in the difference in the pheno-
lation and their relationship to genetic variation [135]. In a study type observed between two groups (Figure 3D) [8–10]. For ex-
comparing RNA-seq data from 35 tissues from the GTEx data ample, one study compared co-expression in mutant cattle with
set, a tissue hierarchy was constructed based on the average increased muscle growth with co-expression in non-mutants,
gene expression in each tissue. Related tissues, such as those using a method similar to DiffCoEx. By identifying the most dif-
from different brain regions, clustered together. This hierarchy ferentially expressed genes and TFs showing the highest differ-
was used to construct a single combined co-expression network ential connection to these genes [10] (Figure 3D), the TF
derived from the tissue-specific co-expression networks—a containing the causal mutation (myostatin) was identified.
meta-network. It was then shown that in tissue-specific net- Interestingly, the Mstn gene, which encodes this TF, hardly
works, TFs with functions specific to that tissue tend to be changed in expression itself, providing an example of how dif-
highly expressed together with tissue-specific genes. These ferential co-expression analysis can uncover biologically im-
genes tend to form a stronger connection with each other than portant findings not revealed by differential expression analysis
with other genes, but remain at the periphery of the network alone.
(thus having low centrality), while the tissue-specific TFs be- Not all methods construct a co-expression network to assess
come more central to that module [11]. Thus, tissue-specific TFs differential expression. GSNCA [103] can be used to identify dif-
could be uncovered by identifying modules with increased co- ferentially co-expressed gene sets, which have to be defined a
expression strength in tissue-specific networks (Figure 3A and priori, between two sample groups. In the first step this method
3C) and by pinpointing the central hubs of these modules. In determines weight vectors for each sample group, from a correl-
contrast, genes that are not TFs but are tissue-specific should be ation network. These weight vectors represent the cross-
584 | van Dam et al.
correlation of each gene with all the other genes, effectively common in cancer, where different mutations can lead to dif-
summarizing a correlation matrix into a single vector, describ- ferent alterations in co-expression patterns but a similar pheno-
ing a weight for each gene. These weights for the genes repre- type [7]. Biclustering allows researchers to disentangle the
senting a certain gene set are then compared between two mechanisms in the cases where predefining biologically rele-
sample groups, to determine whether the gene set is differen- vant sample groups is difficult. For this purpose, biclustering is
tially co-expressed. more effective than other co-expression analysis methods [7].
Cheng et al. were first to use biclustering in co-expression
Generalized Single Value Decomposition (GSVD) analysis [141], followed by the development and application of
Generalized Single Value Decomposition (GSVD) is a unique many more biclustering approaches (reviewed by Pontes et al.
type of differential co-expression analysis that relies on spectral [106]). The choice of biclustering method depends on the num-
has been argued to perform better than DiffCoEx and CoXpress types of data can help to prioritize genes that may underlie a
[4] based on functional enrichment analysis of differentially ex- phenotype. This can be achieved, for example, using informa-
pressed modules. HO-GSVD outperformed WGCNA and tion describing which genes are TFs, as is done for regulatory
DiffCoEx based on its ability to detect clusters in simulated data predictions by GENIE3 [113]. However, a focus on TFs is rarely
[136]. Although biclustering is a powerful approach, it does not sufficient, and integration of multiple data types is often
necessarily perform better than other network analysis meth- required to increase the accuracy and usefulness of the result-
ods such as WGCNA, as shown by a comparison using different ing networks [13, 147].
tools on simulated data [144]. However, as discussed earlier,
biclustering can be performed without the need for prior sample TF binding site analysis
group classification.
Genome-wide transcription factor binding site (TFBS) analysis
sometimes functionally connected with the processes or path- studies [167]. A recently published tool, CoRegNet, allows the in-
ways associated with the corresponding disease. Good ex- tegration of different types of data in a co-expression analysis
amples of this are IFN (interferon)-a and complement pathways by identifying co-operative regulators of genes from different
in which several genes were under trans-regulation of a sys- data types [114]. Another established approach, cMonkey,
temic lupus erythematosus-associated variant, possibly via cis- achieves similar data integration by calculating the joint biclus-
regulation of IKZF1 [155]. The integration of regulatory genetic ter membership probability from different data types by iden-
variant information into co-expression network analysis, with tifying groups of genes that group together in multiple data
cis-eQTLs used as causal anchors, identified TYROBP as the most types [115].
likely causal factor in late-onset Alzheimer disease patients, a
finding supported by the observation that mutations in this
Future prospects
gene are known to cause Nasu-Hakola disease [128]. Lastly,
copy number variation can affect gene expression levels, and In recent years, differential co-expression analyses have been
including such information may help identify and/or explain al- increasingly used to analyse large data sets. This may be attrib-
terations in co-expression network structures present in dis- uted to the decreased costs of large-scale gene expression
eases or traits [138]. profiling, in particular RNA-seq, to increased sample sizes, and
Overall, integration of multiple data types increases the ac- to the greater availability of tissue-specific data from perturb-
curacy of the resulting predictions [13, 147]. For example, mod- ation experiments, which are required for fruitful differential
ules unique to different subtypes of cancer were identified by co-expression analyses [103, 168]. Likewise, biclustering algo-
integrating tumour genome sequences with gene networks rithms have benefitted from larger sample sizes and higher
[166], and these modules may be useful for prognosis and iden- data quality, as shown by the identification of co-expressed
tification of putative targets for personalized medicine-based modules unique to cancer subtypes [18, 20]. The usefulness of
treatments. A number of tools, described earlier in this review, biclustering on single-cell RNA-seq data has been demonstrated
can be used for differential co-expression analysis, but can also by the classification of different cell types and by the identifica-
be applied to other data types. In the initial DINGO publication, tion of clusters of genes uniquely co-expressed in specific cell
the authors conducted a combined analysis on mRNA expres- types [5]. We expect these approaches to be more widely applied
sion, DNA copy number variation and methylation data. By in the future, as they benefit from an increase in RNA-seq data
overlaying the differential networks of each data type and iden- quantity and quality, which will allow for more accurate identi-
tifying edges present in all of them, a number of genes from the fication of tissue-specific and cell-type-specific disease-related
PI3K pathway were identified as important players in glioblast- modules and regulators.
oma multiforme patients [102]. This pathway is an already-es- Large-scale single-cell sequencing technology is increasingly
tablished therapeutic target, supporting the notion that this is used and the first co-expression studies using such techniques
an effective approach for identifying relevant targets for disease have uncovered cell-type-specific co-expression modules that
Gene co-expression analysis | 587
would have gone undetected in multi-cell-type co-expression data, and the tools that exist mostly integrate only two layers of
analyses [5, 12]. Because the latter represent the aggregated sig- omics data [177]. Integrated network analyses come with add-
nals of multiple cell types, they usually cannot detect alter- itional mathematical challenges, and best practices are far from
ations in cell subpopulations between different experimental established. Further research on this topic is of great interest to
groups. This is supported by the observation that the expression the research community, as it will allow a better understanding
of cell cycle genes associated with ageing decreased in the ana- of regulatory mechanisms that can explain co-expression pat-
lysis of non-cell-type-specific data [169]. However, data from terns and disease mechanisms. A better understanding of these
single-cell experiments revealed that this observation was disease mechanisms and corresponding co-expression patterns
caused by a decreased proportion of the G1/S cells that highly will facilitate the identification of appropriate targets for inter-
express cell cycle genes rather than by altered expression across vention studies.
ChIP-chip Module
This method identifies TFBSs by immunoprecipitation of A group of co-expressed genes that form a sub-network in
the TF together with bound DNA fragments (chromatin the larger network, usually defined by applying clustering
immunoprecipitation—ChIP). A DNA microarray is subse- algorithms on a co-expression network or directly on ex-
quently used to identify the sequences where the corres- pression profiles.
ponding TF is bound. Mutual information
ChIP-seq The measure of dependence between two otherwise unre-
This method uses the same approach as ChIP-chip, but lated variables.
using RNA-seq rather than microarray to identify TFBSs. Network robustness
Clustering A measure of how resistant a network is to the removal of sin-
13. Glass K, Huttenhower C, Quackenbush J, et al. Passing mes- neuronal activity-dependent genes in autism. Nat Commun
sages between biological networks to refine predicted inter- 2014;5:5748.
actions. PLoS One 2013;8:e64832. 35. de Magalhaes JP, Finch CE, Janssens G. Next-generation
14. De Smet R, Marchal K. Advantages and limitations of current sequencing in aging research: emerging applications, prob-
network inference methods. Nat Rev Microbiol 2010;8:717–29. lems, pitfalls and possible solutions. Ageing Res Rev
15. Yue F, Cheng Y, Breschi A, et al. A comparative encyclopedia 2010;9:315–23.
of DNA elements in the mouse genome. Nature 36. Chen J, Bardes EE, Aronow BJ, et al. ToppGene Suite for gene
2014;515:355–64. list enrichment analysis and candidate gene prioritization.
16. Goh KI, Cusick ME, Valle D, et al. The human disease net- Nucleic Acids Res 2009;37:W305–11.
work. Proc Natl Acad Sci USA 2007;104:8685–90. 37. Mason MJ, Fan G, Plath K, et al. Signed weighted gene co-
55. Conesa A, Madrigal P, Tarazona S, et al. A survey of best prac- 76. Zhao W, Langfelder P, Fuller T, et al. Weighted gene coex-
tices for RNA-seq data analysis. Genome Biol 2016;17:13. pression network analysis: state of the art. J Biopharm Stat
56. Li J, Bushel PR. EPIG-Seq: extracting patterns and identifying 2010;20:281–300.
co-expressed genes from RNA-Seq data. BMC Genomics 77. Rodius S, Androsova G, Gotz L, et al. Analysis of the dynamic
2016;17:255. co-expression network of heart regeneration in the zebra-
57. Bacher R, Kendziorski C. Design and computational analysis fish. Sci Rep 2016;6:26822.
of single-cell RNA-sequencing experiments. Genome Biol 78. Gaiteri C, Ding Y, French B, et al. Beyond modules and hubs:
2016;17:63. the potential of gene coexpression networks for investigat-
58. Dillies MA, Rau A, Aubert J, et al. A comprehensive evalu- ing molecular mechanisms of complex brain disorders.
ation of normalization methods for Illumina high- Genes Brain Behav 2014;13:13–24.
98. Leek JT. svaseq: removing batch effects and other unwanted 119. Zimmermann P, Hirsch-Hoffmann M, Hennig L, et al.
noise from sequencing data. Nucleic Acids Res 2014;42. GENEVESTIGATOR. Arabidopsis microarray database and
99. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in analysis toolbox. Plant Physiol 2004;136:2621–32.
microarray expression data using empirical Bayes methods. 120. Greene CS, Krishnan A, Wong AK, et al. Understanding
Biostatistics 2007;8:118–27. multicellular function and disease with human tissue-
100. Tesson BM, Breitling R, Jansen RC. DiffCoEx: a simple and specific networks. Nat Genet 2015;47:569–76.
sensitive method to find differentially coexpressed gene 121. Singer GAC, Lloyd AT, Huminiecki LB, et al. Clusters of co-
modules. BMC Bioinformatics 2010;11:497. expressed genes in mammalian genomes are conserved by
101. Watson M. CoXpress: differential co-expression in gene ex- natural selection. Mol Biol Evol 2005;22:767–75.
pression data. BMC Bioinformatics 2006;7:509. 122. Segal E, Friedman N, Koller D, et al. A module map showing
139. Macosko EZ, Basu A, Satija R, et al. Highly parallel genome- 159. John B, Enright AJ, Aravin A, et al. Human MicroRNA targets.
wide expression profiling of individual cells using nanoliter PLoS Biol 2004;2:e363.
droplets. Cell 2015;161:1202–14. 160. Vlachos IS, Paraskevopoulou MD, Karagkouni D, et al.
140. Klein AM, Mazutis L, Akartuna I, et al. Droplet barcoding for DIANA-TarBase v7.0: indexing more than half a million ex-
single-cell transcriptomics applied to embryonic stem cells. perimentally supported miRNA:mRNA interactions. Nucleic
Cell 2015;161:1187–201. Acids Res 2015;43:D153–9.
141. Cheng Y, Church GM. Biclustering of expression data. Proc 161. Chou CH, Chang NW, Shrestha S, et al. miRTarBase 2016: up-
Int Conf Intell Syst Mol Biol 2000;8:93–103. dates to the experimentally validated miRNA-target inter-
142. Oghabian A, Kilpinen S, Hautaniemi S, et al. Biclustering actions database. Nucleic Acids Res 2016;44:D239–47.
methods: biological relevance and application in gene ex- 162. Kozomara A, Griffiths-Jones S. miRBase: annotating high