Report
Report
1. ABSTRACT
When cancer is present usually other complications arise,
such an example is the presence of homeostasis problems in
the patients affected by it. One of the main problems of this
situation is that the pathogenesis related to these problems
is still poorly understood and further characterizations and
analysis are needed. Thanks to the study of Isella C. et al.
(2022) we now know that there is a correlation between the
MET oncogene, to which these homeostasis problems are
highly associated, and the coagulant factor XII(F12), in the
end, effectively finding a correlation between the two
biological components.
By starting off with the same data of Isella C.’s project, a
general analysis regarding the samples and the genes
associated to the data was executed, together with
comparisons between the different applicable methods that
can be used, in addition external softwares such as
STRINGdb were used for further data and information
retrieval and processing.
2. INTRODUCTION
2.1. COLORECTAL CANCER (CRC)
CRC is a type of cancer characterized by the abnormal
growth of cells in the first and longest part of the large
intestine, called “Colon”. Even
though it may become present at
any age, it is frequently present
in age-advanced individuals,
usually starting from the
formation of small cell lumps,
called “polyps” that can be easily
identified inside the colon, they
do not cause symptoms neither Figure 1, Colon section of the
are directly dangerous, although Large Intestine
they possess the potential to become cancerous elements,
their removal is also associated with better recovery against
the cancer itself. Some of the symptoms that may arise and
that can be related to CRC are:
Change in bowel habits: frequent diarrhoea or
constipation
Rectal bleeding
Belly discomfort: characterized by cramps, gas or
general pain
Weakness
Loss of blood from the rectum
Loss of weight
Bowel un-emptiness: feeling that the bowel is still
empty after peristaltic movements
As stated, one of the main factors associated to CRC is the
old age of the patient, other risk factors that go hand in hand
with it are:
RACE: the disease seems to have a link to the race of the
individual, especially for Black people in the US
Family and personal history
Inflammatory bowel diseases: such as Crohn’s disease
or ulcerative colitis
Inherited disease
Low-fiber, high-fat diet: typical Western diet, also
results were seen to be linked with processed and red
meat
Absence of exercise
Obesity
Diabetes
Smoking
Alcohol consumption
2.2. GSE52060
The dataset here presented is the one used in the
beforementioned study, it is derived and produced by 46
samples coming from patients presenting Colorectal Cancer
(CRC), from them one sample coming from the neoplastic
tissue and one coming from the normal mucosal tissue were
taken, later on these two subdivisions will become the main
factors for methodology comparisons and analysis.
The GSE file present on GEO (Gene Expression Omnibus) was
fetched and loaded on the virtual environment of Rstudio,
from it pheno-related data was extracted and a metadata
dataframe was built, mainly used to understand additional
information regarding the samples, how they are grouped
and possibly find some potential targets to use as factors for
later stages in analysis. From that we can see that we have
23 samples of Normal mucosa and 23 of Neoplastic tissue,
for a grand total of 46 analysable samples.
Figure 2, Metadata dataframe obtained from the GSE52060 file
3. METHODS
3.1. EXPRESSION MATRIX RETRIEVAL
After loading the GSE file and retrieving the metadata
dataframe, the next step in the project was to obtain the
expression matrix of the data present in the file. That was
done swiftly by few command lines. After the acquisition,
results of the expression matrix were box plotted, picturing a
before-and-after situation in the context of a log2
Figure 3,
Boxplot of
pre-
normalized
GSE52060
expression
data
Figure 4,
Boxplot of
pre-
normalized
GSE52060
expression
data
3.2. PRINCIPAL COMPONENT ANALYSIS (PCA)
PCA is a machine learning method focused on the
dimensionality reduction in order to simplify a large dataset
into a smaller one, maintaining all the important information,
pattern and trends. This methodology of course still pays the
price when applied, the reduction itself will cost us accuracy
but in the project’s context it is highly profitable for the
identification of homogeneous subgroup of genes with
similar expression profiles or samples which present akin
trends.
PCA was performed through the use the prcomp command
together with the execution of a t-test on the dataframe used
in order to maximize gene prioritization and feature
selection, following these processes both a summary and a
screeplot were produced together with other graphs
produced using the autoplot function, the colours were
assigned by taking in consideration the column 8 of the
metadata_df variable “source_name_ch1” which showed the
subdivision of the two tissues recovered.
3.3.1. CLUSTERING
It is an unsupervised machine learning methodology, in
which no class values hinting at a priori grouping of the data
are given. It focuses on finding similarity groups in a data set
and it is a daily-life technique that is commonly put in
practice, some of its day-to-day applications are for example:
grouping clothes, objects, videogames, tools, documents or
even target possible marketing subjects.
It is characterized by an algorithm in which a distance is
contained, the quality of a clustering process is measured by
taking in consideration some the data, the distance function
and in general the algorithm itself, as in this case we ought
to consider a trade-off between accuracy/quality and
computational feasibility.
3.4. HEATMAP
A heatmap is a 2-dimensional data visualization technique in
which the data are represented through the use of a gradient
of colours, usually using a darker one for greater quantity of
that value, the main aim of this procedure is that of finding
how the samples are correlated between each other and to
exploit these relationships in future steps.
As for other common methodologies it can be used in many
different fields, going from criminology to gene-expression
analysis, with lots of different type of it depending on the
study type in which the methodology is applied.
The first thing done before the heatmap production was
creating a distance matrix with the data coming from the
previous step, along with the setting of a colour palette that
will be needed for the final plot. The pheatmap function was
used for the creation of the heatmap using as main data the
distance matrix, while incorporating the already available
distance_mat variable obtained in the Hierarchical Clustering
step, and the palette of colours.
Figure 8, Plot showing the relevance of the first samples of the used
dataframe in the quality of the model, meaning that the only the starting
samples are relevant for the model
Figure 9, Plot showing which are the probe having the highest
relevance/importance in the model created, with the addition of the
visualization of how much the loss of quality will affect the model if that
probe were to be modified
3.6. LINEAR DISCRIMINANT ANALYSIS (LDA)
LDA is a supervised machine learning approach that aims to
solve multi-class classification problems by separating them
through data dimensionality reduction. It is especially useful
to optimize machine learning models.
This technique is able to make predictions by using Bayes
theorem and calculate the probability of a particular data set
to belong to an output, while also modelling the data
distribution. It focuses on maximizing the between-class
distance and minimizing the within-class one, in addition LDA
works by identifying a linear combination of features that can
be used to separate two or more classes of objects. It is done
by projecting the data into a one-dimensional graph for
easier classification, proving a high level of versatility also
for multi-class data.
In order to evaluate the classifier obtained through the LDA
the ROC and CARET methodologies were used. At the
beginning of this step the creation of factors "tissue type:
adjacent normal mucosa (N)" and "tissue type: neoplastic
tissue (T)" was completed, after that t-tests were conducted
on the expression dataframe using the function rowttests
while also accounting for the new factors, extraction of
probes having a p-value lower than 0.05 followed and a new
dataframe was built, additionally a column was inserted in
the df, called “AFFECTED” hinting whether or not the sample
is derived from the neoplastic tissue.
Following this path the model went under training using the
function lda, taking as input parameters:
AFFECTED (as Function): needed for understanding
which are the variables/samples that we want to guess
as having neoplastic tissue against the normal
mucosal one
The dataframe created before that has built-in the
factors of interest
The prior probability of each class
The subset of elements that we want to train
After these steps were completed, the mod.values were
predicted using the function predict coming from the , taking
as input the model built and the dataframe’s training subset.
3.6.1. CARET
After completing the first part of the LDA phase, the Caret
methodology was implemented. It provides a wide range of
functions focused on data preparation, modelling and
evaluation. With the objective to streamline the evaluation
and building process of predictive models, the package used
also includes functions related to data preparation, modelling
and evaluation.
CARET mainly works by working on Hold-out specific samples
and fitting the model on the remaining ones, in the end the
average performance is calculated across all the hold-out
predictions and the determinations of the optimal
parameters set is given. When starting with the application
of CARET, a control group and a metric needed to be defined:
1. Control built by the trainControl function using as
input:
a. Method: “cv”, setting the resampling method to
cross-validation
b. Number: “10”, indicating the 10-fold cross-
validation
c. Repeats: “NA”, iteration of the methods
2. Metric set as “Accuracy”
Later training of the two models was conducted by using the
train function using the lda and rf methodologies and in both
cases considering also the metric established. Results were
obtained using also the resamples function and ggplot for
the representations.
Figure 11, Plot showing the trend of the quality of the model depending on
the chosen regularization parameter
3.8. RSCUDO
After the implementation of LASSO, the next algorithm taken
into consideration and used in the project was RSCUDO. It
implements decision trees in order to identify robust
subgroups through iterative clustering, it is mainly used in
the biological context, focusing specifically on
transcriptomics and genomic data.
In this project the whole phase starts with the factorization
(all over again) of the conditions of the tissue and creation of
both a train and test group, the following step led to the
training through the specific scudoTrain function, signatures
check was the successive event in the phase, subsequently
the creation of different networks took place. The criteria for
the differentiation of the cluster created was the type of
subdata used in the conditions and a classification procedure
carried out only in the final graph, based on the factorized
in_training data.
Figure 12, Example of network built while using the RSCUDO package,
here can be seen the main two clusters and some elements "on the line"
between them
3.9. PATHFINDR
With the implementation on the starting dataset of all the
wanted clustering/classification algorithms, pathfindR was
applied in order to study the enrichment results of the used
genes in the dataset. Before doing any type of operation in
this scenario a filtered list (p-value < 0.01) of genes was
extracted from the df, which will later be used in the manual
enrichment carried out using the software enrichR available
online (see 3.10.)
Once done with the extraction, the process for pathfinder
application begins. The starting point is represented by the
developing of a model matrix based on the conditions of the
tissue, followed by a “fit model” operation and production of
a contrast matrix, later on an analysis on said matrix was
carried out using limma, from which we could obtain the list
of probes of interest, further filtered by p-value (<0.01).
Until now all the operations that were enacted had a role in
the formation of the final dataframe that would later be used
for the effective run of pathfinder, still before that a merge
between the results obtained until now and another df,
created through the use of both probeID and their respective
gene symbols, associated through the use of ensemble and
biomart.
Subsequently to the merging, different databases were opted
to run the algorithm, in the end 3 of them were chosen:
KEGG, GO-BP, Reactome. With everything done and ready to
go pathfinder was applied through the run_pathfindR
function for each of the selected databases, together with
the upcoming creation of a plot for each case.
3.10. ENRICHMENT
After obtaining the df having the ILLUMINA probes, a new
empty dataframe was built. From previous analysis the logFC
and the p-value of each probe were extracted and
implemented in the new structure, in addition a column
having the gene symbol of the probes was created through
the use of the “illuminaHumanv3.db” and “Annotation.dbi”
libraries and the mapIDs function, able map the probe id to
the respective gene symbol using the given database, the
inputs were as such:
Database “illuminaHumanv3.db”
Keys “probe_IDs” identify the list of elements that
you want to convert, in this case the column having
the probes ID
Columns “SYMBOL”, which type of column to take
in consideration for the retrieval from the db
Keytype “PROBID”, indicating what is the object
that you want to be converted
Concluding the mapping phase, a df having as columns both
the probe and gene symbol IDs were obtained in the end,
leading to the merging of the newly obtained element with a
modified, t-tested df retrieved from the starting data and
metadata. Concluding the merging between the two we will
have a new structure in which the IDs will be present
together with statistics such as p-value, logFC and dm.
From that point the objective is to retrieve the gene symbols
ID having p-value < 0.05, therefore further filtering is
needed. Completed this procedure, a list of the gene symbol
is extracted from the dataframe and uploaded on the online
software of EnrichR, available at the maayanlab.cloud
website. After the insertion of the list of gene symbol IDs, the
main categories taken in consideration for the enrichment
were: Transcription, Pathways, Ontologies and
Diseases/Drugs.
3.11. STRING DB
As the final step of the project, the list of gene symbols
previously used in the enrichment analysis was, first of
all, sorted according to the p-value in an ascending
way, and later the top 150 rows were selected and
translated to their respective UNIPROT ID with the aim
of subjecting them to STRINGdb analysis, in order to
find possible relationships and insights that can be used
for future studies. The software itself is able to produce
highly informative networks using the provided list of
IDs and the information coming from many different
databases, with the possibility of choosing the organism
of preference.
In the end a network of 133 elements was built,
showing how each node (protein) interacts with its
neighbours, their relationship and if there are any
solitary nodes.
4. RESULTS
4.1. PCA
Following the procedures applied in the Principal Component
Analysis step (see Methods) plots related to the sample
clustering and distribution were produced. Regarding this
last one, everything seems to be fine, no errors nor unusual
elements or modification were found on the other end the
PCA plot, when produced, presented one only issue.
Figure 13, Plot deriving from the PCA phase, here almost two perfectly
divided groups can be seen except for two datapoints in the bottom part
of the graph
Figure 16, K-means algorithm plot showing the frame of the clustering
groups, no overlapping is in sight this time
4.3. HEATMAP
After obtaining the modified distance matrix using the
already available data from the clustering phase, an
4.4. LDA
During part 3.6. procedures related to the Linear
Discriminant Analysis were applied, one of the first plot
produced was that related to the quality evaluation of the
LDA algorithm in both the training and test group.
Figure 20, LDA quality evaluation in both [UP] Normal Mucosa group and
[DOWN] Neoplastic group, the X axis represents the division point
between the two groups.
Figure 22, ROC curve and representations of the AUC of the LDA
methodology
Following the AUC calculation, the next phase of evaluation
concerns the comparison of the models used (RF and LDA) in
terms of Accuracy. Using model prediction functions, the
following results related to their efficacy were obtained
(check Figure 23):
Figure 25, plot showing the lambda value such that it is able to minimize
the binomial deviation, obtained in the LASSO analysis
Figure 27, Comparison plot between the three methodologies used for
clustering/classification procedures, when implementing the LASSO model,
the accuracy of the classification greatly decreases in comparison to RF or
LDA
4.6. RSCUDO
Concluding the LASSO classification procedure, the RSCUDO
methodology was applied. The first results obtained showed
the clustering of the tissues groups in two main clusters with
the presence of sparse datapoints, with some of them still
connected, even if on their own. In Figure 28 we are able to
see what was mentioned above, in addition to that, it is clear
the existence of some nodes that seem to “trespass”
between the domains of the two key clusters, with a higher
probability of them being outliers that were considered
related to a group to which they are not actually associated
with. The following classification was focused on looking up
for the Signatures of the tissue’s groups:
Figure 29, Clustering plot showing how the RSCUDO algorithm decompose
and recognise the different solitary nodes already seen in the previous
graph, here we are able to see the subgroups perceived by it
4.7. ENRICHMENT
After obtaining the gene list the result underwent enrichment
through the use of the online platform of EnrichR, as
specified in the Methods section 3.10., out of the ~13300
genes, only a fraction of them were found to have a
description in the online software, with a number around the
~9000 genes. In the end the enriched genes were around
~4000.The results coming from this step of the analysis
presented some associations with:
Translation
Influenza infection
RNA and Ribosome modifications, formation and
binding
Diamond-Blackfan anaemia
Belong to the top 500 genes that are downregulated in
COVID-19
Glycosylation-related disorders
Bladder, Ovarian, Testicular carcinoma
Other different types of aneamia
Presence of bulk tissue in the kidney
Quite not what we were expecting, but as stated before
these results may be due to the high amount of filtering that
was imposed on the genes and the presence of duplicates
among them. eventually, with further research, some of the
pathway in which with elements were enriched were related
to the CRC context, for example:
Colorectal carcinoma
Colorectal neoplasm
Acute myeloblastic leukemia
Carcinoembryonic antigen
Muir-Torre syndrome
High association to the tissue of intestine and uterus
Gut microbiota beta diversity
Colon cancer association and intestine epithelial cells
CL34, HT115, SNU16 and SNUC1 cell lineages of the
Large Intestine
Presence of bulk tissue in the colon
Association with HCT116 cell line
4.8. PATHFINDR
With the application of the PATHFINDR algorithm,
supplementary information regarding insights on the
enrichment of the genes used before, it showed much similar
results to those obtained using the EnrichR instrument. Such
an example is present in Figure 31, where the plot introduced
showed high level of relationship between the gene list
inserted and their respective relevance in ribosomal
functions, modification and structural features, which were
also present in the previous phase.
Figure 31, Enrichment plot that presents the results coming from the base
KEGG database used in PATHFINDR, here the results are highly correlated
to the one found in the previous phase using the EnrichR online
instrument
4.9. STRING DB
Once the gene list from the merged dataframe was obtained,
the first result obtained from this step was a list of the
respectively translated genes into a suitable format for
STRING DB, that being the respective protein name for each
respective gene. The final completed list was then submitted
into the online software, with also the specification of the
organism of interest (Homo Sapiens). In the end a network
was generated showing how the elements are connected: