0% found this document useful (0 votes)
4 views17 pages

Document (26) - Copy 2

The document presents a major project on the analysis of mitochondrial data using R Studio, focusing on Principal Component Analysis (PCA) to reduce dimensionality and visualize biological data. It outlines the methodology for conducting PCA, interpreting results, and the significance of the analysis in population genetics, functional genomics, and clinical applications. The project emphasizes the importance of identifying disease-gene associations and prioritizing research targets through data visualization techniques.

Uploaded by

chefroyale.23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views17 pages

Document (26) - Copy 2

The document presents a major project on the analysis of mitochondrial data using R Studio, focusing on Principal Component Analysis (PCA) to reduce dimensionality and visualize biological data. It outlines the methodology for conducting PCA, interpreting results, and the significance of the analysis in population genetics, functional genomics, and clinical applications. The project emphasizes the importance of identifying disease-gene associations and prioritizing research targets through data visualization techniques.

Uploaded by

chefroyale.23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

MAJOR PROJECT - ANALYSIS OF

MITOCHONDRIAL
DATA USING R STUDIO

SUBMITTED BY : SUBMITTED TO :
BHUVAN NAKRA DR. MINAKSHI GARG
UEM211071
BE BIOTECHNOLOGY, 8 TH SEM

1
ACKNOWLEDGEMENT

I would like to express my heartfelt gratitude to my esteemed guide, Dr. Minakshi Garg, for her
unwavering support, guidance, and encouragement throughout the course of my project,
ANALYSIS OF MITOCHONDRIAL DATA USING R STUDIO . Her vast knowledge and expertise
have been invaluable in shaping the direction and scope of this research. She has provided me
with insightful suggestions, critical feedback, and constructive advice at every stage of the
project, ensuring its successful completion. Her dedication and meticulous attention to detail
inspired me to approach the project with the same rigor and commitment.

I am especially thankful for the time and effort Dr. Minakshi Garg devoted to mentoring me,
despite her busy schedule. Her ability to explain complex concepts in a simplified manner and
her enthusiasm for teaching have been a source of immense motivation. She not only guided me
technically but also instilled in me the importance of discipline, perseverance, and critical
thinking, which have significantly contributed to my growth as a student and a learner.

I also want to acknowledge her constant encouragement, which played a pivotal role in
overcoming challenges during the project. Her guidance extended beyond academics, providing
a supportive and collaborative environment that encouraged creativity and innovation. This
project would not have reached its current level of success without her continuous mentorship. I
feel privileged to have had the opportunity to work under her guidance, and I will always be
grateful for her invaluable contribution to this endeavour.

2
PRINCIPAL COMPONENT ANALYSIS (PCA ) BIPLOT IN
R STUDIO

PRINCIPAL COMPONENT ANALYSIS ( PCA ) IS A


DIMENSIONALITY REDUCTION METHOD THAT IS
OFTEN USED TO REDUCE THE DIMENSIONALITY OF
LARGE DATASETS , BY TRANSFORMING A LARGE SET
OF VARIABLES INTO A SMALLER ONE THAT STILL
CONTAINS MOST OF THE INFORMATION IN THE LARGE
DATASET

USE IN ANALYSIS OF BIOLOGICAL DATA :

 BIOLOGICAL DATA LIKE GENE EXPRESSION ,


METABOLOMICS , SNPs or PROTEOMICS –
OFTEN INVOLVES HUNDREDS OF VARIABLES .
 REVEAL PATTERNS SUCH AS SAMPLE
CLUSTERING AND TRENDS
 VISUALIZE VARIATION ACROSS SAMPLES
FEATURED IN A TWO - DIMENSIONAL OR THREE
DIMENSIONAL SPACE

3
HOW PCA CONSTRUCTS THE PRINCIPAL COMPONENTS

AS THERE ARE AS MANY PRICIPAL COMPONENTS AS


THERE ARE VARIABLES IN THE DATA , PRINCIPAL
COMPONENTS ARE CONSTRUCTED IN SUCH A
MANNER THAT THE FIRST PRINCIPAL COMPONENT
ACCOUNTS FOR THE LARGEST POSSIBLE VARIANCE
IN THE DATASET.

COMPONENTS OF A PCA BIPLOT :

A PCA BIPLOT CONSISTS OF TWO MAIN ELEMENTS –


SAMPLES (POINTS ) – shows how individuals ( example –
tissue samples ,individuals , species ) group in a PC Space
VARIABLES (ARROWS) – how original variables ( example
– nucleotide variability , heteroplasmy frequency and
conservation scores )

4
STEPS IN PCA :

1. LOADING THE ESSENTIAL LIBRARIES


library(readxl)
library(ggplot2)

2. IMPORTING DATA
my_data <- read_excel("C:/Users/bhuva/Downloads/1-
da.xlsx")

3. EXTRACTING NUMERIC DATA


numeric_data<-my_data[, sapply(my_data , is.numeric)]

4. DATA INSPECTION AND CLEANING


str(numeric_data)
summary(numeric_data)
numeric_data<- na.omit(numeric_data)

5. PERFORMING PCA

pca_result<- prcomp(numeric_data , scale. = TRUE)


summary(pca_result)

6. GENERATING THE BIPLOT

biplot(pca_result,col=c("magenta","blue"))

5
RESULT

Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.3848 1.1884 0.6624 0.48087
Proportion of Variance 0.4794 0.3531 0.1097 0.05781
Cumulative Proportion 0.4794 0.8325 0.9422 1.00000

6
INTERPRETATION

(PC1 + PC2 ) TOGTHER EXPLAIN ABOUT 88.3 % OF THE


TOTAL VARIANCE , WHICH IS QUITE GOOD

PC1- STRONLY INFLUENCED BY HF ( VARIANT ALLELE


FREQUENCY & NUCLEOTIDE VARIABILITY )
PC2- INFLUENCED BY “ PHASTCONS20WAY ” AND
“PHYLOP20WAY” i.e. EVOLUTIONARY CONSERVATION

VARIANTS WITH HIGH- CONSERVATION SCORES


CLUSTER ON POSITIVE Y-AXIS

RESULTS
 HIGH CONSERVATION + LOW VARIABILITY –
DISEASE-CAUSING VARIANTS
 LOW CONSERVATION +HIGH VARIABILITY –
BENIGN OR TOLERATED POLYMORPHISM
 HIGH CONSERVATION + HIGH VARIABILITY –
HOTSPOTS OR POPULATION-SPECIFIC
FUNCTIONAL VARIANTS
 LOW CONSERVATION +LOW VARIABILITY –
RARE NEUTRAL VARIANTS

7
 INTRODUCTION :

The mitochondrial DNA is maternally inherited , a circular


molecule ; of about 16.6 kb (16,569 bp) and unlike the nuclear
genome has no introns
Mitochondrial DNA contains 37 genes, all of which are essential
for normal mitochondrial function.
- 13of these genes provide instructions for making enzymes
involved in oxidative phosphorylation
The remaining genes provide instructions for making molecules
called transfer RNA (tRNA) and ribosomal RNA (rRNA), which
are chemical cousins of DNA
Although it codes for a small number of genes, mutations in
mtDNA are common

8
SIGNIFICANCE OF PCA ANALYSIS :

9
1. IN POPULATION GENETICS OR PHYLOGENY
WE COMPILE VARIANTS FROM DIFFERENT
INDIVIDUALS AND USE VARIANT FREQUENCIES
OR CONSERVATION SCORES TO COMPARE
GENETIC PATTERNS ACROSS SAMPLES

2. FUNCTIONAL GENOMICS
PCA CAN HELP INTERPRET FUNCTIONAL IMPACT
SCORES ( CONSERVATION SCORES ) ACROSS
VARIANTS

3. ENVIRONMENTAL MICROBIOLOGY
CLUSTER SOIL / OIL SAMPLES BASED ON
MICROBIAL ABUNDANCE AND ENVIRONMENTAL
FACTORS ( PH , TEMPERATURE )

4. DIFFERENTIATION OF VARIANT ALLELES


DIFFERENTATING HIGH-RISK VARIANTS FROM
BENIGN ONES

5. QUALITY CONTROL PCA CAN DETECT BATCH


EFFECTS , OUTLIERS OR TECHNICAL ARTIFACTS
IN HIGH-THROUGHPUT DATA

TOP DISEASES COUNT PER LOCUS :


10
STEPS :
1. LOAD THE REQUIRED LIBRARIES
library(readxl)
library(dplyr)
library(ggplot2)

2. READ THE EXCEL FILE


data <- read_excel("C:/Users/bhuva/Downloads/1-
da.xlsx")

3. FILTER ROWS WITH NON-MISSING VALUES


filtered_data <- data %>%
filter(!is.na(ClinVar), !is.na(Locus))

4. FIND TOP 20 DISEASES


top_diseases <- filtered_data %>%
count(ClinVar, sort = TRUE) %>%
top_n(20, n) %>%
pull(ClinVar)
5. FILTER DATASET TO INCLUDE ONLY TOP
DISEASES
top_filtered <- filtered_data %>%
filter(ClinVar %in% top_diseases)

6. GROUPING OF DATA

11
grouped_data <- top_filtered %>%
group_by(Locus, ClinVar) %>%
summarise(Count = n(), .groups = "drop")

7. CREATE THE BAR PLOT

ggplot(grouped_data, aes(x = Locus, y = Count, fill =


ClinVar)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Top 20 Diseases per Locus",
x = "Locus", y = "Count",
fill = "Disease (ClinVar)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

SIGNIFICANCE OF THIS ANALYSIS –


12
1. Disease – Gene Association Mapping

Helps establish clear relationships between specific


genomic loci and diseases

2. Prioritization of Research Targets


Identifies “hotspot” loci associated with multiple diseases
Helps prioritize genes/loci for functional studies
Reveals pleiotropic effects where single loci influence
multiple phenotypes

3. Clinical Applications

Improves genetic testing panels by identifying most


clinically relevant loci

4. Biological Pathway Analysis

When combined with pathway databases , reveals disease


networks showing how differernt diseases may share
common biological pathways

TOP VARIANTS BY PATHOGENICITY

13
STEPS :
1. INSTALL THE REQUIRED LIBRARIES
library(readxl)
library(dplyr)
library(ggplot2)
2. READ THE DATA
data <- read_excel("C:/Users/bhuva/Downloads/1-
da.xlsx")
3. COUNT VARIANT ALLELE OCCURENCES
variant_counts <- data %>%
count(`Variant Allele`, sort = TRUE)
4. GET TOP 10 VARIANT ALLELES
top_10 <- variant_counts %>%
top_n(10, n)
5. FILTER ORIGINAL DATA FOR ONLY THESE TOP
VARIANTS
filtered_data <- data %>%
filter(`Variant Allele` %in% top_10$`Variant Allele`)

6. RECOUNT WITH PATHOGENICITY FOR


PLOTTING
plot_data <- filtered_data %>%
count(`Variant Allele`, Pathogenicity)

7. PLOT THE BAR GRAPH

14
ggplot(plot_data, aes(x = reorder(`Variant Allele`, n), y =
n, fill = Pathogenicity)) +
geom_col() +
coord_flip() +
labs(
title = "Top 10 Variants by Pathogenicity",
x = "Variant Allele",
y = "Count"
)+
theme_minimal() +
theme(axis.text.y = element_text(size = 10))

SIGNIFICANCE OF THIS ANALYSIS


15
1. Prioritizing Clinically Relevant Variants
Researchers can focus on variants most likely to cause
the disease
Streamline genetic testing and diagnosis workflows

2. Understanding Disease Mechanisms


Reveal common mutation pattern in specific diseases
Contribute to knowledge of genotype-phenotype
Relationships

3. Supporting Personalized Medicine


Tailored treatment strategies for patients ( for example in
cancer diagnosis)
Better Predictions of disease risk or drug response

5. Our Analysis can contribute to public databases such


as ClinVar , gnomAD or COSMIC

16
17

You might also like