0% found this document useful (0 votes)
12 views12 pages

HW 4

Uploaded by

tommyhi1234567
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views12 pages

HW 4

Uploaded by

tommyhi1234567
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

11/7/24, 4:45 PM HW 4

HW 4
Victor Wei,Yutong Wang,Erica Hwang
3/13/2022

Q1
leukemia_data <- read_csv("leukemia_data.csv")

## New names:
## * FCGRT -> FCGRT...2
## * FCGRT -> FCGRT...3
## * PGK1 -> PGK1...6
## * GUSBP11 -> GUSBP11...19
## * VDAC1 -> VDAC1...21
## * ...

## Rows: 327 Columns: 3142


## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): Type
## dbl (3141): FCGRT...2, FCGRT...3, 31444_s_at, TMSB10, PGK1...6, EIF3K, 31503...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

1a.
leukemia_data <- leukemia_data %>% mutate(Type=as.factor(Type))

Print the number of patients with each leukemia sub-type by using table().

file:///D:/HW4/HW4.html 1/12
11/7/24, 4:45 PM HW 4

table(leukemia_data$Type)

##
## BCR-ABL E2A-PBX1 Hyperdip50 MLL OTHERS T-ALL TEL-AML1
## 15 27 64 20 79 43 79

Based on the table, the Leukemia sub-type “BCR-ABL” has the least occurance among all sub-types.

1b.
Running PCA on our leukemia results gives us:

PCA <- prcomp(leukemia_data[,-c(1)],center=TRUE,scale=TRUE)

Plot the PVE of each PC and the cumulative PVE.

PVE <- PCA$sdev^2 / sum(PCA$sdev^2)


plot(PVE,xlab="Principal Component",ylab="Proportion of Variance Explained ",type='b')

file:///D:/HW4/HW4.html 2/12
11/7/24, 4:45 PM HW 4

plot(cumsum(PVE),xlab="Principal Component ",


ylab=" Cumulative Proportion of Variance Explained ", ylim=c(0,1), type='b')

file:///D:/HW4/HW4.html 3/12
11/7/24, 4:45 PM HW 4

From the outputed result, it can be

concluded that we will need 201 PCs in order to explain 90% of the total variation in the data.

cumsum(PVE)[200:205]

## [1] 0.8990880 0.9002490 0.9013977 0.9025446 0.9036898 0.9048285

1c.
Generate a scatter plot using plot():

file:///D:/HW4/HW4.html 4/12
11/7/24, 4:45 PM HW 4

colors <- rainbow(7)


plot_colors <- colors[leukemia_data$Type]
colors_df <- data.frame(color=plot_colors,type=leukemia_data$Type)
plot(PCA$x[,1],PCA$x[,2],col=plot_colors,cex=0.5,xlab="PC1",ylab="PC2")

biplot(PCA, scale=0,col=plot_colors,cex=0.5)

file:///D:/HW4/HW4.html 5/12
11/7/24, 4:45 PM HW 4

Then, we add type labels to the plot accordign to our leukemia data set.

colors_df_u <- unique(colors_df)


plot(PCA$x[,1],PCA$x[,2],col=plot_colors,cex=0.5,xlab="PC1",ylab="PC2")
legend(x = "topleft", legend = colors_df_u$type,
col = colors_df_u$color, lwd = 2, lty = c(0,0),
pch = c(17,19) )

file:///D:/HW4/HW4.html 6/12
11/7/24, 4:45 PM HW 4

From our result, it can be concluded

that “T-ALL” is the most clearly separated from the others along the PC2 axis.

To find genes with highest absolute loadings for PC1:

pc1_l <- as.data.frame(PCA$rotation[,1])


names(pc1_l) <- c("loading")
pc1_l$gene <- row.names(pc1_l)
pc1_l$abs_loading <- abs(pc1_l$loading)
pc1_l %>% arrange(desc(abs_loading)) %>% head()

file:///D:/HW4/HW4.html 7/12
11/7/24, 4:45 PM HW 4

## loading gene abs_loading


## SEMA3F -0.04517148 SEMA3F 0.04517148
## CCT2 0.04323818 CCT2 0.04323818
## LDHB 0.04231619 LDHB 0.04231619
## COX6C 0.04183480 COX6C 0.04183480
## SNRPD2 0.04179822 SNRPD2 0.04179822
## ELK3 -0.04155821 ELK3 0.04155821

1d.
plot(PCA$x[,1],PCA$x[,3],col=plot_colors,cex=0.5,xlab="PC1",ylab="PC3")
legend( x = "topleft",
legend = colors_df_u$type,
col = colors_df_u$color, lwd = 2, lty = c(0,0),
pch = c(17,19) )

file:///D:/HW4/HW4.html 8/12
11/7/24, 4:45 PM HW 4

Yes. Based on the plot, it can be

concluded that the third PC performs better when discriminating between leukemia types by plotting the data projected onto the first and third
principal components, but not the second.

1e.
By using the filter() function, we generate a subset of our data set that only includes “T-ALL”,“TEL-AML1”, and “Hyperdip50”.

subsetl <- leukemia_data %>% filter(Type %in% c("T-ALL","TEL-AML1","Hyperdip50"))

We then generate a Euclidean distance matrix from our subsets.

file:///D:/HW4/HW4.html 9/12
11/7/24, 4:45 PM HW 4

scaledl <- scale(subsetl[,-1])


distancel <- dist(scaledl)
fit.complete <- hclust(distancel, method="complete")
plot(fit.complete, hang=-1, cex=0.8, main="complete Linkage Clustering")

Create a dendrogram based on our hierarchical clustering result

file:///D:/HW4/HW4.html 10/12
11/7/24, 4:45 PM HW 4

dendrogram <- scaledl %>%


dist %>%
hclust %>%
as.dendrogram %>%
set("labels_col", value = c("skyblue", "orange", "grey"), k=3) %>%
set("branches_k_color", value = c("skyblue", "orange", "grey"), k = 3) %>%
set("labels_cex", 0.3) %>%
plot(horiz=TRUE, axes=FALSE)

Same plot, but color all the branches and labels to have 5 different groups

file:///D:/HW4/HW4.html 11/12
11/7/24, 4:45 PM HW 4

dendrogram <- scaledl %>%


dist %>%
hclust %>%
as.dendrogram %>%
set("labels_col", k=5) %>%
set("branches_k_color", k = 5) %>%
set("labels_cex", 0.3) %>%
plot(horiz=TRUE, axes=FALSE)

file:///D:/HW4/HW4.html 12/12

You might also like