0% found this document useful (0 votes)

6 views5 pages

Gene and Sample Clustering

Clustering genes and samples is a key method in genomics for identifying patterns in gene expression data, involving steps such as data preparation, distance metric selection, clustering algorithm application, visualization, and interpretation. Common clustering techniques include hierarchical clustering and K-means clustering, with tools like R and Python available for implementation. The document also discusses the use of Euclidean distance in clustering, the structure and interpretation of dendrograms, and the differences between Z-tests and t-tests based on sample size.

Uploaded by

moloy16

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views5 pages

Gene and Sample Clustering

Uploaded by

moloy16

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Gene and Sample Clustering

Clustering genes and samples

Clustering genes and samples is a common analysis method in genomics, typically used to identify
patterns in gene expression data. This approach helps group genes with similar expression profiles
across various samples, or vice versa. Here’s a breakdown of the steps and techniques involved:

1. Data Preparation

Normalize the data to make expression levels comparable across samples and genes.
Filter genes or samples if necessary to remove noise or irrelevant entries.

2. Choose a Distance Metric

Common metrics include Euclidean distance, Manhattan distance, or correlation-based

distances (e.g., Pearson or Spearman correlation).
The choice of distance metric can impact how clusters are formed, especially for high-dimensional
gene expression data.

3. Apply Clustering Algorithm

Hierarchical Clustering: Produces a tree-like structure (dendrogram) showing relationships

among genes or samples.
K-means Clustering: Groups genes or samples into k clusters based on the chosen distance
metric. It's commonly used for larger datasets.
Density-based methods (like DBSCAN): Useful if there are clusters of varying densities and
shapes.

4. Visualize Clustering

Heatmaps: Often used to show the clustered data. Genes (rows) and samples (columns) are
reordered based on clustering, highlighting co-expressed gene groups or similar samples.
Dendrograms: Generated from hierarchical clustering to visualize how genes or samples are
merged at different similarity levels.

5. Interpretation and Validation

Analyze clusters to identify meaningful biological patterns, such as groups of co-expressed genes
or samples with similar phenotypes.
Biological Enrichment Analysis: Run analyses (e.g., Gene Ontology or pathway enrichment) to
identify any biological processes enriched in gene clusters.
Cross-validation techniques or bootstrapping can assess the robustness of clusters.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 1/5
Tools and Libraries

R: `heatmap`, `pheatmap`, `cluster`, and `ComplexHeatmap` packages.

Python: `scikit-learn`, `SciPy`, and `seaborn` for clustering and visualization.

Would you like more details on a specific clustering algorithm or visualization method?

eucledean distance

Euclidean distance is one of the most commonly used distance metrics in clustering and other machine
learning tasks. It calculates the straight-line distance between two points in Euclidean space, making it
intuitive for measuring similarity or dissimilarity between two data points (such as genes or samples in a
dataset).

Formula

For two points p = (p1 , p2 , … , pn ) and q = (q1 , q2 , … , qn ), the Euclidean distance d(p, q) is given by:

d(p, q) = (p1 − q1 )2 + (p2 − q2 )2 + … + (pn − qn )2

Application in Clustering

1. Interpretation: A smaller Euclidean distance between two points indicates higher similarity, while
a larger distance suggests greater dissimilarity.
2. Hierarchical Clustering: Euclidean distance is often used with hierarchical clustering to measure
the distance between individual data points, which are then used to calculate linkage distances
between clusters.
3. K-means Clustering: Euclidean distance is commonly used to assign points to the nearest cluster
center. The algorithm iteratively minimizes the sum of squared Euclidean distances between points
and their assigned cluster centers.

Properties

Sensitive to scale: Because it’s based on absolute distances, the scale of each feature can
significantly impact the distance. It's often necessary to normalize or standardize data before
applying Euclidean distance.
High-dimensional data: In high-dimensional spaces, Euclidean distances between points become
more similar, leading to the “curse of dimensionality.” Dimensionality reduction methods, such as
PCA, are often applied to alleviate this.

Example Calculation

To calculate Euclidean distance between two points (such as expression levels of two genes across
samples) with values, say:

p = (2, 3, 4) and q = (5, 6, 8)

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 2/5
Then,

d(p, q) = (2 − 5)2 + (3 − 6)2 + (4 − 8)2 = (−3)2 + (−3)2 + (−4)2 = 9 + 9 + 16 = 34 ≈

When to Use Euclidean Distance

Euclidean distance is best used when:

Features have similar scales or are normalized.

The data is continuous, as it doesn’t work well with categorical data.
The data has relatively low dimensionality or after applying dimensionality reduction.

If you’re looking into implementing this in code or using it in a specific clustering method, I can provide
more detailed steps for that too!

dendogram

A dendrogram is a tree-like diagram that is commonly used to represent the hierarchical relationships
between objects, such as genes or samples, in a dataset. It is a visualization of hierarchical clustering,
which organizes data into nested clusters based on similarity.

How a Dendrogram Works

1. Data Pairing: Hierarchical clustering algorithms start by treating each data point as its own cluster.
2. Calculate Distance: At each step, the algorithm calculates the distance between each pair of
clusters using a specified distance metric (such as Euclidean distance) and a linkage criterion.
3. Merge Clusters: The two closest clusters are merged to form a new cluster.
4. Repeat Steps: This process continues until all points are merged into a single cluster.

The result is a nested clustering structure that can be represented as a dendrogram, where:

Leaves represent individual data points (e.g., genes or samples).

Branches represent clusters that group these points.
Height of each node (where branches merge) indicates the distance (or dissimilarity) between the
merged clusters. Higher nodes mean more dissimilar clusters.

Linkage Criteria

The choice of linkage criterion affects how clusters are merged and how the dendrogram is shaped.
Common linkage methods include:

Single Linkage: Distance between the closest points of two clusters.

Complete Linkage: Distance between the farthest points of two clusters.
Average Linkage: Average distance between all points in two clusters.
Ward's Linkage: Minimizes the increase in total within-cluster variance after merging.

Uses of Dendrograms

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 3/5
Identifying Clusters: By "cutting" the dendrogram at a certain height, you can define clusters
based on the similarity threshold. Everything below the cut forms a cluster.
Visualizing Gene Expression: In gene expression data, dendrograms can show groups of co-
expressed genes or similar samples, helping identify biologically relevant patterns.
Understanding Data Structure: Dendrograms give insights into the natural grouping and
hierarchical structure of the data.

Dendrogram in Practice

1. In Python: You can use `scipy.cluster.hierarchy.dendrogram` along with `linkage` to perform

hierarchical clustering and create a dendrogram.

python

from scipy.cluster.hierarchy import dendrogram, linkage

import matplotlib.pyplot as plt

# Sample data
data = [[1, 2], [2, 3], [3, 4], [5, 8], [8, 8]]

# Perform hierarchical clustering

Z = linkage(data, method='ward') # or use 'single', 'complete', etc.

# Plot the dendrogram

plt.figure(figsize=(10, 7))
dendrogram(Z)
plt.show()

2. In R: Use the `hclust` function with `dist` for distance calculation, and `plot` for visualization.

Interpretation Tips

Cutting the Dendrogram: Setting a threshold and cutting the dendrogram at a certain height lets
you define clusters. Everything below that cut forms clusters of similar items.
Cluster Hierarchies: The hierarchical nature lets you see both small and large clusters. Smaller
branches represent tight clusters, while larger branches represent more generalized groupings.

Let me know if you'd like specific help with code, or interpreting results!

why z test and t test sample size is greater than 30 and less than 30 respectively

The Z-test and t-test are both statistical methods used to test hypotheses about population means, but
they differ in terms of the assumptions they make about the data, particularly in relation to sample size
and population variance.

1. Z-Test: Sample Size Greater Than 30

The Z-test is generally used when the sample size is large (typically n > 30), based on the following
assumptions:

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 4/5
Normality: According to the Central Limit Theorem, the distribution of sample means
approaches a normal distribution as the sample size grows, even if the data itself is not normally
distributed. For sample sizes above 30, this approximation is usually sufficient.
Population Standard Deviation Known: The Z-test assumes that the population standard
deviation (σ ) is known or can be accurately estimated. This is practical with larger samples, where
the sample standard deviation can approximate the population standard deviation well.

With larger sample sizes, the Z-test becomes a powerful test due to these assumptions. The large
sample size reduces the influence of individual data points, creating a robust estimate of the population
mean.

2. T-Test: Sample Size Less Than 30

The t-test is typically used when the sample size is small (n < 30) because of these reasons:

Unknown Population Standard Deviation: In smaller samples, it’s less likely that the population
standard deviation is known. Instead, the t-test uses the sample standard deviation (s) to estimate
the population standard deviation, introducing more variability.
T-Distribution: For smaller samples, the sample means follow a t-distribution rather than a
normal distribution, which has "fatter tails" (i.e., it accounts for the increased variability in the
sample standard deviation). The t-distribution converges to a normal distribution as the sample
size increases, making the t-test appropriate for smaller samples.

Summary of When to Use Each Test

Z-Test: Use if n > 30 and the population standard deviation (σ ) is known or can be accurately
estimated.
T-Test: Use if n < 30 or the population standard deviation is unknown and needs to be estimated
from the sample.

In cases where the sample size is large, the difference between the Z-test and t-test becomes negligible,
and both yield similar results because the t-distribution approximates the normal distribution as n
grows. This is why the Z-test is preferred for larger samples where normality can be assumed.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 5/5

Compiled ESL Activities - Activity Directions - Updated Sept 13th, 2014
50% (2)
Compiled ESL Activities - Activity Directions - Updated Sept 13th, 2014
177 pages
Obs Gynae Dams Notes 2018 PDF
No ratings yet
Obs Gynae Dams Notes 2018 PDF
398 pages
Dendrogram - Slides
No ratings yet
Dendrogram - Slides
27 pages
Hierarchical Clustering: Class Program University Semester Lecturer Sources
100% (1)
Hierarchical Clustering: Class Program University Semester Lecturer Sources
33 pages
Calcium Carbonate
33% (3)
Calcium Carbonate
1 page
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
10 pages
Production Process of Monolithic IC
100% (2)
Production Process of Monolithic IC
5 pages
Lecture - 11 Hierarchical Clustering
No ratings yet
Lecture - 11 Hierarchical Clustering
28 pages
Kra 4 Community Linkages and Professional Engagement & Personal Growth and
No ratings yet
Kra 4 Community Linkages and Professional Engagement & Personal Growth and
7 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
23 pages
Example For Agglomerative Clustering
No ratings yet
Example For Agglomerative Clustering
2 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
Unit 5 Clustering
No ratings yet
Unit 5 Clustering
70 pages
Hierarchical Clusters
No ratings yet
Hierarchical Clusters
6 pages
Isd Process V1
100% (1)
Isd Process V1
3 pages
Bodyweight Hoplite - Build A Lean and Mean Physique With Only Your Own Body PDF
No ratings yet
Bodyweight Hoplite - Build A Lean and Mean Physique With Only Your Own Body PDF
9 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
20 pages
20 - 1 - ML - UNSUP - 02 - Hierarchical Clustering
No ratings yet
20 - 1 - ML - UNSUP - 02 - Hierarchical Clustering
41 pages
Week 07 Lecture Material
No ratings yet
Week 07 Lecture Material
49 pages
Introduction To Clustering: Alka Arora Sr. Scientist
No ratings yet
Introduction To Clustering: Alka Arora Sr. Scientist
57 pages
Clustering
No ratings yet
Clustering
19 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
41 pages
Clustering - The Data Ensemble
No ratings yet
Clustering - The Data Ensemble
4 pages
Clustering: Sridhar S Department of IST Anna University
No ratings yet
Clustering: Sridhar S Department of IST Anna University
91 pages
Sony MDS-JE 520 User Manual
No ratings yet
Sony MDS-JE 520 User Manual
136 pages
Hierarchical Clustering in Machine Learning
No ratings yet
Hierarchical Clustering in Machine Learning
10 pages
Cluster Analysis Using Dicer: Install - Packages
No ratings yet
Cluster Analysis Using Dicer: Install - Packages
8 pages
Clustering Hierarchical PDF
No ratings yet
Clustering Hierarchical PDF
31 pages
Linkage (Analisis Gerarquico)
No ratings yet
Linkage (Analisis Gerarquico)
7 pages
Exp 8
No ratings yet
Exp 8
3 pages
Clustering in R Tutorial
No ratings yet
Clustering in R Tutorial
13 pages
Clustering
No ratings yet
Clustering
69 pages
Clustering Dendogram
No ratings yet
Clustering Dendogram
13 pages
9536 DWM Expt 7 Merged
No ratings yet
9536 DWM Expt 7 Merged
14 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
26 pages
3.2 HierCluster
No ratings yet
3.2 HierCluster
17 pages
Clustering
No ratings yet
Clustering
75 pages
Unit-6 Clustering Techniques
No ratings yet
Unit-6 Clustering Techniques
110 pages
Topic 6d - Hierarchical Algorithm
No ratings yet
Topic 6d - Hierarchical Algorithm
38 pages
ML Lec-17
No ratings yet
ML Lec-17
12 pages
10Hierarchical&Probabilistic Clustering & GMM (ML)
No ratings yet
10Hierarchical&Probabilistic Clustering & GMM (ML)
24 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
Heirarchical Clustering
No ratings yet
Heirarchical Clustering
22 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
DWM Exp8 127 133 137
No ratings yet
DWM Exp8 127 133 137
4 pages
Hierarchical Clustering - 11.3.2024 - Full
No ratings yet
Hierarchical Clustering - 11.3.2024 - Full
14 pages
Creating Heatmaps With Hierarchical Clustering
No ratings yet
Creating Heatmaps With Hierarchical Clustering
14 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
Clustering
No ratings yet
Clustering
8 pages
Spooo
No ratings yet
Spooo
9 pages
Week 10
No ratings yet
Week 10
84 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Agglomerative Hierarchical Clustering
No ratings yet
Agglomerative Hierarchical Clustering
15 pages
Unit-4 New
No ratings yet
Unit-4 New
36 pages
Lecture 6
No ratings yet
Lecture 6
55 pages
Clustering
No ratings yet
Clustering
20 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
11 pages
Hierarchical
No ratings yet
Hierarchical
31 pages
My Lecture On CLUSTER ANALYSIS PDF
No ratings yet
My Lecture On CLUSTER ANALYSIS PDF
55 pages
How Does Gene Expression Clustering Work?: Primer
No ratings yet
How Does Gene Expression Clustering Work?: Primer
3 pages
Agnes
No ratings yet
Agnes
25 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
32 pages
London City Hall: Architectural Analysis Course: Intelligent Building
100% (2)
London City Hall: Architectural Analysis Course: Intelligent Building
17 pages
Clustering
No ratings yet
Clustering
22 pages
Cluster Analysis in R
No ratings yet
Cluster Analysis in R
8 pages
Kursus ICT Refresh Course Programme (ICTRCP) Tahun 2024 (Sesi 6)
No ratings yet
Kursus ICT Refresh Course Programme (ICTRCP) Tahun 2024 (Sesi 6)
32 pages
2
No ratings yet
2
29 pages
Q3 Gender 2018 Sex Gender Nature Nurture
No ratings yet
Q3 Gender 2018 Sex Gender Nature Nurture
5 pages
MS-Syllabus TCChem
No ratings yet
MS-Syllabus TCChem
17 pages
Essay Topics Grade 11
100% (2)
Essay Topics Grade 11
5 pages
Nama Alat Dan Spesifikasi
No ratings yet
Nama Alat Dan Spesifikasi
128 pages
Sachin Pawar Resume
No ratings yet
Sachin Pawar Resume
6 pages
Complete Guide To Service Learning 2
No ratings yet
Complete Guide To Service Learning 2
110 pages
Buzz Marketing For Movies
No ratings yet
Buzz Marketing For Movies
9 pages
Tests For Two Correlations
No ratings yet
Tests For Two Correlations
10 pages
Nature 14432
No ratings yet
Nature 14432
17 pages
Hopf Bifurcation Normal Form
100% (2)
Hopf Bifurcation Normal Form
3 pages
FINAL MANUSCRIPTTTTTTTTTTtttttttttttttttttttttttttttttttttttttTTTTTTTTTTT
No ratings yet
FINAL MANUSCRIPTTTTTTTTTTtttttttttttttttttttttttttttttttttttttTTTTTTTTTTT
24 pages
Guidanc CTspection
No ratings yet
Guidanc CTspection
17 pages
My Classroom
No ratings yet
My Classroom
1 page
Failure Mode For Gas CHromatograph
No ratings yet
Failure Mode For Gas CHromatograph
2 pages
Domino Squares
100% (2)
Domino Squares
1 page
Altman Z Score Model
No ratings yet
Altman Z Score Model
7 pages
Oral Characteristics of Newborns: Journal of Dentistry For Children (Chicago, Ill.) December 2008
No ratings yet
Oral Characteristics of Newborns: Journal of Dentistry For Children (Chicago, Ill.) December 2008
4 pages
OOP Assignment 2
No ratings yet
OOP Assignment 2
2 pages
Mapping Pulling Cable Grounding System
No ratings yet
Mapping Pulling Cable Grounding System
1 page
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Gene and Sample Clustering

Uploaded by

Gene and Sample Clustering

Uploaded by

Gene and Sample Clustering

Clustering genes and samples

2. Choose a Distance Metric

Common metrics include Euclidean distance, Manhattan distance, or correlation-based

3. Apply Clustering Algorithm

Hierarchical Clustering: Produces a tree-like structure (dendrogram) showing relationships

5. Interpretation and Validation

R: `heatmap`, `pheatmap`, `cluster`, and `ComplexHeatmap` packages.

d(p, q) = (p1 − q1 )2 + (p2 − q2 )2 + … + (pn − qn )2

p = (2, 3, 4) and q = (5, 6, 8)

d(p, q) = (2 − 5)2 + (3 − 6)2 + (4 − 8)2 = ​ (−3)2 + (−3)2 + (−4)2 = ​ 9 + 9 + 16 =​ 34 ≈ ​

When to Use Euclidean Distance

Euclidean distance is best used when:

Features have similar scales or are normalized.

How a Dendrogram Works

Leaves represent individual data points (e.g., genes or samples).

Single Linkage: Distance between the closest points of two clusters.

1. In Python: You can use `scipy.cluster.hierarchy.dendrogram` along with `linkage` to perform

from scipy.cluster.hierarchy import dendrogram, linkage

# Perform hierarchical clustering

# Plot the dendrogram

1. Z-Test: Sample Size Greater Than 30

2. T-Test: Sample Size Less Than 30

Summary of When to Use Each Test

You might also like

d(p, q) = (2 − 5)2 + (3 − 6)2 + (4 − 8)2 = (−3)2 + (−3)2 + (−4)2 = 9 + 9 + 16 = 34 ≈