0% found this document useful (0 votes)
41 views57 pages

BT 3041: Analysis and Interpretation of Biological Data

Uploaded by

Muhamed zameel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views57 pages

BT 3041: Analysis and Interpretation of Biological Data

Uploaded by

Muhamed zameel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Data

BT 3041: Analysis and interpretation


of Biological Data
Outline
• Types of data
• Data visualisation
Types of Data
• Data Set is a collection of Data Objects
• Data Object: also called
– Record, point, vector, pattern, event, case,
sample, observation, entity
• Data object is represented by a set of
Attributes
• Attribute: is a property or a characteristic of
an object
Types of attributes
• Qualitative
– Nominal: just a name.
• eg. pincodes, IDs, eye color
– Ordinal: information to order objects.
• Eg. {good, better, best}
• Quantitative
– Numerical value exists
• Eg. temperature, pH
Quantitative attributes
• Discrete
– Binary
• Continuous
Characteristics of Data sets
• Dimensionality
– Curse of dimensionality
– Dimensionality reduction
• Sparsity
– Only a small fraction of attributes are non-zero
• Resolution
– Converting continuous quantities to discrete ones
Types of data sets
• Record Data
• Graph/network Data
• Ordered data
• Spatial, image and multimedia data
Record data:
Same number of attributes
Data matrix

Document data Market basket data


TID Items
timeout

season
coach

game
score
team

1 Bread, Coke, Milk


ball

lost
pla

wi
n
y

2 Beer, Bread
Document 1 3 0 5 0 2 6 0 2 0 2 3 Beer, Coke, Diaper, Milk
Document 2 0 7 0 2 1 0 0 3 0 0 4 Beer, Bread, Diaper, Milk
Document 3 0 1 0 0 1 2 2 0 3 0 5 Coke, Diaper, Milk
Graph/Network data
• Examples:
– Internet

– Signaling networks

– Molecular structures
Ordered Data
• Video data

• Time series data

• Molecular sequence data


Spatial/image data
• Examples:
– Maps
– Images

http://imgarcade.com/1/india-temperature-map/
Basic Statistical Descriptions of Data
• Motivation
– To better understand the data: central tendency and spread
• Data dispersion characteristics
– median, max, min, quantiles, outliers, variance, etc.

12 (Hans, Kamber, Pei 2013)


Measuring the Central Tendency
1 n
 x
• Mean (algebraic measure) (sample vs. population):
x
n
 x
i 1
i 
N
Note: n is sample size and N is population size. n

– Weighted arithmetic mean: w x i i


x i 1
n
– Trimmed mean: chopping extreme values
w
i 1
i

• Median:
– Middle value if odd number of values, or average
of the middle two values otherwise
Median
interval
• Mode
– Value that occurs most frequently in the data
– Unimodal, bimodal, trimodal
Symmetric vs. Skewed
Data
• Median, mean and mode of symmetric, symmetric
positively and negatively skewed data

positively skewed negatively skewed

14
February 3, 2023 Data Mining: Concepts and Techniques
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
– Quartiles:
• Q1 (25th percentile): The first quartile (Q1) is defined as the
middle number between the smallest number and the
median of the data set.
• The second quartile (Q2) is the median of the data.
• Q3 (75th percentile): The third quartile (Q3) is the middle
value between the median and the highest value of the
data set.
– Inter-quartile range: IQR = Q3 – Q1
– Five number summary: min, Q1, median, Q3, max
Boxplot
– Boxplot: ends of the box are the quartiles; median
is marked; add whiskers, and plot outliers
individually
– Outlier: usually, a value higher/lower than 1.5 x
IQR
Boxplot Analysis
• Five-number summary of a distribution
– Minimum, Q1, Median, Q3, Maximum
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third
quartiles, i.e., the height of the box is
Inter Quartile Range (IQR)
– The median is marked by a line within the box
– Whiskers: two lines outside the box extended
to Minimum and Maximum
– Outliers: points beyond a specified outlier
threshold, plotted individually
17
Properties of Normal Distribution Curve

• The normal (distribution) curve


– From μ–σ to μ+σ: contains about 68% of the measurements
(μ: mean, σ: standard deviation)
– From μ–2σ to μ+2σ: contains about 95% of it
– From μ–3σ to μ+3σ: contains about 99.7% of it

18
Boxplot for normal distribution
Visualization of Data Dispersion: 3-D Boxplots

20
February 3, 2023 Data Mining: Concepts and Techniques
Variance
• Variance and standard deviation (sample: s, population: σ)
– Variance: (algebraic, scalable computation)
n n
1 1
      x  2
2 2 2
( xi ) i
N i 1 N i 1

– Standard deviation σ is the square root of variance or σ2


Graphic Displays of Basic Statistical Descriptions

• Boxplot: graphic display of five-number summary


• Histogram: x-axis are values, y-axis represents frequencies
• Quantile plot: each value xi is paired with fi indicating that
approximately 100 fi % of data are  xi
• Scatter plot: each pair of values is a pair of coordinates and
plotted as points in the plane

22
Histogram Analysis
• Histogram: Graph display of tabulated
40
frequencies, shown as bars
• It shows what proportion of cases fall35
into each of several categories 30

• Differs from a bar chart in that it is the25


area of the bar that denotes the 20
value, not the height as in bar charts,15
a crucial distinction when the 10
categories are not of uniform width
5
• The categories are usually specified as
0
non-overlapping intervals of some 10000 30000 50000 70000 90000

variable. The categories (bars) must


be adjacent
23
Histograms Often Tell More than Boxplots

 The two histograms


shown in the left may
have the same boxplot
representation
 The same values for:
min, Q1, median, Q3,
max
 But they have rather
different data
distributions

24
Quantile Plot
• Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
• Plots quantile information
– For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the
value xi

25
Data Mining: Concepts and Techniques
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, etc
• Each pair of values is treated as a pair of coordinates and
plotted as points in the plane

26
Data Visualization
Data Visualization
• Why data visualization?
– Gain insight into an information space by mapping
data onto graphical primitives
– Provide qualitative overview of large data sets
– Search for patterns, trends, structure, irregularities,
relationships among data
– Help find interesting regions and suitable parameters
for further quantitative analysis

28
VISUALIZING IN LOW DIMENSIONS
Pixel-Oriented Visualization Techniques
• 1D
Solar spectrum

• 2D

Microarray data

30
3D visualization

Interactive visualization of three-dimensional


biological models in a CAVE. http://www.jvrb.org/past-issues/6.2009/2257
Histogram
• Ex: Iris data
• 3 kinds of Iris flowers
• 4 attributes: petal length/width, sepal
length/width
• 50 samples per each flower type
Iris data: histograms of individual
attributes
Pareto Chart
• A Pareto Chart is a graph that indicates:
– the frequency of causes of a certain
phenomenon,
– the cumulative impact of the causes.

Pareto Charts are useful to find the causes to


prioritize in order to observe the greatest overall
improvement.
They consist of a line graph and a bar graph
Pareto histogram

Frequencies
are sorted
Scatter plots
Correlation from scatter plots

Pearson Correlation Coefficient


Scatter plot in 3D
VISUALIZING SPATIO-TEMPORAL
DATA
Contour plot
Surface Plot
Vector field plots

2D

A vector is depicted at every point


Vector field plots

3D

A vector is depicted at every point


VISUALIZING IN HIGH DIMENSIONS
Parallel Coordinates plot
Star Coordinates

Sw – sepal width
sw Sl – sepal length
Pl – petal length
Pw – petal width
pl sl
pw
Chernoff faces

SL = size of face
SW = forehead/jaw ratio
PL = shape of forehead
PW = shape of jaw
A more intricate use of Chernoff faces
Multiscale visualization

(Shimabukuro et al 2004)
Circle segment display
CIRCOS

http://jura.wi.mit.edu/bio/education/hot_topi
http://circos.ca/intro/published_images/ cs/Circos/Circos.pdf
Cone trees
• Display hierarchical data as cones
• Root node = apex; children = around the base
• Nodes are transparent so that you see the
nodes in the background
• Cones lower in the hierarchy are progressively
smaller
• If you click on a node, the node and the entire
path from the root node are highlighted
A Cone Tree
Another cone tree!

You might also like