BT 3041: Analysis and Interpretation of Biological Data
BT 3041: Analysis and Interpretation of Biological Data
season
coach
game
score
team
lost
pla
wi
n
y
2 Beer, Bread
Document 1 3 0 5 0 2 6 0 2 0 2 3 Beer, Coke, Diaper, Milk
Document 2 0 7 0 2 1 0 0 3 0 0 4 Beer, Bread, Diaper, Milk
Document 3 0 1 0 0 1 2 2 0 3 0 5 Coke, Diaper, Milk
Graph/Network data
• Examples:
– Internet
– Signaling networks
– Molecular structures
Ordered Data
• Video data
http://imgarcade.com/1/india-temperature-map/
Basic Statistical Descriptions of Data
• Motivation
– To better understand the data: central tendency and spread
• Data dispersion characteristics
– median, max, min, quantiles, outliers, variance, etc.
• Median:
– Middle value if odd number of values, or average
of the middle two values otherwise
Median
interval
• Mode
– Value that occurs most frequently in the data
– Unimodal, bimodal, trimodal
Symmetric vs. Skewed
Data
• Median, mean and mode of symmetric, symmetric
positively and negatively skewed data
14
February 3, 2023 Data Mining: Concepts and Techniques
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
– Quartiles:
• Q1 (25th percentile): The first quartile (Q1) is defined as the
middle number between the smallest number and the
median of the data set.
• The second quartile (Q2) is the median of the data.
• Q3 (75th percentile): The third quartile (Q3) is the middle
value between the median and the highest value of the
data set.
– Inter-quartile range: IQR = Q3 – Q1
– Five number summary: min, Q1, median, Q3, max
Boxplot
– Boxplot: ends of the box are the quartiles; median
is marked; add whiskers, and plot outliers
individually
– Outlier: usually, a value higher/lower than 1.5 x
IQR
Boxplot Analysis
• Five-number summary of a distribution
– Minimum, Q1, Median, Q3, Maximum
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third
quartiles, i.e., the height of the box is
Inter Quartile Range (IQR)
– The median is marked by a line within the box
– Whiskers: two lines outside the box extended
to Minimum and Maximum
– Outliers: points beyond a specified outlier
threshold, plotted individually
17
Properties of Normal Distribution Curve
18
Boxplot for normal distribution
Visualization of Data Dispersion: 3-D Boxplots
20
February 3, 2023 Data Mining: Concepts and Techniques
Variance
• Variance and standard deviation (sample: s, population: σ)
– Variance: (algebraic, scalable computation)
n n
1 1
x 2
2 2 2
( xi ) i
N i 1 N i 1
22
Histogram Analysis
• Histogram: Graph display of tabulated
40
frequencies, shown as bars
• It shows what proportion of cases fall35
into each of several categories 30
24
Quantile Plot
• Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
• Plots quantile information
– For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the
value xi
25
Data Mining: Concepts and Techniques
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, etc
• Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
26
Data Visualization
Data Visualization
• Why data visualization?
– Gain insight into an information space by mapping
data onto graphical primitives
– Provide qualitative overview of large data sets
– Search for patterns, trends, structure, irregularities,
relationships among data
– Help find interesting regions and suitable parameters
for further quantitative analysis
28
VISUALIZING IN LOW DIMENSIONS
Pixel-Oriented Visualization Techniques
• 1D
Solar spectrum
• 2D
Microarray data
30
3D visualization
Frequencies
are sorted
Scatter plots
Correlation from scatter plots
2D
3D
Sw – sepal width
sw Sl – sepal length
Pl – petal length
Pw – petal width
pl sl
pw
Chernoff faces
SL = size of face
SW = forehead/jaw ratio
PL = shape of forehead
PW = shape of jaw
A more intricate use of Chernoff faces
Multiscale visualization
(Shimabukuro et al 2004)
Circle segment display
CIRCOS
http://jura.wi.mit.edu/bio/education/hot_topi
http://circos.ca/intro/published_images/ cs/Circos/Circos.pdf
Cone trees
• Display hierarchical data as cones
• Root node = apex; children = around the base
• Nodes are transparent so that you see the
nodes in the background
• Cones lower in the hierarchy are progressively
smaller
• If you click on a node, the node and the entire
path from the root node are highlighted
A Cone Tree
Another cone tree!