Data Exploration
Data Exploration
▶ Numeric or Continuous
▶ Examples: temperature in Celsius, Fahrenheit
▶ Allows mathematical and logical operations
▶ Special types: Integer, Ratio
▶ Categorical or Nominal
▶ Examples: color of iris, temperature as hot/mild/cold
▶ Limited to logical operations
▶ Special type: Ordered nominal
▶ Data type conversion possible, but may lead to information
loss
Descriptive Statistics: Measures of Central Tendency
▶ Mean
▶ Arithmetic average of all observations
▶ Sensitive to outliers
▶ Median
▶ Middle value in sorted list of observations
▶ Less affected by outliers
▶ Mode
▶ Most frequently occurring value
▶ Useful for categorical data
Descriptive Statistics: Measures of Spread
▶ Range
▶ Difference between maximum and minimum values
▶ Simple but sensitive to outliers
▶ Variance
▶ Average of squared deviations from the mean
▶ Formula: s 2 = N1 N 2
P
i=1 (xi − µ)
▶ Standard Deviation
▶ Square root of variance
▶ Same unit as the original data
Correlation
▶ Histogram
▶ Shows distribution of data
▶ Reveals central location, range, and shape of distribution
▶ Quartile (Box-Whisker) Plot
▶ Displays quartiles, median, and outliers
▶ Allows comparison of multiple attributes
▶ Distribution Chart
▶ Visualizes normal distribution function
▶ Assumes data follows normal distribution
Multivariate Visualization Techniques
▶ Scatterplot
▶ Shows relationship between two attributes
▶ Reveals correlations, patterns, clusters, and outliers
▶ Scatter Matrix
▶ Compares all combinations of attributes
▶ Useful for datasets with multiple attributes
▶ Bubble Chart
▶ Variation of scatterplot with additional dimension (bubble size)
▶ Density Chart
▶ Includes background color as an additional dimension
High-Dimensional Visualization Techniques
▶ Parallel Chart
▶ Projects multi-dimensional data into 2D space
▶ Attributes arranged along x-axis, measures on y-axis
▶ Deviation Chart
▶ Similar to parallel chart
▶ Shows mean and standard deviation for each class
▶ Andrews Curves
▶ Projects data points as Fourier series
▶ Useful for identifying outliers and patterns
Roadmap for Data Exploration