This document discusses exploratory spatial data analysis (ESDA) techniques and descriptive statistics that can be used to analyze and visualize spatial data. Some key techniques mentioned include choropleth maps, histograms, measures of central tendency, outliers detection, and bivariate analyses like scatter plots. ESDA aims to describe, visualize, and examine relationships in spatial data through both graphical and numerical methods.
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
24 views
Exploratory Spatial Data Analysis
This document discusses exploratory spatial data analysis (ESDA) techniques and descriptive statistics that can be used to analyze and visualize spatial data. Some key techniques mentioned include choropleth maps, histograms, measures of central tendency, outliers detection, and bivariate analyses like scatter plots. ESDA aims to describe, visualize, and examine relationships in spatial data through both graphical and numerical methods.
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54
Exploratory Spatial Data Analysis
Spatial Data Analysis
Lecture-04 Exploratory Spatial Data Analysis (ESDA
Exploratory Spatial Data Analysis (ESDA) is a
collection of visual and numerical methods used to analyze spatial data by • (a) Classical non-spatial ESDA • (b) Non-Classical & Advanced Spatial ESDA (Identifying spatial interactions, relationships and patterns) ESDA methods and tools are used to • Describe and summarize spatial data distributions • Visualize spatial distributions • Examine spatial autocorrelation (i.e., trace spatial relationships and associations) • Detect spatial outliers • Locate clusters • Identify hot or cold spots Descriptive statistics Descriptive statistics is a set of statistical procedures that summarize the essential characteristics of a distribution through calculating/plotting: • Frequency distribution • Center, spread and shape (mean, median and standard deviation) • Standard error • Percentiles and quartiles • Outliers • Boxplot graph • Normal QQ plot Others Statistics • Inferential statistics is the branch of statistics that analyzes samples to draw conclusions for an entire population. • Spatial statistics employ statistical methods to analyze spatial data, quantify a spatial process, discover hidden patterns or unexpected trends and model these data in a geographic context. Why Use • Centrographic measures • Analyze spatial patterns • Identify spatial autocorrelation, hot spots and outliers • Perform spatial clustering • Model spatial relationships • Analyze spatially continuous variables ESDA Tools and Descriptive Statistics for Visualizing Spatial Data • Simple ESDA Tools and Descriptive Statistics for Visualizing Spatial Data (Univariate Data) • ESDA Tools and Descriptive Statistics for Analyzing Two or More Variables (Bivariate Analysis) ESDA techniques and descriptive statistics • The most common ESDA techniques and descriptive statistics for analyzing univariate data (only one variable of the dataset is analyzed each time. These include • Choropleth maps • Frequency distributions and histograms • Measures of the center, spread and shape of a distribution • Percentiles and quartiles • Outlier detection • Boxplots • Normal QQ plot Choropleth Maps • Choropleth maps are thematic maps in which areas are rendered according to the values of the variable displayed • Choropleth maps are used to obtain a graphical perspective of the spatial distribution of the values of a specific variable across the study area. • There are two main categories of variables displayed in choropleth maps: • (a) spatially extensive variables • (b) spatially intensive Frequency Distribution and Histograms • Frequency distribution table is a table that stores the categories (also called “bins”), the frequency, the relative frequency and the cumulative relative frequency of a single continuous interval variable. • The frequency for a particular category or value (also called “observation”) of a variable is the number of times the category or the value appears in the dataset. • Relative frequency is the proportion (%) of the observations that belong to a category. Frequency Distribution and Histograms • The cumulative relative frequency of each row is the addition of the relative frequency of this row and above. Histogram • A frequency distribution histogram is a histogram that presents in the x-axis the bins and in the y- axis the frequencies (or the relative frequencies) of a single continuous interval variable. • A probability density histogram is defined so that (a) The area of each box equals the relative frequency (probability) of the corresponding bin (b) The total area of the histogram equals 1 Use of Frequency Distribution & Histogram
Frequency distribution tables and histograms are
used to analyze how the values of the studied variable are distributed across the various categories. The histogram can also be used to determine if the distribution is normal or not. Additionally, it can be used to display the shape of a distribution and examine the distribution’s statistical properties (e.g., mean value, skewness, kurtosis) Measures of Center • Measures of central tendency provide information about where the center of a distribution is located. The most commonly used measures of center for numerical data are the mean and the median. • The mean is the simple arithmetic average: the sum of the values of a variable divided by the number of observations. • The median is the value that divides the sorted scores form smaller to larger in half. It is a measure of center. Mean Measures of Shape • Measures of shape describe how values (e.g., frequencies) are distributed across the intervals (bins) and are measured by skewness and kurtosis. • Shapes are of • Symmetrical • Asymmetrical • Skewness is the measure of the asymmetry of a distribution around the mean. • Kurtosis, from the graphical inspection perspective, is the degree of the peakedness or flatness of a distribution. Measures of Spread/Variability – Variation
• Measures of spread (also called measures of variability,
variation, diversity or dispersion) of a dataset provide information of how much the values of a variable differ among themselves and in relation to the mean. The most common measures are as follows: • Range • Deviation from the mean • Variance • Standard deviation • Standard distance • Percentiles and quartiles Range • A range is the difference between the largest and smallest values of the variable studied. • The greater the range, the more variation in the variable’s values, which might also reveal potential outliers Deviation • Deviation from the mean is the subtraction of the mean from each score Sample Variance • Sample Variance is the sum of the squared deviations from the mean divided by n – 1. • Large values of s 2 (variance) reveal a great variation in the data, indicating that many observations have scores further away from the mean. Standard Deviation • Standard deviation is the square root of variance • A positive standard deviation value indicates the number of standard deviations above the mean, and a negative value indicates the number of standard deviations below the mean. • Standard deviation is used to estimate how many objects in the sample lie further away from the mean in reference to the z-score Percentile • A percentile is a value in a ranked data distribution below which a given percentage of observations falls. Every distribution has 100 percentiles. • Percentiles are used to compare a value in relation to how many values, as a percentage of the total, have a smaller or larger value. Quartile • The quartiles are the 25th, 50th and 75th percentiles, called “lower quartile” (Q1), “median” and “upper quartile” (Q3) respectively. Interquartile Range • The interquartile range (IQR) is obtained by subtracting the lower quartile from the upper quartile Quantile Quantiles are equal-sized, adjacent subgroups that divide a distribution. Quantiles are often used to divide probability distributions into areas of equal probabilities. In fact, percentiles are quantiles that divide a distribution to 100 subgroups. GIS software uses quantiles to color and to symbolize spatial entities when there are many different values Outliers • Outliers are the most extreme scores of a variable. • They should be traced for three main reasons: • Outliers might be wrong measurements • Outliers tend to distort many statistical results • Outliers might hide Boxplot
• A boxplot is a graphical representation of the
key descriptive statistics of a distribution. • To depict the median, spread (regarding percentiles) and presence of outliers. Normal QQ Plot
• The normal QQ plot is a graphical technique
that plots data against a theoretical normal distribution that forms a straight line. • A normal QQ plot is used to identify if the data are normally distributed in a theoretical line of 45 degree. ESDA Tools and Descriptive Statistics for Analyzing Two or More Variables (Bivariate Analysis)
• Spatial analysis often focuses on two different
variables simultaneously. This type of analysis is called “bivariate,” and the dataset used is called a “bivariate dataset.” The study of more than two variables, as well as the dataset used, is called “multivariate.” ESDA techniques for Bivariate Data • The most common ESDA techniques and descriptive statistics for analyzing bivariate data include • Scatter plot • Scatter plot matrix • Covariance and variance–covariance matrix • Correlation coefficient • Pairwise correlation • General QQ plot Scatter plot & Scatter Plot matrix • A scatter plot displays the values of two variables as a set of point coordinates • A scatter plot matrix depicts the combinations of all possible pairs of scatter plots when more than two variables are available. • The visual inspection of all pair combinations facilitates • (a) the locating of variables with high or no association, • (b) the identification of relationship type (i.e., linear nonlinear) • (c) outlying points. • The closer the data, the higher their linear correlation • scattered a pattern is, the weaker the linear relationship • The further away a data point lies likely to be outlier. Covariance and Variance–Covariance Matrix
• Covariance is a measure of the extent to which two
variables vary together (i.e., change in the same linear direction). Covariance Cov(X, Y) is calculated as: Covariance measures the extent to which two variables of a dataset change in the same or opposite linear direction. If the covariance is negative, then the variables change in the opposite way (one increases, the other decreases). Zero covariance indicates no correlation between the variables. Covariance and Variance–Covariance Matrix
• The variance–covariance matrix is applied in many
statistical procedures to produce estimator parameters in a statistical model, such as the eigenvectors and eigenvalues used in principal component analysis. • It is also used in the calculation of correlation coefficients. • Covariance and variance–covariance are descriptive statistics and are widely used in many spatial statistical approaches. Correlation Coefficient
• Correlation coefficient r(x, y) analyzes how
two variables (X, Y) are linearly related. Among the correlation coefficient metrics available, the most widely used is the Pearson’s correlation coefficient (also called Pearson product-moment correlation). Pairwise correlation • Pairwise correlation is the calculation of the correlation coefficients for all pairs of variables. • To identify potential linear relationships quickly. General QQ Plot • A general QQ plot depicts the quantiles of a variable against the quantiles of another variable. • This plot can be used to assess similarities in the distributions of two variables. • If the two variables have identical distributions, then the points lie on the reference line at 45; if they do not, then their distributions differ. Rescaling Data • Rescaling is the mathematical process of changing the values of a variable to a new range. • By rescaling data, the spread and the values of the data change, but the shape of the distribution and relative attributes of the curve remain unchanged. • comparing their descriptive statistics • Comparison for highest and lowest values and index • Used for multivariate datasets. Methods for Rescaling • Normalize: The following formula is a typical method of creating common boundaries.
• Adjust: Rescaling data is to divide a variable (or
multiply it by assigning weights) by a specific value. • Adjustments could be expressed in many other ways depending on the problem studied and the research question/hypothesis tested. • What are parametric and nonparametric methods and tests? • What is a test of significance? • What is the null hypothesis? • What is a p-value? • What is a z-score? • What is the confidence interval? • What is the standard error of the mean? • What is so important about normal distribution? • How can we identify if a distribution is normal? What are parametric and nonparametric methods and tests?
• Parametric methods and tests are statistical
methods using parameter estimates for statistical inferences. • They assume that the sample is drawn from some known distribution (not necessarily normal) that obeys some specific rules. They belong to inferential statistics. What are parametric and nonparametric methods and tests? • Statistical methods used when normal distribution or other types of probability distributions are not assumed are called “nonparametric” Confidence Interval • Confidence interval is an interval estimate of a population parameter. • In other words, a confidence interval is a range of values that is likely to contain the true population parameter value. • A confidence interval is calculated once a confidence level is defined. • It is usually set to 95% or 99%. • A confidence level of 95% reflects a significance level of 5%. Standard Error • The standard error of a statistic is the standard deviation of its sampling distribution. • The standard error reveals how far the sample statistic deviates from the actual population statistic. Standard Error • Low values of the standard error of the mean indicate more precise estimates of the population mean. • The larger the sample is, the smaller the standard error calculated. This is rational, as the more objects we have, the closer to the real values our approximation will be. Significance Tests, Hypothesis, p-Value and z-Score • A test of significance is the process of rejecting or not rejecting a hypothesis based on sample data. • The p-value is the probability of finding the observed (or more extreme) results of a sample statistic (test statistic) if we assume that the null hypothesis is true. It is calculated based on the z-score. Z Score • The z-score (also called z-value) expresses distance as the number of standard deviations between an observation (for hypothesis testing calculated by a specific formula for a statistical test) and the mean. It is calculated (for samples) by the following formula Significance Level • Significance level α is a cutoff value used to reject or not reject the null hypothesis. Significance level α is a probability and is user- defined, usually taking values such as • α = 0.05, 0.01 or 0.001, (5%, 1% and0.1% )probability levels. • The smaller the p-value the more statistically significant the results. What is so important about normal distribution? • Create a histogram and superimpose a normal curve. • Calculate the skewness and kurtosis for the distribution. • Create a normal QQ plot What to Do When Distribution Is Not Normal • Use nonparametric statistics • Apply variable transformation. An efficient way to avoid a non-normal distribution is to transform it (if possible) to a normal distribution. • Check the sample size.
Get Solution Manual for Business Analytics, 4th Edition, Jeffrey D. Camm, James J. Cochran, Michael J. Fry, Jeffrey W. Ohlmann Free All Chapters Available
Get Solution Manual for Business Analytics, 4th Edition, Jeffrey D. Camm, James J. Cochran, Michael J. Fry, Jeffrey W. Ohlmann Free All Chapters Available