0% found this document useful (0 votes)
343 views97 pages

HMX7001 Analysis of Data Using SPSS - Advanced Level

This document provides an outline for exercises in analyzing data using SPSS. It includes 8 exercises that demonstrate various SPSS functions including plotting data, descriptive statistics, correlations analysis, t-tests, ANOVA, cluster analysis, regression models, and principal component analysis. The exercises will use example databases and dummy data to demonstrate preprocessing data, exploring data, and performing both initial and multivariate analyses in SPSS. The document also discusses how statistical analysis fits into typical research frameworks and modeling air pollution and health impacts.

Uploaded by

Lim Kok Ping
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
343 views97 pages

HMX7001 Analysis of Data Using SPSS - Advanced Level

This document provides an outline for exercises in analyzing data using SPSS. It includes 8 exercises that demonstrate various SPSS functions including plotting data, descriptive statistics, correlations analysis, t-tests, ANOVA, cluster analysis, regression models, and principal component analysis. The exercises will use example databases and dummy data to demonstrate preprocessing data, exploring data, and performing both initial and multivariate analyses in SPSS. The document also discusses how statistical analysis fits into typical research frameworks and modeling air pollution and health impacts.

Uploaded by

Lim Kok Ping
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97

Advanced Research Methodology (HVX8001)

Analysis of Data Using SPSS


– Advanced Level

Dr. Md Firoz Khan


Department of Chemistry, Faculty of
Science, University of Malaya
HP: 0162645381
Outlines: The List of Exercises
Exercise: I
SPSS: Demonstration with an example Database for plotting
Exercise: II
Demonstration: Summary Descriptive Statistics
Exercise: III
Demonstration: Correlations analysis, paired t-test, ANOVA
Exercise: IV
Demonstration: Cluster Analysis
Exercise: V
Demonstration: Multiple regression model
Exercise: VI
Demonstration: PCA procedure
Exercise: VII
Demonstration (PCR): Dummy Data
Exercise: VIII
Demonstration: PCA-APCS
Flow of the data analysis using SPSS
Removal of outlier

Input data Correction the


missing data

Preprocessing
Replacing the data
below detection with
appropriate
Data analysis procedures
(initial and
multivariate)
Convert data
dimension or
Output normalization if
appropriate
Data analysis by SPSS
Exploratory
Data Analysis

Initial analysis Multivariate


analysis

PCA/AP
CA MLR PCR PLS
Correlation CS
analysis,
Time-series
paired t test,
anova etc. CA: Cluster analysis
PCA/APCS: principal component
Basis analysis/absolute principal component
summary scores
statistics
(mean, MLR: multiple linear regression
med, std, PCR: principal component regression
etc.) PLS: partial least square
A typical research framework and
statistical input!!
Air pollution monitoring
Assessment of MM power plant Lung function performances
(PM2.5)
Exp. Set up
Chemical analysis (trace metals,
ionic and carbon compositions) Biological monitoring

Database

Statistical Analysis &


Health risk assessment Toxicity test
Air pollution modeling
(HRA) (cytotoxicity and DNA damage )

Descriptive statistics, Correlation, t-


test, Anova, p value, Cluster PCA-APCS
analysis, Regression PMF
CMB

Validation of the Emission Sources by Bivariate Rose Plot/Potential Source


Contribution Function (PSCF)/Concentration Weighted Trajectory
(CWT)/HYSPLIT density model/wind vector by GrADS

Strategic Mitigation
Establishment of Appropriate Plan for stakeholder
Emission Sources (Hotspots) (TNBR)
Research output Impact
Exercise: I

SPSS: Demonstration with an


example Database for plotting

Basic of the statistics: Practice


from the previous Lecture
Practice with dummy data

Prepare plotting
in SPSS
95% 1.96 x SD’s from the mean

95% of values

P(score > 130) =


0.025

100 130
70
mean − (1.96  SD ) mean + (1.96  SD )
100 − (1.96  15.3) = 70 100 + (1.96  15.3) = 130
95% of people have an IQ between 70 and 130
Example use of lognormal distribution in our published work
Shape of Data

◼ Shape of data is measured by


◼ Skewness
◼ Kurtosis
Skewness
◼ Measures asymmetry of data
◼ Positive or right skewed: Longer right tail
◼ Negative or left skewed: Longer left tail
Let x1 , x2 ,... xn be n observations. Then,
n
n å ( xi - x ) 3
Skewness = i =1
3/ 2
æ n

ç å ( xi - x ) ÷
è i =1 ø
Kurtosis
◼ Measures peakedness of the distribution of data. The
kurtosis of normal distribution is 0.

Let x1 , x2 ,... xn be n observations. Then,


n
nå ( xi - x ) 4
Kurtosis = i =1
2
-3
æ n 2ö
ç å ( xi - x ) ÷
è i =1 ø
• Positive or right skewed: Longer right tail
• Negative or left skewed: Longer left tail
• Large Kurtosis > peaky distribution.
• Low Kurtosis > ‘flatter’ distribution.
• Data skewness lies ( -1 to 1 ) and Kurtosis (-3 to
+3)
Exercise: II

Demonstration: Summary
Descriptive Statistics
Practice the basic statistics using
dummy data
Correlation
◼ Strength and direction of the relationship
between variables
◼ Scattergrams

Y Y Y
Y Y Y

X X

Positive correlation Negative correlation No correlation


Example use of correlation plots-Khan et al 2017. JGR

Linearity of r value

r > 0 linear + positive


r < 0 linear + negative
r = 0 no linearity
Exercise: III

Demonstration: Correlations
analysis, paired t-test, ANOVA
Practice correlation analysis with dummy data
Paired t test
ANOVA test
Cluster Analysis (CA)
◼ Unsupervised pattern recognition
◼ Could involve: hierarchical clustering & non-
hierarchical clustering
◼ Dimensionality not reduced like PCA
◼ Generally views objects as points in n-
dimensional measurement space
◼ Objects aggregated step-wise according to the
similarity of their features
◼ Searches for the distance between objects in the
measurement space
◼ Developed primarily by biologists to determine
similarities between organisms
CA
The HCA analysis which primary purpose to assemble objects based on the characteristic
they possess was used in this study is perfomed the Ward’s method by using euclidean
distance as a measure of similarity. This most common technique will produce several
number of clusters that can be presented in the form of chart called ‘dendrogram’ or also
known as hierarchical tree.

A number of common numerical measures of similarity is available:


◼Correlation
◼Mahalanobis distance
◼Manhattan distance
◼Euclidean distance (most common)
◼Chebyshev distance
◼Minkowski distance (unifies Euclidean, Manhattan and Chebyshev distances)
Exercise: IV

Demonstration:
Cluster Analysis
Cluster analysis
General Linear Model
◼ Linear regression is actually a form of the
General Linear Model where the parameters
are b, the slope of the line, and a, the
intercept.
y = bx + a +ε
◼ A General Linear Model is just any model that
describes the data in terms of a straight line
An example use of the Linear Model
[Khan et al. 2015]
Multiple regression
◼ Multiple regression is used to determine the effect of a
number of independent variables, x1, x2, x3 etc., on a
single dependent variable, y
◼ The different x variables are combined in a linear way
and each has its own regression coefficient:

y = b0 + b1x1+ b2x2 +…..+ bnxn + ε

◼ The a parameters reflect the independent contribution of


each independent variable, x, to the value of the
dependent variable, y.
◼ i.e. the amount of variance in y that is accounted for by
each x variable after all the other x variables have been
accounted for
Multiple Linear Regression
• Regression refers to the value of a response variable as a
function of the value of an explanatory variable.
• A regression model is a function that describes the
relationship between response and explanatory variables.
• Commonly referred to predictor-predictand method in
earth/environmental sciences.
• A simple linear regression has one explanatory variable and
the regression line is straight.
• The linear relationship of variable Y and X can be written as
in the following regression model form
Y= b0 + b1X + e
where, ‘Y’ is the response variable, ‘X’ is the explanatory
variable, ‘e’ is the residual (error), and b0 and b1 are two
parameters. Basically, bo is the intercept and b1 is the
slope of a straight line y= b0 + b1X.
• By linear, we are referring to the parameters, not the
variables.
Multiple Linear Regression
❖ Response variable is normally distributed.
❖ Relationship between the two variables is
linear.
❖ Observations of response variable are
independent.
❖ Residual error is normally distributed with
mean 0 and constant standard deviation.
◼ Y is expressed as a function of X
(deterministic portion, Ŷ ) plus the random
errors εi which should sum to 0.
◼ There are two parameters that need to be
estimated.
◼ α – slope; β – section.
◼ Method: Least squared method (LSM) –
minimized the sum of squared error.

SSE =  (Yi − Yi ) =  (Yi − X i +  )


ˆ 2 2

i i

• Involve solving sets of simultaneous equations


(linear algebra)
Exercise: V

Simple example use of MLR model

Y = A1*X1 + A2*X2 + A3*X3………+ AnXn + C

[measured PM10 (μg m-3)] = A1 [measured NOx (μg m~3)] + A2 [measured


sulphate (μg m-3)] + C (μg m-3). [Stedman et al. 2001]

Demonstration: Multiple regression model


A simple Linear Regression
model
Output of MLR model

Coefficientsa
Model Unstandardized Coefficients Standardized Coefficients t Sig.
B Std. Error Beta
1 (Constant) 14.427 1.124 12.839 .000
SO4 1.313 .174 .341 7.549 .000
NO3 1.908 .359 .240 5.311 .000
a. Dependent Variable: Mass

Thus, the reconstructed MLR model:

[measured PM10 (μg m-3)] = 1.908× [measured NOx (μg m~3)] + 1.313
×[measured sulphate (μg m-3)] + 14.427 (μg m-3). [Stedman et al. 2001]
An example multiple linear regression model

Practice with Dummy mass closure data


P values
◼ P values = the probability that the observed
result was obtained by chance
◼ i.e. when the null hypothesis is true

◼ α level is set a priori (Usually 0.05)

◼ If p < α level then we reject the null


hypothesis and accept the experimental
hypothesis
◼ 95% certain that our experimental effect is genuine
◼ If however, p > α level then we reject the
experimental hypothesis and accept the null
hypothesis
When to use non-parametric method and when not to
use?

Visually normal, use parametric


Moderately skewed, use parametric Severely skewed, use non-parametric
Outliers, use non-parametric
Uniformly distributed, use non-parametric
Data Reduction using SPSS
[To be demonstrated to Advanced Lecture for MSc and PhD students]
Basic about multivariate modeling

Receptor modeling in environmental forensics


involves the inference of sources and their
contributions through analysis of chemical data from
the ambient environment.

The objectives are to determine:


➢ the number of chemical fingerprints in the system;
➢ the chemical composition of each fingerprint;
➢ the contribution of each fingerprint in each sample
Multivariate
Receptor Modeling
1. Positive Matrix Factorization Model for environmental data
analyses
https://www.epa.gov/air-research/positive-matrix-factorization-
model-environmental-data-analyses

2. Chemical Mass Balance (CMB) Model


https://www3.epa.gov/scram001/receptor_cmb.htm

3. Unmix 6.0 Model for environmental data analyses


https://www.epa.gov/air-research/unmix-60-model-environmental-
data-analyses

4. Principal Component Analysis/Absolute Principal


Component Analysis (PCA/APCS)
http://www.sciencedirect.com/science/article/pii/000469818590132
5
Widely Used Other Available
Data Mining/Conversion of
Models Models
large data to smaller Group

PCA/Absolute PCA/ APCS - simplified model EPA‘S Chemical


principal Weighted APCS - deals “zero score” Mass Balance
component (CMB)
score(APCS) but lack of non-negativity requirement

PMF is complicated and robust model Unmix

Positive Matrix PMF - lower uncertainty and stop


Factorization producing zero factor score, requires
(PMF) Artificial Neural
component loadings and scores to be non-
Networks-Source
negative receptor modelling
Capable of identifying sources without
45
any prior knowledge of sources
Principal Component Analysis (PCA)

❑ It is a way of identifying patterns in data, and expressing


the data in such a way as to highlight their similarities
and differences.

❑ Principal component analysis (PCA) is also a technique


used to emphasize variation and bring out strong
patterns in a dataset. It's often used to make data easy
to explore and visualize.

46
Objectives of PCA
a) To transform an original set of variables into a new
set of uncorrelated variables called principal
components
b) To rank components in order of the amount of
variance that they account for
c) To see if the first few components account for most
of the variation in the original data
d) If (c) is true, then to make use of a smaller number
of transformed variables
e) If (c) is true, subsequent data analysis can be
simplified because the data set is smaller
f) To seek an underlying meaning of the first few
components (must be approached with care)
PCA/MLRA

address with the following formula

Measurement error
Normalized data
Source contribution
Source profile

48
Data matrix

Data matrix Source contribution Profiles

49
Factor loading using PCA procedure

❑ A large set of data was


used
❑ Obtained 4 small group
❑ Variables are highly
correlated in the respective
group
❑ Least correlation is
observed among the group
❑ Each of the group indicates
similar properties, nature,
sources etc.

50
PCA

◼ The first PC (PC1) is the best fit straight line in the multi-
dimensional space, the scores represent the distance along the
line and the loadings the angle (direction) of the straight line
◼ PC1 explains the largest amount of data variance & subsequent
PCs explain decreasing amounts of data variance
◼ Lower PC number, the greater the signal & lower the noise.
◼ Each PC describes a portion of the data so that all PCs add up
to 100%
◼ If data reduction is good, you need less PC to explain all the
relevant data
◼ PC plots can simplify large or difficult datasets & show the main
trends and are easier to visualize than tables of numbers
Preparation of database
Common problems:
◼ - systematic bias-analysis by different labs or different
methods
◼ - presence of data below detection limit (DL)
◼ - presence of coelution (non-target analytes that elute at the
same time as a target analyte)
◼ - data entry, identify outliers
◼ Noisy data
◼ Missing data
◼ Exclude variables if missing >50%

52
Preparation of database conti..

- replace data below DL with DL/2


- replace missing data with average value of nearby data,
or simply the average of the variable concentration
- data normalization or conversion of the data into unit
less or zero/centered mean
- Adequate number of data point and variables

53
Adequate number of data set

◼ No of data point must be more than no of variables


◼ No of data point should be 5 times of variables
◼ N > or = 100 samples (PK Hopke)
◼ N>(30+p+3)/2 (Henry et al 1984)
◼ N=50 (source unknown)
◼ N=30 (magic number!)
◼ Suitability test (KMO and Bartlett’s test): Our suggestions!!

54
Optimization of factor number

◼ >1 Eigen value


◼ Variance (%) ~ 10 or >10
◼ Interpretable factor profiles
◼ At least one variables should response
significantly
◼ Exclude variable if doesn’t response to any
factor either!

55
Exercise: VI

Activities: PCA procedure

-Follow the example data and use them into PCA to reduce the data into
a small group and least correlation is observed among the group

Demonstration: PCA procedure


PCA – PCR – APCS - MLR
STEP BY STEP
Step 1: Get Data

◼ Suitable data (N)


◼ Missing value
Step 2: Normalize the Data in Excel
Step 3: Upload the normalised data into SPSS
Upload data into SPSS

Upload
File
Step 4: Make Sure Data in Numeric
Step 5: Suitability of the Data
◼ KMO and Bartlett’s test
Step 6: Check KMO Value in Output File
Step 7: Run PCA for Normalised Data
Run PCA

Check all
important
info one
by one
Select Co Varian Method
Varimax
PCA Results

Eigen value > 1


7 Component!
PCA Results – Unrotated Factor Loading
PCA Results –Rotated Factor Loading

Important
Info
Step 8: Explanation of Factor Loading

◼ Factor loading > 0.7


◼ Explain based of significant variable
◼ Need to refer published paper to explain
the sources – need a lot of reading
Step 9: Copy and paste the Factor Scores
in a Excel Sheet
Principal component regression (PCR)
Principal component regression (PCR) analysis is a combination between
PCA (principal component analysis) and OLS (Ordinary Least Squares
regression). The PCR analysis is one of the best approaches to study the
statistical relationships between the air pollutants and meteorological
factors. PCR analysis can reduce the multicollinearity in the datasets
because the presence of multicollinearity among the independent variables
will produce the invalid results in terms of the model’s predictions and
determination of the significant independent variables. The factors with
eigenvalues more than 1.0 is choose in order to fully understanding of the
correlation relationship between the variables as the factors is considered
a significant factors. Then, the significant factors consisted of independent
variables obtained from the PCA were regressed against the dependent
variables using OLS regression analysis.
Exercise: VII

Demonstration (PCR): Dummy Data


Limitation of PCR
Factor Loadings
Factor Scores for
PC1, PC2,
for PC1, PC2,
PC3….
PC3…. PCA

MLR: PC1, PC2,


Rotation by Varimax PC3….vs a
Input data:
to obtain meaningful dependent
normalizatio
PC variable
n

Limitation: appear
negative mass
Execution of concentration
Calculate PCA (unrealistic)
APCS for each
PC
Corrections
for PCA
Determine the
Regress APCS contribution of
Induction an artificial
against the each PC with
samples with zero
dependent variable less
concentration for the
uncertainty
variables
value
APCS-MLR Step by Step

Step 10: Prepare a New Raw Data Set


Adding a Zero Sample at the End of the
Row
Step 11: Normalised the zero samples

= (X-Mean)/SD

Use “$” for Average and Standard Deviation


Paste formula e.g. = (H3-H$632)/H$633
Step 12: Run PCA for the Second Time
Exercise: VIII

Demonstration: PCA-APCS
Step 13: Copy and paste the Factor Scores (0 Sample)
in a Excel Sheet from Step 9
Step 14: Subtract the Factor Score for Zero Sample
(Step 13) from the Each Sample in Step-9

◼ The revised factor scores are recognized here


APCS (Step 9-Step 13)
Minus “Zero Factor Loading” =
APCS
Step 15: Run MLR using PM2.5 mass as Dependent
Variables and Each of the APCS is Independent
Variable.
Step 16: Convert the APCS into Factor Mass by
Multiplying the Respective Regression
Coefficient
Conversion of APCS into Mass Concentration

APCS X Regression Coefficient (B Column)


Delete “Negative Mass” from Data
Set
A correlation of input and predicted PM2.5 mass
% Distribution of PM2.5 mass contributed by F1, F2, F3, F4, F5, and F6
Assignment:
A review on current perspectives of principal component analysis followed by
an absolute principal component analysis in environmental application

Thank you for your attendance

Any further inquiry, please contact me:


[email protected], [email protected]
Acknowledgement

www.utsc.utoronto.ca/~phanira/WebResearchMet
hods/
https://www.nemoursresearch.org/open/StatCla
ss/January200
https://www.stat.auckland.ac.nz/~balemi/Multivar
iate

You might also like