PMFDOC2
PMFDOC2
Contents
Seldomly needed details ................................ 16
Introduction ........................................2 Systematic naming of files ................... 16
Special enhancements of the factor analytic models Controlling pseudorandom numbers .... 16
.......................................................................... 2
Modelling of "errors"........................................ 3 Advanced options for PMF2 and PMF3
...........................................................18
Installation of PMF2 and/or PMF3................... 4
The key matrices Gkey,..., Ckey................ 18
The general working of PMF2 and PMF3 ........ 4 PMFx generates synthetic data arrays....... 18
The .INI file for controlling PMFx ......6 Specifying standard deviations for the array X.
Significance of the Q value. Outliers. ....... 19
Maintaining your .INI files ............................... 6 The array of standard deviations .......... 19
General run control items in the .INI file.......... 6 Specifying the X_std-dev .................... 19
“Monitor” ............................................... 6 Robustness............................................ 21
“Version of PMFx”, compatibility ......... 6 Missing values...................................... 22
“Dimensions”.......................................... 6 Judging the values obtained for Q........ 22
“Repeats”................................................ 6
“FPEAK” ................................................ 7 The problem of rotations; error
“Robust mode” ....................................... 7 estimates of results .........................24
“Outlier threshold distance” ................... 7 Using FPEAK to control rotations in PMF224
“Codes C1, C2, C3 and Errormodel (EM)” Rotational matrices, rotational freedom, and error
................................................................ 7 estimates of results .................................... 24
Pseudorandom “Seed” ............................ 7 The result matrices G_std-dev and F_std-dev26
“Stabilizer” and “Accelerator” ............... 7
Iteration control table for 3 levels of limit Miscellaneous details ......................27
repulsion ................................................. 7 Locations of error messages ...................... 27
Optional parameters................................ 8 Explained Variation EV ............................ 27
Input/Output of arrays....................................... 9 The covariance matrix of A, B, and C....... 28
Specifying file properties in the .INI file..... 9 Observing the process of iteration............. 28
Using formats............................................... 9 Using PMF3 for solving 2-way problems . 29
Control of details of reading and writing .. 10 Avoiding degenerate factorizations with PMF3.
Specifying initial values of factors ............ 12 ................................................................... 29
Array headers............................................. 12
Efficiency considerations................31
Arranging arrays in files ............................ 13
Layout of factor matrices in files ......... 14
Fortran errors, weak points of PMF
Reading/writing three-way data blocks 14
Special read layout for X and X_std-dev14
programs...........................................32
Errata ......................................................... 32
Normalization of factor matrices .................... 15
Technical details ............................................. 16 Requesting support and updates ...32
Marking true and false.......................... 16
Unwanted characters in data files......... 16 Disclaimer .........................................33
Avoiding too large files ........................ 16
Introduction
These instructions are for the two programs PMF2 and PMF3. The reason for combined instructions is
that the programs have similar control structures and similar usage patterns. The notation PMFx means
either PMF2 or PMF3. We try to reserve the word "matrix" for two-dimensional arrays, and
correspondingly the word "block" for 3-way arrays.
Before reading these instructions, you should become familiar with the information in the section
"Getting started with PMF” in part 1 of PMF User’s Guide.
The program PMF2.EXE solves approximately (in the Least Squares sense) the matrix equation
X = GF (1)
where X is known and G and F are unknown. Writing in component form and showing the residual
matrix explicitly we get
p
xij = ∑ gih f hj + eij (2)
h =1
where std-dev(X) = std-dev(E) = S and where some or all elements of G and F are required to be non-
negative. There are p “factors” in this model. One column of G and the corresponding row of F
represent one factor. They correspond to the ‘scores’ and ‘loadings’ of the customary factor analysis.
When the residual matrix E is defined by
X = G F + E, (3)
the task of PMF2 may be expressed as minimizing the sum of squares
n m
Q = ∑ ∑ (eij / sij ) 2 (4)
i =1 j =1
In the robust mode, this expression is modified so that the sij are dynamically readjusted (=iterative
reweighting). We are also calling the value of Q by the name “chi2” but this is not quite correct.
Strictly speaking, Q is not distributed according to the chi-squared distribution, although the
distribution in many cases approximates chi2.
The “PARAFAC” model solved by the program PMF3 is best described in the following component
form,
p
xijk = ∑ aihb jh ckh + eijk (5)
h =1
Again there are p factors in the model, but in this case there are three entities (columns of A,B,C)
forming one factor. And again the model is solved as a non-negatively constrained weighted
(iteratively reweighted) Least Squares task.
Notation: we call the array of observed data by the symbol X both in PMF2 and in PMF3. Anticipating
later needs, we denote the “fit” array by Y so that equation (3) may be written as X=Y+E, where the
matrix Y=GF, and equation (5) is also X=Y+E, but now the block Y may be written symbolically as
Y=ABC.
Modelling of "errors"
Sometimes the residuals eij are caused by the errors of a measuring process in a straightforward
manner. Then it may be possible to specify the standard deviations sij for measured data xij before
running PMFx. This is the "easy" case. If the errors are mainly caused by weighing the samples, then
we may have this easy situation.
It is more usual that the randomness is inherent in the process to be studied. Then it is not possible to
deduce a reliable "error estimate" for an isolated measured value. An example is given by Poisson-
distributed data, e.g. a set of low-intensity spectra measured by counting the radiation quanta, or an
ecological study where a number of species are counted. If the count "2" has been observed, then the
standard deviation for this value could be less than one unit or it could be over two units. The
expectation values are needed in order to estimate the standard deviations or "error estimates". The
unknown "true array" which is to be represented by the factor model approximates the expectation
values or the "true values". We cannot solve the problem if we don't know the solution!
The programs PMFx solve this dilemma in the following iterative way: first rough approximations for
the standard deviations sij are formed and an initial solution of the problem is computed. This initial
factorization gives approximate expectation values (or other similar parameters) for the random vari-
ables. Then the program is able to compute better approximations for the sij, then a better factorization,
and so on. The computations in PMFx are iterative in any case, thus the need to iterate the error model
does not add significantly to the computational workload.
Another example of this kind are lognormally distributed data which are common in environmental
studies. Each observed value xij may be regarded as a sample from a unique lognormal distribution.
p
The geometrical mean values µ ij = ∑ gih f hj of these distributions obey the factor model, thus
h =1
p
xij = µ ij + eij = ∑ gih f hj + eij . (6)
h =1
We view this as a maximum likelihood (ML) problem: the factor model should be determined so as to
maximize the likelihood of the observed array X. It is possible to determine a Least Squares (LS)
problem which is equivalent to the ML problem in the sense that the two problems have the same
solution. When PMFx solves this equivalent LS problem then it in fact solves the ML problem and
computes ML estimates of the original factorization problem. Equivalent LS problems for the ML
solution of Poisson distributed and lognormally distributed factor models have been programmed in
PMFx. The equivalence requires that PMFx computes iteratively the std-dev values sij so that the LS
problem based on these sij is equivalent to the original ML problem. The equations for these std-values
are given in the section describing the error models. Deriving these equations is by no means trivial,
thus the end user may just take them as given.
Solving the lognormal model. First it should be remarked that the lognormal model is usually not
needed. In most situations occurring in environmental research, the errormodel code –14 is quite
adequate and has less problems than the lognormal model. The lognormal model should only be used
after carefully determining that the assumptions of the lognormal model are actually fulfilled by the
errors of all variables of the data set. (Note that it is not enough that the distribution of each variable
appears lognormal.) If the assumptions are met, the user has to specify the logarithm of the
"geometrical standard deviation" (log(GSD)) for each measured value (often the same value is good for
all measured points). This is sufficient if the distributions are really pure lognormals. Then the result is
indeed a ML estimator. But here is a catch: the value zero cannot occur in a lognormal distribution
unless the distribution is degenerated to zero. In practice zero values often occur, e.g. because of
concentrations below the detection limit. These values make the factorization unreliable if they are
processed under the assumption of pure lognormality. Then it is necessary to modify the assumption of
lognormality. PMFx offers the option to assume that there is an additional normal error superimposed
onto the lognormal distribution. The detection limit of the measured values might be used as this
additional error. The statistical properties of this model are not fully known, the solution is probably a
good approximation of a ML solution, but it has not been proved to be strictly ML.
PMF User's Guide, Part 2: Reference 4
The expectation of the Q value does not equal its usual value (the degrees of freedom) in lognormal
models. PMFx estimates the increase of the expectation value of Q and writes it in the .log file.
are array headers useful or a nuisance in the output: they might help the human user, but if the
data are to be input to a spreadsheet, say, the headers may be a nuisance. Arrange for writing or
no writing of headers.
how many different files are needed for input and for output. Is it desired that old info in output
files is conserved or should old info be deleted. Arrange file designations accordingly. The
only limitation is that any single file must be either for input only, or for output only.
Otherwise one may configure input and output flexibly, e.g. the titles might be written to one
file, and the arrays to another. In fact, each step a to d, A to C for each array in stages 2 and 4
is individually controlled.
For unrelated tasks one should prepare separate .INI files. Under the control of one .INI file one may
process a sequence of related tasks having common dimensions and common control information. It is
possible to have some arrays input only once, to be used repeatedly in the sequence, and to have other
arrays read or generated individually for each task in the sequence.
PMF User's Guide, Part 2: Reference 6
“Dimensions”
Defaults for PMF2: 40, 20, 4
Defaults for PMF3: 0, 0, 0, 0
The PMF2 default values are for the Gauss-Exponential test case. For other cases, you will have to
change them. “Rows” and “Columns” are the dimensions n and m of the matrix X to be factorized.
“Factors” is the number p of factors to be used by PMFx. Even one factor (p=1) may be a meaningful
selection in some cases.
PMF3: the 3-dimensional data block “X” is visualized as a book: there are rows and columns on pages.
The data is read and written page by page. The dimensions are: numbers of rows, columns, pages, and
factors.
“Repeats”
Default: 1
PMF programs are able to compute several cases in one run, provided that they are controlled by a
single .INI file. The .INI file is only read once, and the data files are also only opened once. Based on
these, PMFx may process a number of similar cases, their number is given by “Repeats”. Default is
(Repeats=1), compute a single case. Different aspect may be varied between the repeated cases.
PMF User's Guide, Part 2: Reference 7
Typical examples: different data sets may be analyzed. Or, one data set may be analyzed with different
std-dev values or starting from different initial factor values. During the repeated runs, those lines of
the I/O control table (see below) are performed again (= reading again, or forming random values
again) whose (R) code has the value T or true.
“FPEAK”
(PMF2 only) Default: 0.0
Fpeak exerts control on the rotational state of the solution. The default prefers a "central" rotation.
“Robust mode”
Default: True
ROBUST is T (true) or F (false). Use T unless you know that the errors in your data are approximately
normally distributed and that there are no outliers and no non-representative values in your data.
Pseudorandom “Seed”
The program may generate different sequences of pseudorandom numbers, depending on the chosen
Seed value. There is no limit to the lengths of these sequences. Repeated analyses may be performed so
that the generation of initial values is repeated. Then the new values are derived by continuing the
sequence that was originally initiated by the Seed value.
the values in order to find a fast convergence. With typical .INI values the end test works as follows:
Each stage is ended if there have been 4 (=Ministeps_ required) consecutive steps where the absolute
value of the change of Q (“chi2”) value was less than 0.1 (=Chi2_test) on each step. Also, the stage
ends if the maximum allowed cumulative step count is exceeded for the stage (=Max_cumul_count). In
order to study if the iteration has really converged to a true local minimum, one could run with a small
third value for the Chi2_test, e.g. with 0.01. Large problems require larger values for Chi2_test!
The first two values for “lims” do influence the route which the minimization process follows in the
many-dimensional space. If there are several local solutions to your problem, then changing the first
(and perhaps also the second) lims value may result in a change of the result from one local optimum
into another. It is tentatively suggested that large values of lims should be avoided in such cases.
The last (third) lims value affects how near to zero the constrained components of the solution may get.
Possible values are from 0.1 to 0.001, say. Smaller values cause that the results may go nearer to zero.
Optional parameters
Near the end of the .INI file there is a place for optional parameters or "special information". This is
used because of flexibility, in this way the format of the .INI files may be kept unchanged although
new features are introduced. The use of many of the optional parameters is explained in detail
elsewhere in the User's Guide. The following optional parameters are available now:
sortfactorsg Before output, PMF2 orders the factors in a systematic order, based on values EV(G)
sortfactorsf PMF2 bases the ordering of factors on EV(F), in effect ordering the F factors.
Do not use both sort options (g and f) simultaneously!
sortfactorsa, sortfactorsb, sortfactorsc Similar sort options for PMF3, based on columns of
A, B, or C, respectively.
normfactorsac PMF3 normalizes the average values of A and C factors to unity.
missingneg r All negative values in X represent missing data. PMFx increases their specified std-dev
values internally by the factor r. Typically use r=10.0 to r=100.0.
BDLneg r1 r2 (r1<0.0, r2>1.0) PMFx interprets negative values below r1 as missing values (r2 gives
the desired increase of their std-dev values). Negative values between zero and r1
represent BDL indicators.
minreg r (PMF2 and PMF3) Normally both the logarithmic penalty and the regularization are
controlled by "lims". The optional parameter "minreg r" controls the regularization
so that regularization strength is max(lims, r).
outlimits αp αn The two decimal values αp and αn are used as two different outlier threshold
distances, separately for positive and negative residuals. (Then the common
outlier threshold distance α, specified earlier in the .ini file, is not used at all.)
goodstart use this parameter whenever the initial solution of PMF2 is better than a set of
random values. If goodstart is not used, the program may “forget” the initial
solution because then it proceeds very cautiously during the initial steps.
PMF User's Guide, Part 2: Reference 9
Input/Output of arrays
Specifying file properties in the .INI file
The programs PMF2.EXE and PMF3.EXE may read from and write to several files, as specified by the
user in the .INI file. Code numbers 30 to 39 have been reserved for input and output files. In addition,
the input code 4 denotes the PMFx.INI file itself. Similarly, the output code 24 denotes the output file
PMFx.LOG . Properties of the files 4 and 24 cannot be changed by the user.
Properties of the files 30 to 39 (in increasing numerical order!) are set by the .INI file:
1. each file is designated either input (T) or output (F).
2. by using the “file opening status” attribute of F90, one may specify how each file is opened.
The alternatives and their requirements are: (see any Fortran 90 handbook):
NEW a file with the specified name must not already exist on the disk. A new file is created
(only for output files, of course)
OLD a file with the specified name MUST already exist on the disk. If output, the old file is
opened so that new output will be written at the end of the file. If input, the old file is opened
for reading from the beginning of the file. For input files, OLD should always be used!
UNKNOWN a combination of NEW and OLD: works both ways. Not for input, good for
output
REPLACE (only for output files): If there is an old file with the same name on the disk, it is
deleted. The results are written into a new file! This is good if previous results are of no
permanent value, and also if the previous results have already been copied to another file or
directory.
3. the maximum length of lines in each file is specified. If the matrix F is stored straight, it may
need long lines. So may the data array X and the std-dev arrays T,U,V.
WARNING: depending on the properties of the compiler, there may be no error message if
your input file contains long records (2500 characters, say) and you specify a shorter record
(2000, say). It is possible that the extra characters (500) are simply chopped off the input
records, and you cannot read the whole array correctly!
4. a name for each file is specified. If needed, you may also specify the path as part of the name.
By using the character $ as one of the characters of a file name, you request file name
substitution from command line. Assume that PMFx has been started by the command “PMFx
mypmf monday”, and that there is a file name “my$.dat” in the mypmf.ini file. Then name
substitution will change the name into “mymonday.dat”. This technique makes it possible to
use one copy of the .INI file although several input or output files are needed at different times.
File name substitution may not be available on some platforms.
In the default PMFxDEF.INI the definitions are such that files 30 to 33 are reserved for input and 34 to
39 for output. You are free to change this distinction, e.g. having 3 files for input and 7 for output. But
we suggest that the smaller numbers should be reserved for input and the larger for output.
Using formats
There are 10 formats specified in the .INI file, with code numbers 50 to 59 (in increasing numerical or-
der!). You may change them, but it requires knowledge of the FORTRAN90 language format system
(essentially the same as in FORTRAN77). Each row of an array (or column, if transposed layout) is
input by a separate READ statement or written by a separate WRITE statement. Thus the format
should cover reading/writing of one row (not the whole array). Each new row will start using the
format from beginning. For input it is often practical to use list-directed (=”free-form”) input. This is
achieved by using the code number FMT=0 instead of the format numbers 50 to 59. — If there is an
PMF User's Guide, Part 2: Reference 10
error in your format specification, the error is only found when the program attempts to use the format,
not earlier.
If there are quantities of widely differing magnitudes in different columns of the arrays, then write the
results in E or G formats, e.g. E13.5E2, instead of F formats, such as F9.4. (The formats in
PMFxDEF.INI are examples!) Otherwise zeroes appearing in the output may mislead you when the
true value nevertheless could be significantly non-zero. In air pollution studies, the concentrations of
lead are typically such small quantities; for Si there are large values.
If FIL=0 for It causes the following operation instead of reading the array
X the data array X is simulated
X_std-dev the array in question (T,U, or V) is set equal to the corresponding code C1, C2,
/T, /U, or /V or C3. See the section “Specifying the X_std-dev” for more details, e.g.
correspondence analysis.
G and/or F random initial values are generated into G and/or F. Similarly for A,B, and C.
Gkey and/or zero values are generated into key matrices: all components of G and/or F are
Fkey constrained to positive values. Similarly for Akey,Bkey, and Ckey.
rotcom zero values are generated into “rotcom” matrix: no rotations are commanded.
If several arrays are to be read from one file, they must be present in the file in the same order as they
appear in the array list in the .INI file and in the following tables. If something is read from the .ini file
(FIL=4), those values must be at the end of the .ini file.
The marker FMT indicates format. FMT=0 means list-directed input or output. For input, FMT=0 is
usually the best choice. The formats used for arrays "xkey" (x means G, F, A, B, or C) should be based
on the integer conversion, e.g. I3. The F format is not possible for xkey. The FMT codes 50 to 59
specify format strings stored earlier in the .ini file.
Below the marker (T) one specifies if the array should be read or written “as is” (“straight”) (T)=F, or
transposed, (T)=T. For looking at results on the screen it may be good to have F transposed and G
straight. For input arrays, use (T)=F or (T)=T depending on the layout of data in the original file.
For each array, there are first the specifications for reading the header: (FIL (R) FMT). The code (R)
indicates whether the header should be read again during repeated tasks, (R)=T. The header of X is
often an important bookkeeping tool.
The second group is for writing the header: (FIL FMT). If there is no reading but only writing of a
header, then the default header is written. The default header is the last item on the line, you may
change it.
The third group is for reading the array. The codes (R) and (C) control repeated tasks: If both are false,
then the originally read array is also used in repeated tasks. If (R)=T, then another reading (or
generating random values if FIL=0) happens when repeating the task. And if (C)=T, then the computed
PMF User's Guide, Part 2: Reference 12
factor matrix (G, F, A, B, or C) is used as the starting value for the next task. If there are no repetitions,
then the codes (R) and (C) have no significance. (C) is short for "Chain".
The last group (FIL FMT (T)) controls the writing of arrays. For input arrays, this ‘echo’ writing may
be meaningful for documentation (“what was being analyzed”) or for error finding: if there is a read
error when reading X, say, then by looking at the echo output of X, one may see where the good values
end. The “unread” values contain the characters -9! But normally it is not necessary to write the input
arrays, then their output FIL codes may be zero.
The reasons for using random values. Use random Drawbacks connected with random
starting values of factors in order to starting values of factors:
1. suppress any personal preferences towards a 1. The factors appear in random order.
certain solution. This maximizes the objectivity of This may be partly corrected by using one
the analysis. of the optional parameters "sortfactorsx"
where x stands for g, f, a, b, or c.
2. generate many different starting points for
repeating the same computation. This attempts to 2. The iteration may need more steps
find out if there are local minima of the Q function. when started from a “poor” starting point.
Generally one should use random values for such analyses which are not part of a sequence or group.
Also one should investigate a few samples from each group of analyses by re-running several times
with different random values. This requires having FIL=0, (R)=true for factors in a repeated analysis.
When analysing a set of similar arrays, there are two alternatives to random starting:
1. Take the result file(s) (G and F) from one typical analysis and read them (FIL>1) as starting
values when doing all the other analyses. This is fast and creates the factors in the same order.
2. Form artificial “skeleton factors” where a few key variables have non-zero values. Use these
arrays as starting values. This is slower but one may create the factors in any order one likes.
Array headers
The headers are intended for two different purposes: help-like documentation and bookkeeping. The
default headers document what is the meaning of any matrix or array, read or written. Variable headers
(input from data files) identify individual measured data sets from each other. If you wish, you may
work without any headers. Then you must set all the FIL codes for header input and output to zero in
the .INI file. Your data must be pure arrays, without any text lines. This may, however, create
confusion: what is the meaning of all the output arrays.
You might wish to include headers for all output. Then the FIL codes for header ouput should be the
same as the FIL codes for the output of the corresponding arrays. If you wish to direct an array to
another file, then you will have to change two things: the output FIL code of the array, and the output
PMF User's Guide, Part 2: Reference 13
FIL code of the header. In certain applications, it would be good to have application-oriented headers
for output arrays. This is achieved by editing the default headers of arrays in the .INI file.
Timing of header operations. The headers are read and/or written immediately before the time when
the array itself might be read and/or written. The header operations may take place even if the array
itself is not read and/or not written.
Exception: if the array X is simulated (not input), then header operations for X take place immediately
before X is written, i.e. after all input of other arrays. Then also the header operations for S take place
after writing X but before writing S.
The programs PMFx store only a maximum of 40 characters of a header. By default the program
replaces the default header (as specified in your .INI file) by the header that was read from the data file.
It is possible, however, to keep the tail part of the default header intact: if there are the characters ".."
(two periods) in the default header, they act as a separator: only the characters before .. are replaced by
the corresponding characters from the data file. Thus you may have a mixture of input header and
default header in the output.
Headers of input arrays (mainly for array X) are useful for bookkeeping purposes. Be systematic: if
there is a header in the data file, then you must read it by specifying the correct code for the header
input of X, and vice versa. If the header is read with “A” format (as in PMF2DEF.INI) then (max 40
characters from) one whole row in the file constitutes the header. The only requirement is that there
must be at least one non-white character on the row: the program skips all-white rows until it finds a
non-white row. If the header is read without format (FMT=0) then the header should be surrounded by
'apostrophes' or by "quotation marks".
If you forget to insert a header into your input file, then (part of) the first row of the array will be used
as a header. There will probably be an error message when the array ends too soon (because the first
row was lost to the header reading process).
If you do have a header in the file, but forget to read it (FIL=0 for header input) then there will be an
error message when the program attempts to read the first row of the array but finds letters which are
part of the header. If the header is all numerical characters, there is no error message. One should avoid
using such all-numeric headers.
It is good practice to write the header of X to one of the output files even if X itself is normally not
written. If there are any problems in reading X then it is useful to write the X array to a temporary file
(such as file 38 in PMF2DEF.INI) by setting the FIL code for array output of X correctly (e.g. FIL=38).
Then one may inspect which values have been correctly read and which have remained unread (unread
values are = -9.999 or some such special value). Also one should check if the data are correctly divided
onto rows and columns of X, e.g. are the elements (2,1) (row=2, column=1) and (1,2) correct in X.
One should keep an eye on the Q values of the tasks. They are included in the file PMFx.LOG. But it is
also possible to print the Q value as part of the header of the computed G, F, rotmat, or any other result
array. This is achieved so that you insert the two characters “Q=” near the end of the default header of
the desired array in your .INI file (also remember to turn on writing of that header, otherwise the
characters "Q=" have no effect). When PMFx writes the header, it replaces 12 characters of the header
after Q= by the current value of Q. In this operation the header is allowed to grow longer than the
normal limit of 40 characters.
The “folded straight” layout is necessary if the length of lines in the file is limited. This example shows
the situation if only 6 matrix elements fit onto one line.
- This layout may be input without any special concern. The “folded
- It is output by PMFx if a format is used which specifies 6 elements on a line straight” layout
- This layout may be difficult to import to spreadsheets or to matlab.
aaaaaa
The transposed layout is often practical for the F matrix, especially when exporting aaaaaa
results to spreadsheets or to matlab. Then one has to set the transpose indicator for aaa
bbbbbb
the matrix as (T)=T (True). bbbbbb
bbb
Binary files, e.g. matlab .mat files, cannot be read by PMFx. By using the matlab cccccc
command “save” with the switch “-ascii” it is possible to export data from matlab in cccccc
text form so that PMFx may read them. ccc
dimensions of the model are n=5, m=4. Each of the letters D, X, and S denotes one value in the file.
We would like to read all the values X into the 5x4 data matrix X, and all the values S into the 5x4
matrix X_std-dev/T. It would be rather laborious to extract these values by using a text editor. In this
example, the values D are “something else”, a nuisance.
There are four integer parameters near the end of the .ini file for controlling the special reading of
matrix X. The first parameter indicates the size of the data block to be processed when reading X.
(The value 0 indicates no special reading). In this case, we have 48 data values. The second parameter
indicates the position of the first value x11 within the data block. In this example, x11 is the fifth value
in the block. The third parameter indicates the “horizontal” distance from x11 to x12 to x13 ,... = two
units. The distance must be the same everywhere in the data block. Similarly, the fourth parameter
indicates the “vertical” distance between consecutive rows of X in the data block, (=the distance from
x11 to x21 to x31 to x41,... ) = nine units. The parameters for special reading of X are thus: 48 5 2 9.
Similarly, the parameters for reading X_std-dev/T are 48 6 2 9. But in order to read T, there must be
another block of data in the file. After the reading of X, the data block of X is no longer available when
starting to read T. If the values of X_std-dev/T are interspersed with values of X (as in this example)
there must be a second copy of the same block in the file. This is easily done with a text editor.
The special reading utilizes the user-defined formats as all other readings made by PMF2. However, all
the data values in a block are input in one operation, as one very long “row”. Thus the division of
values onto lines in the block need not correspond to the rows or columns of the matrix to be read. The
“Transpose” indicator has no meaning for the special reading of a matrix.
We imagine that the special reading will mostly be used so that FIL codes for the matrices U and V are
=0. This need not be so, however. But there is the restriction
that only one set of parametrs is available for controlling the Normalization options in PMF2
reading of all the matrices T,U, and V. Thus all matrices must 1. no normalization
have the same layout in their respective blocks.
2. maximum abs. value in each
column of G (=in each G factor)
Normalization of factor matrices equals unity
In PMF3, the default normalization is fixed: the sum of absolute
3. sum of absolute values of elements
values of elements in each column of factor matrices B and C is
in each G factor equals unity
normalized to unit value. This corresponds to the sixth
normalization option for PMF2. There is a new alternative 4. mean value of absolute values of
normalization in PMF3: The optional parameter elements in each G factor equals unity
"normfactorsac" commands PMF to normalize A and C factors
(instead of B and C). Furthermore, the average absolute values 5. maximum abs. value in each row of
are normalized to unity (instead of sums of absolute values). F (=in each F factor) equals unity
This option is recommended for environmental studies where B 6. sum of absolute values of elements
mode corresponds to element concentrations. in each F factor equals unity
Normalization is omitted by PMF3 for such factors where any 7. mean of absolute values of elements
element in any of the factor matrices A, B, or C is bound to in each F factor equals unity
non-zero value (xkey<-1), as follows:
No specific normalization is performed. The norms of such factors remain at those values which
happen to be produced during the computation. The user should assume responsibility of producing
well-defined norms by binding at least one significant element in exactly two modes (of each such
factor) to non-zero values.
PMF2. The seven alternatives for normalization are shown in the sidebar on right. The normalization
affects the output of the following five matrices: G, F, std-dev(G), std-dev(F), and rotmat.
Only one of these choices may be selected. Immediately before output, PMF2 determines p coefficients
which are needed for achieving the desired normalization. Then it divides columns of G, and multiplies
rows of F with these coefficients. Thus the product GF does not change when the normalization is per-
formed.
PMF User's Guide, Part 2: Reference 16
After normalizing (and writing) G and F, PMF2 performs necessary scalings for the other three
matrices so that their new values are correct with respect to the normalized matrices G and F, and
writes these scaled matrices according to the instructions in the .ini file.
Normalizing the factors does not influence factorization computations in any way. The 'chi-2' value Q
is not changed.—The command for normalization is near the end of the .ini file of PMF2.
Technical details
Marking true and false
As written by the Fortran system, true and false are represented by “T” and “F”. Such tables are
visually hard to read. It is suggested that if you have to change these default tables, you write the
changed values with two or more letters, e.g. as "TR" and "fa". This makes it easier for you to visually
spot changed entries in the tables. The Fortran system is only interested in the first letter which must be
one of "f F t T".
If SEED=k>0 (k is an integer!) then a simple portable random-number generator is used, it is the same
on all platforms. Different values of k select different positions of the sequence for starting (the
integers k=1, 2, ... do not select first, second, etc. element for starting, but “randomly chosen” locations
within the full sequence).
The pararameter “Initially skipped” indicates how many first elements of the pseudorandom sequence
are to be skipped when PMFx is started. The skipping is performed after choosing the starting position
according to SEED. (This parameter is not useful and may be dropped from future versions.)
PMF User's Guide, Part 2: Reference 18
Specifying standard deviations for the array X. Significance of the Q value. Outliers.
The array of standard deviations
The wording and the equations in this part are written for PMF2. Similar equations are also valid for
PMF3. In theoretical descriptions of PMF, the array of standard deviations is often denoted by S. In
connection with the programs, we also denote it by X_std-dev.
The significance of the array S (standard deviations for elements of the array X) has been discussed in
more detail in the references. This array is the means for communicating problem-specific a priori in-
formation to PMFx. Generally one should attempt to specify the std-dev values so that they indicate the
agreement which is expected between the (bilinear or trilinear) model and the data array. A few
unrealistically small std-dev values (e.g. smaller than the data values by a factor of 10000) may lead to
useless results or even to breakdown of the computation because of apparent matrix singularity. Often
it might be simple to specify the std-dev values as a fixed percentage of data values. This must be
avoided, however, if there may be (near-)zero values in your data. Sometimes the std-dev values can
only be computed when there exists a (preliminary) fit to the data. When the fit converges towards a
good solution, then also the computed std-dev values converge towards a good model. An example is
when the data obey Poisson distribution and the counts are low: the measured data as such don’t give
satisfactory estimates for S. Another example is environmental data where outliers are common: std-
dev values are best based on the larger one of the measured and fitted values, thus lessening the effect
of both kinds of outliers: contamination and loss.
Sometimes one has hardly any idea of std-dev values for X. How is one to use PMFx then? A simple
rule of thumb might be applied, such as: std-dev(xij) is = 5% of xij plus two units of the least significant
digit reported for xij. (In cases where all the rows and all the columns of X are of same magnitude, this
rule may be implemented very simply by using the ad-hoc equation for X_std-dev in PMFx.) If xij =
123, then this rule gives 6.15 + 2 = 8.15. If xij = 0.006, we get std-dev(0.006) = 0.0023. This is simple
but it is better than nothing (or better than assuming that all values are of the same absolute accuracy,
as is inherent in the classical PCA without autoscaling). If rows and columns are of different order of
magnitude then the simplest ad-hoc equation is not useful; one must model the std-dev values by using
the three coefficient arrays T,U, and V.
12), or else approximations for S are computed iteratively during the fit. In those cases where S is
based on the fitted array Y (for definition of Y, see Introduction), the values max(yij,xij) are utilized
instead of the fitted values yij during the first few iteration steps when the fit is still not well defined.
This is true for the EM values -10, -13, and -14.
A complete list of the alternatives for computing S=X_std-dev follows. The two first alternatives are
the simplest. Preferably one should first learn to use the program by using these simple techniques;
sometimes one may simply assemble the array with a text editor so that for each column (row) there is
only one value which is repeated for the whole column (row) (see the remark at the end of next
section). In other cases, one has to compute the array in a preprocessing program.
EM=-12, C1=t, C2=0,C3=0, FIL(T)=0, FIL(U)=0, FIL(V)=0. The value t of C1 is used as the std-
dev value for all elements of the array X. This value is not changed during the iteration.
EM=-12, C1=0,C2=0,C3=0, FIL(T)>1, FIL(U)=0, FIL(V)=0. The std-dev array is read from a file
(as array T) and used as such, not changed during the iteration. Repeated reading of std-dev is
controlled by the repeat-code (R) of X_std-dev /T.
EM=-10. Lognormal distributions. It is assumed that each data value xij comes from a lognormal
distribution with geometric mean equal to the fitted value yij and log(geometric-standard-
deviation) = vij. It is further assumed that there is "measurement error" having std-dev=tij in
each measured value xij. The factors G and F (or A,B,C) are determined so that Y=GF (or
Y=ABC) maximizes the likelihood of X, given the arrays T and V. This implies that PMFx
computes sij as sij = tij2 + cvij2 yij ( yij + xij ) , or sijk = tijk
2
+ cvijk2 yijk ( yijk + xijk ) if
PMF3. (c =0.5). The computed G and F (or A,B,C) are Maximum Likelihood (ML) estimates
of the true unknown factors.— If all the log(GSD) values vij are equal, then the coefficient C3
may be used instead of the array V. Similarly C1 may be used instead of the array T if all the
absolute errors have the same std-dev. The array T is intended for representing typical
measurement error, which may cause zero or negative values to appear although the genuine
lognormal distribution is always strictly positive. If there is no such "lab error" in the data, then
T is not needed and one sets C1=0. — The coefficient C2 and the array U are not used when
EM=-10.— In log-normal modelling the expectation value of the Q as computed by PMFx
increases from the usual degrees-of-freedom value (approximately the number of elements in
the data array). The coefficient of increase is written by PMFx to the .log file.
EM=-11. Correspondence analysis. It is assumed that each data value xij comes from the Poisson
distribution with a parameter µij. Factors are determined so that they represent the array µ:
µ=GF (PMF2) or µ=ABC (PMF3). During the iteration, PMF2 computes sij as
sij = max( µ ij , 01
. ) , (PMF3: sijk = max( µ ijk , 01
. ) ) resulting in maximum likelihood
(ML) estimation for the array µ. The "max" protects against ever having sij =0. FIL codes for
T,U,V should be =0, and C1=0, C2=0, and C3=0.
EM=-12 The values sij (sijk) are computed according to equation (7), (8) or (9) by using the
coefficients C1,C2,C3 and/or arrays T,U,V and the data array X. This is done before the
iteration is started. The std-dev values are not changed during the iteration. This is the default.
EM=-13. The values sij (sijk) are computed by using the coefficients C1,C2,C3 and/or arrays T,U,V
and the array Y of fitted values. This is done iteratively during computations.
EM=-14. The values sij (sijk) are computed according to equation (10) by using the coefficients
C1,C2,C3 and/or arrays T,U,V, the array X, and the array Y of fitted values. For each element
(ij), the computation is based on the larger of the values xij and yij. This is done repeatedly
during the iterations.
The option EM=-14 is recommended for general-purpose environmental work. The reason for
recommending this alternative instead of the “standard” (EM=-12) is as follows: The (EM=-12)
computes the std-dev essentially as a percentage of the observed value. For a large value (a possible
contamination-type outlier) a large std-dev is obtained, thus a contaminated value never gets an unduly
large weight. But for a small observed value the standard technique computes a small std-dev value; if
the small value is in fact a loss-type outlier then it gets a too large weight because the std-dev is based
PMF User's Guide, Part 2: Reference 21
on the near-zero erroneous value. In such a case the fitted yij is significantly larger than xij. The alter-
native EM=-14 avoids generating too small std-dev values by taking the larger of yij and xij as the basis
for std-dev. — For good (non-outlier) observed values the observed and fitted values agree and there is
little difference between selecting one or the other. Thus the alternative EM=-14 should never be a bad
choice, except that the option EM=-12 runs faster than the other ones. —The alternative EM=-10
(lognormal) is also a good choice for environmental data. However, it has been found out that the
options EM=-10 and EM=-14 are practically similar unless the size of residuals is very large in
comparison to the size of data. Thus EM=-10 is only needed for such data where the parameter C3 is
larger than 0.3, say.
Robustness
We recommend that you generally run PMFx in the robust mode, at least in environmental research.
This guarantees that occasional outliers do not ruin your results. Only in exceptional cases it is possible
to know that there are no non-representative data (no ‘outliers’) in the array. In robust mode it is not
quite so essential to specify increased std-dev for unreliable values but if you suspect some points, then
you should increase their std-dev even in robust mode. It is important to note that using the robust
mode does not protect against the harmful effects caused by using extremely small std-dev values for
some data points.
PMF User's Guide, Part 2: Reference 22
The combination of robust mode and systematic gross under-estimation of std-dev for many points
makes the program slow! Try to provide realistic std-dev even in robust mode!
Missing values
Missing values occur in real data. In PMF they are best handled so that the user specifies large std-dev
for them. However, in order to avoid non-physical factor values one should avoid too large std-dev Sij
which satisfy the inequalities sij >> xik and sij >> xhj for all k and all h.
By using the optional parameter "missingneg r" (r is a decimal value) one commands PMF to decrease
the significance of all negative entries of the array X. This is realized so that PMFx increases internally
their std-dev values by the factor r and uses the value zero as the data value. With a suitable
r=10...r=100 this may cause that these negative xik values have a negligible effect on the factors. This
option cannot be used if there are "good" negative values in X. This technique should be regarded as a
"trick", it should only be applied with caution.
The optional parameter "BDLneg r1 r2" combines missing value handling (as presented above) with
Below-Detection-Value handling. If a data value xik is more negative than r1, it is treated as missing
and its std-dev is increased internally by the factor r2. The value zero is used in the least squares fit,
instead of xik. If, however, r1< xik <0, then it is assumed that |xik| is the Detection Limit (DL) for a
Below- Detection-Limit measurement. The corresponding std-dev value sik is computed normally but
dynamic weighting is applied in order to achieve correct BDL handling.
Sometimes missing values are represented in a spreadsheet table by empty cells. Such a table should
not be written in text form, then the information about empty cells is lost. However, the table may be
written in comma-separated (.csv) form. Then an empty cell is indicated by two consecutive commas.
Whenever PMFx encounters two consecutive commas while reading with the usual format code =0, it
inserts the special value -999.9 in the empty slot in the data array. By suitable use of the commands
"missingneg r" or "BDLneg r1 r2", this special value may be interpreted as a missing value indicator.
(eij / sij)2 = α2 d2 .
The robustized or “downweighted” Q is increased only by the amount α2 d. This corresponds to the
apparent increase of sij by the factor d . This increase of sij is performed by PMFx dynamically
during the iteration. The heuristic behind this downweighting is that the downweighted point (ij) exerts
the same influence on the fit as an observed value with d=1, i.e. an observed value at the outlier
threshold distance from the fitted value.
In the presence of outliers (even in the robust mode) it may be difficult to know if the observed value
of Q is “normal” or too large. It may be more helpful to investigate the distribution of scaled residuals
(eij / sij). If the vast majority of these values obey |eij / sij| < 2.0, then the increase of Q value is
probably due to the outliers.
PMF User's Guide, Part 2: Reference 23
If you observe a large Q or chi2, you should try to find a cure from the list in the sidebar.
There are three alternatives for writing the residual array to the file. In the third alternative the
residuals are scaled by the apparently increased sij values, thus the third array consists of the values
PMF2. The uncertainty of the factors G and F is conceptually attributed to two reasons:
individual randomness of the values, caused by individual errors (noise) in the observed data.
This uncertainty is estimated by G_std-dev and F_std-dev.
rotational uncertainty, often limited by the shape of factors and the non-negativity constraints.
The rotational uncertainty is estimated by the matrix “rotmat”.
The rotational state of the result by PMF2 may be controlled by a matrix called “rotcom”, it is “a
matrix of rotation commands”. The matrix rotcom should be all=0 if/when special rotations are not
needed. Otherwise rotcom is a pxp matrix (often mostly zeroes) commanding additions and
subtractions of factors according to the table where g#j denotes the jth column of G.
The matrix rotcom has not proved to be a practical tool, and it has probably never been used
successfully in real-life PMF2 applications. It will probably be deleted from the program PMF2,
but it is still explained here as it is part of the current program. Using it is not recommended any
more. Instead we recommend using FPEAK and/or pulling factor elements to zero or non-zero
values.
The scale of rotcom values corresponds to the scale of FPEAK values, the same value should have an
effect of comparable size in rotcom and in FPEAK. Good first trial values are between 0.1 and 1. Note,
however, that one should normally use either FPEAK or rotcom, but not both.
Using FPEAK (or rotcom) is not equivalent to making rotations a posteriori, after running PMF2.
When FPEAK(rotcom) influences PMF2, it changes the rotational state the solution, but it may also
change the solution itself (causing an increase of the Q value). Thus FPEAK(rotcom) may succeed in
producing a rotation in a case where rotations appear im-
possible if attempted after running PMF2. rotcomij>0 g#j is added to g#i
Together with G_std-dev and F_std-dev (see below) f i# is subtracted from fj#
“rotmat” is the basis for estimating the uniqueness and rotcomij<0 g#j is subtracted from g#i
accuracy of the computed factor values. “rotmat” fi# is added to fj#
indicates the rotational uncertainty. The values rotmat(i,j)
and rotcom(i,j) are related to the same rotation. “rotmat”
is a pxp matrix of standard deviations of rotational coefficients: rotmat(i,j) (row=i, col=j) indicates the
uncertainty or freedom of rotation where simultaneously
g#j is added/subtracted to G´g#i and
fi# is subtracted/added to fi#.
The final (third) value of “lims” influences the values of rotmat directly and indirectly. With a large
final “lims” all elements of rotmat are of medium size, because limit repulsion is soft but extends to all
components. All rotations are then in effect determined by limit repulsions, but because no elements
are near to the boundaries, the repulsions are “soft”.
If the final “lims” is small the distinction of locked and free rotations becomes apparent. Some factor
elements tend to approach near to the limits, and they become “stiff” with respect to rotations (there is
no distinction between rotations towards the limits and away from the limits). Then the corresponding
element of rotmat becomes small, indicating a locked rotation. On the other hand, some factors stay
away from the limits, and the rotmat elements for their rotations increase, indicating free rotations and
non-unique factor values.
The rotmat values should only be interpreted qualitatively. It would be tempting to assume that a value
of 0.2 in rotmat would mean that 20 % of one factor can be added/subtracted from another. This is not
true, unfortunately. If one is solving many similar problems, then one might get a feeling about the
quantitative interpretation, too. As an example, in one case the value 0.2 occurred in rotmat when it
was possible to rotate 2% of one factor into another.—Utilization of rotmat values is only possible if
correct std-dev values have been specified for measured data.
PMF User's Guide, Part 2: Reference 26
If all elements in the ith row of rotmat are “small”, then one knows that the ith G factor is uniquely
determined without “any” rotational uncertainty. Similarly, a column of small elements in rotmat
indicates a uniquely determined F factor.
Forced rotations also affect the rotmat values: if FPEAK(rotcom) is used to press some factors tightly
against the wall, then there remains less freedom of further rotation, and the corresponding rotmat
values decrease. See the example GE2.INI.
Forcing rotations with rotcom may be surprisingly tricky: when one sets up a certain “pull” for one
factor, it may happen that other rotations also happen at the same time, if they are free. It is as if cutting
a slippery piece of food with a dull knife, the piece rotates in all possible ways in order to avoid being
mutilated. It is also possible that large values in rotcom change the order of factors in the factor
matrices. The risk of this is minimized if one uses a “good” starting point for a factorization where
rotcom has large values; a good starting point tends to freeze the identity of factors.
Quantitative estimation of rotational freedom has not yet been attempted in a systematic manner. It is
not possible by simply looking at rotmat elements. It should be possible by systematically trying
different rotcom elements and by observing how much rotation is possible without increasing the Q too
much. The author would be interested in joint research along these lines!
Miscellaneous details
Locations of error messages
During their operation, PMF programs write log information on the screen and also to the end of the
file PMFx.LOG. They also write error messages to both places. However, the error messages created
by the Fortran system (e.g. complaints about data/format errors) are only written to the screen! Every
now and then, one should delete the file PMFx.LOG, otherwise it would occupy too much of disk
space.
Explained Variation EV
(PMF2 only)
The original definition of EV came from the papers of Paatero and Tapper, 1994, and Juntto and
Paatero, 1994. There are several problems with this definition, especially if negative factor values are
allowed to occur. The current recommended definition for EV was introduced for PMF2 v.4.1.
The quantity EV is dimensionless, it summarizes how important each factor element is in explaining
one row or column of the observed matrix. The values of EV range from 0.0 to 1.0, from no explaining
to complete explanation. This value is not quadratic: if measured data is explained so that the residuals
are typically 10 % of data, then the corresponding EV value for the explanation is 0.9. The customary
“explained variance” and the original Paatero-Tapper-Juntto definition of EV are quadratic quantities,
they would indicate 0.99 in the same situation!
The new definition considers that all of X is explained jointly by the p factors and by the residual, as if
the residual would be an extra (p+1st) factor. Taken together, these p+1 “factors” by definition explain
100% of X. The equations for the explained variation of the factor G are
m
∑g
j =1
ik f kj / sij
EV(G ) ik = (for k=1,...,p) (11)
m
p
∑ ∑ gih f hj + eij / sij
j =1 h =1
m
∑e j =1
ij / sij
EV(G ) ik = (for k=p+1) (12)
m
p
∑ ∑ gih f hj + eij / sij
j =1 h =1
Equation (11) defines how much each element gik (k=1,...,p) of the factor G explains of the ith row of
the matrix X. Equation(12) defines similarly how much the residual values eij do explain of the ith row.
By definition, the sum of these p+1 values equals unity.
The contributions of different points xij (j=1,...m) are scaled by the values sij=X_std-devij. These are the
original user-defined standard deviations for xij, not the adjusted values generated during a robust fit. If
there is a huge outlier on the ith row it may slash the EV-values for that row, even if it does not
influence the fit significantly in the robust mode. If one wishes to calculate such EV-values without the
influence of the large outlier then one must manually increase the std-dev value for that outlier.
The values EV(G)ik are represented as a matrix of dimensions n by p+1, i.e. of the same shape as G but
with one extra column added for representing the “explanation by residual” (in fact telling what part of
X is unexplained by the p factors).
Similar equations are also defined for the F side. The matrix EV(F) is a (p+1) by m matrix, where the
extra row (p+1) corresponds to the residual.
PMF User's Guide, Part 2: Reference 28
In source apportionment studies, the element number j of the (p+1st) row of EV(F) indicates how much
of the variable number j remains unexplained. Care is needed in interpreting these values. As an
example, if the variable number j is purely random, so that the jth column of X contains uniformly
distributed values between 0 and 2, and if the average value (=1) is fitted to all elements in the jth
column of X, then EV(F) will indicate that only 33 % of the variable j remains unexplained. Usually
one does not consider that a variable is “explained” if only the average value is fitted but all the
variation around the average value remains unexplained. Thus the following rule should be observed:
Whenever a value on the (p+1st) row of EV(F) exceeds 0.25, one should consider that the variable in
question is practically “not explained”. Then it is crucial to make sure that the error coefficients (C1
and C3) for this variable are large enough so that the average absolute value of the scaled residuals of
this variable does not exceed the value 1. It is advisable to scale such variables down (= increase C1
and/or C3) so that their scaled residuals are mainly between –1 and +1.
not reach this smooth phase in some ten or fifteen steps, it may indicate that the first penalty weight
factor (the first “lims” value) is much too small. Run again with a larger first lims value. However,
sometimes it may be good to try with a smaller first penalty factor.
The penalty function values are output on each step. When the algorithm nears convergence, both the
chi2 and the penalty values stop changing.
In PMF3, “RelSt” values illustrate the "raw" Relative Step as computed for factor elements. The
leftmost value is the largest relative decrease of any positively constrained factor element. The value
1.0 would indicate that some factor element attempts to decrease down to zero. Such a large step is
truncated, of course. The rightmost value is the mean relative change of unconstrained factor elements.
These values may occasionally be > 1.0 without causing problems. The value “Trim” shows how much
the originally computed “raw” step is shortened (or lengthened) when the step is actually made. The
reason for shortening the step is either an attempted violation of constraints, or a failure to (optimally)
decrease the sum of chi2 and penalty. The latter is caused by the non-linear nature of the problem and
becomes apparent if long steps are tried; it is indicated by the flag “>“.
When a peaking rotation is requested, the PMF2 algorithm sometimes seems to proceed “by stages”:
the chi2 changes, then stays approximately constant during several steps, again changes, and so on.—
Current experience indicates that if there is practically no change of chi2 during 10 or 20 steps, then
one may be confident that the true convergence has been reached. Note, however, that there are local
optima for some problems. In order to ascertain that the global minimum has been found, one has to
rerun the analysis ten or twenty times, starting from different pseudorandom values.
The algorithm used by PMF3 may sometimes show the message “matrix is singular” followed by a
code number. This is not an error. These messages need not be reported to the author. However, if
there are tens of them so that the program is unable to proceed, then one should report.
With PMF3, degenerate factorizations may be avoided by using a non-negligible regularization during
all three stages of iteration. Normally, using smaller and smaller "lims" parameter values decreases
both regularization and the logarithmic penalty that is needed for implementing non-negativity
constraints. The optional parameter setting "minreg r" controls the regularization so that the strength
of regularization is = max(lims,r). Thus minreg defines the smallest regularization to be used for all
levels of lims. In order to avoid degeneracy, use lims as usual but introduce the optional parameter as
"minreg 1.0", say. Then experiment with the numerical value of minreg, in order to find the smallest
value that still prevents degeneracy. Use this smallest value in order to minimize the distortion that is
caused by the regularization.
PMF User's Guide, Part 2: Reference 31
Efficiency considerations
PMF computations of very large models tend to be time consuming. Thus it pays to arrange them in an
efficient way. The following points should be considered.
Convergence criteria. In the default setup convergence criteria are tight for all three penalty levels.
This setup is only intended for a safe start, not for permanent use. Different arrays have very different
convergence characteristics. The “activity” of non-negativity constraints is the most important
question; if many factor components try to go negative then the case is “difficult”. Thus it is impossible
to give specific instructions about good parameter values. Instead, we try to point out things to watch.
If the first two stages have tight convergence limits then the program finds two precise solutions which
are then immediately discarded when the computation proceeds to the next stage. This precision goes
wasted. In most cases one may set the chi2 limit of the first stage to something like 5 units, and it may
be sufficient to require 2 small steps. On the other hand, the iteration often proceeds more slowly with
the smaller penalty values. Thus it may be undesirable to end the first stage(s) too early. Sometimes the
iteration may get stuck to a local minimum more easily if the first stage ends early. It is good to first
compute with tight limits and then relax them. Use the relaxed limits if they result in faster computing.
The convergence limit for the final stage influences the accuracy of the result. Statistical considerations
seem to indicate that it is sufficient to reach a chi2 value which is within one unit from the true mini-
mum. Theoretically it is incorrect to use a solution having a chi2 value which is more than a few units
above the true minimum. Practical results may be quite good even if the distance to the true minimum
is somewhat more.
Monitor level M. When testing the program, the .ini file, or the data it is good to have M=1 or even
M=-1. In routine runs one might specify a larger M, perhaps M=5 or M=10. In this way the programs
may run a little faster when certain diagnostic computations may be omitted.
Penalty coefficients “lims”. If there is one very weak (low-intensity) factor, then it may happen that it
is not visible at all when the penalty coefficient is too large. This was observed e.g. with the 4-factor
PMF3 example by Rob Ross. In such cases it is useless to run with a large first lims value, because the
factor structure changes when the penalty is decreased and thus whatever is computed with a large
penalty gets soon discarded.
The std-dev values specified for the array have an indirect influence for the convergence. If you specify
the std-dev values so that they are too small by a factor of 2, then the resulting chi2 values are too large
by a factor of 4. Then also the convergence criterion is too strict and the computation takes longer than
necessary. Also the std-dev values computed for the factors are unrealistically small.—If you notice
that the computed chi2 values are too large, you should find out the reason and select at least one of the
remedies suggested in the section on specifying the standard deviations. See part 1 of User’s Guide.
Robust mode. Running in the robust mode is slower than in the non-robust mode. Having a larger
outlier threshold distance parameter (8.0 or 4.0) may be faster than a smaller value, such as 2.0. In
robust mode, specifying too small std-dev values may slow the convergence dramatically.
Matrix layout. In PMF2 it is slightly better to have the larger dimension in the vertical direction, i.e. the
number of rows should be larger than the number of columns. This difference is not significant. — In
PMF3 it may be important to order the dimensions optimally. Denote the dimensions of the data block
as rows x columns x planes=mxnxq (m is the number of rows). Generally one should have m>n>q.
Computing the covariance matrix of factors. In PMF3, computing the full covariance matrix of all
factor elements is extremely slow and requires a lot of memeory. Thus you should request this compu-
tation only if you really need the result. Otherwise, let the FIL code for the covariance matrix be =0.
The computing environment. The programs are compiled/linked so that if the RAM memory is too
small, disk memory will be used for storing such data segments which cannot fit in the RAM memory.
A shortage of memory may cause that the operating system continually exchanges data segments
PMF User's Guide, Part 2: Reference 32
between RAM memory and disk. This causes severe slowing down of the program, possibly to the
extent that no computations are possible. (Buy more memory or simplify your model!) In Windows a
memory shortage may be caused by having too many other tasks open simultaneously. —Use a
Pentium or better if possible! And remember that it is easy to set up a “batch job” (start PMF runs in a
.BAT file) so that PMF runs overnight or over the weekend!
Fine-tuning the algorithms of PMF2 and PMF3. There are a large number of unpublished tuning
parameters that the end user might adjust in order to enhance the convergence rate of PMF2 or PMF3.
In this way one might obtain a speed improvement of 10% to 30%. However, changing these
parameters is risky because the programs might not converge properly in some cases. Their use is like
trimming the family car: you might get to your destination a little faster, but the car is more difficult to
drive and the engine may stop if not handled properly. Thus this technique will only be useful in
situations where the program is used for analyzing tens or hundreds of similar large data sets. If this is
your situation, contact the author for details.
Errata
1. There is a typo in the Paatero-Tapper paper of 1994: on page 116, the element 3,3 of matrix X should be 1 in-
stead of 0. 2. In a few places we have used the term “alternate regression”, which is wrong. The correct term is
“Alternating Regression” (AR) or better yet, “Alternating Least Squares”, ALS. 3. Currently the PC versions of
PMFx are based on the Lahey Fortran 90 compiler version 4.00e and Lahey-Fujitsu Fortran95 compiler 5.50j.
Errors caused by the compilers are explained in the PMF readme file.
Licenced users may freely download updated program versions and run them with their previous
authorization keys. The program files are packed by the program PKZIP (currently version 2.04),
possibly protected by a password which will be changed infrequently, perhaps once a year.
The ftp approach is through anonymous login to "rock.helsinki.fi, directory /pub/misc/pmf".
Alternatively you may approach through www: connect to ftp://rock.helsinki.fi/pub/misc/pmf/ .
Disclaimer
The programs PMF2 and PMF3 are not guaranteed to be error-free. Also, using a Dos-extender
(PharLap in the current versions) is a complicated process and in rare cases can lead to complications.
The programs PMF2 and PMF3 are licensed on the condition that the end user in all circumstances
bears all risk of all possible damages and losses to his/her computer system and data files, related to the
use or attempted use of these programs.
In order that this not be empty talk, consider if your back-up practices are sufficient. Never have
important data residing on your hard disk without first making a backup copy of them on another disk
or on a diskette. It is not sufficient to backup onto another file or partition on the same physical disk!