Srsoutputcheckingguidance
Srsoutputcheckingguidance
Release: v.1.1
Date: 05.09.2022
Document ID:
For queries relating to this document please contact the Statistical Support Team at
[email protected].
Table of Contents
1. Clearance types ................................................................................................................... 5
2. General output guidance .................................................................................................... 5
3. ‘Safe’ and ‘unsafe’ outputs ................................................................................................. 6
4. Default SDC ‘rules of thumb’............................................................................................... 7
5. File types ............................................................................................................................. 8
6. Frequency tables and other tables ................................................................................... 10
6.1 Low counts and zeros........................................................................................................ 10
6.1.1 Example: suppression ....................................................................................................... 10
6.1.2 Example: rounding ............................................................................................................ 11
6.1.3 Example: reformatting ...................................................................................................... 12
6.2 Class disclosure ................................................................................................................. 13
6.2.1 Example: structural zeros.................................................................................................. 13
6.2.2 Example: suppression ....................................................................................................... 14
6.2.3 Example: rounding ............................................................................................................ 14
6.2.4 Example: reformatting ...................................................................................................... 15
6.3 Secondary disclosure ........................................................................................................ 15
6.3.1 Example: re-calculating totals ........................................................................................... 16
6.3.2 Example: secondary suppression ...................................................................................... 17
6.3.3 Example: rounding ............................................................................................................ 18
7. Dominance ........................................................................................................................ 18
7.1 Example: reformatting ...................................................................................................... 19
8. Statistics ............................................................................................................................ 20
8.1 ‘Safe’ statistics................................................................................................................... 20
8.2 ‘Unsafe’ statistics .............................................................................................................. 20
8.2.1 Mean ................................................................................................................................. 20
8.2.1.1 Example: suppression ....................................................................................................... 21
8.2.2 Percentages....................................................................................................................... 22
8.2.3 Weighted counts ............................................................................................................... 22
8.2.4 Mode, minimums and maximums .................................................................................... 22
8.2.4.1 Example: suppression ....................................................................................................... 23
8.2.4.2 Example: rounding ............................................................................................................ 24
8.2.5 Medians, quartiles, deciles and percentiles...................................................................... 24
8.2.5.1 Example: suppression ....................................................................................................... 25
8.2.5.2 Example: reformatting ...................................................................................................... 26
8.2.6 Ratios, including odds ratios ............................................................................................. 26
8.2.6.1 Example: suppression ....................................................................................................... 26
9. Graphs ............................................................................................................................... 27
9.1 Line graphs ........................................................................................................................ 27
9.1.1 Example: suppression ....................................................................................................... 27
9.1.2 Example: reformatting ...................................................................................................... 29
9.2 Scatter graphs ................................................................................................................... 29
9.2.1 Example: reformatting ...................................................................................................... 30
9.3 Bar charts and histograms ................................................................................................ 31
9.3.1 Example: reformatting ...................................................................................................... 31
9.3.2 Example: suppression ....................................................................................................... 33
9.4 Boxplots............................................................................................................................. 33
9.4.1 Example: suppression of plots .......................................................................................... 34
9.4.2 Example: reformatting plots ............................................................................................. 35
9.4.3 Example: suppression of outliers ...................................................................................... 36
9.4.4 Example: reformatting whiskers ....................................................................................... 36
9.5 Violin plots ........................................................................................................................ 37
9.5.1 Example: reformatting plots ............................................................................................. 37
10. Regressions and modelling ............................................................................................... 39
10.1 Coefficients, margin plots and test statistics .................................................................... 39
10.1.1 Example: saturated regression ......................................................................................... 39
10.2 Residuals ........................................................................................................................... 40
11. Maps and spatial analysis ................................................................................................. 40
11.1 Maps.................................................................................................................................. 40
11.1.1 Example: reformatting ...................................................................................................... 41
11.2 Geographies ...................................................................................................................... 42
12. Code files ........................................................................................................................... 42
12.1 Example: Hard-coded data in code ................................................................................... 43
12.2 Example: Disclosive comments in code ............................................................................ 43
12.3 Example: Data table in code ............................................................................................. 44
12.4 Example: Overly specific code........................................................................................... 44
1. Clearance types
Within the SRS, there are three types of clearance offered. Researchers must clearly state, within
their Output Request form and within their requesting email, which type of clearance they want for
their output. The three types are as follows:
• Pre-Publication (‘PrePub’) clearance: suitable for any file type. The files may only be shared
with researchers, sponsors and/or funders who are named on the project; the files must be
deleted once the project ends.
• Publication (‘Pub’) clearance: suitable for any file type that is publication ready. The files
may be shared beyond individuals named on the project; the files may be retained
indefinitely.
• Code clearance: only suitable for code files that do not contain any SRS data. The files may
be shared beyond individuals named on the project; the files may be retained indefinitely;
the disclaimer is less extensive than for Pub clearances.
In all cases, when the cleared output is sent to the researcher, the email will contain the appropriate
disclaimer. On receipt of the file(s), the researcher must add this disclaimer text to them.
The researcher may request an exemption to the SDC ‘rules of thumb’ for any output. Requests for
exemptions will be assessed by the Statistical Support staff on the basis of whether:
• The output is highly important (it is required to enable the project to provide its planned
research for the public good) and
• Any deviations from the SDC ‘rules of thumb’ do not result in meaningful disclosure beyond
that permitted by the Data Owner for the given project.
Please note that if an exemption is requested, all necessary information for allowing the ‘rules of
thumb’ (e.g., underlying unweighted counts) must be provided up front, as it would be if an
exemption was not being requested. A statement informing output checkers that you are requesting
an exemption to the SDC ‘rules of thumb’ and a justification for the exemption must also be
provided, with the latter covering both of the two points above. This should be provided in the
description section at the end of the Output Request form.
Operating the SRS using a PBOSDC system is only possible if researchers restrict their requests for
exemptions to the SDC ‘rules of thumb’ to very occasional instances involving highly important
outputs.
Following the PBOSDC system, there will very occasionally be instances where the particular context
and content of the output means that the output will require stricter SDC than the SDC ‘rules of
thumb’. If this occurs, the output checkers will clearly outline the disclosure risk that has been
identified and will assist the researcher to apply stricter SDC to appropriately manage this disclosure
risk.
3. ‘Safe’ and ‘unsafe’ outputs
The guidance in the following sections has been written using the concept of ‘safe’ and ‘unsafe’
outputs in SDC literature. This classification system is used to ensure that researchers’ time applying
SDC and output checkers’ time checking SDC is focused on the outputs which are most likely to
represent a disclosure risk. Under this system, the categories are defined as follows1:
• ‘Safe’ outputs: will be released unless SDC checks demonstrate some reason why they
should be held back or adjusted – the SDC ‘rules of thumb’ outlined below are fairly minimal
as they are designed to enable checks for these instances of potential risk.
The researcher should always provide the minimum information required (e.g., total count
of data subjects for regressions and models) but can generally expect the output to be
cleared with minimal or no further changes. I.e., the burden of proof is on the output
checker to provide reason(s) why the output cannot be released, contrary to normal
expectations for this type of output.
• ‘Unsafe’ outputs: will not be released unless the researcher can demonstrate, via
appropriate SDC and, where applicable, contextualising information, that the output meets
the detailed criteria for this type of output – the SDC ‘rules of thumb’ outlined below
represent these detailed criteria.
The researcher should always provide the minimum information required. However, the
output will not be cleared unless the researcher demonstrates, to the output checkers’
satisfaction, that the particular context and content of the output makes it non-disclosive.
I.e., the burden of proof is on the researcher to provide reason(s) why the output can be
released – generally, but not always, appropriate contextualising information (e.g., clear
variable labels, graph titles, etc.) and the SDC ‘rules of thumb’ will ensure that sufficient
reasons are provided.
1
This definition is adapted from:
Ritchie F. (2008) “Disclosure detection in research environments in practice”, in Work session on statistical
data confidentiality 2007; Eurostat; pp. 399-406.
Brandt M. et al. (2010) “Guidelines for the checking of output based on microdata research”, Final report of
ESSnet subgroup on output SDC.
2
This table is adapted from: Brandt M. et al. (2010) “Guidelines for the checking of output based on microdata
research”, Final report of ESSnet subgroup on output SDC.
Higher moments of distributions (including ‘Safe’
variance, covariance, kurtosis and
skewness)
Graphs: pictorial representations of actual ‘Unsafe’
data
Correlation and Linear regression coefficients ‘Safe’
regression Non-linear regression coefficients ‘Safe’
analysis Estimation residuals ‘Unsafe’
Summary and test statistics from estimates ‘Safe’
(R2, 𝜒 2 , etc.)
Correlation coefficients ‘Safe’
a) There are dataset-specific SDC rule(s) set by the Data Owner (see the spreadsheet within the
SRS at Libraries$/SRS and SDC Guidance/SDC Guidance by Dataset for a full list of SDC rules
by dataset). In these instances, the dataset-specific SDC rule(s) would be used instead,
superseding the SRS’s SDC ‘rules of thumb’.
b) An exemption has been granted permitting the project to have custom SDC rules. In these
instances, the project-specific SDC rule(s) would be used instead, superseding both the SRS’s
SDC ‘rules of thumb’ and the dataset-specific SDC rule(s).
c) An exemption is requested as described in section 2.
If complex rules are used, particularly custom rules involving specific rounding rules, SDC checks
often run smoothest when an additional copy of the output file(s) are provided for reference,
showing the data before SDC was applied. If this is done, please add an explanatory note in the
description section of the Output Request form to avoid confusion.
If an output contains any data that is non-SRS data, this must be clearly labelled – within the file
and/or via a note in the in the description section of the Output Request form. Non-SRS data does
not have to conform to the SRS’s SDC ‘rules of thumb’. However, depending on the circumstances,
written confirmation of Data Owner approval for the output may be required.
5. File types
We have the capacity to check a wide variety of file types. These include:
There are some file types which we do not have the capacity to fully check. These file types may not
be cleared for any outputs, regardless of their apparent content:
• Markdown files – with .html extension if in Hypertext Markup Language, .markdown, .md,
.markdn or .mdown extensions if in Markdown language, with .Rmd extension if in R or with
.Rnw extension if in LaTeX.
o Used to format the structure of webpages, for writing code documentation and for
dynamic reports, documents, presentations, dashboards, websites, etc. Combine
code, documentation, data and/or metadata into a single file.
o As each file is unique, due to the wide range of styles and types of information that
can be held within them, we cannot guarantee that we can exhaustively check these
files.
o Instead, the researcher should provide the information in one of the above
checkable formats – they may reconstitute them back into a markdown file once
outside the SRS, if they wish.
• R project files – with .Rproj extension.
o Saves an entire project, including data files, code and outputs.
o Used to store a project neatly in one place, improving workflow.
o As each project will contain different things, potentially including data and outputs,
we cannot guarantee that we can exhaustively check these files.
o Instead, the researcher should provide the information in constituent .R, .Rdata,
.Rda and/or .Rds files as these are checkable – they may reconstitute them back into
an .Rproj file once outside the SRS, if they wish.
• Shapefiles – with .shp, .shx or .dbf extensions or, more occasionally, with .prj, .sbn, .sbx, .fbn,
.fbx, .ain, .aih, .ixs, .mxs, .atx, .shp.xml, .cpg or .qix extensions.
o Digital vector storage formats and associated supporting files.
o Predominantly used for storing geographic data.
o We can open these types of files, but their structure makes them exceptionally
difficult to check, so we cannot guarantee that we can exhaustively check these files.
o Instead, the researcher should provide the data in one of the above checkable
formats – they may reformat this back into a shapefile once outside the SRS, if they
wish.
• JavaScript Object Notation files – with .json extension.
o A language-independent data storage file, in which the data is organised as a
hierarchical list.
o We can open this type of file, but its structure makes it exceptionally difficult to
check, so we cannot guarantee that we can exhaustively check these files.
o Instead, the researcher should provide the data in one of the above checkable
formats – they may reformat this back into a .json file once outside the SRS, if they
wish.
• R presentation files – with .RPres extension.
o Uses Markdown and R code to create HTML5 presentations.
o Like other files using Markdown, e.g. .Rmd files, we cannot guarantee that we can
exhaustively check these files.
o Instead, the researcher should provide the data in one of the above checkable
formats – they may reformat this back into an .RPres file once outside the SRS, if
they wish.
• R data files – with .Rdata, .Rda, .Rds extensions – that contain data in formats other than
dataframes/tables:
o A data storage file for R coding language.
o We can open this type of file, but unless it contains a single dataframe/table the
structure is generally too complex and not sufficiently human-readable. Therefore,
we cannot guarantee that we can exhaustively check these files when they contain
data in formats other than a single dataframe/table.
o Instead, the researcher should provide the data in one of the above checkable
formats.
Zeros are considered a disclosure risk, unless it is evident from the output request that they are
structural – see section 6.2.
Any unweighted counts below the threshold must be suppressed or otherwise removed, e.g., by
reformatting the table. Likewise, statistics relating to a group whose unweighted count is below the
threshold must also be suppressed. If any values are suppressed, secondary SDC must be applied to
prevent secondary disclosure of the suppressed values by differencing – see section 6.3.
If suppression is chosen as the SDC method, this would result in the following table:
Here ‘-’ has been chosen by the researcher to indicate suppression. Other common ways include ‘*’,
‘.’, ‘x’, ‘<10’ and ‘SUPP’. The exact symbols or letters chosen to indicate suppression is up to the
researcher. The only proviso is that they must not enable the reader to crack the suppression – e.g.,
suppressing counts of 1-9 with ‘<10’ but zeros with ‘-’ would not be acceptable. Some Data Owners
might have specific rules for which symbols to use, therefore please consider this when applying
suppression to your outputs.
In this instance, this initial (primary) suppression is not sufficient on its own to prevent disclosure.
The totals highlighted in yellow, in combination with the other values in those rows, enable
recalculation of the suppressed values in that row. For example, in the ‘None’ row, the total of 83
minus the other counts (12, 19, 22, 10 and 11) informs us that the suppressed value is nine. This is
termed secondary disclosure. To prevent this, secondary SDC should be applied – see section 6.3.
The rounding rule(s) are chosen by the researcher (unless specified under dataset- or project-specific
SDC rules). In this case, the researcher has chosen to round to the nearest 10. Here, the zeros are
permitted as they do not exclusively represent real zero counts, but rather any count from 0-4 (with
counts of 5-9 rounded up to 10, as indicated by the annotation below the table).
Variations on rounding are permitted – e.g., rounding all of the raw counts but not the totals or only
rounding the counts that are below threshold. However, it must be made clear to the output checker
and reader how the rounding has been applied, so that it is clear which counts are unrounded and
which are rounded.
The way that the table is reformatted is chosen by the researcher. Reformatting can include any of
the following:
The aim, as with all SDC, is to preserve as much information as possible. This is why the exact
method of reformatting is chosen by the researcher – they are best placed to know how to reformat
whilst ensuring that the results or key points of the analysis are retained.
6.2 Class disclosure
Class disclosure occurs when the reader can learn something new about every data subject
belonging to a particular group of the data. For this reason, empty cells (i.e., cells whose unweighted
counts are zero) and full cells (i.e., cells whose unweighted counts represent 100% of a group)
represent a class disclosure risk.
The exception to this is structural zeros, also called logical zeros. These are zeros which are present
due to the nature of the dataset or its collection – i.e., the only possible value of the group is zero –
see section 6.2.1 for an example. Structural zeros are permitted.
Unweighted counts of zero, excepting structural zeros, must be suppressed or otherwise removed,
e.g., by reformatting the table. Unweighted counts representing 100% of a group and statistics that
relate to 100% of a group may be permitted or may need to be suppressed, depending on the
amount of information gained from the class disclosure. If any values are suppressed, secondary SDC
must be applied to prevent secondary disclosure of the suppressed values by differencing – see
section 6.3. If dataset- and/or project-specific knowledge is required to know that a zero is
structural, the researcher should annotate their file and/or provide a note in the description section
of the Output Request form explaining this, to prevent confusion during SDC checking.
The zeros (highlighted in yellow to make them clearer) make logical sense. You would not expect
anyone younger than 16 years old to have gained secondary education qualifications and would not
expect anyone younger than 21 years old to have gained a higher education qualification. The zeros
are, therefore, structural rather than informative.
Due to their nature, structural zeros present no meaningful disclosure risk. In contrast, informative
zeros (zeros where counts are able to be a value other than zero) do present a disclosure risk. Hence,
structural zeros are permitted whereas informative zeros are considered a potential class disclosure
risk.
If dataset- and/or project-specific knowledge is required to know that a zero is structural, the
researcher should annotate their file and/or provide a note in the description section of the Output
Request form explaining this, to prevent confusion during SDC checking.
Non-responses are not necessarily considered structural zeros, context of the output is required and
requests will be considered on a case by case basis.
6.2.2 Example: suppression
It is quite common for frequency tables to contain informative zeros. For example:
This table contains several informative zeros (highlighted in yellow to make them clearer). These are
disclosive as they enable us to gain specific information about every individual in a group. E.g., we
can see that nobody whose highest qualification was school-level or less is in the 3rd or 4th income
quartile (i.e., earns above the median) and we can also see that nobody whose highest qualification
was post-graduate is in the 1st or 2nd income quartile (i.e., earns below the median). Therefore, SDC
should be applied to this table.
If suppression is chosen as the SDC method, this would result in the following table:
In this instance, the initial (primary) suppression is not sufficient on its own to prevent disclosure.
The totals highlighted in yellow, in combination with the other values in those rows, enable
recalculation of the suppressed values in that row. For example, in the ‘School’ row, the total of 82
minus the other counts (43, 39) is zero, which informs us that both of the suppressed values are
zero. This is termed secondary disclosure. To prevent this, secondary SDC should be applied – see
section 6.3.
As discussed in section 6.1.2, the rounding rule(s) are chosen by the researcher (unless specified
under dataset- or project-specific SDC rules). Also, variations on rounding are permitted – e.g.,
rounding all of the raw counts but not the totals or only rounding the counts that are below
threshold. However, it must be made clear to the output checker and reader how the rounding has
been applied, so that it is clear which counts are unrounded and which are rounded.
The way that the table is reformatted is chosen by the researcher. Reformatting can include any of
the following:
The aim, as with all SDC, is to preserve as much information as possible. This is why the exact
method of reformatting is chosen by the researcher – as they are best placed to know how to
reformat whilst ensuring that the results or key points of the analysis to be retained.
If differencing can occur, secondary SDC must be applied (i.e., suppression of additional aspects of
the output that are not disclosive in and of themselves, but which permit differencing). The
researcher should decide how to apply secondary SDC, as in most cases several options are available
and the choice between them will depend on which parts of the output are most important for use.
Special care should be taken to ensure that the whole output is checked for the possibility of
differencing, not just the table with the suppressed values. Secondary disclosure may occur due to a
combination of tables or the combination of tables and text which contains numbers. If the scope of
differencing in the output is extensive and/or complex, we strongly suggest rounding is used as the
secondary SDC technique as this method is the strongest protection against differencing and is
easier to comprehensively apply and check.
This table was created in section 6.1.1 and contains secondary disclosure. The totals highlighted in
yellow, in combination with the other values in those rows, enable recalculation of the suppressed
values in that row. For example, in the ‘None’ row, the total of 83 minus the other counts (12, 19,
22, 10 and 11) informs us that the suppressed value is nine.
Secondary SDC must be applied to this table to prevent the secondary disclosure. If recalculation of
totals is chosen as the secondary SDC method, the following table is produced:
As the totals have been recalculated and now only include the counts that were not suppressed, it is
now impossible to recalculate the suppressed counts using them. Note that all of the totals have
been recalculated, not just those that could be used to recalculate suppressed counts. Having a
consistent approach to how totals are calculated in a given table is important to prevent output
checker and reader confusion.
Here, additional cells have been suppressed, despite the fact that their counts were not below the
threshold. Exactly which additional cells are suppressed is chosen by the researcher (unless specified
under dataset- or project-specific SDC rules). This secondary suppression ensures that each row and
each column contains at least two suppressed values, preventing recalculation of the suppressed
values using the totals and other values in the row or column.
Note that the researcher may use the same or different symbols or letters to indicate primary versus
secondary suppression, provided that the choice does not enable the reader to crack the
suppression.
Additionally, note that secondary suppression only applies to counts and statistics (e.g., percentages,
means, ratios) that enable recalculation of counts and statistics that have undergone primary
suppression. Therefore, a table such as this:
Could have primary suppression like this (as statistics relating to counts below threshold must be
suppressed but the researcher chooses the symbols or letters that indicate suppression):
The researcher has chosen to carry out secondary SDC by suppressing the count associated with the
‘Black’ ethnicity group. This prevents recalculation of the suppressed count for the ‘Other’ ethnicity
group. To preserve maximum information, the researcher has chosen to suppress this count as ‘<15’
rather than less informative options such as ‘-’, ‘.’, ‘SUPP’, etc. However, the mean associated with
the ‘Black’ ethnicity group does not need to be suppressed. This is because a) it does not relate to a
group whose count is below the threshold and therefore it does not need to have primary
suppression applied to it and b) it cannot be used to recalculate the suppressed count or the
suppressed mean for the ‘Other’ ethnicity group and therefore it does not need to have secondary
suppression applied to it.
Finally, note that secondary suppression rapidly gets very complicated to both implement and check
if the output is extensive and/or complex, e.g., it contains multiple inter-related tables or tables with
a hierarchical relationship to each other. In this instance, we strongly suggest not using secondary
suppression as the secondary SDC method. Instead, we strongly recommend using rounding instead.
7. Dominance
Dominance can exist as a potential disclosure risk when there is a particularly large unit of data
within a sample. There are two ways of classifying dominance within the SRS (other Trusted
Research Environments may use other definitions):
Secondary Secondary
Employment University 1 University 2
school 1 school 2
classification
N Mean N Mean N Mean N Mean
Management 45 98,940 10 180,420 15 67,093 x x
Teacher 158 45,302 45 46,349 92 39,023 15 38,730
Assistant 219 26,823 34 26,392 48 24,567 x x
Other 80 24,509 18 26,781 26 23,201 x x
‘x’ indicates suppression due to low counts.
The dominance in this table is not immediately obvious. It can be identified by looking at the mean
pay for ‘Management’ for University 2. The mean pay for this group is much higher than for
‘Management’ for University 1, despite having 35 fewer staff in this employment classification in
University 1 than University 2. Therefore, even without looking at the record-level data, we know
that there will be some very high earners skewing the mean pay for this employment classification
for University 2. Topic-specific knowledge can help here: skewing of pay data is likely to be more
common and more extensive in smaller universities, as pay scales for upper management are much
higher than for departmental chairs. Any university (regardless of size) needs a certain number of
upper management positions, but the number of departmental chairs will generally be proportional
to the size of the university.
The best way to fix dominance of this nature is to reformat the output. For example:
Sampled Sampled Whole
Employment
universities secondary schools sample
classification
N Mean N Mean N Mean
Management 55 113,755 23 75,424 78 102,452
Teacher 203 45,534 107 38,982 310 43,273
Assistant 253 26,765 52 25,058 305 26,474
Other 98 24,926 28 23,485 126 24,606
Here, the data has been pooled by type of educational institution, rather than reporting at an
individual educational institution level. This alleviates the dominance and therefore reduces the risk
of disclosure.
8. Statistics
Various statistics may be requested for output, ranging from routine descriptive statistics (e.g.,
mean, median, mode, standard deviation, range, percentiles, minimums, maximums) to more
specialist statistics (e.g., variance, covariance, kurtosis, skewness, hypothesis testing, concentration
ratios, odds ratios). These should all be reported based on the threshold, as follows.
Note that in all cases the relevant count is the number of data subjects used to calculate the
statistic, not the number of data subjects in the group overall – sometimes information for some
variables is only available for a subset of the data subjects and this should be considered when
determining the underlying count for a given statistic.
Generally, these ‘safe’ statistics are not considered to be disclosive provided that the underlying
counts are a) stated and b) meet or exceed the threshold.
If the researcher has any other statistic that they wish to output that is not listed above but they
think constitutes a ‘safe’ statistic due to its mathematical complexity, they must provide an
explanation of its ‘safety’. This information should be provided via annotating the output file and/or
adding a note in the description section of the Output Request form.
8.2.1 Mean
Means represent an increased disclosure risk as they are easily calculated: one simply needs the
value held by each individual in the group and the count of individuals in the group.
The same rules apply to means as other statistics: i.e., underlying counts must be a) provided and b)
meet or exceed the threshold.
Due to the simplicity of their calculation, means may be associated with secondary disclosure.
Outputs containing means should be thoroughly checked to exclude this possibility – see section 6.3
for more details.
These means, like all means, cannot be cleared without their underlying counts, which are:
This reveals that several of the means are calculated from groups of firms whose count is below the
threshold (highlighted in yellow to make them clearer). Where counts of a group are below
threshold, the statistic calculated from this group must be suppressed:
Additionally, checks should be made to ensure that these means cannot be used to difference
suppressed values elsewhere in the output and that other information in the output cannot be used
to difference these suppressed means – see section 6.3.
8.2.2 Percentages
Similarly to means, percentages represent an increased disclosure risk as they are easily calculated:
one simply needs the count of a subgroup (i.e., numerator) and the count of the group that
subgroup belongs to (i.e., denominator).
The same rules apply to percentages as other statistics: i.e., underlying counts must be a) provided
and b) meet or exceed the threshold. Note that ‘underlying counts’ refers to the numerator of the
percentage, not the denominator.
Due to the simplicity of their calculation, percentages are frequently associated with secondary
disclosure. Outputs containing percentages should be thoroughly checked to exclude this possibility
– see section 6.3 for more details.
The same rules apply to weighted counts as other statistics: i.e., underlying counts must be a)
provided and b) meet or exceed the threshold. Note that ‘underlying counts’ refers to unweighted
counts. Be aware that weighting in and of itself is not sufficiently protective of low counts as it is
often possible to reverse the weighting using a combination of reported methodology, weighted
counts and/or unweighted totals. These may be present in the output or elsewhere in the public
domain.
Weighted counts may also be associated with secondary disclosure. Outputs containing weighted
counts should be thoroughly checked to exclude this possibility – see section 6.3 for more details.
• When the value is held by at least threshold number of data subjects (e.g., a sample has a
modal depression score of 2 on a 5-point Likert scale, where the number of data subjects
that had a score of 2 is a) stated and b) meets or exceeds the threshold).
• When the value is structural as the variable’s range is limited (e.g., a minimum of zero and a
maximum of 100 for a variable that is reported as a percentage).
If the researcher wishes to clear modes, minimums and/or maximums, they must demonstrate why
the modes, minimums and/or maximums in the output are not disclosive. This information should be
provided via annotating the output file and/or adding a note in the description section of the Output
Request form.
8.2.4.1 Example: suppression
Modes, minimums and maximums are generally not able to be cleared except in the few
circumstances described in section 8.2.4. For example:
Minimum Maximum
Age 11 16
GCSE English score (%) 0 100
GCSE Mathematics score (%) 0 100
GCSE History score (%) 1 97
Here, the minimum and maximum for ‘GCSE English score’ and ‘GCSE Mathematics score’ may be
released – as the scores are percentage scores, a minimum of zero and maximum of 100 are
structural rather than informative, regardless of how many individuals hold them. However, the
minimum and maximum for ‘Age’ and ‘GCSE History score’ are not structural and therefore cannot
be cleared without the count of data subjects holding each value:
Some of the counts underlying the ‘Age’ and ‘GCSE History score’ minimum and maximum are below
threshold (highlighted in yellow to make them clearer) they must be suppressed as follows:
This table illustrates well how under the circumstances that modes, minimums and maximums are
permitted, they are often no longer useful from a research perspective. This is why it is generally
advised that minimums and maximums be simply avoided.
It should be noted that there are two different interpretations for this table; the minimum and
maximum for English and Maths could be considered structural but for History the minimum and
maximum represent the exact scores the students achieved. When presenting both together the
implication is that 0 and 100 were the minimum and maximum score achieved by a certain number
of students, not of all possible scores the students could achieve, and thus could be considered
disclosive so underlying counts and suppression might be needed. To avoid reader confusion,
minimum and maximum should refer to the same concept, e.g. the possible scores to get in GCSE or
the scores students actually achieved, and this should be clear in the table description.
8.2.4.2 Example: rounding
If the mode, minimum and/or maximum are especially necessary for a project’s research goals, it is
sometimes possible to use rounding to enable them to meet one of the exceptions under which they
may be released. For example:
These statistics cannot be cleared without the count of data subjects holding each value, which are
as follows:
These counts are all below the threshold. Therefore, these statistics cannot be cleared. However, by
rounding the output, the counts held by some of these statistics now meet or exceed the threshold:
However, as shown above, sometimes even rounding is not sufficient to make this type of statistic
suitable for clearance – in this example, the count underlying the maximum (highlighted in yellow to
make it clearer) is still below the threshold even after rounding. Therefore, this statistic must be
suppressed:
• Median (aka quartile 2 (Q2) or 50th percentile) should be suppressed if the total count is less
than twice the threshold. As the median corresponds to half of the group, each half needs to
meet the threshold, the total needs to be at least twice the threshold,. Therefore, using a
threshold of 10 there would need to be at least 20 subjects in the total group.
• Upper and lower quartile (aka quartile 3 (Q3) and quartile 1 (Q1) or 75th and 25th percentiles)
should be suppressed if the count of the group is less than four times the threshold (as each
quartile corresponds to one quarter of the group). Therefore, using a threshold of 10 would
need to be at least 40 people in the total group.
• Deciles/percentiles should be suppressed based on the count of the group that they
correspond to, e.g.:
o Deciles correspond to tenths of the group, so should be suppressed if the count of
the group is less than ten times the threshold (as each decline corresponds to one
tenth of the group).
o The 1st and 99th percentile correspond to hundredths of the group, so should be
suppressed if the count of the group is less than one hundred times the threshold
(as each percentile corresponds to one hundredth of the group).
o Etc.
Note that these rules apply regardless of the format that the medians, quartiles, deciles and/or
percentiles are presented in. For examples of these statistics graphed, see section 9.4.
Median LQ UQ N
All 47 34 58 100
Male 47 20 68 46
FSM 52 30 63 65
SEN 46 19 62 36
SEN (EHCP) 22 14 49 11
The best way to handle the disclosure issues in this example is via suppression – i.e., suppressing the
median where the count is less than twice the threshold and suppressing quartiles where the count
is less than four times the threshold. For example:
Median LQ UQ N
All 47 34 58 100
Male 47 20 68 46
FSM 52 30 63 65
SEN 46 . . 36
SEN (EHCP) . . . 11
‘.’ indicates suppression due to low counts.
8.2.5.2 Example: reformatting
In some circumstances, reformatting is a suitable way to handle a disclosure risk caused by
percentiles. For example:
Attendance (%)
Median 1 percentile 99th percentile
st
N
Before scheme 86 32 100 971
After scheme 93 64 100 964
In this table, the 1st and 99th percentiles are present despite the count of each group being less than
one hundred times the threshold (highlighted in yellow to make them clearer). Therefore, it cannot
be cleared. However, as the count of the group is still quite high (and greater than twenty times the
threshold), reformatting the table to use vigintiles instead of percentiles would make it suitable for
clearance:
Attendance (%)
Median 5th centile 95th centile N
Before scheme 86 40 97 971
After scheme 93 67 98 964
The same rules apply to ratios as other statistics: i.e., underlying counts must be a) provided and b)
meet or exceed the threshold. Note that for ratios there are several underlying counts that are
relevant and all numerators for these should be provided.
Due to the simplicity of their calculation, ratios may be associated with secondary disclosure.
Outputs containing ratios should be thoroughly checked to exclude this possibility – see section 6.3
for more details.
(Note: concentration ratios, such as the Herfindahl-Hirschman index, have much more complex
methodology than the ratios described in this section. Therefore, concentration ratios are
considered ‘safe’ statistics and are covered in section 8.1.)
Smoker Non-smoker
Cancer No cancer Total Cancer No cancer Total
Lung 786 8,424 9,210 100 11,574 11,674
Liver 387 8,823 9,210 132 11,600 11,732
Bladder 39 288 327 6 217 223
The relative risk and the odds ratio for bladder cancer are derived from a count below threshold
(highlighted in yellow to make it clearer). Therefore, these two statistics must be suppressed:
9. Graphs
Graphic representations of data can present a high risk of disclosure, particularly if the methods of
data presentation are used to show the distributions of a value (e.g., histograms) or if bars, points or
lines relate to a single observation or a single data subject (e.g., scatter plots).
Graphic representations of data are subject to the same SDC methodology as tables. Therefore, all
graphic representations of data should be presented with their underlying unweighted counts. These
underlying counts must meet or exceed the threshold (except in cases where zeros are structural –
see section 6.2). A convenient way to do this is to provide the underlying counts in table(s) in
supplementary Excel file(s). If you do this, you should clearly indicate (e.g., in the Output Request
form) where to find the underlying counts for each graph, citing file names and sheets, pages, lines
and/or cells within files, as applicable.
As the underlying counts for ‘Transport and communication’ for years 2011-2013 are below the
threshold, the line graph cannot be cleared. Similarly to the table that underlies it, there are a
number of ways to apply SDC to this line graph – for example, the low counts could be suppressed,
or the graph could be reformatted by combining business categories and/or by combining reporting
years. For more details, see section 6.1. Note that any SDC must be done in the graph as well as the
frequency table that shows the graph’s counts:
Note: low underlying counts prevent reporting of ‘Transport and communications’ for 2011-2013.
Some other instances where scatter graphs are not necessarily disclosive include:
• Scatter graphs of ‘safe’ statistics. E.g., a scatter plot of the distribution of kurtosis among
subsamples, where each subsample met the SDC ‘rule of thumb’ for reporting kurtosis.
• Scatter graphs of variables, particularly highly derived variables, that could not reasonably
be related back to underlying counts except by the researcher. E.g., a scatter plot of
predicted turnover by industry for the next 5 years, created using a model designed by the
researcher.
• Scatter graphs that use standardised or otherwise transformed statistics, e.g.:
o Where one or both axes are standardised, e.g., so that the mean is zero and the
standardisation is one or so that the minimum is zero and the maximum is 100.
o Where one or both axes are presented as ‘distance from…’, e.g., ‘distance below
median’ or ‘distance above mean’.
o Where the scale is suppressed on one or both axes.
However, note that standardisation and transformation do not, in and of themselves,
represent sufficient risk mitigation as mean, median, standard deviation, range, etc. may be
presented elsewhere in the output or in the public domain.
However, these and other scatter graphs may only be cleared as an exception.
3
For R users, ggplot2 3.3.0 and later has a bin scale option which may be of use when producing these
outputs.
Even though there are no extreme outliers in this graph, each point can be used to provide
information about a single data subject. Therefore, it is relatively easy to pick out an individual data
subject and attribute data to them. For example, the red circled points (from left to right) show us
that the data includes an individual aged 33 earning £42,000 per annum, an individual aged 35
earning £46,000 per annum and an individual aged 58 earning £28,000 per annum.
This demonstrates why the majority of scatter plots are not suitable for clearance. Instead, the data
could be reformatted – e.g., as a bar chart, histogram, line graph, frequency table and/or as
statistic(s) (e.g., mean). For example:
The same mitigating actions that apply to other methods of graphic representation also apply to bar
charts and histograms. Generally, bar charts and histograms4 are not considered to be disclosive
provided that the underlying counts are a) stated and b) meet or exceed the threshold (except in
cases where low counts are structural – see section 6.2).
Bars in either bar charts or histograms that relate to counts below threshold should be either
suppressed or reformatted (e.g., combining categories or rounding) to prevent disclosure.
4
For R users, see footnote 3 regarding the ggplot2 3.3.0 and later bin scale option which may be of use when
creating histograms.
This histogram, like all graphs, cannot be cleared without its underlying counts, which are:
Age band
Frequency
(years)
10-19 758
20-29 1,932
30-39 2,367
40-49 2,031
50-59 1,689
60-69 876
70-79 432
80-89 211
90-99 97
100-109 4
As the underlying counts for the age band ‘100-109’ are below the threshold, the histogram cannot
be cleared. Typically, the most suitable way to handle a disclosure issue like this is by reformatting
the histogram, combining some of the bars as follows:
As the ’90-99’ and 100-109’ age bands have been combined, there are no longer any bars that
represent a group with a count below threshold. Therefore, this histogram meets the SDC ‘rules of
thumb’.
Note: age bands above 99 years have been suppressed due to low underlying counts.
Suppression of bars should only be chosen when it does not risk confusing the reader, e.g., by
implying that the category holds a value of zero. It is strongly advised that a note accompany the
graph so that it is clear where data has been suppressed
9.4 Boxplots
Boxplots typically have a box showing the upper and lower quartiles and a bar inside the box
showing the median. They also typically have whiskers, which may show the minimum and
maximum, a multiple of the interquartile range above and below the box, the 2nd and 98th
percentiles or other values. If the whiskers are not the minimum and maximum, any outliers beyond
the whiskers are typically shown as points. Sometimes the mean is shown as a point. Sometimes the
boxplot is notched, to indicate a confidence interval around the mean. Sometimes, dot plots
(effectively single-axis scatter plots) are overlaid over boxplots.
In general:
• Boxplots should be suppressed if the count of the group is less than four times the threshold
(as boxplots show quartiles).
• The plot should clearly indicate what the whiskers (and any points) show.
• Whiskers should comply with the above guidance relating to minimums, maximums, deciles
and percentiles (as applicable). This means whiskers may need to be changed to a different
whisker type or suppressed entirely. If they are suppressed, this should be indicated to avoid
reader confusion.
• Outliers should be suppressed, as these represent values relating to single data subjects.
• The mean may be shown as a point.
• The boxplot may be notched to show confidence intervals.
• Any dot plots overlaying boxplots should be treated in the same way as scatter plots – i.e.,
they are generally not permitted. This applies even if the points are jittered (i.e., some
random noise has been introduced into the dot plot).
This set of boxplots, like all graphs, cannot be cleared without its underlying counts, which are:
As boxplots report quartiles and the count for the age group ‘50-64’ is less than four times the
threshold, this age group cannot be reported and must be suppressed:
In this instance, this suppression is not sufficient on its own to prevent disclosure. The other issues
with this set of boxplots are discussed in sections 9.4.3 and 9.4.4.
Note that if the suppressed boxplot(s) were necessary, reformatting might be a better option – see
section 9.3.4.
Here, the ‘35-49’ and ‘50-64’ age groups have been merged, creating a group with a count greater
than four times the threshold (see table in section 9.4.1). Note that other issues remain with this set
of boxplots that are discussed in sections 9.4.3 and 9.4.4.
Alternatively, the boxplots could have been reformatted into histograms with pooled categories,
e.g.:
Here, each pooled category has a count that meets or exceeds the threshold. Therefore, these
histograms meet the SDC ‘rules of thumb’.
9.4.3 Example: suppression of outliers
Generally, outliers are not permitted as they represent single data subjects. Therefore, they should
be suppressed. For example, using the set of boxplots from section 9.4.1:
Note that the y-axis scale has been changed to avoid implying the extent of the suppressed outliers.
However, in this instance, this is still not sufficient on its own to prevent disclosure. The other issue
with this set of boxplots is discussed in section 9.4.4.
Here, the whiskers have been reformatted so that they are defined as the 10th and 90th percentiles.
As both of the age groups have a count at least 10 times the threshold (see table in section 9.4.1), it
is fine to use these percentiles. However, note that if particularly small sample sizes are used, the
whiskers may have to be suppressed entirely, in which case, if a boxplot must be used, this should be
indicated via an annotation to avoid reader confusion.
As violin plots show the full distribution of the data, there is a much higher risk of disclosing
information relating to small counts or single data subjects than with a boxplot. The risk is higher if
the plotted data are ordinal numbers (e.g., scores on a scale) or continuous integers (e.g., age in
years) rather than continuous decimal numbers (e.g., income in decimal pounds). The risk is also
higher when the sample size is small compared to the y-axis range and scale (e.g., for a sample of 20
individuals, presenting ‘job satisfaction’ as a 5-point Likert scale (y-axis of 0-4 in integers) is less risky
than presenting ‘attendance over last calendar year’ as a decimal percentage (y-axis of 0.0 to
100.0)).
This violin plot, like all graphs, cannot be cleared (even by exception) without its underlying counts,
which are:
Loneliness
N
score
1 120
2 90
3 80
4 160
5 10
6 9
7 71
8 60
9 140
10 40
Total 780
As violin plots report the whole distribution, the numerators are important (in contrast to boxplots
where the denominator is important). The underlying counts demonstrate that the ‘pinch point’
visible in the graph for a Loneliness score of 6 relates to a group with a count below threshold
(highlighted in yellow to make it clearer). Therefore, the violin plot cannot be cleared. Instead, the
data could be reported as a boxplot, as described in section 9.4.
Note that the only way to determine if a violin plot is disclosive is by looking at the underlying
counts. A ‘pinch point’ does not conclusively indicate disclosure, nor does the absence of ‘pinch
points’ conclusively indicate a lack of disclosure. For example:
This violin plot does not have any ‘pinch points’. However, the underlying counts are:
Loneliness
N
score
1 6
2 9
3 8
4 6
5 4
6 9
7 7
8 6
9 3
10 8
Total 66
As with the previous violin plot, this violin plot cannot be cleared. This is because every Loneliness
score relates to a group with a count below threshold. Instead, the data could be reported as a
boxplot, as described in section 9.4.
10.Regressions and modelling
10.1 Coefficients, margin plots and test statistics
Regressions/models, including their margin plots and test statistics, must be based on degrees of
freedom (the number of observations minus the number of parameters being estimated) that are a)
stated and b) meet or exceed the threshold. Additionally, regressions/models must not be based on
a single unit (e.g., a time series of one business) and must not be saturated. A saturated
regression/model is one that reports all interactions of the variables; this can also be called a fully
interacted regression/model.
Care should be taken when submitting the unedited outputs of regressions/models carried out by
software such as STATA and SPSS, as often other statistics are automatically reported as part of the
code output. Some of these, such as kurtosis and skewness, are generally unproblematic – see
section 8.1. However, sometimes minimums, maximums and percentiles may be automatically
reported by the software program – these are considered on their own merits, following the rules in
section 8.2.4.
All possible combinations of the data have been reported in this regression, i.e.:
Therefore, this regression is saturated – i.e., it is overfitted because every possible combination of
states is represented by the coefficients. When a regression or model is saturated, the coefficients
exactly correspond with the means:
Mean grant_giver
log_income No Yes
Dead in 2015 10.85469 11.50369
survivor
Alive in 2015 13.64514 15.70696
5
This example is kindly provided by Felix Ritchie, University of the West of England (UWE).
I.e.:
As the output of a saturated regression is, effectively, means, if the researcher wishes to have the
regression output cleared, they must provide information suitable for clearing means – see section
8.2.1. However, as saturated regressions are typically overfitted and therefore unlikely to have
analytical value, it is best to use a different analytical method instead in this circumstance.
10.2 Residuals
Residuals may be plotted or statistics may be calculated from them, e.g., standard deviation.
Statistics of residuals are treated using the same rules as if the statistics were of a group of data
subjects – see section 8. Plots of residuals are treated the same as any other scatter graphs – i.e., not
permitted except in a few circumstances – see section 9.2. If residuals must be plotted, it is
preferable to present only the line of best fit on the graph or you can provide a description of the
plot in words instead.
Generally, choropleth maps are not considered to be disclosive provided that the underlying counts
are a) stated and b) meet or exceed the threshold (except in cases where low counts are structural –
see section 6.2). Choropleth maps that do not meet this requirement should be reformatted to
prevent disclosure, e.g., by using a different geography level or choosing another method of visual
representation.
Data point maps may only be cleared as an exception, as described in section 2. If an exception of
this sort is requested, the researcher should provide clear details outlining what each point
represents. Care should be taken that instances of differencing or dominance cannot be inferred
through the map itself, by comparing the map to other files in the same output or by comparing the
map to other publicly available sources. The researcher should also refer to the guidance in section
11.2.
11.1.1 Example: reformatting
Data point maps are generally not permitted as each point typically relates to a single data subject.
For example:
Each point relates to a single person with disease X. Although it is difficult to pinpoint exact locations
of each case, it is still possible for the individuals to identify themselves and for other individuals to
narrow down identities through social networks, news reports and other publicly available
information. Therefore, this map represents a significant disclosure risk.
The map could be reformatted so that the data is reported as number of cases per geographic area
rather than as single points. For example:
The geographic level for reporting should be carefully chosen, based on SDC concerns as well as
research need.
12.Code files
Code should be reviewed carefully for embedded disclosive data, i.e., low counts, school names, etc.
These are most commonly found in hard-coded data, researcher comments and code that discloses
information about a single record (e.g., by filtering it using overly specific terms). This approach
should be taken with all coding languages.
With code, there is the option of using the Code category of clearance. Therefore:
The SDC ‘rules of thumb’ outlined elsewhere in this document apply to data, regardless of whether it
is in code or in a more typical format. Therefore, use these rules when applying SDC to data in code.
12.1 Example: Hard-coded data in code
This code contains hard-coded data:
The hard-coded data in rows 3-5 is disclosive – it contains record-level data, as each number/letter
in the lists in these rows relate to a single data subject.
The hard-coded data in rows 16-23 is disclosive – like the example above, it contains record-level
data, as each number in the lists in these rows relate to a single data subject.
• Row 36 specifies the data that the researcher is working on, which is called ‘hesa_sample’.
• Row 37 sorts the data by age.
• Row 38 filters the data to select rows where the ‘gender’ variable is not ‘other’ (therefore
removing rows where ‘gender’ is ‘other’).
• Row 39 contains a researcher comment, stating “Filtered out 4 rows”.
The comment in row 39 tells us that there are 4 people in the HESA sample who have ‘other’ gender.
As this count is below threshold, this comment is disclosive.
• Row 48 specifies the data that the researcher is working on, which is called ‘LFS_sample’.
• Row 49 counts the number of data subjects in ‘LFS_sample’ by the variables
‘employed_year’ and ‘ethnicity’.
• Row 50 groups the counts into a frequency table by ‘employed_year’.
• Rows 52-64 contain a researcher comment, which appears to be the frequency table created
by lines 48-50 copy-pasted into the code as a comment.
This particular example is not disclosive, as all of the counts in the frequency table exceed the
threshold. However, other similar instances of frequency tables in code may be disclosive,
depending on their counts and the threshold being used. Additionally, this output could only be
given Pre-Publication or Publication clearance, not Code clearance, as it contains SRS data. It is not
best practice for code files to include data. Therefore, we recommend that you save any data in a
separatee log file instead of introducing data into your code as comments.
• Row 81 specifies the data that the researcher is working on, which is called ‘SIS’.
• Row 82 filters the data to select rows where the ‘comment’ variable is not ‘Please remove
my son James from this survey!!! Mrs. Sarah Chambers’ (therefore, removing rows where
‘comment’ is ‘Please remove my son James from this survey!!! Mrs. Sarah Chambers’).
The code in row 82 is overly specific – we know that this comment is present in the dataset and
therefore we gain information about a single data subject, including their name (James, probably
James Chambers), their relative’s name (Mrs. Sarah Chambers) and how that relative is related to
the data subject (James is her son). This is disclosive.