Skip to content

Instantly share code, notes, and snippets.

@martinchapman
Last active April 20, 2026 07:34
Show Gist options
  • Select an option

  • Save martinchapman/414aea2656f2a432072f554eb432c43e to your computer and use it in GitHub Desktop.

Select an option

Save martinchapman/414aea2656f2a432072f554eb432c43e to your computer and use it in GitHub Desktop.

Safe Researcher Training (SRT)

Contents

1. The Five Safes

The main theoretical underpinning of the SRT is The Five Safes. The framework distributes data protection across five interconnected dimensions, recognising that no single mechanism can adequately protect against all disclosure risks. Each dimension addresses a different aspect of the data access pipeline:

  • Safe people: researchers must be appropriately vetted and trained.
  • Safe projects: research must have legitimate objectives and ethical approval.
  • Safe settings: the environment in which analysis occurs must be controlled and secure.
  • Safe data: data must be de-identified or aggregated to minimise re-identification risk.
  • Safe outputs: research results must be reviewed before release to prevent inadvertent disclosure.

The five safes do not need to be uniformly high; they must collectively achieve an adequate level of overall safety. For open data, for example, high data safety can compensate for lower controls elsewhere, while for controlled access data, lower data safety may be acceptable where strong controls exist over people, projects, settings and outputs.

2. Statistical Disclosure Control (SDC)

This section covers the theory and practice of Statistical Disclosure Control (SDC) for researchers working with confidential data. SDC is the process of identifying and managing the risk that published research outputs might inadvertently reveal sensitive information about individuals. It therefore relates to the Safe outputs of The Five Safes.

It is divided into two main parts:

  1. Basic SDC theory - using simple tables as examples
  2. Extending SDC to the research environment - beyond tables to graphs, regressions, and other outputs

2.1 What is SDC?

Statistical Disclosure Control means looking at data to try to identify possible risks, and then either removing the risky results or using statistical techniques to hide them. The fundamental process is:

  1. Look at the data
  2. Identify risk of re-identification
  3. Risk -> Remove / or use statistical techniques to hide it

SDC is about being precautionary, but utility is important - the aim is to balance risk against the usefulness of the output, consistent with good research practice.

Note: There is a massive theoretical literature on SDC, most of which is irrelevant to researchers, because the basic idea is simple. Research outputs using confidential quantitative data are generally tabular summaries or regression outputs - the aim is to ensure these do not inadvertently reveal individual-level data.


2.2 The Example Dataset

To illustrate SDC principles, we use a small simulated survey of 150 patients taking part in an investigation into the relationship between diabetes, socio-economic variables, and the 'fragileX' gene (which is associated with diabetes).

Variable Description
id Random ID number
male Male y/n
age Age
white White y/n
fragilex Has fragileX gene
diabetic Diabetes diagnosed
education Highest qualifications
abc1 Socio-economic group
income Annual income from all sources (£)
i_quartile Income quartile 1 (lowest) to 4 (highest)
imputed_value Values were imputed y/n

Which variables might be sensitive? fragileX, diabetic, and income - you would not want someone knowing this about you.

Which might be used to identify someone? age, gender (male), white, education - these are identifying variables that help to pick out a respondent in the dataset.

Variables fall into two types:

  • Identifying variables: help to pick out a respondent in the dataset (age, gender, white, education)
  • Target variables: the things you would hope to find out about a person once you have identified them; in this case fragileX, diabetic, and income (also sensitive)

2.3 Why Small Numbers Are a Problem

2.3.1 Unique Observations (N=1)

Table: Existence of 'FragileX' Gene

Has fragileX gene? No Has fragileX gene? Yes Total
Male 85 6 91
Female 58 1 59
Total 143 7 150

This table shows that there is just one female in the dataset with the fragileX gene. Any results that include both female gender and fragileX will refer to that one person. If you know who that person is, you might learn something about them even though the results don't explicitly split out gender and genes. If you were that person, would you be happy knowing your unique combination of variables had been made public?

2.3.2 Small Groups (N=2)

Table: Diabetes vs FragileX Gene

Diabetes diagnosed? Has fragileX gene? No Has fragileX gene? Yes Total
Yes 114 2 116
No 29 5 34
Total 143 7 150

This table shows there are two people with a diagnosis of diabetes and with the fragileX gene. If you were one of those two, when data was presented about the two of you, you could make inferences about the other because you know what information you provided. For example, if the average age of these two people is given as 62 and you know you gave your age as 54, it is easy to work out the other person must be 70. This is less sensitive than a unique value, because the disclosure is only made to the set of people in the group - but it is still a concern.

2.3.3 The "3 is the Magic Number" Rule

Scenario Average Income
N = 1 £40,000 - the person's exact income is revealed
N = 2 £35,000 - each person in the group can calculate the other's income
N = 3 £40,000 - no-one can say with certainty what anyone else earns (assuming no collusion)

When there are three or more people in a cell, no one can say with certainty what anyone else earns (assuming no collusion between respondents). This is the foundation of the minimum cell count threshold of 3 - and many organisations bump this threshold up to 10 to be extra safe.

From the notes: Assume you know the average salary of the people in the box is exactly £30,000. If your mate is in the box, everyone who knows him knows he earns £30,000. If there are two people, each one can calculate the other's income (e.g. if one knows his own income is £18,000, he can immediately tell the other must be earning £42,000). But when there are three or more people, no one can say with certainty what anyone else earns.


2.4 Context-Sensitivity: Not All Small Numbers Are Problematic

Table: No. of Imputed Values in Dataset

Gender Imputed values? No Imputed values? Yes Total
Male 89 2 91
Female 58 1 59
Total 147 3 150

This table has small cells, but is it disclosive? What value do you gain by knowing that one of the female respondents had a value imputed? If anything, it emphasises that you shouldn't make any judgements about specific respondents based on the tables, because at least one observation has been modified.

Key lesson: SDC is context-sensitive. Knowing that one female respondent had a value imputed doesn't tell you anything sensitive about that person. Small counts aren't automatically a problem - you have to ask: what is being disclosed?

The key point here: SDC is context-sensitive, particularly with the principles-based approach. The data was imputed so it is not revealing actual values; it is okay to come out of the environment. But if there were sensitive information attached to those imputed values, it would not be allowed.


2.5 Class Disclosure

2.5.1 What is Class Disclosure?

Class disclosure occurs when publishing data reveals information about an entire group of people, rather than a specific individual. The zeros (or full cells) in a table are often the culprit: when you classify people into different groups, a zero or 100% cell tells you something definitive about everyone in that class.

Table: Income Distribution by Education Level

Highest Qualification Income Q1 (lowest) Q2 Q3 Q4 (highest) Total
Postgrad 1 1 8 18 28
Degree 2 6 14 17 39
College 8 18 16 3 45
School 13 9 0 0 22
None 13 3 0 0 16
Total 37 37 38 38 150

The zero cells tell us something definitive: no one with only school-level or no qualifications earns above the median income. If you know a respondent never went to college, you know with certainty they are earning below the median - that is class disclosure.

2.5.2 Class Disclosure is Highly Context-Sensitive

Consider these examples:

  • "All of the students aged 14+ said they had tried cannabis at least once" - Disclosive. Could effectively transform into a zero-count problem if turned into a table. Be more ambiguous in wording instead.
  • "No nurse in the survey earns over £30.50/hour" - May or may not be disclosive, depending on whether this is the maximum salary on the pay scale. Pay scales may be public information anyway.
  • "No-one in Shetland earns over £50,000/year" - Disclosive: this is a realistic salary threshold that signals something meaningful about the population (effectively suggesting 'poverty').
  • "No-one in Shetland earns over £5m/year" - Could be disclosive depending on the data, but generally less so - very few people earn this in any population.
  • "No-one in Shetland earns over £500m/year" - Not meaningfully disclosive: no one realistically makes that salary in a population, so it doesn't signal anything.

The point is that all three salary examples are formally class disclosures (all are 0% cells), but the practical harm varies with how informative the threshold is. Class disclosure is very context-sensitive - hard-and-fast rules are difficult to apply.

2.5.3 Structural Zeros

Not all zeros are disclosive. Structural zeros (or logical zeros) are values we would expect to be zero from the construction of the table.

Table: Education Level by Age Band

0-15 16-20 21-25 26-30 Total
Higher Education 0 0 165 148 313
Secondary Education 0 152 210 318 680
None 324 65 42 15 446

In this dataset, none of the respondents aged under 20 has a degree; nor does anyone under 18 have a college qualification. These zeros are not disclosive: we would expect them, because in normal circumstances you cannot complete A-levels before 18 or get a degree before 21. (If there were a child genius in the data, that would be an issue - an unexpected non-zero would stand out.)


2.6 Options for Fixing Disclosive Tables

When a table contains disclosive cells, researchers have several options. Using the income/education table as our example:

Option 1: Cell Suppression

Cell suppression means blanking out the offending cells (replacing them with - or a marker such as <3):

Highest Qualification Q1 Q2 Q3 Q4 Total
Postgrad - - 8 18 26
Degree - 6 14 17 37
College 8 18 16 3 45
School 13 9 - - 22
None 13 3 - - 16
Total 34 36 38 38 146

Or using detail reduction (replacing with <3 to indicate some data exists, just not enough to show):

Highest Qualification Q1 Q2 Q3 Q4 Total
Postgrad <3 <3 8 18 26
Degree <3 6 14 17 37
... ... ... ... ... ...

Important: Remove totals or recalculate them after suppression - if you leave the original totals unchanged, missing values can be recovered by subtraction. Always calculate totals after SDC cleaning, not before.

For example, suppose we suppress the low cells but keep the original row and column totals:

Highest Qualification Q1 Q2 Q3 Q4 Total
Postgrad - - 8 18 28
Degree - 6 14 17 39
College 8 18 16 3 45
School 13 9 - - 22
None 13 3 - - 16
Total 37 37 38 38 150

An attacker can recover every suppressed value:

  1. Degree row: 39 - 6 - 14 - 17 = 2, so Degree Q1 = 2.
  2. Q1 column: 37 - 8 - 13 - 13 = 3, and we now know Degree Q1 = 2, so Postgrad Q1 = 3 - 2 = 1.
  3. Postgrad row: 28 - 8 - 18 = 2, and we now know Postgrad Q1 = 1, so Postgrad Q2 = 1.
  4. Similarly, School and None rows give School Q3 + Q4 = 0 and None Q3 + Q4 = 0, recovering the zeros.

The suppression has achieved nothing because the original totals act as simultaneous equations that can be solved.

One thing to be careful of is consistency across tables: if you suppress a low cell in one table, are you also suppressing that information in other tables? Can the missing value be recovered by differencing across tables? This is a significant potential problem - even groups with teams of checkers have made this mistake in official publications.

Option 2: Rounding

Round all cell values to the nearest 5 (or 10):

Highest Qualification Q1 Q2 Q3 Q4 Total
Postgrad 0 0 10 20 30
Degree 0 10 15 15 40
College 15 15 15 5 50
School 10 10 0 0 20
None 10 5 0 0 15
Total 35 40 40 40 155

Note that the totals have changed significantly - this is common with rounding. Controlled rounding, which aims to minimise the effect on totals, is a specialist area you could seek advice on if presenting many similar tables.

You can also make the data into something less disclosive: ratios, growth rates, proportions (but limit decimal places), etc.

Option 3: Redesign the Output (Recommended)

Merge categories so that no cell has fewer than the threshold:

Highest Qualification Q1 Q2 Q3 Q4 Total
Degree+ (Postgrad + Degree) 3 7 22 35 67
College 8 18 16 3 45
School 13 9 0 0 22
None 13 3 0 0 16
Total 37 37 38 38 150

Why we recommend this option:

  • Retains accuracy but not precision (whereas other methods retain precision but not accuracy)
  • Once you have redesigned your categories, you are likely to continue using those categories across all your tables, giving a coherent analytical picture
  • Focuses attention on the data itself, not just on solving the problem of a specific table
  • Note: in some cases (like this one with the zero cells for School/None), redesign alone doesn't solve all problems

Choosing the Right Option

  • The best option depends on the output - not all approaches will work all of the time
  • It depends on the message you want to present
  • You know what's important - you decide which SDC methods to use
  • The user support team can advise if needed, but cannot and will not make decisions for you

2.7 Primary Disclosure: Dominance

2.7.1 The Dominance Rule

Large numbers of observations are not always sufficient protection if one or two units contribute the bulk of the data. This is the dominance problem.

The dominance rule states that disclosure risk exists when:

  • The largest unit is more than 43.75% of the total, OR
  • All but the largest two units are less than 12.5% of the largest unit

Both conditions test raw contributor values against the cell total — the individual values that sum to produce the published aggregate. A suspicious computed statistic (such as a mean far above its peers) is a flag to investigate, but the dominance check itself requires the individual contributor values.

Rule 1 example:

Sector N firms Total turnover (£000s)
Agriculture 18 7,560
Manufacturing 22 25,960
Retail 31 19,840
Utilities 14 118,300
Construction 25 12,750

The Utilities total is an order of magnitude above every other sector. Examining the underlying contributor data for Utilities:

Firm rank Turnover (£000s)
1 65,000
2 11,000
3 8,500
4 6,200
5 5,100
6 4,300
7 3,800
8 3,200
9 2,900
10 2,400
11 2,100
12 1,800
13 1,200
14 800
Total 118,300

Threshold: £118,300k × 0.4375 = £51,756k. Firm 1 (£65,000k) accounts for 65,000 / 118,300 ≈ 55% — Rule 1 is triggered.

Rule 2 example:

Sub-code N firms Largest firm (£m) 2nd largest (£m) Sum of remaining (£m) Total (£m)
21.10 16 340 180 32 552
21.20 19 95 82 410 587

Sub-code 21.10 fails Rule 1 immediately: 340/552 ≈ 61.6% > 43.75%. Sub-code 21.20 passes Rule 1 (95/587 ≈ 16%), so Rule 2 must be checked. The test is applied to each firm ranked 3rd and below individually against 12.5% of the largest (£95m × 0.125 = £11.9m):

Firm rank R&D (£m) > £11.9m?
1 95
2 82
3 75 Yes
4 60 Yes
5 48 Yes
6 38 Yes
7 30 Yes
8 26 Yes
9 22 Yes
10 18 Yes
11 15 Yes
12 13 Yes
13 12 Yes
14 12 Yes
15 11 No
16 9 No
17 8 No
18 7 No
19 6 No

Firms 3–14 each individually exceed the threshold — the test is per firm, not on the aggregate. A single violation is sufficient; Rule 2 is triggered. A researcher at firm #3 (£75m) can subtract their own spend from the published total (£587m − £75m = £512m), tightening their estimate of what the top two firms spent combined.

2.7.2 Why Dominance Creates Risk

The dominance rule exists because aggregates don't protect individuals when one (or two) contributors dominate the total. Computed statistics such as the mean, sum, or total therefore become essentially a proxy for that person's value, and the other contributors are just noise.

Working backwards from the mean:

Table: Income by Company and Qualification (Dominance Example)

Highest Qualification Company 1: Employee count Company 1: Mean income Company 2: Employee count Company 2: Mean income Overall Mean income
Degree 12 £92,412 30 £54,124 £65,063
College 26 £29,006 24 £28,614 £28,818
School/None 42 £18,332 33 £19,148 £18,691

If you know the mean income and the number of people in a group, you can calculate the group total:

Mean x Count = Total

For example, for Company 1 degree-holding employees, where mean income is £92,412 and employee count is 12: £92,412 x 12 = £1,108,944 (the total income for that group).

Now imagine you work at that company and hold a degree. You know your own income, and roughly who else is in the group (only 12 people). You can subtract known incomes from the total to narrow down - and when one person contributes ~44% of the total, that individual's income is barely disguised by the average. With only 12 people and one value that dwarfs the rest, it doesn't take much insider knowledge to effectively uncover that individual's exact salary.

You are doing here indirectly what might be done directly with underlying data, hence a high mean is a symptom used to investigate a dominance cause. For example, you might examine "Degree / Company 1" and find one underlying value is £485,322, which is one individual accounting for approximately 44% of the group total. This exceeds the 43.75% dominance threshold.

Knowing the mean and median, one can mathematically calculate or closely estimate the size of any outlying values.

2.7.3 Dominance: The Aggregation Fix

One remedy for dominance is aggregation: combining groups so that the small dominant subgroup gets absorbed into a larger, more representative pool.

For example, if University 2 has only 10 managers at £180,420 average pay, combining University 1 (45 managers) with University 2 (10 managers) dilutes the influence of that extreme group. The 10 highly-paid managers are now part of a pool of 55 managers, and their extreme mean is tempered by the 45 managers from University 1.

"Diluting" the mean by combining groups: harder to reverse-engineer any outliers.

Benefits of this approach:

  • Increases sample sizes across all categories, making means more robust
  • Eliminates cells dominated by tiny subgroups
  • Harder for anyone to reverse-engineer outlier values from the published aggregate

2.7.4 Dealing with Dominance in Practice

The same techniques as for frequency tables can be used: redesign, suppress, round, etc.

However, dominance is:

  • Hard to check for and demonstrate - it is not as visually obvious as a small cell count
  • Very rare - unless you have a tiny number of observations or an incredibly odd outlier

The best protection is lots of observations - don't produce small cells. Also, be aware of your data: are there any egregious outliers? If so, why are you putting them in with other variables? You may be misrepresenting the data in any case. Once again, good statistics is entirely consistent with good SDC.


2.8 Primary Disclosure: Ranks, Maxima, and Minima

Putting things into ranks (quartiles, medians, maxima and minima) means putting things into cells.

Statistic Income Age
Minimum £8,351 50
Maximum £385,604 70
Mean £34,353 60
Median £11,446 59
N = 150

Key considerations:

  • Maxima and minima: Not always problematic, but assume they are until checked. Min and max could refer to individual people; if so, group and take averages instead.
  • Medians: Can refer to an individual, but unlikely for large groups. For our dataset with 150 observations, the median is fine. However, be careful: the class disclosure we saw earlier showed that some people are definitely below the median income, which combined with the median value is worrying.
  • Ranks: Knowing someone's rank can reveal information without knowing their exact value. This is another form of class disclosure.
  • Percentiles: In a dataset with only 150 observations, the 1st and 99th percentiles (which Stata's sum, detail command will show) will only have one or two people in them.

Income is the variable in this dataset that would cause problems in terms of publishing maxima and minima. For age, the max and min are at the extremes of possible values and the mean/median are uninformative - no problem there.


2.9 Secondary Disclosure: Disclosure by Differencing

This is the biggest problem in SDC, and it has no complete solution.

2.9.1 What is Secondary Disclosure?

Secondary disclosure occurs when a value becomes disclosive not because of what it says on its own, but because of its relationship to other published values. The classic form is disclosure by differencing: subtracting one table from another to reveal a protected cell.

Example:

Age bands Working class Middle class Total
50-54 21 11 32
55-59 25 11 36
60-64 28 12 40
65+ 31 11 42
Total 105 45 150

(All persons)

Age bands Working class Middle class Total
50-54 17 7 24
55-59 19 9 28
60-64 23 8 31
65+ 23 10 33
Total 82 34 116

(Non-diabetics only)

Each table is fine on its own. However, the 65+ row in the "all persons" table shows 42 people, while the "non-diabetics" subset shows only 33. Subtracting: 42 - 33 = 9 diabetics in the 65+ age group. If you produce a table for all persons, it's impossible to prove that no other table has been or will be produced that could breach confidentiality through disclosure by differencing. All we can do is be aware of the problem and try to avoid making it likely.

2.9.2 Secondary Suppression

When one cell is suppressed, sometimes other cells must also be suppressed to prevent the suppressed value from being calculated by subtraction. This is secondary suppression. SDC literature normally recommends this as the preferred solution, because it preserves the original totals, which matters most in government statistics.

For example, in a table of employment by sector and geography, if Yorkshire & the Humber has a suppressed value for Air Transport, and the row total and other values are known, you might be able to calculate the suppressed value. A second suppression elsewhere in the same row (e.g. also suppressing Warehousing) prevents this.

Recall our income/education example from Section 2.6, with cells below 3 suppressed but original totals kept:

Highest Qualification Q1 Q2 Q3 Q4 Total
Postgrad - - 8 18 28
Degree - 6 14 17 39
College 8 18 16 3 45
School 13 9 - - 22
None 13 3 - - 16
Total 37 37 38 38 150

The Degree row has only one suppressed cell (Q1), so it can be recovered from the row total: 39 - 6 - 14 - 17 = 2. Applying secondary suppression to the Degree/Q2 cell (the 6) ensures every row and column has at least two unknowns, blocking recovery. The 6 is a natural choice here because it is the next smallest visible value in the Degree row, so suppressing it loses the least information. In this larger table, only one extra cell needs suppressing. For the 2x2 tables in Sections 2.3.1 and 2.3.2, however, you would need to blank out almost every cell, at which point the table communicates nothing.

Choosing which cells to suppress secondarily is itself non-trivial: the minimum number needed to block recovery, the cells with the fewest observations, or the cells with least "importance" by some other criterion (e.g. economically insignificant firms in a business dataset). This is a specialist area, and researchers will often find they lose substantial information, which is why table redesign (Section 2.6, Option 3) is generally more practical.


2.10 SDC and Statistical Quality

There is ideally no conflict between SDC and good research. The things to avoid for SDC purposes are the same things to avoid for good statistical analysis:

Things to avoid for SDC Also bad for research quality
Small numbers Low statistical power, unreliable estimates
Dominant observations / huge outliers Skew means, misrepresent the data
Very skewed distributions Mean is uninformative

Good SDC is consistent with good research output. The exception is in class disclosure and ranking, where being able to say something about a whole group of data subjects might be analytically valuable but also problematic from a disclosure perspective.

Remember: SDC applies at the point at which statistical results are going to be released - not when you're playing around with the data. You can explore your data freely; it's only the outputs you intend to publish or share that need SDC treatment.


2.11 Moving Beyond Tables: SDC in the Research Environment

So far we have focused on tables, because they are the most intuitive example. But researchers produce a wide range of outputs. The question is: do the threshold and dominance rules described above apply to them?

Consider the types of statistics researchers commonly produce:

  • Odds ratios, regression coefficients, residual plots, contiguity maps, growth rates
  • Graphs and charts
  • Scatter plots, box plots
  • Maps

For each of the stat types covered in the remaining sections, there are two questions worth asking: how do you spot a potential SDC issue, and what can you do about it? To illustrate this structure, consider two stat types already encountered.

Descriptive statistics

How to spot SDC:

  • Multi-way cross-tabulations raise attribution risk significantly
  • Median, min, and max values often represent a single observation
  • Relative frequencies (percentages) are problematic when totals are also shown
  • All cell counts must meet the threshold N

What to do about it:

  • Band or combine columns and rows
  • Average values across a small group to meet the threshold
  • Round values
  • Suppress cells; also suppress a second cell in the same row or column to prevent the original value being deduced from totals

Percentiles

How to spot SDC:

  • Unrounded values may represent the exact value for a single individual
  • At extreme percentiles (e.g. 1st, 99th) the underlying group may contain only one person
  • The median can refer to a single observation in a small dataset
  • Check that the count underlying each percentile meets the threshold N

What to do about it:

  • Round all percentile values (including the median) to the nearest hundred or thousand
  • Aggregate values into broad categories (e.g. income below £12,000)
  • Present the inter-quartile range rather than individual percentile points

2.12 Low Review vs. High Review Statistics

The central organising framework for SDC in the research environment is the distinction between Low Review Statistics (LRS) and High Review Statistics (HRS).

Low Review Statistics (LRS) High Review Statistics (HRS)
Disclosure risk Inherently low Inherently high
Action Publish after administrative checks Publish only once specific values checked
Examples Regression coefficients, modelled aggregates, non-linear combinations (e.g. estimated odds ratios, survival functions) Frequency tables, individual data points, linear combinations, calculated odds ratios or risk ratios, percentiles

The lion and rabbit metaphor: Think of managing a zoo with two kinds of animals - lions and rabbits. You have limited time. An angry rabbit can give you a nasty nip; a well-fed sleepy lion can be tickled behind the ears. But in general, you should spend your time watching the lions (HRS). The lions are HRSs, the rabbits are LRSs.

Why this distinction works:

Many outputs have no meaningful disclosure risk because of their functional form - that is, irrespective of the data used to generate them, there is no realistic way for anyone to unpack the statistic to find confidential information. An example is linear regression coefficients. We call these LRSs because we do not really need to check them in detail.

In contrast, some statistics (such as tables) have lots of potential for disclosure risk. We only publish these after ensuring they are non-disclosive in the specific case of that output.

In practice:

  • HRSs must be checked for negligible disclosure risk in that particular instance, SDC applied if necessary, and checked again before release
  • LRSs should just be released - we look at the type of output and say "yes, ready to go" (with some administrative checks)

What makes something an HRS vs LRS?

Low Review Statistics High Review Statistics
Modelled aggregates (e.g. coefficient estimates) Individual data points (e.g. regression residuals)
Non-linear combinations of the data (e.g. estimated odds ratios or survival functions) Linear combinations (e.g. tables, percentiles, calculated odds ratios or risk ratios)

2.13 Regression Output

A regression is a model for estimating a statistical relationship between two or more variables.

Example regression output:

Variable Estimate Standard Error Sig.
Intercept 0.372 0.003 0.00
Female -0.121 0.002 0.00
Age 16-29 (ref: 45-59) -0.465 0.003 0.17
Age 30-44 -0.181 0.003 0.00
Age 60-69 -0.055 0.002 0.00
Marital Status (Single) 0.593 0.001 0.00
Marital Status (Married/cohab, living apart) 0.845 0.003 0.00
Marital Status (Div/Wid) 1.032 0.002 0.00
N = 2,356; d.f. = 2,348

How to spot SDC:

  • Any category has fewer than N observations
  • Regression run on a single unit
  • Sequential regressions differ in cohort size by fewer than N observations, with significant parameter differences
  • Regression consists solely of categorical variables
  • Degrees of freedom below N, or N too small for the model to be genuine rather than effectively a table

What to do about it:

  • Check degrees of freedom are at least N
  • Check sequential regressions do not differ in observation counts by fewer than N
  • Ensure sufficient observations within and between model estimations
  • In practice, disclosure via regression is very rarely an issue

2.14 Residuals

Regression residuals are High Review Statistics because they are individual data points - a researcher could find themselves on a residual plot.

Residual plots should be treated with caution and may require SDC treatment before release.

How to spot SDC:

  • Each residual represents a single observation, breaching the threshold rule by default
  • Outliers are especially identifiable
  • If observations can be ordered along the x-axis, residuals can be attributed to specific individuals
  • Multiple residual plots from the same model make outliers easier to isolate

What to do about it:

  • Describe the shape or conclusion of the plot rather than releasing it
  • If a plot is needed: remove axis scales
  • If a plot is needed: use an x-axis variable difficult to observe outside the dataset (e.g. one generated during analysis)

2.15 Graphs and Charts

2.15.1 Line/Area Charts

Graphs are often produced without showing the underlying data. However, each data point on a graph must pass the same SDC checks as a cell in a table.

Key issue: Graphs are often missing the number of observations or underlying counts. Before exporting a graph from the research environment, you must check the data points.

Underlying counts example:

2011 2012 2013 2014 2015 2016
Primary 69 89 88 76 170 157
Manufacturing 2,764 2,149 1,570 1,756 3,863 3,850
Construction 395 377 480 418 382 410
Wholesale and Retail Trade 209 319 487 494 1,314 1,301
Transport and Communications 9* 8* 6* 35 84 111
Financial Intermediation 987 973 1,223 1,182 2,198 2,364
Other 83 97 137 142 409 398

*Below threshold - needs suppression

How to spot SDC:

  • Each data point is subject to the same checks as a cell in a frequency table
  • Low cell counts in the tails of distributions
  • Min/max values visible on axis scales
  • No supporting frequency table submitted (thresholds cannot be verified)
  • Some software stores raw data behind a graph

What to do about it:

  • Check the underlying frequency table before exporting
  • Band or aggregate categories where counts fall below threshold
  • Suppress low-count points; remove or cap tail values
  • Release as a fixed image (e.g. PNG or JPEG)

2.15.2 Scatter Plots

Scatter plots of individual data points are High Review Statistics - they may directly reveal individual observations. Require a separate frequency table to confirm adequate cell sizes.

How to spot SDC:

  • Each point typically represents one data subject, breaching the threshold rule by default
  • Individual attributes readable directly from the graph
  • Variables in original, untransformed form increase attribution risk

What to do about it:

  • Group data subjects so each plotted value represents at least N observations
  • Generate a heat map with regular bins; include only bins with count above N
  • Transform the underlying variables before plotting

2.15.3 Box Plots

Box plots display: minimum, maximum, 25th percentile, median, and 75th percentile. They may also show outliers. The minimum and maximum could refer to individual people; outliers shown as individual points are especially problematic. Solutions include grouping or averaging.

How to spot SDC:

  • Long-tailed data: outliers shown as individual points relate to individual data subjects
  • The 25th percentile, 75th percentile, and median could each refer to a single observation
  • Normally distributed data: whisker ends (min and max) may also relate to single observations

What to do about it:

  • Long-tailed data: group or average outliers
  • Normally distributed data: band min and max into a range (e.g. £7 to £8 per hour) meeting the threshold
  • Release only if summary values are demonstrably not attributable to single individuals

2.15.4 Maps

Maps can be problematic if they show exact point locations or small counts. Use a heatmap instead, so that low business numbers are not exposed as specific points. Specific point-location data is only safe to show if based on publicly available data.

How to spot SDC:

  • Each dot may represent a single observation
  • Observations with unusual characteristics (e.g. rare condition, specialist facility) are especially identifiable
  • Even imprecise positions can be pinpointed by geographic software
  • Combining location with other variables (age, gender, ethnicity) raises reidentification risk
  • No data point should represent fewer than N observations

What to do about it:

  • Convert to a choropleth map (colour indicates activity levels across areas, not individual locations)
  • Review any scale shown alongside the map, which could itself reveal small counts

2.16 Odds Ratios and Risk Ratios

  • Calculated odds/risk ratios (from a 2x2 table) - treat as a table - HRS
  • Estimated/modelled odds ratios (from logistic regression) - LRS

2.17 Code Files

2.17.1 Safe Code Files

Code files (scripts) can generally be cleared from the research environment, provided:

  • They contain no data (data import, data cleaning, and visualisation of changes are all fine)
  • Any hard-coded data has been removed

It is okay to leave clear code files in the Trusted Research Environment (TRE).

2.17.2 Problematic Code Files

A very common problem: researchers often insert frequency tables or record-level data in code files as comments. Hard-coded data values in a script are disclosive. Do not include record-level data in code files.


Summary: Key Principles

  1. SDC is context-sensitive - small numbers are not automatically a problem; ask what is being disclosed
  2. Minimum cell count = 3 (many organisations use 10 to be extra safe)
  3. Structural zeros are not disclosive - expected zeros are fine
  4. Redesign is the best fix - retains accuracy, produces consistent analysis, addresses root cause
  5. Always calculate totals after SDC cleaning, not before
  6. Secondary disclosure (differencing) has no complete solution - be aware and minimise the risk
  7. Good SDC = good statistics - small cells are bad for both confidentiality and reliability
  8. Use the LRS/HRS framework - concentrate checking effort on High Review Statistics; Low Review Statistics can be cleared administratively
  9. Dominance is rare - best protection is lots of observations; know your data
  10. SDC applies at the point of release - not during analysis; explore freely, check before you publish

3. Rules-Based vs Principles-Based SDC

Statistical Disclosure Control (SDC) in a Trusted Research Environment is not just a statistical discipline, it is an operational one. So far the principles of SDC have been covered; this section is about how those principles get translated into output clearance that is quick, safe, and doesn't cripple research. The central question is a choice of philosophy: do we govern output release through rules, or through principles?

The framing is deliberately practical. SDC, done naïvely, is a time-consuming process: researchers wait around for results, and support teams wade through piles of output. The two sides also bring different expertise: support staff are not usually as statistically deep as the researchers they serve, and researchers are not always attuned to disclosure risk. In some environments (typically Secure Use Files) there is no support team at all, and researchers self-review. Whatever the setup, the process has to work efficiently, or it doesn't work at all.

3.1 Models of output clearance

Everyone (researchers, support teams, data providers) wants the same three things from output clearance: speed, safety, and no unnecessary restriction on research. These pull against each other, and no clearance process gets all three for free. Splitting outputs into low-review and high-review items helps triage the easy cases, but it doesn't settle the hard question of what to do with the high-review ones. Should we apply hard rules, or soft ones? Two broad models sit on either side of that line: rules-based output SDC (RBOSDC) and principles-based output SDC (PBOSDC).

3.2 Rules-based output SDC

RBOSDC is the intuitive option. You define specific rules for what outputs are allowed, and you stick to them. The rules are, to some degree, arbitrary, but they have been developed over time and represent a reasonable compromise that both researchers and data holders can live with. The rules don't cover everything, and the standard advice is: if in doubt, talk to the support team.

Its appeal is obvious. Rules are simple, transparent, and in principle automatable. They can be written down and handed to a researcher on day one, and they comfort data providers, who like a clean answer. They are used at places like the Eurostat Safe Centre, NILS-RSU, and ADRC-NI. The line here is blunt: don't push the rules; if you hit a problem, work with the support team.

3.2.1 Why strict rules break down in research

The problem is that strict RBOSDC rarely survives contact with a real research environment. About half of SecUF facilities claim to operate this way, but applying it strictly outside fully automated job-processing systems is very difficult. The reason is fundamental: you can't specify, in advance, all the cases a genuine research project will throw at you. Every rule is trying to serve two masters (efficiency and safety), and sooner or later one of them will lose.

A "no exceptions" rule either draws justified complaints from researchers that it is needlessly restrictive, or (the flipside data holders tend to forget) is sometimes too loose for the actual risk in a particular table. What happens in practice is that organisations claiming to be rules-based quietly become "we set and follow rules, apart from the times when we don't." That hybrid is arguably worse than being openly principles-based: expectations aren't made explicit, and the system becomes more open to favouritism.

3.3 Principles-based output SDC

PBOSDC is the alternative, and it is the model used by most TREs (UKDS, the HMRC Datalab, ONS, and the ADRN outside NI). The move is to keep the operational savings of a simple yes/no system, but build in flexibility openly rather than covertly. It starts, like RBOSDC, by defining thresholds, but calls them rules of thumb, because it is going to be explicit about when they bend. In principle, any output is allowed. Because the rules are known to be flexible, they can be set more restrictively on paper than a typical RBOSDC limit: the rules of thumb focus on safety, and the flexibility handles the edge cases.

Support staff and researchers are both allowed to argue that a rule of thumb shouldn't apply in a specific case. A support team member might argue that class disclosure means a particular table can't be released no matter how large the counts are. A researcher might argue that a maximum value should be released because it's informative about the data but can't be informative about any individual respondent. Both are legitimate PBOSDC conversations.

The critical discipline is around when those conversations happen: not very often, only when the output is genuinely important, and only when it is non-disclosive. If every output becomes a negotiation, the process collapses under its own weight. This is why training matters: everyone needs to understand what "not very often" and "important" actually mean in this environment. And it is why trust matters: frustrated researchers and overburdened support teams do not produce good confidentiality practice.

The practical edge of this comes through clearly in the worked example the course uses: a cross-tab with cells of 7, 8 and 9 against a threshold of 10. The numbers are almost certainly non-disclosive, but PBOSDC is an operational and ad hoc criterion: non-disclosiveness alone is not sufficient reason to override the rule of thumb. The output has to be important too. The line has to be drawn somewhere, and it has been drawn at 10. If it weren't, researchers would chip away at every limit, and the efficiency that justified the whole system would disappear.

3.3.1 PBOSDC and the research community

PBOSDC is, in the end, a community process: it only works when everyone works together. Each environment and each data provider will have its own rules of thumb (the specifics come with the data), and if in doubt, the answer is always to check with the support team.

That community framing also cashes out in what "good output" actually looks like in a PBOSDC world. It is hard to specify exactly, but the test is to put yourself in the checker's position: clear labelling, visible frequencies, explanations where needed, no ambiguous axes or unexplained variables. A poorly presented output wastes both sides' time on basic questions, and every minute of that friction erodes the trust the model depends on.

3.3.2 PBOSDC in summary

PBOSDC, together with the low-review / high-review split, was designed specifically for research environments, and is aligned with how research actually gets done: flexible where it matters, restrictive where safety requires. The researcher's job is to learn what the support team expects, and to educate them in return where the research context demands it. Where there is no support team, the researcher takes on the self-review role directly. The most common mistake is producing tables with small cells while exploring the data; everyone makes those mistakes, and what matters is how they get handled.

The two models differ less in their rules than in how they handle the cases the rules don't fit. RBOSDC refuses the edge case; PBOSDC negotiates it, but only when the output is both important and non-disclosive, and only rarely enough that the process stays efficient. When the community holds that line, everyone benefits: researchers get results faster, support teams clear outputs, and data providers keep their confidence in confidentiality intact.

4. Safe People

We looked at Safe outputs in the form of SDC in the previous sections. We will now look at Safe people as the next most important component of The Five Safes when considering releasing data in an electronic environment.

4.1 Why do attitudes & perceptions matter?

The way data providers perceive people directly shapes how much data they are willing to release. For example, a provider who believes researchers mean well but occasionally slip up will invest in training and design flexible access; a provider who does not trust researchers at all will fall back on Public Use Files with much of the useful detail stripped out. Safe Researcher Training takes the middle position: users can be trusted, but need training, and their trustworthiness must be visible enough for data holders to extend that trust.

That visibility is earned through behaviour, which flows from attitudes (the Knowledge-Attitude-Behaviour framing applies, though attitudes are the lever that matters most). The typical failure mode is an attitude failure dressed up as an accident: a researcher who copies confidential files onto a USB, say, never meaningfully engaging with the risk. Exercises asking trainees to pick "safe" colleagues from a shortlist surface the same point: good and safe are not synonyms, obvious traits can mislead, and personal prejudice often does more of the sorting than the evidence warrants. None of this sits in a vacuum, since peers, seniority and power dynamics constantly reshape behaviour, which is why positive peer influence is one of the most effective safeguards available.

4.2 The research community

Safe people operate inside a community with four main actors (data providers, researchers, research users, and the support team), plus a fifth group too easily forgotten: those from whom the data was originally collected, whose interests underpin the whole arrangement. The parties' goals genuinely diverge. Data providers are "default closed", starting from no access and asking of every request whether something will go wrong and at what cost. Researchers and research users are "default open", assuming data should flow unless a specific restriction can be justified; researchers in particular tend to overestimate their trustworthiness and underweight the organisational work needed to make data available. Research users want digestible output on tight timetables and have little patience for confidentiality-driven delay.

Balancing these interests is the support team's core job. Sitting between providers and researchers, they carry provider concerns back to researchers and researcher needs (and the social value of the work) back to providers, and their accumulated experience unlocks future flexibility, since most data access happens because similar access has happened elsewhere, for a long time, without problems. Ignore provider concerns and their trust evaporates; address them credibly and providers can be persuaded to be remarkably open.

4.3 Summary

Attitudes shape behaviour, behaviour shapes perceived trustworthiness, and trusted researchers get systems designed around them rather than locked down against them. Not every positive trait is a "safe" one, peers exert more influence than the support team ever can, and the support team itself is an ally (the right place to take queries and suggestions), since co-operation across the community keeps data flowing to the research that needs it.

5. Breach of procedure vs. breach of confidentiality

We now know two key components of keeping data safe, but what happens when things still go wrong? When a researcher mishandles data in a Trusted Research Environment, the incident falls into one of two categories. A breach of procedure (BoP) is breaking the rules set by the data holder; a breach of confidentiality (BoC) is breaking the law. This section covers the line between them, and why, in practice, they must be handled together.

5.1 Procedures for using a Trusted Research Environment

A BoP is any violation of the rules a data holder sets for using their TRE. The overarching rule is that no data can leave without first being checked and cleared by staff. The everyday precautions are straightforward: access data only within the United Kingdom, make sure your screen cannot be overlooked, lock your screen when away from the computer, do not access or discuss data in public places, and switch off listening mode on any virtual or digital assistant. Each reduces the chance of a procedural slip becoming a BoC.

5.2 Data protection laws

A BoC is a violation of data protection law, which exists to support responsible research use, not to restrict it. Every law, in every jurisdiction, specifies four things: who can use what data, for how long, and for what purpose. Failure on any of those points is a BoC. Alongside the law, data providers typically impose their own requirements: training, certification, use of specified facilities, approved storage; failure there is a BoP. The same laws that carry penalties also build in a 'reasonableness' defence, allowing responsible researchers to work without fear of being caught out by an unlikely event.

5.3 Classifying incidents

Most incidents are procedural rather than legal. Even something serious, like leaving a laptop on a train, is only a breach of procedure until someone finds it, unlocks it and re-identifies an individual from the contents. The following scenarios make the split clearer:

Scenario Accidental or deliberate? Rules or law?
Researcher puts confidential files on a USB to give to a colleague Deliberate (misuse) Breach of procedure; becomes a breach of confidentiality depending on encryption and the colleague's permissions
Researcher leaves laptop on a train Accidental Breach of procedure (e.g. if told not to take the laptop off-site); only becomes a BoC if the finder unlocks unencrypted identifiable data
PhD student agrees to share logon and password with his supervisor Deliberate Prima facie breach of law (unauthorised access) unless a condition applies (e.g. the supervisor is also named on the project)
A user in a research lab asks another user to look at her code on screen Deliberate Breach of procedure, depending on the other user's permissions
Academic team stores confidential data on a cloud server, but encrypts it Deliberate Prima facie breach of law (EU regulations) unless the server is appropriately certified
Researcher using a remote lab takes a screenshot and sends it to the support team to query the data Deliberate Breach of procedure (data should not leave the environment by any means without being checked and cleared)

Only two of these (the shared password and the uncertified cloud server) are breaches of law on their face. The rest become law-breaking only under specific circumstances: the lost laptop, for example, only becomes a BoC if it is unencrypted and opened by someone who sees identifiable data. Not every problem carries legal consequences.

5.4 Types of problem - rules or law?

Which of the two should worry you more? Both, equally. Breaches of confidentiality are rare; breaches of procedure are not. There is no one-to-one relationship between them (a BoP may never produce a BoC, and a BoC can occur through an unanticipated error despite procedures being followed), but BoCs tend to emerge from BoPs. A data provider watching a pattern of procedural breaches can reasonably conclude that management is not working and close down access entirely, regardless of whether any law has been broken. Following procedure means no breach at all.

5.5 Consequences of breaches

The consequences of the two categories look different on paper but overlap heavily in practice.

A breach of confidentiality can mean prison and/or fines, and the responsibility is usually personal (you cannot hide behind your organisation). A breach of procedure brings financial and reputational damage, and potential restrictions on future access: temporary suspension, an indefinite ban, or one-to-one retraining.

Which bites hardest? The meaningful consequences researchers fear (loss of funding, reputation or a job) tend to flow from procedural breaches. If you ignore UKDA guidelines, you risk your ESRC funding whether or not a court is involved.

To make this concrete: a researcher using the Secure Access version of the Crime Survey for England and Wales is in a hurry and copies figures directly from the screen. The output has not gone through statistical disclosure control, the paper could be lost, a journalist could pick it up, and the entire project team could be suspended, all from a rule-break that carries no immediate legal penalty.

5.6 Support if things go wrong

All data protection laws (and most data providers) allow for a 'reasonableness' defence: you cannot be held liable for something unlikely that happened without careless or reckless contribution from you. In practice, this means working with the support team and following procedures. If you need to defend yourself, demonstrating that you did both puts the support team alongside you; if you cannot, they have every reason to distance themselves. The rules are not administrative burdens; they turn an incident into a shared problem rather than a personal liability.

5.7 Summary - understanding data access

The principles are not complicated once you have time to think them through. The Five Safes provide a handy structure for organised thinking. Mistakes and bad practices happen - and when they do, the answer is to work together to address them.

6. Assessment

6.1 Overview

  • The assessment is available online via the Learning Hub
  • It is up to you if and when you complete the assessment (no fixed deadline)
  • Results are marked automatically
  • You must pass to gain access to microdata
  • If you have completed the test but continue to receive reminders, please ignore them

Contact: ids.customer.support@ons.gov.uk


6.2 Assessment Structure

Section 1 - Procedure Scenarios (9 questions)

  • You are presented with 9 workplace scenarios (e.g. "I'm working in a secure lab - what do I do if I need to go to the toilet?")
  • For each scenario, rank the possible responses from most to least sensible

Section 2 - Statistical Disclosure Control (SDC) (10 questions)

  • You are shown 10 example statistical outputs (tables, etc.)
  • Multiple choice format (multiple answers may apply)
  • Apply the threshold of 10 when assessing disclosure risk
  • Identify risks such as: counts below 10, class disclosure, secondary disclosure
  • Some outputs carry no risk - do not be overly cautious
  • Not assessed on output titles - focus on disclosure risk only

6.3 Assessment Rules & Hints

Rule Detail
Open-book You may refer to your notes and the SDC Handbook
One sitting You must complete the assessment without interruption
No deadline Complete at your own pace - but sooner is encouraged
Account expiry Your Learning Hub account is deleted 3 months after activation (can be reactivated)
Pass mark 50% in each section
Time to allow Block approximately 45 minutes to 2 hours

Marking Guidance

  • The assessors are looking for sensible, reasonable answers
  • Some answers are right, some wrong, and some are simply better or worse than others
  • Negative marking applies to the SDC multiple-choice section - only select multiple answers if you are confident they all apply; do not guess
  • Long-answer questions are reviewed by an assessor in borderline cases

Results

  • Results are sent to the TRE team as pass/fail
  • If you fail on your first attempt, you will receive feedback to help you prepare for a resit

6.4 Preparing for the Assessment

  1. Read through the training slides, including the speaker notes
  2. Read the SDC Handbook - this is the key reference for the SDC section
  3. Call the support team if you have queries or want to discuss any topics - do this before taking the test

Note on research: Your detailed test answers are stored in a research database used for a follow-up study on the effect of training. If you wish to opt out of this, please email ids.customer.support@ons.gov.uk


6.5 Practice Questions

Section 1 - Procedure Scenarios

Question 1: When do we need to protect the confidentiality of data in the secure environment?

Rank the following responses from most sensible (1) to least sensible (4):

  • A. Only if the data are sensitive.
  • B. Unless the data are already in the public domain.
  • C. When data are deemed to be personal.
  • D. Regardless of what the data are about.

Answer: D, B, C, A

The correct ranking treats the Safe Settings principle as absolute: once data are inside the secure environment you behave as if everything is confidential, full stop. You don't get to make case-by-case judgements about which rows are "really" sensitive, or which variables are "really" personal - that is precisely the kind of researcher discretion the Five Safes framework is designed to remove.

  • D is the right answer. The confidentiality obligations of the secure environment attach to the environment, not to the perceived sensitivity of the data. You treat everything inside as confidential.
  • B is next best. It at least gestures at a recognised exception (public-domain data genuinely doesn't need protecting), but it's still wrong as a posture - the researcher shouldn't be the one deciding at the screen what counts as "already public", and even data derived from public sources can become sensitive when linked.
  • C is worse. "Personal data" is a narrower, legal-ish category, and adopting it as the trigger invites the researcher to decide that commercial, administrative, or aggregate data don't need the same care - they do, inside the TRE.
  • A is the worst. "Only if the data are sensitive" is exactly the judgement call the researcher is not qualified (or authorised) to make at the keyboard. It is the default attitude that leads to breaches.

Question 2: Once I have access to a dataset I can use it for...

Rank the following responses from most sensible (1) to least sensible (4):

  • A. Any 'good and proper' research purpose.
  • B. The exact research that was specified in my proposal.
  • C. A different research purpose with clearance from my support officer.
  • D. Any research purpose that is directly related to my original application.

Answer: C, B, D, A

Access is granted against a specific project (Safe Projects). The approval attaches to the proposal you submitted, not to you as a researcher and not to the dataset in general. If your research direction changes, the correct response is to go back through the approvals process - not to decide for yourself that the new direction is "close enough" or "also good".

  • C is the default correct answer - if the research purpose changes, you get clearance (an amendment) from the support officer. This covers real-world cases where projects evolve.
  • B is also legitimate and is explicitly noted in the training: it is what you actually signed up for, and it is what the data owner approved.
  • D sounds reasonable but is wrong as stated - "directly related" is the researcher's own judgement, not a cleared amendment. It slides from B into A without the safety check in C.
  • A is the clearest misuse: "any good and proper research purpose" is exactly the unilateral expansion the Safe Projects principle is designed to prevent. No researcher gets to redefine the scope of their own access.

Question 3: You are working on your research and have breached data confidentiality. Who faces the repercussions?

Rank the following responses from most affected (1) to least affected (4) - i.e. who bears the consequences of the breach, in order:

  • A. You - the researcher.
  • B. The data owner.
  • C. The support team.
  • D. Other researchers.

Answer: all of them - A, B, C, D (the question is a trick; every party listed faces some form of consequence)

The point of this question is that a confidentiality breach is never a private matter between the researcher and the data. Every party in the access chain is exposed, which is precisely why the Five Safes framework distributes responsibility across all of them.

  • A (You - the researcher) face the most direct and personal consequences: loss of access, potential loss of employment, professional reputation damage, and in some jurisdictions criminal liability. Rank first.
  • B (The data owner) bears the institutional and legal consequences - they are the data controller, they are answerable to the data subjects, and they face regulatory action. Rank second.
  • C (The support team) loses credibility with the data owner, has to run the incident-response process, and typically has to tighten controls across all users (making everyone else's life harder). Rank third.
  • D (Other researchers) suffer the knock-on effects: tightened rules, slower output clearance, and in the worst case the suspension of access to the dataset entirely. Least direct but very real - hence last, but still on the list.

The wrong answer is to pick any single option as if the others are unaffected. Breaches propagate; everyone in the chain pays some of the cost.


Question 4: Which is the most common reason behind breaches of procedure?

Rank the following causes from most common (1) to least common (4):

  • A. Mistakes or ignorance.
  • B. Laziness.
  • C. Malicious intent.
  • D. Dislike of procedures.

Answer: A, D, B, C

The evidence from every published review of TRE incidents is that breaches are overwhelmingly accidental. Researchers by and large want to do the right thing - they are trained professionals who have been vetted, signed legal agreements, and have reputations to protect. The breaches that actually happen are the ones where someone didn't realise a rule applied, forgot a step, or misunderstood what "non-disclosive" meant in context.

  • A (Mistakes or ignorance) is the dominant cause by a wide margin, and is the premise behind the whole SRT course: train people, and the breaches largely go away. Rank first.
  • D (Dislike of procedures) is plausibly next - some breaches come from researchers who understand the rule but find it annoying and cut a corner. This is a live failure mode and the one the "trust, monitor, punish" framework in Q5 is designed against.
  • B (Laziness) is related to D but narrower - the researcher knew the rule, wasn't ideologically against it, just couldn't be bothered. Rare in a population of people who went to the trouble of getting TRE access.
  • C (Malicious intent) is extremely rare. Safe People vetting is specifically designed to screen this out, and every published incident review finds deliberate misuse is a tiny fraction of cases. Rank last.

The wrong intuition is to reach for C because it feels like the scary case; in practice, the scary case is the everyday researcher who didn't know a rule applied.


Question 5: Which is the most effective way of encouraging positive behaviours when using confidential data?

Rank the following systems from most effective (1) to least effective (4):

  • A. A system that uses threats of punishment to generate good behaviour.
  • B. A system that uses monitoring to ensure good behaviour.
  • C. A system that uses trust in users' good behaviour.
  • D. All of the above.

Answer: D, C, B, A

The correct answer is D - the Safe People / Safe Settings regime works because all three mechanisms reinforce each other. Trust alone is naïve (it doesn't deter the rare bad actor and it gives honest researchers no visible backstop when they make a mistake); monitoring alone is adversarial and corrosive to the researcher-support relationship; punishment alone creates a compliance culture where people hide problems rather than report them. Combined, they produce the "trusted researcher" model: you are trusted, your outputs are checked, and the consequences of a deliberate breach are real - and every one of those facts is known to every party in the system.

  • D (All of the above) is the model the training is actually describing. Rank first.
  • C (Trust) is the next best single mechanism, and the one closest to the ethos of the TRE: you are a vetted professional, you are treated as one, and the system expects you to behave like one. Works well for the overwhelmingly honest majority - but provides no backstop.
  • B (Monitoring) is third. Useful as an audit mechanism and genuinely deters corner-cutting, but if used as the primary lever it signals distrust and changes researcher behaviour for the worse (people stop asking the support team questions in case the questions look bad on the record).
  • A (Threats of punishment) is last as a primary mechanism. Punishment is necessary as a backstop for the rare deliberate breach, but as the headline message it is counter-productive: it drives incidents underground, discourages self-reporting, and frames the researcher as a suspect rather than a collaborator.

The wrong instinct is to pick C alone (too permissive) or A alone (too punitive). The whole point of the Five Safes framework is that no single dimension carries the load.


Section 2 - Statistical Disclosure Control (SDC)

Question 1: Consider the following frequency table:

Highest qualification C2DE ABC1 Total
Degree+ 51 17 68
College 30 13 43
School 15 8 23
None 9 7 16

Which of the following apply?

  • A. Three cells (9, 8, 7) fall below the threshold of 10, so the default action is to reject the table - suppress, redesign, or band the categories before release.
  • B. The cells are probably not meaningfully disclosive, so the table can simply be released without further discussion.
  • C. The table exhibits class disclosure because an entire qualification group sits below the threshold.
  • D. The row totals should be suppressed to prevent the small cells being recovered by subtraction.
  • E. The 10-threshold is a deliberate compromise: the support team holds a fast, consistent rule so most requests can be processed quickly, which in turn creates space for a genuine argument if the researcher can show these specific cells are important to the research.
  • F. No SDC issues - release as-is.

Answer: A and E

The 7, 8, 9 are all below the rule-of-thumb limit of 10, so the default action is to reject. They are probably non-disclosive - but "probably non-disclosive" is not by itself a good enough reason to ignore the threshold. The line has to be drawn somewhere, and it's drawn at 10: the support team applies a strict rough-and-ready rule so most requests can be processed quickly, which is precisely what creates space to scrutinise the genuinely important outputs. If these numbers really matter to the research, the researcher can make that case and the table may then be released; but the burden is on them.

  • B is the over-confident reading that the slides explicitly warn against - being non-disclosive is not sufficient on its own.
  • C misuses the term "class disclosure" - no row is 0 or 100%.
  • D would not help anyway, and would degrade the output unnecessarily.
  • F ignores the threshold entirely.

Question 2: A researcher submits the following line chart of "Import intensity" for release. No frequency table is attached to the submission. The y-axis is labelled only "%", the legend shows four unnamed coloured lines, and no variable definitions are provided.

![Line chart: four unnamed lines labelled only by colour, y-axis "%", x-axis 1994-2016; "Other" line starts at very low values in 1994 and 2000 before rising sharply from 2002 onwards.]

Which of the following apply?

  • A. The chart is safe to release because it is already aggregated and does not show individual data points.
  • B. Each data point on a chart is subject to the same threshold checks as a cell in a frequency table.
  • C. Because the chart is exported as a PNG/JPEG, any underlying small counts cannot be read off it, so no action is needed.
  • D. The submission is a poor output and should be returned to the researcher: the reviewer cannot tell what "import intensity" means, what the y-axis is a percentage of, what the lines represent, or - most importantly - what the underlying frequencies are. These are basic labelling questions the researcher should have answered before submission.
  • E. Without the underlying frequency table, the checker cannot verify that each point meets the threshold; the correct action is to reject the submission and request the counts rather than attempt threshold checks on the chart alone.
  • F. No SDC issues - release as-is.

Answer: B, D, E

This is a bad output. The reviewer doesn't know what "import intensity" is, what the y-axis is a percentage of, what the different lines represent, or - most importantly - where the frequencies are. All of that wastes the checker's time on basic questions the researcher should already have answered. Meanwhile the principle from 2.15.1 still applies: each data point on a chart must pass the same SDC checks as a cell in a table, so without the underlying counts the checker cannot verify thresholds at all. The right action is to reject and ask for properly labelled output with the frequency table attached.

  • A is wrong - aggregation doesn't help if a category contains very few observations, and a time-series point is not "aggregated" in the protective sense.
  • C is wrong - the image format hides nothing about the disclosure risk; and some software stores raw data behind the graph. "PNG/JPEG" is a protection against re-editing, not against low-count points.
  • F ignores both the labelling problems and the missing frequencies.

Question 3: Consider the following output:

Frequencies for employment in the Creative Industries, Creative Occupations & Total Workforce, weighted counts Years: 2014 - 2015 Coverage: All respondents currently employed or self-employed in main job, aged 23-69 Variable: NSSECORIGIN (derived using SMSOC10 & SMEARNER) Unweighted counts supplied in separate sheet.

Industry NS-SEC 2014 2015
Advertising and Marketing Salariat 66,088 62,734
Intermediate 27,667 41,940
Working Class 17,051 14,573
Unemployed / Never worked * *
Architecture Salariat 32,156 37,709
Intermediate 31,043 29,599
Working Class 10,295 *
Unemployed / Never worked * *

Which of the following apply?

  • A. The table is disclosive because the asterisks (*) reveal the existence of unemployed respondents in each industry.
  • B. Small numbers have been suppressed, totals are not shown, the description and variables are clear, and the underlying sample is large - this output can be cleared.
  • C. The output should be rejected because no row or column totals are shown - checkers cannot verify the cell values.
  • D. Unweighted counts being supplied separately is essential: without them the checker cannot confirm that each non-suppressed cell meets the threshold.
  • E. No SDC issues - release as-is.

Answer: B

A clean output: small cells suppressed with *, no totals (so the suppressions can't be recovered by subtraction), clear description of coverage and variable, large underlying sample.

  • A is wrong - the asterisks correctly hide the values; revealing only that some respondents may be "Unemployed / Never worked" in a given industry is not itself a threshold breach. The SDC rule is that each released cell must meet N, not that the existence of the category must be hidden.
  • C is the opposite of correct - omitting totals is a feature here, because it prevents recovery of suppressed cells.
  • D is reasonable in spirit (the checker does need unweighted counts to verify thresholds) but the problem states they are supplied separately, so there is no outstanding issue.

Question 4: Consider the following descriptive statistics per school. No title or variable definition was supplied with the submission.

School Mean s.d. P99
Wickham school 44,000 5.6 87,341
St Annes primary 32,000 1.3 66,124
Clydesdale infants 87,000 12.0 98,363
Pent cross school 45,000 4.3 68,112
Scarsdale 67,000 2.9 89,443
Elgin 38,000 2.4 68,111

Which of the following apply?

  • A. The P99 values are given to the exact unit (e.g. 87,341, 66,124) - if any school has fewer than ~100 pupils, the 99th percentile is effectively a single individual's score and should be rounded.
  • B. There is no sample size (N) per school, so the checker cannot verify that the threshold is met - the output should be returned to the researcher for clarification.
  • C. The output has no title, no variable definition, and no coverage information; these should be supplied before the checker can make a release decision.
  • D. The means are all rounded to the nearest 1,000, so the output is safe regardless of the underlying counts.
  • E. No SDC issues - release as-is.

Answer: A, B, C

  • A - P99 at small N is effectively a single individual's value; quoting it to the exact unit is especially risky.
  • B - The N per school is missing, so the checker cannot apply any threshold. Return for clarification.
  • C - Titles and variable definitions are not themselves an SDC issue (the assessment notes "you are not assessed on output titles"), but they are legitimately missing contextual information the checker needs to decide whether the statistic is disclosive in context. In practice the output would be returned.
  • D is wrong - rounding the mean does nothing to protect the unrounded P99.

Question 5: Consider the same schools table as Question 4, now with N added and P99 rounded.

Maths scores for year 9, schools in the North East. Note: unweighted counts.

School Mean s.d. P99 N
Wickham school 44,000 5.6 87,000 110
St Annes primary 32,000 1.3 66,000 96
Clydesdale infants 87,000 12.0 98,000 34
Pent cross school 45,000 4.3 68,000 24
Scarsdale 67,000 2.9 89,000 16
Elgin 38,000 2.4 68,000 45

Which of the following apply?

  • A. All rows meet the threshold of 10, so the table can be released.
  • B. The P99 cannot be released for Clydesdale infants (N = 34), Pent cross (N = 24), Scarsdale (N = 16), or Elgin (N = 45) - at these sample sizes the 99th percentile effectively refers to a single pupil's score.
  • C. The P99 column should be suppressed for the low-N schools, or replaced with a wider summary (e.g. inter-quartile range) that does not depend on a single extreme observation.
  • D. Means and standard deviations are fine to release for all schools - the disclosure issue is specifically with the extreme percentile.
  • E. Because "Scarsdale" has N = 16, which is above the threshold of 10, all of its statistics - including P99 - are safe.

Answer: B, C, D

The per-school N column now shows that four of the six schools have fewer than ~100 pupils, so their P99 values are effectively single individuals' scores. The fix is to suppress P99 for those rows, or replace it with a less extreme statistic (IQR). Means and standard deviations are summary statistics over N ≥ 16 and are not the disclosure risk here.

  • A is wrong - meeting the threshold on counts is not enough; extreme percentiles need a much larger N behind them.
  • E is wrong for the same reason - an N of 16 is fine for a count but far too small for an unmasked 99th percentile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment