Safe Researcher Training (SRT)

The Five Safes
Statistical Disclosure Control (SDC)
Rules-Based vs Principles-Based SDC
Safe People
Breach of procedure vs. breach of confidentiality
Assessment

1. The Five Safes

The main theoretical underpinning of the SRT is The Five Safes. The framework distributes data protection across five interconnected dimensions, recognising that no single mechanism can adequately protect against all disclosure risks. Each dimension addresses a different aspect of the data access pipeline:

Safe people: researchers must be appropriately vetted and trained.
Safe projects: research must have legitimate objectives and ethical approval.
Safe settings: the environment in which analysis occurs must be controlled and secure.
Safe data: data must be de-identified or aggregated to minimise re-identification risk.
Safe outputs: research results must be reviewed before release to prevent inadvertent disclosure.

The five safes do not need to be uniformly high; they must collectively achieve an adequate level of overall safety. For open data, for example, high data safety can compensate for lower controls elsewhere, while for controlled access data, lower data safety may be acceptable where strong controls exist over people, projects, settings and outputs.

2. Statistical Disclosure Control (SDC)

This section covers the theory and practice of Statistical Disclosure Control (SDC) for researchers working with confidential data. SDC is the process of identifying and managing the risk that published research outputs might inadvertently reveal sensitive information about individuals. It therefore relates to the Safe outputs of The Five Safes.

It is divided into two main parts:

Basic SDC theory - using simple tables as examples
Extending SDC to the research environment - beyond tables to graphs, regressions, and other outputs

2.1 What is SDC?

Statistical Disclosure Control means looking at data to try to identify possible risks, and then either removing the risky results or using statistical techniques to hide them. The fundamental process is:

Look at the data
Identify risk of re-identification
Risk -> Remove / or use statistical techniques to hide it

SDC is about being precautionary, but utility is important - the aim is to balance risk against the usefulness of the output, consistent with good research practice.

Note: There is a massive theoretical literature on SDC, most of which is irrelevant to researchers, because the basic idea is simple. Research outputs using confidential quantitative data are generally tabular summaries or regression outputs - the aim is to ensure these do not inadvertently reveal individual-level data.

2.2 The Example Dataset

To illustrate SDC principles, we use a small simulated survey of 150 patients taking part in an investigation into the relationship between diabetes, socio-economic variables, and the 'fragileX' gene (which is associated with diabetes).

Variable	Description
id	Random ID number
male	Male y/n
age	Age
white	White y/n
fragilex	Has fragileX gene
diabetic	Diabetes diagnosed
education	Highest qualifications
abc1	Socio-economic group
income	Annual income from all sources (£)
i_quartile	Income quartile 1 (lowest) to 4 (highest)
imputed_value	Values were imputed y/n

Which variables might be sensitive? fragileX, diabetic, and income - you would not want someone knowing this about you.

Which might be used to identify someone? age, gender (male), white, education - these are identifying variables that help to pick out a respondent in the dataset.

Variables fall into two types:

Identifying variables: help to pick out a respondent in the dataset (age, gender, white, education)
Target variables: the things you would hope to find out about a person once you have identified them; in this case fragileX, diabetic, and income (also sensitive)

2.3 Why Small Numbers Are a Problem

2.3.1 Unique Observations (N=1)

Table: Existence of 'FragileX' Gene

	Has fragileX gene? No	Has fragileX gene? Yes	Total
Male	85	6	91
Female	58	1	59
Total	143	7	150

This table shows that there is just one female in the dataset with the fragileX gene. Any results that include both female gender and fragileX will refer to that one person. If you know who that person is, you might learn something about them even though the results don't explicitly split out gender and genes. If you were that person, would you be happy knowing your unique combination of variables had been made public?

2.3.2 Small Groups (N=2)

Table: Diabetes vs FragileX Gene

Diabetes diagnosed?	Has fragileX gene? No	Has fragileX gene? Yes	Total
Yes	114	2	116
No	29	5	34
Total	143	7	150

This table shows there are two people with a diagnosis of diabetes and with the fragileX gene. If you were one of those two, when data was presented about the two of you, you could make inferences about the other because you know what information you provided. For example, if the average age of these two people is given as 62 and you know you gave your age as 54, it is easy to work out the other person must be 70. This is less sensitive than a unique value, because the disclosure is only made to the set of people in the group - but it is still a concern.

2.3.3 The "3 is the Magic Number" Rule

Scenario	Average Income
N = 1	£40,000 - the person's exact income is revealed
N = 2	£35,000 - each person in the group can calculate the other's income
N = 3	£40,000 - no-one can say with certainty what anyone else earns (assuming no collusion)

When there are three or more people in a cell, no one can say with certainty what anyone else earns (assuming no collusion between respondents). This is the foundation of the minimum cell count threshold of 3 - and many organisations bump this threshold up to 10 to be extra safe.

From the notes: Assume you know the average salary of the people in the box is exactly £30,000. If your mate is in the box, everyone who knows him knows he earns £30,000. If there are two people, each one can calculate the other's income (e.g. if one knows his own income is £18,000, he can immediately tell the other must be earning £42,000). But when there are three or more people, no one can say with certainty what anyone else earns.

2.4 Context-Sensitivity: Not All Small Numbers Are Problematic

Table: No. of Imputed Values in Dataset

Gender	Imputed values? No	Imputed values? Yes	Total
Male	89	2	91
Female	58	1	59
Total	147	3	150

This table has small cells, but is it disclosive? What value do you gain by knowing that one of the female respondents had a value imputed? If anything, it emphasises that you shouldn't make any judgements about specific respondents based on the tables, because at least one observation has been modified.

Key lesson: SDC is context-sensitive. Knowing that one female respondent had a value imputed doesn't tell you anything sensitive about that person. Small counts aren't automatically a problem - you have to ask: what is being disclosed?

The key point here: SDC is context-sensitive, particularly with the principles-based approach. The data was imputed so it is not revealing actual values; it is okay to come out of the environment. But if there were sensitive information attached to those imputed values, it would not be allowed.

2.5 Class Disclosure

2.5.1 What is Class Disclosure?

Class disclosure occurs when publishing data reveals information about an entire group of people, rather than a specific individual. The zeros (or full cells) in a table are often the culprit: when you classify people into different groups, a zero or 100% cell tells you something definitive about everyone in that class.

Table: Income Distribution by Education Level

Highest Qualification	Income Q1 (lowest)	Q2	Q3	Q4 (highest)	Total
Postgrad	1	1	8	18	28
Degree	2	6	14	17	39
College	8	18	16	3	45
School	13	9	0	0	22
None	13	3	0	0	16
Total	37	37	38	38	150

The zero cells tell us something definitive: no one with only school-level or no qualifications earns above the median income. If you know a respondent never went to college, you know with certainty they are earning below the median - that is class disclosure.

2.5.2 Class Disclosure is Highly Context-Sensitive

Consider these examples:

"All of the students aged 14+ said they had tried cannabis at least once" - Disclosive. Could effectively transform into a zero-count problem if turned into a table. Be more ambiguous in wording instead.
"No nurse in the survey earns over £30.50/hour" - May or may not be disclosive, depending on whether this is the maximum salary on the pay scale. Pay scales may be public information anyway.
"No-one in Shetland earns over £50,000/year" - Disclosive: this is a realistic salary threshold that signals something meaningful about the population (effectively suggesting 'poverty').
"No-one in Shetland earns over £5m/year" - Could be disclosive depending on the data, but generally less so - very few people earn this in any population.
"No-one in Shetland earns over £500m/year" - Not meaningfully disclosive: no one realistically makes that salary in a population, so it doesn't signal anything.

The point is that all three salary examples are formally class disclosures (all are 0% cells), but the practical harm varies with how informative the threshold is. Class disclosure is very context-sensitive - hard-and-fast rules are difficult to apply.

2.5.3 Structural Zeros

Not all zeros are disclosive. Structural zeros (or logical zeros) are values we would expect to be zero from the construction of the table.

Table: Education Level by Age Band

	0-15	16-20	21-25	26-30	Total
Higher Education	0	0	165	148	313
Secondary Education	0	152	210	318	680
None	324	65	42	15	446

In this dataset, none of the respondents aged under 20 has a degree; nor does anyone under 18 have a college qualification. These zeros are not disclosive: we would expect them, because in normal circumstances you cannot complete A-levels before 18 or get a degree before 21. (If there were a child genius in the data, that would be an issue - an unexpected non-zero would stand out.)

2.6 Options for Fixing Disclosive Tables

When a table contains disclosive cells, researchers have several options. Using the income/education table as our example:

Option 1: Cell Suppression

Cell suppression means blanking out the offending cells (replacing them with - or a marker such as <3):

Highest Qualification	Q1	Q2	Q3	Q4	Total
Postgrad	-	-	8	18	26
Degree	-	6	14	17	37
College	8	18	16	3	45
School	13	9	-	-	22
None	13	3	-	-	16
Total	34	36	38	38	146

Or using detail reduction (replacing with <3 to indicate some data exists, just not enough to show):

Highest Qualification	Q1	Q2	Q3	Q4	Total
Postgrad	<3	<3	8	18	26
Degree	<3	6	14	17	37
...	...	...	...	...	...

Important: Remove totals or recalculate them after suppression - if you leave the original totals unchanged, missing values can be recovered by subtraction. Always calculate totals after SDC cleaning, not before.

For example, suppose we suppress the low cells but keep the original row and column totals:

Highest Qualification	Q1	Q2	Q3	Q4	Total
Postgrad	-	-	8	18	28
Degree	-	6	14	17	39
College	8	18	16	3	45
School	13	9	-	-	22
None	13	3	-	-	16
Total	37	37	38	38	150

An attacker can recover every suppressed value:

Degree row: 39 - 6 - 14 - 17 = 2, so Degree Q1 = 2.
Q1 column: 37 - 8 - 13 - 13 = 3, and we now know Degree Q1 = 2, so Postgrad Q1 = 3 - 2 = 1.
Postgrad row: 28 - 8 - 18 = 2, and we now know Postgrad Q1 = 1, so Postgrad Q2 = 1.
Similarly, School and None rows give School Q3 + Q4 = 0 and None Q3 + Q4 = 0, recovering the zeros.

The suppression has achieved nothing because the original totals act as simultaneous equations that can be solved.

One thing to be careful of is consistency across tables: if you suppress a low cell in one table, are you also suppressing that information in other tables? Can the missing value be recovered by differencing across tables? This is a significant potential problem - even groups with teams of checkers have made this mistake in official publications.

Option 2: Rounding

Round all cell values to the nearest 5 (or 10):

Highest Qualification	Q1	Q2	Q3	Q4	Total
Postgrad	0	0	10	20	30
Degree	0	10	15	15	40
College	15	15	15	5	50
School	10	10	0	0	20
None	10	5	0	0	15
Total	35	40	40	40	155

Note that the totals have changed significantly - this is common with rounding. Controlled rounding, which aims to minimise the effect on totals, is a specialist area you could seek advice on if presenting many similar tables.

You can also make the data into something less disclosive: ratios, growth rates, proportions (but limit decimal places), etc.

Option 3: Redesign the Output (Recommended)

Merge categories so that no cell has fewer than the threshold:

Highest Qualification	Q1	Q2	Q3	Q4	Total
Degree+ (Postgrad + Degree)	3	7	22	35	67
College	8	18	16	3	45
School	13	9	0	0	22
None	13	3	0	0	16
Total	37	37	38	38	150

Why we recommend this option:

Retains accuracy but not precision (whereas other methods retain precision but not accuracy)
Once you have redesigned your categories, you are likely to continue using those categories across all your tables, giving a coherent analytical picture
Focuses attention on the data itself, not just on solving the problem of a specific table
Note: in some cases (like this one with the zero cells for School/None), redesign alone doesn't solve all problems

Choosing the Right Option

The best option depends on the output - not all approaches will work all of the time
It depends on the message you want to present
You know what's important - you decide which SDC methods to use
The user support team can advise if needed, but cannot and will not make decisions for you

2.7 Primary Disclosure: Dominance

2.7.1 The Dominance Rule

Large numbers of observations are not always sufficient protection if one or two units contribute the bulk of the data. This is the dominance problem.

The dominance rule states that disclosure risk exists when:

The largest unit is more than 43.75% of the total, OR
All but the largest two units are less than 12.5% of the largest unit

Both conditions test raw contributor values against the cell total — the individual values that sum to produce the published aggregate. A suspicious computed statistic (such as a mean far above its peers) is a flag to investigate, but the dominance check itself requires the individual contributor values.

Rule 1 example:

Sector	N firms	Total turnover (£000s)
Agriculture	18	7,560
Manufacturing	22	25,960
Retail	31	19,840
Utilities	14	118,300
Construction	25	12,750

The Utilities total is an order of magnitude above every other sector. Examining the underlying contributor data for Utilities:

Firm rank	Turnover (£000s)
1	65,000
2	11,000
3	8,500
4	6,200
5	5,100
6	4,300
7	3,800
8	3,200
9	2,900
10	2,400
11	2,100
12	1,800
13	1,200
14	800
Total	118,300

Threshold: £118,300k × 0.4375 = £51,756k. Firm 1 (£65,000k) accounts for 65,000 / 118,300 ≈ 55% — Rule 1 is triggered.

Rule 2 example:

Sub-code	N firms	Largest firm (£m)	2nd largest (£m)	Sum of remaining (£m)	Total (£m)
21.10	16	340	180	32	552
21.20	19	95	82	410	587

Sub-code 21.10 fails Rule 1 immediately: 340/552 ≈ 61.6% > 43.75%. Sub-code 21.20 passes Rule 1 (95/587 ≈ 16%), so Rule 2 must be checked. The test is applied to each firm ranked 3rd and below individually against 12.5% of the largest (£95m × 0.125 = £11.9m):

Firm rank	R&D (£m)	> £11.9m?
1	95	—
2	82	—
3	75	Yes
4	60	Yes
5	48	Yes
6	38	Yes
7	30	Yes
8	26	Yes
9	22	Yes
10	18	Yes
11	15	Yes
12	13	Yes
13	12	Yes
14	12	Yes
15	11	No
16	9	No
17	8	No
18	7	No
19	6	No

Firms 3–14 each individually exceed the threshold — the test is per firm, not on the aggregate. A single violation is sufficient; Rule 2 is triggered. A researcher at firm #3 (£75m) can subtract their own spend from the published total (£587m − £75m = £512m), tightening their estimate of what the top two firms spent combined.

2.7.2 Why Dominance Creates Risk

The dominance rule exists because aggregates don't protect individuals when one (or two) contributors dominate the total. Computed statistics such as the mean, sum, or total therefore become essentially a proxy for that person's value, and the other contributors are just noise.

Working backwards from the mean:

Table: Income by Company and Qualification (Dominance Example)

Highest Qualification	Company 1: Employee count	Company 1: Mean income	Company 2: Employee count	Company 2: Mean income	Overall Mean income
Degree	12	£92,412	30	£54,124	£65,063
College	26	£29,006	24	£28,614	£28,818
School/None	42	£18,332	33	£19,148	£18,691

If you know the mean income and the number of people in a group, you can calculate the group total:

Mean x Count = Total

For example, for Company 1 degree-holding employees, where mean income is £92,412 and employee count is 12: £92,412 x 12 = £1,108,944 (the total income for that group).

Now imagine you work at that company and hold a degree. You know your own income, and roughly who else is in the group (only 12 people). You can subtract known incomes from the total to narrow down - and when one person contributes ~44% of the total, that individual's income is barely disguised by the average. With only 12 people and one value that dwarfs the rest, it doesn't take much insider knowledge to effectively uncover that individual's exact salary.

You are doing here indirectly what might be done directly with underlying data, hence a high mean is a symptom used to investigate a dominance cause. For example, you might examine "Degree / Company 1" and find one underlying value is £485,322, which is one individual accounting for approximately 44% of the group total. This exceeds the 43.75% dominance threshold.

Knowing the mean and median, one can mathematically calculate or closely estimate the size of any outlying values.

2.7.3 Dominance: The Aggregation Fix

One remedy for dominance is aggregation: combining groups so that the small dominant subgroup gets absorbed into a larger, more representative pool.

For example, if University 2 has only 10 managers at £180,420 average pay, combining University 1 (45 managers) with University 2 (10 managers) dilutes the influence of that extreme group. The 10 highly-paid managers are now part of a pool of 55 managers, and their extreme mean is tempered by the 45 managers from University 1.

"Diluting" the mean by combining groups: harder to reverse-engineer any outliers.

Benefits of this approach:

Increases sample sizes across all categories, making means more robust
Eliminates cells dominated by tiny subgroups
Harder for anyone to reverse-engineer outlier values from the published aggregate

2.7.4 Dealing with Dominance in Practice

The same techniques as for frequency tables can be used: redesign, suppress, round, etc.

However, dominance is:

Hard to check for and demonstrate - it is not as visually obvious as a small cell count
Very rare - unless you have a tiny number of observations or an incredibly odd outlier

The best protection is lots of observations - don't produce small cells. Also, be aware of your data: are there any egregious outliers? If so, why are you putting them in with other variables? You may be misrepresenting the data in any case. Once again, good statistics is entirely consistent with good SDC.

2.8 Primary Disclosure: Ranks, Maxima, and Minima

Putting things into ranks (quartiles, medians, maxima and minima) means putting things into cells.

Statistic	Income	Age
Minimum	£8,351	50
Maximum	£385,604	70
Mean	£34,353	60
Median	£11,446	59
N = 150

Key considerations:

Maxima and minima: Not always problematic, but assume they are until checked. Min and max could refer to individual people; if so, group and take averages instead.
Medians: Can refer to an individual, but unlikely for large groups. For our dataset with 150 observations, the median is fine. However, be careful: the class disclosure we saw earlier showed that some people are definitely below the median income, which combined with the median value is worrying.
Ranks: Knowing someone's rank can reveal information without knowing their exact value. This is another form of class disclosure.
Percentiles: In a dataset with only 150 observations, the 1st and 99th percentiles (which Stata's sum, detail command will show) will only have one or two people in them.

Income is the variable in this dataset that would cause problems in terms of publishing maxima and minima. For age, the max and min are at the extremes of possible values and the mean/median are uninformative - no problem there.

2.9 Secondary Disclosure: Disclosure by Differencing

This is the biggest problem in SDC, and it has no complete solution.

2.9.1 What is Secondary Disclosure?

Secondary disclosure occurs when a value becomes disclosive not because of what it says on its own, but because of its relationship to other published values. The classic form is disclosure by differencing: subtracting one table from another to reveal a protected cell.

Example:

Age bands	Working class	Middle class	Total
50-54	21	11	32
55-59	25	11	36
60-64	28	12	40
65+	31	11	42
Total	105	45	150

(All persons)

Age bands	Working class	Middle class	Total
50-54	17	7	24
55-59	19	9	28
60-64	23	8	31
65+	23	10	33
Total	82	34	116

(Non-diabetics only)

Each table is fine on its own. However, the 65+ row in the "all persons" table shows 42 people, while the "non-diabetics" subset shows only 33. Subtracting: 42 - 33 = 9 diabetics in the 65+ age group. If you produce a table for all persons, it's impossible to prove that no other table has been or will be produced that could breach confidentiality through disclosure by differencing. All we can do is be aware of the problem and try to avoid making it likely.

2.9.2 Secondary Suppression

When one cell is suppressed, sometimes other cells must also be suppressed to prevent the suppressed value from being calculated by subtraction. This is secondary suppression. SDC literature normally recommends this as the preferred solution, because it preserves the original totals, which matters most in government statistics.

For example, in a table of employment by sector and geography, if Yorkshire & the Humber has a suppressed value for Air Transport, and the row total and other values are known, you might be able to calculate the suppressed value. A second suppression elsewhere in the same row (e.g. also suppressing Warehousing) prevents this.

Recall our income/education example from Section 2.6, with cells below 3 suppressed but original totals kept:

Highest Qualification	Q1	Q2	Q3	Q4	Total
Postgrad	-	-	8	18	28
Degree	-	6	14	17	39
College	8	18	16	3	45
School	13	9	-	-	22
None	13	3	-	-	16
Total	37	37	38	38	150

The Degree row has only one suppressed cell (Q1), so it can be recovered from the row total: 39 - 6 - 14 - 17 = 2. Applying secondary suppression to the Degree/Q2 cell (the 6) ensures every row and column has at least two unknowns, blocking recovery. The 6 is a natural choice here because it is the next smallest visible value in the Degree row, so suppressing it loses the least information. In this larger table, only one extra cell needs suppressing. For the 2x2 tables in Sections 2.3.1 and 2.3.2, however, you would need to blank out almost every cell, at which point the table communicates nothing.

Choosing which cells to suppress secondarily is itself non-trivial: the minimum number needed to block recovery, the cells with the fewest observations, or the cells with least "importance" by some other criterion (e.g. economically insignificant firms in a business dataset). This is a specialist area, and researchers will often find they lose substantial information, which is why table redesign (Section 2.6, Option 3) is generally more practical.

2.10 SDC and Statistical Quality

There is ideally no conflict between SDC and good research. The things to avoid for SDC purposes are the same things to avoid for good statistical analysis:

Things to avoid for SDC	Also bad for research quality
Small numbers	Low statistical power, unreliable estimates
Dominant observations / huge outliers	Skew means, misrepresent the data
Very skewed distributions	Mean is uninformative

Good SDC is consistent with good research output. The exception is in class disclosure and ranking, where being able to say something about a whole group of data subjects might be analytically valuable but also problematic from a disclosure perspective.

Remember: SDC applies at the point at which statistical results are going to be released - not when you're playing around with the data. You can explore your data freely; it's only the outputs you intend to publish or share that need SDC treatment.

2.11 Moving Beyond Tables: SDC in the Research Environment

So far we have focused on tables, because they are the most intuitive example. But researchers produce a wide range of outputs. The question is: do the threshold and dominance rules described above apply to them?

Consider the types of statistics researchers commonly produce:

Odds ratios, regression coefficients, residual plots, contiguity maps, growth rates
Graphs and charts
Scatter plots, box plots
Maps

For each of the stat types covered in the remaining sections, there are two questions worth asking: how do you spot a potential SDC issue, and what can you do about it? To illustrate this structure, consider two stat types already encountered.

Descriptive statistics

How to spot SDC:

Multi-way cross-tabulations raise attribution risk significantly
Median, min, and max values often represent a single observation
Relative frequencies (percentages) are problematic when totals are also shown
All cell counts must meet the threshold N

What to do about it:

Band or combine columns and rows
Average values across a small group to meet the threshold
Round values
Suppress cells; also suppress a second cell in the same row or column to prevent the original value being deduced from totals

Percentiles

How to spot SDC:

Unrounded values may represent the exact value for a single individual
At extreme percentiles (e.g. 1st, 99th) the underlying group may contain only one person
The median can refer to a single observation in a small dataset
Check that the count underlying each percentile meets the threshold N

What to do about it:

Round all percentile values (including the median) to the nearest hundred or thousand
Aggregate values into broad categories (e.g. income below £12,000)
Present the inter-quartile range rather than individual percentile points

2.12 Low Review vs. High Review Statistics

The central organising framework for SDC in the research environment is the distinction between Low Review Statistics (LRS) and High Review Statistics (HRS).

	Low Review Statistics (LRS)	High Review Statistics (HRS)
Disclosure risk	Inherently low	Inherently high
Action	Publish after administrative checks	Publish only once specific values checked
Examples	Regression coefficients, modelled aggregates, non-linear combinations (e.g. estimated odds ratios, survival functions)	Frequency tables, individual data points, linear combinations, calculated odds ratios or risk ratios, percentiles

The lion and rabbit metaphor: Think of managing a zoo with two kinds of animals - lions and rabbits. You have limited time. An angry rabbit can give you a nasty nip; a well-fed sleepy lion can be tickled behind the ears. But in general, you should spend your time watching the lions (HRS). The lions are HRSs, the rabbits are LRSs.

Why this distinction works:

Many outputs have no meaningful disclosure risk because of their functional form - that is, irrespective of the data used to generate them, there is no realistic way for anyone to unpack the statistic to find confidential information. An example is linear regression coefficients. We call these LRSs because we do not really need to check them in detail.

In contrast, some statistics (such as tables) have lots of potential for disclosure risk. We only publish these after ensuring they are non-disclosive in the specific case of that output.

In practice:

HRSs must be checked for negligible disclosure risk in that particular instance, SDC applied if necessary, and checked again before release
LRSs should just be released - we look at the type of output and say "yes, ready to go" (with some administrative checks)

What makes something an HRS vs LRS?

Low Review Statistics	High Review Statistics
Modelled aggregates (e.g. coefficient estimates)	Individual data points (e.g. regression residuals)
Non-linear combinations of the data (e.g. estimated odds ratios or survival functions)	Linear combinations (e.g. tables, percentiles, calculated odds ratios or risk ratios)

2.13 Regression Output

A regression is a model for estimating a statistical relationship between two or more variables.

Example regression output:

Variable	Estimate	Standard Error	Sig.
Intercept	0.372	0.003	0.00
Female	-0.121	0.002	0.00
Age 16-29 (ref: 45-59)	-0.465	0.003	0.17
Age 30-44	-0.181	0.003	0.00
Age 60-69	-0.055	0.002	0.00
Marital Status (Single)	0.593	0.001	0.00
Marital Status (Married/cohab, living apart)	0.845	0.003	0.00
Marital Status (Div/Wid)	1.032	0.002	0.00
N = 2,356; d.f. = 2,348

How to spot SDC:

Any category has fewer than N observations
Regression run on a single unit
Sequential regressions differ in cohort size by fewer than N observations, with significant parameter differences
Regression consists solely of categorical variables
Degrees of freedom below N, or N too small for the model to be genuine rather than effectively a table

What to do about it:

Check degrees of freedom are at least N
Check sequential regressions do not differ in observation counts by fewer than N
Ensure sufficient observations within and between model estimations
In practice, disclosure via regression is very rarely an issue

2.14 Residuals

Regression residuals are High Review Statistics because they are individual data points - a researcher could find themselves on a residual plot.

Residual plots should be treated with caution and may require SDC treatment before release.

How to spot SDC:

Each residual represents a single observation, breaching the threshold rule by default
Outliers are especially identifiable
If observations can be ordered along the x-axis, residuals can be attributed to specific individuals
Multiple residual plots from the same model make outliers easier to isolate

What to do about it:

Describe the shape or conclusion of the plot rather than releasing it
If a plot is needed: remove axis scales
If a plot is needed: use an x-axis variable difficult to observe outside the dataset (e.g. one generated during analysis)

2.15 Graphs and Charts

2.15.1 Line/Area Charts

Graphs are often produced without showing the underlying data. However, each data point on a graph must pass the same SDC checks as a cell in a table.

Key issue: Graphs are often missing the number of observations or underlying counts. Before exporting a graph from the research environment, you must check the data points.

Underlying counts example:

	2011	2012	2013	2014	2015	2016
Primary	69	89	88	76	170	157
Manufacturing	2,764	2,149	1,570	1,756	3,863	3,850
Construction	395	377	480	418	382	410
Wholesale and Retail Trade	209	319	487	494	1,314	1,301
Transport and Communications	9*	8*	6*	35	84	111
Financial Intermediation	987	973	1,223	1,182	2,198	2,364
Other	83	97	137	142	409	398

*Below threshold - needs suppression

How to spot SDC:

Each data point is subject to the same checks as a cell in a frequency table
Low cell counts in the tails of distributions
Min/max values visible on axis scales
No supporting frequency table submitted (thresholds cannot be verified)
Some software stores raw data behind a graph

What to do about it:

Check the underlying frequency table before exporting
Band or aggregate categories where counts fall below threshold
Suppress low-count points; remove or cap tail values
Release as a fixed image (e.g. PNG or JPEG)

2.15.2 Scatter Plots

Scatter plots of individual data points are High Review Statistics - they may directly reveal individual observations. Require a separate frequency table to confirm adequate cell sizes.

How to spot SDC:

Each point typically represents one data subject, breaching the threshold rule by default
Individual attributes readable directly from the graph
Variables in original, untransformed form increase attribution risk

What to do about it:

Group data subjects so each plotted value represents at least N observations
Generate a heat map with regular bins; include only bins with count above N
Transform the underlying variables before plotting

2.15.3 Box Plots

Box plots display: minimum, maximum, 25th percentile, median, and 75th percentile. They may also show outliers. The minimum and maximum could refer to individual people; outliers shown as individual points are especially problematic. Solutions include grouping or averaging.

How to spot SDC:

Long-tailed data: outliers shown as individual points relate to individual data subjects
The 25th percentile, 75th percentile, and median could each refer to a single observation
Normally distributed data: whisker ends (min and max) may also relate to single observations

What to do about it:

Long-tailed data: group or average outliers
Normally distributed data: band min and max into a range (e.g. £7 to £8 per hour) meeting the threshold
Release only if summary values are demonstrably not attributable to single individuals

2.15.4 Maps

Maps can be problematic if they show exact point locations or small counts. Use a heatmap instead, so that low business numbers are not exposed as specific points. Specific point-location data is only safe to show if based on publicly available data.

How to spot SDC:

Each dot may represent a single observation
Observations with unusual characteristics (e.g. rare condition, specialist facility) are especially identifiable
Even imprecise positions can be pinpointed by geographic software
Combining location with other variables (age, gender, ethnicity) raises reidentification risk
No data point should represent fewer than N observations

What to do about it:

Convert to a choropleth map (colour indicates activity levels across areas, not individual locations)
Review any scale shown alongside the map, which could itself reveal small counts

2.16 Odds Ratios and Risk Ratios

Calculated odds/risk ratios (from a 2x2 table) - treat as a table - HRS
Estimated/modelled odds ratios (from logistic regression) - LRS

2.17 Code Files

2.17.1 Safe Code Files

Code files (scripts) can generally be cleared from the research environment, provided:

They contain no data (data import, data cleaning, and visualisation of changes are all fine)
Any hard-coded data has been removed

It is okay to leave clear code files in the Trusted Research Environment (TRE).

2.17.2 Problematic Code Files

A very common problem: researchers often insert frequency tables or record-level data in code files as comments. Hard-coded data values in a script are disclosive. Do not include record-level data in code files.

Summary: Key Principles

SDC is context-sensitive - small numbers are not automatically a problem; ask what is being disclosed
Minimum cell count = 3 (many organisations use 10 to be extra safe)
Structural zeros are not disclosive - expected zeros are fine
Redesign is the best fix - retains accuracy, produces consistent analysis, addresses root cause
Always calculate totals after SDC cleaning, not before
Secondary disclosure (differencing) has no complete solution - be aware and minimise the risk
Good SDC = good statistics - small cells are bad for both confidentiality and reliability
Use the LRS/HRS framework - concentrate checking effort on High Review Statistics; Low Review Statistics can be cleared administratively
Dominance is rare - best protection is lots of observations; know your data
SDC applies at the point of release - not during analysis; explore freely, check before you publish

3. Rules-Based vs Principles-Based SDC

Statistical Disclosure Control (SDC) in a Trusted Research Environment is not just a statistical discipline, it is an operational one. So far the principles of SDC have been covered; this section is about how those principles get translated into output clearance that is quick, safe, and doesn't cripple research. The central question is a choice of philosophy: do we govern output release through rules, or through principles?

The framing is deliberately practical. SDC, done naïvely, is a time-consuming process: researchers wait around for results, and support teams wade through piles of output. The two sides also bring different expertise: support staff are not usually as statistically deep as the researchers they serve, and researchers are not always attuned to disclosure risk. In some environments (typically Secure Use Files) there is no support team at all, and researchers self-review. Whatever the setup, the process has to work efficiently, or it doesn't work at all.

3.1 Models of output clearance

Everyone (researchers, support teams, data providers) wants the same three things from output clearance: speed, safety, and no unnecessary restriction on research. These pull against each other, and no clearance process gets all three for free. Splitting outputs into low-review and high-review items helps triage the easy cases, but it doesn't settle the hard question of what to do with the high-review ones. Should we apply hard rules, or soft ones? Two broad models sit on either side of that line: rules-based output SDC (RBOSDC) and principles-based output SDC (PBOSDC).

3.2 Rules-based output SDC

RBOSDC is the intuitive option. You define specific rules for what outputs are allowed, and you stick to them. The rules are, to some degree, arbitrary, but they have been developed over time and represent a reasonable compromise that both researchers and data holders can live with. The rules don't cover everything, and the standard advice is: if in doubt, talk to the support team.

Its appeal is obvious. Rules are simple, transparent, and in principle automatable. They can be written down and handed to a researcher on day one, and they comfort data providers, who like a clean answer. They are used at places like the Eurostat Safe Centre, NILS-RSU, and ADRC-NI. The line here is blunt: don't push the rules; if you hit a problem, work with the support team.

3.2.1 Why strict rules break down in research

The problem is that strict RBOSDC rarely survives contact with a real research environment. About half of SecUF facilities claim to operate this way, but applying it strictly outside fully automated job-processing systems is very difficult. The reason is fundamental: you can't specify, in advance, all the cases a genuine research project will throw at you. Every rule is trying to serve two masters (efficiency and safety), and sooner or later one of them will lose.

A "no exceptions" rule either draws justified complaints from researchers that it is needlessly restrictive, or (the flipside data holders tend to forget) is sometimes too loose for the actual risk in a particular table. What happens in practice is that organisations claiming to be rules-based quietly become "we set and follow rules, apart from the times when we don't." That hybrid is arguably worse than being openly principles-based: expectations aren't made explicit, and the system becomes more open to favouritism.

3.3 Principles-based output SDC

PBOSDC is the alternative, and it is the model used by most TREs (UKDS, the HMRC Datalab, ONS, and the ADRN outside NI). The move is to keep the operational savings of a simple yes/no system, but build in flexibility openly rather than covertly. It starts, like RBOSDC, by defining thresholds, but calls them rules of thumb, because it is going to be explicit about when they bend. In principle, any output is allowed. Because the rules are known to be flexible, they can be set more restrictively on paper than a typical RBOSDC limit: the rules of thumb focus on safety, and the flexibility handles the edge cases.

Support staff and researchers are both allowed to argue that a rule of thumb shouldn't apply in a specific case. A support team member might argue that class disclosure means a particular table can't be released no matter how large the counts are. A researcher might argue that a maximum value should be released because it's informative about the data but can't be informative about any individual respondent. Both are legitimate PBOSDC conversations.

The critical discipline is around when those conversations happen: not very often, only when the output is genuinely important, and only when it is non-disclosive. If every output becomes a negotiation, the process collapses under its own weight. This is why training matters: everyone needs to understand what "not very often" and "important" actually mean in this environment. And it is why trust matters: frustrated researchers and overburdened support teams do not produce good confidentiality practice.

The practical edge of this comes through clearly in the worked example the course uses: a cross-tab with cells of 7, 8 and 9 against a threshold of 10. The numbers are almost certainly non-disclosive, but PBOSDC is an operational and ad hoc criterion: non-disclosiveness alone is not sufficient reason to override the rule of thumb. The output has to be important too. The line has to be drawn somewhere, and it has been drawn at 10. If it weren't, researchers would chip away at every limit, and the efficiency that justified the whole system would disappear.

3.3.1 PBOSDC and the research community

PBOSDC is, in the end, a community process: it only works when everyone works together. Each environment and each data provider will have its own rules of thumb (the specifics come with the data), and if in doubt, the answer is always to check with the support team.

That community framing also cashes out in what "good output" actually looks like in a PBOSDC world. It is hard to specify exactly, but the test is to put yourself in the checker's position: clear labelling, visible frequencies, explanations where needed, no ambiguous axes or unexplained variables. A poorly presented output wastes both sides' time on basic questions, and every minute of that friction erodes the trust the model depends on.

3.3.2 PBOSDC in summary

PBOSDC, together with the low-review / high-review split, was designed specifically for research environments, and is aligned with how research actually gets done: flexible where it matters, restrictive where safety requires. The researcher's job is to learn what the support team expects, and to educate them in return where the research context demands it. Where there is no support team, the researcher takes on the self-review role directly. The most common mistake is producing tables with small cells while exploring the data; everyone makes those mistakes, and what matters is how they get handled.

The two models differ less in their rules than in how they handle the cases the rules don't fit. RBOSDC refuses the edge case; PBOSDC negotiates it, but only when the output is both important and non-disclosive, and only rarely enough that the process stays efficient. When the community holds that line, everyone benefits: researchers get results faster, support teams clear outputs, and data providers keep their confidence in confidentiality intact.

4. Safe People

We looked at Safe outputs in the form of SDC in the previous sections. We will now look at Safe people as the next most important component of The Five Safes when considering releasing data in an electronic environment.

4.1 Why do attitudes & perceptions matter?

The way data providers perceive people directly shapes how much data they are willing to release. For example, a provider who believes researchers mean well but occasionally slip up will invest in training and design flexible access; a provider who does not trust researchers at all will fall back on Public Use Files with much of the useful detail stripped out. Safe Researcher Training takes the middle position: users can be trusted, but need training, and their trustworthiness must be visible enough for data holders to extend that trust.

That visibility is earned through behaviour, which flows from attitudes (the Knowledge-Attitude-Behaviour framing applies, though attitudes are the lever that matters most). The typical failure mode is an attitude failure dressed up as an accident: a researcher who copies confidential files onto a USB, say, never meaningfully engaging with the risk. Exercises asking trainees to pick "safe" colleagues from a shortlist surface the same point: good and safe are not synonyms, obvious traits can mislead, and personal prejudice often does more of the sorting than the evidence warrants. None of this sits in a vacuum, since peers, seniority and power dynamics constantly reshape behaviour, which is why positive peer influence is one of the most effective safeguards available.

4.2 The research community

Safe people operate inside a community with four main actors (data providers, researchers, research users, and the support team), plus a fifth group too easily forgotten: those from whom the data was originally collected, whose interests underpin the whole arrangement. The parties' goals genuinely diverge. Data providers are "default closed", starting from no access and asking of every request whether something will go wrong and at what cost. Researchers and research users are "default open", assuming data should flow unless a specific restriction can be justified; researchers in particular tend to overestimate their trustworthiness and underweight the organisational work needed to make data available. Research users want digestible output on tight timetables and have little patience for confidentiality-driven delay.

Balancing these interests is the support team's core job. Sitting between providers and researchers, they carry provider concerns back to researchers and researcher needs (and the social value of the work) back to providers, and their accumulated experience unlocks future flexibility, since most data access happens because similar access has happened elsewhere, for a long time, without problems. Ignore provider concerns and their trust evaporates; address them credibly and providers can be persuaded to be remarkably open.

4.3 Summary

Attitudes shape behaviour, behaviour shapes perceived trustworthiness, and trusted researchers get systems designed around them rather than locked down against them. Not every positive trait is a "safe" one, peers exert more influence than the support team ever can, and the support team itself is an ally (the right place to take queries and suggestions), since co-operation across the community keeps data flowing to the research that needs it.

5. Breach of procedure vs. breach of confidentiality

We now know two key components of keeping data safe, but what happens when things still go wrong? When a researcher mishandles data in a Trusted Research Environment, the incident falls into one of two categories. A breach of procedure (BoP) is breaking the rules set by the data holder; a breach of confidentiality (BoC) is breaking the law. This section covers the line between them, and why, in practice, they must be handled together.

5.1 Procedures for using a Trusted Research Environment

A BoP is any violation of the rules a data holder sets for using their TRE. The overarching rule is that no data can leave without first being checked and cleared by staff. The everyday precautions are straightforward: access data only within the United Kingdom, make sure your screen cannot be overlooked, lock your screen when away from the computer, do not access or discuss data in public places, and switch off listening mode on any virtual or digital assistant. Each reduces the chance of a procedural slip becoming a BoC.

5.2 Data protection laws

A BoC is a violation of data protection law, which exists to support responsible research use, not to restrict it. Every law, in every jurisdiction, specifies four things: who can use what data, for how long, and for what purpose. Failure on any of those points is a BoC. Alongside the law, data providers typically impose their own requirements: training, certification, use of specified facilities, approved storage; failure there is a BoP. The same laws that carry penalties also build in a 'reasonableness' defence, allowing responsible researchers to work without fear of being caught out by an unlikely event.

5.3 Classifying incidents

Most incidents are procedural rather than legal. Even something serious, like leaving a laptop on a train, is only a breach of procedure until someone finds it, unlocks it and re-identifies an individual from the contents. The following scenarios make the split clearer:

Scenario	Accidental or deliberate?	Rules or law?
Researcher puts confidential files on a USB to give to a colleague	Deliberate (misuse)	Breach of procedure; becomes a breach of confidentiality depending on encryption and the colleague's permissions
Researcher leaves laptop on a train	Accidental	Breach of procedure (e.g. if told not to take the laptop off-site); only becomes a BoC if the finder unlocks unencrypted identifiable data
PhD student agrees to share logon and password with his supervisor	Deliberate	Prima facie breach of law (unauthorised access) unless a condition applies (e.g. the supervisor is also named on the project)
A user in a research lab asks another user to look at her code on screen	Deliberate	Breach of procedure, depending on the other user's permissions
Academic team stores confidential data on a cloud server, but encrypts it	Deliberate	Prima facie breach of law (EU regulations) unless the server is appropriately certified
Researcher using a remote lab takes a screenshot and sends it to the support team to query the data	Deliberate	Breach of procedure (data should not leave the environment by any means without being checked and cleared)

Only two of these (the shared password and the uncertified cloud server) are breaches of law on their face. The rest become law-breaking only under specific circumstances: the lost laptop, for example, only becomes a BoC if it is unencrypted and opened by someone who sees identifiable data. Not every problem carries legal consequences.

5.4 Types of problem - rules or law?

Which of the two should worry you more? Both, equally. Breaches of confidentiality are rare; breaches of procedure are not. There is no one-to-one relationship between them (a BoP may never produce a BoC, and a BoC can occur through an unanticipated error despite procedures being followed), but BoCs tend to emerge from BoPs. A data provider watching a pattern of procedural breaches can reasonably conclude that management is not working and close down access entirely, regardless of whether any law has been broken. Following procedure means no breach at all.

5.5 Consequences of breaches

The consequences of the two categories look different on paper but overlap heavily in practice.

A breach of confidentiality can mean prison and/or fines, and the responsibility is usually personal (you cannot hide behind your organisation). A breach of procedure brings financial and reputational damage, and potential restrictions on future access: temporary suspension, an indefinite ban, or one-to-one retraining.

Which bites hardest? The meaningful consequences researchers fear (loss of funding, reputation or a job) tend to flow from procedural breaches. If you ignore UKDA guidelines, you risk your ESRC funding whether or not a court is involved.

To make this concrete: a researcher using the Secure Access version of the Crime Survey for England and Wales is in a hurry and copies figures directly from the screen. The output has not gone through statistical disclosure control, the paper could be lost, a journalist could pick it up, and the entire project team could be suspended, all from a rule-break that carries no immediate legal penalty.

5.6 Support if things go wrong

All data protection laws (and most data providers) allow for a 'reasonableness' defence: you cannot be held liable for something unlikely that happened without careless or reckless contribution from you. In practice, this means working with the support team and following procedures. If you need to defend yourself, demonstrating that you did both puts the support team alongside you; if you cannot, they have every reason to distance themselves. The rules are not administrative burdens; they turn an incident into a shared problem rather than a personal liability.

5.7 Summary - understanding data access

The principles are not complicated once you have time to think them through. The Five Safes provide a handy structure for organised thinking. Mistakes and bad practices happen - and when they do, the answer is to work together to address them.

6. Assessment

6.1 Overview

The assessment is available online via the Learning Hub
It is up to you if and when you complete the assessment (no fixed deadline)
Results are marked automatically
You must pass to gain access to microdata
If you have completed the test but continue to receive reminders, please ignore them

Contact: ids.customer.support@ons.gov.uk

6.2 Assessment Structure

Section 1 - Procedure Scenarios (9 questions)

You are presented with 9 workplace scenarios (e.g. "I'm working in a secure lab - what do I do if I need to go to the toilet?")
For each scenario, rank the possible responses from most to least sensible

Section 2 - Statistical Disclosure Control (SDC) (10 questions)

You are shown 10 example statistical outputs (tables, etc.)
Multiple choice format (multiple answers may apply)
Apply the threshold of 10 when assessing disclosure risk
Identify risks such as: counts below 10, class disclosure, secondary disclosure
Some outputs carry no risk - do not be overly cautious
Not assessed on output titles - focus on disclosure risk only

6.3 Assessment Rules & Hints

Rule	Detail
Open-book	You may refer to your notes and the SDC Handbook
One sitting	You must complete the assessment without interruption
No deadline	Complete at your own pace - but sooner is encouraged
Account expiry	Your Learning Hub account is deleted 3 months after activation (can be reactivated)
Pass mark	50% in each section
Time to allow	Block approximately 45 minutes to 2 hours

Marking Guidance

The assessors are looking for sensible, reasonable answers
Some answers are right, some wrong, and some are simply better or worse than others
Negative marking applies to the SDC multiple-choice section - only select multiple answers if you are confident they all apply; do not guess
Long-answer questions are reviewed by an assessor in borderline cases

Results

Results are sent to the TRE team as pass/fail
If you fail on your first attempt, you will receive feedback to help you prepare for a resit

6.4 Preparing for the Assessment

Read through the training slides, including the speaker notes
Read the SDC Handbook - this is the key reference for the SDC section
Call the support team if you have queries or want to discuss any topics - do this before taking the test

Note on research: Your detailed test answers are stored in a research database used for a follow-up study on the effect of training. If you wish to opt out of this, please email ids.customer.support@ons.gov.uk

6.5 Practice Questions

Section 1 - Procedure Scenarios

Question 1: When do we need to protect the confidentiality of data in the secure environment?

Rank the following responses from most sensible (1) to least sensible (4):

A. Only if the data are sensitive.
B. Unless the data are already in the public domain.
C. When data are deemed to be personal.
D. Regardless of what the data are about.

Answer: D, B, C, A

The correct ranking treats the Safe Settings principle as absolute: once data are inside the secure environment you behave as if everything is confidential, full stop. You don't get to make case-by-case judgements about which rows are "really" sensitive, or which variables are "really" personal - that is precisely the kind of researcher discretion the Five Safes framework is designed to remove.

D is the right answer. The confidentiality obligations of the secure environment attach to the environment, not to the perceived sensitivity of the data. You treat everything inside as confidential.
B is next best. It at least gestures at a recognised exception (public-domain data genuinely doesn't need protecting), but it's still wrong as a posture - the researcher shouldn't be the one deciding at the screen what counts as "already public", and even data derived from public sources can become sensitive when linked.
C is worse. "Personal data" is a narrower, legal-ish category, and adopting it as the trigger invites the researcher to decide that commercial, administrative, or aggregate data don't need the same care - they do, inside the TRE.
A is the worst. "Only if the data are sensitive" is exactly the judgement call the researcher is not qualified (or authorised) to make at the keyboard. It is the default attitude that leads to breaches.

Question 2: Once I have access to a dataset I can use it for...

Rank the following responses from most sensible (1) to least sensible (4):

A. Any 'good and proper' research purpose.
B. The exact research that was specified in my proposal.
C. A different research purpose with clearance from my support officer.
D. Any research purpose that is directly related to my original application.

Answer: C, B, D, A

Access is granted against a specific project (Safe Projects). The approval attaches to the proposal you submitted, not to you as a researcher and not to the dataset in general. If your research direction changes, the correct response is to go back through the approvals process - not to decide for yourself that the new direction is "close enough" or "also good".

C is the default correct answer - if the research purpose changes, you get clearance (an amendment) from the support officer. This covers real-world cases where projects evolve.
B is also legitimate and is explicitly noted in the training: it is what you actually signed up for, and it is what the data owner approved.
D sounds reasonable but is wrong as stated - "directly related" is the researcher's own judgement, not a cleared amendment. It slides from B into A without the safety check in C.
A is the clearest misuse: "any good and proper research purpose" is exactly the unilateral expansion the Safe Projects principle is designed to prevent. No researcher gets to redefine the scope of their own access.

Question 3: You are working on your research and have breached data confidentiality. Who faces the repercussions?

Rank the following responses from most affected (1) to least affected (4) - i.e. who bears the consequences of the breach, in order:

A. You - the researcher.
B. The data owner.
C. The support team.
D. Other researchers.

Answer: all of them - A, B, C, D (the question is a trick; every party listed faces some form of consequence)

The point of this question is that a confidentiality breach is never a private matter between the researcher and the data. Every party in the access chain is exposed, which is precisely why the Five Safes framework distributes responsibility across all of them.

A (You - the researcher) face the most direct and personal consequences: loss of access, potential loss of employment, professional reputation damage, and in some jurisdictions criminal liability. Rank first.
B (The data owner) bears the institutional and legal consequences - they are the data controller, they are answerable to the data subjects, and they face regulatory action. Rank second.
C (The support team) loses credibility with the data owner, has to run the incident-response process, and typically has to tighten controls across all users (making everyone else's life harder). Rank third.
D (Other researchers) suffer the knock-on effects: tightened rules, slower output clearance, and in the worst case the suspension of access to the dataset entirely. Least direct but very real - hence last, but still on the list.

The wrong answer is to pick any single option as if the others are unaffected. Breaches propagate; everyone in the chain pays some of the cost.

Question 4: Which is the most common reason behind breaches of procedure?

Rank the following causes from most common (1) to least common (4):

A. Mistakes or ignorance.
B. Laziness.
C. Malicious intent.
D. Dislike of procedures.

Answer: A, D, B, C

The evidence from every published review of TRE incidents is that breaches are overwhelmingly accidental. Researchers by and large want to do the right thing - they are trained professionals who have been vetted, signed legal agreements, and have reputations to protect. The breaches that actually happen are the ones where someone didn't realise a rule applied, forgot a step, or misunderstood what "non-disclosive" meant in context.

A (Mistakes or ignorance) is the dominant cause by a wide margin, and is the premise behind the whole SRT course: train people, and the breaches largely go away. Rank first.
D (Dislike of procedures) is plausibly next - some breaches come from researchers who understand the rule but find it annoying and cut a corner. This is a live failure mode and the one the "trust, monitor, punish" framework in Q5 is designed against.
B (Laziness) is related to D but narrower - the researcher knew the rule, wasn't ideologically against it, just couldn't be bothered. Rare in a population of people who went to the trouble of getting TRE access.
C (Malicious intent) is extremely rare. Safe People vetting is specifically designed to screen this out, and every published incident review finds deliberate misuse is a tiny fraction of cases. Rank last.

The wrong intuition is to reach for C because it feels like the scary case; in practice, the scary case is the everyday researcher who didn't know a rule applied.

Question 5: Which is the most effective way of encouraging positive behaviours when using confidential data?

Rank the following systems from most effective (1) to least effective (4):

A. A system that uses threats of punishment to generate good behaviour.
B. A system that uses monitoring to ensure good behaviour.
C. A system that uses trust in users' good behaviour.
D. All of the above.

Answer: D, C, B, A

The correct answer is D - the Safe People / Safe Settings regime works because all three mechanisms reinforce each other. Trust alone is naïve (it doesn't deter the rare bad actor and it gives honest researchers no visible backstop when they make a mistake); monitoring alone is adversarial and corrosive to the researcher-support relationship; punishment alone creates a compliance culture where people hide problems rather than report them. Combined, they produce the "trusted researcher" model: you are trusted, your outputs are checked, and the consequences of a deliberate breach are real - and every one of those facts is known to every party in the system.

D (All of the above) is the model the training is actually describing. Rank first.
C (Trust) is the next best single mechanism, and the one closest to the ethos of the TRE: you are a vetted professional, you are treated as one, and the system expects you to behave like one. Works well for the overwhelmingly honest majority - but provides no backstop.
B (Monitoring) is third. Useful as an audit mechanism and genuinely deters corner-cutting, but if used as the primary lever it signals distrust and changes researcher behaviour for the worse (people stop asking the support team questions in case the questions look bad on the record).
A (Threats of punishment) is last as a primary mechanism. Punishment is necessary as a backstop for the rare deliberate breach, but as the headline message it is counter-productive: it drives incidents underground, discourages self-reporting, and frames the researcher as a suspect rather than a collaborator.

The wrong instinct is to pick C alone (too permissive) or A alone (too punitive). The whole point of the Five Safes framework is that no single dimension carries the load.

Section 2 - Statistical Disclosure Control (SDC)

Question 1: Consider the following frequency table:

Highest qualification	C2DE	ABC1	Total
Degree+	51	17	68
College	30	13	43
School	15	8	23
None	9	7	16

Which of the following apply?

A. Three cells (9, 8, 7) fall below the threshold of 10, so the default action is to reject the table - suppress, redesign, or band the categories before release.
B. The cells are probably not meaningfully disclosive, so the table can simply be released without further discussion.
C. The table exhibits class disclosure because an entire qualification group sits below the threshold.
D. The row totals should be suppressed to prevent the small cells being recovered by subtraction.
E. The 10-threshold is a deliberate compromise: the support team holds a fast, consistent rule so most requests can be processed quickly, which in turn creates space for a genuine argument if the researcher can show these specific cells are important to the research.
F. No SDC issues - release as-is.

Answer: A and E

The 7, 8, 9 are all below the rule-of-thumb limit of 10, so the default action is to reject. They are probably non-disclosive - but "probably non-disclosive" is not by itself a good enough reason to ignore the threshold. The line has to be drawn somewhere, and it's drawn at 10: the support team applies a strict rough-and-ready rule so most requests can be processed quickly, which is precisely what creates space to scrutinise the genuinely important outputs. If these numbers really matter to the research, the researcher can make that case and the table may then be released; but the burden is on them.

B is the over-confident reading that the slides explicitly warn against - being non-disclosive is not sufficient on its own.
C misuses the term "class disclosure" - no row is 0 or 100%.
D would not help anyway, and would degrade the output unnecessarily.
F ignores the threshold entirely.

Question 2: A researcher submits the following line chart of "Import intensity" for release. No frequency table is attached to the submission. The y-axis is labelled only "%", the legend shows four unnamed coloured lines, and no variable definitions are provided.

![Line chart: four unnamed lines labelled only by colour, y-axis "%", x-axis 1994-2016; "Other" line starts at very low values in 1994 and 2000 before rising sharply from 2002 onwards.]

Which of the following apply?

A. The chart is safe to release because it is already aggregated and does not show individual data points.
B. Each data point on a chart is subject to the same threshold checks as a cell in a frequency table.
C. Because the chart is exported as a PNG/JPEG, any underlying small counts cannot be read off it, so no action is needed.
D. The submission is a poor output and should be returned to the researcher: the reviewer cannot tell what "import intensity" means, what the y-axis is a percentage of, what the lines represent, or - most importantly - what the underlying frequencies are. These are basic labelling questions the researcher should have answered before submission.
E. Without the underlying frequency table, the checker cannot verify that each point meets the threshold; the correct action is to reject the submission and request the counts rather than attempt threshold checks on the chart alone.
F. No SDC issues - release as-is.

Answer: B, D, E

This is a bad output. The reviewer doesn't know what "import intensity" is, what the y-axis is a percentage of, what the different lines represent, or - most importantly - where the frequencies are. All of that wastes the checker's time on basic questions the researcher should already have answered. Meanwhile the principle from 2.15.1 still applies: each data point on a chart must pass the same SDC checks as a cell in a table, so without the underlying counts the checker cannot verify thresholds at all. The right action is to reject and ask for properly labelled output with the frequency table attached.

A is wrong - aggregation doesn't help if a category contains very few observations, and a time-series point is not "aggregated" in the protective sense.
C is wrong - the image format hides nothing about the disclosure risk; and some software stores raw data behind the graph. "PNG/JPEG" is a protection against re-editing, not against low-count points.
F ignores both the labelling problems and the missing frequencies.

Question 3: Consider the following output:

Frequencies for employment in the Creative Industries, Creative Occupations & Total Workforce, weighted counts Years: 2014 - 2015 Coverage: All respondents currently employed or self-employed in main job, aged 23-69 Variable: NSSECORIGIN (derived using SMSOC10 & SMEARNER) Unweighted counts supplied in separate sheet.

Industry	NS-SEC	2014	2015
Advertising and Marketing	Salariat	66,088	62,734
	Intermediate	27,667	41,940
	Working Class	17,051	14,573
	Unemployed / Never worked	*	*
Architecture	Salariat	32,156	37,709
	Intermediate	31,043	29,599
	Working Class	10,295	*
	Unemployed / Never worked	*	*

Which of the following apply?

A. The table is disclosive because the asterisks (*) reveal the existence of unemployed respondents in each industry.
B. Small numbers have been suppressed, totals are not shown, the description and variables are clear, and the underlying sample is large - this output can be cleared.
C. The output should be rejected because no row or column totals are shown - checkers cannot verify the cell values.
D. Unweighted counts being supplied separately is essential: without them the checker cannot confirm that each non-suppressed cell meets the threshold.
E. No SDC issues - release as-is.

Answer: B

A clean output: small cells suppressed with *, no totals (so the suppressions can't be recovered by subtraction), clear description of coverage and variable, large underlying sample.

A is wrong - the asterisks correctly hide the values; revealing only that some respondents may be "Unemployed / Never worked" in a given industry is not itself a threshold breach. The SDC rule is that each released cell must meet N, not that the existence of the category must be hidden.
C is the opposite of correct - omitting totals is a feature here, because it prevents recovery of suppressed cells.
D is reasonable in spirit (the checker does need unweighted counts to verify thresholds) but the problem states they are supplied separately, so there is no outstanding issue.

Question 4: Consider the following descriptive statistics per school. No title or variable definition was supplied with the submission.

School	Mean	s.d.	P99
Wickham school	44,000	5.6	87,341
St Annes primary	32,000	1.3	66,124
Clydesdale infants	87,000	12.0	98,363
Pent cross school	45,000	4.3	68,112
Scarsdale	67,000	2.9	89,443
Elgin	38,000	2.4	68,111

Which of the following apply?

A. The P99 values are given to the exact unit (e.g. 87,341, 66,124) - if any school has fewer than ~100 pupils, the 99th percentile is effectively a single individual's score and should be rounded.
B. There is no sample size (N) per school, so the checker cannot verify that the threshold is met - the output should be returned to the researcher for clarification.
C. The output has no title, no variable definition, and no coverage information; these should be supplied before the checker can make a release decision.
D. The means are all rounded to the nearest 1,000, so the output is safe regardless of the underlying counts.
E. No SDC issues - release as-is.

Answer: A, B, C

A - P99 at small N is effectively a single individual's value; quoting it to the exact unit is especially risky.
B - The N per school is missing, so the checker cannot apply any threshold. Return for clarification.
C - Titles and variable definitions are not themselves an SDC issue (the assessment notes "you are not assessed on output titles"), but they are legitimately missing contextual information the checker needs to decide whether the statistic is disclosive in context. In practice the output would be returned.
D is wrong - rounding the mean does nothing to protect the unrounded P99.

Question 5: Consider the same schools table as Question 4, now with N added and P99 rounded.

Maths scores for year 9, schools in the North East. Note: unweighted counts.

School	Mean	s.d.	P99	N
Wickham school	44,000	5.6	87,000	110
St Annes primary	32,000	1.3	66,000	96
Clydesdale infants	87,000	12.0	98,000	34
Pent cross school	45,000	4.3	68,000	24
Scarsdale	67,000	2.9	89,000	16
Elgin	38,000	2.4	68,000	45

Which of the following apply?

A. All rows meet the threshold of 10, so the table can be released.
B. The P99 cannot be released for Clydesdale infants (N = 34), Pent cross (N = 24), Scarsdale (N = 16), or Elgin (N = 45) - at these sample sizes the 99th percentile effectively refers to a single pupil's score.
C. The P99 column should be suppressed for the low-N schools, or replaced with a wider summary (e.g. inter-quartile range) that does not depend on a single extreme observation.
D. Means and standard deviations are fine to release for all schools - the disclosure issue is specifically with the extreme percentile.
E. Because "Scarsdale" has N = 16, which is above the threshold of 10, all of its statistics - including P99 - are safe.

Answer: B, C, D

The per-school N column now shows that four of the six schools have fewer than ~100 pupils, so their P99 values are effectively single individuals' scores. The fix is to suppress P99 for those rows, or replace it with a less extreme statistic (IQR). Means and standard deviations are summary statistics over N ≥ 16 and are not the disclosure risk here.

A is wrong - meeting the threshold on counts is not enough; extreme percentiles need a much larger N behind them.
E is wrong for the same reason - an N of 16 is fine for a count but far too small for an unmasked 99th percentile.

Firm rank	R&D (£m)	> £11.9m?
1	95	—
2	82	—
3	75	Yes
4	60	Yes
5	48	Yes
6	38	Yes
7	30	Yes
8	26	Yes
9	22	Yes
10	18	Yes
11	15	Yes
12	13	Yes
13	12	Yes
14	12	Yes
15	11	No
16	9	No
17	8	No
18	7	No
19	6	No

Firm rank	R&D (£m)	> £11.9m?
1	95	—
2	82	—
3	75	Yes
4	60	Yes
5	48	Yes
6	38	Yes
7	30	Yes
8	26	Yes
9	22	Yes
10	18	Yes
11	15	Yes
12	13	Yes
13	12	Yes
14	12	Yes
15	11	No
16	9	No
17	8	No
18	7	No
19	6	No

martinchapman/safe_researcher_training.md

Safe Researcher Training (SRT)

Contents

1. The Five Safes

2. Statistical Disclosure Control (SDC)

2.1 What is SDC?

2.2 The Example Dataset

2.3 Why Small Numbers Are a Problem

2.3.1 Unique Observations (N=1)

2.3.2 Small Groups (N=2)

2.3.3 The "3 is the Magic Number" Rule

2.4 Context-Sensitivity: Not All Small Numbers Are Problematic

2.5 Class Disclosure

2.5.1 What is Class Disclosure?

2.5.2 Class Disclosure is Highly Context-Sensitive

2.5.3 Structural Zeros

2.6 Options for Fixing Disclosive Tables

Option 1: Cell Suppression

Option 2: Rounding

Option 3: Redesign the Output (Recommended)

Choosing the Right Option

2.7 Primary Disclosure: Dominance

2.7.1 The Dominance Rule

2.7.2 Why Dominance Creates Risk

2.7.3 Dominance: The Aggregation Fix

2.7.4 Dealing with Dominance in Practice

2.8 Primary Disclosure: Ranks, Maxima, and Minima

2.9 Secondary Disclosure: Disclosure by Differencing

2.9.1 What is Secondary Disclosure?

2.9.2 Secondary Suppression

2.10 SDC and Statistical Quality

2.11 Moving Beyond Tables: SDC in the Research Environment

Descriptive statistics

Percentiles

2.12 Low Review vs. High Review Statistics

2.13 Regression Output

2.14 Residuals

2.15 Graphs and Charts

2.15.1 Line/Area Charts

2.15.2 Scatter Plots

2.15.3 Box Plots

2.15.4 Maps

2.16 Odds Ratios and Risk Ratios

2.17 Code Files

2.17.1 Safe Code Files

2.17.2 Problematic Code Files

Summary: Key Principles

3. Rules-Based vs Principles-Based SDC

3.1 Models of output clearance

3.2 Rules-based output SDC

3.2.1 Why strict rules break down in research

3.3 Principles-based output SDC

3.3.1 PBOSDC and the research community

3.3.2 PBOSDC in summary

4. Safe People

4.1 Why do attitudes & perceptions matter?

4.2 The research community

4.3 Summary

5. Breach of procedure vs. breach of confidentiality

5.1 Procedures for using a Trusted Research Environment

5.2 Data protection laws

5.3 Classifying incidents

5.4 Types of problem - rules or law?

5.5 Consequences of breaches

5.6 Support if things go wrong

5.7 Summary - understanding data access

6. Assessment

6.1 Overview

6.2 Assessment Structure

Section 1 - Procedure Scenarios (9 questions)

Section 2 - Statistical Disclosure Control (SDC) (10 questions)

6.3 Assessment Rules & Hints

Marking Guidance

Results

6.4 Preparing for the Assessment

6.5 Practice Questions

Section 1 - Procedure Scenarios

Section 2 - Statistical Disclosure Control (SDC)

Firm rank	R&D (£m)	> £11.9m?
1	95	—
2	82	—
3	75	Yes
4	60	Yes
5	48	Yes
6	38	Yes
7	30	Yes
8	26	Yes
9	22	Yes
10	18	Yes
11	15	Yes
12	13	Yes
13	12	Yes
14	12	Yes
15	11	No
16	9	No
17	8	No
18	7	No
19	6	No