This code shows the problem of post-selection inference following the review article Post-Selection Inference
library(tidyverse)
library(fixest)
Data generation process:
-
$X = (X_1, X_2, X_3)'$ is multi-variate normal with non-diagonal covaraince matrix -
$Y$ is generated independently of$X$ (true coefficients of 0)
dgp <- function(n = 100) {
X_vcov <- matrix(
3 * c(1, 0.2, 0.4, 0.2, 1, 0.01, 0.4, 0.01, 1),
nrow = 3, ncol = 3, byrow = TRUE
)
X <- mvtnorm::rmvnorm(n = n, mean = c(0, 0, 0), sigma = X_vcov)
y <- rnorm(n = n, mean = 1, sd = 3)
cbind(y, X) |>
as_tibble() |>
setNames(c("y", "X1", "X2", "X3"))
}
Estimation strategy:
- Do forward-step selection (start with intercept model) and add coefficients according to AIC-selection criteria
- Then estimate linear model selected by forward-step selection
forward_selection <- function(df) {
# forward_model <- lm(y ~ X1 + X2 + X3, data = df)
forward_model <- lm(y ~ 1, data = df)
final <- step(
forward_model,
direction = "forward",
scope = formula(y ~ X1 + X2 + X3),
trace = 0
)
return(final)
}
Run this simulation B times
sim <- function(i) {
df <- dgp(n = 100)
selection_model <- forward_selection(df)
selection_model |>
summary() |>
_$coefficients |>
as_tibble(rownames = "term") |>
setNames(c("term", "est", "se", "tvalue", "pvalue")) |>
mutate(iter = .env$i, .before = 1)
}
res <- map(1:2500, function(i) {
sim(i)
}) |>
list_rbind()
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
#> `.name_repair` is omitted as of tibble 2.0.0.
#> ℹ Using compatibility `.name_repair`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
Conditional on $X_1$ being selected as one of the covariates, what is the
distribution of
res |>
filter(term == "X1") |>
ggplot() +
geom_histogram(aes(x = est)) +
labs(x = "Coefficient Estimate", y = NULL, title = "Histogram of Coefficient Estimates") +
theme_bw(base_size = 14)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now we are going to do sample-splitting to address this issue:
- Using half of the data, we will select the model using forward-selection
- Then, we will estimate the selected linear model using the other half of the data
sample_splitting_sim <- function(i) {
df <- dgp(n = 100)
train_idx <- seq_len(nrow(df)) |> sample(size = nrow(df) / 2, replace = FALSE)
train_idx <- train_idx[order(train_idx)]
test_idx <- setdiff(seq_len(nrow(df)), train_idx)
# Select covariates
selection_model <- forward_selection(df[train_idx, ])
# Fit model on test data
selection_model |>
update(data = df[test_idx, ]) |>
summary() |>
_$coefficients |>
as_tibble(rownames = "term") |>
setNames(c("term", "est", "se", "tvalue", "pvalue")) |>
mutate(iter = .env$i, .before = 1)
}
res_sample_split <- map(1:2500, function(i) {
sample_splitting_sim(i)
}) |>
list_rbind()
Conditional on $X_1$ being selected as one of the covariates, what is the
distribution of
res_sample_split |>
filter(term == "X1") |>
ggplot() +
geom_histogram(aes(x = est)) +
labs(x = "Coefficient Estimate", y = NULL, title = "Histogram of Coefficient Estimates (with Sample Splitting)") +
theme_bw(base_size = 14)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
While this helps restore normality of our estimated coefficient,
the main cost is larger confidence intervals since we are fitting the model
with
se <- res |>
filter(term == "X1") |>
pull(se) |>
mean()
se_sample_split <- res_sample_split |>
filter(term == "X1") |>
pull(se) |>
mean()
cat(sprintf(
"The average standard error in the original simulation is %0.3f. The average standard error in our sample-split estimate is %0.3f. This is %0.3f times larger.",
se, se_sample_split, se_sample_split / se
))
The average standard error in the original simulation is 0.175. The average standard error in our sample-split estimate is 0.255. This is 1.456 times larger.
Created on 2024-08-27 with reprex v2.1.0