Jump to content

Talk:Cross-validation (statistics)

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by 24.13.125.183 (talk) at 00:57, 15 February 2023 ("Swap sampling": new section). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
WikiProject iconStatistics C‑class Mid‑importance
WikiProject iconThis article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
CThis article has been rated as C-class on Wikipedia's content assessment scale.
MidThis article has been rated as Mid-importance on the importance scale.
WikiProject iconMathematics C‑class Mid‑priority
WikiProject iconThis article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
CThis article has been rated as C-class on Wikipedia's content assessment scale.
MidThis article has been rated as Mid-priority on the project's priority scale.


Claim about OLS' downward bias in the expected MSE

The article makes the following claim:

If the model is correctly specified, it can be shown under mild assumptions that the expected value of the MSE for the training set is (n − p − 1)/(n + p + 1) < 1 times the expected value of the MSE for the validation set (the expected value is taken over the distribution of training sets).

The text cites Trippa et al. (2015) specifically about the bias factor . However, the paper does not seem to contain any discussion of this bias factor for OLS. Is there an algebraic proof available for OLS?

Based on a simple simulation, the claim seems to be true.

Simulation in R
draw_sample <- function(n) {
    X <- rnorm(n)
    Z <- rnorm(n)
    epsilon <- rnorm(n)

    data.frame(
        Y = .1 + .3 * X + .4 * Z + epsilon,
        X = X,
        Z = Z)
}

mse <- function(model, data) {
    Y_hat <- predict(model, data)

    mean((data$Y - Y_hat)^2)
}

draw_mse <- function(n_training, n_validation) {
    data <- draw_sample(n_training + n_validation)
    data_training <- data[1:n_training,]
    data_validation <- data[(n_training + 1):nrow(data),]

    model <- lm(Y ~ X + Z, data = data_training)

    c(mse(model, data_training),
      mse(model, data_validation))
}

simulate <- function(n_samples) {
    sapply(
        1:n_samples,
        function(x) {
            draw_mse(n_training = 50, n_validation = 50)
        })
}

x <- simulate(10000)
mean(log(x[1,]) - log(x[2,]))

The resulting mean log ratio of the MSEs on the training set and the validation set is very similar to the formula given by the article. E.g., which is close to .

chery (talk) 16:32, 17 June 2022 (UTC); edited 17:37, 17 June 2022 (UTC)[reply]

"Swap sampling"

Is there another paper describing this method? The cited paper doesn't even call it "swap sampling" 24.13.125.183 (talk) 00:57, 15 February 2023 (UTC)[reply]