Craig A. Rolling, Infoworks, Inc., crolling@infoworks-chicago.com

The Effects of Data Splitting on Prediction Error and its Estimation

Keywords: regression, prediction error, cross-validation

Abstract: An investigator building a linear regression model for the purpose of prediction has two primary objectives: building a model which produces accurate predictions and obtaining an accurate estimate of that model's predictive performance. "Data splitting" is a way many analysts attempt to achieve both objectives. Here three common splitting strategies are discussed and evaluated on both simulated and real-world examples. The first strategy is to not split the data at all; to use all of the available data to build the model and a bias-corrected resubstitution estimator of predictive performance. The second strategy is to split the data into a modeling set, used for the purposes of model estimation; and a validation set, used to estimate the model's predictive performance. The third strategy is to use 10-fold cross-validation. While cross-validation is usually seen as a way to estimate unconditional prediction accuracy (i.e. averaged over many training data sets), we show that it can be interpreted as the conditional accuracy of a certain "averaged" model. This puts CV on the same footing as the other strategies. Performance of these strategies with regard to the two primary objectives mentioned above will be presented.