Before moving to the non-linear world, we discuss in this chapter some ways in which the simple linear model can be improved, by replacing plain least squares fitting with some alternative fitting procedures.

Why might we want to use another fitting procedure instead of least squares? As we will see, alternative fitting procedures can yield better prediction accuracy and model interpretability.

Three important classes of methods are:

6.1 Subset Selection


6.1.1 Best Subset Selection

To perform best subset selection, we fit a separate least squares regression for each possible combination of the p predictors. That is, we fit all $p$ models that contain exactly one predictor, all ${p \choose 2}=p(p-1)/2$ models contain exactly two predictors, and so forth. We then look at all of the resulting models, with the goal of identifying the one that is best.

The problem of selecting the best model from among the $2^p$ possibilities considered by best subset selection is not trivial.

Best Subset Selection Algorithm

  1. Let $\cal{M}_0$ denote the null model, which contains no predictors. This model simply predicts the samples mean for each observation.
  2. For $k=1,2,...,p$:
    1. Fit all ${p \choose k}$ models that contain exactly $k$ predictors.
    2. Pick the best among ${p \choose k}$ models, and call it $\cal{M}_k$. Here best is defined as having the smallest RSS, or equivalently largest $R^2$.
  3. Select a single best model from among $\cal{M}_0,...,\cal{M}_p$ using the prediction error on a validation set, $C_p(\text{AIC})\text{, BIC},$ or adjusted $R^2$. Or use the cross-validation method.

Example: Best subset selection