Before moving to the non-linear world, we discuss in this chapter some ways in which the simple linear model can be improved, by replacing plain least squares fitting with some alternative fitting procedures.

Why might we want to use another fitting procedure instead of least squares? As we will see, alternative fitting procedures can yield better prediction accuracy and model interpretability.

Prediction Accuracy
- If the model is linear, then $\hat{\beta}$ has low bias; however, its variance depends on $n$ and $p$.
- If $n \gg p$, then $\hat{\beta}$ has low variance too.
- If $n \approx p$, then $\hat{\beta}$ has high variance.
- If $n < p$, then $\hat{\beta}$ is not unique, which leads to $\infty$ variance.
- By modifying the fitting procedure through constraints (or regularization) or through shrinkage, the variance is significantly reduced while bias increases only slightly.
Model Interpretability
- Most $\beta_i$s are $0$ (i.e., not associated with $Y$). Including these $\beta_i$s in the model reduces interpretability.
- The ideal model fitting procedure should remove irrelevant predictors by setting $\hat{\beta}=0$.
- None of the estimates $\beta_i$s are $0$ (with high probability)

Three important classes of methods are:

Subset Selection
- Identify predictors with non-zero $\beta_i$.
- Use least squares for the selected predictors to estimate $\beta$.
Shrinkage (regularization)
- Use all $p$ predictors for model fitting.
- Estimated coefficients shrunk towards $0$ → This reduces variance.
- Some methods can shrink some $\beta_i$ to $0$ and simultaneously select predictors.
Dimension Reduction :
- Define new predictors $Z_j=a_{j1}X_1+\cdots +a_{jp}X_p$ for real $a_{jp}$s $(j=1,...,M)$ and $M<p$
- Use $Z_1,...,Z_M$ as new predictors.

6.1 Subset Selection

6.1.1 Best Subset Selection

To perform best subset selection, we fit a separate least squares regression for each possible combination of the p predictors. That is, we fit all $p$ models that contain exactly one predictor, all ${p \choose 2}=p(p-1)/2$ models contain exactly two predictors, and so forth. We then look at all of the resulting models, with the goal of identifying the one that is best.

The problem of selecting the best model from among the $2^p$ possibilities considered by best subset selection is not trivial.

Best Subset Selection Algorithm

Let $\cal{M}_0$ denote the null model, which contains no predictors. This model simply predicts the samples mean for each observation.
For $k=1,2,...,p$:
1. Fit all ${p \choose k}$ models that contain exactly $k$ predictors.
2. Pick the best among ${p \choose k}$ models, and call it $\cal{M}_k$. Here best is defined as having the smallest RSS, or equivalently largest $R^2$.
Select a single best model from among $\cal{M}_0,...,\cal{M}_p$ using the prediction error on a validation set, $C_p(\text{AIC})\text{, BIC},$ or adjusted $R^2$. Or use the cross-validation method.

6.1 Subset Selection

6.1.1 Best Subset Selection

Best Subset Selection Algorithm

Example: Best subset selection