4.1 An Overview of Classification

Examples of classification problems:

A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one of three medical conditions. Which of the three conditions does the individual have?
An online banking service must be able to determine whether or not a transaction being performed on the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.
On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are not.

Just as in the regression setting, in the classification setting we have a set of training observations $(x_1, y_1), ... , (x_n, y_n)$ that we can use to build a classifier. We want our classifier to perform well not only on the training data, but also on test observations that were not used to train the classifier.

4.2 Why Not Linear Regression?

There are at least two reasons not to perform classification using a regression method:

a regression method cannot accommodate a qualitative response with more than two classes
a regression method will not provide meaningful estimates of $Pr(Y|X)$, even with just two classes.

Thus, it is preferable to use a classification method that is truly suited for qualitative response values. Logistic regression is one example of the well-suited method for the case of a binary qualitative response.

4.3 Logistic Regression

4.3.1 The Logistic Model

In logistic regression, we use the logistic function:

$$ p(X)=\frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0 + \beta_1X}} $$

to avoid the situations where we might predict $p(X) < 0$ for some values of $X$ and $p(X) > 1$ for others. This function gives outputs between 0 and 1 for all values of $X$.