Chapter 6 Classification

6.1 Binary classification

Given an observation \(X\in \mathcal{X} \subseteq \mathbb{R}^d\), we want to assign a binary label (e.g \(y\in\{0,1\}\)) to it. A classifier is a function that maps from the feature space to the binary label: \(g: \mathcal{X} \mapsto \{0,1\}\). An observation/label pair is modelled as a pair of random variables \((X,y)\). The joint distribution of the pair can be described by:

\[ \begin{aligned} && \mu(A)&=P(X\in A) \\ && \eta(X)&=P(Y=1|X=x) \\ && 1-\eta(X)&=P(Y=0|X=x) \\ \end{aligned} \]

We further have

\[ \begin{aligned} && q_0&=P(y=0) \\ && q_1&=P(y=1) \\ \end{aligned} \]

and the class-conditional distributions:

\[ \begin{equation} \begin{aligned} && P(X\in A|y=0)&, &P(X\in A |y=1) \\ \end{aligned} \tag{6.1} \end{equation} \]

The quality of a classifier \(g\) is measured by its risk:

\[ \begin{equation} \begin{aligned} && R(g)&=P(g(X) \ne y) \\ \end{aligned} \tag{6.2} \end{equation} \]

More generally we have a loss function \(\ell: \{0,1\} \times \{0,1\} \mapsto \mathbb{R}\) and denote

\[ \begin{equation} \begin{aligned} && R(g)&= \mathbb{E}\ell(g(X),y)\\ \end{aligned} \tag{6.3} \end{equation} \]

which is equivalent to ((6.2)) if we define \(\ell(g(X),y)= \mathbb{1}_{g(X) \ne y}\). In the binary case we have

\[ \begin{aligned} && R(g)&=P(g(X)\ne y)=P(g(X)=1,y=0)+P(g(X)=0,y=1) \\ && &=P(g(X)=1|y=0)q_0+P(g(X)=0|y=1)q_1 \\ \end{aligned} \]

Theorem 6.1 (Bayes classifier) Let

\[ \begin{aligned} && g^*(X)&= \begin{cases} 1 & \text{if }\eta(X) \ge \frac{1}{2}\\ 0 & \text{otherwise.} \end{cases} \\ \end{aligned} \]

then \(g^*\) is optimal in the sense that for any classifier \(g\)

\[ \begin{aligned} && R(g)&\ge R(g^*) \\ \end{aligned} \]

We call \(g^*\) the Bayes classifier and \(R(g^*)\) the Bayes risk.

Proof. For any classifier \(g: \mathcal{X} \mapsto \{0,1\}\) we have \(R(g)=P(g(X)\ne y)\). Conditioning on \(X=x\) we have

\[ \begin{aligned} && P(g(X)\ne y|X=x)&=P(g(X)=1,y=0|X=x)+P(g(X)=0,y=1|X=x) \\ && &= \mathbb{1}_{g(x)=1}P(y=0|X=x) +\mathbb{1}_{g(x)=0}P(y=1|X=x) \\ && &= \mathbb{1}_{g(x)=1}(1-\eta(x)) +\mathbb{1}_{g(x)=0} \eta(x) \\ \end{aligned} \]

Now we want to show that \(R(g)-R(g^*) \ge 0\). Conditioning as before

\[ \begin{aligned} && R(g)-R(g^*)&=P(g(X)\ne y|X=x)-P(g^*(X)\ne y|X=x) \\ && &=\eta(x) (\mathbb{1}_{g(x)=0}-\mathbb{1}_{g^*(x)=0}) + (1-\eta(x)) (\mathbb{1}_{g(x)=1}-\mathbb{1}_{g^*(x)=1}) \\ && &=\eta(x) (\mathbb{1}_{g(x)=0}-\mathbb{1}_{g^*(x)=0}) + (\eta(x)-1) (\mathbb{1}_{g(x)=0}-\mathbb{1}_{g^*(x)=0}) \\ && &= (2\eta(x)-1) (\mathbb{1}_{g(x)=0}-\mathbb{1}_{g^*(x)=0}) \\ \end{aligned} \]

Now consider the case where \(\eta(x)\ge \frac{1}{2}\), then \(\mathbb{1}_{g^*(x)=0}=0\) and hence both \((2\eta(x)-1)\ge0\) and \((\mathbb{1}_{g(x)=0}-\mathbb{1}_{g^*(x)=0})\ge0\). In the other possible case where \(\eta(x)< \frac{1}{2}\) we can argue analogously to show that both \((2\eta(x)-1)<0\) and \((\mathbb{1}_{g(x)=0}-\mathbb{1}_{g^*(x)=0})<0\). Hence, taking expectations we have \(R(g)-R(g^*) \ge 0\).

Above we conditioned on \(X=x\) and in the end stated that taking expectations we have \(R(g)-R(g^*) \ge 0\). More generally we denote

\[ \begin{equation} \begin{aligned} && R(g)&= \mathbb{E} \left[ \mathbb{1}_{g(X)=1}(1-\eta(X)) +\mathbb{1}_{g(X)=0} \eta(X) \right]\\ \end{aligned} \tag{6.4} \end{equation} \]

and for the Bayes risk therefore:

\[ \begin{equation} \begin{aligned} && R^*&=\mathbb{E} \left[ \mathbb{1}_{\eta(X)\ge \frac{1}{2}}(1-\eta(X)) +\mathbb{1}_{\eta(X)<0} \eta(X) \right] \\ && &= \mathbb{E} \min (\eta(X),1-\eta(X))\\ \Rightarrow&& R^*&\in(0, \frac{1}{2}) \\ \end{aligned} \tag{6.5} \end{equation} \]

Of course, in practice \(\eta(X)\) is unknown and instead estimated through supervised learning using training data \(D_n=((X_1,y_1),...,(X_n,y_n))\). We denote a data-dependent classifier as

\[ \begin{equation} \begin{aligned} && g_n&=g_n(X,D_n)\in\{0,1\} \\ \end{aligned} \tag{6.6} \end{equation} \]

Our goal in classification is to find a classifier such that \(\mathbb{E}R(g_n)-R^*\) is small.

Definition 6.1 (Consistent classifier) A classifier is consistent if \(\lim_{n\rightarrow \infty} \mathbb{E}R(g_n)=R^*\)

6.2 Nearest Neighbour

Assume that \(\mathcal{X}\) is a metric space.

Definition 6.2 A metric space \(\mathcal{X}\) is equipped with a distance \(d: \mathcal{X} \times \mathcal{X} \mapsto \mathbb{R}_+\) such that:

\[ \begin{aligned} && d(x,y)&\ge0 &\forall x,y \in \mathcal{X} \\ && d(x,x)&=0 \\ && d(x,y)&>0 &\text{if }x\ne y \\ && d(x,y)&=d(y,x) \\ && d(x,z)&\le d(x,y)+d(y,z) &\forall x,y,z \in \mathcal{X} \\ \end{aligned} \]

The last inequality is referred to as triangle inequality.

Nearest Neighbour rules are based on how far away points are from each other based on some metric distance. Classifiers based on such rules assign labels to points based on the labels their neighbours.

Definition 6.3 (Nearest Neighbour) Let \(X\in\mathcal{X}\) and let \(d(x,y)\) be the metric distance of \(\mathcal{X}\). The the for the Nearest Neighbour of any point \(X_i\) we have:

\[ \begin{aligned} && X_{(1)}(X_i)&= \arg\min_{j\ne i}d(X_i,X_j) \\ \end{aligned} \]

In general we sort all data points in terms of their distance to any point \(X\): \((X_{(1)}(X),...,X_{(n)}(X))\).

6.2.1 1NN

Definition 6.4 (1NN-classifier) The \(1NN-\) classifier assigns to \(X\) the label of its neares neighboer: \(g_n(X)=y_{(1)}(X)\). Its (empirical) probability of error can be denoted as \(R(g_n)=P(y_{(1)}\ne y|D_n)= \mathbb{E} \left[P(y_{(1)}(X)\ne y|D_n)|D_n \right]\).

It can be shown that that the distance between any point and its nearest neighbour is typically on the order of \(n^{-\frac{1}{d}}\). Hence for \(n\rightarrow \infty\) we have that \(d(X_{(1)}(X),X)\rightarrow0\). Consequently, for \(n\) sufficiently large

\[ \begin{aligned} && X_{(1)}(X)&\approx X \\ && \eta(X_{(1)}(X))&\approx \eta(X) \\ \end{aligned} \] Let \(y \sim \text{Bern}(\eta)\), then

\[ \begin{aligned} && P(y_{(1)}(X)\ne y)&=P(y_{(1)}(X)=1, y=0)+P(y_{(1)}(X)=0, y=1) \\ && &\approx \eta(1-\eta)+\eta(1-\eta)=2\eta(1-\eta) \\ \end{aligned} \]

Theorem 6.2 For all distributions of \((X,y)\): \(\lim_{n\rightarrow \infty} \mathbb{E}R(g_n^{(\text{1NN})})=2 \mathbb{E} \left[ \eta(X)(1-\eta(X)) \right]\le2R^*(1-R^*)\)

The inequality can be derived as follows. Recall that \(R^*= \mathbb{E} \left[ \min(\eta,1-\eta) \right]\) and note that \(\eta(1-\eta)\le\min(\eta(1-\eta))\) for \(\eta\in[0,1]\). Hence, clearly \(R^*\le R^{\text{1NN}}\le2R^*\). Let \(Z=\min(\eta,1-\eta)\), then since \(Z(1-Z)\) is concave we can apply Jensen’s Inequality to derive the final result in ??.

6.2.2 KNN

Definition 6.5 Let \(k\) be an odd positive integer. The KNN classifier take the majority vote among the labels of the \(k\) nearest neighbours of \(X\):

\[ \begin{aligned} && g_n&= \begin{cases} 1 & \text{if } \sum_{i=1}^{k}y_{(i)}(X)> \frac{1}{2} \\ 0& \text{otherwise.} \end{cases} \\ \end{aligned} \]

One can show that for the asymptotic risk of the KNN-classifier we have:

\[ \begin{equation} \begin{aligned} && R^{(\text{KNN})}&=R^*+ \frac{1}{\sqrt{ke}} \\ \end{aligned} \tag{6.7} \end{equation} \]

6.3 Linear Classification

Definition 6.6 (Generalized linear classifier) Let \(\phi_1,...,\phi_k\) be feature mappings \(\phi_i:\mathcal{X}\mapsto \mathbb{R}\) and let \(\mathbf{w}\in \mathbb{R}^k\). The a generalized linear classifier is defined as:

\[ \begin{aligned} && g(x)&= \begin{cases} 1 & \text{if } \sum_{j=1}^{k}w_j\phi_j(X)\ge0\\ 0& \text{otherwise.} \end{cases}\\ \end{aligned} \]

6.3.1 Perceptron

Assume that the data \(X\) is linearly separable and the \(y\) takes values in \(\{-1,+1\}\). A linear classifier \(\mathbf{w}^TX_i\) makes a mistake on the \((X_i,y_i)\) pair if

\[ \begin{aligned} && (\mathbf{w}^TX_i)y_i&=<0 \\ \end{aligned} \]

that is when the predicted and true label are of opposite sign. Let us therefore restate the definition in 6.6

\[ \begin{equation} \begin{aligned} && g(X)&= \begin{cases} 1 & \text{if } \mathbf{w}^TX\ge0\\ -1& \text{if } \mathbf{w}^TX<0 \end{cases}\\ && g(X)&= \text{sign}(\mathbf{w}^TX) \\ \end{aligned} \tag{6.8} \end{equation} \]

Note that here \(X\) refers to a single observation of possibly many features, rather than a matrix \(\mathbf{X}\) containing all observations. Clearly, if the data is linearly separable, then there exists a vector of coefficients \(\mathbf{w}_*\ne\mathbf{0}\) such that

\[ \begin{aligned} && ({\mathbf{w}_*}^TX_i)y_i&\ge0 && \forall i \\ \end{aligned} \]

that is for \(\mathbf{w}_*\) the linear classifier commits no mistakes. It turns out that such a vector can be found in polynomial time.

Definition 6.7 (Perceptron algorithm) The perceptron recursively takes a single data point \(X_t\) (in arbitrary order) at each step \(t\). Initialize with arbitrary \(\mathbf{w}_0\), e.g. \(\mathbf{w}_0=\mathbf{0}\).

At each step \(t\):

if \(\text{sign}(\mathbf{w}_{t-1}^TX_t)=y_t\) then \(\mathbf{w}_t=\mathbf{w}_{t-1}\)
else \(\mathbf{w}_t=\mathbf{w}_{t-1}+y_tX_t\)

and proceed like this until all points are rightly classified. If necessary cycle through the whole data set multiple times. Hence, the final classifier is simply

\[ \begin{aligned} && \mathbf{\hat{w}}&=\mathbf{w}_0+ \sum_{t=1}^{T} \mathbb{1}_{y_t\ne\text{sign}(\mathbf{w}_{t-1}^TX_t)}y_tX_t \\ \end{aligned} \]

How well the perceptron algorithm works depends on how well-behaved the data is, in other words how easy it is to linearly separate the two classes. Formally, we would say that data is well-behaved in the context of linear separability if the margin of the linear classifier is large.

Definition 6.8 The margin of a linear classifier \(g(X)\) is simply

\[ \begin{aligned} && \gamma&\le \min_i\left| \frac{\mathbf{\hat{w}}^TX_i}{||\mathbf{\hat{w}}||}\right| \\ \end{aligned} \] that is the distance of \(X_t\) from the hyperplane \(\mathbf{\hat{w}}^TX_i\).

The the following can be established for its time to convergence:

Theorem 6.3 (Novinkov) Let \(\gamma\) be the margin of the classifier and let \(\rho=\max_i||X_i||\). Then if the data is linearly separable, then the perceptron algorithm terminates after at most \(\left(\frac{\rho}{\gamma}\right)^2\) updates.

Proof. By assumption there exists a vector \(\mathbf{w}_*\) such that:

\[ \begin{aligned} && ({\mathbf{w}_*}^TX_i)y_i&\ge0 && \forall i \\ \end{aligned} \]

Without loss of generality assume that \(||\mathbf{w}_*||=1\). Then we also have that:

\[ \begin{aligned} && \gamma&\le\min_i\left| \mathbf{w}_*^TX_i\right| \\ \end{aligned} \]

Now the clue is to look at the similarlity between \(\mathbf{w}_*^T\) and \(\mathbf{w}_t\). As the perceptron converges to the optimal solution, the angle between these two vectors approaches zero and equivalently their inner product grows. In particular, whenever an update is performed,

\[ \begin{aligned} && \mathbf{w}_*^T\mathbf{w}_t&=\mathbf{w}_*^T\mathbf{w}_{t-1} + \mathbf{w}_*^TX_ty_t \\ && &\ge \mathbf{w}_*^T\mathbf{w}_{t-1} + \gamma \\ && &\ge t\gamma \\ \end{aligned} \]

where \(t\) is the number of updates. The first inequality follows from the fact that \(\mathbf{w}_*^TX_ty_t\) is at least as big as the margin (we also know it is positive since the optimal vector of coefficients rightly classifies all points.). The second inequality simply reflects the fact that we update \(t\) times and the inner product increases by at least \(\gamma\).

But the inner product can be huge simply because \(\mathbf{w}_{t-1}\) grows. Now we will control for that. We have

\[ \begin{aligned} && ||\mathbf{w}_t||^2&=||\mathbf{w}_{t-1}+X_ty_t||^2=||\mathbf{w}_{t-1}||^2 + ||X_t||^2+2y_t\mathbf{w}_{t-1}^TX_t \\ && &\le ||\mathbf{w}_t||^2 + \rho^2 \le t\rho^2 \\ \end{aligned} \]

where \(2y_t\mathbf{w}_{t-1}^TX_t\le0\) since an update is performed. The the first inequality follows from \(||X_t||^2+2y_t\mathbf{w}_{t-1}^TX_t\le\rho^2=(\max_i||X_i||)^2\). Analogously as we argued before, the second inequality simply reflects that we perform \(t\) updates and at each step \(t\) the squared norm of our vector of coefficients grows by at most \(t\rho^2\).

Then finally by the Cauchy-Schwartz inequality

\[ \begin{aligned} && 1 &\ge \frac{\mathbf{w}_*^T\mathbf{w}_t}{||\mathbf{w}_*||}\ge \frac{t\gamma}{\rho \sqrt{t}}= \sqrt{t} \frac{\gamma}{\rho} \\ && t&\le \left( \frac{\rho}{\gamma} \right)^2\\ \end{aligned} \]

While the perceptron algorithm clearly has some desirable properties, it only works when the data truely is linearly separable (otherwise it simply does not converge). In cases where the data is not linearly separable, minimizing the empirical risk of a linear classifier is computationally hard. What we do to avoid that in practice is to “convexify” the problem.