Chapter 11 Outliers

The empirical mean is sensitive to outliers.

11.1 Trimmed mean estimator

One way of dealing with outliers is to simply remove them. With respect to empirical mean estimation this corresponding estimator is referred to as trimmed mean estimator, \(m_n^{(k)}\). It simply ignores the top and bottom \(k\) values. We have

\[ \begin{equation} \begin{aligned} && \mathbb{E}m_n^{(k)}&= \mathbb{E} \frac{1}{n-2k} \sum_{i=1}^{n} \mathbb{1}_{x_i \notin \text{top/bottom}} x_i\\ \end{aligned} \tag{11.1} \end{equation} \] One can show that if \(k \approx \log( \frac{1}{\delta})\), then with probability

\[ \begin{aligned} && |m_n-m|&=c \sqrt{ \frac{\delta^2 \log( \frac{1}{\delta})}{n}} \\ \end{aligned} \]

11.2 Median-of-means estimator

Another idea involves repeatedly estimating the empirical means of subsets of the data and taking the median of those. In particular, divide the data into \(k\) blocks of \(l\) points each. For each block compute \(m_n^{(j)}= \frac{1}{l}\sum_{j=1}^{l}x_i\). Then the median-of-means estimator is simply:

\[ \begin{equation} \begin{aligned} && m_n&=\text{median}(m_n^{(1)},...,m_n^{(k)}) \\ \end{aligned} \tag{11.2} \end{equation} \]