Bit Prediction
A Probability-Free Prediction Identity
In his lecture notes for his machine learning class, Ben Recht referenced the following identity
Lemma 3.2
For an sequence \(y_1 \in \mathbb R\) and any \(z \in \mathbb R\), let \(m_t = \frac 1 t \sum_{k=1}^t y_k\) represent the mean of the first \(t\) terms. Then
\[ \sum_{k=1}^{T-1} \left( \frac k {k+1} \right) (m_k-y_{k+1})^2 - \sum_{k=1}^T (z-y_{k+1})^2 = -T (m_T - z)^2 \]
He comments “My proof of this lemma uses elementary but unintuitive and annoying algebra. I’d love a more straightforward argument.”
Here are some attempts to prove this in a more compact and intuitive way
Bias-Variance approach
Proof. First note the bias-variance relation for any \(z\in \mathbb R\), well known in machine learning theory \[ \sum_{k=1}^T (y_k-z)^2 = \sum_{k=1}^T (y_k-m_T)^2 + T (m_T - z)^2 \tag{1}\]
This follows from the observation \[ \begin{align} \sum_{k=1}^T (y_k-z)^2 &= \sum_{k=1}^T (y_k-m_T + m_T - z)^2 \\ &= \sum_{k=1}^T (y_k-m_T)^2 + 2 (m_T - z) \sum_{k=1}^T (y_k-m_T) + \sum_{k=1}^T (m_T - z)^2 \end{align} \] and noting that the middle term is 0 because \(\sum_{k=1}^T y_k = Tm_T\).
Comparing Equation 1 with the lemma statement, we are done if we can show \[ \sum_{k=1}^T (y_k-m_T)^2 = \sum_{k=1}^{T-1} \left( \frac k {k+1} \right) (m_k-y_{k+1})^2 \]
This formula is a variation of Welford’s algorithm for computing the running variance of a sequence.
Let \(S_k = \sum_{t=1}^k (y_t-m_t)^2\) be the sum of squares around the mean of the first \(k\) points in the sequence. \[ \begin{align} S_{k+1} &= \sum_{t=1}^{k+1} (y_t-m_{k+1})^2 \\ &= \sum_{t=1}^{k} (y_t-m_{k+1})^2 + (y_{k+1} - m_{k+1})^2 \end{align} \] We can expand the first term using the bias-variance formula with \(z=m_{k+1}\). This yields \[ S_{k+1} = S_k + k (m_{k+1}-m_k)^2 + (y_{k+1} - m_{k+1})^2 \tag{2}\]
Now we rearrange the recurrence formula for means \(y_{k+1} + k m_k = (k+1)m_{k+1}\) to find \[ \begin{aligned} m_{k+1}-m_k &= \frac 1 {k+1} (y_{k+1} - m_{k}) \\ y_{k+1} - m_{k+1} &= \frac k {k+1} (y_{k+1}-m_k ) \end{aligned} \tag{3}\] Plugging Equation 3 into Equation 2 we get \[ \begin{aligned} S_{k+1} &= S_k + \frac k {(k+1)^2}(y_{k+1}-m_k)^2 + \frac {k^2}{(k+1)^2}(y_{k+1}-m_k)^2 \\ &= S_k + \frac {k} {k+1} (y_{k+1}-m_k)^2 \end{aligned} \] Noting that \(S_1=0\), collapsing the recurrence for \(S_k\) finishes the proof \[ \sum_{k=1}^T (y_k-m_T)^2 = S_T = \sum_{k=1}^{T-1} (S_{k+1}-S_k) = \sum_{k=1}^{T-1} \frac {k} {k+1} (y_{k+1}-m_k)^2 \]
Pythgorean Decomposition
This approach is essentially the same as the previous, except we give a geometric interpretation.
Proof. Consider the vector space \(\mathbb R^T\) with the standard euclidean metric \(\langle x, y \rangle = \sum_i x_i y_i\). Let \(\mathbb 1_k = (1,1, \dots, 1, 0,0,\dots, 0)\) where the first \(k\) components are 1 and the rest are zero. Let \(y^{(k)} = (y_1, y_2, \dots, y_k, 0, \dots, 0)\) be the vector whose first \(k\) components correspond to the first \(k\) elements of the series.
Now we can decompose the vector \(y^{(k)}\) into a vector in the “mean subspace” spanned by \(\mathbb 1_k\) and an orthogonal residual vector \(r_k\). This gives \[ y^{(k)} = \frac {\langle y^{(k)}, \mathbb 1_k \rangle} {\langle \mathbb 1_k , \mathbb 1_k \rangle} \mathbb 1_k + r_k \] The coefficent is calculated to be \[ \frac {\langle y^{(k)}, \mathbb 1_k \rangle} {\langle \mathbb 1_k , \mathbb 1_k \rangle} = \frac{\sum_{n=1}^k y_k}{n} = m_k \] Therefore \(r_k =y^{(k)} - m_k \mathbb 1_k\) satisfies \(r_k \perp \mathbb 1_k\). More generally for any \(t'\geq t\) we have \(r_k \perp 1_{k'}\) because \(r_k\) is zero in all the coordinates greater than \(k\).
Now we can use the Pythagorean theorem on the vector \(y^{(k)}-z \mathbb 1_k\) to conclude \[ \begin{align} \lVert y^{(T)} - z\mathbb 1_T \rVert^2 &= \lVert (y^{(T)} - m_T \mathbb 1_T) + (m_T- z)\mathbb 1_T \rVert^2 \\ &= \lVert r_T + (m_T- z)\mathbb 1_T \rVert^2 \\ &= \lVert r_T \rVert^2 + \lVert (m_T- z)\mathbb 1_T \rVert^2 \end{align} \tag{4}\] This is simply the bias-variance decomposition in another form.
Now we would like to come up with an orthogonal decomposition which relates \(r_{k+1}\) to \(r_{k}\).
\[
\begin{align}
r_{k+1} &= y^{(k+1)} - m_{k+1}\mathbb 1_k \\
&= (y^{(k)} - m_k \mathbb 1_k) + (m_k- m_{k+1}) \mathbb 1_{k+1} + (y_{k+1}-m_k)e_{k+1} \\
&= r_k + (m_k- m_{k+1}) \mathbb 1_{k+1} + (y_{k+1}-m_k)e_{k+1} \\
&= r_k - \frac{1}{k+1}(y_{k+1} - m_k ) \mathbb 1_{k+1} + (y_{k+1}-m_k)e_{k+1}
\end{align}
\] Here \(e_k\) is the unit vector in the \(k\)th coordinate. In the last line we used the mean recurrence \(y_{k+1}+k m_k = (k+1) m_{k+1}\) to rewrite \(m_k-m_{k+1}\). The important thing here is that \(r_k \perp \mathbb 1_{k+1}\) and \(r_k \perp e_{k+1}\) so \(r_k\) is orthogonal to the terms which come after it. Therefore \[
\begin{align}
\lVert r_{k+1} \big \lVert^2 &= \lVert r_k \rVert^2 + (y_{k+1} - m_k )^2 \lVert e_{k+1} - (k+1)^{-1} \mathbb 1_{k+1}\rVert ^2\\
&= \lVert r_k \rVert^2 + (y_{k+1} - m_k )^2 \left(\frac{k^2}{(k+1)^2} + \frac k {(k+1)^2}\right) \\
&= \lVert r_k \rVert^2 + \frac k {k+1} (y_{k+1} - m_k )^2
\end{align}
\tag{5}\] The term \(\frac k {k+1} (y_{k+1} - m_k )^2\) is the contribution to the total energy as we successively project onto each subspace \(\mathbb 1_k\).
Similar to before we get \[ \lVert r_T \rVert^2 = \sum_{k=1}^{T-1} (\lVert r_{k+1} \rVert^2-\lVert r_k \rVert^2) = \sum_{k=1}^{T-1} \frac k {k+1} (y_{k+1} - m_k )^2 \] Combining this with Equation 4 and expanding the norms we get \[ \begin{align} \sum_{k=1}^T (y_k - z)^2 &= \lVert y^{(T)} - z\mathbb 1_T \rVert^2 \\ &= \lVert r_T \rVert^2 + \lVert (m_T- z)\mathbb 1_T \rVert^2 \\ &= \sum_{k=1}^{T-1} \frac k {k+1} (y_{k+1} - m_k )^2 + T (m_T-z)^2 \end{align} \] which is a rearrangement of the terms in the lemma.
Double-sum orthogonal decomposition
There’s another way which is more algebra heavy than the others, so I’ll skip it. But the short version is that by expanding \((y_{k+1}-m_k)^2\) in \(\sum_{k=1}^{T-1} \frac k {k+1} (y_{k+1}-m_k)^2\) and changing the order of summation you get a term like \(\frac 1 {k(k+1)}\) which telescopes.