I've seen many probability concentration bounds in random places but I always struggle to remember any of them. This a summary of the second chapter of High-Dimensional Statistics by Martin J. Wainwright. )
So the meaning of this post is twofold:
Spoiler : Many sentences are simply blatant copy of the aforementioned book, which I claim zero credit for. All the proofs are mostly sketch of the key ideas. This is just a sincere attempt to appreciate other peoples'works
Note this makes perfect sense: when variance is small, the variable is more likely to concentrate on its mean.
proof: define $y=(x-\mu)^2$ and then apply Markov
Extension: in general: \begin{equation*} \mathbb P (|X-\mu| \geq t) \leq \frac{\mathbb E[(X-\mu)^k]}{t^k} \hspace{0.5cm} \text{for all $t >0$} \label{eq:mb} \tag{3} \end{equation*}
The Gausian tail bound motivates the definition of sub-Gaussian:
Definition: A random variable $X$ with mean $\mu=\mathbb E[X]$ is sub-Gaussian if there is a positive number $\sigma$ such that $$\mathbb E[e^{\lambda(X-\mu)}] \leq e^{\frac{\sigma^2 \lambda^2}{2}} \hspace{0.5cm} \text{for all $\lambda \in \mathbb R$}$$ $\sigma$ is referred as the sub-Gaussian paramter
Observation 1: $-X$ is sub-Gaussian iff $X$ is sub-Gaussian (simply by symmetry of definition, let $\lambda = - \lambda$)
Now, by observation 1 and 3, we can derive the concentration inequality for sub-Gaussian variable: \begin{equation*} \mathbb P [|X -\mu| \geq t] \leq 2e^{\frac{-t^2}{2 \sigma ^2}} \hspace{0.5cm} \text{for $t \in \mathbb R$} \label{eq:sgci} \tag{6} \end{equation*}
Let $X$ be zero mean and supported on $[a, b]$. Recall Jenson's in equality: for convex function $\Phi$, and probability space $(\Omega, \mu)$, we have $\Phi (\int_{\Omega} f d \mu) \leq \int_{\Omega} (\Phi \circ f) d \mu$
Let $X'$ be an independent copy, for $\forall \lambda \in \mathbb R$.
\begin{align} \mathbb E_X[e^{\lambda X}] &= \mathbb E_{X} [e^{\lambda (X- \mathbb E_{X'}[X'])}] \\ & \geq \mathbb E_X \mathbb E_{X'} e^{\lambda (X-X')}\\ & = E_{X, X'} [e^{\lambda (X-X')}] \\ & = E_{X, X'} [\mathbb E_{\epsilon} [e^{\lambda \epsilon (X-X')}]]\\ & \leq E_{X, X'} [e^{\frac{\lambda ^2 (X-X')^2}{2}}] \\ & \leq e^{\frac{\lambda^2 (b-a)^2}{2}} \end{align}This shows $X$ must be sub-Gaussian with parameter at most $\frac{b-a}{2}$.
The idea of introducing $X'$ and Rademacher variable is called "Symmetrization Argument"
For any zero-mean random variable $X$
$\exists c \geq 0$ and $Z \sim N(0, \tau^2)$ such that $$\mathbb P[|X|>s] \leq c \mathbb P [|Z| \geq s] \hspace{0.5cm} \text{for all $s \geq 0$}$$
$\exists \theta \geq 0$ such that $$\mathbb P[|X|^{2k}] \leq \frac{(2k)!}{2^{k}k!} \theta^{2k} \hspace{0.5cm} \text{for all $k = 1,2 , \cdots$}$$
$\exists \sigma \geq 0$ such that $$\mathbb E [e^{\frac{\lambda x^2}{2\sigma^2}}] \leq \frac{1}{\sqrt{(1-\lambda)}} \hspace{0.5cm} \text{for all $\lambda \in [0,1)$}$$
Recall the definition of sub-Gaussian requires $\mathbb E[e^{\lambda(X-\mu)}] \leq e^{\frac{\sigma^2 \lambda^2}{2}}$ holds for all $\lambda$. We can of course make it more general
A random variable $X$ with mean $\mu=\mathbb E[X]$ is sub-Gaussian if there are non-negative parameters $(v, \alpha)$, such that $$\mathbb E[e^{\lambda(X-\mu)}] \leq e^{\frac{v^2 \lambda^2}{2}} \hspace{0.5cm} \text{for all $|\lambda| < \frac{1}{\alpha}$}$$
One way to verify whether a random variable is sub-exponential is to compute and bound its mgf $\mathbb E [e^{\lambda (X-\mu)}]$ analytically. This may be infeasible in may cases. Another strategy is to bound its polynomial moment. So here comes the "Bernstein's condition": we say that Bernstein condition with parameter $b$ holds if $$|\mathbb E [(x-\mu)^k]| \leq \frac{1}{2} k!\sigma^2 b^{k-2} \hspace{0.5cm} \text{$k= 2, 3, 4$}$$
Note for $|\lambda| < \frac{1}{b}$, we can sum the geometric series and obtain $$\mathbb E [e^{\lambda(x-\mu)}] \leq 1+\frac{\lambda^2 \sigma^2 /2}{1-b|\lambda|} \leq e^{\frac{\lambda^2 \sigma^2 /2}{1-b|\lambda|}} \hspace{0.5cm} \text{since $1+x \leq e^{x}$}$$ which implies $$\mathbb E [e^{\lambda(x-\mu)}] \leq e^{\frac{\lambda^2 (\sqrt{2\sigma})^2}{2}} \hspace{0.5cm} \text{for $|\lambda|<\frac{1}{2b}$}$$
Based on 5.3, we can state the Bernstein-type formally:
Statement: for any random variable satisfying the Berinstein Condition , we have \begin{equation*} \mathbb E [e^{\lambda (X -\mu)}] \leq e^{\frac{\lambda^2 \sigma^2/2}{1- b |\lambda|}} \hspace{0.5cm} \text{for $|\lambda| < \frac{1}{b}$} \label{eq:btb} \tag{10} \end{equation*}
and more over the concentration inequality \begin{equation*} \mathbb P [|X -\mu| \geq t] \leq 2e^{\frac{-t^2}{2 (\sigma^2+bt)}} \hspace{0.5cm} \text{for $t \geq 0$} \label{eq:btc} \tag{11} \end{equation*}
Proof: the first inequality directly comes from Section 5.3. The second inequality can be deduced from the Chernoff bound, in which we let $\lambda= \frac{t}{bt+\sigma^2} \in [0, \frac{1}{b})$
Remark: Bernstein-type bound is often superior than the Chernoff bound when controling bounded random variable with ($|X-\mu| \leq b$). The most straightforward way to control such variable is to argue $(X-\mu)$ is sub-Gaussian with param $b$. Now observe when $t$ is small in the inequality above, the variable $X$ behaves like sug-Gaussian with parameter $b$. Since $\sigma^2= \mathbb E (X-\mu)^2 \leq b^2$ (as X is bounded), Bernstein-type bound is always better!
Remark 2: by observation 4 from section 5.1, we have $$\mathbb P [\frac{1}{n} \sum_{i=1}^{n} (X_k -\mu_k) \geq t] \leq e^{\frac{-nt^2}{2v_{*}^2/n}} \hspace{0.5cm} 0 \leq t \leq \frac{v_{*}^2}{n\alpha_{*}}$$ $$\mathbb P [\frac{1}{n} \sum_{i=1}^{n} (X_k -\mu_k) \geq t] \leq e^{\frac{-nt}{2 \alpha_{*}}} \hspace{0.5cm} t > \frac{v_{*}^2}{n\alpha_{*}}$$
Statement: if $X \leq b$ almost surely, then \begin{equation*} \mathbb E [e^{\lambda (X -\mathbb E[X])}] \leq \exp(\frac{\frac{\lambda^2}{2} \mathbb E [X^2]}{1-\frac{b \lambda}{3}}) \hspace{0.5cm} \text{for $\lambda \in [0 , \frac{3}{b})$} \label{eq:obi1} \tag{12} \end{equation*} Consequently, given $n$ independent variables such that $X_i \leq b$ almost surely, we have \begin{equation*} \mathbb P [\sum_{i=1}^{n} (X_i -\mathbb E[X_i]) \geq n \delta] \leq \exp(-\frac{n\delta^2}{2 (\frac{1}{n} \sum_{i=1}^{n} \mathbb E[X_i^2] + \frac{b \delta}{3})}) \label{eq:obil2} \tag{13} \end{equation*}
Remark 1: obviously if a random variable is bounded below, we can use the bound above to derive bounds on its lower tail (just substitute $X$ to $-X$)
Remark 2: Prof. Wainwright mentioned that the proof of this is very straightforward, which made me wonder what straightforward really meant.
Proof: Define $h(u) = 2 \frac{2^{u} - u -1}{u^2} = 2\sum_{k=2}^{\infty} \frac{u^{k-2}}{k!}$ (seriously any proof starting by defining some random function shouldn't be called straightforward). We have $$\mathbb E [e^{\lambda x}] = 1+\lambda \mathbb E[X] + \frac{1}{2} \lambda^2 \mathbb E [X^{2} h(\lambda X)]$$ (ohh now I kind of getting where function $h$ comes from). Note for all scalars $x<0$, $x' \in [0, b]$ and $\lambda >0$, we have $$h(\lambda x) \leq h(0) \leq h(\lambda x') \leq h(\lambda b)$$ Since $X < b$ almost surely, we also have $\mathbb E[X^{2} h(\lambda X)] \leq \mathbb E[X^2] h(\lambda b)$. It follows that $$\mathbb E [e^{\lambda(X- \mathbb E[X])}] \leq e^{-\lambda \mathbb E [X]}(1+\lambda \mathbb E[X] + \frac{1}{2} \lambda^2 \mathbb E[X^2] h(\lambda b)) \leq \exp(\frac{\lambda^2 \mathbb E[X^2]}{2} h(\lambda b))$$ The first inequality follows from the fact that $X$ is bounded by $b$ and $e^{\lambda \mathbb E [X]}$ is simply a constant; the second inequality follows from the inequality $x +1 \leq e^{x} $ so that the expression in the bracket is upper bounded by $\exp(\lambda \mathbb E[X] + \frac{1}{2} \lambda^2 \mathbb E[X^2] h(\lambda b))$. To prove the first inequality, it remains to show that $h(\lambda b) \leq (1-\frac{\lambda b}{3})^{-1}$ for $\lambda b <3$. Note the inequality $k! \geq 2(3^{k-2})$ holds for $k \geq 2$, so we have $$h(\lambda b)= 2\sum_{k=2}^{\infty} \frac{(\lambda b)^{k-2}}{k!} \leq \sum_{k=2}^{\infty} (\frac{\lambda b}{3})^{k-2}= \frac{1}{1-\lambda b/3}$$ where the condition $\frac{\lambda b}{3} \in [0,1]$ guarantees the convergence of geometric series. To prove the second bound, we apply the Chernoff bound and the inequality above and obtain $$\mathbb P [\sum_{i=1}^{n} (X_i - \mathbb E[X_i]) \geq n\delta] \leq \exp(-\lambda n \delta + \frac{\frac{\lambda^2}{2}\sum_{i=1}^{n} \mathbb E [X_i^2]}{1-\frac{b \lambda}{3}}) \hspace{0.5cm} \text{valid for $b \lambda \in [0,3)$}$$ Substituting $$\lambda = \frac{n \delta}{ \sum_{i=1}^{n} \mathbb E [X_i^2] +\frac{n\delta b}{3}} \in [0, \frac{3}{b})$$ will yield the desired bound.
Motivation: suppose we want to do dimension reduction from dimention $d$ to dimension $m$. Meanwhile, we hope the mapping can preserve the pairwise distance, what mapping option do we have?
statement: let $X \in \mathbb R^{m \times d}$ be a random matrix filled with independent $N(0,1)$ entries. Consider the linear mapping $F: \mathbb R^{d} \longrightarrow \mathbb R^{m}$ via $u \longrightarrow Xu/\sqrt{m}$. Given some tolerance $\delta \in (0,1)$, we have $$(1-\delta) \leq \frac{\|F(u^{i})- F(u^{j})\|^2}{\|u^{i}-u^{k}\|^2} \leq (1+\delta) \hspace{0.5cm} \text{for all pairs $u^{i} \neq u^{j}$}$$ holds with high probability.
Proof: let $x_i \in \mathbb R^{d}$ denote the ith row of the matrix, and consider some finxed vector $u \neq 0$. Since $x_i$ is standard normal vector, we know the variable $<x_i, \frac{u}{\|u\|}>$ follows $N(0,1)$ distribution, and hence the quantity$$Y:=\frac{\|Xu\|^2}{\|u\|^2}=\sum_{i=1}^{m} <x_i, u/\|u\|>^2$$ is a chi-squared variable with $m$ degrees of freedom. Now by observation 3 from section 5.1, $Z^2$ is sub-exponential with parameters $(2,4)$. Then by observation 4 from the same section, we know $Y$ is a sub-exponential with parameters $(2\sqrt{n}, 4)$. Then by remark 2 from the last section, for any chi-squared with $n$ degree of freedom, we hava the tail bound $$\mathbb P[|\frac{1}{n} \sum_{k=1}^{n} Z_k^2 -1| \geq t] \leq 2e^{-nt^2/8} \hspace{0.5cm} \text{for all $t \in (0,1)$}$$Apply this bound to $Y$ yields $$\mathbb P[|\frac{\|Xu\|^2}{m\|u\|^2}-1| \geq \delta] \leq 2e^{-m\delta^2/8} \hspace{0.5cm} \text{for all $\delta \in (0,1)$}$$ Recall the definition of $F$, the inequality above is equivalent to $$\mathbb P[\frac{\|F(u)\|^2}{\|u\|^2} \not \in [(1-\delta),(1+\delta)]] \leq 2e^{-m \delta^2 /8} \hspace{0.5cm} \text{for any fixed $0 \neq u \in \mathbb R^{d}$}$$ Note we have $N \choose 2$ distinct pairs of data, by the union bound, we have $$\mathbb P[\frac{\|F(u^{i}- u^{j})\|^2}{\|u^{i}-u^{j}\|^2} \not \in [(1-\delta),(1+\delta)] \text{ for $u^{i} \neq u^{j}$}] \leq 2 e^{-m \delta^2 /8} \text{$N\choose 2$} $$ Observe for any $\epsilon \in (0,1)$, if we let $m$ sufficiently large, the probability above can be bounded below by $\epsilon$. Specifically, some algebra would show $m >\frac{16}{\delta^2} \log (N/\epsilon)$.
For a zero mean random variable $X$, the following are equivalent definitions to the one in Section 5.1: