researchNotes/ch_probability_and_statistics.tex at master · clarkedavida/researchNotes · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Math: Probability and Statistics}\label{ch:probability}

Crucial to the study of quantum physics, statistical physics, experiment, and
eventually lattice field theory is an understanding of probability and the
ability to understand statistics. These skills are actually quite important to
understand all modern science, not just physics, as well as in everyday
circumstances like politics.
You already have a rough idea of probability: It's a way of saying whether you
think something will happen or not, while in general signaling that you aren't
completely certain. We will now make these ideas precise.
Parts of this presentation follow selected parts
of Chapters~1 and~2 of Berg~\cite{berg_markov_2004}.


\section{Preliminaries}\label{sec:probprelim}


We will consider a {\it random variable}\index{random variable}\footnote{
We will try to denote random variables by capital letters.} $X$ which
has some possible {\it outcomes}\index{outcome} $x_i$. The random variable will take
one of the values in $x_i$, but we do not know for sure which one.
The set of all possible outcomes
\begin{equation}
  \Omega=\{x_1,x_2,...,x_n\},
\end{equation}
assuming there are $n\in\N$ possible outcomes, is called the {\it sample
space}\index{sample space}. If $|\Omega|$ is finite as in the above case, we
say that $X$ is {\it discrete}\index{discrete}. An {\it event}\index{event} $E$
is any subset of outcomes $E\subset\Omega$, and we assign it a {\it
probability}\index{probability} $\pr{E}$. Axiomatically speaking, the
probability has to fulfill two conditions:
\begin{enumerate}
  \item $\Forall E$, $0\leq\pr{E}\leq1$.
  \item $\pr{\Omega}=1$, i.e. the random variable $X$ needs to have some outcome
in $\Omega$. Another way to state this is that the probability of at least one
of the possibilities is 1.
\end{enumerate}

Pragmatically you can assign a probability by repeating some experiment $n$
times. If the event $E$ occurs $n_E$ times, then\footnote{Note that this limit
is nontrivial since $n_E$ will in general change as $n$ increases.}
\begin{equation}\label{eq:freqdef}
  \pr{E}\equiv\lim_{n\to\infty}\frac{n_E}{n}.
\end{equation}
This method of assignment also gives a way to interpret
what a probability means; in particular $\pr{E}=0.3$ would mean that
if I repeat some experiment 10 times, I can expect to observe event
$E$ 3 times on average. This interpretation is called
the {\it frequentist}\index{frequentist approach} interpretation. We will
discuss another interpretation in the context of the
{\it Bayesian}\index{Bayes!approach} approach in
\secref{sec:bayes}.

Alternatively, you can make some theoretical assignment of probabilities to
events. For instance when one tosses a coin, if one has no reason to believe the
coin is biased\footnote{One way of looking at this theoretical assignment
is that we have no reason to suppose heads is ``more likely" than tails,
i.e. there is no reason to prefer heads to tails.
We call this the {\it principle of insufficient reason}\index{principle!insufficient reason}.
This approach to probability
assignment can be generalized to an arbitrary $n$-state system, which leads
to the uniform distribution.}, it must be that
\begin{equation}
\pr{\text{heads}}=\pr{\text{tails}}=\frac{1}{2}.
\end{equation}
This theoretical assignment is nice, but it should eventually be checked against
some observation or measurement as well.
In the case that $|\Omega|<\infty$,
we might say that the probability of event $E$ is $\pr{E}=|E|/|\Omega|$. In
this instance, we say that the probability is {\it uniform}
\index{PDF!uniform} since every point in the sample space is equally likely.


Sometimes the cardinality of $\Omega$ is uncountably infinite. For instance you
may ask something like: What is the probability that a particle in this gas has
a speed between $v_1$ and $v_2$? In this case, the sample space is
\begin{equation}
  \Omega=\{x\suchthat0\leq x\leq c\},
\end{equation}
where $c$ is the speed of light.
In such a case we speak of a {\it continuous} random variable $X$.
The probability that $X$ takes any one particular value in $\Omega$ is
zero\footnote{Intuitively you could ask yourself: What is the probability that
my speed is exactly $\pi$ to all infinitely many digits? When checking against
experiment, you will never be able to resolve all infinitely many digits, so at
best you must specify a range corresponding to the resolution of your
instrument. In formal mathematical language, we require $\Omega$ to be
{\it measurable}.}.
We can only assign
sensible probabilities to subsets of $\Omega$ of uncountably infinite
cardinality. In the above
case, this corresponds to the range of velocities between $v_1$ and $v_2$.
The probability that $X$ takes a value in the small range of velocities $\dd x$
is denoted
\begin{equation}
   \pr{X\in[x,x+\dd x]}=\dd{x}f(x),
\end{equation}
and hence for continuous random variables, the probability must instead
fulfill the properties
\begin{enumerate}
  \item  $0\leq\dd{x}f(x)\leq 1$ and
  \item $\int_{x\in\Omega}\dd{x}f(x)=1$.
\end{enumerate}
In general for some integrable function
$f:\R\to\R$, we assign a
probability that $X$ lies in the interval $[a,b]$ by
\begin{equation}
  \label{eq:cpr}
  \pr{X\in[a,b]}=\int_a^b\dd{x}f(x).
\end{equation}
For the remainder of this chapter, we will
only be concerned with continuous random variables, and we will simply call
them random variables.
The function $f$ is called the {\it probability distribution function}
(PDF)\index{PDF}; meanwhile the {\it cumulative distribution function}
(CDF) is the function\index{CDF} $F(x)$ given by
\begin{equation}
  F(x)\equiv \pr{X<x}=\int_{-\infty}^x\dd{t}f(t).
\end{equation}
\begin{example*}{}{}
Two examples of important probability distributions include the {\it Gaussian}
or {\it normal} distribution,\index{PDF!normal}
\begin{equation}
  \gau(x,\hat{x},\sigma)\equiv\frac{1}{\sigma\sqrt{2\pi}}
  \exp\Bigg(-\frac{(x-\hat{x})^2}{2\sigma^2}\Bigg)
\end{equation}
where $\sigma$ is the standard deviation of the distribution and $\hat{x}$ is
the mean, and the {\it Cauchy} \index{PDF!Cauchy} distribution,
\begin{equation}
  \cau(x,\alpha)\equiv\frac{\alpha}{\pi\big(\alpha^2+x^2\big)}.
\end{equation}
I will refer to these special PDFs later, particularly the normal distribution.
I'll call their CDFs $\Gau$ and $\Cau$, respectively.
\end{example*}

Now that we know what probabilities and PDFs are, we can start thinking about
ways to characterize them. For example we can think about typical values taken
by a random variable from some distribution. We can get some information
from the mean and variance of a distribution. These are
both special cases of a more general concept.
In particular, let $n\in\N$.
  The {\it $n\nth$ moment} \index{moment}of the distribution $f(x)$ is
  \begin{equation}\label{dfn:mom}
    \ev{X^n}=\int_{-\infty}^\infty\dd{x}x^nf(x).
  \end{equation}

The mean and variance are the special cases $\hat{x}=\ev{X}$ and
$\sigma^2=\ev{(X-\hat{x})^2}$. Sometimes we call the mean the {\it expected
value} and sometimes we denote the variance $\variance$. Note that not all
probability distributions have well defined moments. The Cauchy distribution is
very ill-behaved in this regard, since its $n\nth$ moment diverges
$\Forall n\in\N$.

\begin{example*}{}{}
A {\it standardized moment}\index{moment!standardized} is some moment
that has been rescaled. It is usually rescaled by some power of the variance in
order that the quantity is unitless. The {\it skewness}\index{skewness} of
the random variable $X$ is
\begin{equation*}
\gamma_1\equiv\left(\frac{\ev{X-\hat{x}}}{\sigma}\right)^3.
\end{equation*}
The variance measures the average distance of $X$ from the mean of its
distribution, and hence can be interpreted as a measure of the distribution's
spread. The skewness, having one more power this separation, can become
negative, and we can interpret it as a measure of the asymmetry of the
distribution. Another useful standardized moment is the {\it kurtosis}\index{kurtosis}
\begin{equation*}
\kappa\equiv\left(\frac{\ev{X-\hat{x}}}{\sigma}\right)^4.
\end{equation*}
Note that if the distribution from
which $X$ is drawn features lots of large deviations from the mean, the
numerator will win out over the denominator. This will happen when there are
healthy contributions from the tails of the distribution, and hence the kurtosis
is a measure of the tailedness.
\end{example*}

Generally in the lab, one draws random variables from distributions
about which one has no a priori knowledge. Therefore in principle one
does not know the true moments these distributions. The
definition \eqref{dfn:mom} suggests a way to estimate them.
Suppose you draw a sample $X_1,...,X_N$:
  An {\it estimator}\index{estimator} of the $n\nth$ moment is
  \begin{equation}
    \bar{X}^n\equiv\frac{1}{N}\sum_{i=1}^N X_i^n.
  \end{equation}
In the case $n=1$ we obtain the ordinary arithmetic average.
We use the hat to distinguish true values from estimators, which will
generally be denoted with a bar. For estimators of moments besides the
mean, we must be more careful; this is discussed in \secref{sec:bias}.

Consider two intervals $[a,b]$ and $[c,d]$ and two random variables
$X$ and $Y$ drawn from PDFs $f$ and $g$, respectively. Then $X$ and $Y$ are
said to be {\it independent}\index{random variable!independent} if
\begin{equation}\label{dfn:ind}
  \pr{X\in[a,b]\text{ and }Y\in[c,d]}=\int_a^b\int_c^d\dd{x}\dd{y}f(x)\,g(y)
\end{equation}
Hence we see that the {\it joint PDF}\index{PDF!joint} of $X$ and $Y$
is $f(x)g(y)$. On the other hand, we say $X$ and $Y$ are {\it uncorrelated}
\index{random variable!uncorrelated} if
\begin{equation}
  \ev{XY}=\ev{X}\ev{Y},
\end{equation}
The {\it covariance}\index{covariance}
\begin{equation}\label{dfn:cov}
  \Cov[X,Y]\equiv\ev{XY}-\ev{X}\ev{Y}
\end{equation}
can be used to give a measure of how correlated $X$ and $Y$ are, or
one can use the {\it correlation}\index{correlation}
\begin{equation}\label{dfn:cor}
  \rho(X,Y)=\frac{\Cov[X,Y]}{\sigma_X\sigma_Y}.
\end{equation}
So equivalently we say $X$ and $Y$ are correlated if $\rho(X,Y)=0$.
It's worth emphasizing that if $X$ and $Y$ are independent,
it follows that they are uncorrelated. This can be seen by applying
definition \eqref{dfn:mom} to the random variable $XY$, then using
definition \eqref{dfn:ind}. However if $X$ and $Y$ are
uncorrelated, {\it they can still be dependent}.
\begin{example*}{}{}
  Here's an extreme example by Cosma Shalizi~\cite{cosma_indep}.
Let $X$ be uniformly distributed
  on [-1,1] and let $Y=|X|$. Then clearly $Y$ depends on $X$. However it
  is easy to see that $Y$ is uniform on [0,1] and $\ev{XY}=0=\ev{X}\ev{Y}$.
  Hence $X$ and $Y$ are not correlated.
\end{example*}

The next two propositions show us how to add expectation values and
random variables. Let $X$ and $Y$ be independent random variables
drawn from PDFs $f$ and $g$, respectively.
\begin{proposition}{}{}
  Let $a,b\in\R$ be constants. Then
  $$\ev{aX+bY}=a\ev{X}+b\ev{Y}.$$
  \begin{proof}
    Since $X$ and $Y$ are independent, their joint PDF is $fg$. Then
    \begin{equation*}
      \begin{aligned}
      \ev{aX+bY}&=\int\dd{x}\dd{y}(ax+by)f(x)g(y)\\
                &=a\int\dd{x}\dd{y}x\,f(x)g(y)+
                 b\int\dd{x}\dd{y}y\,f(x)g(y)\\
                &=a\int\dd{x}x\,f(x)+b\int\dd{y}y\,g(y)\\
                &=a\ev{X}+b\ev{Y}.
      \end{aligned}
    \end{equation*}
  \end{proof}
\end{proposition}

\begin{proposition}{}{addvars}
  The PDF of the random variable $Z=X+Y$ is given by the convolution
  \begin{equation*}
    h(z)=\int_{-\infty}^\infty\dd{x}f(x)g(z-x)
  \end{equation*}
  \begin{proof}
    The CDF of $Y$ is, according to \equatref{dfn:ind},
    \begin{equation*}
      H(y)=\int_{x+y\leq z}\dd{x}\dd{y}f(x)g(y)
          =\int_{-\infty}^\infty\dd{x}f(x)\int_{-\infty}^{z-x}
            \dd{y}g(y).
    \end{equation*}
    The PDF $h$ follows from the Fundamental Theorem of Calculus:
    \begin{equation*}
      h(z)=\dv{H}{z}=\dv{H}{(z-x)}
          =\int_{-\infty}^\infty\dd{x}f(x)g(z-x).
    \end{equation*}
  \end{proof}
\end{proposition}

A sequence $\{X_N\}$ of random variables
\index{converge!in probability}{\it converges in probability}
toward the random variable $X$ if $\Forall\epsilon>0$
\begin{equation}
  \lim_{N\to\infty}\pr{|X_N-X|>\epsilon}=0,
\end{equation}
and we write
\begin{equation}
  X_N\xrightarrow{\text{P}}X.
\end{equation}
The sequence converges to $X$
\index{converge!almost surely}{\it almost surely} if
\begin{equation}
  \lim_{N\to\infty}\pr{X_N=X}=1,
\end{equation}
and in this case we write
\begin{equation}
  X_N\xrightarrow{\text{AS}}X.
\end{equation}

\index{Chebyshev's inequality}
\begin{theorem}{Chebyshev's inequality}{}
  \index{Chebyshev's inequality}
  Let $X$ be drawn from a PDF with mean $\hat{x}$ and variance
  $\sigma^2$ and let $a>0$. Then
  \begin{equation*}
    \pr{|X-\hat{x}|>a\sigma}<a^{-2}.
  \end{equation*}
  \begin{proof}
    Let $T=(X-\hat{x})^2$ be a new random variable with PDF $g$. Then
    \begin{equation*}
      \pr{|X-\hat{x}|>a\sigma}=\pr{T>a^2\sigma^2}
                              =\int_{a^2\sigma^2}^\infty\dd{t}g(t)
    \end{equation*}
    But
    \begin{equation*}
      \begin{aligned}
        \sigma^2&=\int_{0}^\infty\dd{t}t\,g(t)
                =\Bigg(\int_0^{a^2\sigma^2}
                     +\int_{a^2\sigma^2}^\infty\Bigg)\dd{t}t\,g(t)\\
                &\geq\int_{a^2\sigma^2}^\infty\dd{t}t\,g(t)
                >a^2\sigma^2\int_{a^2\sigma^2}^\infty\dd{t}g(t)
                =a^2\sigma^2\pr{T>a^2\sigma^2}.
      \end{aligned}
    \end{equation*}
   Dividing through by $a^2\sigma^2$ completes the proof.
  \end{proof}
\end{theorem}
Chebyshev's inequality tells you that large deviations from the mean are
unlikely. Intuitively you expect that as the number of measurements
increases, the sample average tends toward the true mean. This is
called the {\it Law of Large Numbers}\index{LLN}
(LLN). To prove it, we set up as follows: Let $X_1,...,X_N$ be a sequence
of random variables drawn from a PDF
with mean $\hat{x}$ and variance $\sigma^2$.
\begin{theorem}{Weak LLN}{}
  $$
    \bar{X}\xrightarrow{\text{P}}\hat{x}.
  $$
  \begin{proof} Our proof will rely on Chebyshev's inequality, so we will
    first need to compute the mean and variance of the distribution of
    $\bar{X}$. All the $X_i$ are drawn from the same PDF, so
    $$
      \ev{\bar{X}}=\frac{1}{N}\sum_{i=1}^N\ev{X_i}
                  =\frac{N\hat{x}}{N}=\hat{x}.
    $$
    Meanwhile the variance of the distribution of $\bar{X}$ is
    $$
      \sigma^2_{\bar{X}}
      =\variance\sum_{i=1}^N \frac{X_i}{N}
      =\sum_{i=1}^N\frac{\sigma^2}{N^2}
      =\frac{\sigma^2}{N}.
    $$
    Now let $\epsilon>0$. Then $\exists\,a>0$ with
    $\epsilon=a\,\sigma_{\bar{X}}$. Hence by Chebyshev's
    inequality we have
    $$
      \lim_{N\to\infty}\pr{|\bar{X}-\hat{x}|>\epsilon}
       \leq\lim_{N\to\infty}\frac{\sigma^2_{\bar{X}}}{\epsilon^2}
       =\lim_{N\to\infty}\frac{\sigma^2}{N\epsilon^2}=0.
    $$
    The probability cannot be less than 0, so we're done.
  \end{proof}
\end{theorem}
The above proof relies on the PDF having a finite
variance. As it turns out, the Weak LLN is true
even when the variance is infinite! This can be proved
using characteristic functions. But since we do not introduce characteristic
functions until \secref{sec:CLT}, and since we assume in practice
that our data are drawn from PDFs with finite variance anyway,
we direct the reader elsewhere. For example, a proof can be
found on Wikipedia~\cite{Wiki_LLN}.

For completeness we also list the Strong LLN, but without proof.
Like the Weak LLN, the Strong LLN is true even when the PDF variance
is infinite.

\begin{theorem}{Strong LLN}{}
  $$
    \bar{X}\xrightarrow{\text{AS}}\hat{x}.
  $$
\end{theorem}

Also in this section we list a sometimes useful fact about CDFs, which is
for example helpful in interpreting the results of some statistical
tests. (See e.g. \thmref{thm:gaudif}.)

\begin{proposition}{}{}
  If a random variable $X$ is distributed according to the PDF $f$, then
  the corresponding CDF $F$ is distributed uniformly in [0,1].
  \begin{proof}
    If a random variable $Y$ is distributed uniformly in [0,1], then
    by definition we have $\Forall r\in [0,1]$
    $$
      \pr{0\leq Y\leq r}=r.
    $$
    So all we have to do is show that $F(X)$ behaves like $Y$. To that end,
    let $r\in[0,1]$ and define $X^*\in\R$ by
    $$
      r=\int_\infty^{X^*}\dd{s}f(s).
    $$
    Then
    $$
      \pr{0\leq F(X)\leq r}
      =\pr{-\infty\leq X\leq X^*}
      =\int_{-\infty}^{X^*}\dd{s}f(s)
      =r.
    $$
  \end{proof}
\end{proposition}


\section{The Poisson distribution}\label{sec:Poisson}\index{PDF!Poisson}

Let $k\in\N$ and $\lambda>0$. A discrete random variable $X$ has
a {\it Poisson distribution} if
\begin{equation}\label{eq:poisson}
\pr{X=k}=\frac{\lambda^k e^{-\lambda}}{k!}.
\end{equation}
Generally $k$ is interpreted as a number of occurrences and $\lambda$ is
interpreted as a rate, i.e. as a number of occurrences per unit ``time".
This PDF has many useful applications. In physics, it is often used to model
the number of particle decays given a decay rate $\lambda$, or it can
model the probability of particles popping in and out of existence. More hilariously,
Von Bortkiewicz noted in his {\it Das Gesetz der Kleinen Zahlen} that a Poisson
distribution modeled well the frequency with which horses would just straight
up kick Prussian soldiers to death~\cite{funnyPrussianMan}.

Let us try to understand how one arrives at such a distribution.
Forget about the rate for the time being, and consider a sequence of
$N$ independent trials, where a ``success" occurs with probability $p$.
Then the probability of $k$ successes is given by the {\it binomial
distribution}\index{PDF!binomial}
\begin{equation}\label{eq:binomial}
\pr{X=k}=\binom{N}{k}p^k(1-p)^{N-k}.
\end{equation}
This is relatively straightforward to interpret: You have $k$ successes,
which has probability $p^k$ since the trials are independent, and $N-k$
failures, which each must have probability $1-p$. Finally, there are
$\binom{N}{k}$ ways to order the $k$ successes. We can arrive at the Poisson
distribution by making the binomial distribution continuous.
More precisely, set $\lambda=pN$ and eliminate $p$ in \equatref{eq:binomial}.
One arrives at
\begin{equation}
\pr{X=k}=\frac{\lambda^k}{k!}\frac{N}{N}\frac{N-1}{N}\dots
\frac{N-k+1}{N}\left(1-\frac{\lambda}{N}\right)^N
\left(1-\frac{\lambda}{N}\right)^{-k}.
\end{equation}
Now take the limit $N\to\infty$ with $p\to0$ such that $\lambda$ remains
constant. All factors in the above expression tend to 1, with the
exception of the first and penultimate factors: In fact, the penultimate
factor tends to $e^{-\lambda}$, and hence you recover Poisson.

In the context of the phase diagram of strongly interacting matter, it is
sometimes of interest to model net-baryon number\index{baryon number}, $N_B$.
This is defined as the difference between the number of baryons and number
of antibaryons. If one models the rates of production of each kind of particle
as a Poisson, it becomes useful to consider the distribution of the difference
of Poissons. Following \propref{prp:addvars}, the PDF of the difference should
be given by the convolution of two Poissons. For a discrete random variable $X$
capturing the difference between two Poissons with rates $\lambda_1$ and
$\lambda_2$, this comes out to be
\begin{equation}
\pr{X=k}=e^{-(\lambda_1+\lambda_2)}
\sum_{N=\max{0,-k}}^\infty \frac{\lambda_1^{k+N}\lambda_2^N}{N!(k+N)!}.
\end{equation}
It can be shown this implies that
\begin{equation}
\frac{\pr{X=k}}{\pr{X=-k}}=
\left(\frac{\lambda_1}{\lambda_2}\right)^k,
\end{equation}
which means
\begin{equation}
\pr{X=k}=e^{-(\lambda_1+\lambda_2)}
\left(\frac{\lambda_1}{\lambda_2}\right)^{k/2}I_{|k|}\left(\sqrt{\lambda_1\lambda_2}\right),
\end{equation}
where $I_k(z)$ is the modified Bessel function of the first kind.
This is known as the {\it Skellam distribution}\index{PDF!Skellam}.

\section{The normal distribution}
\index{PDF!normal}
Now we're going to focus on results about the normal distribution specifically.
This first proposition will aid us in some of the calculations.

\begin{proposition}{}{gauss}
  Let $\alpha>0$. Then
  \begin{equation*}
    \int_{-\infty}^\infty\dd{x}e^{-\alpha x^2}=\sqrt{\frac{\pi}{\alpha}}.
  \end{equation*}
  \begin{proof}
    The trick is to just square the LHS:
    \begin{equation*}\begin{aligned}
      \left(\int_{-\infty}^\infty\dd{x}e^{-\alpha x^2}\right)^2
      &=\int_{-\infty}^\infty\int_{-\infty}^\infty\dd{x}\dd{y}
        e^{-\alpha(x^2+y^2)}\\
      &=\int_0^\infty \dd{r}r\int_0^{2\pi}\dd{\theta}e^{-\alpha r^2}\\
      &=\frac{\pi}{\alpha}.
    \end{aligned}\end{equation*}
  \end{proof}
\end{proposition}

For the next result
let $X_1$ and $X_2$ be two independent random variables drawn from normal
distributions with respective means $\hat{x}_1$ and $\hat{x}_2$ and
standard deviations $\sigma_1$ and $\sigma_2$.
\begin{proposition}{}{addgauss}
  The random variable $Y=X_1+X_2$ is normally distributed with mean
  $\hat{x}_1+\hat{x}_2$ and variance $\sigma_1^2+\sigma_2^2$.
  \begin{proof}
    By \propref{prp:addvars}, the sum $Y$ has the distribution
    \begin{equation*}
      g(y)=\frac{1}{2\pi\sigma_1\sigma_2}
           \int_{-\infty}^\infty\dd{x}\exp\left[
             -\frac{(x-\hat{x}_1)^2}{2\sigma_1^2}
             -\frac{(y-x-\hat{x}_2)^2}{2\sigma_2^2}\right].
    \end{equation*}
    Pull everything out of the integral that does not depend on $x$,
    then complete the square with what is left over.
    One obtains for $g(y)$
    $$
      \frac{1}{2\pi\sigma_1\sigma_2}
           \exp\left[-\frac{(y-\hat{x}_1-\hat{x}_2)^2}
                      {2(\sigma_1^2+\sigma_2^2)}\right]
           \int_{-\infty}^\infty\dd{x}
           \exp\left[-\frac{\sigma_1^2+\sigma_2^2}
                     {2\sigma_1^2\sigma_2^2}(x+C)^2\right],
    $$
    where $C$ is just a bunch of stuff that does not depend on $x$.
    Therefore you can make the substitution $u=x+C$ with $\dd u=\dd x$ and
    carry out the new integral using \propref{prp:gauss}.
    The result is
    \begin{equation*}
      g(y)=\frac{1}{\sqrt{2\pi(\sigma_1^2+\sigma_2^2)}}
           \exp\left[-\frac{(y-\hat{x}_1-\hat{x}_2)^2}
                      {2(\sigma_1^2+\sigma_2^2)}\right].
    \end{equation*}
  \end{proof}
\end{proposition}

Since the normal distribution is so important, so must be its CDF.
Unfortunately the integral of the normal PDF is {\it non-elementary};
\index{non-elementary} that is, it cannot be expressed in terms of
polynomials or standard
functions like $\sin$, $\cos$, or $\exp$. Therefore we give a name
to this special function.
The {\it error function}\index{error function} is
\begin{equation}
  \erf(x)\equiv\frac{2}{\sqrt{\pi}}\int_0^x\dd{t}e^{-t^2}.
\end{equation}
Then we can write the Gaussian CDF with mean 0 as
\begin{equation}
  \label{eq:gaussCDF}
  \Gau(x,0,\sigma)=\frac{1}{\sqrt{2\pi}\sigma}
                   \int_{-\infty}^x\dd{t}e^{-t^2/2\sigma^2}
                  =\frac{1}{2}+\frac{1}{2}
                   \erf\left(\frac{x}{\sqrt{2}\sigma}\right).
\end{equation}
Now we can list some pretty powerful applications of the normal distribution.
For instance one often must compare two empirical estimates of some mean.
Usually these estimates are different, and one might wonder whether this
disparity is real or just plain unlucky. More precisely:
\index{Gaussian difference test}
\begin{theorem}{Gaussian difference test}{}\label{thm:gaudif}
  Suppose $\bar{X}$ and $\bar{Y}$ are are normally distributed with
  the same mean and standard deviations $\sigma_{\bar{X}}$ and
  $\sigma_{\bar{Y}}$. Then the probability that $\bar{X}$
  and $\bar{Y}$ differ by at least $D$ is
  \begin{equation*}
    \pr{|\,\bar{X}-\bar{Y}|>D}=1-\erf\left(\frac{D}
       {\sqrt{2(\sigma_{\bar{X}}^2+\sigma_{\bar{Y}}^2)}}\right).
  \end{equation*}
  \begin{proof}
    From \propref{prp:addgauss}, the random variable
    $\bar{X}-\bar{Y}$ is normally distributed with mean 0
    and variance $\sigma_D^2=\sigma_{\bar{X}}^2+\sigma_{\bar{Y}}^2$. Therefore by
    \equatref{eq:gaussCDF}, the probability that $\bar{X}$ and
    $\bar{Y}$ are at most $D$ apart is
    \begin{equation*}
      \begin{aligned}
        \pr{|\,\bar{X}-\bar{Y}|<D}
            &=\pr{-D<\bar{X}-\bar{Y}<D}\\
            &=\Gau(D,0,\sigma_D)-\Gau(-D,0,\sigma_D)\\
            &=1-2\Gau(-D,0,\sigma_D)\\
            &=\erf\left(\frac{D}{\sqrt{2}\sigma_D}\right).
      \end{aligned}
    \end{equation*}
    And of course, $\pr{|\,\bar{X}-\bar{Y}|>D}
     =1-\pr{|\,\bar{X}-\bar{Y}|<D}$.
  \end{proof}
\end{theorem}
In other words, the above theorem gives the probability that the
observed difference $|\bar{X}-\bar{Y}|$ is due to chance. This probability
is called the \index{q-value}{\it q-value}. In practice one sets some
threshold on $q$ below which one investigates further whether the
underlying distributions of the estimates are different. Often one
takes the threshold as 0.05.

In practice, the true variances $\sigma_{\bar{X}}^2$ and $\sigma_{\bar{Y}}^2$
are not known, so one cannot use the above theorem directly.
When the sample size is large enough, one can approximate
these variances with some estimators $\bar{\sigma}_{\bar{X}}^2$ and
$\bar{\sigma}_{\bar{Y}}^2$ and just apply the Gaussian difference test.
If one is worried about the effect of finite sample size,
one can instead perform a {\it Student
difference test}\footnote{Student is actually a pseudonym. This test was
originally discovered by William Sealy Gosset, who was a statistician who worked
for Guiness. He may have used the pseudonym due to some kind of company policy.}
or {\it t-test}\index{t-test} to investigate whether the
discrepancy is due to chance. Suppose the estimate $\bar{X}$ comes from
$M$ data while $\bar{Y}$ comes from $N$ data.
Assume $\sigma_{\bar{X}}=\sigma_{\bar{Y}}$, which
happens when the sampling methods used are identical.
Let
\begin{equation}
  t=\frac{D}{\bar{\sigma}_D},
\end{equation}
where $D=\bar{X}-\bar{Y}$, and
\begin{equation}
  \bar{\sigma}^2_D=\left(\frac{1}{M}+\frac{1}{N}\right)
                   \frac{M(M-1)\,\bar{\sigma}_{\bar{X}}^2
                    +N(N-1)\,\bar{\sigma}_{\bar{Y}}^2}
                   {M+N-2}.
\end{equation}
Then the probability that these two estimates differ by at least $D$ is
\begin{equation}
 q=2
 \begin{cases}
 \,I\left(z,\frac{\nu}{2},\frac{1}{2}\right) & \text{for }t\leq 0, \\
 \,1-\frac{1}{2}\,I\left(z,\frac{\nu}{2},\frac{1}{2}\right) & \text{otherwise},
 \end{cases}
\end{equation}
where $I$ is the incomplete beta function, $\nu=M_{\text{conf}}+\nconf-2$,
and
\begin{equation}
  z=\frac{\nu}{\nu+t^2}.
\end{equation}

%At this point it is worth commenting when it is appropriate to use
%the sample average~\eqref{eq:arithmeticaverage} and the estimator
%\eqref{eq:MCMCconsistentEst}. The sample average is the preferred
%method to estimate an observable that can be directly calculated from
%the configurations, i.e. when $X_i=X(C_i)$.
%Such random variables are called {\it primary quantities},
%\index{primary quantity} and they include many simple observables such
%as the action, the absolute value of the Polyakov loop, and so on.
%However some quantities can only be calculated
%as functions of the sample average; these are called {\it secondary quantities}.
%\index{secondary quantity} An obvious example is the variance, which is defined
%only in terms of averages. But a less obvious example are particle masses,
%which cannot be directly calculated on any single configuration. In this
%case we are forced to use \equatref{eq:MCMCconsistentEst}.


\section{The central limit theorem}\label{sec:CLT}

Let $X$ and $Y$ be real random variables. Then we can construct a
complex random variable $F=X+iY$, and its expectation value will be
\begin{equation}
  \ev{F}=\ev{X}+i\ev{Y}.
\end{equation}
This allows us to speak sensibly about Fourier transformations of
PDFs. In particular let $X$ be drawn from the PDF $f$.
The {\it characteristic function}
\index{random variable!characteristic function} of $X$ is
\begin{equation}
  \phi(t)\equiv\ev{e^{itX}}=\int_{-\infty}^\infty\dd{x}e^{itx}f(x).
\end{equation}
Knowing the characteristic function $X$ is equivalent to knowing its PDF;
this is because we can take the inverse Fourier transformation
\begin{equation}
  f(x)=\frac{1}{2\pi}\int_{-\infty}^\infty\dd{t}e^{-itx}\phi(t),
\end{equation}
which follows from the Dirac $\delta$-function. The derivatives of the
characteristic function are easily calculated to be
\begin{equation}
  \phi^{(n)}(t)=i^n\int_{-\infty}^\infty\dd{x}x^ne^{itx}f(x);
\end{equation}
therefore
\begin{equation}
  \phi^{(n)}(0)=i^n\ev{X^n}.
\end{equation}
If $|f(x)|$ falls off faster than $x^m$ for any $m\in\mathbb{Z}$, then
it follows from the above equation that all moments exist, and the
characteristic function is analytic in $t$ about $t=0$.

These are some neat properties of characteristic functions; however
our main use for them is summarized in the next proposition.
\begin{proposition}{}{addchars}
  The characteristic function of a sum of independent random variables equals
  the product of their characteristic functions.
  \begin{proof}
    Let $X_1$,...,$X_N$ be drawn from PDFs $f_1$,...,$f_N$ with
    corresponding characteristic functions $\phi_1,...,\phi_N$, and
    let $Y=\sum_j X_j$. Then using the definition of the characteristic
    function we obtain
    $$
      \phi_Y(t)=\ev{e^{it\sum_j X_j}}=\ev{\prod_{j=1}^N e^{it X_j}}\\
               =\prod_{j=1}^N \ev{e^{it X_j}}=\prod_{j=1}^N\phi_j(t),
    $$
    where we used independence for the third equality.
  \end{proof}
\end{proposition}
Now suppose you are an experimenter taking independent measurements of
some observable. Furthermore suppose you do not know anything about the
observable, except that it comes from some distribution with finite variance.
The central limit theorem (CLT) says that armed with this information alone,
you know that the sample mean will be normally distributed about
the true mean. Here is the precise statement.
\begin{theorem}{Central limit theorem}{}
  \index{CLT}
  Let $X_1,...,X_N$ be $N$ independent random variables drawn from PDF $f$.
  Suppose further that $f$ has mean $\hat{x}$ and variance $\sigma^2$.
  Then the PDF of the estimator $\bar{X}$ converges to
  $\gau(\bar{x},\hat{x},\sigma/\sqrt{N})$.
  \begin{proof}
    What we're going to do is look at the characteristic function
    $\phi_S$ of the random variable
    $$
      S\equiv\bar{X}-\hat{x}=\frac{X_1+...+X_N-N\hat{x}}{N}.
    $$
    If we can show that $\phi_S$ converges to the characteristic function
    corresponding to $\gau(s,0,\sigma/\sqrt{N})$, then we are finished.
    In order to show this, we first need the characteristic function for
    the distribution $\gau(s,0,\sigma/\sqrt{N})$. By completing the
    square and using \propref{prp:gauss}, we find
    \begin{equation*}
      \begin{aligned}
        \phi_{\text{gau}}
            &=\frac{1}{\sigma}\sqrt{\frac{N}{2\pi}}\int_{-\infty}^\infty\dd{s}
              e^{its}\exp\left[-\frac{s^2N}{2\sigma^2}\right]\\
            &=\frac{1}{\sigma}\sqrt{\frac{N}{2\pi}}
              \exp\left[-\frac{\sigma^2t^2}{2N}\right]
              \int_{-\infty}^\infty\dd{s}
              \exp\left[-\frac{N}{2\sigma^2}(s-C)^2\right]\\
            &=\exp\left[-\frac{\sigma^2t^2}{2N}\right],
      \end{aligned}
    \end{equation*}
    where $C$ is a number that does not depend on $s$. It remains to show
    $\phi_S=\phi_{\text{gau}}$. By \propref{prp:addchars} we have
    $$
      \phi_S(t)=\phi_{\frac{1}{N}\sum X_i-\hat{x}}(t)
               =\left[\phi_{X-\hat{x}}\left(\frac{t}{N}\right)\right]^N,
    $$
    where $\phi_{X-\hat{x}}$ is the characteristic function corresponding
    to the random variable $X-\hat{x}$. Call its PDF $g$. From the
    properties of $f$, we know that $g$ has mean 0 and variance $\sigma^2$.
    Therefore by expanding $\phi_S$ about $t=0$ and using the
    definition \eqref{dfn:mom}, we find
    $$
      \phi_S(t)=\left[1-\frac{\sigma^2t^2}{2N^2}
             +\order{\frac{t^3}{N^3}}\right]^N
               =\exp\left[-\frac{\sigma^2t^2}{2N}\right]
             +\order{\frac{t^3}{N^2}},
    $$
    as desired.
  \end{proof}
\end{theorem}
Since the variance of the estimator $\bar{X}$ tends to 0 for large $N$,
it follows that the sample mean converges to the true mean $\hat{x}$.
In particular for large $N$, we expect the true mean to be within
$\sigma/\sqrt{N}$ of the estimator roughly 68\% of the time.
\tabref{tab:normal} gives the area under a Gaussian curve
for different numbers of standard deviations away from the mean.

\begin{table}
\centering
\caption{Table of areas under the curve for the normal distribution.
The last column gives the probability that a random variable
drawn from the distribution falls at least the given number of error bars
away from the mean.}
\begin{tabularx}{\linewidth}{cCr}
\hline\hline
Number of $\sigma$ from $\hat{x}$ & Area under curve & About 1 in ...\\
\hline
1 & 0.682 689 49 & 3\\
2 & 0.954 499 74 & 22\\
3 & 0.997 300 20 & 370\\
4 & 0.999 936 66 & 15 787\\
5 & 0.999 999 43 & 1 744 278\\
\hline\hline
\end{tabularx}
\label{tab:normal}
\end{table}

\section{Bias}\label{sec:bias}

For this section consider independent random variables $X_1,...,X_N$
drawn from a distribution with mean $\hat{x}$ and variance $\sigma^2$.
Earlier we recovered the familiar estimator for the mean, which was
just the ordinary arithmetic average. But what about an estimator for
the variance? Naively one might write
\begin{equation}\label{eq:bad}
  \bar{\sigma}^2_{\text{biased}}=\frac{1}{N}\sum_{i=1}^N(X_i-\bar{X})^2;
\end{equation}
While we expect this estimator to converge\footnote{An estimator that
converges to the correct result as $N\to\infty$ is
{\it consistent}\index{consistent}. Note that unbiased and consistent
do not mean the same thing; in particular $\bar{\sigma}^2_{\text{biased}}$
is consistent but biased.} to the
exact result in the limit $N\to\infty$, it disagrees with
$\sigma^2$ for small $N$. Most glaringly when $N=1$, the
estimator is zero, regardless of the exact result.
\index{bias}
An estimator is said to be {\it biased} when its expectation value
does not agree with the exact result. The difference between the
expectation value of the estimator and the exact result is
correspondingly called the {\it bias}. When they agree, we say
the estimator is {\it unbiased}.
\begin{proposition}{}{stdev}
  An unbiased estimator of the variance is
  $$
    \bar{\sigma}^2=\frac{1}{N-1}\sum_{i=1}^N(X_i-\bar{X})^2.
  $$
  \begin{proof}
  To construct an unbiased estimator of the variance,
  we will determine the bias of the estimator, then remove it. Note
  \begin{equation*}
    \ev{\bar{\sigma}^2_{\text{biased}}}=\frac{1}{N}\sum\limits_{i=1}^N
      \left(\ev{X_i^2}-2\ev{X_i\bar{X}}+\ev{\bar{X}^2}\right).
  \end{equation*}
  Let us analyze the above equation term by term. Since the random
  variables $X_i$ are drawn from the same distribution, the first term
  is an unbiased estimator of $\ev{X^2}$ for each $i$. Next the
  second term can be rewritten as
  \begin{equation*}
    \begin{aligned}
    \ev{X_i\bar{X}}&=\frac{1}{N}\left(\ev{X_i^2}+
                     \sum_{j\suchthat j\neq i}\ev{X_iX_j}\right)\\
                   &=\frac{1}{N}\left(\ev{X^2}+(N-1)\ev{X}^2\right)\\
                   &=\frac{1}{N}\left(\ev{X^2}-\ev{X}^2\right)+\ev{X}^2\\
                   &=\frac{\sigma^2}{N}+\hat{x}^2,
    \end{aligned}
  \end{equation*}
  where in the second line we used the independence of the $X_i$. Finally
  for the last term we have
  $$
    \ev{\bar{X}^2}=\ev{\frac{1}{N^2}\sum_{i,j}X_iX_j}
        =\frac{1}{N^2}\left(N\ev{X^2}+\sum_{i,j\suchthat i\neq j}\hat{x}^2\right)
        =\frac{\sigma^2}{N}+\hat{x}^2,
  $$
  where we again used independence in the second equality. Plugging
  everything into $\ev{\bar{\sigma}^2_{\text{biased}}}$ gives
  \begin{equation*}
      \ev{\bar{\sigma}^2_{\text{biased}}}
        =\frac{1}{N}\sum_{i=1}^N
          \left(\ev{X^2}-\frac{\sigma^2}{N}-\hat{x}^2\right)
        =\left(\frac{N-1}{N}\right)\sigma^2.
  \end{equation*}
  This equation shows us the bias is $-\sigma^2/N$. Therefore according
  to this equation, an unbiased estimator of the variance is
  $$
    \bar{\sigma}^2=\left(\frac{N}{N-1}\right)\bar{\sigma}_\text{biased}^2
                  =\frac{1}{N-1}\sum_{i=1}^N(X_i-\bar{X})^2
  $$
  as we wished to show.
  \end{proof}
\end{proposition}
\vspace{1cm}
We saw that the bias of the naive variance estimator goes like $1/N$.
So one might wonder: How much bias does one typically expect to encounter?
Bias problems often appear whenever one wants to estimate some function
of the mean $\hat{f}=f(\hat{x})$ that is not necessarily linear near the
mean. One might be tempted to take the estimator
\begin{equation}
  \bar{f}_{\text{bad}}=\frac{1}{N}\sum_{i=1}^Nf_i,
\end{equation}
where $f_i\equiv f(X_i)$. However such an estimator is usually biased,
and even worse, we have in general that
\begin{equation}
  \lim_{N\to\infty}\bar{f}_\text{bad}\neq\hat{f}.
\end{equation}

\begin{example}{}{} Consider $N$ measurements of a random variable $X$ that
are drawn from a PDF with  $\hat{x}=0$, and suppose we are interested
in estimating the function $f(x)=x^2$. If we try the bad estimator, we find
\begin{equation}
  \bar{f}_\text{bad}=\frac{1}{N}\sum_{i=1}^N X_i^2,
\end{equation}
which is guaranteed to be larger than 0 for each $i$, since $X_i=0$
with probability 0. In other words it is biased. For a more egregious example,
we consider the piecewise function
\begin{equation*}
  f(x)=
  \begin{cases}
  0 & x=0, \\
  1 & \text{otherwise}.
  \end{cases}
\end{equation*}
In this example, clearly $\hat{f}=0$, but again since $X_i=0$ with probability
0 for each $i$, we will find
\begin{equation*}
  \lim_{N\to\infty}\bar{f}_\text{bad}=1.
\end{equation*}
\end{example}

An estimator that never converges to its true value is called
{\it inconsistent}; otherwise it is {\it consistent}.
\index{estimator!consistent}
So this bad estimator is not a consistent estimator. Note that a biased
estimator is not necessarily inconsistent; for instance the biased estimator
of the variance \equatref{eq:bad} is consistent. From a practical perspective,
it is enough for the expectation value of the estimator to converge to its true
value. Such estimators are {\it asymptotically unbiased}\index{asymptotically
unbiased}. An asymptotically unbiased estimator of $\hat{f}$ is
\begin{equation}\label{eq:consistentEstimator}
  \bar{f}=f(\bar{X}).
\end{equation}
We can prove that $\bar{f}$ is asymptotically unbiased for a wide class of functions.
\begin{proposition}{}{bias}
  Suppose $f:\R\to\R$ has a convergent Taylor series
  in a region about $\hat{x}$. If $\bar{X}$ maps to this region,
  then $\bar{f}$ has bias of order $1/N$.
  \begin{proof}
    If we consider $f$ as a function of the ordinary variable $x$, we can
    expand it about $\hat{x}$ as
    $$
      f(x)=f(\hat{x})+f'(\hat{x})(x-\hat{x})
           +\frac{1}{2}f''(\hat{x})(x-\hat{x})^2
           +\order{(x-\hat{x})^3}.
    $$
    Since $\bar{X}$ maps to the region in which this expansion is valid,
    we can plug it into the above formula and find its expected value.
    This gives
    $$
      \ev{\bar{f}}-\hat{f}=f'(\hat{x})\ev{\bar{X}-\hat{x}}
           +\frac{1}{2}f''(\hat{x})\ev{(\bar{X}-\hat{x})^2}
           +\order{(\bar{X}-\hat{x})^3}.
    $$
    The LHS of this equation is the bias of $\bar{f}$. To simplify the
    RHS, note that by the CLT $\ev{\bar{X}-\hat{x}}=0$ and
    $\ev{(\bar{X}-\hat{x})^2}=\sigma^2/N$. Therefore
    $$
      \ev{\bar{f}}-\hat{f}=\frac{1}{2}f''(\hat{x})\frac{\sigma^2}{N}
                           +\order{\frac{1}{N^2}}.
    $$
  \end{proof}
\end{proposition}
According to the above proposition, the bias vanishes as $N\to\infty$, which
shows that $\bar{f}$ is asymptotically unbiased. For large $N$, $\bar{X}$ is very likely
to be close to $\hat{x}$ by the CLT, so \propref{prp:bias} will
essentially hold whenever $N$ is large and $f$ is a nice enough function.
There is another important consequence to this proposition: the bias
decreases faster than the statistical error bar, which you will recall
goes like $1/\sqrt{N}$. Hence when $N$ becomes large enough, the bias
can be ignored.

\section{Error propagation}\label{sec:prop}
We will now reproduce the commonly used error propagation formula. We know
that if we have a smooth function of $N$ variables
$f:\R^N\to\R$ we can Taylor expand
\begin{equation}
  f(x)=f(\hat{x})+\sum_{j=1}^N
        \pdv{f}{x_j}\Bigg|_{x=\hat{x}}(x_j-\hat{x}_j)
        +\order{x^2}
\end{equation}
where $x=(x_1,...,x_N)$ and $\hat{x}=(\hat{x}_1,...,\hat{x}_N)$. So if
$|x-\hat{x}|$ is small enough, $f$ is a linear function of the components
of $x$. This motivates the following situation: Suppose we have a set of
$M$ random variables $Y_i$, each of which is a linear function of $N$
random variables $X_j$; i.e.
\begin{equation}\label{eq:lintran}
  Y_i=a_{i0}+\sum_{j=1}^N a_{ij}X_j.
\end{equation}
Then the mean is given by
\begin{equation}
  \hat{y}_i=\ev{Y_i}=a_{i\,0}+\sum_{j=1}^N a_{ij}\ev{x_j}
           =a_{i\,0}+\sum_{j=1}^N a_{ij}\hat{x}_j.
\end{equation}
Meanwhile the variance of $Y_i$ is given by
\begin{equation}\label{eq:linprop}
  \begin{aligned}
    \sigma^2_{Y_i}=\ev{(Y_i-\hat{y}_i)^2}
              &=\ev{\sum_{j=1}^N a_{ij}(X_j-\hat{x}_j)
                   \sum_{k=1}^N a_{ik}(X_k-\hat{x}_k)}\\
              &=\sum_{j=1}^N a_{ij}^2\ev{(X_j-\hat{x}_j)^2}
               +\sum_{j\neq k}a_{ij}a_{ik}
                \ev{(X_j-\hat{x}_j)(X_k-\hat{x}_k)}\\
              &=\sum_{j=1}^N a_{ij}^2\sigma^2_{X_j}
               +\sum_{j\neq k}a_{ij}a_{ik}\Cov(X_j,X_k).
  \end{aligned}
\end{equation}
In the case that the $X_j$ and $X_k$ are independent, $\Cov(X_j,X_k)=0$.
Furthermore if $Y_i$ is a linear function of the $X_j$, we
can associate the $a_{ij}$ with partial derivatives. Then
\equatref{eq:linprop} becomes
\begin{equation}\label{eq:errprop}\index{error propagation formula}
  \sigma^2_{Y_i}=
   \sum_{j=1}^N\left(\pdv{Y_i}{X_j}\right)^2\sigma_{X_j}^2.
\end{equation}
This is the {\it error propagation formula} as it's usually stated.
Berg~\cite{berg_markov_2004} emphasizes that this relation is
mnemonic because it does not make
sense to take derivatives with respect to random variables.
In practice we can apply \equatref{eq:errprop} when we
\begin{enumerate}
  \item know how some function $f$ depends on some variables $x_i$
        (then taking derivatives with respect to these new variables
         is well defined);
  \item take measurements of the variables;
  \item the measurements are independent; and
  \item either the function is exactly linear in the $x_i$ or $f$ is
        approximately linear in a region close to the mean.
\end{enumerate}

\section{Covariance}\index{covariance}

Oftentimes in physics you'll find yourself in a situation where you want
to calculate several functions of the same random variables $X_j$. If
the $X_j$ are close enough to their means, or if the $Y_i$ are linear
functions of the $X_j$, this is a situation in which \equatref{eq:lintran}
applies. Intuitively one might expect the $Y_i$ to be correlated; this
turns out to be the case. In particular
\begin{equation}
  \begin{aligned}
    \Cov(Y_i,Y_j)&=\ev{(Y_i-\hat{y}_i)(Y_j-\hat{y}_j)}\\