$$\notag \newcommand{\N}{\mathcal{N}} \newcommand{\g}{\,|\,} \newcommand{\la}{\!\leftarrow\!} \newcommand{\pdd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\te}{\!=\!} \newcommand{\tm}{\!-\!} $$

Laplace question answer

The posterior is proportional to prior times likelihood: \[ p(\lambda\g r) \,\propto\, e^{-\lambda}\,\frac{\lambda^{r-1}}{r!}. \] Define an ‘energy’, equal to the negative log posterior up to a constant: \[ E \,=\, \lambda - (r\tm1)\log\lambda. \] First find the MAP estimate, by minimizing the energy: \[ \pdd{E}{\lambda} = 1 - \frac{r-1}{\lambda}, \qquad \Rightarrow \lambda_\mathrm{MAP} = r-1. \] Then measure the curvature at the minimum: \[ \left.\pdd{^2E}{\lambda^2}\right|_{\lambda=\lambda_\mathrm{MAP}} = \frac{r-1}{\lambda^2_\mathrm{MAP}} = \frac{1}{r-1}. \] Unless \(r>1\) we don’t have a mode at \(\lambda\!>\!0\) to expand around sensibly.

\[\Rightarrow \fbox{$p(\lambda\g r) \approx \N(\lambda;\; r\tm1,\; r\tm1),~~~~~r>1.$}\]

The posterior over \(\ell \te \log\lambda\) is proportional to prior times likelihood: \[ p(\log\lambda\g r) \,\propto\, e^{-\lambda}\,\frac{\lambda^r}{r!}. \] Define an ‘energy’, equal to the negative log posterior up to a constant: \[ E = \lambda - r\log\lambda = e^\ell - r\ell. \] First find the MAP estimate of \(\ell\), by minimizing the energy: \[ \pdd{E}{\ell} = e^\ell - r, \quad \Rightarrow~ \ell_\mathrm{MAP} = \log r. \] We now only need \(r\!>\!0\) for the approximation to make sense. Then measure the curvature at the minimum: \[ \left.\pdd{^2E}{\ell^2}\right|_{\ell=\ell_\mathrm{MAP}} = e^\ell\big|_{\ell=\ell_\mathrm{MAP}} = r. \] Writing \(\ell\) as \(\log\lambda\) again, we get: \[\Rightarrow \fbox{$p(\log\lambda\g r) \approx \N(\log\lambda;\; \log r,\; \frac{1}{r}),~~~~~r>0.$}\]

Which approximation is better? Putting the rate on a log scale means that the distribution is less skewed and unbounded. Both of these properties are a better match for a Gaussian approximation. To plot the distribution of \(\log\lambda\), I put \(\lambda\) on the \(x\)-axis, but with a log scale. The numbers are then directly comparable between the plots.

\(r\te2\): for low counts, the Laplace approximation on \(\lambda\) (dashed, red) puts lots of mass on negative (impossible) values. The approximation based on \(\log\lambda\) is better, although neither can fit the true distribution (solid, blue) perfectly:

 
 
\(r\te20\): for higher counts, the Laplace approximation on \(\lambda\) (dashed, red) works better than before. However the approximation based on \(\log\lambda\) is still slightly better as the true distribution (solid, blue) is less skewed in this parameterization:

 
 
The Matlab/Octave code that produced the plots is available online.