advances in markov chain monte carlo methods iain murray ma msci natural sciences physics university of cambridge uk gatsby computational neuroscience unit university college london queen square london wcn ar united kingdom thesis submitted for the degree of doctor of philosophy university of london i iain murray confirm that the work presented in this thesis is my own where information has been derived from other sources i confirm that this has been indicated in the thesis abstract probability distributions over many variables occur frequently in bayesian inference statistical physics and simulation studies samples from distributions give insight into their typical behavior and can allow approximation of any quantity of interest such as expectations or normalizing constants markov chain monte carlo mcmc introduced by metropolis et al allows sampling from distributions with intractable normalization and remains one of most important tools for approximate computation with probability distributions while not needed by mcmc normalizers are key quantities in bayesian statistics marginal likelihoods are needed for model comparison in statistical physics many physical quantities relate to the partition function in this thesis we propose and investigate several new monte carlo algorithms both for evaluating normalizing constants and for improved sampling of distributions many mcmc correctness proofs rely on using reversible transition operators often these operators lead to slow diffusive motion resembling a random walk after reviewing existing mcmc algorithms we develop a new framework for constructing nonreversible transition operators that allow more persistent motion next we explore and extend mcmcbased algorithms for computing normalizing constants we compare annealing multicanonical and nested sampling giving recommendations for their use we also develop a new mcmc operator and nested sampling approach for the potts model this demonstrates that nested sampling is sometimes better than annealing methods at computing normalizing constants and drawing posterior samples finally we consider doublyintractable distributions with extra unknown normalizer terms that do not cancel in standard mcmc algorithms we propose using several deterministic approximations for the unknown terms and investigate their interaction with sampling algorithms we then develop novel exactsamplingbased mcmc methods the exchange algorithm and latent histories for the first time these algorithms do not require separate approximation before sampling begins moreover the exchange algorithm outperforms the only alternative sampling algorithm for doubly intractable distributions acknowledgments i feel very fortunate to have been supervised by zoubin ghahramani my training has benefited greatly from his expertise enthusiasm and encouragement i have been equally fortunate to receive regular advice and ideas from david mackay much of this research would never have happened without these mentors and friends this work was carried out at the gatsby computational neuroscience unit at university college london this is a firstclass environment in which to conduct research both intellectually and socially peter dayan the director is owed a lot of credit for this as are all of gatsbys members and visitors past and present several individuals have offered me their constant support advice and good humor in particular angela yu ed snelson katherine heller and members of the inference group in cambridge id also like to extend a special thanks to alex boss whose administrative help often extended beyond the call of duty some of chapter is a review of john skillings nested sampling algorithm john has been generous with his advice and encouragement which enabled me to pursue a study based on his work thanks also to radford neal for comments on aspects of chapters and hyunchul kim for meanfield code used in chapter and matthew stephens for encouraging me to work out the noninfinite explanation of the exchange algorithm thanks to my examiners peter green and lorenz wernisch for reviewing this work and catching some mistakes of course i am solely responsible for any errors that remain i am enormously grateful to the gatsby charitable foundation for funding my research and providing travel money further travel monies were received from the pascal network of excellence auai the nips foundation and the valencia bayesian meeting various items of free software have played a vital role in conducting this research gcc a gnulinux gnuplot l tex metapost hobby octave valgrind seward and nethercote vim and many more projects in the public interest such as these deserve considerable support finally i would like to thank my family and friends for all their support and patience contents front matter abstract acknowledgments contents list of algorithms list of figures list of tables notes on notation introduction graphical models directed graphical models undirected graphical models the potts model computations with graphs the role of summation simple monte carlo sampling from distributions importance sampling markov chain monte carlo mcmc choice of method markov chain monte carlo metropolis methods generality of metropolishastings gibbs sampling a two stage acceptance rule conditional estimators raoblackwellization waste recycling construction of estimators convergence auxiliary variable methods swendsenwang contents slice sampling hamiltonian monte carlo simulated tempering expanded ensembles parallel tempering annealed importance sampling ais tempered transitions generalization to only forward transitions generalization to a single pass annealing methods multicanonical ensemble exact sampling exact sampling example the ising model discussion and outlook multiple proposals and nonreversible markov chains population monte carlo multipletry metropolis efficiency of mtm multipleimportance try wasterecycled mtm adapting k automatically ordered overrelaxation with pivotbased transitions persistence with pivot states ordered overrelaxation pivotbased transitions pivotbased metropolis summary related and future work normalizing constants and nested sampling starting at the prior bridging to the posterior aside on the prior factorization thermodynamic integration multicanonical sampling nested sampling a change of variables computations in the new representation nested sampling algorithms mcmc approximations integrating out x degenerate likelihoods contents efficiency of the algorithms nested sampling multicanonical sampling importance sampling constructing annealing schedules markov chains for normalizing constants randomize operator orderings changes in lengthscale and energy a new version of swendsenwang description of slice sampling experiments discussion of slice sampling results the potts model summary related work philosophy experiments discussion and conclusions doublyintractable distributions bayesian learning of undirected models do we need z for mcmc targets for mcmc approximation approximation algorithms extension to hidden variables experiments involving fully observed models experiment involving hidden variables discussion product space interpretation bridging exchange algorithm details for proof of correctness reinterpreting savm approximation schemes the exchange algorithm the single auxiliary variable method mavm a temperedtransitions refinement comparison of the exchange algorithm and mavm ising model comparison discussion metropolishastings algorithm performance latent history methods slice sampling doublyintractable distributions contents latent histories mavm discussion summary and future work references list of algorithms rejection sampling metropolishastings a two stage acceptance rule population monte carlo as in capp et al e a single step of the multipletry metropolis markov chain self sizeadjusting population for ordered overrelaxation the pivotbased transition the pivotbased metropolis operator single particle nested sampling multiple particle nested sampling swendsenwang for weighted bond configurations construction of an annealing schedule from nested sampling standard but infeasible metropolishastings exchange algorithm exchange algorithm with bridging multiple auxiliary variable method mavm simple rejection sampling algorithm for py template for a latent history sampler metropolishastings latent history sampler list of figures a selection of directed graphical models some graphical models over three variables a grid of binary variables an intractable undirected graphical model drawing samples from lowdimensional distributions challenges for markov chain exploration the swendsenwang algorithm slice sampling the effect of annealing parallel tempering annealed importance sampling ais tempered transitions multicanonical ensemble example coupling from the past cftp overview metropolis and population monte carlo on the funnel distribution multipletry metropolis mtm performance of multipletry metropolis the idea behind successive overrelaxation illustration of the pivotbased transitions operator ordered overrelaxation schemes applied to a bivariate gaussian using pivot states for persistent motion example of persistent motion reflect move for discrete distributions views of the integral z l d nested sampling illustrations the arithmetic and geometric means of nested samplings progress errors in nested sampling approximations empirical average behavior of ais and nested sampling potts model states accessible by mcmc and nested sampling list of figures a simple illustration of the global nature of z histograms of approximate samples for heart disease risk model quality of marginals from approximate mcmc methods toy semisupervised problem with results from approximate samplers the exchange algorithms augmented model a product space model motivating the exchange algorithm the proposed change in joint distribution under a bridged exchange augmented model used by savm joint model for mavm comparison of mavm and the exchange algorithm learning a precision performance comparison of exchange and mavm on an ising model the latent history representation list of tables equilibrium efficiency of mtm and metropolis accuracy on tdistribution after proposals a rough interpretation of the evidence scale nested sampling ais and multicanonical behavior with slicesampling estimates of the deceptive distribution partition function results for potts systems notes on notation probability distributions we have chosen to follow a fairly loose but commonlyused notation for probabilities occasionally we use p x x to denote the probability that a random variable x takes on the value x but as long as the meaning can be inferred from context we simply write p x often there is more than one probability distribution over the same variable we simply write qx for the probability under distribution q and px for the probability under p we never mention the space in which x lives nor any measure theory unless it is actually used we rarely need to distinguish between probability densities and discrete probabilities this loose notation is imprecise but hopefully its simplicity will be appreciated by some readers probability of a given b we use several notations for distributions over variables that depend on other quantities p ab this is the conditional probability of a given b bayes rule can be applied to infer b from a p ba p abp bp a p a b the probability of a is a function depending on some parameter b one should not necessarily assume that bayes rule holds for a and b t a b t is a transition operator which gives a probability distribution over a new position a given a starting position b one could also write t a b the arrow is to provide a more obvious distinction from authors that use t a b for the probability of the transition a b transition operators do not necessarily satisfy bayes rule known as detailed balance so the notation t ab is avoided t a b c this specifies parameters c in addition to starting location b expectations we use epxf x x pxf x for the expectation or average of f x under the distribution px a sum with no specified range should be taken to mean over all values the variance of a quantity is given by varpf epx f x epxf x chapter introduction probability distributions over many variables occur frequently in bayesian inference statistical physics and simulation studies computational methods for dealing with these large and complex distributions remains an active area of research graphical models section provide a powerful framework for representing these distributions we use these to explain challenging probability distributions and sometimes the algorithms to deal with them a surprising number of statistical problems result in the computation of averages which we explain in section monte carlo techniques section approximate these summations using random samples many of these methods rely on the use of markov chains section extending markov chain monte carlo mcmc techniques is the subject of this thesis graphical models compact representations of highdimensional probability distributions are essential for their interpretation feasibility of learning and computational reasons as a concrete example consider a distribution over a vector of d binary variables x for small d eg two or three an explicit table of counts for all possible joint settings could be maintained and used for frequencybased estimates of the settings probabilities explicitly storing such a table becomes exponentially more costly as d grows even if the table is stored in a sparse data structure the representation is not useful for learning probabilities most cells will contain zero observations enforcing some structure is essential when learning from large multivariate distributions the simplest multivariate distributions assume that all of their component variables xd are independent d px d pxd graphical models c x x x x x x x x x x y a na bayes ve b hidden causes y c simple parametric model figure a selection of directed graphical models this assumption is generally inappropriate as it means that no relationships between any of the variables can ever be learned the graphical model for equation has a node for each xd with no edges between any of the nodes graphs with more structure provide a convenient representation for different factorizations of px two classes of graphical model are used in this thesis directed graphs offer a natural representation of causal relationships or the result of a sequence of operations while useful in modeling we mainly use directed graphs for explanations of algorithms undirected graphical models are a more natural representation for constraints and mutual interactions these are an important motivation for chapter this section offers a brief review of the required concepts directed graphical models a directed graphical model is represented by a directed acyclic graph dag generally the joint distribution factorizes into a product of terms for each node conditioned on all of its parents figure a is a directed graphical model for the joint distribution found in the na bayes model with class variable c and feature vector x xd ve d pc x pcpxc pc d pxd c figure a and equation contain exactly the same information from either form it is readily apparent to those familiar with the representations that the features x are dependent but independent conditioned on the class variable c as more variables and more complicated structure are introduced the graphical model is often easier to interpret at a glance than an equation giving the factorization of the joint distribution several graphical model figures have been included throughout this thesis for those that might find them useful graphical models x x x fi x x x x x x x x x a b c d figure some graphical models over three variables figure c demonstrates another piece of directed graphical model notation sometimes shading or a doubleoutline is used to indicate that a node represents an observed variable here y are the data being modeled which are assumed to come from a distribution parameterized by strictly this graph says nothing else about the distribution any joint distribution over two variables can be written as py pyp pypy so the arrow could equally be drawn the other way the implication is that generating y from is a natural operation the arrow directions make a big difference in figure b the x variables are independent in the generative process after observing y our knowledge about x forms a potentially complex joint distribution we do not expect direct sampling from pxy to be easy methods based on markov chains will offer a solution to this problem undirected graphical models a probability distribution must be normalized so one representation used frequently in chapter is px f x z z x f x where f x is any positive function for which the normalization sum or integral exists however without detailed analysis of f this representation says nothing about the structure of the distribution undirected graphical models represent f as a product of terms but do not identify some nodes as children and others as parents this is appropriate for representing mutual interactions modern undirected models use bipartite factor graphs figure b shows three variables which all interact through a central factor node traditionally this distribution would be represented as in figure c where the three nodes all take part in a common interaction as they form a fully connected clique the factor representation is more powerful the traditional representation cannot represent the distribution in figure d which contains three factors px x x f x x f x x f x x z graphical models the potts model the potts model is a widespread example of an undirected graphical model its probability distribution is defined over discrete variables s each taking on one of q distinct settings or colors psj q exp zp j q jij si sj ije the variables exist as nodes on a graph where ij e means that nodes i and j are linked by an edge the kronecker delta si sj is one when si and sj are the same color and zero otherwise neighboring nodes pay an energy penalty of jij when they are different colors often a single common coupling jij j is used a common extension allows biases to be put on the colors an additional term i hi si is put inside the exponential the potts model with bias terms and q has several different names the boltzmann machine the ising model binary markov random fields or autologistic models computations with graphs imagine a problem that involves summing over the settings of a binary potts model with variables in general this seems impossible there are possible states for comparison the age of the universe is usually estimated to be about picoseconds while most current processors take about picoseconds to perform a single instruction if the models graph forms a chain s s s sn the distribution can be factored into a product of functions involving each adjacent pair psj zj n fn sn sn j n s now certain sums such as the normalizer zj can be decomposed is as follows n fn sn sn j and expectations of functions of variables epsgsi are tractable an example showing how the sums gsi s n fn sn sn s si s f s s s f s s fi si si sn fi si si gsi si sn fn sn sn fn sn sn performing the sums right to left makes the computation oqn rather than oq n where q for binary variables the role of summation a b figure a small state space of binary variables represented as a pixels and b an undirected graphical model the above technique easily generalizes to treestructured graphs but other topologies are common in applications figure shows a grid of binary variables arrays like this are common in computer vision spatial statistics and statistical physics treating each row of pixels as a single variable makes a chainstructured graph with n variables each taking on states summing has changed from an operation that takes orders of magnitude longer than the age of the universe to being almost instantaneous advanced versions of this procedure exist for general graphical models notably the junction tree algorithm lauritzen and spiegelhalter the cost of the algorithm is determined by the treewidth of a graph which indicates the largest number of variables that need to be joined in order to form a tree structure figure shows a genuinely intractable graphical model or does it even if the variables are only binary then summing over all states with a graphbased exact inference algorithm is infeasible forming a tractable tree will require making at least one node with joint settings of variables the topology of the graph isnt everything however if the variables were continuous and gaussian distributed then the model is quite tractable almost everything one might want to know about a gaussian distribution can be found easily from a cholesky decomposition of the covariance matrix this matrix factorization is flexible numerically stable and costs on for a nn covariance matrix seeger when n a current midrange workstation can perform the required computation in a couple of seconds the role of summation summing over all the configurations of a multivariate distribution turns out to be the dominant computational task in many fields one of the goals of statistical physics is to capture the collective behavior of systems that like the potts model involve enormous numbers of interacting parts many physical quantities relate to simple statistics of summing in diagonal stripes means this need only happen once rather than times the role of summation figure an intractable undirected graphical model a grid of variables with pairwise interactions these parts averaged over the entire system other key quantities can be derived from z the zustandssumme or sum over states bayesian inference the use of probability theory to deal with uncertainty is also dominated by the computation of averages as a canonical example we consider a statistical model x for data x generated using parameters the predictive distribution over new data xn given observations of n previous settings xn n is an average n under the posterior distribution pxn n n pxn xn n n pxn pxn n d n n epxn n the posterior distribution is given by bayes rule pxn n n px n pxn n p n pxn n p d n which involves another sum over in general any time a quantity is unknown we must consider all of its possible settings which tends to involve an average under a probability distribution this thesis concentrates on computational methods rather than their application simple monte carlo however modeling and learning from data are a key motivation for this work so we briefly mention some more general references bayesian inference is named after bayes although many of the ideas that currently fall under this banner came much later for more on the philosophy a good start is a beautifully written paper by cox recent textbooks offer a broader view of probabilistic modeling and are available from the viewpoints of various communities including statistics gelman et al machine learning mackay bishop and the physical sciences sivia and skilling simple monte carlo many of the alleged difficulties with finding averages are unduly pessimistic imagine we were interested in the average salary x of people p in the set of statisticians s formally this is a large intractable sum over all of the s statisticians in the world eps xp s xp ps but to claim that the question is unanswerable is absurd we can clearly get a reasonable estimate of the answer by measuring just some statisticians eps xp s s x ps for random survey of s people ps s s no reasonable application needs the exact answer which in any case is constantly fluctuating as individual statisticians are hired promoted and retire so for all practical purposes the problem is solved nearly conducting surveys that obtain a fair sample from a target population is difficult but the technique is so useful we are prepared to invest effort into good sampling methods this statistical sampling technique is directly relevant to solving difficult integrals in statistical inference for example what is the distribution over an unknown quantity x after observing data d from a distribution with unknown parameters pxd px dpd d epdpx d s s px s d s pd s we draw samples from the target distribution which rather than the set of statisticians is now pd approximating general averages by statistical sampling is known as simple monte carlo the estimates are unbiased which if not clear now is shown coxs work has been subject to some technical objections which are countered by horn simple monte carlo more generally in section as long as variances are bounded appropriately the sum of independent terms will obey a central limit theorem the estimators variance will scale like s having an estimators variance shrink only linearly with computational effort is often considered poor sokal begins a lecture course on monte carlo with the warning monte carlo is an extremely bad method it should be used only when all alternative methods are worse however as sokal goes on to acknowledge monte carlo is also an extremely important method on some problems it might be the only way to obtain accurate answers metropolis describes how the physicist enrico fermi used monte carlo before the development of electronic computers when insomnia struck he would spend his nights predicting the results of complex neutron scattering experiments by performing statistical sampling calculations with a mechanical adding machine as analytically deriving the behavior of many neutrons seemed intractable fermis ability to make accurate predictions astonished his colleagues many problems in physics and statistics are complex and involve many variables numerical methods that scale exponentially with the dimensionality of the problem will not work at all in contrast monte carlo is usually simple and its s scaling of error bars independent of dimensionality may be good enough even when more advanced deterministic methods are available monte carlo can be appropriate if todays computers can return useful answers with less implementation effort than more complex methods sampling from distributions just as finding a fair sample from a population is difficult in surveys sampling correctly from arbitrary probability distributions is also hard simple monte carlo is only as easy to implement as the random variate generator for the entire joint distribution involved a graphical description of sampling from a probability distribution is given by figure a points are drawn uniformly from the unit area underneath the probability density function and their corresponding value recorded this correctly assigns each element of the input space dx with probability pxdx the probability mass to the left of each point ux is distributed as uniform to implement a sampler we can first draw u and then compute xu from the inverse cumulative distribution that is xu u where x x px dx now the difficulty of sampling from distributions with this general method becomes apparent it is often infeasible to even normalize p a single integral over the entire state space whereas x is a whole continuum of such integrals simple monte carlo c q x copt q x p x px xi hi xj hj x x x x a direct sampling x x b rejection sampling x figure drawing samples from lowdimensional distributions a sample locations can be taken from points drawn uniformly under the density curve b rejection sampling provides a way to implement this samples are drawn uniformly underneath a simpler curve cq x as in algorithm points above p x px such as xi hi are rejected the remaining points come uniformly from underneath p and are valid samples from p the constant c is chosen such that cq is always above p the best setting is shown as copt but finding this value is not always possible while is tractable for some standard distributions random number generators must generally use less direct techniques rejection sampling is a method for drawing samples from a distribution px by using another distribution qx from which we can already sample the method requires an ability to evaluate the probability density of points under both distributions up to a constant p x px and q x qx further we must be able to find a constant c such that cq x p x sampling underneath a p x px curve gives the correct probability distribution we can sample under p x by drawing samples from the tractable qx distribution and applying algorithm figure b the more closely q matches p the lower the rejection rate can be made providing that c is adjusted to maintain a valid bound we can improve q after each step of the algorithm ideally q would be updated automatically as points from p are evaluated in general this could be difficult but for logconcave distributions there are at least two techniques gilks and wild gilks for applications of rejection sampling to standard distributions and several more techniques for nonuniform random variate generation see devroye simple monte carlo algorithm rejection sampling inputs target distribution px p x simple distribution qx q x and number of samples s outputs xs s samples from px s find a constant c such that cq x p x for all x ideally minimize c within the constraint for s s draw candidate x q draw height h uniform cq x if x h is below p ie h p x then xs x else go to end for importance sampling computing p x and q x then throwing x away along with all the associated computations seems wasteful yet discarding samples is a key part of rejection samplings correct operation importance sampling avoids such rejections by rewriting the integral as an expectation under any distribution q that has the same support as p f xpx dx s f x px qx dx qx if qx wherever px wx pxqx xs qx f xwx qx dx s f xs wxs s one interpretation is that the weights wxs make some samples from q more important than others rather than assigning harsh weights of zero and one as in rejection sampling the simplest view is that equation is is just simple monte carlo integration of an expectation under qx we immediately know that the estimator is unbiased and might obey a central limit theorem as before if qx is much smaller than px in some regions of the state space then the effective function f xpxqx will have very high or even infinite variance under qx states with extreme importance weights are rare events under q and might not be observed within a moderate number of samples this means that error bars based on the empirical variance of the importance weights can be very misleading for a practical example of this problem along with the associated recommendation to use broad distributions with heavy tails see mackay section interestingly the ideal q distribution is not equal to p if f x is a positive function then qx f xpx would give a zero variance estimator this optimal distribution is markov chain monte carlo mcmc unobtainable as evaluating qx requires the normalization zq f xpx dx the target integral but sometimes deliberate matches between q and p are useful in practice for example when interested in gathering statistics about the tails of p nothing about the importance sampling trick in equation actually requires the original integral to be an expectation we can divide and multiply any integrand by a convenient distribution to make it an expectation thus importance sampling allows any integral to be approximated by monte carlo as most of the integrals in dominant applications are expectations we still maintain px in the equations throughout evaluating the importance weights wx requires evaluating px p xzp but often zp is unknown in these cases the normalizing constant must be approximated separately as follows zp zq zq s p x dx p x qx dx q x w x p xq x xs qx w x qx dx s w xs s this gives an approximation for the importance weights wx zq p x px qx zp q x w x s s s s w x which can be used within equation the resulting estimator is biased but consistent when both estimators and have bounded variance markov chain monte carlo mcmc both rejection sampling and importance sampling require a tractable surrogate distribution qx neither method will perform well if maxx pxqx is large rejection sampling will rarely return samples and importance sampling will have large variance markov chain monte carlo methods can be used to sample from px distributions that are complex and have unknown normalization this is achieved by relaxing the requirement that the samples should be independent a markov chain generates a correlated sequence of states each step in the sequence is drawn from a transition operator t x x which gives the probability of moving from state x to state x according to the markov property the transition probabilities depend only on the current state x in particular any free parameters eg step markov chain monte carlo mcmc sizes in a family of transition operators t x x cannot be chosen based on the history of the chain a basic requirement for t is that given a sample from px the marginal distribution over the next state in the chain is also the target distribution of interest p px x t x x px for all x by induction all subsequent steps of the chain will have the same marginal distribution the transition operator is said to leave the target distribution p stationary mcmc algorithms often require operators that ensure the marginal distribution over a state of the chain tends to px regardless of starting state this requires irreducibility the ability to reach any x where px in a finite number of steps and aperiodicity no states are only accessible at certain regularly spaced times for more details see tierney for now we note that as long as a t satisfies equation it can be useful as the other conditions can be met through combinations with other operators given that px is a complicated distribution it might seem unreasonable to expect that we could find a transition operator t leaving it stationary however it is often easy to construct a transition operator satisfying detailed balance t x x px t x x px for all x x this states that a step starting at equilibrium and transitioning under t has the same probability forwards x x or backwards x x proving detailed balance only requires considering each pair of states in isolation there is no sum over all states as in equation having shown equation summing over x on both sides immediately recovers the stationary distribution requirement thus detailed balance is a useful property for deriving many mcmc methods however it is not always required or even desirable chapter introduces some mcmc transition operators that do not satisfy equation given any transition operator t satisfying the stationary condition equation we can construct a reverse operator t defined by t x x t x x px t x x px t x x px px x t x x px a symmetric form of this relationship shows that an operator satisfying detailed balance is its own reverse transition operator t x x px t x x px for all x x summing over x or x reveals that this mutual reversibility condition implies that the stationary condition equation holds for both t and t therefore constructing choice of method a pair of mutually reversible transition distributions is an alternative strategy for constructing mcmc operators detailed balance is a restricted case where a transition operator must be its own reverse operator choice of method in this introduction we described the importance of highdimensional probability distributions samples from these distributions capture their typical properties which is the basis of the monte carlo estimation of expectations such as f xpx dx we assume or hope that extreme values under px do not form a significant contribution to the integral for expectations in many physical and statistical applications this is a reasonable assumption in highdimensional problems sampling from an interesting target distribution px is often intractable methods such as rejection sampling fail because finding a closematching simple distribution qx is not possible for the same reason importance sampling estimators tend to have very high variance we described rejection sampling as it remains an important method when iid samples from lowdimensional distributions are required this occurs in some simulation work and as part of some mcmc methods similarly while simple importance sampling has problems in high dimensions it provides the basis of more advanced methods that are useful on some highdimensional problems we neglect much of the importance sampling literature but will study some methods relating to markov chains the focus of this thesis are methods that use markov chains these allow us to draw correlated samples from complex distributions and to perform monte carlo integration in highdimensional problems the next chapter reviews several important algorithms based on markov chains together with some new contributions after this we will be in a position to outline the remainder of the thesis chapter markov chain monte carlo this chapter reviews some important markov chain monte carlo mcmc algorithms much of this material is standard and could be skipped by those already familiar with the literature however some of the material in this chapter is to the best of our knowledge novel these contributions include generalizations of tempered transitions in subsection and the introduction of a slicesampling version of the two stage acceptance rule see sections and we also present some results in an unconventional way which is designed to help with reading later chapters metropolis methods algorithm gives a procedure for simulating a markov chain with stationary distribution px due to hastings algorithm metropolishastings input initial setting x number of iterations s for s s propose x qx x compute a px qx x px qx x set x x with probability min a eg a draw r uniform b if r a then set x x end for the setting of x at the end of each iteration is considered as a sample from the target probability distribution px adjacent samples are identical when the state is not updated in step b but every iteration must be recorded as part of the markov chain metropolis methods it is straightforward to show that the metropolishastings algorithm satisfies detailed balance t x x px min px qx x px qx x qx x px min px qx x px qx x px qx x min px qx x t x x px qx x px as required here the probability for a forwards transition x x was manipulated into the probability of the reverse transition x x however it would have been sufficient to stop after the second line and note that the expression is symmetric in x and x the algorithm is valid for any proposal distribution q ideal proposals would be constructed for rapid exploration of the distribution of interest p we could write qx x d to emphasize choices of proposal that are based on observed data d using any fixed d is valid but q cannot be based on the past history of the sampler as such a chain would not be markov often proposals are not complicated databased distributions simple perturbations such as a gaussian with mean x are commonly chosen the acceptreject step in the algorithm corrects for the mismatch between the proposal and target distributions when the proposal distribution is symmetric ie qx x qx x for all x x the acceptance ratio in step simplifies to a ratio under the distribution of interest a px px this is the original metropolis algorithm metropolis et al some authors eg mackay prefer to drop such distinctions and simply refer to all metropolishastings algorithms as metropolis methods the next section suggests another justification of this view generality of metropolishastings consider a mcmc algorithm that proposes a state from a distribution qx x and accepts with probability pa x x restricting attention to transition operators satisfying detailed balance pa x xqx x px pa x x qx x px for all x x gives an equality constraint we also have inequalities that must hold for probabilities pa x x and pa x x optimizing the average acceptance probability pxpa x x px pa x x with respect to ax x and ax x must saturate one or more of the inequalities a metropolis methods well known property of linear programming problems therefore either pa x x or pa x x and pa x x is given by equation this gives the metropolis hastings acceptance rule pa x x min px qx x pxqx x according to the constraint of equation a pair of valid acceptance probabilities must have the same ratio pa x xpa x x as any other valid pair therefore they can be obtained by multiplying the metropolishastings probabilities by a constant less than one this corresponds to mixing in some fraction of the do nothing transition operator which leaves the chain in the current state at those two sites it also corresponds to adjusting the proposal distribution q to suggest staying still more often staying still more often harms the asymptotic variance of the chain in this sense although not by all measures using equation is optimal peskun given this result it is unsurprising that metropolishastings has become almost synonymous with mcmc it is tempting to conclude that the only way to improve markov chains for monte carlo is by researching domainspecific proposal distributions such as in the vision communitys datadriven mcmc tu and zhu in fact a rich variety of more generic mcmcbased algorithms exist and continue to be developed many of these do satisfy detailed balance but the corresponding mh qx x distribution is often defined implicitly and would not be a natural description of the method to illustrate the limitations of claiming all reversible mcmc is just metropolis hastings we show that all reversible mcmc is just metropolis this also demonstrates a methodology used throughout the thesis we construct a new target distribution px x qx x px ie the joint distribution over a point x p and a point x proposed from that location the marginal distribution over x is the original target p now consider the symmetric metropolis proposal that swaps the values of x and x such that putatively x comes from p and x was proposed from it the metropolis acceptance probability for this swap proposal is min px x px x min px qx x px qx x ie the metropolishastings acceptance probability in this sense only an algorithm with symmetric proposals is needed but this is not a very natural description of the metropolishastings algorithm similarly while it is possible to describe all algorithms satisfying detailed balance as metropolis or metropolishastings algorithms other descriptions may be more natural however we will find constructing joint distributions like px x a useful theoretical tool and one that suggests new markov chain operators metropolis methods gibbs sampling gibbs sampling geman and geman resamples each dimension xi of a multivariate quantity x from their conditional distributions pxi xji any individual update maintains detailed balance which is easily checked directly alternatively we can write the gibbs sampling update as a metropolishastings proposal qx x pxi xji ixji xji where i is an indicator function ensuring all components other than xi stay fixed the acceptance probability for this proposal is identical to one so need not be checked gibbs sampling is often easy to implement if the target distribution is discrete and each variable takes on a small number of settings then the conditional distributions can be explicitly computed pxi xji pxi xji pxi xji x pxi xji i on continuous problems the onedimensional conditional distributions are usually amenable to standard sampling methods as mentioned in subsection a desirable feature of gibbs sampling is that it has no free parameters and so can be applied fairly automatically indeed the bugs packages can create gibbs samplers for a large variety of models specified using a simple description language spiegelhalter et al there are actually some free choices regarding how the variables are updated blockgibbs sampling chooses groups of variables to update at once also continuous distributions can be reparameterized before gibbs sampling a basis in which the dimensions are independent would obviously be particularly effective clifford et al contains several interesting discussions regarding the nature and history of the gibbs sampler the method has often been regarded with a rather undeserved special status amongst mcmc methods it is just one of many ways to construct a markov chain with a target stationary distribution in particular there is no need to approximate gibbs sampling when sampling from conditionals is not tractable as in ritter and tanner metropolishastings updates of each dimension can be used instead and there is no need to call this method metropoliswithingibbs a two stage acceptance rule if the target distribution is factored into two terms for example a prior and likelihood px xlx a proposal can be accepted with probability pa min qx x x qx xx min lx lx construction of estimators straightforward checking shows that this satisfies detailed balance we know from previous discussions that it will accept less often than standard metropolishastings and is inferior according to peskun but computationally it can be more efficient algorithm is an acceptance rule that will accept proposals with probability pa if x is able to veto an a priori unreasonable proposal then lx need not be computed factoring out a cheap sanitycheck distribution x may be worth a fall in acceptance rate in problems where likelihood evaluations are expensive algorithm a two stage acceptance rule input initial setting x proposed setting x draw r uniform if r qx x x qx xx draw r uniform if r lx accept else reject lx else reject as with standard metropolishastings none of the probabilities need to be computed exactly bounding them sufficiently to make the acceptreject decision in step or is all that is required in practice few monte carlo codes implement the required interval arithmetic to make these savings in contrast the two stage procedure is generally applicable and easy to implement algorithm is a special case of the surrogate transitions method liu pp and also the different algorithm of christen and fox the twostage method here is also equivalent to dostert et al algorithm ii which presented it in the context of a particular choice for qx x this literature has a rich variety of possible approximate distributions that can be used for x in particular applications in a statistical application where px depends on a data set a general choice would be to use a subset of the data to define x an obvious generalization is splitting the acceptance rule into more than two terms factoring the distribution into many terms could make the acceptance rate fall dramatically so there would need to be a specific computational benefit construction of estimators the output of an mcmc algorithm is a set of correlated samples drawn from a joint distribution p xs at equilibrium every marginal pxs is the correct distribution of interest using these equilibrium samples in the straightforward monte carlo construction of estimators estimator would be unbiased s s ep xs f xs s xs p xs s s s s s f xs s f xs s xs s xs p xs s s f xs pxs epf s xs the states drawn from a markov chain started at an arbitrary position will not have the correct marginal distribution asymptotically equation is still an unbiased estimator under weak conditions but in practice it is advisable to discard some initial states allowing the chain to burnin after this there is no need to remove intermediate samples in an attempt to obtain a set of approximately independent samples such thinning will only make the variance of the estimator worse geyer however thinning can be justified when adjacent steps are highly correlated or when computing f x is costly then discarding some samples can save computer time that is better spent running the markov chain for longer a related issue is whether it is better to run one long markov chain or perhaps to obtain independent points more quickly by trying many initializations geyer also shows that theoretically a longer chain is to be preferred intuitively an arbitrary initialization is bad as it introduces bias which takes time to remove it turns out this time is better spent evolving an existing chain to new locations despite the theory favoring running a single chain we tend to prefer running a few rather than one when more than one computer processor is available this is the easiest way to parallelize our code although agreement amongst multiple chains does not provide a guarantee that the markov chains are mixing differing results would reveal a failure to mix that might go unnoticed from a single chain finally multiple chains are useful for adapting stepsizes in metropolis methods and turn up naturally in population methods chapter conditional estimators raoblackwellization using a sample average of function values is not the only way to construct mcmc estimators it is an attractive default procedure as the estimator only needs to know f x evaluated at the samples however it is sometimes possible to perform some computations involving f x analytically a simple identity lets us use this analytical knowledge f xpx x h x f xpxh ph construction of estimators where h is an arbitrary statistic if we can sum over the distribution conditioned on h then the bracketed term evaluated under samples of h can be used as an estimator a special case provides more concrete motivation f xi px x xji xi f xi pxi xji pxji epxji epxi xji f xi here we are interested in a function of a particular variable xi sums over a single variable are often tractable the identity allows averaging epxi xji f xi under monte carlo samples rather than f x itself as an example consider finding the marginal of a binary variable xi f xi xi the standard estimator throws away xji and averages the xi samples a sequence of zeros and ones the new estimator throws away the observed xi values but does use xji the resulting average of real numbers between zero and one can be much better behaved if xi is nearly independent of the other variables the new estimator gives nearly the correct answer from one sample whereas the standard estimator is always noisy does the conditioning trick improve variances in general without loss of generality we assume that the function of interest has zero mean under this assumption the variance of the estimator that conditions on a statistic h becomes varph epxhf x h ph x pxhf x pxhf x x x h ph pxf x where the bound is an application of jensens inequality using the convexity of the square function the final equality from summing out h means that varph epxhf x varpxf x the variance of the conditional estimator is never worse than the standard monte carlo estimator this result applies under independent sampling from px unfortunately the result does not hold in general for correlated mcmc samples liu et al even if the variance is improved we dont generally know when a conditional estimator will be computationally more efficient than the straightforward monte carlo estimator as an extreme example hx x gives an estimator with zero variance because the target expectation is computed exactly clearly the cost of computing the conditional expectation must be considered the use of equation was suggested by gelfand and smith an influential paper that popularized gibbs sampling in the statistics community their motivation construction of estimators was essentially equation cited as a version of the raoblackwell theorem the bounds requirement of independent samples was satisfied because the paper proposed running many independent gibbs sampling runs as this practice has fallen out of favor it seems misleading to continue to call the method raoblackwellization although some of the literature continues to do so despite the lack of raoblackwell guarantees the estimator can often be justified empirically as was done earlier by at least pearl waste recycling the metropolishastings algorithm has the undesirable property that for many proposal distributions a large fraction of proposed states are rejected from the final set of samples as computations involving these rejected points were performed it seems a shame not to use this information in monte carlo estimators the same observation followed our description of rejection sampling for which the solution was to use the same proposal distribution in importance sampling mh proposal distributions are generally local in nature and unsuitable as importance sampling proposals waste recycling is a framework for constructing better estimators using the importance ratios computed while running mcmc a simple way to understand how to perform wasterecycling is to augment the stationary distribution over the current state x and the next proposal x with an auxiliary variable x we declare that x was generated by copying one of x and x min px qxx px qx x min px qxx px qx x px x x xx x x alternatively any other rule that gives x the same stationary distribution as x can be used now we could use draws of x for estimators instead of x we can also average estimators under the conditional distribution of x given x and x epxf x eqx xpx xxx f px x x x px qx x f x f x f x min px qx x this is just the conditional estimator equation from the previous section applied to the auxiliary distribution px x x px x qx xpx x we hesitate to assign credit for this estimator to any particular author it or closely related estimators can be found in various independent sources including at least kalos and whitlock tjelmeland and frenkel wasterecycling is not necessarily better than the straightforward estimator let alone worth the computa convergence p p q l p a a slow diffusion b burnin and mode finding a c balancing density and volume figure challenges for markov chain exploration tional expense however there is empirical evidence that it can be useful dramatically so in the case of frenkel casella and robert also derived a raoblackwellized estimator for using all the points drawn by the metropolis algorithm however their algorithm has an os cost in the number of steps of the markov chain and unsurprisingly has not been widely adopted convergence no review of markov chains would be complete without discussing their convergence properties if the simulation is initialized at an arbitrary starting location how long must we wait before we can treat the chains states as samples from the stationary distribution often metropolis proposals have a local nature the typical stepsize is limited by a need to maintain a reasonable acceptance rate the amount of time it takes to explore a distance of length l by a diffusive random walk scales like l figure a this offers a rule of thumb to understand sampling within a mode an arbitrarily chosen initial state is usually very improbable under the target distribution reaching a mode in high dimensions can take a long time with sophisticated optimizers let alone a markov chain simulation analyzing the chain can identify when the statistics of the sampler settle down allowing the initial exploratory phase to be discarded such diagnostics could be severely misleading if there are multiple modes as in figure b in general applications there is no way of knowing if a markov chain has yet to find the most important regions of a probability distribution there are also more subtle problems with convergence that do not relate to finding modes or step sizes figure c illustrates a distribution with a pillar at x a points within this mode have higher probability density than any state within the shaded plateau a x a but given the large size of the plateau few proposals will suggest moving to the pillar initializing the chain within the pillar using an optimizer would not help here p a and p a a the pillar and plateau have auxiliary variable methods equal probability mass yet only a small fraction of proposals from the pillar to the plateau can be accepted the acceptance rate is aaa for symmetric proposals a proposal distribution that knew the probability masses involved eg iid sampling qx x px would move rapidly between the modes but this would involve already having the knowledge that mcmc is being used to discover there is a theoretical literature on markov chain convergence but for general statistical problems it is difficult to prove much about the validity of any particular mcmc result in later chapters we make occasional use of a standard diagnostic tool to compare the convergence properties of chains but in general we would also run the sampling code on a cutdown version of the problem where exact results are available or simple importance sampling will work in inference problems it is a good idea to check that a sampler is consistent with ground truth when learning from synthetic data generated from the prior model geweke s method for getting it right is a more advanced way to check consistency between samplers for the prior and posterior such diagnostics are also a guard against errors in the software implementation a possibility considered seriously even by experts kass et al auxiliary variable methods as emphasized in the introduction monte carlo is usually best avoided when possible in tractable problems computations based on analytical solutions will usually be far more efficient one would think this means we should analytically marginalize out variables wherever possible auxiliary variable methods turn this thinking on its head given a target distribution px we introduce auxiliary variables z such that px px z dz we could just sample from px corresponding to analytically integrating out z from px z instead auxiliary variable methods instantiate the auxiliary variables in a markov chain that explores the joint distribution surprisingly this can result in a better monte carlo method an alternative way of introducing extra variables is to make the target distribution a conditional of a joint distribution px px the whole px distribution is explored but only samples where are retained this seemingly wasteful procedure can also yield better monte carlo estimates an example is simulated tempering discussed later in subsection swendsenwang the swendsenwang algorithm swendsen and wang is a highly effective algorithm for sampling potts models subsection with a small number of colors q the auxiliary variable methods algorithm is important in its own right we use it in chapters and and as a significant development in auxiliary variable methods edwards and sokal provided a scheme for constructing similar auxiliary variable methods for a wider class of models they also identified the fortuinkasteleynswendsenwang fksw auxiliary variable joint distribution that underlies the algorithm the fksw joint distribution is over the original potts color variables s si on the nodes of a graph and binary bond variables d dij present on each edge ps d zp j q pij dij pij dij si sj ije pij ejij as long as all couplings jij are positive the marginal distribution over s is the potts distribution equation the marginal distribution over the bonds is the random cluster model of fortuin and kasteleyn pd zp j q ije pijij pij dij q cd d where cd is the number of connected components in a graph with edges wherever dij the algorithm of swendsen and wang performs block gibbs sampling on the joint model by alternately sampling from p ds and p sd this also allows a sample from any of the three distributions potts ps random cluster pd or fksw ps d to be converted into a sample from one of the others the algorithm is illustrated in figure while implementing swendsenwang we were grateful for the efficient percolation code available in newman and ziff it is commonly stated that swendsenwang only allows positive jij couplings as presented here in fact swendsen and wang described the extension to negative bonds which also follows easily from edwards and sokals generalization the resulting algorithm only forms bonds between edges where jij if the adjacent sites have different colors colors joined by these negative bonds must remain different when choosing a new coloring for binary systems two colorings of a cluster are possible the previous coloring and its inverse when there are many strong negative interactions the entire system gets locked into one of two configurations whereas many configurations are probable swendsenwang can still be very useful when only some connections have negative weights stern et al provide a nice example in the context of modeling the game of go for a wider view of related methods in the context of spatial statistics see besag and green auxiliary variable methods a potts state b fksw state c random cluster state d new fksw state figure the swendsenwang algorithm a as an example we run the algorithm on a binary potts model on a square lattice b under the fksw conditional pds bonds are placed down with probability pij ejij wherever adjacent sites have the same color c discarding the colors gives a sample from the random cluster model d sampling from psd involves assigning each connected component or cluster a new color uniformly at random giving a new fksw state discarding the bonds gives a new setting of the potts model this coloring is dramatically different from the previous one in contrast a sweep of singlesite gibbs sampling can only diffuse the boundaries of the red and blue regions by roughly the width of a single site auxiliary variable methods x h x h x h h x h x h x a p x x h b p x c x h h x h x d e figure slice sampling ac show a procedure for unimodal distributions after sampling h from its conditional distribution an interval is found enclosing the region where hx h samples are drawn uniformly from this interval until one underneath the curve is found points outside the curve can be used to shrink the interval this is an adaptive rejection sampling method d sampling uniformly from the slice is difficult for multimodal distributions e one valid procedure uses an initial bracket of width that encloses x but is centered uniformly at random this bracket is extended in increments of until it sticks outside the curve at both ends this can cut off regions of the slice but applying the previous adaptive rejection sampling procedure still leaves the uniform distribution on the slice invariant slice sampling all metropolis methods make use of a proposal distribution qx x in continuous spaces such distributions have stepsize parameters that have to be set well almost all proposals are rejected if stepsizes are too large overly small stepsizes lead to slow diffusive exploration of the target distribution in contrast a selftuning method which has lessimportant or even no stepsize parameters would be preferable slice sampling neal is a method with one auxiliary variable that fits into the framework of edwards and sokal the auxiliary variable does not directly help mixing but allows the use of relatively easytoimplement algorithms that allow selftuning while generating a valid markov chain like gibbs sampling a slice sampling chain has no rejections it always moves when sampling a continuous distribution slice sampling works by stepping back to the view of figure a sampling involves drawing points uniformly underneath a curve hx p x rather than drawing points iid as in rejection sampling a markov chain explores this uniform distribution over the original variables x and a height variable h hx slice sampling algorithms alternate between updating h and one or more of the original variables the height has a simple conditional distribution from which it can be resampled h uniform hx conditioned on h the stationary distribution over x is uniform over the slice that lies under the curve hx h when the distribution is auxiliary variable methods unimodal this conditional can be sampled by rejection sampling see figure ac more generally we use transition operators that leave a uniform distribution on the slice stationary one of the operators introduced by neal is briefly explained in figure e like most metropolis methods it uses a stepsize parameter which would ideally match the lengthscale of the problem l when exponentially when region by a random walk neal also provides a slice sampling operator that can expand its search region exponentially quickly skilling and mackay provide a version with no stepsize it starts with a large search region which shrinks with a fast exponent both of these operators are slightly harder to implement than the simple version sketched in figure we now propose a simple we believe novel method that can shrink slice samplings search region more efficiently the method is a generalization of subsection where we factored the target density into px xlx here we also introduce two auxiliary variables ph x uniform x and ph x uniform lx the new method follows any of the standard slice sampling algorithms but defines the slice by requiring both x h and lx h this version of slice sampling still satisfies detailed balance with respect to the stationary distribution px we first identify all of the quantities involved in the slice sampling procedure the algorithm starts with a point x px then two auxiliary variables h h are generated the search procedure eg figure e will create an interval around x and possibly some unacceptable points and adaptations of the interval we collectively refer to all of the intermediate intervals and rejected points as s finally an acceptable point x is found and adopted the joint probability of all of these quantities is t x s h h x px x lx psh h xpx s h h x ph x ph x xlx z l the initial search region shrinks l the search region requires ol computations to grow out this is much better than the ol steps required by metropolis to explore the psh h xpx s h h xz all of the standard slicesampling procedures are designed to ensure that x is only accepted as the final point if this expression is symmetric in x and x therefore t x s h h x px t x s h h x px summing this expression over s h and h shows that the operator satisfies detailed balance multiple auxiliary variables sometimes allow very fast mixing as in swendsenwang however in this context the slice defined by the two auxiliary variables h and h will typically be smaller than under standard slice sampling this is likely to increase the mixing time of the chain neal warns about this problem with reference to damien et al our motivation is computational h can be checked first and only auxiliary variable methods if it satisfies the slice constraint need l h be checked thus schemes that start with a large range and rapidly shrink a bracket around the slice can do so with less computation if x can reject some unreasonable points while being cheap to compute hamiltonian monte carlo physical simulations often work by discretizing time and approximately computing the result of following the hamiltonian dynamics of the system being modeled these dynamics exploit gradients of a target stationary probability density not just pointwise evaluations as in the metropolis algorithm theoretically gradients only ever cost a constant multiple of the computer time needed to evaluate a function bischof and bcker and are a much richer source of information the difficulty with u simulation work is that inaccuracies accumulate rapidly unless time is discretized very finely which has a large computational cost the metropolis algorithm targets only an equilibrium distribution over states the method doesnt need this distribution to be the result of an underlying dynamical system instead the algorithm evolves according to a proposal distribution the progress of the metropolis algorithm often resembles a diffusive random walk in contrast a particle evolving under hamiltonian dynamics will naturally move in persistent trajectories across its statespace these two areas of physical simulation research were combined in hybrid monte carlo duane et al hamiltonian dynamics are simulated approximately as a metropolis proposal which is accepted or rejected this allows persistent rapid motion across a state space without damaging the equilibrium distribution of the chain through discretization errors if the target probability distribution does not correspond to a hamiltonian system then auxiliary variables are introduced to create a fictitious system that can be simulated while hamiltonian dynamics is time reversible some care is required to ensure the discretized version is also reversible neal reviews a variety of hamiltonian based monte carlo methods see mackay for another review with simple example code hybrid monte carlo crossed into a statistical setting through work on bayesian neural networks neal b where it is extremely successful hamiltonian monte carlo methods continue to appear in recent work especially in gaussian process modeling although they could be adopted more widely annealing methods annealing methods simulated annealing kirkpatrick et al is a heuristic for minimizing a cost function ex with isolated local optima the metropolis algorithm is run on a probability distribution derived from the energy e pk x p x k xek ex zk zk a base measure x is often omitted set to uniform but including this term ensures pk is defined when initially the inverse temperature k is set very low so that the distribution is diffuse the distribution is then slowly cooled by increasing k towards infinity eventually all of the probability mass will be concentrated on the global optimum of ex there can be no guarantee that a particular sampling procedure will actually find the global optimum in general it is not possible to do better than an exhaustive search however the heuristic has proved useful in a variety of applications monte carlo is not interested in optima but if px lxx and ex log lx then k returns the target distribution of interest and distributions with k give more diffuse distributions figure shows an example of a sequence of distributions it might be that isolated modes at are difficult to explore by available markov chain operators these modes join and disappear at lower these more diffuse distributions will typically be easier to explore this section reviews algorithms designed to exploit this observation to provide better samples from multimodal target distributions it turns out that all of these methods can also provide estimates of the target distributions normalizing constant a full discussion of normalizing constants is deferred to chapter notation all of the algorithms in this section and several later in the thesis use a sequence of distributions we have adopted the following conventions throughout p is a base distribution usually easy to explore or amenable to direct sampling if pk are defined by inverse temperatures then there are k intermediate distributions pk k between the base and target k distributions pk p is the target distribution if pk are defined by inverse temperatures then k the sequences are usually defined using temperatures as in equation but other methods for creating intermediate distributions between the base and target distributions may be used annealing methods figure the effect of annealing at the distribution is equal to a convenient base distribution as the inverse temperature increases the distribution morphs into the target distribution at simulated tempering expanded ensembles a coherent way of dealing with annealing in mcmc simulations was independently discovered under the names expanded ensembles lyubartsev et al and simulated tempering marinari and parisi the idea is to bring the choice of intermediate distribution pk into the joint distribution under exploration px k wst k p x k where wst k are userchosen weights attached to each temperature level the conditional distribution at temperature k returns the original distribution of interest pxk px a markov chain at equilibrium returns correlated samples from the correct distribution by only recording states when the hope is that the extra computation is justified by the decrease in autocorrelation of the markov chain the extent to which simulated tempering improves a chains mixing time depends heavily on the target distribution and the markov chains used one general characteristic of the algorithm is the amount of time it takes to move the temperature index k between its extreme values usually metropolis proposals k k are used because large changes in temperature are unlikely to be accepted the adjacent levels must be similar enough for these changes in k to be accepted frequently this sets a minimum number of temperature levels k even if all moves were accepted k would be undergoing a slow random walk it will take at least ok steps to bring information from the rapidly mixing k distribution to the target k k distribution the extra cost of the algorithm depends on the fraction of time spent at the other temperature levels at equilibrium each level has marginal probability pk x wst k p x wst k zk k the partition function zk can easily vary by orders of magnitude across temperature levels in order for simulated tempering to spend a reasonable amount of time at each temperature level we could set wst k zk except of course that this ideal weighting is unknown eg z is the unknown normalization of the target distribution a practical implementation of simulated tempering must be an iterative process where annealing methods px t t t t t t t t p x p x t t t t p x t t t t figure an illustration of the idea behind a family of parallel tempering algorithms each column represents the states of a markov chain each states in the first row have marginal distribution px the other rows have stationary distributions bridging between this distribution and a vague distribution for which exploration is easy mcmc proceeds by traditional simulation of each of the independent states and dotted arrows metropolis proposals to exchange the states between adjacent levels the wst k are improved over a series of preliminary runs parallel tempering parallel tempering is a popular alternative to simulated tempering based on a different joint distribution while simulated tempering uses a union space consisting of the temperature and the original state space parallel tempering simulates on a product space that is replicas of the original state space exist concurrently for each of the stationary distributions pk k pxk k pk xk as the states at each level are independent under the target distribution independent transition operators tk with corresponding stationary distributions pk can be applied for the higher temperature chains to communicate the results of their freer motion to the lower temperature chains interactions are also required figure illustrates one possible way of evolving a product space ensemble the state of the markov chain consists of an entire column of variables these evolve using standard mcmc operators or by swaps between states that are accepted or rejected according to the metropolis rule the swap proposals can be scheduled in a variety of ways but as with simulated tempering information will propagate amongst levels by a random walk or slower by construction the amount of computer time spent on each level is completely under the users control in particular no weights wst k are required which is one source of this method has several names and independent authors its history is documented by iba b annealing methods the methods relative popularity another feature of coexisting chains is the possibility of using proposals based on sophisticated interactions between states eg wang and swendsen these advantages are at the expense of an obvious increase in memory requirements and a hardertogauge cost for bringing a larger system to equilibrium annealed importance sampling ais annealed importance sampling neal a looks more like the original simulated annealing for optimization than the above ensemble methods an initial point x is drawn from p a vague high temperature base distribution further points x x xk result from cooling by evolving through a sequence of k markov chain transition operators t tk whose stationary distributions p pk become sharper at each iteration the ensemble of k points generated in the algorithm is denoted by x xk k k we treat the procedure that generated them as a proposal which has the following joint distribution qx p x k k tk xk xk but what is this a proposal for qxk xkk the marginal distribution over the final point qx is not the target distribution and is probably intractable we cannot directly compare qxk with pxk instead we must create an ensemble target distribution p x that can be compared to qx under this target model xk is drawn from the distribution of interest p pk the remaining auxiliary quantities are drawn using a sequence of reverse markov chain transitions such that k p x pk xk k tk xk xk where tk are reverse operators as described in section figure illustrates both the ensembles p x and qx samples of x from q can be assigned importance weights with respect to p in the usual way subsection w x p x q x k k p xk k p x k p xk tk xk xk k tk xk xk p x k k pk xk pk xk k p xk k p xk k in neals papers the indexes of the points started at n and ran down to the presentation here is designed to be consistent with similar algorithms found in later chapters annealing methods xk pk px p x xk tk xk tk xk x t x qx xk tk xk tk xk x t x x p x x figure annealed importance sampling ais is standard importance sampling applied to an ensemble p x this generative process starts with a draw from the target distribution px pk x which is intractable importance sampling uses proposals drawn from qx which starts at a simple distribution x p x samples weighted in proportional to w x can be used as though they came from p typically only xk is of interest which under p comes from the target pk p the weights themselves are an unbiased estimate of zp zq zk z as in standard importance sampling equation a valid choice for all of the tk is the do nothing operator which attaches all of its probability mass to staying still to x x x x then qx just corresponds to drawing from p and copying the results similarly p x only draws from pk as one might expect the importance weights collapse down to those of standard importance sampling of a target distribution p pk with proposals from the base distribution q p this highlights that the markov chains do not need to mix well for ais to be valid but poor mixing operators will not improve over standard importance sampling a feature of ais is that the importance weights do not require details of the transition operators tk this means that a given ensemble x s will always be given the same importance regardless of whether the tk mix immediately by drawing from their stationary distributions pk or in fact do nothing like to the downside to this behavior is that even with ideal transition operators there is always a mismatch between p x and qx under perfect mixing the final point is proposed from pk not the target distribution pk the annealed importance sampling technique was first described in the physics literature by jarzynski however earlier still was the method of tempered transitions which contains the same core idea tempered transitions tempered transitions neal a defines a reversible markov chain transition operator leaving a distribution of interest pk stationary thus its use is in mcmc annealing methods rather than importance sampling it uses pairs of mutually reversible base transitions tk and tk with corresponding stationary distributions pk k k given a current state k a candidate state k is proposed through a sequence of states x x drawn as follows generate k from k using tk x x generate k from k using tk x x generate x from using t x generate from x using t x x x generate k from k using tk generate k from k using tk x x this generating sequence is illustrated in figure a the candidate state is accepted with probability k x x a k k min k pk k pk k x x pk k pk k x x note that the normalizations of pk cancel so are not needed one way to derive this acceptance rule it is to identify the whole joint distribution of figure a as the target of x the sampler if the order of the states were reversed figure b k would be the new state generated at equilibrium a few lines of manipulation results in equation for the metropolis acceptance probability for this proposal the intermediate states can be discarded because they are resampled at the next iteration typically p is much more vague than pk which allows easier movement around the middle of the proposal sequence if there are many intermediate distributions then every pk can be made close to its neighbors pk this suggests each state visited during the proposal should be close to equilibrium under each of the transition operators and that the final state should nearly be drawn from pk which is close to pk thus the acceptance rate can be high for large k one might expect that deterministically raising and lowering the temperature gives better performance than the markov chain employed by simulated tempering in fact the advantage is roughly cancelled by the need for a larger number of intermediate distributions see neal a for details the main advantages of tempered transitions are it doesnt need a set of weights like simulated tempering it doesnt have parallel temperings large memory requirements because equation can be accumulated as the states are generated annealing methods x t tk tk x k x k x x t x x k tk x k x k tk x k x k pk x a current state x t tk x k x t x x k tk x k x tk x k x k tk x k xk pk x b proposed state figure tempered transitions a the algorithm starts with a state k and x generates the ensemble x b the acceptance probability is for the proposal x which by reversing all the states would make k the new state generated under x the equilibrium distribution pk generalization to only forward transitions in standard tempered transitions tk and tk are mutually reversible here we consider two modified algorithms a forwards tempered transition operator t uses tk tk tk to form the proposal distribution q where each tk is any transition operator leaving pk stationary a reverse tempered transition operator t uses tk tk tk to form the proposal distribution q where each tk is the reverse transition operator corresponding to tk both algorithms use the same acceptance rule as before neal a shows that the operator defined by standard tempered transitions satisfies detailed balance here we present a similar proof to show that t and t are mutually reversible with respect to pk as in standard tempered transitions we start at k and generate a sequence of states x x x x k k x k to simplify notation both and are taken to mean x x x throughout the probability of generating a particular sequence of states given the initial state using only forward transitions tk is k k q k k k x x x k k tk k k x x k k k tk k k x x k k tk k k pk k x x x pk k x k tk k k pk k x x x pk k x pk k x pk k x k k k tk k k x x k tk k k x x pk k pk k x x pk k pk k x x annealing methods using equation and equation we can compute the equilibrium probability x x x of starting at k generating the sequence k k and accepting the move with operator t x x x x t k k k pk k qk k k pk k ak k x x x x x x k k k k x x tk k k k tk k k pk k min x x x k pk k pk k x x pk k pk k x x qk k k pk k ak k x x x x x x t k k k pk k x x x x x x t summing both sides over all intermediate quantities k k kk gives x x x x x x t k k pk k t k k pk k this is the equilibrium probability of observing the corresponding reverse move under which shows that both t and t leave pk invariant therefore we only need one of them t say this means that the standard tempered transitions algorithm still leaves the target distribution invariant if forward operators tk are used for both tk and tk implementing reverse operators is not required generalization to a single pass the second half of tempered transitions is identical to annealed importance sampling neal states that the major difference between the two methods is the requirement for an upward pass in tempered transitions making the updates twice as expensive in fact this upward pass is not necessary looking back at figure there is no reason why the ais ensemble x q cannot be used as a metropolishastings proposal for updating a state x drawn at equilibrium from p this is a metropolis independence sampler where the proposal does not depend on the current state this method of turning an importance sampler into a markov chain operator dates back at least to hastings importance sampling seems a more natural way to construct estimators than the metropolis independence sampler notice that an independence sampler sometimes rejects states when the current state is more important the states kept by the algorithm will depend on the order in which the proposals are made which is arbitrary as they are independent it seems likely that treating the proposals identically through importance weights will be better and theoretically this is usually the case liu multicanonical ensemble tempered transitions and annealed importance sampling are really two applications of the same joint distribution the main difference is whether an mcmc operator or an importance sampler is preferable in isolation the importance sampler may yield better estimators but the mcmc operator can be combined with other mcmc operators hastings argued that unweighted mcmc samples are more interpretable although one could also obtain approximate unweighted samples by resampling from a distribution over points proportional to their importanceweights multicanonical ensemble the multicanonical ensemble berg and neuhaus is another method for allowing easier movement around a state space rather than introducing extra distributions the original probability distribution is reweighted pmc x wmc ex eex e log p x the weight function wmc x only depends on a states energy e which is the log of the states original unnormalized probability the weights are chosen such that the marginal distribution over energies pe is uniform as far as possible if the energy is unbounded a cutoff is introduced why set the distribution over energies uniform as with the temperaturebased algorithms more time is spent in regions of low probability which helps join isolated modes but the multicanonical ensemble is just a heuristic there is nothing fundamental about this distribution in fact visualizing such a distribution can be somewhat surprising figure shows the result of reweighting a simple onedimensional distribution for simplicity we created a set of energy bands and used a procedure that only ensured that the distribution over bands was uniform the most probable states in figure c are in the extreme tail of the original distribution few states occupy the lowest energy bands so any particular one must be visited frequently in many highdimensional problems the lowest probability states will be plentiful and the tails will be given much less prominence however when there are only a relatively small number states at a particular energy the very high probabilities assigned by the multicanonical ensemble can be very useful berg and neuhaus invented the multicanonical ensemble to deal with distributions containing bottlenecks regions with small numbers of states that separate regions with significant probability mass strangely bottleneck states can be problematic even when they are individually more probable than many of the states typical under the distribution a markov chain is not able to spend much time exploring the states between important regions if there are many more states elsewhere with much higher total probability this phenomenon is known as an entropic barrier and can be overcome by multicanonical ensemble original probability x a energy x b multicanonical probability x c figure an example of a multicanonical ensemble a the target distribution of interest b the energy function that defines this distribution the horizontal lines mark bands of energy used to construct the new ensemble c the final figure results from setting the probability of each state proportional to the reciprocal of the number of states in its energy range drawing a state from this multicanonical distribution and reading off its energy band gives a uniform distribution over the bands some artifacts in the plot are due to the precise way bands were chosen but note that the sharp peaks in the middle of the multicanonical plot are due in part to the truncation of the original distribution at x setting the distribution over energies is a global operation and can have unexpected consequences exact sampling a multicanonical ensemble which makes states with abnormal energies low or high much more probable as in simulated tempering finding the weights or weighting function for the multicanonical method is difficult the original application of the multicanonical ensemble used prior knowledge to construct a parametric approximation that would give a nearuniform distribution over energy general applications often use histogram methods that bin ranges of energy together and iteratively fit weights for each bin on the basis of preliminary runs for more advanced methodologies there are whole theses largely dedicated to these methods eg smith ferkinghoffborg unlike annealing methods the multicanonical method never simulates the target distribution however correlated samples from pmc can be used to construct importance sampling estimators of any quantity of interest this is explored further in chapter exact sampling exact or perfect sampling simply means drawing a sample from a target distribution markov chain monte carlo uses approximate samples the distribution over the position at the sth step of a markov chain depends perhaps weakly on its starting position at s surprisingly it is sometimes possible to obtain exact samples from markov chains a sketch of the technique known as coupling from the past cftp propp and wilson is given in figure the core idea is to find the location of a markov chain that has run for an infinitely long time without having to simulate its entire length typically exact sampling algorithms require the ability to track sets of states through a random possibly large number of markov chain steps this is technically challenging but provides a means to draw samples from some distributions such as large potts models and some spatial point processes where traditional means fail see wilson for reviews examples and more algorithms for exact sampling with markov chains is exact sampling actually useful some users like to see samples which can give insight into a problem but in many statistical applications there is no interest in the samples themselves only estimates of expectations given some exact samples the variance of an estimator could be improved by running markov chains from the samples to produce more correlated samples if it is possible to start with exact samples some comfort is derived from removing any bias associated with the burn in of the chain but it is still better to run a small number of long markov chains than investing in equilibrating many independent ones section several algorithms in chapter rely heavily on exact sampling these require an exact sample from a different distribution at each iteration using approximations is dangerous as biases can accumulate at every iteration so markov chains potentially exact sampling x time a a sample is at the end of an infinite chain x time t b we try to find the end with a finite number of random numbers x time t t c eventually after a random amount of time we find the sample figure coupling from the past cftp overview each dash on the time axis represents a random number used by the chain everything must be consistent with the infinite chain in a we try to locate the final state by looking at drawing just some of the random numbers b before time t we know nothing so the chain could be anywhere we evolve a bound on the state space forward to time zero at first this might only localize the sample c we look at more random numbers further into the past until a bound collapses to a single state we then follow it to the exact sample at time zero it takes a random amount of computation before the bound can be made tight exact sampling have to be run for a very long time exact sampling gives a stopping rule and one that guarantees zero bias exact sampling example the ising model coupling from the past relies on being able to prove where the final state of a markov chain lies without knowing the starting location to demonstrate that this is sometimes possible we describe the summary state technique following childs et al the method is originally due to huber who uses the name bounding chains our task is to find the setting of an ising model a potts model subsection where each variable takes on q settings si we assume the system evolved under gibbs sampling subsection from time t to t each variable was updated in order at each time step according to the following algorithm uit uniform p p si sji sji sji sji exp si jei t t t jij sj jij si sj t exp t jei if uit p then set si else si in general the setting of variable i at time t depends on the random number uit and the settings of its neighbors j ei the summary states algorithm starts by drawing all the random numbers uit for some finite range of time t t we dont know what happened before time t so for each variable we set si t we then follow the gibbs sampling updates from t t to t setting the states to s whenever the result of an update depends on the settings of states we do not know uit uniform pmax max p si sji sji sji sji sj sj t t t pmin min p si sji sji sji sji sj sj t if uit pmin then set si t if uit pmax then set si t otherwise set si t t t if some of the si states are set to then we do not know the setting of our exact sample we draw more of the random numbers uit from time t back to time t and start again from st the hope is that eventually after doubling the number of markov chain steps enough times no s will remain and we will know s discussion and outlook more sophisticated versions of the algorithm have been developed for models with q and for tracking dependencies among unknown states when the connection strengths j are large it wont be possible to get rid of all the s in practice better markov chains than gibbs sampling are required while the details are considerably more involved it is possible to bound the behavior of randomcluster samplers propp and wilson and carefully implemented swendsenwang samplers huber discussion and outlook in this chapter we have reviewed a subset of the established literature on markov chain monte carlo methods most mcmc algorithms are underpinned by proving detailed balance which is almost synonymous with the seminal algorithms of metropolis et al and hastings some choices still remain in the choice of acceptance rule and the construction of estimators we presented an interpretation of wasterecycling estimators which we will apply in a new setting in chapter we also reviewed a twostage acceptance rule in subsection which trades off statistical optimality to give computational savings we extended this idea to slice sampling in subsection algorithms and proposal distributions tuned for particular applications are clearly important but were beyond the scope of this review moreover we did not discuss techniques that sample distributions with variable dimensionality eg green or infinite dimensionality eg neal c while important in some applications or as a replacement for model selection techniques they are not used by the remainder of the thesis we do have a brief excursion into sampling from an infinitedimensional distribution at the end of chapter although this is more closely related to the work on exact sampling reviewed in the previous section auxiliary variable methods have the potential to dramatically accelerate markov chain exploration figure demonstrated the large changes possible in a single step of swendsenwang slice sampling doesnt make such dramatic moves but it is an easytoapply method that we have found personally useful both in this thesis and in quickly testing hypotheses in discussions with colleagues similarly hamiltonian monte carlo has proved personally useful in work outside this thesis murray and snelson and we feel it should be used more widely the use of annealing methods appears somewhat mysterious especially in a statistical setting it seems strange to sample at a range of temperatures in applications outside of physics however the annealing heuristic figure can be very effective in problems with multiple modes it is less clear whether these distributions need to be simulated together as in parallel tempering or separately as in the other algorithms we explore the issue of populations of samples in the next chapter this work leads us to construct discussion and outlook new algorithms for constructing nonreversible markov chains operators that do not satisfy detailed balance returning to annealing chapter explores the computation of normalizing constants there we explore why standard mcmc is insufficient for solving this problem and the advantages of using multiple distributions as in annealing we then provide a critical and comparative review of annealed importance sampling the multicanonical ensemble and nested sampling a new method by john skilling we will propose new procedures for running both nested sampling and annealed importance sampling finally all of the algorithms discussed so far rely on being able to evaluate a function proportional to the target distribution p x px it turns out that this assumption is not met by a class of statistical inference problems which we label doublyintractable chapter is dedicated to constructing new algorithms for this problem their operation draws on ideas from all areas of this review including annealing and exact sampling chapter multiple proposals and nonreversible markov chains markov chain monte carlo is ambitious a huge amount is asked of a solitary point diffusing through the void of a complex highdimensional space the target stationary distribution may have multiple modes a representative sample of which must be found also somehow the mass of each mode is implicitly estimated so that the correct amount of time is spent in each one even navigating a single mode can be difficult a long standing idea is that multiple points evolving simultaneously should help communication amongst points might help find and explore the regions of high probability mass the simplest way to introduce multiple points is to target a product model p x k k pxk this is just like parallel tempering subsection except that each of the independent distributions are the same although each of the xk variables are independent a markov chain operator used to update one of them can depend on the setting of the others at the same time step this provides a limited but adapting source of information about lengthscales in the problem and the positions of modes adaptive direction sampling ads is a name given to at least two algorithms one of these updates a point by resampling from its conditional distribution ie gibbs sampling along a direction suggested by another pair of points gilks et al this allows coupled movement of highly correlated variables which would be difficult under standard axisaligned gibbs sampling initial experiments with ads highlighted the need for many more particles than dimensions in order to equilibrate in all necessary directions then the burden of bringing k points all to equilibrium simultaneously removes the advantage on the other hand some advantage is obtained with a small number of points when combined with standard axisaligned updates to ensure mixing in the full space i have been able to produce very similar results using a slice sampling based version of ads as suggested in mackay related ideas are population monte carlo still being actively pursued in the literature ter braak christen and fox while general theoretical results seem elusive a small number of parallel chains is often empirically useful and seems advisable in general this chapter is dedicated to exploring other monte carlo algorithms that attempt to leverage an advantage from multiple points the first of these population monte e carlo as in capp et al is an existing importance sampling method described as a competitor to mcmc methods we highlight some important differences with mcmcbased methods like ads not made apparent enough in the current literature we then return to mcmc methods looking at those not based on the ads product model these methods have a single markov chain backbone which branches to a set of points for consideration at each step ensembles of dependent points might allow more thorough local exploration section reviews and extends multipletry metropolis mtm an algorithm for making good use of expensive proposal mechanisms section covers ordered overrelaxation an existing extension to gibbs sampling which draws multiple points to improve navigation of highly correlated variables we extend ordered overrelaxation from just gibbs sampling to arbitrary transition operators and multiple metropolis proposals population monte carlo many monte carlo algorithms could be described as population methods iba a provides a review of some of them this section refers to a specific method with the name population monte carlo pmc as described by capp et al and e popularized in robert and casella it is described as an algorithm which can be iterated like mcmc but unlike markov methods can easily be adapted over time examples were presented in which replacing mcmc with pmc gave better results in fact it is also easy to demonstrate the reverse this section provides a critical review of pmc the template of a pmc method is given in algorithm k particles are evolved in parallel an arbitrary distribution qks is used to move each particle at each step these distributions can depend on all the particles entire histories unbiased estimates can be constructed from the proposed points k weighted by their importance weights x as usual subsection the weights are usually only available up to a constant consistent estimators are still available by using normalized weights the resampled points drawn in step are a biased sample from the target distribution for finite k for example k would just give a sample from the proposal distribution when k is large the points are approximately unbiased samples from the target distribution it is hoped these will form a useful basis for constructing the s population monte carlo algorithm population monte carlo as in capp et al e for s to s for k to k select proposal distribution qks s generate xk qks x compute weight wk s pk qks k x x end for resample k positions with replacement from k with probabilities x s s s proportional to the weights giving this steps sample x x xk end for proposal distributions at the next iteration estimators should be formed from the importanceweighted ensemble before the resampling step as pmc is a framework for importance sampling the qks distributions must have appreciable probability mass across the entire support of the target distribution while capp et al do mention the need for heavy tails they simultaneously suggest that e pmc is a replacement for mcmc population monte carlo borrows from mcmc algorithms for the construction of the proposal from importance sampling for the construction of appropriate estimators these two goals seem mutually exclusive metropolishastings proposals are typically local in nature we do not expect them to capture the entire target distribution sometimes we might propose dramatic moves perhaps from a kernel density estimate tierney and mira warnes but such proposals do not need to be a good global approximation to the target distribution for mcmc to work such transition operators can be combined with local diffusion for robustness importance sampling has no such choice using mh proposals which tend to cut off some or most of the target density can have disastrous effects in pmc some particles would occasionally have very large weights this would give high variance estimators and the occasional collapse of all or most particles to a single point in the resampling step both capp et al and robert and casella describe a mixture pmc e algorithm which seems very likely to suffer from these failure modes of importance sampling the algorithm suggests choosing qks from a set of spherical gaussian distributions with a variety of widths the probability of adopting each width is adapted according to their past ability to make proposals that survive the resampling step to demonstrate the problem with mixture pmc we use a version of the funnel distribution described by neal the distribution is over a zeromean gaussian distributed variable v with standard deviation and several independent zeromean gaussian variates xi each with variance expv this type of structure can easily exist in hierarchical bayesian models the original funnel distribution used nine xi s s multipletry metropolis variates and was exceedingly challenging for simple metropolis sampling slice sampling largely alleviated this problem here we make the distribution easier to handle using only four xi variates so that metropolis sampling will equilibrate more quickly we used spherical gaussian proposals with standard deviations selected uniformly from an ad hoc set of proposal widths metropolis was run times with steps in each run which produced a good histogram for the marginal distribution of v figure a only in the tails of the distribution are some frequencies not quite right it would have been better to sample fewer longer runs section but we wished to provide a direct comparison to pmc steps with particles figure b shows a histogram of samples obtained from step of pmc algorithm these samples do not match the true marginal distribution although we can expect this distribution to be biased figures cf show estimates of vs marginal distribution based on the importance weights of the particles before resampling the pmc estimates are terrible for all four variants of the algorithm described in the figure none of the gaussian distributions are good importance samplers for this distribution a combination of the estimators does not work well either large importance weights during the pmc runs often caused many particles in the population to move together into a new region of space unlike capp et al we do not e see this as a sign that the space is being explored instead it indicates the underlying high variance of the importance sampling estimator when the distributions are inappropriate for importance sampling there is no guarantee that the time spent by the population within a region reflects the target distribution instead the times between large importance weight events depend on the details of the tails of the proposal distributions in this case adapting the distributions did not solve the problem in any case favoring proposals that can lead to extremely large weights seems unwise iterated importance sampling algorithms are attractive in some contexts however using proposals from mcmc algorithms within importance sampling is not generally advisable we now return to algorithms that stay within the mcmc framework multipletry metropolis this section concerns the multipletry metropolis method mtm of liu et al we review the method and provide a simple interpretation as an evolving joint distribution including auxiliary variables this almost trivial rewriting of the algorithm makes a possible improvement obvious and relates to the pivotbased metropolis method introduced later in the chapter the mtm algorithm was motivated as follows local proposal distributions qx x may lead to slow convergence but longer range moves are rarely accepted if multi multipletry metropolis a metropolis b pmc samples q separate adapted c pmc weights q separate adapted d pmc weights q combined adapted e pmc weights q combined fixed f pmc weights q separate fixed figure histograms of samples of the v parameter from the funnel distribution the curves show the correct marginal posterior all samplers were initialized at v x both metropolis and pmc used spherical gaussian proposal distributions with the same set of proposal widths for pmc these were adapted as e in capp et al or chosen from a fixed uniform distribution the pmc importance weights were either computed using the proposal width that was used q separate as in capp et al or using q combined the mixture of gaussians e obtained by summing over the choice of width robert and casella multipletry metropolis ple points are proposed from a longrange distribution then it is more likely that an acceptable one will be found mtm algorithm is one way of allowing k draws from a proposal distribution figure illustrates and motivates the procedure with a graphical model algorithm a single step of the multipletry metropolis markov chain draw k iid proposals from xk qxk x compute weights wx xk px qxk x x xk is any symmetric function x xk xk x choose one proposal x x xk using probabilities proportional to the weights generate a new set of k points xk qxk x for k k and define xk x wx x wxk x compute ax x wx x wxk x assign x x with probability mina mtm is actually a family of algorithms defined by the function used at step to weight the points mtmii is the mtm algorithm with symmetric q and q giving weights wxk pxk as various of choices of were reported to behave very similarly we use mtmii in our experiments figure s graphical model description immediately suggests an alternative algorithm step of the mtm algorithm resamples x xk from the joint distribution this discards the setting already made by a previous iteration in some circumstances keeping the existing proposals could save computer time this would be achieved by omitting step after the first iteration and updating step to include xk xk in the assignment efficiency of mtm for a given proposal distribution qxk x each step of mtm is more expensive than a single metropolishastings proposal counting the number of target distribution evaluations suggests that the computational cost is roughly k times more for mtm various practical issues make the exact cost tradeoff more complex in many circumstances the time taken to evaluate k probabilities will grow sublinearly with k this may be because some intermediate results can be shared or because related computations can be vectorized and computed efficiently on some types of computer the sequential nature of mcmc often makes it difficult to vectorize operations yet such code transformations can have surprisingly large benefits especially in popular interpreted environments such as r octave or matlab the results reported by liu et al suggest that mtm outperforms metropolis without needing implementationbased codetuning one example involved simple multipletry metropolis x x x xix xk a k proposals are generated and one selected x x x x xk b a suggested swap makes x the sample from px x x x xx i xk c resampling all but x and x makes the joint setting more plausible figure multipletry metropolis mtm a x is assumed to come from the target distribution px k proposals are drawn xk qxk x this defines a joint target distribution on which the markov chain operates firstly one of the proposed values xi say is chosen and labeled x b one could consider proposing a swap x x giving a joint state in which x generated x xki the acceptance probability for this proposal could typically be low for large k it will often be obvious that x rather than x generated the remaining variables c mtm swaps x and x and also resamples the remaining variables from their new stationary distribution xki qxk x the acceptance rule is simply metropolishastings applied to the proposal a c multipletry metropolis varxt index mtmii metropolis metropolis figure performance of multipletry metropolis based on figure from liu et al samplers were run on a tdistribution as described in the main text plotted is varx t the variance of a samplers state after t steps from x estimated from runs the time taken for a chains states to match the true variance of is a measure of its burnin time computer time was measured as follows for mtmii with k index is the number of steps taken for the metropolis algorithms index is optimistically the number of steps divided by k the performance of metropolis with and mtmii with closely reproduce the results from the original paper rerunning metropolis with a gaussian stepsize of shows a much shorter burnin time than mtm gaussianbased proposals on a onedimensional tdistribution with five degrees of freedom the amount of computer time to reach equilibrium from an extreme fixed starting point x appeared to be considerably less for mtmii with n x proposals than for metropolis with n x proposals as shown in figure this claim is reproducible but why were different proposal widths and used while making the comparison presumably a longer stepsize was deemed appropriate for mtm as it has more attempts to find a good proposal but our results show that metropolis with approaches equilibrium according to the measure used faster than metropolis with or mtm with the amount of computer time needed to forget a particular atypical initialization is only one measure of performance we also used rcoda cowles et al to estimate the effective number of samples based on a spectral analysis of the time series these are compared in table table equilibrium efficiency of mtm and metropolis method iid sampling control mh mh mtm effective samples proposals the acceptance rates for metropolishastings with and were and respectively this suggests that the optimal step size is somewhere in between the two multipletry metropolis values tried mtms acceptance rate was close to ideal for both standard mh and according to liu et als experience mtm despite this mtms sampling efficiency on this very simple example distribution is worse than mh at both equilibrium sampling and initial burning in this is a negative result but one worth noting using multiple parallel proposals is not in our experience worthwhile in itself this makes sense when proposals are made sequentially a good proposal can be accepted immediately and future proposals move from the new location in contrast only one point from a set of parallel proposals can be accepted if there is more than one good point the rest are wasted the motivation for using multiple proposals cannot be to explore larger search regions instead we should consider mtm if making multiple proposals together is computationally cheaper than making them separately the orientational bias procedure in frenkel and smit algorithm section is equivalent to mtmii in this context the proposals involve moving molecules a molecules probability depends on independent terms depending on its position and orientation the orientational bias procedure proposes a single new position with several possible orientations while the issue isnt discussed in detail presumably it is sharing the positional energy computation across proposals that makes the algorithm worthwhile in molecular dynamics simulations a statistical example claiming benefit from multiple proposals is given by qi and minka they drew proposals from local gaussian approximations based on the hessian of the log posterior at the current sample location in highdimensional problems these proposals would be expensive to construct od to prepare a matrix factorization for a multivariategaussian sampler drawing multiple samples is then cheaper od even sequential sampling can take advantage of this because the proposal will be reused until a move is accepted one wonders if given the effort expended to construct the distribution drawing more samples than needed for acceptance could be useful the reported accuracy for a given cpu time does appear to be slightly better by combining hessianbased sampling with an mtm variant multipleimportance try multipleimportance try mtm makes it more likely that the chain moves but is wasteful of information multiple evaluations of the probability density are made at each iteration but at most one of these points is used qi and minka suggested a way to use all of the xk points and called their combined hessianbased mtm method adaptive multipleimportance try amit we will refer to the part of the algorithm that concerns including all the points in mtm as multipleimportance try mit qi and minka also discuss cheap approximations to hessians but are still keen to get the best use from them however constructed multipletry metropolis mit is the standard multipletry metropolis algorithm with a particular choice for postprocessing the samples if p can be evaluated then the samples are weighted by wk pxk qxk x when px is only known up to a constant the wk are computed using the unnormalized version p x and then made to sum to one this is standard importance sampling using qxk x as a proposal distribution the locations x of these importance samplers are chosen using mcmc using local gaussian approximations can lead to importance sampling estimators with large or even infinite variances this suggests that mit is likely to give erroneous results on highdimensional problems mit will work when good global proposal distributions can be obtained but we are unsure why mcmc would be an appropriate way to adapt these distributions these are similar concerns to those discussed in section regarding population monte carlo wasterecycled mtm waste recycling subsection is a method for using proposals that are not accepted as part of the markov chain here we describe how to recycle the points from the mtm method that are normally unused waste recycling does not carry the risks of huge variances associated with using importance sampling estimators with metropolis proposals we first take the mtm joint distribution identified in figure a k p x xk k k px k qxk x and augment it with a new variable x the marginal distribution of the new variable should be the target distribution of interest this could be achieved by copying x or picking one of the proposals at random and copying it according to the metropolis hastings acceptance rule ak p x xk x k min pxk qxxk px qxkx x xk x x k ak as in subsection any quantity can be estimated by averaging over each of the x xk settings weighted by their probability under this conditional distribution so far we are still discarding the xk variables drawn by the algorithm if the x xk joint state were accepted we could draw a new auxiliary variable x p x xk using x the conditional distribution analogous to equation to include both possible joint multipletry metropolis settings in an estimator we use a further auxiliary variable x which is equal to one of x or x we could define the probability of x x according to the min a acceptance probability of the mtm algorithm however this would sometimes give zero weight to x in normal wasterecycling we do not mind ignoring the initial state as it was already used at the previous iteration but the xk were created at this iteration instead we use px x x xk x xk wxk x k wxk x k wxk x k which also gives x the correct stationary distribution the final estimator averages over all states visited at each iteration based on the following identity for epxf x epxxk x x k p x xk p x x xk x f x pxx xk x xk the expression looks cumbersome but when expanded out only costs ok to compute the estimator can be implemented once and used as a blackbox the conditional probabilities of x and x equation require some backwards pro posal probabilities not normally needed by mtm or mit this would remove the potential advantage when proposals are expensive as in qi and minka but would not be a problem when as in frenkel and smit the purpose of multiple proposals is shared computation of px we compared the mtms mean square error mse in estimating x to that obtained by the wasterecycled version wrmtm on the same tdistribution example as before xxxk x x xk xx x table accuracy on tdistribution after proposals method mh mh mtm wrmtm mit mean square error mse postprocessing the mtm run with wasterecycling reduces the mse considerably but still not to the level of metropolishastings speedups from efficient implementation of parallel proposals would still be required to justify wasterecycled mtm treating the xk proposals as draws from importance samplers the mit estimator obtains the unless we are applying the alternative algorithm suggested just before subsection ordered overrelaxation figure the idea behind successive overrelaxation for minimization when performing component updates moving past the minimum along the current search direction can give faster progress along the direction that points to the optimum monte carlo overrelaxation methods attempt to successively move to the other side of a components conditional distribution which can cause persistent motion along a more interesting direction best mse this reflects the fact that importance sampling is usually more appropriate than mcmc for estimating the statistics of simple d densities in higher dimensional problems where mcmc is favored over simple importance sampling the proposals will generally be more local and take longer to explore the target distribution this situation was simulated by setting the proposal width to and taking mtm steps with k averaging over such runs the mean square error mse of mtm in estimating x was this wasnt hurt by wasterecycling which had an mse of there wasnt much to gain from using the extra samples as the mtm sequence is highly correlated as such recycling would not be worth the computational expense but wouldnt be too harmful the mit importance sampling estimate had a significantly worse mse of mit does not necessarily reduce variance in highdimensional problems it could be much worse ordered overrelaxation ordered overrelaxation neal b is a technique to improve the convergence of gibbs sampling for some distributions with strong positive correlations amongst its variables a variety of mcmc methods exist that are inspired by successive overrelaxation in optimization see figure ordered overrelaxation is one of these methods considered here because it is based on a population of proposals ordered overrelaxation is based on gibbs sampling a normal iteration of gibbs sampling involves resampling xd from its conditional distribution pxd xd d a wasteful way to implement this rule is to draw k samples from the conditional distribution and then pick one of them uniformly at random we could leave the conditional dis ordered overrelaxation tribution stationary by also including the current setting and choosing uniformly from k settings moreover we need only apply a transition operator that leaves this uniform distribution over k settings stationary ordered overrelaxation provides such an operator the ordered overrelaxation transition operator requires that the k candidates for the dth component can be ordered in some way if they are real scalars we simply sort by numerical value if only a partial ordering is available then ties are broken arbitrarily after sorting the points they are relabeled in order s s si xd sk the operator chooses xd ski as the next step in the markov chain as long as k is odd this rule will always pick one of the k new values because the rule is deterministic and reversible it clearly leaves a uniform distribution over the points stationary choosing xd ski tends to move to the opposite side of the conditional distribution from xd this is precisely the goal of overrelaxation the effect becomes stronger when increasing k for very large k the point is moved almost deterministically from its current quantile q to q in na implementations any benefits from persistent motion is cancelled by the cost ve of drawing additional samples whether ordered overrelaxation is useful will depend on whether repeated draws sk t sk z can be performed cheaply neal b discusses some circumstances in which this can be done another potential cost saving is to make the number of points k small when ordered overrelaxation is less useful we now introduce a new procedure with this goal adapting k automatically our new algorithm attempts to use fewer samples in ordered overrelaxation when the current position is close to the center of its conditional distribution this is a situation where overrelaxation is typically unhelpful it requires the user to set kmin and kmax parameters giving the smallest and largest acceptable number of samples to draw per iteration the procedure is detailed in algorithm the numbers of new points to the right and left of the current position are tracked as r and l in an extreme situation where the current position is at the very edge of the conditional distribution one of r and l will remain at zero and kmax points will be drawn conversely when xd is near the median of the conditional distribution the algorithm will typically return sooner after obtaining an unordered set of points from algorithm the list is sorted together with the current point xd si giving an ordered set sk k as before we now show k that ordered overrelaxation still satisfies detailed balance using a variable k to do this pivotbased transitions algorithm self sizeadjusting population for ordered overrelaxation generates number of points k and unordered set sk k k inputs kmin kmax current location x initialize l r for k kmax sk pxd xd d if sk xd then r r else if sk xd then l l end if if r kmin and l kmin then kk return end if end for we need the equilibrium probability of starting at point xd si generating a set sk and then deterministically transitioning to xd ski the order in which the set of points was generated does not matter although some orderings are not possible because algorithm would terminate before generating the entire set let cl r be the number of orderings of r l points with r greater than xd that can be generated by the algorithm the equilibrium probability of all the points is ci ki k k psk xd d given the symmetry cl r cr l this is also the probability of starting at xd ski producing the same set of points and then deterministically moving to x si summing over all intermediate points shows that the transition xd xd satisfies detailed balance pivotbased transitions inspired by the ordered overrelaxation generalization of gibbs sampling we now introduce a new method which uses general markov chain transition operators we start from any pair of mutually reversible transition operators t x xpx t x x px which leave a desired target distribution px stationary a single reversible operator t t will also suffice algorithm takes the existing operator pair and constructs a new transition operator t x x this procedure is illustrated in figure the pivotbased transition operator will favor moves to the opposite side of the region accessible by t as defined by a user supplied ordering the stationary target stationary distribution px is maintained without introducing any rejections into the chain although t x x x will carry finite probability in some discrete settings or if t x x x is finite each of the sk k points in algorithm are marginally distributed according to k pivotbased transitions algorithm the pivotbased transition t x x take one step of t from point x to a pivot state z ie z t z x use t to draw k points one step away from z ie sk t sk z k k use the sk and x points to create an ordered list sk k break ties k arbitrarily and relabel the points such that sk sk identify index i giving x si set x equal to ski start point x pivot state z points generated from t s z point selected for x a forwards start point x pivot state z points generated from t s z point selected for x b backwards figure illustration of the pivotbased transitions operator for k between points x and x via points z and set s consisting of the set of round dots excluding the ringed one a moving forwards generating a draw from t x s z x b the reverse process involving the same s and z occurs with probability t x s z x pivotbased transitions the target distribution they also have the same relation to the pivot state all points including x and x can be seen as draws from t sk z by symmetry the point labeled as the starting position can be resampled from a uniform distribution over the points or updated as in ordered overrelaxation which leaves the uniform distribution stationary while perhaps unconvincing as a proof this symmetry is important if all the proposals started at x then this location would be made special making it harder to construct plausible reverse moves starting at any other point we now show explicitly that the new operator satisfies detailed balance given a starting point x the probability of generating a pivot state z a new point x and k other points set s as in figure a is t x s z x t z x k t x z ss t s z the first step must produce the pivot state z from x the other points can then be produced in any of k possible orderings the probability of producing the same configuration of points by starting at x as in figure b is t x s z x t z x k t x z ss t s z a small amount of manipulation gives t z x t x z px t x s z x pz px t x s z x t x z t z x px pz px t x s z x px t x s z xpx t x x px t x xpx ie the new chain satisfies detailed balance the second line was obtained from the t and t operators mutual reversibility condition the final line is obtained simply by summing over z and s the number of points k used by the algorithm could be automatically adjusted we would use algorithm to draw the points replacing the conditional distribution with t sk z the proof above still holds if we replace k with cl r ordered overrelaxation with pivotbased transitions gibbs sampling updates each component of a distribution xd in turn by applying a transition operator td that resamples from the conditional pxd xd d the original ordered overrelaxation replaces gibbs samplings td operators with modified versions that tend to move to the opposite side of the conditional distribution from the current setting of xd pivotbased transitions gibbs sampling is not the only monte carlo algorithm that updates components of a distribution in turn now we can perform ordered overrelaxation by replacing any algorithms componentbased update operators td with their pivotbased versions each update will tend to move the current setting to the opposite side of the distribution defined by td when the original markov chain is gibbs sampling the z states become irrelevant and the resulting transition operator is equivalent to the standard ordered overrelaxation algorithm figure gives an illustrative example of ordered overrelaxation applied to a bivariate gaussian distribution using pivotbased transitions we can obtain the same benefits as the original ordered overrelaxation with slice sampling which doesnt require the ability to sample from conditional distributions again as with standard ordered overrelaxation the benefits are only worth the computational cost if some saving can be made when drawing multiple points although pivotbased transitions do not introduce any rejections into the chain if the underlying transition operator is a metropolis method then the chain will sometimes stay still the rejections will cause reversals in the direction of motion reducing the ability of the chain to maintain a persistent direction this will tend to occur at the edges of a distribution where rejections are more frequent and where ordered overrelaxation is normally of most use figure c demonstrates that pivotbased transitions based on metropolis works much less well than the same method based on slice or gibbs sampling in section we will develop a better metropolis method based on pivot states first we explore another use of pivotbased transitions persistence with pivot states as pivotbased transitions apply to general transition operators we can consider algorithms that are not tied to componentbased updates in this section we present an alternative way to achieve persistent motion using pivotbased transitions this algorithm is based on the expanded distribution px z t z xpx t x zpz for other ways of constructing nonreversible chains that use this distribution see neal b here we do not require the base operator t to be reversible a sample x z from the expanded distribution provides an equilibrium sample x and a pivot state z as produced by step of algorithm following the rest of the algorithm gives a new setting x z the same pivot state with a new starting point pivotbased transitions a standard ordered overrelaxation gibbs sampling b pivotbased transitions univariate slice sampling c pivotbased transitions metropolis figure ordered overrelaxation schemes applied to a highly correlated bivariate gaussian some exact samples from the distribution are shown in black standard metropolis shown in red proceeds by slow diffusion ordered overrelaxation schemes shown in blue are able to make persistent progress along the distribution here exaggerated with k rejections in the underlying transition operators upset this persistent motion however the metropolis sampler in c gains little benefit from pivotbased transitions pivotbased transitions this procedure leaves the expanded distribution invariant the proof is similar to before t x s x z k t x z ss t s z t s z ss t x s x z k t x z t x z t z xpx pz t x s x z t x s x z t x z pz t z x px t z xpx t z x px t x s x z t z x px t x s x z t z xpx t x z x z px z t x z x z px z the final line was obtained by using equation and summing over the intermediate points s we have shown that the appropriate detailed balance relationship holds with respect to the expanded distribution equation suggests that the sample from the expanded distribution can also be interpreted as an equilibrium sample z and a pivot state x t x z this pivot state could be updated by an alternative pivotbased transition operator t based on algorithm as t was used in step we switch to using t to produce the k points in step a version of the above proof shows that this operator also leaves the expanded distribution invariant given a pair sampled from the expanded distribution x x px x t x x px we alternate between two markov chain updates first we update x using x as a pivot state with algorithm operator t next we update x using x as a pivot state with the alternate operator t figure illustrates the effect of alternately applying these two operators both operators have a tendency to reverse the ordering of x and x by continually jumping over each other the pair move persistently across the space that induces the ordering by chance for finite k the points will sometimes stay in the same order in subsequent iterations motion will be in the opposite direction allowing the chain to mix across the whole space figure shows pivotbased persistent motion when the base transition operator has a small step size which would normally lead to slow diffusion the effect is dramatic the entire range of the distribution is explored in many fewer iterations however the improved dynamics alone are not actually sufficient to justify the increased cost of drawing multiple proposals as with mtm some further savings such as parallel computation are required pivotbased transitions point to be updated pivot state updated at next iteration new points the set s updated point next pivot figure using pivot states for persistent motion each row shows a new iteration of the algorithm described in the main text the pivot state tends to move in the same direction for multiple iterations in the final two iterations shown the direction of motion has reversed drift in one direction would persist for longer if a larger number of states were drawn from the pivot state metropolis k position x iteration number figure example of persistent motion exploring uniform with gaussian proposals metropoliss random walk behavior is greatly reduced by pivotbased transitions although at a greater computational cost pivotbased metropolis pivotbased metropolis pivotbased metropolis is a new alternative to mtmbased methods for using multiple metropolis proposals as in mtm we use an arbitrary proposal distribution qxk x but by making different tradeoffs orderedoverrelaxation updates become possible the new procedure will follow algorithm closely but uses arbitrary proposal distributions rather than transition operators arbitrary proposals could be turned into metropolishastings transition operators for use in algorithm but many of the candidate points would be in the same place due to rejections algorithm provides a way to use all of the points proposed from a proposal distribution q the pivot state can optionally be created using a different proposal distribution q also the pivot state need not live in the same space as the distribution being sampled unless applying the persistence procedure of subsection this procedure can be seen as an instance of random proposal distributions besag et al algorithm the pivotbased metropolis operator t x x take one step of q from point x to a pivot state z ie z qz x use q to draw k points one step away from z ie sk qsk z k k use the sk and x points to create an ordered list sk k break ties k arbitrarily and relabel the points such that sk sk identify index i giving x si qz sk compute weights wk psk qsk z choose x sj from sk k using a distribution proportional to the k weights or any reversible move ts j i sk that leaves this distribution invariant the pivotbased metropolis operator of algorithm is a valid transition operator for px proof the equilibrium probability of starting at si generating pivot state z the remainder of the set of points s sk k and choosing final point sj is k t z s j si psi ts j i s psi qz si k ki qsk z which is invariant to swapping si sj summing over all intermediate points skij z shows that t sj si satisfies detailed balance ordered overrelaxation uses a transition rule that leaves a uniform distribution over a set of points stationary pivotbased metropolis and perhaps other algorithms require a generalized move that leaves a nonuniform discrete distribution stationary a probability distribution over points sk can be represented on a unit interval each point occupies a segment with width equal to its probability pk the segments can be ordered by any method that does not use the current setting si or the history of the markov chain summary c si pi sj pj oij c figure reflect move for discrete distributions a unit interval is constructed from segments sk each of these bars has width equal to the points probability given current point si the probability of transitioning to another point is proportional to its overlap with si s reflection for example sj will be selected with probability oij si selftransitions are only possible if si overlaps with the middle of the unit interval this is when ordered overrelaxation is less helpful assume we have a current sample si and wish to a apply a transition operator that tends to take us to the opposite end of the distribution such a transition operator is illustrated in figure the probability interval corresponding to the current point c cpi is reflected to cpi c the reflected interval will have some overlap with one or more points probability segments we sample from these points with probability proportional to their overlap oij with the reflected interval the probability of observing a transition from si sj is ts j i pi oij pi oij pi which is independent of the direction of the transition therefore detailed balance holds the generalized reflection rule leaves the discrete target distribution invariant using this rule within algorithm gives an operator based on arbitrary proposals that is suitable for use in ordered overrelaxation or the persistent motion of subsection summary this chapter explored ways to use populations of points drawn from proposal distributions we started with a review of two existing methods population monte carlo pmc and multipletry metropolis mtm which highlighted two important points proposal distributions used by mcmc are usually unsuitable for simple importance sampling attempts to combine mcmc with estimators based on importance sampling should not forget this drawing multiple points in parallel from a proposal distribution usually gives slower mixing than drawing the same number of points sequentially the parallel methods require savings from shared computations to be competitive in response to item we derived a waste recycling scheme for mtm as an alternative to an existing importance samplingbased estimator related and future work we then focussed on methods that use a population of points to move the state of a markov chain to the opposite side of a userdefined ordering chaining these updates together can cause persistent motion through neals ordered overrelaxation or the method we introduced in subsection a random walk will explore a distribution with lengthscale l in ol steps of size more persistent motion can bring this down to ol our contributions to this area were algorithm which automatically reduces the size of a population when ordered overrelaxation is less likely to be useful saving computation generalizing ordered overrelaxation to transition operators other than gibbs sampling algorithm using ordered transition operators for persistent motion without using componentbased updates subsection generalizing ordered updates to operate directly on a set of weighted points section these moves are less likely to reject than using metropolishastings within algorithm which is important because rejections destroy persistent motion the usefulness of these innovations relies on the assumption that drawing multiple proposals is computationally efficient it remains to be seen if implementations leveraging caching of intermediate results or parallel computations will make our work effective in real applications similar assumptions are made by related work in the literature which we now briefly consider along with directions for future research related and future work tjelmeland and stormark investigate an idea similar to pivotbased metropolis they suggest drawing a set of antithetic variables possibly using quasi monte carlo recent work also applies these ideas to mtm craiu and lemieux pivotbased metropolis could also be extended in this way tjelmeland also recommends wasterecycling which would also apply to pivotbased algorithms the main difference in his algorithm is that all the states are drawn as proposals from the current state x rather than an intermediate state z making reversible moves based on these proposals requires working out the probability of each state reproducing the entire ensemble an ok computation our pivot state approach reduces this computation to ok and makes an exchange of the current point with one of the proposals seem more natural as an example we showed pivotbased ordered overrelaxation applied to slice sampling in fact slice sampling already has its own method for overrelaxation neal although this is susceptible to rejections unless the edges of the slice are found accurately related and future work ordered overrelaxation may be preferable as it never has rejections while needing only enough information to draw k samples there is a way to eliminate the pivot state when slice sampling from unimodal distributions it is possible to gibbs sample the slice sampling auxiliary distribution so standard ordered overrelaxation applies subsection described a new way to make persistent motion along a direction that defines an ordering on the state space choosing suitable methods for ordering points will be problemdependent in general one generic application could be simulated tempering subsection where energy could be used to define an ordering persistent motion along this ordering should encourage the inversetemperature to move more rapidly between zero and one radford neal has suggested personal communication considering an alternative to pivotbased transitions this method starts with a random integer f drawn uniformly between and k inclusive the current point x is evolved through a sequence of f transitions using t then starting at x again a sequence of b k f backwards transitions is simulated using t choosing uniformly from the resulting k states maintains the stationary distribution also reordering the points and choosing the complement to x as in ordered overrelaxation is valid producing a chain of states rather than a star of steps starting at a single state is likely to produce larger moves a version of neals idea based on the weighted reflect operator of section is also possible the chain could use transition operators that leave a cheapertoevaluate surrogate distribution invariant proposal weights could be evaluated for a thinned subset of the chain and then the reflect operator applied yet another possibility is the multipoint metropolis method qin and liu an extension to mtm that uses a chain of states a disadvantage of using a chain of transitions is that each move must be made sequentially the pivotbased algorithms introduced in this chapter could evaluate their proposals in parallel on suitable computer hardware chapter normalizing constants and nested sampling the computation of normalizing constants plays an important role in inference and statistical physics for example bayesian model comparison needs the evidence or marginal likelihood of a model m given observed data d z pdm pd mpm d l d where the model has prior and likelihood l over parameters in statistical physics z exped is the normalizing constant for the distribution over states with energy e at inverse temperature in this context z is an important quantity known as the partition function a function of from which several fundamental properties of a system can be derived the marginal likelihood equation is also a function in this case of the model m which will become important in chapter in this chapter we focus on estimating the constant for a given model we also consider estimating the whole z function from physics which perhaps surprisingly is useful for solving the marginal likelihood problem statistical physics normally uses the normalization in its log form log z is known as the free energy of a system when used in statistical inference log z has also been found to be a useful and natural scale for comparing models given two models m and m the loglikelihood ratio log zz log pdm log pdm can be used in a classical hypothesis test which rejects m in favor of m when a threshold is exceeded this means that absolute differences in log z correspond to shifts in model quality that are meaningful without reference to an alternative model researchers decrypting codes at bletchley park judged that a difference of roughly a deciban log zz is about the smallest difference in two models that people can express table contains qualitative descriptions of the importance of marginal likelihood ratios table a selection of qualitative descriptions of the importance of evidence values taken from kass and raftery and jeffreys appendix b p the third column could be turned into a deviance by multiplying by two for more on units of information content particularly the ban see mackay p log zz loge zz log zz z z evidence against m bans nats bits to to to to to to to to to to to to weak substantial strong decisive beyond reasonable doubt the precise interpretation of the numerical value of a models evidence will depend on context in some applications it is quite possible to keep using m even with a marginallikelihood ratio that in isolation seems in favor of m beyond reasonable doubt in a legal setting it may be that a guilty hypothesis being a thousand times more likely than innocent is enough to convict but other applications associate even more extreme losses with adopting a particular model in data compression of large files a difference in size of bits is insignificant if compressing with m is computationally cheap an improvement in marginal likelihood very much greater than bits is required to adopt a new model even in the legal setting one must remember that the most probable model is not necessarily the one with the highest marginal likelihood extreme prior ratios can cancel large likelihood ratios although this shouldnt happen very often under the rules of rational inference one should mechanically apply bayes rule but if the data are strongly out of line with prior expectations it would seem sensible to carefully reexamine the models assumptions and check for computational errors this chapter focusses on the computation of z the message to take from the above discussion is that computing the evidence of a model to a very high level of precision is usually pointless it is difficult to appreciate a difference in log z of less than one and some errors larger than this may not affect decisions based on the computation this means that even noisy monte carlo estimates can be useful indeed monte carlo techniques are sometimes identified as providing gold standard estimates of z suitable for comparison with other approximate techniques eg beal and ghahramani kuss and rasmussen however just as with any monte carlo integration method it is quite possible to get wrong answers for some classes of difficult problems this includes algorithms guaranteed to be correct asymptotically developing a variety of methods and checks and hoping for the best is all we can do in general this chapter first examines the necessary ingredients of a monte carlo method for computing z next we review nested sampling a new method due to john skilling we provide a comparative analysis of this new algorithm and more established techniques then having analyzed nested sampling in its own right we consider using it as a method for guiding and checking other algorithms starting at the prior starting at the prior by definition the prior should spread its probability mass over settings of the parameters deemed reasonable before observing the likelihood l therefore samples from this distribution can sometimes provide a useful representation of the whole parameter space in particular as z is just an average under the prior a simple monte carlo approach could be attempted z l d s s ls s s the variance of this estimator may be huge most of the mass of the integral is associated with parameter settings that are typical under the posterior the posterior often occupies a small effective fraction of the priors volume so it can take a long time to obtain a representative sample however simple monte carlo equation does work on small problems and provides a simpletoimplement check of more complex code starting at the prior will also form the basis of more advanced methods why not start at the posterior instead posterior samples are concentrated around the mass of the integral also practitioners already sample from the posterior for making predictions it would be convenient if these samples could also estimate z the obvious importance sampling estimator for z based on posterior samples is nonsensical as it involves knowing z a moment of inspiration yields the following harmonic mean estimator noted by newton and raftery l z z l d pd m d l s pd m s s ls as acknowledged by its creators this estimator has the unfortunate property that the least probable points have small l and so carry the largest weights thus the effective sample size tends to be very small indeed the estimator can easily have infinite variance various attempts have been made to fix the harmonic mean estimator both within the original paper and in more recent research raftery et al the authors are keen to avoid sampling from the prior as well as the posterior however this goal seems misplaced firstly implementing samplers for most priors is relatively simple and allows several diagnostics to be performed eg checking prior assumptions look reasonable checking against simple importance sampling equation and geweke s tests for getting it right secondly while some specific instances seem to work there is a generic problem with posterioronly methods as pointed out by neal in the discussion of newton and raftery the choice of prior has a strong influence on the value of z taking a broad prior to an improper limit will send z towards zero starting at the prior however in many inference problems the data are so influential that the statistics of the posterior are fairly insensitive to such changes in the prior this demonstrates that posterior statistics alone bear little relation to z in many statistical problems reliable estimates of z require finding the prior mass associated with the large likelihood values found under the posterior in lowdimensional problems this is a feasible density estimation problem although deterministic approximations to z based on a direct fit to the posterior may be preferable in those cases chib notes that it is sufficient to know the posterior probability at just a single point for any point z l p d m chib provides an estimator for p d m based on the probability of a gibbs samplers transitioning to from each of the points visited by the sampler chibs method is relatively easy to apply and should work well on unimodal problems posteriors with more complex shapes will lead to subtle difficulties in general it is unreasonable to expect the sequence of states visited by a sampler to densely cover the posterior in models with many parameters if this were possible it would be feasible to build a kernel density estimate on the basis of a preliminary run and perform importance sampling instead practitioners simply hope to get a representative sample of points that are useful for prediction it is accepted that the sampler may not go near some typical points as an analogy consider conducting a survey of the world population a sample of people might not include anyone from some cities yet will be adequate for many statistical purposes in contrast chibs method is likely to fail without fairly dense sampling if none of the posterior samples can easily reach within one gibbssampling iteration are in its city the estimator will be badly behaved this wont occur if was chosen from the posterior samples but this is only hiding the problem at the expense of biasing the estimator jumping directly to the posterior is fraught with problems the normalizer z is a global quantity that depends on the entire posteriors relationship to the prior if standard numerical methods apply that is good news but if the posterior is sufficiently complicated to justify mcmc the place to start is probably a much simpler distribution like the prior on small problems simple monte carlo can be more reliable than chibs method neal when solving larger problems prior sampling equation doesnt work either we need advanced methods that start with a feasible sampling problem but overcome the variance problems of simple importance sampling bridging to the posterior bridging to the posterior when samples from the prior rarely find regions of high posterior mass a distribution closer to the posterior is needed for importance sampling to work designing or adaptively finding a tractable approximation to the posterior is one approach although hard in general another solution is a divideandconquer approach rather than finding z directly a series of easier subproblems are solved first consider a family of distributions specified by an inverse temperature p l ee z z e log l the prior has while p is the posterior all of the distributions for involve unknown normalizations given by the partition function z we will try to compute this function for a range of inverse temperatures k k when adjacent distributions p k and p k are close they can be compared with standard importance sampling as reviewed in subsection weights give the relative importance of states evaluated under the two distributions wk lk lk k lk the expected value of these weights is the ratio of the two distributions normalizers zk zk s s wk s s s p k these estimates can then be combined using z zk z z z zk z z z z zk numerical concerns suggest taking logs of a large positive product zk log z log z k log k zk zk where each term in the sum can be estimated from samples under p k this form is also convenient for error bars the variance of an estimator for log z is the sum of the variances of the estimators for the log of each ratio given this basic annealing approach there are several implementation alternatives simulated tempering and parallel tempering reviewed back in section both offer correlated samples from p over a userset range of annealed importance sampling bridging to the posterior subsection samples from distributions which are only close to p but a single run gives the same estimator as above for s k k log wk k k k log z k k log lk where k is the sample drawn at the k th iteration of ais or a single sample from p k when s or multiple ais runs are performed the samples are combined differently independent importance samplers provide local gradients or ratio information at each temperature the samples are first combined at each level as in equation ais gives up the requirement for valid samples from each p k by running importance sampling on a global distribution see subsection thus ais dictates forming independent z estimates from each run and then averaging them generally the ais estimator should be used because it does not assume exact samples are available from each temperature in some cases applying equation to pseudosamples from rapidly mixing markov chains can work better than ais but is difficult to justify aside on the prior factorization some readers may be more familiar with canonical distributions defined by p expez here the energy e is defined in terms of the negativelog of the whole unnormalized probability rather than just the likelihood for finite state spaces this corresponds to equation with a uniform in an unbounded state space p is an improper distribution introducing a tractable normalized base measure ensures it is always possible to sample from p the base measure need not be a prior distribution another choice may be computationally convenient or the problem may not be bayesian and so have no prior in subsection a distribution is factorized in an unconventional way for algorithmic reasons the term prior is used throughout this chapter to emphasize a common usage of these algorithms and that should usually be a simple tractable starting point thermodynamic integration the bridging procedures above have been independently developed by various communities gelman and meng these different views can bring a better understanding to the problem the statistical physics community views z as a useful function applying operators to the partition function often yields interesting physically meaningful bridging to the posterior quantities in particular for the canonical distribution p ee z z ee d the lognormalizer is a moment generating function d log z d z e ee d epe d log z d z e e e d z e e e d ep e epe varpe unlike z itself these expectations can reasonably be approximated for a given temperature using samples recorded under simulations at that temperature of course log z is related to its gradients by a trivial identity log z log z d log z d d z log z log z epe d the simplest discrete approximation to equation uses measurements at a sequence of temperatures k in order to estimate the normalizer at inverse temperature k k log z k k k ek k pk comparing to equation and remembering that e log l we see that simple thermodynamic integration and a bridging sequence of importance samplers are the same approximation thermodynamic integration seems deceptively general the difficulty is in choosing intermediate distributions appropriate temperaturebased distributions may be difficult and sometimes impossible to find this chapter explores how to choose a sequence of temperatures and alternative methods that dont use annealing schedules multicanonical sampling multicanonical sampling the multicanonical ensemble was mentioned in section as a distribution that might allow better markov chain exploration of the state space than the original target distribution the multicanonical ensemble is available in terms of the target distribution pt p zt and a multicanonical weighting function wmc t pmc p mc zmc p p wmc mc t the multicanonical heuristic suggests finding a weighting such that a markov chain exploring pmc spends equal times at all energies including those typical of the posterior and the prior a weighting function giving approximately this behavior must be found from preliminary runs samples from pmc can be used as an importance sampling proposal distribution as in subsection weights w p p give the normalizing constant ratio mc between any distribution p p zp and the multicanonical distribution pmc p s s w s s zp zmc the multicanonical normalization zmc is generally unavailable but can be eliminated by comparing to the base distribution log zt log pt log instead of bridging carefully between the prior and posterior the multicanonical ensemble has the ambitious goal of capturing both at once one hopes that the broad coverage of pmc gives estimators with reasonable variance for quantities relating to both pt and this is explored theoretically and experimentally in later sections nested sampling nested sampling is a new monte carlo method by skilling intended for general bayesian computation which bears some relation to earlier work found in mcdonald and singer it is designed to be a general and robust alternative to annealingbased methods like annealing it starts at the prior and samples from a sequence of distributions that become more constrained at each iteration however nested sampling makes no use of temperature and does not require tuning of intermediate distributions or other large sets of parameters it also provides a natural means to compute error bars on all of its results without needing multiple runs of the algorithm nested sampling the key feature and technical difficulty of nested sampling is sampling from the prior subject to a series of lower bounds on the likelihood the reward for attempting this novel challenge is an estimate of the prior probability mass associated with each observed likelihood value this representation of the posterior provides an estimate of the normalization constant and any other property of the posterior in addition to reviewing necessary material the remainder of this chapter provides an improved implementation of nested sampling for problems with degenerate likelihoods or discrete distributions a brief study of a deterministic approximation of nested samplings behavior this has theoretical implications for its performance and practical benefits for some applications a comparative analysis of the generic properties of annealed importance sampling nested sampling and multicanonical sampling illustrative examples some simple continuous distributions and the potts model an undirected graphical model incidentally a new variant of swendsenwang is derived some of this material was previously presented in murray et al b a change of variables the normalization of a posterior over variables is a weighted sum of likelihoods l over elements of prior mass d z pdm pd m pmd l d we label each element of prior mass dx d this is only a change in notation z l dx the integral could in principle by accumulating elements in a standard raster order in the parameter space figure a alternatively we can add up the contributions from each scalar element of prior mass dx in any order we like as illustrated in figure b we choose to order the elements by their corresponding likelihood as is a distribution its elements sum to one and can be arranged along a unit interval z lx dx x d l l for now assume is continuous and l provides a total ordering of elements so that x is an invertible change of variables in subsection these assumptions will be relaxed so that the mapping will always be invertible nested sampling a b figure views of the integral z l d for the posterior normalization a elements of the parameter space d are associated with likelihood values heights l these elements likelihood values are summed weighted by their prior masses d b bars with exactly the same set of heights are arranged in order along a scalar unit interval three bars are colored to help illustrate the correspondence in a the bars have a hyperarea base corresponding to an element in parameter space in b they have a scalar width corresponding to an element of prior mass dx xdx the area of each bar is its weighted likelihood so z is the sum of these the area under the curve in b the mapping x may seem strange but is closely related to the familiar cumulative distribution function cdf for a onedimensional density p c p d this quantity is often considered very natural some authors prefer to define distributions in terms of the cdf unlike a density the cdf is invariant to monotonic changes of variable and it directly returns the probability mass between two settings of the parameters also its inverse allows the straightforward generation of samples from uniform random variates each element dc corresponds to an element of probability mass pd naturally these elements sum to one dc cumulative distribution functions for multivariate distributions are less frequently used in statistics a sensible ordering of the elements of probability mass is less obvious and the multivariate integrals involved may be difficult comparing equations and we see that x is just a cumulative distribution function of the prior corresponding to a particular choice of ordering for scalar parameters the standard inverse cdf c gives the parameter such that fraction c of the prior mass is associated with parameters less than for general parameters the inverse of the likelihoodsorted cdf x gives the parameter value such that fraction x of the prior mass is associated with likelihoods greater than l this is illustrated in figure a for multimodal likelihoods the prior mass satisfying a likelihood constraint will be scattered over disconnected regions nested sampling lx lx lx x a x x x b x x x c figure nested sampling illustrations adapted from mackay a elements of parameter space top are sorted by likelihood and arranged on the xaxis an eighth of the prior mass is inside the innermost likelihood contour in this figure b point xs is drawn from the prior inside the likelihood contour defined by xs ls is identified the ordering on the xaxis and pxs are known but exact values of xs are not known c with n particles the one with smallest likelihood defines the likelihood contour and is replaced by a new point inside the contour ls and pxs are still known computations in the new representation given the change of variables described in the previous section the normalizer is just the area under a monotonic onedimensional curve l vs x onedimensional integrals are usually easy to approximate numerically assuming an oracle provided some points xs ls s ordered such that xs xs we can obtain an estimate z based s on quadrature s z s s ls wq s wq s x xs where the quadrature weights given correspond to the trapezoidal rule rectangle rules can upper and lower bound the error zz as long as an upper bound on l is known the boundary conditions of the sum require an arbitrary choice for the edge xvalues such as x x and xs xs sensitivity to such choices could always be checked in our experience they do not matter quadrature is effectively approximating the posterior with a distribution over s particles each with probability proportional to ls wq samples from this distribution can approximate samples from the true posterior the approximate distribution can also be used to directly approximate posterior expectations the change of variables has removed details of the highdimensional space and made s nested sampling the integration problem apparently easy of course the highdimensional space is still in the original problem and will make the change of variables intractable this is now the target for approximation nested sampling algorithms nested sampling aims to provide a set of points s ls s and a probability distris bution over their corresponding x xs values a simple algorithm to draw s such points is algorithm see also figure b algorithm single particle nested sampling initial point draw for s to s draw s ls where l ls ls otherwise end for the first parameter set by this algorithm is drawn from the prior which implies that the corresponding cumulative prior quantity must have distribution px uniform similarly pxs xs uniform xs as each point is drawn from the prior subject to ls ls xs xs this recursive relationship defines px a simple generalization algorithm uses multiple particles at each step the one with smallest likelihood is replaced with a draw from a constrained prior figure c algorithm multiple particle nested sampling initialize draw n points n n n m argminn ln m for s to s redraw m ls given by equation algorithm m argminn ln s m end for the first parameter is the setting with the smallest l from the initial n draws this is the particle with the largest x for this parameter to be at x the other n points must have x x so px n n x n the extra factor of n comes from the invariance of the points to reordering and is needed for correct normalization alternatively it is immediately identified as a standard result from order statistics the nth ordered point drawn from a distribution has its cumulative quantity distributed according to betan n n where our case corresponds to n n nested sampling eg balakrishnan and clifford cohen after replacing a particle in step there will be n samples uniformly between and xs the point with smallest l and largest x will be a fraction r xs xs through this range distributed as prxs n n rn changing variables gives pxs xs n n xs n xs n xs xs this defines the joint distribution over the entire sequence s px px n s pxs xs n note that px depends only on n it is the same distribution regardless of the problemdependent likelihood function if we knew the x locations we could combine these with the likelihood observations l ls s and compute the estimate of the normalization z given in equation s instead we have a distribution over x which gives a distribution over what this estimator would be pzls n zx z pxn dx we defer philosophizing over the precise meaning of this distribution to subsection for now we assume that this distribution gives reasonable beliefs over the estimate that would be obtained by quadrature as can be checked in a given application the uncertainty of this distribution tends to be much larger than the differences between choices of quadrature scheme we can therefore somewhat loosely take equation to be a distribution over z itself similarly distributions over any other quantity such as log z or posterior expectations can be obtained by averaging quadrature estimates over pxn to recap the key ideas required to understand nested sampling are it would be convenient if we could perform an intractable mapping from the original state space to a cumulative quantity x numerical computation of z or posterior expectations would only involve a onedimensional function samples from the prior subject to a nested sequence of constraints equation give a probabilistic realization of the mapping these samples give a distribution over the results of any numerical computation that could be performed given the change of variables no algorithm can solve all problems some pathological integration problems will always be impossible for nested sampling the difficulty is in obtaining samples from the nested sequence of constrained priors nested sampling e e xs e e e e s xs x exp log x error bars e e e e e e e e e s x exp log x error bars figure the arithmetic and geometric means of xs against iteration number s for algorithm with n error bars on the geometric mean show expsn sn samples of pxn are superimposed s omitted for clarity mcmc approximations the nested sampling algorithm assumes obtaining samples from ls equa tion is possible this is somewhat analogous to thermodynamic integrations requirement for samples drawn from a sequence of temperaturebased distributions in section annealed importance sampling offers a way to approximately sample from temperaturebased distributions and still obtain unbiased estimates of z in contrast nested sampling really needs exact samples from its intermediate distributions for its theoretical justification to be valid rejection sampling of the constrained distributions using would slow down exponentially with iteration number s in general we do not know how to sample efficiently from the constrained distributions so despite the theoretical difficulties we will replace step of algorithm with an approximate sampler using a few steps of a markov chain in practice this often works well but one must remember to be careful when interpreting error bars that ignore the approximation we must initialize the markov chain for each new sample somewhere one possibility is to start at the position of the deleted point s on the contour constraint which is independent of the other points and not far from the bulk of the required uniform distribution however if the markov chain mixes slowly amongst modes the new point starting at s may be trapped in an insignificant mode experience suggests it is generally better to start at one of the other n existing points inside the contour constraint these are all ideally draws from the correct distribution ls so represent modes fairly making this new point effectively independent of the point it cloned may take many markov chain steps how many to use is an unfortunate free parameter of mcmcbased nested sampling nested sampling integrating out x to estimate quantities of interest we must average over pxn as in equation the mean of a distribution over logz can be found by simple monte carlo estimation logz logzxpxn dx t t logzxt t xt pxn this scheme is easily implemented for any expectation under pxn including error bars from the variance of logz to reduce noise in comparisons between runs it is advisable to reuse the same samples from pxn eg clamp the seed used to generate them a simple deterministic approximation of pxn is useful for understanding and also provides fasttocompute lowvariance estimators after s iterations xs directly requires substantial work but notice that logxs characterized by its mean and standard deviation logxs sn sn s s s logt s s s t where the ts are drawn iid from pt n tn dealing with this distribution is a sum of independent terms the central limit theorem suggests that this should be well figure shows how well this description summarizes sequences sampled from pxn using the geometric path as a proxy for the unknown x is a cheap alternative to monte carlo averaging over settings equation see figure we might further assume that trapezoidal estimates of integrals are dominated by a number of trapezoids found around a particular iteration s the corresponding uncertainty in log z will be domi nated by uncertainty in log xs s n s n it usually takes extensive sampling to distinguish s n from the true standard deviation of the posterior over log z degenerate likelihoods the progress of nested sampling is supposed to be independent of the likelihood function the procedures that estimate quantities based on nested sampling results rely on this behavior as an example the geometric mean path suggests that with n particles nested sampling should typically get to x after about log iterations now consider running nested sampling algorithm on a problem with the following likelihood function x lx x nested sampling a b c figure histograms of errors in computing epx log z under three approximations for integrating over px random experiments shown the test system was a dimensional hypercube of length with uniform prior centered on the origin the loglikelihood was l nested sampling used n s a monte carlo estimation equation using t sampled trajectories b t sampled trajectories c deterministic approximation using the geometric mean trajectory in this example the distribution plog z from equation has a width so the errors in finding its mean in b and c are tolerable after drawing n particles from the prior about of them will have x and l if the inequality ls ls is strictly enforced all new particles must have l and therefore x therefore x will be reached in about s iterations leading to very wrong results if instead ls ls is enforced the results will still be wrong x will not be reached until about iteration s the problem is that the mapping set up in subsection required a total ordering of all the dx elements of prior mass it was assumed the likelihood function lx provided a unique key for this sort operation in many applications this will be sufficient as long as no two likelihood evaluations in algorithm are numerically identical there is no problem but degenerate likelihoods like equation require special treatment completely flat plateaus in likelihood functions of continuous variables are rare but when is discrete finite elements of prior mass carry the same likelihood this will always introduce plateaus in lx solving the degeneracy problem requires imposing a total ordering on the dx elements earlier presentations of nested sampling skilling suggest introducing an auxiliary labeling of the states that resolves likelihood ties it was suggested these labels could be chosen arbitrarily random samples from uniform suffice as would a cryptographic identification key derived from or almost anything else however a fixed key does not solve the problem with discrete parameters if the x plateau in the example above originated from a single discrete setting then all the dx elements in this range would receive the same cryptographic key likelihood nested sampling ties would still be unresolved if random labels are employed and always regenerated on repeated observations of the same then nested sampling will give the correct results this is how the code skilling has made available is implemented skilling mentions a better way of thinking about the random labels instead of an extra piece of information attached to each parameter setting is an extra input variable for an extended problem with a likelihood that depends on and the distinction may seem fine on large problems the same location will rarely be revisited but getting this right is important for obtaining correct answers on small test problems the joint distribution view also allows a rejectionless algorithm for the potts example in subsection which was tersely introduced in murray et al b the auxiliary joint distribution is over the variables of interest and an independent variable p p p l l z z lj z where lj ll l and z standard nested sampling is now possible at each iteration lj must increase we can choose such that log is smaller than the smallest difference in logl allowed by machine precision this ensures the auxiliary variable the constrainedprior distribution becomes l ls s s j l ls and otherwise will only matter when l ls s mcmcbased implementations of nested sampling require transition operators that leave j stationary the simplest method is to propose with an operator that leaves the original constrained prior stationary and then generate uniform rejecting the proposal if s a rejectionless method would be preferable and is sometimes possible by marginalizing out s m l ls s l ls otherwise if the operator used for was slice sampling subsection then m can also be slice sampled directly similarly a gibbs sampler could easily be adapted to use the conditionals of m instead of if metropolishastings proposals are used it will available from httpwwwinferencephycamacukbayesys efficiency of the algorithms sometimes be possible to reweight q to reduce or eliminate rejections an example of such a transition operator is developed in subsection after generating s the auxiliary variable tribution uniform ls ls s s s uniform s ls ls s can be drawn from its conditional dis as a minor refinement this could be done lazily if and when it is required by the next iteration efficiency of the algorithms this section attempts to compare the computational efficiency of three significant algorithms for estimating normalizing constants general comparisons of monte carlo algorithms are hard to make as they are designed for use with problems where ground truth is unavailable also some algorithms are targeted at a particular class of problem making performance comparisons on other classes unfair all three normalizing constant methods considered below are designed to deal with problems in which the base measure is much more diffuse than the compact target distribution pt we look at the scaling of computational cost with the nature of this disparity the behavior of a multivariate gaussian toy problem is worked out for each algorithm the base distribution is vaguely distributed with precision exp z the target distribution pt has higher precision t pt t exp zt d d d d d d z l zt t l exp d d d we pretend that the normalization constants of the gaussians are unknown and require all of the algorithms under study to return an approximation to the log of the ratio zt z t d many continuous distributions have locallygaussian peaks so poor performance on the toy problem would be a general cause for concern we also make some general remarks relevant to nongaussian modes despite this the theoretical comparison presented here is necessarily limited unreasonable assumptions are made regarding the efficiency of the algorithms ability to draw samples from intermediate distributions obviously this doesnt test the ability of the algorithms to find and then fairly represent isolated modes this will be probed in the empirical comparisons in the sections that follow for now the analysis is confined to behavior that does not require details of particular markov chains used in implementations nested sampling the scaling behavior of nested sampling can be found without detailed calculation the algorithm is characterized by an exponential shrinkage of the prior mass under consideration if the bulk of the posterior mass is found around x then an n particle nested sampling run reaches dominant terms in the integral after s n log x iterations subsection explained that the uncertainty in log z is typically s n obtaining a unit uncertainty in log z would require setting the number of particles to n s for this choice the number of iterations required to reach the target distributions mass from the prior is s log x most of a highdimensional gaussians mass is within a hyperspherical shell out at a radius of d this means that the bulk of the target distribution is in a shell enclosing a fraction t d of the prior volume if the prior were uniform then we could immediately say that the bulk of the target distribution is found around x t d it turns out that this is a good enough description for gaussians too log x d log t therefore the s log x computational cost of nested sampling on the toy problem is number of samples for unit error s d log t the od scaling revealed by this analysis is somewhat disappointing if each sample costs od to produce then the total cost for a unit error is od if one knew that each dimension were independent one could estimate the lognormalizer of each gaussian separately making the d estimates have variance d would require d log t samples each but samples of just one variable only cost o computation which gives a total cost of od rather than od the reason to resort to monte carlo is that real computations do not decompose this way but we shall see that without detailed knowledge of the distribution annealing does only need od iterations the advantages of nested sampling do come at a cost according to the cumulative distribution function the logcumulative prior mass inside a radius p of dt is log dd t which numerically matches the scaling of d log t to within a d small factor over a large range of t and d efficiency of the algorithms multicanonical sampling as there are many possible algorithms for constructing an approximate multicanonical ensemble a general comparison to the method is hard instead we generously assume that we have obtained by chance the ideal function that reweights the target distribution such that the distribution over energy is uniform this avoids having to consider the details of histogrambuilding methods or other energy density approximations the danger is that this sections results may be severely misleading if the dominant computational cost is actually obtaining the reweighting the analysis here should be seen as a lower bound of the costs involved and could be useful for indicating which method is best for refining initial results however they were obtained by definition the multicanonical ensemble has a uniform marginal distribution over energy to be defined this distribution must be confined to some finite range width h say so pmc e h this ensemble could be used to compute the normalizing constant of a simple target distribution ps which also has a uniform energy distribution but over an energy range with smaller width h the importance weights equation are a constant with probability hh and zero otherwise this distribution gives varlog ps h sh after s effective samples empirically the estimator still exhibits quite heavy tails after the s hh samples predicted for unit error but the expression gives good error bars for larger s finding effective widths corresponding to h and h will provide a ruleofthumb estimate of multicanonicals variance on more complex distributions we now derive properties of the multicanonical ensemble for the gaussian problem under the target multivariate gaussian the energy is t e log l which is proportional to r d d d d d d t r the squareradius r follows a scaled distribution with d degrees of freedom pr rd expt r weighting by the reciprocal of pr makes the distribution over r and the energy uniform giving the multicanonical ensemble p p wmc mc et r rd et r rd r rh a maximum energy h must be imposed for the distribution to be defined a reasonable choice for the square radius defining this limit is larger than the prior mean value rh d d this cutoff excludes little of the prior mass from the ensemble efficiency of the algorithms the target distribution is softly confined to a narrow range of the energy spectrum with an effective width likely to be related to the standard deviation of a distribution rheff dt while the exact variance of the estimator does not have a simple analytical form a more detailed analysis shows that the guess hsheff rhsrheff is basically correct var log zt t s d astoundingly the number of effective samples required to obtain a unit error scales only as o d this is better than possible by a method that exploited independence and estimated the normalizer of each of the d gaussians separately perhaps this is a sign that through the weighting function the method has unreasonable a priori knowledge of the answer despite the ideal weighting function the scaling with precision ratio t is not impressive as the target distributions precision grows the multicanonical sampler becomes exponentially worse than nested sampling equation this is a clear indication that the energy spectrum ratio hheff can be quite different from the volume ratio that determines nested samplings performance the details of a particular target distribution can make either method dramatically better than the other another concern is the cost of each of the s effective samples the ensemble pmc is a distribution capturing a large energy range which will usually take many steps of mcmc exploration to reach equilibrium in contrast the intermediate distributions of nested sampling or annealing were designed to be close so a small number of mcmc steps may be sufficient to equilibrate states at each iteration the logprobability of the variance of the two parts equation of the multicanonical estimator equation can be approximated as follows e w varw varlog s ew s ew the ratio of the target distribution equation and multicanonical distribution equation give importance weights w rd exp r noting the integrals z w r dr d d d z and w r dr d d and that for suitably large h these give approximately rh times the required expectations we have varlog rh d rh p sd d s d stirlings approx to both functions substituting in rh assuming large d and using the approximate independence of the separate log estimators when t gives the variance of the final estimator as r h i t d var log zt s the version in the main text drops further terms for clarity this footnote is terse because what matters is the simpler hheff description in the main text and the following assertion empirically the expression above reasonably matches the observed estimator variance for t in a few or more dimensions efficiency of the algorithms pmc has a range of emc d log t between the typical sets of the target and prior distributions if o iterations of mcmc are required to change this energy by then it will take at least oemc iterations to equilibrate the multicanonical ensemble by a random walk this would make the cost of a obtaining a single effective sample from the multicanonical ensemble the same order as the total cost of an entire nested sampling run the potentially good performance of the multicanonical method can be wiped out by the use of some standard markov chains importance sampling we now analyze the performance of combining a bridge of k importance sampling estimates as in section for the toy gaussian problem all the intermediate distributions are also gaussian pk lk zk k exp zk d d d k t this allows computation of the variance of the estimator in equation under the ideal case of direct sampling from each pk the log of the importance weights at each level from equation become log wk k k d d d the variance of the log weights under pk is varpk log wk k k varpk k k d d d k k d k d the variance of log wk only depends on k through the ratio k kk the variance of log z k t k k log wk k k is minimized by equalizing all the contributing variances the resulting estimator has variance t k dk var log z d k t log for large k constructing annealing schedules the approximation becomes accurate very quickly the amount of computation k required for unit variance scales with dimensionality of the problem as od number of samples for unit error k d log t comparing to equation we see that the dependence of the computational cost is the same as nested sampling and that the scaling with dimensionality is better this result is not strongly tied to the tail behavior of a gaussian redoing the calculations for distributions p exp slightly to dplog t p d d only changes the number of samples required before hastily discarding nested sampling remember the caveats at the beginning of this section practical realities of markov chain based samplers will make performance worse than predicted here the potts model subsection will be an example of a distribution which while easily sampled by multicanonical and nested sampling is totally unapproachable with temperaturebased annealing methods as with multicanonical sampling we were very generous to annealing by assuming the free parameters defining the annealing schedule were chosen optimally we were unlikely to guess k k t t k indeed the optimal schedule balances variances which through equation are intimately related to z itself we cannot know the optimal schedule without already having solved the problem some work is required to find a good schedule default choices may not be sufficient for example the linear annealing schedule k kk gives varkkk log z d t k t for large t this is exponentially worse than the correct schedule and nested sampling constructing annealing schedules various families of annealing schedule have been used in the literature while demonstrating ais on some example distributions neal spliced together a linear schedule at high temperatures with a geometric schedule at lower temperatures this is not too far from the behavior of equation the optimal schedule for some wellbehaved found by a largek integral approximation to the discrete sum over equation and also checked numerically constructing annealing schedules posteriors beal suggested a family of nonlinear annealing schedules k e kk e kk where the linearity parameter e makes the schedule approximately linear for large values of e but dwells for longer at high temperatures for small positive values another ad hoc nonlinear scheme is given by kuss and rasmussen where a linear schedule is raised to the power of four k kk k k each of the above schemes were presumably the result of some experimentation preliminary runs are required to check that a proposed schedule eg fourth power of a linear schedule is adequate and to eliminate others also any free parameters must be found such as the number of levels linearity setting e or the changepoint between a linear and geometric schedule algorithms for these fitting procedures could be although rarely are described in detail part of the difficulty is deciding how to include results from earlier higher variance runs often these are only run informally and simply discarded regardless of how an annealing schedule is chosen the selection should ideally be performed automatically otherwise computer time may become irrelevant compared to the time required by manual adjustments an annealing schedule should control the variance of the log weights vark log wk k k vark log l rearranging gives a recurrence relationship for the inverse temperatures k k v vark log l which achieves a given target variance v varlog w on each weight this update rule is applied from until an inverse temperature of is reached the variance of the log z estimator equation is k v a binary search on the control parameter v can find a balanced annealing schedule with a user chosen total variance or computational cost implementing this algorithm for constructing annealing schedules requires the variance of the loglikelihood or energy at each k considered these quantities are not available exactly otherwise log z would be known so the variances must be approximated from some preliminary experiments running mcmc at each k considered would be too costly one could try to interpolate results measured at a preliminary range of temperatures alternatively statistics at any temperature are available from a single run of either multicanonical or nested sampling we propose using nested sampling to set a schedule for ais the details are given in algorithm sampling from the multicanonical ensemble would need a large set of markov chains for normalizing constants control parameters to be set nested sampling can be run before annealing as it only has two control parameters n and the amount of effort to put into sampling at each iteration annealed importance sampling can then be run with a schedule estimated from nested sampling comparing the annealing results to those predicted by nested sampling could uncover problems with markov chains that would go unnoticed using a single method algorithm construction of an annealing schedule from nested sampling inputs total target variance vtarget num nest particles n numerical tolerance tol run nested sampling algorithm obtaining s assuming xs expsn compute weights wq as in equation vmax vtarget vmin k while vmax vmin k vtarget tol vtrial vmax vmin create discrete proxy for p k distribution pk wq ls k v k k varp log l k s if k then passed end of schedule start again with a higher variance per level vmin vtrial k else if k vtrial vtarget then exceeded target variance start again with lower variance per level vmax vtrial k else k k end if end while k return annealing schedule k markov chains for normalizing constants each of the monte carlo algorithms in this chapter require sampling from complex distributions p or pmc standard sampling techniques metropolis hastings slice sampling etc should apply but there are some issues special to these algorithms that are worth considering randomize operator orderings many mcmc operators concatenate several operators with different behaviors together gibbs sampling for example updates each dimension of a parameter vector separately markov chains for normalizing constants some users prefer to randomize the order of these updates so that the resulting mixture operator maintains detailed balance but many use a deterministic ordering if only for convenience of implementation this is not a good idea with algorithms that start out of equilibrium at each iteration the problem is most easily demonstrated by nested sampling with n at each iteration the only particle is by definition on the boundary of the constrained prior the first update must increase the likelihood of the particle subsequent updates have some freedom to decrease the likelihood again as only a limited number of markov chain steps can be performed at each iteration the particle will climb unnaturally fast up the likelihood surface in the direction of the first transition operator in subsection nested sampling is run using a univariate slice sampler applied to each variable in a random order initially these experiments used a fixed ordering the first variable to be updated would systematically become much more constrained than the last even if by symmetry they were equivalent fortunately this pathology is so severe that it quickly made itself known by causing numerical problems and crashing the slice sampling code experiments with spherical gaussians confirm that annealed importance sampling suffers from a similar problem histograms of the unweighted final states obtained show that the statistics of each dimension depend on the order in which they were updated the ais weights correct this bias asymptotically but samples without these artifacts will tend to need lower variance weights the easiest fix is to randomly permute the markov chain operators at each iteration changes in lengthscale and energy there is usually a dramatic difference in scale between a prior and posterior it is unlikely that the same markov chain operators are appropriate for both yet annealing has to sample them and all the interpolating distributions in between similarly nested sampling has to sample from a sequence of distributions that shrink exponentially in volume from the prior to the posterior and beyond step size parameters in metropolishastings algorithms must be changed as the algorithm proceeds it is also profitable to adapt the initial step size of slice sampling some authors set a schedule of step sizes by hand but automatic schemes are clearly desirable one option is to adapt based on the acceptance rate of the previous iteration or to use parallel chains for ais this would require giving up some theoretical correctness while nested sampling is already in an approximate setting by using a markov chain at all another option for ais is to adapt the schedule of proposals after each run each run is unbiased in z using any stepsize but when adapting it is still advisable to discard early runs which will have higher variance markov chains for normalizing constants there is usually a large change in the scale of probabilities involved between a diffuse prior and a posterior which concentrates mass on a much smaller number of states many standard markov chain operators such as simple metropolis and slice sampling are only able to make small changes in logprobability at each iteration depending on the algorithm this may or not be desirable the majority of the volume in a highdimensional solid body is in a thin shell near its surface mackay p for nested sampling this means that much of s mass is likely to be close to the likelihood contour surface and large changes in likelihood are not required instead we need efficient chains that sample well at close to constant likelihood temperaturebased distributions have soft constraints that lead to broader distributions over energy although in many problems they are still constrained to a relatively narrow range the multicanonical method samples from a single distribution which has significant overlap with both the prior and posterior making the distribution over energies uniform requires making some states much more probable than others under pmc simple metropolis methods are unable to move rapidly between regions of many states with low probability and more compact regions with high probability the exploration of pmc s energy spectrum will be characterized by a random walk or slower process this suggests that equilibrating pmc will need at least emc steps where emc is the range of log probabilities under pmc not the original distribution some monte carlo algorithms such as hybrid monte carlo are able to make larger changes in energy hamiltonian dynamics based on pmc could be simulated as long as the reweighting function is smooth nested sampling could also benefit from hamiltonian dynamics for its large movements in statespace although is not compatible with large changes in energy fortunately versions of slice sampling that can use hamiltonian dynamics on the prior and reflections from constraint boundaries have already been developed neal another important markov chain operator for dramatic moves in statespace and energy is swendsenwang subsection while this algorithm can work at any temperature it does not allow reweightings of the energy and is not easily modified to sample at near constant energy by recasting the problem we can develop a version of swendsenwang that will work with multicanonical and nested sampling a new version of swendsenwang the partition functions of the potts model equation the random cluster model equation and the fksw joint distribution equation are identical also a sample from any of these distributions is easily converted into a sample from one of for literature searches it is helpful to know that in physics a constant energy distribution is known as a microcanonical ensemble markov chains for normalizing constants the others this allows using any of the distributions to simulate the potts model and find its normalization zp j q we focus on the random cluster model assuming identical positive couplings j on each edge we rewrite the random cluster distribution in an unconventional way pd zp j q expje z ldd zn where cd q z dij zn ld expd logej d under this factorization the energy is minus the total number of bonds d and the inversetemperature the bonds d logej is set by the coupling parameter j as in subsection cd is the number of connected components or clusters formed by algorithm gives an mcmc operator to update the bond configuration d d the stationary distribution is a weighted prior wmc dd where wmc could be multicanonical weights or set to wmc for prior sampling or set to zero and one to sample from the prior subject to constraints on d algorithm swendsenwang for weighted bond configurations create a random coloring s uniformly from the q cd colorings satisfying the bond constraints d as in the swendsenwang algorithm count sites that allow bonds e ije si sj draw d from t d es zt s wmc d es d es d throw away the old bonds d and pick uniformly from one of the of setting d bonds in the e available sites ways the probability of proposing a particular coloring and new setting of the bonds is t s d d t d s d t d es t s d es d t d es q cd wmc d zt s q cd summing over all possible intermediate colorings the probability of starting with d bonds d and ending with d bonds d is proportional to t d d wmc d d d wmc d s t s d d zt s wmc d wmc d z q cd q cd s this expression is symmetric under the exchange of d d and d d therefore the transition operator satisfies detailed balance with respect to the weighted prior it is experiments also ergodic proof with finite probability all si are given the same color then any d with nonzero weight is possible in turn all allowable d have finite probability when performing nested sampling using the weighted prior representation the likelihood constraints in are thresholds on the total number of bonds d this can be realized by setting wmc for states with fewer bonds many states have identical d which requires careful treatment as discussed in subsection the simple implementation that draws a random key for each state will lead to some rejections when proposing moves to a state with the same d on the constraint surface sampling without rejections can be achieved by setting the weights such that algorithm leaves the auxiliary constrained distribution m equation invariant the number of bonds has previously been identified as a useful energylike target for reweighting janke and kappler in that work the bonds were updated by single site gibbs sampling rather than the blockgibbs sampling move of step this does not allow simulation of the fixed d ensemble or rapid exploration near fixed d single site updates are easier to implement however and would become more attractive on systems that allow a different jij on each edge in this case the implementation of global updates is much more involved experiments detailed comparisons of nested sampling and more established monte carlo techniques are not currently available in the literature anecdotally nested sampling has already been useful in astronomy mukherjee et al and shaw et al claim nested sampling gives speedups of orders of magnitude over annealing based methods the annealing approach that was cited in both papers was very carefully implemented beltrn et al these results are somewhat surprising given that both nested a sampling and annealing follow a sequence of increasingly constrained ensembles and theoretically seem quite similar if anything nested sampling seems slightly worse although this should be a small effect for the d astronomy problems that were tested the focus of this section is not applications but well understood test problems hopefully these will give some better insight into the relative merits of the approaches description of slice sampling experiments a set of experiments were performed on some continuous distributions that are amenable to slice sampling this allowed the same mcmc code to be used within each algorithm the distributions tested are described first and then details of the algorithms the results which appear in table are discussed in the next section experiments gaussian base and target distributions as considered theoretically in section were run in ten dimensions these experiments reveal the actual performance when confounded by interactions with a particular mcmc operator in version a we set the standard deviation of the base distribution to be times wider than the target ie t in version b we made the prior times wider t tdistribution this was included as another simple standard distribution with different tail behavior the target distribution was a tendimensional multivariatet with five degrees of freedom the base distribution was gaussian with two modes a mixture of gaussians as tested in neal a pt zt exp d d exp d d the base distribution was a unit sixdimensional gaussian p n i deceptive this twodimensional problem bridges from a spherical gaussian distribution with precision to a twodimensional mixture of gaussians taken from neal a pt zt i j ij exp i j exp ij i j ij exp i j exp ij where and the mixture components are in four groups ij i j ij i j ij i j ij i j means in the upperright quadrant means in the upperleft quadrant means in the lowerleft quadrant means in the lowerright quadrant this target distribution is exceedingly challenging and more pathological than experienced in many statistical problems it is interesting however the different spacings of the means make it hard for algorithms to know from a distance where the bulk of the probability mass is this highlights differences between the massfinding heuristics implicitly performed by the algorithms in all cases one markov chain update consisted of a simple univariate slice sampler applied once to each variable in a new random order at each iteration a linear stepping out procedure was employed with an initial stepsize equal to one or to the range of settings currently occupied by particles being simulated in parallel some additional choices were needed by each method nest nested sampling runs had two free parameters the number of particles n and the number of slicesampling steps used to update at each iteration unless n each experiments new particle was initialized at one of the n particles already satisfying the likelihood constraint as slice samplers always move and there are no plateaus in these problems likelihood functions the details in subsection were not required rather than setting a number of iterations s we terminated the nested sampler when the estimate of log z appeared to have converged in particular we used a geometric approximation for x to estimate log z with equation and terminated when the remaining prior mass appeared to contribute a fraction of only to the sum the reported results used monte carlo samples of x in equation these are very similar to those obtained from the geometric mean approximation but also provide quantiles of the predictive posterior over log z aisrc annealed importance sampling has two free parameters in addition to its annealing schedule an experiment was repeated r times each run using c parallel chains when c the parallel chains are used to set the initial step sizes of the slice sampler for multimodal distributions this is not strictly justified within the ais framework but it seems unlikely the results will differ greatly from using preliminary runs instead an ais initialization of e nest n n k k estimated an annealing schedule from a nested sampling run with the given n and step per iteration the given value of k intermediate temperatures was needed for target standard error e the details of this procedure and the linear and fourth power schedules were given in section multicanonical ten slice sampling chains initialized from the prior were run in parallel on the multicanonical ensemble simple error bars were calculated from the variance of the estimates from each chain on the gaussian problem the analyticallyderived weighting function that sets a uniform distribution over energies equation could be used we also tried setting the multicanonical weights by estimating the distribution over energies from a nested sampling run discussion of slice sampling results the experiments with a gaussian highlight differences between the theoretical ideals in section and the realities of mcmc nested sampling is fastest with n which was tried as a preliminary run the predictions from one or ten slice sampling sweeps per iteration are imprecise as one might expect but also overconfident the posterior over log z has negligible overlap with the true answer increasing the amount of mcmc sampling to slice sampling steps per iteration overcomes the bias but at a large computational cost increasing the number of particles n gives more precise answers and increases accuracy there are three reasons each particle can start at one of the other particles which are supposed to be drawn from the target distribution the more particles there are the easier it is to forget exactly which one was copied the distributions change more slowly with larger n the answer from n with one step of sampling gives experiments table empirical behavior of slicesamplingbased nested sampling ais and multicanonical sampling the distributions and methods are described in the main text subsection the results are discussed in subsection distribution gaussian a method initialization log l evals log z truth nest n step nest n steps nest n steps nest n step nest n steps nest n step nest n steps ais nest n k ais nest n k ais nest n k ais nest n k ais nest n k ais linear k ais fourth power k multicanonical nest n multicanonical nest n multicanonical analytic weighting multicanonical analytic weighting truth nest n step multicanonical analytic weighting gaussian b tdistribution truth nest n step nest n step nest n step ais nest n k ais linear k ais fourth power k two modes truth nest n step nest n step nest n step nest n step nest n step ais nest n k ais nest n k ais nest n k truth nest n step ais nest n k deceptive experiments a much better answer than n with steps and is also an order of magnitude cheaper even with only one sampling step per iteration the estimates and error bars with n and n are indistinguishable from those obtained by exact sampling from the constrained priors three standard annealed importance sampling runs were performed with the same target error of ais ais and ais obtaining the target error from only one run requires a very long annealing schedule k this gave similar results to shorter runs with k but error bars are much more easily obtained from multiple runs increasing the number of runs to required shortening the annealing schedule further for a similar target error or computational cost the short k runs had larger errors than predicted and unreliable error bars this schedule was designed assuming that the algorithm would keep close to the equilibrium distributions defined in the annealing schedule in this case there were not enough bridging distributions for this approximation to be accurate the result with ais k which uses ten parallel chains is similar to the result from separate runs ais but with many fewer likelihood evaluations in the absence of a population of points the slice sampling code used an initial step size of one which is inappropriately small for this problem at high temperatures a user not prepared to adapt stepsizes based on a population should find some way to set them appropriately the sequence of three ais runs with k are deliberately at higher precision than necessary so that the error bars are somewhat reliable and reproducible something that is sadly not true of the lower precision ais runs these confirm that a linear annealing schedule is worse than one that dwells for longer at higher temperatures in this case the fourth power schedule quite closely follows that set by nested sampling and has very similar performance the first multicanonical result with weights set by nested sampling seems to give comparable uncertainty to ais for the amount of computer effort however running the chains for ten times longer reveals problems with this multicanonical estimator we had set the weights by estimating the probability mass between each pair of likelihood values ls and ls visited by nested sampling for simplicity we used the deterministic approximation in equation this gives noisy estimates of the ideal multicanonical weights which it turns out are not good enough we could attempt to smooth the approximate multicanonical weighting function instead for the gaussian case we went directly to the correct multicanonical ensemble equation this confirmed that the problem with the estimator was the choice of weighting function and gives multicanonical a performance somewhere in between nested sampling and ais on the gaussian a problem even with the ideal weighting multicanonical fails to equilibrate and fails to give reasonable error bars on the gaussian b problem this is to be expected given the methods poor scaling with precision ratio t we did we were able to confirm this because is quite tractable for this toy problem experiments not continue to try multicanonical adapting the weighting function beyond a nested sampling initialization seems important and this would complicate the comparison turning to the tdistribution example we see behavior different from the gaussian case the annealing schedule estimated from nested sampling is closer to linear than to a fourth power moreover the linear schedule and that produced from nested sampling are reproducibly better than the fourth power notice the reversal from the gaussian experiment a good annealing schedule was estimated cheaply from nested sampling with n obtaining as good answers as ais with nested sampling alone would be more expensive exactly how much is hard to tell from individual runs because the error bars can be unreliable a more reliable comparison of nested sampling and ais can be made based on many repeated runs nested sampling was run times each over a range of n and compared to ais with runs of schedules of varying lengths set by nested sampling figure a shows that the actual performance measured as a mean squared error of the log z estimates was similar for the two methods nested sampling is slightly more accurate at low computational costs being taken over by ais at higher precisions increasing the number of ais runs at the expense of annealing schedule length performs worse for the same computational cost this difference goes away at high precisions but may be a concern when many runs are required on multimodal problems the main problem with ais is that at lower precisions simple estimates and error bars based on the mean and variance of the importance weights are unreliable figure b shows large biases in estimates for log z figure c shows that the error bars are too small on average jackknife estimates not shown are sometimes slightly better calibrated but give very similar results for a user the error bars are often very important so in this case the nested sampling estimator is the favorable choice the peaks of the two modes target distribution are highly separated at low temperatures the fraction of ais chains ending in the mode in the positive orthant is compared to its actual probability mass of a third through weighting these runs ais converges to the correct answer once enough sufficientlylong chains have been run further checks show that it also estimated the mass of each mode correctly the same is true for nested sampling the results from large n obtain nearly the correct relative masses of the modes and in turn the overall normalization detailed analysis still shows some signs of bias the posterior overlap with the correct answer is smaller than apparent from the scalar error bars in the table both methods find twomodes difficult because it is not feasible to sample correctly from at late iterations or from low temperature distributions ais relies on reweighting over many runs nested sampling requires many particles to maintain a fair representation of even after likelihood evaluations both nested sampling and ais have problems experiments mse a ais ais nest e likelihood evaluations mean error b ais ais nest e likelihood evaluations ais ais nest calibration c e likelihood evaluations figure average behavior of ais and nested sampling over runs for a range of n or target errors the xaxis gives computational cost in likelihood evaluations a mean square error of log z estimate b mean error or bias of point estimate c mean square normalized error error divided by error bar width a measure of calibration experiments table estimates of the deceptive distribution quadrant upperleft upperright lowerleft lowerright true probability nest estimate ais estimate with the deceptive distribution nested sampling has an error bar on log z that is far too small aiss estimate is consistent with the correct answer although as with two modes with a higher standard error than targeted due to poor mixing at low temperatures additional problems are revealed by looking at the probability distribution over the four quadrants containing each cluster of modes table again the nested sampling run is far too confident in this case aiss error bars are generally much better although it is very certain that the upperright quadrant has much less probability mass than it actually does runs of nested sampling with smaller n show that it too can easily lose the upperright quadrant nested samplings answers are wrong according to its error bars due to the markov chain approximation at late iterations with high likelihood constraints the slice sampler is unable to equilibrate and is relying on a large number of particles to provide a good starting point on this problem the inaccuracy of this approximation does not reduce as fast as the size of the error bars with a large number of points increasing the number of slice sampling steps per iteration does not solve the problem because it would take an unrealistically large number to move between isolated modes however proposals that take all of the particles positions into account could help shaw et al uses an approximate rejection sampler based on uniform samples within ellipsoidal fits to clusters of the existing particles no problems with biases were reported using this method although clearly problems could still occur when the ellipsoids fail to correctly capture in general ellipsoids could be used as metropolis proposals as part of several and various attempts to equilibrate a particle the potts model the potts model was introduced in subsection it describes a class of undirected graphical models over discrete variables s the variables take on one of q colors and the model has a temperaturelike coupling parameter j gibbs sampling updates of are the simplest way to implement approximate nested sampling we also try clusterbased updates subsection while physicists tend to be interested in a broader range of quantities we focus here on the normalization constant zp j q where the discrete variables s are the variables that need to be integrated ie summed over experiments table partition function results for potts systems see text for details method gibbs ais swendsenwang ais gibbs nested sampling randomcluster nested sampling acceptance ratio q ising j q j table shows results on two example systems an ising model q and a q potts model in a difficult parameter regime we tested nested sampling and ais with gibbs sampling and clusterbased updates annealed importance sampling ais was run times with a geometric spacing of settings of j as the annealing schedule nested sampling used n particles and fullsystem mcmc updates to approximate each draw from we also developed an acceptance ratio method bennett based on our representation in equation which we ran extensively and should give nearly correct results the markov chains used by nested sampling were initialized at one of the n particles satisfying the current constraint preliminary experiments that initialized a new particle s at s on the constraint surface were a failure the gibbs nested sampler could get stuck permanently in a local maximum of the likelihood while the cluster method gave erroneous answers for the ising system this supports the suggestions of subsection ais performed very well on the ising system and can work with q at low coupling strengths we took advantage of its performance in easy parameter regimes to compute z which was needed to interpret the results from the clusterbased nested sampler however with a temperaturebased annealing schedule ais was unable to give useful answers for the q system close to the critical j evaluated nested sampling appears to be correct within its error bars under these conditions it is known that even the efficient swendsenwang algorithm mixes slowly for potts models with q near critical values of j which correspond to a first order phase transition gore and jerrum see figure typical potts model states are either entirely disordered or ordered disordered states contain a jumble of small regions with different colors eg figure b in ordered states the system is predominantly one color eg figure d moving between these two phases is difficult defining a valid mcmc method that moves between distinct phases requires knowledge of the relative probability of the whole collections of states in those phases temperaturebased annealing algorithms explore the model for a range of settings of j and fail to capture the correct behavior near the transition despite using closely related markov chains to those used in ais nested sampling can work in all parameter discussion and conclusions a b c d e figure two q potts models with starting states a and c were simulated with fullsystem swendsenwang updates with j the corresponding results b and d are typical of all the intermediate samples swendsenwang is unable to take a into an ordered phase or c into a disordered phase although both phases are typical at this j e in contrast shows an intermediate state of nested sampling which succeeds in bridging the phases regimes figure e shows how nested sampling can explore a mixture of ordered and disordered phases by moving steadily through these states nested sampling is able to estimate the prior mass associated with each likelihood value this behavior is not possible in algorithms that use j as a control parameter such as ais with a temperaturebased schedule discussion and conclusions summary the main purpose of this chapter was to study nested sampling and its relationship with more established methods we find that it fits into a unique position amongst monte carlo algorithms unlike the majority of annealingbased methods nested sampling can deal with first order phase transitions while multicanonical sampling also solves this problem its properties are different in almost every other respect nested sampling does not need prior setting of weights its scaling with dimensionality and energy ranges are very different and nested sampling follows a sequence of distributions like annealing in some statistical settings nested sampling has clear advantages over multicanonical undoubtedly it will perform badly on some applications where multicanonical has already been successful nested samplings theoretical scaling with dimensionality od iterations is worse than that of annealing od iterations in practice the difference can be less dramatic as bringing a ensemble to equilibrium may require fewer function evaluations than a temperature based distribution the slicesampling results do support aiss superiority at least for higher accuracy results but other issues such as ease of implementation and quality of error bars may be more significant issues with annealings error bars were blamed for an order of magnitude extra cost by both beltrn et al and a shaw et al discussion and conclusions despite this the relative underperformance of annealing reported by these astronomers is difficult to account for given the theoretical and practical results of this chapter the most likely explanation is a difference in the operators used to update the intermediate distributions in this chapter the same or closely related operators where used with ais and nested sampling in shaw et al nested sampling was described with its own special update rule using ellipsoids fitted to the current particles this algorithm could easily require fewer function evaluations than a simple slice sampler or metropolis method as used within annealing in beltrn et al although the ellipsoid algoa rithm was suggested by the need to sample from nested samplings there is no reason not to use the same basic code to propose moves for an annealing method combined with an annealing schedule specified by nested sampling ais might perform as well as or better than nested sampling related work this chapter only considered a subset of the available methods for computing normalizing constants the focus has been on simple methods that compute the normalizing constant of a generic probability distribution we avoided assuming detailed knowledge about the target distribution and its relationships with other distributions as much as possible for users with more time a richer set of methods are available some of which are mentioned here the intermediate distributions in annealed importance sampling do not have to be temperaturebased distributions that has been the focus here because of the simplicity and generality of raising the likelihood to a power any other way of bringing in the likelihood gradually can be used one data point at a time may be natural in a statistical setting a generalization linked importance sampling neal may be helpful for such cases where the intermediate levels are fixed and limited in number annealed importance sampling can be seen as a member of a wider family of sequential monte carlo smc methods some of these algorithms allow transfer of particles amongst modes and should definitely be considered by anyone attracted to nested sampling by this property a recent example of smc combined with an interesting set of bridging distributions leveraging the power of graphical models is presented in hamze and de freitas some situations require the computation of more than one normalizing constant an option is to construct a distribution containing each of the models and explore it with mcmc reversible jump mcmc green makes this possible even when the models have parameter vectors of different dimensionalities the relative probability of each model is available from the amount of time the sampler spent exploring each model and more advanced estimators based on transition probabilities may be available this discussion and conclusions only provides the normalizing constants up to a constant although if one of the models considered is tractable this constant could be found the path sampling approach of gelman and meng suggests computing normalizing constants by integrating along a path of model hyperparameters rather than along a single inversetemperature parameter if the normalizer for each setting of the hyperparameters is required this is particularly effective contour plots based on independent estimates will usually be very noisy compared to a path sampling approach philosophy various authors have noted the irony of using monte carlo a frequentist procedure for bayesian computation eg ohagan neal rasmussen and ghahramani rather than giving beliefs about quantities given the computations performed monte carlo algorithms provide frequentist statistical estimators skilling claims that nested sampling is bayesian this section examines the extent to which this holds our target is a posterior distribution over z philosophically this is slightly tricky because z is a constant which we should be able to work out given the prior and likelihood functions thus according to any rational calculus of beliefs as in cox or jaynes the posterior should concentrate all of its mass at the true value the problem is not lack of available information but computer time to use it perfectly the solution to this conundrum is to be careful about what we claim to know assume that some agent is running nested sampling on our behalf and only reports to us l ls if it reported more information such as s locations we would be in an embarrassing situation to use this information we would need a probabilistic model including s in which inference would probably be hard but throwing away knowledge is hard to deal with rationally so we pretend it was never available thus the inferences for nested sampling as in rasmussen and ghahramani are rational for a fictitious observer with limited information this observers results will be more vague than if s were not ignored but should be sensible which is not guaranteed by general approximations to bayesian inference we first set up priors based on the knowledge that an agent is running nested sampling which will provide ls quantities with associated xs cumulative mass values our knowledge of nested sampling defines a problemindependent prior distribution over x xs we should also specify a prior distribution over functions lx after observing l ls we update our beliefs about the underlying cumulative values according to bayes rule p xl p lx p xp l it is this posterior distribution that should be used when computing posteriors over other quantities such discussion and conclusions as z p zl p zx l p xl dx note that specifying a prior over monotonic functions p lx and computing with it appears difficult in general skilling declares that i cant so i dont instead the prior p x is used this corresponds to a particular assumption an improper uniform prior over likelihood functions this cannot be avoided by claiming general ignorance unlike s we must be told ls thus to maintain a claim of rationality we must be happy with this particular choice of prior or type of ignorance about the likelihood function in addition and much more seriously we must assume the agent is actually running nested sampling as in algorithm in fact we really know that some approximation will be involved as we have seen this can give us unreasonable beliefsas with most probabilistic modeling where we know a priori that a models joint distribution does not really capture every detail of a real system similar concerns must surround all such attempts to introduce bayesian methodology into monte carlo algorithms for general inference problems for example bayesian learning of the onedimensional weighting function of the multicanonical method has been attempted eg smith this should be a difficult inference problem as the output of a sampler has a complicated dependence structure only with incorrect assumptions ie approximations can bayesian methods be applied the ultimate justification for such methods must be their empirical performance which is sometimes very good chapter doublyintractable distributions most of this thesis has been dedicated to sampling from probability distributions where the key difficulty has been an intractable normalization when considering a posterior over parameters given data y py py py p py py we assumed that the joint probability in the numerator could be easily evaluated for any particular joint setting y chapter described how the important quantity py can be approximated but it is often infeasible to compute this quantity exactly standard mcmc methods are designed for use with these intractable distributions markov chain operators can be constructed by restricting consideration to a manageable subset of s state space at each step in metropolishastings only two settings are considered the current setting and a randomly chosen proposal gibbs sampling changes only one component of at a time metropolis requires an ability to evaluate py ratios for various pairs and gibbs sampling requires the ability to sample from the conditional distributions pi ji y by considering restricted parts of the state space neither method needs to know the global normalizing constant py but what if py like py contains a summation over a large state space so it cannot feasibly be evaluated pointwise then the problem is doublyintractable and as we shall see even performing markov chain monte carlo is potentially exceedingly difficult the next section explains why doublyintractable distributions arise and the difficulties involved section explores approximations of standard mcmc algorithms in the context of undirected graphical models moller et al provided the first feasible algorithm for models where sampling from py is possible but its normalization is unknown their method is reviewed in section and generalized by us in section working on these algorithms inspired the new exchange algorithm sec bayesian learning of undirected models tion these innovations were first described in murray et al a this chapter provides a slightly more general version of the exchange algorithm with a new derivation which is somewhat simpler than the original description full mathematical derivations of detailed balance are provided which were previously omitted for space reasons in section we consider further new valid mcmc algorithms for doublyintractable distributions which provide a connection to the approximate bayesian computation abc literature section provides slice sampling algorithms for doublyintractable distributions finally new directions for research in this area are considered in section bayesian learning of undirected models the potts model discussed in the previous chapter belongs to a very wide class of energybased models where py exp z j ej ycj j z y j z fj ycj j j fj ycj j the sets cj each index a subset of variables that take part in a corresponding potential function fj parameterized by j each potential expresses mutual compatibilities amongst a subset of variables ycj sampling from the y variables is possible with mcmc but computing the normalization can be very difficult section reviewed how special structure in a graphical model sometimes allows efficient computation of the normalization z most of this chapter concerns methods that apply to general distributions so it is convenient for clarity to collapse the model to a simpler form py f y z we now consider sampling from the posterior over parameters equation when the likelihood is of the unnormalized form given in equation this posterior py f y p z py offers a new difficulty as before py is not needed for mcmc but the normalizing constant z cannot be ignored as it is a function of the parameters the variables being sampled every time new parameter values are considered it appears that an intractable computation involving z will be required as mcmc estimators are approximations unless an infinite number of iterations are performed and each itera bayesian learning of undirected models tion is generally infeasible py in equation can be called a doublyintractable distribution while sampling from parameter posteriors by mcmc is a well established technique it is largely associated with distributions that could be represented as directed graphical models however sampling parameters in anything but the most trivial undirected graphical model is doublyintractable while directed models are a more natural tool for modeling causal relationships the soft constraints provided by undirected models have proven useful in a variety of problem domains we briefly mention six applications a in computer vision markov random fields mrfs a form of undirected model are used to model the soft constraint a pixel or image feature imposes on nearby pixels or features geman and geman this use of mrfs grew out of a long tradition in spatial statistics besag b in language modeling a common form of sentence model measures a large number of features of a sentence fj s such as the presence of a word subjectverb agreement the output of a parser on the sentence etc and assigns each such feature a weight j a random field model of this is then ps z exp j j fj s where the weights can be learned via maximum likelihood iterative scaling methods della pietra et al c these undirected models can be extended to coreference analysis which deals with determining for example whether two items eg strings citations refer to the same underlying object mccallum and wellner d undirected models have been used to model protein folding winther and krogh and the soft constraints on the configuration of protein side chains yanover and weiss e semisupervised classification is the problem of classifying a large number of unlabeled points using a small number of labeled points and some prior knowledge that nearby points have the same label this problem can be approached by defining an undirected graphical model over both labeled and unlabeled data zhu and ghahramani f given a set of directed models pyj the products of experts idea is a simple way of defining a more powerful undirected model by multiplying them py z models despite the long history and wide applicability of undirected models until recently bayesian treatments of learning the parameters of large undirected models have been virtually nonexistent indeed there is a related statistical literature on bayesian inference in undirected models log linear models and contingency tables albert dellaportas and forster dobra et al however this literature with the notable exception of the technique reviewed in section assumes that the partition function z can be computed exactly but for many of the machine learning applications of undirected models cited above this assumption is unreasonable this chapter addresses bayesian learning for models with intractable z j pyj hinton the product assigns high probability when there is consensus among component ie low treewidth graphs graphical gaussian models and small contingency tables bayesian learning of undirected models do we need z for mcmc the modelers effort was put into specifying f y it is tempting to think that z should have little relevance and there must be some way to sidestep computing it it was established above that the normalizer z is not a constant in the context of sampling this section explores the rle of z in more detail addressing the o consequences for any scheme that avoids computing it algorithm gives the straightforward application of standard metropolishastings algorithm to the doublyintractable distribution in equation algorithm standard but infeasible metropolishastings mh input initial setting number of iterations s for s s propose q y compute a p y q y f y p q y z py q y f y p q y z draw r uniform if r a then set end for computing the acceptance ratio requires a ratio of normalizing constants or at least a bound tight enough for step this is difficult in general there are usually some free choices while constructing mcmc algorithms perhaps some nuisance parameters could be judiciously set to remove z through some fortunate cancellations we are not free to cancel out z through a choice of prior p the prior would have to depend on the number of observed data points and would take on extreme values dominating any inferences murray and ghahramani in theory the proposal distribution could be defined to remove explicit z dependence from the acceptance ratio but in practice this does not seem to help eg q y zg or q y z would be difficult to construct without knowing z and would be terrible proposals the only distribution we know of that contains z gives probabilities in data space not over parameters section reviews and extends a method by moller et al which introduced auxiliary variables taking on values in the data space this allows proposals that cancel out the unknown terms but only if it is possible to draw from the intractable distribution py this key insight makes it possible to sample from a limited but significant class of distributions for which mcmc was previously impossible however the algorithm and its extensions are not a panacea it is not always possible to draw exact samples from the data distribution another problem or opportunity is bayesian learning of undirected models x y a f x y b f figure a the dash marked y shows the position of our observation in data space the curve shows an unnormalized function f which gives low probability to the alternative data set x b after changing the parameter the unnormalized function evaluated at the observation has increased f y f y however the likelihood has decreased py py noticing this requires considering the new highprobability region of data space containing x which is not necessarily close to the observation that specifying a workable stationary distribution for the auxiliary variables requires approximating the parameter posterior before sampling begins an optimistic train of thought sees no problem as z is just a sum and summing over unknowns is a standard feature of mcmc for example models with latent variables z often have intractable likelihoods but while it may be difficult to evaluate py z py z z pyz pz we can jointly sample over p zy as long as this distribution is known up to a constant discarding the latents gives samples from the correct marginal py in doublyintractable problems the likelihood contains z the reciprocal of a sum of terms rather than a sum over latent variables but could there be some similar way of instantiating latent variables to remove the need to compute z section provides such a method with an infinite number of latent variables but how do we know there is not a better choice in general must all simple algorithms fail figure shows a hypothetical unnormalized probability function for two settings of its parameters an obvious but important observation is that the change in an unnormalized probability function evaluated at an observation does not necessarily tell us anything about the change in the likelihood of the parameters in fact we must consider the parameters effect on the entire observation space this is in sharp contrast to mcmc sampling of the y variables for fixed parameters where only two settings need be considered at a time if we are not going to compute z explicitly then a valid mcmc algorithm must have some other source of global information that is sensitive to changes anywhere in observation space one of the simplest ways to get information from a probability distribution is of course through a sample a sample x f x z using the same function as figure b could easily land in the new region of high probability on the left noticing that approximation schemes f x f x gives some indication that any perceived benefit of f y f y should be penalized despite the apparent paucity of such information surprisingly these single samples at each parameter setting are sufficient to create valid mcmc algorithms for py that do not need z this is how the approach in moller et al works which explains why it requires samples from the target distribution using an exact or perfect sampling method section approximate samples from a few mcmc steps cannot guarantee considering the new bubble in data space in figure b a markov chain started at eg the observation will be heavily biased towards a mode near the starting point and may never consider any other modes are spurious modes in data space a problem in practice contrastive divergence learning hinton uses very brief mcmc sampling starting at the observed data and can give useful results on complex problems in machine learning sometimes we know the parameter posterior is simple it is logconcave for fullyobserved exponentialfamily distributions in such cases deterministic approximations to the parameter posterior perform favorably compared to pseudomcmc approaches welling and parise however if we wish to construct a valid mcmc method we must formally show that the entire data space has been properly considered exact sampling explicitly tracks bounds that start off considering the entire observation space section a procedure with this flavor seems an essential part of samplers for doublyintractable distributions thus even though a new latent variable approach introduced in section does not use exact sampling as such the requirements are and must be very similar exact sampling is a formidable requirement deterministic approaches and mcmc methods that do not use the correct posterior distribution will always have a place some possibilities are studied in the next section there is also always a place for gold standard methods that given enough time should give correct answers those are the focus of the remainder of the chapter approximation schemes for concreteness this section studies a simple but widespread type of graphical model the boltzmann machine bm is a markov random field which defines a probability distribution over a vector of binary variables s s sk where si psw exp zw wij si sj ij the symmetric weight matrix w parameterizes this distribution in a bm there are usually also linear bias terms i bi si in the exponent with these the model is equivalent to a potts or ising model with magnetic field parameters we omit these biases to simplify notation although the models in the experiments assume them approximation schemes the usual algorithm for learning bms is a maximum likelihood version of the em algorithm assuming some of the variables are hidden sh and some observed so ackley et al the gradient of the log probability is log pso w ecsi sj eusi sj wij where ec denotes expectation under the clamped data distribution psh so w and eu denotes expectation under the unclamped distribution psw for a data set s s sn sn of iid data the gradient of the log likelihood is simply summed over n for boltzmann machines with large treewidth see section these expectations would take exponential time to compute and the usual approach is to approximate them using gibbs sampling or one of many more recent approximate inference algorithms targets for mcmc approximation metropolishastings for the parameters of a boltzmann machine given fully observed data needs to bound a psw pw qw w s psw pw qw w s zw zw n pw qw w s pw qw w s exp nij wij wij n n si sj the first class of approximation we will pursue is to substitute a deterministic approxi mation zw zw into the above expression clearly this results in an approximate sampler which does not converge to the true equilibrium distribution over parameters moreover it seems reckless to take an approximate quantity to the n th power despite these caveats we explore empirically whether approaches based on this class of approximation are viable note that above we need only compute the ratio of the partition function at pairs of parameter settings zw zw this ratio can be approximated directly by importance sampling zw epsw exp zw wij wij si sj ij thus any method for estimating expectations under psw samplingbased or deterministic can be nested into the metropolis sampler for w for small steps w w estimating ratios of normalizers and finding gradients with respect to the parameters are closely related problems gradients may be more useful approximation schemes as they provide a direction in which to move which is useful in algorithms based on dynamical systems such as hybrid monte carlo subsection hybrid monte carlo would also require a z ratio for its acceptreject step this effort may not be justified when the gradients and zw are only available as approximations simpler schemes that use gradient information also exist neal the simplest of these is the uncorrected langevin method parameters are updated without any rejections according to the rule i i log py ni i where ni are independent draws from a zeromean unit variance gaussian intuitively this rule performs gradient descent but explores away from the optimum through the noise term strictly this is only an approximation except in the limit of vanishing using the above or other dynamical methods a third target for approximation for systems with continuous parameters is the gradient of the joint log probability in the case of bms we have log ps w wij si sj n n n n epsw si sj log pw wij assuming an easy to differentiate prior the main difficulty arises as in equation from computing the middle term the unclamped expectations over the variables interestingly although many learning algorithms for undirected models eg the original boltzmann machine learning rule are based on computing gradients such as equation and it would be simple to plug these into approximate stochastic dynamics mcmc methods to do bayesian inference this approach does not appear to have been investigated we explore this approach in our experiments this section has considered two existing sampling schemes metropolis and langevin and identified three targets for approximation to make these schemes tractable zw zw zw and epsw si sj while the explicit derivations above focused on boltzmann machines these same expressions generalize in a straightforward way to bayesian parameter inference in a general undirected model as in equation in particular many undirected models of interest can be parameterized to have potentials in the exponential family ej ycj j uj ycj j for such models the key ingredients to an approximation are the expected sufficient statistics epy uj ycj approximation algorithms in this section approximations for each of the three target quantities in equations and are identified these are used to propose a variety of approximate sampling methods for doublyintractable distributions first outlined in murray and ghahramani approximation schemes variational lower bounds were developed in statistical physics but are also widely used in machine learning they use jensens inequality to lower bound the log partition function in the following way log z log x f x log x qx f x qx x qx log f x qx log z x qx log f x hq f q the relationship holds for any distribution qx provided it is not zero where px has support the second term in the bound is called the entropy of the distribution hq x qx log qx the overall bound f is often called the free energy see winn and bishop for more details and a framework for automatic construction and optimization of variational bounds for a large class of graphical models the na mean field method is a variational method with q constrained to belong to the ve set of fully factorized distributions qmf q qx i qi xi for the boltzmann machine a local maximum of this lower bound log zmf maxqqmf f q can be found with an iterative and tractable fixedpoint algorithm see for example mackay chapter let the meanfield metropolis algorithm be defined by using zmf in place of z in the acceptance probability computation equation the expectations from the na mean field algorithm could also be used to compute direct ve approximations to the gradients in equation for use in a stochastic dynamics method jensens inequality can be used to obtain much tighter bounds than those given by the na meanfield method for example when constraining q to be in the set of all ve treestructured distributions qtree optimizing the lower bound on the partition function is still tractable wiegerinck obtaining ztree z the tree metropolis algorithm is defined through the use of this approximation in equation alternatively expectations under the tree could be used to form the gradient estimate for a stochastic dynamics method equation bethe approximation a recent justification for applying belief propagation to graphs with cycles is the relationship between this algorithms messages and the fixed points of the bethe free energy yedidia et al this breakthrough gave a new approximation for the partition function in the loopy metropolis algorithm belief propagation is run on each proposed system and the bethe free energy is used to approximate the acceptance probability equation traditionally belief propagation is used to compute marginals pairwise marginals can be used to compute the expectations used in gradient methods eg equation or in finding partition function ratios equation these approaches lead to different algorithms although their approximation schemes approximations are clearly closely related langevin using brief sampling the pairwise marginals required in equations and can be approximated by mcmc sampling the gibbs sampler used in subsection is a popular choice whereas in subsection a more sophisticated swendsenwang sampler is employed unfortunately as in maximum likelihood learning equation the parameterdependent variance of these estimates can hinder convergence and introduce biases the brief langevin algorithm inspired by work on contrastive divergence hinton uses very brief sampling starting from the data x which gives biased but low variance estimates of the required expectations as the approximations in this section are run as an inner loop to the main sampler the cheapness of brief sampling makes it an attractive option langevin using exact sampling unbiased expectations can be obtained in some systems using an exact sampling algorithm section although the gradients from this method are guaranteed to be unbiased parameterdependent variance could lead to worse performance than the proposed brief langevin method variance could be reduced by reusing pseudo random numbers however we shall see later that there are much more elegant ways to use an exact sampler if one is available pseudolikelihood replacing the likelihood of the parameters with a tractable product of conditional probabilities is a common approximation in markov random fields for image modeling one of the earliest bayesian approaches to learning in large systems of which we are aware was in this context wang et al yu and cheng the models used in the experiments of subsection were not well approximated by the pseudolikelihood so it is not explored further here extension to hidden variables so far we have only considered models of the form py where all variables y are observed often models need to cope with missing data or have variables that are always hidden these are often the models that would most benefit from a bayesian approach to learning the parameters in fully observed models in the exponential family the parameter posteriors are often relatively simple as they are log concave if the prior used is also log concave as seen later in figure the parameter posterior with hidden variables will be a linear combination of log concave functions which need not be log concave and can be multimodal in theory the extension to hidden variables is simple first consider a model py h where h are unobserved variables the parameter posterior is still proportional to approximation schemes pyp and we observe py h py h z fj y hcj j h j log py log z log h j fj y hcj j log z log zy that is the sum in the second term is a partition function zy for an undirected graph of the variables h to see this compare to equation and consider the fixed observations y as parameters of the potential functions in a system with multiple iid observations zy must be computed for each setting of y note however that these additional partition function evaluations are for systems smaller than the original therefore any method that approximates z or related quantities directly from the parameters can still be used for parameter learning in systems with hidden variables the brief sampling and pseudolikelihood approximations rely on settings of every variable provided by the data for systems with hidden variables these methods could use settings from samples conditioned on the observed data in some systems this sampling can be performed easily hinton in subsection several steps of mcmc sampling over the hidden variables are performed in order to apply the brief langevin method experiments involving fully observed models the approximate samplers described in subsection were tested on three systems the first taken from edwards and havrnek lists six binary properties detaila ing risk factors for coronary heart disease in men modeling these variables as outputs of a fullyconnected boltzmann machine we attempted to draw samples from the distribution over the unknown weights we can compute z exactly in this system which allows us to compare methods against a metropolis sampler with an exact inner loop a previous bayesian treatment of these data also exists dellaportas and forster while predictions wouldnt need as many samples we performed sampling for iterations to obtain reasonable histograms for each of the weights figure the meanfield tree and loopy metropolis methods each proposed changes to one parameter at a time using a zeromean gaussian with variance the brief langevin method used a stepsize qualitatively the results are the same as those reported by dellaportas and forster parameters deemed important by them have very little overlap with zero the meanfield metropolis algorithm failed to converge producing noisy and wide histograms over an ever increasing range of weights figure the sampler with the approximation schemes mean field tree waa wab wac wad wae waf wbb wbc wbd wbe wbf wcc wcd wce wcf wdd wde wdf wee wef wff figure histograms of samples for every parameter in the heart disease risk factor model results from exact metropolis are shown in solid blue loopy metropolis dashed purple brief langevin dotted red these curves are often indistinguishable the meanfield and tree metropolis algorithms performed very badly to reduce clutter these are only shown once each in the plots for waa and wab respectively shown in dashdotted black f f parameters parameters figure loopy metropolis is shown dashed blue brief langevin solid black left an example histogram as in figure for the edge bm the vertical line shows the true weight also shown are the fractions of samples f within of the true value for every parameter in the edge system center and the edge system right the parameters are sorted by f for clarity higher curves indicate better performance approximation schemes treebased inner loop did not always converge either and when it did its samples did not match those of the exact metropolis algorithm very well the loopy metropolis and brief langevin methods closely match the marginal distributions predicted by the exact metropolis algorithm for most of the weights results are not shown for algorithms using expectations from loopy belief propagation in equation or equation as these gave almost identical performance to loopy metropolis based on equation our other two test systems are node boltzmann machines and demonstrate learning where exact computation of zw is intractable we considered two randomly generated systems one with edges and another with each of the parameters not set to zero including the biases was drawn from a unit gaussian experiments on an artificial system allow comparisons with the true weight matrix we ensured our training data were drawn from the correct distribution with an exact sampling method childs et al reviewed in subsection this level of control would not be available on a natural data set the loopy metropolis algorithm and the brief langevin method were applied to data points from each system the model structure was provided so that only nonzero parameters were learned figure shows a typical histogram of parameter samples the predictive ability over all parameters is also shown short runs on similar systems with stronger weights show that loopy metropolis can be made to perform arbitrarily badly more quickly than the brief langevin method on this class of system experiment involving hidden variables finally we consider an undirected model approach taken from work on semisupervised learning in zhu and ghahramani here a graph is defined using the d positions x xi yi of unlabeled and labeled data the variables on the graph are the class labels s si of the points the joint model for the l labeled points and u unobserved or hidden variables is psx exp z lu i ij si sj wij where wij exp xi xj yi yj x y the edge weights of the model wij are functions of the euclidean distance between points i and j measured with respect to scale parameters x y nearby points wish to be classified in the same way whereas far away points may be approximately uncorrelated unless linked by a bridge of points in between these test sets are available online httpwwwgatsbyuclacukiamblug approximation schemes y x x y y x a data set b brief langevin c loopy metropolis figure a a data set for semisupervised learning with variables two groups of classified points and and unlabeled data b approximate samples from the posterior of the parameters x and y equation an uncorrected langevin sampler using gradients with respect to log approximated by a swendsenwang sampler was used c approximate samples using loopy metropolis the likelihoods in this model can be interesting functions of zhu and ghahramani leading to nongaussian and possibly multimodal parameter posteriors with any simple prior as the likelihood is often a very flat function over some parameter regions the map parameters can change dramatically with small changes in the prior there is also the possibility that no single settings of the parameters can capture our knowledge when performing binary classification equation which is a type of potts model can be rewritten as a standard boltzmann machine the edge weights wij are now all coupled through so our sampler will only explore a twodimensional parameter space x y however little of the above theory is changed by this we can still approximate the partition function and use this in a standard metropolis scheme or apply langevin methods based on equation where gradients include sums over edges figure a shows an example data set for this problem this toy data set is designed to have an interpretable posterior over and demonstrates the type of parameter uncertainty observed in real problems we can see intuitively that we do not want x or y to be close to zero this would disconnect all points in the graph making the likelihood small l parameters that correlate nearby points that are the same will be much more probable under a large range of sensible priors neither can both x and y be large this would force the and clusters to be close which is also undesirable however one of x and y can be large as long as the other stays below around one these intuitions are closely matched by the results shown in figure b this plot shows draws from the parameter posterior using the brief langevin method based on a swendsenwang sampling inner loop described in zhu and ghahramani we also reparameterized the posterior to take gradients with respect to log rather than this is important for any unconstrained gradient method like langevin note that predictions from typical samples of will vary greatly for example large x approximation schemes predicts the unlabeled cluster in the top left as mainly s whereas large y predicts s it would not be possible to obtain the same predictive distribution over labels with a single optimal setting of the parameters as was pursued in zhu and ghahramani this demonstrates how bayesian inference over the parameters of an undirected model can have a significant impact on predictions figure c shows that loopy metropolis converges to a very poor posterior distribution which does not capture the long arms in figure b this is due to poor approximate partition functions from the inner loop the graph induced by w contains many tight cycles which cause problems for loopy belief propagation as expected loopy propagation gave sensible posteriors on other problems where the observed points were less dense and formed linear chains discussion although mcmc sampling in general undirected models is intractable there are a variety of approximate methods that can be brought forth to tackle this problem we have proposed and explored a range of such approximations including two variational approximations brief sampling and the bethe approximation combined with metropolis and langevin methods clearly many more approximations could be explored the mean field and treebased metropolis algorithms performed disastrously even on simple problems we believe these failures result from the use of a lower bound as an approximation where the lower bound is poor the acceptance probability for leaving that parameter setting will be exceedingly low thus the sampler is often attracted towards extreme regions where the bound is loose and does not return the bethe free energy based metropolis algorithm performs considerably better and gave the best results on one of our artificial systems however it also performed terribly on our final application in general if an approximation performs poorly in the inner loop then we cannot expect good parameter posteriors from the outer loop in loopy propagation it is well known that poor approximations result for frustrated systems and systems with large weights or tight cycles the typically broader distributions from brief langevin and its less rapid failure with strong weights means that we expect it to be more robust than loopy metropolis another advantage is the cost per iteration the mean field and belief propagation algorithms are iterative procedures with potentially large and uncontrolled costs for convergence brief langevin gave reasonable answers on some large systems where the other methods failed although it too can suffer from very large artifacts even on the very simple heart disease data set one of the posterior marginals was very different under this approximation this actually means the whole joint distribution is in the wrong place we now turn to valid mcmc algorithms which are clearly required if correct inferences are important the exchange algorithm f y z y q y f x z x figure an augmented model with the original generative model for observations py as a marginal the exchange algorithm is a particular sequence of metropolishastings steps on this model the exchange algorithm the standard metropolishastings algorithm proposes taking the data y away from the current parameters which would remove a factor including z from the joint probability p y the proposal also suggests giving the data to the new parameters which introduces a new factor including z if each parameter setting always owned a data set under the models joint distribution then we would not need to keep adding and removing these z factors we bring the proposed parameter into the model and give it a data set of its own x this augmented distribution illustrated in figure has joint probability py x p f y f x q y z z given a setting of y a second setting of parameters is generated from an arbitrary distribution q y a fantasy dataset is then generated from px f x z where the function f is the same as in the datagenerating likelihood equation the original joint distribution p y is evidently maintained on adding these child variables to the graphical model as long as we can generate fantasies from this prior model two monte carlo operators are now feasible operator one resamples x from its distribution conditioned on y this is a blockgibbs sampling operator which is naturally implemented by sampling from the proposal followed by generating a fantasy for that parameter setting thus we are able to change as long as we can sample from px resampling x at the same time removes the need to evaluate the change in z although this does require an exact sample this will need a method like coupling from the past section just having a markov chain with stationary distribution px is not sufficient operator two is a metropolis move that proposes swapping the values of the two parameter settings as the proposal is symmetric the acceptance ratio is simply the exchange algorithm the ratio of joint probabilities before and after the swap a q y p f y f x z z p q y f y f x p q y f y f x q y p f y f x z z combining the two operators gives algorithm the exchange algorithm algorithm exchange algorithm input initial number of iterations s for s s propose q y generate an auxiliary variable x f x z compute acceptance ratio a q y p f y f x q y p f y f x draw r uniform if r a then set end for each step tries to take the data y from the current parameter setting we speculate that a better parameter setting is which was generated by q y how can we persuade to give up the data to the rival parameter setting we offer it a replacement data set x from the distribution if f x f y then this replacement is preferred by to the real data y which is a good thing we have to consider both sides of the exchange the ratio f y f x measures how much likes the trade in data sets only by liking the data and generating good replacements can a parameter enter the set of the posterior samples parameters that spread high f settings over many data sets will like the swap but tend to generate unacceptable fantasies this penalty replaces the need to compute z comparing the acceptance ratio to the infeasible mh algorithms ratio in algorithm we identify that the exchange algorithm replaces the ratio of normalizers with a onesample unbiased importance sampling estimate cf equation z f x z f x x f x z this gives an interpretation of why the algorithm works but it is important to remember that arbitrary estimators based on importance sampling or other methods for computing z and its ratios are unlikely to give a transition operator with the correct stationary distribution we emphasize the exchange motivation because the algorithm is simply metropolishastings which is already well understood here applied to an augmented model for more mathematical detail see subsection which provides a proof of the stationary distribution for a more general version of the algorithm the exchange algorithm x x x x i y figure an alternative representation of the generative model for observations y all possible parameter settings j are instantiated fixed and used to generate a set of data variables xj the indicator i is used to set y xi the posterior over i the parameter chosen by the indicator variable i is identical to py in the original model product space interpretation the exchange algorithm was originally inspired by carlin and chib which gives a general method for model comparison by mcmc in this approach every model is instantiated in the sampler which explores a product space of all the models parameter settings an indicator variable which is also sampled specifies which model was used to generate the data the posterior probability distribution over the indicator variable is the posterior distribution over models exchange moves in product spaces are common in the parallel or replica tempering methods in the physics literature eg swendsen and wang and geyer by temporarily assuming that there is only a finite discrete set of possible parameters j each setting can be considered as a separate model this suggests a somewhat strange joint distribution see also figure pxj y i pi iy xj j f xj j zj where iy xi is an indicator function enforcing y xi this corresponds to choosing a parameter setting i from the prior and generating the observed data as before but then also generating unobserved data sets using every other setting of the parameters although this only appears to be a convoluted restatement of the original generative process all zj terms are now always present z is a constant again each of the unobserved datasets xji can be updated by standard mcmc methods such as gibbs sampling conditioned on i the remaining dataset xi y is not updated because it has a single known observed value to update i we notice that an isolated mh proposal i i will not work it is exceedingly unlikely that xi y instead we couple i i proposals with an exchange of datasets xi y and xi the mh acceptance the exchange algorithm ratio for this proposal is simple much of the joint distribution in equation cancels leaving a qi i y pi f y i f xi i qi i y pi f y i f xi i the only remaining problem is that if the number of parameter settings is very large we require huge amounts of storage and a very long time to reach equilibrium in particular the method seems impractical if is continuous the solution to these problems recreates the exchange algorithm of the previous section we declare that at each time step all of the x variables are at equilibrium conditioned on the indicator variable i but we do not need to store these values only when a swap of ownership between and is proposed do we need to know the current setting of s data we can then draw the value from its equilibrium distribution x px f x z pretending that this had already been computed after the exchange has been accepted or rejected the intermediate quantity x can be discarded we redraw any unobserved data set from its stationary distribution as required this retrospective sampling trick has been used in a variety of other mcmc algorithms for infinitedimensional models papaspiliopoulos and roberts beskos et al an infinitedimensional model and retrospective sampling are not required to describe the exchange algorithm but provide connections to the literature which might be useful for some readers the product model works without exact sampling for models with a small number of parameter settings and is also how the algorithm was originally conceived other readers may prefer the explanation in the previous section even on continuous parameters it uses metropolishastings on a finite auxiliary system which is more established theoretically bridging exchange algorithm the metropolishastings acceptance rule has the maximum acceptance rate for any reversible transition operator that may only accept or reject proposals from q see subsection as the exchange algorithm is only approximately the same as the direct mh approach in some sense it rejects moves that it should not as we have much better approximations of normalizing constant ratios than equation better methods should be possible how can good parameter proposals get rejected it may be that is a much better explanation of the data y than the current parameters and would be accepted under mh however the exchange algorithm can reject the swap because x f x z is improbable under this makes movement unnecessarily slow and suggests searching for a way to make swaps look more favorable the exchange algorithm with bridging draws a fantasy as before x f x z but then applies a series of modifications x x xk such that xk is typically the exchange algorithm more appealing to to describe the precise form of these modifications we specify a new augmented joint distribution py xk k k p f y z the augmented model combines the original model a parameter proposal generating a fantasy and bridging steps q y f x z k k tk xk xk we choose the tk to be markov chain transition operators with corresponding stationary distributions pk a convenient choice is pk x f x k f x k fk x where k k k giving k intermediate distributions that bridge between p x f x z and pk x f x z other bridging schemes could be used in what follows we only assume the directional symmetry pk x pkk x similarly we require tk xk xk tkk xk xk which is easily achieved by always using the same transition operator for the same underlying stationary distribution as before there are two possible markov chain operators operator one block gibbs sample from p xk y q y f x z k tk xk xk k operator two propose swapping whilst simultaneously reversing the order of the xk sequence as illustrated in figure if the tk operators satisfy detailed balance ie tk x x pk x tk x x pk x then the mh acceptance ratio for operator two does not depend on the details of tk and is easily computed the concatenation of the two operators results in algorithm figure note that k reduces to the previous exchange algorithm as before the k k fk xk fk xk term in the acceptance ratio corresponds the exchange algorithm f x z x t x x x q y xk tk xk xk xk f xk z xk t xk xk tk xk xk y f y z xk q y x tk x x t x x x y f y z figure the proposed change under a bridged exchange given a current parameter the algorithm generated an exact sample x from its distribution and a sequence of data sets that come from distributions closer and closer to px the swap move proposes making the current parameter and the owner of a fantasy as x was typical of not the stack of auxiliary variables is reversed so it looks like it might have been generated under the original process algorithm exchange algorithm with bridging input initial iterations s bridging levels k for s s propose q y generate an auxiliary variable by exact sampling x p x f x z generate k further auxiliary variables with transition operators x t x x x t x x xk tk xk xk compute acceptance ratio q y p f y a q y p f y draw r uniform if r a then set end for k k fk xk fk xk the exchange algorithm to an unbiased estimate of zz proof it is an annealed importance sampling ais weight equation this is the natural extension to the simple importance estimate equation in the original exchange algorithm linked importance sampling neal could also be used as a dropin replacement the bridging extension to the exchange algorithm allows us to improve its implicit normalization constant estimation and improve the acceptance rate for some additional cost fortunately no further expensive exact sampling on top of that needed by the original algorithm is required per iteration the performance as a function of k is explored in section details for proof of correctness it turns out that the algorithm as stated is also valid when the operators tk do not satisfy detailed balance we give the details of a proof for this more general case we define a set of reverse markov chain operators tk x x tk x x pk x tk x x pk x pk x define the transition operator t by steps of algorithm let t be defined by the same algorithm but using tk rather than tk we now prove that both t and t leave py stationary t x xk py probability of transition is py pyp f y p py zpy f x f x z z k probability of being at q y proposing a move to k generating x tk xk xk generating x xk k k min q y p f y q y p f y fk xk fk xk and finally accepting rearranging and using fk xk f xk min p f y q y f x py z z p f y q y f xk py z z k tk xk xk k k k tk xk xk fk xk fk xk the single auxiliary variable method we substitute equation and equation into the second product and reorder its terms min p f y q y f x py z z p f y q y f xk py z z k tk xk xk k k tk xkk xkk k now swapping and reversing the order of the auxiliary variables ie mapping x x xk xk x x and swapping t and t throughout only swaps the arguments of the min leaving the value of the expression unchanged therefore the probability of the reverse transition under t via the same intermediate values is the same as in equation that is for any values of xk t x xk py t xk x p y summing over all possible intermediate values xk gives t py t p y further summing over or shows that both t and t leave py stationary if as was originally proposed t is reversible then t t and the bridged exchange algorithm also satisfies detailed balance as an aside we mention that the bridging scheme is very similar to that found in the method of dragging fast variables neal a although dragging was presented using reversible transition operators a similar proof to the one above shows that it too can use nonreversible transition operators the single auxiliary variable method the first valid mcmc algorithm for doublyintractable distributions was discovered by moller et al for comparison this section reviews their method which we call the single auxiliary variable method savm savm extends the original model to include a single auxiliary variable x which shares the same state space as y figure px y px y f y p z the joint distribution py is unaffected no known method of defining auxiliary variables removes z from the joint distribution however through careful choice the single auxiliary variable method x y y a b figure a the original model unknown parameters generated observed variables y b the savm augmented model the conditional distribution of x must have a tractable dependence in existing approaches this distribution is only a function of one of y or eg f x yz or a normalizable function of x of q explicit z dependence can be removed from the mh ratio a px y qx x y px y qx x y a convenient form of proposal distribution is qx x y q y qx which corresponds to the usual change in parameters followed by a choice for the auxiliary variable if this choice which happens to ignore the old x uses qx f x z where f and z are the same functions as in py equation then the mh acceptance ratio becomes a px y p y qx q y px y py qx q y px y zf y p f x z q y px y z f y p f x z q y f y p q y px y f x f y p q y px y f x now every term can be computed as with the exchange algorithm the big assumption is that we can draw independent exact samples from the proposal distribution equation the missing part of this description was the conditional distribution of the auxiliary variable px y this choice is not key to constructing a valid mh algorithm but our choice will have a strong impact on the efficiency of the markov chain normally we have a choice over the proposal distribution here that choice is forced upon us and instead mavm a temperedtransitions refinement we choose the target distribution pxy to match the proposals as closely as possible we cannot maximize the acceptance rate by choosing pxy f x z as that would reintroduce explicit z terms into the mh ratio moller et al suggested two possibilities use a normalizable approximation to the ideal case replace with a point estimate such as the maximum pseudolikelihood estimate based on the observations y this gives an auxiliary distribution px y pxy pxy f x z with a fixed normalization constant z which will cancel in equation the broken lines in figure b indicate that while x could be a child of and y in practice previous work has only used one of the possible parents for concreteness we assume px y f xz for some fixed y in all that follows but our results are applicable to either case reinterpreting savm seen in the light of the product model in figure moller et als savm method appears slightly strange savm can be reproduced by augmenting our joint model in figure containing variables x x with an additional arbitrary latent x with no subscript as used in savm then we can define the following proposal draw j qj i perform the deterministic threeway swap x xi xj xj x xi before the swap xi was equal to the observed data y so after the swap xj will be equal to the data now owned by j the acceptance ratio for this proposal is precisely as in savm equation if we want to take y from i and give it to rival setting j why involve a third parameter in section we will see that the third party can make the transaction harder or mediate it the bridging steps in subsection were specifically designed to make the swap more palatable in the next section we propose an extension of savm which has similar bridging steps mavm a temperedtransitions refinement as with the exchange algorithm savms acceptance ratio equation can be seen as an approximation to the exact normalization constant evaluation in algorithm savm uses the following two unbiased onesample importancesampling estimators f x z z f x z f x z f x x f x z x f x z mavm a temperedtransitions refinement a biased estimate of zz is obtained by dividing equation by equa tion the unknown constant z fortuitously cancels and amazingly substituting this elementary approximation into the mh algorithm gave a valid method as with the exchange algorithm savms importance sampling estimators are very crude a large mismatch between px y and qx y can cause a high mh rejection rate bridging between these two distributions might help which suggests replacing the two importance sampling estimators with annealed importance sampling estimates this gives algorithm our new algorithm has k auxiliary variables and collapses to savm for k we call this method with k the multiple auxiliary variable method mavm algorithm multiple auxiliary variable method mavm input initial x iterations s bridging levels k for s s propose q y propose the first component of x by exact sampling x p x y f x z propose remainder of x using k transition operator steps x t x x y x t x x y xk tk xk xk y compute acceptance ratio a f y p q y f y p q y k k fk xk fk xk fk xk fk x k draw r uniform if r a then set x x end for as in the bridged exchange algorithm ratios involving the auxiliary variables can be computed online as they are generated there is no need to store the whole x ensemble the tk are any convenient transition operators that leave corresponding distributions pk stationary where pk x f x k f x k fk x and k k k other sequences of stationary distributions are possible they must start at p f x z as we are forced to draw an exact sample from this as part of the proposal from there they should bridge towards the approximate or estimatorbased distribution used by savm mavm a temperedtransitions refinement xk xk x x y figure the joint distribution for the annealingbased multiple auxiliary variable method mavm here it is assumed that pxk y is based only on a datadriven parameter estimate as in equation the auxiliary variables bridge towards the distribution implied by the graylevel and thickness of the arrows from y and indicate the strengths of influence on the auxiliary variables these are controlled by k in equation to motivate and validate this algorithm we extend the auxiliary variables x to an ensemble of variables x x x x xk figure we give xk the same conditional distribution as the single auxiliary variable x in savm equation the distribution over the remaining variables is defined by a sequence of markov chain transition operators tk xk xk with k k pxk xk y tk xk xk y px x y t x x y px x y t x x y where as usual tk x x pk x tk x x pk x this defines a stationary distribution over the ensemble px ypy treating the procedure in algorithm as a proposal qx x the acceptance rule is that of standard metropolishastings as in other bridging schemes the details of the transition operators cancel after substituting equation while we started by replacing savms importance sampling estimators with ais the resulting algorithm is more closely related to tempered transitions neal a our approach has cheaper moves than standard tempered transitions which would regenerate xk x from px y before every mh proposal this is exploiting the generalization of tempered transitions introduced in subsection as with adding bridging to the exchange algorithm mavm makes savm a closer match to ideal metropolishastings sampling there is an additional cost of k markov comparison of the exchange algorithm and mavm chain steps per iteration but no additional exact sampling which might need many markov chain steps we have also provided an answer to an open question in moller et al on how to use both and y in the auxiliary distribution px y we use y in coming up with a point estimate of the parameters to get a distribution in roughly the right place then we bridge towards a better fit to f x z using ideas from annealing comparison of the exchange algorithm and mavm we first consider a concrete example for which all computations are easy this allows comparison with exact partition function evaluation algorithm and averaging over chains starting from the true posterior we consider sampling from the posterior of a single precision parameter which has likelihood corresponding to n iid zeromean gaussian observations y y y yn with a conjugate prior pyn n p gamma the corresponding posterior is tractable py gamma n n yn but we pretend that the normalizing constant in the likelihood is unknown we compare the average acceptance rate of the algorithms for two choices of proposal distribution q y all of the algorithms require n exact gaussian samples for which we used standard generators for large n one could also generate the sufficient statistic n xn with a chisquared routine we also draw directly from the gaussian stationary distributions pk in the bridging algorithms this simulates an ideal case where the energy levels are close or the transition operators mix well more levels would be required for the same performance with less efficient operators we now report results for n and y the first experiment uses proposals drawn directly from the parameter posterior equation the mh acceptance probability becomes a all proposals are accepted when z is computed exactly therefore any rejections are undesirable byproducts of the auxiliary variable scheme which can only implicitly obtain noisy estimates of the normalizing constants figure a shows that both mavm and the exchange algorithm improve over the savm baseline of moller et al it appears that a large number k of bridging levels are required to bring the acceptance rate close to the attainable a however significant benefit is obtained from a relatively small number of levels after which there are diminishing returns as each algorithm requires an exact comparison of the exchange algorithm and mavm sample which in applications can require many markov chain steps the improvement from a few extra steps k can be worth the cost see subsection in this artificial situation the performance of mavm was similar to the exchange algorithm this result favors the exchange algorithm which has a slightly simpler update rule and does not need to find a maximum pseudolikelihood estimate before sampling begins in figure a we had set figure b shows that the performance of mavm falls off when this estimate is of poor quality for moderate k the exchange algorithm automatically obtains an acceptance rate similar to the best possible performance of mavm only for k was performance considerably worse than savm for this simple posterior sometimes manages to be a useful intermediary but by k the exchange algorithm has caught up with mavm more importantly the exchange algorithm performs significantly better than savm and mavm in a more realistic situation where the parameter proposal q y is not ideal figure c shows results using a gaussian proposal centered on the current parameter value the exchange algorithm exploits the local nature of the proposal rapidly obtaining the same acceptance rate as exactly evaluating z mavm performs much worse although adding bridging levels does rapidly improve performance over the original savm algorithm savm is now hindered by which is more rarely between and the posterior distribution over equation becomes sharper for n this makes the performance of savm and mavm fall off more rapidly as is moved away from its optimum value these methods require better estimates of with larger datasets ising model comparison we have also considered the ising model distribution with yi on a graph with nodes i and edges e py exp z j yi yj ije i h yi we used the summary states algorithm subsection for exact sampling and a single sweep of gibbs sampling for the transition operators t results are reported for y drawn from a model with h and j moller et al used uniform priors over h and j this only works or seems to for a q with small step sizes the algorithms hang if j is proposed because cftp based on gibbs sampling takes a very long time to return a sample in this regime we used a uniform prior over j and larger closertooptimal step sizes comparison of the exchange algorithm and mavm ave acceptance probability k exchange mavm savm a ave acceptance probability mavm k mavm k savm b ave acceptance probability k exchange mavm savm c figure comparison of mavm and the exchange algorithm learning a gaussian precision a average acceptance rate as a function of k mavm with k corresponds to savm the method of moller et al exact normalizing constant evaluation in would give an acceptance rate of one b average acceptance rate as a function of the initial parameter estimate required by savm k and the extended version mavm horizontal bars show the results for the exchange algorithm which has no for k c as in a but with a gaussian proposal distribution of width centered on the current parameter setting the horizontal line shows the maximum average acceptance rate for a reversible transition operator this is obtained by exact normalizing constant evaluation comparison of the exchange algorithm and mavm efficiency k efficiency exchange mavm savm exchange mavm savm k a b figure performance on a toroidal square ising lattice the data were generated from an exact sample with j and h proposals were gaussian perturbations of width the plot shows efficiency effective number of samples estimated by rcoda cowles et al divided by the total number of gibbssampling sweeps computer time all the algorithms are expensive savm requires gibbs sweeps of the ising model to obtain one effective sample of the parameters this could be improved by using more advanced exact sampling methods figure gives some example results for two different step sizes for both methods the number of effective samples obtained per iteration will tend to the maximum possible as k in practice the performance of the two methods does converge however these bridging steps come at a cost the best k is a compromise between the computer time per iteration and the mixing time of the chain in the examples we tried the exchange algorithm provided better mixing times than savmmavm for all finite k bridging provided good improvements over savm but only gave appreciable benefits to the exchange algorithm for larger step sizes discussion mcmc methods typically navigate complicated probability distributions by local diffusion longer range proposals will be rejected unless carefully constructed it is usually the case that as the stepsize of proposals are made sufficiently small the acceptance rate of a mh method tends to one however savm does not have this property it introduces rejections even when while the exchange algorithm has a for all k as the stepsize tends to zero mavm will only recover a as k this is because the third party in the proposed swap see subsection is not necessarily close to even in a simple unimodal dimensional posterior distribution figure c this is a significant disadvantage in comparison with the exchange algorithm we found the exchange algorithm performs better than the only other existing mcmc method for this problem and is simpler to implement latent history methods latent history methods both mavm and the exchange algorithm require exact samples and use them to compute similar metropolishastings acceptance ratios in this section we propose new algorithms which are also valid mcmc methods for doublyintractable distributions but which do not draw exact samples they do follow a very similar coupling procedure but the properties of the resulting algorithms are quite different we start with an inspiring but impractical rejection sampling algorithm for the parameters given for example by marjoram et al algorithm b algorithm simple rejection sampling algorithm for py generate p simulate y py accept if y y on real problems where the data can take on many values this algorithm will rarely accept marjoram et al suggest relaxing the equality in step by accepting data sets that are close in some sense they also suggest mcmc algorithms which diffuse the parameters in step rather than drawing from the prior these algorithms are valid samplers for the wrong distribution py is near observed value this section will introduce an alternative approach which tries to make it more probable that the fantasy data in step is equal to the observed data while keeping the validity of the original algorithm a generative model specifies a likelihood py but not necessarily the details of an algorithm to sample from this distribution the latent history representation brings a particular sampling procedure into the description of the model we declare that a stationary ergodic markov chain t was simulated for an infinite number of time steps generating an arbitrarilyinitialized sequence x x x x x and that we observed the data y x the marginal probability of any xt including x is the stationary distribution of the markov chain which we set to the target models py unraveling a markov chain and writing it down as a generative model was a trick employed by hinton et al our latent history representation is illustrated in figure as in coupling from the past section we also consider the random variables u u u u u used by the markov chain transition operator at each time step given and the ut variables which provide t with its source of randomness each state is a deterministic function of the previous xt xt ut latent history methods x x x x y u u u u u figure the latent history representation augments the observed data y and parameters with the history of a markov chain x x x with stationary distribution py if we can effectively sample over all unknowns in this graphical model we have by discarding xt ut an algorithm to sample metropolishastings algorithm the latent history representation is a distribution with an infinite number of latent variables although we can not explicitly update all of these variables mcmc sampling is notionally straightforward algorithm would be a valid mcmc procedure if we had an oracle to perform the infinite number of computations in steps and on our behalf as an aside we note that this algorithm is similar to algorithm f in marjoram et al for approximate bayesian computation abc algorithm template for a latent history sampler input initial setting number of iterations s for s s block gibbs sample from px u x y propose q y p q y with probability min p q y follow deterministic map xt xt ut to find y x if y y then accept move end for the algorithm can feasibly be implemented if t provably coalesces to a single final state from any start state given the sequence of random numbers ut this allows step of algorithm to be replaced with the procedure used by coupling from the past cftp see section to identify y while only examining some of the ut s further there are two differences firstly the equivalent of the generation in step is always performed before checking the equivalent to step our version would also be advisable in the original abc setting assuming data simulation is more expensive than evaluation of prior probabilities secondly the major difference we sample the fantasy y in a way that has a bias towards reproducing y the algorithms are not directly comparable however the abc community assumes that they cannot evaluate a function proportional to py which would make it difficult to construct the transition operators needed by our algorithm latent history methods by combining steps and the coupling from the past can be terminated early if at any point it becomes clear that y y all we require now is an oracle that can return any ut on request creating this service can replace the block gibbs sampling in step as long as we are consistent with its infinite computation markov properties in the model imply px u x t pxt xt t put xt xt this structure allows resampling of x x in sequence from pxt xt pxt xt pxt pxt t xt xt pxt t xt xt pxt the reverse transition operator corresponding to t as long as put xt xt is known for operator t any ut value can be sampled after the corresponding xt and xt are available this means that step in algorithm can be implemented lazily construct an object that will provide ut values on request consistent with block gibbs sampling the entire model this object will respond to a request for ut by returning the value from memory if it has previously been sampled or if necessary by sampling back to xt from the furthest back x value that is known and returning a sample from put xt xt within each iteration all samples are stored for future use to ensure consistency the practical implementation of a latent history metropolishastings sampler is summarized in algorithm this includes an optional refinement version b which analytically integrates out u this requires computing the transition probability for one step of t which is usually possible and should increase the acceptance rate the refinement also makes the algorithm feasible for markov chains that do not completely coalesce for example on continuous state spaces a disadvantage of version b is that step may be difficult to implement without always identifying x whereas step in version a will be able to prematurely halt cftp whenever y x which is often easy to prove performance a simple special case highlights large differences between latent history samplers and the previous auxiliary variable algorithms we first consider version a of the sampler applied to a large collection of d independent bernoulli variables sharing a single unknown parameter this is an ising model in the limit of zero connection strengths assume that the transition operator in use t latent history methods algorithm metropolishastings latent history sampler input initial setting number of iterations s and an algorithm cftp that can use a finite number of random numbers to prove where an infinitely long markov chain ends see section for s s create lazy ut generator consistent with px u x y propose q y either version a p q y with probability min p q y end for is gibbs sampling which for these independent variables draws from the stationary distribution in one sweep the sampler is implemented as follows for each of the observed spins yd the corresponding ud random variate is uniformly distributed in if the spin is up and if the spin is down now updates to are constrained to lie within the two closest surrounding ud values this range has a typical scale of d the posterior over parameters has a width that scales as d this suggests that it will take at least od steps to equilibrate the single scalar parameter by random walk sampling disappointing given the ideal conditions notice that the performance of the algorithm depends on the details of the transition operators an unreasonably clever transition operator could generate y if u py and some other data set otherwise then the probability of finding y y is min pypy giving an overall algorithm similar to the ideal metropolis hastings algorithm actually the two stage rule described in algorithm neither mavm nor the exchange algorithm have such dependence on the transition operators how the exact sample was obtained is irrelevant we now turn to version b of the sampler with the same independent bernoulli model and gibbs sampling operators here the transition probabilities in step are equal to the parameters likelihoods under the model the algorithm reduces to standard metropolishastings for which has the best possible acceptance rate for a given proposal distribution q rapidly mixing operators are rewarded by this version of latent histories operators that move very slowly will perform as in version a we have implemented both versions of the sampler and applied them to the same ising identify whether y x is y using cftp and ut generator if y y then accept move or version b draw r uniform identify enough about x for acceptance rule using cftp and ut s if r t x y x p q y then accept move t x y x p q y slice sampling doublyintractable distributions model problem as in subsection we used gibbs sampling for the y variables combined with the summary states algorithm subsection to perform inferences required by the algorithm the summary states code required some modification so that its transitions were driven by the latent history samplers ugenerator rather than a generic random number generator then the state at time t or t was identified as in subsection the parameter proposals were gaussian with width which preliminary runs indicated was roughly optimal in version a we terminated each iteration as soon as the summary state algorithm identified the state at time zero or a non state at time zero was incompatible with the data in version b we always waited until x had been identified more elaborate code could terminate earlier based on bounds of the acceptance ratio but this is also true for the exchange algorithm mavm and most standard metropolishastings algorithms as a result of early terminations version a was roughly five times cheaper per iteration than our implementation of version b even with this advantage the number of effective samples per gibbssamplinglike sweep through the system was around ten times worse for version a the efficiency of version b was effective samples per gibbs sweep about half that obtained by savm the latent history efficiency computations included the gibbslike sweeps needed to compute the us while savm was given its random numbers for free this makes the real computation time of the methods comparable which method is actually faster will come down to implementation details however both the bridged exchange and mavm algorithms are clearly faster than either variant of latent histories on this ising problem we might hope that latent histories will perform better on other problems a more realistic reason to be excited by latent histories is that they might be a much better route to approximate methods than any of the previous algorithms in this chapter we return to this issue in section first we turn to the original motivation for developing latent histories constructing slice samplers slice sampling doublyintractable distributions all of the algorithms for doublyintractable distributions described so far are based on metropolishastings such algorithms always require stepsize parameters that need to be set well overly large stepsizes lead to almost all proposals being rejected overly small stepsizes give lengthy randomwalklike exploration of the posterior over it would be nice to be able to use a selftuning method like slice sampling subsection which has lessimportant or even no stepsize parameters slice sampling doublyintractable distributions latent histories the latent history representation naturally allows slice sampling it also has the property like the exchange algorithm that small steps in are always accepted so moves will always be possible the trick is to identify the variables in the joint distribution figure as and u while thinking of the x quantities as just a deterministic function of these variables conditioned on the u variables any standard slice sampling algorithm subsection can be applied to the parameter the stationary distribution pu is simply the prior over the parameters conditioned on x u y any point not satisfying the constraint is off the slice and discarded by the slice sampling procedure this description gives a slice sampler corresponding to version a of latent history algorithm a slice sampling implementation of version b sets the stationary distribution to t y x p after each iteration u is effectively updated in a blockgibbs update this involves resetting a lazy ut generator as in the metropolishastings algorithm mavm standard slice sampling cannot be applied to the parameter in the mavm joint distribution because this would involve computing z as with metropolishastings the auxiliary variables must be changed simultaneously it turns out that with an appropriate definition of a slice algorithm can be converted into a slice sampler we first describe the new algorithm ostensibly we do just use a standard slice sampling procedure subsection to update the parameters however the range of auxiliary heights h h is defined in terms of the following quantity hx p y px y z f y p qx y k k fk xk fk xk after drawing the slicesampling auxiliary quantity h uniform hx we consider new parameter settings using any of the normal slice sampling procedures however whenever a new setting of the parameters is considered a new setting of auxiliary x variables is generated in the same way as algorithm then as usual slice membership is assertained by checking hx h we now show that this procedure satisfies detailed balance with respect to the desired stationary distribution as in subsection we use the label s for all rejected points and other ancillary points generated while exploring the slice in a steppingout procedure the probability of starting at a setting x generating a particular setting of discussion h generating intermediate quantities s and finally an acceptable new pair x is py px y phx y qs h q s h qx y py px y py z qx y px y p y z qs h q s h qx y q s h qs h qx y qx y all of the slicesampling bracketing procedures mentioned in subsection ensure that q s h qs h q s h qs h therefore equation is invariant to exchanging x x and the overall procedure satisfies detailed balance this derivation closely follows that of standard slice sampling this is possible because conditioned on h the decision to reject intermediate parameters in s is the same whether generating s forwards from or backwards from in contrast the acceptability of a proposal in the exchange algorithm cannot be summarized by a scalar slice height h a move depends on giving a new data set to the old parameter which will have different probabilities depending on whether the move started at or we dont currently see any way to define a slice sampler for the exchange representation as noted in subsection even very small moves in can be rejected when using the mavm auxiliary system the usual slice bracketing procedures work on an assumption that nearby regions will always be accepted and exponentially shrink towards the current point unless a large number of bridging levels are used this behavior could lead to overly small step sizes care would be needed in applying stock slicesampling code discussion we established in subsection that global information like that provided by exact sampling must be found as part of a strictly valid mcmc algorithm for doublyintractable distributions given this it seems likely that the exchange algorithm performs about as well as is possible on the problems we have tried the acceptance rate approaches that of metropolishastings with moderate k leaving exact sampling as the dominant cost we would generally recommend this algorithm if a valid mcmc procedure is required moller et al pointed out that good deterministic approximations to py could make savm perform very well our results firmly support the use of bridging and we would recommend always considering mavm instead there is an unresolved issue concerning doublyintractable distributions when learning discussion from n iid data points n py n py n zn n f y n n all of the known valid mcmc algorithms for py require drawing an exact sample from py this means drawing a fantasy containing n data sets x xn n but n perhaps there is enough information about z in a single exact sample from the data distribution drawing n exact samples may be prohibitively expensive with large data sets whether this problem can be avoided is an open question sadly it is not possible to draw exact samples from all distributions approximate relaxations must be used for some situations section observed that a bad choice of approximations applied to the inner loop of standard metropolishastings can have surprising and disastrous consequences on the quality of the approximate samples the valid algorithms discussed in this chapter are an alternative target for approximation the obvious relaxation of methods requiring exact sampling from the data distribution is to use a brief run of mcmc possibly starting from the observed data moller et al report that a fixed length of sampling can be more inefficient than exact sampling for similar levels of performance however this idea would be worth exploring further in cases where exact sampling is not possible a natural relaxation of the latent history representation is truncating the history to some finite length of time k this would require specifying an explicit prior distribution for xk perhaps an approximation to the original model one could optionally replace the stationary markov chain operator t with a sequence of operators tk bridging between the approximate distribution used for xk and the target model distribution this finite model could be simulated using the latent history algorithms with explicit computations or any standard mcmc technique we believe that truncated latent history representations are a promising direction for future research in approximate inference for doublyintractable distributions running finite markov chains within the exchange algorithm or mavm is an innerloop approximation somewhat like those explored in section the consequences of approximating mcmc procedures is hard to predict it is possible that very unreasonable inferences will result in contrast a truncated latent history representation is a model on which we can apply standard mcmc algorithms the model will not describe exactly the same beliefs as the original distribution but all of the inferences performed on it will be selfconsistent thus if prior draws from the truncated model look reasonable it is likely that the inferences will also be sensible this line of reasoning is really suggesting that we throw out the original doublyintractable model and replace it with a more tractable latent variable model if the model was originally introduced as an ad hoc way to introduce dependencies then us discussion ing an alternative that can be treated consistently may be preferable in particular we can look at draws from the prior and see if they actually reflect our beliefs heckerman et al follow a related thought process they discuss circumstances in which undirected graphical models may not be a natural representation and provide an alternative replacement changing the model doesnt seem attractive when an undirected model was derived directly from physical considerations an example of such a model is recent work on learning protein structure podtelezhnikov et al which learns the parameters of an energy function using contrastive divergence hinton a full bayesian treatment of these parameters with mcmc seems daunting exact sampling from the configurationaldistribution of a protein chain seems infeasibly difficult however using the wrong model could still be used as an approximation technique and doing so may be more stable than innerloop approximations this chapter has not addressed computing marginal likelihoods pym for bayesian model comparison of different undirected graphical models evidently the standard methods discussed in chapter will apply in theory although such computations are likely to be demanding a triplyintractable problem chapter summary and future work the markov chain monte carlo method introduced by metropolis et al was one of the earliest applications of electronic computers today it continues to be an important technique for approximate numerical integration in a growing number of applications chapter provided an introduction to the challenges involved while chapter described some of the current literature along with some minor extensions in chapter we explored the promising ideas of using multiple particles and multiple proposals in markov chain simulations as with other authors we found that modest improvements are possible largely derived from the basic lengthscale information given by a few samples from a distribution however drawing multiple proposals can easily cost more to compute than the statistical benefit it remains to be seen if implementations leveraging caching of intermediate results or parallel computations can gain advantage in real applications other authors in the literature remain hopeful that they can and our pivotbased approaches seem to bring unique benefits to this area of the field chapter investigated methods for computing normalizing constants one of these ais is actually an importance sampling method rather than mcmc and nested sampling offers a new sampling paradigm in its own right however both of these algorithms use markov chains in their practical implementation we extended existing techniques for nested sampling on discrete distributions we then compared three fundamentally different algorithms based on the same markov chain our theoretical and empirical results show that the algorithms perform very differently and that no method is appropriate for all distributions it would seem wise to use more than one algorithm as standard practice we proposed one practical approach to realize this a method for tuning ais based on preliminary runs of nested sampling chapter serves to highlight the limitations of mcmc sampling from doubly intractable distributions is exceedingly difficult yet these distributions are found in a large class of undirected statistical models and have received a great deal of at tention in recent years we have introduced three new algorithms for this problem mavm the exchange algorithm and latent history samplers these improve the existing state of the art remove the need for separate deterministic approximations and offer new directions for approximating this difficult problem exploring truncations of latent histories is a possible area for further work this thesis has concentrated on investigating new markov chains and new algorithms for the better use of existing mcmc operators these have been investigated empirically and sometimes theoretically on simple problems designed to demonstrate the algorithms key properties one limitation of the presentation here has been the focus on stationary distributions little has been said about the chains formal convergence rates this is partly because making meaningful statements about convergence for general statistical problems seems difficult section also wherever possible we have reduced new algorithms to the metropolishastings algorithm applied to an auxiliary system this passes some of the burden of deriving rates of convergence onto an established literature while we have focused on mcmc methodology and proposed a number of novel algorithms ultimately the computational advantages of these methods has to be proven on challenging realworld applications this is an area i hope to spend more time on in my future research bibliography d h ackley g e hinton and t j sejnowski a learning algorithm for boltzmann machines cognitive science page j albert bayesian selection of loglinear models technical report working paper duke university institute of statistics and decision sciences page n balakrishnan and a clifford cohen order statistics inference estimation methods academic press san diego isbn page t bayes an essay towards solving a problem in the doctrine of chances philosophical transations communicated by richard price in a letter to john canton also available edited by g a barnard in biometrika december page m j beal variational algorithms for approximate bayesian inference phd thesis gatsby computational neuroscience unit university college london page m j beal and z ghahramani the variational bayesian em algorithm for incomplete data with application to scoring graphical model structures in j m bernardo m j bayarri j o berger a p dawid d heckerman a f m smith and m west editors bayesian statistics pages oxford university press page m beltrn j garc a abellido j lesgourgues a r liddle and a slosar bayesian model selection and isocurvature perturbations physical review d pages and c h bennett efficient estimation of free energy differences from monte carlo data journal of computational physics october page b a berg and t neuhaus multicanonical ensemble a new approach to simulate firstorder phase transitions phys rev lett january page j besag spatial interaction and the statistical analysis of lattice systems journal of the royal statistical society page j besag and p j green spatial statistics and bayesian computation journal of the royal statistical society b page bibliography j besag p green d higdon and k mengersen bayesian computation and stochastic systems statistical science page a beskos o papaspiliopoulos g o roberts and p fearnhead exact and computationally efficient likelihoodbased estimation for discretely observed diffusion processes with discussion journal of the royal statistical society series b statistical methodology page c bischof and m bcker computing derivatives of computer programs in modern u methods and algorithms of quantum chemistry volume pages page c m bishop pattern recognition and machine learning springerverlag new york isbn page o capp a guillin jm marin and c p robert population monte carlo journal e of computational graphical statistics december pages and b p carlin and s chib bayesian model choice via markov chain monte carlo methods journal of the royal statistical society b methodological page g casella and c p robert raoblackwellisation of sampling schemes biometrika page s chib marginal likelihood from the gibbs output journal of the american statistical association december page a m childs r b patterson and d j c mackay exact sampling from nonattractive distributions using summary states physical review e pages and a christen and c fox mcmc using an approximation journal of computational and graphical statistics page j a christen and c fox a selfadjusting multiscale mcmc algorithm poster presented at valencia bayesian meeting benidorm preprint available from jaccimatmx code available from httpwwwcimatmxjactwalk page p clifford et al discussion on the meeting on the gibbs sampler and other markov chain monte carlo methods journal of the royal statistical society series b page methodological m k cowles n best k vines and m plummer rcoda available from httpwwwfisiarcfrcoda pages and bibliography r t cox probability frequency and reasonable expectation american journal of physics january pages and r v craiu and c lemieux acceleration of the multipletry metropolis algorithm using antithetic and stratified sampling statistics and computing page p damien j wakefield and s walker gibbs sampling for bayesian nonconjugate and hierarchical models by using auxiliary variables journal of the royal statistical society series b statistical methodology page s della pietra v j della pietra and j d lafferty inducing features of random fields ieee transactions on pattern analysis and machine intelligence page p dellaportas and j j forster markov chain monte carlo model determination for hierarchical and graphical loglinear models biometrika pages and l devroye nonuniform random variate generation springerverlag new york isbn out of print available from httpcgscscarletoncalucrnbookindexhtml page a dobra c tebaldi and m west data augmentation in multiway contingency tables with fixed marginal totals journal of statistical planning and inference page p dostert y efendiev t y hou and w luo coarsegradient langevin algorithms for dynamic data integration and uncertainty quantification journal of computational physics page s duane a d kennedy b j pendleton and d roweth hybrid monte carlo physics letters b september page d edwards and t havrnek a fast procedure for model search in multidimensional a contingency tables biometrika august page r g edwards and a d sokal generalizations of the fortuinkasteleynswendsenwang representation and monte carlo algorithm physical review pages and j ferkinghoffborg monte carlo methods in complex systems phd thesis graduate school of biophysics niels bohr institute and rise national laboratory faculty of science university of copenhagen may page c m fortuin and p w kasteleyn on the randomcluster model i introduction and relation to other models physica page bibliography d frenkel speedup of monte carlo simulations by sampling of rejected states proceedings of the national academy of sciences pages and d frenkel and b smit understanding molecular simulation academic press inc nd edition pages and a e gelfand and a f m smith samplingbased approaches to calculating marginal densities journal of the american statistical association page a gelman and xl meng simulating normalizing constants from importance sampling to bridge sampling to path sampling statistical science pages and a gelman j b carlin h s stern and d b rubin bayesian data analysis second edition chapman hallcrc isbn x page d geman and s geman stochastic relaxation gibbs distribution and bayesian restoration of images ieee transactions on pattern analysis and machine intelligence pages and j geweke getting it right joint distribution tests of posterior simulators journal of the american statistical association pages and c j geyer markov chain monte carlo maximum likelihood in computing science and statistics proceedings of the rd symposium on the interface pages page c j geyer practical markov chain monte carlo statistical science november w r gilks page derivativefree adaptive rejection sampling for gibbs sampling in j bernardo j berger a p dawid and a f m smith editors bayesian statistics oxford university press page w r gilks and p wild adaptive rejection sampling for gibbs sampling applied statistics page w r gilks g o roberts and e i george adaptive direction sampling the statistician special issue conference on practical bayesian statistics page v k gore and m r jerrum the swendsenwang process does not always mix rapidly in proceedings of the twentyninth annual acm symposium on theory of computing pages acm press new york ny usa page bibliography v k gore and m r jerrum the swendsenwang process does not always mix rapidly journal of statistical physics page p j green reversible jump markov chain monte carlo computation and bayesian model determination biometrika december pages and f hamze and n de freitas hot coupling a particle approach to inference and normalization on pairwise undirected graphs in y weiss b schlkopf and j platt o editors advances in neural information processing systems nips proceedings of the conference mit press page w k hastings monte carlo sampling methods using markov chains and their applications biometrika april pages and d heckerman d m checkering c meek r rounthwaite and c kadie dependency networks for inference collaborative filtering and data visualization journal of machine learning research jmlr page g e hinton training products of experts by minimizing contrastive divergence neural computation august pages and g e hinton s osindero and yw teh a fast learning algorithm for deep belief nets neural computation page j d hobby a users manual for metapost technical report att bell laboratories murray hill new jersey page k s v horn constructing a logic of plausible inference a guide to coxs theorem international journal of approximate reasoning page m huber exact sampling and approximate counting techniques in th acm symposium on the theory of computing pages page m huber a bounding chain for swendsenwang random structures algorithms page y iba population monte carlo algorithms transactions of the japanese society for artificial intelligence a page y iba extended ensemble monte carlo international journal of modern physics c b page w janke and s kappler multibondic cluster algorithm for monte carlo simulations of firstorder phase transitions physical review letters page c jarzynski equilibrium freeenergy differences from nonequilibrium measurements a masterequation approach physical review e november page bibliography e t jaynes probability theory the logic of science principles and elementary applications vol cambridge university press april page h jeffreys theory of probability oxford university press rd edition edition republished in oxford classic texts in the physical sciences page m h kalos and p a whitlock monte carlo methods volume i basics john wiley isbn page r e kass and a e raftery bayes factors journal of the american statistical association page r e kass b p carlin a gelman and r m neal markov chain monte carlo in practice a roundtable discussion american statistician page s kirkpatrick c d gelatt jr and m p vecchi optimization by simulated annealing science page m kuss and c e rasmussen assessing approximate inference for binary gaussian process classification journal of machine learning research pages and s l lauritzen and d j spiegelhalter local computations with probabilities on graphical structures and their application to expert systems with discussion journal of the royal statistical society series b page j s liu metropolized independent sampling with comparisons to rejection sampling and importance sampling statistics and computing page j s liu monte carlo strategies in scientific computing springer isbn page j s liu w h wong and a kong covariance structure of the gibbs sampler with applications to the comparisons of estimators and augmentation schemes biometrika page j s liu f liang and w h wong the use of multipletry method and local optimization in metropolis sampling journal of american statistical association pages and a p lyubartsev a a martsinovski s v shevkunov and p n vorontsovvelyaminov new approach to monte carlo calculation of the free energy method of expanded ensembles j chem phys d mackay nested sampling explanatory illustrations available from httpwwwinferencephycamacukbayesysboxnestedpdf page page bibliography d j c mackay information theory inference and learning algorithms cambridge university press available from httpwwwinferencephycamacukmackayitila pages and e marinari and g parisi simulated tempering a new monte carlo scheme europhysics letters page p marjoram j molitor v plagnol and s tavar markov chain monte carlo without e likelihoods proceedings of the national academy of sciences pages and a mccallum and b wellner toward conditional models of identity uncertainty with application to proper noun coreference in ijcai workshop on information integration on the web page i r mcdonald and k singer machine calculation of thermodynamic properties of a simple fluid at supercritical temperatures j chem phys page n metropolis the beginning of the monte carlo method los alamos science page n metropolis a w rosenbluth m n rosenbluth a h teller and e teller equation of state calculation by fast computing machines j chem phys pages and j moller a n pettitt k k berthelsen and r w reeves an efficient markov chain monte carlo method for distributions with intractable normalising constants technical report r department of mathematical sciences aalborg university pages and j moller a n pettitt r reeves and k k berthelsen biometrika an efficient markov chain monte carlo method for distributions with intractable normalising constants pages and p mukherjee d parkinson and a r liddle a nested sampling algorithm for cosmological model selection the astrophysical journal l page i murray and z ghahramani bayesian learning in undirected graphical models approximate mcmc algorithms in m chickering and j halpern editors proceedings of the th annual conference on uncertainty in artificial intelligence pages auai press arlington virginia united states pages and bibliography i murray and e snelson a pragmatic bayesian approach to predictive uncertainty in j quionerocandela i dagan b magnini and f dalchbuc editors machine n e learning challenges evaluating predictive uncertainty visual object classification and recognising textual entailment first pascal machine learning challenges workshop southampton uk april revised selected papers springer lecture notes in computer science page i murray z ghahramani and d j c mackay mcmc for doublyintractable distributions in r dechter and t s richardson editors proceedings of the nd annual conference on uncertainty in artificial intelligence uai pages auai press a page i murray d mackay z ghahramani and j skilling nested sampling for potts models in y weiss b schlkopf and j platt editors advances in neural information o processing systems nips proceedings of the conference pages mit press b pages and r m neal bayesian training of backpropagation networks by the hybrid monte carlo method technical report crgtr connectionist research group department of computer science university of toronto april page r m neal probabilistic inference using markov chain monte carlo methods technical report crgtr department of computer science university of toronto september pages and r m neal sampling from multimodal distributions using tempered transitions technical report department of statistics university of toronto october pages and r m neal suppressing random walks in markov chain monte carlo using ordered overrelaxation technical report no department of statistics university of toronto page r m neal sampling from multimodal distributions using tempered transitions statistics and computing a pages and r m neal bayesian learning for neural networks number in lecture notes in statistics springerverlag b page r m neal markov chain monte carlo methods based on slicing the density function technical report department of statistics university of toronto page r m neal annealed importance sampling technical report no department of statistics university of toronto a pages and bibliography r m neal suppressing random walks in markov chain monte carlo using ordered overrelaxation in m i jordan editor learning in graphical models pages kluwer academic publishers b pages and r m neal markov chain sampling methods for dirichlet process mixture models technical report department of statistics university of toronto september c page r m neal erroneous results in marginal likelihood from the gibbs output available from httpwwwcstorontoeduradfordchibletterhtml page r m neal annealed importance sampling statistics and computing pages and r m neal slice sampling annals of statistics pages and r m neal taking bigger metropolis steps by dragging fast variables technical report no department of statistics university of toronto october a page r m neal improving asymptotic variance of mcmc estimators nonreversible chains are better technical report department of statistics university of toronto july b page r m neal estimating ratios of normalizing constants using linked importance sampling technical report no department of statistics university of toronto pages and m e j newman and r m ziff fast monte carlo algorithm for site or bond percolation physical review e june page m a newton and a e raftery approximate bayesian inference with the weighted likelihood bootstrap journal of the royal statistical society series b methodological page a ohagan monte carlo is fundamentally unsound the statistician in special issue practical bayesian statistics page o papaspiliopoulos and g o roberts retrospective mcmc methods for dirichlet process hierarchical models submitted manuscript page j pearl evidential reasoning using stochastic simulation of causal models artificial intelligence page p peskun optimum montecarlo sampling using markov chains biometrika pages and bibliography a a podtelezhnikov z ghahramani and d l wild learning about protein hydrogen bonding by minimizing contrastive divergence proteins structure function and bioinformatics page j g propp and d b wilson exact sampling with coupled markov chains and applications to statistical mechanics random structures and algorithms pages and y qi and t p minka hessianbased markov chain montecarlo algorithms in first cape cod workshop on monte carlo methods september pages and z s qin and j s liu multipoint metropolis method with application to hybrid monte carlo journal of computational physics page a e raftery m a newton j m satagopan and p n krivitsky estimating the integrated likelihood via posterior simulation using the harmonic mean identity in j m bernardo m j bayarri j o berger a p dawid d heckerman a f m smith and m west editors bayesian statistics proceedings of the valencia isba th world meeting on bayesian statistics oxford university press page c e rasmussen and z ghahramani bayesian monte carlo in s becker s thrun and k obermayer editors advances in neural information processing systems nips proceedings of the conference mit press page c ritter and m a tanner facilitating the gibbs sampler the gibbs stopper and the griddygibbs sampler journal of the american statistical association page c p robert and g casella monte carlo statistical methods springer texts in statistics springer nd edition isbn m seeger pages and low rank updates for the cholesky decomposition technical repage port september httpwwwkybtuebingenmpgdebspeopleseeger paperscholupdatepdf j seward and n nethercote using valgrind to detect undefined value errors with bitprecision in proceedings of the usenix annual technical conference april page r shaw m bridges and m p hobson clustered nested sampling efficient bayesian inference for cosmology preprint available from httparxivorgpdfastroph pages and d s sivia and j skilling data analysis a bayesian tutorial oxford science publications isbn page bibliography j skilling nested sampling in r fischer r preuss and u von toussaint editors bayesian inference and maximum entropy methods in science and engineering aip conference proceeedings pages page j skilling nested sampling for general bayesian computation bayesian analysis pages and j skilling nested sampling for bayesian computations in j m bernardo m j bayarri j o berger a p dawid d heckerman a f m smith and m west editors bayesian statistics proceedings of the valencia isba th world meeting on bayesian statistics oxford university press pages and j skilling and d j c mackay slice sampling a binary implementation annals of statistics in discussion slice sampling radford m neal page g r smith the measurement of free energy by monte carlo computer simulation phd thesis university of edinburgh september pages and a sokal monte carlo methods in statistical mechanics foundations and new algorithms lectures at the cargse summer school on functional integration e basics and applications page d j spiegelhalter a thomas n g best and w r gilks bugs examples volumes and available from httpwwwmrcbsucamacukbugs page d stern t graepel and d j c mackay modelling uncertainty in the game of go in l k saul y weiss and l bottou editors advances in neural information processing systems nips proceedings of the conference mit press page k stormark multiple proposal strategies for markov chain monte carlo msc thesis department of mathematical sciences norwegian university of science and technology page r h swendsen and js wang replica monte carlo simulation of spinglasses physical review letters page r h swendsen and j s wang nonuniversal critical dynamics in monte carlo simulations physical review letters pages and c j f ter braak a markov chain monte carlo version of the genetic algorithm differential evolution easy bayesian computing for real parameter spaces statistical computing pages page l tierney markov chains for exploring posterior distributions the annals of statistics page bibliography l tierney and a mira some adaptive monte carlo methods for bayesian inference statistics in medicine page h tjelmeland using all metropolishastings proposals to estimate mean values technical report department of mathematical sciences norwegian university of science and technology pages and z tu and sc zhu image segmentation by datadriven markov chain monte carlo ieee transactions on pattern analysis and machine intelligence page js wang and r h swendsen lowtemperature properties of the j ising spin glass in two dimensions physical review b pages september page l wang j liu and s z li mrf parameter estimation by mcmc method pattern recognition page g r warnes the normal kernel coupler an adaptive mcmc method for efficiently sampling from multimodal distributions phd thesis department of biostatistics university of washington page m welling and s parise bayesian random fields the bethelaplace approximation in proceedings of the nd annual conference on uncertainty in artificial intelligence uai auai press page w wiegerinck variational approximations between mean field theory and the junction tree algorithm in proceedings of the th conference on uncertainty in artificial intelligence pages morgan kaufmann page d b wilson annotated bibliography of perfectly random sampling with markov chains in dimacs series in discrete mathematics and theoretical computer science volume american mathematical society updated version available online httpdbwilsoncomexact page j winn and c m bishop variational message passing journal of machine learning research page o winther and a krogh teaching computers to fold proteins physical review e page c yanover and y weiss approximate inference and protein folding in s becker s thrun and k obermayer editors advances in neural information processing systems nips proceedings of the conference mit press page j s yedidia w t freeman and y weiss constructing freeenergy approximations and generalized belief propagation algorithms ieee transactions on information theory july page bibliography y yu and q cheng mrf parameter estimation by an accelerated method pattern recognition letters page x zhu and z ghahramani towards semisupervised classification with markov random fields technical report cmucald school of computer science carnegie mellon university june pages and