Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, California, USA
*Corresponding author Email: firstname.lastname@example.org
Approximate Bayesian computation is an analysis approach that has arisen in response to the recent trend to collect data of very high dimension. This has led to many existing methods become intractable because of difficulties in calculating the likelihood function. Approximate Bayesian computation circumvents this issue by replacing calculation of the likelihood with a simulation step in which it is estimated in one way or another. In this review, we give an overview of the approximate Bayesian computation approach, by giving examples of some of the more popular specific forms of approximate Bayesian computation. We then discuss some of the areas of most active research and application in the field, specifically, choice of low-dimensional summaries of complex datasets and metrics for measuring similarity between observed and simulated data. Next, we consider the question of how to do model selection in an approximate Bayesian computation context. Finally, we discuss an area of growing prominence in the approximate Bayesian computation world, use of approximate Bayesian computation methods in genetic pathway inference.
We expect the rise of approximate Bayesian computation methods to continue, and we hope this will include the continued development of theory and machinery to guide the user in making some of the key choices discussed above.
At a time in which our ability to collect data is growing at great rates, it is also the case that new challenges arise when attempting analysis of these data. Given data, D, a model, M, that attempts to explain the data, and a set of model parameters, θ, our analysis task often depends upon calculation of the likelihood, f(D | θ), either as a direct component of a frequentist analysis, or as a step towards calculating the posterior distribution f(θ | D) in the Bayesian paradigm (and our perspective in this article will be Bayesian). Using Bayes' theorem the posterior distribution is calculated as f(θ | D) ∝ f(D | θ)π(D), where π(·), the prior distribution, captures our beliefs about θ before the data is collected. However, as complexity or volume of data increases, calculation of the likelihood (and, therefore, also the posterior) often becomes impossible, either because it is computationally intractable or because closed-form expressions are not derivable. This conflict has led to the rise of an alternative approach called approximate Bayesian computation (ABC).
ABC methods borrow intuition from likelihood estimation, introduced by Diggle and Gratton. There, large-scale Monte Carlo simulation is used to directly approximate the likelihood of D given θ (and all expressions here are implicitly also dependent upon the model M) as the proportion of times in which simulation of data, D′, using parameter θ, results in Dθ = D. However, as data complexity grows, the probability of observing D′ = D typically becomes vanishingly small, even when the correct value of θ is used. This has led to the appearance of ABC versions of rejection methods. This review discusses the approximate Bayesian computational approach.
The author has referenced some of his own studies in this review. The protocols of these studies have been approved by the relevant ethics committees related to the institution in which they were performed.
The simplest form of ABC, that based on rejection methods, supposes the existence of a set of summary statistics, S, that capture key features of D, and adopts the intuition of likelihood approximation within the following algorithm:
• Set, i = 0. • Sample θ′ from the prior π(). • Simulate data D′ using model M with parameter θ′. Calculate a set of summary statistics S′ from D′. • If S′ = S accept θ′. • Set i = i + 1. If i < N, go to 1; else stop.
• Set, i = 0.
• Sample θ′ from the prior π().
• Simulate data D′ using model M with parameter θ′. Calculate a set of summary statistics S′ from D′.
• If S′ = S accept θ′.
• Set i = i + 1. If i < N, go to 1; else stop.
Here, N is a predetermined large number. The resulting set of accepted θ-values form a sample from the dis-tribution f(θ | S). In the best case scenario in which the set of statistics S are sufficient for θ, we have (by definition of sufficiency) f(θ | S) = f(θ | D). However, in most contexts exact matching even of summary statistics is relatively unlikely, in which case we introduce a distance measure, d(·, ·), a tolerance threshold ε, and we replace step 4 above with
Now we obtain independent samples from a distribution that we will call φ(θ | S). One of the important caveats of an ABC analysis is that, in general, it is not possible to state the degree of agreement between the distribution one wanted to calculate, f(θ | D), and the distribution from which one obtains a sample, φ(θ | S). This is currently often assessed via simulation study, but is an area of active research in the ABC community.
Rejection methods work well provided there is good overlap between prior and posterior parameter dis-tributions. However, when this is not the case efficiency is low since much time is wasted sampling potential θ-values from parts of the prior distribution that are poorly supported by the posterior. Problems also arise when the dimension of the parameter spaces is large. For this reason, a number of other methods have arisen, previous to the existence of ABC, that are more efficient. Many of these ideas have now been adapted into the ABC context. An early example is the adoption of Metropolis- Hastings Markov chain Monte Carlo (MCMC)[2,3], into what has become known as ABC-MCMC (or the ‘no-likelihoods’ MCMC method).
The ABC-MCMC algorithm, starts from an arbitrarily chosen θ-value and proceeds as follows:
1. If now at θ, propose a move to θ′ according to a proposal distribution q(θ → θ′). 2. Simulate a dataset, Dθ using θ′. 3. If Dθ ~ D proceed to 4; else, output θ and return to 1. 4. Calculate the Hastings Ratio (HR):
5. Accept, and output, the new θ′ with probability h. Else, return to, and output, θ. Go to 1.
1. If now at θ, propose a move to θ′ according to a proposal distribution q(θ → θ′).
2. Simulate a dataset, Dθ using θ′.
3. If Dθ ~ D proceed to 4; else, output θ and return to 1.
4. Calculate the Hastings Ratio (HR):
5. Accept, and output, the new θ′ with probability h. Else, return to, and output, θ. Go to 1.
Here, q() is a user-defined transitional kernel that controls how we propose new θ-values. Once the chain of θ-values has reached stationarity, outputs from the chain have the required distribution. ABC-MCMC differs from traditional MCMC in that the calculation of the ratio of likelihoods for new and old parameter values has been replaced by a step in which we simulate a single dataset, D′, using θ′, and then proceed to calculate the rest of the HR only if D′ ~ D. Thus, the intractable likelihood has again been replaced by a simulation step, thereby recovering tractability. However, ABC-MCMC has been shown to mix relatively poorly, compared to traditional MCMC, in the tails of the posterior. The reason for this is simple. In traditional MCMC, if we propose a θ′-value in the tail of f(θ | D) it will be the case that f(D | θ′) is likely to be very small. However, provided the transitional kernel q() proposes small changes to θ when generating θ′, at least some of the time, it will also be the case that f(D | θ) will be small, and that the ratio f(D | θ′)/f(D | θ) will typically be of order 1. Thus, the HR will, all other things being equal, not take too small a value. This encourages good mixing. In ABC-MCMC, we have replaced the ratio of likelihoods term with the generation of a dataset for θ′ only. Thus, in the tails of the posterior for θ, the probability of generating a D′ ~ D may be vanishingly small, and is not countered by similar behaviour of P(D ~ D′ | θ). There are several possible responses to this, if mixing becomes problematic. First, use a proposal kernel that sometimes proposes large changes to θ, thereby retaining the possibility of proposing θ′-values out of the tail of the posterior, whatever is the currently accepted value of θ. Second, Andrieu C, Roberts, have shown that we can run a generalised version of the ABC-MCMC algorithm in which we simulate data to approximate the likelihood of (D | θ) for both new and old parameter values. However, it is important to note that when one estimates f(D | θ) in the denominator of the traditional MCMC HR this way, one must recycle the estimate that was used when accepting θ, rather than re-estimate it. Otherwise biases are introduced. We note in passing that the ABC-MCMC algorithm above can be viewed as a version of this latter approach in which we use a single dataset, and an indicator function that takes the value 1 if D′ ~ D, as a crude estimate of f(D′ ~ D | θ′) and f(D′ ~ D | θ).
An alternative response to these mixing issues results in another popular ABC algorithm, Sequential Monte Carlo ABC (ABC-SMC).
ABC-SMC uses a population of θ-values, rather than a single θ-value, at any given time[6,7]. While some of these may be in the tails of the posterior, others will likely not, thus improving mixing properties. The algorithm is a form of ‘importance sampling’. It iterates through T generations, proceeding as follows (we base our description on that of Secrier et al.):
1. Define tolerances . Tolerance εt is used in generation t. Define the initial ‘posterior’ parameter distribution, f1, to be equal to the prior distribution θ. Set the population count to t = 1, and define a target number of acceptances per population, N. 1.A. Set the particle indicator to i = 0. 1.B. Sample a parameter-value, θ, from f. If t > 1 perturb the sampled parameter value (e.g., by adding a normal random variable). 1.C. Simulate data Dt,i using θ. If the distance between Dt,i and the observed data is greater than εt return to step 1.B; otherwise, set i = i + 1, and calculate a ‘weight’ for the accepted parameter value θ. This weight is an ‘importance sampling’ weight that corrects for the fact that θ was sampled from ft rather than π. 1.D. If i < N go to 1.B; otherwise construct a new ‘posterior’ distribution, fi+1, from the set of weights of accepted parameter values. 2. If t < T , set t = t + 1 and go to 1.A. We have omitted many of the technical details, but the intuition is that the algorithm performs a rejection method in which, rather than sampling from the prior, we sample from an importance sampling distribution formed from the posterior distribution calculated in the previous ‘generation', but adding noise to sampled parameter values to allow the generation of new values. As such, it is a form of importance sampling in which the importance sampling distribution changes over time. The algorithm has now been used in a number of applications[7,9,10,11], and is implemented in the ABC-SysBio package.
1. Define tolerances . Tolerance εt is used in generation t. Define the initial ‘posterior’ parameter distribution, f1, to be equal to the prior distribution θ. Set the population count to t = 1, and define a target number of acceptances per population, N.
1.A. Set the particle indicator to i = 0.
1.B. Sample a parameter-value, θ, from f. If t > 1 perturb the sampled parameter value (e.g., by adding a normal random variable).
1.C. Simulate data Dt,i using θ. If the distance between Dt,i and the observed data is greater than εt return to step 1.B; otherwise, set i = i + 1, and calculate a ‘weight’ for the accepted parameter value θ. This weight is an ‘importance sampling’ weight that corrects for the fact that θ was sampled from ft rather than π.
1.D. If i < N go to 1.B; otherwise construct a new ‘posterior’ distribution, fi+1, from the set of weights of accepted parameter values.
2. If t < T , set t = t + 1 and go to 1.A. We have omitted many of the technical details, but the intuition is that the algorithm performs a rejection method in which, rather than sampling from the prior, we sample from an importance sampling distribution formed from the posterior distribution calculated in the previous ‘generation', but adding noise to sampled parameter values to allow the generation of new values. As such, it is a form of importance sampling in which the importance sampling distribution changes over time. The algorithm has now been used in a number of applications[7,9,10,11], and is implemented in the ABC-SysBio package.
A number of decisions need to be made when performing an ABC analysis. Principal among them, perhaps, is the needs to measure the match between observed and simulated data. This is often achieved through the adoption of a set of summary statistics that are designed to capture key features of the data. In the early days of ABC, these were often chosen using ‘investigator intuition’. More recently a number of studies have appeared in which more principled methods are proposed. Joyce and Marjoram developed a sequential scheme for scoring statistics according to whether their use in the analysis substantially improved the quality of inference, as measured by changes to the posterior distribution (the addition of uninformative statistics should not be expected to substantially change the posterior distribution that results). Nunes et al. proposed a similar scheme designed to minimise the average squared error of the posterior distribution. Fearnhead and Prangle showed how to construct statistics in a semi-automatic manner. Jung and Marjoram develop a method to choose both a subset of statistics and weights that should be applied to each statistic in the subsequent calculation of similarity with observed data.
In other related work, Beaumont et al. discarded the concept of ‘rejection’ and instead included all simulated iterations in the estimation of the posterior for θ, but now weighting each iteration by the distance between observed and simulated statistic values after fitting a local linear regression of θ on S. Blum et al. generalised this to use non-linear regression, using an importance sampling scheme to refine the fit. Wegmann et al. aimed to reduce the dimensionality of the analysis, and thereby increase efficiency, by reducing the number of data-points considered in the analysis, and so raise the acceptance rate. One might hope to do this simply by calculating principal components of the values the data take over a large number of simulated datasets. However, principal components often perform rather poorly in ABC analyses, since they are designed to return orthogonal directions for which the variation in the data is greatest, whereas ABC performs best when projections of the data concisely capture variation in the parameters. Wegman's method uses a partial least squares approach to choose orthogonal axes that have maximum correlation with the parameters of interest. These axes are analogous to the results of a principal component analysis, but the partial least square approach ensures that the axes have good utility in predicting parameter values. In Wegman et al.'s study, the method was applied to an analysis of time of divergence of two populations in an ABC-MCMC context.
In an alternative approach, Hamilton et al. took an existing set of statistics and chose weights for them using a scheme in which large numbers of dataset were simulated, with only those that were similar to the observed data being retained. Using those data, regress the Si on each parameter in θ in turn, recording the model-fit R2 in each case. A set of weights are then calculated to measure the degree of informativeness of statistic i on parameter j. (In fact, rather than weighting the statistics directly, Hamilton defines a weighted Euclidean distance metric to measure the difference between observed and simulated statistic values, but the effect is the same.) The scheme was applied to an analysis of evolutionary parameters in models of human demography.
A number of these methods were compared by Barnes et al., in which a further new, improved method was proposed (see below).
One of the most active areas of research in ABC is its application to model selection. Here, we suppose we are trying to decide between two models, M1 and M2, (the following generalises in an obvious way if there are more than two models). In a Bayesian paradigm, evidence for M1 compared to M2 is weighed in terms of the Bayes Factor, BF = f(D | M1)/ f(D | M2), the ratio of the posterior and priors odds in favour of M1. In an ABC context, it has been common to use an approximation to the BF, BFABC = f(S | M1)/f(S | M2). Research in this area was perhaps provoked by a study of Templeton that attacked an ABC analysis of Fagundes et al. in which several possible models for early human evolution were compared. It was later shown that Rogers was in fact attacking the Bayesian method itself, rather than the ABC approach to Bayesian analysis, but a series of studies have subsequently emerged in which complications involved with ABC in a model selection context have been discussed. Fundamentally, the issue is that the ABC approximation to the Bayes Factor, BFABC, is related to the actual Bayes Factor, BF, in the following way: BF = BFABC × f(D | S,M1)/ f(D | S,M2). However, it is not necessarily the case that f(D | S,M1)/ f(D | S,M1) = 1. Most interestingly, as pointed out by Robert et al., we do not necessarily have f(D | S,M1)/ f(D | S,M2) = 1, even when the sta-tistics S are sufficient for parameter estimation in M1 and M2 individually. Robert et al. give a nice example in which count data might arise from Poisson or Geometric distributions. They show that the ratio f(D | S,M1)/ f(D | S,M2) is not equal to 1 even when S is formed from the union of statistics that are sufficient for inference in the two models separately.
There have been two responses to this issue. First, it has been noted that provided one works with the data, rather than summary statistics thereof, the problem is avoided. In this context, with a slight abuse of notation, the BF is approximated as , and we note that, as ε → 0 we have . Of course, as we have noted, choice of θ represents a compromise between accuracy and tractability, so achieving a result sufficiently close to the limiting behaviour may be practically difficult. The second approach introduces the concept of statistics that are sufficient for model selection (SFM), in the sense that if S is SFM, then f(D | S,M1)/f(D | S,M2) = 1. This was introduced by Barnes et al., in a study in which they present an algorithm that attempts to choose a set of statistics that appear to be SFM, and which generalises and improves the methods of Joyce and Marjoram for choosing approximately sufficient statistics in a non-model-selection context.
An area of growing application of ABC methods is that of inference of genetic networks. Here, the goal is to infer parameters of a known network relating expression of a set of genes, possibly related to some phenotypes of interest. Alternatively, we might wish to construct the network from scratch, aiming to infer which genes are involved and how they interact with each other. ABC methods are of interest here because as the complexity of networks grows, computational intractability becomes an issue (again either because exact solutions are impossible, or because networks contain genuinely stochastic components, or because numeric algorithms become too slow to perform well) (see Marjoram et al. for an overview of this perspective).
The leading exponents of ABC in this field are the group of Stumpf et al., who have written a number of papers on the subject e.g.,[7,28] and have also produced a software package (ABC SysBio) that makes implementations of ABC methods relatively straightforward in this context, and which integrates with the widely used SysBio systems biology software package.
The ABC SysBio method is for analysis of known networks. A recent study by Rau et al. addressed the issue of how to build networks from the ground up in an ABC context. Their method uses time-course data to test for linear relationships between pairs of genes, arguing that many networks can be well approximated using linear components. The complexity of the search space is kept manageable by supposing limits on the number of genes that can directly affect the behaviour of another gene.
It remains to be seen how popular applications of ABC analyses will become in the context of gene networks, but the growing view that such networks might be used to leverage the power of genome-wide association studies, suggests that there is a powerful need for methods that remain tractable for relatively complex networks.
In the modern era, we are collecting data that are bigger, generally by orders of magnitude, than data that were collected previously. This means that more detailed inference can be made, often using models that are more complex than before. A consequence of this is that standard statistical analysis methods often become intractable. There are two common responses to the intractability of the likelihood: (1) simplify the model so that the likelihood function can, once again, be calculated; or (2) add an approximation step to the analytic method itself. At this point, we recall a quote attributed to George Box: ‘All models are wrong, but some are useful'. While approach (1) above is possible, it may lead to a model so divorced from reality that conclusions drawn from it cannot be considered particularly informative. The American statistician John Tukey said ‘Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise'. ABC methods embrace this spirit, allowing tractable analysis of large, modern datasets. Consequently, there is an increasing tendency for investigators to turn to ABC methods in answer to the challenges of analysis of modern data sets. As such, the rise of ABC has been rapid—from essentially no studies prior to 2000, to over a hundred per year most recently.
In this review, we surveyed ABC methods and illustrated some of the key decisions that need to be made in an ABC analysis. We also pointed to areas of active research in the ABC community. We expect the rise of ABC methods to continue, and we hope this will include the continued development of theory and machinery to guide the user in making some of the key choices discussed above.
This study was funded by NSF award DMS 1101060 and NIH awards R01MH100879 and 1U01GM103804. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NSF or NIH.
All authors contributed to the conception, design, and preparation of the manuscript, as well as read and approved the final manuscript.
All authors abide by the Association for Medical Ethics (AME) ethical rules of disclosure.