[logo]
The BUGS Project
The BUGS Project
version 0.5 manual


next previous contents
Next: Acknowledgements Previous: 8 Errors

9 Topics in modelling

We shall examine some general issues in Bayesian analysis and associated implementation in BUGS.

9.1 General Strategy

We are unwilling, and unable, to recommend any fixed strategy for analysing data. However, some general points are as follows.

For a collection of case-studies and practical advice on using MCMC methods, see Gilks et al. (1995).

9.2 `Non-informative priors'

9.2.1 Issues with improper priors

The appropriate specification of `non-informative' priors is an old problem in Bayesian statistics, and is particularly important when techniques are to be used in a scientific context in which an `objective' inference is required. BUGS requires that a full probability model is defined, and hence forces all priors to be proper, and the problem arises of specifying appropriate priors on `founder' nodes (those with no parents) in the graph.

It is, however, vital to distinguish parameters of primary interest from those which specify secondary structure for the model. The former will generally be location parameters, such as regression coefficients, and in many cases a normal prior with an extremely small precision (large variance) is adequate. Our examples often feature a choice tex2html_wrap_inline3214 , corresponding to a standard deviation of 100, so that a 95% prior interval is around tex2html_wrap_inline3216 , and the prior will be locally uniform over the region supported by the likelihood. Alternatively a uniform prior on a suitably wide interval could be given , such as tex2html_wrap_inline3218

Secondary aspects of a model include, say, the precision of random effects in a hierarchical model. More care is required in setting such parameters and we recommend thinking carefully about reasonable values for such parameters in advance and so specifying fairly informative priors wherever possible - the inclusion of such external information is unlikely to bias the scientific conclusions of a study, although some sensitivity analysis should be performed to reassure consumers.

A particular, but insufficiently recognised, problem occurs with the precision parameter in a random effects model [DuMouchel, 1990, DuMouchel and Waternaux, 1992]. Let us call this parameter tex2html_wrap_inline3220 , with inverse tex2html_wrap_inline3222 , the variance of the random effects. The standard `non-informative' prior would generally be considered to be tex2html_wrap_inline3224 , equivalent to tex2html_wrap_inline3226 , which arises from assuming the log of the scale parameter has a uniform distribution on tex2html_wrap_inline3228 .

However, the boundary value tex2html_wrap_inline3230 is supported by non-negligible likelihood since it is theoretically possible that there are no subject-specific random effects and, remarkably, the improper prior     tex2html_wrap_inline3232 goes to infinity at zero at a rate sufficient to lead to a posterior distribution that is also improper.   Assuming a prior that is `only just' proper, such tex2html_wrap_inline3234 with mean 1 and variance tex2html_wrap_inline3236 , will formally avoid strict impropriety: such priors are investigated below.

9.2.2 Pareto priors for precision parameters

An alternative approach is to choose an improper prior such as tex2html_wrap_inline3241 being locally uniform on tex2html_wrap_inline3243 [Gelman and Rubin, 1992, Carlin, 1992], which does not lead to an improper posterior. Keeping within a general class of proper priors, we assume we would like a prior on tex2html_wrap_inline3245 that is uniform on a range tex2html_wrap_inline3247 . This implies that tex2html_wrap_inline3249 has a Pareto distribution with parameters tex2html_wrap_inline3251 , where the Pareto distribution (Section 4.4) has the form

displaymath3253

Thus a uniform prior for tex2html_wrap_inline3255 or tex2html_wrap_inline3257 on tex2html_wrap_inline3259 would respectively give rise to a Pareto tex2html_wrap_inline3261 or a Pareto tex2html_wrap_inline3263 distribution on tex2html_wrap_inline3265 .

Assuming tex2html_wrap_inline3267 is the precision parameter for a normal distribution   gives a likelihood contribution for tex2html_wrap_inline3269 from n observations proportional to tex2html_wrap_inline3273 , where S is a sum of squares statistic. Thus a Pareto prior for tex2html_wrap_inline3277 is conjugate with the likelihood within the class of left-truncated gamma distributions, and leads to a posterior distribution proportional to tex2html_wrap_inline3279 . We note that the Pareto distribution is not log-concave in tex2html_wrap_inline3281 , and hence BUGS recognises the conjugacy and constructs the full conditional distribution: this will be log-concave provided tex2html_wrap_inline3283 : hence for a locally uniform prior on tex2html_wrap_inline3285 we require at least n=3 contributions to the precision estimation, and at least n= 4 to assume a locally uniform distribution for tex2html_wrap_inline3291 .

These and other priors are explored in Figure 3.



Figure 3: Prior beliefs arising from an assumption that a precision parameter tex2html_wrap_inline3293 has various forms, shown on scales in tex2html_wrap_inline3295 and tex2html_wrap_inline3297 .

9.2.3 Assessment of proper priors for precision parameters

Instead of using an `off-the-shelf' prior, it is perhaps more appropriate to give some thought as to a reasonable proper prior for the random-effect precision parameter using knowledge of the particular context. Below we give some examples of the sort of thinking that could give rise to proper prior specification.   We shall consider normal, logistic and Poisson regression, in which it is assumed that the random effect for subject i enters the linear predictor additively as a variable tex2html_wrap_inline3303 .

Normal
Suppose we have a regression model such that the jth observation on the ith subject has distribution

eqnarray1734

Plausible values for tex2html_wrap_inline3309 will depend primarily on the measurement precision tex2html_wrap_inline3311 - it is unlikely that the between-subject precision tex2html_wrap_inline3313 is substantially smaller than the within-subject precision tex2html_wrap_inline3315 , and plausible that it is considerably greater.

Logistic
Assume a regression model

eqnarray1740

This model means that 95% of subjects with identical covariates will have log(odds) of the event in question occuring in a range with width tex2html_wrap_inline3317 .

Smith et al. (1995) derived the following prior within the context of meta-analysis. Suppose that we thought it plausible that there was roughly one order of magnitude difference between the odds on failure for subjects with identical covariates, and we interpreted this as having a prior belief that 95% of subjects had log(odds) in a range of width tex2html_wrap_inline3319 . This implies that tex2html_wrap_inline3321 is a reasonable guess for tex2html_wrap_inline3323 . Suppose in addition that we would be rather surprised to find two orders of magnitude difference between the odds on failure between subjects, and we would interpret such a finding as requiring 95% of subjects to have log(odds) in a range of width tex2html_wrap_inline3325 . This gives a `low' value for tex2html_wrap_inline3327 of tex2html_wrap_inline3329 . A gamma distribution with parameters (3,1) has a mean of 3 and 96% probability of exceeding .73, and hence tex2html_wrap_inline3331 might be an appropriate proper prior in this context. Figure 3 shows that the implied distribution for tex2html_wrap_inline3333 seemsquite reasonable, although ruling out either very low values (no random effect) and very high values (independent observations).

This analysis can be adjusted to reflect alternative prior beliefs: for example a guess that 95% of log(odds) were within a two-fold range ( tex2html_wrap_inline3335 ), with a ten-fold range being unlikely ( tex2html_wrap_inline3337 ), would lead to a prior tex2html_wrap_inline3339 , with mean 32 and standard deviation 27.

Poisson
Assume a regression model

eqnarray1752

This model means that 95% of subjects with identical covariates will have log(mean) in a range with width tex2html_wrap_inline3341 .

Following the analysis for the logistic model, we might think it plausible in a specific application that subjects with the same covariates had expectations that varied within one order of magnitude, but unlikely to vary over two orders. This leads to the same assumption of a Gamma(3,1) prior for tex2html_wrap_inline3343 .

Figure 3 shows a variety of prior suggestions plotted on scales for tex2html_wrap_inline3345 (.25 to 100) and tex2html_wrap_inline3347 (.1 to 2). The selected prior distributions show a variety of behaviours on the tex2html_wrap_inline3349 scale. The `just' proper Gamma(.001,.001) prior favours small values of tex2html_wrap_inline3351 in a reasonable manner, and we have found this to generally lead to sensible conclusions. Although the crucial aspect is whether the shape of the prior is reasonable in the area supported by the likelihood, the major reason for introducing substantial prior opinion is when the likelihood is fairly flat. In this situation some external judgement of plausible values of the precision parameter is unavoidable.  

9.3 Model criticism and selection

Although the emphasis in this manual and examples is firmly on drawing inferences assuming a model is an appropriate assumption, serious statistical science requires us to check these assumptions and justify our particular choice of model. We briefly outline some recent suggestions for model checking and comparison using MCMC methods - see the cited references for more details. We shall indicate which aspects can be easily included in a BUGS analysis.

We shall distinguish three objectives: model checking through examination of individual observations, comparison between two or more competitor models, and global checks of goodness-of-fit. In all cases we exploit the idea of comparing observed statistics with our expectations, were we making predictions conditional on the truth of a particular model assumption.

9.3.1 Checking through examination of individual observations.

Consider data tex2html_wrap_inline3357 and parameters tex2html_wrap_inline3359 under the assumed model. Gelfand et al. (1992) suggest a series of `checking functions' which can be calculated for each observation, involving a comparison of the observed tex2html_wrap_inline3361 with a predictive distribution tex2html_wrap_inline3363 . Leaving the basis for this predictive distribution aside for the moment, their suggestions include
  1. the residual: tex2html_wrap_inline3365  
  2. the standardised residual: tex2html_wrap_inline3367
  3. the chance of getting a more extreme observation: tex2html_wrap_inline3369
  4. the chance of getting a more `surprising' observation: tex2html_wrap_inline3371
  5. the predictive ordinate of the observation: tex2html_wrap_inline3373
The first and last require some standard against which to compare, whereas the middle three options speak for themselves.

Two situations can be identified: when the data tex2html_wrap_inline3375 form a separate evaluation set, and when they were used for deriving the posterior distribution of the parameters.

  1. Separate evaluation data available. In this case the posterior distribution is based on a `training set' tex2html_wrap_inline3377 . The predictive distribution is given by

    displaymath3379

    In many cases, unless say there is an explicit dependence on previous observations, the y's are conditionally independent of the x's given the total set of unknowns tex2html_wrap_inline3385 . Thus the required integral is simply

    displaymath3387

    This just requires adding additional random nodes tex2html_wrap_inline3389 in the graph with the appropriate parents, and monitoring the samples generated for tex2html_wrap_inline3391 .

    The observed values tex2html_wrap_inline3393 can then be compared with their predictive distributions which may be summarised using the stats command. As an application of the third checking function, one could then simply see how often the true observations lie within the nominal predictive intervals. Alternatively, one could calculate the first four checking functions above by explicitly including new nodes representing, for example, tex2html_wrap_inline3395 and tex2html_wrap_inline3397 and so obtain estimates of tex2html_wrap_inline3399 and tex2html_wrap_inline3401 directly by monitoring these nodes and noting the empirical mean.

  2. No separate evaluation data available. In this case the predictive distribution of tex2html_wrap_inline3403 should ideally be conditional on the model and remainder of the data in order to perform ``cross-validation". Hence for each observation tex2html_wrap_inline3405 we would require a distribution tex2html_wrap_inline3407 , where is the rest of the data excluding tex2html_wrap_inline3411 .

    Unfortunately this is generally difficult to do within BUGS. An exception is the final checking function, since we can explicitly calculate the predictive ordinate within BUGS. Gelfand and Dey (1994) point out the interesting fact that

    displaymath3413

    and hence a Monte Carlo estimate of tex2html_wrap_inline3415 is obtained as a harmonic mean of tex2html_wrap_inline3417 . This may be accomplished in BUGS by constructing a node which takes on values tex2html_wrap_inline3419 , and then taking the inverse of its empirical mean. However, harmonic means are notoriously unstable so care is required regarding convergence.

    An approximation to the cross-validatory   method is to use the methods for a separate evaluation set, but replacing tex2html_wrap_inline3421 by tex2html_wrap_inline3423 . Hence our predictive distribution is

    displaymath3425

    which will usually be expressible as

    displaymath3427

    and hence just requires adding additional random nodes tex2html_wrap_inline3429 in the graph with the same parents as tex2html_wrap_inline3431 .

    If we do wish to sample from the correct cross-validatory predictive distribution, this can be carried out using an additional importance sampling step to remove the effect of tex2html_wrap_inline3433 when repredicting tex2html_wrap_inline3435 [Gelfand et al., 1992], although this would have to be carried out external to BUGS.

9.3.2 Comparison between two or more candidate models

Bayes factor approaches.

The Bayesian theory of model comparison is based on the Bayes factor,   which for two competing models tex2html_wrap_inline3439 and tex2html_wrap_inline3441 is defined as the ratio of the marginal ordinates of the observed data

displaymath3443

where tex2html_wrap_inline3445 are the unobserved quantities in model i. There is a great deal of recent literature on the difficult issues surrounding the calculation and interpretation of Bayes factors: see for example Kass and Raftery (1995), Gelfand and Dey (1994). Carlin and Chib (1995) describe one means of calculating Bayes factors within a Monte Carlo run, and this is illustrated in our pines example.

Cross-validatory measures.

We note the analysis first described by Smith (1991) in which it is argued that a Bayes factor approach is only strictly appropriate if we truly believe that one, and only one, of the candidate models is true. If, as would generally be the case, the assumed models are simply considered as useful proxies for some theoretically and practically indeterminate `true' model, then model comparison based on cross-validatory   measures may be more suitable.

Comparison between models can therefore be based on an accumulation over observations of the checking functions described above. Examples might include the sum of squared or absolute unstandardised residuals tex2html_wrap_inline3449 , or the negative cross-validatory log-likelihood

displaymath3451

The latter results in a comparison between models based on the `pseudo-Bayes factor' [Geisser and Eddy, 1979, Gelfand and Dey, 1994].

We note that such accumulation needs to be carried out after a BUGS run using the estimates obtained from the monitors.

`Deviance' measures

Dempster (1974) suggested examining the posterior distribution of the log-likelihood of the observed data: this is straightforward to calculate with a BUGS run by forming a variable with the value tex2html_wrap_inline3453 . For non-hierarchical models, the minimum feasible value of this quantity is the traditional deviance   statistic, and in these circumstances a natural connection is made to classical model comparison. However, in for hierarchical models the minimum is likely to be very poorly estimated by the sample minimum, and the mean might be a more reasonable measure: this has been implemented by, for example, Gilks et al. (1992) and Zeger and Karim (1991) for model comparison. This quantity is illustrated in the seeds example for logistic models, the salm example for Poisson models, the lsat and beetles examples for binary data, the alli example for multinomial data and the jaw example for multivariate normal data. However, further work is required before giving recommendations on its use.

9.3.3 Global goodness-of-fit tests based on Bayesian p-values

Rubin (1984) introduced a `Bayesianly-justifiable' frequentist model checking device. Suppose we calculate a statistic tex2html_wrap_inline3458 that we expect might be sensitive to departures of interest from the assumed model. For example, if we doubted the normal assumption for the population distribution in random-effects meta-analysis, we might calculate the range of the observed empirical log-odds ratios in the trials. Allowing design aspects of the study to be fixed, such as the sample sizes of the trials, we could then generate replicate sets of data based on our current beliefs about the parameters tex2html_wrap_inline3460 , and re-calculate for each replicate set tex2html_wrap_inline3462 the statistic tex2html_wrap_inline3464 which we may abbreviate to tex2html_wrap_inline3466 . Our observed statistic can then be compared with the distribution tex2html_wrap_inline3468 generated under the assumed model. If tex2html_wrap_inline3470 lies in the extreme tails of this distribution then evidence exists against the model.

This approach can be easily accommodated in BUGS by producing replicate data-sets and constructing nodes to calculate d based on the original and replicate data. The empirical distribution of tex2html_wrap_inline3474 can then be monitored using the stats command.

Gelman et al. (1996) extend this work of Rubin to allow discrepancy measures tex2html_wrap_inline3476 that depend both on the data and parameters, and this is straightforward to program within BUGS.

9.4 Implementation of selected model checking criteria in BUGS

We use the trivial line example introduced in Section 1.4 of this manual to illustrate how to implement a number of the above checking criteria in BUGS. (Note that we have changed the data point (2,3) in the line.dat file to (2,7) to represent an `outlier').

The first 3 checking functions listed in Section 9.3.1 may be calculated directly in BUGS. Note that in this example, tex2html_wrap_inline3478 is denoted by mu[i]. To compute tex2html_wrap_inline3480 (see checking function 3), we first obtain values of the random variable tex2html_wrap_inline3482 by generating a replicate data set Y.rep[i] which depends on the current values of mu[] and tau at each iteration. The step() function is then used to calculate the variable p.smaller[i] which takes the value 1 if Y[i]-Y.rep[i] tex2html_wrap_inline3484 and zero otherwise. Hence the posterior mean of p.smaller[i] is simply the proportion of iterations for which Y.rep[i] < Y[i] (i.e. tex2html_wrap_inline3488 ). It follows that tex2html_wrap_inline3490 posterior mean of p.smaller; the chance of observing a more extreme value for tex2html_wrap_inline3492 is thus the minimum of these two probabilities. To compute the fifth checking function listed in Section 9.3.1, we calculate the variable p.inv[i] (the inverse of the likelihood for observation i) at each iteration. Having completed a BUGS run, we then calculate the inverse of the posterior mean of p.inv[i] to obtain the predictive ordinate tex2html_wrap_inline3496 . In addition, these values may be used to compute the negative cross-validatory log-likelihood discussed in Section 9.3.2.

The model `deviance' may be computed directly in the BUGS code by calculating the log-likelihood contribution (log.like[i]) for each observation, and then summing and multiplying by -2. Monitoring this sum over iterations will yield a posterior distribution for the deviance.

The replicate data set Y.rep[] mentioned earlier is also used to compute the Bayesian p-values described in Section 9.3.3. Here we consider 2 different statistics for tex2html_wrap_inline3500 which may be sensitive to outlying obervations in a Normal model. These are the coefficients of skewness and kurtosis, which are estimated by the average value of the third and fourth moments about the mean divided by the third and fourth powers of the standard deviation respectively. For samples from a Normal distribution, the skewness coefficient has expectation zero, whilst the kurtosis coefficient has expectation 3. Positive skewness indicates an extended right tail, whilst negative skewness implies an extended left tail. A kurtosis coefficient > 3 indicates an excess of values near the mean and far from it, with a corresponding depletion elsewhere: this is the manner in which the t-distribution departs from the Normal. Values < 3 result from distributions with a flatter top than Normal. For large sample sizes ( tex2html_wrap_inline3508 for skewness and tex2html_wrap_inline3510 for kurtosis), the sampling distributions of the coefficients are approximately Normal with known variances. For smaller sample sizes, these sampling distributions are not appropriate for testing whether the observed coefficients depart significantly from their expected values. The problem of an unkown sampling distribution may be overcome by implementing the ideas of Gelman et al. (1996), who suggest computing the empirical distribution of the relevant statistic tex2html_wrap_inline3512 conditional on current beliefs about the `true' model parameters. In the line example, this is achieved by computing the average of the standardised third (skew.rep) and fourth (kurtosis.rep) moments using the replicate data set Y.rep[] at each iteration. These values are then compared to the relevant average moment (skew.obs or kurtosis.obs) computed from the observed data. The proportion of iterations for which skew.obs < skew.rep and kurtosis.obs < kurtosis.rep correspond to the Bayesian p-values for testing significant departures from the values of skewness and kurtosis expected under Normality.

The BUGS code below shows the declarations needed to implement each of the above checking criteria for the line example.

model line;
const
 N = 5,  # number of observations
 PI = 3.141593;

var
 x[N],Y[N],mu[N],alpha,beta,tau,sigma,x.bar,resid[N],sresid[N],
 m3[N],m4[N],skew.obs,kutosis.obs,p.skew,p.kurtosis, Y.rep[N], 
 resid.rep[N],sresid.rep[N],m3.rep[N],m4.rep[N],skew.rep,kutosis.rep, 
 p.smaller,like[N],p.inv[N],log.like[N],deviance;
   
data in  "line.dat";
inits in "line.in";

{
 # Model
 
 for (i in 1:N) {
   mu[i] <- alpha + beta*(x[i] - x.bar);
   Y[i] ~ dnorm(mu[i],tau);
 }
 x.bar <- mean(x[]);
 alpha ~ dnorm(0.0,1.0E-4);  beta ~ dnorm(0.0,1.0E-4);
 tau ~ dgamma(1.0E-3,1.0E-3);  sigma <- 1.0/sqrt(tau);

 # Model checking

 for (i in 1:N) {
 # Residuals and moments for observed data
   resid[i] <- Y[i] - mu[i];             # Checking fn. 1
   sresid[i] <- resid[i]*sqrt(tau);      # Checking fn. 2

   m3[i] <- pow(sresid[i],3);
   m4[i] <- pow(sresid[i],4);
  
 # Replicate data set 
   Y.rep[i]  ~ dnorm(mu[i],tau);
   p.smaller[i] <- step(Y[i]-Y.rep[i]);  # Checking fn. 3

 # Residuals and moments for replicate data
   resid.rep[i] <- Y.rep[i] - mu[i];
   sresid.rep[i] <- resid.rep[i]*sqrt(tau);

   m3.rep[i] <- pow(sresid.rep[i],3);
   m4.rep[i] <- pow(sresid.rep[i],4);

 # Likelihood for each observed Y[i]
   like[i] <- sqrt(tau/(2*PI))*exp(-0.5*pow(sresid[i],2)); 
   p.inv[i] <- 1/like[i];         # For checking fn. 5
   log.like[i] <- log(like[i]);   # Log-likelihood for deviance
 }

 # Bayesian p-values
   skew.obs <- sum(m3[])/N;     skew.rep <- sum(m3.rep[])/N;
   p.skew <- step(skew.rep-skew.obs);
 
   kurtosis.obs <- sum(m4[])/N;     kurtosis.rep <- sum(m4.rep[])/N;
   p.kurtosis <- step(kurtosis.rep-kurtosis.obs);

 # Deviance distribution
   deviance <- -2*sum(log.like[]);
}

Table 6 gives the results for each of these model checks based on a 10000 iteration BUGS run.


Table 6: Results of selected model diagnostics for the line example  

  table1925

Recall that we edited observation tex2html_wrap_inline3590 to create an outlier. Each of the checking functions described in Section 9.3.1 indicate that tex2html_wrap_inline3592 is indeed the most outlying value: tex2html_wrap_inline3594 has the largest residuals, the smallest probability of observing a more extreme value and the smallest predictive ordinate. However, the Bayesian p-values suggest that the observed skewness and kurtosis coefficients are consistent with the values expected from a random sample of 5 points from a Normal distribution. That is, there is no evidence against the assumption of Normal errors for this data. Notice the wide 95% credible intervals for the empirical distribution of skew.rep and kurtosis.rep, suggesting large sampling variation for these coefficients when based on small sample sizes.

The deviance and negative cross-validatory log-likelihood are intended for comparing 2 or more alternative models. To illustrate this, we fitted a second model to the line data, in which the observations were assumed to follow a Student's t distribution on 3 degrees of freedom. This yielded a posterior mean deviance of 24.56 (minimum = 19.71) and a negative cross-validatory log-likelihood of 15.44, all of which are slightly smaller than the corresponding values for the Normal model. This suggests that a t-distributed error structure on 3 degrees of freedom provides a marginally better fit to the edited line data than does a Normal error structure.  

9.5 Ranking

Besag et al. (1995) emphasise how MCMC methods are particularly suitable for deriving posterior probabilities of complex functions of multiple parameters, simply by counting the proportion of iterations for which the specific condition obtains. A special but useful example is when inferring the rank order of a set of parameters, for example the mortality rates in a set of hospitals. This is described in detail in the surgical example, in which it is shown how to establish the rank order at each iteration by means of the step function.

9.6 Measurement error

There has been increasing acknowledgement of the importance of measurement error in epidemiology, and this has led to a large literature [Gail, 1990] on alternative methods for adjusting inferences based on the standard assumption of that measured covariates represent the true `exposure' to whatever risk factor is of interest. Richardson and Gilks (1993, 1994) discuss the Bayesian approach to measurement error using graphical models and Gibbs sampling: here we shall indicate how alternative models for measurement error may be accommodated using BUGS, and also issue a warning about the lack of robustness that can result from a full Bayesian analysis.

Suppose we measure a covariate tex2html_wrap_inline3599 on an individual i, but we feel this is only an approximation for the `true' covariate value which we shall denote tex2html_wrap_inline3603 . We consider two possible reasons for this approximation: first, that there is `measurement error' either due to the measurement instrument used or due to random fluctuations around a long-term mean, and second that there is `rounding error' due simply to recording with insufficient inaccuracy: an example of the latter is the notorious preference for numbers ending in 0 or 5 in records of blood pressure. Of course, it is possible for both of these causes to exist together.

For each of these causes, we can consider two possible structures relating the observed tex2html_wrap_inline3605 to the true underlying tex2html_wrap_inline3607 : first, in the `classical error' model tex2html_wrap_inline3609 depends on tex2html_wrap_inline3611 , and we explicitly assume an `exposure model', i.e. a probability distribution for tex2html_wrap_inline3613 in the population; second, we can adopt the `Berkson' model and consider the true tex2html_wrap_inline3615 as depending on tex2html_wrap_inline3617 - while this may seem odd at first it is essentially equivalent to assuming a uniform exposure distribution.


Figure 4: Alternative graphical models for measurement and rounding error.

The four alternatives are briefly discussed below, with their possible representation in BUGS, and shown graphically in Figure 4.

Classical measurement error
Assume tex2html_wrap_inline3619 is normal with mean tex2html_wrap_inline3621 and precision tex2html_wrap_inline3623 , and a normal exposure distribution with mean tex2html_wrap_inline3625 and precision tex2html_wrap_inline3627 . tex2html_wrap_inline3629 will need to be assumed known or a relevant calibration sample available (see the cervix example). Alternatively repeated measures of tex2html_wrap_inline3631 may be used to provide information on tex2html_wrap_inline3633 . There may be external prior information on tex2html_wrap_inline3635 and tex2html_wrap_inline3637 , or there may be additional relevant sets of data to include in the model, or we may adopt `vague' priors.
  z[i] ~ dnorm(x[i], tau.e);
  x[i] ~ dnorm(theta, psi);

Berkson measurement error
Assume tex2html_wrap_inline3639 is normal with mean tex2html_wrap_inline3641 and precision tex2html_wrap_inline3643 , with the same need for external information on tex2html_wrap_inline3645 as above (see the air example).
  x[i] ~ dnorm(z[i], tau.e);

Classical rounding error
Assume tex2html_wrap_inline3647 is tex2html_wrap_inline3649 rounded to a simple number, so that the real information we have about tex2html_wrap_inline3651 is that it lies in an interval centred on tex2html_wrap_inline3653 with width 2c, where 2c is the precision of measurement. For instance, if tex2html_wrap_inline3659 is only recorded to the nearest whole number ending in 0 or 5 (as in the blood pressure example),

We may again assume a normal exposure distribution with mean tex2html_wrap_inline3661 and precision tex2html_wrap_inline3663 .

  z[i] ~ dnorm(x[i], tau.e);
  x[i] ~ dnorm(theta, psi) I(lower[i], upper[i]);
  lower[i] <- z[i] - c;
  upper[i] <- z[i] + c;
The restriction in the range of the conditional distribution is dealt with by BUGS using the I(,) construct, and represented in the graph by dashed undirected links.

Berkson rounding error
We again assume that tex2html_wrap_inline3665 is tex2html_wrap_inline3667 rounded to a simple number, but now interpret this as meaning that tex2html_wrap_inline3669 could lie anywhere in a range tex2html_wrap_inline3671 to tex2html_wrap_inline3673 .
  x[i] ~ dunif(lower[i], upper[i]);
  lower[i] <- z[i] - c;
  upper[i] <- z[i] + c;

As a simple example of these methods, suppose we have an observed measurement vector tex2html_wrap_inline3675 . We shall assume tex2html_wrap_inline3677 , which gives a standard deviation of .5 and hence a 95% interval of around tex2html_wrap_inline3679 , and tex2html_wrap_inline3681 which says observations have been rounded to the nearest whole number. Fairly vague priors are assumed for tex2html_wrap_inline3683 (mean 0, standard deviation 100) and tex2html_wrap_inline3685 a gamma(.001,.001) distribution). We obtain the estimates shown in Table 7 based on 5000 iterations. We note that the classical approach has pulled the estimated x's towards each other, while the Berkson model produces no change.


Table 7: Estimated x's using different measurement error models  

  table2066

Suppose we follow the line example of Section 1.4 and now consider 5 observed tex2html_wrap_inline3695 pairs tex2html_wrap_inline3697 . We shall fit a simple linear regression of y on x, using the notation

eqnarray2077

where the x's have been standardised around the mean of the z's.

The results in Table 8 are based on 5000 iterations.


 Table 8: Estimated x's using different measurement error models when embedded in regression problem.

  table2083

In contrast to the analysis before considering the y's, in which only the classical approach pulled all estimates pulled towards the middle, we see that there is substantial adjustment with all methods. The x's are essentially pulled towards values that provide a better fit to the linear model, so that the classical approach makes a substantial adjustment to the second and fourth values, and the Berkson model estimates tex2html_wrap_inline3721 and tex2html_wrap_inline3723 as being even more extreme than the surrogate z. The classical measurement error approach also shows a dramatic increase in the estimate of tex2html_wrap_inline3727 .

Is this appropriate behaviour? MacMahon et al. (1990) have provided a highly influential use of measurement error adjustment, in which the x's were all adjusted before consideration of the regression analysis: they were all pulled towards the middle on the basis of an external sample. Simultaneous adjustment for measurement error and estimation of regression coefficients clearly leads to adjustments towards improving the fit of the assumed regression model, which of course is perfectly coherent given the assumption that the regression model is true. However, rarely do we feel so confident in a regression model.  

9.7 Multinomial-Poisson transformation

A widely applicable statistical trick for handling structured multinomial probabilities is to pretend the counts in the categories are actually independent Poisson counts: see Baker (1995) for a long list of possible applications. We show three possible applications of this technique, which may be used to speed sampling in a variety of models.

9.7.1 Multinomial-logistic models

We assume there are K categories, with the probability of the kth category given covariates tex2html_wrap_inline3737 given by

displaymath3739

where tex2html_wrap_inline3741 for identifiability. Equivalently

displaymath3743

For the ith covariate pattern, let the observed counts be tex2html_wrap_inline3747 . Then the likelihood is

displaymath3749

This likelihood can be handled directly within BUGS (see Section 4.8 and the alli and endo examples). However, in some circumstances the following trick may be more efficient.

Suppose we assume the data in fact were generated by

eqnarray2120

Then the likelihood is

displaymath3751

displaymath3753

Let the tex2html_wrap_inline3755 's have independent gamma(a,b) priors: integrating them out gives a marginal likelihood of tex2html_wrap_inline3759 's

displaymath3761

displaymath3763

so as tex2html_wrap_inline3765 , the correct likelihood is obtained.

Since a gamma(0,0) on tex2html_wrap_inline3767 is formally equivalent to a uniform prior on tex2html_wrap_inline3769 , we could also formulate the problem as

displaymath3771

and give tex2html_wrap_inline3773 a locally uniform prior.

The use of this technique is illustrated in the alli example.

9.7.2 Conditional likelihoods in case-control studies

Conditional likelihoods in case-control studies provide a further application of the multinomial-Poisson transformation. Chapter 8 of Breslow and Day (1980) describe a structure with I case-control strata, each containing one case with covariates tex2html_wrap_inline3778 , and tex2html_wrap_inline3780 controls with covariates tex2html_wrap_inline3782 . The conditional likelihood for relative risk parameters tex2html_wrap_inline3784 is

displaymath3786

Assume the data are given by the disease status indicator tex2html_wrap_inline3788 , and were in fact generated by

eqnarray2177

Then the likelihood is

displaymath3790

and giving each tex2html_wrap_inline3792 a locally uniform prior leads to a marginal likelihood of identical form to the conditional likelihood. So in the basic example of a single case and control and a single exposure variable tex2html_wrap_inline3794 if exposed, 0 otherwise, this would lead to likelihood contributions of tex2html_wrap_inline3796 when the case is exposed and the control not, tex2html_wrap_inline3798 when the control is exposed and the case not, and is uninformative if the case and control have matching exposure status. See the endo example.

This technique allows conditional inference in case-control studies to take advantage of the Bayesian modelling of proper prior information, hierarchical structure, measurement error and so on.

9.7.3 Partial likelihoods in Cox regression models

It is possible to directly model in BUGS the partial likelihood used in Cox regression models, although in practice we have found it easier to use the Poisson trick: see the leuk example.

9.8 Logical constraints

The I(,) construct described in Section 4.6 allows bounds on the range of stochastic nodes to be specified. We consider the generic statement
       y ~ ddist( theta ) I( l, u )
where ddist represents a specific distribution with parameter or parameters tex2html_wrap_inline3800 , and the bounds l, u may themselves depend on tex2html_wrap_inline3804 . A number of different circumstances can apply.

y, l ,u all observed, tex2html_wrap_inline3808 not observed: in this case the constraints are automatically fulfilled and the appropriate likelihood for tex2html_wrap_inline3810 will be generated.

l ,u observed (or l,u unobserved but not dependent on tex2html_wrap_inline3816 ), tex2html_wrap_inline3818 not observed: this is the case for censored survival times, where sampling can proceed accordingly.

y, l ,u all unobserved, tex2html_wrap_inline3822 observed: this is handled appropriately, as illustrated in the marsbars example.

tex2html_wrap_inline3824 all unobserved, and l,u dependent on tex2html_wrap_inline3828 : this must be handled with great care. Essentially, the children of tex2html_wrap_inline3830 are not conditionally independent and so the total likelihood for tex2html_wrap_inline3832 is not a simple product of terms for y, l and u. In this situation the overall joint likelihood contribution from y, l ,u must be known externally and the full conditional for tex2html_wrap_inline3842 expressed in BUGS - see the lips example and Section 9.10 for a general discussion of how this is dealt with in a context without logical links.  

9.9 Parameterisation

Appropriate parameterisation is important in MCMC work in order to improve convergence and stability of the samples. This is a large and complex topic and we merely provide some comments in relation to the worked examples.

9.10 Undirected links, chain graphs and neighbourhood models

Up to now the discussion has been in terms of Directed Acyclic Graphs (DAGs), which provide a direct mapping into the model specification language. The only deviation from this has been the imposition of bounds on random quantities in Section 4.6, which essentially imposes an associational or undirected link (see the marsbars example). We now consider the general problem of handling chain graphs: these also can be considered within the framework of `neighbourhood models': see Besag et al. (1995) for a general discusion of MCMC techniques in such models.

Chain graphs. These graphs combine directed and undirected links in such a way that there are no cycles in the graph that contain a directed link, and have been used primarily in modelling of substantive research hypotheses in the social sciences [Wermuth and Lauritzen, 1990, Cox and Wermuth, 1993]. The graphs can always be arranged in a sequence of blocks containing nodes tex2html_wrap_inline3851 , where nodes within each block are connected by undirected links, and links between blocks are directed with the order constraint that a node in tex2html_wrap_inline3853 can only be the parent of a node in tex2html_wrap_inline3855 , where j > i. Thus a directed acyclic graph is a chain graph where each block only contains a single node, and an undirected graph is a chain graph with only one block: chain graphs are also known as block recursive graphs.

Chain graphs may represent the chain local Markov property [Frydenberg, 1990], which states that for a node v in block j, conditional on its parents in preceding blocks and neighbours in the same block, v is independent of all other nodes that are not descendants of v: descendants of a node are considered to be all nodes that are reachable by a path containing a directed link.

Frydenberg (1990) shows that, provided the joint distribution is everywhere positive, this property is equivalent to a joint distribution

  equation2247

where tex2html_wrap_inline3867 if there is a directed link from u to a member of tex2html_wrap_inline3871 ; this is as if each block has been taken as a single `node' in a directed acyclic graph. Further, each of the terms in (4) factorises into

  equation2252

where tex2html_wrap_inline3873 are the cliques of the undirected graph with nodes tex2html_wrap_inline3875 and undirected links comprising those between members of tex2html_wrap_inline3877 , the links between tex2html_wrap_inline3879 and tex2html_wrap_inline3881 with the directed arrows removed, and a complete set of links between all members of tex2html_wrap_inline3883 .

Suppose we now require the full conditional   for a node tex2html_wrap_inline3885 . Then

  eqnarray2262

where k indexes blocks containing children of tex2html_wrap_inline3893 . By conditional probability and the chain local Markov property we have for the the first term

displaymath3895

In each of the likelihood terms in the product in (6), the non-children of v can be integrated out since (5) implies there will be no terms containing both v and non-children of v. Let tex2html_wrap_inline3903 represent the members of tex2html_wrap_inline3905 that are children of v. Then we obtain a full conditional distribution

  eqnarray2272

Gibbs sampling on chain graphs. Three main approaches for carrying out Gibbs sampling in chain graphs may be identified: treating each block as a single multivariate node, using a factorisation of univariate conditional distributions, and directly specifying the full conditional distributions.

  1. Direct use of expression (6) may be feasible if joint distributions within blocks are directly given in the model specification and a procedure exists for picking out the relevant terms: this may be feasible using multivariate normal distributions in future versions of BUGS.

  2. Factorisation (7) may be used if each child of v is in a separate block containing only one member; this occurs when sampling, for example, the random effects tex2html_wrap_inline3911 in the lip example, each of which has only tex2html_wrap_inline3913 as a child (ignoring intervening deterministic nodes). Each tex2html_wrap_inline3915 then has only a single member, say w, so that the full conditional (7) is then of the form

    displaymath3919

    The likelihood terms may be then handled by BUGS in the usual way, but we also need to specify the prior term tex2html_wrap_inline3921 .

    Two issues arise in specifying this prior term. First, we need to ensure that the prior is derived from the correct joint distribution tex2html_wrap_inline3923 . Second, it should be remembered that we will also be specifying a prior term for each neighbour of v in tex2html_wrap_inline3927 , which will be conditioning on v and hence appear to have v as a parent; BUGS would, unless told otherwise, include each such conditional distribution as a likelihood term for v and hence `double-count' the evidence from each neighbour. To stop this inappropriate use of a likelihood term, BUGS therefore has a precedence relation:   when constructing the full conditional for a node v, if a node w is used in calculating the prior term, w should not also appear as a child of v, i.e. any likelihood contribution that makes use of w should be ignored.

  3. The above procedure will fail if v has multiple children in a block, which occurs, for example, when sampling the hyperparameter tex2html_wrap_inline3947 that determines the degree of smoothing in the lip cancer model (see lips example), which has all the random effects tex2html_wrap_inline3949 as children. We now have the additional problem that the likelihood term does not factorise into univariate terms, and so it is necessary to derive the full conditional distribution tex2html_wrap_inline3951 by algebraic means and place it directly in the model specification. The precedence relation then ensures that inappropriate likelihood terms are not added.

We note that the latter technique, over-riding the automatic construction of full conditionals, means that BUGS may be used for handling fully undirected models. However, we emphasise that when using anything other than a DAG it is the user's responsibility to ensure they are defining a correct joint distribution.

While BUGS is capable of handling a variety of types of neighbourhood model, its rather simplistic approach to sampling will mean it may be very inefficent compared with the specially tuned methods described, for example, by Besag et al. (1995).  

9.11 Multivariate normal nodes

9.11.1 Use of Multivariate normal observations

The jaw example illustrates the use of a multivariate normal observation vector x, with unknown mean tex2html_wrap_inline3963 and precision T. This is represented in BUGS by
   x[] ~ dmnorm(mu[],T[,])  
  T[,] ~ dwish(R[,],k)

This will carry out multivariate updating of tex2html_wrap_inline3967 . Note our warnings in the birats example about instability when adopting multivariate normal observations and multivariate normal parmaters in the same model.

9.11.2 Use of Multivariate normal parameters

A set of regression parameters in a generalised linear model may be declared as multivariate normal (see birats example). However, unless the response variable is normally distributed, updating of tex2html_wrap_inline3969 will be element-wise rather than multivariate.

9.11.3 Use of the Wishart distribution

The parameterisation used for the Wishart distribution of a tex2html_wrap_inline3971 symmetric positive definite matrix x is

displaymath3975

is that of De Groot (1970) and not that of, say, Bernardo and Smith (1994): when reading studies it is important to check which form is being used. Here tex2html_wrap_inline3977 is the degrees of freedom, and R is a tex2html_wrap_inline3981 symmetric non-singular matrix: our parameterisation treats the Wishart as a multivariate extension of a chi-squared distribution.

Since E(x) = (Rtex2html_wrap_inline3999)tex2html_wrap_inline3995 and xtex2html_wrap_inline3995 is a covariance matrix, we might roughly think of Rtex2html_wrap_inline3999 as a prior estimate of the covariance based on k observations (in fact, since E(xtex2html_wrap_inline3995) = Rtex2html_wrap_inline4011 and k will generally be chosen to be near or equal to p, it is perhaps best to think of R as an assessment of the order of magnitude of the covariance matrix.

9.11.4 Avoiding multivariate nodes for parameters

The use of multivariate normal nodes for parameters can be avoided by assuming ``founder" parameters (with no parents) are a priori independent. If we are interested in marginal distributions of parameters this restriction to prior independence should still lead to reasonable conclusions, since dependence introduced by likelihood terms is appropriately handled.

Possible aids in avoiding the need for multivariate analyses include

  1. Defining parameters so as to make prior independence plausible. Thus, for example, covariates should be standardised about their mean: this will also lead to reduced posterior correlation and hence better convergence.
  2. Introducing additional levels of the hierarchical model to explain correlations.
Thus for random-intercept models with a single random effect term, it will generally be unnecessary to involve multivariate distributions. However, if there are additional random coefficients, as in teh birats and schools examples, then it is preferable to use multivariate normal models for the parameters.  



next previous contents
Next: Acknowledgements Previous: 8 Errors

Comments to bugs@mrc-bsu.cam.ac.uk
© 1995 MRC Biostatistics Unit
Return to the Welcome Page