![[logo]](/bugs/images/bugslogo.gif)
We shall examine some general issues in Bayesian analysis and associated implementation in BUGS.
We are unwilling, and unable, to recommend any fixed strategy for analysing data. However, some general points are as follows.
For a collection of case-studies and practical advice on using MCMC methods, see Gilks et al. (1995).
The appropriate specification of `non-informative' priors is an old problem in Bayesian statistics, and is particularly important when techniques are to be used in a scientific context in which an `objective' inference is required. BUGS requires that a full probability model is defined, and hence forces all priors to be proper, and the problem arises of specifying appropriate priors on `founder' nodes (those with no parents) in the graph.
It is, however, vital to
distinguish parameters of primary interest from those which
specify secondary structure for the model. The former will
generally be location parameters, such as regression coefficients, and in many cases a normal prior with an extremely small
precision (large variance) is adequate. Our examples often feature
a choice
, corresponding to a standard
deviation of 100, so that a 95% prior interval is around
,
and the prior will be locally uniform over the region supported
by the likelihood. Alternatively a uniform prior on a suitably
wide interval could be given , such as
Secondary aspects of a model include, say, the precision of random effects in a hierarchical model. More care is required in setting such parameters and we recommend thinking carefully about reasonable values for such parameters in advance and so specifying fairly informative priors wherever possible - the inclusion of such external information is unlikely to bias the scientific conclusions of a study, although some sensitivity analysis should be performed to reassure consumers.
A particular, but insufficiently recognised, problem occurs with
the precision parameter in a random effects model
[DuMouchel, 1990, DuMouchel and Waternaux, 1992]. Let us call this
parameter
, with inverse
, the variance of the
random effects. The standard `non-informative' prior would
generally be considered to be
,
equivalent to
, which arises from
assuming the log of the scale parameter has a uniform
distribution on
.
However, the boundary
value
is supported by non-negligible likelihood
since it is theoretically possible that there are no
subject-specific random effects and, remarkably,
the improper prior
goes to infinity at zero at a rate sufficient to lead to a posterior distribution
that is also improper.
Assuming a
prior that is `only just' proper, such
with mean 1 and variance
, will formally
avoid strict impropriety: such priors are investigated below.
An alternative approach is to choose an
improper prior such as
being locally uniform on
[Gelman and Rubin, 1992, Carlin, 1992],
which does not lead to an improper posterior. Keeping within a
general class
of proper priors, we assume we would like a prior on
that is
uniform on a range
. This implies that
has a
Pareto distribution with parameters
, where the
Pareto distribution (Section 4.4) has the form
Thus a uniform prior for
or
on
would
respectively give rise to
a Pareto
or a Pareto
distribution on
.
Assuming
is the precision parameter for a normal distribution
gives a likelihood contribution for
from n observations
proportional to
, where S is a sum of squares
statistic. Thus a Pareto prior for
is conjugate with the
likelihood within the class of left-truncated gamma distributions,
and leads to a posterior distribution proportional to
.
We note that the Pareto distribution is not log-concave in
, and
hence BUGS recognises the conjugacy and constructs
the full conditional distribution: this will be log-concave
provided
: hence for a locally uniform
prior on
we require at least n=3 contributions to
the precision estimation, and at least n= 4 to assume a locally uniform
distribution for
.
These and other priors are explored in Figure 3.
Instead of using an `off-the-shelf' prior,
it is perhaps more appropriate to give some thought as to a reasonable proper
prior for the random-effect precision parameter using knowledge of the particular context. Below we give some examples of the sort of thinking
that could give rise to proper prior specification.
We shall consider normal, logistic
and Poisson regression, in which it is assumed that the random
effect for subject i enters the linear predictor additively as a
variable
.
Plausible values for
will depend primarily
on the measurement precision
- it is unlikely that the between-subject
precision
is substantially smaller than the within-subject precision
, and
plausible that it is considerably greater.
This model means that 95% of subjects
with identical covariates will have log(odds) of the event in question
occuring in a range with width
.
Smith
et al. (1995)
derived the following prior within the context of meta-analysis.
Suppose that we thought it plausible that there was roughly one order
of magnitude difference between the odds on failure for subjects with
identical covariates, and we interpreted this as having a prior belief that
95% of subjects had log(odds) in a range of width
.
This implies that
is a reasonable guess for
. Suppose in addition that we would be rather surprised
to find two orders of magnitude difference between the odds on failure between
subjects, and we would interpret such a finding as requiring
95% of subjects to have log(odds) in a range of width
.
This gives a `low' value for
of
. A gamma distribution with
parameters (3,1) has a mean of 3 and 96% probability of exceeding
.73, and hence
might be an appropriate
proper prior in this context. Figure 3 shows that the implied distribution
for
seemsquite reasonable, although ruling out either very low values
(no random effect) and very high values (independent observations).
This analysis can be adjusted to reflect alternative prior beliefs: for example
a guess that 95% of log(odds) were within a two-fold range (
), with a
ten-fold range being unlikely (
), would lead to a prior
, with mean 32 and standard deviation
27.
This model means that 95% of subjects
with identical covariates will have log(mean) in a range with width
.
Following the analysis for the logistic model, we might think it
plausible in a specific application that subjects with the same
covariates had expectations that varied within one order of
magnitude, but unlikely to vary over two orders. This leads
to the same assumption of a Gamma(3,1) prior for
.
Figure 3 shows a variety of prior suggestions plotted on
scales
for
(.25 to 100) and
(.1 to 2). The selected prior
distributions show a variety of behaviours on the
scale.
The `just' proper Gamma(.001,.001) prior favours small values of
in a reasonable manner, and we have found this to
generally lead to sensible conclusions. Although
the crucial aspect is whether the shape of the prior is
reasonable in the area supported by the likelihood, the
major reason for introducing substantial prior opinion is when
the likelihood is fairly flat. In this situation some external judgement
of plausible values of the precision parameter is unavoidable.
We shall distinguish three objectives: model checking through examination of individual observations, comparison between two or more competitor models, and global checks of goodness-of-fit. In all cases we exploit the idea of comparing observed statistics with our expectations, were we making predictions conditional on the truth of a particular model assumption.
Two situations can be identified: when the data
form a
separate evaluation set, and when they were
used for deriving the posterior distribution of the parameters.
In many cases, unless say there is an explicit dependence on
previous observations, the y's are conditionally independent of
the x's given the total set
of unknowns
. Thus the required
integral is simply
This just requires adding additional random nodes
in the graph
with the appropriate parents,
and monitoring the samples generated for
.
The observed values
can then be compared with their predictive
distributions which may be summarised using the stats command.
As an application of the third checking function,
one could then simply see how often the true observations
lie within the nominal predictive intervals. Alternatively,
one could calculate the first four checking
functions above by explicitly including new nodes representing,
for example,
and
and so obtain
estimates of
and
directly by
monitoring these nodes and noting the empirical mean.
Unfortunately this is generally difficult to do within BUGS. An exception is the final checking function, since we can explicitly calculate the predictive ordinate within BUGS. Gelfand and Dey (1994) point out the interesting fact that
and hence a Monte Carlo estimate of
is
obtained as a harmonic mean of
.
This may be accomplished in BUGS by constructing a node
which takes on values
,
and then taking the inverse of its empirical mean. However, harmonic means
are notoriously unstable so care is required regarding convergence.
An approximation to the cross-validatory
method is to use the methods
for a separate evaluation set, but replacing
by
. Hence our predictive distribution is
which will usually be expressible as
and hence just requires adding additional random nodes
in the graph
with the same parents as
.
If we do wish to sample from the correct cross-validatory predictive
distribution, this can be carried out
using an additional importance sampling step to remove the effect of
when repredicting
[Gelfand et al., 1992], although
this would have to be carried out external to BUGS.
The Bayesian theory of model comparison is based on the Bayes factor,
which
for two competing models
and
is defined as the ratio of
the marginal ordinates of the observed data
where
are the unobserved quantities in model i. There is
a great deal of recent literature on the difficult issues surrounding the
calculation and interpretation of Bayes factors: see for example
Kass and Raftery (1995), Gelfand and Dey (1994). Carlin and Chib (1995) describe
one means of calculating Bayes factors within a Monte Carlo run, and this
is illustrated in our pines example.
Cross-validatory measures.
We note the analysis first described by Smith (1991) in which it is argued that a Bayes factor approach is only strictly appropriate if we truly believe that one, and only one, of the candidate models is true. If, as would generally be the case, the assumed models are simply considered as useful proxies for some theoretically and practically indeterminate `true' model, then model comparison based on cross-validatory measures may be more suitable.
Comparison between models can therefore be based on an accumulation
over observations of the checking functions described above. Examples
might include the sum of squared or absolute unstandardised residuals
,
or the negative cross-validatory log-likelihood
The latter results in a comparison between models based on the `pseudo-Bayes factor' [Geisser and Eddy, 1979, Gelfand and Dey, 1994].
We note that such accumulation needs to be carried out after a BUGS run using the estimates obtained from the monitors.
`Deviance' measures
Dempster (1974) suggested examining the posterior distribution
of the log-likelihood of the observed data: this is straightforward
to calculate with a BUGS run by forming a variable with the value
.
For non-hierarchical models, the minimum feasible value of this quantity is the traditional
deviance
statistic, and in these circumstances a natural
connection is made to classical model comparison. However, in for hierarchical models the minimum is likely to be very poorly estimated by the sample minimum, and the mean might be a more reasonable measure: this has been
implemented by, for example, Gilks et al. (1992) and
Zeger and Karim (1991) for model comparison. This quantity is illustrated
in the seeds example for logistic models, the salm example
for Poisson models, the lsat and beetles examples for binary data, the alli example for multinomial data and the jaw example for multivariate normal data. However,
further work is required
before giving recommendations on its use.
This approach can be easily accommodated in BUGS by
producing
replicate data-sets and constructing nodes to calculate d based
on the original and replicate data. The empirical distribution of
can then be monitored using the stats command.
Gelman et al. (1996) extend this work of Rubin to allow
discrepancy measures
that depend both on the data and parameters, and this is
straightforward to program within BUGS.
The first 3 checking functions listed in Section 9.3.1 may be calculated directly in BUGS. Note that in this example,
is denoted by mu[i]. To compute
(see checking function 3), we first obtain values of the random variable
by generating a replicate data set Y.rep[i] which depends on the current values of mu[] and tau at each iteration. The step() function is then used to calculate the variable p.smaller[i] which takes the value 1 if Y[i]-Y.rep[i]
and zero otherwise. Hence the posterior mean of p.smaller[i] is simply the proportion of iterations for which Y.rep[i] < Y[i] (i.e.
). It follows that
posterior mean of p.smaller; the chance of observing a more extreme value for
is thus the minimum of these two probabilities. To compute the fifth checking function listed in Section 9.3.1, we calculate the variable p.inv[i] (the inverse of the likelihood for observation i) at each iteration. Having completed a BUGS run, we then calculate the inverse of the posterior mean of p.inv[i] to obtain the predictive ordinate
. In addition, these values may be used to compute the negative cross-validatory log-likelihood discussed in Section 9.3.2.
The model `deviance' may be computed directly in the BUGS code by calculating the log-likelihood contribution (log.like[i]) for each observation, and then summing and multiplying by -2. Monitoring this sum over iterations will yield a posterior distribution for the deviance.
The replicate data set Y.rep[] mentioned earlier is also used to compute the Bayesian p-values described in Section 9.3.3. Here we consider 2 different statistics for
which may be sensitive to outlying obervations in a Normal model. These are the coefficients of skewness and kurtosis, which are estimated by the average value of the third and fourth moments about the mean divided by the third and fourth powers of the standard deviation respectively. For samples from a Normal distribution, the skewness coefficient has expectation zero, whilst the kurtosis coefficient has expectation 3. Positive skewness indicates an extended right tail, whilst negative skewness implies an extended left tail. A kurtosis coefficient > 3 indicates an excess of values near the mean and far from it, with a corresponding depletion elsewhere: this is the manner in which the t-distribution departs from the Normal. Values < 3 result from distributions with a flatter top than Normal. For large sample sizes (
for skewness and
for kurtosis), the sampling distributions of the coefficients are approximately Normal with known variances. For smaller sample sizes, these sampling distributions are not appropriate for testing whether the observed coefficients depart significantly from their expected values. The problem of an unkown sampling distribution may be overcome by implementing the ideas of Gelman et al. (1996), who suggest computing the empirical distribution of the relevant statistic
conditional on current beliefs about the `true' model parameters. In the line example, this is achieved by computing the average of the standardised third (skew.rep) and fourth (kurtosis.rep) moments using the replicate data set Y.rep[] at each iteration. These values are then compared to the relevant average moment (skew.obs or kurtosis.obs) computed from the observed data. The proportion of iterations for which skew.obs < skew.rep and kurtosis.obs < kurtosis.rep correspond to the Bayesian p-values for testing significant departures from the values of skewness and kurtosis expected under Normality.
The BUGS code below shows the declarations needed to implement each of the above checking criteria for the line example.
model line;
const
N = 5, # number of observations
PI = 3.141593;
var
x[N],Y[N],mu[N],alpha,beta,tau,sigma,x.bar,resid[N],sresid[N],
m3[N],m4[N],skew.obs,kutosis.obs,p.skew,p.kurtosis, Y.rep[N],
resid.rep[N],sresid.rep[N],m3.rep[N],m4.rep[N],skew.rep,kutosis.rep,
p.smaller,like[N],p.inv[N],log.like[N],deviance;
data in "line.dat";
inits in "line.in";
{
# Model
for (i in 1:N) {
mu[i] <- alpha + beta*(x[i] - x.bar);
Y[i] ~ dnorm(mu[i],tau);
}
x.bar <- mean(x[]);
alpha ~ dnorm(0.0,1.0E-4); beta ~ dnorm(0.0,1.0E-4);
tau ~ dgamma(1.0E-3,1.0E-3); sigma <- 1.0/sqrt(tau);
# Model checking
for (i in 1:N) {
# Residuals and moments for observed data
resid[i] <- Y[i] - mu[i]; # Checking fn. 1
sresid[i] <- resid[i]*sqrt(tau); # Checking fn. 2
m3[i] <- pow(sresid[i],3);
m4[i] <- pow(sresid[i],4);
# Replicate data set
Y.rep[i] ~ dnorm(mu[i],tau);
p.smaller[i] <- step(Y[i]-Y.rep[i]); # Checking fn. 3
# Residuals and moments for replicate data
resid.rep[i] <- Y.rep[i] - mu[i];
sresid.rep[i] <- resid.rep[i]*sqrt(tau);
m3.rep[i] <- pow(sresid.rep[i],3);
m4.rep[i] <- pow(sresid.rep[i],4);
# Likelihood for each observed Y[i]
like[i] <- sqrt(tau/(2*PI))*exp(-0.5*pow(sresid[i],2));
p.inv[i] <- 1/like[i]; # For checking fn. 5
log.like[i] <- log(like[i]); # Log-likelihood for deviance
}
# Bayesian p-values
skew.obs <- sum(m3[])/N; skew.rep <- sum(m3.rep[])/N;
p.skew <- step(skew.rep-skew.obs);
kurtosis.obs <- sum(m4[])/N; kurtosis.rep <- sum(m4.rep[])/N;
p.kurtosis <- step(kurtosis.rep-kurtosis.obs);
# Deviance distribution
deviance <- -2*sum(log.like[]);
}
Table 6 gives the results for each of these model checks based on a 10000 iteration BUGS run.
Recall that we edited observation
to create an outlier. Each of the checking functions described in Section 9.3.1 indicate that
is indeed the most outlying value:
has the largest residuals, the smallest probability of observing a more extreme value and the smallest predictive ordinate. However, the Bayesian p-values suggest that the observed skewness and kurtosis coefficients are consistent with the values expected from a random sample of 5 points from a Normal distribution. That is, there is no evidence against the assumption of Normal errors for this data. Notice the wide 95% credible intervals for the empirical distribution of skew.rep and kurtosis.rep, suggesting large sampling variation for these coefficients when based on small sample sizes.
The deviance and negative cross-validatory log-likelihood are intended for comparing 2 or more alternative models. To illustrate this, we fitted a second model to the line data, in which the observations were assumed to follow a Student's t distribution on 3 degrees of freedom. This yielded a posterior mean deviance of 24.56 (minimum = 19.71) and a negative cross-validatory log-likelihood of 15.44, all of which are slightly smaller than the corresponding values for the Normal model. This suggests that a t-distributed error structure on 3 degrees of freedom provides a marginally better fit to the edited line data than does a Normal error structure.
Suppose we measure a covariate
on an individual i, but we
feel this is only an approximation for the `true' covariate value which
we shall denote
. We consider two possible reasons for this
approximation: first, that there is `measurement error' either due to
the measurement instrument used or due to random fluctuations around a
long-term mean, and second that there is `rounding error' due
simply to recording with insufficient inaccuracy: an example of the
latter is the notorious preference for numbers ending in
0 or 5 in records of
blood pressure. Of course, it is possible for both
of these causes to exist together.
For each of these causes, we can consider two possible structures
relating the observed
to the true underlying
: first,
in the `classical error' model
depends on
, and we explicitly
assume an `exposure model', i.e. a probability
distribution for
in the population; second, we can adopt
the `Berkson' model and consider the true
as depending on
- while this may seem odd at first it is essentially equivalent
to assuming a uniform exposure distribution.
Figure 4: Alternative graphical models for measurement and rounding error.
The four alternatives are briefly discussed below, with their possible representation in BUGS, and shown graphically in Figure 4.
z[i] ~ dnorm(x[i], tau.e); x[i] ~ dnorm(theta, psi);
x[i] ~ dnorm(z[i], tau.e);
We may again assume a normal exposure distribution
with mean
and precision
.
z[i] ~ dnorm(x[i], tau.e); x[i] ~ dnorm(theta, psi) I(lower[i], upper[i]); lower[i] <- z[i] - c; upper[i] <- z[i] + c;The restriction in the range of the conditional distribution is dealt with by BUGS using the I(,) construct, and represented in the graph by dashed undirected links.
x[i] ~ dunif(lower[i], upper[i]); lower[i] <- z[i] - c; upper[i] <- z[i] + c;
As a simple example of these methods, suppose we have an observed
measurement vector
.
We shall assume
, which gives a standard deviation of .5
and hence a 95% interval of around
, and
which
says observations have been rounded to the nearest whole
number. Fairly vague priors are assumed for
(mean 0,
standard deviation 100) and
a gamma(.001,.001) distribution).
We obtain the estimates shown in
Table 7 based on 5000 iterations. We note that the
classical approach has pulled the estimated x's towards each other, while
the Berkson model produces no change.
Suppose we follow the line example of Section 1.4 and now consider 5
observed
pairs
. We shall fit
a simple linear regression of y on x, using
the notation
where the x's have been standardised around the mean of the z's.
The results in Table 8 are based on 5000 iterations.
In contrast to the analysis before considering the y's, in which only the classical
approach pulled all estimates pulled towards the middle, we see that there
is substantial adjustment with all methods. The x's are essentially
pulled towards values that provide a better fit to the linear
model, so that the classical approach makes a substantial adjustment
to the second and fourth values, and the Berkson model estimates
and
as being even more extreme than the surrogate z.
The classical measurement error approach also shows a dramatic increase in the
estimate of
.
Is this appropriate behaviour? MacMahon et al. (1990) have provided a highly influential use of measurement error adjustment, in which the x's were all adjusted before consideration of the regression analysis: they were all pulled towards the middle on the basis of an external sample. Simultaneous adjustment for measurement error and estimation of regression coefficients clearly leads to adjustments towards improving the fit of the assumed regression model, which of course is perfectly coherent given the assumption that the regression model is true. However, rarely do we feel so confident in a regression model.
where
for identifiability. Equivalently
For the ith covariate pattern, let the observed counts be
. Then
the likelihood is
This likelihood can be handled directly within BUGS (see Section 4.8 and the alli and endo examples). However, in some circumstances the following trick may be more efficient.
Suppose we assume the data in fact were generated by
Then the likelihood is
Let the
's have independent gamma(a,b) priors:
integrating them out gives a marginal likelihood of
's
so as
, the correct likelihood is obtained.
Since a gamma(0,0) on
is formally equivalent to a uniform prior on
, we could also formulate the problem as
and give
a locally uniform prior.
The use of this technique is illustrated in the alli example.
Conditional likelihoods in case-control studies provide a further
application of the multinomial-Poisson transformation. Chapter 8 of
Breslow and Day (1980) describe a structure with I case-control strata,
each containing one case with covariates
, and
controls
with covariates
. The conditional likelihood for relative risk parameters
is
Assume the data are given by the disease status indicator
, and were in fact generated by
Then the likelihood is
and giving each
a locally uniform prior leads to a marginal likelihood of identical form to the conditional likelihood.
So in the basic example of a single case and control and a single exposure variable
if exposed, 0 otherwise, this would
lead to likelihood contributions of
when the case is exposed and the control not,
when the control is exposed and the case not, and
is uninformative if the case and control have matching exposure status.
See the endo example.
This technique allows conditional inference in case-control studies to take advantage of the Bayesian modelling of proper prior information, hierarchical structure, measurement error and so on.
y ~ ddist( theta ) I( l, u )where ddist represents a specific distribution with parameter or parameters
y, l ,u all observed,
not observed: in this case the constraints are
automatically fulfilled and the appropriate likelihood for
will be generated.
l ,u observed (or l,u unobserved but not dependent on
),
not observed: this is the case for censored survival times, where sampling
can proceed accordingly.
y, l ,u all unobserved,
observed: this is handled
appropriately, as illustrated in the marsbars example.
all unobserved, and l,u dependent on
:
this must be
handled with great care. Essentially, the children of
are not conditionally independent and so the total likelihood for
is not a simple product of terms for y, l and u.
In this situation the overall joint likelihood
contribution from y, l ,u must be known externally and the full
conditional for
expressed in BUGS - see
the lips example and
Section 9.10 for a general discussion of how this is dealt with
in a context without logical links.
Chain graphs.
These graphs combine directed and undirected links
in such a way that there are no cycles in the graph that contain
a directed link, and have been used primarily in modelling
of substantive research hypotheses in the social
sciences
[Wermuth and Lauritzen, 1990, Cox and Wermuth, 1993]. The graphs can always be
arranged in a sequence
of blocks containing nodes
, where nodes within each block are connected by
undirected links, and links between blocks are directed with the
order constraint that a node in
can only be the parent
of a node in
, where j > i. Thus a directed
acyclic graph is a chain graph
where each block only contains a single node, and an undirected
graph is a chain graph with only one block: chain graphs
are also known as block recursive graphs.
Chain graphs may represent the chain local Markov property [Frydenberg, 1990], which states that for a node v in block j, conditional on its parents in preceding blocks and neighbours in the same block, v is independent of all other nodes that are not descendants of v: descendants of a node are considered to be all nodes that are reachable by a path containing a directed link.
Frydenberg (1990) shows that, provided the joint distribution is everywhere positive, this property is equivalent to a joint distribution
where
if there is a
directed link from u to a member of
; this is as if each block has been taken as a single `node'
in a directed acyclic graph. Further, each of the terms
in (4) factorises into
where
are the cliques of the undirected graph with nodes
and undirected links
comprising those between members of
, the links between
and
with the directed arrows
removed, and a complete set of links between all members of
.
Suppose we now require the full conditional
for a node
.
Then
where k indexes blocks containing children of
. By conditional
probability and the chain local Markov property we have for the the first term
In each of the likelihood terms in the product in (6), the non-children
of v can be integrated out since (5) implies there will be no terms containing both v and non-children of v. Let
represent the members of
that are
children of v. Then we obtain a full conditional distribution
Gibbs sampling on chain graphs. Three main approaches for carrying out Gibbs sampling in chain graphs may be identified: treating each block as a single multivariate node, using a factorisation of univariate conditional distributions, and directly specifying the full conditional distributions.
The likelihood terms may be then handled by BUGS
in the usual way, but we also need to specify the prior term
.
Two issues arise in specifying this prior term. First, we
need to ensure that the prior is derived from the correct joint
distribution
.
Second, it should be remembered that we will also be specifying a prior
term for each neighbour of v in
, which will be conditioning
on v and hence appear to have v as a parent; BUGS would, unless
told otherwise, include each such conditional distribution as a likelihood term for v
and hence `double-count' the evidence from each neighbour.
To stop this inappropriate use of
a likelihood term, BUGS therefore has a precedence
relation:
when constructing the full conditional for a node v, if a node w
is used in calculating the prior term, w should not also
appear as a child of v, i.e. any likelihood contribution that makes use of
w should be ignored.
We note that the latter technique, over-riding the automatic construction of full conditionals, means that BUGS may be used for handling fully undirected models. However, we emphasise that when using anything other than a DAG it is the user's responsibility to ensure they are defining a correct joint distribution.
While BUGS is capable of handling a variety of types of neighbourhood model, its rather simplistic approach to sampling will mean it may be very inefficent compared with the specially tuned methods described, for example, by Besag et al. (1995).
x[] ~ dmnorm(mu[],T[,]) T[,] ~ dwish(R[,],k)
This will carry out multivariate updating of
. Note our warnings in the birats example about instability when adopting multivariate normal observations and multivariate normal parmaters in the same model.
A set of regression parameters in a generalised linear
model may be declared as multivariate normal (see birats example). However, unless
the response variable is normally distributed, updating of
will be element-wise rather than multivariate.
The parameterisation used for the Wishart distribution of
a
symmetric positive definite matrix x is
is that of
De Groot (1970) and not that of, say, Bernardo and Smith (1994):
when reading studies it is important to check which
form is being used. Here
is the degrees of freedom, and
R is a
symmetric non-singular matrix: our
parameterisation treats the Wishart as a multivariate
extension of a chi-squared distribution.
Since E(x) = (R
)
and x
is a covariance matrix, we might roughly think of R
as a prior estimate of the covariance based on k observations (in fact,
since E(x
) = R
and k will generally be chosen to be near or equal to p, it is perhaps best to think of R as an assessment of the order of magnitude of the covariance matrix.
The use of multivariate normal nodes for parameters can be avoided by assuming ``founder" parameters (with no parents) are a priori independent. If we are interested in marginal distributions of parameters this restriction to prior independence should still lead to reasonable conclusions, since dependence introduced by likelihood terms is appropriately handled.
Possible aids in avoiding the need for multivariate analyses include
Comments to bugs@mrc-bsu.cam.ac.uk
© 1995 MRC Biostatistics Unit
Return to the Welcome Page