Over the next couple of weeks I will be using Bayesian analysis to model the spread of COVID-19. Inspired by Alex Stage who started the Athena Project, I have committed Winder.AI to helping Athena reach its goals.
The first project I will be helping is to develop and deploy simple epidemic models so that users are able to not only view current data, but also view estimates of future data. To achieve this I want to use Bayesian models, to capture the inherent uncertainty within the data.
This post is a quick set of best practices when performing Bayesian analysis; subsequent posts will present my work. I will update this post with new information.
Bayesian Data Analysis
Bayesian linear regression is still linear regression. But the Bayesian part helps us propagate probability estimates to provide a more intuitive interpretation of the results. Indeed, this is similar to how non-data scientists thinks about probability. The key is to think of your data, your models, your work, in terms of distributions of possibilities, not fixed points. You have to be comfortable with a lack of precision; shades of grey. For a detailed introduction into Bayesian thinking, check out the reading materials at the end.
Markov Chain Monte Carlo
Markov chains are links made of choices. They describe movement from one state to the next, according to a transition probability. Monte Carlo is a black-box sampling technique, whereby you repeatedly ask an unknown system to generate a response. These are used together in Markov chain Monte Carlo (MCMC) algorithms to build a chain of samples from a distribution that it would be infeasible to solve analytically. The chain “walks” (I like to think of it as “stumbles”) towards generating observations that look exactly like the distribution you want to simulate and at this point you can sample from the chain to generate summary statistics or plots of the probability distribution.
Performance of a Model
Trace plots show the estimate of a parameter over the course of a single chain. You can generate this chain multiple times. If you model is representative, then the chain should end up in the same place. I.e. each chain should predict the same parameters. If they do, then you can feel more confident about the results. If not, then you cannot trust the estimates. In other words they should be stationary and converge to the same location when repeated. Visually, you want chains that are commonly described as looking like a “hairy caterpillar” or “grass”. The probability distributions of chains should be similar.
Although I always recommend using visual interpretations of performance, you can make use of common summary statistics too:
- Standard error of the mean: this is the amount of uncertainty contributed by only having a finite number of posterior draws. The sampling error in Monte Carlo. Like standard error.
- Effective sample size: This is the amount of autocorrelation in the chain, this should be high, relative to the sample size. If it is low, that means the chain is randomly walking all over the place and randomly sampling. You don’t want this; you want the chain to hone in on the correct result.
- Potential Scale Reduction Factor: This measures how well chains “mix”. It is an estimate of convergence based on the variance of an estimated parameter between chains, and the variance within a chain. It is interpreted as the factor by which the variance in the estimate might be reduced with longer chains. We are looking for a value near 1.
Posterior Distribution checking
You can take the parameters of the model at any point and generate new data from those parameters. This generates a whole new set of data which should match the original data.
Similarly, you can plot one against the other, like plotting the best fit line on top of the original data. You can regenerate this data multiple times to see the variation around the real data.
Autocorrelations of the posterior draws should be correlated at the first lag only. Correlation at higher lags (high serial correlation) imply that the model has not converged. Larger chains may help.
n_eff attempts to represent this feature.
These are checks to see if changes to model settings, even slight changes, alter the result. For example, different hyper-parameters, different, but plausible priors, different sampling distributions or additions and removal of explanatory variables.
Predictive Accuracy / Model Comparison
Use Akaike information criterion (AIC) or deviance information criterion (DIC).
Choice of Priors
Non-informative priors place a preference on the data. Typically we use weakly-informative priors. Rules of thumb or constraints on the problem. Informative priors based upon prior research.
Use conjugate distributions where possible. In a regression model, the conjugate setting uses a normal distribution for the predictor coefficients and an inverse gamma for the variance. For example, the Gaussian family is conjugate to itself (or self-conjugate) with respect to a Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior over the mean will ensure that the posterior distribution is also Gaussian. More information and list of conjugates.