mcmc_and

Markov Chain Monte Carlo & Hamiltonian Monte Carlo

Our Story Thus Far…

We have been using Maximum A Posteriori Approximations
Assumes Gaussian posterior (approximately quadratic)
Great for simple linear Gaussian models

But…

Problems with models of moderate complexity
- GLMs can get odd
Many problems do not have easy analytical solution
- Autocorrelation
- State-Space Models
- Mixed Models
Solution is simulated draws form the posterior…

Monte Carlo Sampling for Bayesian Estimation

Markov Chain Monte Carlo (MCMC)
The problems of MCMC
Hamiltonian Monte-Carlo (HMC)
HMC v. MCMC
Implementing HMC

King Markov and His Islands

How to move around Islands

Flip a coin. Heads, move left. Tails, move right.

Figure out the number of people on each island.

Assign a probability, p = # next island / # current island

Choose a random number. If number < p, move.

Rinse and repeat

What Metropolis MCMC Looks Like

mcmc_ex <- function(num_weeks = 1e5, 
                    current=10, 
                    positions = rep(0, num_weeks)){
  for ( i in 1:num_weeks ) {
    # record current position
    positions[i] <- current
  
    # flip coin to generate proposal
    proposal <- current + sample( c(-1,1) , size=1 )
    
    # now make sure he loops around the archipelago
    if ( proposal < 1 ) proposal <- 10
    if ( proposal > 10 ) proposal <- 1
  
    # move?
    prob_move <- proposal/current
    current <- ifelse( runif(1) < prob_move , proposal , current )
  }

  positions
}

Metropolis MCMC in Action: 10 Weeks

Population = Island Number

Metropolis MCMC in Action: 50 Weeks

Population = Island Number

Metropolis MCMC in Action: 1000 Weeks

Population = Island Number

Metropolis MCMC For Models

Each island is a set of parameter choices
Each “population” is a posterior density
The path is a ‘chain’
Note the autocorrelation - we “thin” chains
- Only use every ith sample so that there is no autocorrelation

MCMC In Practice for Models

from http://www.cnblogs.com/Nietzsche/p/4255948.html

Monte Carlo Sampling for Bayesian Estimation

Markov Chain Monte Carlo (MCMC)
The problems of MCMC
Hamiltonian Monte-Carlo (HMC)
HMC v. MCMC
Implementing HMC

MCMC is not a Panacea

How can MCMC Fail?

MCMC (particularly Metropolis) can get stuck
- High correlation between parameters
- Concentration of Measure problem
Start values can still be important
- But, who can always choose the right start values?
One way we try and assess is fit with many chains and make sure they converge - This can lead to depression as things never converge, or converge in impossible values

Concentration of Measure Problem

When you sample at high dimensions, as more of the density is away from the mode, you sample further and further from the mode

Concentration of Measure Problem

This is a Real Problem - Perhaps Bigger Than We Knew

MCMC Algorithm Solution

Metropolis MCMC inefficient and prone to getting stuck
Many algorithms to come up with clever proposed moves to speed up
Gibbs sampling used for BUGS, JAGS, etc.
- Still has same problems as Metropolis
Or… Abandon search and use more deterministic sampling
- Hamiltonian MCMC

Monte Carlo Sampling for Bayesian Estimation

Markov Chain Monte Carlo (MCMC)
The problems of MCMC
Hamiltonian Monte-Carlo (HMC)
HMC v. MCMC
Implementing HMC

King Monty Of the Valley and Hamilton, His Advisor

Moving Around the Valley

Royal carraige pushed in random direction with random momentum.
As it moves up the hill, slows down until it reverses.
After a time, we stop it and record its location
Will spend more time in at bottom of the hill (most people)

arXiv:2006.16194v2

How Monty Samples

A Few Tuning/Algorithmic Details

We need to chose how large a step we have between each time the gradient is recalculated
- Automated in the Warmup phase
- Warmpup is NOT burn-in - no information
- These are Leapfrog steps
We do NOT want our chain to come back to where it started
- No U Turn Sampling - NUTS
- Calibrated during warmup as well
- Algorithms always improving

Monte Carlo Sampling for Bayesian Estimation

Markov Chain Monte Carlo (MCMC)
The problems of MCMC
Hamiltonian Monte-Carlo (HMC)
HMC v. MCMC
Implementing HMC

Metropolis versus Hamilton

Prokhorenko et al 2018

Metropolis versus Hamiltonian

Neal 2011, http://www.cs.utoronto.ca/~radford/ham-mcmc.abstract.html

Metropolis versus Hamiltonian

Neal 2011, http://www.cs.utoronto.ca/~radford/ham-mcmc.abstract.html

Metropolis versus Hamiltonian

Neal 2011, http://www.cs.utoronto.ca/~radford/ham-mcmc.abstract.html

Let’s Explore HMC and MCMC

Monte Carlo Sampling for Bayesian Estimation

Markov Chain Monte Carlo (MCMC)
The problems of MCMC
Hamiltonian Monte-Carlo (HMC)
HMC v. MCMC
Implementing HMC

Implementing HMCMC via Stan

We use the ulam function to call STAN
- Honors Stanislaw Ulam, one of the originators of Monte-Carlo computing.
- Stan is named after him!
- Compiles a model, so it can take a while
Can specify number of chains and other parameters
And now our samples are already part of our model!
Careful, models can get large (in size) depending on number of parameters and samples

Data Prep for HMC

data(rugged)
rugged$log_gdp <- log(rugged$rgdppc_2000)

#Prep the data
d <- rugged[complete.cases(rugged$log_gdp),]

# Need indices >0
d$cont_africa <- d$cont_africa +1

# Only want the data we are using
# (otherwise slows STAN down)
d.trim <- d[ , c("log_gdp","rugged","cont_africa") ]

The Model…

int_mod <- alist(
  #likelihood
  log_gdp ~ dnorm(mu, sigma),
  
  #Data generating process
  mu <- bR[cont_africa]*rugged + bA[cont_africa],
  
  #priors
  bR[cont_africa] ~ dnorm(0,1),
  bA[cont_africa] ~ dnorm(8,100),
  sigma ~ dcauchy(0,2)
)

Wait, Cauchy???

Sidenote: the Cauchy Distribution

Pronounced Ko-she
A ratio of two normal distributions
Large thick tail
- Extreme values regularly sampled
Uses half-cauchy, so, only positive

Cauchy v. Exponential

Fitting with ulam

fit <- ulam(int_mod, data=d.trim)

Running MCMC with 1 chain, with 1 thread(s) per chain...

Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 1 finished in 0.1 seconds.

Note where errors occur
- Warmup only?
- How often in your chain?

How Much Time Did it Take?

show(fit)

Hamiltonian Monte Carlo approximation
500 samples from 1 chain

Sampling durations (seconds):
  chain_id warmup sampling total
1        1   0.04     0.04  0.09

Formula:
log_gdp ~ dnorm(mu, sigma)
mu <- bR[cont_africa] * rugged + bA[cont_africa]
bR[cont_africa] ~ dnorm(0, 1)
bA[cont_africa] ~ dnorm(8, 100)
sigma ~ dcauchy(0, 2)

How Do Your Chains Look?

traceplot(fit)

Can use window argument to remove initial bit

How Does Your Chain Look?

traceplot(fit, window = c(50, 1e3))

A Well Shaped Posterior

pairs(fit, depth = 2)

Inspect Your Output and Effective Samples

summary(fit)

            mean         sd        5.5%       94.5%      rhat ess_bulk
bR[1] -0.2027438 0.08174296 -0.33076321 -0.07174654 1.0025347 321.6917
bR[2]  0.1823275 0.11145994  0.01284547  0.35732397 0.9988120 308.9929
bA[1]  9.2206960 0.14386451  8.99236180  9.44758615 1.0124792 307.1217
bA[2]  7.2848738 0.19399302  6.97542315  7.60929605 0.9991098 354.8102
sigma  0.9546533 0.05306223  0.86897331  1.04197770 0.9992848 331.2134

To See if Chains are Really Covering Sample Space, Refit with Multiple Chains

fit_4 <- ulam(int_mod, data=d.trim,
              chains = 4, cores = 4)

Running MCMC with 4 parallel chains, with 1 thread(s) per chain...

Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 2 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 2 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 2 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 2 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 2 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 2 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 2 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 2 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 2 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 2 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 2 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 2 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 3 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 3 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 3 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 3 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 3 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 3 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 3 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 3 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 3 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 3 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 3 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 3 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 4 Iteration:   1 / 1000 [  0%]  (Warmup) 
Chain 4 Iteration: 100 / 1000 [ 10%]  (Warmup) 
Chain 4 Iteration: 200 / 1000 [ 20%]  (Warmup) 
Chain 4 Iteration: 300 / 1000 [ 30%]  (Warmup) 
Chain 4 Iteration: 400 / 1000 [ 40%]  (Warmup) 
Chain 4 Iteration: 500 / 1000 [ 50%]  (Warmup) 
Chain 4 Iteration: 501 / 1000 [ 50%]  (Sampling) 
Chain 4 Iteration: 600 / 1000 [ 60%]  (Sampling) 
Chain 4 Iteration: 700 / 1000 [ 70%]  (Sampling) 
Chain 4 Iteration: 800 / 1000 [ 80%]  (Sampling) 
Chain 4 Iteration: 900 / 1000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 1 finished in 0.2 seconds.
Chain 2 finished in 0.1 seconds.
Chain 3 finished in 0.2 seconds.
Chain 4 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 4 finished in 0.2 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.2 seconds.
Total execution time: 0.3 seconds.

Your Four Chains

show(fit_4)

Hamiltonian Monte Carlo approximation
2000 samples from 4 chains

Sampling durations (seconds):
  chain_id warmup sampling total
1        1   0.08     0.08  0.16
2        2   0.07     0.08  0.15
3        3   0.07     0.09  0.16
4        4   0.09     0.09  0.17

Formula:
log_gdp ~ dnorm(mu, sigma)
mu <- bR[cont_africa] * rugged + bA[cont_africa]
bR[cont_africa] ~ dnorm(0, 1)
bA[cont_africa] ~ dnorm(8, 100)
sigma ~ dcauchy(0, 2)

Inspect your Chains for convergence!

The Trankplot

trankplot(fit_4)

Shows histogram of sample ranks.
Should be freely mixing.

Numerical Assessment of Convergence

summary(fit_4)

            mean         sd        5.5%       94.5%     rhat ess_bulk
bR[1] -0.1993123 0.07409950 -0.31862377 -0.07504129 1.002517 1199.031
bR[2]  0.1880488 0.10388902  0.02127217  0.35425577 1.001472 1136.127
bA[1]  9.2163276 0.13686809  8.99009865  9.43239495 1.003633 1145.768
bA[2]  7.2790567 0.17194453  7.01205680  7.56030135 1.003980 1084.088
sigma  0.9481244 0.05323086  0.86810399  1.03436385 1.000101 1482.074

n_eff is effective number of samples in chain
- Should be reasonably large (at least 200)
Rhat is a measure of convergence
- Gelman-Rubin diagnostic
- Should be 1 - even 1.01 is suspect
Treat as warnings - necessary but not sufficient

General Workflow

Fit one chain to make sure things look OK
- warmup = 1000, iter=2000 OK
Fit multiple chains to ensure convergence
- Inspect n_eff and r_hat
- Make sure posterior converges and is stationary
- Tune HMC and model parameters if needed
Fit one chain with large everything tuned
- Can use more chains if you have the processors

What do bad chains look like?

y <- c(-1,1)
set.seed(11)
m9.2 <- ulam(
    alist(
        y ~ dnorm( mu , sigma ) ,
        mu <- alpha ,
        alpha ~ dnorm( 0 , 1000 ) ,
        sigma ~ dexp( 0.0001 )
    ) , data=list(y=y) , chains=3 )

What do bad chains look like?

           mean        sd        5.5%     94.5%     rhat  ess_bulk
alpha -21.39546  360.6928 -642.619250  504.6249 1.068819 148.11487
sigma 562.40833 1331.1809    5.726646 2539.1058 1.069306  38.32935

The Wandering Chains

set.seed(384)
m9.4 <- ulam(
    alist(
        y ~ dnorm( mu , sigma ) ,
        mu <- a1 + a2 ,
        a1 ~ dnorm( 0 , 1000 ),
        a2 ~ dnorm( 0 , 1000 ),
        sigma ~ dexp( 1 )
    ) , data=list(y=y) , chains=3 )

Warnings

Warning: 1 of 1500 (0.0%) transitions ended with a divergence.
See https://mc-stan.org/misc/warnings for details.

Warning: 1103 of 1500 (74.0%) transitions hit the maximum treedepth limit of 10.
See https://mc-stan.org/misc/warnings for details.

Chains inefficient. Can change tree depth, but…

Wandering Chains

Lack of Convergence

Might be that model has not found good values
More likely bad model
- Too many parameters
- Redundant parameters
- Poor fit to data
- Bad priors

Taming a Bad Chain with Informative Priors

set.seed(11)
m9.3 <- ulam(
    alist(
        y ~ dnorm( mu , sigma ) ,
        mu <- alpha ,
        alpha ~ dnorm( 1 , 10 ) ,
        sigma ~ dexp( 1 )
    ) , data=list(y=y) , chains=3 )

precis( m9.3 )

Still Looks Odd…

Eh….

Folk theorem of statistical computing: When you are having trouble fitting a model, it often indicates a bad model.

In many ways, HMC and MCMC failing will help you find issues in your own intuition in both model structure and how to build a good model, rather than just being a PITA.

Frances Capel