Markov Chain Monte Carlo & Hamiltonian Monte Carlo

Our Story Thus Far…

  • We have been using Maximum A Posteriori Approximations

  • Assumes Gaussian posterior (approximately quadratic)

  • Great for simple linear Gaussian models

But…

  • Problems with models of moderate complexity
    • GLMs can get odd
  • Many problems do not have easy analytical solution
    • Autocorrelation
    • State-Space Models
    • Mixed Models
  • Solution is simulated draws form the posterior…

Monte Carlo Sampling for Bayesian Estimation

  1. Markov Chain Monte Carlo (MCMC)

  2. The problems of MCMC

  3. Hamiltonian Monte-Carlo (HMC)

  4. HMC v. MCMC

  5. Implementing HMC

King Markov and His Islands

King Markov and His Islands

How to move around Islands

  • Flip a coin. Heads, move left. Tails, move right.
  • Figure out the number of people on each island.
  • Assign a probability, p = # next island / # current island
  • Choose a random number. If number < p, move.
  • Rinse and repeat
  • What Metropolis MCMC Looks Like

    mcmc_ex <- function(num_weeks = 1e5, 
                        current=10, 
                        positions = rep(0, num_weeks)){
      for ( i in 1:num_weeks ) {
        # record current position
        positions[i] <- current
      
        # flip coin to generate proposal
        proposal <- current + sample( c(-1,1) , size=1 )
        
        # now make sure he loops around the archipelago
        if ( proposal < 1 ) proposal <- 10
        if ( proposal > 10 ) proposal <- 1
      
        # move?
        prob_move <- proposal/current
        current <- ifelse( runif(1) < prob_move , proposal , current )
      }
    
      positions
    }

    Metropolis MCMC in Action: 10 Weeks

    Population = Island Number

    Metropolis MCMC in Action: 50 Weeks

    Population = Island Number

    Metropolis MCMC in Action: 1000 Weeks

    Population = Island Number

    Metropolis MCMC For Models

    • Each island is a set of parameter choices

    • Each “population” is a posterior density

    • The path is a ‘chain’

    • Note the autocorrelation - we “thin” chains

      • Only use every ith sample so that there is no autocorrelation

    MCMC In Practice for Models

    from http://www.cnblogs.com/Nietzsche/p/4255948.html

    Monte Carlo Sampling for Bayesian Estimation

    1. Markov Chain Monte Carlo (MCMC)

    2. The problems of MCMC

    3. Hamiltonian Monte-Carlo (HMC)

    4. HMC v. MCMC

    5. Implementing HMC

    MCMC is not a Panacea

    MCMC is not a Panacea

    How can MCMC Fail?

    • MCMC (particularly Metropolis) can get stuck
      • High correlation between parameters
      • Concentration of Measure problem
    • Start values can still be important
      • But, who can always choose the right start values?
    • One way we try and assess is fit with many chains and make sure they converge - This can lead to depression as things never converge, or converge in impossible values

    Concentration of Measure Problem

    When you sample at high dimensions, as more of the density is away from the mode, you sample further and further from the mode

    Concentration of Measure Problem

    This is a Real Problem - Perhaps Bigger Than We Knew

    MCMC Algorithm Solution

    • Metropolis MCMC inefficient and prone to getting stuck

    • Many algorithms to come up with clever proposed moves to speed up

    • Gibbs sampling used for BUGS, JAGS, etc.

      • Still has same problems as Metropolis
    • Or… Abandon search and use more deterministic sampling

      • Hamiltonian MCMC

    Monte Carlo Sampling for Bayesian Estimation

    1. Markov Chain Monte Carlo (MCMC)

    2. The problems of MCMC

    3. Hamiltonian Monte-Carlo (HMC)

    4. HMC v. MCMC

    5. Implementing HMC

    King Monty Of the Valley and Hamilton, His Advisor

    Moving Around the Valley

    • Royal carraige pushed in random direction with random momentum.

    • As it moves up the hill, slows down until it reverses.

    • After a time, we stop it and record its location

    • Will spend more time in at bottom of the hill (most people)

    arXiv:2006.16194v2

    How Monty Samples

    A Few Tuning/Algorithmic Details

    • We need to chose how large a step we have between each time the gradient is recalculated
      • Automated in the Warmup phase
      • Warmpup is NOT burn-in - no information
      • These are Leapfrog steps
    • We do NOT want our chain to come back to where it started
      • No U Turn Sampling - NUTS
      • Calibrated during warmup as well
      • Algorithms always improving

    Monte Carlo Sampling for Bayesian Estimation

    1. Markov Chain Monte Carlo (MCMC)

    2. The problems of MCMC

    3. Hamiltonian Monte-Carlo (HMC)

    4. HMC v. MCMC

    5. Implementing HMC

    Metropolis versus Hamilton

    Prokhorenko et al 2018

    Metropolis versus Hamiltonian

    Neal 2011, http://www.cs.utoronto.ca/~radford/ham-mcmc.abstract.html

    Metropolis versus Hamiltonian

    Neal 2011, http://www.cs.utoronto.ca/~radford/ham-mcmc.abstract.html

    Metropolis versus Hamiltonian

    Neal 2011, http://www.cs.utoronto.ca/~radford/ham-mcmc.abstract.html

    Let’s Explore HMC and MCMC

    Monte Carlo Sampling for Bayesian Estimation

    1. Markov Chain Monte Carlo (MCMC)

    2. The problems of MCMC

    3. Hamiltonian Monte-Carlo (HMC)

    4. HMC v. MCMC

    5. Implementing HMC

    Implementing HMCMC via Stan

    • We use the ulam function to call STAN

      • Honors Stanislaw Ulam, one of the originators of Monte-Carlo computing.
      • Stan is named after him!
      • Compiles a model, so it can take a while
    • Can specify number of chains and other parameters

    • And now our samples are already part of our model!

    • Careful, models can get large (in size) depending on number of parameters and samples

    Data Prep for HMC

    data(rugged)
    rugged$log_gdp <- log(rugged$rgdppc_2000)
    
    #Prep the data
    d <- rugged[complete.cases(rugged$log_gdp),]
    
    # Need indices >0
    d$cont_africa <- d$cont_africa +1
    
    # Only want the data we are using
    # (otherwise slows STAN down)
    d.trim <- d[ , c("log_gdp","rugged","cont_africa") ]

    The Model…

    int_mod <- alist(
      #likelihood
      log_gdp ~ dnorm(mu, sigma),
      
      #Data generating process
      mu <- bR[cont_africa]*rugged + bA[cont_africa],
      
      #priors
      bR[cont_africa] ~ dnorm(0,1),
      bA[cont_africa] ~ dnorm(8,100),
      sigma ~ dcauchy(0,2)
    )
    Wait, Cauchy???

    Sidenote: the Cauchy Distribution

    • Pronounced Ko-she
    • A ratio of two normal distributions
    • Large thick tail
      • Extreme values regularly sampled
    • Uses half-cauchy, so, only positive

    Cauchy v. Exponential

    Fitting with ulam

    fit <- ulam(int_mod, data=d.trim)
    Running MCMC with 1 chain, with 1 thread(s) per chain...
    
    Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
    Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
    Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
    Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
    Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
    Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
    Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
    Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
    Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
    Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
    Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
    Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
    Chain 1 finished in 0.1 seconds.
    • Note where errors occur
      • Warmup only?
      • How often in your chain?

    How Much Time Did it Take?

    show(fit)

    Hamiltonian Monte Carlo approximation
    500 samples from 1 chain
    
    Sampling durations (seconds):
      chain_id warmup sampling total
    1        1   0.04     0.04  0.09
    
    Formula:
    log_gdp ~ dnorm(mu, sigma)
    mu <- bR[cont_africa] * rugged + bA[cont_africa]
    bR[cont_africa] ~ dnorm(0, 1)
    bA[cont_africa] ~ dnorm(8, 100)
    sigma ~ dcauchy(0, 2)

    How Do Your Chains Look?

    traceplot(fit)
    • Can use window argument to remove initial bit

    How Does Your Chain Look?

    traceplot(fit, window = c(50, 1e3))

    A Well Shaped Posterior

    pairs(fit, depth = 2)

    Inspect Your Output and Effective Samples

    summary(fit)
                mean         sd        5.5%       94.5%      rhat ess_bulk
    bR[1] -0.2027438 0.08174296 -0.33076321 -0.07174654 1.0025347 321.6917
    bR[2]  0.1823275 0.11145994  0.01284547  0.35732397 0.9988120 308.9929
    bA[1]  9.2206960 0.14386451  8.99236180  9.44758615 1.0124792 307.1217
    bA[2]  7.2848738 0.19399302  6.97542315  7.60929605 0.9991098 354.8102
    sigma  0.9546533 0.05306223  0.86897331  1.04197770 0.9992848 331.2134

    To See if Chains are Really Covering Sample Space, Refit with Multiple Chains

    fit_4 <- ulam(int_mod, data=d.trim,
                  chains = 4, cores = 4)
    Running MCMC with 4 parallel chains, with 1 thread(s) per chain...
    
    Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
    Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
    Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
    Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
    Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
    Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
    Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
    Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
    Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
    Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
    Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
    Chain 2 Iteration:   1 / 1000 [  0%]  (Warmup) 
    Chain 2 Iteration: 100 / 1000 [ 10%]  (Warmup) 
    Chain 2 Iteration: 200 / 1000 [ 20%]  (Warmup) 
    Chain 2 Iteration: 300 / 1000 [ 30%]  (Warmup) 
    Chain 2 Iteration: 400 / 1000 [ 40%]  (Warmup) 
    Chain 2 Iteration: 500 / 1000 [ 50%]  (Warmup) 
    Chain 2 Iteration: 501 / 1000 [ 50%]  (Sampling) 
    Chain 2 Iteration: 600 / 1000 [ 60%]  (Sampling) 
    Chain 2 Iteration: 700 / 1000 [ 70%]  (Sampling) 
    Chain 2 Iteration: 800 / 1000 [ 80%]  (Sampling) 
    Chain 2 Iteration: 900 / 1000 [ 90%]  (Sampling) 
    Chain 2 Iteration: 1000 / 1000 [100%]  (Sampling) 
    Chain 3 Iteration:   1 / 1000 [  0%]  (Warmup) 
    Chain 3 Iteration: 100 / 1000 [ 10%]  (Warmup) 
    Chain 3 Iteration: 200 / 1000 [ 20%]  (Warmup) 
    Chain 3 Iteration: 300 / 1000 [ 30%]  (Warmup) 
    Chain 3 Iteration: 400 / 1000 [ 40%]  (Warmup) 
    Chain 3 Iteration: 500 / 1000 [ 50%]  (Warmup) 
    Chain 3 Iteration: 501 / 1000 [ 50%]  (Sampling) 
    Chain 3 Iteration: 600 / 1000 [ 60%]  (Sampling) 
    Chain 3 Iteration: 700 / 1000 [ 70%]  (Sampling) 
    Chain 3 Iteration: 800 / 1000 [ 80%]  (Sampling) 
    Chain 3 Iteration: 900 / 1000 [ 90%]  (Sampling) 
    Chain 3 Iteration: 1000 / 1000 [100%]  (Sampling) 
    Chain 4 Iteration:   1 / 1000 [  0%]  (Warmup) 
    Chain 4 Iteration: 100 / 1000 [ 10%]  (Warmup) 
    Chain 4 Iteration: 200 / 1000 [ 20%]  (Warmup) 
    Chain 4 Iteration: 300 / 1000 [ 30%]  (Warmup) 
    Chain 4 Iteration: 400 / 1000 [ 40%]  (Warmup) 
    Chain 4 Iteration: 500 / 1000 [ 50%]  (Warmup) 
    Chain 4 Iteration: 501 / 1000 [ 50%]  (Sampling) 
    Chain 4 Iteration: 600 / 1000 [ 60%]  (Sampling) 
    Chain 4 Iteration: 700 / 1000 [ 70%]  (Sampling) 
    Chain 4 Iteration: 800 / 1000 [ 80%]  (Sampling) 
    Chain 4 Iteration: 900 / 1000 [ 90%]  (Sampling) 
    Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
    Chain 1 finished in 0.2 seconds.
    Chain 2 finished in 0.1 seconds.
    Chain 3 finished in 0.2 seconds.
    Chain 4 Iteration: 1000 / 1000 [100%]  (Sampling) 
    Chain 4 finished in 0.2 seconds.
    
    All 4 chains finished successfully.
    Mean chain execution time: 0.2 seconds.
    Total execution time: 0.3 seconds.

    Your Four Chains

    show(fit_4)
    Hamiltonian Monte Carlo approximation
    2000 samples from 4 chains
    
    Sampling durations (seconds):
      chain_id warmup sampling total
    1        1   0.08     0.08  0.16
    2        2   0.07     0.08  0.15
    3        3   0.07     0.09  0.16
    4        4   0.09     0.09  0.17
    
    Formula:
    log_gdp ~ dnorm(mu, sigma)
    mu <- bR[cont_africa] * rugged + bA[cont_africa]
    bR[cont_africa] ~ dnorm(0, 1)
    bA[cont_africa] ~ dnorm(8, 100)
    sigma ~ dcauchy(0, 2)

    Inspect your Chains for convergence!

    The Trankplot

    trankplot(fit_4)
    • Shows histogram of sample ranks.
    • Should be freely mixing.

    Numerical Assessment of Convergence

    summary(fit_4)
                mean         sd        5.5%       94.5%     rhat ess_bulk
    bR[1] -0.1993123 0.07409950 -0.31862377 -0.07504129 1.002517 1199.031
    bR[2]  0.1880488 0.10388902  0.02127217  0.35425577 1.001472 1136.127
    bA[1]  9.2163276 0.13686809  8.99009865  9.43239495 1.003633 1145.768
    bA[2]  7.2790567 0.17194453  7.01205680  7.56030135 1.003980 1084.088
    sigma  0.9481244 0.05323086  0.86810399  1.03436385 1.000101 1482.074
    • n_eff is effective number of samples in chain
      • Should be reasonably large (at least 200)
    • Rhat is a measure of convergence
      • Gelman-Rubin diagnostic
      • Should be 1 - even 1.01 is suspect
    • Treat as warnings - necessary but not sufficient

    General Workflow

    1. Fit one chain to make sure things look OK
      • warmup = 1000, iter=2000 OK
    2. Fit multiple chains to ensure convergence
      • Inspect n_eff and r_hat
      • Make sure posterior converges and is stationary
      • Tune HMC and model parameters if needed
    3. Fit one chain with large everything tuned
      • Can use more chains if you have the processors

    What do bad chains look like?

    y <- c(-1,1)
    set.seed(11)
    m9.2 <- ulam(
        alist(
            y ~ dnorm( mu , sigma ) ,
            mu <- alpha ,
            alpha ~ dnorm( 0 , 1000 ) ,
            sigma ~ dexp( 0.0001 )
        ) , data=list(y=y) , chains=3 )

    What do bad chains look like?

    What do bad chains look like?

    What do bad chains look like?

               mean        sd        5.5%     94.5%     rhat  ess_bulk
    alpha -21.39546  360.6928 -642.619250  504.6249 1.068819 148.11487
    sigma 562.40833 1331.1809    5.726646 2539.1058 1.069306  38.32935

    The Wandering Chains

    set.seed(384)
    m9.4 <- ulam(
        alist(
            y ~ dnorm( mu , sigma ) ,
            mu <- a1 + a2 ,
            a1 ~ dnorm( 0 , 1000 ),
            a2 ~ dnorm( 0 , 1000 ),
            sigma ~ dexp( 1 )
        ) , data=list(y=y) , chains=3 )

    Warnings

    Warning: 1 of 1500 (0.0%) transitions ended with a divergence.
    See https://mc-stan.org/misc/warnings for details.
    
    Warning: 1103 of 1500 (74.0%) transitions hit the maximum treedepth limit of 10.
    See https://mc-stan.org/misc/warnings for details.

    Chains inefficient. Can change tree depth, but…

    Wandering Chains

    Wandering Chains

    Lack of Convergence

    • Might be that model has not found good values

    • More likely bad model

      • Too many parameters
      • Redundant parameters
      • Poor fit to data
      • Bad priors

    Taming a Bad Chain with Informative Priors

    set.seed(11)
    m9.3 <- ulam(
        alist(
            y ~ dnorm( mu , sigma ) ,
            mu <- alpha ,
            alpha ~ dnorm( 1 , 10 ) ,
            sigma ~ dexp( 1 )
        ) , data=list(y=y) , chains=3 )
    
    precis( m9.3 )

    Still Looks Odd…

    Eh….

    Folk theorem of statistical computing: When you are having trouble fitting a model, it often indicates a bad model.



    In many ways, HMC and MCMC failing will help you find issues in your own intuition in both model structure and how to build a good model, rather than just being a PITA.

    Frances Capel