library(dplyr)
library(rethinking)
library(tidybayes)
library(tidybayes.rethinking)
library(ggplot2)
library(patchwork)
library(palmerpenguins)
library(naniar) #for missing data
vis_miss(penguins)
Missing Data with Bayes and Rethinking
1. Simple Missing Data Model
Let’s start with an old favorite. The Palmer Penguins! Shockingly, it does indeed have missing data. Let’s use naniar to explore a bit about the missingness.
1.1 Exploring Missingness
Missing data in sex, and a some in measurements of bill length, depth, etc. Let’s look a little more to see if these are often complete cases, or just one variable at a time.
::gg_miss_upset(penguins) naniar
OK, there are two cases of lots of missing values, and 9 of just sex.
Let’s see if missingness is associated with anything. In the test below, from the naniar
package, we assume a null hypothesis of MCAR. A large p value means we fail to reject the null hypothesis of MCAR.
::mcar_test(penguins) naniar
# A tibble: 1 × 4
statistic df p.value missing.patterns
<dbl> <dbl> <dbl> <int>
1 12.7 10 0.242 3
Whew. OK. No deep thinking there!
Let’s assume things were even worse, though. Let’s say we wanted to look at how body mass and species interacted to predict bill depth, and we were missing 15 measurements. Let’s make that data.
set.seed(2024)
<- penguins |>
penguins_na select(species, body_mass_g, bill_depth_mm) |>
filter(!(is.na(bill_depth_mm))) |>
mutate(species_id = as.factor(species) |> as.numeric())
$body_mass_g[sample(1:nrow(penguins_na), size = 20)] <- NA
penguins_na
vis_miss(penguins_na)
1.2 A Model for Missingness
Let’s fit the model when, modeling body mass as a normal distribution. We’ll standardize first to make life easier.
<- with(penguins_na,
p list(bill_depth_mm = standardize(bill_depth_mm),
body_mass_g = standardize(body_mass_g),
species = species,
species_id = species_id,
is_imputed = is.na(body_mass_g),
N = nrow(penguins_na)))
# model
<- alist(
depth_mod
# Likelihood
~ dnorm(mu, sigma),
bill_depth_mm
# DGP
<- a[species_id] + b[species_id] * body_mass_g,
mu
# Missing data
~ dnorm(nu, sigma_mass),
body_mass_g
# define body_mass_g_merge as mix of observed and imputed values
# body_mass_g_merge <-
# merge_missing( body_mass_g , body_mass_g_impute ),
# Priors
~ dnorm(0,2),
a[species_id] ~ dnorm(0,2),
b[species_id] ~ dnorm(0, 2),
nu ~ dhalfnorm(0, 2),
sigma_mass ~ dhalfnorm(0, 2)
sigma
|> ulam(data = p, chains = 3, log_lik = TRUE) )
Great it worked!
One thing we might worry about, though, is the body masses themselves NOT coming from a normal distribution, but rather from a normal that differs by species. Fortunately, that’s just an index. We can write a similar model that allows nu
to vary by species.
<- alist(
depth_mod_sp
# Likelihood
~ dnorm(mu, sigma),
bill_depth_mm
# DGP
<- a[species_id] + b[species_id] * body_mass_g,
mu
# Missing data
~ dnorm(nu[species_id], sigma_mass),
body_mass_g
# Priors
~ dnorm(0,2),
a[species_id] ~ dnorm(0,2),
b[species_id] ~ dnorm(0, 2),
nu[species_id] ~ dhalfnorm(0, 2),
sigma_mass ~ dhalfnorm(0, 2)
sigma
|> ulam(data = p, chains = 3, log_lik = TRUE) )
Is one better than the other?
compare(depth_mod_sp, depth_mod)
WAIC SE dWAIC dSE pWAIC weight
depth_mod_sp 425.2749 27.32643 0.00000 NA 10.94201 0.994150488
depth_mod 435.5459 27.17503 10.27106 3.205515 14.53277 0.005849512
So, adding species does make it more predictive.
2. Spatial Interpolation
OK, the beauty of the above is that we can use the same framwork for space, time, and Gaussian processes. Let’s look at some data about boreal forests and try and interpolate across space.
<- read.table("http://biol609.github.io/lectures/data/06/Boreality.txt", header=T)
boreal
ggplot(boreal,
aes(x = x, y = y, color = NDVI, size = Wet )) +
geom_point() +
theme_bw() +
scale_size_continuous("Index of Wetness", range=c(0,7)) +
scale_color_gradient("NDVI", low="lightgreen", high="darkgreen")
We can fit a gaussian process model for this data by creating a distance matrix with coordinates and assuming a gaussian process. First, the matrix.
#make a distance matrix on the km scale
<- with(boreal,
dmat dist(cbind(x, y))) |> as.matrix()/1e3
image(dmat)
# standardize your data
<- boreal |>
boreal mutate(NDVI_s = standardize(NDVI),
Wet_s = standardize(Wet))
Now, let’s model NDVI as a function of wetness given the distances between points. I’m not going to run this, as it takes a looooong time. But, I wanted to present it so we can see the next step.
<- alist(
ndvi_mod # likelihood
~ dnorm(mu, sigma),
NDVI_s
# Data Generating Process
<- k[point_id] + bp*Wet_s,
mu
# Gaussian Process
:k ~ multi_normal( 0 , SIGMA ),
vector[N]:SIGMA <- cov_GPL2( dmat , etasq , rhosq , 0.01 ),
matrix[N,N]
# Priors
~ dnorm(0,1),
bp ~ dhalfnorm(0,2),
sigma ~ dhalfnorm(0,2),
etasq ~ dexp(2)
rhosq
)
<- ulam(ndvi_mod,
ndvi_fit data = list(Wet_s = boreal$Wet_s,
NDVI_s = boreal$NDVI_s,
dmat = dmat,
point_id = 1:nrow(boreal),
N = nrow(boreal)),
start = list(k = rep(0, nrow(boreal))),
chains = 3)
So, what would we want to do next if we wanted to make new predictions? We have a few options. In the past we would have fed a new data frame to the model. We could do that here, but, we would have to write a lot of ancillary code to take a distance matrix and set of predictors and turn it into something useful. See here for an example.
OR - we could look at this as a data imputation problem. The model we have above takes a distance matrix for each point and a level of Wet_s
for each point and an observed NDVI_s
for each point. This is a dataset looking at trees. Let’s assume there are 30 trees in the upper right hand of the image for which for whatever reason our satellite didn’t pickup NDVI - maybe it’s too much image noise. But we have their coordinates and a measurement of wetness. We can build them in as missing data and impute them. We have even already specified a distribution for NDVI_s
as our likelihood.
What do we need, then? Well, we need points, Wet_s
values, and a new distance matrix with the new points in it.
Let’s start by adding data.
<- boreal |>
boreal_new select(x, y, NDVI_s, Wet_s) |>
bind_rows(tibble(
x = runif(30, 1500, 2000),
y = runif(30, 5000, 7000),
Wet_s = rnorm(30, 0, 1)
))
From this, we can make a new distance matrix.
<- with(boreal_new,
dmat_new dist(cbind(x, y))) |> as.matrix()/1e3
At which point we can run our model again - but now with this new missing data.
<- ulam(ndvi_mod,
ndvi_fit data = list(Wet_s = boreal_new$Wet_s,
NDVI_s = boreal_new$NDVI_s,
dmat = dmat_new,
point_id = 1:nrow(boreal_new),
N = nrow(boreal_new)),
start = list(k = rep(0, nrow(boreal_new))),
chains = 3)
And that’s it!
3. Spatial Interpolation with Splines
As we discussed in class, we can also fit GP models with splines. This enables us to lighten the computational load - something that is crucial when doing spatial interpolation, aka Kriging. What if, instead of a few trees, we assumed that while we had the top of the canopy, we wanted to get NDVI for the entire landscape - including all of the missing space. We can start by fitting the same model as above - only now using bsplines.
The difficulty with splines and space is that we have 2 coordinates instead of one. What do we give to the basis function? In this case, we can still create a basis function for x and y. But, we can consider each point as if it were an interaction of basis functions. With splines, we’d create what is called a tensor product of the two basis functions - a product across our two matrices. In the case of a spline tensor product, we want to do this for each row of each basis function using the function kroenecker()
.
But, we’ll get there in a second. Let’s start by setting up a basis function for x and y using a B-spline. We’ll make it 3rd order, but only use 10 knots. As you will see, with the tensor product, we’re going to have a lot of parameters to estimate.
library(splines2)
<- 10
num_knots <- seq(min(boreal$x), max(boreal$x), length.out = num_knots)
x_knots <- seq(min(boreal$y), max(boreal$y), length.out = num_knots)
y_knots <- bsp(boreal$x,
x_bs knots = x_knots[-c(1,num_knots)],
degree = 3, intercept = TRUE)
<- bsp(boreal$y,
y_bs knots = y_knots[-c(1,num_knots)],
degree = 3, intercept = TRUE)
Now that we have these basis functions, we need to, for each row of our data, calculate the tensor product with kronecker()
. Basically, we loop over both matrices and get the tensor product. At the end, sapply produces a matrix that is oriented wrong, so we’ll transpose it back.
<- sapply(1:nrow(x_bs), function(i) kronecker(x_bs[i,], y_bs[i,])) |> t() xy_bs
How much information is that? Let’s look at the size of the basis function for just x versus this spatial interaction basis function.
dim(x_bs)
[1] 533 12
dim(xy_bs)
[1] 533 144
Wow - so, we’ve fone from 12 to 144 predictors. This is why we’re only using 10 knots at the moment!
With this in hand, though, we can make a standardized data set for ease of fitting:
<- list(
dat NDVI_s = boreal$NDVI_s,
Wet_s = boreal$Wet_s,
xy_bs = xy_bs
)
And then a model. We’ll include an intercept this time, although we could go without, and use matrix multiplication with our basis function.
<- alist(
boreal_mod # likelihood
~ dnorm(mu, sigma),
NDVI_s
# Data Generating Process
<- a + b * Wet_s +
mu %*% w_xy,
xy_bs
# Priors
~ dnorm(0,1),
a ~ dnorm(0,2),
b ~ dnorm(0,2),
w_xy ~ dexp(2)
sigma
)
Last, we can fit this with quap()
, which is nice. Remember, we do have to specify a starting vector for w_xy
so that we have the right size in the model.
<- quap(boreal_mod,
boreal_fit data = dat,
start = list(
w_xy = rep(0, ncol(xy_bs)))
)
This is great! Check it with postcheck()
.
Now, if we want to make predictions across the entire landscape, we’re also going to need to have wetness measurements, too. Let’s assume that Wet_s
follows a gaussian process with no predictors - which means we can fit a B-spline model using the same basis functions as above for Wet_s
as a response variable.
# Wetness Mod
<- alist(
wet_mod # likelihood
~ dnorm(mu, sigma),
Wet_s
# Data Generating Process
<- a +
mu %*% w_xy,
xy_bs
# Priors
~ dnorm(0,1),
a c(w_xy) ~ dnorm(0,2),
~ dexp(2)
sigma
)
<- quap(wet_mod,
wet_fit data = dat,
start = list(
w_xy = rep(0, ncol(xy_bs)))
)
With this in hand, we’re ready to do some kriging!
3.2 Kriging Wet_s
To use our model, we need to first get Wet_s
values. As we are looking at an entire landscape we can apply our basis function to an evenly spaced grid of x and y values. to not break out computers, let’s make a grid consisting of 100 x and 100 y values. We’ll use tidyr::crossing()
to make the grid using the range of data we observe.
# A new dataset
range(boreal$x)
[1] 1.91 2619.92
range(boreal$y)
[1] 0.13 7098.19
<- tidyr::crossing(x = seq(0, 2700, length.out = 100),
newgrid y = seq(0, 7000, length.out = 100))
That’s 10,000 rows of data. We’re definitely in the territory of this lab of having to start to think about not overloading memory, etc.
One we have these x and y values, we can now apply them to our same basis functions as before.
<- bsp(newgrid$x,
x_bs_new knots = x_knots[-c(1,num_knots)],
degree = 3, intercept = TRUE)
<- bsp(newgrid$y,
y_bs_new knots = y_knots[-c(1,num_knots)],
degree = 3, intercept = TRUE)
<- sapply(1:nrow(x_bs_new),
xy_bs_new function(i) kronecker(x_bs_new[i,], y_bs_new[i,])) |> t()
If you want, check out how big these objects are.
Last, and this is the easy part, we can just use our xy_bs_new
with our fit model to get landscape-level predictions. We’ll only take 1,000 draws so we don’t break things.
# What's the distribution of Wet_s look like relative to what we have observed?
<- linpred_draws(wet_fit,
wet_pred list(xy_bs = xy_bs_new,
x = newgrid$x,
y = newgrid$y), ndraws = 1000) |>
group_by(x,y) |>
summarize(Wet_s = mean(.value))
`summarise()` has grouped output by 'x'. You can override using the `.groups`
argument.
To see how we did, let’s plot the landscape, and compare it to the observed data for Wet_s
.
<- ggplot(wet_pred,
wet_1 aes(x=x, y=y, fill = Wet_s)) +
geom_raster() +
scale_fill_distiller()+
theme_void()
<- ggplot(boreal,
wet_2 aes(x=x, y=y, fill = Wet_s)) +
geom_tile(width = 100, height = 300) +
scale_fill_distiller() +
theme_void()
+ wet_2 wet_1
Looks good!
3.3 Moving to prediction of NDVI
So, now that we have a landscape for Wet_s
, how do we apply that to our NDVI model? Simple! We use the data frame wet_pred
with it’s coordinates and predictor alongside our new basis function. It’a sll just predictions. If we wanted, we could also aggregate and get HPDIs and such for predictions in order di display uncertainty. At the moment, let’s just concentrate on the fitted prediction values and then see how they compare to the actual data.
<- linpred_draws(boreal_fit,
boreal_pred ndraws = 100,
list(x = wet_pred$x,
y = wet_pred$y,
Wet_s = wet_pred$Wet_s,
xy_bs = xy_bs_new)) |>
group_by(x, y, Wet_s) |>
summarize(NDVI_s = mean(.value))
`summarise()` has grouped output by 'x', 'y'. You can override using the
`.groups` argument.
<- ggplot(boreal_pred,
ndvi_1 aes(x=x, y=y, fill = NDVI_s)) +
geom_raster() +
scale_fill_distiller(palette = "Greens")+
theme_void()
<- ggplot(boreal,
ndvi_2 aes(x=x, y=y, fill = NDVI_s)) +
geom_tile(width = 100, height = 300) +
scale_fill_distiller(palette = "Greens") +
theme_void()
+ ndvi_2 ndvi_1
Fantastic. As a last step, let’s compare the NDVI patterns to patterns in our Wetness index.
+ wet_1 ndvi_1
Huh - they look fairly similar. Did adding space really add anything?
To test this, we can refit our NDVI model without spatial information and compare the spatial and nonspatial model.
<- alist(
boreal_mod_nospace # likelihood
~ dnorm(mu, sigma),
NDVI_s
# Data Generating Process
<- a + b * Wet_s,
mu
# Priors
~ dnorm(0,1),
a ~ dnorm(0,2),
b ~ dexp(2)
sigma
)
<- quap(boreal_mod_nospace,
boreal_fit_nospace data = dat)
compare(boreal_fit_nospace, boreal_fit)
WAIC SE dWAIC dSE pWAIC weight
boreal_fit 690.3657 60.58190 0.0000 NA 80.951162 1.000000e+00
boreal_fit_nospace 954.9469 53.80924 264.5812 42.79626 4.793682 3.523115e-58
Now, it’s clear the spatial model is better for prediction. But, AIC-like indices can tend to select more parameter-rich models, It might be worthwhile to have a think about what is going on here, and how much spatial information was needed to improve fit this much - or if your goal really is just prediction, to just go with it.