So, we’ve flipped the globe 6 times, and drawn
W,L,W,W,W,L,W,L,W
In a data frame:
1. Use seq
to come up with a set of possible probability
values
2. Add a column of priors. Make it flat, so they’re all 1, or get
fancy.
3. Calculate your likelihoods for each probability with size=9 and
W=6
4. Calculate your prior * likelihood
5. Calculate your posterior, as the previous value divided by the sum of
all prior*likelihoods
library(dplyr)
# our hypotheses and prior = 1/nrow
grid <- data.frame(prob = seq(0,1,.01),
prior=1/101) |>
#calculate likelihood
mutate(likelihood =
dbinom(water, size = 9, prob = prob)) |>
# multiply by prior
mutate(posterior = likelihood*prior) |>
# divide by marginal (normalize)
mutate(posterior = posterior/sum(posterior))
# our hypotheses
log_grid <- data.frame(prob = seq(0,1,.01),
prior=1/101) |>
#calculate log likelihood
mutate(likelihood =
dbinom(water, size = 9, prob = prob, log = TRUE)) |>
# multiply by prior
mutate(posterior = likelihood + prior) |>
# exponentiate and divide by marginal (normalize)
mutate(posterior = exp(posterior),
posterior = posterior/sum(posterior))
prob prior likelihood posterior
1 0.00 0.00990099 0.000000e+00 0.000000e+00
2 0.01 0.00990099 8.150512e-11 8.150511e-12
3 0.02 0.00990099 5.059848e-09 5.059848e-10
4 0.03 0.00990099 5.588844e-08 5.588844e-09
5 0.04 0.00990099 3.044058e-07 3.044058e-08
6 0.05 0.00990099 1.125305e-06 1.125305e-07
prob prior likelihood posterior
1 0.67 0.00990099 0.2730674 0.02730674
2 0.66 0.00990099 0.2728850 0.02728850
3 0.68 0.00990099 0.2721339 0.02721339
4 0.65 0.00990099 0.2716211 0.02716211
5 0.69 0.00990099 0.2700592 0.02700591
6 0.64 0.00990099 0.2693188 0.02693188
How much of the posterior is less than a certain value?
How much of the posterior is greater than a certain value?
What value of the posterior has the highest density?
What is the range of the values of some percent of the posterior? e.g., 90%
[1] 0.0504
So, 5.04% of the posterior
It’s a filter thang!
What % is < 0.6
What % is > 0.6
What % is between 0.2 and 0.6
We can calculate quantiles using the cummulative density of the posterior
Visualize as before
Visualize as before
ggplot() +
#grid posterior
geom_line(data = grid,
aes(x = prob, y = posterior)) +
#the interval
geom_ribbon(data = grid |> filter(quantile > 0.25 & quantile < 0.75),
ymin = 0, aes(x = prob, ymax = posterior),
fill = "black")
Note that this is not the Highest Posterior Density Interval
25% 75%
0.54 0.74
|0.5 0.5|
0.54 0.73
25% 75%
0.09251207 0.37267543
|0.5 0.5|
9.248308e-05 2.082650e-01
# to get distributional properties
dens <- density(samp_bad)
hpdi_50 <- HPDI(samp_bad, 0.5)
#properties
interval_dat <- tibble(
prob = dens$x, # parameters
dens = dens$y/sum(dens$y), #standardized density
quant = cumsum(dens), #quantiles based on std dens
# use the quantiles for the PI
pi_50 = ifelse(quant >= 0.25 & quant <= 0.75,
dens,
NA),
# use the values from the HPDI for filtering
hpdi_50 = ifelse(prob >= hpdi_50[1] & prob <= hpdi_50[2],
dens,
NA)
) |>
tidyr::pivot_longer(cols = c(pi_50, hpdi_50))
## Plot!
ggplot(interval_dat,
aes(x = prob, y = dens)) +
geom_line() +
geom_ribbon(ymin = 0, aes(ymax = value),
fill = "darkgrey") +
facet_wrap(vars(name))
[1] 0.636778
[1] 0.65
[1] 0.71
# what is the cost of a given value being wrong
# for any parameter - linear addition
loss_fun <- function(d) {
dist_from_each_hypothesis <- abs(d - grid$prob)
scaled_distance_by_posterior_prob <-
grid$posterior * dist_from_each_hypothesis
cost_if_parameter_is_wrong <- sum(scaled_distance_by_posterior_prob)
}
# iterate over all parameters
loss <- sapply(grid$prob, loss_fun)
# and the point estimate with the lowest cost is...
grid$prob[which.min(loss)]
[1] 0.64
[1] 0.65
w n percent
0 10 0.0010
1 124 0.0124
2 336 0.0336
3 780 0.0780
4 1282 0.1282
5 1754 0.1754
6 1972 0.1972
7 1858 0.1858
8 1311 0.1311
9 573 0.0573
Note that 6 is the peak, and our draw was w=6!
We had many 3s - not bad, not spot on - is this a good model or check?
getRuns <- function(prob, draws=9){
# toss that coin with a probability!
toss <- rbinom(draws, size=1, prob)
runs <- rle(toss) # gets run lengths of all values
runs <- runs$length[runs$values == 1] # filter to 1
# return 0 if all L, otherwise
# return the longest
if(length(runs) == 0){return(0)}
max(runs)
}
# iterate over all samples
samp <- samp |>
group_by(samp) |>
mutate(longest_run = getRuns(samp)) |>
ungroup()
simplehist(samp$longest_run)