Teaching Myself: Bayesian Linear Regression

As many psychologists, I was trained in frequentist regression. Here, I want to familiarise myself with Bayesian linear regression.
r
Bayes
Linear regression
Author

Olaf Borghi

Published

Tuesday, March 19, 2024

# use groundhog to make code maximally reproducible
if (!require("groundhog", quietly = TRUE)) {
  install.packages("groundhog")
}
library("groundhog")

# use groundhog to install and load packages
pkgs <- c("here",         # Path management
          "tidyverse",    # ggplot, dplyr, %>%, and friends
          "easystats",    # bayestestR, performance, effectsize and friends
          "psych",        # Descriptive statistics and other tools
          "brms",         # Bayesian modeling through Stan
          "tidybayes",    # Integration of Bayesian models in tidy + ggplot workflow
          "broom",        # Convert model objects to data frames
          "broom.mixed",  # Convert brms model objects to data frames
          "tinytable",    # Lightweight package to create tables
          "hrbrthemes",   # Additional ggplot themes
          "extrafont",    # Additional fonts for plots etc
          "ggdist",       # Special geoms for posterior distributions
          "ggrepel",      # Automatically position labels
          "gghalves",     # Special half geoms
          "patchwork"     # Combine ggplot objects
          )

groundhog.library(pkgs, "2024-03-01") # change to a recent date

Teaching Myself: Bayesian Linear Regression

Meet the data

I will use a large open data set described in Eisenberg et al. (2018; 2019), combined with data from Zmigrod et al. (2021). The data from Eisenberg and colleagues includes measures from a large number of questionnaires and cognitive tasks that tap into the domain of self regulation. The data from Zmigrod and colleagues includes data from several measures of ideology.

The goal will be to find out whether self-report and task measures of self-control predict authoritarianism.

Variables of interest:

  • Self-reported self control: Brief Self-Control Scale

  • Behavioral self-control: Accuracy in a Go / No-Go task, a cognitive task that requires a response to a frequent Go stimulus, such as pressing the space bar when a red square is displayed, and the inhibition of this response when a No-Go stimulus is displayed.

  • Self reported authoritarianism.

data1 <- data1 %>%
  rename("SubjectID" = "...1")

data <- inner_join(data1, data2, by = "SubjectID")
# select variables of interest
d <- data %>%
  select(c(SubjectID, BSCSSCON, GNGDRPIM, Authoritarianism)) %>%
  rename(ID = SubjectID,
         bscs = BSCSSCON,
         gonogo = GNGDRPIM,
         aut = Authoritarianism) %>%
  mutate(bscs_z = scale(bscs),
         gonogo_z = scale(gonogo),
         aut_z = scale(aut))

describe(d)
##          vars   n   mean    sd median trimmed    mad   min    max  range  skew kurtosis
## ID*         1 334 167.50 96.56 167.50  167.50 123.80  1.00 334.00 333.00  0.00    -1.21
## bscs        2 334  46.07 10.30  47.00   46.33  11.86 18.00  65.00  47.00 -0.22    -0.53
## gonogo      3 333   3.65  0.76   3.73    3.67   0.64  0.82   5.97   5.15 -0.11     1.18
## aut         4 334   4.43  2.94   4.00    4.25   2.97  0.00  12.00  12.00  0.55    -0.05
## bscs_z      5 334   0.00  1.00   0.09    0.02   1.15 -2.73   1.84   4.56 -0.22    -0.53
## gonogo_z    6 333   0.00  1.00   0.10    0.03   0.85 -3.74   3.07   6.81 -0.11     1.18
## aut_z       7 334   0.00  1.00  -0.15   -0.06   1.01 -1.51   2.58   4.08  0.55    -0.05
##            se
## ID*      5.28
## bscs     0.56
## gonogo   0.04
## aut      0.16
## bscs_z   0.05
## gonogo_z 0.05
## aut_z    0.05

vars1_density<- d %>%
  select(c(bscs_z, gonogo_z, aut_z)) %>%
  pivot_longer(., cols = c(bscs_z, gonogo_z, aut_z), 
               names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value, fill = factor(Variable))) + 
  geom_density(aes(alpha = 0.6)) + 
  labs(x="Standardised values", y=element_blank(),
       title="Distributions of predictor and outcome variables") +
  scale_fill_ipsum() +
  scale_color_ipsum() +
  guides(color = "none", alpha = "none") +
  theme_ipsum(plot_title_size = 14) + 
  theme(legend.position = "bottom",
        legend.title = element_blank())

vars1_density

Have a first visual look of the association between the variables.


ggplot(d, aes(x = bscs_z, y = aut_z)) +
  geom_point(size = 1) +
  geom_smooth(method = "lm") +
  labs(x="Brief Self-Control Scale", y="Authoritarianism",
       title="Association between self-control and authoritarianism") +
  scale_fill_ipsum() +
  scale_color_ipsum() +
  guides(color = "none", alpha = "none") +
  theme_ipsum(plot_title_size = 12)


ggplot(d, aes(x = gonogo_z, y = aut_z)) +
  geom_point(size = 1) +
  geom_smooth(method = "lm") +
  labs(x="Go / No-Go Accuracy", y="Authoritarianism",
       title="Association between Go / No-go accuracy and authoritarianism") +
  scale_fill_ipsum() +
  scale_color_ipsum() +
  guides(color = "none", alpha = "none") +
  theme_ipsum(plot_title_size = 12)


d %>%
  select(c(bscs_z, gonogo_z, aut_z)) %>%
  pairs.panels(.)

A first Bayesian linear regression: One continuous predictor

We already displayed regression lines. So far, however, all of them were fit using the frequentist lm() function. We want to change that and go Bayesian here. Let’s start by investigating the association between self-reported self-control and authoritarianism.

Selecting priors

I already scaled all variables of interest. This will make it a bit simpler to select priors. I will use weakly informative priors for our slope and intercept parameters (see https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations). The selected prior is \(N(0,1)\).

It is also recommended to conduct a prior sensitivity analysis. This can be done for example by refitting the model with a different prior and by comparing the results. We will thus fit another model using \(N(0,2)\) as my prior. Let’s plot both of them to see what kind of an effect they may have on our parameters.

# simulate some data
x_values <- seq(-6, 6, length.out = 1000)
dists <- data.frame(x = x_values,
                    y1 = dnorm(x_values, mean = 0, sd = 1),
                    y2 = dnorm(x_values, mean = 0, sd = 2))

# colors
colors <- ipsum_pal()(2)

# plot for N(0,1) 
p1 <- ggplot(dists, aes(x = x, y = y1)) +
  geom_area(aes(fill = "N(0,1)"), alpha = 0.8) +
  scale_fill_manual(values = c("N(0,1)" = colors[1])) +
  labs(title = "N(0,1)", y = "Density", x = "Value") +
  theme_ipsum() +
  theme(legend.position = "none")

# plot for N(0,2) 
p2 <- ggplot(dists, aes(x = x, y = y2)) +
  geom_area(aes(fill = "N(0,2)"), alpha = 0.8) +
  scale_fill_manual(values = c("N(0,2)" = colors[2])) +
  labs(title = "N(0,2)", y = "Density", x = "Value") +
  theme_ipsum() +
  theme(legend.position = "none")

# Combine plots 
p_combined <- p1 + p2 + plot_layout(ncol = 2)
print(p_combined)

We can see that in the the plot for \(N(0,1)\), more density is around 0. As we expect our parameter values to be between 0 and 1, this gives more weight to parameter values closer to zero, and less weight to parameter values that are far from zero, compared to \(N(0,2)\) , where more probability mass is also on parameters that are a bit further from 0. If we would know that the association is large, and at the same time our sample size is low, we could think of using informative priors.

Here we prefer the prior with less information - this prior also has a slight regularizing effect.

Let’s check how our priors are set by default in brms.

get_prior(aut_z ~ bscs_z, data = d)
##                    prior     class   coef group resp dpar nlpar lb ub       source
##                   (flat)         b                                         default
##                   (flat)         b bscs_z                             (vectorized)
##  student_t(3, -0.1, 2.5) Intercept                                         default
##     student_t(3, 0, 2.5)     sigma                               0         default

By default brms sets non-informative flat priors for our slope parameters. We thus have to change this, and can do so directly when fitting the model.

Fitting and summarising the first model

m1.1 <- brm(aut_z ~ bscs_z, 
          data = d, 
          prior = c(prior(normal(0, 1), class = "Intercept"), 
                    prior(normal(0, 1), class = "b")),
          seed = bayes_seed,
          file = "02_fits/fit_m1.1")

Check the results. We can use either summary(m1.1) or print(m1.1) to do so.

summary(m1.1, prob = 0.95)  # prob = 0.95 is the default
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: aut_z ~ bscs_z 
##    Data: d (Number of observations: 334) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     0.00      0.05    -0.11     0.11 1.00     4111     2956
## bscs_z        0.15      0.05     0.05     0.26 1.00     3702     2562
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.99      0.04     0.92     1.07 1.00     3605     3062
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Use the tools from bayestestR to have another look at the posterior. We also define a region of practical equivalence ranging from [-0.05; 0.05].

More on this later.

m1.1_posteriors <- describe_posterior(m1.1, 
                                      centrality = "median",
                                      dispersion = FALSE,
                                      ci = 0.95,
                                      ci_method = "hdi",
                                      test = c("rope"),
                                      rope_range = c(-0.05, 0.05),
                                      rope_ci = 1)

m1.1_posteriors <- as.data.frame(m1.1_posteriors)
m1.1_posteriors
##     Parameter   Median   CI  CI_low CI_high ROPE_CI ROPE_low ROPE_high ROPE_Percentage
## 2 b_Intercept 0.000595 0.95 -0.1085   0.103       1    -0.05      0.05          0.6392
## 1    b_bscs_z 0.153896 0.95  0.0424   0.253       1    -0.05      0.05          0.0272
##    Rhat  ESS
## 2 1.000 4139
## 1 0.999 3690

Plot the conditional effects.

c_eff_m1.1 <- conditional_effects(m1.1)
c_eff_m1.1_plot <- plot(c_eff_m1.1, 
                        points = T,
                        point_args = list(size = 1, alpha = 1/4, width = .05, height = .05, color = "black"),
                        plot = FALSE)[[1]] +
  scale_color_ipsum() +
  scale_fill_ipsum() +
  theme_ipsum()
  
c_eff_m1.1_plot

Let’s use bayesplot functions to plot the posterior distributions of the parameters.


mcmc_plot(m1.1, 
          type = "areas") +
  theme_ipsum()

We can also manually create a similar plot using draws from the posterior distribution. For simplicity, Here we use gather_draws from the tidybayes package - this automatically returns a long dataframe suitable for plotting in ggplot.

draws_m1.1 <- m1.1 %>% 
  gather_draws(`b_.*`, regex = TRUE)  %>% 
  mutate(variable = case_when(.variable == "b_Intercept" ~ "Intercept",
                              .variable == "b_bscs_z"    ~ "Self-Control",
                              TRUE  ~ .variable # Keeps the original value if it doesn't match any of the above
                              ))

ggplot(draws_m1.1, aes(x = .value, y = fct_rev(variable), fill = variable)) +
  geom_rect(aes(xmin = -0.05, xmax = 0.05, ymin = -Inf, ymax = Inf), fill = "#FFE3EB", alpha = 0.05) +
  geom_vline(xintercept = 0) +
  stat_halfeye(slab_alpha = 0.8,
               .width = c(0.9, 0.95), point_interval = "median_hdi") +
  scale_fill_ipsum() +
  guides(fill = "none", slab_alpha = "none", alpha = "none") +
  labs(title = "Effect estimates", x = "Coefficient", y = "Variable",
       caption = "90% and 95% credible intervals shown in black. Area in rose indicates ROPE.") +
  theme_ipsum(plot_title_size = 14,
              axis_title_size = 12,
              axis_title_face = "bold")

Checking the model

Prior check

Let’s check if the prior that we used for the slope and intercept was informative or, as intended, only weakly informative. We can use functionality from bayestestR for this.


prior_summary(m1.1)
##                 prior     class   coef group resp dpar nlpar lb ub       source
##          normal(0, 1)         b                                            user
##          normal(0, 1)         b bscs_z                             (vectorized)
##          normal(0, 1) Intercept                                            user
##  student_t(3, 0, 2.5)     sigma                               0         default
check_prior(m1.1)
##     Parameter Prior_Quality
## 1 b_Intercept uninformative
## 2    b_bscs_z uninformative

Ok - as intended our priors were uninformative.

Posterior predictive check

Let’s also check how well our model performs in generating data that is similar to our observed data. We call this a Posterior predictive check (PPC).

pp_check(m1.1, ndraws= 200)

pp_check(m1.1, ndraws = 100, type ='error_scatter_avg', alpha = .1)

Trace plot

Commonly, we look at the trace plots, i.e., the estimated values across each iteration for a chain. See this nice practical guide by Michael Clark for some details. We will follow his code in several steps below.

What should the trace plot look like? Ideally it should look like a series from a normal distribution.

qplot(x = 1:1000, y = rnorm(1000), geom = 'line') +
  theme_ipsum()

Now we can also plot the trace plot from our model, using nice functionality from bayesplot.

mcmc_plot(m1.1, type = 'trace') +
  theme_ipsum()

Prior sensitivity checks

Ok, so we fit our first model. We also want to check whether the choice of different priors would have a substantial effect. We refit a similar model, just this time with \(N(0,2)\) as prior for the intercept and slope parameter.

m1.2 <- brm(aut_z ~ bscs_z, 
          data = d, 
          prior = c(prior(normal(0, 2), class = "Intercept"), 
                    prior(normal(0, 2), class = "b")),
          seed = bayes_seed,
          file = "02_fits/fit_m1.2")
summary(m1.1, prob = 0.95)  # prob = 0.95 is the default
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: aut_z ~ bscs_z 
##    Data: d (Number of observations: 334) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     0.00      0.05    -0.11     0.11 1.00     4111     2956
## bscs_z        0.15      0.05     0.05     0.26 1.00     3702     2562
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.99      0.04     0.92     1.07 1.00     3605     3062
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
m1.2_posteriors <- describe_posterior(m1.2, 
                                      centrality = "median",
                                      dispersion = FALSE,
                                      ci = 0.95,
                                      ci_method = "hdi",
                                      test = c("rope"),
                                      rope_range = c(-0.05, 0.05),
                                      rope_ci = 1)

m1.2_posteriors <- as.data.frame(m1.2_posteriors)

# Add a new column 'model' to each dataframe
m1.1_posteriors <- m1.1_posteriors %>% mutate(model = "m1.1")
m1.2_posteriors <- m1.2_posteriors %>% mutate(model = "m1.2")

# Combine the dataframes
m1.1_m1.2_combined <- bind_rows(m1.1_posteriors, m1.2_posteriors)

# Print the combined dataframe
m1.1_m1.2_combined %>% 
  select(Parameter, Median, CI_low, CI_high, model) %>% 
  tt(.)
tinytable_0omihsch5sn1oesd8yp3
Parameter Median CI_low CI_high model
b_Intercept 0.000595 -0.1085 0.103 m1.1
b_bscs_z 0.153896 0.0424 0.253 m1.1
b_Intercept 0.000932 -0.1045 0.109 m1.2
b_bscs_z 0.152366 0.0470 0.265 m1.2

We can see that irrespective of our prior choice, both the median of the posterior distributions of the two different models, and the 95% HDI are very similar.

Takeaway

  • Using the Bayesian approach we get much more than just a point estimate for our slope parameter of interest - we get a posterior distribution of parameters that are plausible given our observed data.

  • An increase of self control of 1 SD is associated with an increase of 0.04-0.26 SDs (median: 0.15) in authoritarianism.

  • The 95% HDI of our predictor of interest “Self-Control” does not include 0. We can thus say with a relatively high probability (according to common criteria - some authors also argue for 89% or 90% CIs) given our data that the effect is different from 0.

  • This is still true when we define and look at the ROPE - a region of practical equivalence. There is only minor overlap between the posterior distribution of our parameter and the ROPE (around 3%), indicating that the probability that our effect is practically relevant given our defined criteria is large.

Resources

I am not a statistician, so please take all of what I wrote and coded here with a lot of caution. However, there are many great resources out there if you want to delve deeper into the topic - I am just getting started with this, but here are a few potentially helpful links.

And there are of course many many more resources, this is just to get started with some.