When the Spider-Man meme is relevant to multilevel models

27 minute read

For a while, I’ve wondered about the different approches for multilevel modeling, also known as mixed effects modeling. My initial understanding is with a Bayesian perspective since I learned about it from Statistical Rethinking. But when hearing others talk about “fixed effects”, “varying effects”, “random effects”, and “mixed effects”, I had trouble connecting my own understanding of the concept to theirs. Even more perplexing, I wasn’t sure what the source(s) of the differences were:

It it a frequentist vs. Bayesian thing?
Is it a statistical package thing?
Is it because there are five different definitions of “fixed and random effects”, infamously observed by Andrew Gelman and why he avoids using those terms?

I decided to take a deep dive to resolve my confusion, with much help from numerous sources. Please check out the Acknowledgments and references section!

In this post, I’ll be comparing an example of mixed effects modeling across statistical philosophies and across statistical languages. As a bonus, a meme awaits.

method	approach	language	package
1	frequentist	R	`lme4`
2	Bayesian	Python	`pymc`

Note that the default language in the code blocks is Python. A cell running R will have %%R designated at the top. A variable can be inputted (-i) or outputted (-o) on that same line if it is used between the two languages.

Special thanks to Patrick Robotham for providing a lot of feedback.

from aesara import tensor as at
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import pymc as pm
import xarray as xr

%config InlineBackend.figure_format = 'retina'
az.style.use("arviz-darkgrid")
rng = np.random.default_rng(1234)
az.rcParams["stats.hdi_prob"] = 0.95 

def standardize(x):
    x = (x - np.mean(x)) / np.std(x)
    return x

# Enable running of R code
%load_ext rpy2.ipython

%%R
suppressMessages(library(tidyverse))
suppressMessages(library(lme4))
suppressMessages(library(arm))
suppressMessages(library(merTools))

Create synthetic `cafe` dataset

The dataset I am using is created from a scenario described in Statistical Rethinking.

Here are a few more details of the dataset from Dr. McElreath’s book:

Begin by defining the population of cafés that the robot might visit. This means we’ll define the average wait time in the morning and the afternoon, as well as the correlation between them. These numbers are sufficient to define the average properties of the cafés. Let’s define these properties, then we’ll sample cafés from them.

Nearly all Python code is taken from the Statistical Rethinking pymc repo with some minor alterations.

a = 3.5  # average morning wait time
b = -1.0  # average difference afternoon wait time
sigma_a = 1.0  # std dev in intercepts
sigma_b = 0.5  # std dev in slopes
rho = -0.7  # correlation between intercepts and slopes

Mu = [a, b]

sigmas = [sigma_a, sigma_b]
Rho = np.matrix([[1, rho], [rho, 1]])
Sigma = np.diag(sigmas) * Rho * np.diag(sigmas)  # covariance matrix

N_cafes = 20
vary_effects = np.random.multivariate_normal(mean=Mu, cov=Sigma, size=N_cafes)
a_cafe = vary_effects[:, 0]
b_cafe = vary_effects[:, 1]

Now simulate the observations.

N_visits = 10
afternoon = np.tile([0, 1], N_visits * N_cafes // 2)
cafe_id = np.repeat(np.arange(0, N_cafes), N_visits)

mu = a_cafe[cafe_id] + b_cafe[cafe_id] * afternoon
sigma = 0.5  # std dev within cafes

wait = np.random.normal(loc=mu, scale=sigma, size=N_visits * N_cafes)
df_cafes = pd.DataFrame(dict(cafe=cafe_id, afternoon=afternoon, wait=wait))

To get a sense of the data structure we just created, let’s take a look at the first and last 5 rows.

df_cafes.head()

	afternoon	wait
0	0	2.724888
1	1	1.951626
2	0	2.488389
3	1	1.188077
4	0	2.026425

df_cafes.tail()

	cafe	afternoon	wait
195	19	1	3.394933
196	19	0	4.544430
197	19	1	2.719524
198	19	0	3.379111
199	19	1	2.459750

Note that this dataset is balanced, meaning that each group (cafe) has the same number of observations. Mixed effects / multilevel models shine with unbalanced data where it can leverage partial pooling.

Visualize data

Let’s plot the raw data and see how the effect of afternoon influences wait time. Instead of plotting in order of the arbitrarily named cafes (0 to 19), I’ll show in order of increasing average morning wait time so that we can appreciate the variability across the dataset.

df_cafes.head()

	afternoon	wait
0	0	2.644592
1	1	2.126485
2	0	2.596465
3	1	2.250297
4	0	3.310709

%%R -i df_cafes

# credit to TJ Mahr for a template of this code
xlab <- "Afternoon"
ylab <- "Wait time"
titlelab <- "Wait times for each cafe (ordered by increasing average time)"

# order by increasing average morning wait time (intercept only)
cafe_ordered_by_avgwaittime <- df_cafes %>%
              filter(afternoon==0) %>%
              group_by(cafe) %>%
              summarize(mean = mean(wait)) %>%
              arrange(mean)

# Turn the gear column from a numeric in a factor with a certain order
df_cafes$cafe <- factor(df_cafes$cafe, levels=cafe_ordered_by_avgwaittime$cafe)

ggplot(df_cafes) + 
  aes(x = afternoon, y = wait) + 
  geom_boxplot(aes(fill=factor(afternoon))) +
  stat_summary(fun.y="mean", geom="line") +
  facet_wrap("cafe") +
  labs(x = xlab, y = ylab, title=titlelab)

png

One pattern is that as we increase morning wait time (e.g. the intercept) the difference in wait time in the afternoon (the slope) gets bigger. In other words, when we simulated this dataset, we included a co-variance structure between the intercept and slope. When we develop an inferential model with this data, we want to be able to reveal this co-variance.

Definitions of mixed effects modeling

Equation set 1: both fixed and random effects terms in linear model

Galecki and Burzykowski, Wikipedia, and this page from UCLA all describe a linear mixed model with an equation similar to equation 1 below.

I rely heavily on the UCLA page since it is the one that helped me the most. In fact, if you don’t care about how it connects to the Bayesian approach, stop reading this and check that out instead!

In contrast to the Bayesian set of equations, the fixed effects and random effects are in the same equation here.

\[\textbf{y} = \textbf{X} \boldsymbol{\beta} + \textbf{Z} \textbf{u} + \boldsymbol{\epsilon} \tag{1}\]

The left side of the equation $\textbf{y}$ represents all of our observations (or the wait time in the cafe example). The $\boldsymbol{\beta}$ in the first term of the equation represents a vector of coefficients across the population of cafes. These are the fixed effects. The $\textbf{u}$ in the second term of equation 1 represents a matrix of coefficients for each individual cafe. These are the random effects. Both $\textbf{X}$ and $\textbf{Z}$ are the design matrix of covariates. Finally, there’s a residual error term $\boldsymbol{\epsilon}$.

When relating this equation all back to the cafe dataset we just created, I needed to dig deeper to how terms represented an individual observation versus the group (cafe) level. Doing a dimensional analysis helped.

Equation 1 variable	Dimensions	Effects type	Comment
$\textbf{y}$	200 x 1	n/a	This vector represents the wait time for all 200 observations. I’ll refer to this as $w_i$ later in equation 2.
$\textbf{X}$	200 x 2	associated with fixed	The first column of each observation is 1 since it is multiplied by the intercept term. The second column is $A$, which will be 0 or 1 for `afternoon`.
$\boldsymbol{\beta}$	2 x 1	fixed	The two elements in the $\boldsymbol{\beta}$ (bold font beta) are what I’ll refer to as the intercept $\alpha$ and the slope $\beta$ (unbolded beta) across all cafes in equation 2.
$\textbf{Z}$	200 x (2x20)	associated with random	The first 20 columns representing intercepts for each cafe and the second 20 for the covariate (`afternoon`). See visual below.
$\textbf{u}$	(2x20) x 1	random	$\textbf{u}$ holds each of the 20 cafes’ intercept $a_\text{cafe}$ and slope $b_\text{cafe}$. There’s an implied correlation structure between them.
$\boldsymbol{\epsilon}$	200 x 1	n/a	Normally distributed residual error.

To better understand what $\textbf{Z}$ looks like we can create an alternate representation of df_cafes. Each row of the matrix $\textbf{Z}$ is for an individual observation. The first 20 columns of a row are the 20 intercepts of a cafe (column 1 is cafe 1, column 2 is cafe 2, etc.) All of the first 20 columns will contain a 0 except for the column that represents the cafe that observation is associated with which will be a 1. The next 20 columns (columns 21-40) will represent afternoon. All of this second group of columns will be 0 except for the column that represents the cafe that observation is associated with and if the observation is associated with an afternon observation.

To be clear, the structure of df_cafes, where each row is an observation with the cafe, afternoon status, and wait time, is already in a form to be understood by the lmer and pymc packages. What I’m showing below is to help understand what the matrix $\textbf{Z}$ looks like in the above equations.

Z = np.zeros((200, 40))
for i in df_cafes.index:
    cafe = df_cafes.loc[i, 'cafe']
    afternoon = df_cafes.loc[i, 'afternoon']
    Z[i, cafe] = 1
    Z[i, 20+cafe] = afternoon

We can take a look at the first 12 rows of Z. The first 10 are for the first cafe and observations alternate morning and afternoon, hence what’s displayed in column 20. I included the first two rows of the second cafe to show how the 1 moves over a row after the first 10 rows. I’ll use pandas to better display the values.

pd.set_option('display.max_columns', 40)
(
    pd.DataFrame(Z[0:12, :])
    .astype(int)
    .style
    .highlight_max(axis=1, props='color:navy; background-color:yellow;')
    .highlight_min(axis=1, props='color:white; background-color:#3E0B51;')
)

	0	1	20	21
0	1	0	0	0
1	1	0	1	0
2	1	0	0	0
3	1	0	1	0
4	1	0	0	0
5	1	0	1	0
6	1	0	0	0
7	1	0	1	0
8	1	0	0	0
9	1	0	1	0
10	0	1	0	0
11	0	1	0	1

We can visualize all of $\textbf{Z}$ here.

plt.imshow(Z, aspect='auto')
plt.text(10, 220, s='intercept (cafe)', ha='center', fontsize=14)
plt.text(30, 220, s='covariate (afternoon)', ha='center', fontsize=14)
plt.ylabel('observations')
plt.title('Visual representation of Z')

Text(0.5, 1.0, 'Visual representation of Z')

png

The vector in $\textbf{u}$ is really where the mixed effects model takes advantage of the covariance structure of the data. In our dataset, the first 20 elements of the vector represent the random intercepts of the cafes and the next 20 are the random slopes. A cafe’s random effects can be thought of as an offset from the populations (the fixed effects). Accordingly, a random effect will be multivariate normally distributed, with mean 0 and a co-variance matrix S.

\[\textbf{u} \sim \text{Normal}(0, \textbf{S}) \tag{2}\]

Remember that the $\textbf{u}$ is a (2x20) x 1 matrix, where each cafe’s intercept $a_\text{cafe}$ and slope $b_\text{cafe}$ are contained. Therefore, we can also write this as.

\[\textbf{u} = \begin{bmatrix} a_{\text{cafe}} \\ b_{\text{cafe}} \end{bmatrix} \sim \text{MVNormal} \left( \begin{bmatrix} 0 \\ 0 \end{bmatrix} , \textbf{S} \right) \tag{3}\]

In other words, in Equation 1, both the random intercept and random slope are both expected to lie at 0. With regards to $\textbf{S}$, my prior post talked about covariance matrixes so I won’t elaborate here. The key conceptual point of relevance in this problem is that the covariance matrix $\textbf{S}$ can reflect the correlation ($\rho$) that the intercept (average morning wait time) and slope (difference between morning and afternoon wait time).

\[\textbf{S} = \begin{pmatrix} \sigma_{\alpha}^2 & \rho\sigma_{\alpha}\sigma_{\beta} \\ \rho\sigma_{\alpha}\sigma_{\beta} & \sigma_{\beta}^2 \end{pmatrix} \tag{4}\]

We know there is a correlation because (a) we generated the data that way and (b) we can directly observe this when we visualized the data.

Finally, the role of $\boldsymbol{\epsilon}$ is to capture any residual variance. Between observations, it is assumed to be homogenous and independent.

Non-linear algebra form

Equation 1 is written concisely in linear algebra form. However, since our dataset is relatively simple (only one predictor variable), equation 1 can be written in an expanded, alternative form as equation 2. This might make it easier to understand (at least it did for me). The notation will start to get hairy with subscripts and so I will explicitly rename some variables for this explanation. It will also better match with the Bayesian set of equations described in the McElreath text. Equation 2 is written at the level of a single observation $i$. I’ll repeat Equation 1 here so it’s easier to see the conversion.

\[\textbf{y} = \textbf{X} \boldsymbol{\beta} + \textbf{Z} \textbf{u} + \boldsymbol{\epsilon} \tag{5}\] \[W_i = (\alpha + \beta \times A_i) + (a_{\text{cafe}[i]} + b_{\text{cafe}[i]} \times A_i) + \epsilon_i \tag{6}\]

Let’s start off with the left side where we can see that $\textbf{y}$ will now be $W_i$ for wait time. On the right side, I have segmented the fixed and random effects with parentheses. For both, I’ve deconstructed the linear algebra expression form to a simpler form. After re-arrangement, we can obtain the following form in equation 3.

\[W_i = (\alpha + a_{\text{cafe}[i]}) + (\beta + b_{\text{cafe}[i]}) \times A_i + \epsilon_{\text{cafe}} \tag{7}\]

Here, we can better appreciate how a cafe’s random effects intercept can be thought of as an offset from the population intercept. The same logic of an offset can be applied to its slope. We will come back to equation 3 after covering Equation set 2, the Bayesian approach.

Equation set 2: fixed effects as an adaptive prior, varying effects in the linear model

The following equations are taken from Chapter 14 in Statistical Rethinking. These set of equations look like a beast, but to be honest, they’re more intuitive to me, probably because I learned this approach initially. I’ll state the equations before comparing them directly with Equation set 1 but you may already start seeing the relationship. Essentially what is going on is a re-writing of the above equations in a Bayesian way such that the fixed effects can act as an adaptive prior.

$W_i \sim \text{Normal}(\mu_i, \sigma) \tag{8}$ $\mu_i = \alpha_{\text{cafe}[i]} + \beta_{\text{cafe}[i]} \times A_{i} \tag{9}$ $\sigma \sim \text{Exp}(1) \tag{10}$

Equation 8 is stating how wait time is normally distributed around $\mu$ and $\sigma$. By making $w_i$ stochastic instead of deterministic (using a ~ instead of =), the $\sigma$ replaces $\epsilon_i$. In equation 10, the prior for $\sigma$ is exponentially distributed and paramaterized with 1. The expected value parameter $\mu$ comes from the linear model in equation 9. You can start to see the similarities with equation 7 above.

\[\begin{bmatrix}\alpha_{\text{cafe}} \\ \beta_{\text{cafe}} \end{bmatrix} \sim \text{MVNormal} \left( \begin{bmatrix}{\alpha} \\ {\beta} \end{bmatrix} , \textbf{S} \right) \tag{11}\]

The $\alpha_{\text{cafe}}$ and $\beta_{\text{cafe}}$ terms come from sampling of a multivariate normal distribution as shown in equation 11. Note the very subtle difference in placement of the subscript cafe when compared to equation 6 and 7. This is an important point I’ll discuss later. On the right side, the two-dimensional normal distribution’s expected values are $\alpha$ and $\beta$. The rest of the equations shown below are our priors for each parameter we’re trying to estimate.

\[\alpha \sim \text{Normal}(5, 2) \tag{13}\] \[\beta \sim \text{Normal}(-1, 0.5) \tag{14}\] \[\textbf{S} = \begin{pmatrix} \sigma_{\alpha}^2 & \rho\sigma_{\alpha}\sigma_{\beta} \\ \rho\sigma_{\alpha}\sigma_{\beta} & \sigma_{\beta}^2 \end{pmatrix} = \begin{pmatrix} \sigma_{\alpha} & 0 \\ 0 & \sigma_{\beta} \end{pmatrix} \textbf{R} \begin{pmatrix} \sigma_{\alpha} & 0 \\ 0 & \sigma_{\beta} \end{pmatrix} \tag{12}\]

$\sigma, \sigma_{\alpha}, \sigma_{\beta} \sim \text{Exp}(1) \tag{15}$ $\textbf{R} \sim \text{LKJCorr}(2) \tag{16}$

Comparison of equation sets

To recap, the first equation set has an explicit fixed effects term and varying effects term in the linear model. In the second equation, the linear model is already “mixed”. It contains both the fixed and varying effects terms implicitly. The fixed effects estimates can be seen in equation 5.

I think you can think of these $\alpha_{\text{cafe}}$ and $\beta_{\text{cafe}}$ terms as already incorporating the information from the fixed and random effects simultaneously.

Now that we have the dataset, we can run the two models, one with lmer and one with pymc. Here are the equations that these packages run.

Running equation set 1 with `lmer` (frequentist)

The lmer and by extension (brms) syntax was initially confusing to me.

lmer(wait ~ 1 + afternoon + (1 + afternoon | cafe), df_cafes)

The 1 corresponds to inclusion of the intercept term. A 0 would exclude it. The 1 + wait corresponds to the “fixed effects” portion of the model ($\alpha + \beta \times A_i$) while the (1 + wait | cafe) is the “varying effects” ($a_{\text{cafe}} + b_{\text{cafe}} \times A_i$).

%%R -i df_cafes -o m -o df_fe_estimates -o df_fe_ci -o df_fe_summary

# m df_fe_summary
m <- lmer(wait ~ 1 + afternoon + (1 + afternoon | cafe), df_cafes)
arm::display(m)

# get fixed effects coefficients
df_fe_estimates <- data.frame(summary(m)$coefficients)
# get fixed effects coefficient CIs
df_fe_ci <- data.frame(confint(m))

df_fe_summary <- merge(
    df_fe_estimates,
    df_fe_ci[c('(Intercept)', 'afternoon'), ],
    by.x=0,
    by.y=0
)
rownames(df_fe_summary) <- df_fe_summary[, 1]

lmer(formula = wait ~ 1 + afternoon + (1 + afternoon | cafe), 
    data = df_cafes)
            coef.est coef.se
(Intercept)  3.64     0.23  
afternoon   -1.04     0.11  

Error terms:
 Groups   Name        Std.Dev. Corr  
 cafe     (Intercept) 0.99           
          afternoon   0.39     -0.74 
 Residual             0.48           
---
number of obs: 200, groups: cafe, 20
AIC = 369.9, DIC = 349.2
deviance = 353.5 


R[write to console]: Computing profile confidence intervals ...

Can we get the partial pooling results from the lmer output and see how it compares with the unpooled estimates? Let’s export it for use later.

%%R -i m -o df_partial_pooling -o random_sims

# Make a dataframe with the fitted effects
df_partial_pooling <- coef(m)[["cafe"]] %>% 
  rownames_to_column("cafe") %>% 
  as_tibble() %>% 
  rename(Intercept = `(Intercept)`, Slope_Days = afternoon) %>% 
  add_column(Model = "Partial pooling")

# estimate confidence interval
random_sims <- REsim(m, n.sims = 1000)
#plotREsim(random_sims)

random_sims

	groupFctr	groupID	term	mean	median	sd
1	cafe	0	(Intercept)	-1.277651	-1.283341	0.379761
2	cafe	1	(Intercept)	0.164935	0.162715	0.420411
3	cafe	2	(Intercept)	-1.047076	-1.043646	0.387153
4	cafe	3	(Intercept)	0.474320	0.500552	0.400053
5	cafe	4	(Intercept)	-1.473647	-1.468940	0.394707
6	cafe	5	(Intercept)	0.086072	0.082010	0.408971
7	cafe	6	(Intercept)	-0.640217	-0.628944	0.412642
8	cafe	7	(Intercept)	1.507154	1.516430	0.391119
9	cafe	8	(Intercept)	-0.657831	-0.659448	0.394984
10	cafe	9	(Intercept)	0.332758	0.331037	0.388295
11	cafe	10	(Intercept)	-1.018611	-1.025387	0.389930
12	cafe	11	(Intercept)	0.925071	0.913997	0.397095
13	cafe	12	(Intercept)	-1.407149	-1.403259	0.384820
14	cafe	13	(Intercept)	-0.412975	-0.414958	0.412863
15	cafe	14	(Intercept)	1.346380	1.343109	0.403694
16	cafe	15	(Intercept)	0.336807	0.346523	0.390567
17	cafe	16	(Intercept)	0.747439	0.735906	0.413094
18	cafe	17	(Intercept)	-0.046579	-0.035018	0.396795
19	cafe	18	(Intercept)	1.659019	1.646634	0.393909
20	cafe	19	(Intercept)	0.323375	0.327348	0.392401
21	cafe	0	afternoon	0.498557	0.501401	0.182594
22	cafe	1	afternoon	-0.336036	-0.337360	0.193462
23	cafe	2	afternoon	0.395379	0.391621	0.189140
24	cafe	3	afternoon	0.296956	0.293144	0.191710
25	cafe	4	afternoon	0.059611	0.055121	0.189680
26	cafe	5	afternoon	-0.033068	-0.036143	0.194723
27	cafe	6	afternoon	0.236107	0.237904	0.192575
28	cafe	7	afternoon	-0.473485	-0.479199	0.185549
29	cafe	8	afternoon	0.408039	0.411507	0.194145
30	cafe	9	afternoon	-0.402131	-0.393931	0.186868
31	cafe	10	afternoon	0.316072	0.309198	0.189218
32	cafe	11	afternoon	-0.335749	-0.340427	0.186644
33	cafe	12	afternoon	0.521558	0.519243	0.184606
34	cafe	13	afternoon	-0.006800	-0.014344	0.199548
35	cafe	14	afternoon	-0.277165	-0.281127	0.188748
36	cafe	15	afternoon	-0.234501	-0.235683	0.192804
37	cafe	16	afternoon	-0.182673	-0.185997	0.194017
38	cafe	17	afternoon	-0.017126	-0.023784	0.187302
39	cafe	18	afternoon	-0.364424	-0.364049	0.187532
40	cafe	19	afternoon	-0.028883	-0.032691	0.185824

OK, now let’s try the Bayesian approach and compare answers.

Running equation set 2 with `pymc` (Bayesian)

n_cafes = df_cafes['cafe'].nunique()
cafe_idx = pd.Categorical(df_cafes["cafe"]).codes

with pm.Model() as m14_1:
    # can't specify a separate sigma_a and sigma_b for sd_dist but they're equivalent here
    chol, Rho_, sigma_cafe = pm.LKJCholeskyCov(
        "chol_cov", n=2, eta=2, sd_dist=pm.Exponential.dist(1.0), compute_corr=True
    )
    
    a_bar = pm.Normal("a_bar", mu=5, sigma=2.0)  # prior for average intercept
    b_bar = pm.Normal("b_bar", mu=-1, sigma=0.5)  # prior for average slope

    
    ab_subject = pm.MvNormal(
        "ab_subject", mu=at.stack([a_bar, b_bar]), chol=chol, shape=(n_cafes, 2)
    )  # population of varying effects
    # shape needs to be (n_cafes, 2) because we're getting back both a and b for each cafe

    mu = ab_subject[cafe_idx, 0] + ab_subject[cafe_idx, 1] * df_cafes["afternoon"].values  # linear model
    sigma_within = pm.Exponential("sigma_within", 1.0)  # prior stddev within cafes (in the top line)

    wait = pm.Normal("wait", mu=mu, sigma=sigma_within, observed=df_cafes["wait"].values)  # likelihood

    idata_m14_1 = pm.sample(1000, target_accept=0.9)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [chol_cov, a_bar, b_bar, ab_subject, sigma_within]

100.00% [8000/8000 02:03<00:00 Sampling 4 chains, 1 divergences]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 140 seconds.
There was 1 divergence after tuning. Increase `target_accept` or reparameterize.

# take a glimpse at the head and tail of the summary table
pd.concat(
    [
        az.summary(idata_m14_1).head(10),
        az.summary(idata_m14_1).tail(10)
    ]
)

/Users/blacar/opt/anaconda3/envs/pymc_env2/lib/python3.10/site-packages/arviz/stats/diagnostics.py:586: RuntimeWarning: invalid value encountered in double_scalars
  (between_chain_variance / within_chain_variance + num_samples - 1) / (num_samples)
/Users/blacar/opt/anaconda3/envs/pymc_env2/lib/python3.10/site-packages/arviz/stats/diagnostics.py:586: RuntimeWarning: invalid value encountered in double_scalars
  (between_chain_variance / within_chain_variance + num_samples - 1) / (num_samples)

	mean	sd	hdi_2.5%	hdi_97.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
a_bar	3.654	0.223	3.203	4.074	0.003	0.002	4802.0	3140.0	1.0
b_bar	-1.049	0.109	-1.265	-0.844	0.002	0.001	3446.0	3200.0	1.0
ab_subject[0, 0]	2.380	0.200	1.996	2.785	0.003	0.002	4271.0	2783.0	1.0
ab_subject[0, 1]	-0.587	0.245	-1.071	-0.119	0.004	0.003	3077.0	2833.0	1.0
ab_subject[1, 0]	3.820	0.199	3.442	4.220	0.003	0.002	3988.0	3167.0	1.0
ab_subject[1, 1]	-1.402	0.248	-1.897	-0.945	0.004	0.003	3165.0	3182.0	1.0
ab_subject[2, 0]	2.606	0.199	2.210	2.988	0.003	0.002	4702.0	3450.0	1.0
ab_subject[2, 1]	-0.681	0.240	-1.156	-0.218	0.004	0.003	3696.0	3014.0	1.0
ab_subject[3, 0]	4.120	0.203	3.739	4.532	0.003	0.002	3475.0	2800.0	1.0
ab_subject[3, 1]	-0.707	0.266	-1.213	-0.184	0.005	0.004	2482.0	2921.0	1.0
chol_cov[0]	0.988	0.163	0.710	1.328	0.002	0.002	5207.0	3263.0	1.0
chol_cov[1]	-0.226	0.105	-0.442	-0.033	0.002	0.001	2769.0	3178.0	1.0
chol_cov[2]	0.299	0.093	0.120	0.481	0.002	0.002	1379.0	1308.0	1.0
sigma_within	0.482	0.027	0.431	0.534	0.000	0.000	3773.0	2542.0	1.0
chol_cov_corr[0, 0]	1.000	0.000	1.000	1.000	0.000	0.000	4000.0	4000.0	NaN
chol_cov_corr[0, 1]	-0.579	0.192	-0.898	-0.196	0.003	0.002	3196.0	2983.0	1.0
chol_cov_corr[1, 0]	-0.579	0.192	-0.898	-0.196	0.003	0.002	3196.0	2983.0	1.0
chol_cov_corr[1, 1]	1.000	0.000	1.000	1.000	0.000	0.000	4087.0	4000.0	1.0
chol_cov_stds[0]	0.988	0.163	0.710	1.328	0.002	0.002	5207.0	3263.0	1.0
chol_cov_stds[1]	0.386	0.107	0.182	0.605	0.003	0.002	1541.0	1201.0	1.0

Comparison of `lmer` and `pymc` outputs

While pymc returns posterior estimates for each parameter, including $\rho$, for this post, we are interested in comparing the output comparable to the “fixed effects” and “varying effects” from lmer. Having the equations above can help us piece together the relevant bits of information. The fixed intercept and slope are easy because we’ve used the same characters $\alpha$ and $\beta$ in equation set 2 as we did in Equation set 1.

However, when identifying the “varying effects”, we’ll have to do some arithmetic with the pymc output. In contrast with the lmer output, the pymc outputs have the estimate for each cafe with “baked in” varying effects. In other words, the “offset” that we see in equation 7 ($a_{\text{cafe}[i]}$ and $b_{\text{cafe}[i]}$)

\[W_i = (\alpha + a_{\text{cafe}[i]}) + (\beta + b_{\text{cafe}[i]}) \times A_i + \epsilon_{\text{cafe}} \tag{7}\] \[\mu_i = \alpha_{\text{cafe}[i]} + \beta_{\text{cafe}[i]} \times A_{i} \tag{9}\]

are already embedded in ($\alpha_{\text{cafe}[i]}$ and $\beta_{\text{cafe}[i]}$) in equation 9. We’ll have to therefore subtract out the fixed effecs in the pymc output before we can compare with the lmer output. First, let’s get fixed effects from pymc.

df_summary_int_and_slope = az.summary(idata_m14_1, var_names=['a_bar', 'b_bar'])
df_summary_int_and_slope

	mean	sd	hdi_2.5%	hdi_97.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
a_bar	3.654	0.223	3.203	4.074	0.003	0.002	4802.0	3140.0	1.0
b_bar	-1.049	0.109	-1.265	-0.844	0.002	0.001	3446.0	3200.0	1.0

These estimates and uncertainties compare well with the fixed estimates lmer.

f, (ax0, ax1) = plt.subplots(1, 2, figsize=(12,4))
# value to generate data
# a, average morning wait time was defined above
ax0.vlines(x=a, ymin=0.8, ymax=1.2, linestyle='dashed', color='red')
ax1.vlines(x=b, ymin=0.8, ymax=1.2, linestyle='dashed', color='red', label='simulated value')

# pymc fixed effects value
ax0.scatter(df_summary_int_and_slope.loc['a_bar', 'mean'], 1.1, color='navy')
ax0.hlines(xmin=df_summary_int_and_slope.loc['a_bar', 'hdi_2.5%'], xmax=df_summary_int_and_slope.loc['a_bar', 'hdi_97.5%'], y=1.1, color='navy')
ax1.scatter(df_summary_int_and_slope.loc['b_bar', 'mean'], 1.1, color='navy')
ax1.hlines(xmin=df_summary_int_and_slope.loc['b_bar', 'hdi_2.5%'], xmax=df_summary_int_and_slope.loc['b_bar', 'hdi_97.5%'], y=1.1, color='navy', label='pymc estimate')

# lmer fixed effects estimate
ax0.scatter(df_fe_summary.loc['(Intercept)', 'Estimate'], 0.9, color='darkgreen')
ax0.hlines(xmin=df_fe_summary.loc['(Intercept)', 'X2.5..'], xmax=df_fe_summary.loc['(Intercept)', 'X97.5..'], y=0.9, color='darkgreen')
ax1.scatter(df_fe_summary.loc['afternoon', 'Estimate'], 0.9, color='darkgreen')
ax1.hlines(xmin=df_fe_summary.loc['afternoon', 'X2.5..'], xmax=df_fe_summary.loc['afternoon', 'X97.5..'], y=0.9, color='darkgreen', label='lmer estimate')

# plot formatting
f.suptitle('Fixed effect estimates')
ax0.set_yticks([0.9, 1.1])
ax0.set_yticklabels(['lmer', 'pymc'])

ax1.set_yticks([0.9, 1.1])
ax1.set_yticklabels(['', ''])

ax0.set(xlabel='intercept')
ax1.set(xlabel='slope')
ax1.legend(fontsize=10)
plt.tight_layout()

/var/folders/tw/b9j0wcdj6_9cyljwt364lx7c0000gn/T/ipykernel_5516/1253574855.py:30: UserWarning: This figure was using constrained_layout, but that is incompatible with subplots_adjust and/or tight_layout; disabling constrained_layout.
  plt.tight_layout()

png

As promised, here is the meme that rewards you for paying attention this far! jpg

Now to get the varying effects from pymc output, we’ll take each sample’s intercept and slope and subtract the fixed estimate.

# Convert to pandas dataframe and take a glimpse at the first few rows
idata_m14_1_df = idata_m14_1.to_dataframe()
idata_m14_1_df.head()

	draw	(posterior, a_bar)	(posterior, b_bar)	(posterior, ab_subject[0,0], 0, 0)	(posterior, ab_subject[0,1], 0, 1)	(posterior, ab_subject[1,0], 1, 0)	(posterior, ab_subject[1,1], 1, 1)	(posterior, ab_subject[10,0], 10, 0)	(posterior, ab_subject[10,1], 10, 1)	(posterior, ab_subject[11,0], 11, 0)	(posterior, ab_subject[11,1], 11, 1)	(posterior, ab_subject[12,0], 12, 0)	(posterior, ab_subject[12,1], 12, 1)	(posterior, ab_subject[13,0], 13, 0)	(posterior, ab_subject[13,1], 13, 1)	(posterior, ab_subject[14,0], 14, 0)	(posterior, ab_subject[14,1], 14, 1)	(posterior, ab_subject[15,0], 15, 0)	(posterior, ab_subject[15,1], 15, 1)	...	(log_likelihood, wait[97], 97)	(log_likelihood, wait[98], 98)	(log_likelihood, wait[99], 99)	(log_likelihood, wait[9], 9)	(sample_stats, tree_depth)	(sample_stats, max_energy_error)	(sample_stats, process_time_diff)	(sample_stats, perf_counter_diff)	(sample_stats, energy)	(sample_stats, step_size_bar)	(sample_stats, diverging)	(sample_stats, energy_error)	(sample_stats, lp)	(sample_stats, acceptance_rate)	(sample_stats, n_steps)	(sample_stats, largest_eigval)	(sample_stats, smallest_eigval)	(sample_stats, index_in_trajectory)	(sample_stats, step_size)	(sample_stats, perf_counter_start)
0	0	3.397744	-0.993140	2.353823	-0.712216	3.936642	-1.328451	2.497521	-0.990675	4.589760	-1.271864	2.272038	-0.780358	3.400074	-1.307487	4.660517	-0.920542	3.967868	-1.339014	...	-0.592594	-0.280869	-1.783441	-0.404212	5	-0.452539	0.234946	0.067260	194.679539	0.246795	False	-0.226605	-167.432037	0.975607	31.0	NaN	NaN	-17	0.284311	192.355518
1	1	3.227032	-1.105823	2.486742	-0.657790	3.890044	-1.788579	2.894867	-0.741011	4.346072	-1.048541	2.446301	-0.678041	3.564795	-1.520221	5.013627	-1.128684	3.793134	-1.084814	...	-0.581570	-0.708670	-1.709776	-0.741664	4	0.498338	0.123327	0.033713	196.867266	0.246795	False	0.273832	-177.694232	0.809115	15.0	NaN	NaN	-8	0.284311	192.423125
2	2	3.393307	-0.926431	2.348434	-0.604619	3.905778	-1.355137	2.712834	-1.124770	4.409195	-1.291088	2.324233	-0.754508	3.586107	-1.562165	5.050191	-1.556993	4.122478	-1.718417	...	-0.452885	-0.109849	-2.293094	-0.559207	5	-0.382814	0.236803	0.063232	207.926089	0.246795	False	-0.347905	-176.112370	0.968229	31.0	NaN	NaN	6	0.284311	192.457135
3	3	3.750943	-1.109148	2.613325	-0.667234	3.682009	-1.293790	2.558511	-0.362557	4.548968	-1.266139	2.264383	-0.445725	3.102086	-0.903726	4.589499	-0.409875	4.063760	-1.249921	...	-1.239451	-0.574010	-0.906557	-1.015460	4	-0.530897	0.116930	0.037484	198.279760	0.246795	False	-0.024171	-180.489888	0.987683	15.0	NaN	NaN	-9	0.284311	192.520656
4	4	3.416951	-1.152993	2.478859	-0.812085	3.773041	-1.423143	2.136978	-0.465100	4.385045	-1.180823	2.160109	-0.395771	3.459758	-1.300131	5.527213	-2.107117	3.906480	-1.388326	...	-0.400278	-0.240346	-2.188396	-0.471960	5	-0.382498	0.241781	0.072736	207.993298	0.246795	False	-0.041904	-183.942618	0.999986	31.0	NaN	NaN	-24	0.284311	192.558443

5 rows × 270 columns

# Get the "unbaked in" varying intercept and slope
bayesian_int = list()
bayesian_slope = list()
for i in range(20):
    idata_m14_1_df[f'varying_int_{i}'] = idata_m14_1_df[ ('posterior', f'ab_subject[{i},0]', i, 0)] - idata_m14_1_df[('posterior', 'a_bar')]
    bayesian_int.append(idata_m14_1_df[f'varying_int_{i}'].mean())

    idata_m14_1_df[f'varying_slope_{i}'] = idata_m14_1_df[ ('posterior', f'ab_subject[{i},1]', i, 1)] - idata_m14_1_df[('posterior', 'b_bar')]
    bayesian_slope.append(idata_m14_1_df[f'varying_slope_{i}'].mean())

We can now make a direct comparison between the lmer and pymc outputs. I’ll ignore the uncertainties for the sake of a cleaner plot.

random_sims_int = random_sims.loc[random_sims['term']=='(Intercept)', 'mean'].copy()
random_sims_slope = random_sims.loc[random_sims['term']=='afternoon', 'mean'].copy()

f, (ax0, ax1) = plt.subplots(1, 2, figsize=(12, 4))

min_max_int = [min(list(random_sims_int) + bayesian_int), max(list(random_sims_int) + bayesian_int)]
min_max_slope = [min(list(random_sims_slope) + bayesian_slope), max(list(random_sims_slope) + bayesian_slope)]

# intercepts
ax0.scatter(random_sims_int, bayesian_int, facecolors='none', edgecolors='navy')
ax0.plot(min_max_int, min_max_int, linestyle='dashed', color='gray')
ax0.set(xlabel='lmer intercept estimates', ylabel='pymc intercept estimates', title='Comparison of varying intercepts')

# slopes
ax1.scatter(random_sims_slope, bayesian_slope, facecolors='none', edgecolors='navy')
ax1.plot(min_max_slope, min_max_slope, linestyle='dashed', color='gray')
ax1.set(xlabel='lmer slope estimates', ylabel='pymc slope estimates', title='Comparison of varying slopes')

[Text(0.5, 0, 'lmer slope estimates'),
 Text(0, 0.5, 'pymc slope estimates'),
 Text(0.5, 1.0, 'Comparison of varying slopes')]

png

As you can see we get very similar intercepts and slopes for the cafe-specific estimates (varying effects) for the intercept and slope between the lmer and pymc approaches.

Summary

Here in this post, I set out to compare different mixed model approaches. I looked at the equations and the programmatic implementations. I concluded by showing how the two methods can arrive at the same answer. It required a careful understanding of the differences in equations and coding language- and package-specific implementations. There were various points of writing this post that confused me but it provided opportunities for deeper understanding.

Acknowledgements and references

Acknowledgements

Special shoutout to Patrick Robotham (@probot) from the University of Bayes Discord channel for helping me work through many of my confusions.
Eric J. Daza about some discussions about mixed effects modeling. It reminded me about improving my knowledge in this area.
Members of the Glymour group at UCSF for checking some of my code.

References

UCLA introduction to linear mixed models.
Richard McElreath’s Statistical Rethinking for my introduction to Bayesian multilevel modeling and the Statistical Rethinking Chapter 14 repo.
Andrzej Gałecki and Tomasz Burzykowski’s Linear Mixed-Effecsts Models Using R which references the lme4 package. Dr. McElreath referenced this package as a non-Bayesian alternative in his book.
Andrew Gelman wrote about why he doesn’t like using “fixed and random effects” (in a blog and in a paper).
TJ Mahr’s partial pooling blog post.

%load_ext watermark
%watermark -n -u -v -iv -w -p aesara,aeppl

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Last updated: Tue Sep 13 2022

Python implementation: CPython
Python version       : 3.10.6
IPython version      : 8.4.0

aesara: 2.8.2
aeppl : 0.0.35

pymc      : 4.1.7
xarray    : 2022.6.0
pandas    : 1.4.3
sys       : 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:43:44) [Clang 13.0.1 ]
arviz     : 0.12.1
matplotlib: 3.5.3
aesara    : 2.8.2
numpy     : 1.23.2

Watermark: 2.3.1

%%R
sessionInfo()

R version 4.1.3 (2022-03-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Monterey 12.5.1

Matrix products: default
LAPACK: /Users/blacar/opt/anaconda3/envs/pymc_env2/lib/libopenblasp-r0.3.21.dylib

locale:
[1] C/UTF-8/C/C/C/C

attached base packages:
[1] tools     stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] merTools_0.5.2  arm_1.13-1      MASS_7.3-58.1   lme4_1.1-30    
 [5] Matrix_1.4-1    forcats_0.5.2   stringr_1.4.1   dplyr_1.0.9    
 [9] purrr_0.3.4     readr_2.1.2     tidyr_1.2.0     tibble_3.1.8   
[13] ggplot2_3.3.6   tidyverse_1.3.2

loaded via a namespace (and not attached):
 [1] httr_1.4.4          jsonlite_1.8.0      splines_4.1.3      
 [4] foreach_1.5.2       modelr_0.1.9        shiny_1.7.2        
 [7] assertthat_0.2.1    broom.mixed_0.2.9.4 googlesheets4_1.0.1
[10] cellranger_1.1.0    globals_0.16.1      pillar_1.8.1       
[13] backports_1.4.1     lattice_0.20-45     glue_1.6.2         
[16] digest_0.6.29       promises_1.2.0.1    rvest_1.0.3        
[19] minqa_1.2.4         colorspace_2.0-3    httpuv_1.6.5       
[22] htmltools_0.5.3     pkgconfig_2.0.3     broom_1.0.0        
[25] listenv_0.8.0       haven_2.5.1         xtable_1.8-4       
[28] mvtnorm_1.1-3       scales_1.2.1        later_1.3.0        
[31] tzdb_0.3.0          googledrive_2.0.0   farver_2.1.1       
[34] generics_0.1.3      ellipsis_0.3.2      withr_2.5.0        
[37] furrr_0.3.1         cli_3.3.0           mime_0.12          
[40] magrittr_2.0.3      crayon_1.5.1        readxl_1.4.1       
[43] fs_1.5.2            future_1.27.0       fansi_1.0.3        
[46] parallelly_1.32.1   nlme_3.1-159        xml2_1.3.3         
[49] hms_1.1.2           gargle_1.2.0        lifecycle_1.0.1    
[52] munsell_0.5.0       reprex_2.0.2        compiler_4.1.3     
[55] rlang_1.0.4         blme_1.0-5          grid_4.1.3         
[58] nloptr_2.0.3        iterators_1.0.14    labeling_0.4.2     
[61] boot_1.3-28         gtable_0.3.0        codetools_0.2-18   
[64] abind_1.4-5         DBI_1.1.3           R6_2.5.1           
[67] lubridate_1.8.0     fastmap_1.1.0       utf8_1.2.2         
[70] stringi_1.7.8       parallel_4.1.3      Rcpp_1.0.9         
[73] vctrs_0.4.1         dbplyr_2.2.1        tidyselect_1.1.2   
[76] coda_0.19-4        

Twitter Facebook LinkedIn

When the Spider-Man meme is relevant to multilevel models

Create synthetic `cafe` dataset

Visualize data

Definitions of mixed effects modeling

Equation set 1: both fixed and random effects terms in linear model

Non-linear algebra form

Equation set 2: fixed effects as an adaptive prior, varying effects in the linear model

Comparison of equation sets

Running equation set 1 with `lmer` (frequentist)

Running equation set 2 with `pymc` (Bayesian)

Comparison of `lmer` and `pymc` outputs

Summary

Acknowledgements and references

You May Also Enjoy

Instrumental variable analysis with a binary outcome

Instrumental variable analysis with Bayesian modeling and statsmodels

Time series with varying intercepts

Follow-up after getting causal estimates

	0	1	20	21
0	1	0	0	0
1	1	0	1	0
2	1	0	0	0
3	1	0	1	0
4	1	0	0	0
5	1	0	1	0
6	1	0	0	0
7	1	0	1	0
8	1	0	0	0
9	1	0	1	0
10	0	1	0	0
11	0	1	0	1

	0	1	20	21
0	1	0	0	0
1	1	0	1	0
2	1	0	0	0
3	1	0	1	0
4	1	0	0	0
5	1	0	1	0
6	1	0	0	0
7	1	0	1	0
8	1	0	0	0
9	1	0	1	0
10	0	1	0	0
11	0	1	0	1

Create synthetic cafe dataset

Visualize data

Definitions of mixed effects modeling

Equation set 1: both fixed and random effects terms in linear model

Non-linear algebra form

Equation set 2: fixed effects as an adaptive prior, varying effects in the linear model

Comparison of equation sets

Running equation set 1 with lmer (frequentist)

Running equation set 2 with pymc (Bayesian)

Comparison of lmer and pymc outputs

Summary

Acknowledgements and references

You May Also Enjoy

Instrumental variable analysis with a binary outcome

Instrumental variable analysis with Bayesian modeling and statsmodels

Time series with varying intercepts

Follow-up after getting causal estimates

Create synthetic `cafe` dataset

Running equation set 1 with `lmer` (frequentist)

Running equation set 2 with `pymc` (Bayesian)

Comparison of `lmer` and `pymc` outputs

	0	1	20	21
0	1	0	0	0
1	1	0	1	0
2	1	0	0	0
3	1	0	1	0
4	1	0	0	0
5	1	0	1	0
6	1	0	0	0
7	1	0	1	0
8	1	0	0	0
9	1	0	1	0
10	0	1	0	0
11	0	1	0	1