benslack19

Generating a predictive distribution for the number of people attending your party

2024-04-12T00:00:00+00:00

A few years ago, not long after I started writing on this blog, I wrote a piece called The probability of making your Friday night party. Well, the opportunity has presented itself for predicting the number of people attending our party. This question came up between my wife and I since we’re trying to plan party logistics.

You might be asking yourself, “Ben, why don’t you just request RSVPs and get a more definitive number?” There are several reasons for this:

The number of people in attendance is not only a key factor in party planning, but a factor that both affects and is affected by logistics. If you build the party, they will come. Therefore, it would be helpful to get a probability distribution for the number of attendees even before sending an Evite or request RSVPs.
Friends and family have increasingly busy schedules and sometimes RSVPs aren’t reliable. We know they’d love to attend but have other barriers with complicated dependencies. Think caregiving for older parents, kids, or pets; travel preferences; already reserved tickets to sporting events; hair salon appointments (hey sometimes they’re hard to get). Cancellations last minute are increasingly common.
In a post-pandemic world, it’s more acceptable to be a no-show if you’re experiencing any health symptoms.
I’m a geek and thought of the fun way this can be answered with probability and statistics!

Here are some interesting aspects of the problem.

Estimating the total number of people means considering the variable probabilities of attending across individuals. A grandmother will very likely come (probability of 0.9) while a co-worker with young kids is more of a toss-up (probability of 0.5). That’s why an initial, naive approach for using the binomial distribution for the entire guest list would not be optimal, since constant probability is assumed across individuals. However, one can apply binomial distribution to a group of friends or a family.
Determining the expected value is relatively easy. Understanding the uncertainty is harder.

This problem was addressed by a 1993 paper by Ken Butler and Michael Stephens called The Distribution of a Sum of Binomial Random Variables . However, we can re-derive some of the work through reasoning with just a few fundamental probability rules:

Identifying the probabilities of jointly occurring events which are otherwise independent means multiplying the probabilities of the individual events (“and” statements)
Determining the total probability of several events occurring means adding the probabilities of the individual events (“or” statements)
The sum of the probabilities of all possible events must add to 1.

Let’s get started!

import itertools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import binom
from scipy.stats._distn_infrastructure import rv_discrete_frozen

Probability distributions for each group

Here is a visualization of the problem. We can have probability distributions for different groups on an invite list. For explanation purposes, let’s imagine that we’re considering inviting three groups: a set of grandparents, neighbors, and a busy co-worker with their family with probability of attending as 90%, 75% for neighbors, and 50%, respectively. These groups would have binomial distributions that look like this.

# create a dataframe that holds the number and probability for each group
df_data = pd.DataFrame(
    {
        "group": ["Grandparents", "Neighbors", "Co-worker's family"],
        "n": [2, 4, 5],
        "p": [0.9, 0.5, 0.2]
    }
)
df_data

	group	n	p
0	Grandparents	2	0.9
1	Neighbors	4	0.5
2	Co-worker's family	5	0.2

# Visualize the probability distribution for each grouop
fig, axes = plt.subplots(1, 3, figsize=(16, 4), sharey=True)
for (i, row), ax  in zip(df_data.iterrows(), axes.flat):
    rv = binom(row['n'], row['p'])
    x = np.arange(row['n']+1)
    ax.bar(x, rv.pmf(x))
    ax.set(
        title = row['group'],
        ylim = (0, 1),
        xticks = x,
        xlabel = "Number attending",
        ylabel = "Probability"
        )

As you can see, it’s straighforward to get a distribution for each group. But how do we get a distribution for the total number of people attending?

Deriving the algorithm for the distribution of summed discrete random variables

We can look at the Butler and Stephens paper and see that they with both exact and approximate solutions. For our purposes, we’ll focus on the exact distribution, especially since there was this line from their paper: “With modern computing facilities, it is possible to calculate the exact distribution of S.” Remember this was written in 1993 when computers looked like this. Nevertheless, when you see the calculations carried out, you can appreciate why a statement was warranted and why they explored approximations, especially as the number of samples (possible attendees in this case) might scale.

The heart of the algorithm is the following equation.

\[P(Y + Z=j) = \sum_{i=0}^{j} P(Y=i) P(Z = j-i)\]

$Y$ and $Z$ are two discrete random variables (e.g. integers) which makes sense since we don’t want a fraction of a person attending (unless it was Halloween). While they described $Y$ and $Z$ as variables with a binomial distribution, we can see later why this is not strictly necessary. Additionally, the equation only has $Y$ and $Z$ since they start with only two groups. Once that has been calculated the new distribution is the “new” $Y$ and the next group would be the “new” $Z$. This continues recursively until all groups have been accounted for.

We can derive the above formula with our putative invite example. We’ll go back to our dataframe where each row contains a group, the number of people ($n$), and the probability of attendance ($p$) but limit it to the first two rows where Grandparents is the $Y$ variable and Neighbors is the $Z$ variable.

df_data.head(2)

	group	n	p
0	Grandparents	2	0.9
1	Neighbors	4	0.5

And let’s look at each group’s probability distribution in table form.

# grandparents
i_vals_gp = range(df_data.loc[0, "n"] + 1)
Y = binom(df_data.loc[0, "n"], df_data.loc[0, "p"])
df_Y = pd.DataFrame({'x':i_vals_gp, 'probability':Y.pmf(i_vals_gp)})
df_Y

	x	probability
0	0	0.01
1	1	0.18
2	2	0.81

# neighbors
i_vals_nb = range(df_data.loc[1, "n"] + 1)
Z = binom(df_data.loc[1, "n"], df_data.loc[1, "p"])
df_Z = pd.DataFrame({'x':i_vals_nb, 'probability':Z.pmf(i_vals_nb)})
df_Z

	x	probability
0	0	0.0625
1	1	0.2500
2	2	0.3750
3	3	0.2500
4	4	0.0625

Now, we can consider the probabilities for each possibility of total attendance $j$. That means, $j$ is bounded by 0 (no one comes, wah wah) to 6 (everyone shows up). The probability for each value of $j$ can be deduced with probability rules. I’m italicizing some keywords below since we can link and statements to multiplication and or statements to addition.

0 total attendees ($j=0$): That means 0 grandparents show up and 0 neighbors show up. The probability of both happening means multiplying the probabilities ($0.01 \times 0.003906$).
1 total attendee ($j=1$): That means either 1 grandparent shows up and 0 neighbors show up ($0.18 \times 0.003906$) or 0 grandparents show up and 1 neighbor shows up ($0.01 \times 0.046875$). This means adding the two possibilities ($0.18 \times 0.003906 + 0.01 \times 0.046875$)
2 total attendees ($j=2$): Here’s where it starts get more complicated. That means either 2 grandparents show up and 0 neighbors show up ($0.81 \times 0.003906$) or 1 grandparent shows up and 1 neighbor shows up ($0.18 \times 0.046875$) or 0 grandparents show up and 2 neighbors show up ($0.01 \times 0.210938$). This means adding these three possibilities ($0.81 \times 0.003906 + 0.18 \times 0.046875 + 0.01 \times 0.210938$).

You can see the pattern and figure out the logic for the remaining values of $j$. Hopefully you can see now how this leads to the above equation. Let’s flesh this out with code.

def convert_binom_pmf(rv: rv_discrete_frozen) -> dict:
    """Convert a random variable's binomial distribution PMF to a dictionary.

    Parameters
    ----------
    rv
        Random variable Y representing discrete values
        
    Returns
    -------
    :
        Dictionary of probabilities for each j
    
    """
    j = rv.support()[1]
    return {x:rv.pmf(x) for x in range(j+1)}


def probability_for_j_total_people(
    j: int, rv_Y_prob: dict, rv_Z_prob: dict
) -> float:
    """Determine the probability of j people attending, given two random variables.

    Parameters
    ----------
    j
        The number of total attendees.
    rv_Y_prob
        Probability distribution of random variable Y as a dictionary
    rv_Z_prob
        Probability distribution of random variable Y as a dictionary
        
    Returns
    -------
    :
        Total probability of j
    """
    combinations = itertools.product(range(j + 1), repeat=2)
    total_j_combinations = [(y, z) for y, z in combinations if y + z == j]  # gives all y+z combinations that add up to j
    prob = 0
    x = range(j + 1)
    for combo in total_j_combinations:
        if (combo[0] in rv_Y_prob) and (combo[1] in rv_Z_prob):
            prob_combo = rv_Y_prob[combo[0]] * rv_Z_prob[combo[1]]
            prob += prob_combo
    return prob

total_prob = 0   # sanity check that the total probability adds up to 1
j_prob = dict()  # cache the probabilities in a dictionary
max_attendees = df_data.head(2)['n'].sum()
for j in range(max_attendees + 1):
    rv_Y_prob = convert_binom_pmf(Y)
    rv_Z_prob = convert_binom_pmf(Z)
    prob = probability_for_j_total_people(j, rv_Y_prob, rv_Z_prob)
    print(f"Probability for {j} total people: {prob:0.5f}")
    j_prob[j] = prob
    total_prob += prob
    
print(f"\nTotal probability after accounting for all cases: {total_prob:0.4f}")

Probability for 0 total people: 0.00062
Probability for 1 total people: 0.01375
Probability for 2 total people: 0.09938
Probability for 3 total people: 0.27250
Probability for 4 total people: 0.34937
Probability for 5 total people: 0.21375
Probability for 6 total people: 0.05063

Total probability after accounting for all cases: 1.0000

Awesome! It looks like we’ve successfully carried out the equation. But remember we’ve only done the first two groups. The next step is to add in the remaining group (co-worker's family) using the same process. But the probability distribution of our new $Y$ variable is what we just calculated, which is now stored in a dictionary j_prob. As I indicated above, there’s really no requirement that the distribution be binomial. It just has to be discrete.

# Let's recall the n and p of the remaining group
df_data.tail(1)

	group	n	p
2	Co-worker's family	5	0.2

We can carry out the same steps.

# coworkers
i_vals_cw = range(df_data.loc[2, "n"] + 1)
new_Z = binom(df_data.loc[2, "n"], df_data.loc[2, "p"])
rv_new_Z_prob = convert_binom_pmf(new_Z)  # put in dictionary form for our function
rv_new_Z_prob

{0: 0.3276799999999998,
0.4095999999999999,
0.20479999999999987,
0.051200000000000016,
0.0064,
0.0003200000000000001}

df_data

	group	n	p
0	Grandparents	2	0.9
1	Neighbors	4	0.5
2	Co-worker's family	5	0.2

Recall that the putative total invite list was 11 people. Therefore, our probability distribution should give probabilities for each value between 0 and 11, inclusive.

j_prob_new = dict()  # cache the probabilities in a dictionary
for j in range(df_data["n"].sum() + 1):
    prob = probability_for_j_total_people(
        j, j_prob, rv_new_Z_prob
    )  # j_prob is what we calculated for the first two groups
    j_prob_new[j] = prob

f, ax = plt.subplots(figsize=(8, 5))

ax.bar(j_prob_new.keys(), j_prob_new.values())
ax.set(
    title="Probability distribution of total attendance",
    xticks=list(j_prob_new.keys()),
    xlabel="Number attending",
    ylabel="Probability",
);

A user-friendly function

OK! So it now looks like we have our final, exact distribution for the sum of all variables, what Butler and Stephens called $S$. We can make the production of this distribution much more user friendly with another couple of functions. Passing in our dataframe, we will produce a list of dictionaries, where each dictionary is a group’s probability distribution. Then this list will be passed into a second function to give our final answer. You can see at the assert statement that gives us the same answer that we derived above, step-by-step.

df_data

	group	n	p
0	Grandparents	2	0.9
1	Neighbors	4	0.5
2	Co-worker's family	5	0.2

def sum_of_discrete_rvs_exact_calculation(pmf_list: list) -> pd.Series:
    """Determine the probability distribution using recursion.

    Perform the exact calculation on the first two rows (e.g. random variables). 
    If more than two rows exist, treat the resulting probability distribution as 
    a random variable and add the third row. Repeat until all rows have been
    accounted for.

    Parameters
    ----------
    pmf_list
        A list of dictionaries containing the probability mass functions of
        different groups attending.
    
    Returns
    -------
    :
        Final probability distribution as a pandas Series
    """
    pmf_list = pmf_list.copy() # prevent the original list from being altered
    
    s_prob = dict()
    max_attendees = max(list(pmf_list[0].keys())) + max(list(pmf_list[1].keys()))
    for j in range(max_attendees + 1):
        prob = probability_for_j_total_people(
                j, pmf_list[0], pmf_list[1]
        )
        s_prob[j] = prob
    
    # base case
    if len(pmf_list) == 2:
        return s_prob
    
    # apply recursion
    else:
        # remove the first element and then replace the remaining first element with the new dictionary
        pmf_list.pop(0)
        pmf_list[0] = s_prob
        return sum_of_discrete_rvs_exact_calculation(pmf_list)

# put each group in a probability distribution
pmf_list_party = [convert_binom_pmf(binom(df_data.loc[i, "n"], df_data.loc[i, "p"])) for i in df_data.index]
assert j_prob_new == sum_of_discrete_rvs_exact_calculation(pmf_list_party)

%load_ext watermark
%watermark -n -u -v -iv -w

Last updated: Fri Apr 12 2024

Python implementation: CPython
Python version       : 3.11.7
IPython version      : 8.21.0

numpy     : 1.25.2
seaborn   : 0.13.2
matplotlib: 3.8.2
pandas    : 2.2.0

Watermark: 2.4.3

When the Spider-Man meme is relevant to multilevel models

2022-09-13T00:00:00+00:00

For a while, I’ve wondered about the different approches for multilevel modeling, also known as mixed effects modeling. My initial understanding is with a Bayesian perspective since I learned about it from Statistical Rethinking. But when hearing others talk about “fixed effects”, “varying effects”, “random effects”, and “mixed effects”, I had trouble connecting my own understanding of the concept to theirs. Even more perplexing, I wasn’t sure what the source(s) of the differences were:

It it a frequentist vs. Bayesian thing?
Is it a statistical package thing?
Is it because there are five different definitions of “fixed and random effects”, infamously observed by Andrew Gelman and why he avoids using those terms?

I decided to take a deep dive to resolve my confusion, with much help from numerous sources. Please check out the Acknowledgments and references section!

In this post, I’ll be comparing an example of mixed effects modeling across statistical philosophies and across statistical languages. As a bonus, a meme awaits.

method	approach	language	package
1	frequentist	R	`lme4`
2	Bayesian	Python	`pymc`

Note that the default language in the code blocks is Python. A cell running R will have %%R designated at the top. A variable can be inputted (-i) or outputted (-o) on that same line if it is used between the two languages.

Special thanks to Patrick Robotham for providing a lot of feedback.

from aesara import tensor as at
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import pymc as pm
import xarray as xr

%config InlineBackend.figure_format = 'retina'
az.style.use("arviz-darkgrid")
rng = np.random.default_rng(1234)
az.rcParams["stats.hdi_prob"] = 0.95 

def standardize(x):
    x = (x - np.mean(x)) / np.std(x)
    return x

# Enable running of R code
%load_ext rpy2.ipython

%%R
suppressMessages(library(tidyverse))
suppressMessages(library(lme4))
suppressMessages(library(arm))
suppressMessages(library(merTools))

Create synthetic `cafe` dataset

The dataset I am using is created from a scenario described in Statistical Rethinking.

Here are a few more details of the dataset from Dr. McElreath’s book:

Begin by defining the population of cafés that the robot might visit. This means we’ll define the average wait time in the morning and the afternoon, as well as the correlation between them. These numbers are sufficient to define the average properties of the cafés. Let’s define these properties, then we’ll sample cafés from them.

Nearly all Python code is taken from the Statistical Rethinking pymc repo with some minor alterations.

a = 3.5  # average morning wait time
b = -1.0  # average difference afternoon wait time
sigma_a = 1.0  # std dev in intercepts
sigma_b = 0.5  # std dev in slopes
rho = -0.7  # correlation between intercepts and slopes

Mu = [a, b]

sigmas = [sigma_a, sigma_b]
Rho = np.matrix([[1, rho], [rho, 1]])
Sigma = np.diag(sigmas) * Rho * np.diag(sigmas)  # covariance matrix

N_cafes = 20
vary_effects = np.random.multivariate_normal(mean=Mu, cov=Sigma, size=N_cafes)
a_cafe = vary_effects[:, 0]
b_cafe = vary_effects[:, 1]

Now simulate the observations.

N_visits = 10
afternoon = np.tile([0, 1], N_visits * N_cafes // 2)
cafe_id = np.repeat(np.arange(0, N_cafes), N_visits)

mu = a_cafe[cafe_id] + b_cafe[cafe_id] * afternoon
sigma = 0.5  # std dev within cafes

wait = np.random.normal(loc=mu, scale=sigma, size=N_visits * N_cafes)
df_cafes = pd.DataFrame(dict(cafe=cafe_id, afternoon=afternoon, wait=wait))

To get a sense of the data structure we just created, let’s take a look at the first and last 5 rows.

df_cafes.head()

	afternoon	wait
0	0	2.724888
1	1	1.951626
2	0	2.488389
3	1	1.188077
4	0	2.026425

df_cafes.tail()

	cafe	afternoon	wait
195	19	1	3.394933
196	19	0	4.544430
197	19	1	2.719524
198	19	0	3.379111
199	19	1	2.459750

Note that this dataset is balanced, meaning that each group (cafe) has the same number of observations. Mixed effects / multilevel models shine with unbalanced data where it can leverage partial pooling.

Visualize data

Let’s plot the raw data and see how the effect of afternoon influences wait time. Instead of plotting in order of the arbitrarily named cafes (0 to 19), I’ll show in order of increasing average morning wait time so that we can appreciate the variability across the dataset.

df_cafes.head()

	afternoon	wait
0	0	2.644592
1	1	2.126485
2	0	2.596465
3	1	2.250297
4	0	3.310709

%%R -i df_cafes

# credit to TJ Mahr for a template of this code
xlab <- "Afternoon"
ylab <- "Wait time"
titlelab <- "Wait times for each cafe (ordered by increasing average time)"

# order by increasing average morning wait time (intercept only)
cafe_ordered_by_avgwaittime <- df_cafes %>%
              filter(afternoon==0) %>%
              group_by(cafe) %>%
              summarize(mean = mean(wait)) %>%
              arrange(mean)

# Turn the gear column from a numeric in a factor with a certain order
df_cafes$cafe <- factor(df_cafes$cafe, levels=cafe_ordered_by_avgwaittime$cafe)

ggplot(df_cafes) + 
  aes(x = afternoon, y = wait) + 
  geom_boxplot(aes(fill=factor(afternoon))) +
  stat_summary(fun.y="mean", geom="line") +
  facet_wrap("cafe") +
  labs(x = xlab, y = ylab, title=titlelab)

One pattern is that as we increase morning wait time (e.g. the intercept) the difference in wait time in the afternoon (the slope) gets bigger. In other words, when we simulated this dataset, we included a co-variance structure between the intercept and slope. When we develop an inferential model with this data, we want to be able to reveal this co-variance.

Definitions of mixed effects modeling

Equation set 1: both fixed and random effects terms in linear model

Galecki and Burzykowski, Wikipedia, and this page from UCLA all describe a linear mixed model with an equation similar to equation 1 below.

I rely heavily on the UCLA page since it is the one that helped me the most. In fact, if you don’t care about how it connects to the Bayesian approach, stop reading this and check that out instead!

In contrast to the Bayesian set of equations, the fixed effects and random effects are in the same equation here.

\[\textbf{y} = \textbf{X} \boldsymbol{\beta} + \textbf{Z} \textbf{u} + \boldsymbol{\epsilon} \tag{1}\]

The left side of the equation $\textbf{y}$ represents all of our observations (or the wait time in the cafe example). The $\boldsymbol{\beta}$ in the first term of the equation represents a vector of coefficients across the population of cafes. These are the fixed effects. The $\textbf{u}$ in the second term of equation 1 represents a matrix of coefficients for each individual cafe. These are the random effects. Both $\textbf{X}$ and $\textbf{Z}$ are the design matrix of covariates. Finally, there’s a residual error term $\boldsymbol{\epsilon}$.

When relating this equation all back to the cafe dataset we just created, I needed to dig deeper to how terms represented an individual observation versus the group (cafe) level. Doing a dimensional analysis helped.

Equation 1 variable	Dimensions	Effects type	Comment
$\textbf{y}$	200 x 1	n/a	This vector represents the wait time for all 200 observations. I’ll refer to this as $w_i$ later in equation 2.
$\textbf{X}$	200 x 2	associated with fixed	The first column of each observation is 1 since it is multiplied by the intercept term. The second column is $A$, which will be 0 or 1 for `afternoon`.
$\boldsymbol{\beta}$	2 x 1	fixed	The two elements in the $\boldsymbol{\beta}$ (bold font beta) are what I’ll refer to as the intercept $\alpha$ and the slope $\beta$ (unbolded beta) across all cafes in equation 2.
$\textbf{Z}$	200 x (2x20)	associated with random	The first 20 columns representing intercepts for each cafe and the second 20 for the covariate (`afternoon`). See visual below.
$\textbf{u}$	(2x20) x 1	random	$\textbf{u}$ holds each of the 20 cafes’ intercept $a_\text{cafe}$ and slope $b_\text{cafe}$. There’s an implied correlation structure between them.
$\boldsymbol{\epsilon}$	200 x 1	n/a	Normally distributed residual error.

To better understand what $\textbf{Z}$ looks like we can create an alternate representation of df_cafes. Each row of the matrix $\textbf{Z}$ is for an individual observation. The first 20 columns of a row are the 20 intercepts of a cafe (column 1 is cafe 1, column 2 is cafe 2, etc.) All of the first 20 columns will contain a 0 except for the column that represents the cafe that observation is associated with which will be a 1. The next 20 columns (columns 21-40) will represent afternoon. All of this second group of columns will be 0 except for the column that represents the cafe that observation is associated with and if the observation is associated with an afternon observation.

To be clear, the structure of df_cafes, where each row is an observation with the cafe, afternoon status, and wait time, is already in a form to be understood by the lmer and pymc packages. What I’m showing below is to help understand what the matrix $\textbf{Z}$ looks like in the above equations.

Z = np.zeros((200, 40))
for i in df_cafes.index:
    cafe = df_cafes.loc[i, 'cafe']
    afternoon = df_cafes.loc[i, 'afternoon']
    Z[i, cafe] = 1
    Z[i, 20+cafe] = afternoon

We can take a look at the first 12 rows of Z. The first 10 are for the first cafe and observations alternate morning and afternoon, hence what’s displayed in column 20. I included the first two rows of the second cafe to show how the 1 moves over a row after the first 10 rows. I’ll use pandas to better display the values.

pd.set_option('display.max_columns', 40)
(
    pd.DataFrame(Z[0:12, :])
    .astype(int)
    .style
    .highlight_max(axis=1, props='color:navy; background-color:yellow;')
    .highlight_min(axis=1, props='color:white; background-color:#3E0B51;')
)

	0	1	20	21
0	1	0	0	0
1	1	0	1	0
2	1	0	0	0
3	1	0	1	0
4	1	0	0	0
5	1	0	1	0
6	1	0	0	0
7	1	0	1	0
8	1	0	0	0
9	1	0	1	0
10	0	1	0	0
11	0	1	0	1

We can visualize all of $\textbf{Z}$ here.

plt.imshow(Z, aspect='auto')
plt.text(10, 220, s='intercept (cafe)', ha='center', fontsize=14)
plt.text(30, 220, s='covariate (afternoon)', ha='center', fontsize=14)
plt.ylabel('observations')
plt.title('Visual representation of Z')

Text(0.5, 1.0, 'Visual representation of Z')

The vector in $\textbf{u}$ is really where the mixed effects model takes advantage of the covariance structure of the data. In our dataset, the first 20 elements of the vector represent the random intercepts of the cafes and the next 20 are the random slopes. A cafe’s random effects can be thought of as an offset from the populations (the fixed effects). Accordingly, a random effect will be multivariate normally distributed, with mean 0 and a co-variance matrix S.

\[\textbf{u} \sim \text{Normal}(0, \textbf{S}) \tag{2}\]

Remember that the $\textbf{u}$ is a (2x20) x 1 matrix, where each cafe’s intercept $a_\text{cafe}$ and slope $b_\text{cafe}$ are contained. Therefore, we can also write this as.

\[\textbf{u} = \begin{bmatrix} a_{\text{cafe}} \\ b_{\text{cafe}} \end{bmatrix} \sim \text{MVNormal} \left( \begin{bmatrix} 0 \\ 0 \end{bmatrix} , \textbf{S} \right) \tag{3}\]

In other words, in Equation 1, both the random intercept and random slope are both expected to lie at 0. With regards to $\textbf{S}$, my prior post talked about covariance matrixes so I won’t elaborate here. The key conceptual point of relevance in this problem is that the covariance matrix $\textbf{S}$ can reflect the correlation ($\rho$) that the intercept (average morning wait time) and slope (difference between morning and afternoon wait time).

\[\textbf{S} = \begin{pmatrix} \sigma_{\alpha}^2 & \rho\sigma_{\alpha}\sigma_{\beta} \\ \rho\sigma_{\alpha}\sigma_{\beta} & \sigma_{\beta}^2 \end{pmatrix} \tag{4}\]

We know there is a correlation because (a) we generated the data that way and (b) we can directly observe this when we visualized the data.

Finally, the role of $\boldsymbol{\epsilon}$ is to capture any residual variance. Between observations, it is assumed to be homogenous and independent.

Non-linear algebra form

Equation 1 is written concisely in linear algebra form. However, since our dataset is relatively simple (only one predictor variable), equation 1 can be written in an expanded, alternative form as equation 2. This might make it easier to understand (at least it did for me). The notation will start to get hairy with subscripts and so I will explicitly rename some variables for this explanation. It will also better match with the Bayesian set of equations described in the McElreath text. Equation 2 is written at the level of a single observation $i$. I’ll repeat Equation 1 here so it’s easier to see the conversion.

\[\textbf{y} = \textbf{X} \boldsymbol{\beta} + \textbf{Z} \textbf{u} + \boldsymbol{\epsilon} \tag{5}\] \[W_i = (\alpha + \beta \times A_i) + (a_{\text{cafe}[i]} + b_{\text{cafe}[i]} \times A_i) + \epsilon_i \tag{6}\]

Let’s start off with the left side where we can see that $\textbf{y}$ will now be $W_i$ for wait time. On the right side, I have segmented the fixed and random effects with parentheses. For both, I’ve deconstructed the linear algebra expression form to a simpler form. After re-arrangement, we can obtain the following form in equation 3.

\[W_i = (\alpha + a_{\text{cafe}[i]}) + (\beta + b_{\text{cafe}[i]}) \times A_i + \epsilon_{\text{cafe}} \tag{7}\]

Here, we can better appreciate how a cafe’s random effects intercept can be thought of as an offset from the population intercept. The same logic of an offset can be applied to its slope. We will come back to equation 3 after covering Equation set 2, the Bayesian approach.

Equation set 2: fixed effects as an adaptive prior, varying effects in the linear model

The following equations are taken from Chapter 14 in Statistical Rethinking. These set of equations look like a beast, but to be honest, they’re more intuitive to me, probably because I learned this approach initially. I’ll state the equations before comparing them directly with Equation set 1 but you may already start seeing the relationship. Essentially what is going on is a re-writing of the above equations in a Bayesian way such that the fixed effects can act as an adaptive prior.

$W_i \sim \text{Normal}(\mu_i, \sigma) \tag{8}$ $\mu_i = \alpha_{\text{cafe}[i]} + \beta_{\text{cafe}[i]} \times A_{i} \tag{9}$ $\sigma \sim \text{Exp}(1) \tag{10}$

Equation 8 is stating how wait time is normally distributed around $\mu$ and $\sigma$. By making $w_i$ stochastic instead of deterministic (using a ~ instead of =), the $\sigma$ replaces $\epsilon_i$. In equation 10, the prior for $\sigma$ is exponentially distributed and paramaterized with 1. The expected value parameter $\mu$ comes from the linear model in equation 9. You can start to see the similarities with equation 7 above.

\[\begin{bmatrix}\alpha_{\text{cafe}} \\ \beta_{\text{cafe}} \end{bmatrix} \sim \text{MVNormal} \left( \begin{bmatrix}{\alpha} \\ {\beta} \end{bmatrix} , \textbf{S} \right) \tag{11}\]

The $\alpha_{\text{cafe}}$ and $\beta_{\text{cafe}}$ terms come from sampling of a multivariate normal distribution as shown in equation 11. Note the very subtle difference in placement of the subscript cafe when compared to equation 6 and 7. This is an important point I’ll discuss later. On the right side, the two-dimensional normal distribution’s expected values are $\alpha$ and $\beta$. The rest of the equations shown below are our priors for each parameter we’re trying to estimate.

\[\alpha \sim \text{Normal}(5, 2) \tag{13}\] \[\beta \sim \text{Normal}(-1, 0.5) \tag{14}\] \[\textbf{S} = \begin{pmatrix} \sigma_{\alpha}^2 & \rho\sigma_{\alpha}\sigma_{\beta} \\ \rho\sigma_{\alpha}\sigma_{\beta} & \sigma_{\beta}^2 \end{pmatrix} = \begin{pmatrix} \sigma_{\alpha} & 0 \\ 0 & \sigma_{\beta} \end{pmatrix} \textbf{R} \begin{pmatrix} \sigma_{\alpha} & 0 \\ 0 & \sigma_{\beta} \end{pmatrix} \tag{12}\]

$\sigma, \sigma_{\alpha}, \sigma_{\beta} \sim \text{Exp}(1) \tag{15}$ $\textbf{R} \sim \text{LKJCorr}(2) \tag{16}$

Comparison of equation sets

To recap, the first equation set has an explicit fixed effects term and varying effects term in the linear model. In the second equation, the linear model is already “mixed”. It contains both the fixed and varying effects terms implicitly. The fixed effects estimates can be seen in equation 5.

I think you can think of these $\alpha_{\text{cafe}}$ and $\beta_{\text{cafe}}$ terms as already incorporating the information from the fixed and random effects simultaneously.

Now that we have the dataset, we can run the two models, one with lmer and one with pymc. Here are the equations that these packages run.

Running equation set 1 with `lmer` (frequentist)

The lmer and by extension (brms) syntax was initially confusing to me.

lmer(wait ~ 1 + afternoon + (1 + afternoon | cafe), df_cafes)

The 1 corresponds to inclusion of the intercept term. A 0 would exclude it. The 1 + wait corresponds to the “fixed effects” portion of the model ($\alpha + \beta \times A_i$) while the (1 + wait | cafe) is the “varying effects” ($a_{\text{cafe}} + b_{\text{cafe}} \times A_i$).

%%R -i df_cafes -o m -o df_fe_estimates -o df_fe_ci -o df_fe_summary

# m df_fe_summary
m <- lmer(wait ~ 1 + afternoon + (1 + afternoon | cafe), df_cafes)
arm::display(m)

# get fixed effects coefficients
df_fe_estimates <- data.frame(summary(m)$coefficients)
# get fixed effects coefficient CIs
df_fe_ci <- data.frame(confint(m))

df_fe_summary <- merge(
    df_fe_estimates,
    df_fe_ci[c('(Intercept)', 'afternoon'), ],
    by.x=0,
    by.y=0
)
rownames(df_fe_summary) <- df_fe_summary[, 1]

lmer(formula = wait ~ 1 + afternoon + (1 + afternoon | cafe), 
    data = df_cafes)
            coef.est coef.se
(Intercept)  3.64     0.23  
afternoon   -1.04     0.11  

Error terms:
 Groups   Name        Std.Dev. Corr  
 cafe     (Intercept) 0.99           
          afternoon   0.39     -0.74 
 Residual             0.48           
---
number of obs: 200, groups: cafe, 20
AIC = 369.9, DIC = 349.2
deviance = 353.5 


R[write to console]: Computing profile confidence intervals ...

Can we get the partial pooling results from the lmer output and see how it compares with the unpooled estimates? Let’s export it for use later.

%%R -i m -o df_partial_pooling -o random_sims

# Make a dataframe with the fitted effects
df_partial_pooling <- coef(m)[["cafe"]] %>% 
  rownames_to_column("cafe") %>% 
  as_tibble() %>% 
  rename(Intercept = `(Intercept)`, Slope_Days = afternoon) %>% 
  add_column(Model = "Partial pooling")

# estimate confidence interval
random_sims <- REsim(m, n.sims = 1000)
#plotREsim(random_sims)

random_sims

	groupFctr	groupID	term	mean	median	sd
1	cafe	0	(Intercept)	-1.277651	-1.283341	0.379761
2	cafe	1	(Intercept)	0.164935	0.162715	0.420411
3	cafe	2	(Intercept)	-1.047076	-1.043646	0.387153
4	cafe	3	(Intercept)	0.474320	0.500552	0.400053
5	cafe	4	(Intercept)	-1.473647	-1.468940	0.394707
6	cafe	5	(Intercept)	0.086072	0.082010	0.408971
7	cafe	6	(Intercept)	-0.640217	-0.628944	0.412642
8	cafe	7	(Intercept)	1.507154	1.516430	0.391119
9	cafe	8	(Intercept)	-0.657831	-0.659448	0.394984
10	cafe	9	(Intercept)	0.332758	0.331037	0.388295
11	cafe	10	(Intercept)	-1.018611	-1.025387	0.389930
12	cafe	11	(Intercept)	0.925071	0.913997	0.397095
13	cafe	12	(Intercept)	-1.407149	-1.403259	0.384820
14	cafe	13	(Intercept)	-0.412975	-0.414958	0.412863
15	cafe	14	(Intercept)	1.346380	1.343109	0.403694
16	cafe	15	(Intercept)	0.336807	0.346523	0.390567
17	cafe	16	(Intercept)	0.747439	0.735906	0.413094
18	cafe	17	(Intercept)	-0.046579	-0.035018	0.396795
19	cafe	18	(Intercept)	1.659019	1.646634	0.393909
20	cafe	19	(Intercept)	0.323375	0.327348	0.392401
21	cafe	0	afternoon	0.498557	0.501401	0.182594
22	cafe	1	afternoon	-0.336036	-0.337360	0.193462
23	cafe	2	afternoon	0.395379	0.391621	0.189140
24	cafe	3	afternoon	0.296956	0.293144	0.191710
25	cafe	4	afternoon	0.059611	0.055121	0.189680
26	cafe	5	afternoon	-0.033068	-0.036143	0.194723
27	cafe	6	afternoon	0.236107	0.237904	0.192575
28	cafe	7	afternoon	-0.473485	-0.479199	0.185549
29	cafe	8	afternoon	0.408039	0.411507	0.194145
30	cafe	9	afternoon	-0.402131	-0.393931	0.186868
31	cafe	10	afternoon	0.316072	0.309198	0.189218
32	cafe	11	afternoon	-0.335749	-0.340427	0.186644
33	cafe	12	afternoon	0.521558	0.519243	0.184606
34	cafe	13	afternoon	-0.006800	-0.014344	0.199548
35	cafe	14	afternoon	-0.277165	-0.281127	0.188748
36	cafe	15	afternoon	-0.234501	-0.235683	0.192804
37	cafe	16	afternoon	-0.182673	-0.185997	0.194017
38	cafe	17	afternoon	-0.017126	-0.023784	0.187302
39	cafe	18	afternoon	-0.364424	-0.364049	0.187532
40	cafe	19	afternoon	-0.028883	-0.032691	0.185824

OK, now let’s try the Bayesian approach and compare answers.

Running equation set 2 with `pymc` (Bayesian)

n_cafes = df_cafes['cafe'].nunique()
cafe_idx = pd.Categorical(df_cafes["cafe"]).codes

with pm.Model() as m14_1:
    # can't specify a separate sigma_a and sigma_b for sd_dist but they're equivalent here
    chol, Rho_, sigma_cafe = pm.LKJCholeskyCov(
        "chol_cov", n=2, eta=2, sd_dist=pm.Exponential.dist(1.0), compute_corr=True
    )
    
    a_bar = pm.Normal("a_bar", mu=5, sigma=2.0)  # prior for average intercept
    b_bar = pm.Normal("b_bar", mu=-1, sigma=0.5)  # prior for average slope

    
    ab_subject = pm.MvNormal(
        "ab_subject", mu=at.stack([a_bar, b_bar]), chol=chol, shape=(n_cafes, 2)
    )  # population of varying effects
    # shape needs to be (n_cafes, 2) because we're getting back both a and b for each cafe

    mu = ab_subject[cafe_idx, 0] + ab_subject[cafe_idx, 1] * df_cafes["afternoon"].values  # linear model
    sigma_within = pm.Exponential("sigma_within", 1.0)  # prior stddev within cafes (in the top line)

    wait = pm.Normal("wait", mu=mu, sigma=sigma_within, observed=df_cafes["wait"].values)  # likelihood

    idata_m14_1 = pm.sample(1000, target_accept=0.9)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [chol_cov, a_bar, b_bar, ab_subject, sigma_within]

100.00% [8000/8000 02:03<00:00 Sampling 4 chains, 1 divergences]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 140 seconds.
There was 1 divergence after tuning. Increase `target_accept` or reparameterize.

# take a glimpse at the head and tail of the summary table
pd.concat(
    [
        az.summary(idata_m14_1).head(10),
        az.summary(idata_m14_1).tail(10)
    ]
)

/Users/blacar/opt/anaconda3/envs/pymc_env2/lib/python3.10/site-packages/arviz/stats/diagnostics.py:586: RuntimeWarning: invalid value encountered in double_scalars
  (between_chain_variance / within_chain_variance + num_samples - 1) / (num_samples)
/Users/blacar/opt/anaconda3/envs/pymc_env2/lib/python3.10/site-packages/arviz/stats/diagnostics.py:586: RuntimeWarning: invalid value encountered in double_scalars
  (between_chain_variance / within_chain_variance + num_samples - 1) / (num_samples)

	mean	sd	hdi_2.5%	hdi_97.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
a_bar	3.654	0.223	3.203	4.074	0.003	0.002	4802.0	3140.0	1.0
b_bar	-1.049	0.109	-1.265	-0.844	0.002	0.001	3446.0	3200.0	1.0
ab_subject[0, 0]	2.380	0.200	1.996	2.785	0.003	0.002	4271.0	2783.0	1.0
ab_subject[0, 1]	-0.587	0.245	-1.071	-0.119	0.004	0.003	3077.0	2833.0	1.0
ab_subject[1, 0]	3.820	0.199	3.442	4.220	0.003	0.002	3988.0	3167.0	1.0
ab_subject[1, 1]	-1.402	0.248	-1.897	-0.945	0.004	0.003	3165.0	3182.0	1.0
ab_subject[2, 0]	2.606	0.199	2.210	2.988	0.003	0.002	4702.0	3450.0	1.0
ab_subject[2, 1]	-0.681	0.240	-1.156	-0.218	0.004	0.003	3696.0	3014.0	1.0
ab_subject[3, 0]	4.120	0.203	3.739	4.532	0.003	0.002	3475.0	2800.0	1.0
ab_subject[3, 1]	-0.707	0.266	-1.213	-0.184	0.005	0.004	2482.0	2921.0	1.0
chol_cov[0]	0.988	0.163	0.710	1.328	0.002	0.002	5207.0	3263.0	1.0
chol_cov[1]	-0.226	0.105	-0.442	-0.033	0.002	0.001	2769.0	3178.0	1.0
chol_cov[2]	0.299	0.093	0.120	0.481	0.002	0.002	1379.0	1308.0	1.0
sigma_within	0.482	0.027	0.431	0.534	0.000	0.000	3773.0	2542.0	1.0
chol_cov_corr[0, 0]	1.000	0.000	1.000	1.000	0.000	0.000	4000.0	4000.0	NaN
chol_cov_corr[0, 1]	-0.579	0.192	-0.898	-0.196	0.003	0.002	3196.0	2983.0	1.0
chol_cov_corr[1, 0]	-0.579	0.192	-0.898	-0.196	0.003	0.002	3196.0	2983.0	1.0
chol_cov_corr[1, 1]	1.000	0.000	1.000	1.000	0.000	0.000	4087.0	4000.0	1.0
chol_cov_stds[0]	0.988	0.163	0.710	1.328	0.002	0.002	5207.0	3263.0	1.0
chol_cov_stds[1]	0.386	0.107	0.182	0.605	0.003	0.002	1541.0	1201.0	1.0

Comparison of `lmer` and `pymc` outputs

While pymc returns posterior estimates for each parameter, including $\rho$, for this post, we are interested in comparing the output comparable to the “fixed effects” and “varying effects” from lmer. Having the equations above can help us piece together the relevant bits of information. The fixed intercept and slope are easy because we’ve used the same characters $\alpha$ and $\beta$ in equation set 2 as we did in Equation set 1.

However, when identifying the “varying effects”, we’ll have to do some arithmetic with the pymc output. In contrast with the lmer output, the pymc outputs have the estimate for each cafe with “baked in” varying effects. In other words, the “offset” that we see in equation 7 ($a_{\text{cafe}[i]}$ and $b_{\text{cafe}[i]}$)

\[W_i = (\alpha + a_{\text{cafe}[i]}) + (\beta + b_{\text{cafe}[i]}) \times A_i + \epsilon_{\text{cafe}} \tag{7}\] \[\mu_i = \alpha_{\text{cafe}[i]} + \beta_{\text{cafe}[i]} \times A_{i} \tag{9}\]

are already embedded in ($\alpha_{\text{cafe}[i]}$ and $\beta_{\text{cafe}[i]}$) in equation 9. We’ll have to therefore subtract out the fixed effecs in the pymc output before we can compare with the lmer output. First, let’s get fixed effects from pymc.

df_summary_int_and_slope = az.summary(idata_m14_1, var_names=['a_bar', 'b_bar'])
df_summary_int_and_slope

	mean	sd	hdi_2.5%	hdi_97.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
a_bar	3.654	0.223	3.203	4.074	0.003	0.002	4802.0	3140.0	1.0
b_bar	-1.049	0.109	-1.265	-0.844	0.002	0.001	3446.0	3200.0	1.0

These estimates and uncertainties compare well with the fixed estimates lmer.

f, (ax0, ax1) = plt.subplots(1, 2, figsize=(12,4))
# value to generate data
# a, average morning wait time was defined above
ax0.vlines(x=a, ymin=0.8, ymax=1.2, linestyle='dashed', color='red')
ax1.vlines(x=b, ymin=0.8, ymax=1.2, linestyle='dashed', color='red', label='simulated value')

# pymc fixed effects value
ax0.scatter(df_summary_int_and_slope.loc['a_bar', 'mean'], 1.1, color='navy')
ax0.hlines(xmin=df_summary_int_and_slope.loc['a_bar', 'hdi_2.5%'], xmax=df_summary_int_and_slope.loc['a_bar', 'hdi_97.5%'], y=1.1, color='navy')
ax1.scatter(df_summary_int_and_slope.loc['b_bar', 'mean'], 1.1, color='navy')
ax1.hlines(xmin=df_summary_int_and_slope.loc['b_bar', 'hdi_2.5%'], xmax=df_summary_int_and_slope.loc['b_bar', 'hdi_97.5%'], y=1.1, color='navy', label='pymc estimate')

# lmer fixed effects estimate
ax0.scatter(df_fe_summary.loc['(Intercept)', 'Estimate'], 0.9, color='darkgreen')
ax0.hlines(xmin=df_fe_summary.loc['(Intercept)', 'X2.5..'], xmax=df_fe_summary.loc['(Intercept)', 'X97.5..'], y=0.9, color='darkgreen')
ax1.scatter(df_fe_summary.loc['afternoon', 'Estimate'], 0.9, color='darkgreen')
ax1.hlines(xmin=df_fe_summary.loc['afternoon', 'X2.5..'], xmax=df_fe_summary.loc['afternoon', 'X97.5..'], y=0.9, color='darkgreen', label='lmer estimate')

# plot formatting
f.suptitle('Fixed effect estimates')
ax0.set_yticks([0.9, 1.1])
ax0.set_yticklabels(['lmer', 'pymc'])

ax1.set_yticks([0.9, 1.1])
ax1.set_yticklabels(['', ''])

ax0.set(xlabel='intercept')
ax1.set(xlabel='slope')
ax1.legend(fontsize=10)
plt.tight_layout()

/var/folders/tw/b9j0wcdj6_9cyljwt364lx7c0000gn/T/ipykernel_5516/1253574855.py:30: UserWarning: This figure was using constrained_layout, but that is incompatible with subplots_adjust and/or tight_layout; disabling constrained_layout.
  plt.tight_layout()

As promised, here is the meme that rewards you for paying attention this far!

Now to get the varying effects from pymc output, we’ll take each sample’s intercept and slope and subtract the fixed estimate.

# Convert to pandas dataframe and take a glimpse at the first few rows
idata_m14_1_df = idata_m14_1.to_dataframe()
idata_m14_1_df.head()

	draw	(posterior, a_bar)	(posterior, b_bar)	(posterior, ab_subject[0,0], 0, 0)	(posterior, ab_subject[0,1], 0, 1)	(posterior, ab_subject[1,0], 1, 0)	(posterior, ab_subject[1,1], 1, 1)	(posterior, ab_subject[10,0], 10, 0)	(posterior, ab_subject[10,1], 10, 1)	(posterior, ab_subject[11,0], 11, 0)	(posterior, ab_subject[11,1], 11, 1)	(posterior, ab_subject[12,0], 12, 0)	(posterior, ab_subject[12,1], 12, 1)	(posterior, ab_subject[13,0], 13, 0)	(posterior, ab_subject[13,1], 13, 1)	(posterior, ab_subject[14,0], 14, 0)	(posterior, ab_subject[14,1], 14, 1)	(posterior, ab_subject[15,0], 15, 0)	(posterior, ab_subject[15,1], 15, 1)	...	(log_likelihood, wait[97], 97)	(log_likelihood, wait[98], 98)	(log_likelihood, wait[99], 99)	(log_likelihood, wait[9], 9)	(sample_stats, tree_depth)	(sample_stats, max_energy_error)	(sample_stats, process_time_diff)	(sample_stats, perf_counter_diff)	(sample_stats, energy)	(sample_stats, step_size_bar)	(sample_stats, diverging)	(sample_stats, energy_error)	(sample_stats, lp)	(sample_stats, acceptance_rate)	(sample_stats, n_steps)	(sample_stats, largest_eigval)	(sample_stats, smallest_eigval)	(sample_stats, index_in_trajectory)	(sample_stats, step_size)	(sample_stats, perf_counter_start)
0	0	3.397744	-0.993140	2.353823	-0.712216	3.936642	-1.328451	2.497521	-0.990675	4.589760	-1.271864	2.272038	-0.780358	3.400074	-1.307487	4.660517	-0.920542	3.967868	-1.339014	...	-0.592594	-0.280869	-1.783441	-0.404212	5	-0.452539	0.234946	0.067260	194.679539	0.246795	False	-0.226605	-167.432037	0.975607	31.0	NaN	NaN	-17	0.284311	192.355518
1	1	3.227032	-1.105823	2.486742	-0.657790	3.890044	-1.788579	2.894867	-0.741011	4.346072	-1.048541	2.446301	-0.678041	3.564795	-1.520221	5.013627	-1.128684	3.793134	-1.084814	...	-0.581570	-0.708670	-1.709776	-0.741664	4	0.498338	0.123327	0.033713	196.867266	0.246795	False	0.273832	-177.694232	0.809115	15.0	NaN	NaN	-8	0.284311	192.423125
2	2	3.393307	-0.926431	2.348434	-0.604619	3.905778	-1.355137	2.712834	-1.124770	4.409195	-1.291088	2.324233	-0.754508	3.586107	-1.562165	5.050191	-1.556993	4.122478	-1.718417	...	-0.452885	-0.109849	-2.293094	-0.559207	5	-0.382814	0.236803	0.063232	207.926089	0.246795	False	-0.347905	-176.112370	0.968229	31.0	NaN	NaN	6	0.284311	192.457135
3	3	3.750943	-1.109148	2.613325	-0.667234	3.682009	-1.293790	2.558511	-0.362557	4.548968	-1.266139	2.264383	-0.445725	3.102086	-0.903726	4.589499	-0.409875	4.063760	-1.249921	...	-1.239451	-0.574010	-0.906557	-1.015460	4	-0.530897	0.116930	0.037484	198.279760	0.246795	False	-0.024171	-180.489888	0.987683	15.0	NaN	NaN	-9	0.284311	192.520656
4	4	3.416951	-1.152993	2.478859	-0.812085	3.773041	-1.423143	2.136978	-0.465100	4.385045	-1.180823	2.160109	-0.395771	3.459758	-1.300131	5.527213	-2.107117	3.906480	-1.388326	...	-0.400278	-0.240346	-2.188396	-0.471960	5	-0.382498	0.241781	0.072736	207.993298	0.246795	False	-0.041904	-183.942618	0.999986	31.0	NaN	NaN	-24	0.284311	192.558443

5 rows × 270 columns

# Get the "unbaked in" varying intercept and slope
bayesian_int = list()
bayesian_slope = list()
for i in range(20):
    idata_m14_1_df[f'varying_int_{i}'] = idata_m14_1_df[ ('posterior', f'ab_subject[{i},0]', i, 0)] - idata_m14_1_df[('posterior', 'a_bar')]
    bayesian_int.append(idata_m14_1_df[f'varying_int_{i}'].mean())

    idata_m14_1_df[f'varying_slope_{i}'] = idata_m14_1_df[ ('posterior', f'ab_subject[{i},1]', i, 1)] - idata_m14_1_df[('posterior', 'b_bar')]
    bayesian_slope.append(idata_m14_1_df[f'varying_slope_{i}'].mean())

We can now make a direct comparison between the lmer and pymc outputs. I’ll ignore the uncertainties for the sake of a cleaner plot.

random_sims_int = random_sims.loc[random_sims['term']=='(Intercept)', 'mean'].copy()
random_sims_slope = random_sims.loc[random_sims['term']=='afternoon', 'mean'].copy()

f, (ax0, ax1) = plt.subplots(1, 2, figsize=(12, 4))

min_max_int = [min(list(random_sims_int) + bayesian_int), max(list(random_sims_int) + bayesian_int)]
min_max_slope = [min(list(random_sims_slope) + bayesian_slope), max(list(random_sims_slope) + bayesian_slope)]

# intercepts
ax0.scatter(random_sims_int, bayesian_int, facecolors='none', edgecolors='navy')
ax0.plot(min_max_int, min_max_int, linestyle='dashed', color='gray')
ax0.set(xlabel='lmer intercept estimates', ylabel='pymc intercept estimates', title='Comparison of varying intercepts')

# slopes
ax1.scatter(random_sims_slope, bayesian_slope, facecolors='none', edgecolors='navy')
ax1.plot(min_max_slope, min_max_slope, linestyle='dashed', color='gray')
ax1.set(xlabel='lmer slope estimates', ylabel='pymc slope estimates', title='Comparison of varying slopes')

[Text(0.5, 0, 'lmer slope estimates'),
 Text(0, 0.5, 'pymc slope estimates'),
 Text(0.5, 1.0, 'Comparison of varying slopes')]

As you can see we get very similar intercepts and slopes for the cafe-specific estimates (varying effects) for the intercept and slope between the lmer and pymc approaches.

Summary

Here in this post, I set out to compare different mixed model approaches. I looked at the equations and the programmatic implementations. I concluded by showing how the two methods can arrive at the same answer. It required a careful understanding of the differences in equations and coding language- and package-specific implementations. There were various points of writing this post that confused me but it provided opportunities for deeper understanding.

Acknowledgements and references

Acknowledgements

Special shoutout to Patrick Robotham (@probot) from the University of Bayes Discord channel for helping me work through many of my confusions.
Eric J. Daza about some discussions about mixed effects modeling. It reminded me about improving my knowledge in this area.
Members of the Glymour group at UCSF for checking some of my code.

References

UCLA introduction to linear mixed models.
Richard McElreath’s Statistical Rethinking for my introduction to Bayesian multilevel modeling and the Statistical Rethinking Chapter 14 repo.
Andrzej Gałecki and Tomasz Burzykowski’s Linear Mixed-Effecsts Models Using R which references the lme4 package. Dr. McElreath referenced this package as a non-Bayesian alternative in his book.
Andrew Gelman wrote about why he doesn’t like using “fixed and random effects” (in a blog and in a paper).
TJ Mahr’s partial pooling blog post.

%load_ext watermark
%watermark -n -u -v -iv -w -p aesara,aeppl

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Last updated: Tue Sep 13 2022

Python implementation: CPython
Python version       : 3.10.6
IPython version      : 8.4.0

aesara: 2.8.2
aeppl : 0.0.35

pymc      : 4.1.7
xarray    : 2022.6.0
pandas    : 1.4.3
sys       : 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:43:44) [Clang 13.0.1 ]
arviz     : 0.12.1
matplotlib: 3.5.3
aesara    : 2.8.2
numpy     : 1.23.2

Watermark: 2.3.1

%%R
sessionInfo()

R version 4.1.3 (2022-03-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Monterey 12.5.1

Matrix products: default
LAPACK: /Users/blacar/opt/anaconda3/envs/pymc_env2/lib/libopenblasp-r0.3.21.dylib

locale:
[1] C/UTF-8/C/C/C/C

attached base packages:
[1] tools     stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] merTools_0.5.2  arm_1.13-1      MASS_7.3-58.1   lme4_1.1-30    
 [5] Matrix_1.4-1    forcats_0.5.2   stringr_1.4.1   dplyr_1.0.9    
 [9] purrr_0.3.4     readr_2.1.2     tidyr_1.2.0     tibble_3.1.8   
[13] ggplot2_3.3.6   tidyverse_1.3.2

loaded via a namespace (and not attached):
 [1] httr_1.4.4          jsonlite_1.8.0      splines_4.1.3      
 [4] foreach_1.5.2       modelr_0.1.9        shiny_1.7.2        
 [7] assertthat_0.2.1    broom.mixed_0.2.9.4 googlesheets4_1.0.1
[10] cellranger_1.1.0    globals_0.16.1      pillar_1.8.1       
[13] backports_1.4.1     lattice_0.20-45     glue_1.6.2         
[16] digest_0.6.29       promises_1.2.0.1    rvest_1.0.3        
[19] minqa_1.2.4         colorspace_2.0-3    httpuv_1.6.5       
[22] htmltools_0.5.3     pkgconfig_2.0.3     broom_1.0.0        
[25] listenv_0.8.0       haven_2.5.1         xtable_1.8-4       
[28] mvtnorm_1.1-3       scales_1.2.1        later_1.3.0        
[31] tzdb_0.3.0          googledrive_2.0.0   farver_2.1.1       
[34] generics_0.1.3      ellipsis_0.3.2      withr_2.5.0        
[37] furrr_0.3.1         cli_3.3.0           mime_0.12          
[40] magrittr_2.0.3      crayon_1.5.1        readxl_1.4.1       
[43] fs_1.5.2            future_1.27.0       fansi_1.0.3        
[46] parallelly_1.32.1   nlme_3.1-159        xml2_1.3.3         
[49] hms_1.1.2           gargle_1.2.0        lifecycle_1.0.1    
[52] munsell_0.5.0       reprex_2.0.2        compiler_4.1.3     
[55] rlang_1.0.4         blme_1.0-5          grid_4.1.3         
[58] nloptr_2.0.3        iterators_1.0.14    labeling_0.4.2     
[61] boot_1.3-28         gtable_0.3.0        codetools_0.2-18   
[64] abind_1.4-5         DBI_1.1.3           R6_2.5.1           
[67] lubridate_1.8.0     fastmap_1.1.0       utf8_1.2.2         
[70] stringi_1.7.8       parallel_4.1.3      Rcpp_1.0.9         
[73] vctrs_0.4.1         dbplyr_2.2.1        tidyselect_1.1.2   
[76] coda_0.19-4        

LKJCorr and LKJCov in pymc

2022-04-12T00:00:00+00:00

While continuing to deep dive on covariance priors following my prior post, I investigated implementations in pymc. I played around with the LKJcorr and LKJcov functions (named for the authors Lewandowskia, Kurowicka, and Joe). There’s already great explanations out there to explain the rationale for these distributions.

I felt those references were more useful after I got my hands dirty. I learned by continuing to try some of McElreath’s Statistical Rethinking problems and dissecting some of the pymc output. That’s what I document here. Let’s do this.

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import scipy.stats as stats
from scipy.special import expit, logit
import seaborn as sns
from theano import tensor as tt

%load_ext nb_black
%config InlineBackend.figure_format = 'retina'
%load_ext watermark
RANDOM_SEED = 8927
np.random.seed(RANDOM_SEED)
az.style.use("arviz-darkgrid")
az.rcParams["stats.hdi_prob"] = 0.89  # sets default credible interval used by arviz

def standardize(x):
    x = (x - np.mean(x)) / np.std(x)
    return x

`LKJcorr` distribution

This function is used to draw correlation values that would comprise one set of parameters that produce a covariance matrix. The other is a vector of standard deviations which we’ll cover in a later section. The smallest correlation matrix you can have is one that is 2x2. Let’s take 5 draws from an LKJ distribution for a 2x2 matrix with an eta value of 2. Since a matrix is always square, we only need one value for its size, represented by n in the function. I’ll come back to what eta is representing later.

# 5 draws of rho values for 2x2 correlation matrix
pm.LKJCorr.dist(n=2, eta=2).random(size=5)

array([[ 0.44536007],
       [ 0.47925197],
       [ 0.11555357],
       [ 0.32794659],
       [-0.36670234]])

For each draw, where does the one unique value from each draw come from? Since the covariance matrix is 2x2, and the diagonal is 1, we only need one rho value to complete the matrix since the off-diagonals are symmetric like this, where I’m using $a$ as the placeholder for one of the five values we sampled above:

\[\begin{bmatrix} 1 & a \\ a & 1 \end{bmatrix}\]

Why do we even the LKJ distribution? If we need a prior distribution for rho, can’t we use a beta distribution? After all, we only need one value since the off-diagonals are symmetric. We can see quickly that the 2x2 co-variance matrix is a special case. We immediately appreciate that once we use a 3x3 matrix.

# 5 draws of rho values for 3x3 correlation matrix
pm.LKJCorr.dist(n=3, eta=2).random(size=5)

array([[-0.03769077, -0.11296013, -0.1146735 ],
       [ 0.02914172,  0.87776358,  0.23930909],
       [ 0.22782414,  0.17480301, -0.33129416],
       [-0.10766053,  0.01332989, -0.47423766],
       [-0.11734588, -0.56739363, -0.03539502]])

Where do the three unique values from each draw come from? We now have three unique rho values which are symmetric around the diagonal vector of 1s. I’ll use $a$, $b$, and $c$ to represent rho values between the first and second parameters, first and third parameters, and second and third paramters, respectively.

\[\begin{bmatrix} 1 & a & b \\ a & 1 & c \\ b & c & 1 \end{bmatrix}\]

We see this same pattern continue when we increase the dimensions of the covariance matrix by one yet again.

# 5 draws of rho values for 4x4 correlation matrix
pm.LKJCorr.dist(n=4, eta=2).random(size=5)

array([[-0.38633472,  0.68960155,  0.48638167, -0.51097999, -0.33741679,
         0.41864488],
       [ 0.21746371, -0.3314052 ,  0.55754886,  0.14152549,  0.426748  ,
        -0.03264997],
       [-0.24486788, -0.01114335,  0.09792299,  0.52026915,  0.25866165,
         0.50274309],
       [ 0.89416089, -0.23197608, -0.29084055, -0.2306694 , -0.37897343,
         0.10800676],
       [-0.50015375,  0.30615676, -0.16429283, -0.36701092, -0.17276691,
        -0.34336828]])

The six unique values when the correlation matrix is 4x4 are shown in this arrangement. Again, the diagonal is 1, while the off-diagonals are symmetric around it.

\[\begin{bmatrix} 1 & a & b & c \\ a & 1 & d & e \\ b & d & 1 & f \\ c & e & f & 1 \end{bmatrix}\]

Above I’ve illustrated what each draw of LKJcorr represents as it varies with n, but the values themselves are controlled by the parameter eta. This controls the rho values and since they are correlation coefficients, the values are bound by -1 and 1. The eta parameter influences the shape of the distribution within these bounds.

# repo code (for correlation matrix of 2x2)
f, ax = plt.subplots(1, 3, figsize=(16, 4))
textloc = [[0, 0.5], [0, 0.8], [0.5, 0.9]]

for i in range(3):
    for eta, loc in zip([1, 2, 4], textloc):
        R = pm.LKJCorr.dist(n=i+2, eta=eta).random(size=10000)
        az.plot_kde(R, plot_kwargs={"alpha": 0.8}, ax=ax[i])
        ax[i].text(loc[0], loc[1], "eta = %s" % (eta), horizontalalignment="center")

    ax[i].set_xlabel("correlation")
    ax[i].set_ylabel("Density")
    ax[i].set_title(f"{i+2}x{i+2} correlation matrix");

You can see that an eta value of 2 is fairly conservative with most of the expectation for correlations being low. The size of the correlation matrix itself also influences the distribution, particularly at the tails.

Manually building a covariance matrix

Remember that at this point we don’t have the covariance matrix yet. We only have sampled correlation matrices. But we can get the covariance matrix once we take some sampled sigmas. Let’s do that to see how this would look in 5 draws from a prior distribution. We wouldn’t do this for a real problem. This is just to demonstrate what sampling would look like, step-by-step. We’ll go back to a 2x2 covariance matrix to make things simple.

For each pass through this loop, we will:

take one rho value, since this is a 2x2 corelation matrix
take two standard deviation (sigma) values and they’ll be from slightly different distributions just to make things interesting (Exp(1) and Exp(0.5))
arrange the sigmas as a vector
generate a new covariance matrix from the rho and sigmas
output the covariance matrix and the values that

for i in range(5):
    # the [0] is just to get the value out of the array
    rho = pm.LKJCorr.dist(n=2, eta=2).random(size=1)[0][0]
    Rmat = np.array([[1, rho], [rho, 1]])
    # the sigmas themselves need to be sampled; again the [0] is to get the value out of the array
    sigma_a = pm.Exponential.dist(1.0).random(size=1)[0]
    sigma_b = pm.Exponential.dist(0.5).random(size=1)[0]
    sigmas = np.array([sigma_a, sigma_b])  # arrange the sigmas as a vector
    Sigma = np.diag(sigmas).dot(Rmat).dot(np.diag(sigmas))   # use a weird way a that covariance matrix is made
    print(f"sample {i}\n -- sigma for a: {sigma_a:0.3f},\t sigma for b: {sigma_b:0.3f},\tRho: {rho:0.3f} \n -- covariance matrix:\n{Sigma}\n")

sample 0
 -- sigma for a: 0.411,	 sigma for b: 2.389,	Rho: -0.289 
 -- covariance matrix:
[[ 0.16873355 -0.28350151]
 [-0.28350151  5.7070884 ]]

sample 1
 -- sigma for a: 1.969,	 sigma for b: 10.462,	Rho: 0.436 
 -- covariance matrix:
[[  3.87629967   8.97898901]
 [  8.97898901 109.46212826]]

sample 2
 -- sigma for a: 1.932,	 sigma for b: 2.676,	Rho: -0.404 
 -- covariance matrix:
[[ 3.73187298 -2.08904139]
 [-2.08904139  7.1588578 ]]

sample 3
 -- sigma for a: 0.601,	 sigma for b: 2.326,	Rho: 0.233 
 -- covariance matrix:
[[0.36063857 0.32606953]
 [0.32606953 5.41025897]]

sample 4
 -- sigma for a: 3.659,	 sigma for b: 0.002,	Rho: 0.458 
 -- covariance matrix:
[[1.33858317e+01 4.08000425e-03]
 [4.08000425e-03 5.92186930e-06]]

`LKJCholeskyCov` distribution

Unlike LKJCorr, there’s no .dist method that we can sample from directly. However, we can wrap this in a model container and sample from it. Again, this is not recommended practice. It is merely to get an idea of what this function is producing. Check out the pymc example I referenced above for an authoritative source.

(One difference compared to the manual way of constructing the covariance matrix above is that I don’t think there’s a way to specify different prior distributions for the sigmas in the sd_dist parameter, at least with version 3.11.0 of pymc.)

with pm.Model() as m1:
    packed_L = pm.LKJCholeskyCov("packed_L",  eta=2, n=2, sd_dist=pm.Exponential.dist(1.0))
    trace_m1 = pm.sample(1000, tune=1000, return_inferencedata=False, progressbar=False)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [packed_L]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 32 seconds.
There were 2 divergences after tuning. Increase `target_accept` or reparameterize.
There was 1 divergence after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.4781781973772051, but should be close to 0.8. Try to increase the number of tuning steps.
The acceptance probability does not match the target. It is 0.705022869734104, but should be close to 0.8. Try to increase the number of tuning steps.
The rhat statistic is larger than 1.05 for some parameters. This indicates slight problems during sampling.
The estimated number of effective samples is smaller than 200 for some parameters.

pm.trace_to_dataframe(trace_m1).head()

	packed_L__0	packed_L__1	packed_L__2
0	2.248979	-1.339856	2.574068
1	1.521427	-1.494167	1.882593
2	2.729876	-1.076162	2.427225
3	0.781399	-0.943537	1.931551
4	1.161240	-0.111627	0.320429

With each sample, we have three values. They are a lower triangular matrix but they are not of the covariance matrix itself. Rahter, they are the Cholesky decomposition of the covariance matrix. For practical purposes and for interpretation, it is better to use the following instantiation where we can get the rho and sigma values back automatically.

with pm.Model() as m2:
    chol, corr, stds = pm.LKJCholeskyCov(
        "chol", n=2, eta=2.0, sd_dist=pm.Exponential.dist(1.0), compute_corr=True
    )
    cov = pm.Deterministic("cov", chol.dot(chol.T))
    trace_m2 = pm.sample(1000, tune=1000, return_inferencedata=False, progressbar=False)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [chol]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 29 seconds.
There was 1 divergence after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.47696968632616427, but should be close to 0.8. Try to increase the number of tuning steps.
The acceptance probability does not match the target. It is 0.6910481731222186, but should be close to 0.8. Try to increase the number of tuning steps.
There was 1 divergence after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.5735096676868859, but should be close to 0.8. Try to increase the number of tuning steps.
The rhat statistic is larger than 1.05 for some parameters. This indicates slight problems during sampling.
The estimated number of effective samples is smaller than 200 for some parameters.

trace_m2_df = pm.trace_to_dataframe(trace_m2)
trace_m2_df.head()

	chol__0	chol__1	chol__2	chol_stds__0	chol_stds__1	chol_corr__0_0	chol_corr__0_1	chol_corr__1_0	chol_corr__1_1	cov__0_0	cov__0_1	cov__1_0	cov__1_1
0	3.622728	-0.515597	1.710626	3.622728	1.786640	1.0	-0.288585	-0.288585	1.0	13.124155	-1.867867	-1.867867	3.192082
1	0.697870	-0.172701	1.485430	0.697870	1.495436	1.0	-0.115485	-0.115485	1.0	0.487023	-0.120523	-0.120523	2.236328
2	0.600202	0.428665	1.425867	0.600202	1.488909	1.0	0.287906	0.287906	1.0	0.360242	0.257286	0.257286	2.216850
3	1.777497	-0.023333	0.703293	1.777497	0.703680	1.0	-0.033159	-0.033159	1.0	3.159494	-0.041475	-0.041475	0.495165
4	0.671842	-0.077284	0.361800	0.671842	0.369962	1.0	-0.208898	-0.208898	1.0	0.451371	-0.051923	-0.051923	0.136872

We can verify the values of the covariance matrix by using the standard deviations and correlation coefficients of the posterior.

for i in range(5):
    sigmas = trace_m2_df.loc[i, ['chol_stds__0', 'chol_stds__1']]
    rho = trace_m2_df.loc[i, 'chol_corr__0_1']
    Rmat = np.array([[1, rho], [rho, 1]])
    Sigma = np.diag(sigmas).dot(Rmat).dot(np.diag(sigmas))
    print(f'draw: {i}', Sigma, sep='\n')

draw: 0
[[13.12415508 -1.86786674]
 [-1.86786674  3.19208164]]
draw: 1
[[ 0.48702298 -0.1205228 ]
 [-0.1205228   2.23632763]]
draw: 2
[[0.36024213 0.25728563]
 [0.25728563 2.21685039]]
draw: 3
[[ 3.15949443 -0.0414752 ]
 [-0.0414752   0.49516542]]
draw: 4
[[ 0.45137105 -0.05192275]
 [-0.05192275  0.13687202]]

Compare the printed values from each sample draw and you see that we get an exact match with the cov__0_0, cov__0_1, cov__1_0, and cov__1_1 columns.

Summary

In this post, I wanted to better understand the LKJcorr and LKJcov outputs. You often wouldn’t have to go this detailed, but it helped me gain a better understanding of using and interpreting these distributions when applied to problems with multivariate normals, including varying effects models.

%watermark -n -u -v -iv -w

Last updated: Tue Apr 12 2022

Python implementation: CPython
Python version       : 3.8.6
IPython version      : 7.20.0

seaborn   : 0.11.1
matplotlib: 3.3.4
pymc3     : 3.11.0
scipy     : 1.6.0
theano    : 1.1.0
pandas    : 1.2.1
numpy     : 1.20.1
sys       : 3.8.6 | packaged by conda-forge | (default, Jan 25 2021, 23:22:12) 
[Clang 11.0.1 ]
arviz     : 0.11.1

Watermark: 2.1.0

Weird ways that covariance matrices are made

2022-03-28T00:00:00+00:00

Covariance priors for multivariate normal models are an important tool for the implementation of varying effects. By representing more than one parameter with a covarying structure, even more partial pooling can result than if the parameters had their own separate distribution. Before talking more about varying effects, I thought I’d write about the weird ways that covariance matrixes are made.

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import scipy.stats as stats
import scipy.linalg as linalg
import seaborn as sns

%load_ext nb_black
%config InlineBackend.figure_format = 'retina'
%load_ext watermark
RANDOM_SEED = 8927
np.random.seed(RANDOM_SEED)
az.style.use("arviz-darkgrid")
az.rcParams["stats.hdi_prob"] = 0.89  # sets default credible interval used by arviz

def standardize(x):
    x = (x - np.mean(x)) / np.std(x)
    return x

What is a a covariance matrix? One way to think of it is through an analogy: a standard deviation is to a univariate normal distribution as a covariate matrix is to a multivariate normal distribution.

In equation form, you could have variables that look like this:

\[x \sim \text{Normal}(\mu, \sigma) \tag{univariate normal distribution}\] \[\begin{bmatrix}x_1 \\ x_2 \\ ... \\ x_n \end{bmatrix} \sim \text{MVNormal} \left( \begin{bmatrix} \mu_1 \\ \mu_2 \\ ... \\ \mu_n \end{bmatrix} , \Sigma \right) \tag{multivariate normal distribution}\]

In both cases, we have variables paramaterized by random distributions. In the univariate case, a single draw from the distribution will result in one value. In the multivariate case, a single draw will result in n values, one for each parameter. In the multivariate normal (MVN) case, we have a vector of means ($\mu$), but the interesting relationships will result from the covariance matrix $\Sigma$.. It will tell us about the variability of the parameters and also possible correlative relationships between them. This is seen in how we can construct covariance matrices.

Using numbers helps me understand things so let’s use Dr. McElreath’s example involving cafe waiting times. For the purposes of this post, you don’t need to know the details of the problem, but it is described in this lecture.

The multivariate normal distribution for this cafe waiting times example is described here:

\[\begin{bmatrix}\alpha_{\text{cafe}} \\ \beta_{\text{cafe}} \end{bmatrix} \sim \text{MVNormal} \left( \begin{bmatrix}\alpha \\ \beta \end{bmatrix} , \textbf{S} \right) \tag{population of varying effects}\]

We’ll create a simple 2x2 covariance matrix but the lessons can be extended to larger sizes. To construct it, we’ll need values for each parameter’s standard deviation (what I’ll call $\sigma$ below) and a correlation coefficient $\rho$. For a proper multivariate normal distribution, we’ll also need values for the means (the $\mu$ vector described above), denoted as a and b.

a = 3.5  # average morning wait time
b = -1.0  # average difference afternoon wait time
sigma_a = 1.0  # std dev in intercepts
sigma_b = 0.5  # std dev in slopes
rho = -0.7  # correlation between intercepts and slopes

While our focus is on the covariance matrix, let’s get the first term of the MVN distribution out of the way. I’ll generate the vector of the averages which is straightforward.

Mu = [a, b]
print("Vector of means: ", Mu)

Vector of means:  [3.5, -1.0]

Intuitive construction

The first method can be made is the most intuitive for me.

\[\textbf{S} = \begin{pmatrix} \sigma_{\alpha}^2 & \rho\sigma_{\alpha}\sigma_{\beta} \\ \rho\sigma_{\alpha}\sigma_{\beta} & \sigma_{\beta}^2 \end{pmatrix}\]

The diagonals show each individual parameter’s variance (standard deviation squared) while the off-diagonal shows the co-variance, represented as the correlation coefficient $\rho$ multiplied by the parameters’ standard deviations.

I’ll use Sigma1 with capital S to represent this covariance matrix with the 1 representing this first method of assembly but as you’ll see, they will be equivalent. (In equations like the one shown above, the covariance matrix is represented by a bold, capital S.)

cov_ab = rho * sigma_a * sigma_b
Sigma1 = np.array([[sigma_a**2, cov_ab], [cov_ab, sigma_b**2]])
Sigma1

array([[ 1.  , -0.35],
       [-0.35,  0.25]])

The important parts are the off-diagonals, which shows a negative covariance between the $\alpha$ and $\beta$ terms. They are symmetric because the calculation is equivalent. Hopefully there’s no confusion in how this covariance matrix resulted.

Standard deviation diagonals

The second method for building the covariance matrix will be weirder:

arrange the standard deviations along the diagonal and fill in zeros everywhere else
matrix multiply by a correlation matrix
matrix multiply by the same arrangement of standard deviations along the diagonal

Here’s how it looks in equation form:

\[\textbf{S} = \begin{pmatrix} \sigma_{\alpha} & 0 \\ 0 & \sigma_{\beta} \end{pmatrix} \textbf{R} \begin{pmatrix} \sigma_{\alpha} & 0 \\ 0 & \sigma_{\beta} \end{pmatrix}\]

To create a matrix where the standard deviations are on the diagonal and zeros are everywhere, we can use a handy numpy function called diag that can be applied to the parameter standard deviations arranged in a vector:

# put the sigmas in a vector first
sigmas = [sigma_a, sigma_b]

# represent on the diagonal
sigma_diag = np.diag(sigmas)
sigma_diag

array([[1. , 0. ],
       [0. , 0.5]])

The $\textbf{R}$ matrix is where rho is arranged in the off-diagonals, where rho represents the correlation between the two parameters. The diagonals show values of 1 since each parameter will always be perfectly correlated with itself.

Rmat = np.array([[1, rho], [rho, 1]])
Rmat

array([[ 1. , -0.7],
       [-0.7,  1. ]])

Now the final step is the matrix multiplication. In numpy, you can do this with a small chain of matrix multiplication (taken from this SO post).

Sigma2 = sigma_diag.dot(Rmat).dot(sigma_diag)
Sigma2

array([[ 1.  , -0.35],
       [-0.35,  0.25]])

As expected, we get the same values of the covariance matrix as we did with the previous method.

Cholesky factors

Ok, now we have the third method of creating a covariance matrix. As promised, it gets even more weird. It deserves its own exploration but I’ll just show how it works now then explain later. The first thing we need to do is get the Cholesky factor which can be derived from the $\textbf{R}$ correlation matrix. There are other sources that explain Cholesky factors like the Wikipedia page.

The matrix $\textbf{R}$ can be derived from this Cholesky factor with the following equation:

$ \textbf{R} = \textbf{LL}^\intercal $

Accordingly, we can substitute for $\textbf{R}$ in the equation we saw above:

\[\textbf{S} = \begin{pmatrix} \sigma_{\alpha} & 0 \\ 0 & \sigma_{\beta} \end{pmatrix} \textbf{LL}^\intercal \begin{pmatrix} \sigma_{\alpha} & 0 \\ 0 & \sigma_{\beta} \end{pmatrix}\]

$\textbf{L}$ is not simply the lower triangle simply of a correlation matrix.

# WRONG - this is not how to get L
np.tril(Rmat)

array([[ 1. ,  0. ],
       [-0.7,  1. ]])

There is a different numpy function that calculates the lower triangle properly. (Note that scipy.linalg.cholesky does the upper triangle. You’d modify the above equation by transposing L first then mutiplying by itself.)

# numpy.linalg.cholesky does the lower triangle
L = np.linalg.cholesky(Rmat)
L

array([[ 1.        ,  0.        ],
       [-0.7       ,  0.71414284]])

In code, we can get this third re-construction of $\textbf{S}$ like this:

Sigma3 = sigma_diag.dot(L).dot(L.T).dot(sigma_diag)
Sigma3

array([[ 1.  , -0.35],
       [-0.35,  0.25]])

As we would expect, all three ways to get a covariance matrix give equivalent results. Why would you even use this last, strange way? It will have to do with sampling in a varying effects problem. The Cholesky factors will allow us to generate non-centered paramaterizations. I’ll cover this in a later post.

%watermark -n -u -v -iv -w

Last updated: Mon Mar 28 2022

Python implementation: CPython
Python version       : 3.8.6
IPython version      : 7.20.0

sys       : 3.8.6 | packaged by conda-forge | (default, Jan 25 2021, 23:22:12) 
[Clang 11.0.1 ]
matplotlib: 3.3.4
pandas    : 1.2.1
pymc3     : 3.11.0
arviz     : 0.11.1
scipy     : 1.6.0
seaborn   : 0.11.1
numpy     : 1.20.1

Watermark: 2.1.0

Escaping the Devil’s Funnel

2022-03-26T00:00:00+00:00

Multi-level models are great for improving our estimates. However, the intuitive way these kinds of models are specified (which goes by the unhelpful name “centered” parameterization) can be notorious for producing posterior distributions that are difficult to sample using Markov chain Monte Carlo. This is because when a parameter (such as the scale variable of one distribution) depends on other parameters, the posterior can have weird shapes. This is the rationale for re-specifying the model into a “non-centered” parameterization.

One does not need a multi-level model to appreciate this concept. In the divergent transition section of Statistical Rethinking lecture 13 , Dr. McElreath illustrates the centered and non-centered parameterization ideas with what he calls “The Devil’s Funnel”. A funnel can be seen when plotting $\nu$ and x from the following centered paramaterization (figures shown below).

\[\nu \sim \text{Normal}(0, \sigma)\] \[x \sim \text{Normal}(0, \text{exp}(\nu))\]

This is numerically equivalent to the non-centered form. The noteworthy trick is setting $x$ from a stochastic relationship to a deterministic one and creating a new variable $z$ that is easier to sample.

\[\nu \sim \text{Normal}(0, \sigma)\] \[z \sim \text{Normal}(0, 1)\] \[x = z \times \text{exp}(\nu))\]

In an online discussion forum, we shared experiences with these kinds of parameterizations since it was around the same time lecture 13 of Statistical Rethinking was released. In the divergent transition section of the lecture, I noticed that the centered parameterization had a distribution that looked somewhat bivariate Gaussian when the value of $\sigma$ was low. I thought changing parameterizations wouldn’t affect sampling efficiency. I then asked at what value of $\sigma$ does the sampling efficiency matter. Let’s find out!

# boiler plate setup code
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import scipy.stats as stats
from scipy.special import expit
from scipy.special import logit
import seaborn as sns
import statsmodels.api as sm

%load_ext nb_black
%config InlineBackend.figure_format = 'retina'
%load_ext watermark
RANDOM_SEED = 8927
np.random.seed(RANDOM_SEED)
az.style.use("arviz-darkgrid")
az.rcParams["stats.hdi_prob"] = 0.89  # sets default credible interval used by arviz

def standardize(x):
    x = (x - np.mean(x)) / np.std(x)
    return x

The Devil’s Funnel variables were originally specified like this:

\[\nu \sim \text{Normal}(0, \sigma=3)\] \[x \sim \text{Normal}(0, \text{exp}(\nu))\]

The funnel gets more extreme with higher values for the standard deviation of $\nu$. Since there is no data here, this would be like manipulating priors only. I therefore experimented with different values for the standard deviation (sigma) of $\nu$.

# Generate a list of sigmas for the prior nu
sigmas = np.geomspace(0.025, 1, num=10)

# Create dictionaries for storage
# samples for plotting
traces_C = dict()
traces_NC = dict()

# summary results
summary_C = dict()
summary_NC = dict()

# number of divergences
div_C = dict()
div_NC = dict()

# Look at sigma values
sigmas

array([0.025     , 0.03766575, 0.05674836, 0.0854988 , 0.12881507,
       0.19407667, 0.29240177, 0.44054134, 0.66373288, 1.        ])

The following code evaluates each sigma value and uses that to build centered and non-centered models. I’ll save the results at the end of each model run and then plot the sampling metrics down below.

for sigma in sigmas:

    # Centered model
    with pm.Model() as mC:
        v = pm.Normal("v", 0.0, sigma)
        x = pm.Normal("x", 0.0, pm.math.exp(v))
        trace_mC = pm.sample(draws=1000, tune=1000, chains=4, return_inferencedata=False, progressbar=False)
        # Save results
        traces_C[sigma] = trace_mC
        summary_C[sigma] = az.summary(trace_mC)
        div_C[sigma] = trace_mC["diverging"].sum()

    # Non-centered model
    with pm.Model() as mNC:
        v = pm.Normal("v", 0.0, sigma)
        z = pm.Normal("z", 0.0, 1.0)
        # transformed variable
        x = pm.Deterministic("x", z*np.exp(v))
        trace_mNC = pm.sample(draws=1000, tune=1000, chains=4, return_inferencedata=False, progressbar=False)

        # Save results
        traces_NC[sigma] = trace_mNC
        summary_NC[sigma] = az.summary(trace_mNC)
        div_NC[sigma]= trace_mNC["diverging"].sum()

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [x, v]

(Removed the rest of the pymc output to save space. Notes about divergences are explored down below.)

Exploration of the joint distribution

We’ll plot the joint distribution of the centered-parameterization of $x$ and $\nu$ (blue) and see how that looks compared to the non-centered parameterization. In the latter, we’re sampling $z$ from a regular Gaussian distribution and getting $x$ through a deterministic transformation relationship.

sigmas_sampled = sigmas[1:len(sigmas):2]  # plot every other sigma evaluated

f, axes = plt.subplots(3, len(sigmas_sampled), figsize=(20, 8))

# top row: centered model
for sigma, ax in zip(sigmas_sampled, axes.flat[0:len(sigmas_sampled)]):
    samples_C = pm.trace_to_dataframe(traces_C[sigma])
    ax.scatter(samples_C['x'], samples_C['v'], alpha=0.2, facecolors='none', edgecolors='navy')

    sigma_str = '{:.3f}'.format(sigma)
    ax.set_title(f'sigma = {sigma_str}')
    ax.set_xlabel('x (stochastic)')
    if ax.is_first_col() & ax.is_first_row():
        ax.set_ylabel('centered\n\nv')

# middle row: non-centered model, z on x-axis
for sigma, ax in zip(sigmas_sampled, axes.flat[len(sigmas_sampled):2*len(sigmas_sampled)]):
    samples_NC = pm.trace_to_dataframe(traces_NC[sigma])
    ax.scatter(samples_NC['z'], samples_NC['v'], alpha=0.2, facecolors='none', edgecolors='darkgreen')

    ax.set_xlabel('z (stochastic)')
    if ax.is_first_col():
        ax.set_ylabel('non-centered\n\nv')

# bottom row: non-centered model, x on x-axis
for sigma, ax in zip(sigmas_sampled, axes.flat[2*len(sigmas_sampled):3*len(sigmas_sampled)]):
    samples_NC = pm.trace_to_dataframe(traces_NC[sigma])
    ax.scatter(samples_NC['x'], samples_NC['v'], alpha=0.2, facecolors='none', edgecolors='darkgreen')

    ax.set_xlabel('x (deterministic)')
    if ax.is_first_col() & ax.is_last_row():
        ax.set_ylabel('non-centered\n\nv')

plt.tight_layout()

:34: UserWarning: This figure was using constrained_layout==True, but that is incompatible with subplots_adjust and or tight_layout: setting constrained_layout==False. 
  plt.tight_layout()

At the top is the centered target distribution with increasing values of $\sigma$. We can see that the Devil’s Funnel begins to form as $/sigma$ exceeds 0.2. However, the middle row looks like samples from a plain old bivariate Gaussian regression and doesn’t change. That’s because we’ve defined it to not change: it will always be $z \sim \text{Normal}(0,1)$ regardless of $\sigma$. The last row shows that we can get $x$ and our target distribution back with a deterministic transformation.

Let’s look at other metrics to inform us about sampling efficiency.

Number of divergences

f, ax1 = plt.subplots(figsize=(6, 4))
ax1.scatter(div_C.keys(), div_C.values(), color='navy')
ax1.plot(div_C.keys(), div_C.values(), color='navy', label='Centered')
ax1.scatter(div_NC.keys(), div_NC.values(), color='darkgreen')
ax1.plot(div_NC.keys(), div_NC.values(), color='darkgreen', label='Non-centered')
ax1.set(xlabel='sigma', ylabel='Number of divergences', title='Number of divergences\nCentered vs Non-centered')
ax1.legend()

We don’t see any divergences at all in the non-centered paramaterization, regardless of $\sigma$. We can get away with the centered paramaterization only at low values of $\sigma$ as the bivariate plots suggest.

Number of effective samples and R-hat

# Put the summary results in one table to facilitate plotting
df_summary_C = pd.concat(pd.DataFrame(summary_C[sigma]).reset_index().rename(columns={'index': 'var'}) for sigma in sigmas).reset_index(drop=True)
df_summary_C['sigma'] = sorted(list(sigmas)*2)

df_summary_NC = pd.concat(pd.DataFrame(summary_NC[sigma]).reset_index().rename(columns={'index': 'var'}) for sigma in sigmas).reset_index(drop=True)
df_summary_NC['sigma'] = sorted(list(sigmas)*3)

f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(12, 8))

# Top row (v) ---------

# plot centered ESS
df_centered_v = df_summary_C.loc[df_summary_C['var']=='v', :]
ax1.scatter(df_centered_v['sigma'], df_centered_v['ess_mean'], color='navy')
ax1.plot(df_centered_v['sigma'], df_centered_v['ess_mean'], color='navy', label='Centered')

ax2.scatter(df_centered_v['sigma'], df_centered_v['r_hat'], color='navy')
ax2.plot(df_centered_v['sigma'], df_centered_v['r_hat'], color='navy', label='Centered')

# plot non-centered ESS
df_noncentered_v = df_summary_NC.loc[df_summary_NC['var']=='v', :]
ax1.scatter(df_noncentered_v['sigma'], df_noncentered_v['ess_mean'], color='darkgreen')
ax1.plot(df_noncentered_v['sigma'], df_noncentered_v['ess_mean'], color='darkgreen', label='Non-centered')

ax2.scatter(df_noncentered_v['sigma'], df_noncentered_v['r_hat'], color='darkgreen')
ax2.plot(df_noncentered_v['sigma'], df_noncentered_v['r_hat'], color='darkgreen', label='Non-centered')

# plot decorations
ax1.legend()
ax1.set(xlabel='sigma', ylabel='ESS', xscale='linear', title='Effective sample size for v')

ax2.legend()
ax2.set(xlabel='sigma', ylabel='R-hat', xscale='linear', title='R-hat for v')

# Bottom row (x) ---------

# plot centered ESS
df_centered_x = df_summary_C.loc[df_summary_C['var']=='x', :]
ax3.scatter(df_centered_x['sigma'], df_centered_x['ess_mean'], color='navy')
ax3.plot(df_centered_x['sigma'], df_centered_x['ess_mean'], color='navy', label='Centered')

ax4.scatter(df_centered_x['sigma'], df_centered_x['r_hat'], color='navy')
ax4.plot(df_centered_x['sigma'], df_centered_x['r_hat'], color='navy', label='Centered')

# plot non-centered ESS
df_noncentered_x = df_summary_NC.loc[df_summary_NC['var']=='x', :]
ax3.scatter(df_noncentered_x['sigma'], df_noncentered_x['ess_mean'], color='darkgreen')
ax3.plot(df_noncentered_x['sigma'], df_noncentered_x['ess_mean'], color='darkgreen', label='Non-centered')

ax4.scatter(df_noncentered_x['sigma'], df_noncentered_x['r_hat'], color='darkgreen')
ax4.plot(df_noncentered_x['sigma'], df_noncentered_x['r_hat'], color='darkgreen', label='Non-centered')

# plot decorations
ax3.legend()
ax3.set(xlabel='sigma', ylabel='ESS', xscale='linear', title='Effective sample size for x')

ax4.legend()
ax4.set(xlabel='sigma', ylabel='R-hat', xscale='linear', title='R-hat for x')

f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(12, 8))

# Top row (v) ---------

# plot centered ESS
df_centered_v = df_summary_C.loc[df_summary_C['var']=='v', :]
ax1.plot(df_centered_v['sigma'], df_centered_v['ess_mean'], marker='o', color='navy', label='Centered')
ax2.plot(df_centered_v['sigma'], df_centered_v['r_hat'], marker='o', color='navy', label='Centered')

# plot non-centered ESS
df_noncentered_v = df_summary_NC.loc[df_summary_NC['var']=='v', :]
ax1.plot(df_noncentered_v['sigma'], df_noncentered_v['ess_mean'], marker='o', color='darkgreen', label='Non-centered')
ax2.plot(df_noncentered_v['sigma'], df_noncentered_v['r_hat'], marker='o', color='darkgreen', label='Non-centered')

# plot decorations
ax1.legend()
ax1.set(xlabel='sigma', ylabel='ESS', xscale='linear', title='Effective sample size for v')
ax2.legend()
ax2.set(xlabel='sigma', ylabel='R-hat', xscale='linear', title='R-hat for v')

# Bottom row (x) ---------

# plot centered ESS
df_centered_x = df_summary_C.loc[df_summary_C['var']=='x', :]
ax3.plot(df_centered_x['sigma'], df_centered_x['ess_mean'], marker='o', color='navy', label='Centered')
ax4.plot(df_centered_x['sigma'], df_centered_x['r_hat'], marker='o', color='navy', label='Centered')

# plot non-centered ESS
df_noncentered_x = df_summary_NC.loc[df_summary_NC['var']=='x', :]
ax3.plot(df_noncentered_x['sigma'], df_noncentered_x['ess_mean'], marker='o', color='darkgreen', label='Non-centered')
ax4.plot(df_noncentered_x['sigma'], df_noncentered_x['r_hat'], marker='o', color='darkgreen', label='Non-centered')

# plot decorations
ax3.legend()
ax3.set(xlabel='sigma', ylabel='ESS', xscale='linear', title='Effective sample size for x')

ax4.legend()
ax4.set(xlabel='sigma', ylabel='R-hat', xscale='linear', title='R-hat for x')

[Text(0.5, 0, 'sigma'),
 Text(0, 0.5, 'R-hat'),
 None,
 Text(0.5, 1.0, 'R-hat for x')]

Conclusion

When looking at the number of divergences, effective sample size, and R-hat, smaller values of sigma result in good sampling whether it’s in the centered or non-centered form of the Devil’s Funnel equations. However, between 0.2 and 0.4, we begin to see indications that the non-centered form is clearly doing better.

Appendix: Environment and system parameters

%watermark -n -u -v -iv -w

Last updated: Sat Mar 26 2022

Python implementation: CPython
Python version       : 3.8.6
IPython version      : 7.20.0

pymc3      : 3.11.0
arviz      : 0.11.1
statsmodels: 0.12.2
numpy      : 1.20.1
sys        : 3.8.6 | packaged by conda-forge | (default, Jan 25 2021, 23:22:12) 
[Clang 11.0.1 ]
matplotlib : 3.3.4
pandas     : 1.2.1
scipy      : 1.6.0
seaborn    : 0.11.1

Watermark: 2.1.0

Correlated data, different DAGs

2022-02-03T00:00:00+00:00

One of the lessons from Statistical Rethinking that really hit home for me was the importance of considering the data generation process. Different datasets can show similar patterns, but the data generation can be different. I’ll illustrate this below, showing how correlated data can arise from these varying processes.

As an homage to someone who’s recently been in the news … LFG!

import arviz as az
import daft
from causalgraphicalmodels import CausalGraphicalModel
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sns

%load_ext nb_black
%config InlineBackend.figure_format = 'retina'
%load_ext watermark
RANDOM_SEED = 8927
np.random.seed(RANDOM_SEED)
az.style.use("arviz-darkgrid")
az.rcParams["stats.hdi_prob"] = 0.89  # sets default credible interval used by arviz

We’ll have three different directed acyclic graphs (DAGs) showing relationships with only three variables, named X, Y, and Z. In all of them, we’ll have 100 datapoints. In all of them, X will be our “predictor” variable and Y will always be the “outcome” variable. Z will be the wild card, moving around so we will see what effects it has on the relationship between X and Y.

Pipe

The first will be a mediator, AKA a pipe. Here Z passes on information from X to Y.

dag = CausalGraphicalModel(
    nodes=["X", "Y", "Z"],
    edges=[
        ("X", "Z"),
        ("Z", "Y"),
    ],
)
pgm = daft.PGM()
coordinates = {
    "X": (0, 0),
    "Z": (1, 0),
    "Y": (2, 0),
}
for node in dag.dag.nodes:
    pgm.add_node(node, node, *coordinates[node])
for edge in dag.dag.edges:
    pgm.add_edge(*edge)

pgm.render()

When I say “data generation”, I literally mean that. We can simulate the relationships between all variables. Since X is a starting point, that is the easiest to get. We’ll simply take 100 random samples from a normal distribution with mean 0, standard deviation of 1 (represented in stats.norm.rvs as loc and scale, respectively). Then to generate Z, we’ll take another random samples, where the mean of this new normal distribution is the product of a coefficient bXZ and X. We’re drawing Z from a normal distribution, because a simple multiplication would give perfectly correlated data. That’s not what we’re trying to represent. We then do something similar with Y, expect this time it is the product of bZY and Z.

A few quick notes you may be wondering about: my assignment of value to the coefficients is arbitrary and I’m appending “p” to the variable names, like Xp, to represent the X variable in this pipe dag.

n = 100

# pipe   X > Z > Y
bXZ = 1
bZY = 1

Xp = stats.norm.rvs(loc=0, scale=1, size=n)
Zp = stats.norm.rvs(loc=bXZ*Xp, scale=1, size=n)
Yp = stats.norm.rvs(loc=bZY*Zp, scale=1, size=n)

We’ll plot the relationship of X and Y from this pipe at the end, after we generate all data examples.

Fork

Let’s look at the second DAG, which is a fork. Z influences both X and Y.

dag = CausalGraphicalModel(
    nodes=["X", "Y", "Z"],
    edges=[
        ("Z", "X"),
        ("Z", "Y"),
    ],
)
pgm = daft.PGM()
coordinates = {
    "X": (0, 0),
    "Z": (1, 1),
    "Y": (2, 0),
}
for node in dag.dag.nodes:
    pgm.add_node(node, node, *coordinates[node])
for edge in dag.dag.edges:
    pgm.add_edge(*edge)

pgm.render()

/Users/blacar/opt/anaconda3/envs/stats_rethinking/lib/python3.8/site-packages/IPython/core/pylabtools.py:132: UserWarning: Calling figure.constrained_layout, but figure not setup to do constrained layout.  You either called GridSpec without the fig keyword, you are using plt.subplot, or you need to call figure or subplots with the constrained_layout=True kwarg.
  fig.canvas.print_figure(bytes_io, **kw)

The code will look similar as the pipe, but the relationships of the data generation process will reflect the DAG depicting this fork. You can see how Z is influencing both the predictor X and the outcome Y. Z is a confound of this relationship.

# fork   X < Z > Y
bZX = 1
bZY = 1

Zf = stats.norm.rvs(size=n)
Xf = stats.norm.rvs(bZX*Zf, size=n)
Yf = stats.norm.rvs(bZY*Zf, size=n)

Collider

Now let’s move onto the trickiest DAG, which is the collider. Here, our predictor and outcome variables are no longer the consequences of Z, but are now the causes of Z. They both influence and “collide” on Z.

dag = CausalGraphicalModel(
    nodes=["X", "Y", "Z"],
    edges=[
        ("X", "Z"),
        ("Y", "Z"),
    ],
)
pgm = daft.PGM()
coordinates = {
    "X": (0, 1),
    "Z": (1, 0),
    "Y": (2, 1),
}
for node in dag.dag.nodes:
    pgm.add_node(node, node, *coordinates[node])
for edge in dag.dag.edges:
    pgm.add_edge(*edge)

pgm.render()

To get correlated data with the collider, we’ll have to do some non-intuitive things that are still represented on this DAG. I’ll show the code then explain after.

# collider   X > Z < Y
from scipy.special import expit

n=200

bXZ = 2
bYZ = 2

Xc = stats.norm.rvs(loc=0, scale=1.5, size=n)
Yc = stats.norm.rvs(loc=0, scale=1.5, size=n)
Zc = stats.bernoulli.rvs(expit(bXZ*Xc - bYZ*Yc), size=n)

df_c = pd.DataFrame({"Xc":Xc, "Yc":Yc, "Zc":Zc, "Znorm":bXZ*Xc - bYZ*Yc})
df_c0 = df_c[df_c['Zc']==0]
df_c1 = df_c[df_c['Zc']==1]

What are we doing in the code?

We’re going to make Z take on a value of 0 or 1. It is essentially a categorical variable. This is to aid in visualization.
To ensure Z acts as a collider, we still have to represent the causal influences of X and Y. Z is represented by first taking the difference bXZ*Xc - bYZ*Yc and applying an expit (AKA logistic sigmoid function). We take the difference to get a positive correlated relationship to mimic the pipe and fork examples, but it still is a faithful representation of the DAG.
We’re going to show only one subset of data, those with Z again to aid in visualization.
Since we’re subsetting the data, we double the number of observations.

df_c.head()

	Xc	Yc	Zc	Znorm
0	1.517475	-0.412633	1	3.860216
1	-2.698825	0.746912	0	-6.891475
2	2.503817	0.868102	1	3.271430
3	-0.241992	-0.795235	1	1.106486
4	0.182652	-2.039822	1	4.444948

f, ax1 = plt.subplots()
ax1.scatter(df_c['Znorm'], df_c['Zc'])
ax1.set(xlabel='bXZ*Xc - bYZ*Yc', ylabel='Z', title='relationship before (x-axis) and after (y-axis) expit')

[Text(0.5, 0, 'bXZ*Xc - bYZ*Yc'),
 Text(0, 0.5, 'Z'),
 Text(0.5, 1.0, 'relationship before (x-axis) and after (y-axis) expit')]

OK, now let’s plot the data!

f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16, 6), sharex=True, sharey=True)
ax1.scatter(Xp, Yp)
ax1.set(xlabel='X', ylabel='Y', title='pipe, X > Z > Y')

ax2.scatter(Xf, Yf)
ax2.set(xlabel='X', ylabel='Y', title='fork, X < Z > Y')

#ax3.scatter(df_c0['Xc'], df_c0['Yc'], color='gray', label='Z=0')
ax3.scatter(df_c1['Xc'], df_c1['Yc'], color='blue', label='Z=1')
ax3.set(xlabel='X', ylabel='Y', title='collider, X > Z < Y,\n(inverse logit(bXZ*Xc - bYZ*Yc)')
ax3.legend()

As you can see, there’s a positive correlated relationship despite the DAGs being different and working with only three variables! This drives home the point that we can’t look only at our data to infer the proper relationships. A data generating model is key to know how to stratify our models.

Appendix: Environment and system parameters

%watermark -n -u -v -iv -w

Last updated: Thu Feb 03 2022

Python implementation: CPython
Python version       : 3.8.6
IPython version      : 7.20.0

seaborn    : 0.11.1
numpy      : 1.20.1
pandas     : 1.2.1
arviz      : 0.11.1
daft       : 0.1.0
sys        : 3.8.6 | packaged by conda-forge | (default, Jan 25 2021, 23:22:12) 
[Clang 11.0.1 ]
pymc3      : 3.11.0
scipy      : 1.6.0
matplotlib : 3.3.4
statsmodels: 0.12.2

Watermark: 2.1.0

Running models forwards and backwards

2022-01-03T00:00:00+00:00

The value of simulations is highighted by Dr. McElreath throughout his textbook and by van de Schoot and colleagues. I didn’t entirely appreciate its value until I implemented this myself. I started by reviewing some materials, then I went down a rabbit hole, where one question naturally branched into other questions. I’ll talk about simulations in multiple posts and how model running forwards and backwards can help with understanding.

We’ll start simple, with a weighted coin example. Then in later posts, we’ll move into simple linear regression, then multiple linear regression and thinking from a causal perspective. As most of my recent posts have been inspired by my learning of Bayesian inference, much credit goes to Dr. McElreath’s Statistical Rethinking, its associated pymc repo, and friends on the Discord server. Colleagues at UCSF who are experts in simulations and causal models have also been a great source.

Let’s get started using our weighted coin example!

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import pymc3 as pm
from scipy import stats
import statsmodels.api as sm
import seaborn as sns

sns.set_context("talk")

%load_ext nb_black
%config InlineBackend.figure_format = 'retina'
%load_ext watermark
RANDOM_SEED = 8927
np.random.seed(RANDOM_SEED)
az.style.use("arviz-darkgrid")
az.rcParams["stats.hdi_prob"] = 0.89  # sets default credible interval used by arviz

Running models forwards and backwards

We can input parameters to generate data. Or we can start from data, come up with a model to infer a parameter. When most people do statistics, they’re doing the latter. But running a model in both directions can help us understand things more deeply.

Let’s pretend we have a weighted coin. (We’ll use this in place of McElreath’s globe tossing example.) In many problems, we’re asked to deduce the true proportion that the coin comes up heads (or tails). This would be parameter estimation or what McElreath calls “running a model backwards”. A parameter is typically unobserved in contrast to data, which we can observe. We can count the number of heads after a known number of tosses. But for simulation (running a model forwards), we input a known parameter (how weighted the coin is) and generate data.

In our simulation, let’s say that the true proportion the coin comes up heads is 0.7. Since each coin toss is independent of the other, these are conditions for a binomial distribution. We’ll illustrate simulations using scipy.stats and inference using pymc. I’ll also show how to do a simulation using pymc alone but I think using the scipy.stats module has more flexibility.

Running forwards: simulation to get data

Let’s imagine we do two coin flips. The possibilities after three tosses ($n$=2) is to observe 0 heads (H), 1 H, or 2 H. The number of observed heads is assigned $k$. Formally, the model equation with the known parameters will look like this.

\[H \sim \text{Binomial}(n=2, p=0.7)\]

It can be read as “the number of heads we will observe is binomially distributed after two tosses of a coin, with true known proportion of heads of 0.7”. We can get the probability mass function (PMF) for this distribution like this.

# k = 0, 1, or 2 heads
stats.binom.pmf(k=range(3), n=2, p=0.7)

array([0.09, 0.42, 0.49])

This means that, after two coin flips, we expect to observe 0 heads with 9% probability, 1 head with 42% probability, and 2 heads with 49% probability. These three possibilities sum to 100%. We can’t get more hits on water ($k$) than the number of tosses ($n$). This is made clear if we try to input a value of $k$ that is $n$. (We’ll do an explicit construction of a list of k values.)

stats.binom.pmf(k=[0, 1, 2, 3], n=2, p=0.7)

array([0.09, 0.42, 0.49, 0.  ])

We don’t get an error, we simply get a probability of zero when $k > n$. Now let’s simulate.

We can use stats.binom.rvs to input parameters and generate data. Let’s do multiple trials, which is parameterized by size. To be clear, each trial means we’re doing two tosses and recording the number of heads in that trial. We’ll repeat this until we have 10 total trials.

# Use rvs to make dummy observations
stats.binom.rvs(n=2, p=0.7, size=10)

array([2, 2, 1, 1, 1, 2, 2, 2, 2, 1])

If you keep executing this cell, you’ll get a new set of values for observed W. We can also generate a high number of trials by simply making size large. We’ll do 10,000 trials.

dummy_h = stats.binom.rvs(n=2, p=0.7, size=10 ** 5)
dummy_h

array([2, 2, 1, ..., 1, 1, 2])

And from here, we can see how well the proportion of samples for each water value generated by the simulation matches the proportions determined analytically (using stats.binom.pmf).

# Wrap the list in a series so I can use `value_counts`
pd.Series(dummy_h).value_counts() / 10 ** 5

  0.48837
  0.42089
  0.09074
dtype: float64

The numbers are pretty close to our PMF we calculated above.

Running backwards: inference to estimate the parameter

Now let’s run the model backwards with pymc starting with the data that we generated from running the model forward. This might seem like a silly exercise since the purpose of inference is to estimate an unknown parameter. But the point of this exercise is to see how the two directions of model running are connected.

Here is our model equation.

$H \sim \text{Binomial}(n=2, p)$
$p \sim \text{Beta}(\alpha=2, \beta=2)$

We can read this as “the number of heads we will observe is binomially distributed after two tosses and some unknown parameter p.” We have to give $p$ some plausible prior distribution. Since $p$ should be between 0 and 1, a beta distribution is a good choice. I wrote about the beta distribution in a prior post here.

I choice beta to be parameterized as (2,2) since it is fairly conservative, with most of the mass suggesting it is a fair coin. You’ll see the shape of this below, when we put it with the posterior. We can get the posterior in the following code. There is a closed form solution where we can get the posterior analytically. The following is a way to get it with MCMC. While using MCMC is overkill for this problem, it is applicable with more complicated models.

# Generate observed data
dummy_h = stats.binom.rvs(n=2, p=0.7, size=10)

# Infer the parameter
with pm.Model() as m1:

    # prior
    p = pm.Beta("p", alpha=2, beta=2)

    # likelihood with unknown parameter p, observed dummy_h
    H = pm.Binomial("H", n=2, p=p, observed=dummy_h)

    # posterior
    trace_m1 = pm.sample(
        draws=1000, random_seed=19, return_inferencedata=True, progressbar=False
    )

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [p]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 12 seconds.

Let’s plot the prior and posterior for $p$ together.

a = 2
b = 2

f, ax1 = plt.subplots(figsize=(12, 6))

# known parameter to generate data
ax1.axvline(
    0.7, lw=1, color="red", linestyle="dashed", label="known p to generate data"
)

# prior
x = np.linspace(stats.beta.ppf(0.00, a, b), stats.beta.ppf(1.00, a, b), 100)
prior_label = "prior, beta(" + str(a) + ", " + str(b) + ")"
ax1.plot(x, stats.beta.pdf(x, a, b), lw=2, color="gray", label=prior_label)

# Make the posterior values accessible and plot
df_trace_m1 = trace_m1.to_dataframe()
sns.kdeplot(df_trace_m1[("posterior", "p")], ax=ax1, color="blue", label="posterior")

ax1.set_title("Distribution of p")
ax1.set_xlabel("random variable x")
ax1.set_ylabel("PDF")
ax1.legend()

You can see how our distribution of $p$ narrows and gets closer to the known true value of 0.7, but hasn’t centered over the true value yet. This will happen with more trials. But more of the probability mass is around 0.7 than the prior was. We can look more specifically at the 89% compatibility interval like this.

az.summary(trace_m1)

	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_mean	ess_sd	ess_bulk	ess_tail	r_hat
p	0.623	0.097	0.474	0.778	0.002	0.002	1725.0	1714.0	1749.0	2869.0	1.0

Summary

In this post, we talked about running a model forwards and backwards. I used a simple binomial example to illustrate how a known parameter can be used to generate data (running forwards) and how using observed data can help us obtain plausible parameter values in the form of a distribution. You may know already that pymc (and other software) has built-in capability to produce prior and posterior predictive simulations. We’ll use this functionality in a later post.

Appendix: Environment and system parameters

%watermark -n -u -v -iv -w

Last updated: Mon Jan 03 2022

Python implementation: CPython
Python version       : 3.8.6
IPython version      : 7.20.0

json       : 2.0.9
arviz      : 0.11.1
pandas     : 1.2.1
matplotlib : 3.3.4
pymc3      : 3.11.0
scipy      : 1.6.0
seaborn    : 0.11.1
numpy      : 1.20.1
statsmodels: 0.12.2

Watermark: 2.1.0

Exploring modeling failure

2021-11-23T00:00:00+00:00

In my last post, I gave an example of a multilevel model using a binomial generalized linear model (GLM). The varying intercept model helped illustrate partial pooling, shrinkage, and information sharing. The equation to create the mixed effects model was simple. But how exactly is “information shared”? Let’s get started!

This is how I was going to start this post. I thought this would be a fairly quick write-up where I would simplify the dataset by showing a few visualizations to demonstrate some concepts. But during this process, I realized the act of simplifying my dataset (using only three clusters) caused divergences when trying to obtain the posterior distribution. This is something that Richard McElreath had warned readers about on pages 407 and 408 of Statistical Rethinking. I tested and adjusted my priors–hard and unglamorous work that I planned to minimize to not distract from the main lesson.

However, I listened to Alex Andorra’s conversation with Michael Betancourt in the Learning Bayesian Statistics podcast #6. Around the 30 min mark, Alex and Michael talk about how failure is necessary for learning. Remarkably (for me), a few minutes later Michael uses the example specific to heirarchical models and about how problems can come from interactions between population location and population scale, including when there are only a small number of groups.

Instead of glossing over my observation, let’s dive into the problem, explore the model failure, and see how adjustments to priors and the modeling equations can resolve it.

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
import scipy.stats as stats
from scipy.stats import gaussian_kde
from theano import tensor as tt

%load_ext nb_black
%config InlineBackend.figure_format = 'retina'
%load_ext watermark
RANDOM_SEED = 8927
np.random.seed(RANDOM_SEED)
az.style.use("arviz-darkgrid")
az.rcParams["stats.hdi_prob"] = 0.89  # sets default credible interval used by arviz
sns.set_context("talk")

def standardize(x):
    x = (x - np.mean(x)) / np.std(x)
    return x

I’ll be re-using the problem described in the last post, so I won’t re-write everything. But let’s show the fixed effects and mixed effects models from the original post since we’ll be contrasting them.

Equation for fixed effects model

Model mfe equation

Equation for mixed effects model

Model mme equation

\[C_i \sim \text{Binomial}(1, p_i) \tag{binomial likelihood}\] \[\text{logit}(p_i) = \alpha_{\text{district}[i]} \tag{linear model using logit link}\] \[\alpha_j \sim \text{Normal}(\bar{\alpha}, \sigma) \tag{adaptive prior}\] \[\bar{\alpha} \sim \text{Normal}(0, 1.5) \tag{regularizing hyperprior}\] \[\sigma \sim \text{Exponential}(1) \tag{regularizing hyperprior}\]

The main difference is the use of an adaptive prior and the regularizing hyperiors in the mixed effects equations. There will be another change to the $\sigma$ term of the mixed effects model which I’ll detail later.

Data exploration and setup

I’ll load and clean all in this one cell and skip the details. Review the last post if you’d like to revisit this step.

df_bangladesh = pd.read_csv(
    "../pymc3_ed_resources/resources/Rethinking/Data/bangladesh.csv",
    delimiter=";",
)
# Fix the district variable
df_bangladesh["district_code"] = pd.Categorical(df_bangladesh["district"]).codes

To make the lessons of multi-level model more comprehensible, let’s limit the dataframe to only the first three districts. (As you’ll see, here is where I encountered trouble.)

df_bangladesh_first3 = df_bangladesh[df_bangladesh["district_code"] < 3].copy()

We can also get a count of women represented in each district. The variability in the number of women will help drive some of the lessons home further.

df_bangladesh_first3['district_code'].value_counts()

  117
   20
    2
Name: district_code, dtype: int64

Fixed-effects model

with pm.Model() as mfe:

    # alpha prior, one for each district
    a = pm.Normal("a", 0, 1.5, shape=len(df_bangladesh_first3["district_code"].unique()))

    # link function
    p = pm.math.invlogit(a[df_bangladesh_first3["district_code"]])

    # likelihood, n=1 since each represents an individual woman
    c = pm.Binomial("c", n=1, p=p, observed=df_bangladesh_first3["use.contraception"])

    trace_mfe = pm.sample(draws=1000, random_seed=19, return_inferencedata=True, progressbar=False)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [a]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 22 seconds.

Let’s do some model diagnostics with some arviz functions.

az.summary can provide the effective sample size (like with ess_mean) and r_hat values. The effective sample size is an indication of how well the posterior distribution was explored by HMC. Since Markov chains are typically autocorrelated, sampling within a chain is not entirely independent. The effective sample size accounts for this correlation. This number can even exceed the raw sample size. The r_hat value is “estimated between-chains and within-chain variances for each model parameter” source. (While an imperfect analogy, I think of ANOVA and the F-test which is calculated as variance between groups divided by variance within groups.) McElreath cautions that an r_hat value above 1.00 is a signal of danger, but not of safety; an invalid chain can still reach 1.00.

az.summary(trace_mfe)

	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_mean	ess_sd	ess_bulk	ess_tail	r_hat
a[0]	-1.058	0.211	-1.380	-0.710	0.003	0.002	4574.0	4005.0	4666.0	2322.0	1.0
a[1]	-0.592	0.439	-1.287	0.086	0.006	0.005	5652.0	3980.0	5674.0	3009.0	1.0
a[2]	1.220	1.162	-0.582	3.069	0.015	0.014	5776.0	3345.0	5811.0	2619.0	1.0

The az.plot_trace and az.plot_rank functions output visualizations. The former theoretically can be used to assess how well chains are mixing, but it can be hard to see. The latter are histograms of the ranked samples. This tells us how evenly a particular chain’s samples come in “ranked” across all samples and chains. Efficient exploration of the posterior should yield uniform distributions in these kind of plots (known as trace rank or “trank” plots).

az.plot_trace(trace_mfe)

array([[,
        ]], dtype=object)

az.plot_rank(trace_mfe)

array([,
       ,
       ],
      dtype=object)

As you can see in this case, the model ran great. Some indications for this:

Pymc gave no warnings about divergences.
While we would need something to compare it to, the number of samples ess_mean is high.
The r_hat is 1.0 which is a sign of “lack of danger”.
The trace plots apparently show good “wiggliness” and inter-mixing between chains is confirmed by the trank plots.

In the plots, the blue chain is district0, orange is district1, and green is district2, but let’s not focus too much on interpretation for now.

Let’s see how the first iteration of our mixed effects model looks. We’ll paramaterize as we had done in the last post and as the equations show above.

Mixed-effects (ME) attempt 0

with pm.Model() as mme0:

    # prior for average district
    a_bar = pm.Normal("a_bar", 0.0, 1.5)
    # prior for SD of districts
    sigma = pm.Exponential("sigma", 1.0)

    # alpha prior, we only have 1 district
    a = pm.Normal("a", a_bar, sigma, shape=len(df_bangladesh_first3["district_code"].unique()))

    # link function
    p = pm.math.invlogit(a[df_bangladesh_first3["district_code"]])

    # likelihood, n=1 since each represents an individual woman
    c = pm.Binomial("c", n=1, p=p, observed=df_bangladesh_first3["use.contraception"])

    trace_mme0 = pm.sample(draws=1000, random_seed=19, return_inferencedata=True, progressbar=False)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [a, sigma, a_bar]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 21 seconds.
There were 71 divergences after tuning. Increase `target_accept` or reparameterize.
There were 71 divergences after tuning. Increase `target_accept` or reparameterize.
There were 99 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.7116085658262987, but should be close to 0.8. Try to increase the number of tuning steps.
There were 49 divergences after tuning. Increase `target_accept` or reparameterize.
The number of effective samples is smaller than 10% for some parameters.

Wow, we’ve already got a number of issues. Let’s take a look at our summary and trace plots.

az.summary(trace_mme0)

	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_mean	ess_sd	ess_bulk	ess_tail	r_hat
a_bar	-0.449	0.703	-1.638	0.532	0.024	0.017	874.0	874.0	848.0	394.0	1.00
a[0]	-1.031	0.206	-1.355	-0.712	0.008	0.006	666.0	666.0	662.0	1641.0	1.00
a[1]	-0.642	0.434	-1.379	-0.025	0.014	0.011	996.0	821.0	993.0	330.0	1.00
a[2]	0.282	1.358	-1.513	2.137	0.051	0.036	715.0	715.0	663.0	382.0	1.00
sigma	0.966	0.752	0.167	1.885	0.033	0.023	523.0	523.0	232.0	134.0	1.01

az.plot_trace(trace_mme0)

array([[,
        ],
       [,
        ],
       [,
        ]], dtype=object)

az.plot_rank(trace_mme0)

array([[,
        ,
        ],
       [,
        ,
        ]], dtype=object)

The r_hat seems to indicate that the chains didn’t have a lot of variation between each other. But another thing to look at is the ess_mean which is an indicator of the effective sample size. A higher number means sampling the posterior distribution is more efficient. When we compare between the fixed effects and this first version of our mixed effects model, we can see that the latter had trouble sampling. The number of effective samples in the mixed effects model is much smaller for each district than in the fixed effects model. The trank plots are sounding alarms, particularly with the sigma parameter, but let’s come back to this.

f, ax = plt.subplots(figsize=(8, 4))
ax.bar(x=[i - 0.1 for i in range(3)], height=az.summary(trace_mfe)['ess_mean'], width=0.25, color='gray', label='fixed effects')
ax.bar(x=[i + 0.1 for i in range(3)], height=az.summary(trace_mme0)['ess_mean'].iloc[1:4], width=0.25, color='navy', label='mixed effects v0')
ax.set_xticks(range(3))
ax.set_xticklabels(['district' + str(i) for i in range(3)])
ax.legend(fontsize=10)
ax.set_ylabel('effective sample size (mean)')

Text(0, 0.5, 'effective sample size (mean)')

Let’s now take a closer look at some of the warnings.

There were 71 divergences after tuning. This is an indication that the model had exploring all of the posterior distribution and that there could be a problem with the chains. Divergences result when the “energy at the start of the trajectory differs substantially from the energy at the end” according to pg. 278 of Statistical Rethinking. The energy pertains to the physics simulation that Hamiltonian Monte Carlo performs.

Increase `target_accept` or reparameterize. This suggestion follows the above warning. We’ll talk about reparamaterization later, but what is target_accept for? Per the pymc documentation, this controls the step size of the physics simulation. A higher target_accept value will lead to smaller step sizes, or a smaller duration of time to run each segment of a simulation. This will help explore tricky and curvy parts of a posterior distribution where a smaller target_accept value might overshoot and miss exploring these areas. (Why would you want a smaller target_accept if you can get away with it? The model will sample more efficiently if the posterior is not problematic.) Therefore, increasing the target_accept paramaeter is an easy thing we can change so let’s try this first. The default value is 0.8, so let’s try 0.9

ME attempt 1: higher `target_accept`

with pm.Model() as mme1:

    # prior for average district
    a_bar = pm.Normal("a_bar", 0.0, 1.5)
    # prior for SD of districts
    sigma = pm.Exponential("sigma", 1.0)

    # alpha prior, we only have 1 district
    a = pm.Normal("a", a_bar, sigma, shape=len(df_bangladesh_first3["district_code"].unique()))

    # link function
    p = pm.math.invlogit(a[df_bangladesh_first3["district_code"]])

    # likelihood, n=1 since each represents an individual woman
    c = pm.Binomial("c", n=1, p=p, observed=df_bangladesh_first3["use.contraception"])

    trace_mme1 = pm.sample(draws=1000, random_seed=19, return_inferencedata=True, progressbar=False, target_accept=0.9)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [a, sigma, a_bar]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 23 seconds.
There were 151 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.5782174075147031, but should be close to 0.9. Try to increase the number of tuning steps.
There were 34 divergences after tuning. Increase `target_accept` or reparameterize.
There were 56 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.8194929209484552, but should be close to 0.9. Try to increase the number of tuning steps.
There were 58 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.05 for some parameters. This indicates slight problems during sampling.
The estimated number of effective samples is smaller than 200 for some parameters.

az.summary(trace_mme1)

	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_mean	ess_sd	ess_bulk	ess_tail	r_hat
a_bar	-0.497	0.650	-1.371	0.559	0.024	0.017	720.0	720.0	647.0	636.0	1.01
a[0]	-1.010	0.208	-1.332	-0.685	0.009	0.007	495.0	463.0	544.0	1926.0	1.01
a[1]	-0.652	0.405	-1.273	0.029	0.011	0.008	1290.0	1290.0	1198.0	1199.0	1.05
a[2]	0.193	1.327	-1.449	2.039	0.066	0.047	407.0	407.0	369.0	1246.0	1.02
sigma	0.866	0.741	0.091	1.765	0.051	0.036	214.0	214.0	42.0	21.0	1.07

We get higher ess_mean values but our r_hat actually got worse. Looks like we have more work to do. But let’s try target_accept again by cranking it up to 0.99

ME attempt 2: even higher `target_accept`

with pm.Model() as mme2:

    # prior for average district
    a_bar = pm.Normal("a_bar", 0.0, 1.5)
    # prior for SD of districts
    sigma = pm.Exponential("sigma", 1.0)

    # alpha prior, we only have 1 district
    a = pm.Normal("a", a_bar, sigma, shape=len(df_bangladesh_first3["district_code"].unique()))

    # link function
    p = pm.math.invlogit(a[df_bangladesh_first3["district_code"]])

    # likelihood, n=1 since each represents an individual woman
    c = pm.Binomial("c", n=1, p=p, observed=df_bangladesh_first3["use.contraception"])

    trace_mme2 = pm.sample(draws=1000, random_seed=19, return_inferencedata=True, progressbar=False, target_accept=0.99)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [a, sigma, a_bar]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 35 seconds.
There were 10 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.9658147278026631, but should be close to 0.99. Try to increase the number of tuning steps.
There were 2 divergences after tuning. Increase `target_accept` or reparameterize.
There were 3 divergences after tuning. Increase `target_accept` or reparameterize.
There were 5 divergences after tuning. Increase `target_accept` or reparameterize.
The number of effective samples is smaller than 10% for some parameters.

az.summary(trace_mme2)

	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_mean	ess_sd	ess_bulk	ess_tail	r_hat
a_bar	-0.497	0.669	-1.382	0.558	0.022	0.016	913.0	913.0	942.0	1475.0	1.00
a[0]	-1.015	0.209	-1.327	-0.664	0.006	0.004	1358.0	1358.0	1352.0	1871.0	1.00
a[1]	-0.684	0.406	-1.355	-0.065	0.011	0.008	1419.0	1419.0	1389.0	1621.0	1.00
a[2]	0.122	1.315	-1.444	1.987	0.054	0.038	603.0	603.0	709.0	1132.0	1.00
sigma	0.865	0.760	0.019	1.749	0.034	0.024	489.0	489.0	349.0	425.0	1.01

These numbers look better, but the suggestions to reparamaterize persist. This problem we’re observing is a a good example of the folk theorem of statistical computing and why we should make the effort to reparamaterize. But first, let’s see if we can visualize this problematic posterior using the arviz plot_pair function.

Visualizing divergent transitions

from matplotlib.pyplot import figure
figure(figsize=(8, 6))
az.plot_pair(trace_mme2, kind='kde', divergences=True)

array([[, , ,
        ],
       [, , ,
        ],
       [, , ,
        ],
       [,
        , ,
        ]], dtype=object)

That’s a lot of red dots! Each represents a divergent transition. Why could this happen? McElreath again provides a clue by pointing to the sigma parameter. When using a logit function “floor and ceiling effects sometimes render extreme values of the variance equally plausible as more realistic values.” Let’s look closer at some of the diagnostic metrics.

First we see low ess_mean and an r_hat is above 1.00, even if barely.

az.summary(trace_mme0, var_names='sigma')

	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_mean	ess_sd	ess_bulk	ess_tail	r_hat
sigma	0.966	0.752	0.167	1.885	0.033	0.023	523.0	523.0	232.0	134.0	1.01

az.plot_trace(trace_mme0, var_names='sigma')

array([[,
        ]], dtype=object)

It’s hard to tell what’s going on in the trace plot, but the rank plot definitely indicates something strange is going on.

az.plot_rank(trace_mme0, var_names='sigma')

There’s something simple we can do first to help avoid extreme values of variance: use a much more informative prior. Let’s use a half-normal prior instead of an exponential. It will keep values of sigma positive while avoiding the more extreme values that a exponential distribution allows.

ME attempt 3: more informative prior for sigma

Plot exponential and half-normal (change these)

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# First two subplots ------------------------------------------
x1 = np.linspace(stats.norm.ppf(0.01),
                stats.norm.ppf(0.99), 100)
                
x2 = np.linspace(stats.halfnorm.ppf(0.01),
                stats.halfnorm.ppf(0.99), 100)

ax1.plot(x1, stats.norm.pdf(x1), color='black')
ax1.set(title=r'prior for $\bar{\alpha}$', xlabel = 'x', ylabel='density')
ax2.plot(x2, stats.halfnorm.pdf(x2), color='black')
ax2.set(title=r'prior for $\sigma$', xlabel = 'x', ylabel='density')

[Text(0.5, 1.0, 'prior for $\\sigma$'),
 Text(0.5, 0, 'x'),
 Text(0, 0.5, 'density')]

Model mme3 equation

I’m going to leave the exact value of our half-normal parameter TBD because I really don’t know what’s going to work. We’ll just have to try.

with pm.Model() as mme3:

    # prior for average district
    a_bar = pm.Normal("a_bar", 0.0, 1.5)
    # prior for SD of districts
    sigma = pm.HalfNormal("sigma", 0.25)

    # alpha prior, we only have 1 district
    a = pm.Normal("a", a_bar, sigma, shape=len(df_bangladesh_first3["district_code"].unique()))

    # link function
    p = pm.math.invlogit(a[df_bangladesh_first3["district_code"]])

    # likelihood, n=1 since each represents an individual woman
    c = pm.Binomial("c", n=1, p=p, observed=df_bangladesh_first3["use.contraception"])

    trace_mme3 = pm.sample(draws=1000, random_seed=19, return_inferencedata=True, progressbar=False, target_accept=0.99)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [a, sigma, a_bar]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 34 seconds.
There were 8 divergences after tuning. Increase `target_accept` or reparameterize.
There were 6 divergences after tuning. Increase `target_accept` or reparameterize.
There were 201 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.6423494210597291, but should be close to 0.99. Try to increase the number of tuning steps.
There were 8 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.05 for some parameters. This indicates slight problems during sampling.
The estimated number of effective samples is smaller than 200 for some parameters.

Hmmm… that looks worse. Let’s make sigma tighter by setting the half-normal parameter to 0.1.

with pm.Model() as mme3b:

    # prior for average district
    a_bar = pm.Normal("a_bar", 0.0, 1.5)
    # prior for SD of districts
    sigma = pm.HalfNormal("sigma", 0.1)

    # alpha prior, we only have 1 district
    a = pm.Normal("a", a_bar, sigma, shape=len(df_bangladesh_first3["district_code"].unique()))

    # link function
    p = pm.math.invlogit(a[df_bangladesh_first3["district_code"]])

    # likelihood, n=1 since each represents an individual woman
    c = pm.Binomial("c", n=1, p=p, observed=df_bangladesh_first3["use.contraception"])

    trace_mme3b = pm.sample(draws=1000, random_seed=19, return_inferencedata=True, progressbar=False, target_accept=0.99)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [a, sigma, a_bar]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 42 seconds.
There were 5 divergences after tuning. Increase `target_accept` or reparameterize.
There were 2 divergences after tuning. Increase `target_accept` or reparameterize.
There were 3 divergences after tuning. Increase `target_accept` or reparameterize.
There were 4 divergences after tuning. Increase `target_accept` or reparameterize.
The number of effective samples is smaller than 10% for some parameters.

az.summary(trace_mme3b)

	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_mean	ess_sd	ess_bulk	ess_tail	r_hat
a_bar	-0.924	0.205	-1.244	-0.597	0.009	0.006	571.0	542.0	573.0	517.0	1.00
a[0]	-0.952	0.186	-1.260	-0.661	0.007	0.005	640.0	613.0	650.0	662.0	1.00
a[1]	-0.917	0.220	-1.244	-0.557	0.009	0.007	600.0	567.0	601.0	608.0	1.00
a[2]	-0.911	0.230	-1.269	-0.548	0.010	0.007	585.0	567.0	578.0	649.0	1.00
sigma	0.081	0.059	0.004	0.161	0.003	0.002	346.0	346.0	286.0	532.0	1.02

ME, attempt 4: re-paramaterization

Our r_hat for the a values are better, but the ess_mean indicates some ineffiencies still. The sigma also still looks bad. Now we should really consider re-paramaterizing. Luckily, we have an example of how to do this in the book.

Let’s look again at the original equation. We’ve got a substitution we can make for $\alpha$. We can also loosen the sigma back up, using a Half-Normal(0.5) prior.

Centered equation

Non-centered equation

\[C_i \sim \text{Binomial}(1, p_i) \tag{binomial likelihood}\] \[\text{logit}(p_i) = \bar{\alpha} + z_{\text{district}[i]} \sigma \tag{substituting for alpha}\] \[z \sim \text{Normal}(0, 1) \tag{adaptive prior}\] \[\bar{\alpha} \sim \text{Normal}(0, 1.5) \tag{regularizing hyperprior}\] \[\sigma \sim \text{Half-Normal}(0.5) \tag{regularizing hyperprior}\]

The non-centered equation gives us a posterior that is easier to explore, while allowing us to use transformations to get back the numerical values we care about.

with pm.Model() as mme4:

    # prior for average district
    a_bar = pm.Normal("a_bar", 0.0, 1.5)
    # prior for SD of districs
    sigma = pm.HalfNormal("sigma", 0.5)

    # our substitution
    z = pm.Normal("z", 0.0, 1, shape=len(df_bangladesh_first3["district_code"].unique()))

    # alpha prior, we only have 1 district
    # a = pm.Normal("a", a_bar, sigma, shape=len(df_bangladesh_first3["district_code"].unique()))

    # link function
    p = pm.math.invlogit(a_bar + z[df_bangladesh_first3["district_code"]] * sigma)

    # likelihood, n=1 since each represents an individual woman
    c = pm.Binomial("c", n=1, p=p, observed=df_bangladesh_first3["use.contraception"])

    trace_mme4 = pm.sample(draws=1000, random_seed=19, return_inferencedata=True, progressbar=False, target_accept=0.99)

Auto-assigning NUTS sampler...
INFO:pymc3:Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
INFO:pymc3:Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
INFO:pymc3:Multiprocess sampling (4 chains in 4 jobs)
NUTS: [z, sigma, a_bar]
INFO:pymc3:NUTS: [z, sigma, a_bar]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 22 seconds.
INFO:pymc3:Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 22 seconds.

az.summary(trace_mme4)

	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_mean	ess_sd	ess_bulk	ess_tail	r_hat
a_bar	-0.731	0.409	-1.343	-0.137	0.014	0.010	905.0	905.0	1116.0	1060.0	1.0
z[0]	-0.513	0.883	-1.884	0.894	0.025	0.018	1238.0	1238.0	1245.0	1783.0	1.0
z[1]	0.059	0.870	-1.237	1.504	0.019	0.015	2150.0	1661.0	2138.0	2537.0	1.0
z[2]	0.453	1.002	-1.138	2.039	0.021	0.016	2334.0	1919.0	2332.0	2212.0	1.0
sigma	0.393	0.290	0.000	0.783	0.008	0.006	1216.0	1216.0	1037.0	1015.0	1.0

Hallelujah! We’ve got no divergences, a great ess_mean and r_hat values. Let’s visualize what we’ve got.

az.plot_pair(trace_mme4, kind='kde', divergences=True)

array([[, , ,
        ],
       [, , ,
        ],
       [, , ,
        ],
       [,
        , ,
        ]], dtype=object)

Clean pair plots!

az.plot_rank(trace_mme4, var_names='sigma')

Great rank plots!

Now let’s get our a values back. It might be a little tricky to see where this came from so let’s review the most important parts of the centered and non-centered equations.

Centered equation $\text{logit}(p_i) = \alpha_{\text{district}[i]} \tag{linear model using logit link}$ $\alpha_j \sim \text{Normal}(\bar{\alpha}, \sigma) \tag{adaptive prior}$

Non-centered equation $\text{logit}(p_i) = \bar{\alpha} + z_{\text{district}[i]} \sigma \tag{substituting for alpha}$ $z \sim \text{Normal}(0, 1) \tag{adaptive prior}$

$\alpha$ became $\bar{\alpha} + z_{\text{district}[i]} \sigma$.

trace_mme4_df = trace_mme4.to_dataframe()
trace_mme4_df.iloc[0:5, 0:7]

	draw	(posterior, a_bar)	(posterior, z[0], 0)	(posterior, z[1], 1)	(posterior, z[2], 2)	(posterior, sigma)
0	0	0.185127	-1.501203	-0.768144	1.673570	0.547466
1	1	-0.025513	-1.515975	-1.125922	-0.096766	0.684001
2	2	-0.334302	-1.174676	-1.535301	0.709949	0.569022
3	3	-0.134617	-2.165963	-1.432953	0.741756	0.376201
4	4	-0.151168	-2.197087	-1.454579	0.214046	0.226324

Therefore, we can get a parameters for each district (a[0], a[1], a[2]) values by doing the appropriate arithmetic on each row. I’m going to ignore the chains for now.

# Initialize
df_a = pd.DataFrame(np.zeros((len(trace_mme4_df), 3)))

# Fill in rows with transformation
df_a[0] = trace_mme4_df[('posterior', 'a_bar')] + trace_mme4_df[('posterior', 'z[0]', 0)] * trace_mme4_df[('posterior', 'sigma')]
df_a[1] = trace_mme4_df[('posterior', 'a_bar')] + trace_mme4_df[('posterior', 'z[1]', 1)] * trace_mme4_df[('posterior', 'sigma')]
df_a[2] = trace_mme4_df[('posterior', 'a_bar')] + trace_mme4_df[('posterior', 'z[2]', 2)] * trace_mme4_df[('posterior', 'sigma')]

We can plot our parameters now, with a bit more plotting code for the middle plot since we have raw values. And just as a reminder of the original goal of the post, we’ll show the fixed effects model as well. We’l plot them on the same x-scale to show the dramatic difference that partial pooling has.

f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16, 4), sharex=True)
# z parameters, mixed effects
az.plot_forest(trace_mme4, var_names='z', combined=True, ax=ax1)
ax1.set_title("z parameters\nmixed-effects")

# a parameters, mixed effects
ax2.hlines(xmin=df_a[0].quantile(0.055), xmax=df_a[0].quantile(0.945), y=0)
ax2.hlines(xmin=df_a[1].quantile(0.055), xmax=df_a[1].quantile(0.945), y=1)
ax2.hlines(xmin=df_a[2].quantile(0.055), xmax=df_a[2].quantile(0.945), y=2)
ax2.scatter(df_a.mean(), range(3), facecolors='white', edgecolors='blue')
ax2.invert_yaxis()
ax2.set_yticks(range(3))
ax2.set_yticklabels(["a" + str(i) for i in range(3)])
ax2.set_title("a parameters\nmixed-effects")

# a parameters, fixed effects
az.plot_forest(trace_mfe, var_names='a', combined=True, ax=ax3)
ax3.set_title("a parameters\nfixed-effects")

Text(0.5, 1.0, 'a parameters\nfixed-effects')

In the mixed-effects model, we see that the point estimates of the a parameters mirror the pattern of the z parameters which is to be expected. Of course, after we have gone through our model diagnostics and iterations, the bigger picture is that the multi-level model has a much different result than the fixed-effects model which we created at the start of the post. We see shrinkage of our estimates so that it appears our districts have less of a difference among them than originally seen with the fixed-effects model. We also see a reduction in variance. Both the movement in the point estimate and reduction in variance are most dramatic for district 2 which has the smallest sample size of the three shown here.

Summary

Well, what started as a simple post led to a deeper understanding of how to diagnose a model and what knobs to fiddle. More often than not, we’ll have to turn to alternate paramaterizations as we did here. In another post, we’ll get back to the original goal of understanding how partial pooling happens, resulting in shrinkage of estimates for our clusters.

Appendix: Environment and system parameters

%watermark -n -u -v -iv -w

Last updated: Mon Nov 22 2021

Python implementation: CPython
Python version       : 3.8.6
IPython version      : 7.20.0

scipy     : 1.6.0
sys       : 3.8.6 | packaged by conda-forge | (default, Jan 25 2021, 23:22:12) 
[Clang 11.0.1 ]
pandas    : 1.2.1
seaborn   : 0.11.1
pymc3     : 3.11.0
numpy     : 1.20.1
theano    : 1.1.0
arviz     : 0.11.1
matplotlib: 3.3.4

Watermark: 2.1.0

Multilevel modeling with binomial GLM

2021-10-23T00:00:00+00:00

I’ve been on a journey learning multilevel models and Bayesian inference through Richard McElreath’s Statistical Rethinking book. The concepts of shrinkage and partial pooling that are inherent to multilevel models are really interesting to me. First, let’s get some terminology out of the way. As Dr. McElreath highlights, there are multiple terms that are used for multilevel models.

heirarchical models
mixed-effects models
varying effects models

Be mindful though because other authors may have different definitions for these terms. In this post, I will use these terms interchangeably as Dr. McElreath does. Regardless of how they’re called, multilevel models use information sharing based on grouping variables to make more accurate estimates, especially when groups are of variable sizes. These concepts only made sense to me after working through a problem. Let’s look at the impact of a multilevel model structure using a binomial generalized linear model (GLM) example from the book. Dr. McElreath uses the “tadpole data”, but here I’ll use problem 13H1 which has similar concepts. This question illustrates use of “varying intercepts”, which is the simplest kind of varying effects model.

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import scipy.stats as stats
import seaborn as sns
from scipy.special import expit as logistic
from scipy.special import logit
from scipy.optimize import curve_fit
from causalgraphicalmodels import CausalGraphicalModel
from theano import tensor as tt

%load_ext nb_black
%config InlineBackend.figure_format = 'retina'
%load_ext watermark
RANDOM_SEED = 8927
np.random.seed(RANDOM_SEED)
az.style.use("arviz-darkgrid")
az.rcParams["stats.hdi_prob"] = 0.89  # sets default credible interval used by arviz
sns.set_context("talk")

The nb_black extension is already loaded. To reload it, use:
  %reload_ext nb_black
The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark

def standardize(x):
    x = (x - np.mean(x)) / np.std(x)
    return x

Problem description

The description of problem 13H1 is taken directly from the book.

In 1980, a typical Bengali woman could have 5 or more children in her lifetime. By the year 2000, a typical Bengali woman had only 2 or 3. You’re going to look at a historical set of data, when contraception was widely available but many families chose not to use it. These data reside in data(bangladesh) and come from the 1988 Bangladesh Fertility Survey. Each row is one of 1934 women. There are six variables, but you can focus on two of them for this practice problem:

district: ID number of administrative district each woman resided in

use.contraception: An indicator (0/1) of whether the woman was using contraception

The first thing to do is ensure that the cluster variable, district, is a contiguous set of integers. Recall that these values will be index values inside the model. If there are gaps, you’ll have parameters for which there is no data to inform them. Worse, the model probably won’t run. Look at the unique values of the district variable:

# R code 13.40
sort(unique(d$district)) 
[1] 1 2 3 4 5 ... 51 52 53 55 56 57.... 61

District 54 is absent. So district isn’t yet a good index variable, because it’s not contiguous. This is easy to fix. Just make a new variable that is contiguous. This is enough to do it:

# R code 13.41
d$district_id  <-as.integer(as.factor(d$district))
sort(unique(d$district_id))
[1] 1 2 3 4 5 ... 60

Now there are 60 values, contiguous integers 1 to 60. Now, focus on predicting use.contraception, clustered by district_id. Fit both (1) a traditional fixed-effects model that uses an index variable for district and (2) a multilevel model with varying intercepts for district. Plot the predicted proportions of women in each district using contraception, for both the fixed-effects model and the varying-effects model. That is, make a plot in which district ID is on the horizontal axis and expected proportion using contraception is on the vertical. Make one plot for each model, or layer them on the same plot, as you prefer. How do the models disagree? Can you explain the pattern of disagreement? In particular, can you explain the most extreme cases of disagreement, both why they happen where they do and why the models reach different inferences?

This problem provides some nice opportunities for how multilevel models work so I’ll make some additional plots in addition to what the question asks for. Let’s dive in!

Data exploration and setup

df_bangladesh = pd.read_csv(
    "../pymc3_ed_resources/resources/Rethinking/Data/bangladesh.csv",
    delimiter=";",
)
df_bangladesh.head()

	woman	district	living.children	age.centered	urban
0	1	1	4	18.4400	1
1	2	1	1	-5.5599	1
2	3	1	3	1.4400	1
3	4	1	4	8.4400	1
4	5	1	1	-13.5590	1

The dataframe has several columns, but for this problem, we’ll focus on only the outcome variable use.contraception and the district feature. Note how each row represents one woman.

print("shape of df: ", df_bangladesh.shape)

shape of df:  (1934, 6)

Per the assignment, fix the district variable since it is not a contiguous set of integers.. Luckily, this is easy enough to do with pd.Categorical.

df_bangladesh["district_code"] = pd.Categorical(df_bangladesh["district"]).codes

# inspect and see that it's now 0-indexed for Python

print(
    "Head of the dataframe: ", df_bangladesh[["district", "district_code"]].drop_duplicates().head()
)

# and also that it accounts for missing district 54
print(
    "Tail: of the dataframe ", df_bangladesh[["district", "district_code"]].drop_duplicates().tail(10)
)

Head of the dataframe:       district  district_code
         1              0
       2              1
       3              2
       4              3
       5              4
Tail: of the dataframe        district  district_code
      51             50
      52             51
      53             52
      55             53
      56             54
      57             55
      58             56
      59             57
      60             58
      61             59

# Inspect the outcome variable
df_bangladesh["use.contraception"].value_counts()

0    1175
1     759
Name: use.contraception, dtype: int64

Now that we have a sense for how the data is structured, we can start building our models. To appreciate the mixed-effects model, we will start by creating a fixed-effects model so that we can highlight the differences between them.

Fixed-effects model

Predict use.contraception. Since there are two outcomes, it makes sense to use a binomial GLM for this problem. We’ll use an index variable for district and it will be an intercept only model. We are using a binomial likelihood where each observation(n) is 1, since each row of our dataset is for one woman. Alternatively, we could have used a Bernoulli. The parameter p is the probability of a woman using contraception. We obtain this from the linear model which is on the second line. The linear model uses the logit function to serve as our link function with the binomial GLM. Finally, we have a regularizing prior for $\alpha$ so that our considered values for p are within reason. What $\alpha$ represents in this case is the average contraception use for each district. This is regardless of any other variables that are in our dataframe because we have omitted them from our model. This point will be contrasted with a future post where we build on this problem.

Model mfe equation

Now let’s use pymc to build our model.

with pm.Model() as mfe:

    # alpha prior, one for each district
    a = pm.Normal("a", 0, 1.5, shape=len(df_bangladesh["district_code"].unique()))

    # link function
    p = pm.math.invlogit(a[df_bangladesh["district_code"]])

    # likelihood, n=1 since each represents an individual woman
    c = pm.Binomial("c", n=1, p=p, observed=df_bangladesh["use.contraception"])

    trace_mfe = pm.sample(draws=1000, random_seed=19, return_inferencedata=True, progressbar=False)

Auto-assigning NUTS sampler...
INFO:pymc3:Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
INFO:pymc3:Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
INFO:pymc3:Multiprocess sampling (4 chains in 4 jobs)
NUTS: [a]
INFO:pymc3:NUTS: [a]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 21 seconds.
INFO:pymc3:Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 21 seconds.

# View summary of mfe model
az.summary(trace_mfe).head()

	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_mean	ess_sd	ess_bulk	ess_tail	r_hat
a[0]	-1.052	0.205	-1.389	-0.730	0.002	0.002	10791.0	8837.0	11082.0	2622.0	1.0
a[1]	-0.584	0.452	-1.287	0.134	0.005	0.005	9885.0	4432.0	9897.0	2824.0	1.0
a[2]	1.240	1.156	-0.647	2.980	0.012	0.014	8647.0	3240.0	8900.0	2517.0	1.0
a[3]	-0.003	0.362	-0.572	0.579	0.004	0.007	9690.0	1409.0	9763.0	2666.0	1.0
a[4]	-0.569	0.330	-1.077	-0.020	0.004	0.003	8362.0	4628.0	8431.0	2847.0	1.0

Let’s visualize the posterior distributions of the $\alpha$ parameter for each district.

f, ax1 = plt.subplots(figsize=(10, 16))
az.plot_forest(trace_mfe, combined=True, ax=ax1)
ax1.set(xlabel="log-odds", ylabel="alpha for each district", title='posterior distribution (fixed effects model)');

We can learn a lot by looking at these results. First, it is clear that some districts are much less likely to use contraception (negative log-odds) than other districts (log-odds that span zero or are wholly positive). The district indexed as 10 (originally called district 11 of the raw dataset) has the mean with the lowest likelihood of contraceptive usage; if you inspect the raw data, no woman used contraception.

print("No. of district index 10 women who used contraception: ", (df_bangladesh[df_bangladesh["district_code"] == 10]['use.contraception']).sum())

No. of district index 10 women who used contraception:  0

Another point worth our focus is the width of the credible intervals, representing the uncertainty of our estimate. For example, district index 2 has a mean estimate that is the most positive among all the districts, but the credible interval is exceptionally wide, ranging from a log-ods of -0.647 to 2.980. Other districts, however, have relatively narrow confidence intervals, such as district index 13. This difference in variability can be explained by the number of women in each district. Higher counts represent lower uncertainties.

print("Top 5 lowest districts for counts of women:\n", df_bangladesh['district_code'].value_counts().sort_values().head())

print("Top 5 highest districts for counts of women:\n", df_bangladesh['district_code'].value_counts().sort_values().tail())

Top 5 lowest districts for counts of women:
    2
   4
   6
  10
  11
Name: district_code, dtype: int64
Top 5 highest districts for counts of women:
    65
   67
   86
   117
  118
Name: district_code, dtype: int64

While the fixed-effecs model is a reasonable approach, we can do better with a multilevel (mixed-effects) model. Let’s do that next.

Mixed-effects model

Here we can allow information to pool between clusters (districts). This would make more sense since there’s a varying number of women in each district as we identified above. We would expect our district index 2 estimate to be more precise (narrower credible interval).

How can information be shared? This is where the structure of our equations can give some insight. The main change is the third line in the equations below. We’re now using an adaptive prior that borrows information from each district, to make better estimates for all districts. Instead of specifying numerical values in our prior for $\alpha_j$, we instead replace them with new parameters. These new parameters are an “average alpha”, $\bar{\alpha}$ and $\sigma$ and they have their own priors, which we call a hyperprior. Seeing how we can see parameters embedded in other parameters is how we can also appreciate the “multilevel”-ness of the multilevel model. (As McElreath states in an earlier lecture, it can become parameters all the way down.)

Model mme equation

# multilevel model
with pm.Model() as mme:

    # prior for average district
    a_bar = pm.Normal("a_bar", 0.0, 1.5)
    # prior for SD of districts
    sigma = pm.Exponential("sigma", 1.0)

    # alpha priors for each district
    a = pm.Normal("a", a_bar, sigma, shape=len(df_bangladesh["district_code"].unique()))

    # link function
    p = pm.math.invlogit(a[df_bangladesh["district_code"]])

    # likelihood, n=1 since each represents an individual woman
    c = pm.Binomial("c", n=1, p=p, observed=df_bangladesh["use.contraception"])

    trace_mme = pm.sample(draws=1000, random_seed=19, return_inferencedata=True, progressbar=False)

Auto-assigning NUTS sampler...
INFO:pymc3:Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
INFO:pymc3:Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
INFO:pymc3:Multiprocess sampling (4 chains in 4 jobs)
NUTS: [a, sigma, a_bar]
INFO:pymc3:NUTS: [a, sigma, a_bar]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 17 seconds.
INFO:pymc3:Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 17 seconds.

It looks like there are no divergences here so we don’t have to worry about re-parameterizing. Let’s take a look now.

# View summary of mme model
az.summary(trace_mme).head()

	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_mean	ess_sd	ess_bulk	ess_tail	r_hat
a_bar	-0.540	0.088	-0.679	-0.397	0.002	0.001	3125.0	3125.0	3125.0	3066.0	1.00
a[0]	-0.992	0.198	-1.299	-0.679	0.003	0.002	5931.0	5154.0	5973.0	2419.0	1.00
a[1]	-0.599	0.360	-1.144	-0.001	0.004	0.004	7047.0	3716.0	7138.0	2443.0	1.01
a[2]	-0.240	0.501	-1.003	0.559	0.006	0.008	7605.0	2191.0	7501.0	3176.0	1.00
a[3]	-0.179	0.298	-0.652	0.309	0.004	0.004	6683.0	2486.0	6638.0	3036.0	1.00

Let’s visualize by plotting the mixed effects model posterior side-by-side with the fixed-effects model that we already visualized above. This will help us appreciate the impact of the multilevel model structure.

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 16), sharex=True, sharey=True)
az.plot_forest(trace_mfe, combined=True, ax=ax1)
ax1.set(xlabel="log-odds", ylabel="alpha for each district", title='posterior distribution (fixed effects model)')

az.plot_forest(trace_mme, var_names='a', combined=True, ax=ax2)
ax2.set(xlabel="log-odds", ylabel="alpha for each district", title='posterior distribution\n(mixed effects model)');

The first thing that jumps out is that the uncertainty for several districts now is much smaller, particularly those with small sample sizes like district index 2. This is where the multilevel model really shines. Another change is that we see the mean estimates get pulled towards the center, especially those with more extreme values in the fixed effects models. It may be harder to appreciate in this visualization but we’ll plot on the outcome scale. Let’s explore these differences in different ways.

Let’s look more closely at how sample size impacts the uncertainty for each district in the fixed-effects versus mixed-effects model for each district.

# Create a new dataframe
col2inspect = ["mean", "sd", "hdi_5.5%", "hdi_94.5%"]
df_summary = (
    pd.merge(
    az.summary(trace_mfe)[col2inspect],
    az.summary(trace_mme)[col2inspect],
    how="inner",
    left_index=True,
    right_index=True)
    .reset_index(drop=True)
)

# Add number of women for each district
df_summary["n_women"] = df_bangladesh.groupby("district_code").count().iloc[:, 0]
# Inspect
df_summary.head()

	mean_x	sd_x	hdi_5.5%_x	hdi_94.5%_x	mean_y	sd_y	hdi_5.5%_y	hdi_94.5%_y	n_women
0	-1.052	0.205	-1.389	-0.730	-0.992	0.198	-1.299	-0.679	117
1	-0.584	0.452	-1.287	0.134	-0.599	0.360	-1.144	-0.001	20
2	1.240	1.156	-0.647	2.980	-0.240	0.501	-1.003	0.559	2
3	-0.003	0.362	-0.572	0.579	-0.179	0.298	-0.652	0.309	30
4	-0.569	0.330	-1.077	-0.020	-0.577	0.279	-1.007	-0.118	39

f, ax1 = plt.subplots(figsize=(8, 6))
sns.scatterplot(
    data=df_summary, x="sd_x", y="sd_y", size="n_women", color="black", alpha=0.25, ax=ax1
)
ax1.plot([0, 1.2], [0, 1.2], color="black", lw=1, linestyle="--")
ax1.set(
    xlim=[0, 1.2], ylim=[0, 1.2], xlabel="fixed effects SD", ylabel="mixed effects SD", title="Impact of sample size on SD"
)

[(0.0, 1.2),
 (0.0, 1.2),
 Text(0.5, 0, 'fixed effects SD'),
 Text(0, 0.5, 'mixed effects SD'),
 Text(0.5, 1.0, 'Impact of sample size on SD')]

On the x-axis are the standard deviations of the $\alpha$ values for each district in the fixed effects model. On the y-axis are the corresponding SD values for the mixed effects model. The size of the points are the number of women in each district. The dashed diagonal line represents where values between the x and y-axes are equal.

Here, we can see that the fixed effects model shows greater uncertainty, especially when the number of women in each district gets lower (points at the right of the plot). The lower uncertainty with mixed effects is due to partial pooling. When the number of women is high, the mixed effects model shows uncertainty on par with the fixed effects model, meaning there’s “less benefit” to using a mixed effects model but it also doesn’t hurt.

Now, let’s plot on the outcome scale. Here, we’ll show the predicted proportion of women in each district using contraception with fixed-effects and mixed-effects estimates shown side-by-side. We’ll use the logistic function to transform the log-odds back on the probability scale.

f, ax1 = plt.subplots(figsize=(16, 8))

# Plot means
ax1.scatter(
    df_summary.index - 0.15,
    logistic(df_summary["mean_x"]),
    color="gray",
    alpha=0.5,
    label="fixed effect",
)
ax1.scatter(
    df_summary.index + 0.15,
    logistic(df_summary["mean_y"]),
    color="blue",
    alpha=0.5,
    label="mixed effect",
)

# Plot uncertainties
ax1.vlines(
    x=df_summary.index - 0.15,
    ymin=logistic(df_summary["hdi_5.5%_x"]),
    ymax=logistic(df_summary["hdi_94.5%_x"]),
    color="gray",
    linewidth=1,
    alpha=0.5,
)

ax1.vlines(
    x=df_summary.index + 0.15,
    ymin=logistic(df_summary["hdi_5.5%_y"]),
    ymax=logistic(df_summary["hdi_94.5%_y"]),
    color="blue",
    linewidth=1,
    alpha=0.5,
)

# Plot average mixed effect line
me_mean = logistic(az.summary(trace_mme).loc["a_bar", "mean"])
ax1.plot(
    [-10, 62],
    [me_mean, me_mean],
    color="blue",
    lw=0.5,
    linestyle="--",
    alpha=1,
    label="mixed effect mean",
)

# Plot raw fixed effect line
fe_mean = df_bangladesh["use.contraception"].mean()
ax1.plot(
    [-10, 62],
    [fe_mean, fe_mean],
    color="red",
    lw=0.5,
    linestyle="--",
    alpha=1,
    label="fixed effect mean",
)

ax1.legend()
ax1.set(
    xlim=[-2, 60],
    ylim=[0, 1],
    xlabel="district index",
    ylabel="proportion predicted\nfor contraception use",
)

[(-2.0, 60.0),
 (0.0, 1.0),
 Text(0.5, 0, 'district index'),
 Text(0, 0.5, 'proportion predicted\nfor contraception use')]

The district index is shown on the x-axis while the proportion predicted for contraception use is on the y-axis. Visualizing on this scale makes it more directly interpretable than log-odds when thinking about the proportion of women using contraception. The horizontal dashed lines represent the overall means for fixed effect (red) versus the mixed effect (blue). The difference in the lines is due to leveraging the sample size differences. As we had seen with the log-odds scale, the districts with the smallest number of women (like district index 2) had their estimates greatly affected by using a multilevel model. The estimates get pulled towards the horizontal blue dashed line, illustrating the concept of shrinkage that results from the partial pooling of information across estimates.

Summary

In this post, we covered a simple example of multilevel modeling using a binomial GLM. We used a dataset where clusters (districts) contained variable sample sizes. By contrasting a fixed effects model with a mixed effects model, we can see how multilevel modeling improves our estimates and reduces uncertainty. Here, we covered an example of varying intercepts. In a later post, we’ll add varying slopes which will help us incorporate predictor variables.

Appendix: Environment and system parameters

%watermark -n -u -v -iv -w

Last updated: Sat Oct 23 2021

Python implementation: CPython
Python version       : 3.8.6
IPython version      : 7.20.0

pandas    : 1.2.1
numpy     : 1.20.1
scipy     : 1.6.0
seaborn   : 0.11.1
matplotlib: 3.3.4
sys       : 3.8.6 | packaged by conda-forge | (default, Jan 25 2021, 23:22:12) 
[Clang 11.0.1 ]
pymc3     : 3.11.0
arviz     : 0.11.1
theano    : 1.1.0

Watermark: 2.1.0

Working with PyTorch’s Dataset and Dataloader classes (part 1)

2021-06-24T00:00:00+00:00

Recently, I built a simple NLP algorithm for a work project, following the template described in this tutorial. As I looked to increase my model’s complexity, I started to come across references to Dataset and Dataloader classes. I tried adapting my work-related code to use these objects, but I found myself running into pesky bugs. I thought I should take some time to figure out how to properly use Dataset and Dataloader objects. In this post, I adapt the PyTorch NLP tutorial to work with Dataset and Dataloader objects. Since my focus is primarily on using these objects, please refer to the tutorial for details regarding the NLP model.

import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
torch.manual_seed(1)

%load_ext nb_black
%config InlineBackend.figure_format = 'retina'
%load_ext watermark

# Figure aesthetics
sns.set_theme()
sns.set_context("talk")
sns.set_style("white")

First attempt

The tutorial generates a simple dataset to use for a logistic regression bag-of-words classifier. It takes a sentence and trains whether the sentence is in English or Spanish. The data was structured originally so each sample was a list.

train_data = [
    ("me gusta comer en la cafeteria".split(), "SPANISH"),
    ("Give it to me".split(), "ENGLISH"),
    ("No creo que sea una buena idea".split(), "SPANISH"),
    ("No it is not a good idea to get lost at sea".split(), "ENGLISH"),
]

test_data = [
    ("Yo creo que si".split(), "SPANISH"),
    ("it is lost on me".split(), "ENGLISH"),
]

Before putting the data into the Dataset object, I’ll organize it into a dataframe for easier input.

# Combine so we have one data object
data = train_data + test_data

# Put into a dataframe
df_data = pd.DataFrame(data)
df_data.columns = ["words", "labels"]
df_data

	words	labels
0	[me, gusta, comer, en, la, cafeteria]	SPANISH
1	[Give, it, to, me]	ENGLISH
2	[No, creo, que, sea, una, buena, idea]	SPANISH
3	[No, it, is, not, a, good, idea, to, get, lost...	ENGLISH
4	[Yo, creo, que, si]	SPANISH
5	[it, is, lost, on, me]	ENGLISH

Putting the data in `Dataset` and output with `Dataloader`

Now it is time to put the data into a Dataset object. I referred to PyTorch’s tutorial on datasets and dataloaders and this helpful example specific to custom text, especially for making my own dataset class, which is shown here.

class TextDataset(Dataset):
    """
    Characterizes the pre-processed SRF custom dataset for PyTorch
    """

    def __init__(self, ids, text, labels):
        """
        Initialization. Ids can be useful after splitting the dataset.
        """
        self.ids = ids
        self.text = text
        self.labels = labels

    def __len__(self):
        """
        This is simply the number of labels in the dataseta.
        """
        return len(self.labels)

    def __getitem__(self, idx):
        """
        Generate one sample of data
        """
        label = self.labels[idx]
        text = self.text[idx]
        sample = {"Text": text, "Label": label}
        return sample

# Put train and test into dataset objects
train_ids = range(0, 4)
test_ids = range(4, 6)

train_DS1 = TextDataset(
    train_ids,
    df_data.loc[train_ids, "words"].tolist(),
    df_data.loc[train_ids, "labels"].tolist(),
)

test_DS1 = TextDataset(
    train_ids,
    df_data.loc[test_ids, "words"].tolist(),
    df_data.loc[test_ids, "labels"].tolist(),
)

When putting the data into their respective dataset objects, it is important to use the .tolist() method or else DataLoader will return an error when retrieving the data. Now let’s use DataLoader and a simple for loop to return the values of the data. I’ll use only the training data and a batch_size of 1 for this purpose.

train_DL = DataLoader(train_DS1, batch_size=1, shuffle=False)

print("Batch size of 1")
for (idx, batch) in enumerate(train_DL):  # Print the 'text' data of the batch

    print(idx, "Text data: ", batch["Text"])  # Print the 'class' data of batch
    print(idx, "Label data: ", batch["Label"])

Batch size of 1
Text data:  [('me',), ('gusta',), ('comer',), ('en',), ('la',), ('cafeteria',)]
Label data:  ['SPANISH']
Text data:  [('Give',), ('it',), ('to',), ('me',)]
Label data:  ['ENGLISH']
Text data:  [('No',), ('creo',), ('que',), ('sea',), ('una',), ('buena',), ('idea',)]
Label data:  ['SPANISH']
Text data:  [('No',), ('it',), ('is',), ('not',), ('a',), ('good',), ('idea',), ('to',), ('get',), ('lost',), ('at',), ('sea',)]
Label data:  ['ENGLISH']

At first glance, things might look okay but the eagle-eyed will notice that each element in our list is now wrapped as one element. If we increase batch_size to 2, we get an ugly error.

train_DL2 = DataLoader(train_DS1, batch_size=2, shuffle=False)

print("Batch size of 2")
for (idx, batch) in enumerate(train_DL2):  # Print the 'text' data of the batch

    print(idx, "Text data: ", batch["Text"])  # Print the 'class' data of batch
    print(idx, "Label data: ", batch["Label"], "\n")

Batch size of 2



---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

 in 
      2 
      3 print("Batch size of 2")
----> 4 for (idx, batch) in enumerate(train_DL2):  # Print the 'text' data of the batch
      5 
      6     print(idx, "Text data: ", batch["Text"])  # Print the 'class' data of batch


~/opt/anaconda3/envs/sdoh_text/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
    515             if self._sampler_iter is None:
    516                 self._reset()
--> 517             data = self._next_data()
    518             self._num_yielded += 1
    519             if self._dataset_kind == _DatasetKind.Iterable and \


~/opt/anaconda3/envs/sdoh_text/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    555     def _next_data(self):
    556         index = self._next_index()  # may raise StopIteration
--> 557         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    558         if self._pin_memory:
    559             data = _utils.pin_memory.pin_memory(data)


~/opt/anaconda3/envs/sdoh_text/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     45         else:
     46             data = self.dataset[possibly_batched_index]
---> 47         return self.collate_fn(data)


~/opt/anaconda3/envs/sdoh_text/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
     71         return batch
     72     elif isinstance(elem, container_abcs.Mapping):
---> 73         return {key: default_collate([d[key] for d in batch]) for key in elem}
     74     elif isinstance(elem, tuple) and hasattr(elem, '_fields'):  # namedtuple
     75         return elem_type(*(default_collate(samples) for samples in zip(*batch)))


~/opt/anaconda3/envs/sdoh_text/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py in (.0)
     71         return batch
     72     elif isinstance(elem, container_abcs.Mapping):
---> 73         return {key: default_collate([d[key] for d in batch]) for key in elem}
     74     elif isinstance(elem, tuple) and hasattr(elem, '_fields'):  # namedtuple
     75         return elem_type(*(default_collate(samples) for samples in zip(*batch)))


~/opt/anaconda3/envs/sdoh_text/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
     79         elem_size = len(next(it))
     80         if not all(len(elem) == elem_size for elem in it):
---> 81             raise RuntimeError('each element in list of batch should be of equal size')
     82         transposed = zip(*batch)
     83         return [default_collate(samples) for samples in transposed]


RuntimeError: each element in list of batch should be of equal size

What’s going on? With some investigation of which I’ll spare you, it appears that having each sample data already as a list makes confuses Dataloader. Let’s re-structure out data differently.

Re-structuring data as a comma-separated string

Due to the structure of our model, we still need a way to vectorize each sentence sample, but we can’t have each wrapped as a list. Here is a workaround even if the syntax is awkward. I’m rejoining the elements as a comma-separated string like this:

", ".join("me gusta comer en la cafeteria".split())

'me, gusta, comer, en, la, cafeteria'

train_data2 = [
    (", ".join("me gusta comer en la cafeteria".split()), "SPANISH"),
    (", ".join("Give it to me".split()), "ENGLISH"),
    (", ".join("No creo que sea una buena idea".split()), "SPANISH"),
    (", ".join("No it is not a good idea to get lost at sea".split()), "ENGLISH"),
]

test_data2 = [
    (", ".join("Yo creo que si".split()), "SPANISH"),
    (", ".join("it is lost on me".split()), "ENGLISH"),
]

data2 = train_data2 + test_data2
df_data2 = pd.DataFrame(data2)
df_data2.columns = ["words", "labels"]

Here’s how the data looks.

df_data2

	words	labels
0	me, gusta, comer, en, la, cafeteria	SPANISH
1	Give, it, to, me	ENGLISH
2	No, creo, que, sea, una, buena, idea	SPANISH
3	No, it, is, not, a, good, idea, to, get, lost,...	ENGLISH
4	Yo, creo, que, si	SPANISH
5	it, is, lost, on, me	ENGLISH

Putting the data in `Dataset` and output with `Dataloader`

train_DS2 = TextDataset(
    train_ids,
    df_data2.loc[train_ids, "words"].tolist(),
    df_data2.loc[train_ids, "labels"].tolist(),
)
test_DS2 = TextDataset(
    test_ids,
    df_data2.loc[test_ids, "words"].tolist(),
    df_data2.loc[test_ids, "labels"].tolist(),
)

train_DL2a = DataLoader(train_DS2, batch_size=1, shuffle=False)

print("batch size of 1")
for (idx, batch) in enumerate(train_DL2a):
    print(idx, "Text data: ", batch["Text"])
    print(idx, "Label data: ", batch["Label"], "\n")

batch size of 1
Text data:  ['me, gusta, comer, en, la, cafeteria']
Label data:  ['SPANISH'] 

Text data:  ['Give, it, to, me']
Label data:  ['ENGLISH'] 

Text data:  ['No, creo, que, sea, una, buena, idea']
Label data:  ['SPANISH'] 

Text data:  ['No, it, is, not, a, good, idea, to, get, lost, at, sea']
Label data:  ['ENGLISH'] 

Great, we get closer to the expected output where we have one sample, represented as a string, in the list created by DataLoader. We still have to vectorize this before we input this into our model but we can worry about that later. Additionally, when we increase the batch_size we don’t get an error anymore.

train_DL2b = DataLoader(train_DS2, batch_size=2, shuffle=False)

print("batch size of 2")
for (idx, batch) in enumerate(train_DL2b):
    print(idx, "Text data: ", batch["Text"])
    print(idx, "Label data: ", batch["Label"], "\n")

batch size of 2
Text data:  ['me, gusta, comer, en, la, cafeteria', 'Give, it, to, me']
Label data:  ['SPANISH', 'ENGLISH'] 

Text data:  ['No, creo, que, sea, una, buena, idea', 'No, it, is, not, a, good, idea, to, get, lost, at, sea']
Label data:  ['SPANISH', 'ENGLISH'] 

We can also verify that this works for our test set in its own DataLoader object.

test_DL2b = DataLoader(test_DS2, batch_size=2, shuffle=False)

print("batch size of 2")
for (idx, batch) in enumerate(test_DL2b):
    print(idx, "Text data: ", batch["Text"])
    print(idx, "Label data: ", batch["Label"], "\n")

batch size of 2
0 Text data:  ['Yo, creo, que, si', 'it, is, lost, on, me']
0 Label data:  ['SPANISH', 'ENGLISH'] 

Train model using `DataLoader` objects

# word_to_ix maps each word in the vocab to a unique integer, which will be its
# index into the Bag of words vector
word_to_ix = {}
for sent, _ in data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)

VOCAB_SIZE = len(word_to_ix)
NUM_LABELS = 2

{'me': 0, 'gusta': 1, 'comer': 2, 'en': 3, 'la': 4, 'cafeteria': 5, 'Give': 6, 'it': 7, 'to': 8, 'No': 9, 'creo': 10, 'que': 11, 'sea': 12, 'una': 13, 'buena': 14, 'idea': 15, 'is': 16, 'not': 17, 'a': 18, 'good': 19, 'get': 20, 'lost': 21, 'at': 22, 'Yo': 23, 'si': 24, 'on': 25}

sent = "me, gusta, comer"
sent.split(", ")

['me', 'gusta', 'comer']

class BoWClassifier(nn.Module):  # inheriting from nn.Module!
    def __init__(self, num_labels, vocab_size):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(BoWClassifier, self).__init__()

        # Define the parameters that you will need.  In this case, we need A and b,
        # the parameters of the affine mapping.
        # Torch defines nn.Linear(), which provides the affine map.
        # Make sure you understand why the input dimension is vocab_size
        # and the output is num_labels!
        self.linear = nn.Linear(vocab_size, num_labels)

        # NOTE! The non-linearity log softmax does not have parameters! So we don't need
        # to worry about that here

    def forward(self, bow_vec):
        # Pass the input through the linear layer,
        # then pass that through log_softmax.
        # Many non-linearities and other functions are in torch.nn.functional
        return F.log_softmax(self.linear(bow_vec), dim=1)


def make_bow_vector(sentence, word_to_ix):
    """
    Edited from original to get words wrapped in a list back
    """
    sentence = sentence[0].split(", ")
    vec = torch.zeros(len(word_to_ix))
    for word in sentence:
        vec[word_to_ix[word]] += 1
    return vec.view(1, -1)


def make_target(label, label_to_ix):
    """
    Altered to extract label from list
    """
    return torch.LongTensor([label_to_ix[label[0]]])

Batch size of 1

train_DL2a = DataLoader(train_DS2, batch_size=1, shuffle=False)
test_DL2a = DataLoader(test_DS2, batch_size=1, shuffle=False)

model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)

for param in model.parameters():
    print(param)

Parameter containing:
tensor([[ 0.0544,  0.0097,  0.0716, -0.0764, -0.0143, -0.0177,  0.0284, -0.0008,
          0.1714,  0.0610, -0.0730, -0.1184, -0.0329, -0.0846, -0.0628,  0.0094,
          0.1169,  0.1066, -0.1917,  0.1216,  0.0548,  0.1860,  0.1294, -0.1787,
         -0.1865, -0.0946],
        [ 0.1722, -0.0327,  0.0839, -0.0911,  0.1924, -0.0830,  0.1471,  0.0023,
         -0.1033,  0.1008, -0.1041,  0.0577, -0.0566, -0.0215, -0.1885, -0.0935,
          0.1064, -0.0477,  0.1953,  0.1572, -0.0092, -0.1309,  0.1194,  0.0609,
         -0.1268,  0.1274]], requires_grad=True)
Parameter containing:
tensor([0.1191, 0.1739], requires_grad=True)

Note that model parameters are randomly initialized to very small, non-zero values so that gradient descent is not too slow. This point is explained more fully by Andrew Ng in this video.

label_to_ix = {"SPANISH": 0, "ENGLISH": 1}

Run on test data before we train, just to see a before-and-after

with torch.no_grad():
    for batch in test_DL2a:
        # Alter code from tutorial
        # for instance, label in test_data:
        instance, label = batch["Text"], batch["Label"]
        print(instance, label)

        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)
        print(log_probs, "\n")

# Print the matrix column corresponding to "creo"
print(
    "Tensor for 'creo' (before training): ",
    next(model.parameters())[:, word_to_ix["creo"]],
)

['Yo, creo, que, si'] ['SPANISH']
tensor([[-0.9736, -0.4744]]) 

['it, is, lost, on, me'] ['ENGLISH']
tensor([[-0.7289, -0.6586]]) 

Tensor for 'creo' (before training):  tensor([-0.0730, -0.1041], grad_fn=)

loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

for epoch in range(100):
    # for instance, label in data:

    for (idx, batch) in enumerate(train_DL2a):  # Print the 'text' data of the batch
        instance, label = batch["Text"], batch["Label"]

        # Step 1. Remember that PyTorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Make our BOW vector and also we must wrap the target in a
        # Tensor as an integer. For example, if the target is SPANISH, then
        # we wrap the integer 0. The loss function then knows that the 0th
        # element of the log probabilities is the log probability
        # corresponding to SPANISH
        bow_vec = make_bow_vector(instance, word_to_ix)
        target = make_target(label, label_to_ix)

        # Step 3. Run our forward pass.
        log_probs = model(bow_vec)

        # Step 4. Compute the loss, gradients, and update the parameters by
        # calling optimizer.step()
        loss = loss_function(log_probs, target)
        loss.backward()
        optimizer.step()

        if (idx % 4 == 0) & (epoch % 20 == 0):  # Edit when datasets are bigger
            print(f"epoch: {epoch}, training sample: {idx}, loss = {loss.item():0.04f}")

epoch: 0, training sample: 0, loss = 0.8369
epoch: 20, training sample: 0, loss = 0.0507
epoch: 40, training sample: 0, loss = 0.0257
epoch: 60, training sample: 0, loss = 0.0172
epoch: 80, training sample: 0, loss = 0.0129

We see the loss decrease quickly and saturate by the end of the training epochs.

Evaluation after training

Look at the test set again, after model training.

with torch.no_grad():
    for batch in test_DL2a:
        # Alter code from tutorial
        # for instance, label in test_data:
        instance, label = batch["Text"], batch["Label"]
        print(instance, label)

        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)
        print(log_probs, "\n")

['Yo, creo, que, si'] ['SPANISH']
tensor([[-0.2056, -1.6828]]) 

['it, is, lost, on, me'] ['ENGLISH']
tensor([[-2.7960, -0.0630]]) 

# Print the matrix column corresponding to "creo"
print(
    "Matrix for 'creo' (after training): ",
    next(model.parameters())[:, word_to_ix["creo"]],
)

Matrix for 'creo' (after training):  tensor([ 0.3702, -0.5473], grad_fn=)

We see that the coefficients for the Spanish word “creo” separate quite nicely and relative to the initial values. I believe that the model training was successful.

Summary

In this post, I sought to better understand how to use Dataset and Dataloader objects, especially in the context of model training. Fleshing this out showed me where I had to re-structure my data to get my code to work properly. Here, I had a batch size of 1, to mimic the original PyTorch tutorial. In a later post, I’ll write about how to take advantage of batching which is more relevant in larger datasets.

Appendix: Environment and system parameters

%watermark -n -u -v -iv -w

Last updated: Thu Jun 24 2021

Python implementation: CPython
Python version       : 3.8.6
IPython version      : 7.22.0

numpy  : 1.19.5
torch  : 1.8.1
re     : 2.2.1
json   : 2.0.9
seaborn: 0.11.1
pandas : 1.2.1

Watermark: 2.1.0

	0	1	20	21
0	1	0	0	0
1	1	0	1	0
2	1	0	0	0
3	1	0	1	0
4	1	0	0	0
5	1	0	1	0
6	1	0	0	0
7	1	0	1	0
8	1	0	0	0
9	1	0	1	0
10	0	1	0	0
11	0	1	0	1

	0	1	20	21
0	1	0	0	0
1	1	0	1	0
2	1	0	0	0
3	1	0	1	0
4	1	0	0	0
5	1	0	1	0
6	1	0	0	0
7	1	0	1	0
8	1	0	0	0
9	1	0	1	0
10	0	1	0	0
11	0	1	0	1

benslack19

Generating a predictive distribution for the number of people attending your party

Probability distributions for each group

Deriving the algorithm for the distribution of summed discrete random variables

A user-friendly function

When the Spider-Man meme is relevant to multilevel models

Create synthetic cafe dataset

Visualize data

Definitions of mixed effects modeling

Equation set 1: both fixed and random effects terms in linear model

Non-linear algebra form

Equation set 2: fixed effects as an adaptive prior, varying effects in the linear model

Comparison of equation sets

Running equation set 1 with lmer (frequentist)

Running equation set 2 with pymc (Bayesian)

Comparison of lmer and pymc outputs

Summary

Acknowledgements and references

LKJCorr and LKJCov in pymc

LKJcorr distribution

Manually building a covariance matrix

LKJCholeskyCov distribution

Summary

Weird ways that covariance matrices are made

Intuitive construction

Standard deviation diagonals

Cholesky factors

Escaping the Devil’s Funnel

Exploration of the joint distribution

Number of divergences

Number of effective samples and R-hat

Conclusion

Correlated data, different DAGs

Pipe

Fork

Collider

Running models forwards and backwards

Running models forwards and backwards

Running forwards: simulation to get data

Running backwards: inference to estimate the parameter

Summary

Exploring modeling failure

Data exploration and setup

Fixed-effects model

Mixed-effects (ME) attempt 0

ME attempt 1: higher target_accept

ME attempt 2: even higher target_accept

Visualizing divergent transitions

ME attempt 3: more informative prior for sigma

ME, attempt 4: re-paramaterization

Summary

Multilevel modeling with binomial GLM

Problem description

Data exploration and setup

Fixed-effects model

Mixed-effects model

Summary

Working with PyTorch’s Dataset and Dataloader classes (part 1)

First attempt

Putting the data in Dataset and output with Dataloader

Re-structuring data as a comma-separated string

Putting the data in Dataset and output with Dataloader

Train model using DataLoader objects

Batch size of 1

Run on test data before we train, just to see a before-and-after

Evaluation after training

Summary

Create synthetic `cafe` dataset

Running equation set 1 with `lmer` (frequentist)

Running equation set 2 with `pymc` (Bayesian)

Comparison of `lmer` and `pymc` outputs

`LKJcorr` distribution

`LKJCholeskyCov` distribution

ME attempt 1: higher `target_accept`

ME attempt 2: even higher `target_accept`

Putting the data in `Dataset` and output with `Dataloader`

Putting the data in `Dataset` and output with `Dataloader`

Train model using `DataLoader` objects

	0	1	20	21
0	1	0	0	0
1	1	0	1	0
2	1	0	0	0
3	1	0	1	0
4	1	0	0	0
5	1	0	1	0
6	1	0	0	0
7	1	0	1	0
8	1	0	0	0
9	1	0	1	0
10	0	1	0	0
11	0	1	0	1