PyMC linear regression part 3: predicting average height

10 minute read

This is the next post in a series of linear regression posts using PyMC3. This series has been inspired by my reading of Statistical Rethinking. Part 1 was dedicated to setting up the problem and understanding the package’s objects. Part 2 was about interpreting the posterior distribution. In this entry, I’ll be using the posterior distribution to make predictions. Specifically, I’ll make the distinction between predicting average height which has its own uncertainty, and actual height. I’ll cover predictions of average height here.

The first few pieces of code will replicate the previous posts to get us to where we want to be.

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import pymc3 as pm
import scipy.stats as stats
import seaborn as sns

%load_ext nb_black
%config InlineBackend.figure_format = 'retina'
%load_ext watermark
RANDOM_SEED = 8927
np.random.seed(RANDOM_SEED)
az.style.use("arviz-darkgrid")

Here again is the question that motivated the series of posts.

The weights listed below were recorded in the !Kung census, but heights were not recorded for these individuals. Provide predicted heights and 89% compatibility intervals for each of these individuals. That is, fill in the table below, using model-based predictions.

Individual	weight	expected height	89% interval
1	45
2	40
3	65
4	31

d = pd.read_csv("../data/a_input/Howell1.csv", sep=";", header=0)
d2 = d[d.age >= 18]  # filter to get only adults

Setting up the variables, producing `model` and `trace` objects

# Get the average weight as part of the model definition
xbar = d2.weight.mean()

with pm.Model() as heights_model:

    # Priors are variables a, b, sigma
    # using pm.Normal is a way to represent the stochastic relationship the left has to right side of equation
    a = pm.Normal("a", mu=178, sd=20)
    b = pm.Lognormal("b", mu=0, sd=1)
    sigma = pm.Uniform("sigma", 0, 50)

    # This is a linear model (not really a prior or likelihood?)
    # Data included here (d2.weight, which is observed)
    # Mu is deterministic, but a and b are stochastic
    mu = a + b * (d2.weight - xbar)

    # Likelihood is height variable, which is also observed (data included here, d2.height))
    # Height is dependent on deterministic and stochastic variables
    height = pm.Normal("height", mu=mu, sd=sigma, observed=d2.height)

    # The next lines is doing the fitting and sampling all at once.
    # I'll use the return_inferencedata=False flag
    # I'll use the progressbar flag to minimize output
    trace_m2 = pm.sample(1000, tune=1000, return_inferencedata=False, progressbar=False)

Auto-assigning NUTS sampler...
INFO:pymc3:Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
INFO:pymc3:Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
INFO:pymc3:Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, b, a]
INFO:pymc3:NUTS: [sigma, b, a]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 11 seconds.
INFO:pymc3:Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 11 seconds.

trace_m2_df = pm.trace_to_dataframe(trace_m2)

Here again are some summary statistics of the posterior distribution.

az.rcParams["stats.hdi_prob"] = 0.89  # sets default credible interval used by arviz
az.summary(trace_m2, round_to=3, kind="stats")

/Users/blacar/opt/anaconda3/envs/stats_rethinking/lib/python3.8/site-packages/arviz/data/io_pymc3.py:88: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

	mean	sd	hdi_5.5%	hdi_94.5%
a	154.602	0.267	154.152	154.996
b	0.903	0.042	0.834	0.969
sigma	5.106	0.197	4.802	5.430

# Get the covariance matrix.
trace_m2_df.cov().round(3)

	a	b	sigma
a	0.071	-0.000	0.000
b	-0.000	0.002	-0.000
sigma	0.000	-0.000	0.039

The correlation matrix can show how changing one parameter affects another.

trace_m2_df.corr()

	a	b	sigma
a	1.000000	-0.004523	0.000751
b	-0.004523	1.000000	-0.011835
sigma	0.000751	-0.011835	1.000000

Making predictions for average height ($\mu_i$)

Now that we have a good grasp of the data, we can make posterior predictions. Functionally, what we are doing is taking into account the probability distributions of each parameter (alpha, beta, sigma) when making the prediction.

However, it was not intuitive to me to see how to carry this out mechanically using pymc. Let’s walk through this step-by-step since there are different ways we can capture uncertainty.

One thing that I had some trouble grasping initially was the distinction in these two equations:

$\text{height}_i$ ~ Normal($\mu_i, \sigma$)
$\mu_i = \alpha + \beta(x_i - \bar{x})$

The first line is actual height, that incorporates the uncertainty of all parameters. The second line is average height. We can make predictions for both. Note that $\sigma$ is not represented but $\alpha$ and $\beta$ are both vectors so $\mu_i$ will also be a vector for a single weight value (or a matrix when looking at all weight values, see below). As stated above, we’ll focus on average height in this post.

In the next few plots, we are only considering plausible mean lines that can be generated with posterior distribution alpha and beta values while ignoring sigma. Here again is the trace_m2 object’s returned parameters (showing only the head of the dataframe). When we are predicting the mean for a given weight, it is helpful to remember that there is uncertainty for the mean itself.

trace_m2_df.head()

	a	b	sigma
0	154.363232	0.878117	5.149980
1	155.100196	0.899341	5.166454
2	154.226905	0.920709	5.049191
3	154.226905	0.920709	5.049191
4	154.614243	0.872827	5.043696

We can generate credibility intervals for the range of weight values using an arviz function. This puts bounds on the plausible mean line. Note how this code omits sigma.

# compute the hpdi for a range of weight values
cred_intervals = np.array(
    [
        az.hdi(
            np.array(trace_m2_df.loc[:, "a"])
            + np.array(trace_m2_df.loc[:, "b"]) * (x - xbar)
        )
        for x in np.linspace(30, 65)  # This is inputting a range of weight values.
    ]
)

# Take a look at credibility intervals
cred_intervals[0:5, :]

array([[139.91398663, 142.10746662],
       [140.61586835, 142.72290189],
       [141.31055931, 143.33144088],
       [141.98464947, 143.92085427],
       [142.67405724, 144.52814949]])

We can represent uncertainty in two ways in the figure down below. In the left plot, we will use a combination of $\alpha$ and $\beta$ samples to produce a bunch of lines from the posterior distribution. On the right, we are plotting with the credibility interval.

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# ax1 plot --------------------------------------------
# Plotting the data as a scatter plot
ax1.scatter(d2["weight"], d2["height"], alpha=0.5)

# Plotting mean lines using the first 50 sampled values for alpha and beta parameters
# note that sigma is not used
for i in range(50):
    ax1.plot(
        d2["weight"],
        trace_m2_df["a"][i] + trace_m2_df["b"][i] * (d2["weight"] - xbar),
        color="gray",
        linewidth=0.2,
        linestyle="dashed",
    )

# Plotting the 50th for labeling purposes
ax1.plot(
    d2["weight"],
    trace_m2_df["a"][i] + trace_m2_df["b"][i] * (d2["weight"] - xbar),
    color="gray",
    linewidth=0.2,
    linestyle="dashed",
    label=r"a line for each of 100 randomly $\alpha$ and $\beta$ sampled values",
)

# Plotting the overall mean line
ax1.plot(
    d2["weight"],
    trace_m2["a"].mean() + trace_m2["b"].mean() * (d2["weight"] - xbar),
    label="posterior mean line",
    color="orange",
)

ax1.set_xlabel("weight")
ax1.set_ylabel("height")
ax1.legend(fontsize=11)


# ax2 plot --------------------------------------------
# Plotting the data as a scatter plot
ax2.scatter(d2["weight"], d2["height"], alpha=0.5)

ax2.fill_between(
    np.linspace(30, 65),
    cred_intervals[:, 0],
    cred_intervals[:, 1],
    alpha=0.4,
    color="gray",
    label="89% CI",
)

# Plotting the overall mean line
ax2.plot(
    d2["weight"],
    trace_m2["a"].mean() + trace_m2["b"].mean() * (d2["weight"] - xbar),
    label="posterior mean line",
    color="orange",
)

ax2.set_xlabel("weight")
ax2.set_ylabel("height")
ax2.legend(fontsize=11)

png

It’s important to remember that the orange line drawn represents the posterior mean line. It is the most plausible line, but it’s helpful to keep in mind that there’s uncertainty and other lines (gray) are plausible (albeit with lower probability).

Understanding $\mu_i$ uncertainty at a single weight value

It might still be hard to see why we’d have uncertainty around a mean value. Let’s take only one weight, 45 kg. We can use the formula for our line and simply plug in the value of 45 for $x_i$. We will get back a vector of predicted means since $\alpha$ and $\beta$ are vectors. Again, note how $\sigma$ from our posterior distribution is still not used in this equation.

mu_at_45 = trace_m2_df["a"] + trace_m2_df["b"] * (45 - xbar)
mu_at_45

     154.371587
     155.108753
     154.235666
     154.235666
     154.622548
           ...    
  154.848588
  154.247808
  154.608258
  154.342976
  154.589418
Length: 4000, dtype: float64

We can plot the uncertainty around this single weight value.

# Get 89% compatibility interval
az.hdi(np.array(mu_at_45))

array([154.16661059, 155.01021454])

f, ax1 = plt.subplots(1, 1, figsize=(6, 3))

az.plot_kde(mu_at_45, ax=ax1)
ax1.set_title("KDE with arviz")
ax1.set_ylabel("density")
ax1.set_xlabel("mu|weight=45")
ax1.vlines(
    az.hdi(np.array(mu_at_45)),
    ymin=0,
    ymax=1.5,
    color="gray",
    linestyle="dashed",
    linewidth=1,
)

<matplotlib.collections.LineCollection at 0x7fecb04d5460>

png

Understanding $\mu_i$ uncertainty for a range weight values

Now let’s take a look at a range of weight values.

This is taken from the repo: “We are doing manually, in the book is done using the link function. In the book on code 4.58 the following operations are performed manually.”

# Input a range of weight values
weight_seq = np.arange(25, 71)

# Given that we have a lot of samples we can use less of them for plotting
# I just decided to use them all
# trace_m_thinned = trace_m2_df[::10]

# This is generating a matrix where the predicted mu values will be kept
# Each weight value will be its own row
mu_pred = np.zeros((len(weight_seq), len(trace_m2_df)))

# Fill out the matrix in this loop
for i, w in enumerate(weight_seq):
    mu_pred[i] = trace_m2_df["a"] + trace_m2_df["b"] * (w - d2["weight"].mean())

The line for mu_pred can use further explanation. Unlike the example above where mu was a vector when evaluating at had for a single weight value, we will now return a matrix. (While I didn’t explicitly think to use this, this Stack Overflow link explains how MCMC uses multiple chains.)

df_mu_pred = pd.DataFrame(mu_pred, index=weight_seq)
df_mu_pred.index.name = "weight"
df_mu_pred.head()

	0	1	2	3	4	5	6	7	8	9	...	3990	3991	3992	3993	3994	3995	3996	3997	3998	3999
weight
25	136.809256	137.121926	135.821484	135.821484	137.166000	135.657218	135.657218	135.657218	137.010733	137.442018	...	136.315442	136.682549	137.083692	137.219905	135.234508	137.367422	136.002073	136.432719	136.817518	135.816661
26	137.687372	138.021267	136.742193	136.742193	138.038827	136.601131	136.601131	136.601131	137.900697	138.304255	...	137.221021	137.580689	137.961734	138.091173	136.175118	138.241480	136.914360	137.341496	137.693791	136.755299
27	138.565489	138.920609	137.662903	137.662903	138.911655	137.545044	137.545044	137.545044	138.790662	139.166491	...	138.126600	138.478828	138.839776	138.962441	137.115727	139.115539	137.826647	138.250273	138.570064	137.693937
28	139.443605	139.819950	138.583612	138.583612	139.784482	138.488957	138.488957	138.488957	139.680627	140.028727	...	139.032180	139.376967	139.717819	139.833710	138.056337	139.989597	138.738933	139.159050	139.446337	138.632575
29	140.321722	140.719291	139.504321	139.504321	140.657310	139.432870	139.432870	139.432870	140.570591	140.890963	...	139.937759	140.275107	140.595861	140.704978	138.996946	140.863655	139.651220	140.067827	140.322609	139.571212

5 rows × 4000 columns

f, axes = plt.subplots(2, 3, figsize=(12, 6))

# equally spaced weights weights in the weight range
arb_weights = np.linspace(25, 70, 6)

for arb_weight, ax in zip(arb_weights, axes.flat):
    comp_interval = az.hdi(np.array(df_mu_pred.loc[arb_weight]))
    legend_label = "CI range: {0:0.2f}".format(np.diff(comp_interval)[0])
    az.plot_kde(df_mu_pred.loc[arb_weight], ax=ax)
    ax.set_title("mu|weight=%i" % arb_weight + "\n" + legend_label)
    # ax.legend()  # cleaner to put as the title
    if ax.is_first_col():
        ax.set_ylabel("density")
    if ax.is_last_row():
        ax.set_xlabel("average height")

png

This is another way of looking at the variability for average height. The compatibility interval range (CI range) is showing how bigger the CI is when looking at either end of the weight range. Here is a more intuitive way of showing this, with the data points in blue.

# Plotting the data as a scatter plot
plt.scatter(d2["weight"], d2["height"], alpha=0.5)
plt.plot(weight_seq, mu_pred, "C0.", color="gray", alpha=0.005)
plt.xlabel("weight")
plt.ylabel("height")
plt.title(r"Uncertainty of $\mu_i$, the linear model of the mean")

Text(0.5, 1.0, 'Uncertainty of $\\mu_i$, the linear model of the mean')

png

Summary

The purpose of this post was to show how to make predictions on average height using pymc. We used our posterior distribution to make many different regression lines. We plotted the uncertainties of average height around individual weights to appreciate this point. One thing that I have stressed is making a distinction between average height versus actual height. The last figure highlights how making predictions on average height would lead to over-confidence in where a new point may lie. In the next post, we will use the posterior distribution to make predictions on actual height.

Appendix: Environment and system parameters

%watermark -n -u -v -iv -w

Last updated: Sat May 15 2021

Python implementation: CPython
Python version       : 3.8.6
IPython version      : 7.20.0

seaborn   : 0.11.1
pymc3     : 3.11.0
arviz     : 0.11.1
matplotlib: 3.3.4
pandas    : 1.2.1
json      : 2.0.9
scipy     : 1.6.0
numpy     : 1.20.1

Watermark: 2.1.0

Twitter Facebook LinkedIn

PyMC linear regression part 3: predicting average height

Setting up the variables, producing `model` and `trace` objects

Making predictions for average height ($\mu_i$)

Understanding $\mu_i$ uncertainty at a single weight value

Understanding $\mu_i$ uncertainty for a range weight values

Summary

You May Also Enjoy

Instrumental variable analysis with a binary outcome

Instrumental variable analysis with Bayesian modeling and statsmodels

Time series with varying intercepts

Follow-up after getting causal estimates

Setting up the variables, producing model and trace objects

Making predictions for average height ($\mu_i$)

Understanding $\mu_i$ uncertainty at a single weight value

Understanding $\mu_i$ uncertainty for a range weight values

Summary

You May Also Enjoy

Instrumental variable analysis with a binary outcome

Instrumental variable analysis with Bayesian modeling and statsmodels

Time series with varying intercepts

Follow-up after getting causal estimates

Setting up the variables, producing `model` and `trace` objects