PyMC linear regression part 4: predicting actual height

11 minute read

At last, we have come to the end. This is the final post in a series of linear regression posts using PyMC3, from my reading of Statistical Rethinking. Part 1 set up the problem and explored the package’s objects. Part 2 interpreted the posterior distribution. Part 3 predicted average height which has its own uncertainty. Here, we’ll predict actual height which avoids the over-confidence that comes from predicting average height only.

Going deep into each step took some time, but it’s helped embed the concepts in my brain. As we’ve done before, we’ll start off duplicating code of the previous posts before we make the final predictions.

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import pymc3 as pm
import scipy.stats as stats
import seaborn as sns
%load_ext nb_black
%config InlineBackend.figure_format = 'retina'
%load_ext watermark
RANDOM_SEED = 8927
np.random.seed(RANDOM_SEED)
az.style.use("arviz-darkgrid")

Question

Ths question come’s from the winter 2020, week 2 homework.

The weights listed below were recorded in the !Kung census, but heights were not recorded for these individuals. Provide predicted heights and 89% compatibility intervals for each of these individuals. That is, fill in the table below, using model-based predictions.

Individual weight expected height 89% interval
1 45    
2 40    
3 65    
4 31    

Let’s quickly take a look at the data to get a handle on what we’re working with.

d = pd.read_csv("../data/a_input/Howell1.csv", sep=";", header=0)
d2 = d[d.age >= 18]  # filter to get only adults

Setting up the variables

# Get the average weight as part of the model definition
xbar = d2.weight.mean()

with pm.Model() as heights_model:

    # Priors are variables a, b, sigma
    # using pm.Normal is a way to represent the stochastic relationship the left has to right side of equation
    a = pm.Normal("a", mu=178, sd=20)
    b = pm.Lognormal("b", mu=0, sd=1)
    sigma = pm.Uniform("sigma", 0, 50)

    # This is a linear model (not really a prior or likelihood?)
    # Data included here (d2.weight, which is observed)
    # Mu is deterministic, but a and b are stochastic
    mu = a + b * (d2.weight - xbar)

    # Likelihood is height variable, which is also observed (data included here, d2.height))
    # Height is dependent on deterministic and stochastic variables
    height = pm.Normal("height", mu=mu, sd=sigma, observed=d2.height)

    # The next lines is doing the fitting and sampling all at once.
    trace_m2 = pm.sample(1000, tune=1000, return_inferencedata=False, progressbar=False)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, b, a]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 12 seconds.
trace_m2_df = pm.trace_to_dataframe(trace_m2)

The next set of code gives both the MAP as well as providing a sense of the variance of the data, which is more useful.

az.rcParams["stats.hdi_prob"] = 0.89  # sets default credible interval used by arviz
az.summary(trace_m2, round_to=3, kind="stats")
/Users/blacar/opt/anaconda3/envs/stats_rethinking/lib/python3.8/site-packages/arviz/data/io_pymc3.py:88: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(
mean sd hdi_5.5% hdi_94.5%
a 154.600 0.275 154.178 155.055
b 0.903 0.043 0.835 0.970
sigma 5.103 0.195 4.789 5.404
# Covariance matrix
trace_m2_df.cov().round(3)
a b sigma
a 0.075 0.000 0.000
b 0.000 0.002 -0.000
sigma 0.000 -0.000 0.038

Predictions for actual height

We return now to the original problem where we are seeking predictions for actual height. We will produce compatability intervals where we’d expect to see any observed height (note how the uncertainty region for $\mu_i$ as shown above is much narrower than where the points are.)

We will do something similar as in the previous post but incorporating $\sigma$.

$\text{height}_i$ ~ Normal($\mu_i, \sigma$)

The repo initially confused me, but I saw that the solution was straightforward after more exploring of functions. I’ll spare you most of this exploration, but will put some of my key insights here.

Function exploration
One key realization for me was seeing how the stats function for drawing from a normal distribution can have varied mean (loc) and standard deviation (scale) values. Previously, I had typically used it to get multiple samples from the same mean and standard deviation like this:

stats.norm.rvs(loc=10, scale=2, size=10)
array([ 7.09403309, 12.255019  ,  7.32435791, 10.26594505, 11.79233188,
       12.09804331, 10.84097672, 14.51428249, 14.22362859,  9.25777341])

I hadn’t previously thought to use different mean and standard deviations for each draw. That is, loc and scale can be vectors. I’ll vary the vectors down below for each so that you can see what I mean in the resulting vector.

stats.norm.rvs(
    loc=[1, 1, 1, 10, 10, 10, 20, 20, 20], scale=[0.5, 1, 2, 0.5, 1, 2, 0.5, 1, 2]
)
array([ 0.73038967,  0.37624878,  1.87080349,  9.87713139, 10.7586794 ,
        9.36231624, 20.29745599, 19.36382859, 18.66087831])

This is when I had my a-ha moment about what to do next. Remember: the posterior distribution that pymc outputs produces a different mean and standard deviation in each line.

trace_m2_df.head()
a b sigma
0 154.030585 0.853788 4.989442
1 153.969973 0.861619 5.128790
2 155.111070 0.957769 5.067568
3 154.106523 0.838728 4.961068
4 154.328210 0.863311 5.160797

We don’t need to use a multivariate normal distribution function but we can instead use the regular (non-multivariate?) version of the normal distribution function. This implementation is more consistent with R code 4.63 which I copied/posted below taken from McElreath’s repo.

## R code 4.63
post <- extract.samples(m4.3)
weight.seq <- 25:70
sim.height <- sapply( weight.seq , function(weight)
    rnorm(
        n=nrow(post) ,
        mean=post$a + post$b*( weight - xbar ) ,
        sd=post$sigma ) )
height.PI <- apply( sim.height , 2 , PI , prob=0.89 )

Making predictions for actual height ($h_i$) at a single weight value

When we input a weight, we can create a vector for height that results from the normal distribution that consists of the posterior’s parameters. This seems confusing at first, but let’s just look at one weight, 45.

Note how the parameters within the stats.norm.rvs function mimic McElreath’s R code above.

example_weight = 45
h_at_45 = stats.norm.rvs(
    loc=trace_m2_df["a"] + trace_m2_df["b"] * (example_weight - xbar),
    scale=trace_m2_df["sigma"],
)
print(h_at_45)
print("Length of vector: ", len(h_at_45))
[155.7948111  146.56221928 159.43749538 ... 155.33600385 149.83412841
 151.31287874]
Length of vector:  4000

We have a vector of values for which we can compute summary statistics, including getting a compatibility interval.

# Get 89% compatibility interval
az.hdi(np.array(h_at_45))
array([146.08867804, 162.48740096])

This seems repetitive to what we did for average height in the last post, but here we’ve incorporated sigma from our posterior distribution. Let’s repeat the code for average height and compare what we just did for actual height.

# Vector for average height - note that sigma is not used!
mu_at_45 = trace_m2_df["a"] + trace_m2_df["b"] * (45 - xbar)
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4), sharex=True)

az.plot_kde(mu_at_45, ax=ax1)
ax1.set_title("prediction of average height\nat weight of 45")
ax1.set_ylabel("density")
ax1.set_xlabel("average height ($\mu$)")
ax1.vlines(
    az.hdi(np.array(mu_at_45)),
    ymin=0,
    ymax=1.5,
    color="gray",
    linestyle="dashed",
    linewidth=1,
)

az.plot_kde(h_at_45, ax=ax2)
ax2.set_title("prediction of actual height\nat weight of 45")
ax2.set_ylabel("density")
ax2.set_xlabel("actual height ($h$)")
ax2.vlines(
    az.hdi(np.array(example_draw)),
    ymin=0,
    ymax=0.1,
    color="gray",
    linestyle="dashed",
    linewidth=1,
)
<matplotlib.collections.LineCollection at 0x7fe5ad6e2940>

png

I am intentionally plotting the x-axis with the same axis limits so that you can appreciate the difference in uncertainty. (Note the difference in the y-axis scale.) We would expect that the probability distribution for average height would be much tighter than the one for actual height. The plots drive that point home visually.

Making predictions for actual height ($h_i$) at a single weight value

We’ll start off by generating a matrix that contains the posterior predictions for actual heights over a range of weights, between 25 and 70 kg. This is similar to what we did for average heights in the last post.

# Input a range of weight values
weight_seq = np.arange(25, 71)

# This is generating a matrix where the predicted h values will be kept
# Each weight value will be its own row
h_pred = np.zeros((len(weight_seq), len(trace_m2_df)))

# Fill out the matrix in this loop
for i, w in enumerate(weight_seq):
    h_pred[i] = stats.norm.rvs(
        loc=trace_m2_df["a"] + trace_m2_df["b"] * (w - xbar), scale=trace_m2_df["sigma"]
    )

# Output the matrix in a dataframe for viewing
df_h_pred = pd.DataFrame(h_pred, index=weight_seq)
df_h_pred.index.name = "weight"
df_h_pred.head()
0 1 2 3 4 5 6 7 8 9 ... 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999
weight
25 146.819087 131.950216 129.793433 141.425616 134.363193 134.193045 138.625503 127.385217 135.971128 132.278403 ... 128.438620 128.187695 140.142773 136.868872 132.564979 140.662170 135.461373 129.179569 133.174905 139.440426
26 136.456664 139.880879 138.150449 144.660997 130.087085 133.466438 134.668864 130.054509 137.989162 144.520142 ... 140.724558 122.746829 141.763728 140.855240 138.828098 125.438146 139.518013 145.103823 144.708165 137.155993
27 133.427578 146.873131 148.040623 134.562287 143.913001 141.570235 138.758449 126.249154 139.976170 136.232402 ... 132.069314 142.616321 139.039138 144.439965 140.008615 132.783098 136.572733 131.534774 135.076601 145.095471
28 143.316194 130.229996 135.166804 140.417059 143.595620 146.727791 142.071108 135.161632 138.845736 140.713138 ... 131.457780 131.502991 140.776164 144.107353 150.544606 144.113096 133.674960 137.908407 131.267668 141.673441
29 142.093570 142.195453 145.842916 144.098927 146.044805 138.547015 148.580323 134.374695 139.463730 141.527401 ... 142.074329 145.405100 144.384839 131.004655 150.856133 139.352867 133.089377 139.672434 139.466480 133.085828

5 rows × 4000 columns

To re-iterate, each row are a weight’s sampling of heights as dictated by the posterior distribution of parameters. We can illustrate that graphically.

# Plotting the data as a scatter plot
plt.scatter(d2["weight"], d2["height"], alpha=0.5)
plt.plot(weight_seq, h_pred, "C0.", color="gray", alpha=0.005)
plt.xlabel("weight")
plt.ylabel("height")
plt.title("Uncertainty of $h_i$")
Text(0.5, 1.0, 'Uncertainty of $h_i$')

png

PyMC method for predictions of actual height

The above plot helps show the uncertainty for a range of weights after some step-by-step calculations. Not surprisingly, pymc and arviz can help us get to those steps more quickly. Here’s one way to do that.

# Predict heights
# set parameter to 200 to match the notebook example
# could set to 400 to match the samples and nchains
height_pred = pm.sample_posterior_predictive(trace_m2, 200, heights_model)
height_pred
/Users/blacar/opt/anaconda3/envs/stats_rethinking/lib/python3.8/site-packages/pymc3/sampling.py:1687: UserWarning: samples parameter is smaller than nchains times ndraws, some draws and/or chains may not be represented in the returned posterior predictive sample
  warnings.warn(




<div>
    <style>
        /* Turns off some styling */
        progress {
            /* gets rid of default border in Firefox and Opera. */
            border: none;
            /* Needs to be in here for Safari polyfill so background images work as expected. */
            background-size: auto;
        }
        .progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {
            background: #F44336;
        }
    </style>
  <progress value='200' class='' max='200' style='width:300px; height:20px; vertical-align: middle;'></progress>
  100.00% [200/200 00:00<00:00]
</div>






{'height': array([[152.44015719, 143.8660879 , 139.40217284, ..., 156.3123799 ,
         162.99350218, 160.15982149],
        [161.48691291, 152.72929135, 138.60784035, ..., 162.5252433 ,
         167.12620516, 160.8162708 ],
        [154.18712134, 148.66960604, 140.74062707, ..., 165.56777632,
         160.19434841, 157.95943753],
        ...,
        [160.39855129, 147.75720026, 140.01191018, ..., 159.5460152 ,
         153.48386574, 163.72311138],
        [149.698651  , 142.51656198, 145.54450618, ..., 167.74115779,
         165.92839171, 164.0738555 ],
        [159.73732783, 153.56864439, 141.05003146, ..., 166.80582294,
         161.04500287, 160.92227836]])}
f, ax1 = plt.subplots()
# ax = az.plot_hdi(weight_seq, mu_pred.T)
az.plot_hdi(d2["weight"], height_pred["height"], ax=ax1)
ax1.scatter(d2["weight"], d2["height"])
ax1.set_xlabel("weight")
ax1.set_ylabel("height")
ax1.set_title("Uncertainty of $h_i$\n(plotting with the help of arviz)")
/Users/blacar/opt/anaconda3/envs/stats_rethinking/lib/python3.8/site-packages/arviz/stats/stats.py:493: FutureWarning: hdi currently interprets 2d data as (draw, shape) but this will change in a future release to (chain, draw) for coherence with other functions
  warnings.warn(





Text(0.5, 1.0, 'Uncertainty of $h_i$\n(plotting with the help of arviz)')

png

The answer to the problem

After all that, we come to the answer, which can be written in a few lines of code.

# Initialize the table
df_answer = pd.DataFrame()
df_answer["individual"] = range(1, 5)
df_answer["weight"] = [45, 40, 65, 31]
df_answer["expected_height"] = None
df_answer["89perc_interval"] = None
# Fill out the table
for i, weight in enumerate(df_answer["weight"]):

    draw = stats.norm.rvs(
        loc=trace_m2_df["a"] + trace_m2_df["b"] * (weight - xbar),
        scale=trace_m2_df["sigma"],
    )

    df_answer.loc[i, "expected_height"] = "{0:0.2f}".format(draw.mean())
    # .at is for insertion of alist
    # https://stackoverflow.com/questions/26483254/python-pandas-insert-list-into-a-cell
    df_answer.at[i, "89perc_interval"] = "{0:0.2f}, {1:0.2f}".format(
        az.hdi(draw)[0], az.hdi(draw)[1]
    )

# Output the table
df_answer
individual weight expected_height 89perc_interval
0 1 45 154.55 146.06, 162.41
1 2 40 150.22 141.98, 157.86
2 3 65 172.79 164.58, 180.99
3 4 31 141.96 133.63, 149.96

Summary: A lesson about two kinds of uncertainty

In a series of posts, I took a deep dive into this problem. I initially had not appreciated the two kinds of uncertainty that this problem illustrates. It took me some time to absorb the concept. This was in the McElreath book.

Rethinking: Two kinds of uncertainty. In the procedure above, we encountered both uncertainty in parameter values and uncertainty in a sampling process. These are distinct concepts, even though they are processed much the same way and end up blended together in the posterior predictive simulation. The posterior distribution is a ranking of the relative plausibilities of every possible combination of parameter values. The distribution of simulated outcomes, like height, is instead a distribution that includes sampling variation from some process that generates Gaussian random variables. This sampling variation is still a model assumption. It’s no more or less objective than the posterior distribution. Both kinds of uncertainty matter, at least sometimes. But it’s important to keep them straight, because they depend upon different model assumptions. Furthermore, it’s possible to view the Gaussian likelihood as a purely epistemological assumption (a device for estimating the mean and variance of a variable), rather than an ontological assumption about what future data will look like. In that case, it may not make complete sense to simulate outcomes.

Appendix: Environment and system parameters

%watermark -n -u -v -iv -w
Last updated: Wed May 19 2021

Python implementation: CPython
Python version       : 3.8.6
IPython version      : 7.20.0

numpy     : 1.20.1
scipy     : 1.6.0
arviz     : 0.11.1
seaborn   : 0.11.1
json      : 2.0.9
pandas    : 1.2.1
matplotlib: 3.3.4
pymc3     : 3.11.0

Watermark: 2.1.0