PyMC linear regression part 4: predicting actual height

11 minute read

At last, we have come to the end. This is the final post in a series of linear regression posts using PyMC3, from my reading of Statistical Rethinking. Part 1 set up the problem and explored the package’s objects. Part 2 interpreted the posterior distribution. Part 3 predicted average height which has its own uncertainty. Here, we’ll predict actual height which avoids the over-confidence that comes from predicting average height only.

Going deep into each step took some time, but it’s helped embed the concepts in my brain. As we’ve done before, we’ll start off duplicating code of the previous posts before we make the final predictions.

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import pymc3 as pm
import scipy.stats as stats
import seaborn as sns

%load_ext nb_black
%config InlineBackend.figure_format = 'retina'
%load_ext watermark
RANDOM_SEED = 8927
np.random.seed(RANDOM_SEED)
az.style.use("arviz-darkgrid")

Question

Ths question come’s from the winter 2020, week 2 homework.

The weights listed below were recorded in the !Kung census, but heights were not recorded for these individuals. Provide predicted heights and 89% compatibility intervals for each of these individuals. That is, fill in the table below, using model-based predictions.

Individual	weight	expected height	89% interval
1	45
2	40
3	65
4	31

Let’s quickly take a look at the data to get a handle on what we’re working with.

d = pd.read_csv("../data/a_input/Howell1.csv", sep=";", header=0)
d2 = d[d.age >= 18]  # filter to get only adults

Setting up the variables

# Get the average weight as part of the model definition
xbar = d2.weight.mean()

with pm.Model() as heights_model:

    # Priors are variables a, b, sigma
    # using pm.Normal is a way to represent the stochastic relationship the left has to right side of equation
    a = pm.Normal("a", mu=178, sd=20)
    b = pm.Lognormal("b", mu=0, sd=1)
    sigma = pm.Uniform("sigma", 0, 50)

    # This is a linear model (not really a prior or likelihood?)
    # Data included here (d2.weight, which is observed)
    # Mu is deterministic, but a and b are stochastic
    mu = a + b * (d2.weight - xbar)

    # Likelihood is height variable, which is also observed (data included here, d2.height))
    # Height is dependent on deterministic and stochastic variables
    height = pm.Normal("height", mu=mu, sd=sigma, observed=d2.height)

    # The next lines is doing the fitting and sampling all at once.
    trace_m2 = pm.sample(1000, tune=1000, return_inferencedata=False, progressbar=False)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, b, a]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 12 seconds.

trace_m2_df = pm.trace_to_dataframe(trace_m2)

The next set of code gives both the MAP as well as providing a sense of the variance of the data, which is more useful.

az.rcParams["stats.hdi_prob"] = 0.89  # sets default credible interval used by arviz
az.summary(trace_m2, round_to=3, kind="stats")

/Users/blacar/opt/anaconda3/envs/stats_rethinking/lib/python3.8/site-packages/arviz/data/io_pymc3.py:88: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

	mean	sd	hdi_5.5%	hdi_94.5%
a	154.600	0.275	154.178	155.055
b	0.903	0.043	0.835	0.970
sigma	5.103	0.195	4.789	5.404

# Covariance matrix
trace_m2_df.cov().round(3)

	a	b	sigma
a	0.075	0.000	0.000
b	0.000	0.002	-0.000
sigma	0.000	-0.000	0.038

Predictions for actual height

We return now to the original problem where we are seeking predictions for actual height. We will produce compatability intervals where we’d expect to see any observed height (note how the uncertainty region for $\mu_i$ as shown above is much narrower than where the points are.)

We will do something similar as in the previous post but incorporating $\sigma$.

$\text{height}_i$ ~ Normal($\mu_i, \sigma$)

The repo initially confused me, but I saw that the solution was straightforward after more exploring of functions. I’ll spare you most of this exploration, but will put some of my key insights here.

Function exploration
One key realization for me was seeing how the stats function for drawing from a normal distribution can have varied mean (loc) and standard deviation (scale) values. Previously, I had typically used it to get multiple samples from the same mean and standard deviation like this:

stats.norm.rvs(loc=10, scale=2, size=10)

array([ 7.09403309, 12.255019  ,  7.32435791, 10.26594505, 11.79233188,
       12.09804331, 10.84097672, 14.51428249, 14.22362859,  9.25777341])

I hadn’t previously thought to use different mean and standard deviations for each draw. That is, loc and scale can be vectors. I’ll vary the vectors down below for each so that you can see what I mean in the resulting vector.

stats.norm.rvs(
    loc=[1, 1, 1, 10, 10, 10, 20, 20, 20], scale=[0.5, 1, 2, 0.5, 1, 2, 0.5, 1, 2]
)

array([ 0.73038967,  0.37624878,  1.87080349,  9.87713139, 10.7586794 ,
        9.36231624, 20.29745599, 19.36382859, 18.66087831])

This is when I had my a-ha moment about what to do next. Remember: the posterior distribution that pymc outputs produces a different mean and standard deviation in each line.

trace_m2_df.head()

	a	b	sigma
0	154.030585	0.853788	4.989442
1	153.969973	0.861619	5.128790
2	155.111070	0.957769	5.067568
3	154.106523	0.838728	4.961068
4	154.328210	0.863311	5.160797

We don’t need to use a multivariate normal distribution function but we can instead use the regular (non-multivariate?) version of the normal distribution function. This implementation is more consistent with R code 4.63 which I copied/posted below taken from McElreath’s repo.

## R code 4.63
post <- extract.samples(m4.3)
weight.seq <- 25:70
sim.height <- sapply( weight.seq , function(weight)
    rnorm(
        n=nrow(post) ,
        mean=post$a + post$b*( weight - xbar ) ,
        sd=post$sigma ) )
height.PI <- apply( sim.height , 2 , PI , prob=0.89 )

Making predictions for actual height ($h_i$) at a single weight value

When we input a weight, we can create a vector for height that results from the normal distribution that consists of the posterior’s parameters. This seems confusing at first, but let’s just look at one weight, 45.

Note how the parameters within the stats.norm.rvs function mimic McElreath’s R code above.

example_weight = 45
h_at_45 = stats.norm.rvs(
    loc=trace_m2_df["a"] + trace_m2_df["b"] * (example_weight - xbar),
    scale=trace_m2_df["sigma"],
)
print(h_at_45)
print("Length of vector: ", len(h_at_45))

[155.7948111  146.56221928 159.43749538 ... 155.33600385 149.83412841
 151.31287874]
Length of vector:  4000

We have a vector of values for which we can compute summary statistics, including getting a compatibility interval.

# Get 89% compatibility interval
az.hdi(np.array(h_at_45))

array([146.08867804, 162.48740096])

This seems repetitive to what we did for average height in the last post, but here we’ve incorporated sigma from our posterior distribution. Let’s repeat the code for average height and compare what we just did for actual height.

# Vector for average height - note that sigma is not used!
mu_at_45 = trace_m2_df["a"] + trace_m2_df["b"] * (45 - xbar)

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4), sharex=True)

az.plot_kde(mu_at_45, ax=ax1)
ax1.set_title("prediction of average height\nat weight of 45")
ax1.set_ylabel("density")
ax1.set_xlabel("average height ($\mu$)")
ax1.vlines(
    az.hdi(np.array(mu_at_45)),
    ymin=0,
    ymax=1.5,
    color="gray",
    linestyle="dashed",
    linewidth=1,
)

az.plot_kde(h_at_45, ax=ax2)
ax2.set_title("prediction of actual height\nat weight of 45")
ax2.set_ylabel("density")
ax2.set_xlabel("actual height ($h$)")
ax2.vlines(
    az.hdi(np.array(example_draw)),
    ymin=0,
    ymax=0.1,
    color="gray",
    linestyle="dashed",
    linewidth=1,
)

<matplotlib.collections.LineCollection at 0x7fe5ad6e2940>

png

I am intentionally plotting the x-axis with the same axis limits so that you can appreciate the difference in uncertainty. (Note the difference in the y-axis scale.) We would expect that the probability distribution for average height would be much tighter than the one for actual height. The plots drive that point home visually.

Making predictions for actual height ($h_i$) at a single weight value

We’ll start off by generating a matrix that contains the posterior predictions for actual heights over a range of weights, between 25 and 70 kg. This is similar to what we did for average heights in the last post.

# Input a range of weight values
weight_seq = np.arange(25, 71)

# This is generating a matrix where the predicted h values will be kept
# Each weight value will be its own row
h_pred = np.zeros((len(weight_seq), len(trace_m2_df)))

# Fill out the matrix in this loop
for i, w in enumerate(weight_seq):
    h_pred[i] = stats.norm.rvs(
        loc=trace_m2_df["a"] + trace_m2_df["b"] * (w - xbar), scale=trace_m2_df["sigma"]
    )

# Output the matrix in a dataframe for viewing
df_h_pred = pd.DataFrame(h_pred, index=weight_seq)
df_h_pred.index.name = "weight"
df_h_pred.head()

	0	1	2	3	4	5	6	7	8	9	...	3990	3991	3992	3993	3994	3995	3996	3997	3998	3999
weight
25	146.819087	131.950216	129.793433	141.425616	134.363193	134.193045	138.625503	127.385217	135.971128	132.278403	...	128.438620	128.187695	140.142773	136.868872	132.564979	140.662170	135.461373	129.179569	133.174905	139.440426
26	136.456664	139.880879	138.150449	144.660997	130.087085	133.466438	134.668864	130.054509	137.989162	144.520142	...	140.724558	122.746829	141.763728	140.855240	138.828098	125.438146	139.518013	145.103823	144.708165	137.155993
27	133.427578	146.873131	148.040623	134.562287	143.913001	141.570235	138.758449	126.249154	139.976170	136.232402	...	132.069314	142.616321	139.039138	144.439965	140.008615	132.783098	136.572733	131.534774	135.076601	145.095471
28	143.316194	130.229996	135.166804	140.417059	143.595620	146.727791	142.071108	135.161632	138.845736	140.713138	...	131.457780	131.502991	140.776164	144.107353	150.544606	144.113096	133.674960	137.908407	131.267668	141.673441
29	142.093570	142.195453	145.842916	144.098927	146.044805	138.547015	148.580323	134.374695	139.463730	141.527401	...	142.074329	145.405100	144.384839	131.004655	150.856133	139.352867	133.089377	139.672434	139.466480	133.085828

5 rows × 4000 columns

To re-iterate, each row are a weight’s sampling of heights as dictated by the posterior distribution of parameters. We can illustrate that graphically.

# Plotting the data as a scatter plot
plt.scatter(d2["weight"], d2["height"], alpha=0.5)
plt.plot(weight_seq, h_pred, "C0.", color="gray", alpha=0.005)
plt.xlabel("weight")
plt.ylabel("height")
plt.title("Uncertainty of $h_i$")

Text(0.5, 1.0, 'Uncertainty of $h_i$')

png

PyMC method for predictions of actual height

The above plot helps show the uncertainty for a range of weights after some step-by-step calculations. Not surprisingly, pymc and arviz can help us get to those steps more quickly. Here’s one way to do that.

# Predict heights
# set parameter to 200 to match the notebook example
# could set to 400 to match the samples and nchains
height_pred = pm.sample_posterior_predictive(trace_m2, 200, heights_model)
height_pred

/Users/blacar/opt/anaconda3/envs/stats_rethinking/lib/python3.8/site-packages/pymc3/sampling.py:1687: UserWarning: samples parameter is smaller than nchains times ndraws, some draws and/or chains may not be represented in the returned posterior predictive sample
  warnings.warn(




<div>
    <style>
        /* Turns off some styling */
        progress {
            /* gets rid of default border in Firefox and Opera. */
            border: none;
            /* Needs to be in here for Safari polyfill so background images work as expected. */
            background-size: auto;
        }
        .progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {
            background: #F44336;
        }
    </style>
  <progress value='200' class='' max='200' style='width:300px; height:20px; vertical-align: middle;'></progress>
  100.00% [200/200 00:00<00:00]
</div>






{'height': array([[152.44015719, 143.8660879 , 139.40217284, ..., 156.3123799 ,
         162.99350218, 160.15982149],
        [161.48691291, 152.72929135, 138.60784035, ..., 162.5252433 ,
         167.12620516, 160.8162708 ],
        [154.18712134, 148.66960604, 140.74062707, ..., 165.56777632,
         160.19434841, 157.95943753],
        ...,
        [160.39855129, 147.75720026, 140.01191018, ..., 159.5460152 ,
         153.48386574, 163.72311138],
        [149.698651  , 142.51656198, 145.54450618, ..., 167.74115779,
         165.92839171, 164.0738555 ],
        [159.73732783, 153.56864439, 141.05003146, ..., 166.80582294,
         161.04500287, 160.92227836]])}

f, ax1 = plt.subplots()
# ax = az.plot_hdi(weight_seq, mu_pred.T)
az.plot_hdi(d2["weight"], height_pred["height"], ax=ax1)
ax1.scatter(d2["weight"], d2["height"])
ax1.set_xlabel("weight")
ax1.set_ylabel("height")
ax1.set_title("Uncertainty of $h_i$\n(plotting with the help of arviz)")

/Users/blacar/opt/anaconda3/envs/stats_rethinking/lib/python3.8/site-packages/arviz/stats/stats.py:493: FutureWarning: hdi currently interprets 2d data as (draw, shape) but this will change in a future release to (chain, draw) for coherence with other functions
  warnings.warn(

Text(0.5, 1.0, 'Uncertainty of $h_i$\n(plotting with the help of arviz)')

png

The answer to the problem

After all that, we come to the answer, which can be written in a few lines of code.

# Initialize the table
df_answer = pd.DataFrame()
df_answer["individual"] = range(1, 5)
df_answer["weight"] = [45, 40, 65, 31]
df_answer["expected_height"] = None
df_answer["89perc_interval"] = None

# Fill out the table
for i, weight in enumerate(df_answer["weight"]):

    draw = stats.norm.rvs(
        loc=trace_m2_df["a"] + trace_m2_df["b"] * (weight - xbar),
        scale=trace_m2_df["sigma"],
    )

    df_answer.loc[i, "expected_height"] = "{0:0.2f}".format(draw.mean())
    # .at is for insertion of alist
    # https://stackoverflow.com/questions/26483254/python-pandas-insert-list-into-a-cell
    df_answer.at[i, "89perc_interval"] = "{0:0.2f}, {1:0.2f}".format(
        az.hdi(draw)[0], az.hdi(draw)[1]
    )

# Output the table
df_answer

	individual	weight	expected_height	89perc_interval
0	1	45	154.55	146.06, 162.41
1	2	40	150.22	141.98, 157.86
2	3	65	172.79	164.58, 180.99
3	4	31	141.96	133.63, 149.96

Summary: A lesson about two kinds of uncertainty

In a series of posts, I took a deep dive into this problem. I initially had not appreciated the two kinds of uncertainty that this problem illustrates. It took me some time to absorb the concept. This was in the McElreath book.

Rethinking: Two kinds of uncertainty. In the procedure above, we encountered both uncertainty in parameter values and uncertainty in a sampling process. These are distinct concepts, even though they are processed much the same way and end up blended together in the posterior predictive simulation. The posterior distribution is a ranking of the relative plausibilities of every possible combination of parameter values. The distribution of simulated outcomes, like height, is instead a distribution that includes sampling variation from some process that generates Gaussian random variables. This sampling variation is still a model assumption. It’s no more or less objective than the posterior distribution. Both kinds of uncertainty matter, at least sometimes. But it’s important to keep them straight, because they depend upon different model assumptions. Furthermore, it’s possible to view the Gaussian likelihood as a purely epistemological assumption (a device for estimating the mean and variance of a variable), rather than an ontological assumption about what future data will look like. In that case, it may not make complete sense to simulate outcomes.

Appendix: Environment and system parameters

%watermark -n -u -v -iv -w

Last updated: Wed May 19 2021

Python implementation: CPython
Python version       : 3.8.6
IPython version      : 7.20.0

numpy     : 1.20.1
scipy     : 1.6.0
arviz     : 0.11.1
seaborn   : 0.11.1
json      : 2.0.9
pandas    : 1.2.1
matplotlib: 3.3.4
pymc3     : 3.11.0

Watermark: 2.1.0

Twitter Facebook LinkedIn

PyMC linear regression part 4: predicting actual height

Setting up the variables

Predictions for actual height

Making predictions for actual height ($h_i$) at a single weight value

Making predictions for actual height ($h_i$) at a single weight value

The answer to the problem

Summary: A lesson about two kinds of uncertainty

You May Also Enjoy

Instrumental variable analysis with a binary outcome

Instrumental variable analysis with Bayesian modeling and statsmodels

Time series with varying intercepts

Follow-up after getting causal estimates