Longitudinal Survey Designs

7 minute read

Notes for Chapter 4 of Causal Inference with Survey Data on LinkedIn Learning, given by Franz Buscha. I’m using this series of posts to take some notes.

import graphviz as gr

def draw_causal_graph(
    edge_list, node_props=None, edge_props=None, graph_direction="UD"
):
    """Utility to draw a causal (directed) graph
    Taken from: https://github.com/dustinstansbury/statistical-rethinking-2023/blob/a0f4f2d15a06b33355cf3065597dcb43ef829991/utils.py#L52-L66

    """
    g = gr.Digraph(graph_attr={"rankdir": graph_direction})

    edge_props = {} if edge_props is None else edge_props
    for e in edge_list:
        props = edge_props[e] if e in edge_props else {}
        g.edge(e[0], e[1], **props)

    if node_props is not None:
        for name, props in node_props.items():
            g.node(name=name, **props)
    return g

Surveys with longitudinal data

A series of snapshots.
Captures information from the same subjects across multiple points in time.
Useful in understanding how relationships evolve and spotting trends.

Example: A training program

Cross-sectional

snapshot
static
limited causality
quick and cheap

Longitudinal

time series
dynamic (can follow someone’s productivity over time)
better causality
slow and expensive

Types of longitudinal data

Panel survey
- Collect data on individuals, households, or companies over short time periods. Example: studies of demographic dynamics of families.
Cohort survey
- Follow a group of people who share a common characteristic or experience within a defined survey.
Repeated cross-section
- Collect data from different samples over time but from the same population.

Statistical Framework

Key to working with time is the t subscript

\[Y_{it} = \beta_0 + \beta_1X1_{it} + ... + \beta_nXn_{it} + \epsilon_{it}\]

Time subscripts are manipulated by methods in different ways

Summary

Allow for a deeper level of analysis, especially for cause-and-effect relationships
Remember to consider challenges such as data attrition, time-carrying confounders, and complexity of such data
They often provide a richer and more nuanced view of the world.

Regression models with time effects

Adding time to a regression model can significantly improve causal inference
Time flows in one direction
Time trends and lagged values are common ways to include time

OLS with longitudinal data

Work with time is the t subscript
Static model makes no specific use of time from a methods perspective
Time can be added to this model

Time manipulation: trends

Time can be included as a variable (linear or otherwise)

\[Y_{it} = \beta_0 + \beta_1X1_{it} + \beta_2X2_{it} + \beta_3X3_{it} + \beta_4T_{t} + \epsilon_{it}\]

T is simply the survey time variable
Many processes trend, so it makes sense to add time as a control

Time manipulation: lags

Lags help explain how past values of X are related to present values of Y
Help trace how past events affect today’s outcome
Termed finite distributed lag models of order N
Model of order 2:

\[Y_{it} = \beta_0 + \beta_1X1_{it} + \beta_2X2_{it} + \beta_3X3_{it} + \beta_4X3_{it-1} + \beta_5X3_{it-2} + \epsilon_{it}\]

X1 and X2 are measured in the present. X3 is measured at three timepoints (present, lag of 1 and lag of 2.)

$\beta_3, \beta_4, \beta_5$ are independent; they are often summed to estimate a long-run effect of X on Y
Powerful model for estimating cause and effect of a variable

Summary

Advantages

Capture dynamic effects
Temporal causality
Flexibility

Disadvantages

Require lots of data
Autocorrelation/multi-collinearity
Reverse causality

Conclusion

Using time in a regression can be a real game changer
You can uncover short- and long-run effects, which cannot be done using static models

Fixed effects regression models

A straightforward causal method that requires fewer theoretical assumptions.
Has one major disadvantage - it’s terminal.
Very frequently used with panel data.

Fixed effect: A DAG approach

The focus is on variation over a data unit over time
X2 and u are confounders
- X2 can be controlled for but not u
Fixed effects removes both

draw_causal_graph(
    edge_list=[
        ("u", "X1_1"),
        ("u", "X1_2"),
        ("u", "X1_3"),
        ("u", "Y_1"),
        ("u", "Y_2"),
        ("u", "Y_3"),
        ("X2", "X1_1"),
        ("X2", "X1_2"),
        ("X2", "X1_3"),
        ("X2", "Y_1"),
        ("X2", "Y_2"),
        ("X2", "Y_3"),
        ("X1_1", "Y_1"),
        ("X1_2", "Y_2"),
        ("X1_3", "Y_3"),
        ("X1_1", "X1_2"),
        ("X1_2", "X1_3"),
    ],
    edge_props={
        ("u", "X1_1"): {"style": "dashed"},
        ("u", "X1_2"): {"style": "dashed"},
        ("u", "X1_3"): {"style": "dashed"},
        ("u", "Y1_1"): {"style": "dashed"},
        ("u", "Y1_2"): {"style": "dashed"},
        ("u", "Y1_3"): {"style": "dashed"},
    },
)

svg

Implementing FE: LSDV

Least Squares Dummy Variable approach (LSDV): include all panel units as dummy variables in a regression

\[Y_{it} = \beta_0 + \beta_1X1_{it} + \beta_2X2_{it} + \beta_3X3_{it} + \lambda_uZ_{u_{it}} + \epsilon_{it}\]

Z is a large vector of dummy variables

Estimate this using standard regression
The many $\lambda$ terms control for time-invariant unobserved effects
Method is disadvantaged if panel is large or limited hardware

Implementing FE: Within-transformation

The within-transformation subtracts the average panel unit value from each measured data point
The within-transformation subtracts the average panel unit value from each measured data point

\[[Y_{it} - \bar{Y_i}] = \beta_0 + \beta_1[X1_{it} - \bar{X1_i}] + \beta_2[X2_{it} - \bar{X2_i}] + \beta_3[X3_{it} - \bar{X3_i}] + [U_{i} - \bar{U_i}] + [\epsilon_{it} - \bar{\epsilon_i}]\]

We’re removing data averages from each variable
Variables that do not vary over time are removed from the model, including unobserved variables

Summary

Advantages

Control for unobserved confounders
More credible causal effects

Disadvantages

Less efficient
Cannot analyze time-invariant variables
Time-varying confounders

Conclusion

A powerful tool for longitudinal data structures
Control for unobserved time-invariant confounders
Always estimate FE models if you can, to see if the results differ
Continue to think about time-varying confounders

Difference-in-difference estimation

Older technique, relatively simple
It can be done using only four numbers, which is often as good as advanced methods
Compares the difference in outcomes before and after for two groups = the difference in the differences
DiD is then the effect of the intervention

Basic DiD: Minimum Wages

Famous study by Card and Kreuger, won Nobel Prize
They were interested in the effect of minimum wages on employment
Theory says it can be both good and bad

“Treated”:

New Jersey: Feb 1992 = $4.25
New Jersey: Nov 1992 = $5.05
Difference = after - before

“Unreated control”:

Pennsylvania: Feb 1992 = $4.25
Pennsylvania: Nov 1992 = $4.25

Basic DiD: Minimum Wages (fast food restaurants)

Average Employment FTE	New Jersey	Penn	Difference
Before	20.44	23.33	-2.89
After	21.03	21.17	-0.14
Difference	0.59	-2.16	2.76

To get the full effect of the minimum wage law on employment, in NJ, use the counterfactual: assume NJ would have been on the same trajectory as Penn. That’s why you take the difference of the differences ($ 0.59 - (-2.16) $).

Basic DiD: Visualization

Here is one example but this doesn’t apply to the NJ/Penn example.

Key assumption of DiD estimaton is the parallel trend assumption (if nothing had happened, NJ would follow same trend as Penn); represented by dashed line.
The difference between the end of dashed line and labor supply in NJ is the DiD

Regression DiD

Needed if there are additional confounders:

$Y_{it} = \alpha + \beta_1 Treatment_i + \beta_2 Time_i + (\beta_3 Treatment_i \times Time_t) + \beta_4 X_{it} + \epsilon_{it}$ $Treatment = \text{1 if treated, 0 if control}$ $Time = \text{1 if post-period, 0 if pre-period}$ $Treatment_i \times Time_t = \text{Interaction}$

Assumptions

Parallel trends
No contamination
- Treatment cannot jump into control group and vice versa

Summary

Advantages

Unobserved confounders
Easy to apply
Flexible (can be combined with other methods like matching techniques or standard regression techniques)

Disadvantages

Requires assumptions to be met
Can’t be used for single-case evaluations

Conclusions

Popular evaluation strategy
Need good quality data, at least two periods
But it can offer compelling insights into the causal impact of policies and interventions, to help guide decision making

Synthetic control methods

A new innovation in causal inference (last 15 years)
Somewhat similar to DiD
Creating a “synthetic control” that looks like the treated (using a weighted combination of potential control units)
A great approach when no easy controls are available

Creating the synthetic control

Synthetic control is the weighted sum of other potential control units

\[SynCon = w_1 \times ControlUnit_1 + w_2 \times ControlUnit_2 + ... + w_n \times ControlUnit_n\]

Sum of weights, w, must equal to 1

Estimating the weights

The weights minimize the difference between the treated unit and the control before the intervention
Involves an optimization process
Typically, it involves control variables that assist in matching the treated and synthetic units

Control variables

Must be unaffected by the intervention/policy/treatment
Typical control variables include economic or demographic indicators, and other pre-intervention attributes
The quality of the synthetic control greatly depends on these variables

Evlauating the intervention

The effect of the intervention is the difference between the treated unit and the synthetic control:

\[Effect = Y_{treated, post} - Y_{synthetic, post}\]

Synthetic control estimates are tailored to recover the ATT.

Cross-Contamination is allowed

This occurs when control units are affected by the intervention
Allowed in scenarios where pure separation between treated and control isn’t possible, such as in geographic cases

Example: If CA impelements a new health policy, it’s likely neighboring states follow CA and implement some but not all policies. Neighboring control states are not clean. Example is Prop 99 for tobacco legislation.

Summary

Advantages

Handling heterogeneity
Complex interventions

Disdvantages

Many pretreatment periods
External validity
Pre-intervention fit

Conclusion

Synthetic control provides a sophisticated tool
Especially useful for analysis of interventions that affect only one unit and cross-contamination exists

%load_ext watermark
%watermark -n -u -v -iv -w

Last updated: Tue May 28 2024

Python implementation: CPython
Python version       : 3.12.3
IPython version      : 8.24.0

graphviz: 0.20.3

Watermark: 2.4.3

Twitter Facebook LinkedIn

Longitudinal Survey Designs

Surveys with longitudinal data

Summary

Regression models with time effects

Summary

Fixed effects regression models

Summary

Difference-in-difference estimation

Summary

Synthetic control methods

Summary

You May Also Enjoy

Instrumental variable analysis with a binary outcome

Instrumental variable analysis with Bayesian modeling and statsmodels

Time series with varying intercepts

Follow-up after getting causal estimates