Background

This course is about modeling heterogeneity based on clustering or latent grouping structures. If that sounds abstract it’s because in this class we cover a lot of different kinds of heterogeneity and a lot of different kinds of clustering and grouping structures. Everything from students in classes in schools to countries in dyads and alliance structures to public opinion in counties and countries over time. This course is designed to start high up with the theoretical problems and then work our way through examples so these notes will provide you with a basic reference for the first week in class where we discuss the theory. Below, I outline the basic multilevel modeling toolkit that we will use repeatedly in this class. Depending on your field, i may use language different than what you’ve heard previously. You should reference these notes continually as they will likely be updated and edited throughout the course.

Multilevel models go by a lot of names. This course uses “multilevel” as opposed to “mixed” or “hierarchical” mainly out of habit. It’s the language that I (the instructor) learned first in graduate school and so it’s the default in my head for any kind of model designed to deal with clustering or nested structure. I will try to always reference which literature I’m talking about and if something is really more of one kind of model than another.

Terms

One of the hardest things with a class like this is keeping track of terms. Different people across fields can use different language to mean basically the same thing without realizing it. They can also use the same words to mean different things. Below you’ll find some general terms used across different fields that relate to things in this course.

multilevel model: Typically a nested model with group varying intercepts and slopes from the education and psych literature; Often not explicitly longitudinal (eg students in classes) but refers to longitundinal models as well
mixed effects models: Mainly used in biostat with longitudinal models to refer to a generalized linear model with random intercepts or slopes
hierarchical linear models: Mainly used in the education lit for linear models with random intercepts and slopes; Sometimes also confounded with bayesian hierarchical models which are a substitute but not synonom.
bayesian hierarchical model: a bayesian version of a multilevel model that is in practice a cross between a dummy variable model and a mixed effects model
random effects: Any combination of random intercepts or random slopes; Used by economists to mean a model with a random intercept in it
fixed effects: In the statistics, biostat, and psych lit – typically \(\alpha\) and \(\beta\) where the paramaters are not modeled as group-varying; In the econ, poli sci, and policy lit – typically \(\beta(X_{ij}-\overline{X}_{j})\) where \((\overline{X}_{j})\) has been removed from the data by either being modeled with dummy variables or removed from both sides of the equation with something like a first difference; In the econ, poli sci, and policy lit – a type of longitudinal model with between group variance removed from the analysis
random intercept: a latent variable built from group-conditional intercepts that does not vary within groups; sometimes called a group-varying intercept in the stats literature; sometimes just referred to as a random effect in most literatures
random slope: a latent variable built from group-conditional \(\beta\) coefficients that does not vary within groups; also called random or group-varying coefficient
variance components: an archaic term for random effects from the statistics literature; often interchangeable with the level-2 residual framework for random intercepts
longitudinal models: an umbrella term for models with data measured over time. often, interchangeable with mixed effects models
panel data models: econometric models for longitudinal data; typically these models are based on ripping between group variance out of the model as opposed to directly modeling it; often used interchangeably with “fixed effect models” despite panel data models being much broader
growth curve models: longitudinal data methods based on modeling change over time typically with splines

The Classical Linear Model (CLM)

The classical linear model (aka regression) is the building block of most modern and commonly used statistical techniques.

\[ Y_i = \alpha_i + \beta(X_i) + \varepsilon_i \] There are a number of basic assumptions inherent to the CLM that are the focus of any number of advanced classes. One of the core assumptions of the CLM is that the data are i.i.d. or independently and identically distributed. The residuals are presumed to be noise that can be reasonably ignored. Clustered data automatically break this assumption by having observations that are more alike within groups than across groups.

Random Intercepts

\[ Y_i = \alpha_i + \beta(X_{ij}) + \color{orange}{\mu_j} + \varepsilon_i \] The least problematic kind of i.i.d. violation is that where the clustering is related to the residuals but not to the betas . In this situation the most basic solution is to split the error term \(\varepsilon_{ij}\) into \(\varepsilon_i\) and \(\mu_j\) so that the within-group residuals become conditionally independent and identically distributed (in theory). This is known as a random or varying intercept model.¹ The core idea is that we have made \(\varepsilon_i\) i.i.d. by splitting out \(\mu_j\). Assuming that both \(\mu_j\) and \(\varepsilon_{ij}\) are conditionally independent of any \(\beta_{ij}\) then we should be fine with this solution.

This assumption has a tendency to break down in practice. It is often fine in an experimental context because random assignment makes the groups compositionally equivalent by construction. If the groups themselves are exchangeable draws of some larger population of groups then \(\mu_j\) is often all you need because the problem is about splitting up your error. If you don’t have random assignment (as with all observational data) then the groups are probably not compositionally equivalent and you’ll have correlation between \(\beta(X_{ij})\) and \(\mu_j\). In practice, this assumption is problematic because the same latent factors that cause \(\mu_j\) to be correlated with \(\varepsilon_i\) are also likely to cause it to be correlated with any given \(\beta_i\). In both experimental and observational settings you can still end up with groups that are not exchangeable despite assignment into those groups having been so. This means something about the group moderates the effect of X on Y but that something may or may not be about the composition of the group. There are a variety of potential solutions in this circumstance that we come to momentarily.

Before moving on to other modeling strategies for omitted variable bias it is worth spending some time on random effects more generally to help with intuition. Random intercepts and random slopes are not observed and measured variables that will show up in a given data set. They have to be built from other variables through some modeling strategy and so they are actually latent variables (see Rabe-Hesketh & Skrondal 2004). Conceptually, a random intercept is built from a set of \(\alpha_i\)’s that have been subdivided by group. The most intuitive way to think about it is if we start with the model \(Y_{ij} = \alpha_{ij} + \beta(X_{ij}) + \varepsilon_{ij}\) and decide to instead have G separate models, one for each group in J so that we have:

\(Y_i = \alpha_i + \beta(X_i) + \varepsilon_i\) if J = 1 where \(\alpha_i=5\)
\(Y_i = \alpha_i + \beta(X_i) + \varepsilon_i\) if J = 2 where \(\alpha_i=10\)
\(Y_i = \alpha_i + \beta(X_i) + \varepsilon_i\) if J = 3 where \(\alpha_i=15\)
\(Y_i = \alpha_i + \beta(X_i) + \varepsilon_i\) if J = 4 where \(\alpha_i=20\)
\(Y_i = \alpha_i + \beta(X_i) + \varepsilon_i\) if J = 5 where \(\alpha_i=25\)

We simply run these 5 models and extract \(\alpha_i\) from each of them in turn. We then stack them in a new variable \(\mu\) so that every value of \(\mu\) for group 1 is 5, every value of \(\mu\) for group 2 is 10, every value of \(\mu\) for group 3 is 15 and so on. This new variable goes back into a pooled model \(Y_i = \alpha_i + \beta(X_{ij}) + \varepsilon_i\) to become \(Y_i = \alpha_i + \beta(X_{ij}) + \color{orange}{\mu_j} + \varepsilon_i\) where \(\mu\) pulls out the group-level contamination in \(\varepsilon_{ij}\). This kind of two-stage approach to building random effects is atypical in practice because it doesn’t allow for shrinkage but it is nonetheless useful for intuition-building. ² ³ ⁴

Mundlak & group-mean centering

Once again, the random intercept by itself typically doesn’t do enough to solve the assumption violation inherent with clustered data because it mainly works to break apart the connections between the grouping structure and \(\varepsilon_i\). It can partially adjust for confouding in \(\beta_i\) but doesn’t do enough by itself to provide unbiased within effects. Outside of experimental settings, we typically still need to deal with the problem of correlation between the grouping structure and \(\beta_i\). The Mundlak device was developed as a way to orthogonalize \(X_i\) and \(X_j\) so that the effects of those variables on Y can be modeled separately.

In essence, the \(\beta_{ij}\)’s are split apart in the same way that \(\varepsilon_{ij}\) is split into \(\varepsilon_i\) and \(\mu\) ⁵. (Mundlak 1978) proved that the dummy variable approach yielded equivalent results for the within-group effects of X on Y as simply including \(\overline{X}\) in the model. This has been well known within panel data econometrics since 1978 and even forms the basis of a large number of panel data models (see the Hausman-Taylor model for instance) but has not proliferated widely to applied economics or to other social sciences. (Bell and Jones 2015) have an excellent applied paper working through the logic of group mean centered, Mundlak, and traditional FE specifications to demonstrate it.

\[Y_i = \alpha_i + \beta(X_{ij}-\overline{X}_{j}) + \beta(\overline{X}_{j}) + \color{orange}{\mu_j} + \varepsilon_i\]

The equation above is the group-mean centered variant on a Mundlak device. Centering \(X_{ij}\) so that the group mean of X is zero is a way to help fully orthogonalize the relationship between the within-group and between-group portions of the relationship between X and Y. This is typically the preferred approach to constructing a multilevel model (Bell and Jones 2015). It should be your default basic multilevel model. However, it is not always necessary to recenter X when including group means nor is it always necessary to include group means if you have recentered \(X_i\). Recentering and including group means is almost always a good idea for continuous variables. However, recentering can become somewhat problematic with categorical variables as you lose their natural interpreation. For categorical variables, recentering results in values that are nonsensical and so they are typically evaluated on their default scale.

A Fixed Effects Model

The standard alternative to the Mundlak device typically used in the social sciences is an econometric fixed effects model whereby group-level effects are removed directly from the data. This can be accomplished in a number of ways that get different mileage depending on if you are using a linear or generalized linear model.

\[Y_{ij} = \alpha_i + \beta(X_i) + \beta(J_1) + \beta(J_2) +...\beta(J_{g-1}) + \varepsilon_i\]

A fixed effects model (in the language of econometrics) is a model that has excised all between group variability from an analysis. We can do this with a collection of subject/group specific indicator variables used to decompose the within group and between group data. More commonly, in a fixed effects model we simply condition out the between group effects with some kind of data manipulation like a first difference. The kind of model outlined above can only represent the effect of X on Y given internal variability by group and therefore heavily depends on within group heterogeneity in both X and Y. Any group invariant variables in the model are automatically invalid and purged from a standard fixed effects model.

The obvious problem with this approach is that it simply ignores a sizable amount of variation that occurs in reality. It sacrifices generalizability to help attain validity. In most observational settings the between group effects are of important substantive interest.

Fixed Effects models do not help with standard error issues at all because they lead directly to errors that are not reflective of a population. They divorce an estimation sample from context the same way that an experimental design can’t be generalized to a population. This was the original reason why random effects were seen as valuable (Eisenhart 1947).

Fixed effects models do help with endogeneity problems like cluster confounding or general effect heterogeneity across groups in that they allow you to estimate the common effects of X on Y that are group-invariant. You get a clean measure of the within-group effects of X on Y so long as other sources of endogeneity (reciprocal causation, standard omitted variable bias, selection effects) are not at work. However, you get no measures of context and no way to evaluate consistency of effects in the real world.

Fixed Effects models (via something like a first difference) actually tend to make predictive accuracy worse than with a simple linear model. The entire point of a Fixed Effects model is to cut variability out of the analysis which is anathema to any kind of predictive enterprise.

Fixed effects are quite useful under a number of other circumstances that are much more difficult with other modeling strategies. If groups represent some actual mistake in either your research design or the implementation of your design, then a fixed effects model is the only real approach. Here, between group effects are pathological to your research design and cutting them out of your data helps you make better inferences. This applies if you are doing a controlled experiment with some form of random assignment, but the randomization process did not magically lead to equivalent groups—it usually doesn’t. It is also quite useful if you are studying some kind of observational treatment effect and need to pretend you have something like random assignment. This is why fixed effects estimators are often classified under tools for causal inference.

Fixed effects are also very useful if you are interested in potentially mixing heterogeneous populations within an analysis but do not have enough groups to effectively use random effects. Using dummy variables for gender or race instead of attempting incredibly ill-fitting random effects based on those categories is standard practice. When fixed effects are viewed this way, it becomes apparent that we often mix and match fixed effects models (with group-specific indicator variables) and random effects in the same analysis (Fitzmaurice, Laird, and Ware 2012).

Finally, fixed effects models provide an invaluable benchmark for evaluating an attempt to model a full data generating process and the scale of potential effect heterogeneity. The Hausman test allows for a direct comparison of β coefficients on a model with no cluster-based endogeneity and one based on the entire sample. By knowing the unbiased within-group effect of X on Y we gain a solid starting point in a more comprehensive attempt to model within, between, and cross-level effects. If we cannot effectively build a model of the DGP that yields similar results for within-group effects, then we know that our model is incomplete. We can start with this inconsistency and reason inductively toward a set of potentially better approximations to the DGP.

Random Slopes

Where the basic fixed effects (e.g. dummies) approach and the Mundlak device/group-mean centered model solve the conditional independence problem by splitting the X’s into two sets of variables we have another viable option that can be used as either a complement or partial and imperfect substitute. The random coefficient model simply recreates the random intercept solution of decomposing an error term for β. Intuitively, we can estimate 5 different models, one on each group J. We can then extract \(\alpha_i\) from each model as before to create the random intercept \(\mu_j\). We can simply go a step further and extract \(\beta_i\) from each model, stack them in a new vector, and create the random coefficient \(N_{xj}\).

\[ Y_i = \alpha_i + \beta(X_{ij}) + \color{orange}{\mu_j} + \color{orange}{N_{xj}} + \varepsilon_i \]

In practice, this is not how random coefficients are estimated in any multilevel modeling software. Typically, they are estimated with empirical Bayesian shrinkage which draws the group-specific \(\beta_j\) closer to the pooled average β. Shrinkage literally induces bias in the values of the random coefficient (also called the BLUP) to reduce variance via partial pooling. This shrinkage means that the random coefficient will reduce the confounding between \(X_i\) and \(X_j\) but will not completely eliminate it the way that either dummy variables or a Mundlak device will. For this reason, a random coefficient is often used WITH a Mundlak device and group mean centering.

The reason that a random coefficient is useful even in the context of a Mundlak device is because it can go further in being able to account for cross-level interactions in addition to just group-specific intercept differences. Random coefficients are simply group-conditional effects of X on Y (Beck and Katz 2007). Technically, random coefficient can be used without simultaneously including a random intercept. However, this goes against standard advice in the multilevel literature and would be very difficult to properly interpret. ⁶ In practice, you are always going to think of a random coefficient as an addition to a random intercept model. In that setting random coefficients simultaneously deal with standard error issues as well as random intercept models.

Unlike random intercept models, they help to solve the endogeneity problem by simply willing away the assumption that the effect of X on Y is uncorrelated with group membership (Hsiao and Pesaran 2008). Random coefficients give you a distribution of effects that you can directly compare based on group membership, but they do not tell you anything about the underlying mechanisms driving the heterogeneity. They are a kind of last resort modeling strategy when you have exhausted your observed variables because they are of limited value for standard hypothesis testing. ⁷

Glossary

Y

An outcome or dependent variable

X

an independent or predictor variable

\(Y_i\)

An outcome indexed to the \(i^{th}\) observation where i is not repeated

\(X_i\)

An independent variable indexed to the \(i^th\) observation where i is not repeated

\(X_J\)

Level 2 or group varying X’s that are constant within groups

\(\overline{X}_{j}\)

The group average of X for each group J

Typically used as part of a Mundlak device

\(\alpha_i\)

The model intercept. Often with i omitted from notation

\(\varepsilon_i\)

The model or level-1 residual term

\(\mu_j\)

the random intercept or level-2 residual

\(\varepsilon_{ij}\)

The model or confounded residual term where \(\varepsilon_{ij}\) is a weighted average of \(\varepsilon_{i}\) and \(\mu_j\)

\(N_{xj}\)

the random slope where the effect of X on Y is conditional on group J

\(\beta_i\)

The conditional effect of X on Y for \(X_i\)

\(\beta_j\)

The conditional effect of X on Y for \(X_j\)

References

Bell, A. and K. Jones (2015). “Explaining fixed effects: Random effects modeling of time-series cross-sectional and panel data.” Political Science Research and Methods 3(01): 133-153.

Mundlak, Y. (1978). “On the pooling of time series and cross section data.” Econometrica: journal of the Econometric Society: 69-85.

Variants of the random intercept model have been used for centuries (Airy 1861) but were amalgamated into the theory of the Analysis of Variance in the first half of the 20th century (Scheffe 1959, Fisher 1925). Within this context, the random intercept can be seen as a way to recover population standard errors when strict conditions about group exchangeability are met (Eisenhart 1947) and when ’s are uncorrelated with . This assumption is an extension of the basic conditional independence assumption of linear modeling that requires X and \(\varepsilon_{i}\) be uncorrelated (see Hausman 1978, for instance). For the random intercept model, we assume that X is independent of both \(\varepsilon_{i}\) and \(\mu_j\).↩
Random effects can be built in a number of different ways. In a linear model, a random intercept can be created when the residuals are split into two pieces via a residual decomposition formula like the Swamy–Arora estimator (if you are in panel data econometrics) or with MLE/REML (if you are in statistics). In nonlinear models the random intercept is more difficult to calculate and is typically built with approximation techniques. Numerical quadrature (typically adaptive Gauss-Hermite quadrature), Laplace approximation, or transformations based loosely on the linear model like PQL, MQL, and PIRLS are all viable techniques with pros and cons. The ways in which you build a random intercept can matter in terms of accuracy, time, and computational feasibility.↩
In fields like psychology and economics \(\mu_j\) is not a particularly interesting object anymore than \(\alpha_i\) is interesting. It is useful for making results more generalizable (Barr et al. 2013) and is often seen as a practical requirement in fields like psycholinguistics for this reason. In fields like education, ecology, and epigenetics, μ is an exceptionally valuable tool that operates as a latent variable manifestation of group tendencies in Y (Robinson 1991). In those fields, it is directly interpreted as a way to rank order groups conditional on other variables and therefore they have a fundamentally different approach to its use. It is crucial to understand that the random intercept itself is a tool and can be employed in different ways in different contexts.↩
The main way in which the two-stage model of random effects breaks down is that it ignores empirical Bayes shrinkage. This is one of the most important and least understood nuances of random intercepts in this class. In our hypothetical example above the intercepts in each of the split models were the values of \(\mu\) in the pooled model. In reality, these values would have been shrunk toward the full model’s intercept along a normal distribution. The smaller the group the greater the shrinkage toward the average. This phenomenon occurs with both random intercepts (shrunk towards \(\alpha_i\) of the full model) and random slopes (shrunk towards \(\beta_i\) of the full model). You can read more about shrinkage at https://m-clark.github.io/posts/2019-05-14-shrinkage-in-mixed-models/↩
This is no different in principle to orthogonalizing within and between effects by including a set of dummy variables except the \(X_j\)’s can be more easily interpreted on substantive grounds.↩
It’s much like running a classical linear model without an intercept term.↩
Random coefficients tend to be somewhat problematic mostly from an estimation standpoint. Models are not usually stable with more than a handful of random coefficients and it can become difficult to test their significance (or even estimate them) when they are correlated with one another. The problem becomes far more difficult in nonlinear as opposed to linear settings because the methods for calculating random effects generally become more difficult. Standard operating procedure is to try to limit the number of random coefficients if at all possible to simplify estimation by only including those that significantly improve model fit and by attempting to explain inconsistent \(\beta\)’s substantively with a more accurate model of the data generating process (Bates et al. 2015).↩

The Basic MLM Toolkit