This course is about modeling heterogeneity based on clustering or latent grouping structures. If that sounds abstract it’s because in this class we cover a lot of different kinds of heterogeneity and a lot of different kinds of clustering and grouping structures. Everything from students in classes in schools to countries in dyads and alliance structures to public opinion in counties and countries over time. This course is designed to start high up with the theoretical problems and then work our way through examples so these notes will provide you with a basic reference for the first week in class where we discuss the theory. Below, I outline the basic multilevel modeling toolkit that we will use repeatedly in this class. Depending on your field, i may use language different than what you’ve heard previously. You should reference these notes continually as they will likely be updated and edited throughout the course.
Multilevel models go by a lot of names. This course uses “multilevel” as opposed to “mixed” or “hierarchical” mainly out of habit. It’s the language that I (the instructor) learned first in graduate school and so it’s the default in my head for any kind of model designed to deal with clustering or nested structure. I will try to always reference which literature I’m talking about and if something is really more of one kind of model than another.
One of the hardest things with a class like this is keeping track of terms. Different people across fields can use different language to mean basically the same thing without realizing it. They can also use the same words to mean different things. Below you’ll find some general terms used across different fields that relate to things in this course.
The classical linear model (aka regression) is the building block of most modern and commonly used statistical techniques.
\[ Y_i = \alpha_i + \beta(X_i) + \varepsilon_i \] There are a number of basic assumptions inherent to the CLM that are the focus of any number of advanced classes. One of the core assumptions of the CLM is that the data are i.i.d. or independently and identically distributed. The residuals are presumed to be noise that can be reasonably ignored. Clustered data automatically break this assumption by having observations that are more alike within groups than across groups.
\[ Y_i = \alpha_i + \beta(X_{ij}) + \color{orange}{\mu_j} + \varepsilon_i \] The least problematic kind of i.i.d. violation is that where the clustering is related to the residuals but not to the betas . In this situation the most basic solution is to split the error term \(\varepsilon_{ij}\) into \(\varepsilon_i\) and \(\mu_j\) so that the within-group residuals become conditionally independent and identically distributed. This is known as a random or varying intercept model. 1 The core idea is that we have made \(\varepsilon_i\) i.i.d. by splitting out \(\mu_j\). Assuming that both \(\mu_j\) and \(\varepsilon_{ij}\) are conditionally independent of any \(\beta_{ij}\) then we should be fine with this solution. In practice, this assumption is problematic because the same latent factors that cause \(\mu_j\) to be correlated with \(\varepsilon_i\) are also likely to cause it to be correlated with any given \(\beta_i\). There are a variety of potential solutions in this circumstance that we come to momentarily.
Before moving on to other modeling strategies for omitted variable bias it is worth spending some time on random effects more generally to help with intuition. Random intercepts and random slopes are not observed and measured variables that will show up in a given data set. They have to be built from other variables through some modeling strategy and so they are actually latent variables (see Rabe-Hesketh & Skrondal 2004). Conceptually, a random intercept is built from a set of \(\alpha_i\)’s that have been subdivided by group. The most intuitive way to think about it is if we start with the model \(Y_{ij} = \alpha_{ij} + \beta(X_{ij}) + \varepsilon_{ij}\) and decide to instead have G separate models, one for each group in J so that we have:
\(Y_i = \alpha_i + \beta(X_i) + \varepsilon_i\) if J = 1 where \(\alpha_i=5\)
\(Y_i = \alpha_i + \beta(X_i) + \varepsilon_i\) if J = 2 where \(\alpha_i=10\)
\(Y_i = \alpha_i + \beta(X_i) + \varepsilon_i\) if J = 3 where \(\alpha_i=15\)
\(Y_i = \alpha_i + \beta(X_i) + \varepsilon_i\) if J = 4 where \(\alpha_i=20\)
\(Y_i = \alpha_i + \beta(X_i) + \varepsilon_i\) if J = 5 where \(\alpha_i=25\)
We simply run these 5 models and extract \(\alpha_i\) from each of them in turn. We then stack them in a new variable \(\mu\) so that every value of \(\mu\) for group 1 is 5, every value of \(\mu\) for group 2 is 10, every value of \(\mu\) for group 3 is 15 and so on. This new variable goes back into a pooled model \(Y_i = \alpha_i + \beta(X_{ij}) + \varepsilon_i\) to become \(Y_i = \alpha_i + \beta(X_{ij}) + \color{orange}{\mu_j} + \varepsilon_i\) where \(\mu\) pulls out the group-level contamination in \(\varepsilon_{ij}\). This kind of two-stage approach to building random effects is atypical in practice because it doesn’t allow for shrinkage but it is nonetheless useful for intuition-building. 2 3 4
Once again, the random intercept by itself typically doesn’t do enough to solve the assumption violation inherent with clustered data because it only works to break apart the connections between the grouping structure and \(\varepsilon_i\). Outside of experimental settings we typically still need to deal with the problem of correlation between the grouping structure and \(\beta_i\). The Mundlak device is a way to orthogonalize \(X_i\) and \(X_j\) so that the effects of those variables on Y can be modeled separately. In essence, the \(\beta_{ij}\)’s are split apart in the same way that \(\varepsilon_ij\) is split into \(\varepsilon_i\) and \(\mu\).5
\[Y_i = \alpha_i + \beta(X_{ij}-\overline{X}_{j}) + \beta(\overline{X}_{j}) + \color{orange}{\mu_j} + \varepsilon_i\]
The equation above is the group-mean centered variant on a Mundlak device. Centering \(X_ij\) so that the group mean of X is zero is a way to help fully orthogonalize the relationship between the within-group and between-group portions of the relationship between X and Y. This is the preferred approach to constructing the model (Bell and Jones 2015). It is not always necessary to recenter X as it often simply doesn’t matter in practice, but it is worth checking for continuous variables. For categorical variables, recentering results in values that are nonsensical and so they are typically evaluated on their default scale.
The standard alternative to the Mundlak device typically used in the social sciences is an econometric fixed effects model whereby group-level effects are removed directly from the data. This can be accomplished in a number of ways that get different mileage depending on if you are using a linear or generalized linear model.
\[Y_{ij} = \alpha_i + \beta(X_i) + \beta(J_1) + \beta(J_2) +...\beta(J_{g-1}) + \varepsilon_i\]
A fixed effects model (in the language of econometrics) is a model that has excised all between group variability from an analysis. We can do this with a collection of subject/group specific indicator variables used to decompose the within group and between group data. More commonly, in a fixed effects model we simply condition out the between group effects with some kind of data manipulation like a first difference. The kind of model outlined above can only represent the effect of X on Y given internal variability by group and therefore heavily depends on within group heterogeneity in both X and Y. Any group invariant variables in the model are automatically invalid and purged from a standard fixed effects model.
The obvious problem with this approach is that it simply ignores a sizable amount of variation that occurs in reality. It sacrifices generalizability to help attain validity. In most observational settings the between group effects are of important substantive interest.
Where the basic fixed effects (e.g. dummies) approach and the Mundlak device solve the conditional independence problem by splitting the X’s into two sets of variables we have another viable option. The random coefficient’s model simply recreates the random intercept solution for β. Intuitively, we can estimate 5 different models, one on each group J. We can then extract \(\alpha_i\) from each model as before to create the random intercept \(\mu_j\). We can simply go a step further and extract \(\beta_i\) from each model, stack them in a new vector, and create the random coefficient \(N_{xj}\).
\[ Y_i = \alpha_i + \beta(X_{ij}) + \color{orange}{\mu_j} + \color{orange}{N_{xj}} + \varepsilon_i \]
In practice, this is not how random coefficients are estimated in any multilevel modeling software. Typically, they are estimated with empirical Bayesian shrinkage which draws the group-specific \(\beta_j\) closer to the pooled average β. Shrinkage literally induces bias in the values of the random coefficient (also called the BLUP) to reduce variance. This shrinkage means that the random coefficient will reduce the confounding between \(X_i\) and \(X_j\) but will not completely eliminate it the way that either dummy variables or a Mundlak device will. For this reason, a random coefficient is often used WITH a Mundlak device.
The reason that a random coefficient is useful even in the context of a Mundlak device is because it can go further in being able to account for cross-level interactions in addition to just group-specific intercept differences. Random coefficients are simply group-conditional effects of X on Y (Beck and Katz 2007). Technically, random coefficient can be used without simultaneously including a random intercept. However, this goes against standard advice in the multilevel literature and would be very difficult to properly interpret. In practice, you are always going to think of a random coefficient as an addition to a random intercept model. In that setting random coefficients simultaneously deal with standard error issues as well as random intercept models.
Unlike random intercept models, they help to solve the endogeneity problem by simply willing away the assumption that the effect of X on Y is uncorrelated with group membership (Hsiao and Pesaran 2008). Random coefficients give you a distribution of effects that you can directly compare based on group membership, but they do not tell you anything about the underlying mechanisms driving the heterogeneity. They are a kind of last resort modeling strategy when you have exhausted your observed variables because they are of limited value for standard hypothesis testing. 6
\[\begin{align} \\ Y_i = & \alpha_i & + \\ &\beta(X_{ij}-\overline{X}_{j}) & + \\ &\beta(\overline{X}_{j}) & + \\ &\beta(Z_j) & + \\ &\beta((X_{ij}-\overline{X}_{j})*(Z_j)) & + \\ &\beta(N_j) & + \\ &\beta((X_{ij}-\overline{X}_{j})*(N_j)) & + \\ &\beta(S_j) & + \\ &\beta((X_{ij}-\overline{X}_{j})*(S_j)) & + \\ &\beta(T_j) & + \\ &\beta((X_{ij}-\overline{X}_{j})*(T_j)) & + \\ &\color{orange}{\mu_j} & + \\ &\color{orange}{N_{xj}} & + \\ &\varepsilon_i \end{align}\]
Random intercepts and slopes attempt to deal with clustering problems by allowing α and β to vary by group respectively. A logical extension on this approach is to build a model that articulates the mechanisms driving group-conditional intercepts and slopes. This is ideally what we would always do since it provides the most theoretical insight, but it is fundamentally the most difficult approach out of the possible set of options. Note that this kind of modeling strategy can be used in conjunction with random intercepts and random slopes. In fact, it should be used with both. Without a random intercept we are unlikely to get adequate population standard errors (unless we use one of the other methods to get them).
Modeling a DGP can help to directly solve cluster confounding when we use a Mundlak device or group-mean centering to help alleviate cluster confounding (Enders 2013, Bell and Jones 2015). While some fields view this as a technical correction it really is directly related to improving the substantive specification of a model. General effect heterogeneity, while having a similar result to cluster confounding, is a bit more problematic. We certainly can model effect heterogeneity directly by including cross-level interactions, diffusion effects, crossed lag effects, and so on. However, it is difficult to do in practice and the vast majority of people don’t do it well. It may be necessary to use either random coefficients or fixed effects to help clean up the parts of the data that we can’t directly model.
In terms of the fourth horseman, we do improve predictive accuracy over a standard (non)linear model. We also make that predictive model far more interpretable on substantive grounds which can be quite useful. But ultimately, the added benefits over a random intercepts and slopes model may not be worth the extra work to refine a decent structural model that we don’t really care about. The counterargument to this has more to do with the technical difficulty of building a model that is saturated with random effects (Bates et al. 2015) than the substantive benefits of the model. If it’s easier to get model to converge by building out the mechanisms instead of simply allowing all effects to vary across all groups, then it’s worth the effort even if you don’t care about the mechanisms.
This assumption is a very easy one to conceptualize in practice. The random intercept \(\mu_j\) only helps to solve a clustering problem if the only thing that varies by group-membership is \(\alpha_i\). If \(\beta_i\) also varies as a function of group-membership, then we have a conditional independence violation at work that our random intercept is not able to solve. Outside of experimental settings, there is almost always at least one \(\beta\) that does change as a function of group membership so in practice this is not a tenable assumption to make. Note that there are a variety of ways to test if \(\beta_i\) is actually better viewed as \(\beta_{ij}\) but the most commonly used one is the Hausman test (Hausman 1978). This test simply compares the set of \(\beta\)’s in a random intercept model to a model where all between group-variability has been ripped out of the data. Since it is impossible for X to be correlated with a grouping structure that has been forcibly removed from the data then the β_i are automatically unbiased with respect to J. If the parallel comparisons of each \(\beta\) are different according to a \(\chi^2\) test, then we have a conditional independence assumption violation at work.
Cluster Confounding & Simpson’s Paradox
Baseline Linear Model
Random Intercept Model
Fixed Effects Model via Dummies
Creating Group-Means
Take the mean of ses by group id
Group Means (Mundlak) Model
Group Mean Centered Model
OLS | RI | FE | Mundlak | GMC | |
---|---|---|---|---|---|
Predictors | Estimates | Estimates | Estimates | Estimates | Estimates |
x | 0.99 | -0.09 | -0.10 | -0.10 | |
cent.x | -0.10 | ||||
Random Effects | |||||
σ2 | 722.58 | 717.41 | 717.41 | ||
τ00 | 71629.23 choice | 0.00 choice | 0.00 choice | ||
ICC | 0.99 | ||||
N | 15 choice | 15 choice | 15 choice | ||
Observations | 2130 | 2130 | 2130 | 2130 | 2130 |
R2 / R2 adjusted | 0.983 / 0.983 | 0.006 / 0.990 | 0.988 / 0.988 | 0.988 / NA | 0.988 / NA |
Hausman Test
## x
## 14.8758
Critical value:
## [1] 3.841459
p-value:
## x
## 0.0001148264
Let’s actually look at the data
Fixed Effects models do not help with standard error issues at all because they lead directly to errors that are not reflective of a population. They divorce an estimation sample from context the same way that an experimental design can’t be generalized to a population. This was the original reason why random effects were seen as valuable (Eisenhart 1947).
Fixed effects models do help with endogeneity problems like cluster confounding or general effect heterogeneity across groups in that they allow you to estimate the common effects of X on Y that are group-invariant. You get a clean measure of the within-group effects of X on Y so long as other sources of endogeneity (reciprocal causation, standard omitted variable bias, selection effects) are not at work. However, you get no measures of context and no way to evaluate consistency of effects in the real world.
Fixed Effects models actually tend to make predictive accuracy worse than with a simple linear model. The entire point of a Fixed Effects model is to cut variability out of the analysis which is anathema to any kind of predictive enterprise.
Fixed effects are quite useful under a number of other circumstances that are much more difficult with other modeling strategies. If groups represent some actual mistake in either your research design or the implementation of your design, then a fixed effects model is the only real approach. Here, between group effects are pathological to your research design and cutting them out of your data helps you make better inferences. This applies if you are doing a controlled experiment with some form of random assignment, but the randomization process did not magically lead to equivalent groups—it usually doesn’t. It is also quite useful if you are studying some kind of observational treatment effect and need to pretend you have something like random assignment. This is why fixed effects estimators are often classified under tools for causal inference.
Fixed effects are also very useful if you are interested in potentially mixing heterogeneous populations within an analysis but do not have enough groups to effectively use random effects. Using dummy variables for gender or race instead of attempting incredibly ill-fitting random effects based on those categories is standard practice. When fixed effects are viewed this way, it becomes apparent that we often mix and match fixed effects models (with group-specific indicator variables) and random effects in the same analysis (Fitzmaurice, Laird, and Ware 2012).
Finally, fixed effects models provide an invaluable benchmark for evaluating an attempt to model a full data generating process and the scale of potential effect heterogeneity. The Hausman test allows for a direct comparison of β coefficients on a model with no cluster-based endogeneity and one based on the entire sample. By knowing the unbiased within-group effect of X on Y we gain a solid starting point in a more comprehensive attempt to model within, between, and cross-level effects. If we cannot effectively build a model of the DGP that yields similar results for within-group effects, then we know that our model is incomplete. We can start with this inconsistency and reason inductively toward a set of potentially better approximations to the DGP.
Variants of the random intercept model have been used for centuries (Airy 1861) but were amalgamated into the theory of the Analysis of Variance in the first half of the 20th century (Scheffe 1959, Fisher 1925). Within this context, the random intercept can be seen as a way to recover population standard errors when strict conditions about group exchangeability are met (Eisenhart 1947). This assumption is an extension of the basic conditional independence assumption of linear modeling that requires X and \(\varepsilon_{i}\) be uncorrelated. For the random intercept model, we assume that X is independent of both \(\varepsilon_{i}\) and \(\mu_j\).↩
Random effects can be built in a number of different ways. In a linear model, a random intercept can be created when the residuals are split into two pieces via a residual decomposition formula like the Swamy–Arora estimator (if you are in panel data econometrics) or with MLE/REML (if you are in statistics). In nonlinear models the random intercept is more difficult to calculate and is typically built with approximation techniques. Numerical quadrature (typically adaptive Gauss-Hermite quadrature), Laplace approximation, or transformations based loosely on the linear model like PQL, MQL, and PIRLS are all viable techniques with pros and cons. The ways in which you build a random intercept can matter in terms of accuracy, time, and computational feasibility.↩
In fields like psychology and economics \(\mu_j\) is not a particularly interesting object anymore than \(\alpha_i\) is interesting. It is useful for making results more generalizable (Barr et al. 2013) and is often seen as a practical requirement in fields like psycholinguistics for this reason. In fields like education, ecology, and epigenetics, μ is an exceptionally valuable tool that operates as a latent variable manifestation of group tendencies in Y (Robinson 1991). In those fields, it is directly interpreted as a way to rank order groups conditional on other variables and therefore they have a fundamentally different approach to its use. It is crucial to understand that the random intercept itself is a tool and can be employed in different ways in different contexts.↩
The main way in which the two-stage model of random effects breaks down is that it ignores empirical Bayes shrinkage. This is one of the most important and least understood nuances of random intercepts in this class. In our hypothetical example above the intercepts in each of the split models were the values of \(\mu\) in the pooled model. In reality, these values would have been shrunk toward the full model’s intercept along a normal distribution. The smaller the group the greater the shrinkage toward the average. This phenomenon occurs with both random intercepts (shrunk towards \(\alpha_i\) of the full model) and random slopes (shrunk towards \(\beta_i\) of the full model). You can read more about shrinkage at https://m-clark.github.io/posts/2019-05-14-shrinkage-in-mixed-models/↩
This is no different in principle to orthogonalizing within and between effects by including a set of dummy variables except the \(X_j\)’s can be more easily interpreted on substantive grounds. Mundlak (1978) proved that the dummy variable approach yielded equivalent results for the within-group effects of X on Y as simply including \(\overline{X}\) in the model↩
Random coefficients tend to be somewhat problematic mostly from an estimation standpoint. Models are not usually stable with more than a handful of random coefficients and it can become difficult to test their significance (or even estimate them) when they are correlated with one another. The problem becomes far more difficult in nonlinear as opposed to linear settings because the methods for calculating random effects generally become more difficult. Standard operating procedure is to try to limit the number of random coefficients if at all possible to simplify estimation by only including those that significantly improve model fit and by attempting to explain inconsistent \(\beta\)’s substantively with a more accurate model of the data generating process (Bates et al. 2015).↩