__Click [HERE](https://julianquandt.com/rmd_files/2020-08-05-power-analysis-by-data-simulation-in-r-part-iv.Rmd) to download the .Rmd file__ _This blog is also available on [R-Bloggers](https://www.r-bloggers.com/)_

**A few words on nested vs. crossed random effects** As mixed-effects model literature has unfortunately developed to use different terms to mean different things, I just want to spend a few words on what I mean by nested as opposed to crossed random effects. A data structure is _nested_ when observations are organized in one or more clusters. The classic example of nested data is that we observe the performance of school-children over time. Because we collect data throughout multiple years, we have multiple observations per child, so we can say that observations are nested within children. We also observe multiple children per class, so children are nested within classes. If we involve not one but multiple schools in our study, then we have children nested in classes nested in schools. If we have schools from multiple countries we have .... well, you get the idea. It is important to realize that nested data is defined by "belong" and "contain" relationships. The data is _fully nested_ if (per reasonable assumption in this case) each child only belongs to one specific class, each class only belongs to one school etc. This point is sometimes confusing as we might, in our data-frame, have multiple classes from different schools that have the same ID in the data-set, for instance there might be a class called _9A_ in schools 1 and 2. However, it is important to realize that these classes are not the same but _different_ classes in reality. They contain different students and belong to a different school. Therefore the data is _nested_. In _crossed_, or _cross-classified_ data, each observation does also belong to not only 1, but multiple nesting levels. However, the as opposed to fully nested data, the nesting levels are not bounded to each other by a "contain" or "belong" relationship as was the case in the example above where classes contained students and belonged to schools. A classic example of cross-classified data is experimental data in which, for instance, people had to press a button quickly whenever pictures appear on a computer screen. Imagine that we recruit 100 participants and each sees the same 20 pictures 5 times. In this case we have 100x20x5 = 10,000 observations. Imagine we would identify these observations by their row-number (i.e. 1 ... 10,000). Now each of these observations "belongs" to a participant, and "belongs" to a specific stimulus. For instance, assuming that the data is ordered ascending by participant and stimulus, the first observation (observation 1) belongs to participant 1 and stimulus 1, the last observation (observation 10,000) belongs to participant 100 and stimulus 20. As you might already realize, in this case it is not the case that a participant either contains or belongs to a specific stimulus and vice versa: Each participant sees each stimulus, or framed differently, each stimulus presents itself to each participant - the two nesting levels are independent from each other, they are _cross-classified_.

Note that we made the design choice that genre will be a within-subject effect. In theory, it could also be a between-subject design where a person listens to only 1 genre, pop or rock, but a within-subject design has the advantage that general liking of music across conditions (e.g. person 1 might like both genres more than person 2) does not affect the results (as much) as it would in a between-subject design. I say "as much" because it could still affect the results even in a within subject design if, for instance, a person that likes music generally more will also show fewer difference between conditions compared to a person that dislikes music more generally. In that case, the overall score is related to the _difference score_ which can impact our inference.

As a side note: We are working with a bounded scale here, where scores cannot be smaller than 0 or larger than 100. As normal distributions are actually unbounded, they do not perfectly represent this situation, and we might get scores larger than the boundaries. There are different ways we could deal with this. First, we could specify the boundaries by _censoring_ the normal distribution. This means we would set each score falls outside the scores to either the minimum or maximum of the scale. This will result in scores that only fall inside of the specified range, but will result in peaks at both ends of the scale, as we are keeping the scores. We could also _truncate_ the distribution, i.e. throw away all scores that do not fall inside our specified range. This will not result in peaks at the scale boundaries, but will however, result in data loss, that we would either have to accept or compensate for by resampling until we get scores inside the range. Both approaches have there advantages and disadvantages, and the question which one we should use boils down to what we believe will happen in our study: Will people overproportionally give very low or high ratings (i.e. people really HATING or LOVING songs)? In that case we would _expect_ peaks at the boundaries and might prefer censoring the normal distribution as it might be closer to what we will observe in our study. If we do not expect that and mainly care about keeping the scores inside the range, we could instead use truncation. Here, we will just ignore this problem for now (which is which is not that problematic after all as the underlying process that generates the data is the same normal distribution in all cases).

**An easy way to "break" mixed models** Consider a slightly different assumption about the correlation between liking scores. It is very much possible that a participant who likes rock music will dislike pop music. Thus, knowing a rock music score for a certain participant, we could expect the next score to be somewhat lower and vice versa, i.e. there would be _negative_ correlation between the scores. If we change the above correlation in the simulation to be negative, we will get a rather weird output for the mixed model however: ```{r} correlation_neg <- -0.2 # now correlation is negative sigma_neg <- matrix(c(rock_sd^2, rock_sd*pop_sd*correlation_neg, rock_sd*pop_sd*correlation_neg, pop_sd^2), ncol = 2) # define variance-covariance matrix set.seed(123) bivnorm_neg <- data.frame(mvrnorm(nrow(data1)/2, group_means, sigma_neg)) bivnorm_dat_neg <- data.frame(cbind(liking = c(bivnorm_neg$rock, bivnorm_neg$pop), genre = c(rep("rock", (nrow(data1)/2)), rep("pop", (nrow(data1)/2))), participant = rep(1:(nrow(data1)/2), 2))) bivnorm_dat_neg$liking <- as.numeric(bivnorm_dat_neg$liking) bivnorm_dat_neg$genre_f <- factor(bivnorm_dat_neg$genre) lmer_bivnorm_neg <- lmer(liking ~ genre_f + (1 | participant), bivnorm_dat_neg) summary(lmer_bivnorm_neg) ``` First we get a warning about the fit being "singular". Second, we see that, even though the absolute covariance is just as large this time but negative (-11.25), the model tells us that the covariance of participants is 0!! Can that be right? The correlation is indeed `r cor(subset(bivnorm_dat_neg, genre == "rock")$liking, subset(bivnorm_dat_neg, genre == "pop")$liking)` so using the formula to infer the covariance from the group SDs in the data we would expect ``` cor(subset(bivnorm_dat_neg, genre == "rock")$liking, subset(bivnorm_dat_neg, genre == "pop")$liking) *sd(subset(bivnorm_dat_neg, genre == "rock")$liking) *sd(subset(bivnorm_dat_neg, genre == "pop")$liking) == 0 ``` It is, however, `r cor(subset(bivnorm_dat_neg, genre == "rock")$liking, subset(bivnorm_dat_neg, genre == "pop")$liking)*sd(subset(bivnorm_dat_neg, genre == "rock")$liking)*sd(subset(bivnorm_dat_neg, genre == "pop")$liking)`, which is certainly not 0! However, take a moment to reflect on how a variance is defined as the squared standard-deviation, i.e. $var = sd^2$. By definition, a squared number cannot be negative, and therefore the variance can also never be negative. Mixed models have a built-in restriction that honors this notion. If a covariance is negative, the mixed model gives us the closest number that it assumes reasonable for a random effect variance, namely 0 which is just an SD of 0 squared, and will tell us that the model is "singular" i.e. the variance component cannot be estimated. Anova procedures do not "suffer" from this problem, as they use a sum-of-squares approach, where the sum-of-squares are always positive. This means that in cases where the observed correlation between the observations is negative, Anovas and mixed models will yield different results: ```{r} ez_an <- ez::ezANOVA(bivnorm_dat_neg, dv = .(liking), wid = .(participant), within = .(genre_f), detailed = T) print(ez_an) ``` Which one is the true result though? Both and neither, it is just a design choice that package developers made. [As Jake Westfall put it](https://stats.stackexchange.com/a/122662) better than I ever could for models that have these negative covariances: > "The best-fitting model is one that implies that several of the random variance components are negative. But lmer() (and most other mixed model programs) constrains the estimates of variance components to be non-negative. This is generally considered a sensible constraint, since variances can of course never truly be negative. However, a consequence of this constraint is that mixed models are unable to accurately represent datasets that feature negative intraclass correlations, that is, datasets where observations from the same cluster are less (rather than more) similar on average than observations drawn randomly from the dataset, and consequently where within-cluster variance substantially exceeds between-cluster variance. Such datasets are perfectly reasonable datasets that one will occasionally come across in the real world (or accidentally simulate!), but they cannot be sensibly described by a variance-components model, because they imply negative variance components. They can however be "non-sensibly" described by such models, if the software will allow it. aov() allows it. lmer() does not".

To make the random intercept simulation more traceable we can plot it. If we use the code above, but set the error term $\epsilon$ to a very small number, we will see what influence the random effect has on the data. ```{r} epsilon_noerror <- .001 data1_noerror <- subset(data1, participant <= 20) # take only 20 participants for plotting set.seed(1) for(i in 1:nrow(data1_noerror)){ data1_noerror$liking[i] <- rnorm(1, b0+U0[data1_noerror$participant[i]]+b1*data1_noerror$genre_pop[i], epsilon_noerror) } ggplot(data1_noerror, aes(x=genre, y=liking, colour=factor(participant), group = factor(participant))) + geom_line() ``` We can see in the above plot that each participant has exactly no difference in liking for rock and pop music, as $beta_1$ is 0 and there is no error term that causes random variations. Still, the likings of each participant differ, as the random intercept causes overall ratings to be lower or higher for different participants.

Again we can plot this combination of random slope + intercept. ```{r} data2_noerror <- subset(data2, participant <= 20) # take only 20 participants for plotting set.seed(1) for(i in 1:nrow(data2_noerror)){ data2_noerror$liking[i] <- rnorm(1, b0+U0[data2_noerror$participant[i]]+ (b1+U1[data2_noerror$participant[i]])*data2_noerror$genre_pop[i], epsilon_noerror) } ggplot(data2_noerror, aes(x=genre, y=liking, colour=factor(participant), group = factor(participant))) + geom_line() ``` We can now see that not only does each participant still have a different mean, they also have do not have flat slopes anymore, but can have even large differences in their liking for pop vs. rock music. However, across all participants the difference is still zero.

Should we add the genre random slope for songs as well? Try to think about this for a second. What would it mean to add the random slope for genre across songs here? It would mean that we would expect a certain song to differ in its propensity to be liked as a pop song compared to being liked as a rock song. Clearly, this does not make sense as a song is either rock or pop but cannot "switch" genre from rock to pop or vice versa. Thus, genre is a within-participant factor here but it is _between-songs_ meaning that we cannot add a random slope here as it does not make sense.