Monday, October 14, 2019

"It's beginning to, and back again": Zheng and Zhang (2016) part 2

A few months ago, I blogged about an article by Zheng and Zhang (2016) that appeared in Social Behavior and Personality. I thought it would be useful to briefly return to this particular article as it was (unbeknownst to me at the time) my first exposure to that lab's work, and because I think it might be helpful if you all are seeing what I am seeing when I read the article.

I don't think I need to re-litigate my reasoning for recommending a rejection when I was a peer reviewer on that particular article, nor my disappointment that the article still was published anyway. Water under the bridge. I think what I want to do is to share some screen shots of the analyses in question as well as to note a few other odds and ends that always bugged me about that particular paper.

I am keeping my focus to Study 2, as that seems to be the portion of the paper that is most problematic. Keep in mind that there were 240 children who participated in the experiment. One of the burning questions is why the degrees of freedom in the denominator for so many of the analyses were so low. As the authors provided no descriptive statistics (including n's) it is often difficult to know exactly what is happening, but I might have a guess. If you follow the Zhang lab's progression since near the start of this decade, sample sizes have increased in their published work. I have a sneaking hunch that the authors copied and pasted text from prior articles and did not necessarily adequately update the degrees of freedom reported. The df for simple effects analyses may actually be correct, but there is no real way of knowing given the lack of descriptive statistics reported.

One problem is that there seemed to be something of a shifting dependent variable (DV). In the first analysis where the authors attempted to establish a main effect, the authors only used the mean reaction times (rt) for aggressive words as the DV. In subsequent analyses, the authors used a mean difference in reaction times (rt neutral minus rt aggressive) as the DV. That created some confusion already.

So let's start with the main analysis, as I do have a screen shot I used in a tweet a while back:

So you see the problem I am seeing already. The analysis itself is nonsensical. There is no way to say that violent video games primed aggressive thoughts in children who played the ostensibly violent game as there was no basis for comparison (i.e, rt for non-aggressive words). There is a reason why I and my colleagues in the Mizzou Aggression Lab a couple decades ago computed a difference score between rts for aggressive words and rts for non-aggressive words and used it as our DV when we ran pronunciation tasks, lexical decision tasks, etc. If the authors were not willing to do that much, then a complex between/within ANOVA in which the interaction term would have been the main focus would have been appropriate. Okay. Enough of that. Make note of the degrees of freedom (df) in the denominator. With 240 participants, there is no way one is going to end up with 68 df in the denominator for a main effects analysis.

Let's look at the rest of the results. The Game Type by Gender interaction analyses were, um, a bit unusual.

First let's let it soak in that the authors claim to be running a four-way ANOVA, but there appear to be only three independent variables: game type, gender, and trait aggressiveness. Where is the fourth variable hiding? Something is already amiss. Now note that first analysis goes back to main effects. Here the difference between rts for aggressive and nonaggressive words is used as the DV, unlike the prior analysis that only examined rts for aggressive words as the DV. Bad news, though: if we believe the df as reported, a Statcheck analysis shows that the reported F could not be significant. Bummer. Statcheck also found that although the reported F for the game type by gender interaction - F(1,62) = 4.89 - was significant, it was at the p = .031. Note again, that is assuming the df can be taken at face value. The authors do report the mean difference scores for boys playing violent games and nonviolent games, but do not do so for girls in either condition. I found the lack of descriptive statistical data to be vexing, to say the least.

How about the analyses examining a potential interaction of game type and trait aggressiveness? That doesn't exactly look great:

Although Statcheck reports no obvious decision errors for the primary interaction effect or the simple effects, the df reported are, for lack of a better way to phrase this, all over the place. The lack of descriptive statistics makes it difficult to diagnose exactly what is going on. Then the authors go on to report a 3-way interaction as non-significant, when a Statcheck analysis indicates that it would be. If there were a significant 3-way interaction, that would require some considerable effort to carefully characterize the interaction, and to carefully graphically portray the interaction.

It also helps to go back and look at the Method section and see how the authors determined how many trials each participant would experience in the experiment:

As I stated previously:

The authors selected 60 goal words for their reaction time task: 30 aggressive and 30 non-aggressive. These goal words are presented individually in four blocks of trials. The authors claim that their participants completed 120 trials total, when the actual total would appear to be 240 trials. I had fewer trials for adult participants in an experiment I ran over a couple decades ago and that was a nearly hour-long ordeal for my participants. I can only imagine the heroic level of attention and perseverance required of these children to complete this particular experiment. I do have to wonder if the authors tested for potential fatigue or practice effects that might have been detectable across blocks of trials. Doing so was standard operating procedure in our lab in the Aggression Lab at Mizzou back in the 1990s. Reporting those findings would have also been done - at least in a footnote when submitted for publication.

Finally, I just want to say something about the way the authors described the personality measure they used. The authors appeared to be interested in obtaining an overall assessment of aggressiveness. The Buss & Perry AQ is arguably defensible for such an endeavor. The authors have a tendency to repeat the original reliability coefficients reported by Buss and Perry (1992), but given that the authors only examined overall trait aggressiveness, and given that they presumably had to translate this instrument into Chinese, the authors would have been better served by reporting the reliability coefficient(s) that they specifically obtained, rather than doing little more than copying and pasting the same basic statement they make in other papers published by the Zhang lab. It really takes getting to the General Discussion section before the authors even obliquely mention that this instrument was translated, as well as to more specifically recommend an adaptation of the BPAQ specifically for Chinese-speaking and reading individuals.

This was a paper that had so many question marks that it should never have been published in the first place. That it did somehow slip through the peer review system is indeed unfortunate. If the authors are unable or unwilling to make the necessary corrections, it is up to the editorial team at the journal to do so. I hope that they will in due time. I know that I have asked.

If not retracted, any corrections would have to report the necessary descriptive statistics upon which the analyses for Study 2 were based, as well as provide the correct inferential statistics: accurate F-tests, df, and p-values. Yes, that means tables would be necessary. That is not a bad thing. The actual Coefficient Alphas used in the specific study for their version of the BPAQ should be reported, instead of simply repeating what Buss and Perry reported for the original English language version of the instrument in the previous century. The editorial team should insist on examining the original data themselves so that they can confirm that any corrections made are indeed correct, or so that they can determine that the data set is so hopelessly botched that the findings reported cannot be trusted, hence necessitating a retraction.

How all this landed on my radar is really just the luck of the draw. I was asked to review a paper in 2014, and I had the time and interest in doing so. The topic of the paper was in my wheelhouse, so I agreed to do so. I recommended a rejection, which in hindsight was sound. I moved on with my life. A couple years later I would read a weapons priming effect paper that was really odd and with reported analyses that were difficult to trust. I didn't make the connection until an ex-coauthor of mine appeared on a paper that appeared to have originated from this lab. At that point I scoured the databases until I could locate every English-language paper published by this lab, and discovered that this specific paper - which I recommended rejecting - had been published as well. In the process, I was able to notice that there was a distinct similarity among all the papers - how they were formatted, the types of analyses, and the types of data analytic errors. I realized pretty quickly that "holy forking shirtballs, this is awful." I honestly don't know if what I have read in this series of papers amounts to gross incompetence or fraud. I do know that it does not belong in the published record.

To be continued (unfortunately)....


Zheng, J., & Zhang, Q. (2016). Priming effect of computer game violence on children’s aggression levels. Social Behavior and Personality: An International Journal, 44(10), 1747–1759. doi:10.2224/sbp.2016.44.10.1747

Footnote: The lyric comes from the chorus in "German Shepherds" by Wire. Toward the the end of this post near the end of the last paragraph, I make a reference to some common expressions used in the TV series, The Good Place.

No comments:

Post a Comment