Friday, December 6, 2019

Now about those Youth and Society retractions involving Qian Zhang

Hopefully you have had a moment to digest the recent article about the retraction of two of Qian Zhang's papers in Retraction Watch. I began tweeting about one of the articles in question around late September, 2018. You can follow this link to my tweet storm for the now retracted Zhang, Espelage, and Zhang (2018) paper. Under a pseudonym, I initially documented some concerns about both papers in PubPeer: Zhang, Espelage, and Zhang (2018) and Zhang, Espelage, and Rost (2018).

Really, what I did was to upload the papers into the online version of Statcheck, and flag any decision inconsistencies I noticed. I also tried to be mindful of any other oddities that seemed to stick out at that time. I might make note of df that seemed unusual given the reported sample size, for example, or problems with tables presuming to report means and standard deviations. By the time I would have looked at these two papers, I suspect that I was already concerned that papers from Zhang's lab showed a pervasive pattern of errors. Sadly, these two were no different.

With regard to Zhang, Espelage, and Zhang (2018), a Statcheck scan showed three decision errors. In this case, these were errors where the authors reported findings as statistically significant when they were not - given the test statistic value, the degrees of freedom, and the level of significance the authors tried to report.

The first decision inconsistency has to do with an assertion that playing violent video games increased accessibility of aggressive thoughts. The authors initially reported the effect as F(1, 51) = 2.87, p < .05. The actual p-value would have been .09634, according to Statcheck. In other words, there is no main effect for violent content of video games in this sample. Nor was a video game type by gender interaction found: F(1, 65) = 3.58, p < .01 actual p-value: p = 0.06293. Finally, there is no game type by age interaction: F(1, 64) = 3.64, p < .05 actual p-value: p = 0.06090. Stranger still, the sample was approximately 3000 students. Why were the denominator degrees of freedom so small for these reported test statistics? Something did not add up. Table 1 from Zhang, Espelage, and Zhang (2018) was also completely impossible to interpret - an issue I have highlighted in other papers that have been published from his lab:

A few months later, a correction would be published in which the authors would purport to correct a number of errors found on several of the pages of the original article, as well as Table 1. That was wonderful insofar as it went. However, there was a new oddity. The authors purported to only use 500 of the 3000 participants in order to have a "true experiment" - which was one of the more interesting uses of that term I have read over the course of my career. And as Joe Hilgard has aptly noticed, problems with the descriptive statistics continued to be pervasive - implausible and impossible cell means and marginal means and standard deviations, for example.

With regard to the Zhang, Espelage, and Rost (2018) article, my initial flag was simply for some decision errors in Study 1, in which the authors were attempting to establish that their stimulus materials were equivalent across a number of variables except for the level of violent content, and consistent across subsamples, such as sex of participant (male/female). Given the difficulty that exists in obtaining, say, film clips that are sufficiently equivalent except for level of violence, due diligence in using materials that are as equivalent as possible, except for the IV or IVs is to be admired. Unfortunately, There were several decision errors that I flagged after a Statcheck run.

As noted at the time, contra the authors' assertion, there was evidence that there were some rated differences between the violent film (Street Fighter) and the nonviolent film (Air Crisis) in terms of pleasantness - t(798) = 2.32, p > .05 actual p-value: p = 0.02059 - and fear - t(798) = 2.13, p > .05 actual p-value: p = 0.03348  To the extent that failure to control for those factors might have impacted subsequent analyses in Study 2 is of course debatable. It is clear that the authors cannot demonstrate, based on their reported analyses, that they had films that were equivalent on variables that they identified as important to hold constant with only violent content varying. The final decision inconsistency suggested that there was a sex difference in ratings of the variable fear, contrary to authors claim, t(798) = -2.14, p > .05 actual p-value: p = 0.03266. How much that impacted the experiment in study 2 was not something I thought I could assess, but I found it troubling and worth flagging. At minimum, the film clips were less equivalent than reported, and the subsamples were potentially reacting differently to these film clips than reported.

Although I did not comment on Study 2, Hilgard demonstrated that there was a consistency in the pattern of reported means in this paper were strikingly similar to the pattern of means reported in a couple of earlier papers in which Zhang was a lead or coauthor in 2013. That is troubling. If you have followed some of my coverage of Zhang's work on my blog, you are well aware that I have actually discovered at least one instance in which a reported test statistic was directly copied and pasted from one paper to another. Make of it what you will. Dr. Hilgard was able to eventually get a hold of the data that were to accompany a correction to that article, and as noted in the coverage in Retraction Watch, the data and analyses were fatally flawed.

I was noticing a pervasive pattern of errors in these papers, along with others I was reading by Zhang and colleagues at the time. These are the first papers on which Zhang is a lead or coauthor to be retracted. I am willing to bet that these will not be the last, given the evidence I have been sharing with you all here and on Twitter over the last year. I have already probably stated this repeatedly about these retractions - I am relieved. There is no joy to be had here. This has been a bad week for the authors involved. Also please note that I am taking great care here not to assign motive. I think that the evidence speaks for itself that the research was poorly conducted and poorly analyzed. That can happen for any of a number of reasons. I don't know any of the authors involved. I have some awareness of Dr. Espelage's work in bullying, but that is a bit outside my own specialty area. My impression of her work has always been favorable, and these retractions notwithstanding, I see no reason to change my impression of her work on bullying.

If I was sounding alarms in 2018 and onward, it is because Zhang had begun to increasingly enlist as collaborators well-regarded American and European researchers, and was beginning to publish in top-tier journals in various specialties within the Psychological Sciences, such as child and adolescent development and aggression. Given that I thought a reasonable case could be made that Zhang's reputation for well-conducted and analyzed research was far from ideal, I did not want to see otherwise reputable researchers put their careers on the line. My fears to a certain degree are now being realized.

Note that in the preparation of this post, I relied heavily on my tweets from Sept. 24, 2018 and a couple posts I published pseudonymously in PubPeer (see links above). And credit where it is due. I am glad I could get a conversation started about these papers (and others) by this lab. Joe Hilgard has clearly put a great deal of effort and talent into clearing the record since. Really we owe him a debt of gratitude. And also a debt of gratitude to those who have asked questions on Twitter, retweeted, and refused to let up on the pressure. Science is not self-correcting. It takes people who care to actively do the correcting.

No comments:

Post a Comment