Friday, December 6, 2019

Now about those Youth and Society retractions involving Qian Zhang

Hopefully you have had a moment to digest the recent article about the retraction of two of Qian Zhang's papers in Retraction Watch. I began tweeting about one of the articles in question around late September, 2018. You can follow this link to my tweet storm for the now retracted Zhang, Espelage, and Zhang (2018) paper. Under a pseudonym, I initially documented some concerns about both papers in PubPeer: Zhang, Espelage, and Zhang (2018) and Zhang, Espelage, and Rost (2018).

Really, what I did was to upload the papers into the online version of Statcheck, and flag any decision inconsistencies I noticed. I also tried to be mindful of any other oddities that seemed to stick out at that time. I might make note of df that seemed unusual given the reported sample size, for example, or problems with tables presuming to report means and standard deviations. By the time I would have looked at these two papers, I suspect that I was already concerned that papers from Zhang's lab showed a pervasive pattern of errors. Sadly, these two were no different.

With regard to Zhang, Espelage, and Zhang (2018), a Statcheck scan showed three decision errors. In this case, these were errors where the authors reported findings as statistically significant when they were not - given the test statistic value, the degrees of freedom, and the level of significance the authors tried to report.

The first decision inconsistency has to do with an assertion that playing violent video games increased accessibility of aggressive thoughts. The authors initially reported the effect as F(1, 51) = 2.87, p < .05. The actual p-value would have been .09634, according to Statcheck. In other words, there is no main effect for violent content of video games in this sample. Nor was a video game type by gender interaction found: F(1, 65) = 3.58, p < .01 actual p-value: p = 0.06293. Finally, there is no game type by age interaction: F(1, 64) = 3.64, p < .05 actual p-value: p = 0.06090. Stranger still, the sample was approximately 3000 students. Why were the denominator degrees of freedom so small for these reported test statistics? Something did not add up. Table 1 from Zhang, Espelage, and Zhang (2018) was also completely impossible to interpret - an issue I have highlighted in other papers that have been published from his lab:

A few months later, a correction would be published in which the authors would purport to correct a number of errors found on several of the pages of the original article, as well as Table 1. That was wonderful insofar as it went. However, there was a new oddity. The authors purported to only use 500 of the 3000 participants in order to have a "true experiment" - which was one of the more interesting uses of that term I have read over the course of my career. And as Joe Hilgard has aptly noticed, problems with the descriptive statistics continued to be pervasive - implausible and impossible cell means and marginal means and standard deviations, for example.

With regard to the Zhang, Espelage, and Rost (2018) article, my initial flag was simply for some decision errors in Study 1, in which the authors were attempting to establish that their stimulus materials were equivalent across a number of variables except for the level of violent content, and consistent across subsamples, such as sex of participant (male/female). Given the difficulty that exists in obtaining, say, film clips that are sufficiently equivalent except for level of violence, due diligence in using materials that are as equivalent as possible, except for the IV or IVs is to be admired. Unfortunately, There were several decision errors that I flagged after a Statcheck run.

As noted at the time, contra the authors' assertion, there was evidence that there were some rated differences between the violent film (Street Fighter) and the nonviolent film (Air Crisis) in terms of pleasantness - t(798) = 2.32, p > .05 actual p-value: p = 0.02059 - and fear - t(798) = 2.13, p > .05 actual p-value: p = 0.03348  To the extent that failure to control for those factors might have impacted subsequent analyses in Study 2 is of course debatable. It is clear that the authors cannot demonstrate, based on their reported analyses, that they had films that were equivalent on variables that they identified as important to hold constant with only violent content varying. The final decision inconsistency suggested that there was a sex difference in ratings of the variable fear, contrary to authors claim, t(798) = -2.14, p > .05 actual p-value: p = 0.03266. How much that impacted the experiment in study 2 was not something I thought I could assess, but I found it troubling and worth flagging. At minimum, the film clips were less equivalent than reported, and the subsamples were potentially reacting differently to these film clips than reported.

Although I did not comment on Study 2, Hilgard demonstrated that there was a consistency in the pattern of reported means in this paper were strikingly similar to the pattern of means reported in a couple of earlier papers in which Zhang was a lead or coauthor in 2013. That is troubling. If you have followed some of my coverage of Zhang's work on my blog, you are well aware that I have actually discovered at least one instance in which a reported test statistic was directly copied and pasted from one paper to another. Make of it what you will. Dr. Hilgard was able to eventually get a hold of the data that were to accompany a correction to that article, and as noted in the coverage in Retraction Watch, the data and analyses were fatally flawed.

I was noticing a pervasive pattern of errors in these papers, along with others I was reading by Zhang and colleagues at the time. These are the first papers on which Zhang is a lead or coauthor to be retracted. I am willing to bet that these will not be the last, given the evidence I have been sharing with you all here and on Twitter over the last year. I have already probably stated this repeatedly about these retractions - I am relieved. There is no joy to be had here. This has been a bad week for the authors involved. Also please note that I am taking great care here not to assign motive. I think that the evidence speaks for itself that the research was poorly conducted and poorly analyzed. That can happen for any of a number of reasons. I don't know any of the authors involved. I have some awareness of Dr. Espelage's work in bullying, but that is a bit outside my own specialty area. My impression of her work has always been favorable, and these retractions notwithstanding, I see no reason to change my impression of her work on bullying.

If I was sounding alarms in 2018 and onward, it is because Zhang had begun to increasingly enlist as collaborators well-regarded American and European researchers, and was beginning to publish in top-tier journals in various specialties within the Psychological Sciences, such as child and adolescent development and aggression. Given that I thought a reasonable case could be made that Zhang's reputation for well-conducted and analyzed research was far from ideal, I did not want to see otherwise reputable researchers put their careers on the line. My fears to a certain degree are now being realized.

Note that in the preparation of this post, I relied heavily on my tweets from Sept. 24, 2018 and a couple posts I published pseudonymously in PubPeer (see links above). And credit where it is due. I am glad I could get a conversation started about these papers (and others) by this lab. Joe Hilgard has clearly put a great deal of effort and talent into clearing the record since. Really we owe him a debt of gratitude. And also a debt of gratitude to those who have asked questions on Twitter, retweeted, and refused to let up on the pressure. Science is not self-correcting. It takes people who care to actively do the correcting.

For those visiting from Retraction Watch:

Retraction Watch posted an article about two retractions of articles in which Qian Zhang of Southwest University in China was the lead author. Since some of you might be interested in what I've documented about other published articles from Zhang's lab, your best bet is to either type Zhang in the search field for this blog. Or just follow this link, where I have done the work for you. I'll have more to say about these specific articles in a little bit. I think I documented some of my concerns on Twitter last year and pseudonymously on PubPeer. In the meantime, I am relieved to see two very flawed articles removed from the published record. Joe Hilgard deserves a tremendous amount of credit for his work reanalyzing some data he was able to obtain from the lab (and his meticulous documentation of the flaws in these papers), and for his persistence in contacting the Editor in Chief of Youth and Society. I am also grateful for tools like Statcheck, which enabled me to very quickly spot some of the problems with these papers.

Saturday, November 30, 2019

Revisiting the weapons effect database: The allegiance effect redux

A little over a year ago, I blogged about some of the missed opportunities from the Benjamin et al. (2018) meta-analysis. One of those missed opportunities was to examine something known as an investigator allegiance effect (Luborsky et al., 2006). As I noted at the time, I gave credit where credit was due (Sanjay Srivastava, personal communication) for the idea. I simply found a pattern as I was going back over the old database and Dr. Srivastava quite aptly told me what I was probably seeing. It wasn't to difficult to run some basic meta-analytic results through CMA software and tentatively demonstrate that there appears to be something of an allegiance effect.

So, just to break it all down, let's recall when the weapons effect appears to occur. Based on the old Berkowitz and LePage (1967) experiment, the weapons effect appears to occur when individuals are exposed to a weapon and are highly provoked. Under those circumstances, the short-term exposure to weapons instigates an increase in aggressive behavior. Note that this is presumably what Carlson et al. (1990) found in their early meta-analysis. So far, so good. Now, let's see where things get interesting.

I have been occasionally updating the database. Recently have added some behavioral research, although it is focused on low provocation conditions. I am aware of some work recently conducted under conditions of high provocation, but have yet to procure those analyses. That said, I have been going over the computations, double and triple checking them once more, cross-validating them and so on. Not glamorous work, but necessary. I can provide basic analyses along with funnel plots. If there is what Luborsky et al. (2006) define as an allegiance effect, the overall mean effect size for work conducted by researchers associated with Berkowitz should be considerably different than work conducted by non-affiliated researchers. The fairest test I could think of was to concentrate on studies in which there was a specific measure of provocation, a specific behavioral measure of aggression, and - more to the point - to concentrate on high provocation subsamples, based on the rationale provided by Berkowitz and LePage (1967) and Carlson et al. (1990). I coded these studies based on whether the authors were in some way affiliated with Berkowitz (e.g., former grad students, post-docs, or coauthors) or were independent. That was fairly easy to do. Just took a minimal amount of detective work. I then ran the analyses.

Here is the mixed-effects analysis:

The funnel plot also looks pretty asymmetrical for those in the allegiance group (i.e. labelled yes). The funnel plot for those studies in the non-allegiance group appears more symmetrical. Studies in the allegiance group may be showing considerable publication bias, which should be of concern. Null studies, if they exist, are not included.

Above is the funnel plot for studies from the allegiance group.

Above is the funnel plot for studies from the non-allegiance group.

I can slice these analyses any of a number of ways. For example, I could simply examine subsamples that are intended to be direct replications of the Berkowitz and LePage (1967) paper. I can simply collapse across all subsamples, which is what I did here. Either way, the mean effect size will trend higher when the authors are interconnected. I can also document that publication bias is a serious concern when examining funnel plots of the papers in which the authors have some allegiance to Berkowitz than when not. That should be concerning.

I want to play with these data further as time permits. I am hoping to incite a peer to share some unpublished data with me so that I can update the database. My guess is that the findings will be even more damning. I say so relying only on the basic analyses that CMA provides along with the funnel plots.

For better or for worse I am arguably one of the primary weapons effect experts - to the extent that we define the weapons effect as the influence of short-term exposure of weapons on aggressive behavioral outcomes as measured in lab or field experiments. That expertise is documented in some published empirical work - notably Anderson et al. (1998) in which I was responsible for Experiment 2, and Bartholow et al. (2005) in which I was also primarily responsible for Experiment 2, as well as the meta-analysis on which I was the primary author (Benjamin et al., 2018). I've discussed my shortcomings before, as well as the shortcomings of some recent work. It is what it is, and I am who I am. But I do know this area of research very well, am quite capable of looking at the data available, and changing my mind if the analyses dictate - in other words, I am as close to objective as one can get when examining this particular topic. I am also a reluctant expert given that the data dictate the necessity of adopting a considerably more skeptical stance after years of believing the phenomenon to be unquestionably real. I do have an obligation to report the truth as it appears in the literature.

As it stands, not only should we be concerned that the aggressive behavioral outcomes reported in Berkowitz and LePage (1967) represent something of an urban myth, but that in general the mythology appears to be largely due to mostly published reports by a particular group of highly affiliated authors. There appears to be an allegiance effect in this literature. Whether a similar effect exists in the broader body of media violence studies remains to be seen, but I would not be surprised if such an allegiance effect existed.

Friday, November 8, 2019

The New Academia

Rebecca Willen has an interesting post up on Medium about some alternatives to the tradition academic model of conducting research. I doubt traditional academia is going anywhere, but it is quite clear that independent institutions can fill in some gaps and provide some freedoms that might not be afforded elsewhere. In the meantime, this article offers an overview of the potential challenges independent researchers might face and how those challenges can be successfully handled. Worth a read.

Monday, November 4, 2019

Another resource for sleuths

This tweet by Elizabeth Bik is very useful:

The site she used to detect a publication that was self-plagiarized not only in terms of data and analyses but also in terms of text can be found here: Similarity Texter. I will be adding that site to this blog's links. I think as a peer reviewer it will help in detecting potential problem documents. Obviously I see the utility for post-peer review. Finally, any of us as authors who publish multiple articles and chapters on the same topic would do well to run our manuscripts through this particular website prior to submission to any publishing portal. Let's be real and accept that the major publishing houses are very lax when it comes to screening for potential duplicate publication, in spite of the enormous profits that they make from taxpayers across the planet. We should also be real about the quality of peer review. As someone who has been horrified to receive feedback on manuscript from a supposedly reputable journal in less than 48 hours, I think a good case can be made as an author for taking things into your own hands as much as possible. That along with statcheck can save some embarrassment as well as ensure that we as researchers and authors do due diligence to serve the public good.

Friday, November 1, 2019

To summarize, for the moment, my series on the Zhang lab's strange media violence research

It never hurt to keep something of a cumulative record of one's activities when investigating any phenomenon, including secondary analyses.

In the case of the work produced in the lab of Qian Zhang, I have been trying to understand their work, and what appears to have gone wrong with their reporting, for some time. Unbeknownst to me at the time in 2014, I was already encountering one of the lab's papers when by the luck of the draw I was asked to review a manuscript that I would later find was coauthored by Zhang. As I have previously noticed, that paper had a lot of problems and I recommended as constructively as I could that the paper not be published. It was published anyway.

More explicitly, I found a weapons priming article published in Personality and Individual Differences at the start of 2016. It was an empirical study and one that fit the inclusion criteria for a meta-analysis that I was working on at the time. However, I ran into some really odd statistical reporting, leaving me unsure as to what I should use to estimate an effect size. So I sent what I thought was a very polite email to the corresponding author and heard nothing. After a lot of head-scratching, I figured out a way to extract effect size estimates that I felt semi-comfortable with. In essence the authors had no main effect for weapon primes on aggressive thoughts - and it showed in the effect size estimate and confidence intervals. That study really had a minimal impact on the overall mean effect size for weapon primes on aggressive cognitive outcomes in my meta-analysis. I ran analyses and later re-ran analyses and went on with my life.

I probably saw a tweet by Joe Hilgard who was reporting some oddities in another Zhang et al paper sometime in the spring of 2018. That got me wondering what all I was missing. I made a few notes, bookmarked what I needed to bookmark, and came back to that question a bit later in the summer of 2018 when I had a bit of time and breathing room. By this point I could comb through the usual archives, EBSCO databases, ResearchGate, and Google Scholar, and was able to hone in on a fairly small set of English-language empirical articles coauthored by Qian Zhang of Southwest University. I saved all the PDF files, and did something that I am unsure if anyone had done already: I ran the articles through statcheck. With one exception at the time, all the papers I ran through statcheck that had the necessary elements reported (test stat value, p-value, degrees of freedom) showed serious decision errors. In other words, the conclusions the authors were drawing in these articles were patently false based on what they had reported. I was also able to document that the reported degrees of freedom were inconsistent within articles, and often much smaller than the reported sample sizes. There were some very strange tables in many of these articles that presumably reported means and standard deviations but looked more like poorly constructed ANOVA summary tables.

I first began tweeting about what I was finding in mid-to-late September 2018. I think between some conversations via Twitter and email, I at least was convinced that I had spotted something odd, and that my conclusions so far as they went were accurate. Joe Hilgard was especially helpful in confirming what I had found, and then going well beyond that. Someone else honed in on inaccuracies in the reporting of the number of reaction time trials reported in this body of articles. So that went on throughout the fall of 2018. By this juncture, there were a few folks tweeting and retweeting about this lab's troubling body of work, some of these issues were documented by individuals in PubPeer, and editors were being contacted, with varying degrees of success.

By spring of this year, the first corrections were published - one in Youth and Society and a corrigendum in Personality and Individual Differences. To what extent those corrections can be trusted is still an open question. At that point, I began blogging my findings and concerns here, in addition to the occasion tweet.

This summer, a new batch of errata were made public concerning articles published in journals hosted by a publisher called Scientific Research. Needless to say, once I became aware of these errata, I downloaded those and examined them. That has consumed a lot of space on this blog since. As you are now well aware, these errata themselves require errata.

I think I have been clear about my motivation throughout. Something looked wrong. I used some tools now at my disposal to test my hunch and found that my hunch appeared to be correct. I then communicated with others who are stakeholders in aggression research, as we depend on the accuracy of the work of our fellow researchers in order to get to as close an approximation of the truth as is humanly possible. At the end of the day, that is the bottom line - to be able to trust that the results in front of me are a close approximation of the truth. If they are not, then something has to be done. If authors won't cooperate, maybe editors will. If editors don't cooperate, then there is always a bit of public agitation to try to shake things up. In a sense, maybe my role in this unfolding series of events is to have started a conversation by documenting what I could about some articles that appeared to be problematic. If the published record is made more accurate - however that must occur - I will be satisfied with the small part I was able to play in the process. Data sleuthing, and the follow-up work required in the process, is time-consuming and really cannot be done alone.

One other thing to note - I have only searched for English-language articles published by Qian Zhang's lab. I do not read or speak Mandarin, so I may well be missing out on a number of potentially problematic articles in Chinese-language psychological journals. If someone who does know of such articles wishes to contact me please do. I leave my DM open on Twitter for a reason. I would especially be curious to know if there are any duplicate publications of data that we are not detecting. 

As noted before, how all this landed on my radar was really just the luck of the draw. A simple peer review roughly five years ago, and a weird weapons priming article that I read almost four years ago were what set these events in motion. Maybe I would have noticed something was off regardless. After all, this lab's work is in my particular wheelhouse. Maybe I would not have. Hard to say. All water under the bridge now. What is left is what I suspect will be a collective effort to get these articles properly corrected or retracted.

Monday, October 28, 2019

Consistency counts for something, right? Zhang et al. (2019)

If you manage to stumble upon this Zhang et al. (2019) paper, published in Aggressive Behavior, you'll notice that this lab really loves to use a variation of the Stroop Task. Nothing wrong with that in and of itself. It is, after all, presumably one of several ways to attempt to measure the accessibility of aggressive cognition. One can get mean differences between reactions times (rt) for aggressive words and for nonaggressive words under different priming conditions and see if the stimuli with what we believe is violent content make aggressive thoughts more accessible - in this case with reactions times being higher for aggressive words than nonaggressive words (hence, higher positive difference scores). I don't really want to get you too much into the weeds, but I just think having that context is useful in this instance.

So far so good, yeah?

Not so fast. Usually the differences we find in rt between aggressive and nonaggressive words in these various tasks - including the Stroop Task - are very small. We're talking maybe single digit or small double digit differences in milliseconds. As has been the case with several other studies where Zhang and colleagues have had to publish errata, that's not quite what happens here. Joe Hilgard certainly noticed (see his note in PubPeer). Take a peek for yourself:

Hilgard notes another oddity as well as the general tendency for the primary author (Qian Zhang) to essentially stonewall requests for data. This is yet another paper I would be hesitant to cite without access to data, given that this lab already has an interesting publishing history, including some very error-prone errata for several papers published from this decade.

Note that I am only commenting very briefly on the cognitive outcomes. The authors also have data analyzed using a competitive reaction time task. Maybe I'll comment more about that at a later date.

As always, reader beware.


Zhang, Q., Cao, Y., Gao, J., Yang, X., Rost, D. H., Cheng, G., Teng, Z., & Espelage, D. L. (2019). Effects of cartoon violence on aggressive thoughts and aggressive behaviors. Aggressive Behavior, 45, 489-497. doi: 10.1002/ab.21836