Sunday, April 7, 2019

Closing the books on a correction (Zhang et al., 2016)

When I was updating the weapons effect database for a then-in-progress meta-analysis a little over three years ago, I ran across a paper by Zhang, Tian, Cao, Zhang, & Rodkin (2016). You can read the original here, as it required significant corrections. The Corrigendum can be found here.

Initially, I was excited, as it is not often that one finds a weapons effect paper published that is based on non-American or non-European samples. There were obvious problems from the start. First, although the authors purport to measure aggression in adolescents (in reality the sample were pre-adolescent children), in reality the dependent variable was a difference in reaction time between aggressive and non-aggressive words. To put it another way, the authors were merely measuring accessibility of aggressive thoughts that presumably would be primed by mere exposure to weapons.

The analyses themselves never quite added up, which made determining an accurate effect size estimate from their work to be, shall we say, a wee bit challenging. I attempted to contact the corresponding author asking for data and any code or syntax used in the hopes of reproducing the analyses and getting the information necessary and obtaining the effect size estimate that would most closely approximate the truth. That email was sent on January 26, 2016. I never heard from Qian Zhang. I figured out a work-around in order to obtain a satisfactory-enough effect size estimate and moved in.

But that paper always bothered me once the initial excitement wore off. I am well aware that I am far from alone in having some serious questions about the Zhang et al. (2016) article. Some of those could be written off as potential typos: there were some weird discrepancies in degrees of freedom across the analyses. The authors contended that they established that they had replicated work I had been involved in conducting (Anderson, Benjamin, & Bartholow, 1998) by simply examining if reaction times to aggressive words were more rapid when primed with weapons than neutral images. In our experiments, we used the difference between aggressive and non-aggressive words as our dependent variable. And based on the degrees of freedom reported, it appeared that the analysis was based on one subsample, as opposed to the complete sample. So obviously there are some red flags.

The various subsample analyses using a proper difference score (they call it AAS) also looked a bit off. And of course the MANOVA table seemed unusual, especially since the unit of analysis appeared to be their difference score (reaction times for aggressive words minus non-aggressive words) - a single dependent variable - as opposed to multiple dependent variables. Although I have rarely used MANOVA and am unlikely to use MANOVA in my own research, I certainly had enough training to know what such analyses should look like. My understanding is that one would report MS, df, and F values for each IV-DV relationship, with the understanding that there will be at least two DVs for every IV. A cursory glance at the most recent edition I had of a classic textbook on multivariate statistics by Tabachnick and Fidell (2012) convinced me that the summary table reported in the article was inappropriate, and would confuse readers rather than enlighten them. There were other questions about the extent to which the authors more or less copied and pasted content from the Buss and Perry (1992) article in which they present their Aggression Questionnaire. Those as of yet have not been adequately addressed, and I suspect they never will.

So, I ran the analyses the authors provided in I had even more questions. There were numerous errors, including decision errors even assuming that the test statistics and their respective degrees of freedom were accurate. Just to give you a flavor, here are my initial statcheck analyses:

As you can see, the authors misreport F(1, 155) = 1.75 p < .05 (actual p = .188), F(1, 288) = 3.76 p < .01 (actual p = .054), and F(1, 244) = 1.67, p < .05 (actual p = .197). The authors also appeared to misreport a three-way interaction as non-significant that clearly was statistically significant. Statcheck could not catch that one due to the authors' failure to include any degrees of freedom in their report. Basically, there was no good reason to trust the analyses at this point. Keep in mind that what I have done here is something that anyone with a basic graduate-level grounding in data analysis and access to Statcheck could compute. Anyone can reproduce what I did. That said, communicating with others about my findings was comforting: I was not alone in seeing what was clearly wrong.

In consultation with some of my peers, something else jumped out: the authors reported an incorrect number of trials. The authors reported 36 primes and 50 goal words which were each randomly paired. The authors reported a total number of trials as 900. However, if you do the math, it becomes obvious that the actual number of trials was 1800. As someone who was once involved in conducting reaction time experiments, I know the importance of not only assessing the necessary number of trials depending on the number of stimuli and target words that must be randomly paired, but also the importance of accurately reporting the number of trials required of participants. It is possible that given their description in the article itself, the authors took the number 18 (for weapons, for example) and multiplied it by 50. In itself, that seems like a probable and honest error. It happens, although it would have been helpful for this sort of thing to have been worked out in the peer review process.

The corrections in the corrigendum suggest a rather massive correction to the article. The presumed MANOVA table never quite gets completely resolved to satisfaction, and a lingering decision error remains. The authors also start using the term marginally significant to refer to a subsample analysis that made me cringe. The concept of marginal significance was supposed to have been swept into the dustbin of history a long time ago. We are well enough along into the 21st century to avoid that vain attempt to rescue a finding altogether.  Whether the corrections noted in the corrigendum are sufficient to save the conclusions the authors wished to make in the article is questionable. At minimum, we can conclude that Zhang et al. (2016) did not find evidence of weapon pictures priming aggressive thoughts, and even their effort to base a partial replication on subsample analyses was not sufficient. It is a non-replication, plain and simple.

My recommendation is not to cite Zhang et al. (2016) unless absolutely necessary. If one is conducting a relevant meta-analysis, citation is probably unavoidable. Otherwise, the article is probably worth citing if one is writing about questionable reporting of research, or perhaps as an example of research that fails to replicate a weapons priming effect.

Please note that the intention is not to attack this set of researchers. My concern is strictly on the research report itself, and the apparent inaccuracies contained in the original research report. I am quite pleased that however it transpired, the editor and authors were able to quickly make corrections in this instance. Mistakes get made. The point is to make an effort to fix them when they are noticed. That should at least be normal science. So kudos to those involved in making the effort to do the right thing here.


Anderson, C. A., Benjamin, A. J., Jr., & Bartholow, B. D. (1998). Does the gun pull the trigger? Automatic priming effects of weapon pictures and weapon names. Psychological Science, 9, 308-314. doi: 10.1111/1467-9280.00061

Buss, A. H., & Perry, M. (1992). The aggression questionnaire. Journal of Personality and Social Psychology, 63, 452-459. doi:10.1037/0022-3514.63.3.452

Tabachnick, B. G., & Fidell, L. S. (2012). Using multivariate statistics. New York: Pearson.

Zhang, Q, Tian, J., Cao, J., Zhang, D., & Rodkin, P. (2016). Exposure to weapon pictures and subsequent aggression in adolescence. Personality and Individual Differences, 90, 113-118. doi: 10.1016/j.paid.2015.09.017.

No comments:

Post a Comment