Monday, October 28, 2019

Consistency counts for something, right? Zhang et al. (2019)

If you manage to stumble upon this Zhang et al. (2019) paper, published in Aggressive Behavior, you'll notice that this lab really loves to use a variation of the Stroop Task. Nothing wrong with that in and of itself. It is, after all, presumably one of several ways to attempt to measure the accessibility of aggressive cognition. One can get mean differences between reactions times (rt) for aggressive words and for nonaggressive words under different priming conditions and see if the stimuli with what we believe is violent content make aggressive thoughts more accessible - in this case with reactions times being higher for aggressive words than nonaggressive words (hence, higher positive difference scores). I don't really want to get you too much into the weeds, but I just think having that context is useful in this instance.

So far so good, yeah?

Not so fast. Usually the differences we find in rt between aggressive and nonaggressive words in these various tasks - including the Stroop Task - are very small. We're talking maybe single digit or small double digit differences in milliseconds. As has been the case with several other studies where Zhang and colleagues have had to publish errata, that's not quite what happens here. Joe Hilgard certainly noticed (see his note in PubPeer). Take a peek for yourself:


Hilgard notes another oddity as well as the general tendency for the primary author (Qian Zhang) to essentially stonewall requests for data. This is yet another paper I would be hesitant to cite without access to data, given that this lab already has an interesting publishing history, including some very error-prone errata for several papers published from this decade.

Note that I am only commenting very briefly on the cognitive outcomes. The authors also have data analyzed using a competitive reaction time task. Maybe I'll comment more about that at a later date.

As always, reader beware.

Reference:

Zhang, Q., Cao, Y., Gao, J., Yang, X., Rost, D. H., Cheng, G., Teng, Z., & Espelage, D. L. (2019). Effects of cartoon violence on aggressive thoughts and aggressive behaviors. Aggressive Behavior, 45, 489-497. doi: 10.1002/ab.21836

Sunday, October 27, 2019

Postscript to the preceding

I am under the impression that the body of errata and corrigenda from the Zhang lab were composed as hastily and without care as were the original articles themselves. I wonder how much scrutiny the editorial teams of these respective journals gave these corrections as they were submitted. I worry that little scrutiny was involved, and it is a shame that once more post-peer-review scrutiny is all that is available.

Erratum to Zhang, Zhang, & Wang (2013) has errors

This is a follow up to my commentary on the following paper:

Zhang, Q. , Zhang, D. & Wang, L. (2013). Is Aggressive Trait Responsible for Violence? Priming Effects of Aggressive Words and Violent Movies. Psychology, 4, 96-100. doi: 10.4236/psych.2013.42013

The erratum can be found here.

It is disheartening when an erratum ends up being more problematic than the original published article. One thing that struck me immediately is that the authors continue to insist that they ran a MANCOVA. As I stated previously:
It is unclear just how a MANCOVA would be appropriate as the only DV that the authors consider for the remaining analyses is a difference score. MANOVA and MANCOVA are appropriate analytic techniques for situations in which multiple DVs are analyzed simultaneously. The authors fail to list a covariate. Maybe it is gender? Hard to say. Without an adequate explanation, we as readers are left to guess. Even if a MANCOVA were appropriate, Table 4 is a case study in how not to set up a MANCOVA table. Authors should be explicit about what they are doing as possible. I can read Method and Results sections just fine, thank you. I cannot, however, read minds.

In essence, my initial complaint remains unaddressed.  One change, Table 4 is now Table 1, and it has different numbers in it. Great. I still have no idea (nor would any reasonably-minded reader), based on the description given, what the authors used as a covariate nor do I know what purported multiple DVs were used simultaneously. This is not an analysis I use very often in my own work, although I have certainly done so in the past. I do have an idea of how MANOVA and MANCOVA tables would be set up, and how those analyses would be described. I did a fair amount of that for my first year project at Mizzou a long time ago. The authors used as their DV a difference score (diff between RT aggressive words vs RT nonaggressive words), which would rule out the need for a MANOVA. And since no covariate is specified, a MANCOVA would be ruled out. I am going to make a wild guess that the partial summary table that comprises Table 1 will end up being nonsensical as have been similar tables generated in papers by this lab, including errata and corrigenda. I don't expect to be able to generate the necessary error MS, which I could then use to estimate the pooled SD.

I also want to note that the description of Table 2 as characterized by the authors and the numbers in Table 2 do not match up. I find that troubling. I am assuming that the authors mislabeled the columns, and intended for the low trait and high trait columns to be reversed. It is still sloppy.
At least when I ran this document through Statcheck, the findings, as reported, appeared clean - no inconsistencies and no decision inconsistencies. I wish that provided cold comfort. Since I don't know if I can trust any of what I have read in either the original document or the current erratum, I am not sure that I there is any comfort to be had.

What saddens me is that so much media violence research is based on WEIRD samples. That influences the generalizability of the findings. That also limits the scope of any skepticism I and my peers might have about media violence effects. We need good non-WEIRD research. So the fact that there is a lab that is generating a lot of research that is non-WEIRD, but is riddled with errors is a major disappointment.

At this juncture, the only cold comfort I would find is if the lot of the problematic studies from this lab were retracted. I do not say that lightly. I view retraction as a last resort, when there is no reasonable way for the record to be corrected without removing the paper itself. Doing so appears to be necessary for at least a few reasons. One, meta-analysts might try to use this research - either the original article or the erratum (or both if they are not paying attention) to generate effect size estimates. If we cannot trust the effect size estimates we generate, it's pretty much game over. Two, given that in a globalized market we all consume much of the same media (or at least the same genres), it makes sense to have evidence from not only WEIRD samples but also non-WEIRD samples. Some of us might try to understand just how violent media affect samples from non-WEIRD populations in order to understand if our understanding of these phenomena are universal. The findings generated from this paper and from this lab more broadly do not contribute to that understanding. If anything, the findings detract from our ability to get any closer to the truth. Three, the general public latches on to whatever seems real. If the findings are bogus - either due to gross incompetence or fraud - then the public is essentially being fleeced, which to me is simply unacceptable. The Chinese taxpayers deserved better. So do all of us who are global citizens.


Monday, October 14, 2019

"It's beginning to, and back again": Zheng and Zhang (2016) part 2

A few months ago, I blogged about an article by Zheng and Zhang (2016) that appeared in Social Behavior and Personality. I thought it would be useful to briefly return to this particular article as it was (unbeknownst to me at the time) my first exposure to that lab's work, and because I think it might be helpful if you all are seeing what I am seeing when I read the article.

I don't think I need to re-litigate my reasoning for recommending a rejection when I was a peer reviewer on that particular article, nor my disappointment that the article still was published anyway. Water under the bridge. I think what I want to do is to share some screen shots of the analyses in question as well as to note a few other odds and ends that always bugged me about that particular paper.

I am keeping my focus to Study 2, as that seems to be the portion of the paper that is most problematic. Keep in mind that there were 240 children who participated in the experiment. One of the burning questions is why the degrees of freedom in the denominator for so many of the analyses were so low. As the authors provided no descriptive statistics (including n's) it is often difficult to know exactly what is happening, but I might have a guess. If you follow the Zhang lab's progression since near the start of this decade, sample sizes have increased in their published work. I have a sneaking hunch that the authors copied and pasted text from prior articles and did not necessarily adequately update the degrees of freedom reported. The df for simple effects analyses may actually be correct, but there is no real way of knowing given the lack of descriptive statistics reported.

One problem is that there seemed to be something of a shifting dependent variable (DV). In the first analysis where the authors attempted to establish a main effect, the authors only used the mean reaction times (rt) for aggressive words as the DV. In subsequent analyses, the authors used a mean difference in reaction times (rt neutral minus rt aggressive) as the DV. That created some confusion already.

So let's start with the main analysis, as I do have a screen shot I used in a tweet a while back:

So you see the problem I am seeing already. The analysis itself is nonsensical. There is no way to say that violent video games primed aggressive thoughts in children who played the ostensibly violent game as there was no basis for comparison (i.e, rt for non-aggressive words). There is a reason why I and my colleagues in the Mizzou Aggression Lab a couple decades ago computed a difference score between rts for aggressive words and rts for non-aggressive words and used it as our DV when we ran pronunciation tasks, lexical decision tasks, etc. If the authors were not willing to do that much, then a complex between/within ANOVA in which the interaction term would have been the main focus would have been appropriate. Okay. Enough of that. Make note of the degrees of freedom (df) in the denominator. With 240 participants, there is no way one is going to end up with 68 df in the denominator for a main effects analysis.

Let's look at the rest of the results. The Game Type by Gender interaction analyses were, um, a bit unusual.

First let's let it soak in that the authors claim to be running a four-way ANOVA, but there appear to be only three independent variables: game type, gender, and trait aggressiveness. Where is the fourth variable hiding? Something is already amiss. Now note that first analysis goes back to main effects. Here the difference between rts for aggressive and nonaggressive words is used as the DV, unlike the prior analysis that only examined rts for aggressive words as the DV. Bad news, though: if we believe the df as reported, a Statcheck analysis shows that the reported F could not be significant. Bummer. Statcheck also found that although the reported F for the game type by gender interaction - F(1,62) = 4.89 - was significant, it was at the p = .031. Note again, that is assuming the df can be taken at face value. The authors do report the mean difference scores for boys playing violent games and nonviolent games, but do not do so for girls in either condition. I found the lack of descriptive statistical data to be vexing, to say the least.

How about the analyses examining a potential interaction of game type and trait aggressiveness? That doesn't exactly look great:

Although Statcheck reports no obvious decision errors for the primary interaction effect or the simple effects, the df reported are, for lack of a better way to phrase this, all over the place. The lack of descriptive statistics makes it difficult to diagnose exactly what is going on. Then the authors go on to report a 3-way interaction as non-significant, when a Statcheck analysis indicates that it would be. If there were a significant 3-way interaction, that would require some considerable effort to carefully characterize the interaction, and to carefully graphically portray the interaction.

It also helps to go back and look at the Method section and see how the authors determined how many trials each participant would experience in the experiment:


As I stated previously:

The authors selected 60 goal words for their reaction time task: 30 aggressive and 30 non-aggressive. These goal words are presented individually in four blocks of trials. The authors claim that their participants completed 120 trials total, when the actual total would appear to be 240 trials. I had fewer trials for adult participants in an experiment I ran over a couple decades ago and that was a nearly hour-long ordeal for my participants. I can only imagine the heroic level of attention and perseverance required of these children to complete this particular experiment. I do have to wonder if the authors tested for potential fatigue or practice effects that might have been detectable across blocks of trials. Doing so was standard operating procedure in our lab in the Aggression Lab at Mizzou back in the 1990s. Reporting those findings would have also been done - at least in a footnote when submitted for publication.

Finally, I just want to say something about the way the authors described the personality measure they used. The authors appeared to be interested in obtaining an overall assessment of aggressiveness. The Buss & Perry AQ is arguably defensible for such an endeavor. The authors have a tendency to repeat the original reliability coefficients reported by Buss and Perry (1992), but given that the authors only examined overall trait aggressiveness, and given that they presumably had to translate this instrument into Chinese, the authors would have been better served by reporting the reliability coefficient(s) that they specifically obtained, rather than doing little more than copying and pasting the same basic statement they make in other papers published by the Zhang lab. It really takes getting to the General Discussion section before the authors even obliquely mention that this instrument was translated, as well as to more specifically recommend an adaptation of the BPAQ specifically for Chinese-speaking and reading individuals.

This was a paper that had so many question marks that it should never have been published in the first place. That it did somehow slip through the peer review system is indeed unfortunate. If the authors are unable or unwilling to make the necessary corrections, it is up to the editorial team at the journal to do so. I hope that they will in due time. I know that I have asked.

If not retracted, any corrections would have to report the necessary descriptive statistics upon which the analyses for Study 2 were based, as well as provide the correct inferential statistics: accurate F-tests, df, and p-values. Yes, that means tables would be necessary. That is not a bad thing. The actual Coefficient Alphas used in the specific study for their version of the BPAQ should be reported, instead of simply repeating what Buss and Perry reported for the original English language version of the instrument in the previous century. The editorial team should insist on examining the original data themselves so that they can confirm that any corrections made are indeed correct, or so that they can determine that the data set is so hopelessly botched that the findings reported cannot be trusted, hence necessitating a retraction.

How all this landed on my radar is really just the luck of the draw. I was asked to review a paper in 2014, and I had the time and interest in doing so. The topic of the paper was in my wheelhouse, so I agreed to do so. I recommended a rejection, which in hindsight was sound. I moved on with my life. A couple years later I would read a weapons priming effect paper that was really odd and with reported analyses that were difficult to trust. I didn't make the connection until an ex-coauthor of mine appeared on a paper that appeared to have originated from this lab. At that point I scoured the databases until I could locate every English-language paper published by this lab, and discovered that this specific paper - which I recommended rejecting - had been published as well. In the process, I was able to notice that there was a distinct similarity among all the papers - how they were formatted, the types of analyses, and the types of data analytic errors. I realized pretty quickly that "holy forking shirtballs, this is awful." I honestly don't know if what I have read in this series of papers amounts to gross incompetence or fraud. I do know that it does not belong in the published record.

To be continued (unfortunately)....

Reference:

Zheng, J., & Zhang, Q. (2016). Priming effect of computer game violence on children’s aggression levels. Social Behavior and Personality: An International Journal, 44(10), 1747–1759. doi:10.2224/sbp.2016.44.10.1747

Footnote: The lyric comes from the chorus in "German Shepherds" by Wire. Toward the the end of this post near the end of the last paragraph, I make a reference to some common expressions used in the TV series, The Good Place.

Sunday, October 13, 2019

Back to that latest batch of errata: Tian et al (2016)

So a few weeks ago I noted that there had been four recent corrections to papers published out of the Zhang lab. It's time to turn to a paper with a fun history to it:

Tian, J. , Zhang, Q. , Cao, J. and Rodkin, P. (2016). The Short-Term Effect of Online Violent Stimuli on Aggression. Open Journal of Medical Psychology, 5, 35-42. doi: 10.4236/ojmp.2016.52005

What really caught initially was that there was a 3-way interaction reported as nonsignificant in this particular article that was identical to the analysis of a similar 3-way interaction in another article published by Zhang, Tian, Cao, Zhang, and Rodkin (2016). Same numbers, same failure to report degrees of freedom, and the same decision error in each paper. Quite the coincidence, as I noticed before. Eventually, Zhang et al. (2016) did manage to change the numbers on several analyses on the paper published in Personality and Individual Differences. See the corrigendum for yourself. Even the 3-way interaction got "corrected" so that it no longer appeared significant. We even get - gasp - degrees of freedom! Not so with Tian et al. (2016) in OJMP. I guess that is the hill these authors will choose to die on? Alrighty then.

So if you want to really see what gets changed from the original Tian et al. (2016) paper, read here. Compared to the original, it appears that the decision errors go away - except of course for that pesky three-way ANOVA, which I guess the authors simply chose not to address. Gone to is any reference to computing MANCOVAs, which is what I would minimally expect, given that there was no evidence that such analyses were ever done - no mention of a covariate, nor any mention of multiple dependent variables to be analyzed simultaneously. This is at least a bit better. The table of means at least on the surface seems to add up. The new Table 1 is a bit funky. I've noticed with another one of the papers that the Mean Square error based on the Mean Square information for the main effect and interaction effects that the authors were interested in would not give an estimate of Mean Square error that would support the SDs supplied in the descriptive stats. That appears to be the case with this correction as well, to the extent that one can make an educated guess about Mean Square error based on an incomplete summary table. Even with those disadvantages in papers by other authors in the past, I have generally managed to get a reasonably close estimate of MSE, and hence with some simple computations estimate the pooled standard deviation. That I am unable to do so satisfactorily here is troubling. When I ran into this difficulty with another one of the corrections from this lab, I consulted with a colleague who quickly made it clear that the likely correct pooled MSE would not support the descriptive statistics as reported. So at least I am reasonably certain here that I am not making a mistake.

I also find it odd that the authors now discuss that viewing violent stimuli had no change on aggressive personality - as if the Buss and Perry Aggressiveness Questionnaire, which measures a stable trait would ever be changed by short term exposure to a stimulus like a brief clip of a violent film. What the authors might have been trying to state is that there was no interaction of scores on the Buss and Perry AQ and movie violence. That is only a guess in my part.

These corrections, as they are billed, strike me as very rushed, and potentially as mistake-ridden as the original articles. This is the second correction out of this new batch that I have had time to review and it is as problematic as the first. Reader beware.

Tuesday, October 1, 2019

Interlude: When Social Sciences Become Social Movements

I found this very positive take on the recent SIPS conference in Rotterdam earlier today. Am so glad to have seen this. As someone who either grew up in the shadow of various social movements (including the No Nukes movement that was big when I was in my mid-teens) or was a participant (in particular Anti-Apartheid actions, as well as an ally of the feminist movement of the time and the then-struggling LGBT community that was largely at the time referred to as "Gay Rights"), I feel right at home in a social movement environment. They are after all calls to action, and at their best require of their participants a willingness to roll up our sleeves and do the hard work of raising awareness, changing policies and so on. All of that was present when I attended in July and what I experienced left me cautiously optimistic about the state of the psychological sciences. Questionable research practices still happen, and powerful players in our field still have too much pull in determining what gets said and what gets left unsaid. What's changed is Twitter, podcasts, etc. Anyone can listen to some state-of-the-art conversation with mostly early career researchers and educators who are at the cutting edge, and who are not afraid to blow whistles as necessary. Anyone can interact with these same individuals on Twitter. And although eliminating hierarchies is at best a pipe dream, the playing field among the Open Science community in my corner of the scientific universe is very level. I'll paraphrase something from someone I would prefer not to name: Some people merely talk about the weather. The point is to get up and do something about it. We've gone well beyond talk. We have the beginnings of some action thanks to what started as just a few understandably irate voices in the wake of the Bem paper and the Stapel scandal in 2011. We have a long way to go. And yes, if you do go to a SIPS conference, expect to meet some of the friendliest people you could hope to meet - amazing how those who are highly critical of bad and fraudulent science turn out to be genuinely decent in person. Well, not so amazing to me.