Friday, October 22, 2021

Back to Weapons Effect Theory

I spent a bit of time over the week looking at p-curve analyses for the handful of articles that not only predicted an interaction of provocation level and brief exposure to weapons and weapon images (starting with Berkowitz & LePage, 1967), and preliminary evidence is not promising. Berkowitz and LePage (1967) predicted an interaction effect.They appeared to find on. Some subsequent authors attempted to test the same effect. Some were fairly good at reporting the specific test statistics necessary to be included in a p-curve analysis. Others failed miserably. Of those who did make explicit reports, some found a significant interaction that was consistent with Berkowitz and LePage (1967). Others, found effects that were null or opposite. Given the dearth of reports that faithfully attempted to replicate Berkowitz & LePage, (1967), I am not sure how much to read into things. Overall, I am getting the impression that there was just not much to write home about. At some point, I will re-rerun the analysis and download what I can and get screen shots of the rest. The bottom line is that the interaction, to the extent it was reported during the 1970s was far from robust, and an argument can be made that those who simply began to concentrate only on experimental research in which only high frustration or provocation was measured had put the proverbial cart before the horse. I understand, up to a point. In the Berkowitz and LePage (1967) experiment, only the cell in which participants were highly provoked and exposed briefly to firearms showed a higher level of aggression. Why other researchers found suppression effects under the same conditions, no effect, or simply neglected to report the necessary test statistics for that interaction effect is a story that I cannot adequately offer commentary about. What I can say is that the findings were messy and that more rigorous replications made sense. I'll give Buss et al (1972) points for an effort, even if they were a bit oblique in reporting faithfully the actual F-tests. Keep in mind that p-curve analyses only consider those studies where a significant effect is found. So if those come up short? Reader beware.

Tuesday, October 5, 2021

Converting Standard Error to Standard Deviation

This is just one of my brief posts that I hope is obvious to most folks who would bother to read this blog. It's just a technique that is part of my meta-analyst's toolbox, can be calculated on the back of a napkin, and comes in handy when authors report means and standard error instead of the usual standard deviation term we would want to estimate Cohen's d. The formula is simple.

I've used it a bit more often than I would have imagined. It comes in handy. One common mistake meta-analysts can make is to erroneously use the SE term instead of SD in the denominator to compute Cohen's d, which would inflate the effect size for the hypothesis test in question. Needless to say, you want to avoid that.

Credit to Cochrane Handbook for Systematic Reviews of Interventions.

Monday, October 4, 2021

A grim day for another weapons effect paper

Sometimes a specific lab becomes the gift that keeps on giving. If the work is good, we are the better for it. If the work is questionable, our science becomes less trustworthy not only to the public, but to those of us who serve as educators and fellow researchers. As is true in other facets of life, there are gifts we would really rather return. 

Which brings us to a certain researcher from Southwest University: Qian Zhang. In spite of several recent retractions, I have to give the man credit. He remains prolific. A recent paper was recently uploaded on a preprint server, on a variation of weapons effect research that is quite well known to me. The author was even kind enough to upload the data and the analyses at, which is to be commended. 

That said, there are clearly some concerns with this paper. I will only discuss a few in this post. My hope is that others who are far more facile at error detection and have enough fluency in Mandarin can pick up where I will likely leave off. The basic premise of the paper is to examine if playing with weapon toys will lead children to show more accessibility of aggressive cognition (or think more aggressively, if that is easier on the eyes and tongue) and show higher levels of aggression on the Competitive Reaction Time Task (CRTT). As an aside, I seem to have some difficulty with the acronym, CRTT, and often misspell it when I tweet about the task. But I digress. These sorts of experiments have been run primarily in North America (specifically, the US) and Europe, and not so much outside of those limited geographical regions. Research of this sort outside of the US and Europe could be potentially useful if done well. Usually, experiments of this sort are done to examine only behavioral outcomes (Mendoza's 1972 dissertation is arguably an exception, if we code the variation of a TAT as a cognitive measure), so the idea of also examining cognitive outcomes could be potentially beneficial. 

As I read through the paper, I noticed that there were 104 participants in total. The author contends that he used an a priori power test to determine sample size at 95% power, using G-Power 3.1. That caught my attention. I dusted off my meta-analysis (Benjamin et al., 2018) and looked at effect size estimates for various distributions that we were interested in examining at the time. One of those distributions specifically included studies in which toy weapons were used as primes. The naive effect size is not exactly overwhelming: d = 0.32. That is arguably a generous estimate, once we include various techniques for measuring the impact of publication bias, and a good-faith argument can easily be made that the true effect size for this particular type of prime is close to negligible. But let's ignore that detail for a minute. Let's pretend we exist in a universe in which the naive effect size of d = 0.32 is correct. The authors argue that an N of 52 would suffice, but that their "sample size of (N=102 [sic])" was more than sufficient to meet 95% power. If you ever run any study in G-Power, you have to choose your analysis, enter the info required, and you are given a sample size estimate. One complication with G-Power is that it never directly allows us to enter an effect size for Cohen's d. It does give us Cohen's F. Computing Cohen's d from Cohen's F is quite easy: d = 2*F, and F = d/2. So, if I know that the effect size for my research question of interest is d = 0.32, I divide by 2 and can plug that into G-Power for Cohen's F, and then make sure I have my other info correct, including number of conditions, covariates, etc. When I do all that, based on the experiment as described, with a Cohen's F of 0.16, it becomes clear that the experiment would require a sample of at least 510 students. Now let's say that the author merely made a mistake and plugged in the number for Cohen's d by accident. The sample would still have to be about 129 in order to meet the requirements of 95% power, and really given the intention to randomly divide an equal number of males and females into treatment and control conditions, the author should shoot for 132 students. In order for the argument of 95% power to be met in this study, we'd have to assume a Cohen's d of approximately 1.00. There may be individual studies in the literature that would yield such a Cohen's d, but of the available sample of studies? Not so much. So, we have another low power experiment in the research literature. It's hardly the end of the world. 

What grabbed my attention was the research protocols described in the experiment. For the time being, I will take the author at his word that this was an experiment in which random assignment was involved (this author has once been flagged for failing to disclose that participants chose which treatment condition they were involved in, which was, shall we say, a wee bit embarrassing). The way the treatment and control conditions are described seems standard enough. What was odd was what happened after the play session ended. The children were first given a semantic classification task. I admit that I've had to do a double take on this, as some of the wording is a bit off. I am increasingly thinking that what the authors did was use a series of aggressive and neutral pictures and had children respond to them as quickly as they could. The author had made some mention of aggressive and neutral pictures also being used as cues, which threw me, because that would have seemed more like an experiment within an experiment. At minimum, there would have been needless contamination. Then the children participated in a CRTT where they set noise blasts at 70 to 100 db. Those controls were set from 0 (no noise) to 4 (100 db). The authors reported their means and standard deviations. I then initially looked at the means for treatment and control condition using GRIM, which is a nifty online tool for flagging errors. The results were, to say the least, initially looked grim. However, I was reminded that there is the issue of granularity that I might have overlooked. So, even though there is one scale, the trials each count as independent items. So, an N=26 for one cell is, with the 13 out of 25 trials that the author included in the data set (in which participants had an opportunity to send noise blasts after a loss), effectively an N=338. So I went and opened up the SPRITE test link and entered the same mean info, along with the minimum and maximum scale values (0 and 4, respectively), the target mean for each cell I was interested in, and SPRITE would report that each of the two cells measuring boys failed to arrive at a solution for at minimum the standard deviation. In each case, the standard deviation was reported by SPRITE to be too low. I can get reproductions of possible distributions for the other two cells. I then downloaded the data set to see what it looked like. Much of it is in Mandarin, but I can make some educated guesses about the data in each column. I turned my attention to the "ANCOVA" analysis. It actually looked like a MANOVA was run. Perhaps a MANCOVA (but as I am admittedly not literate in Mandarin, it's hard to really know without taking time I don't have yet to put some terms into Google Translate and sort that all out). That's a project for later in the month. I could see the overall mean for the aggressive behavioral outcome, as measured by the CRTT and entered it, its standard deviation, and overall N in SPRITE and noticed it also could generate some potential score distributions. Still, given the failure to generate some potential distributions in SPRITE, it's not a good day to have posted a preprint. At minimum, there is some sort of error in reporting, whatever the cause.

I do need to take some time to sit down, try to reproduce the analyses, etc. Of course, it goes without saying that successfully reproducing a data set that has been in some way fabricated is going to add no new information. That said, I am satisfied that the findings as presented for the CRTT analyses intended to establish that weapon toys could (at least most specifically for the male subsample) influence aggressive behavioral outcomes may also be potentially questionable. This is a paper that should not make it past peer review in its present form. 

Note that I have not yet run this through Statcheck, although in recent years Zhang's lab has become more savvy about avoiding obvious decision errors. I made an effort as of this writing to run the analyses as they appeared in the pdf, and the report came back with nothing to be analyzed. I will likely have to enter the analyses by hand on a word document and then reupload at a later date. 

Please also note that the author appears not to have counterbalanced the SCT and CRTT measures to control for order effects. That strikes me as odd. The very superficial discussion about debriefing left me with a few questions as well. 

Note: Updated to reflect some more refined analyses. Any initial mistakes with GRIM are my own. I am on solid ground with the SPRITE runs, and I think my own concerns about the lack of statistical power, failure to counterbalance, etc. are on solid ground.