Friday, February 4, 2022

A long-overdue retraction

After sounding the alarm bells several years ago, a paper that I had failed to get retracted (the editor of PAID a superficially "better" Corrigendum in 2019 instead is now officially retracted. Dr. Joe Hilgard really put the work in to make it happen. Here is his story:

The saga of this weapons priming article is over. There are plenty of articles remaining that have yet to be adequately scrutinized.

Sunday, November 7, 2021

The interaction of mere exposure to weapons and provocation: A preliminary p-curve analysis

The primary hypothesis of interest in Berkowitz and LePage (1967) was an interaction of exposure to weapons (rifles vs. badminton racquets/no stimuli) and provocation (high vs low) on aggression (which was measured in terms of number electric shocks participants believed they were giving to the person who had just given them feedback. The authors predicted that the interaction would be statistically significant and would find that those participants who had been highly provoked and had short-term exposure to rifles would show by far the highest level of aggression. Their hypothesis was successfully confirmed. However, subsequent efforts to replicate that interaction were rather inconsistent. This space is not the place to repeat that history, as I have discussed it elsewhere on this blog and in recent literature reviews published in the National Social Science Journal (Benjamin 2019) and in an encyclopedic chapter (Benjamin, 2021). In the Benjamin et al. (2018) meta-analysis, we did look very broadly at whether or not the weapons effect appeared stronger in studies in highly provoked conditions than in neutral/low-provoked conditions. There appeared to be a trend, although there were some problems with our approach. We examined all DVs (cognitive, affective, appraisal, and behavioral) rather than just focus on behavioral. That probably inflated the naive effect size for the neutral/low provoking condition and deflated the naive effect size in the high provoking condition. That said, we can note that depending on the method of publication bias analysis used, when correcting for publication bias, there is some reason to doubt that the sort of noticeably strong effect in highly provoking conditions in which weapons were present was particularly robust. 

Another way of testing the robustness of the weapons effect hypothesis as presented by Berkowitz and LePage (1967), which is central to the viability of Weapons Effect Theory (yes, as I have noted, that is a thing), is to use a method called p-curves. This is fairly straightforward to do. All I need is a collection of studies that have a test of an interaction between the mere presence or absence of weapons and some manipulation of provocation as independent variables, and a dependent measure of an aggressive behavioral outcome. There really isn't an overwhelming number of published articles, so the task of finding a collection of published research was fairly simple. I just had to look through my old collection of articles and look to make sure no one had added anything new that would satisfy my criteria. That turned out to not be a problem. 

So to be clear, my inclusion criteria for articles (or studies within articles) were:

1. manipulation of short-term exposure to weapon (usually weapons vs. neutral objects)

2. manipulation of provocation (high vs. low/none)

3. explicit test of interaction of weapon exposure and provocation level

4. behavioral measure of aggression

Of the studies I am aware of, that left me with 11 potential studies to examine. Unfortunately, five were excluded for failing to report the necessary 2-way interaction or any simple effects analyses: Buss et al. (1972, Exp. 5), Page and O'Neal (1977), Caprara et al. (1984), Cahoon and Edmonds (1984, 1985). The remaining six were entered into a p-curve disclosure table: Berkowitz and LePage (1967), Fischer et al., (1969), Ellis et al. (1971), Page and Scheidt (1971, Exp. 1), Frodi, (1975), and Leyens and Parke (1975). Once I completed the table, I entered the relevant test statistics, degrees of freedom, and p-values into the p-curve app 4.06. Three studies were excluded from the analysis due to the 2-way interaction being non-significant. Of the remaining three studies included, here is the graph:

As you might guess, there isn't a lot of diagnostic information to go on. According to the summary that was printed:

P-Curve analysis combines the half and full p-curve to make inferences about evidential value. In particular, if the half p-curve test is right-skewed with p<.05 or both the half and full test are right-skewed with p<.1, then p-curve analysis indicates the presence of evidential value. This combination test, introduced in Simonsohn, Simmons and Nelson (2015 .pdf) 'Better P-Curves' paper, is much more robust to ambitious p-hacking than the simple full p-curve test is.

Here neither condition is met; hence p-curve does not indicate evidential value.

Similarly, p-curve analysis indicates that evidential value is inadequate or absent if the 33% power test is p<.05 for the full p-curve or both the half p-curve and binomial 33% power test are p<.1. Here neither condition is met; so p-curve does not indicate evidential value is inadequate nor absent.

From the available evidence, there does not appear to be much to suggest that the Weapons Effect as initially proposed by Berkowitz and LePage (1967) was either replicable or provided adequate evidence to suggest that the effect as first proposed was worth further exploration. There might have been some alternatives worth further exploration (including Ellis et al. (1971) who wanted to explore associations we might make via operant conditioning). That work seemed to fizzle out. We know what happened next. Around the mid-1970s, researchers interested in this area of inquiry concentrated their efforts to explore the short term effect of weapons only under highly provoking condition (with a few exceptions in field research) and took the original finding as gospel. By 1990, the Carlson et al. (1990) meta-analysis seemed to make it official, and research shifted to cognitive priming effects. I will try a p-curve analysis on that line of weapons effect research next, as the cognitive route appears to be the primary mechanism for a behavioral effect to occur.

Saturday, November 6, 2021

An interesting take on the 2021 Virginia Gubernatorial election

The Virginia Gubernatorial election was a predictably close one, and one that led to a GOP candidate winning this particular off-year election. I suppose there are any of a number of takes to be had. One factor that had my attention was the GOP candidate's (Youngkin) focus on Critical Race Theory (CRT) which is probably part of the curriculum in the context of advanced coursework in Legal Studies, but not a factor in the K-12 system. Youngkin made it a point to advocate for parents having "more of a say" in their kids' education, in the context of the moral panic over CRT that has developed over the past year. Did Youngkin's strategy work? The answer turns out to be complicated. Those who enjoy poring over the cross-tabs in public opinion polls found that it succeeded, but not in the way that it has been spun in the media:

The network exit poll, released on Nov. 2, showed the same pattern. Youngkin got 62 percent of the white vote and 13 percent of the Black vote, a gap of 49 points. But among voters who said parents should have a lot of “say in what schools teach”—about half the electorate—he got 90 percent of the white vote and only 19 percent of the Black vote, a gap of 71 points. The idea that parents should have more say in the curriculum—Youngkin’s central message—had become racially loaded. And the loading was specific to race: Other demographic gaps for which data were reported in the exit poll—between men and women, and between white college graduates and whites who hadn’t graduated from college—get smaller, not bigger, when you narrow your focus from the entire sample to the subset of voters who said parents should have a lot of say in what schools taught. Only the racial gap increases.

The exit poll didn’t ask voters about CRT, but it did ask about confederate monuments on government property. Sixty percent of white voters said the monuments should be left in place, not removed, and 87 percent of those voters went to Youngkin. That was 25 points higher than his overall share of white voters. The election had become demonstrably polarized, not just by race but by attitudes toward the history of racism. All the evidence indicates that Youngkin’s attacks on CRT played a role in this polarization. 

So, in a way, the strategy of honing in on this latest moral panic did work in gaining the favor of white voters, but that's it. As a newer Southern Strategy tactic, focusing on CRT, demonizing it, and tying it (inaccurately) to the public school systems is considerably more sophisticated than earlier efforts. The end result appears to be to further sow divisions among white voters, and between subsets of white voters and the rest of the voting population, in order to maintain hegemony. As a strategy, it may just work to an extent. School boards and more localized policymakers are ill-prepared for what awaits them in the upcoming months and years.

Friday, October 22, 2021

Back to Weapons Effect Theory

I spent a bit of time over the week looking at p-curve analyses for the handful of articles that not only predicted an interaction of provocation level and brief exposure to weapons and weapon images (starting with Berkowitz & LePage, 1967), and preliminary evidence is not promising. Berkowitz and LePage (1967) predicted an interaction effect.They appeared to find on. Some subsequent authors attempted to test the same effect. Some were fairly good at reporting the specific test statistics necessary to be included in a p-curve analysis. Others failed miserably. Of those who did make explicit reports, some found a significant interaction that was consistent with Berkowitz and LePage (1967). Others, found effects that were null or opposite. Given the dearth of reports that faithfully attempted to replicate Berkowitz & LePage, (1967), I am not sure how much to read into things. Overall, I am getting the impression that there was just not much to write home about. At some point, I will re-rerun the analysis and download what I can and get screen shots of the rest. The bottom line is that the interaction, to the extent it was reported during the 1970s was far from robust, and an argument can be made that those who simply began to concentrate only on experimental research in which only high frustration or provocation was measured had put the proverbial cart before the horse. I understand, up to a point. In the Berkowitz and LePage (1967) experiment, only the cell in which participants were highly provoked and exposed briefly to firearms showed a higher level of aggression. Why other researchers found suppression effects under the same conditions, no effect, or simply neglected to report the necessary test statistics for that interaction effect is a story that I cannot adequately offer commentary about. What I can say is that the findings were messy and that more rigorous replications made sense. I'll give Buss et al (1972) points for an effort, even if they were a bit oblique in reporting faithfully the actual F-tests. Keep in mind that p-curve analyses only consider those studies where a significant effect is found. So if those come up short? Reader beware.

Tuesday, October 5, 2021

Converting Standard Error to Standard Deviation

This is just one of my brief posts that I hope is obvious to most folks who would bother to read this blog. It's just a technique that is part of my meta-analyst's toolbox, can be calculated on the back of a napkin, and comes in handy when authors report means and standard error instead of the usual standard deviation term we would want to estimate Cohen's d. The formula is simple.

I've used it a bit more often than I would have imagined. It comes in handy. One common mistake meta-analysts can make is to erroneously use the SE term instead of SD in the denominator to compute Cohen's d, which would inflate the effect size for the hypothesis test in question. Needless to say, you want to avoid that.

Credit to Cochrane Handbook for Systematic Reviews of Interventions.

Monday, October 4, 2021

A grim day for another weapons effect paper

Sometimes a specific lab becomes the gift that keeps on giving. If the work is good, we are the better for it. If the work is questionable, our science becomes less trustworthy not only to the public, but to those of us who serve as educators and fellow researchers. As is true in other facets of life, there are gifts we would really rather return. 

Which brings us to a certain researcher from Southwest University: Qian Zhang. In spite of several recent retractions, I have to give the man credit. He remains prolific. A recent paper was recently uploaded on a preprint server, on a variation of weapons effect research that is quite well known to me. The author was even kind enough to upload the data and the analyses at, which is to be commended. 

That said, there are clearly some concerns with this paper. I will only discuss a few in this post. My hope is that others who are far more facile at error detection and have enough fluency in Mandarin can pick up where I will likely leave off. The basic premise of the paper is to examine if playing with weapon toys will lead children to show more accessibility of aggressive cognition (or think more aggressively, if that is easier on the eyes and tongue) and show higher levels of aggression on the Competitive Reaction Time Task (CRTT). As an aside, I seem to have some difficulty with the acronym, CRTT, and often misspell it when I tweet about the task. But I digress. These sorts of experiments have been run primarily in North America (specifically, the US) and Europe, and not so much outside of those limited geographical regions. Research of this sort outside of the US and Europe could be potentially useful if done well. Usually, experiments of this sort are done to examine only behavioral outcomes (Mendoza's 1972 dissertation is arguably an exception, if we code the variation of a TAT as a cognitive measure), so the idea of also examining cognitive outcomes could be potentially beneficial. 

As I read through the paper, I noticed that there were 104 participants in total. The author contends that he used an a priori power test to determine sample size at 95% power, using G-Power 3.1. That caught my attention. I dusted off my meta-analysis (Benjamin et al., 2018) and looked at effect size estimates for various distributions that we were interested in examining at the time. One of those distributions specifically included studies in which toy weapons were used as primes. The naive effect size is not exactly overwhelming: d = 0.32. That is arguably a generous estimate, once we include various techniques for measuring the impact of publication bias, and a good-faith argument can easily be made that the true effect size for this particular type of prime is close to negligible. But let's ignore that detail for a minute. Let's pretend we exist in a universe in which the naive effect size of d = 0.32 is correct. The authors argue that an N of 52 would suffice, but that their "sample size of (N=102 [sic])" was more than sufficient to meet 95% power. If you ever run any study in G-Power, you have to choose your analysis, enter the info required, and you are given a sample size estimate. One complication with G-Power is that it never directly allows us to enter an effect size for Cohen's d. It does give us Cohen's F. Computing Cohen's d from Cohen's F is quite easy: d = 2*F, and F = d/2. So, if I know that the effect size for my research question of interest is d = 0.32, I divide by 2 and can plug that into G-Power for Cohen's F, and then make sure I have my other info correct, including number of conditions, covariates, etc. When I do all that, based on the experiment as described, with a Cohen's F of 0.16, it becomes clear that the experiment would require a sample of at least 510 students. Now let's say that the author merely made a mistake and plugged in the number for Cohen's d by accident. The sample would still have to be about 129 in order to meet the requirements of 95% power, and really given the intention to randomly divide an equal number of males and females into treatment and control conditions, the author should shoot for 132 students. In order for the argument of 95% power to be met in this study, we'd have to assume a Cohen's d of approximately 1.00. There may be individual studies in the literature that would yield such a Cohen's d, but of the available sample of studies? Not so much. So, we have another low power experiment in the research literature. It's hardly the end of the world. 

What grabbed my attention was the research protocols described in the experiment. For the time being, I will take the author at his word that this was an experiment in which random assignment was involved (this author has once been flagged for failing to disclose that participants chose which treatment condition they were involved in, which was, shall we say, a wee bit embarrassing). The way the treatment and control conditions are described seems standard enough. What was odd was what happened after the play session ended. The children were first given a semantic classification task. I admit that I've had to do a double take on this, as some of the wording is a bit off. I am increasingly thinking that what the authors did was use a series of aggressive and neutral pictures and had children respond to them as quickly as they could. The author had made some mention of aggressive and neutral pictures also being used as cues, which threw me, because that would have seemed more like an experiment within an experiment. At minimum, there would have been needless contamination. Then the children participated in a CRTT where they set noise blasts at 70 to 100 db. Those controls were set from 0 (no noise) to 4 (100 db). The authors reported their means and standard deviations. I then initially looked at the means for treatment and control condition using GRIM, which is a nifty online tool for flagging errors. The results were, to say the least, initially looked grim. However, I was reminded that there is the issue of granularity that I might have overlooked. So, even though there is one scale, the trials each count as independent items. So, an N=26 for one cell is, with the 13 out of 25 trials that the author included in the data set (in which participants had an opportunity to send noise blasts after a loss), effectively an N=338. So I went and opened up the SPRITE test link and entered the same mean info, along with the minimum and maximum scale values (0 and 4, respectively), the target mean for each cell I was interested in, and SPRITE would report that each of the two cells measuring boys failed to arrive at a solution for at minimum the standard deviation. In each case, the standard deviation was reported by SPRITE to be too low. I can get reproductions of possible distributions for the other two cells. I then downloaded the data set to see what it looked like. Much of it is in Mandarin, but I can make some educated guesses about the data in each column. I turned my attention to the "ANCOVA" analysis. It actually looked like a MANOVA was run. Perhaps a MANCOVA (but as I am admittedly not literate in Mandarin, it's hard to really know without taking time I don't have yet to put some terms into Google Translate and sort that all out). That's a project for later in the month. I could see the overall mean for the aggressive behavioral outcome, as measured by the CRTT and entered it, its standard deviation, and overall N in SPRITE and noticed it also could generate some potential score distributions. Still, given the failure to generate some potential distributions in SPRITE, it's not a good day to have posted a preprint. At minimum, there is some sort of error in reporting, whatever the cause.

I do need to take some time to sit down, try to reproduce the analyses, etc. Of course, it goes without saying that successfully reproducing a data set that has been in some way fabricated is going to add no new information. That said, I am satisfied that the findings as presented for the CRTT analyses intended to establish that weapon toys could (at least most specifically for the male subsample) influence aggressive behavioral outcomes may also be potentially questionable. This is a paper that should not make it past peer review in its present form. 

Note that I have not yet run this through Statcheck, although in recent years Zhang's lab has become more savvy about avoiding obvious decision errors. I made an effort as of this writing to run the analyses as they appeared in the pdf, and the report came back with nothing to be analyzed. I will likely have to enter the analyses by hand on a word document and then reupload at a later date. 

Please also note that the author appears not to have counterbalanced the SCT and CRTT measures to control for order effects. That strikes me as odd. The very superficial discussion about debriefing left me with a few questions as well. 

Note: Updated to reflect some more refined analyses. Any initial mistakes with GRIM are my own. I am on solid ground with the SPRITE runs, and I think my own concerns about the lack of statistical power, failure to counterbalance, etc. are on solid ground.

Tuesday, September 28, 2021

One of my pet peeves when it comes to cognitive priming tasks in aggression

I'm probably going to come across as the late Andy Rooney for a moment. You know what I hate? Some of the apparent flexibility in how cognitive priming tasks get measured in my specialty area: aggression. I've spent some time with these sorts of tasks during my days as a doctoral student in Mizzou's aggression lab in late 1990s. The idea is fairly simple. We prime participants (typical traditional first-year college students) with stimuli that are either thought to be aggression-inducing stimuli (e.g., violent content in video games, images of weapons, violent lyrical content, etc.) or neutral stimuli (e.g., non-violent content in video games, images of non-weapons such as flowers, nonviolent lyrical content, etc.), and then get reaction times to aggressive and non-aggressive words. I'm probably most familiar with the pronunciation task, in which participants see, for example, an image for a few seconds (weapon or neutral object), followed by a target word that participants read aloud into a microphone, and the latency is recorded in milliseconds. The lexical decision task is similar, except in addition to reacting to aggressive or neutral words, the participants also must decide on whether or not what they are seeing is a word or non-word. At the end of the day, we get reaction time data, and look for latency measured in milliseconds.

For a prime to work, we expect that the relative latency for aggressive words will be lower than for neutral words in the treatment condition when compared to the relative latency for aggressive versus neutral words in the control condition. That's the pattern we found in both experiments in Anderson et al. (1998) and Lindsay and Anderson (2000), for example. The difference in latency between aggressive words and non-aggressive words was significantly larger, and in the predicted direction, in the weapon condition than was the case between aggressive words and non-aggressive words in the neutral prime condition. We could conclude that weapons appeared to prime the relative accessibility of aggressive cognition, or we could say aggression-related schemata or whatever nomenclature you might prefer. 

The way I was trained, and the literature I tended to read worked largely the way I just described, regardless the stimuli used for primes and regardless of the target words or concepts the experimenters were attempting to prime. In our case, comparing the relative difference in reaction time latencies between responses to aggressive and non-aggressive words gave us a basis for comparison across treatment condition, and took into account some of the noise we would likely get in the data, such as individual differences in reaction time speed. 

Lately I have seen in my corner of the research universe papers published in which the authors only publish reaction time latencies for aggressive words, even though they admit in their published reports that they did have reaction time data for non-aggressive words. They appear to be getting statistically significant findings, but I find my self asking myself a question: so you find that participants respond faster to aggressive words in the treatment condition than in the control condition. That's nice, but what do those reaction time findings for aggressive words alone really tell us about the priming of the relative accessibility of aggressive cognition? In more lay terms, you say you found participants respond faster to aggressive words, but compared to what? I have also seen the occasional paper slip through in which the authors attempt to have it both ways. They'll use raw aggressive word reaction times as their basis for establishing that there is a priming effect, but their other hypothesis tests actually do use what I see as a proper difference score between aggressive and non-aggressive words. Oddly enough, in one presumably soon-to-be retracted number, when the authors use the approach I was taught, the effect size for the treatment condition becomes negligible, and the authors have to rely on subsample analyses in order to make some statement about the treatment condition actually priming the relative accessibility of aggressive cognition. Now, when I see subsequent research where only the reaction times for aggressive words are reported, I wonder if what I am reading is to be trusted, or if something is being hidden from those of us relying on the accuracy of those reports. 

That is the sort of thing that can keep me awake at night.