Monday, February 18, 2019

A note of caution in justifying the use of the Aggressive Word Completion Task

The following blog post was co-authored by James Benjamin at University of Arkansas-Fort Smith and Randy McCarthy at Northern Illinois University.

Following the larger trends in social psychology, the past few decades in aggression research have been heavily influenced by theories examining the cognitive structures and cognitive processes that contribute to aggressive behavior. Accordingly, researchers have developed several tasks to measure “aggressive cognitions.” Good measures of aggressive cognitions (i.e., those that are reliable and valid) allow researchers to test these social-cognitive theories; bad measures (i.e., unreliable or those lacking validity evidence) do not allow researchers to test these social-cognitive theories.

Recently, we have been looking at the published literature of one commonly-used task for measuring aggressive cognitions: The Aggressive Word Completion Task (AWCT).

The most commonly-used stimuli for this task are freely available here (and have been freely available much longer than OSF has existed), which we think is swell. The posted instructions for using these stimuli recommend that researchers cite Anderson et al. (2003), Anderson et al. (2004), and/or Carnagey and Anderson (2005). After closely examining these three articles, we recommend that researchers NOT cite these articles as evidence for the validity of the AWCT.

What is the AWCT?

When completing the AWCT, participants are presented with several word fragments (e.g., words with one or more missing letters) and are instructed to fill in the missing letters to create a word. Critically, some of the word fragments can be completed with a word that is either semantically associated with the concept of aggression or with a word that is not semantically associated with the concept of aggression. For example, the word fragment “KI_ _” can be completed with aggressive words, such as “KILL” or “KICK”, or it can be completed with non-aggressive words, such as “KITE” or “KISS.” Computationally, the more potentially aggressive word fragments that are completed with aggressive words is considered to indicate more aggressive cognitions.

Thus, the AWCT has several desirable features. First, the AWCT does not require expensive equipment or software. Second, the AWCT can be administered in pencil-paper or online formats. Third, the AWCT is typically described as a task that is widely applicable to a range of research questions. These advantages make the AWCT an easy and flexible task to use.

What is the validity evidence in Anderson et al. (2003), Anderson et al. (2004), and Carnagey and Anderson (2005)?

The earliest published study that used the AWCT is Anderson et al. (2003). Anderson et al. (2003) cites Anderson et al. (2002) as a justification for using the AWCT, which is a manuscript described as “submitted for publication.” Anderson et al. (2002) is eventually published and becomes Anderson et al. (2004). Anderson et al. (2004; which was previously cited as Anderson et al., 2002) then cites Anderson et al. (2003) as justification for using the AWCT.

Right off the bat you can see that Anderson et al. (2003) and Anderson et al. (2004) are essentially two studies using one another in a sort of Penrose staircase of justification for using the AWCT.

By 2005, Carnagey and Anderson (2005) describes the AWCT as a “valid measure of aggressive cognitions” and jointly cites Anderson et al. (2003) and Anderson et al. (2004) as justification.

To our eye, there is no independent validation work that is publicly available (which is not to say that any validation work was not done, just that it is not available). These early studies both start using the AWCT and point to themselves as justification for doing so.

The lack of independent validation work is not great news, but it also isn’t inherently damning. The implication is the data from these studies provide some evidence for the validity of the AWCT because these studies purportedly demonstrate the AWCT changes predictably to theoretically-relevant conditions. Indeed, in Anderson et al. (2003), Experiment 4, participants’ trait hostility was associated with their AWCT performance (F [1, 134] = 4.21, p = .042) and participants who were exposed to a song with violent lyrics had higher AWCT performance than two other comparison conditions (F [2, 134] = 3.26, p = .073; note that the authors reported this p-value as being less than .05, which means that either the reported F ratio is incorrect or the reported p-value is incorrect). In Anderson et al. (2003), Experiment 5, participants who were exposed to a song with violent lyrics had higher AWCT performance than those who were exposed to songs without violent lyrics (F [1, 141] = 6.16, p = .014). In Anderson et al. (2004), participants who played a violent video game used more aggressive word completions than participants who played a non-violent video game (F [1, 120] = 4.26, p = .041). And in Carnagey and Anderson (2005), there were main effects for type of video game on aggressive word completions (F [2, 59] = 5.33, p = .007) and baseline systolic blood pressure (F [1, 59] = 5.79, p = .005). However, trait aggression (F [1, 57] = 1.05, p = .31), exposure to video game violence (F [1, 57] = 0.02, p = .89), and video-game ratings (F [1, 59] < 3.10, ps > .08) were not predictors of aggressive word completions.

Notably, in the data from Anderson et al. (2003) and Anderson et al. (2004) is the relevant p-values are all high-yet-significant. Specifically, each p-value is less than .05 and greater than .01. This would be extraordinarily rare if these were all tests of a non-null hypothesis (Simonsohn, Nelson, & Simmons, 2014). In Carnagey and Anderson (2005) there is a mix of significant and non-significant results. Collectively, these results do not strike us as unambiguously supportive of the validity of the AWCT.

However, implicit in these three studies is another form of self-justification for the AWCT. The AWCT is used to test the hypothesis that exposure to violent media (e.g., songs with violent lyrics, violent video games) increase aggressive cognitions. These studies argue that the AWCT is valid because AWCT scores increased after exposure to violent media. Thus, the AWCT is claimed to be a valid measure of aggressive cognitions because it supported these hypotheses AND these hypotheses were claimed to be supported because the AWCT is a valid measure of aggressive cognitions. The claim that violent media increases aggressive cognitions and the claim that the AWCT is a valid measure of aggressive cognitions do not stand independently. Rather, they each rest on the assumption the other is true. In short, this is a form of circular reasoning.

To throw another variable into the mix, let’s look at how the AWCT is administered in these three articles. Anderson et al. (2003) does not impose a time limit for administering the AWCT. Anderson et al. (2004) imposes a 3-minute time limit. Carnagey et al. (2005) imposes a 5-minute time limit. Thus, in these first 3 published studies using the AWCT, there are 3 slightly different procedures used to administer the task. The use of different procedures when administering the AWCT means that the evidence from these studies is not even directly cumulative.

So what?

We are not pointing to this pattern of citations to say “gotcha!” Rather, we are both aggression researchers who really want to ensure that our field is producing the best work possible. And, like a chef who ensures her knives are sharp or a mason who ensures his trowel is not bowed, we believe our best work is possible when we have confidence that our tools work well for the task at hand.

We believe that citing Anderson et al. (2003), Anderson et al. (2004), and Carnagey and Anderson (2005) is merely pointing out that this task has been used in previous publications. However, we believe these articles do not clearly demonstrate the AWCT is a valid measure of aggressive cognitions. In other words, these studies, in and of themselves, do not give us confidence in the AWCT as a useful tool for measuring aggressive cognitions.

Where do we go from here? As suggested by others (e.g., Zendle et al., 2018; see also Koopman et al., 2013), we strongly advocate for rigorous validation work on the AWCT. It may turn out that the AWCT is a valid measure of aggressive cognitions after all. If so, that would be great news. However, it may not. Either way, we need to know.


Anderson, C. A., Carnagey, N. L., & Eubanks, J. (2003). Exposure to violent media: The effects of songs with violent lyrics on aggressive thoughts and feelings. Journal of Personality and Social Psychology, 84, 960-971. DOI: 10.1037/0022-3514.84.5.960

Anderson, C. A., Carnagey, N. L., Flanagan, M., Benjamin, A. J., Jr., Eubanks, J., & Valentine, J. C. (2004). Violent video games: Specific effects of violent content on aggressive thoughts and behavior. Advances in Experimental Social Psychology, 36, 199-249. DOI:10.1016/S0065-2601(04)36004-1

Carnagey, N. L., & Anderson, C. A. (2005). The effects of reward and punishment in violent video games on aggressive affect, cognition, and behavior. Psychological Science, 16, 882-889. DOI: 10.1111/j.1467-9280.2005.01632.x

Koopman, J., Howe, M., Johnson, R. E., Tan, J. A., & Chang, C. (2013). A framework for developing word fragment completion tasks. Human Resource Management Review, 23, 242-253. DOI: 10.1016/j.hrmr.2012.12.005

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). p-Curve and effect size: Correcting for publication bias using only significant results. Perspectives on Psychological Science, 9, 666-681. DOI: 10.1177/1745691614553988

Zendle, D., Kudenko, D., & Cairns, P. (2018). Behavioural realism and the activation of aggressive concepts in violent video games. Entertainment Computing, 24, 21-29. DOI: 10.1016/j.entcom.2017.10.003

Sunday, February 17, 2019

How to be a good consumer of research

I have a few tips on how to read coverage of original research. Please note that this post is aimed primarily at lay individuals who are simply seeking answers to questions that are of interest to them. That said, even those of us who are experts seek answers to questions in fields where we have minimal knowledge.

First, ask yourself as question when reading coverage about a new study: Is this study a new finding or not? If it is new, I advise skepticism. Many novel findings do not replicate. Beware especially of findings that appear to be too good to be true, or too outrageous to be true. Maybe drinking coffee extends one's life. If so, based on the mass quantities of coffee I consume, I will probably live forever. But I wouldn't bet on it.

Next: If you answered no, is this a replication attempt? If it is a replication attempt, then you really have something worth reading. Follow up question: Did the replication attempt succeed or not? Are there multiple replication attempts and is there a consistent pattern? Answers to those questions will give you an idea of whether or not a phenomenon is real. If I find that there are indeed multiple studies that coffee consumption is successfully linked to an increased lifespan, I can rest easily knowing that my pot a day of the thickest, sludgiest coffee imaginable (people blocks away are awakened to the smell of my coffee as it brews!) will allow me to live longer than I might have otherwise.

Finally, if the coverage is not of a replication study but of a meta-analysis, you have some interesting decisions to make. Keep in mind that a meta-analysis is not the last word on the matter. Meta-analyses are only as good as the database and the techniques used to assess the impact of publication bias. Look for any coverage of how the meta-analysts dealt with publication bias. If they simply relied on trim-and-fill analyses, treat any conclusions with caution. If the researchers appear to use methods with names such as PET-PEESE or p-curves, or some battery of methods to detect publication bias, and the effect size still appears to be reliable, then you have a set of findings worth trusting - at least for now. I am not aware of a meta-analysis on coffee consumption and lifespan (that does not mean one does not exist!) but if it used the most rigorous analyses and the effect appeared to be as strong as the coffee I brew, then I could argue reasonably that our youth need to consider not only the world they are leaving behind for Keith Richards (who seems to be alive in spite of himself), but also for me. If on the other hand, the effect size, corrected for publication bias, is not reliable, then I may want to rethink my coffee habit (if my goal is a long life).

One last point: A lot of media coverage of research is based on press releases. Sometimes science journalists have the luxury of going beyond the press releases, but not always. Press releases often gloss over the finer points of a study, including the many caveats that authors would want to make. I avoid press releases of my work precisely to avoid the possibility of my work being spun in some way that I did not intend. Beyond that, don't just gravitate toward findings that fit your preconceived beliefs. Go outside your bubble and look at research that appears to challenge what you believe. I love coffee a lot. However, if it turns out that the way I consume it is not good for me, I need to know - even if I do not like the answers I find.

This post did not age well

Sometimes I find it informative to look at my old blog posts. If nothing else, I get a reminder of how I was thinking about some concept or issue at the time. In the process, I occasionally find that I have done a 180 in the intervening months and years.

Case in point: a post from June 2016 about Apple removing rifle emojis from its line of mobile phones. When I posted that one, the meta-analysis I was working on led me to some overly optimistic conclusions. Some additions to the database, followed by a much-needed correction to that database would throw cold water on those optimistic statements about that research just a mere one and a half years later.

Keep in mind that at the time I made that post, I and my then-second author honestly believed that the findings in our meta-analysis on the weapons effect indicated that the mere exposure to a weapon (including images, and even emojis) was sufficient to increase aggressive behavior. It's just that once we had correct evidence in front of us, it was time to draw some different conclusions. Really, a clear-eyed account of the literature on the weapons effect from its early days onward suggests a skeptical stance is more appropriate. Live and learn.

Apple's decision to remove a rifle emoji was arguably overkill. I understand the inclination to act. After all, mass shootings (however defined) are way too common in the US, and responsible corporations do not want to be viewed as encouraging that sort of violence. I get it. I just don't think removing the emojis actually helped. As of the publication of that post, Apple still had toy emojis of rifles and knives. Those too are probably harmless. Why the change? To my knowledge, there is no solid evidence that short-term exposure to any sort of weapon leads to criminal violence. Heck, it is unclear if short term weapon exposure even has an impact on the mildest of aggressive behaviors we can measure in the lab. The evidence, based on our meta-analysis, is inconclusive at best.

Obviously, those of you who have ever stumbled on to this blog know that I consider gun violence (including mass shootings) to be a serious social and public health concern, and that we as a society need to do far more to prevent such events from happening. Risk factors such as the widespread availability of firearms and individuals' prior history of violent behavior will be more helpful in getting a handle on this important problem. Weapons effect research in its current form? Not so much.


“There are no innocents. There are, however, different degrees of responsibility.” 

-- Lisbeth Salander (protagonist to the Stieg Larsson novel, The Girl Who Played With Fire)

I very much admired Larsson's original Milliennium trilogy. Salander was a far from perfect individual, but one who had a moral compass she put into practice. The quote above is a good indicator of her ethos (and I suspect the ethos of Larsson): we get our hands dirty, we take sides, make our choices, and deal with the consequences. Some hands are dirtier than others. In an ideal world, those who were more responsible would shoulder more of the burden of accountability. Indeed, Salander tried to make that ideal a reality throughout the course of the original trilogy. Reality is considerably more complicated. Sometimes those who are more responsible, and who do deserve to be held to a higher level of accountability, manage to continue to skate. Although frustrating, we can note that there may be systems in place that prevent justice of any sort (restorative or otherwise) from occurring. In the book series there was indeed a system in place that enabled gross economic exploitation, exploitation of those who were vulnerable due to real or alleged psychological disabilities, and based on gender or sexual orientation. In that sense, art imitates life. In my corner of the sciences, there is a rigid hierarchy, as well as various forms of exploitation. Unlike the book series, there aren't necessarily any obvious villains. When the occasional Wansink gets caught, it really isn't a cause for celebration, given not only the obvious loss of a life's work, but also the human toll on those less privileged who tried but were unable to replicate and build on that work, as well as those most directly affected as collaborators. We may wonder who knew and when. But really many who may have known the most were likely those with the least power in a game that was (and is) largely rigged against them. Different degrees of responsibility. As we work our way through the current crisis in confidence in our field, let's make sure to keep that in mind. The end goal is to change the system, rather than take down individuals.

Saturday, February 16, 2019

When is a replication not a replication? Part 2

A while back I described a scenario, and offered some ideas for how that scenario might occur:

Let's imagine a scenario. A researcher several years ago designs a study with five treatment conditions and is mainly interested in a planned contrast between condition 1 and the remaining four conditions. That finding appears statistically significant. A few years later, the same researcher runs a second experiment that appears to be based on the same protocols, but with a larger sample (both good ideas) and finds the same planned contrast is no longer significant. That is problematic for the researcher. So, what to do? Here is where we meet some forking paths. One choice is to report the findings as they appear and acknowledge that the original finding did not replicate. Admittedly finding journals to publish non-replications is still a bit of a challenge (too much so in my professional opinion), so that option may seem a bit unsavory. Perhaps another theory driven path is available. The researcher could note that other than the controller used in condition 1 and condition 2 (and the same for condition 3 and condition 4), the stimulus is identical. So, taking a different path, the researcher combines conditions 1 and 2 to form a new category and does the same with conditions 3 and 4. Condition 5 remains the same. Now, a significant ANOVA is obtainable and the researcher can plausibly argue that the findings show that this new category (conditions 1 and 2 combined) really is distinct from the neutral condition, thus supporting a theoretical model. The reported findings now look good for publication in a higher impact journal. The researcher did not find what she/he initially set out to find, but did find something. But did the researcher really replicate the original findings? If based on the prior published work, the answer appears to be no. The original planned contrast between condition 1 and the other conditions does not replicate. Does the researcher have a finding that tells us something possibly interesting or useful? Maybe. Maybe not. Does the revised analysis appear to be consistent with an established theoretical model? Apparently. Does the new finding tell us something about everyday life that the original would not have already told us had it successfully replicated? That's highly questionable. At bare minimum, in the strict sense of how we define a replication (i.e., a study that finds similar results to the original and/or to similar other studies) the study in question fails to do so. That happens with many psychological phenomena, especially ones that are quite novel and counter-intuitive.
 Here is a more concrete example of what I had described. Worth noting if for no other reason than the article in question is now officially "in print." I still think the concept of forking paths is one that seems applicable here, as one attempts to digest what is reported in the article versus what was apparently intended originally, and what is argued for as an overarching narrative in the article in question. That the evidence, as it exists, contradicts the narrative that might have been desirable is in itself neither good nor bad. To put it more casually, it is what it is. If nothing else, for those willing to go into the weeds a bit, the new findings, compared side by side with those in the original article, are quite informative. Let's learn what we can and move on.

Why due dilligence matters

About a couple months ago, I made some oblique reference to a series of articles published by authors from what appears to be a specific lab. I am still holding off on a more detailed post, although I doubt I will hold off for too much longer. Obviously, I suspect whatever decisions are being made by journal editors are still in the pipeline regarding those particular articles that have yet to be corrected.

What I do want to note is what my role is as an educator and scholar when it comes to post-peer review. My goal when presenting contemporary research is to provide something cutting edge that may not yet appear in textbooks. It's really cool to be able to point to some experimental research in, say, China, and note that some phenomenon either does or does not appear to replicate across cultures - especially since so much of the research on media violence and more narrow phenomena like the weapons effect are often very much based on American and European samples. However, if in the process of looking over articles I might wish to share with my students or use as citations for my own work I notice problems, I can't just remain silent. Since there is the possibility that I may be misreading something, I might initially examine some basics: do the methods as described match what should occur if I or my peers were to run an equivalent study? If yes, then maybe I need to lay off. If no, then it's time for a bit of a deep dive. Thanks to the work of Nuijten et al. (2016) - a public version of their article can be found at osf) - we know that there are often mistakes in the statistical findings reported in Results sections. So, I like to run articles I find of potential use through Statcheck (the web-based version can be found here). If I find problems, then it comes down to how to proceed. I might initially try contacting the authors themselves. Let's say that for the lab in question I initially did while in the midst of working on a meta-analysis where some of their work was relevant to me. Now if the authors are willing to respond, great! Maybe we can look at data analyses, and attempt to reproduce what they found. Honest mistakes happen, and generally those can and should be fixable. But what if the authors are not interested in communicating? I don't know an easy answer beyond contact a relevant editor, and share my concerns.

Note the purpose of the exercise is not to try to hang a fellow researcher (or team of researchers) out to dry. That's a lousy way to live and I have no interest in living that way. Instead, I want to know that what I am reading and what my students might read is factual, rather than fiction. Having the correct data analyses at hand affects me as a researcher as well. If I am working on a meta-analysis and the analyses and descriptive statistics upon which I may be basing my effect size calculations are wrong, my ability to estimate some approximation of the truth will be hampered, and what I might then communicate with my peers will be at least somewhat incorrect. I don't want that to happen either. I just want this to be clear: my intentions are benign. I just like to know that I can trust what I am reading and I can trust the process by which authors arrive at their conclusions. Having been on the other side of the equation, finding out that I had made a serious but correctable error led to a much better article once all the dust had settled. Being wrong does not "feel" good in the moment, but that is not the point. The point is to learn and to do better.

We as researchers are humans. We are not perfect. Mistakes just come with the territory. Peer review is arguably a poor line of defense for catching mistakes. Thankfully the tools we have in place post-peer review are much better than ever. In the process of evaluating work post-peer review, we can find what got missed, and hopefully help make our little corners of our science a little better. Ideally we can do so in a way that avoids unnecessary angst and hard feelings. At least that is the intention.

So when I spill some tea in a few months, just be aware that I will lay out a series of facts about what got reported, what went sideways, and how it got resolved (or not resolved). Nothing more, nothing less.


Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2016). The prevalence of statistical reporting errors in psychology (1985-2013). Behavior Research Methods, 48 (4), 1205-1226. DOI: 10.3758/s13428-015-0664-2