You all know that I have done just a bit of data sleuthing here or there. I do so with no real fancy background in statistics. I have sufficient course work to teach stats courses at the undergraduate level, but I am no quantitative psychologist. So, I appreciate articles like How to Be a Statistical Detective. The author lays out some common problems and how any of us can use our already existing skills to detect those problems. I use some of these resources already, and am reasonably adept at using a calculator. I will likely add more links to these resources to this blog.
This article is behind a paywall, but I suspect my more enterprising readers already know how to obtain a copy. This article is fundamental reading.
The blog of Dr. Arlin James Benjamin, Jr., Social Psychologist
Friday, December 27, 2019
Saturday, December 21, 2019
Oblique Strategies
I found the card deck of Oblique Strategies developed by Brian Eno and Peter Schmidt (1975) to be quite useful. Although I have never owned an original deck (those were quite pricey back when I was a grad student), thanks to the early days of the world wide web, I could find early sites that would generate Oblique Strategies that I would use in my creative process while working on my dissertation. Now, well, into the 21st century, Daniel Lakens has developed this cool shiny app that will provide random Oblique Strategies for your own inspiration. I will make this link available elsewhere on my blog so that it is easily accessible.
What drew me to the dilemmas posed by the deck - or virtual deck - was that they required a certain amount of willingness to think "outside the box" or outside of one's normal professional parameters. From my vantage point, there is something healthy about that. Give it a try. See how your writing changes. See how you view everything from study design to data analytic strategies, to - yes - writing up a research report. At the end of the day we social scientists are still creators. We may be data driven creators, but creators nonetheless. Besides, we should have some fun with our work as truth seekers and truth tellers.
What drew me to the dilemmas posed by the deck - or virtual deck - was that they required a certain amount of willingness to think "outside the box" or outside of one's normal professional parameters. From my vantage point, there is something healthy about that. Give it a try. See how your writing changes. See how you view everything from study design to data analytic strategies, to - yes - writing up a research report. At the end of the day we social scientists are still creators. We may be data driven creators, but creators nonetheless. Besides, we should have some fun with our work as truth seekers and truth tellers.
Friday, December 6, 2019
Now about those Youth and Society retractions involving Qian Zhang
Hopefully you have had a moment to digest the recent article about the retraction of two of Qian Zhang's papers in Retraction Watch. I began tweeting about one of the articles in question around late September, 2018. You can follow this link to my tweet storm for the now retracted Zhang, Espelage, and Zhang (2018) paper. Under a pseudonym, I initially documented some concerns about both papers in PubPeer: Zhang, Espelage, and Zhang (2018) and Zhang, Espelage, and Rost (2018).
Really, what I did was to upload the papers into the online version of Statcheck, and flag any decision inconsistencies I noticed. I also tried to be mindful of any other oddities that seemed to stick out at that time. I might make note of df that seemed unusual given the reported sample size, for example, or problems with tables presuming to report means and standard deviations. By the time I would have looked at these two papers, I suspect that I was already concerned that papers from Zhang's lab showed a pervasive pattern of errors. Sadly, these two were no different.
With regard to Zhang, Espelage, and Zhang (2018), a Statcheck scan showed three decision errors. In this case, these were errors where the authors reported findings as statistically significant when they were not - given the test statistic value, the degrees of freedom, and the level of significance the authors tried to report.
The first decision inconsistency has to do with an assertion that playing violent video games increased accessibility of aggressive thoughts. The authors initially reported the effect as F(1, 51) = 2.87, p < .05. The actual p-value would have been .09634, according to Statcheck. In other words, there is no main effect for violent content of video games in this sample. Nor was a video game type by gender interaction found: F(1, 65) = 3.58, p < .01 actual p-value: p = 0.06293. Finally, there is no game type by age interaction: F(1, 64) = 3.64, p < .05 actual p-value: p = 0.06090. Stranger still, the sample was approximately 3000 students. Why were the denominator degrees of freedom so small for these reported test statistics? Something did not add up. Table 1 from Zhang, Espelage, and Zhang (2018) was also completely impossible to interpret - an issue I have highlighted in other papers that have been published from his lab:
A few months later, a correction would be published in which the authors would purport to correct a number of errors found on several of the pages of the original article, as well as Table 1. That was wonderful insofar as it went. However, there was a new oddity. The authors purported to only use 500 of the 3000 participants in order to have a "true experiment" - which was one of the more interesting uses of that term I have read over the course of my career. And as Joe Hilgard has aptly noticed, problems with the descriptive statistics continued to be pervasive - implausible and impossible cell means and marginal means and standard deviations, for example.
With regard to the Zhang, Espelage, and Rost (2018) article, my initial flag was simply for some decision errors in Study 1, in which the authors were attempting to establish that their stimulus materials were equivalent across a number of variables except for the level of violent content, and consistent across subsamples, such as sex of participant (male/female). Given the difficulty that exists in obtaining, say, film clips that are sufficiently equivalent except for level of violence, due diligence in using materials that are as equivalent as possible, except for the IV or IVs is to be admired. Unfortunately, There were several decision errors that I flagged after a Statcheck run.
As noted at the time, contra the authors' assertion, there was evidence that there were some rated differences between the violent film (Street Fighter) and the nonviolent film (Air Crisis) in terms of pleasantness - t(798) = 2.32, p > .05 actual p-value: p = 0.02059 - and fear - t(798) = 2.13, p > .05 actual p-value: p = 0.03348 To the extent that failure to control for those factors might have impacted subsequent analyses in Study 2 is of course debatable. It is clear that the authors cannot demonstrate, based on their reported analyses, that they had films that were equivalent on variables that they identified as important to hold constant with only violent content varying. The final decision inconsistency suggested that there was a sex difference in ratings of the variable fear, contrary to authors claim, t(798) = -2.14, p > .05 actual p-value: p = 0.03266. How much that impacted the experiment in study 2 was not something I thought I could assess, but I found it troubling and worth flagging. At minimum, the film clips were less equivalent than reported, and the subsamples were potentially reacting differently to these film clips than reported.
Although I did not comment on Study 2, Hilgard demonstrated that there was a consistency in the pattern of reported means in this paper were strikingly similar to the pattern of means reported in a couple of earlier papers in which Zhang was a lead or coauthor in 2013. That is troubling. If you have followed some of my coverage of Zhang's work on my blog, you are well aware that I have actually discovered at least one instance in which a reported test statistic was directly copied and pasted from one paper to another. Make of it what you will. Dr. Hilgard was able to eventually get a hold of the data that were to accompany a correction to that article, and as noted in the coverage in Retraction Watch, the data and analyses were fatally flawed.
I was noticing a pervasive pattern of errors in these papers, along with others I was reading by Zhang and colleagues at the time. These are the first papers on which Zhang is a lead or coauthor to be retracted. I am willing to bet that these will not be the last, given the evidence I have been sharing with you all here and on Twitter over the last year. I have already probably stated this repeatedly about these retractions - I am relieved. There is no joy to be had here. This has been a bad week for the authors involved. Also please note that I am taking great care here not to assign motive. I think that the evidence speaks for itself that the research was poorly conducted and poorly analyzed. That can happen for any of a number of reasons. I don't know any of the authors involved. I have some awareness of Dr. Espelage's work in bullying, but that is a bit outside my own specialty area. My impression of her work has always been favorable, and these retractions notwithstanding, I see no reason to change my impression of her work on bullying.
If I was sounding alarms in 2018 and onward, it is because Zhang had begun to increasingly enlist as collaborators well-regarded American and European researchers, and was beginning to publish in top-tier journals in various specialties within the Psychological Sciences, such as child and adolescent development and aggression. Given that I thought a reasonable case could be made that Zhang's reputation for well-conducted and analyzed research was far from ideal, I did not want to see otherwise reputable researchers put their careers on the line. My fears to a certain degree are now being realized.
Note that in the preparation of this post, I relied heavily on my tweets from Sept. 24, 2018 and a couple posts I published pseudonymously in PubPeer (see links above). And credit where it is due. I am glad I could get a conversation started about these papers (and others) by this lab. Joe Hilgard has clearly put a great deal of effort and talent into clearing the record since. Really we owe him a debt of gratitude. And also a debt of gratitude to those who have asked questions on Twitter, retweeted, and refused to let up on the pressure. Science is not self-correcting. It takes people who care to actively do the correcting.
Really, what I did was to upload the papers into the online version of Statcheck, and flag any decision inconsistencies I noticed. I also tried to be mindful of any other oddities that seemed to stick out at that time. I might make note of df that seemed unusual given the reported sample size, for example, or problems with tables presuming to report means and standard deviations. By the time I would have looked at these two papers, I suspect that I was already concerned that papers from Zhang's lab showed a pervasive pattern of errors. Sadly, these two were no different.
With regard to Zhang, Espelage, and Zhang (2018), a Statcheck scan showed three decision errors. In this case, these were errors where the authors reported findings as statistically significant when they were not - given the test statistic value, the degrees of freedom, and the level of significance the authors tried to report.
The first decision inconsistency has to do with an assertion that playing violent video games increased accessibility of aggressive thoughts. The authors initially reported the effect as F(1, 51) = 2.87, p < .05. The actual p-value would have been .09634, according to Statcheck. In other words, there is no main effect for violent content of video games in this sample. Nor was a video game type by gender interaction found: F(1, 65) = 3.58, p < .01 actual p-value: p = 0.06293. Finally, there is no game type by age interaction: F(1, 64) = 3.64, p < .05 actual p-value: p = 0.06090. Stranger still, the sample was approximately 3000 students. Why were the denominator degrees of freedom so small for these reported test statistics? Something did not add up. Table 1 from Zhang, Espelage, and Zhang (2018) was also completely impossible to interpret - an issue I have highlighted in other papers that have been published from his lab:
A few months later, a correction would be published in which the authors would purport to correct a number of errors found on several of the pages of the original article, as well as Table 1. That was wonderful insofar as it went. However, there was a new oddity. The authors purported to only use 500 of the 3000 participants in order to have a "true experiment" - which was one of the more interesting uses of that term I have read over the course of my career. And as Joe Hilgard has aptly noticed, problems with the descriptive statistics continued to be pervasive - implausible and impossible cell means and marginal means and standard deviations, for example.
With regard to the Zhang, Espelage, and Rost (2018) article, my initial flag was simply for some decision errors in Study 1, in which the authors were attempting to establish that their stimulus materials were equivalent across a number of variables except for the level of violent content, and consistent across subsamples, such as sex of participant (male/female). Given the difficulty that exists in obtaining, say, film clips that are sufficiently equivalent except for level of violence, due diligence in using materials that are as equivalent as possible, except for the IV or IVs is to be admired. Unfortunately, There were several decision errors that I flagged after a Statcheck run.
As noted at the time, contra the authors' assertion, there was evidence that there were some rated differences between the violent film (Street Fighter) and the nonviolent film (Air Crisis) in terms of pleasantness - t(798) = 2.32, p > .05 actual p-value: p = 0.02059 - and fear - t(798) = 2.13, p > .05 actual p-value: p = 0.03348 To the extent that failure to control for those factors might have impacted subsequent analyses in Study 2 is of course debatable. It is clear that the authors cannot demonstrate, based on their reported analyses, that they had films that were equivalent on variables that they identified as important to hold constant with only violent content varying. The final decision inconsistency suggested that there was a sex difference in ratings of the variable fear, contrary to authors claim, t(798) = -2.14, p > .05 actual p-value: p = 0.03266. How much that impacted the experiment in study 2 was not something I thought I could assess, but I found it troubling and worth flagging. At minimum, the film clips were less equivalent than reported, and the subsamples were potentially reacting differently to these film clips than reported.
Although I did not comment on Study 2, Hilgard demonstrated that there was a consistency in the pattern of reported means in this paper were strikingly similar to the pattern of means reported in a couple of earlier papers in which Zhang was a lead or coauthor in 2013. That is troubling. If you have followed some of my coverage of Zhang's work on my blog, you are well aware that I have actually discovered at least one instance in which a reported test statistic was directly copied and pasted from one paper to another. Make of it what you will. Dr. Hilgard was able to eventually get a hold of the data that were to accompany a correction to that article, and as noted in the coverage in Retraction Watch, the data and analyses were fatally flawed.
I was noticing a pervasive pattern of errors in these papers, along with others I was reading by Zhang and colleagues at the time. These are the first papers on which Zhang is a lead or coauthor to be retracted. I am willing to bet that these will not be the last, given the evidence I have been sharing with you all here and on Twitter over the last year. I have already probably stated this repeatedly about these retractions - I am relieved. There is no joy to be had here. This has been a bad week for the authors involved. Also please note that I am taking great care here not to assign motive. I think that the evidence speaks for itself that the research was poorly conducted and poorly analyzed. That can happen for any of a number of reasons. I don't know any of the authors involved. I have some awareness of Dr. Espelage's work in bullying, but that is a bit outside my own specialty area. My impression of her work has always been favorable, and these retractions notwithstanding, I see no reason to change my impression of her work on bullying.
If I was sounding alarms in 2018 and onward, it is because Zhang had begun to increasingly enlist as collaborators well-regarded American and European researchers, and was beginning to publish in top-tier journals in various specialties within the Psychological Sciences, such as child and adolescent development and aggression. Given that I thought a reasonable case could be made that Zhang's reputation for well-conducted and analyzed research was far from ideal, I did not want to see otherwise reputable researchers put their careers on the line. My fears to a certain degree are now being realized.
Note that in the preparation of this post, I relied heavily on my tweets from Sept. 24, 2018 and a couple posts I published pseudonymously in PubPeer (see links above). And credit where it is due. I am glad I could get a conversation started about these papers (and others) by this lab. Joe Hilgard has clearly put a great deal of effort and talent into clearing the record since. Really we owe him a debt of gratitude. And also a debt of gratitude to those who have asked questions on Twitter, retweeted, and refused to let up on the pressure. Science is not self-correcting. It takes people who care to actively do the correcting.
For those visiting from Retraction Watch:
Retraction Watch posted an article about two retractions of articles in which Qian Zhang of Southwest University in China was the lead author. Since some of you might be interested in what I've documented about other published articles from Zhang's lab, your best bet is to either type Zhang in the search field for this blog. Or just follow this link, where I have done the work for you. I'll have more to say about these specific articles in a little bit. I think I documented some of my concerns on Twitter last year and pseudonymously on PubPeer. In the meantime, I am relieved to see two very flawed articles removed from the published record. Joe Hilgard deserves a tremendous amount of credit for his work reanalyzing some data he was able to obtain from the lab (and his meticulous documentation of the flaws in these papers), and for his persistence in contacting the Editor in Chief of Youth and Society. I am also grateful for tools like Statcheck, which enabled me to very quickly spot some of the problems with these papers.
Subscribe to:
Posts (Atom)