Saturday, January 27, 2018

Was Bem dishonest?

This is an overly contentious title, but we now seem to have confirmation that Bem provided a false description of experiment 5 in "Feeling the Future". Dr. R. on the Replicability index blog has made Bem's data available for download. 
https://replicationindex.wordpress.com/2018/01/20/my-email-correspondence-with-daryl-j-bem-about-the-data-for-his-2011-article-feeling-the-future/

The data for experiment 5 consists of 100 subjects, but there is a clear condition change after the first 50 subjects, in the number of trials each subject was exposed to. There is also a period of about 4 weeks separating the trials done on the first 50 subjects and the last 50.

Bem states, in "Feeling the Future", that the preliminary results of Experiment 5 (and 6 and 7) were previously reported in 2003. That report is here - https://pdfs.semanticscholar.org/8033/f0406daadc956c18d847cb39afc1610b2e73.pdf. The condition change I mention above is consist with the change in conditions between experiment 101 and 102. 

The first experimental series (101) in that report consists of the following:
     34 women
     16 men
     negative/high arousal hit rate = 55.8%
     t-test(49) = 2.41
     p = 0.01 one-tailed

     "control" hit rate = 49.8%
     t-test of the difference(49) = 2.28
     p = 0.027 two-tailed.

If we use the data on experiment 5 which Dr. R. made available from "Feeling the Future," and perform the same analysis on the first 50 subjects, we get:
     34 women
     16 men
     negative/high arousal hit rate = 55.8%
     t-test(49) = 2.41
     p = 0.01 two-tailed

     control hit rate = 49.8%
     t-test of the difference(49) = 2.28
     p = 0.027 two-tailed

It's pretty clear that both reports are talking about the same data. The description of this experiment from 2003 states:


"For the PH studies, the pictures were divided into six categories defined by crossing 3 levels of valence (negative, neutral, positive) with 2 levels of arousal (low, high)...

The first, Experiment 101, was designed to see if the PH procedure would yield a significant psi effect on any kind of target. Accordingly, the 6 kinds of picture pairs composed by crossing 3 levels of valence (negative, neutral, positive) with 2 levels of arousal (low, high) were equally represented across the 48 trials of the session, 8 of each kind...

The results were clear cut: Only the negative/high arousal pictures produced a significant psi effect...

After the fact, then, this experiment can be conceptualized as comprising 8 negative trials and 40 low-affect (“control”) trials."

But the description of this experiment, eight years later, in "Feeling the Future," states:


"This first retroactive habituation experiment comprised trials using either strongly arousing negative picture pairs or neutral control picture pairs;"

There is no mention of the fact that Bem started by looking for an effect for any kind of target, not just negative/high arousal. And that further experiments were planned on the basis of those results. And there is no mention that the "neutral controls" were a post-hoc compilation of pictures with a variety of valence and arousal levels, some of which were not "neutral" or not "low arousal".

A key criticism of "Feeling the Future" is that the results likely do not represent a true effect if these reports are cherry-picked from among a larger pool of exploratory studies. Yet even in the recent email exchange with Dr. R., he states, "Nor did I discard failed experiments or make decisions on the basis of the results obtained." This is clearly false in at least one of the experiments.

In light of these findings, perhaps Dr. R. is right in asking for retraction of "Feeling the Future".

Wednesday, January 17, 2018

QRP's in Bem's Feeling the Future

I have seen mentioned (here for example: https://replicationindex.wordpress.com/2018/01/05/why-the-journal-of-personality-and-social-psychology-should-retract-article-doi-10-1037-a0021524-feeling-the-future-experimental-evidence-for-anomalous-retroactive-influences-on-cognition-a/) that there seems to be little scope for questionable research practices (QRPs) to have an effect on Bem's results. I thought I'd make a list of the potential QRPs I've identified as I've gone through the study and the research which Bem references in support.

Experiment 1
Pictures are rated on arousal (low to high) and valence (positive to neutral to negative) which allows for a variety of eminently justifiable ways of forming groups in which an effect is ‘expected’ or ‘not expected’. Plus, Bem mentions that a large number of ‘non-arousing’ trials were run along with the 36 trials he selected out to report on. Note that he forms different groups in this study than he does using the same categories in experiments 5 and 6.
 
Experiment 2
Allowed for 3 different outcomes to serve as the main outcome  - first 100 trials, second 50 trials, or all 150 trials.

Experiment 3 and 4
No explanation is offered for why the timing differs in the length of time before the prime is presented and the length of time the prime is presented, between the forward and backward condition. Once there are no restrictions on this, it allows for the possibility of testing multiple variations in time. Priming experiments in the literature differ in the length of time the prime is presented (from subliminal to explicit) and in the length of time between prime and picture presentation, with the findings that there is a window where priming is most effective, and then the effect is lost as the time increases. The forward priming trials fall within this window, while the retroactive trials are too long to do so. This raises the question of why?
Ratcliff’s recommendations to deal with the right skew of the data are to either use cutoffs or transformations, not to transform data on which cutoffs have been applied, like Bem performed. The choice of cutoff or method of transformation has substantial effects on the power of the study, which then makes the false-positive risk, mentioned by Colquhoun, relevant.
Also, more results were excluded than the 4 subjects who had more than 16 errors. Trials in which errors were made were excluded across all subjects which resulted in the exclusion of about 9% of the trials, in addition to those excluded by the choice of cutoff.

Experiment 5 and 6
This experiment was previously written up, so we can compare the original report with this new report. The original report describes presenting 6 categories of pictures (as per Experiment 1). There were multiple hypotheses available for use, depending upon which category or combinations of categories were found to have a finding which differed from chance, in either direction. For example, the idea which this experiment was based on, Mere Exposure, would predict target preference in any category. Bem’s idea, Retroactive Habituation, predicts target preference or avoidance, depending upon the category.
There are trials in this report which were not included in the original report (at least 50). And there are sets of trials in the original report (at least 60), which have not been included in this report. In addition, trials which were originally reported as separate series are now combined and treated as though they were a single preplanned experiment in this report.

Experiment 7
The description of this experiment is different from the initial report, which included strongly negative and erotic pictures. Either Bem neglected to include the results from 146 of the subjects, or neglected to include all the trials from each subject.

Experiment 8 and 9
The DR % is a novel outcome measure. Without the constraint of using an established outcome measure, this allows for flexibility in outcome measures.