Wednesday, February 21, 2018

More falsehoods?

Continuing on the previous two posts...Bem's data, which Dr. R made available here, shows that the data for experiment 7 was collected in 2005. I, and others, had made the assumption that the data for experiment 7 was collected along with the data for experiments 5 and 6, because the description of the experiment is the same as the description of the 300 series experiments in the 2003 report (which includes the data later used to form experiments 5 and 6). It turns out that this was not the case.

The 2003 report describes a set of experiments using supraliminal exposures and low affect trials. Two groups are compared on the basis of "boredom proneness" (in this case "openness to experience"). A t-test of the difference gives the degrees of freedom at 92, which indicates there were 94 subjects in the experiment. This is different from the description of the other experiments in the 300 supraliminal series which were testing variations in the precognitive habituation hypotheses, both in the description of the targets used and in the numbers of subjects.

Then, as we now know from the data newly made available to us, Bem later performs another set of experiments using supraliminal exposures and low affect trials on 200 subjects, in 2005. Again, two groups are compared on the basis of "boredom proneness" (in this case, "stimulus seeking"). This experiment is written up in "Feeling the Future" as experiment 7, but no mention is made of his previous experiment which also tested the idea.

Bem has denied that there is a file-drawer of relevant experiments/trials which he failed to mention in "Feeling the Future". Yet here we have a relevant experiment, testing the same hypothesis in the same way, as a later experiment which does make its way into "Feeling the Future". And he makes no mention of the prior experiment, let alone including its results. It has been suggested (here, for example) that Bem's passing reference to a couple of file-drawer studies on precognitive habituation can be taken as including the file-drawer on the induction of boredom study. However, why would this be the case? The supraliminal PH studies and the boredom study were performed using different targets and different subjects, and were testing different hypotheses. And while Bem found a way to excuse the supraliminal PH studies from consideration, by regarding them as conceptual replications which didn't work, that same excuse doesn't hold for the boredom studies, since there is no difference in concept between the earlier and later studies.

So to sum it up, we find unequivocal evidence, from the 2003 report, that Bem did at least two things which he denies. He set up an experiment which tested multiple hypotheses, then presented the experiment as though it was set up to test a single hypothesis; one which was chosen post hoc based on where he was able to tease out statistically significant results (http://naturalismisuseful.blogspot.com/2018/01/). And he ran multiple tests of the same hypothesis, yet failed to mention or include one of those tests, leaving them in a file-drawer. He has had ample opportunity to own up to doing this. At what point does this become deliberate deception and a violation of the standards of practice in place in 2011?

Saturday, January 27, 2018

Was Bem dishonest?

This is an overly contentious title, but we now seem to have confirmation that Bem provided a false description of experiment 5 in "Feeling the Future". Dr. R. on the Replicability index blog has made Bem's data available for download. 
https://replicationindex.wordpress.com/2018/01/20/my-email-correspondence-with-daryl-j-bem-about-the-data-for-his-2011-article-feeling-the-future/

The data for experiment 5 consists of 100 subjects, but there is a clear condition change after the first 50 subjects, in the number of trials each subject was exposed to. There is also a period of about 4 weeks separating the trials done on the first 50 subjects and the last 50.

Bem states, in "Feeling the Future", that the preliminary results of Experiment 5 (and 6 and 7) were previously reported in 2003. That report is here - https://pdfs.semanticscholar.org/8033/f0406daadc956c18d847cb39afc1610b2e73.pdf. The condition change I mention above is consist with the change in conditions between experiment 101 and 102. 

The first experimental series (101) in that report consists of the following:
     34 women
     16 men
     negative/high arousal hit rate = 55.8%
     t-test(49) = 2.41
     p = 0.01 one-tailed

     "control" hit rate = 49.8%
     t-test of the difference(49) = 2.28
     p = 0.027 two-tailed.

If we use the data on experiment 5 which Dr. R. made available from "Feeling the Future," and perform the same analysis on the first 50 subjects, we get:
     34 women
     16 men
     negative/high arousal hit rate = 55.8%
     t-test(49) = 2.41
     p = 0.01 two-tailed

     control hit rate = 49.8%
     t-test of the difference(49) = 2.28
     p = 0.027 two-tailed

It's pretty clear that both reports are talking about the same data. The description of this experiment from 2003 states:


"For the PH studies, the pictures were divided into six categories defined by crossing 3 levels of valence (negative, neutral, positive) with 2 levels of arousal (low, high)...

The first, Experiment 101, was designed to see if the PH procedure would yield a significant psi effect on any kind of target. Accordingly, the 6 kinds of picture pairs composed by crossing 3 levels of valence (negative, neutral, positive) with 2 levels of arousal (low, high) were equally represented across the 48 trials of the session, 8 of each kind...

The results were clear cut: Only the negative/high arousal pictures produced a significant psi effect...

After the fact, then, this experiment can be conceptualized as comprising 8 negative trials and 40 low-affect (“control”) trials."

But the description of this experiment, eight years later, in "Feeling the Future," states:


"This first retroactive habituation experiment comprised trials using either strongly arousing negative picture pairs or neutral control picture pairs;"

There is no mention of the fact that Bem started by looking for an effect for any kind of target, not just negative/high arousal. And that further experiments were planned on the basis of those results. And there is no mention that the "neutral controls" were a post-hoc compilation of pictures with a variety of valence and arousal levels, some of which were not "neutral" or not "low arousal".

A key criticism of "Feeling the Future" is that the results likely do not represent a true effect if these reports are cherry-picked from among a larger pool of exploratory studies. Yet even in the recent email exchange with Dr. R., he states, "Nor did I discard failed experiments or make decisions on the basis of the results obtained." This is clearly false in at least one of the experiments.

In light of these findings, perhaps Dr. R. is right in asking for retraction of "Feeling the Future".

Wednesday, January 17, 2018

QRP's in Bem's Feeling the Future

I have seen mentioned (here for example: https://replicationindex.wordpress.com/2018/01/05/why-the-journal-of-personality-and-social-psychology-should-retract-article-doi-10-1037-a0021524-feeling-the-future-experimental-evidence-for-anomalous-retroactive-influences-on-cognition-a/) that there seems to be little scope for questionable research practices (QRPs) to have an effect on Bem's results. I thought I'd make a list of the potential QRPs I've identified as I've gone through the study and the research which Bem references in support.

Experiment 1
Pictures are rated on arousal (low to high) and valence (positive to neutral to negative) which allows for a variety of eminently justifiable ways of forming groups in which an effect is ‘expected’ or ‘not expected’. Plus, Bem mentions that a large number of ‘non-arousing’ trials were run along with the 36 trials he selected out to report on. Note that he forms different groups in this study than he does using the same categories in experiments 5 and 6.
 
Experiment 2
Allowed for 3 different outcomes to serve as the main outcome  - first 100 trials, second 50 trials, or all 150 trials.

Experiment 3 and 4
No explanation is offered for why the timing differs in the length of time before the prime is presented and the length of time the prime is presented, between the forward and backward condition. Once there are no restrictions on this, it allows for the possibility of testing multiple variations in time. Priming experiments in the literature differ in the length of time the prime is presented (from subliminal to explicit) and in the length of time between prime and picture presentation, with the findings that there is a window where priming is most effective, and then the effect is lost as the time increases. The forward priming trials fall within this window, while the retroactive trials are too long to do so. This raises the question of why?
Ratcliff’s recommendations to deal with the right skew of the data are to either use cutoffs or transformations, not to transform data on which cutoffs have been applied, like Bem performed. The choice of cutoff or method of transformation has substantial effects on the power of the study, which then makes the false-positive risk, mentioned by Colquhoun, relevant.
Also, more results were excluded than the 4 subjects who had more than 16 errors. Trials in which errors were made were excluded across all subjects which resulted in the exclusion of about 9% of the trials, in addition to those excluded by the choice of cutoff.

Experiment 5 and 6
This experiment was previously written up, so we can compare the original report with this new report. The original report describes presenting 6 categories of pictures (as per Experiment 1). There were multiple hypotheses available for use, depending upon which category or combinations of categories were found to have a finding which differed from chance, in either direction. For example, the idea which this experiment was based on, Mere Exposure, would predict target preference in any category. Bem’s idea, Retroactive Habituation, predicts target preference or avoidance, depending upon the category.
There are trials in this report which were not included in the original report (at least 50). And there are sets of trials in the original report (at least 60), which have not been included in this report. In addition, trials which were originally reported as separate series are now combined and treated as though they were a single preplanned experiment in this report.

Experiment 7
The description of this experiment is different from the initial report, which included strongly negative and erotic pictures. Either Bem neglected to include the results from 146 of the subjects, or neglected to include all the trials from each subject.

Experiment 8 and 9
The DR % is a novel outcome measure. Without the constraint of using an established outcome measure, this allows for flexibility in outcome measures.