Wednesday, February 21, 2018

More falsehoods?

Continuing on the previous two posts...Bem's data, which Dr. R made available here, shows that the data for experiment 7 was collected in 2005. I, and others, had made the assumption that the data for experiment 7 was collected along with the data for experiments 5 and 6, because the description of the experiment is the same as the description of the 300 series experiments in the 2003 report (which includes the data later used to form experiments 5 and 6). It turns out that this was not the case.

The 2003 report describes a set of experiments using supraliminal exposures and low affect trials. Two groups are compared on the basis of "boredom proneness" (in this case "openness to experience"). A t-test of the difference gives the degrees of freedom at 92, which indicates there were 94 subjects in the experiment. This is different from the description of the other experiments in the 300 supraliminal series which were testing variations in the precognitive habituation hypotheses, both in the description of the targets used and in the numbers of subjects.

Then, as we now know from the data newly made available to us, Bem later performs another set of experiments using supraliminal exposures and low affect trials on 200 subjects, in 2005. Again, two groups are compared on the basis of "boredom proneness" (in this case, "stimulus seeking"). This experiment is written up in "Feeling the Future" as experiment 7, but no mention is made of his previous experiment which also tested the idea.

Bem has denied that there is a file-drawer of relevant experiments/trials which he failed to mention in "Feeling the Future". Yet here we have a relevant experiment, testing the same hypothesis in the same way, as a later experiment which does make its way into "Feeling the Future". And he makes no mention of the prior experiment, let alone including its results. It has been suggested (here, for example) that Bem's passing reference to a couple of file-drawer studies on precognitive habituation can be taken as including the file-drawer on the induction of boredom study. However, why would this be the case? The supraliminal PH studies and the boredom study were performed using different targets and different subjects, and were testing different hypotheses. And while Bem found a way to excuse the supraliminal PH studies from consideration, by regarding them as conceptual replications which didn't work, that same excuse doesn't hold for the boredom studies, since there is no difference in concept between the earlier and later studies.

So to sum it up, we find unequivocal evidence, from the 2003 report, that Bem did at least two things which he denies. He set up an experiment which tested multiple hypotheses, then presented the experiment as though it was set up to test a single hypothesis; one which was chosen post hoc based on where he was able to tease out statistically significant results (http://naturalismisuseful.blogspot.com/2018/01/). And he ran multiple tests of the same hypothesis, yet failed to mention or include one of those tests, leaving them in a file-drawer. He has had ample opportunity to own up to doing this. At what point does this become deliberate deception and a violation of the standards of practice in place in 2011?

Saturday, January 27, 2018

Was Bem dishonest?

This is an overly contentious title, but we now seem to have confirmation that Bem provided a false description of experiment 5 in "Feeling the Future". Dr. R. on the Replicability index blog has made Bem's data available for download. 
https://replicationindex.wordpress.com/2018/01/20/my-email-correspondence-with-daryl-j-bem-about-the-data-for-his-2011-article-feeling-the-future/

The data for experiment 5 consists of 100 subjects, but there is a clear condition change after the first 50 subjects, in the number of trials each subject was exposed to. There is also a period of about 4 weeks separating the trials done on the first 50 subjects and the last 50.

Bem states, in "Feeling the Future", that the preliminary results of Experiment 5 (and 6 and 7) were previously reported in 2003. That report is here - https://pdfs.semanticscholar.org/8033/f0406daadc956c18d847cb39afc1610b2e73.pdf. The condition change I mention above is consist with the change in conditions between experiment 101 and 102. 

The first experimental series (101) in that report consists of the following:
     34 women
     16 men
     negative/high arousal hit rate = 55.8%
     t-test(49) = 2.41
     p = 0.01 one-tailed

     "control" hit rate = 49.8%
     t-test of the difference(49) = 2.28
     p = 0.027 two-tailed.

If we use the data on experiment 5 which Dr. R. made available from "Feeling the Future," and perform the same analysis on the first 50 subjects, we get:
     34 women
     16 men
     negative/high arousal hit rate = 55.8%
     t-test(49) = 2.41
     p = 0.01 two-tailed

     control hit rate = 49.8%
     t-test of the difference(49) = 2.28
     p = 0.027 two-tailed

It's pretty clear that both reports are talking about the same data. The description of this experiment from 2003 states:


"For the PH studies, the pictures were divided into six categories defined by crossing 3 levels of valence (negative, neutral, positive) with 2 levels of arousal (low, high)...

The first, Experiment 101, was designed to see if the PH procedure would yield a significant psi effect on any kind of target. Accordingly, the 6 kinds of picture pairs composed by crossing 3 levels of valence (negative, neutral, positive) with 2 levels of arousal (low, high) were equally represented across the 48 trials of the session, 8 of each kind...

The results were clear cut: Only the negative/high arousal pictures produced a significant psi effect...

After the fact, then, this experiment can be conceptualized as comprising 8 negative trials and 40 low-affect (“control”) trials."

But the description of this experiment, eight years later, in "Feeling the Future," states:


"This first retroactive habituation experiment comprised trials using either strongly arousing negative picture pairs or neutral control picture pairs;"

There is no mention of the fact that Bem started by looking for an effect for any kind of target, not just negative/high arousal. And that further experiments were planned on the basis of those results. And there is no mention that the "neutral controls" were a post-hoc compilation of pictures with a variety of valence and arousal levels, some of which were not "neutral" or not "low arousal".

A key criticism of "Feeling the Future" is that the results likely do not represent a true effect if these reports are cherry-picked from among a larger pool of exploratory studies. Yet even in the recent email exchange with Dr. R., he states, "Nor did I discard failed experiments or make decisions on the basis of the results obtained." This is clearly false in at least one of the experiments.

In light of these findings, perhaps Dr. R. is right in asking for retraction of "Feeling the Future".

Wednesday, January 17, 2018

QRP's in Bem's Feeling the Future

I have seen mentioned (here for example: https://replicationindex.wordpress.com/2018/01/05/why-the-journal-of-personality-and-social-psychology-should-retract-article-doi-10-1037-a0021524-feeling-the-future-experimental-evidence-for-anomalous-retroactive-influences-on-cognition-a/) that there seems to be little scope for questionable research practices (QRPs) to have an effect on Bem's results. I thought I'd make a list of the potential QRPs I've identified as I've gone through the study and the research which Bem references in support.

Experiment 1
Pictures are rated on arousal (low to high) and valence (positive to neutral to negative) which allows for a variety of eminently justifiable ways of forming groups in which an effect is ‘expected’ or ‘not expected’. Plus, Bem mentions that a large number of ‘non-arousing’ trials were run along with the 36 trials he selected out to report on. Note that he forms different groups in this study than he does using the same categories in experiments 5 and 6.
 
Experiment 2
Allowed for 3 different outcomes to serve as the main outcome  - first 100 trials, second 50 trials, or all 150 trials.

Experiment 3 and 4
No explanation is offered for why the timing differs in the length of time before the prime is presented and the length of time the prime is presented, between the forward and backward condition. Once there are no restrictions on this, it allows for the possibility of testing multiple variations in time. Priming experiments in the literature differ in the length of time the prime is presented (from subliminal to explicit) and in the length of time between prime and picture presentation, with the findings that there is a window where priming is most effective, and then the effect is lost as the time increases. The forward priming trials fall within this window, while the retroactive trials are too long to do so. This raises the question of why?
Ratcliff’s recommendations to deal with the right skew of the data are to either use cutoffs or transformations, not to transform data on which cutoffs have been applied, like Bem performed. The choice of cutoff or method of transformation has substantial effects on the power of the study, which then makes the false-positive risk, mentioned by Colquhoun, relevant.
Also, more results were excluded than the 4 subjects who had more than 16 errors. Trials in which errors were made were excluded across all subjects which resulted in the exclusion of about 9% of the trials, in addition to those excluded by the choice of cutoff.

Experiment 5 and 6
This experiment was previously written up, so we can compare the original report with this new report. The original report describes presenting 6 categories of pictures (as per Experiment 1). There were multiple hypotheses available for use, depending upon which category or combinations of categories were found to have a finding which differed from chance, in either direction. For example, the idea which this experiment was based on, Mere Exposure, would predict target preference in any category. Bem’s idea, Retroactive Habituation, predicts target preference or avoidance, depending upon the category.
There are trials in this report which were not included in the original report (at least 50). And there are sets of trials in the original report (at least 60), which have not been included in this report. In addition, trials which were originally reported as separate series are now combined and treated as though they were a single preplanned experiment in this report.

Experiment 7
The description of this experiment is different from the initial report, which included strongly negative and erotic pictures. Either Bem neglected to include the results from 146 of the subjects, or neglected to include all the trials from each subject.

Experiment 8 and 9
The DR % is a novel outcome measure. Without the constraint of using an established outcome measure, this allows for flexibility in outcome measures.



Tuesday, August 22, 2017

The Birthday Problem

You are at a party chatting with the host when it is discovered that two of the guests have the same birthday.  "Wow, what are the odds of that?" asks your host.  People start throwing out answers:

"One in a million."
"One in 365."
"One in 365 squared."

"Even odds," is your response, whereupon everyone looks at you like you are nuts.  Intuitively, it seems a highly unlikely event.  After all, the likeliness that Cheryl and Louis would both be born on April 19 is remote (1/365 x 1/365)*.  Intuitively, we tend to substitute the probability of finding a pair with the probability of finding the pair which we found.

The same sort of response comes into play in parapsychology.  A correspondence is found between events (examples, a mediumship reading hits on a hummingbird tattoo, a past life regression finds the painter of a hunchback) and the likeliness of both events is too low to be due to chance.  Some sort of anomalous information must be present to account for the match, right?

If we go back to the question I asked in my previous blog post on evidence. “What would we expect in the absence of anomalous information?” It turns out we expect to find remarkable correspondences which have a very low probability of occurrence due to chance.

The Birthday Problem is a well-known puzzle which answers our host's question.  What is the probability that two people will share a birthday in a group of n people?  When there are 23 people (as your quick count at the party confirmed), the probability is 50%. This answer seems counter-intuitive, as the probability of the match we found is very low.  It becomes less counter-intuitive when we realize that any two party members who happened to share any birthday would have been equally remarkable.  In a group of 23, there are 253 different pairs which can be formed when looking for a single match, which makes the task appear easier than we first realized (especially when we also drop the requirement that the matching birthdate is April 19).  The take home message from the birthday problem is that the probability of finding a match is very different from the probability of the match you found.

When it comes to the issues discussed on parapsychology forums, including many of the anti-science issues like evolution denial, much use is made of the assumption that if events are extremely unlikely, then happenstance isn't a valid option. So we get Intelligent Design proponents who treat the (extremely low) probability that a mutation would produce a specific protein as evidence for a explanation involving God. Or we get Andrew Paquette publishing a paper demonstrating that it would be extremely unlikely for his dream to correspond to events in real life (for the few dreams selected post hoc because they corresponded) unless there was anomalous information.

I'm not sure how to overcome this misunderstanding. Perhaps we can ask, “what other remarkable events could have come to our attention in the same way?” In the case of proteins and mutation, we are looking at a vast range of mutations which may undergo selection for a great variety of functions - not a single function selected post hoc. The emergence of a few dozen useful functions out of that environment seems less unexpected even in the absence of God. For a remarkable correspondence in one of Andrew Paquette’s “pre-cognitive”dreams, we are looking at tens of thousands of dreams in which there is potential for a correspondence to be found. That he was able to find a handful which were somewhat remarkable seems to be expected in the absence of anomalous information.

Remarkable stories which come to our attention after the fact can’t serve as evidence of the paranormal when this is also exactly what we'd expect to see in the absence of anomalous information. Yet from personal experience and surveys, the bulk of proponents will say that they believe because of their experience of a remarkable story (either their own or someone else’s). 

https://en.m.wikipedia.org/wiki/Birthday_problem

  • For the sake of forestalling another common probability error, the probability of an event is not one over the number of possible events unless the events are uniformly distributed, so the probability that someone was born on April 19 is not really 1/365.  Even though there are 365 days (excluding leap years) in a year, the distribution of births is not uniform.  The actual probability would have to be determined empirically, based on census data, and may be something like 1/378.


Thursday, August 10, 2017

Information that confirms an idea isn’t evidence.


Okay, this title seems counter-intuitive – how could information that supports an idea not be evidence for the idea? Sure, it may be “weak”, but it has to count at least a little, doesn’t it? Yet it turns out that, in most cases, data that supports an idea is more likely to be produced when the idea is false than when the idea is true.

It starts with the famous Ioannidis paper, “Why most published research results are false”. Table 4 outlines the positive-predictive values under a variety of different types of investigation (positive-predictive values (PPV) are the proportion of positive findings that are true-positives vs. all positives (true plus false-positives)). Note that most of our discourse – people who claim to have had a weirdly accurate reading from a medium, conspiracy theories making the round of social media, experiences of horrific side-effects from vaccines/statins/<insert substance of choice here>, etc. – doesn’t even remotely reach the level of “exploratory study” with respect to rigor. But even with some element of ‘rigor’, a positive result from an exploratory study is still tens to hundreds of times more likely to be a false-positive than a true-positive. This means that the ability to “confirm” an idea (i.e. find information which supports the idea) says much, much more about how easy it is to find confirming information even when the idea is false, than it says about whether the idea is true.

We saw this in my prior post, where hummingbird statements, which are often used as proof that a particular medium’s reading depends on anomalous information, are also easily produced when the idea that mediums are taping into anomalous information is false. This prevents us from being able to distinguish which of the great variety of contradictory and fantastical statements made about the afterlife may actually be true.

Unfortunately, confirmation bias tends to ensure that we spend our time looking for this confirming information, instead of looking for information that would help us distinguish between ideas that are true or false.

In another famous experiment, the Wason selection task, which asks you to turn over a card or two in order to test whether a rule about those cards is true, is a test of this bias. Fewer than 10% of the people taking the test (even intelligent university students), pick cards which adequately test the idea. Almost everyone picks the card that would confirm the idea, but few also pick the card that would tell you whether the idea is false. This leads us to think that we are building evidence for an idea, by finding more and more examples that confirm the idea, even when the idea is false.

When faced with evaluating whether something might be true, don’t look at the ‘evidence’ for the idea. Ask yourself, “what might I expect to see if this isn’t true?”  Perhaps what you’d expect to see if the idea isn’t true is pretty much the same as what you’d expect to see if the idea is true. If you spend even five minutes on Snopes looking at the plethora of false and unproven conspiracy theories out there, it becomes pretty obvious that no matter how sketchy the idea (Pizzagate anyone?), it’s pretty easy to build a case for it even when it’s false. The existence of ‘evidence’ may just tell you that ‘evidence’ is easy to produce, not that the conspiracy may be valid.

Monday, June 19, 2017

Hummingbird statements

"Hummingbird statements" is a term I came up with for those out-of-the-blue statements which seem eerily specific.  I got the idea from watching a video of a Dr. Phil episode where an inexperienced Skeptic was put up against an experienced Psychic in order to perform readings on audience members.  Of course, the Skeptic is shown performing poorly, while the Psychic is shown performing well, including getting an eerily accurate hit on a hummingbird tattoo.


From talking to people on Skeptiko, from my own experiences, from talking to friends and family, and from reading research on mediums, it seems to be hummingbird statements which drive people to believe in psychic abilities.  Or at least to give people pause in their skepticism.  What you hear over and over again is "there is no way they could have known that"  Our intuition tells us that these experiences are far too unlikely to be due to chance, and it becomes very hard to shake the feeling that something magical is happening when these kinds of hits are made.


So is this correct?  Are these experiences unlikely, due to chance?  It turns out that they aren't. 


One way to look at this is to consider whether these statements are as specific as they appear (i.e. there is only one opportunity for a match?).  If we go back to the Dr. Phil show, the actual statement made by the Psychic was "I'm supposed to talk about a hummingbird." There are multiple ways in which this could have been a hit, including specific hits.  For example, it could have been a hit for any of the other audience members, it could have referred to a hobby, a location, an occupation, etc., or it could have been one of the many statements made by the psychic that wasn't a hit and was simply passed over and forgotten.  If you go through a reading just keeping track of statements which have the potential to be a hummingbird statement (i.e. they have the potential to be regarded as eerily specific, if a match is made), you see that many such statements are made in rapid succession, and are simply discarded in the absence of a match.  Multiple opportunities dramatically increase the probability of a match, yet our intuition judges the probability as though only one opportunity was taken.

But we don't even need to guess at whether or not these statements are more probable than they seem.  Their presence, due to chance, has been assessed in a few studies.


In Emily Kelly's study of mediums who seem to give accurate information, 12 of the 200 control readings, and 14 of the 40 target readings generated comments like "I knew this was the correct reading," “I don’t see how it could be anything other than (X reading),”  or the reading "stood out and contained many accurate descriptions".  One example of a hummingbird statement was a reference to the Wizard of Oz (in the control group).  


In a different kind of study looking at the Ganzfeld studies, 20 out of 128 mentation reports were selected as having remarkable correspondences.  Fourteen of these remarkable correspondences occurred by chance, while six corresponded to the target picture (the rate of production of hummingbird statements was the same for "psi" as it was for "chance").


I mentioned in a previous blog post that the outcomes which are measured in mediumship studies is all over the place, which dramatically increases the probability of obtaining false positives.  I would propose that looking for hummingbird statements has strong external and face validity, as an outcome measure.  That is, it seems to be the outcome measure people use informally to justify their belief in psi, and in the Kelly study, more hummingbird statements were generated in the target readings than in the control readings. 

Monday, April 7, 2014

Shifting goalposts?

I commented on another blog about a study which had some similarities to the Sheldrake staring experiments.

http://thinkingdeeper.wordpress.com/2014/03/09/sheldrake-vs-ubc-the-same-experiment/

http://hct.ece.ubc.ca/publications/pdf/gauchou-rensink-cac2012.pdf

http://www.sheldrake.org/files/pdfs/papers/sensoryclues.pdf

While I was reading the UBC paper, I was aware that I felt less critical about the paper than I would be if it had been a parapsychology paper.  Considering Dean Radin's criticisms from my previous blog post, and my criticisms of Radin's presentation of the blessed tea study, is it fair for me to be any less critical of the UBC paper (or alternatively, more critical of parapsychology papers)?  After all, like Sheldrake's and Radin's papers, there were multiple ways offered to analyze the results, the findings were post hoc, and novel outcome measures were offered.

Or were they?

An important design choice in the UBC paper is highlighted by contrasting it with Radin's paper.  Radin gave two groups of people tea which had been blessed or not, and measured change in mood and the subject's belief that they were in the intervention group.  The authors of the UBC study asked people general knowledge questions explicitly and implicitly (through the use of a Ouija board), and measured accuracy and the subject's belief that they were guessing at the answers.  In both cases, the significant finding was an interaction between the intervention and the belief condition.  Amongst those who believed they received the blessed tea, those who actually received the blessed tea had more improvement than those who did not.  Amongst those who believed they were guessing, those who were asked general knowledge questions implicitly (via the Ouija board) performed more accurately than when asked those questions explicitly.

Why are Radin's findings likely false, while the UBC study findings may be true?  The biggest difference is that Radin's findings are post hoc, while those in the UBC study were pre-planned.  Post-hoc testing violates the assumptions which underlie statistical significance testing, which reduces the validity of the results.

How can we tell whether a finding is pre-planned vs. post hoc?  It is not sufficient for the researcher to state a comparison was pre-planned.  And merely choosing to measure a number of different variables does not qualify as pre-planning.  So we can look at other factors, such as experimental manipulation, descriptions of the planning, and the analysis of the results.

The UBC group deliberately manipulated the belief condition by selecting questions which the subject identified as guesses.  They were identified as "guesses" independently of the accuracy of the answer and independently of their use in the "Ouija" board condition.  This experimental manipulation must be pre-planned.  There was no equivalent in Radin's study.  To be equivalent, Radin would also need to manipulate the belief condition (in this case, by manipulating what information was given to the subjects).  Unlike the UBC study, "belief" was a dependent variable in Radin's study, so it wouldn't be possible to form groups on the basis of "belief" prior to the drinking of the tea.

Another way to tell whether a comparison was pre-planned is to look at which comparisons were used in the sample size calculations (if reported).  In the UBC study, there are no sample size calculations reported.  In Radin's study, he reports that the sample size was assumed to be adequate based on his intentional chocolate study.  In that study, mood level (not change in mood) on each day was compared between conditions and "belief" was not a reported variable.  Had "belief" been a pre-planned condition in the tea study, it should have been accounted for, in some way, in the sample size assessments.

Finally, a quick way to check whether a comparison was pre-planned is to look at whether all the subjects are included in the analysis and whether the reasons for any exclusions are independent of the outcome.  In the UBC study, the analysis included 21/27 of the subjects who participated in the study.  Exclusions were based on a lack of success (i.e. movement of the planchette without conscious interference) in the use of the Ouija board and were unrelated to the outcome.  Radin included 40% of the subjects in his analysis, excluding more than half of the participants.  Thirty-two out of 221 were dropped for reasons unrelated to the outcome.  The remainder (101/221) were dropped for reasons which were strongly related to the outcome.  It would be very unlikely that a researcher would pre-plan a comparison which would so dramatically violate the significance testing.

To be fair, there is a good chance that the UBC study results are also false.  The sample size was small and it was somewhat exploratory, even if it was well-designed in comparison to Radin's study.  It will be interesting to see whether the findings hold up under attempted replications.

Linda