Tuesday, March 4, 2014

Outcomes, Part 1

My (very) informal survey of why other scientists and academics don't generally pay much attention to psi research reveals that they mostly assume that the purported effects are due to some sort of publication bias.  That is, the collection of studies presented as though they demonstrate an anomalous effect represents a selected sample of all the research performed - studies with positive results are presented for our attention while those with negative results quietly fade away.  There is an element of truth to this assumption, but it's more complicated than what we normally think of as publication bias.

A 'positive' study usually refers to a study which demonstrates a "statistically significant" result.  A statistically significant result can represent a true-positive (there is a real effect present and it is responsible for the positive test of statistical significance) or it can represent a false-positive (there is no effect present or the positive result was not the result of any real effect).  Jon Ioannidis famously argued that most (or all) of the positive results within a field may be false-positives.

There are a number of ways to boost the number of false-positive studies.  One of the easier ways to do so is to use flexibility in outcomes.  The false-positive rate which is meant to be 5% or less (the usual level chosen for significance testing is p<0.05 which represents the alpha or Type 1 ("false-positive") error rate) can easily rise to 50% or higher once you violate the assumptions which underlie significance testing by introducing multiple ways to chose an outcome measure.

One way to address this issue is to use only valid outcome measures, and preferably the same one for each type of study.  Any time you perform an experiment or study, you have a particular outcome in mind which you are interested in observing.  Sometimes this is called the "dependent variable" (as opposed to the "independent variables", which are the characteristics which you think may alter the outcome).  Depending upon the circumstances, we may have multiple ways we could measure that outcome - some of which are valid and reliable, and some which are not.  A valid measure is one which actually captures the outcome the interest.  If you want to know how tall someone is, measuring their length with a ruler would be a valid measure, while measuring their weight would not be.  Sometimes it is obvious whether or not a measure is valid, but often, it is not.  For example, how would you measure how "big" someone is?  A reliable measure is one which gives the same result no matter whether someone else does the measuring (inter-rater agreement) or whether the same person does the measuring at different times (intra-rater agreement).

Establishing a valid outcome measure is often difficult.  The example I am going to use is mediumship research, as this topic came up recently on the Skeptiko forum (http://www.skeptiko.com/forum/threads/why-skeptics-are-wrong-podcast.568/page-4#post-15416).

Mediums receive visual, auditory, and other sensations (e.g. scent or emotion) which they interpret as coming from a connection to a decreased person ("discarnate").  A mediumship reading usually involves three components - identifying the discarnate for a recipient (the living person who receives the reading), verifying that the source of information is the discarnate (usually by offering information which is regarded as accurate and specific to the recipient), and conveying messages from the discarnate to the recipient.  Research generally focuses on the verification component, since it is this aspect which speaks to the idea of psi and survival of consciousness.  So what would be a valid way to measure "specific and accurate information has been received from a discarnate"?  If we look at how the question has been answered in the most rigorous of the mediumship studies (Robertson and Roy, Beischel and Schwartz, Kelly and Arcangel), we find that we have 21 different answers in just 3 studies.

Robertson and Roy measured the dependent variables by breaking down each reading into individual statements and then recording:
- the number of statements accepted as true by each participant
- the number of participants who accepted a given statement
- the total number of statements

Beischel and Schwartz recorded:
- the placement of each general statement into 1 of 5 accuracy categories
- the placement of each of 4 Life Questions into 1 of 5 accuracy categories
- the placement of each of the Reverse Questions into 1 of 5 accuracy categories
- the placement of each general statement into 1 of 4 emotional categories
- the placement of each of 4 Life Questions into 1 of 4 emotional categories
- the placement of each of the Reverse Questions into 1 of 4 emotional categories
- a written explanation of each general statement placed into 1 of 2 accuracy categories
- a written explanation of each of 4 Life Questions placed into 1 of 2 accuracy categories
- a written explanation of each of the Reverse Questions placed into 1 of 2 accuracy categories
- a global numerical score for each reading on a 7-point scale
- the choice of 1 of 2 readings
- a rating of that choice on a 5-point scale

Kelly and Arcangel recorded:
Study 1
- the accuracy of each statement on a 5-point scale
- the significance of each statement on a 5-point scale
- the choice of 1 of 4 readings
Study 2
- the accuracy of each reading on a 10-point scale
- the rank of each reading within a group of 6, based on the scores (there were ties, including ties for first place)
- the choice of 1 of 6 readings
- written comments on each of the readings

What is striking about this list is not just the sheer number of different outcomes, but that among the three studies, no two are the same.  Even the way in which the accuracy of individual statements is measured is different in each study.  These outcome measures cannot all be valid (they don't come to the same answers).  So then it becomes important to ask whether any are valid, and if so, which ones? The list also highlights that concerns about a grossly inflated false-positive rate are legitimate.

It was suggested that all which was needed to perform more rigorous mediumship research was to repeat the Beischel study with a larger sample size.  However, this is almost guaranteed to lead to a false positive result until it is determined which one of the 12 different outcome measures is most valid.

Linda


http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0020124

http://pss.sagepub.com/content/22/11/1359.full

http://deanradin.com/evidence/Beischel2007.pdf
http://deanradin.com/evidence/Kelly2011.pdf

Robertson, T. J. and Roy, A. E. (2004) Results of the application of the Robertson-Roy Protocol to a series of experiments with mediums and participants. JSPR 68.1

No comments:

Post a Comment