CSU Hayward

Statistics Department

Quiz Question 18:
Testing the Claims of Theraputic Touch


Background on Theraputic Touch

Although rooted in mysticism, therapeutic touch (TT) has been widely used in practice by nurses and other health professionals and is alleged to have a scientific basis. Its proponents state that TT is taught in more than 100 colleges and universities worldwide, that more than 100,000 people have been trained in the technique with more than 20,000 current practitioners who are health professionals, and that it is practiced in at least 80 hospitals in North America.

Various practitioners of TT have claimed a wide variety of benefits from its use: it promotes relaxation, relieves pain and inflammation, heals burns, ulcers, and broken bones, relieves symptoms of Alzheimer's disease and AIDS, and promotes healing in an almost unlimited number of other pysiological disorders. Actually, the word touch, is not entirely appropriate because, in the most common application, practitioners of TT move their hands a few inches (typically 3 or 4) above the patient's skin without making actual physical contact.

Using some of the language of TT practitioners, we summarize a typical therapeutic session as follows:

Testing the Claims of Therapeutic Touch

An important claim of the theory and practice of TT is the ability of practitioners to detect HEFs. Moreover, they must sense HEFs with sufficient reliability and duration to permit assessment and repatterning. Sessions typically last from 20 to 30 minutes.

To test the ability of TT practitioners to detect HEFs, Emily Rosa organized a small-scale study in 1996, using 15 volunteer practitioners as subjects. Subjects were seated behind an opaque screen with two holes, through which they placed their hands. Out of view on the other side of the screen the investigator held her hand above either the practitioner's left or right hand (as determined by a coin toss). After taking a desired length of time to "detect" the investigator's HEF, the subject indicated whether it was felt through the left or the right hand. (Preliminary runs, testing the apparatus with subjects who are not TT practitioners, satisfied the experimenters that such factors as body heat and air movements did not provide relevant tactile cues.)

For each of the 15 subjects, the number of correct identifications (as to left or right) out of ten trials was recorded. The sample mean number of successes was 4.67 with a standard deviation of 1.74. (Out of the 150 tries, there were 70 successful identifications altogether.) The most successful score, achieved by one subject only, was 8 successes in 10 trials.

In 1997 a follow-up study, using the same design, was conducted using 13 volunteer TT practitioner subjects (7 of them from the first study, along with 6 new subjects). For this second study the mean was 4.08 successes with a standard deviation of 1.44 (altogether, 53 successes out of 130 tries).

For the two studies together (15 + 13 = 28 runs of 10 trials each) the numbers of correct responses per run are summarized as follows:

Correct      Number
Responses    of Runs
--------------------
   0           0
   1           1
   2           1
   3           8
   4           5
   5           7
   6           2
   7           3
   8           1
   9           0
  10           0
--------------------
Total          28
Average         4.39

These results were reported in Rosa et al. (1998) Journal of the American Medical Association Vol. 279, No. 13, pages 1005-1010. With regard to the summary information on 28 runs, the only specific information in the published article about the scores of the 7 repeat subjects is that the one subject with 8 successes in the first study achieved only 6 successes in the second.

Questions

1. State specifically the question that these studies were apparently designed to answer. What population and population parameter are of interest? Formulate appropriate null and alternative hypotheses. Should the alternative be one or two-sided? Would a confidence interval for this parameter also be useful?

2. Reported above are various facts about the results of the study: the total number of successes by all subjects in each study, the mean and the standard deviation of the numbers of successes by each subject in each study, the combined table of results from both studies, etc. Specifically, how would you use the reported results to test your null hypothesis?

3. Give the test statistic and its distribution (or approximate distribution).What do you conclude about the value of the parameter of interest? If you reject the null hypothesis, give the P-value. If not, perform a relevant power computation. Provide and interpret appropriate confidence intervals.

4. Rosa et al. (1998) conclude "... that TT claims are groundless and that further use of TT by health professionals is unjustified." Do you believe that this conclusion is warranted on the basis of the information provided above? Defend your answer.


Answers

Question 1

The purpose of the study is to see whether TT practitioners can correctly identify through which hand they feel the HEF of the investigator more often than would be the case by chance alone. With p = P(Success) we want to test the null hypothesis Ho: p = 1/2 against the two-sided alternative.

Strictly speaking, one might insist on the one-sided alternative that p > 1/2 on the grounds that a TT practitioner who knows what he or she is doing would have many more successes than failures. However, it would still seem to us an important finding if TT practitioners were more often wrong than could be accounted for by chance alone. In the extreme case, a practitioner who consistently selected the wrong hand must still be sensing something about HEF, even if he or she is confused as to what.

A confidence interval for p that includes 1/2 provides no evidence of ability to sense HEF.

Question 2

We explore several approaches.

Aggregated number of successes for a study. In the first study there were 70 successes out of a total of 150 tries by all 15 subjects. If subjects have differing levels of ability to detect HEFs then it is inappropriate to treat these as 150 independent Bernoulli trials. Presumably, results among the 10 trials for one subject would be more similar than results among trials for different subjects. However, under H0 each trial for each subject has p = 1/2.

In 150 tosses of a fair coin there is about 95% probability of seeing a number of Heads in the range 75 ± 12. So if the 50 tries were all from a single subject, 70 successes would certainly provide no evidence of ability to detect HEF. A total number of successes less than 75 certainly does not speak highly of the claim that these 15 TT practitioners know what they are doing.

Number of successes for each subject. Stated in terms of the average number of successes out of 10 trials the null hypothesis is H0: m = 5. Under the null hypothesis each subject generates an observation from Bino(10, 0.5), which is approximately normal with mean 5, variance 2.5, and standard deviation 1.58. Thus, this is one of the rare cases in which a traditional z-test is appropriate. For the first study with n = 15, the variance of the sample mean is 2.5/15 = 0.167 so the standard error is 0.41. Thus, the z-statistic is z = (4.67 – 5.00)/0.41 = –0.81 which is clearly within the acceptance region (critical values ±1.96 for a 5% level 2-sided test). The 95% confidence interval for m is (3.87, 5.47).

The analysis of the second study, with n = 13 is similar. z = –2.10; CI = (3.22, 4.94).

The article by Rosa et al. in JAMA shown a similar analysis but uses the one-sample t-test. Because the population standard deviation is known, this is an incorrect procedure. However, the t-distribution with 14 (or 12) degrees of freedom is not far from normal and the conclusions are the same.

Aggregating the two studies. If we ignore the overlap of 7 subjects between the two experiments, we can use a z-test with observations from 28 "runs." Because of the overlap of these subjects, however, we do not really have 28 independent observations. An approximately correct procedure may be to take the average of the 28 as a proxy for what we might have gotten if we had done a single study with 21 subjects. Then we would use n = 21 to compute the standard error.

Defining a "Passing" performance. In what is not much more than an incidental remark, the JAMA article proposes the standard that a subject with 8 or more successes "Passes" the test of being able to identify HEF. If X is Bino(10, 0.5) then P(X > 8) = 56/1024 = 0.055. [The article defends this with the irrelevant and misleading comment that P(X = 8) = 45/1024 = 0.04. A correct criterion based on a two-sided alternative would be that the subject "passes" at distinguishing HEF from non-HEF (without necessarily knowing which is which) if he or she obtains 0, 1, 9, or 10 successes.]

Using 8 or more as Passing, we have only 1 passing performance out of 21 subjects, if we charitably ignore the fact that the one subject who passed in the first study didn't pass in the second. Due to chance alone we would "expect" (21)(0.055) = 1.2 successes out of 21 randomly chosen subjects. So we certainly have no evidence of significantly more than the expected number of passes. If Y is the number of Passes out of 21 subjects then the null hypothesis gives the distribution of Y as Bino(21, 0.055) and 4 or more Passes might be taken as evidence that more subjects are passing than by chance alone, even though such a small number of Passes would hardly be a ringing endorsement of the ability of TT practitioners to detect HEF. [Using the two-sided criterion mentioned above, the only "Pass" is on the "wrong" side with one success, hwere we expect 0.45 successes; it would take 3 or more of these passes to establish some sort of HEF effect.]

Question 3

Following the line of reasoning given in the "Number of Successes for Each Subject" approach given in Question 2, we model each subject as generating an observation from a binomial, with p = 0.5 and n = 10. Since the population distribtuion is now approximately normal, we may use the z statistic instead of the t-statistic.

Test 1: We first consider the results of a one-sided hypothesis test for the first study (similar to the approach in the article.) We have

equivalently,

Computing the value of the statistic (15 subjects, s2 = 2.5) yields z = –0.81, which is not significant at the .05 level.

Power:  If we take p = 0.5 under the null hypothesis and p = 0.6 under the alternative hypothesis, we have m0 = 5 and ma = (10)(0.6) = 6. Now, computing beta:

b = P(z < z.05 – |m0ma| / s) = P(z < 1.65 – |5 – 6| / (2.5/15)1/2) = P(z < –0.80) = 0.2119.

The corresponding power computation is given by (1 – b) = 0.7881.

Confidence Interval: (4.67 – 1.96(2.5/15)1/2, 4.67 + 1.96(2.5/15)1/2) = (3.87, 5.47).

Test 2: Following the line of reasoning given in the answer to Question 1, We now consider the results of a two-sided hypothesis test (for the second study.)

Computing the value of the statistic (13 subjects, s2 = 2.5) yields z = –2.10. Since P(z < –2.10) = 0.0179. So we barely reject H0 for a = .05. That is, according to this model, we cannot conclude that, on average, the subjects were simply guessing.

P-value:  As given above, p = P(z < –2.10) = 0.0179.

Confidence Interval: (4.08 – 1.96(2.5/13)1/2, 4.08 + 1.96(2.5/13)1/2) = (3.22, 4.94).

Question 4

An alternative approach to a standard significance test is that of the "passing score" idea. Rather than looking at the mean success rate for a group of subjects and attempting to find a departure from chance, we could establish a cut-off point for the number of successes. Subjects scoring above this number (say, 8 out of ten) might be considered to have a true ability to detect some sort of physiological "pattern" while those scoring below the cut-off point would be regarded as lacking said ability.

Note, however, that this approach assumes the existence of such an ability to begin with and so might very well be viewed as begging the question. At any rate, one might argue that such a method could only be justified if the results of an initial study clearly identified individuals who were capable of consistently scoring better than by chance alone.

In question 3 we saw that by using the z-test for the second group (n = 13) we can no longer accept the null hypothesis at the .05 level of significance. Whether or not this is seen as evidence for an ability possessed by some of the subjects is still quite debatable. Certainly the failure of the first group to do better than chance would suggest caution in taking this point of view. And the p-value for the second group is not that small. Also, there is the fact that seven of the subjects in the second group had also been part of the first group.

To play the devil's advocate, it is clearly possible to come up with many explanations for why the TT practitioners failed to perform convincingly. For instance, one might argue that by its very nature, theraputic touch relies on the compliance of a cooperative patient in order for the process to work. Alternatively, a true illness or "disruptive force" might have to be present in order for the TT practitioners to actually detect something.

The question still remains as to why theraputic touch has gained some popularity among patients and practitioners. To be fair, it would seem quite possible that some real clinical effects could be observed as a result of the "treatment" administered by TT practitioners. But to attribute beneficial effects to anything other than positive social interaction would seem overly ambitious.


Copyright © 1999 by Bruce E. Trumbo, Chris M. Fraser. All rights reserved. Intended for instructional use at California State University, Hayward. Please request permission in advance for other uses: btrumbo@csuhayward.edu.