Eppur si muove

Things I Think About, by Peter Bowditch

Psychological testing

Australasian Science - October 2005Every so often the matter of psychological testing comes up in discussion among skeptics, with opinions varying on where such tests fit on the spectrum of scientific activity. Usually the majority think that psychological testing is about as scientific as the study of alien abductions or the memory of water, but there are sometimes a couple of people prepared to defend the tests. I topped my class at university in the course about the design and interpretation of psychological tests, and my take on them is that they may be very useful if used appropriately, but they are also a very good way of illustrating the meaning of the terms “reliability” and “validity”. “Validity” is the relationship of the findings to the real world, and “reliability” is the reproducibility of the results. It is possible for something to be reliable but not valid, but it is impossible for the opposite to be true. You can print out this page and use the ruler below to measure things. It won’t matter if you use it to measure feet, firkins, furlongs or femtometres, it should produce very close to the same measurement each time and is therefore a reliable measuring instrument. Its validity would be useless (unless you were a crook selling something by length to someone who had never seen a ruler).

A ruler

I remember being asked once by an employment agent if I had any objection to being asked to do a psych test for a potential employer. I told them that I had no objection at all, because unless they could tell me what the test was and how it predicted any aspect of job performance the requirement for a test disqualified the potential employer and saved me the wasted time of interviews. I didn’t get the job. In one case where I did a test, it was simply a process of following some logical paths to reach conclusions based on information provided as part of the test. I was told that it would take about three hours to do the test. I finished in about 45 minutes, so I thought that I must have done something wrong. The only way to test the answers was to do the test again, which this time took three-quarters of an hour.

I was told that I was the first ever applicant to get all the answers correct, but even this wasn’t enough to get me the job. I didn’t care, really, because I didn’t want to work with people who were so dumb that they could get any of the test answers wrong. This appeared to be one of those tests which was highly reliable, but had no validity in the situation in which it was used. (I later found out that the person who would have been my boss was a misogynist creep who groped women at parties and all the programmers employed there really were brainless nincompoops. Lucky escape!)

In one of those discussions between skeptics recently, the matter of the Myers-Briggs test came up. This is a multiple-choice test which purports to place test subjects along several spectra or axes of personality traits. I went off and did a Myers-Briggs test and I am ENFJ:

moderately expressed extrovert (44%)
moderately expressed intuitive personality (50%)
moderately expressed feeling personality (38%)
moderately expressed judging personality (56%)

That sounds like me, especially all those “moderately” measurements. To get these results I answered the questions more-or-less honestly (and I do know something about self-serving bias in personality tests). The danger in using the results of a test like this, however, are at least twofold. First, it is only a single test and can be done in a short time. For it to have validity requires other tests to be taken at the same time which can be used to corroborate the results. I do know of people using just a single test for employment selection, and this makes the choices suspect. Secondly, it has an inherent reliability problem. Two actually – results can vary from time to time just because people feel different on different occasions, and anyone who knows how the test works (and which questions have special significance in scoring) can adjust the results. This is another reason for using batteries of tests – if each has a different reliability, the overall reliability of the collection can be improved. I know that if I were to be interviewing next Tuesday for the position of Promotions Manager for the Anthony Robbins outfit, my Myers-Briggs results would look nothing like the table above. And on Thursday, when I was going for Nursing Manager at a palliative care hospice there would be a different picture again.

As I said above, there can be no validity without reliability. That is why we demand that experimental results in all areas of science be reproducible. It is specially important if those results suggest that the world is not as we think it is. Carl Sagan said that extraordinary claims require extraordinary evidence. He forgot to add that the evidence needs to be found more than once.

Something often associated in the public’s eye with psychological testing is IQ. I had my IQ measured recently and I came in a couple of points below where I was when I was twelve years old. Does that mean I am less smart now? No, it doesn’t. In fact, as IQ is a quotient where age is the divisor, I must be a hell of a lot smarter now than I was back then. The difference is that now I don’t think I am anywhere near as smart as I thought I was when I was a teenager.

A version of this article appeared as the “Naked Skeptic” column in the October 2005 edition of Australasian Science

You can follow me on Twitter here.

Comments are currently closed.

4 thoughts on “Psychological testing

  • Guy Curtis says:

    It is not 100% accurate to say there can be no validity without reliability. One form of reliability that is often assessed in psychological tests is called test-retest reliability – this is where you give the test to people on separate occasions and look for a correlation between the scores. A test is usually seen as reliable if scores correlate strongly. Where a construct is supposed to be stable such as personality or IQ test-retest reliability reflects on validity, a valid measure of a stable construct should have high test-retest reliability. However, a valid measure of something that changes, such as current mood, should have low test-retest reliability – low reliability of this kind for a test of this type actually speaks to its validity. Obscure point I know, but worth mentioning.

    • Fair points, but I wasn’t writing this as a paper for a scientific audience.

      My real objection is to the misuse of these tests, both through ignorance and through laziness. I’ve seen places use a single Myers-Briggs test alone as part of a recruitment process. M-B has low test-retest reliability because it is so influenced by mood, but it is being used as if it measures personality which is relatively stable.

  • Michael Kingsford Gray says:

    This reminds me of a lesson that I learned whilst young:
    Repeatability, Resolution, Accuracy are very different entities.
    (And clearly defined entities as well.)

    I garnered this gem from a lecture concerning Hewlett-Packard analog plotter-devices, of all things, but it has stuck with me as a mathematical truism.

  • david poole says:

    Great Blog. Made me laugh so much, because it is just so spot on. keep the comments coming.