Iit is hard to overestimate the importance of measurement in science.
It can be easy to come up with lots of theories to account for particular sets of data, although the best theories come from the best minds. In most biological sciences such as psychology, theorizing is cheap and data is expensive. This may be because theorizing can be done in a warm office with pen, paper, and computer. Data gathering on the other hand is a hard slog and much of the skill in science is in thinking of, and then making, good measurements.
Second, new tools for measurement drive science. There is no doubt that throughout the history of science new techniques, or new ways of measuring things, have had a very profound influence.
Third, for purely practical reasons, measurement is not as simple as it might seem. This applies particularly to psychology and the social sciences where people try to measure abstract theoretical constructs like 'happiness' or 'social support' rather than more concrete things like weight or height. In addition, psychological experiments frequently suffer from bias in sampling the subjects and from intentional or unintentional bias in the experimenter. It is necessary to be aware of these problems and to know how to tackle them.
In psychology many advances can be traced to important measurements, or new ways of making measurements. For example, the PET (Positron Emission Tomography) image below, illustrates how a new way of measuring something, in this case local cerebral blood flow, can open up huge areas of research. In the Figure , there is a representation of blood flow in the brain of a patient with a panic disorder. The right half of the image represents the difference between blood flow in the right and left hemispheres. Note the 'blob' in the region of the parahippocampal gyrus, which shows that activity is greater in the right parahippocampal gyrus than in the left.
Levels of Measurement
Nominal measurement is a form of categorization where the categories bear no numerical relationship to each other; they are qualitatively different. Dividing animals by species and gender would constitute nominal measurement.
Ordinal Measurement categorizes data into a ranked order (e.g. 1st, 2nd, 3rd etc.). The finishing order in a race is an example of ordinal measurement. Neither the difference between the ranks nor the ratios between ranks are meaningful. Ordinal measurement tells us nothing about the size of the difference between ranks; rank 2 may be just behind rank 1 or a long way behind it. Two 3rd places do not equal one 6th place. 5th place minus 2nd place does not equal 3rd place. Many psychological scales, such as personality and attitude scales, are ordinal scales.
Interval measurement means that the absolute differences (intervals) between measurements are meaningful, but the ratios between measurements are not, because the scale does not have an 'absolute zero' point.. The Celsius temperature scale is an example of an interval scale. The same amount of heat is required to raise the tempereture by 1 degree for any point on the scale, so the intervals we call 1 degree are equal at all parts of the scale. However, because '0' is arbitrarily chosen to equal the freezing point of water, 10 deg. is not twice as hot as 5 deg.
Ratio measurement has all the properties of an interval scale and, in addition, has an absolute zero point. This means that the absolute differences between, and ratios of, measurements are meaningful. The Kelvin temperature scale, where '0' is absolute zero is a ratio scale. Many measures such as weight, age, psychological reaction time or the frequencies of an event are examples of ratio measurement scales.
You will learn later that different statistical techniques apply to Nominal, Ordinal, Interval and Ratio data. For example, you can't generate a meaningful mean average from Nominal or Ordinal data.
Reliability and Validity
Reliability is a measure of how well repeated measurements of the same thing agree with each other. A stopped watch is highly reliable.
Validity is a measure of how well the measure measures what it sets out to measure. A working watch is more valid (but less reliable) than a stopped watch.
Good measures should be both reliable and valid. We will see later that problems with reliability can, in some cases, be minimized by taking repeated measurements. Problems with validity may necessitate finding a new measuring device altogether.
Nominal, Invalid, and Unreliable
Comets were categorized as celestial swords by Johannes Hevelius in his Cometographia published in 1668. This categorization is unreliable since it likely that different observers or repeated observations would not agree on cometary classification. This categorization is also of questionable validity.
Nominal, Invalid, and Reliable
Phrenology was a popular 'science' through much of the 19th century. The measurement of lumps and bumps could be highly reliable (since repeat measurements of the bumps would agree). The measurement is invalid since the bumps do not relate to any of the 35 or so supposed 'mental faculties'.
Observer and Experimenter Effects
Observer and experimenter effects are very important in the behavioral sciences. You should design your experiments to minimize (or control for) both.
Observer Effects Neville Maskelyne (1732-1811),eighth Astronomer Royal, built
an extremely accurate telescope at the Greenwich Observatory. He
aimed to use transit observations (stars crossing a line in the
telescope) to set chronometers to an accuracy of one tenth of a
second. The time would be signaled to ships on the Thames by
dropping a ball high on the Observatory. Maskelyne found that his
observations were systematically different from those of his
assistant, Kinnebrooke. The difference was around 0.8 seconds.
Kinnebrooke was sacked, Maskelyne claiming he had 'fallen into
some irregular and confused method of his own'.
Neville Maskelyne wrote up his experiences in 'Astronomical Observations at Greenwich'' but did not appreciate the significance of the finding. It was several years later that the German astronomer and mathematician F.W.Bessel (1784-1846), with the help of J.K.F.Gauss (1777-1855) realized the importance of the systematic differences between observers. Bessel went on to calibrate individual observers, deriving 'personal equations' which compensated for their individual differences.
It is not only individual observers or subjects who have biases that can influence measurements. Experimenters can also influence results either deliberately or accidentally. Some examples are given below:
1) Measuring skulls: In 'The Mismeasure of Man', Gould gives an account of experimenter bias in making skull volume measurements. The volume of skulls was measured by filling them with seed, which was then weighed. The experimenter had certain preconceptions about skull volume, so would be a bit more enthusiastic about packing seed into skulls which he thought should be big. This introduced a systematic bias in the volume measurements.
2) Clever Hans, the amazing counting horse, lived in Germany in the early 1900s. His owner, Wilhelm von Osten, claimed that his horse could answer a wide variety of questions, such as solving mathematical problems and telling the time, and communicate the answers using hoof-taps. Prominent German scientists tested Clever Hans until most were convinced that the horse's highly accurate responses were not the result of trickery. The horse performed "almost as well" when von Osten was absent as when the master was present.
Researchers were not unanimous in pronouncing Clever Hans a bona-fide horse prodigy. Scientist Oskar Pfungst uncovered Hans's one weakness: he was unable to respond correctly when no one in front of him knew the answer to the question at hand.
Was Clever Hans being secretly cued by someone other than von Osten? Apparently so... but further tests showed that no one was tipping off Clever Hans intentionally -- he simply needed someone who knew the answer to BE there. The horse had learned to identify subtle tensing and relaxing of muscles that occur in someone who is anticipating the correct answer. Thus, Hans would tap his hoof until he saw the subconscious twitch in observers who knew he had arrived at the right spot in the alphabet, and there Hans would stop, oblivious to the semantic content of his actions.
3) Piers Cornelissen and Motion Detection Thresholds: Piers Cornelissen (who is in our department) uses a computerized motion detection task in his work on dyslexics. The subjects watch a dynamic pattern of random dots on a computer screen (which looks like the 'snow' on a poorly tuned T.V. set) and have to decide if the pattern is drifting to the left or to the right. The computer adjust the strength of the motion signal, making the task easier or more difficult, to find the point at which the subjects response is only just better than guess work. This is called the 'motion detection threshold'.
Although the computerized measurement of motion detection threshold appears objective, subjects tested by Piers himself routinely achieve a lower motion detection threshold than subjects tested by his 3rd year project students. Maybe Piers encourages the subjects to concentrate harder or explains the task better.
Measuring Stress: A case study in the problems of psychological measurement.
The term 'Stress' is commonly used in everyday life (e.g. 'Learning statistics is relaxing, and causes no stress'), but can be difficult to measure (If you are interested in a fuller discussion of the problems of measuring stress see Charlton 1992; Engel 1985; Levine and Coe 1985).
1) Should we measure the stimulus ? A disparate range of stimuli are described as 'stressful', and what is stressful to one person may not be stressful to someone else. How do you produce a single stress measure from a disparate range of stimuli, and how do you know the stimuli are stressful if you don't take the response into account.
2) Should we measure the response ? One expert, Hans Selye has defined stress as "the non-specific response of the body to any demand". Selye claims that the non-specific response is composed of a pattern of changes in adrenal cortex, the thymus gland and the gut and terms the response as the "General Adaptation Syndrome" or GAS.
There are several problems here. First, many things commonly thought of as stressful do not cause GAS. Second, it is difficult to measure GAS in live humans. People have taken to measuring, for example, the levels of various hormones that are thought to relate to GAS which makes the assumption that these are valid measures of GAS. Third, if the response is 'non-specific', how do we know what stimulus caused it ?
The problem of measuring stress comes from the error of 'reification' - the process by which when something is given a name, people assume that it exists as a (single) thing. Reification is common in the history of psychology (e.g. Intelligence, Motivation, Social Cohesion, Ego, Id, etc.) and you should be suspicious of it !
The Important Points of Measurement and Data