This lecture provides an introduction to variability. Variability is the degree to which scores in a distribution are spread out. Variability is an important descriptor of data and it is always difficult to interpret differences in the mean without considering variability. The following example illustrates the point.

In 1911, Lombroso claimed that criminals have smaller brains than law abiding citizens: "In criminals, the small capacities dominate". Lombroso measured the cranial capacities of criminals and of respectable citizens. He found that the mean from 121 male criminals was 1,450cc while the mean from 328 decent citizens was 1,484cc. The frequency distribution of cranial capacities is graphed below. Was Lombroso justified in his claim about the small brains of criminals ?

A brief glance at the graph shows that cranial capacity is highly variable in both the law abiding citizens and the criminals. In fact, when the data is plotted in this way, it is clear that there is not much difference between the two groups. This is because both groups are very variable and the difference between the groups is small. To report the data accurately, Lombroso should have given an indication of variability.


Measures of variability

Just as 'the average' can be measured in several ways, so can variability. Measures include

1) Range.

3) Variance.

4) Standard deviation.


The Range

The range is the difference between the maximum and minimum values (Xmax and Xmin) in a distribution. The range is sensitive to extreme values. Below are two distributions of scores - both have the same mean but differ in variability. Suppose that the scores were achieved by students using two different study methods. The different methods yield the same mean score, but the performance of one of the groups of students is much more variable.

Graphs illustrating scores from students who followed two different educational programmes. The programmes achieved the same mean score but had very different variability.


Variance and standard deviation

Variance and standard deviation are the two most commonly used measures of variability. Variance is the mean squared deviation from the mean. Standard deviation is a measure of the 'typical' distance from the mean.

Remember, from the lecture on averages, that deviations from the mean are both positive and negative, and that the sum of deviations from the mean is, by definition, always zero. Therefore, sum of deviations from the mean does not provide a measure of how well the mean describes the data.

The mean has another important property. It is the most 'representative' value because the sum of squared differences of all scores from the mean is a minimum (see below). Squaring the errors is a mathematical trick to turn negative deviations into positive 'squared deviations'. This is important for calculating variance and standard deviation.

The ability of the square of the difference between each score and the mean [(Xi-Xbar)^2] to measure deviation from the mean is show in the figure below. The two populations have the same mean but differ in variability. For sample 2 (burgundy colour), N = 100 and sum of squared deviations, (Xi-Xbar)^2, = 9826, therefore the mean squared deviation = ((Xi-Xbar)^2)/100 = 98.26. For sample 1 (blue colour), N = 100, sum of squared deviations = 51882, therefore the mean squared deviation = 518.82. The mean squared deviation is know as VARIANCE and is an important measure of variability.

Two samples with the same mean but different variance. Sample 1 is in mauve. Sample 2 is in burgundy. Sample 1 has a much higher variability because the 'typical' value is further from the mean. Therefore the mean is a worse description of sample 1 than of sample 2.

Variance is defined formally below (SS=sum of squared deviations).


Standard deviation

Variance is a useful measure of variability but it has awkward units. E.g., if the measure is cm, variance will be in cm^2. Standard deviation is the square root of the variance. It has the same units as the measure that is varying so is easier to relate to the scores in question.

 


Sample standard deviation

When considering averages, we saw that the accuracy of estimates of population mean depend on sample size but that small samples, though unreliable, are unbiased. They do not systematically under- or over-estimate the mean. This is illustrated in the stem and leaf plot below showing the means of 40 samples drawn from a population with a mean of 100.

Just as population mean (mu) and sample mean (X bar) are different, population standard deviation and sample standard deviation different. Therefore, sample standard deviation is know as 's' (not little 'sigma', the population standard deviation) and sample variance is 's^2'.

Xbar (sample mean) is an unbiased estimate of mu (population mean) because it is not systematically more or less than population mean, but fluctuates randomly around the population mean. However, the standard deviation of small samples tends systematically to be less than the standard deviation of the population from which they were drawn. This is illustrated in the table below. To produce the table, I made a large population of numbers with a mean of 100 and a standard deviation of 20 (variance = 400). I then took 40 samples of 5 and 20 and calculated their standard deviation.

The table shows that the standard deviation small samples systematically underestimates population standard deviation. The 'mean' values are the mean of the standard deviations of the samples. When N=5, the mean of the standard deviations of the samples in 17.4 while the true population standard deviation is 20.

Correcting sample variance and s.d.

It is necessary to correct the bias in the standard deviation of samples to get a true estimate of the population standard deviation by dividing the sum of squared deviations by N-1 rather than N.

The stem and leaf plot below has corrected for the sampling bias by dividing SS by N-1. 's', sample standard deviation, is still variable, but is no longer systematically biased.

One point of clarification....if the population has 5 members (N=5) then use 'sigma' which you get by dividing the sum of squared deviations from the population mean ('mu') by N. If you want to estimate the standard deviation of a population ('sigma') from a sample, then calculate 's' by dividing the sum of squared deviations from the sample mean (Xbar) by N-1 !


The main points