Appendix L. . .Statistics
On Describing Data:
Some information is in numerical form. Your height, weight, age and
scores on tests are numbers that convey some information about you. If we
have such information on a group of people, we need some way to describe
that set of numbers. This is done by using statistics but you only need to
know a few very simple things in order to under_ stand the way that
numerical information is described in introductory textbooks.
Suppose, for example, that the 27 students in my class took the
vocabulary pre-test given in the preface. If you wanted to know how well
the class did, you could look at the list of their scores in my grade book
where the students are listed in alphabetical order. You might see the
following numbers:
85 79 52 64 62 79 53 46 64 84 76 49 64 68 61 55 73 41 88 58 74 65 93 59 53 73 36
Although reading those numbers might give you some general idea about how
well the class did, it would be better if the scores were listed in
numerical order:
36 41 42 49 52 53 53 55 58 59 61 62 64 64 64 65 68 73 73 74 76 79 79 84 85 88 93
Such an arrangement of the scores enables you quickly to determine that
they range between 36 and 93, and that the middle score is about 62 or 64.
A still better picture of this set of scores emerges if we stack the
identical scores:
64
53 64 73 79
36 41 42 46 49 52 53 55 58 59 61 62 64 65 68 73 74 76 79 84 85 88 93
For greater convenience and clarity, it is helpful to lump scores into
five to seven class intervals. I have chosen a set of scores so that the
numerical decades provide good class intervals, and the results of this
lumping of the scores are shown in two ways:
Score Percent
Interval Frequency Frequency 68
30-39 1 3.7 59 65 79
40-49 3 11.1 58 64 79
50-59 6 22.2 55 64 76
60-69 7 25.9 49 53 64 74 88
70-79 6 22.2 46 53 62 73 85
80-89 3 11.1 36 41 52 61 73 84 93
90-99 1 3.7
The table on the left above is called a frequency distribution; it is the
most common method of presenting statistical information. As a rule, only
the percent frequencies are displayed because, if you know the total
number of scores, you can compute the actual frequencies if they are of
special interest. The disadvantage of a frequency distribution is that
one cannot tell "at a glance" what the distribution of scores looks like.
For this purpose, graphic means of presenting statistical information are
preferable. To illustrate such methods, let us put boxes around the
stacks of numbers, as follows:
____
____: 68 :____
: 59 : 65 : 79 :
: 58 : 64 : 79 :
____: 55 : 64 : 76 :____
: 49 : 53 : 64 : 74 : 88 :
____: 46 : 53 : 62 : 73 : 85 :____
: 36 : 41 : 52 : 61 : 73 : 84 : 93 :
In practice, we leave the actual numbers out, but display along the
baseline the range of numbers in each stack:
____
____: :____
: : : :
: : : :
____: : : :____
: : : : : :
____: : : : : :____
: : : : : : : :
30_ 40_ 50_ 60_ 70_ 80_ 90_
39 49 59 69 79 89 99
This graph is called a bar graph. The height of each bar depicts the
number of people with scores in the range shown along the baseline.
Another very similar procedure can be illustrated by putting a dot
above each stack of numbers.
.
. 68 .
59 65 79
58 64 79
. 55 64 76 .
49 53 64 74 88
. 46 53 62 73 85 .
36 41 52 61 73 84 93
Again in practice, we leave the numbers out and
connect the dots:
.
. . (Connect the
dots)
. .
. .
_________________________________
30_ 40_ 50_ 60_ 70_ 80_ 90_
39 49 59 69 79 89 99
This graph is called a frequency polygon, and again the height of the
curve above the baseline represents the number of people who scored in the
indicated range.
Why do we have both bar graphs and frequency distributions? The
answer is that there are two fundamentally different kinds of data. Some
things come in discrete, indivisible units. With such things, you should,
and normally do, ask the question, "How many are there?" and you count
them in order to find out. The number of children in families, of seats
in classrooms, of rooms in dormitories, of people at parties, of pieces of
puzzles, etc., illustrate discrete data. With such data, we normally use
bar graphs.
Other things are continuous in nature, at least conceptually. With
such things you should, and normally do, ask the question, "How much is
there?" and you measure in order to find out. The amount of gasoline in
cars, of weight of people, of height of buildings, of the temperature in
classrooms, of distance between cities, etc., illustrate continuous data.
With such data, we can use a frequency polygon because the lines
connecting the points imply continuity.
There would be no confusion except that numbers themselves are
discrete even when used with continuous things. If I ask, "How much do
you weigh?" you will answer with a number of pounds. But we know that
weight is continuous in nature. A person probably doesn't weigh exactly
125 pounds, and my home is not exactly 10 miles from campus. If may be
less obvious, but the scores on the vocabulary test are measures of a
continuous trait, namely knowledge of words. The amount of knowledge can't
be counted, but we can measure it by the number of items a person answers
correctly.
On the "Average"
You undoubtedly already understand the concept of an "average." It
means typical, usual, normal, common, middle, expected, central. In the
vernacular, "average" is not so much a single, exact number as it is a
range of values that fit most people. For example, I might say that "the
average student registers for somewhere between 15 and 18 hours of course
work." A 12-hour course load would be considered to be somewhat "below
average."
Statisticians have devised three different ways to compute a
numerical average. These are defined and illustrated below with a set of
nine scores where the three statistical meanings of "average" yield the
same answer:
3
1 2 3 4
2 1 2 3 4 5
2 ^
3 MODE = most frequent score
3
3 1 2 2 3 3 3 4 4 5
4 ^
4 MEDIAN = middle score, half above, half below
5
27/9 = 3 --> MEAN = Sum of scores divided by number of scores
You may have heard the expression, "Statistics don't lie, but
statisticians do." One way to deceive people is when the three
statistical averages differ, and you choose the one that best suits your
purposes. Consider another set of 9 scores:
1
1 2
1 1 2
1 1 2 4 5
1 ^
1 MODE = most frequent score
2
2 1 1 1 1 2 2 2 4 5
2 ^
4 MEDIAN = middle score, half above, half below
5
19/9 = 2.1 --> MEAN = Sum of scores divided by number of scores
Now if those numbers represent something that is undesirable, such as
"How many times a day do you exceed the speed limit?" I might answer,
"About once a day, on the average." In this case, the mode gives the
smallest "average." However, if the numbers represent something that
is desirable, such as "How many times a day do you brush your teeth?"
I might say "More than twice a day, on the average." Now I chosen
the mean because it gives the largest number. Accordingly, by taking
advantage of the ambiguity of the meaning of "average," one can tilt
the picture one way or the other.
Fortunately, good scientists have no intention to deceive anyone.
They will tell you explicitly which statistical average was used to
summarize their data. If the answers are substantially different, it
is customary to report them all. This should permit you to get a very
clear picture of the data. There is no one best statistical average;
each is a legitimate way to represent the average score. As long as
you know how each one is computed, you should not be misled.
On Variability
An average score can be thought of as a representative score; it
attempts to represent the entire set of scores with a single number.
If almost all of the scores are the same, or clustered very close to
the middle, then the average is a very good representative. However,
if there is a lot of variability in the scores, then the average does
not give enough information. Accordingly, in addition to an average,
you need some indication of the variability.
The most common measure of variability used in textbooks is the
standard deviation. There is no need for you to know how to compute
the standard deviation. (If you're interested, it is the square root
of the mean of the squared deviations from the mean.) What you should
know is that, in a symmetrical distribution, about two thirds of the
scores fall in the range between one standard deviation above and be_
low the mean. For example, if the mean is 100 and the standard deviation
is 10, two-thirds of the scores are between 90 and 110.
On Correlation
Two measures are correlated when the amount of one tends to vary with
the amount of the other. Some familiar correlations are:
tall people tend to weigh more than short people
well-educated people tend to earn more than uneducated people
intelligent people tend to be healthier than stupid people
In each of the above examples, the correlation is positive because larger
amounts of one measure go together with larger amounts of the other. But
in many cases, the correlation is negative:
fat people tend to live shorter lives
old people tend to have poor memory
happy people tend to get fewer ulcers
Notice that there is nothing "bad," or "wrong," or undesirable about
negative correlations. The sign (positive or negative) of a correlation
simply shows whether big numbers go with big numbers or whether big
numbers go with little numbers.
The degree of correlation is symbolized by "r" and is measured on a
scale ranging from zero (no correlation) to 1.00 (perfect correlation).
There is no need for you to know how to calculate "r" but it is important
to understand what it means if you hear that the correlation between hours
of study and grade point average (GPA) is +.6 among college freshmen.
One way to learn about correlations is to construct scatter plots.
Each person's measures on two variables can be plotted on a graph as
follows: Suppose person A is 5' tall and weighs 100 pounds. Person B is
5'8" tall and weighs 112 pounds, and person C is 6'2" tall and weighs 150
pounds. All of this information is displayed in Figure 1.
: - - - - - - - - - - - - - - - - - - - - - - - - - - - C
6 -
: - - - - - - - - - - - - - - - - - - - - B :
:
5 - - - - - - - - - - - - - - - - - - - A : D :
Height :
(feet) : : : :
4 -
: : : :
:
3 - : : :
:
: : : :
2 -
: : : :
:
1 - : : :
:
:---------:---------:---------:---------:---------:---------:--
25 50 75 100 125 150
Weight (pounds)
Figure 1. In a scatter plot, each person's two scores are
represented by a single point in the graph. Person D is 5'
tall and weighs 125 pounds.
When a number of points are plotted, you can get a good idea about
both the degree and the sign of the correlation by just looking at the way
the dots are arranged (each dot representing one person's scores).
Hi : . . Hi : . High Negative
: . . : .. Correlation
: . .. . : . . r = -.90
: . . . .. : .... .
Score : . . .. . Score : . .. ..
Y : . ... . Y : . ....
: . . . : .. .. .
: . . High Positive : . .
: . . Correlation : . ..
: . r = +.90 : .
:---------:---------:---------: :---------:---------:---------:
Score X Hi Score X Hi
Hi : . . .
: . .
: . . . No Correlation Figure 2. Scatter plots that
:. . r = .00 . illustrate the range of likely
Score : . . . correlations between scores X
Y : . . . and Y. A perfect correlation
: . . . . (r=1.00) would show all of the
: . . . . dots on a single straight line.
: . . . .
: . . .
:---------:---------:---------:
Score X Hi
A high correlation does not necessarily mean that one variable
causes the other. Both may be caused by some other factor(s). For
example, studies have usually found a correlation of about +.60 between
one's grades in high school and one's grades as a college freshman. To
see how this is determined, plot the following data in the graph:
Student X = High Y = Fresh- 4.0 :
School GPA man GPA :
--- ---------- ---------- :
1 2.00 1.60 3.0 :
2 2.25 2.00 Freshman :
3 2.60 1.80 GPA :
4 2.65 2.80 2.0 :
5 2.80 2.10 :
6 3.10 2.00 :
7 2.90 2.65 1.0 :
8 3.25 2.25 :---------:---------:---------:
9 3.60 3.00 1.0 2.0 3.0 4.0
10 3.25 3.10 High School GPA
Grades in both high school and college are probably caused by factors
like intelligence, amount of effort, and the quality of instruction. Try
to think of reasons why the correlation is not perfect (like class size).
Even so, a It is very unlikely
(albeit possible) that a person who got mostly C's in high school will get
straight A's as a freshman. For the purpose of prediction, a negative
correlation is just as useful as a positive one. Plot the following data:
Patient X = Years Y = Lung 60 :
Smoked Capacity :
--- ---------- ---------- :
1 25 45 50 :
2 36 40 Lung :
3 22 50 Capacity :
4 20 60 40 :
5 48 25 :
6 39 30 :
7 42 30 30 :
8 31 45 :---------:---------:---------:
9 28 58 20 30 40 50
10 33 65 Years Smoked
In the preceding example, the correlation is about -.80 because
smaller lung capacity is associated with larger number of years smoking.
But we can safely predict that the longer a person has smoked, the smaller
his/her likely lung capacity.
In general, the degree to which two sets of scores are correlated is
indicated by a number between 0 and 1. The sign of the correlation tells
whether it is a big-big or a big-little relationship. Both of these can be
displayed in a scatter plot. In the graphs below, make a scatter plot that
represents each of the following correlations and describe in words what
you can say about the correlation: (Be sure to put labels on the axes.)
(a) amount of rain and time the sun rises is r=+.01
(b) number of miles per gallon and speed of driving is r=-.80
(c) amount of alcohol drunk and time to react is r+.90
(d) level of income and number of children is r=-.50
Hi : Hi :
: :
: :
: :
: :
: :
: :
: :
: :
:---------:---------:---------: :---------:---------:---------:
Hi Hi
Hi : Hi :
: :
: :
: :
: :
: :
: :
: :
: :
:---------:---------:---------: :---------:---------:---------:
Hi Hi