Appendix L: Statistics

Appendix L. . .Statistics 

 On Describing Data:

     Some information is in numerical form.  Your height, weight, age and
scores on tests are numbers that convey some information about you.  If we
have such information on a group of people, we need some way to describe
that set of numbers. This is done by using statistics but you only need to
know a few very simple things in order to under_ stand the way that
numerical information is described in introductory textbooks.

     Suppose, for example, that the 27 students in my class took the
vocabulary pre-test given in the preface.  If you wanted to know how well
the class did, you could look at the list of their scores in my grade book
where the students are listed in alphabetical order.  You might see the
following numbers:

85 79 52 64 62 79 53 46 64 84 76 49 64 68 61 55 73 41 88 58 74 65 93 59 53 73 36

Although reading those numbers might give you some general idea about how
well the class did, it would be better if the scores were listed in
numerical order:

36 41 42 49 52 53 53 55 58 59 61 62 64 64 64 65 68 73 73 74 76 79 79 84 85 88 93

Such an arrangement of the scores enables you quickly to determine that
they range between 36 and 93, and that the middle score is about 62 or 64.  
A still better picture of this set of scores emerges if we stack the
identical scores:

                                     64 
                   53                64       73       79
 36 41 42 46 49 52 53 55 58 59 61 62 64 65 68 73 74 76 79 84 85 88 93
 
For greater convenience and clarity, it is helpful to lump scores into
five to seven class intervals.  I have chosen a set of scores so that the
numerical decades provide good class intervals, and the results of this
lumping of the scores are shown in two ways: 

Score                Percent           
Interval  Frequency  Frequency                      68
 30-39        1          3.7                   59   65  79
 40-49        3         11.1                   58   64  79
 50-59        6         22.2                   55   64  76
 60-69        7         25.9              49   53   64  74   88
 70-79        6         22.2              46   53   62  73   85
 80-89        3         11.1         36   41   52   61  73   84   93
 90-99        1          3.7

The table on the left above is called a frequency distribution; it is the
most common method of presenting statistical information.  As a rule, only
the percent frequencies are displayed because, if you know the total
number of scores, you can compute the actual frequencies if they are of
special interest.  The disadvantage of a frequency distribution is that
one cannot tell "at a glance" what the distribution of scores looks like.  
For this purpose, graphic means of presenting statistical information are
preferable.  To illustrate such methods, let us put boxes around the
stacks of numbers, as follows:
                            ____
                       ____: 68 :____
                      : 59 : 65 : 79 :
                      : 58 : 64 : 79 :
                  ____: 55 : 64 : 76 :____
                 : 49 : 53 : 64 : 74 : 88 :
             ____: 46 : 53 : 62 : 73 : 85 :____
            : 36 : 41 : 52 : 61 : 73 : 84 : 93 :

In practice, we leave the actual numbers out, but display along the
baseline the range of numbers in each stack:
                            ____
                       ____:    :____
                      :    :    :    :
                      :    :    :    :
                  ____:    :    :    :____
                 :    :    :    :    :    :
             ____:    :    :    :    :    :____
            :    :    :    :    :    :    :    :
             30_  40_  50_  60_  70_  80_  90_
               39   49   59   69   79   89   99

This graph is called a bar graph.  The height of each bar depicts the
number of people with scores in the range shown along the baseline.

     Another very similar procedure can be illustrated by putting a dot
above each stack of numbers.

                              .
                         .   68    .
                        59   65   79
                        58   64   79
                    .   55   64   76    .
                   49   53   64   74   88
               .   46   53   62   73   85    .
              36   41   52   61   73   84   93 

Again in practice, we leave the numbers out and 
connect the dots:

                              .         
                         .         .      (Connect the
                                               dots)
                        
                    .                   .
                                         
               .                             .
             _________________________________
             30_  40_  50_  60_  70_  80_  90_
               39   49   59   69   79   89   99

This graph is called a frequency polygon, and again the height of the
curve above the baseline represents the number of people who scored in the
indicated range.

     Why do we have both bar graphs and frequency distributions?  The
answer is that there are two fundamentally different kinds of data. Some
things come in discrete, indivisible units.  With such things, you should,
and normally do, ask the question, "How many are there?" and you count
them in order to find out.  The number of children in families, of seats
in classrooms, of rooms in dormitories, of people at parties, of pieces of
puzzles, etc., illustrate discrete data. With such data, we normally use
bar graphs.

     Other things are continuous in nature, at least conceptually. With
such things you should, and normally do, ask the question, "How much is
there?" and you measure in order to find out.  The amount of gasoline in
cars, of weight of people, of height of buildings, of the temperature in
classrooms, of distance between cities, etc., illustrate continuous data.  
With such data, we can use a frequency polygon because the lines
connecting the points imply continuity.

     There would be no confusion except that numbers themselves are
discrete even when used with continuous things.  If I ask, "How much do
you weigh?"  you will answer with a number of pounds.  But we know that
weight is continuous in nature.  A person probably doesn't weigh exactly
125 pounds, and my home is not exactly 10 miles from campus. If may be
less obvious, but the scores on the vocabulary test are measures of a
continuous trait, namely knowledge of words. The amount of knowledge can't
be counted, but we can measure it by the number of items a person answers
correctly.


On the "Average"

      You undoubtedly already understand the concept of an "average." It
means typical, usual, normal, common, middle, expected, central. In the
vernacular, "average" is not so much a single, exact number as it is a
range of values that fit most people.  For example, I might say that "the
average student registers for somewhere between 15 and 18 hours of course
work."  A 12-hour course load would be considered to be somewhat "below
average."

       Statisticians have devised three different ways to compute a
numerical average.  These are defined and illustrated below with a set of
nine scores where the three statistical meanings of "average" yield the
same answer:
                  3
   1            2 3 4
   2          1 2 3 4 5
   2              ^  
   3          MODE = most frequent score
   3
   3         1 2 2 3 3 3 4 4 5
   4                 ^  
   4          MEDIAN = middle score, half above, half below
   5
  27/9 = 3 --> MEAN = Sum of scores divided by number of scores                  

     You may have heard  the expression,  "Statistics don't lie,  but
statisticians do."  One way to deceive people is when the three 
statistical averages differ,  and you choose the one that best suits your
purposes.  Consider another set of 9 scores:

                  1
                  1 2
   1              1 2 
   1              1 2 4 5
   1              ^  
   1            MODE = most frequent score
   2
   2        1 1 1 1 2 2 2 4 5
   2                ^  
   4             MEDIAN = middle score, half above, half below
   5
  19/9 = 2.1 --> MEAN = Sum of scores divided by number of scores                   

Now if those numbers represent something that is undesirable, such as
"How many times a day do you exceed the speed limit?" I might answer,
"About once a day, on the average."  In this case, the mode gives the
smallest "average."  However, if the numbers represent something that
is desirable, such as "How many times a day do you brush your teeth?"
I might say  "More than twice a day,  on the average."   Now I chosen
the mean because it gives the largest number.  Accordingly, by taking
advantage of the ambiguity of the meaning of "average,"  one can tilt
the picture one way or the other.

    Fortunately, good scientists have no intention to deceive anyone.
They will tell you explicitly which  statistical average  was used to
summarize their data.  If the answers are substantially different, it
is customary to report them all. This should permit you to get a very
clear picture of the data.  There is no one best statistical average;
each is a legitimate way to represent the average score.   As long as
you know how each one is computed, you should not be misled.


On Variability

     An average score can be thought of as a representative score; it
attempts to represent the entire set of scores  with a single number.
If almost all of the scores are the same,  or clustered very close to
the middle, then the average is a very good representative.  However,
if there is a lot of variability in the scores, then the average does
not give enough information.  Accordingly, in addition to an average,
you need some indication of the variability.

     The most common measure of variability  used in textbooks is the
standard deviation.   There is no need for you to know how to compute
the standard deviation.  (If you're interested, it is the square root
of the mean of the squared deviations from the mean.) What you should
know is that, in a symmetrical distribution,  about two thirds of the
scores fall in the range between one standard deviation above and be_
low the mean.  For example, if the mean is 100 and the standard deviation 
is 10, two-thirds of the scores are between 90 and 110.

On Correlation
     
Two measures are correlated when the amount of one tends to vary with
the amount of the other.  Some familiar correlations are:
          tall people tend to weigh more than short people 
          well-educated people tend to earn more than uneducated people 
          intelligent people tend to be healthier than stupid people 
In each of the above examples, the correlation is positive because larger 
amounts of one measure go together with larger amounts of the other.   But 
in many cases, the correlation is negative: 
          fat people tend to live shorter lives 
          old people tend to have poor memory 
          happy people tend to get fewer ulcers 
Notice  that  there  is  nothing  "bad," or "wrong,"  or undesirable about 
negative correlations.  The sign (positive or negative) of  a  correlation 
simply  shows  whether  big  numbers  go  with  big numbers or whether big 
numbers go with little numbers.   
 
The degree of correlation is symbolized by "r" and is measured on a 
scale ranging from zero (no correlation) to  1.00  (perfect  correlation). 
There  is no need for you to know how to calculate "r" but it is important 
to understand what it means if you hear that the correlation between hours 
of study and grade point average (GPA) is  +.6 among college freshmen. 
 
      One way to learn about correlations is to construct scatter plots. 
Each person's measures on two variables can  be  plotted  on  a  graph  as 
follows:  Suppose  person A is 5' tall and weighs 100 pounds.  Person B is 
5'8" tall and weighs 112 pounds, and person C is 6'2" tall and weighs  150 
pounds.  All of  this information is displayed in Figure 1. 
 
           : - - - - - - - - - - - - - - - - - - - - - - - - - - -     C 
         6 - 
           : - - - - - - - - - - - - - - - - - - - -    B              : 
           : 
         5 - - - - - - - - - - - - - - - - - - -   A    :    D         : 
 Height    : 
 (feet)    :                                       :    :              : 
         4 - 
           :                                       :    :              : 
           : 
         3 -                                       :    :              : 
           :   
           :                                       :    :              : 
         2 - 
           :                                       :    :              : 
           : 
         1 -                                       :    :              : 
           : 
           :---------:---------:---------:---------:---------:---------:-- 
                    25        50        75        100       125       150 
                                      Weight (pounds) 
 
             Figure 1.  In a scatter plot, each person's two scores are 
             represented by a single point in the graph.  Person D is 5' 
             tall and weighs 125 pounds. 
      When a number  of points are plotted,  you can get a good idea about 
both the degree and the sign of the correlation by just looking at the way 
the dots are arranged (each dot representing one person's scores). 
 
   Hi :                        .  .     Hi :  .              High Negative 
      :                       . .          :    ..             Correlation 
      :                  .  ..  .          :     .   .            r = -.90 
      :              .  . . ..             :       .... . 
Score :           . . .. .           Score :          . ..   .. 
  Y   :         .  ... .               Y   :              . .... 
      :      . .  .                        :               .. .. . 
      :   .  .         High Positive       :                     .  . 
      :   . .            Correlation       :                       .  .. 
      : .                  r = +.90        :                          . 
      :---------:---------:---------:      :---------:---------:---------: 
                 Score X           Hi                   Score X         Hi 
 
   Hi :     .      .            . 
      :        .            . 
      :   .  .       . No Correlation      Figure 2.   Scatter plots that 
      :.         .        r = .00  .       illustrate the range of likely 
Score :   .          .      .              correlations between  scores X 
  Y   :      .    .            .           and Y.  A perfect  correlation 
      : .          .    .    .             (r=1.00) would show all of the 
      :     .       .   .         .        dots on a single straight line. 
      :  .    .       .     . 
      :    .     .             . 
      :---------:---------:---------: 
                  Score X          Hi 
 
      A  high  correlation  does  not  necessarily  mean that one variable 
causes the other.  Both may  be  caused  by  some  other  factor(s).   For 
example,  studies  have  usually found a correlation of about +.60 between 
one's grades in high school and one's grades as a  college  freshman.   To 
see how this is determined, plot the following data in the graph: 
 
Student  X = High   Y = Fresh-       4.0 : 
        School GPA   man GPA             : 
  ---   ----------  ----------           : 
    1      2.00        1.60          3.0 : 
    2      2.25        2.00     Freshman :  
    3      2.60        1.80        GPA   :  
    4      2.65        2.80          2.0 :  
    5      2.80        2.10              :  
    6      3.10        2.00              :  
    7      2.90        2.65          1.0 :  
    8      3.25        2.25              :---------:---------:---------: 
    9      3.60        3.00             1.0       2.0       3.0       4.0 
   10      3.25        3.10                      High School GPA 
 
     Grades in both high school and college are probably caused by factors 
like intelligence,  amount of effort, and the quality of instruction.  Try 
to think of reasons why the correlation is not perfect  (like class size). 
Even so, a   It is  very  unlikely 
(albeit possible) that a person who got mostly C's in high school will get 
straight A's as a freshman.   For  the  purpose of prediction,  a negative 
correlation is just as useful as a positive one.  Plot the following data: 

Patient  X = Years   Y = Lung         60 : 
          Smoked     Capacity            : 
  ---   ----------  ----------           : 
    1       25          45            50 : 
    2       36          40        Lung   :  
    3       22          50      Capacity :  
    4       20          60            40 :  
    5       48          25               :  
    6       39          30               :  
    7       42          30            30 :  
    8       31          45               :---------:---------:---------: 
    9       28          58               20        30        40       50 
   10       33          65                        Years Smoked 
 
      In  the  preceding  example,  the  correlation is about -.80 because 
smaller lung capacity is associated with larger number of  years  smoking. 
But we can safely predict that the longer a person has smoked, the smaller 
his/her likely lung capacity.   
 
      In general, the degree to which two sets of scores are correlated is 
indicated by a number between 0 and 1.   The sign of the correlation tells
whether it is a big-big or a big-little relationship. Both of these can be
displayed in a scatter plot. In the graphs below, make a scatter plot that
represents each of the following correlations  and describe in words  what
you can say about the correlation:  (Be sure to put labels on the axes.)  
       (a) amount of rain and time the sun rises is r=+.01 
       (b) number of miles per gallon and speed of driving is r=-.80 
       (c) amount of alcohol drunk and time to react is r+.90 
       (d) level of income and number of children is r=-.50 
 
Hi :                                    Hi : 
   :                                       : 
   :                                       : 
   :                                       : 
   :                                       : 
   :                                       : 
   :                                       : 
   :                                       :                              
   :                                       :                            
   :---------:---------:---------:         :---------:---------:---------: 
                                Hi                                      Hi 
Hi :                                    Hi : 
   :                                       : 
   :                                       : 
   :                                       : 
   :                                       : 
   :                                       : 
   :                                       : 
   :                                       :                              
   :                                       :                            
   :---------:---------:---------:         :---------:---------:---------: 
                                Hi                                      Hi