Appendix L. . .Statistics On Describing Data: Some information is in numerical form. Your height, weight, age and scores on tests are numbers that convey some information about you. If we have such information on a group of people, we need some way to describe that set of numbers. This is done by using statistics but you only need to know a few very simple things in order to under_ stand the way that numerical information is described in introductory textbooks. Suppose, for example, that the 27 students in my class took the vocabulary pre-test given in the preface. If you wanted to know how well the class did, you could look at the list of their scores in my grade book where the students are listed in alphabetical order. You might see the following numbers: 85 79 52 64 62 79 53 46 64 84 76 49 64 68 61 55 73 41 88 58 74 65 93 59 53 73 36 Although reading those numbers might give you some general idea about how well the class did, it would be better if the scores were listed in numerical order: 36 41 42 49 52 53 53 55 58 59 61 62 64 64 64 65 68 73 73 74 76 79 79 84 85 88 93 Such an arrangement of the scores enables you quickly to determine that they range between 36 and 93, and that the middle score is about 62 or 64. A still better picture of this set of scores emerges if we stack the identical scores: 64 53 64 73 79 36 41 42 46 49 52 53 55 58 59 61 62 64 65 68 73 74 76 79 84 85 88 93 For greater convenience and clarity, it is helpful to lump scores into five to seven class intervals. I have chosen a set of scores so that the numerical decades provide good class intervals, and the results of this lumping of the scores are shown in two ways: Score Percent Interval Frequency Frequency 68 30-39 1 3.7 59 65 79 40-49 3 11.1 58 64 79 50-59 6 22.2 55 64 76 60-69 7 25.9 49 53 64 74 88 70-79 6 22.2 46 53 62 73 85 80-89 3 11.1 36 41 52 61 73 84 93 90-99 1 3.7 The table on the left above is called a frequency distribution; it is the most common method of presenting statistical information. As a rule, only the percent frequencies are displayed because, if you know the total number of scores, you can compute the actual frequencies if they are of special interest. The disadvantage of a frequency distribution is that one cannot tell "at a glance" what the distribution of scores looks like. For this purpose, graphic means of presenting statistical information are preferable. To illustrate such methods, let us put boxes around the stacks of numbers, as follows: ____ ____: 68 :____ : 59 : 65 : 79 : : 58 : 64 : 79 : ____: 55 : 64 : 76 :____ : 49 : 53 : 64 : 74 : 88 : ____: 46 : 53 : 62 : 73 : 85 :____ : 36 : 41 : 52 : 61 : 73 : 84 : 93 : In practice, we leave the actual numbers out, but display along the baseline the range of numbers in each stack: ____ ____: :____ : : : : : : : : ____: : : :____ : : : : : : ____: : : : : :____ : : : : : : : : 30_ 40_ 50_ 60_ 70_ 80_ 90_ 39 49 59 69 79 89 99 This graph is called a bar graph. The height of each bar depicts the number of people with scores in the range shown along the baseline. Another very similar procedure can be illustrated by putting a dot above each stack of numbers. . . 68 . 59 65 79 58 64 79 . 55 64 76 . 49 53 64 74 88 . 46 53 62 73 85 . 36 41 52 61 73 84 93 Again in practice, we leave the numbers out and connect the dots: . . . (Connect the dots) . . . . _________________________________ 30_ 40_ 50_ 60_ 70_ 80_ 90_ 39 49 59 69 79 89 99 This graph is called a frequency polygon, and again the height of the curve above the baseline represents the number of people who scored in the indicated range. Why do we have both bar graphs and frequency distributions? The answer is that there are two fundamentally different kinds of data. Some things come in discrete, indivisible units. With such things, you should, and normally do, ask the question, "How many are there?" and you count them in order to find out. The number of children in families, of seats in classrooms, of rooms in dormitories, of people at parties, of pieces of puzzles, etc., illustrate discrete data. With such data, we normally use bar graphs. Other things are continuous in nature, at least conceptually. With such things you should, and normally do, ask the question, "How much is there?" and you measure in order to find out. The amount of gasoline in cars, of weight of people, of height of buildings, of the temperature in classrooms, of distance between cities, etc., illustrate continuous data. With such data, we can use a frequency polygon because the lines connecting the points imply continuity. There would be no confusion except that numbers themselves are discrete even when used with continuous things. If I ask, "How much do you weigh?" you will answer with a number of pounds. But we know that weight is continuous in nature. A person probably doesn't weigh exactly 125 pounds, and my home is not exactly 10 miles from campus. If may be less obvious, but the scores on the vocabulary test are measures of a continuous trait, namely knowledge of words. The amount of knowledge can't be counted, but we can measure it by the number of items a person answers correctly. On the "Average" You undoubtedly already understand the concept of an "average." It means typical, usual, normal, common, middle, expected, central. In the vernacular, "average" is not so much a single, exact number as it is a range of values that fit most people. For example, I might say that "the average student registers for somewhere between 15 and 18 hours of course work." A 12-hour course load would be considered to be somewhat "below average." Statisticians have devised three different ways to compute a numerical average. These are defined and illustrated below with a set of nine scores where the three statistical meanings of "average" yield the same answer: 3 1 2 3 4 2 1 2 3 4 5 2 ^ 3 MODE = most frequent score 3 3 1 2 2 3 3 3 4 4 5 4 ^ 4 MEDIAN = middle score, half above, half below 5 27/9 = 3 --> MEAN = Sum of scores divided by number of scores You may have heard the expression, "Statistics don't lie, but statisticians do." One way to deceive people is when the three statistical averages differ, and you choose the one that best suits your purposes. Consider another set of 9 scores: 1 1 2 1 1 2 1 1 2 4 5 1 ^ 1 MODE = most frequent score 2 2 1 1 1 1 2 2 2 4 5 2 ^ 4 MEDIAN = middle score, half above, half below 5 19/9 = 2.1 --> MEAN = Sum of scores divided by number of scores Now if those numbers represent something that is undesirable, such as "How many times a day do you exceed the speed limit?" I might answer, "About once a day, on the average." In this case, the mode gives the smallest "average." However, if the numbers represent something that is desirable, such as "How many times a day do you brush your teeth?" I might say "More than twice a day, on the average." Now I chosen the mean because it gives the largest number. Accordingly, by taking advantage of the ambiguity of the meaning of "average," one can tilt the picture one way or the other. Fortunately, good scientists have no intention to deceive anyone. They will tell you explicitly which statistical average was used to summarize their data. If the answers are substantially different, it is customary to report them all. This should permit you to get a very clear picture of the data. There is no one best statistical average; each is a legitimate way to represent the average score. As long as you know how each one is computed, you should not be misled. On Variability An average score can be thought of as a representative score; it attempts to represent the entire set of scores with a single number. If almost all of the scores are the same, or clustered very close to the middle, then the average is a very good representative. However, if there is a lot of variability in the scores, then the average does not give enough information. Accordingly, in addition to an average, you need some indication of the variability. The most common measure of variability used in textbooks is the standard deviation. There is no need for you to know how to compute the standard deviation. (If you're interested, it is the square root of the mean of the squared deviations from the mean.) What you should know is that, in a symmetrical distribution, about two thirds of the scores fall in the range between one standard deviation above and be_ low the mean. For example, if the mean is 100 and the standard deviation is 10, two-thirds of the scores are between 90 and 110. On Correlation Two measures are correlated when the amount of one tends to vary with the amount of the other. Some familiar correlations are: tall people tend to weigh more than short people well-educated people tend to earn more than uneducated people intelligent people tend to be healthier than stupid people In each of the above examples, the correlation is positive because larger amounts of one measure go together with larger amounts of the other. But in many cases, the correlation is negative: fat people tend to live shorter lives old people tend to have poor memory happy people tend to get fewer ulcers Notice that there is nothing "bad," or "wrong," or undesirable about negative correlations. The sign (positive or negative) of a correlation simply shows whether big numbers go with big numbers or whether big numbers go with little numbers. The degree of correlation is symbolized by "r" and is measured on a scale ranging from zero (no correlation) to 1.00 (perfect correlation). There is no need for you to know how to calculate "r" but it is important to understand what it means if you hear that the correlation between hours of study and grade point average (GPA) is +.6 among college freshmen. One way to learn about correlations is to construct scatter plots. Each person's measures on two variables can be plotted on a graph as follows: Suppose person A is 5' tall and weighs 100 pounds. Person B is 5'8" tall and weighs 112 pounds, and person C is 6'2" tall and weighs 150 pounds. All of this information is displayed in Figure 1. : - - - - - - - - - - - - - - - - - - - - - - - - - - - C 6 - : - - - - - - - - - - - - - - - - - - - - B : : 5 - - - - - - - - - - - - - - - - - - - A : D : Height : (feet) : : : : 4 - : : : : : 3 - : : : : : : : : 2 - : : : : : 1 - : : : : :---------:---------:---------:---------:---------:---------:-- 25 50 75 100 125 150 Weight (pounds) Figure 1. In a scatter plot, each person's two scores are represented by a single point in the graph. Person D is 5' tall and weighs 125 pounds. When a number of points are plotted, you can get a good idea about both the degree and the sign of the correlation by just looking at the way the dots are arranged (each dot representing one person's scores). Hi : . . Hi : . High Negative : . . : .. Correlation : . .. . : . . r = -.90 : . . . .. : .... . Score : . . .. . Score : . .. .. Y : . ... . Y : . .... : . . . : .. .. . : . . High Positive : . . : . . Correlation : . .. : . r = +.90 : . :---------:---------:---------: :---------:---------:---------: Score X Hi Score X Hi Hi : . . . : . . : . . . No Correlation Figure 2. Scatter plots that :. . r = .00 . illustrate the range of likely Score : . . . correlations between scores X Y : . . . and Y. A perfect correlation : . . . . (r=1.00) would show all of the : . . . . dots on a single straight line. : . . . . : . . . :---------:---------:---------: Score X Hi A high correlation does not necessarily mean that one variable causes the other. Both may be caused by some other factor(s). For example, studies have usually found a correlation of about +.60 between one's grades in high school and one's grades as a college freshman. To see how this is determined, plot the following data in the graph: Student X = High Y = Fresh- 4.0 : School GPA man GPA : --- ---------- ---------- : 1 2.00 1.60 3.0 : 2 2.25 2.00 Freshman : 3 2.60 1.80 GPA : 4 2.65 2.80 2.0 : 5 2.80 2.10 : 6 3.10 2.00 : 7 2.90 2.65 1.0 : 8 3.25 2.25 :---------:---------:---------: 9 3.60 3.00 1.0 2.0 3.0 4.0 10 3.25 3.10 High School GPA Grades in both high school and college are probably caused by factors like intelligence, amount of effort, and the quality of instruction. Try to think of reasons why the correlation is not perfect (like class size). Even so, a It is very unlikely (albeit possible) that a person who got mostly C's in high school will get straight A's as a freshman. For the purpose of prediction, a negative correlation is just as useful as a positive one. Plot the following data: Patient X = Years Y = Lung 60 : Smoked Capacity : --- ---------- ---------- : 1 25 45 50 : 2 36 40 Lung : 3 22 50 Capacity : 4 20 60 40 : 5 48 25 : 6 39 30 : 7 42 30 30 : 8 31 45 :---------:---------:---------: 9 28 58 20 30 40 50 10 33 65 Years Smoked In the preceding example, the correlation is about -.80 because smaller lung capacity is associated with larger number of years smoking. But we can safely predict that the longer a person has smoked, the smaller his/her likely lung capacity. In general, the degree to which two sets of scores are correlated is indicated by a number between 0 and 1. The sign of the correlation tells whether it is a big-big or a big-little relationship. Both of these can be displayed in a scatter plot. In the graphs below, make a scatter plot that represents each of the following correlations and describe in words what you can say about the correlation: (Be sure to put labels on the axes.) (a) amount of rain and time the sun rises is r=+.01 (b) number of miles per gallon and speed of driving is r=-.80 (c) amount of alcohol drunk and time to react is r+.90 (d) level of income and number of children is r=-.50 Hi : Hi : : : : : : : : : : : : : : : : : :---------:---------:---------: :---------:---------:---------: Hi Hi Hi : Hi : : : : : : : : : : : : : : : : : :---------:---------:---------: :---------:---------:---------: Hi Hi