Working With Your Data

You’ll be constructing numerous graphs as you progress through the simulation exercises and answer the accompanying questions.  Probably the easiest way to do graph your data is to enter them into a spreadsheet and use its graphing functions to construct the graphs you need.  This essay will take you through the steps involved in producing a graph based on data you’ve entered in a spreadsheet and will introduce you to the most basic aspects of fitting a regression model to your data.  For purposes of instruction, we’ll use Microsoft ExcelÔ, but you may use any spreadsheet software you wish.  I assume that you know how to enter data into a spreadsheet and do basic manipulations with them.  In this how-to guide I’ll just cover the basics of graph construction and regression analysis.

Creating A Graph

The Excel graphing packages suffer from some shortcomings for scientific graphs, most notably the fact that you can’t have a numerical scale for the x-axis of histograms.  Nonetheless, for the relatively simple graphs you’ll be working with in these simulations, the graphing packages that come with Excel or other spreadsheets are adequate.  Limitations notwithstanding, Excel and most other spreadsheet packages make it easy to construct a graph (called a “chart” in Excel) of your data.  Once you have your data in a MS Excel worksheet, the following steps will produce a graph you can be proud of. 

Let’s work with the following data:

Independent
Variable
Dependent
Variable
( x ) ( y )
1 1
2 21
3 28
4 23
5 43
6 36
7 61
8 68
9 112
10 128
11 161

Go ahead and enter these data into your spreadsheet so you can follow along as we proceed through the creation of a graph and fitting them with a regression line.  The easiest way to do this is to highlight the above table with your mouse, copy it to your computer’s clipboard, then paste the clipboard’s contents into your spreadsheet. 

I’ll assume you have two columns of data, and that the left-hand column (Column A in this case) contains the independent variable’s data (x) and the right-hand column (Column B) contains the dependent variable’s data (y). 

1.  Using the cursor and the left mouse button, highlight the data you wish to graph by clicking on the left uppermost datum in the left-hand column and while keeping the left-hand mouse button depressed, drag the cursor right one column and down the right-hand column until you’ve highlighted all of the x-y data you wish to include in your graph.  Release the mouse button.  Your display should now look similar to this:

2.  Click the Insert menu item (or press Alt, I) to pull up the first Chart Wizard window:

3.  For nearly all of the exercises you’ll be doing in this course, you’ll want to select the “XY (Scatter)” option under “Chart type” (the 5th entry in the list),  You can do this either (i) by clicking on it with the mouse, then clicking the Next button,  or (ii) by using the down arrow key (¯) to highlight the option, then hitting the Enter key.  Either way, the Chart sub-type on the right side of the Chart Wizard window will change to show the available scatter plots:

4.  Select the uppermost scatter plot sub-type (no lines) and click the Next button to pull up the 2nd Chart Wizard Window:

5.  Make sure that the indicated Data range is correct, and that the Columns radio button is selected. Click the Next button to bring up the 3rd Chart Wizard window:

6.  At your option, you can enter a title for your graph and labels for the x- and y-axis.  Once you’re through with this window, click the Next button to bring up the 4th (and last) Chart Wizard window:

This window simply lets you specify whether your new graph will be displayed as a small chart in the worksheet containing the data plotted in the graph (the default “As object in:” option) or as a full-size (8 ½ x 11) chart in its own page.  You’ll probably want to go with the default for now, but you may change this at any time. 

7.  Click the Finish button and your completed graph will appear at the location you’ve specified, and your worksheet’s display should look similar to this:

You may edit the graph to get it looking the way you want, but for now we’ll content ourselves with expanding the graph’s dimensions and make it easier to work with simply by deleting the legend that Excel automatically generated at the right-hand edge of the chart.  Do this by putting the cursor inside the legend’s box, click the left mouse button to highlight it, then hit the “Delete” key.  The legend should disappear.

Further techniques for editing the graph are beyond the scope of this essay, but there any number of references and tutorials available to help you learn the necessary steps to make your graphs look the way you want them to.

You can print the graph at this time, if you wish.  However, it’s often easier to gain information and insights from a simple graph if we superimpose some sort of ‘best fit’ line that depicts in general terms how the dependent variable responds to changes in the independent variable.  Often, this line is drawn “by eye”, but it’s usually preferable to used well-documented statistical techniques to generate the line of best fit.  In the next section, we’ll step through the procedure involved in fitting a least-squares regression to the data displayed in the graph we just created.

Fitting Regression Lines To Your Data

Since Excel was targeted primarily at business-oriented users, rather that scientists, its developers chose to use the term “trendline” to refer to what most scientists would call a “regression line”.  Perhaps also reflecting its business heritage, Excel makes fitting trendlines to your data very simple and straightforward. 

Once you have your graph constructed, all you have to do in order to fit a regression line to the data is:

1.  Click on the chart to select it.

2.  Open the “Add Trendline” window.  You do this by either:

a.  Pressing the Alt key, followed by the c key (for Chart), followed by the r key (for Add Trendline)

– or – 

b.  Using the mouse, click on the Chart menu item in the menu bar (near the top of the Excel window), then on Add Trendline in the drop-down menu that appears.

3.  You will now see the “Add Trendline” window in the center of the display:

You can select any of the six displayed regression models simply by clicking on the window displaying the regression model you wish to fit to your data.  For now, let’s go with the default Linear model.  Now, click on the “Options” tab near the top of the window. 

4.  This will bring up an “Add Trendline” window:

In this window, you can set a number of options.  For this example, we will click on the “Display equation on chart” and “Display R-squared value on chart” checkboxes (the bottom two checkboxes) to activate those options. 

5.  Click the OK button to see the regression line displayed on your graph:

You now see the data with the least-squares best fit regression line, the equation for the line, and the R2 value displayed as well (really, it’s an r2 value, but we won’t quibble……).  You’ll notice the change in the y-axis’ scale……that’s because we’re letting Excel automatically determine the scale.  You can change that option if you like.

You can format the graph to suit your individual tastes, using the various menu items that are available by double clicking on the item you wish to format (the x-axis, the y-axis, grid lines, the regression line, whatever), or by clicking the right mouse button with the cursor positioned anywhere within the chart’s display area and selecting from the options that are displayed in the pop-up window that appears.  Right now, you might want to move the regression equation/ R2 value away from the data points.  Do this by simply positioning the cursor anywhere in the area containing the equation & R2 value and dragging the field to the position you choose.

Selecting a Regression Model

This is an extremely important topic that’s too often overlooked, even by practicing research scientists.  It would serve us well to spend a short time discussing some of the basics involved in how we decide what model to use when conducting a regression analysis ( = fitting a regression line) of a data set.

Consider the (linear) regression model we fit to our data in the previous section.  The R2 value for the regression fit is 0.8882, which is extremely high (most scientists would just kill to get R2 values that high!).  Remember, R2 can only go as high as 1.00, and then only if the regression line fits the data perfectly.  An R2 of 0.8882 means that our linear model, y = ax + b accounts for nearly 90% of the variation in y.  Put another way, such a high R2 value means that we can use our model to make very accurate predictions about the ‘behavior’ of y. It’s often very tempting to take such high R2 at face value, assume the system is almost perfectly understood, and stop there.  Indeed, myriad published papers reveal that their authors have done just that.

But, let’s look more closely at the data and the regression line.  Notice that there is a pattern to the position of the data points relative to the regression line:  at small x-values (x < 3), the data points all fall above the regression line ( = “positive residuals”), but at intermediate x-values (4 < x < 8) the data points fall below the regression line ( = “negative residuals”), and for x > 9, the data points once again fall above the regression line (positive residuals).  Any pattern in the residuals of the data about the regression line is a strong hint that the model is not the best possible in terms of explaining the data and that the model may, in fact, be inappropriate or incorrect. 

The whole question of choosing a model to fit to your data is dealt with much more extensively in the accompanying essay entitled Models In Physiology.  For now, let’s just use Exel’s Chart Wizard to fit another model to the data. 

The pattern in the residuals described in the previous paragraph suggests that some sort of a curvilinear model would fit the data better.  Let’s see what we can do: 

1.  Click (left mouse button) on the regression line to highlight it.  Then, call up the Chart Wizard’s Format Trendline” window by:

a.  rapidly double-clicking the left mouse button with the cursor placed on the trendline,

b.  simultaneously depressing the Ctrl and 1 keys

– or – 

c.  clicking on the Format menu item, then the Selected Trendline… option.

This will cause the “Format Trendline” window to appear in the center of the screen:

Click on the Polynomial panel to fit the default 2nd Order (quadratic) model to the data, click on the Options tab and check to make sure that the Display equation and Display R2 options are still selected.  

2.  Click the OK button., hit the Esc key to ‘de-highlight’ the regression line, and your spreadsheet with the new regression model displayed should now look similar to:

Notice that the new regression line does seem to fit the data better, which is reflected in a larger R2 (0.9695) than when we fit the linear model (we won’t bother testing whether the new R2 is significantly larger in the statistical sense, although ordinarily one should do so). More importantly, the residuals don’t show the same dependence on the value of x  as previously.  Now, negative and positive residuals occur along the entire length of the curve, at small, intermediate, and large x-values. 

Thus, although a linear model fit the data quite well, blind acceptance of the resulting equation and R2 would clearly have been inappropriate and led to possible problems.  For example, suppose these data actually represent the response of the rate some physiological process to changes in the concentration of some ion.  Using the linear model would have led to a prediction of negative rates for the process at small x-values, clearly an impossibility.  Similarly, the linear model would badly underestimate the rate of the process at large values of x, which might be disastrous if you were, say, trying to design a medical treatment protocol based on how you thought the process’ rate depended on ion concentrations.  And, it almost goes without saying that attempting to extrapolate the linear model to even higher values of x (always a tricky endeavor, but sometimes necessary) would lead to wildly incorrect underestimates of y-values.

Take-home Message

Blindly accepting a model – and the regression line it produces – at face value
solely because it yields a large R2 value is a recipe for error, sometimes for disaster.
You must look – and look critically – at a regression line side-by-side with your data before you attribute any explanatory power whatsoever to the regression line or to the model on which the line is based.

In conclusion, the better fit of a 2nd order polynomial to our data implies that a quadratic model is a more appropriate model to use for fitting these data.  However, there is more – much more – to data analysis than just finding a model that fits the data well.  Any number of books have been written on the subject of models and their selection, and a treatment of the subject here is not possible.  Suffice to say, whenever possible, you should choose a model based on a thorough understanding of the underlying mechanisms responsible for your data. 

Take-home Exercise

1.  If you haven’t done so already, copy the data from the table at the start of this essay into your favorite spreadsheet and construct a graph of them following the above steps.  Practice fitting various regression models to the data.  In particular, try fitting polynomials of different orders to the data (select the order by clicking the up- and down-arrows to the right of the “Order” checkbox).  Note the effect on the value of R2.  Be certain to examine the residuals closely, looking for any patterns such as the one discussed above. To make it easier to detect patterns in the residuals, it's helpful to construct a graph of the residual values versus x.