Regression Review:

(The italicized “think” questions are ones that you should be able to answer come quiz time).

 

Main concepts:

 

1: Regression is a method for making predictions given 2 related variables (related means variables which are correlated). The form of the prediction is usually: what is the expected y value given some x value. 

 

Think: What kind of predictions can be made between 2 variables which are completely uncorrelated?

 

2.      The form of the regression line is : Predicted y = (slope) multiplied by (X value) + intercept.

 

Think: what does the slope mean in terms of the standard deviations of the variables you are working with?

 

3.      Steps to compute regression line :

a.       find slope (correlation multiplied by the standard deviation of y) *divided by* (standard deviation of x)

b.      find intercept (mean of y minus* (slope multiplied by mean of x))

c.       plug in x and solve for y

 

4.      That’s it! Now you can make predictions for any y value given an x value. Remember, predictions are only good when:

a.      The x values you want to use come from the same population that the regression equation was derived from and

b.      The x values you want to use are within the range of the original x variable.

 

 

5.      Error: Because our original correlation between the two variables is not 1.0, our regression equation cannot make exact predictions. There will be prediction errors. You can prove this two ways:

a.      Check it against the original x and y values. Each (x,y) pair that went in to computing the correlation can now be verified (what was the y predicted by using the regression equation, and what was the actual y?)  This will give you the standard error of the estimate.

b.      Steps to compute SEE:

                                                                          i.      Compute the expected y value for each x in the data set. The predicted y’s are not the same as the actual y’s.

                                                                        ii.      Subtract the predicted y from each actual y. The error should be relatively random around the predicted y value (i.e. some will be less than predicted, some will be more)

                                                                      iii.      Square these values, and then add them all up

                                                                      iv.      Divide this sum by N-2

                                                                        v.      Take the square root of the whole thing. This is the SEE.

 

Think: What is the SEE, in words? Is it an average? How is it like a standard deviation?

 

c.       The other way to compute an SEE is to go out and collect more (x,y)  pairs, compute the expected y for each of these new x’s and compare the expected y’s to actual y’s, using the steps above. 

 

6.      Homoscedasticity:

a.      Homoscedasticity is an assumption that we make when using a regression line to make predictions.  More precisely, it is an assumption that we make when interpreting the SEE.  For the SEE to be meaningful, we want to say that we have met the assumption. 

                                                                          i.      Homoscedasticity simply says that errors of prediction are random around each x value (the variability in y is constant around each x value).  Look carefully at pg. 142, figure 7.5 in your textbook.

                                                                        ii.      To make an evaluation of the assumption of homoscedasticity, you must have several x,y pairs in your data set where the x’s are the same and y’s are different.  Then, you can look at the predicted y value for that x, and see how the actual y values vary around that x value. These variations of actual y from predicted y should be randomly distributed around the predicted y value. If all the actual y values were, for example, larger than the predicted value, we fail to meet the assumption. This must be the case at every possible x value (this would be hypothetical in the population data, but we can make a go at evaluating it in our sample data).

 

Think:  When the assumption of homoscedasticity is met, what can we infer about the population data? (i.e. what would the distribution of errors look like if we had  N= 10,000?

 

 

 

 

 

 

 

Due to popular demand, the ‘think’ questions have been answered here: ‘Think!’