What is a Regression?

Regression is a form of statistical analysis used to predict one variable (y) from values of another variable (x). A regression equation is derived from a known set of data. The adjacent graph shows the mortality indices (y) of a group of English men with different smoking indices (x). The smoking index is the ratio of the average number of cigarettes smoked per day by men in that particular group to the average number of cigarettes smoked per day by all men. The mortality index is the ratio of the rate of deaths from lung cancer among men in that particular group to the rate of lung cancer deaths among all men. In linear regression, a straight line is drawn through the data points. The line (y=mx+b) can then be used to predict the mortality rate of a person with a known smoking habit.

But how was the line drawn?

Many lines could have been drawn through the data in the above figure. How was the plotted line selected? A "residual" is the distance from a data point to a line. If the residual for each data point is determined and squared, a sum of the "squared residuals" could be calculated for each line drawn through the data set! The regression line drawn in this example is the one line with the smallest sum of the "squared residuals!" It's even got a catchy name; it's called the "Least Squares Regression Line!"

So…how well does the Regression equation predict an unknown y value based on a known x value?

If all of the data points fell on the line, there would be a perfect correlation between the x and y data points (Figure 1a and 1b). If there were to be a perfect correlation, the correlation coefficient, r, would have a value of 1.0 or -1.0. These cases represent the best scenario for predicting. A positive or negative r value just represents how y varies with x. When r is positive, y increases as x increases. When r is negative, y decreases as x increases.

Figure 2a shows a regression line where r = 0.9. It is not a perfect correlation since the data points do not all fall on the line, but many are very close to the line. Since the correlation of x and y data points is close to one, the regression equation will predict unknown y values from know x values pretty well.

In Figures 2b and 2c, the data points fall farther from the least squares regression line. The correlation between the data points and the regression line drops. Hence, the regression equations will not be able to predict an unknown y value from a known x value as well. If r = 0, there is no correlation between the data points and the regression line, and it has no predicting value!!

Is the regression line useful for predictions? CLICK HERE FOR MORE!


Original work on this document was done by Central Virginia Governor's School students Ashley Farmer, Josh Nelson and Sara Throckmorton (Class of '98). Revisions were made by Ryan Malec, John Lewis and Terri Kendrick (Class of '05)

Copyright © 2004 Central Virginia Governor's School for Science and Technology Lynchburg, VA