Plotting the Residuals

The residual plot illustrates how far away each of the values on the graph is from the expected value (the value on the line).

Learning Objective

Differentiate between scatter and residual plots, and between errors and residuals

Key Points

The sum of the residuals within a random sample is necessarily zero, and thus the residuals are necessarily not independent.
The average of the residuals is always equal to zero; therefore, the standard deviation of the residuals is equal to the RMS error of the regression line.
We see heteroscedasticity in a resitual plot as the difference in the scatter of the residuals for different ranges of values of the independent variable.

Terms

scatter plot
A type of display using Cartesian coordinates to display values for two variables for a set of data.
heteroscedasticity
The property of a series of random variables of not every variable having the same finite variance.
residual
The difference between the observed value and the estimated function value.

Full Text

Errors Versus Residuals

Statistical errors and residuals are two closely related and easily confused measures of the deviation of an observed value of an element of a statistical sample from its "theoretical value. " The error of an observed value is the deviation of the observed value from the (unobservable) true function value, while the residual of an observed value is the difference between the observed value and the estimated function value.

A statistical error is the amount by which an observation differs from its expected value, the latter being based on the whole population from which the statistical unit was chosen randomly. For example, if the mean height in a population of 21-year-old men is 5' 8", and one randomly chosen man is 5' 10" tall, then the "error" is 2 inches. If the randomly chosen man is 5' 6" tall, then the "error" is $-2$ inches. The expected value, being the mean of the entire population, is typically unobservable, and hence the statistical error cannot be observed either.

A residual (or fitting error), on the other hand, is an observable estimate of the unobservable statistical error. Consider the previous example with men's heights and suppose we have a random sample of $n$ people. The sample mean could serve as a good estimator of the population mean, and we would have the following:

The difference between the height of each man in the sample and the unobservable population mean is a statistical error, whereas the difference between the height of each man in the sample and the observable sample mean is a residual.

Note that the sum of the residuals within a random sample is necessarily zero, and thus the residuals are necessarily not independent. The statistical errors on the other hand are independent, and their sum within the random sample is almost surely not zero.

Residual Plots

In scatter plots we typically plot an $x$-value and a $y$-value. To create a residual plot, we simply plot an $x$-value and a residual value. The residual plot illustrates how far away each of the values on the graph is from the expected value (the value on the line).

The average of the residuals is always equal to zero; therefore, the standard deviation of the residuals is equal to the RMS error of the regression line.

As an example, consider the figure depicting the number of drunk driving fatalities in 2006 and 2009 for various states:

Residual Plot

This figure shows a scatter plot, and corresponding residual plot, of the number of drunk driving fatalities in 2006 ($x$-value) and 2009 ($y$-value)

The relationship between the number of drunk driving fatalities in 2006 and 2009 is very strong, positive, and linear with an $r^2$ (coefficient of determination) value of 0.99. The high $r^2$ value provides evidence that we can use the linear regression model to accurately predict the number of drunk driving fatalities that will be seen in 2009 after a span of 4 years.

High Residual

These images depict the highest residual in our example.

Considering the above figure, we see that the high residual dot on the residual plot suggests that the number of drunk driving fatalities that actually occurred in this particular state in 2009 was higher than we expected it would be after the 4 year span, based on the linear regression model. So, based on the linear regression model, for a 2006 value of 415 drunk driving fatalities we would expect the number of drunk driving fatalities in 2009 to be lower than 377. Therefore, the number of fatalities was not lowered as much as we expected they would be, based on the model.

Low Residual

These images depict the lowest residual in our example.

Considering the above figure, we see that the low residual plot suggests that the actual number of drunk driving fatalities in this particular state in 2009 was lower than we would have expected it to be after the 4 year span, based on the linear regression model. So, based on the linear regression model, for a 2006 value of 439 drunk driving fatalities we would expect the number of drunk driving fatalities for 2009 to be higher than 313. Therefore, this particular state is doing an exceptional job at bringing down the number of drunk driving fatalities each year, compared to other states.

Advantages of Residual Plots

Residual plots can allow some aspects of data to be seen more easily.

We can see nonlinearity in a residual plot when the residuals tend to be predominantly positive for some ranges of values of the independent variable and predominantly negative for other ranges.
We see outliers in a residual plot depicted as unusually large positive or negative values.
We see heteroscedasticity in a resitual plot as the difference in the scatter of the residuals for different ranges of values of the independent variable.

The existence of heteroscedasticity is a major concern in regression analysis because it can invalidate statistical tests of significance that assume that the modelling errors are uncorrelated and normally distributed and that their variances do not vary with the effects being modelled.

[ edit ]

Prev Concept

Computing R.M.S. Error

Homogeneity and Heterogeneity

Next Concept