In the previous installment (*American Laboratory*, Oct 2011), the terms that contribute to R^{2} were defined and a formula relating them was developed. This article will show how R^{2} is calculated and explain shortcomings associated with this statistic.

A brief review is in order first. The three terms are: 1) sum of squares for the model (SS_{Model}, the amount of variation explained by the model), 2) sum of squares error (SS_{Error}, the variation not captured by the model plus the random noise inherent in the data), and 3) total corrected sum of squares (SS_{Total}). The relationship among the terms is:

SS_{Model} + SS_{Error} = SS_{Total } (1)

The “expanded” version of the formula is:

Σ(*y*_{pi} – *y*_{p-mean})^{2} + Σ(*y*_{i} – *y*_{pi})^{2} = Σ(*y*_{i} – *y*_{mean})^{2}, (2)

where:

*y*_{pi} = a *y*-value predicted by the proposed model,*y*_{p-mean} = the mean of all the *y*_{pi} values,*y*_{i} = an actual *y-*value (or response),*y*_{mean} = the mean of all the actual *y*-values (i.e., the “grand mean”).

Lastly, the definition of R^{2} is the proportion (of the variation) that can be explained by the regression that has been performed. Typically, the statistic is stated as a percentage, in decimal form.

How does the above definition translate into a formula for R^{2}? Since SS_{Model} is the amount of uncertainty captured by the model and SS_{Total} is the total variation, the proportion is simply:

R^{2} = SS_{Model}/SS_{Total} (3)

Another form of this expression can be obtained if Eq. (1) is first rearranged to give:

SS_{Model} = SS_{Total} – SS_{Error} (4)

Substituting Eq. (4) into Eq. (3) gives:

R^{2} = (SS_{Total} – SS_{Error})/SS_{Total} (5)

or:

R^{2} = (SS_{Total}/SS_{Total}) – (SS_{Error}/SS_{Total}), (6),

which simplifies to:

R^{2} = 1 – (SS_{Error}/SS_{Total}) (7)

Since (SS_{Error}/SS_{Total}) is the proportion of the variation that is left over (i.e., not “eaten” by the model), R^{2} can be seen as an indication of how far away the regression is from perfection (i.e., from R^{2} = 1). In other words, a higher (SS_{Error}/SS_{Total}) will result in a lower R^{2}.

Why is this statistic insufficient for evaluating the adequacy of a chosen model? The answer lies in the definition of SS_{Error} (see review paragraph above), which is simply a sum of its two components. R^{2} cannot reveal how much of the error is from random noise. If the noise is very large, then the presence of an acceptable model can be masked.

Other traps lie in wait. Consider *Figure 1*. Plot (a) contains a cluster of data plus one point at a much larger *x*-value (this last point is called a “leverage point” because it can act as a lever that strongly influences the slope and intercept estimates). Plot (b) appears to be a perfect straight line, with the exception of one outlier. Plot (c) contains data that seem to trend and exhibit random noise (in other words, fairly typical noisy calibration or recovery data). Now for a pop quiz. What is the common thread among the three? The answer is as follows. If a straight line is fitted to each of the data sets, using ordinary least squares as the fitting technique, then the value of R^{2} is almost identical for all three cases (0.670, 0.644, and 0.670, respectively)! The statistic has no ability at all to discriminate among different reasons for deviation from perfection.

A second trap exists when the urge to try higher-order polynomials creeps into the regression landscape. For *any* data set, adding the next-higher *x*-term will *always* increase the R^{2} or leave it unchanged (when compared with the previous, lower-order model). The reason is that every additional term contributes an additional coefficient, which can be tweaked along with all the coefficients from the preceding model.

A key objective of regression is to have the curve go through the mean of the responses at each *x*-value. Thus, as more coefficients become available, there is greater opportunity to “fine-tune” in pursuit of this goal. However, the price of such an outcome is that the overall shape of the fitted model may be quite odd and implausible; the resulting meandering is not likely to be intrinsic to the actual data.

This trap is more severe than simply overfitting by going to higher and higher polynomials. As will be seen below, adding any term to an existing model will raise R^{2} (or leave it unchanged).

Consider the following (simulated) challenge given to fictitious Laboratory XYZ. They have been asked to perform an analytical test that has several quality requirements. One mandate is to generate a calibration curve with a “high value for R^{2}.” Another directive is to analyze five replicates of each of six different standards.

The laboratory’s chemists set out on their mission and generate the required data. The scatterplot is shown in *Figure 2a*. Standard-deviation modeling shows that ordinary least squares is the appropriate fitting technique. Since the data exhibit curvature, the analysts begin by fitting a quadratic; R^{2} = 0.88226. In an attempt to raise this value, they try a cubic, but gain very little (R^{2} = 0.88234). Undeterred, the scientists press on to a quartic, which gives a more significant increase (R^{2} = 0.88380).

About this time, a colleague comes by to see how the regression work is going. Upon hearing that the fitting is up to a fourth-order polynomial, the co-worker suggests adding the phases of the moon to the quadratic, as an alternative to cubic or quartic. Application of this advice gives an R^{2} of 0.88372, which competes nicely with the quartic! In fact, the two curves are almost identical (see Figure 2*b* and *c*).

What on *earth* (or moon, perhaps) is going on?!? The similarity of Figure *2b* and *c* is purely a coincidence of this simulation (and was not planned by the authors!), but the rise in R^{2} values with the addition of the “odd” term is not coincidental. Indeed, even something as unrelated to analytical data as the phases of the moon will never decrease (and will generally increase) the R^{2} of the previously fit and simpler model, which in this case is a quadratic.

By now, the reader may be wondering what to do, especially since R^{2} is so widely used within the analytical community. A more realistic version (R^{2}_{adjusted}, usually abbreviated as R^{2}_{adj}) exists and will be discussed in the next installment. Also included will be a review of a robust approach to diagnosing regression models. Please stay tuned!

*Mr. Coleman is an Applied Statistician, and Ms. Vanatta is an Analytical Chemist, e-mail: statistics@americanlaboratory.com.*