Statistics in Analytical Chemistry: Part 46—R2 (Concluded)

Over the course of the past two articles (Part 44, Oct 2011, and Part 45, Nov/Dec 2011), R2 has been defined, its components have been explained verbally and mathematically, and the statistic’s formula has been presented. Also included has been a discussion of the limitations of R2, and the traps that lie in wait for those who rely exclusively or too heavily on this number. Is there a way to cast this often-used statistic in a more favorable light? This installment will address that question, as well as offer a more robust path for evaluating candidate models.

To review the bidding, the formula for R2 is:

R2 = SSModel/SSTotal, or       (1)
 R2 = 1 – (SSError/SSTotal)     (2)

Recall, though, that SSError includes the random noise inherent in the data, as well as the variation the model fails to capture. Furthermore, the value of R2 will increase (or remain unchanged) every time another term is added to the model. As was explained earlier, these two facts can lead the user down the primrose path.

Fortunately, there is a statistic known as R2adj (where “adj” stands for “adjusted”); its value gives a more honest assessment of the model’s adequacy. Below are the details.

In a sentence, R2 adj includes a penalty for each term used in the regression. If the additional term(s) is not needed, R2adj will generally decline relative to its value for the previous (simpler) model. Mathematically, R2 adj is a modification of Eq. (2) above; the new formula includes degrees-of-freedom (DOF) terms:

R2adj = 1 – (MSError/MSTotal)       (3)


MSError = Mean Square Error = SSError/DOFE
DOFE = degrees of freedom for Mean Square Error
MSTotal = Mean Square Total = SSTotal/DOFT
DOFT = degrees of freedom for Mean Square Total

Two questions jump to mind. First, how are the DOF terms calculated? Second, why is “Mean” used to describe the adjusted “Square” terms? The answers are as follows.

In general, DOF terms for a statistic are computed by starting with the number of data points that are in the data set under discussion. Every time a calculation is made using this original set, a degree of freedom is lost for any statistic that depends on the calculation.

SSTotal is calculated using the entire set of raw responses; the total number of data points in a set is typically designated as n. However, to calculate SSTotal, one must first calculate the average of all the responses, thereby sacrificing a degree of freedom. (See Part 44 or Part 45 for the formulas for the SS terms.) Thus, the associated DOF term is (n-1).

For SSError, the starting point is the same as above. However, this time, a model must first be fitted to the data, since the predicted responses are needed in the calculation of this statistic. Each parameter (p) in a model includes a coefficient, which must be calculated; for example, a straight-line model requires the calculation of an intercept, as well as a coefficient for the x term, so a degree of freedom is lost for each parameter. As a result, the general expression for DOFE is (n-p).

The use of “Mean Square” to describe the terms in R2adj can be understood by thinking about what happens when one calculates the mean (i.e., average) of a set of data. The formula is the sum of all the values, divided by the total number of data points. In other words, the sum is divided by the degrees of freedom. In this case, no degrees of freedom were lost beforehand, since this determination is based solely on the original data. Thus, n is the appropriate DOF value. Since MSError and MSTotal also divide a sum by the associated DOF term, the use of “Mean” in the names is logical.

The stage is now set for deriving a more useful formula for R2adj.

Incorporating the two DOF expressions into Eq. (3) results in the following:

R2adj = 1 – [SSError/(n-p)]/[SSTotal/(n-1)]       (4)

Regrouping yields:

R2adj = 1 – [(SSError/SSTotal)] * [(n-1)/(n-p)]       (5)

Rearranging Eq. (2) gives:

SSError/SSTotal = 1 – R2       (6)

Combining Eqs. (5) and (6) yields:

R2adj = 1 – {(1-R2) * [(n-1)/(n-p)]}       (7)

In Eq. (7), the last expression, [(n-1)/(n-p)], can be considered a “penalty factor” that keeps R2adj honest. In other words, the inclusion of DOF terms levels the playing field somewhat when different models are compared using R2adj.

Keep in mind, though, that even R2adj must be used with caution. Recall the example in Part 45. There, a data set was fitted with four different models: 1) quadratic, 2) cubic, 3) quartic, and 4) quadratic + phases-of-the-moon (POM) term. The values for R2adj stack up as follows:

Quadratic 0.8735
Cubic 0.8688
Quartic 0.8652
Quadratic + POM 0.8703

The progression from the first through the third models is accompanied by a decrease in R2adj, thereby signaling the inclusion of inappropriate terms. (Note that this comparison is for illustrative purposes; one is splitting hairs by looking at essentially the third decimal place of R2adj.) Comparison of the POM-containing option with the cubic might lead the casual observer to think that he or she was getting somewhere, and that connecting a cubic with the moon might lead to victory!

It is time to turn to a more reliable (although more complex) alternative to either R2 or R2adj. The focus in model evaluation should be on the tools of: 1) the p-value for any term that was just added and 2) the residual plot and the related lack-of-fit (LOF) test. (For details related to this alternative, see Parts 9, 10, 22, and 23 of this series—American Laboratory, Feb 2004, Mar 2004, Jun/Jul 2006, and Oct 2006, respectively.)

First, if the p-value of the new term is insignificant (i.e., >0.01), then the term is not needed and its inclusion will result in overfitting. In the case of a straight line, the x-term is the line’s slope, which will typically increase in a plot of raw responses versus concentration, thereby being significant unless there is a major problem with the instrument.

Second, the residuals pattern will help the user determine if the model exhibits lack of fit; random scatter about the zero line suggests an adequate model. The LOF diagnostic is based on the residual values and does separate SSError into its parts. Thus, the door is open for distinguishing between random noise and “leftovers” from the model, and for producing a p-value that will reflect the sufficiency of the chosen curve.

Figure 1 - Residuals plots for a simulated data set fit with a) a straight-line and b) a quadratic model. See text for details.

The usefulness of this alternative approach can be shown by returning to the moon example. In Part 45, the scatterplot displayed curvature, so a quadratic was selected as the first candidate. The wisdom of this decision can be seen in the results of the LOF test; the p-values for a straight line and a quadratic were 0.0276 and 0.9363, respectively. The residual patterns in Figure1 agree with the LOF test. Furthermore, the p-value for the quadratic term is 0.0008, indicating that x2 is needed in the model.

Addition of either an x3 or POM term results in insignificant p-values for the new member (0.8935 and 0.5729, respectively). Thus, there is confirmation that a quadratic model is adequate; inclusion of additional terms will result in overfitting.

One additional matter is the choice of fitting technique (see Part 8, Nov 2003, for details on this topic). Neither R2 nor R2adj can help here, either. The proper way to evaluate ordinary least squares versus weighted least squares is to model the standard deviation of the responses; if there is trending with concentration, then the latter technique is needed.

The final take-home message is that even R2adj is not a strong tool for evaluating model selection or fitting-technique choice, and should be employed only when accompanied by a very large grain of salt. Instead, users should depend on residual patterns and the LOF test, and standard-deviation modeling, respectively. The authors cannot overemphasize the importance of these two emboldened statements!

David Coleman is an Applied Statistician, and Lynn Vanatta is an Analytical Chemist;