Statistics in Analytical Chemistry: Part 44—R2

Imagine that a new method has been developed and now needs a calibration curve. Three chemists are asked to analyze each of three standards in duplicate. The scientists are to work separately and generate individual curves, using ordinary- least-squares fitting. The resulting values of R2 are as follows: 0.7273, 0.9143, and 0.7268. Whose work should be accepted?

Figure 1 - Three separate data sets, each fitted with a straight line, and the corresponding values of R2; a) R2 = 0.7273, b) R2 = 0.9143, c) R2 = 0.7268.

If the sole criterion for the decision is the value of R2, then the second data set should be used. However, for all its popularity, this statistic is not a sufficient indicator of the adequacy of a given model or fitting technique. Indeed, an examination of the plots themselves illustrates this fact. Figure 1 shows the results for the three data sets. The plots show that the low R2 for set 3 is likely due to an easy-to-fix problem (i.e., the wrong model was used). R2 has not provided enough information for making a sound decision.

When a quadratic model is fitted to data set 3, R2 soars to 0.9999 (see Figure 2) and analyst 3 now “wins.” Although the model choice looks excellent and the replicates are very tight, further investigation of the plot reveals a disturbing feature. At the low end, the curve “doubles back” on itself; the responses for the blank look to be slightly higher than the responses for the standard at x = 2. Clearly, there is a problem from a chemistry point of view! (Perhaps there was a mix-up in solutions.) If the difficulty can be resolved, then the work of analyst 3 has potential to be the best choice, since he or she was able to achieve better replication than could analyst 1 or 2. In the end, R2 has been inadequate again.

What is the take-home message from the above? Simply put, R2 should not be used as the primary (and certainly not only) criterion for evaluating the acceptability of a calibration or recovery curve. Other tools (e.g., residual pattern, lack-of-fit p-value) are crucial in diagnosing the “health” of a given curve. To borrow a chemical phrase, R2 is “bulk property” in nature. In one number, this statistic captures the uncertainty not explained or “absorbed” by the model plus the uncertainty due to the noise in the data; this total is expressed relative to the total variation of the data. However, the statistic cannot help the user determine what portion is due to which source.

What is R2 and how does it actually “work”? R2 is the proportion (of the variation) that can be explained by the regression that has been performed; generally, the statistic is stated as a percentage, in decimal form. The following will explain the details of the terms on which R2 is based.

The first step is identifying the uncertainty sources, and developing: 1) mathematical expressions for the types, and 2) a formula that relates everything. In any set of calibration or recovery data, there is variation in the responses (i.e., in the y-values). Some of the variation is to be expected, since responses typically will grow as the concentrations (x-values) increase. In fact, the responses must change if the method is to be useful. As was mentioned above, when a model is fitted to the points, it will absorb or account for a portion of this uncertainty.

Figure 2 - Data from Figure 1c, fitted with a quadratic model; R2 = 0.9999

Nevertheless, the model, as an approximation to truth, generally will leave some variation unexplained. This “leftover” variation from the model can be combined with the noise to form what is known statistically as the error in the regression analysis. Thus:

uncertainty captured by model + error = total uncertainty        (1) 

The above equation is more than just a definition of total uncertainty. The two components can be expressed mathematically; the total can be calculated as well, independent of its constituents.

Determining mathematical expressions for each uncertainty involves the calculation of differences. Of necessity, the error term will involve subtraction; to avoid mixing apples and oranges, the other terms must be differences as well. Details are as follows, based on an arbitrary model and least-squares fitting.

The model component works with ypi (i.e., with each y-value that has been predicted by the model). To create a difference, the mean of all the actual predicted y-values (yp-mean) is used. This mean is an appropriate comparison “anchor” because the value is a summary statistic for the entire set of predicted value. This average also has a useful property, as will be seen below.

The resulting final expression is:

 ypiyp-mean        (2)

The error is determined from a concept that has been discussed throughout this series (i.e., residuals). Once a model has been fitted to a data set, the residual (ri) for any actual data point is the actual y-value (yi) minus the predicted (from the curve) y-value (ypi), or: 

ri = yiypi             (3)

This difference is a mixture of the variation not explained by the model and the random noise in the data.

Within a given data set, the total variability depends solely on the actual y-values (i.e., is independent of the proposed model). Thus, no predicted y-values will be included in the formula. Again, to avoid combining apples and oranges, the expression for total variability should be a difference. The logical option is to work with ymean (the mean of all the actual responses), since it is a useful summary statistic for the entire data set, regardless of model. Subtracting the mean from the actual gives:

yiymean               (4)

How are these expressions combined into a mathematical equation? Any of the differences can be either positive or negative. Since direction is irrelevant and will only distort any combining efforts, simple summation is not an option. Working with absolute values solves the sign problem, but is not enough. For each of the three components, the squares of the differences must be summed. Otherwise, it can be shown that more than one regression solution may result, and the user will not be able to determine which expression should be chosen. There are also some useful mathematical properties that result from minimizing squares of residuals (contact the authors for details).

The resulting equation for expression (1) is:

Σ(ypiyp-mean)2 + Σ(yiypi)2 = Σ(yiymean)2          (5)

Statisticians have given names to each of the three summation expressions. Respectively, the designations are: 1) sum of squares for the model (SSModel), 2) sum of squares error (SSError, also known as sum of squares of residuals), and 3) total corrected sum of squares (SSTotal):

SSModel + SSError = SSTotal                  (6)

Why does the equality in (5) hold? In other words, why can the sum on the left be calculated from terms that are independent of the selected model (i.e., from the expression on the right)? The key lies in a characteristic of the residuals. One of the important properties of least-squares regression is that the complete set of signed residual values will always sum to zero. For two data points:

r1 + r2 = (y1yp1) + (y2yp2) = 0         (7)

Combining like terms gives:

y1 + y2 = yp1 + yp2                       (8)

Since the two sums are equal, the averages are equal:

Or:

ymean = yp-mean          (10) 

(Don’t try calibrating with such a data set at home; two data points are not enough! This small data set has been used for simplicity.)

At this stage, the wisdom of using means to form differences becomes evident. In Eq. (5), yp-mean can be replaced with ymean. If this new equation is expanded (using the two-data-point example again), the yi2 and the ymean2 terms will cancel. The collection of {ymean[(yp1 + yp2) – (y1 + y2)]} will drop out because of Eq. (8) (i.e., the ypi sum and the yi sum are equal, so the difference is 0). Thus, proving Eq. (5) reduces to showing that:

Σypi2 = Σyiypi = Σ(ri + ypi)ypi = Σ(riypi + ypiypi)     (11)

Eq. (11) simplifies to showing that:

Σriypi = 0       (12)

Eq. (12) holds for fitting done by least-squares regression, as is shown in textbooks such as Draper and Smith’s Applied Regression Analysis, 2nd ed., p. 18 (or contact the authors for a derivation). The proof involves the unique properties of the least-squares estimates of the model coefficients (in the case of linear regression, of the slope and intercept).

The above discussion has led to the desired result (i.e., a mathematical formula relating the various sources of variation). Now the stage is set for the next article, which will detail the formula for R2 and elaborate on the traps this statistic can bait.

Mr. Coleman is an Applied Statistician, and Ms. Vanatta is an Analytical Chemist, e-mail: [email protected].