Statistics in Analytical Chemistry: Part 3—Calibration: Introduction and Ordinary Least Squares

Calibration is a modeling procedure by which a response (e.g., peak area) can be transformed into a useful measurement (e.g., concentration). In analytical chemistry, models that are often used are the straight-line and quadratic models. To put this transformation procedure in perspective, one should keep in mind a quote by Prof. George Box, who is considered one of the greatest (and most highly published) living statisticians: “All models are wrong; some are useful.”

The implication is that, generally speaking, no model will be able to explain the data completely. However, this statement will be qualified later. Also, it will be shown that there are statistical tools for evaluating a model and deciding if it will be useful for our calibration purposes.

Suppose that chromatographic peak-area (PA) measurements {yi}, are collected for samples with known spike concentrations, {xi}. (Note: The brace notation indicates “the set of all.” Thus, {xi} is the set of all the xi values.) Suppose, further, that the relationship between y and x is known or believed to be a straight line. In its simplest form, the straight-line model is:

yi = A + Bxi

When a model is applied to a set of data of this type, {xi, yi; i = 1,…,n}, a fitting technique is involved. The most typically used procedure is ordinary least squares (OLS, also known as “regression,” though the latter term is broader).

OLS minimizes the sum of the squared residuals (SSR), where the residual is the error in the fit, and equals the response minus the fit. As will be seen later, residual analysis (i.e., investigating the plot of ri vs xi) is one of the most powerful tools for evaluating a proposed model. The fact that OLS is widely used to fit a straight-line model does not mean that this combination is always appropriate.

For readers who are interested, OLS details are presented in the following paragraph. While memorization of the formulas is not necessary, familiarity with these terms will be useful in future articles.

To minimize SSR, OLS produces estimates (a,b) of the true coefficients (A,B) as follows:

b = Sxy/Sxx,

a = yavg – (b * xavg),

where

Sxy = Σ [(xixavg) * (yiyavg)] and Sxx = Σ (xixavg)2

and where xavg is the average of all the {xi} values; yavg is the average of all the {yi} values; and Σ indicates summation, i = 1,…,n. Sxy is known as the corrected sum of cross-products, and Sxx is known as the corrected sum of squares, where “corrected” refers to subtracting out the mean. Thus, the residuals are defined as: ri = yi – (a + bxi), and OLS minimizes: SSR = Σ (riravg)2. Furthermore, SSR = Σ (ri)2, since it can be shown that ravg = 0.

Underlying straight-line OLS are some assumptions, the most important of which are listed below. It is not possible, in practice, to verify all of these assumptions. Often, it is not possible to verify any of them, and for a particular data set, it may even be known that one or more is false.

  1. The true relationship is approximately a straight line: yi ≈ A + Bxi.
  2. The standard deviation of the responses does not change over the range of x values for which the model will be applied. However, in analytical chemistry, this assumption does not always hold; the variability of the response will often increase with increasing concentration.
  3. Residuals, ri, are statistically independent from each other, and are distributed according to the Normal (Gaussian) distribution. While the Normal premise is not as critical as the previous two assumptions, the analyst should be wary of the corrupting influence of outliers on an OLS fit.
  4. The {xi} values are error free, while the {yi} are not. Although this assumption rarely holds in practice, it is important that the y errors dominate the x errors (as a percent of range in the data set).

Note that the fitting shown above is the PA (i.e., y, the dependent response, usually shown on the vertical axis of a plot) fitted as a straight-line function of spike concentration (i.e., x, the independent variable, usually shown on the horizontal axis), not vice versa. This “y vs x” direction may seem odd, since the intended use of the calibration line (or curve) is to transform a peak area (y) into a concentration (x), not x into y. The reason is to satisfy the 4th assumption. To calibrate, y is fitted versus x to find the approximate relationship: y ≈ a + bx. However, to use the calibration, the relationship is inverted; for a new value of y (e.g., PA), x = (y – a)/b is computed. The inverted formula is not the same mathematical result as obtained if (unwisely) x is fitted versus y.

Now that straight-line regression has been discussed, there is one possible way to transform the responses into more useful numbers (i.e., concentrations). However, any predictions we make are simply estimates. Thus, there has to be a way to determine how much variability is in our values and how close they are to true.

Before specific procedures are discussed, several terms need to be defined, some of which are often used (in error) interchangeably. (See Figure 1 for a graphical illustration.) These definitions are:

Figure 1 - Idealized diagram of the error in measurement (of temperature, for purposes of illustration), showing how the error changes with the true temperature value. Shown are: 1) the bias function (a curved average-error line); 2) an envelope around the bias (the envelope width represents the measurement precision, within which will fall a user-specified percentage of individual measurement errors); 3) a blow-up of the bias function (revealing the reporting resolution of the measurement system); and 4) the ideal error line (everywhere zero). Not shown is noise, although the amount of noise is captured in the precision. (Adapted from Gibbons, R.D.; Coleman, D.E. Statistical methods for detection and quantitation of environmental contamination. Reprinted with permission from John Wiley & Sons, Inc., 2001.)

  1. Bias—the difference between the true value and the reported measurement. The bias is never known, but can be estimated or characterized by a value or a distribution.
  2. Error (refers to a measurement)—the (usually) unknown difference between a reported measurement and the true value; error may be random or systematic.
  3. Noise—measurement changes that vary over time, even with no change in the true; also known as random error or stochastic error.
  4. Precision—the consistency of measurement, usually quantified by the sample standard deviation, s, of measurement error (or quantified by a function of s, such as a statistical interval). In either case, the lower the value of s, the greater the precision.
  5. Resolution—the smallest difference (in measurements) that can be consistently reported from a measurement system. For example, this difference may be limited by internal eight-bit representation or the number of reported digits. Poor resolution is called discretization or rounding error.
  6. Uncertainty (of measurements)—a statistical interval within
    which the measurement error is believed to occur, at some level of confidence. Uncertainty incorporates both bias and precision.

The alert reader will notice that “accuracy” is not included in the above list. While this term is still in common usage, it often has conflicting definitions. Much of the statistical community has recognized the confusion surrounding this term and has recommended that it be dropped from the vernacular. In keeping with this trend, the word has not been defined here and will not be used in this series of articles.

The next column will introduce the concept of uncertainty intervals. These intervals will be extremely useful; they are the single most valuable way to characterize the variability associated with a predicted concentration.

Mr. Coleman is an Applied Statistician, Alcoa Technical Center, MST-C, 100 Technical Dr., Alcoa Center, PA 15069, U.S.A.; e-mail: [email protected]. Ms. Vanatta is an Analytical Chemist, Air Liquide-Balazs Analytical Services, Box 650311, MS 301, Dallas, TX 75265, U.S.A.; tel.: 972-995-7541; fax: 972-995-3204; e-mail: [email protected].