Optimal Baseline Correction for Multivariate Calibration Using Open-Source Software

Baseline estimation and correction is critical for many types of spectral data. This article shows how baseline correction of spectra used in multivariate calibration can be performed objectively. The software utilized in this paper incorporates an implementation of the procedure proposed by Liland et al.1 Included is a set of baseline correction algorithms and procedures for the optimization that find the algorithm and parameters that give the best predictions. It also includes graphical user interfaces (GUIs) for adjusting baselines visually and for setting up and performing the optimization. The procedure is in no way bound to this software, but is presented as an open-source, freely available alternative.

Materials

Raman spectroscopy on pork fat

Figure 1 - Raman spectra on melted back fat from pork adipose tissue from 3100 to 775 cm–1 before baseline correction and normalization.

A set of 77 samples of melted back fat from pork adipose tissue was measured using Raman spectroscopy and GC.2 The Raman spectrometer used a sapphire ball probe, which caused peaks in the range between 775 and 378 cm–1. GC was used for reference analysis of the fatty acid composition. In this paper, the authors have concentrated on the iodine value as a response. Before the statistical analysis, the Raman spectra were cut at 3100 and 775 cm–1 to remove areas dominated by artifacts and containing a minimal amount of information. This left 3875 wavelengths for analysis, and thus a predictor matrix of 77 × 3875, which is plotted in Figure 1.

In contrast to the Raman spectra used in Liland et al.,1 there is very little fluorescence in the spectra discussed here. Instead, there is an elevation between approx. 2000 and 1200 cm–1 in many of the spectra that seems to be noninformative with regard to the response. This may pose an even greater challenge for the baseline correction algorithms than dealing with fluorescence.

Methods

Baseline correction

Baseline variation is a problem encountered in many types of spectral data. Typically, it is a linear or nonlinear addition to the spectra that causes expected zero measurements to attain a positive value. Baselines can be described as the slowly varying curve going through the lower part of the spectra without the jumps of the peaks. A multitude of different algorithms exist for estimating and correcting baseline effects. More in-depth descriptions of baseline estimation and specific baseline correction algorithms can be found in Ref. 1.

Optimization

The proposed general procedure for evaluating and choosing the optimal baseline correction for any given statistical analysis is as follows:

  1. Limit the parameter spaces: For each baseline algorithm, select the levels to be tested for all baseline parameters.
  2. Correct baselines and perform the statistical analyses: For each algorithm and combination of its parameter levels, perform the baseline corrections on the calibration data, perform the statistical analysis, and calculate the quality measure(s).
  3. Select and validate the optimal parameter levels: For each baseline algorithm, select the combination of parameter levels that gives the best baseline correction, as judged by the quality measure(s). Validate the resulting baseline corrections by visual inspection.
  4. Select the baseline algorithm(s) that gives the best quality measure, apply the correction to independent validation data, and predict the response using the model(s) from the calibration.

Partial least squares regression

The statistical analysis is performed using partial least squares regression (PLSR),3 which is a dimension-reducing method able to handle high-dimensional data. It compresses the predictor space down to a few dimensions by making linear combinations through covariances between predictors and the response. Cross-validation4 is applied as a tool to find the optimal number of components/dimensions to use in the regression. Root mean squared error of prediction (RMSEP) is utilized to assess the goodness of fit (the acronym RMSECV is used when predictions come from cross-validation). RMSEP for a k component model is denoted as θk, and is computed as: 

where N is the number of samples and yi,k is the prediction of the i-th sample using a k component model.

Software

In this paper, the authors make use of the R software environment for statistical computing and graphics (http://www.r-project.org). This is an open-source project under the GNU General Public License. R has become the de facto standard among statisticians and has a rapidly increasing number of user-submitted packages for all kinds of statistical analyses. The baseline package is available on the Comprehensive R Archive Network (http://cran.r-project.org) together with all packages it depends on. It contains several baseline correction methods available through both a GUI, where visual inspection of the effects of algorithms and parameters are easily visualized, and as command line methods for large-scale use and automation. In addition, the algorithm described above is included in the package. This is accompanied by its own GUI for setting up, running, and inspecting optimizations.

Experimental

Figure 2 - R code examples.

This section describes in general terms how a baseline correction using real data and free software was set up, executed, and evaluated. Accompanying R code examples are shown in Figure 2. When numbering the following actions, the authors referred to the numbered list in the Methods section (“Optimization”).

Before the baseline correction begins, the software of choice (R) is opened, relevant packages are loaded (baseline correction and regression packages), and data are imported to the software.

Figure 3 - Baseline correction controls and optimization setup wizard. Used simultaneously, the optimization wizard can collect parameter values directly from the visual baseline correction.

  1. Visual inspection of the effect of baseline corrections can be a good starting point for the optimization. In the baseline package, this is easily achieved using the baseline GUI function. For this analysis, the baseline correction methods chosen were asymmetric least squares (ALS),5 local medians,6 iterative polynomial fitting,7 and robust baseline estimation (RBE),8 along with in-house methods Fill Peaks and iterative restricted least squares (IRLS). The choices made thus far are still quite subjective, but after choosing the initial parameters for baseline correction, one can span a grid of parameter values around the visually chosen values to use for the optimization (e.g., through the GUI, Figure 3).
  2. Performing baseline correction for all combinations of parameter values and performing the statistical analysis needed on all of them is a matter of systematic execution and well-organized storage of results. If it is done manually, one has to ensure that looping is performed through all algorithms and parameter value combinations, and that the storage of the results can be traced back to the correct combinations. Through the GUI shown in Figure 3, this part of the optimization is done by clicking on “Start,” and all the looping and storing are done automatically. Many baseline algorithms are computationally heavy, especially for high-resolution spectra. This means that testing a large number of algorithms and parameter value combinations can be very time consuming. In practice, it might therefore be advisable to optimize in two or more steps to try to home in on the best-performing combinations, rather than having high enough resolution on the search to find the optimum in one run.