What’s Happening in Chemometrics?

Chemometrics is the discipline associated with the application of mathematical and statistical methods to chemical measurements. The goals in applying chemometrics to data include improvement of the overall measurement process, ability to extract more useful information from the system, and enhanced understanding of the information. This is an active field of research in many parts of the world, with new and exciting developments. This article will review a few of the highlights from chemometric developments from late 2003 through mid-2004 both in theoretical developments and in their application to chemical analysis problems.

Discussion

Two of the most active areas in chemometrics research are support vector machines (SVM) and three-way analysis. Three-way analysis methods have become more important as large data sets become the norm. This is particularly true for process analysis data and imaging applications. SVM is a nonlinear modeling technique that has been gaining popularity in other fields, but has only recently been applied in chemometrics. There are several articles that give a background on the technique, including one by Vapnik1 and a tutorial provided by Smola and Schölkopf.2

Recent publications are becoming more plentiful on this topic as it applies to chemometrics. Thissen3,4 proposes SVM as a means with which to optimize a model rapidly. SVM can deal with ill-posed data sets easily, but finding the final SVM model can be computationally difficult because it requires the solution of a set of nonlinear equations. Least-squares SVM has been proposed as a class of kernel machines to solve this problem. Thissen shows this applied to data from mixtures of ethanol, water, and 2-propanol measured over different temperatures. The least-squares SVM approach produced more accurate results in each case when compared to other methods.

SVMs were applied for sample selection in classification by Zomer.5,6 The method discussed illustrates the use of SVM for nonlinearly separable cases. The SVM method uses functions of high dimension so that boundaries can be found to fit a variety of complex situations. SVM methodology is also shown to reduce the labeling requirements in classification over other methods.

Vogt et al.7 introduce a concept called secured pseudo principal components regression (PCR) as a means with which to correct for spectral drift and uncalibrated spectral features. This method was shown to produce better results than standard PCR. The method also demonstrates the ability to extract estimates of the spectra of the uncalibrated absorbers. Bro et al.8 discuss the theory of net analyte signal (NAS) vectors in inverse regression. The theory of NAS was originally derived from classical least squares (CLS), where responses of all pure analytes and interferents are assumed to be known. In chemometrics, the use of inverse methods such as partial least squares (PLS) is more common. This article gives a thorough development of a calibration-specific NAS vector and shows its application.

A bootstrap method for several three-way data analysis methods (i.e., CANDECOMP, PARAFAC, TUCKER3) is proposed by Kiers9 to produce percentile interval estimates that are related to the instability level of the solution. They show how these estimates can be interpreted as confidence intervals for the output parameters. Loethen et al.10 demonstrate a new second-derivative variance minimization procedure that can automatically extract spectra of a dilute component from a mixture whose spectrum is dominated by a major component. It is not necessary to have the spectrum of the pure solute. Results are shown from benzene in hexane and water in acetone measured by Raman spectroscopy.

A search methodology was proposed by Hageman et al.11 The use of a TABU search for wavelength selection in calibration is described. TABU search is a deterministic global optimization technique loosely based on concepts from artificial intelligence. It examines the search space in a highly ordered fashion and keeps track of areas already explored. Given a particular starting point, the same end solution will always result. The methodology is compared with simulated annealing and genetic algorithms and is shown to perform comparably. The TABU search can provide better search coverage in cases in which local minima exist.

Geladi et al.12 give an informative discussion on visualization methods used in multivariate analysis. They discuss what we plot and why. Methods include PCA, three-way analysis, and regression in general, with examples provided from near-infrared (NIR) spectroscopic data. The article stresses the importance of understanding the data being plotted rather than just accepting the output of a software program.

Orthogonal projection analysis (OPA) and multivariate curve resolution (MCR) are presented by Gourvenec et al.13 as a way to monitor batch processes using spectroscopic data. MCR allows one to look at a patch and predict concentration during a reaction. OPA is a self-modeling curve resolution method that will provide estimates of the number of components in the system and the shape of the pure spectra along with concentration profiles of each. The authors show results from a polymer reaction and discuss the situation in which the number of components in a new batch is different from previous batches. This method is shown to be a useful tool for control purposes in on-line measurements. van Sprang et al.14 discuss the use of a bilinear gray model to monitor a series of batch processes using NIR spectroscopy. The term “grey model” is meant to incorporate characteristics of white models and black models. White models are based on first principles such as kinetic data. Black box models include regression or neural networks. The authors use examples from NIR measurement of a urethane reaction (di-isocyanate and alcohol). For use in the process, this involves first collecting the white model knowledge. In the urethane reaction, this provided the estimate of the pure spectra of the di-isocyanate and the alcohol. The authors demonstrate that this model gives insight into the physical and chemical changes occurring during the process that would otherwise go undetected.

Vogt et al.15 present a method to detect wavelength shifts in calibration spectra. When wavelength shifts occur, they usually introduce derivative-like features in multivariate loading vectors. If the shifts are random and not reproducible, then the model estimates of these features will not help account for their effect. The method proposed involves shifting one calibration spectrum and analyzing whether it is more or less similar to the unshifted unknown spectrum. If it is more similar, then the method concludes that there was a wavelength shift and is able to correct it before analysis. For data sets with shifts occurring in a small number of samples, this method is shown to work well. If a large number of samples have shifts, further investigation is normally required.

Skibsted et al.16 use an indicator instead of prediction statistics to choose the optimal data preprocessing and wavelength selection. The indicator is called SE, or signal-to-error indicator, and is based on the net analyte signal and the total error. The method requires a blank spectrum and an analyte spectrum be measured. Data are shown for two sets of samples—powders and tablets. The method is contrasted to the typical RMSEP or RMSEPcv statistics (root mean square error) normally utilized in multivariate calibration methods.

Xu et al.17 discuss Monte Carlo cross-validation methods for selecting the optimal model and estimating the prediction error in multivariate calibration. The Monte Carlo cross-validation method leaves out a large part of the samples at each stage rather than one or a handful, as is normally done. The authors show results for simulated data, quantitative structure–activity relationship (QSAR) data, and NIR and UV spectroscopic data. The method is demonstrated to find the optimal model using fewer components and producing better predictive results. A method for accurately estimating the model predictive error is also presented. Bridge-PLS, discussed by Gidskehaug et al.,18 is a two-block bilinear regression method used to process large amounts of data more efficiently. It is a combination of standard PLS and Bookstein PLS, in which only one singular value decomposition is used to extract the latent variables. The method is illustrated with data from magnetic resonance imaging (MRI) with demonstrated time savings over standard PLS.

Andrew and Fearn19 discuss calibration transfer by orthogonal projection. Orthogonal projection is used in the development of calibrations to make them less sensitive to variations between instruments. This allows calibration transfer without adjustment to the model. The idea is to orthogonalize the spectral data to directions in the spectral space in which most of the variation lies. The method requires that spectra from several instruments be available. Results are shown on agricultural NIR data for barley and corn and are compared to other transfer methods. The authors show better transfer results using this method in addition to reducing sensitivity to temperature and sample pathlength variations.

Thomas20 proposes a method for selecting the number of latent variables in multivariate calibration. This includes nonparametric statistical methods using the sign test and the Wilcoxon rank test. Results are shown from an octane data set. The method is shown to be sensitive to small differences in performance but still robust to unusual observations. The Wilcoxon rank test is demonstrated to be the preferred method.

Seipel and Kalivas21 looked at ways to measure the effective rank for a given model. This applies to all modeling methods. The definition is based on the regression vector norm. The proper definition of the effective rank permits a better assessment of the number of degrees of freedom. Examples are shown with spectroscopic data using PLS, PCR, and ridge regression models.

Myles and Brown22 propose decision pathway modeling as a pattern recognition method for multigroup classification problems. The architecture is an interconnected graph of nodes and partial pathways. The method depends on the construction of accurate binary classification models and can be computationally expensive. Four data sets are used for demonstration. Improvements are demonstrated over traditional methods in several cases.

Wold et al.23 discuss the use of multivariate design in PLS modeling. The design can be done in the original variable space or in PLS scores space for x-variables, y-variables, or both. The authors also show how this can be used to select data from a larger population for use in calibration.

The use of orthogonal signal correction (OSC) is presented by Woody et al.24 as a means to transfer multivariate calibrations between spectrophotometers. This is a comparative study of different methods to determine the optimal method for calibration transfer.

Summary

The field of chemometrics is an active area of research. The potential benefits to be gained in applying this research include improved calibration transfer, better modeling for nonlinear data sets, a clearer understanding of data, and the ability to extract the maximum information from large data sets with ease. This paper is not meant to be an all-inclusive review of work done in this area, but rather a highlight of some of the interesting areas being explored.

References

  1. Vapnik V. The nature of statistical learning theory. New York: Springer-Verlag, 1995.
  2. Smola AJ, Schölkopf B. A tutorial on support vector regression. NeuroCOLT Technical Report NC-TR-98-030. London: Royal Holloway College, University of London, 1998.
  3. Thissen U, Ustun B, Melssen W, Buydens LMC. Multivariate calibration with least-squares support vector machines. Anal Chem 2004; 76:3009–3105.
  4. Thissen U, Pepers M, Ustun B, Melssen WJ, Buydens LMC. Comparing support vector machines to PLS for spectral regression applications. Chemometrics and Intelligent Lab Sys 2004; 73:169–79.
  5. Zomer S. Active learning support vector machines for optimal sample selection in classification. J Chemometrics 2004; 19:294–305.
  6. Zomer S. Classification with support vector machines. Nov 2004. www.acc. umu . se /~tnkjtg / chemometrics / editorial/nov2004.pdf.
  7. Vogt F, Mizaikoff B. Fault-tolerant spectroscopic data evaluation based on extended principal component regression correcting for spectral drifts and uncalibrated spectral features. J Chemometrics 2003; 17:660–5.
  8. Bro R, Andersen CM. Theory for net analyte signal vectors in inverse regression. J Chemometrics 2003; 17:646–52.
  9. Kiers HAL. Bootstrap confidence intervals for three-way methods. J Chemometrics 2004; 18:22–36.
  10. Loethen YL, Zhang D, Favors RN, Basiaga SBG, Ben-Amotz D. Second-derivative variance minimization method for automated spectral subtraction. Appl Spectrosc 2004; 3:272–8.
  11. Hageman JA, Streppe M, Wehrens R, Buydens LMC. Wavelength selection with TABU search. J Chemometrics 2003; 17:427–37.
  12. Geladi P, Manley M, Lestander T. Scatter plotting in multivariate data analysis. J Chemometrics 2003; 17:503–11.
  13. Gourvenec S, Lamotte C, Pestiaux P, Massart DL. Use of the orthogonal projection approach (OPA) to monitor batch processes. Appl Spectrosc 2003; 57:80–7.
  14. van Sprang ENM, Ranmaker H, Westerhuis JA, Smilde AK, Gurden SP, Wienke D. Near-infrared spectroscopic monitoring of a series of industrial batch processes using a bilinear grey model. Appl Spectrosc 2003; 57:1007–19.
  15. Vogt F, Booksh K. Influence of wavelength-shifted calibration spectra on multivariate calibration models. Appl Spectrosc 2004; 58:625–34.
  16. Skibsted ETS, Boelens HFM, Westerhuis JA, Witte DT, Smilde AK. New indicator for optimal preprocessing and wavelength selection of near-infrared spectra. Appl Spectrosc 2004; 58:264–71.
  17. Xu Q, Liang Y, Du Y. Monte Carlo cross-validation for selecting a model and estimating the prediction error in multivariate calibration. J Chemometrics 2004; 18:112–20.
  18. Gidskehaug L, Stodkilde-Jorgensen H, Martens M, Martens H. Bridge-PLS regression: two-block bilinear regression without deflation. J Chemometrics 2004; 18:208–15.
  19. Andrew A, Fearn T. Transfer by orthogonal projection: making near-infrared calibrations robust to between-instrument variation. Chemometrics and Intelligent Lab Sys 2004; 72:51–6.
  20. Thomas EV. Non-parametric statistical methods for multivariate calibration model selection and comparison. J Chemometrics 2003; 17:653–9.
  21. Seipel HA, Kalivas J. Effective rank for multivariate calibration methods. J Chemometrics 2004; 18:306–11.
  22. Myles AJ, Brown SD. Decision pathway modeling. J Chemometrics 2004; 18:286–93.
  23. Wold S, Josefson M, Gottfries J, Linusson A. The utility of multivariate design in PLS modeling. J Chemometrics 2004; 18:156–65.
  24. Woody NA, Feudale RN, Myles AJ, Brown SD. Transfer of multivariate calibrations between four near-infrared spectrometers using orthogonal signal correction. Anal Chem 2004; 76:2595–2600.

Ms. Foulk is Senior Applications Chemist, Guided Wave, Inc., 5190 Golden Foothill Pkwy., El Dorado Hills, CA 95762, U.S.A.; tel.: 916-939-4300; fax: 916-939-4307; e-mail: [email protected].

Comments