Chemometrics is the discipline associated
with the application of mathematical and
statistical methods to chemical measurements.
The goals in applying chemometrics
to data include improvement of the
overall measurement process, ability to
extract more useful information from the
system, and enhanced understanding of
the information. This is an active field of
research in many parts of the world, with
new and exciting developments. This
article will review a few of the highlights
from chemometric developments from
late 2003 through mid-2004 both in theoretical
developments and in their application
to chemical analysis problems.
Discussion
Two of the most active areas in chemometrics
research are support vector
machines (SVM) and three-way analysis.
Three-way analysis methods have
become more important as large data
sets become the norm. This is particularly
true for process analysis data and
imaging applications. SVM is a nonlinear
modeling technique that has been
gaining popularity in other fields, but
has only recently been applied in
chemometrics. There are several articles
that give a background on the
technique, including one by Vapnik1
and a tutorial provided by Smola
and Schölkopf.2
Recent publications are becoming
more plentiful on this topic as it
applies to chemometrics. Thissen3,4
proposes SVM as a means with which
to optimize a model rapidly. SVM can deal
with ill-posed data sets easily, but finding
the final SVM model can be computationally
difficult because it requires the solution
of a set of nonlinear equations. Least-squares
SVM has been proposed as a class
of kernel machines to solve this problem.
Thissen shows this applied to data from
mixtures of ethanol, water, and 2-propanol
measured over different temperatures. The
least-squares SVM approach produced
more accurate results in each case when
compared to other methods.
SVMs were applied for sample selection in
classification by Zomer.5,6 The method discussed
illustrates the use of SVM for nonlinearly
separable cases. The SVM method
uses functions of high dimension so that
boundaries can be found to fit a variety of
complex situations. SVM methodology is
also shown to reduce the labeling requirements
in classification over other methods.
Vogt et al.7 introduce a concept called
secured pseudo principal components
regression (PCR) as a means with which to
correct for spectral drift and uncalibrated
spectral features. This method was shown
to produce better results than standard
PCR. The method also demonstrates the
ability to extract estimates of the spectra of
the uncalibrated absorbers. Bro et al.8 discuss
the theory of net analyte signal
(NAS) vectors in inverse regression. The
theory of NAS was originally derived from
classical least squares (CLS), where
responses of all pure analytes and interferents
are assumed to be known. In chemometrics,
the use of inverse methods such as
partial least squares (PLS) is more common.
This article gives a thorough development
of a calibration-specific NAS vector
and shows its application.
A bootstrap method for several three-way
data analysis methods (i.e., CANDECOMP,
PARAFAC, TUCKER3) is proposed
by Kiers9 to produce percentile
interval estimates that are related to the
instability level of the solution. They
show how these estimates can be interpreted
as confidence intervals for the output
parameters. Loethen et al.10 demonstrate
a new second-derivative variance
minimization procedure that can automatically
extract spectra of a dilute component
from a mixture whose spectrum is
dominated by a major component. It is
not necessary to have the spectrum of the
pure solute. Results are shown from benzene
in hexane and water in acetone measured
by Raman spectroscopy.
A search methodology was proposed by
Hageman et al.11 The use of a TABU
search for wavelength selection in calibration
is described. TABU search is a deterministic
global optimization technique
loosely based on concepts from artificial
intelligence. It examines the search space
in a highly ordered fashion and keeps
track of areas already explored. Given
a particular starting point, the same
end solution will always result. The
methodology is compared with simulated
annealing and genetic algorithms
and is shown to perform comparably.
The TABU search can
provide better search coverage in
cases in which local minima exist.
Geladi et al.12 give an informative discussion
on visualization methods used
in multivariate analysis. They discuss
what we plot and why. Methods
include PCA, three-way analysis, and
regression in general, with examples
provided from near-infrared (NIR) spectroscopic data. The article
stresses the importance of understanding
the data being plotted rather
than just accepting the output of a
software program.
Orthogonal projection analysis (OPA)
and multivariate curve resolution (MCR)
are presented by Gourvenec et al.13 as a
way to monitor batch processes using spectroscopic
data. MCR allows one to look at
a patch and predict concentration during a
reaction. OPA is a self-modeling curve resolution
method that will provide estimates
of the number of components in the system
and the shape of the pure spectra
along with concentration profiles of each.
The authors show results from a polymer
reaction and discuss the situation in which
the number of components in a new batch
is different from previous batches. This
method is shown to be a useful tool for
control purposes in on-line measurements. van Sprang et al.14 discuss the use of a bilinear
gray model to monitor a series of batch
processes using NIR spectroscopy. The term
“grey model” is meant to incorporate characteristics
of white models and black models.
White models are based on first principles
such as kinetic data. Black box models
include regression or neural networks. The
authors use examples from NIR measurement
of a urethane reaction (di-isocyanate
and alcohol). For use in the process, this
involves first collecting the white model
knowledge. In the urethane reaction, this
provided the estimate of the pure spectra of
the di-isocyanate and the alcohol. The
authors demonstrate that this model gives
insight into the physical and chemical
changes occurring during the process that
would otherwise go undetected.
Vogt et al.15 present a method to detect
wavelength shifts in calibration spectra.
When wavelength shifts occur, they usually
introduce derivative-like features in multivariate
loading vectors. If the shifts are random
and not reproducible, then the model
estimates of these features will not help
account for their effect. The method proposed
involves shifting one calibration spectrum
and analyzing whether it is more or less
similar to the unshifted unknown spectrum.
If it is more similar, then the method concludes
that there was a wavelength shift and
is able to correct it before analysis. For data
sets with shifts occurring in a small number
of samples, this method is shown to work
well. If a large number of samples have shifts,
further investigation is normally required.
Skibsted et al.16 use an indicator instead of
prediction statistics to choose the optimal
data preprocessing and wavelength selection.
The indicator is called SE, or signal-to-error indicator, and is based on the net
analyte signal and the total error. The
method requires a blank spectrum and an
analyte spectrum be measured. Data are
shown for two sets of samples—powders
and tablets. The method is contrasted to
the typical RMSEP or RMSEPcv statistics
(root mean square error) normally utilized
in multivariate calibration methods.
Xu et al.17 discuss Monte Carlo cross-validation
methods for selecting the optimal
model and estimating the prediction error
in multivariate calibration. The Monte
Carlo cross-validation method leaves out a
large part of the samples at each stage rather
than one or a handful, as is normally done.
The authors show results for simulated data,
quantitative structure–activity relationship
(QSAR) data, and NIR and UV spectroscopic
data. The method is demonstrated to
find the optimal model using fewer
components and producing better predictive
results. A method for accurately
estimating the model predictive error is also
presented. Bridge-PLS, discussed by Gidskehaug
et al.,18 is a two-block bilinear regression
method used to process large amounts
of data more efficiently. It is a combination
of standard PLS and Bookstein PLS, in
which only one singular value decomposition
is used to extract the latent variables.
The method is illustrated with data from
magnetic resonance imaging (MRI)
with demonstrated time savings over
standard PLS.
Andrew and Fearn19 discuss calibration
transfer by orthogonal projection. Orthogonal
projection is used in the development
of calibrations to make them less sensitive
to variations between instruments. This
allows calibration transfer without adjustment
to the model. The idea is to orthogonalize
the spectral data to directions in the
spectral space in which most of the variation
lies. The method requires that spectra
from several instruments be available.
Results are shown on agricultural NIR data
for barley and corn and are compared to
other transfer methods. The authors show
better transfer results using this method in
addition to reducing sensitivity to temperature
and sample pathlength variations.
Thomas20 proposes a method for selecting
the number of latent variables in multivariate
calibration. This includes nonparametric
statistical methods using the sign test and
the Wilcoxon rank test. Results are shown
from an octane data set. The method is
shown to be sensitive to small differences in
performance but still robust to unusual
observations. The Wilcoxon rank test is
demonstrated to be the preferred method.
Seipel and Kalivas21 looked at ways to measure
the effective rank for a given model.
This applies to all modeling methods. The
definition is based on the regression vector
norm. The proper definition of the effective
rank permits a better assessment of the
number of degrees of freedom. Examples
are shown with spectroscopic data using
PLS, PCR, and ridge regression models.
Myles and Brown22 propose decision pathway
modeling as a pattern recognition
method for multigroup classification problems.
The architecture is an interconnected
graph of nodes and partial pathways. The
method depends on the construction of
accurate binary classification models and
can be computationally expensive. Four
data sets are used for demonstration.
Improvements are demonstrated over traditional
methods in several cases.
Wold et al.23 discuss the use of multivariate
design in PLS modeling. The design
can be done in the original variable space
or in PLS scores space for x-variables, y-variables,
or both. The authors also show
how this can be used to select data from a
larger population for use in calibration.
The use of orthogonal signal correction
(OSC) is presented by Woody et
al.24 as a means to transfer multivariate
calibrations between spectrophotometers.
This is a comparative study of different
methods to determine the optimal
method for calibration transfer.
Summary
The field of chemometrics is an active area
of research. The potential benefits to be
gained in applying this research include
improved calibration transfer, better modeling
for nonlinear data sets, a clearer understanding
of data, and the ability to extract
the maximum information from large data
sets with ease. This paper is not meant to be
an all-inclusive review of work done in this
area, but rather a highlight of some of the
interesting areas being explored.
References
- Vapnik V. The nature of statistical learning
theory. New York: Springer-Verlag, 1995.
- Smola AJ, Schölkopf B. A tutorial on
support vector regression. NeuroCOLT
Technical Report NC-TR-98-030. London:
Royal Holloway College, University
of London, 1998.
- Thissen U, Ustun B, Melssen W, Buydens
LMC. Multivariate calibration with
least-squares support vector machines.
Anal Chem 2004; 76:3009–3105.
- Thissen U, Pepers M, Ustun B, Melssen
WJ, Buydens LMC. Comparing support
vector machines to PLS for spectral regression
applications. Chemometrics and
Intelligent Lab Sys 2004; 73:169–79.
- Zomer S. Active learning support vector
machines for optimal sample selection in classification.
J Chemometrics 2004; 19:294–305.
- Zomer S. Classification with support vector
machines. Nov 2004. www.acc. umu . se /~tnkjtg / chemometrics / editorial/nov2004.pdf.
- Vogt F, Mizaikoff B. Fault-tolerant spectroscopic
data evaluation based on
extended principal component regression
correcting for spectral drifts and uncalibrated
spectral features. J Chemometrics
2003; 17:660–5.
- Bro R, Andersen CM. Theory for net
analyte signal vectors in inverse regression.
J Chemometrics 2003; 17:646–52.
- Kiers HAL. Bootstrap confidence intervals
for three-way methods. J Chemometrics
2004; 18:22–36.
- Loethen YL, Zhang D, Favors RN, Basiaga
SBG, Ben-Amotz D. Second-derivative
variance minimization method
for automated spectral subtraction. Appl
Spectrosc 2004; 3:272–8.
- Hageman JA, Streppe M, Wehrens R, Buydens
LMC. Wavelength selection with TABU
search. J Chemometrics 2003; 17:427–37.
- Geladi P, Manley M, Lestander T. Scatter
plotting in multivariate data analysis.
J Chemometrics 2003; 17:503–11.
- Gourvenec S, Lamotte C, Pestiaux P,
Massart DL. Use of the orthogonal projection
approach (OPA) to monitor batch
processes. Appl Spectrosc 2003; 57:80–7.
- van Sprang ENM, Ranmaker H, Westerhuis
JA, Smilde AK, Gurden SP,
Wienke D. Near-infrared spectroscopic
monitoring of a series of industrial batch
processes using a bilinear grey model.
Appl Spectrosc 2003; 57:1007–19.
- Vogt F, Booksh K. Influence of wavelength-shifted calibration spectra on multivariate
calibration models. Appl Spectrosc
2004; 58:625–34.
- Skibsted ETS, Boelens HFM, Westerhuis
JA, Witte DT, Smilde AK. New indicator
for optimal preprocessing and wavelength
selection of near-infrared spectra. Appl
Spectrosc 2004; 58:264–71.
- Xu Q, Liang Y, Du Y. Monte Carlo
cross-validation for selecting a model and
estimating the prediction error in multivariate
calibration. J Chemometrics
2004; 18:112–20.
- Gidskehaug L, Stodkilde-Jorgensen H,
Martens M, Martens H. Bridge-PLS
regression: two-block bilinear regression
without deflation. J Chemometrics 2004;
18:208–15.
- Andrew A, Fearn T. Transfer by orthogonal projection: making near-infrared
calibrations robust to between-instrument
variation. Chemometrics and Intelligent
Lab Sys 2004; 72:51–6.
- Thomas EV. Non-parametric statistical
methods for multivariate calibration
model selection and comparison. J
Chemometrics 2003; 17:653–9.
- Seipel HA, Kalivas J. Effective rank for
multivariate calibration methods. J
Chemometrics 2004; 18:306–11.
- Myles AJ, Brown SD. Decision pathway
modeling. J Chemometrics 2004;
18:286–93.
- Wold S, Josefson M, Gottfries J, Linusson
A. The utility of multivariate design
in PLS modeling. J Chemometrics 2004;
18:156–65.
- Woody NA, Feudale RN, Myles AJ,
Brown SD. Transfer of multivariate calibrations
between four near-infrared spectrometers
using orthogonal signal correction.
Anal Chem 2004; 76:2595–2600.
Ms. Foulk is Senior Applications Chemist,
Guided Wave, Inc., 5190 Golden Foothill
Pkwy., El Dorado Hills, CA 95762, U.S.A.;
tel.: 916-939-4300; fax: 916-939-4307;
e-mail: [email protected].