A Systematic Computational Approach for Identifying Small Molecules From Accurate-Mass Fragmentation Data

The relative ease of obtaining accurate-mass fragmentation data on modern LC-MS instruments, faster computers, and the availability of large molecular structure databases have recently made it possible to change the “art” of interpreting mass spectral data into a systematic computational process. The basic approach described here is to generate simple models (modular structures) of compounds that are consistent with all of the available mass spectral data, and, where applicable, to subsequently compare those models to molecular structures of known compounds. Three applications will be briefly described: de novo identification, identification based on comparison to “template” compounds, and identification by matching compounds in molecular structure databases.

Modular structures

Figure 1 - Modular structure of xemilofiban.

A modular structure is a two-dimensional arrangement of connected groups of atoms (subfragments) that is consistent with a given set of mass spectral data. Usually more than one modular structure can be drawn from a given set of mass spectral data. A modular structure of xemilofiban, shown in Figure 1, is very similar to its molecular structure. The principal differences are: 

  1. The number of hydrogen atoms in the subfragments may differ from the number of hydrogen atoms in the corresponding molecular substructures, and
  2. the subfragments of modular structures are combinations of atoms with no additional detail.

Generating modular structures from mass spectral data

The process of generating a modular structure from mass spectral data is called partitioning,1 and it is described as follows:

Step 1: The spectral ions are neutralized by adding the mass of a proton to negative ions and subtracting the mass of a proton from positive ions. Positive and negative ion data are then pooled.

Step 2: Possible molecular formulas for the whole compound are then tabulated. The inputs here are the accurate mass of the whole compound, a mass window, isotope ratio data of the whole compound, and an isotope ratio window. The window parameters are based on the expected accuracy of the mass spectral data.

Additional limitations on the molecular formulas are applied, depending on the particular application. For de novo identification work, an additional restriction based on element ratios is also applied.2 When searching a molecular structure database, formulas not present in that database are excluded. When using the template approach, only the formula of the template compound is used in assigning its fragments and subfragments.

Step 3: Partitions of the integral molecular weight are found. A partition is a mathematical term for a set of integers that sum up to another integer. For each partition, every combination of those integers is then summed to select those partitions that best account for the fragment masses. Fragment masses are then “assigned” as sums of different combinations of the individual integers. The individual integers can be viewed as the integral masses of subfragments; assigned fragments are then sums of subfragments.

Step 4: The assignments must be logically consistent with any MSn or collision-induced dissociation (CID)-MS-MS data that are available. For example, if a compound is composed of five subfragments, A+B+C+D+E, and the 220 neutralized fragment ion is assigned as A+D+E, fragmentation of the 220 ion could not yield any fragments that have subfragments B or C present. Inconsistent sets of assignments are eliminated.

Step 5: The fragments have been assigned as integral sums of various combinations of subfragments. The mass defects of the subfragments that compose any particular fragment must also sum up to the mass defect of that fragment. Since the mass defects of the fragments are known, the mass defects of the subfragments can be calculated by solving a set of simultaneous linear equations.

Step 6: Possible formulas for each of the subfragments are then generated. Because the masses of all of the subfragments are considerably smaller than the mass of the whole compound, formulas consistent with these masses are more limited. In addition, the subfragment compositions must sum up to an overall composition consistent with the mass and isotope ratio of the whole compound. This will rule out many of the possible elemental formulas for the whole compound. For example, if a 46 subfragment has a mass defect only consistent with ethanol (e.g., not formic acid), then C2H6O1 can be subtracted from the atoms available to the other subfragments; this further restricts their compositions. This concept was first applied as the “basket-in-a-basket” approach by physically breaking a compound into five subfragments using MS5 on a Fourier transform-ion cyclotron resonance (FT-ICR) instrument.3 Here the “basket-in-a-basket” concept is applied mathematically, and MS5 data are not needed. The “basket-in-a-basket” concept has been today extended into the “topdown bottom-up” approach.4

Step 7: Using logic tables, applying some chemical–spatial rules, and assuming that no rearrangements (excluding hydrogen atoms) have taken place, the subfragments are arranged spatially into modular structures.

Step 8: The modular structures are scored and sorted based on the number of fragment ions that each assigns, also taking into account the relative intensities of those fragment ions.

De novo identification of a novel compound

With limited background information, it is extremely difficult to identify a novel compound from mass spectral data. However, combined with NMR data, the complete molecular structure can often be derived. NMR is very useful for determining which atom is connected to which atom, but sometimes there are gaps (substructures with no hydrogens or carbons) in a compound. In a sense, mass spectrometry shows the clumps of trees in the whole forest, whereas NMR shows exactly how the trees are arranged in each clump.

In the case of de novo identification, the 10 modular structures best accounting for the mass spectral data are saved. These modular structures give a rough idea of the overall structure of the compound. Some modular structures will fit the data very well, but may not correspond well to the actual molecular structure. Although the modular structures are ranked, there is no way of knowing a priori which ones match the structure of the compound that produced the spectral data and which ones do not.

Identification using the “template” approach

In the pharmaceutical industry, unknown compounds are usually closely related to a lead compound: degradation products, impurities, or metabolites. Traditionally, the mass spectral data of that lead compound are used to work out the fragmentation pathways, and the unknown compounds are then identified based on the changes in the masses of various fragments. This approach works well, but it can be very time consuming.

Systematic bond-disconnection has been used to assign accurate-mass fragments to known compounds.5,6 A similar approach can also be used to assign subfragments of modular structures to specific molecular substructures of a lead compound. The heavy atom distribution of modular structures, derived from the mass spectral data, is compared to the heavy atom distribution of the molecular structure to find matches. Heavy atoms are atoms of elements other than hydrogen. Only the modular structures that correlate with the molecular structure are saved, and a monochrome molecular structure can then be color-coded with the same color scheme as the modular structures. This makes the fragmentation easy to visualize. An example is xemilofiban in Figure 1.

Figure 2 - Template approach: Modular structures matching a template compound (left) are compared to modular structures of a related unknown compound (right). Clicking on the magenta-colored squares yields the formulas of the subfragments. Here, the magenta-colored subfragment has clearly added the atoms H2O.

By using the modular structures that match the lead compound as templates, related unknown compounds can now be identified by comparing modular structures to modular structures. The modular structures of the unknown compound that best match the templates are saved and linked to the template modular structure that they most closely match. This process is illustrated in Figure 2. For correlating related compounds to a lead compound of known structure (the template approach7), subfragments are clearly the most simple units of comparison.

Identification by matching compounds in a molecular structure database

The basic approach used to assign subfragments and fragments to a single template compound, systematic bond-disconnection, and comparison of the heavy atom distributions can also be applied to searching molecular structure databases. Traditional spectral libraries are not needed. A set of modular structures are derived from the mass spectral data, and then this set of modular structures is compared to all molecular structures in the database that have a similar mass (Figure 3). Molecular structures that match modular structures are then ranked according to how many modular structures are matched and the scores of the matching modular structures. The overall objective is to draw a rough picture of molecules that could yield a particular set of numbers, and then to search through an index of the MDL® Available Chemicals Directory to find matching compounds. Using the Available Chemicals Directory,8 matching compounds can then be viewed and conveniently ordered from suppliers.

Most molecular structure databases have a large number of compounds as salts (e.g., alkali salts, hydrochloride) and hydrates; these salts are often the most stable solid form. However, in LC-MS work, only the organic moiety of the compound is observed. To improve results, most of the small organic compounds in the Available Chemicals Directory have been sorted by the exact mass at which they would be perceived using electrospray ionization (ESI) LC-MS (e.g., D-(+)-2-phosphoglyceric acid sodium hydrate, MFCD00150613, has been indexed with a molecular weight of 185.993, excluding the water of crystallization and replacing the two sodiums with hydrogens). Quaternary amines are indexed at the mass of the quaternary portion minus the mass of a proton; thus, quaternary compounds such as acetylcholine are also found.

Figure 3 - Databases can be searched by comparing the set of modular structures (left) derived from the mass spectral data to molecular structures (right) in the database.

Conclusion

Recent advances in accurate-mass mass spectrometry are making practicable some novel approaches for identifying small molecules in complex samples.

References

  1. Sweeney, D.L. Small molecules as mathematical partitions. Anal. Chem. 2003, 75(20), 5362–73.
  2. Kind, T.; Fiehn, O. Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinformatics2007, 8, 105.
  3. Wu, Q. Multistage accurate mass spectrometry: a “basket in a basket” approach for structure elucidation and its application to a compound from combinatorial synthesis. Anal. Chem.1998, 70, 865–72.
  4. McDonald, L.A.; Barbieri, L.R.; Carter, G.T.; Kruppa, G.; Feng, X.; Lotvin, J.A.; Siegel, M.M. FTMS structure elucidation of natural products: application to muraymycin antibiotics using ESI multi-CHEF SORICID FTMSn, the top-down/bottom-up approach, and HPLC ESI capillary-skimmer CID FTMS. Anal. Chem.2003, 75(11), 2730–9.
  5. Watson, I.A.; Mahoui, A.; Duckworth, D.C.; Peake, D.A. A strategy for structure confirmation of drug molecules via automated matching of structures with exact mass MS/MS spectra. Proceedings of the 53rd ASMS Conference on Mass Spectrometry, June 5–9, 2005, San Antonio, TX.
  6. Hill, A.; Mortishire-Smith, R. Automated assignment of high-resolution collisionally activated dissociation mass spectra using a systematic bond disconnection approach. Rapid Commun. Mass Spectrom. 2005, 19, 3111–18.
  7. Rourick, R.A.; Volk, K.J.; Klohr, S.E.; Spears, T.; Kerns, E.H.; Lee, M.S. Predictive strategy for the rapid structure elucidation of drug degradants. Pharm. Biomed. Anal. 1996, 14, 1743–52.

Dr. Sweeney is President, MathSpec, Inc., 1314 North Highland Ave., Arlington Heights, IL 60004, U.S.A.; tel.: 847-840-4994; e-mail: [email protected].