About 25 years ago the former MDL introduced the substance groups (Sgroups) into the world of chemical representation in order to mainly address the needs of industrial chemistry where chemical structures are not so well defined than in the world of small molecules of the pharmaceutical industry,
- The main Sgroup components are
- pairs of square brackets [ ] to mark an entire molecule or any suitable collection of atoms and bonds within a molecule,
- Sgroup data as text strings or numbers that can be linked with any collection of atoms, bonds, brackets, fragments or the entire molecule
- the “wildcard” * atom that finds any other atom in a search (therefore “wildcard”) but has no meaning on its own unless Sgroup data give any additional information.
With these 3 elements MDL started to handle a variety of different structural entities including
- Multiple Groups
- Chemical representations for mixtures and formulations
- Chemical structures that cannot be described by a full structure representation
- Special cases in stereochemistry
- Statistically distributed structural elements
The search for Sgroup data is based on standard Oracle string and number search operators like “=”, “<=”, “>”, “like” and others with “%” as wildcard.
In SSS searches Sgroup data are found like an additional query element in the search, i.e. if you search for a structure without Sgroup data it will find the structure in compounds with and without Sgroup data, while any SSS with one or more Sgroup data elements only returns those structures that fulfill the structural query conditions and the text or number query for the Sgroup data.
To keep the results of your structure searches and the structure duplicate check consistent you must ensure that the collection of atoms, bonds etc. is consistently defined in your database. More details described in an example can be found in “Special cases in stereochemistry”.
Abbreviations (shortcuts, residues)
Structural abbreviations (Synonyms: shortcuts or residues) like NO2 for the Nitrogroup represent the most common use of Sgroups in the chemical representation. In the following example the blue section on the right is the Sgroup that is used for the Nitrogroup shortcut definition.
Note that the chemist drawing the abbreviation is responsible that the name corresponds to the underlying structure.
Abbreviations are very heavily used in the context of biologics to keep the structure readable like in the following peptide sequence. In these cases some of the abbreviation definitions are overlaying to each other to ensure the functionality of the peptide builders and other similar tools.
Multiple sgroups are used to simplify the depiction of chemical structures that contain the same collection of atoms and bonds multiple times like the Chloride in the following example of calcium Chloride
If you go into the details of [Cl-]2 you find out that two Cl- atoms are exactly overlapping so that only one is perceived while the index of the brackets reflects the number of overlaying levels. Therefore the left and right presentation of calcium chloride are identical except that the Sgroup brackets simplify the depiction. Analogously you find 100 polyethelene glycol units overlaying in the following example
Accordingly you receive the formula C200H402O101 and the molweight 4423 Dalton for this structure.
Polyethylene glycol (PEG) is a polymer with a characteristic structure repeating unit (SRU) that can be represented by
The square brackets mark the definition of the structural repeating part [-O-CH2-CH2-] and the index n makes the representation to the SRU.
The formula is calculated with H2O(C2H4O)n while the molweight is not defined because there are no information given how many repeating units are represented in our case.
Quite frequently polymers do not have a fixed molecular weight (like in the example of PEG under multiple groups) but a statistical distribution of units that may be described by the average molecular weight or by a range with lower and upper molecular weight limits. By introducing the Sgroup data fields Average_MW (for the average molweight), Lower_MW (for the lower MW limit) and Upper_MW (for the upper MW limit) you may use following representations
In the left example the Sgroup data field Average_MW is attached to the brackets with a number value of 4400. In the right example the Sgroup data field Lower_MW is filled with 3900 (and manually moved to the left bracket) while the value 4900 is stored in the Sgroup data field Upper_MW positioned underneath the right bracket.
If the parameter for Sgroup data is enabled for exact searches in BioVia (Accelrys) Direct only those structures are returned that exactly match in the chemical structure and in the molecular weight values. If the Sgroup search key is omitted the exact match with the PEG drawn as SRU will return all entries with PEG (written as SRU) independent of the molecular weight values.
For Sub Structure Searches (SSS) with the PEG-SRU and without additional values as query you get all SRUs of PEG that are in the database. But if you add the molecular weight as additional search condition the hit list is drastically reduced as demonstrated in the following search examples:
The first query finds all PEG structures in SRU format. The second one reduces the hit list to those database entries that provide an average molecular weight between 4200 and 4400 Dalton. The third query cannot find our original entry because it asks for average molecular weights > 5000, while our example entry “only” has an average molecular weight of 4400.
In the second case (PEG with molecular weight range) the lower and upper molecular weight fields are searched independently from each other together with the structure so that you may start searches with the SRU of PEG together with one or both fields as query conditions like in the following example:
In the first search example the queried lower and upper molecular weights fit into the values of our PEG example. The second query only operates on the lower molecular weight field of the PEG structure so that the 3900 of our example structure is identified while the Upper_MW is not searched for. The last example cannot find our PEG because it requests that the upper molecular weight is greater than 5200 while our example has an upper limit of 4900.
Although our two example representations of the PEG (multigroup on the left, structure repeating unit on the right with an average molecular weight of 4400) mainly describe the same polymer they do not find each other, because the types of Sgroup elements being used are not identical. The maximum common sub structure is ethylene glycol.
Beside “Standard Repeating Units” the molfile format knows additional Sgroup types handling all kinds of polymers like block (blk), alternating (alt), or random(ran) copolymers as in the following example where 80% and 20% reflects the relative composition of the co-polymer out of polyethylene and polystyrene with “either unknown” repeating unit (eu = either unknown, ht = head to tail, hh = head to head).:
Chemical representations for mixtures and formulations
While most of the mixtures and formulations of chemical compounds are directly handled by relational database systems the molfile format offers a notation for chemical structures that allows you to keep the consistency for structure searches and between structure databases and the alpha-numeric table that handles most of the information.
Technical Xylene, for example, is normally not split up between ortho-, meta-, or para-Xylene. The mixture can be drawn as
Sgroup data may be added to handle information about the composition of the mixture.
In this context formulations are defined as ordered mixtures. The following example is developed from a shampoo formulation and uses multiple Sgroups and PEG as described above in this article:
In the formulation above the substances Magnesium laurylsulfate, Ocamidopropyl betaine, Polysorbate 20, PEG 600, Citric acid and water are added stepwise, where c1, c2, … describes the order of the components within the formulation. The blue numbers and texts describe the composition of each component while the numbers in black represent the molecular weight of the component. For more information for the component c3 see “statistically distributed structural collections”.
Compounds that cannot be described by structural representation and no-structures
Typical example in this category are natural products (for example plant extracts with unknown chemical structure(s)) or biologics like antibodies that are too “big” or structurally unknown to be fully represented in a structure database. For this type of cases the *-atom as “wildcard” atom has been introduced. It does not have any chemical meaning unless Sgroup data are used to specify it. Let’s assume the antibody is called SAB1400491 and that a peptide sequence builds a Cysteine bridge to the antibody. A structural representation may look like the following depiction
In the case of compounds with unresolved structure (like many natural products for example) the *-atom is frequently used to keep a “placeholder” for this compound in the structure database that keep the structural part consistent with other relational data tables. In the following example it is assumed that the ID / primary key of the compound is “123456” you may use following drawing as database structure
Alternatively you may use
because the structure is not known. In case you have a duplicate check on the structure database you may only have one no structure. That may be compensated by the data model but alternatively you may keep the consistency between the relational data table and the structure table by using the *-atom + ID Sgroup data field like in the example above so that all “no structures” get their own ID that makes them unique in the structure set.
*-atoms are as well found as start or end groups of polymers
Reflecting the issue that the start and end groups of polymers are not known in most of the cases.
Special cases in stereochemistry
Helicenes belong to a set of structures whose stereochemistry can neither be described by V2 nor by V3 molfiles using up or down bonds to indicate the stereochemistry (for example by the enhanced stereochemistry of BioVia(Accelrys)). Therefore the P(lus) and M(inus) form cannot be distinguished by using molfiles (or other ASCII based representations). Workaround: To make a duplicate check work and to let them be separated by searches you may define a Sgroup data text field, select the entire molecule on the left and use “P” (=Plus) for the right-handed helix and “M” (=Minus) for the left-handed one.
By using “P” and “M” Heptahelicene can be added to the structure database twice even if the duplicate check on the database is invoked. In case the structure is registered without “P” or “M” it means that you have an unknown stereochemistry and/or the racemate. Note that you already register another compound if you use another vocabulary then P and M, or if you define the Sgroup data text field not over the entire molecule but over a few atoms and/or bonds only. For another example with another dictionary see Automated Structure Modifications and Normalizations.
Statistically distributed structural collections
The standard structure database assumes that the chemical structure is fully resolved. That is true for most of the small molecules in the pharmaceutical industry. But for example in the cosmetic industry statistical distributions of molecule segments like PEG are quite frequent. In the formulation example above you find the examples of Polysorbate 20 and PEG 600. While PEG 600 is already described in the “Polymer” section above, Polysorbate 20 is defined by the condition: w + x + y + z = 16 meaning that the polymer brackets identify 16 PEGs but the total number per side chain is not given, instead the distribution over the 4 potential location is defined by the total amount of units
The Sgroup data element “16” in the drawing above presents the rule w + x + y + z = 16. Because the Sgroup data are searchable other Polysorbates differing in the total number of units may be registered into a database as well.
General remark: if you have to work with statistical distribution in chemical structures you have to ensure unique drawing rules and support the drawing by structure templates that can be easily modified by the personal involved. In doubt let an administrator take the decision to identify the correct drawing.