Help Needed: How Do We Use CML Properly?

November 25th, 2006

Bear with us for this post as we send out the bat signal to attract the chemical informatics crowd. I’m talking about the Peter Murray-Rusts of the world, who draw their superpowers from the glow of fluorescent lighting as they sit and hammer out code for hours on end.

On their blogs, Joerg Wegner and Murray-Rust took the rest of the chemistry blogosphere to task for not including more minable data, using standards such as Chemical Markup Language (CML). CML is used to include metadata on chemical structures and compounds that people post on the Internet. While it’s essentially invisible to human readers, search engines can use the extra data to sift through content on the Internet more efficiently for information that is specific to user-defined queries about chemical structures or substructures.

Aside from that gibberish in the last paragraph, we know nothing about the subject of informatics and making chemically-minable data. Since we’re basically starting this blog from scratch, we want to try to get all the informatics stuff right from the beginning. I notice that using the latest version of ChemDraw (Ultra 9), you can save structures as CML files (in addition to the “usual” .cdx). Where is the proper place to put this CML code? In the image file? Can it go anywhere in the blog post?

We also have the following questions:

1. Is CML the “standard” way of including structural information in electronic form? For instance, how does SciFinder or Beilstein store this information? If they don’t use CML, why don’t we just do it the way these programs do it, since they are the main tools already used by chemists?

2. How do you recommend tagging images of chemical structures? Is a name and CAS number good? Is this unnecessary if the CML data is in there?

3. Is there anything else we’re missing?


  1. Jean-Claude Bradley Says:
    November 25th, 2006 at 4:48 am A very simple thing that you can do is put InChIs in your posts of the molecules you mention. Free programs like ChemSketch can let you export in InChI format. This is nice because Google indexes InChIs very well and there is very little likelihood of false hits. For example see this Google search for DOPAL: molecule has a unique InChI number but few databases currently use it. It is much more common to find SMILES, which you can also export from ChemSketch and similar programs. Unfortunately there is more than one SMILES per molecule, which makes it difficult to ensure proper indexing. If you are searching for commercial sources, searching Google with the CAS number is most useful because company catalogs tend to use that.You are also welcome to use our UsefulChem molecules blog (unless you have massive numbers of molecules):
    All you have to do is create a post with a pic and the text “SMILES:” then put a SMILES of your molecule. Every day a script reads that blog and converts the SMILES to InChI, queries emolecules, calculates the MW, creates a JMOL view and a few other things. It also generates CML so that the molecules blog feed can be tracked by CMLRSS readers. The resulting files can be accessed from the “molecules automated info” button at the top of the molecules blog.Just let me know if you want an account.I am sure Peter and Egon can offer additional info that will be helpful.
  3. Peter Murray-Rust Says:
    November 25th, 2006 at 6:09 am Thanks very much Paul. I have blogged your post on the CML blogPaul: Where is the proper place to put this CML code? In the image file? Can it go anywhere in the blog post?In principle CML – like other XML languages such as MathML – can be mixed with XHTML. Unfortunately blogs don’t do XHTML very well yet. It may take a year or so for the world to sort this out. So don’t try to put CML in the blog. That’s a great pity, but it shouldn’t last for long. 1. Is CML the “standard” way of including structural information in electronic form? For instance, how does SciFinder or Beilstein store this information? If they don’t use CML, why don’t we just do it the way these programs do it, since they are the main tools already used by chemists?There is no “standard” way of including structural information in chemistry – there are about 10 main file types and many others in use. I’ll discuss the pluses and minuses of CML and the other formats in later posts on my blog. But for now:

    There is no single answer and no best “format”. Chemistry is a complex subject and contains many types of information – molecules, reactions, spectra, wavefunctions, crystals, etc. Traditionally each subdomain was separate – a sepctrum could not contain a molecule and so on. So a whole range of formats grew up. In organic chemistry MDLMolfile, SMILES and CAS are the best known approaches. They all have many uses and many restrictions. Thus CAS only tells you what the substance is – it cannot carry a diagram. None can carry spectra, and so on.

    XML is an approach that allows people to define their own information respresentation. It is extensible in that you can add things to it without breaking it. (By contrast MOLfiles and SMILES are not extensible – you cannot carry both 2-D and 3-D structures at the same time – only one or the other).

    2. How do you recommend tagging images of chemical structures? Is a name and CAS number good? Is this unnecessary if the CML data is in there?

    I’m assuming you are referring to a blog. (If you have a database or a separate website it’s different). At present I think the best and simplest thing is to use the InChI. It works well for organic structures (less well for organometallic – though nothing else is better). The InChI is a unique handle for the molecule. That means that Google and other search engines can find it and index it. It is also possible to reconstruct the connection table from the InChI.

    If you are brave you can put the InChI in full text at the bottom of your post. It may look a little strange to readers at first but I think that if all the blogosphere did it it would become accepted very soon. There are several ways of creating InChIs – we have an Open interactive website:…..erver.html
    It will accept MOLFiles and SMILES.

    If you want to save the coordinates of your diagram that is slightly harder. The InChI actually has a way of preserving coordinates although obviously it’s more verbose.

    These things become more tractable as we move from blogs to Wikis to web pages and databases. One idea that I have floated recently is to create a communal site for molecules – a little like Flickr or Facebook or Wikimedia. Common molecules could be found and re-used.

    CML starts to become valuable when you need:

    transmission without corruption
    machine interpretability of chemistry
    complex documents, such as lab books and journal articles
    output of programs (e.g. theochem)
    interaction with other semantic tools (Wikis, RSS, etc.)

    I’ll review these later, and add a comment here when I do

    It depends what you want to do. If you want to index organic structures, then I think the InChI is the best way to do it. The problem with CAS numbers is there is no way to look them up without subscribing to CAS. And also that if the molecule hasn’t been formally published it won’t be in CAS.

    3. Is there anything else we’re missing?

    I’ll cover these in my comments on your blog.


  4. Paul Says:
    November 25th, 2006 at 5:03 pm Thanks, everyone. We’ll take a look through the SMILES and InChI stuff to figure it out.
  5. Mitch Says:
    November 25th, 2006 at 5:52 pm You can play with SMILES code on my site. Go to:…..dchemicals, click draw structure in the chmoogle search box, draw a structure and them click on the SMILE face to see the SMILE code.Mitch :)
  6. MoTD Says:
    November 27th, 2006 at 9:27 pm For what it’s worth, I was spurred on by a post on PMR’s blog and just started using inchi (google “inchi converter”). Save as .mol and .gif, convert .mol, paste as an alt tag. Better than nothing and I’m always open to other options. One thing that is a pain for reaction schemes is that the converter works on one-structure .mol files. Fortunately, the converter is perfect for a blog like mine, that has simple, one-small-molecule structures.As always, I’m open to anything that doesn’t take more than 90s from my day, let me know what you end up going with. I suspect his will probably become useful well after we end up doing this; I’m not googling for inchi tags and I don’t think anybody (except guys like Peter :) ) is.
  7. Paul Says:
    November 28th, 2006 at 12:24 am Yeah…I was thinking about using an alt tag on the image, but then wasn’t sure if search engines register them outside of image searches. We’re considering doing as Peter suggests above, writing the InChI code at the bottom of the post, except changing the font color to white so that it’s invisible unless you go looking for it.
  8. Richard Says:
    November 28th, 2006 at 9:41 am Hi there, search engines do indeed index alt tags but low importance is placed on them because they are open to abuse, i.e. you can stuff them full of irrelevant keywords etc.In general (puts on flame-proof trousers :) I think using an InChiin an alt tag to describe your molecule is a bad idea, at least from a web-standards point of view.The alt tag, IMHO is recommended to be used to give people using
    a text based browser (e.g those with sight problems, search engines and people who have disabled images in their browser) a description of your image which they cannot see.So in a text based browser, the following img tag… 

    will just show up as


    which isn’t very helpful for your (partially sighted) reader.

    I realise the following is no use for reactions etc and doesn’t solve the debate about how to best use CML in your blog, but how about ‘tagging’ or better still directly linking your image to PubChem wherever possible? e.g for the example above…

    This gives your reader the opportunity to retrieve the InChi, Canonical Smiles, SDFile, IUPAC Name, calculated properties and tons of other data directly from pubchem should they wish to.

    It also saves you generating the InChi yourself.

    Because InChi’s can get very long for large molecules this approach also saves your pages becoming ‘bloated’

    It is also far more ’semantic’ from a web standards point of view than having white text on a white background which is most definitely frowned upon by the xhtml / css community.

    Obviously you won’t find every single molecule on PubChem and not every entry has additional data but there are 15 million substances to search through which should cover the vast majority of cases for your blog.

    If you do use this I recommend linking to the Compound Identifier (which won’t ever change)
    if possible, rather than the Substance Identifier (which can be revoked / depreciated by the PubChem depositor)

  9. Richard Says:
    November 28th, 2006 at 9:44 am Sorry about the missing img tags – the code tag doesn’t seem to work…
  10. Paul Says:
    November 28th, 2006 at 10:25 pm Thanks, Richard.
  12. Egon Says:
    December 10th, 2006 at 10:39 am Paul,I wrote up some of my personal experiences in the below blog item, linking to a few earlier blog items on the same thing. The first blog is of general interest and answers your question on how to tag CAS registry numbers, while the second goes into adding CML in blog items:http://chem-bla-ics.blogspot.c…..hi-in.html
    http://chem-bla-ics.blogspot.c…..otcom.htmlKind regards,Egon
  14. Egon Willighagen Says:
    December 17th, 2006 at 8:26 am I’ve just published a small Greasemonkey script which addresses some of issues discusses here. Using proper markup, the script will recognize the chemistry in the HTML page, and automatically link to PubChem and Google.Read it at:http://chem-bla-ics.blogspot.c…..blogs.html
  15. Paul Says:
    December 17th, 2006 at 7:36 pm Sweet…I’m going to check that out over the holidays.
  3. Tim Says:

    Great posting, thx

