Wednesday, July 31, 2013

Writing CDK code in Bioclipse

Of course, there is the CDK manager in Bioclipse (Fig. 2, doi:10.1186/1471-2105-10-397), which exposes a good deal of the commonly used functionality, but yesterday there was this question on the CHMINF-L mailing list about how to find all existing isotopes, given a certain exact mass.

Now, that's something we can do with CDK's IsotopeFactory (with thanx to John for these "stable" URIs!), but I did not want to propose using something arcane as a command line, so was wondering if it could be done in Bioclipse. So, I had to figure out how to create Java objects in JavaScript, a skill I once had but lost.

But, I regained it after some googling and ended up with this code:

It's an image as I had to make one for MyExperiment.org where the source code can be downloaded.

The key JavaScript calls as importPackage() and importClass(), and I am sure there are some things that can be optimized here (please leave a comment if you have a better solution), but once I have the IsotopeFactory, I iterate over all elements and for each ask all isotopes (yeah, I think there is some API methods missing here :). And for each isotope I test if it matches (within a certain error) the search mass, and if it matches, reports the mass number, element symbol, and exact mass (using the js manager).

Tuesday, July 30, 2013

CTR #6: Ring counts in a SMILES file

The sixth solution to CTR problems is one that uses a SSSR algorithm to count the number of rings in SMILES entries. It combines code to read a gzipped SMILES file, iterates over all structures, splits out components for SMILES strings with non-covalently bound parts, and calculates the total sum of rings in those components:

import org.openscience.cdk.interfaces.*;
import org.openscience.cdk.graph.*;
import org.openscience.cdk.io.*;
import org.openscience.cdk.io.iterator.*;
import org.openscience.cdk.ringsearch.*;
import org.openscience.cdk.silent.*;
import org.openscience.cdk.tools.manipulator.*;
import java.io.File;
import java.util.zip.GZIPInputStream;

new GZIPInputStream(
new File("ctr/benzodiazepine.smi.gz")
.newInputStream()
),
SilentChemObjectBuilder.getInstance()
)
)
while (iterator.hasNext()) {
mol = iterator.next()
components =
ConnectivityChecker.partitionIntoMolecules(
mol
)
totalRingCount = 0
for (component in components.atomContainers()) {
ringset = new SSSRFinder(component).findSSSR()
totalRingCount += ringset.atomContainerCount
}
println totalRingCount
}


CTR #5: Convert a SMILES string to canonical SMILES

The CDK SmilesGenerator generates a canonical SMILES by default, so that one can simply parse the SMILES and generate a SMILES again to convert a SMILES string to a canonical SMILES:

import org.openscience.cdk.smiles.SmilesGenerator;
import org.openscience.cdk.smiles.SmilesParser;
import org.openscience.cdk.silent.SilentChemObjectBuilder;

parser = new SmilesParser(
SilentChemObjectBuilder.getInstance()
);
generator = new SmilesGenerator();

smi = [
"CN2C(=O)N(C)C(=O)C1=C2N=CN1C",
"CN1C=NC2=C1C(=O)N(C)C(=O)N2C"
]
can = [];

smi.each { smiles ->
generator.createSMILES(
parser.parseSmiles(smiles)
)
)
}

assert can[0] == can[1]


CTR #4: Working with SD tag data

Next in the Chemistry Toolkit Rosetta CDK/Groovy solutions series is how to work with SD tag data. This CTR problem asks to scan the content of a number of SD fields, process them, and summarize them in a new field. Here's the CDK/Groovy solution, taking advantage of Groovy's grab functionality:

@GrabResolver(
name='idea',
root='http://ambit.uni-plovdiv.bg:8083/nexus/content/repositories/thirdparty/'
)
@Grapes([
@Grab(
group='org.openscience.cdk',
module='cdk-io',
version='1.4.11'
),
@Grab(
group='org.openscience.cdk',
module='cdk-silent',
version='1.4.11'
)
])

import org.openscience.cdk.io.*;
import org.openscience.cdk.io.iterator.*;
import org.openscience.cdk.silent.*;
import java.util.zip.GZIPInputStream;

new GZIPInputStream(
new File("benzodiazepine.sdf.gz")
.newInputStream()
),
SilentChemObjectBuilder.getInstance()
)
writer = new SDFWriter(
new FileWriter("RULES5.sdf")
)
while (iterator.hasNext()) {
mol = iterator.next()
if (mol.getProperty("PUBCHEM_XLOGP3") == null) {
mol.setProperty("RULE5", "no logP")
} else {
ruleCount = 0;
if (Integer.valueOf(mol.getProperty("PUBCHEM_CACTVS_HBOND_ACCEPTOR")) <= 10)
      ruleCount++
if (Integer.valueOf(mol.getProperty("PUBCHEM_CACTVS_HBOND_DONOR")) <= 5)
      ruleCount++
if (Double.valueOf(mol.getProperty("PUBCHEM_MOLECULAR_WEIGHT")) <= 500.0)
      ruleCount++
if (Double.valueOf(mol.getProperty("PUBCHEM_XLOGP3")) <= 5.0)
      ruleCount++
mol.setProperty("RULE5", ruleCount >= 3  ? "1" : "0")
writer.write(mol)
}
}
writer.close()


Tuesday, July 09, 2013

JChemInf volumes as single PDFs

One of the advantages of a print journal is that you are effectively forced to look at papers which may not have received your attention in the first place. Online journals do not provide such functionality, and you're stuck with the table of contents, and never see that cool figure from that paper with the boring title.

Of course, the problem is artificial. We have pdftk and we can make PDF of issues, or in the present example, of complete volumes. Handy, I'd say. It saves you from many, many downloads and forces you to scan through all pages. Anyway, I wanted to scan the full JChemInf volumes, and rather have one PDF per volume. So, I created them. And you can get them too. The journal is Open Access after all (CC-BY).

Thanks to the wonderful FigShare, here are the first four volumes of the journal:
And since you can embed things hosted on FigShare too, here's the first bits of Vol. 1:

Sunday, July 07, 2013

Visualizing SPARQL end point results with d3.js (but not yet for #WikiPathways)

 SVG pie chart visualization of SPARQL end point results in Firefox with the Web Console.
SPARQL and RDF are very quickly becoming the (Open) standard for linking and accessing database works. Readers of my blog I have been searching the corners of what can and cannot be achieved with this for some time now.

Triggered by some nice visualization work at the BioHackathon on ChEMBL content, I picked up visualization of RDF data (see this 2010 post where I asked people to visualize data using SPARQL). So, and since d3.js is cool nowadays (it was processing.js in the past), so I had a go at the learning curve.

I started with a pie chart and this example code. Because I was working on the SPARQL queries for metabolites in WikiPathways (using Andra's important WP-RDF work, doi:10.1038/npre.2011.6300.1).

Because SPARQL end points can spit results in many formats, there are many ways to hook it up. CSV (MIME typetext/csv) seems to be the simplest, and because my SPARQL query has all the semantics already, I am happy with simple column headers like source and count.

So, the first step is the SPARQL query, and I am asking here for the number of unique identifiers per database (and note the variable names I marked in red):

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms: <http://purl.org/dc/terms/>

select
str(?datasource) as ?source
count(distinct ?identifier) as ?count
where {
?mb a wp:Metabolite ;
dc:source ?datasource ;
dc:identifier ?identifier .
}
order by desc(?count)

When I ask for the results as CSV, this looks like (and note the column headers):

"source","count"
"HMDB",524
"Kegg Compound",389
"CAS",267
"ChEBI",240
"Entrez Gene",143
"PubChem-compound",87
"Wikipedia",8
"PubChem-substance",8
"ChemIDplus",7
"Chemspider",6
"Ensembl",4
"ChEMBL compound",2
"3DMET",1
"TAIR",1
"LIPID MAPS",1

Then, the HTML/JavaScript code I am using is based on the earlier-listed example code:

<!DOCTYPE html>
<body>
<script src="d3.v3.js"></script>
<script>

var width = 300,
height = 300,
radius = Math.min(width, height) / 2;

var color = d3.scale.category20();

var arc = d3.svg.arc()

var pie = d3.layout.pie()
.sort(null)
.value(function(d) { return d.total; });

var svg = d3.select("body").append("svg")
.attr("width", width)
.attr("height", height)
.append("g")
.attr(
"transform",
"translate(" + width / 2 + "," + height / 2 + ")"
);

d3.csv("data.csv", function(data) {

data.forEach(function(d) {
d.total = +d.count;
});

var g = svg.selectAll(".arc")
.data(pie(data))
.enter().append("g")
.attr("class", "arc");

g.append("path")
.attr("d", arc)
.style("fill", function(d) {
return color(d.data.source); }
);

g.append("text")
.attr("transform", function(d) {
return "translate(" + arc.centroid(d) + ")";
})
.attr("dy", ".35em")
.style("text-anchor", "middle")
.text(function(d) {
if (d.data.count > 40) {
return d.data.source;
} else {
return "";
}
});
});
</script>
</body>

I cannot say I understand 100% of this code yet, but here goes: the first half defines some parameters, like how large the pie chart is, which colors to use, etc. The red bits are the column names in the CSV output. One of the more puzzling things was the code that calculated the total count, but I changed the "count" variable name to "total" (colored blue) to make it more clear to me.

The d3.csv() method loads the data from a CSV file or stream. The stream is asked for by replacing the file name (viz data.csv) with the full URL with the SPARQL query, such as this one for the running example. This URL has this bit "&format=text/csv" which ensures the SPARQL end point returns the data in CSV format. Now, changing this to "&format=text/html" returns a HTML version of the results. The results are shown at the top right of this blog post.

However, you may now be wondering why I load it from file in the above example, rather than from the stream. Well, it turns out there is something wrong with the Virtuoso 6.1 installation or version for WikiPathways SPARQL end point in that does not complete the stream. Indeed, my Firefox returns a partial file, even though it is complete. I tried it with DBPedia instead which does close the stream, and then it does work directly from the SPARQL end point:

But that problem aside, which will resolve itself when upgrading the Virtuoso installation I expect, mission accomplished.