## Sunday, July 17, 2016

### Use of the BridgeDb metabolite ID mapping database in PathVisio

A long time ago Martijn van Iersel wrote a PathVisio plugin that visualizes 2D chemical structures of metabolites in pathways as found on WikiPathways. Some time ago I tried to update it to a more recent CDK version, but did not have enough time at the time to get it going. However, John May's helpful DepictionGenerator made it a lot easier, so I set out this morning in updating the code base to use this class and CDK 1.5.13 (well, strictly speaking it's running a prerelease (snapshot) of CDK 1.5.14). With success:

The released version is a bit more tweaked and shows the 2D structure diagram more filling the Structure tab. I have submitted the plugin to the PathVisio Plugin Repository.

Now, you may know that these GPML pathways only contain identifiers, and no chemical structures. But this is where the metabolite identifier mapping database helps (doi:10.6084/m9.figshare.3413668.v1): it contains SMILES strings for many of the compounds. It does not contains SMILES string from Wikidata, but I will start adding those in upcoming releases too. The current SMILES strings come from HMDB.

To show how all this works, check out the below PathVisio screenshot. The selected node in the pathway has a label uracil and the left most front dialog was used to search in the metabolite identifier mapping database and it found many hits in HMDB and Wikidata (middle dialog). The Wikidata identifier was chosen for the data node, allowing PathVisio to "interpret" the biological nature of that node in the pathway. However, along with many mapped identifiers (see the Backpage on the right), this also provides a SMILES that is used by the updated ChemPaint plugin.

## Sunday, July 10, 2016

### Setting up a local SPARQL endpoint

... has never been easier, and I have to say, with Virtuoso it already was easy.

OK, you do need Java installed, and for many this is still the case, despite Oracle doing their very best to totally ruin it for everyone. But seriously, visit the Blazegraph website (@blazegraph) and download the jar and type:

$java -jar blazegraph.jar It will give some output on the console, including a webpage with SPARQL endpoint, upload form etc. That it tracks past queries is a nice extra. Step 2: there is no step two Step 3: OK, OK, you also want to try a SPARQL from the command line Now, I have to say, the webpage does not have a "Download CSV" button on the SPARQL endpoint. That would be great, but doing so from the command line is not too hard either.$ curl -i -H "Accept: text/csv" --data-urlencode \
query@list.rq http://192.168.0.233:9999/blazegraph/sparql

But it would be nice if you would not have to copy/paste the query into a file, or go to the command line in the first place. Also, I had some trouble finding the correct SPARQL endpoint URL, as it seems to have changed at least twice in recent history, given the (outdated) documentation I found online (common problem; no complaint!).

HT to Andra who first mentioned Blazegraph to me, and the Blazegraph team.

## Friday, July 08, 2016

### Metabolomics 2016 Write up #1: some interesting bits

A good conference needs some time to digest. A previous supervisor advised me that a conference travel of 5 days takes 5 full day to follow up on everything. I think he is right, though few of us actually block our schedules to make time for that. Anyway, I started following up on things last weekend, resulting in a first two blog posts:
The second was pretty much how I have been blogging a lot: it's my electronic lab notebook. The first is about how people can link out to WikiPathways. That post explains how people can create links between identifiers and pathways.

But there was a lot of very interesting stuff at Metabolomics 2016. I hope to be blogging about more things, but please find some initial coverage in the slides of a presentation I gave yesterday at our department:

Also check the Twitter hashtag #metsocdublin2016.

## Saturday, July 02, 2016

### Harmonized identifiers in the WikiPathways RDF

 Biological knowledge should not only be capturedin nice graphics, but should be machine readable.Public domain image from Wikipedia.
WikiPathways described biological processes. Entities in these processes are genes, gene products, like miRNAs, proteins, and metabolites. The pathways do not describe what these entities are, but only provide identifiers in external databases allowing you to study the identity in those databases. Therefore, for metabolites you will not find chemical graphs but identifiers from HDMB, CAS, KEGG, ChEBI, and others.

To ensure experimental data can be mapped to these pathways, independent of whatever identifiers are used, BridgeDb was developed. WikiPathways uses a BridgeDb webservice, Open PHACTS embeds BridgeDb technologies in their Identifier Mapping Service (particularly developed by Carole Goble's team), and PathVisio uses local BridgeD ID mapping files.

The WikiPathways SPARQL end point is not using the Open PHACTS IMS and Andra introduced harmonized identifiers and provides these as additional triples in the WikiPathways RDF. For example:

SELECT DISTINCT ?gene fn:substring(?ensId,32) as ?ensembl
WHERE {
?gene a wp:GeneProduct ;
wp:bdbEnsembl ?ensId .
}

Now, the gene resource IRIs actually use the Ensembl identifier when available, so this query returns redundant information, but there are other harmonized identifiers available:

SELECT DISTINCT ?type ?pred
WHERE {
?entity a ?type; ?pred [] .
FILTER (regex(?pred,'bdb'))
}

That results in a table like this:

Therefore, for these databases it is easy to make links between those identifiers and the pathways in which entities with those identifiers are found. For example, to create a link between Ensembl identifiers and pathways, we could do something like:

SELECT DISTINCT
?pathwayRes str(?wpid) as ?pathway
str(?title) as ?pathwayTitle
fn:substring(?ensId,32) as ?ensembl
WHERE {
?gene a wp:GeneProduct ;
dcterms:identifier ?id ;
dcterms:isPartOf ?pathwayRes ;
wp:bdbEnsembl ?ensId .
?pathwayRes a wp:Pathway ;
dcterms:identifier ?wpid ;
dc:title ?title .
}

I am collecting a number of those queries in the WikiPathways help wiki's page with many example SPARQL queries. For example, check out the federated SPARQL queries listed there.

### Two Apache Jena SPARQL query performance observations

Doing searches in RDF stores is commonly done with SPARQL queries. I have been using this with the semantic web translation of WikiPathways by Andra to find common content issues, though sometimes combined with some additional Java code. For example, find PubMed identifiers that are not numbers.

Based on Ryan's work on interactions, a more complex curation query I recently wrote in reply to issues that Alex ran into with converting pathways to BioPax, is to find interactions that convert a gene to another gene. Such occurred in WikiPathways because graphically you do not see the difference. I originally had this query:

SELECT (str(?organismName) as ?organism) ?page
?gene1 ?gene2 ?interaction
WHERE {
?gene1 a wp:GeneProduct .
?gene2 a wp:GeneProduct .
?interaction wp:source ?gene1 ;
wp:target ?gene2 ;
a wp:Conversion ;
dcterms:isPartOf ?pathway .
?pathway foaf:page ?page ;
wp:organismName ?organismName .
} ORDER BY ASC(?organism)

This query properly found all gene-gene conversions to be fixed. However, it was also horribly slow with my JUnit/Apache Jena set up. The queries runs very efficiently on the Virtuoso-based SPARQL end point. I had been trying to speed it up in the past, but without much success. Instead, I ended up batching the testing on our Jenkins instance. But this got a bit silly, with at some point subsets of less than 100 pathways.

Observation #1
So, I turned to twitter, and quite soon got three useful leads. The first two suggestions did not help, but helped me rule out the problem. Of course, there is literature about optimizing, like this recent paper by Antonis (doi:10.1016/j.websem.2014.11.003), but I haven't been able to convert this knowledge into practical steps either. After ruling out these options (though I kept the sameTerm() suggestion), and realized it had to be the first two triples with the variables ?gene1 and ?gene2. So, I tried using FILTER there too, resulting with this query:

WHERE {
?interaction wp:source ?gene1 ;
wp:target ?gene2 ;
a wp:Conversion ;
dcterms:isPartOf ?pathway .
?pathway foaf:page ?page ;
wp:organismName ?organismName .
FILTER (!sameTerm(?gene1, ?gene2))
FILTER (?gene1 a wp:GeneProduct)
FILTER (?gene2 a wp:GeneProduct)
} ORDER BY ASC(?organism)

That did it! The time to run a query halved. Not so surprising, in retrospect, but it all depends on the SPARQL engine: which parts does it run first. Apparently, Jena's SPARQL engine starts at the top. This seems to be confirmed by the third comment I got. However, I always understood engine can also start at the bottom.

Observation #2
But that's not all. This speed up made me wonder something else. The problem clearly seems to engine approach to run parts of the query. So, what if I remove further choices in what to run first? That leads me to a second observation. It helps significantly if you reduce the number of subgraphs it should later "merge". Instead, if possible, use property paths. That again, about halved the runtime of the query. I ended up with the below query, which, obviously, no longer give me access to the pathway resources, but I can live with that:

WHERE {
?interaction wp:source ?gene1 ;
wp:target ?gene2 ;
a wp:Conversion ;
dcterms:isPartOf/foaf:page ?pathway ;
dcterms:isPartOf/wp:organismName ?organismName .
FILTER (!sameTerm(?gene1, ?gene2))
FILTER EXISTS {?gene1 a wp:GeneProduct}
FILTER EXISTS {?gene2 a wp:GeneProduct}
} ORDER BY ASC(?organism)

I'm hoping these two observations may help other with using Apache Jena with unit and integrated testing of RDF generation too.

Loizou, A., Angles, R., Groth, P., Mar. 2015. On the formulation of performant SPARQL queries. Web Semantics: Science, Services and Agents on the World Wide Web 31, 1-26. http://dx.doi.org/10.1016/j.websem.2014.11.003