Monday, January 25, 2010

How Informatics can help in a crisis

Today's blog is not about bioinformatics, but another open-source area in informatics that I feel passionate about: open geographical data. The OpenStreetMap (OSM) project provides free geographic data, such as street maps. It is essentially the wiki of the mapping world. In a response to the recent earthquake disaster in Haithi, OSM started the Project Haiti .

In short time so much data has been collected that what was before a largely empty map is now one of the most complete maps of Haiti. Michael Maron posted some before and after images on Flickr:

Before the earthquake (Jan 12th) :

and by Jan 14th:

Beyond just marking roads, there is also an analysis of damaged buildings and displacement camps. This rendering, taken from the OSM wiki, shows damaged buildings and refugee camps mapped within OpenStreetMap.
A few other geographical projects related to Haiti are described at .


Friday, January 22, 2010

BioJava Hackathon - Last Day

Today was the last day of the BioJava Hackathon. It has been an exciting week and we made progress along several lines, which I will talk about in a moment. Special thanks go to Jonathan Warren for organizing the meeting room at the Sanger Institute. Also thanks to our hackers without who this hackathon would not have been possible. In particular thanks to Scooter Willis, Jules Jacobsen, Andy Yates, Jonathan Warren, Christoph Gille, Matias Piipari for participating during the week and to our special guests who joined us for a day, Richard Holland and Jim Procter.

All the code that has been written is available through the new modules labeled with the biojava3 name. Most work was related to the new sequence and protein structure modules:

Sequence modules

There have been a lot of discussions about the current way sequences are represented over the last years. As such the "sequence guys" among the developers were working on coming up with a new design which is providing a biological meaningful (think central dogma) representation of sequences. What is still missing are file parsers using the new modules. The first fasta parser is about to be committed by Scooter as I am writing this. There is still more work required before the code will be ready for the next release. Still this is the beginning of a new data representation which should make the code base ready for the next couple of years.

Structure modules

The protein structure modules are the BioJava3-part which is closest to be released. During this week we have added the CE algorithm for protein structure alignment, implemented core interfaces for a generic Model View Control wrapping of various 3D visualization tools, we added better support for chemically modified residues (like MSE) and natural ones like Selenocysteine. They are treated now as amino acids. We also re-factored the code base to have the structure data model clearly separated from the new graphical user interfaces. This gui module now provides a nice way for calculating and visualizing protein structure alignments.

Next BioJava release (3.0)

There is still more work required to push the new sequence module to a state where it can be released. We also did not write any documentation this week, so that will have to be added later on. We will try to bring up the modules to a state where they can be released over the next weeks. Once a module is release ready a detailed summary of the new features will be posted to the mailing list. In any case there will be a BioJava 3.0 release in time for the ISMB/BOSC conference as we have been doing during the last years.

Thursday, January 21, 2010

BioJava Hackathon - Day 3

Today the various efforts of this week start to come together. We have several new or refactored modules in SVN that contain the code designed and written this week. They are named with biojava3 to clearly indicate that this are the new things.

BioJava Hackathon - Day 3 - Sequence Modules

Another main feature of the Wednesday session is an update to the sequence modules. Scooter Willis extended the sequence-core modules to provide the framework for working with sequences according to what was discussed earlier. There is still room for tweaks, refactoring and future optimizations but for the most part we can represent DNA,RNA and Protein sequences in a biologically aware manner with minimum code complexity at the top level. Check out the biojava3 package in the sequence-core module.

Wednesday, January 20, 2010

BioJava Hackathon - Day 3 - Structure Modules

Today the main new feature in the structure modules is the release of a Java port of the Combinatorial Extension (CE) algorithm. This contains both a version of the algorithm that can be run from command line, as well as a GUI to view the results and trigger custom alignments. Essentially this is what is available from the RCSB website from:

About the generic design for Model View Control for 3D viewers, an unsolved problem is currently how to deal with selections. Selecting ranges, chains or atoms in proteins is done using a scripting interface at PyMol or Jmol. Shall we have a scripting interface (based on the syntax of one of these) or shall we have multiple select methods that accept various arguments? Jules Jacobsen wrapped the Jmol-Biojava interface using the new interface definitions for the MVC.

Tuesday, January 19, 2010

BioJava Hackathon - Day 2

Yesterday's contributor who added most lines of code is Michael Heuer, who is joining the hackathon from remote (i.e. somewhere in the US). He added the new FASTQ parser to BioJava. Well done Michael!

During the morning session we did a "Post Up", a silent and structured way of doing brainstorming. This was in order to come up with a new requirement how to do some state of the art pushing on the sequence modules. Scooter moderated a discussion where we focused on biologically meaningful representations of biological sequences. A Chemical Compound will be at the core of any sequence representation and we want to have different types of sequences like Chromosome sequence, Scaffold, DNA, RNA, Protein, and Sugars.

We started with test-driven development for the new sequence interfaces and then we will wrap the existing sequence code with the new interfaces. Here you can see us during the brainstorming session:

On the 3D structure side of things, we added a new 3D structure-gui module that is going to provide the Model View Control interface for the various open source viewers.

Monday, January 18, 2010

BioJava Hackathon - Day 1 part 2

Continuation of Day 1...

We had more discussion about how to deal with the sequence modules, bytecode dependencies of the core module and related topics. Seems there is a general agreement about moving the current sequence code out of the core module into its own space. Will continue tomorrow morning, when Richard Holland is back.

On a different side of things, Christoph Gille, Jules Jacobsen and I were discussing how to provide a Model View Control interface for using various open source 3D visualization libraries (Jmol, RCSB Libraries, Astex Viewer) together with Biojava.

We spent a lot of time discussing today, hope to be able to get more code done tomorrow.

BioJava Hackathon - Day 1


I am going to blog every day about the BioJava Hackathon, so you can stay updated with what is happening here in Cambridge.

In the morning I gave this presentation around which we had several discussions about what are the most critical issues we want to solve. The issues are:

  • Installation problems. Getting the latest checkout of the new Maven based build system causes problems for some of us. Sorting our the installation procedure is a major topic of the afternoon. It works successfully with the latest Eclipse, the m2eclipse plugin and subclipse plugin. Some of the NetBeans based developers also reported no problems during installations.
  • Features. The Biojava features should become a first class citizen. This means it should be possible to instantiate them independently of sequence objects.
  • Simplify Sequences: Sequences should be Strings as far as possible. Only convert them to Sequence objects if required.
  • Some of the BioJava 3 docu is not up to date and can lead to misunderstandings. The latest BioJava 3 code is available in the trunk
  • Memory efficiency: Make sure that iterating over RichSequences is memory efficient. (Fix a memory leak there)
  • Bytecode: The Biojava - core module should not require the Bytecode module.
Andy Yates is tweeting about it at

Saturday, January 16, 2010

BioJava Hackathon 2010

I am off to Cambridge, U.K. where we will have the BioJava Hackathon next week. I am planning to blog on a regular basis about what is going on there.

Friday, January 15, 2010

All vs All Structure Alignments in PDB

Proteins can have various degrees of similarity. If two proteins show high similarity in their amino acid sequence, it is generally assumed that they are closely evolutionary related. With increasing evolutionary distance the degree of similarity usually drops, but proteins can still show similar function and have an overall similar 3D structure, even if the sequence similarity is low. The detection of such remote similarities is important in order to infer functional and evolutionary relationships between protein families and is a core technique used in structural bioinformatics.

For the RCSB-PDB web site I have recently been working on a new all against all comparison of all protein chains. While protein sequence comparisons can be computed quickly, the calculation of protein structure alignments is much more time consuming. So far we were computing about 140 mio. pairwise alignments in ~100.000 CPU hours on the Open Science Grid (OSG). With the help of Chris Bizon we could easily deploy our code there and I can highly recommend giving the OSG a try also for other scientists. A technical report about how we computed about 140 mio. pairwise alignments in ~100.000 CPU hours is available from here:

Sunday, January 10, 2010

Protein Comparison Tool

In the recent months I spent some time developing the new RCSB PDB Protein Comparison Tool (you can see an example for it on the right-hand menu of this blog).

In particular I spent a lot of time porting the CE and FATCAT algorithms from C to Java and developing a new user interface. Check out the latest version at . (E.g. try to align 4HHB chain A and 4HHB chain B ).

Having the algorithms in Java opens the door for a number of nice applications. It is now possible to launch the structure comparison application with a single mouse click using the Java Web Start technology.