Monday, December 10, 2012

Protein Feature View goes open source

The source code for the Protein Feature View has been released at github. This allows you to incorporate the dynamic SVG graphics visualizing UniProt and PDB relationships into your own web sites. You can either use the public JSON services provided by RCSB PDB to populate the view, or display your own data (after setting up your own services).

Saturday, December 8, 2012

RCSB PDB's NAR paper

The manuscript describing recent developments at the RCSB PDB has been released as part of the latest Nucleic Acid Research database issue.

View the manuscript at NAR

The Author Profile provides a graphical timeline on when a particular structure was released in the PDB.

Some of the highlights of this year are improvements in the following areas:
Our continuous efforts to provide a structural view of biology are also reflected by an increase in our user base.  The RCSB PDB web site currently hosts ∼240 000 unique visitors per month (based on the number of unique IP addresses), an increase from the 180 000 visitors last reported in 2011. 

Thursday, December 6, 2012

Ten Simple Rules For The Open Development of Scientific Software

Jim Procter and I drafted a new manuscript for PLOS Computational Biology describing ten simple rules for the open development of scientific software. Our motivation for writing this was to make the development of open scientific software more rewarding and the experience of using software more positive.  The ten rules are intended to serve as a guide for any computational scientist:

Ten Simple Rules For the Open Development of Scientific Software (article)

Tuesday, December 4, 2012

RCSB PDB is hiring

The RCSB PDB is one of the leading biological databases with more than 240,000 unique visitors per month. We have an open position in our team for a Lead Web Architect. The position is located in beautiful San Diego at UCSD .

A detailed job description and online application form can be found at:


* MS Degree in Computer Science or comparable combination of education and experience with considerable focus in JavaEE software development.

* Established demonstrated work experience in the role of an architect and developer on medium to large size database-driven web applications using Java EE technology and standards.

* Advanced experience developing the presentation layer of a dynamic, database-driven web application using HTML, CSS, JavaScript, JavaScript Toolkits, Ajax, JSP, XML, Java. Experience resolving browser and cross-platform compatibility issues. Advanced experience with Struts2, Tiles, jQuery.

* Advanced experience with database design, Structured Query Language and RDBMS's such as MySQL. Expertise in web application server administration and configuration such as Tomcat.

* Established expertise in software life cycle methodologies. Experience with build tools such as Maven and Ant, and continuous integration systems such as Cruise Control. Experience with project tracking tools such as Jira.

Friday, November 30, 2012

BioJava 3.0.5 released

BioJava 3.0.5 has been released and is available from as well as from the BioJava maven repository at .

New Features:

- New parser for CATH classification

- New parser for Stockholm file format

- Significantly improved representation of biological assemblies of protein structures. Now can re-create biological assembly from asymmetric unit

- Several bug fixes

Thanks to Daniel Asarnow for contributing the CATH parser and Amr Al Hossary and Marco Vaz for their contributions to the Stockholm parser.

Thursday, November 29, 2012

The PLOS Computational Biology Software Section

PLOS Computational Biology has been accepting and publishing so call Software Articles for about a year now. The articles that have been released under this category span a wide range of different topics in computational biology.  In today's editorial, Hilmar Lapp and I are providing a brief overview of what has been published  so far and describe some of the ideas behind the Software section.

Sunday, October 28, 2012

RCSB PDB web site update Fall 2012

New Features at the RCSB PDB web site

 This week the  RCSB PDB released the latest major web site update. Here a quick description of some of the new features.

Protein Feature View

One of the main new features is the new Protein Feature View. It allows to compare the full length protein sequence, as defined by UniProt with the regions that have been determined in 3D and are available together with their coordinates from the Protein Data Bank.  Besides the visualization of the PDB and UniProt relationships, the  new view also adds additional annotations for a more comprehensive understanding of the protein. External data such as Pfam domains or regions for which Homology Models are available from the ProteinModelPortal are indicated. There are also some annotations that are being calculated on the fly: Protein disorder regions, as predicted by Peter Troshin's BioJava implementation of RONN are available as a histogram-style track. Finally, regions with increased hydrophobicity can be spotted by looking at the Hydropathy track.

The Protein Feature View is built using SVG graphics and extensively uses the jQuery-SVG library. Using SVG graphics for a prominently  feature on the site (it is on every protein-explorer page) has become possible since the majority of all modern browsers support these types of graphics nowadays. However, there is still a number of users who are stuck with old browser versions.  According to our web site traffic logs, this number is rapidly declining and we estimate that currently less than 15% of our users can't use the new view. These users won't see error messages on the protein-explorer page, thought.  The graphics will simply not be visible and provide a graceful fallback to the way the page used to look before the graphics were introduced.


Better Pfam integration

Another new feature of this release is a better integration with Pfam. Pfam family names are now searchable and one can quickly lookup all protein structures related to these families. Since Pfam is used in structural genomics projects to prioritize targets for crystallization, a possible use case is to look up domains of unknown function (DUFs) and whether 3D coordinates have already been determined for them. As already mentioned above, Pfam domains can be viewed as part of the new Protein Feature View. Weekly up-to-date Pfam-PDB mappings are being calculated by submitting newly released PDB entries to the HMMER3 web site. The details of this process are being described in more detail at the Pfam blog site.

Searching and Reporting

Other improvements of this RCSB PDB web site update include search and reporting improvements. RCSB searches have been improved for better supporting poly-proteins and their sub-components (see screenshot above). There is also better support for searching drug names (and more information about drugs on the Ligand Summary page (e.g. Lipitor), coming from DrugBank . Once a search has been performed, there are now four different types of reports available for investigating the results. Besides the "traditional" search results there is now a "condensed" view, which provides a compact summary of results. The "gallery" provides images for the proteins that have been found in the search. A "timeline" gives a historic overview when proteins were released in the PDB

A full description of all the new features is (as always) available on the What's New Page.

Friday, August 10, 2012

BioJava 2012 paper published

Today the latest BioJava paper was published, describing the BioJava version 3 series .

Thanks to all developers for their contributions, it would not have been possible without them!




BioJava: an open-source framework for bioinformatics in 2012

Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius
Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock
Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L.
Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis

Bioinformatics 2012; doi: 10.1093/bioinformatics/bts494

Monday, May 21, 2012

BioJava 3.0.4 released

BioJava 3.0.4 just hit the servers. This is mainly a bug-fix release addressing a few issues with the protein structure and the disorder modules.

One new feature is that SCOP  can now be either accessed from the original SCOP site in the UK or the Berkeley version.

Wednesday, May 9, 2012

About Pfam and PDB mappings

Today's blog article is not going to get published here but at the Pfam blog site, where Rob Finn and I wrote about the Pfam and PDB mappings:

Does my family of interest have a determined 3D protein structure?

Friday, May 4, 2012

Systematic domain based structure alignments at the RCSB PDB

At the RCSB PDB web site we habe been providing pre-calculated and systematic protein structure alignments already for about two years. Every week we run systematic structure alignments for newly released proteins across a representative subset of the database and try to identify related proteins based on their 3D shape.

This week we have released a major upgrade to those efforts. Our pre-computed alignments are now using domain information to split protein chains into smaller subunits. We introduced this change because many proteins are built of more than just one domain. In that case our previous results were a bit unclear in the sense that results for any of the domains were displayed together and that made the data more difficult to compare and interpret.

The new domain based procedure is using the SCOP domain assignments where available to define how to break up protein chains. If the structures are too new to be annotated by SCOP (like all newly released proteins), then we use a software called ProteinDomainParser to define domains based on geometric criteria. Even if the algorithm sometimes defines a break point that might not be the same what SCOP would define, it is still interesting if you find structural neighbors with such fragments of proteins.

In addition, this release of the RCSB PDB web site also provides a new display of protein chains and how different sources annotate protein domains (see the image above). This domain summary shows SCOP domains, ProteinDomainParser domains and Pfam domains. Here an example for  a Cyclodextrin glycosyl transferase (3BMV) from Thermoanerobacterium thermosulfurigenes. It is composed of four domains, which are identified by all of the three data sources.

Friday, April 27, 2012 down?!

I just realized that is not reachable  and displays this little message:

Thanks to google cache (all the sites containing the original info are unreachable as well) I found a notification in one of the Java User groups:

Notice: will be offline starting around noon pacific time (7pm UTC) on Friday, April 27, through noon pacific time (7pm UTC) on Monday, April 30 for scheduled maintenance.

A planned downtime of three days. If we would do that at work we would get into serious trouble! We go through all sorts of efforts to ensure 24/7 availability and even if we have major site updates we don't let that impact our availability.

I noticed this because one of my tomcat instances did now want to start up which was caused by a failing XML- dtd validation. 

Luckily this can be easily fixed by removing the dtd validation in the xml file. However I wonder how many tomcat servers world wide will have similar problems to start up until Monday! Such a long downtime of such "standard" URLs is rather unprofessional IMHO.

Thursday, March 29, 2012

PLoS comp biol goes Wikipedia

Wikipedia  has become a primary resource of information for many students when looking up basic information. However there is an interesting gap between the scientific community and the people who are regularly contributing to wikipedia articles. There are only few prominent scientists who are regulars, such as the Pfam authors who recently integrated wikipedia into the Xfam series of databases. Another major science related project on wikipedia with about ten thousand articles describing various genes is GeneWiki, lead by Andrew Su. A possible reason for this difference in communities might be the lack of acceptance as academic publishing for wikipedia articles. As of today PLoS comp biol tries to resolve this disparity by publishing a new type of manuscripts, Topic Pages.

Topic Pages are designed to provide review style articles. These articles serve as a copy of reference, that can be cited and will show up in Pubmed. It will also be released at wikipedia where a living copy of the document can be edited and updated by the wider public. This is done in collaboration with the wikiproject computational biology.

How does this work? In short, an article is first submitted to PLoS where it is peer reviewed and upon acceptance it will be published by PLoS comp biol as well as uploaded to wikipedia. While this sounds rather straightforward, one of the issues with this approach is around licensing.

PLoS is publishing all articles under a very liberal license, the Creative Commons Attribution License.  This means, you can do with the article what you want, even change the license, as long as you credit the original sources. This license is in fact more liberal than the wikipedia license, which is Creative Commons Attribution Share Alike. This means we can take a PLoS comp biol article and publish it on wikipedia, as long as we cite the original source of the text, but we can not do this in the opposite direction.

In order to avoid any licensing conflicts, Spencer Bliven set up a custom Mediawiki instance with the liberal PLoS style license. It can be found at  By drafting the manuscript there we are able to transfer the content easily over to both PLoS and wikipedia, once it has passed the PLoS review process. Besides this, also the review process is transparent and you can see what the referees commented on our article at the talk page of the article (both at wikipedia and the topicpages sites)

Our  latest paper is the first such Topic Page. It provides a review on Circular Permutations in proteins, a type of relationship in proteins, whereby the proteins have a changed order of amino acids in their protein sequence while their 3D shape remains very similar.

For more information read the full article at plos ( doi:10.1371/journal.pcbi.1002445 ) or take a look at the latest version of this at wikipedia. Also read the PLoS comp biol editorial, announcing the Topic Pages

Sunday, March 18, 2012

GSoC 2012 - how to get started with a proposal

To get started with a proposal I would recommend to look at the BioJava
project proposals from the last two years (and here) and
see what kind of projects got funded and how those proposals were
written. Think about what you would like to work on. Get a copy of
BioJava and see how related features are working. Come up with a plan
on how to extend this.

We are fairly flexible regarding what kind of projects we will run
this summer and this really depends on the submitted project
proposals. All proposals will be compared and ranked together with
other projects from the Bio* projects. As such a good proposal is key
to get funded.

A good proposals shows

- the motivation of the student
- that the candidate is qualified to do what he is proposing
- adds useful new functionality to BioJava
- discusses possible risks and what to do about them

It is difficult to answer questions like "how should I perform this or
that project?" - There are more than one possible path and it depends
on your skills and interest what will be the best answer for this.
Overall I recommend to pick a project on a topic that is close to your
(future?)  thesis, or is of particular interest for you.

Here a couple of more thoughts which are project specific:

-  The best projects are those that you come up with yourself. If you
want to distinguish yours from every other proposal, suggest something
which is not on our list.

- File parsers:

if you want to work on file parsers take a look at existing ones. What
features do they provide? How can they be extended? For example if you
want to work on the CATH parser, take a look at how the SCOP parser
works. What features are available around this (access to domains) and
how can something like this be set up for CATH. Look at how the CATH
website provides files.

- Porting of algorithms:

There are several approaches possible for doing this. I recommend that
you should have some background both in C and in Java for this. Get a
copy of the algorithm you want to port, compile it, and take a look at
the source. There are several ways how to proceed for the actual port
and having a good strategy for this is key for this proposal. Perhaps
try to use your strategy on some simple test case to see how this
might work.

- BioJava in the cloud

The goal here is parallelization of existing code. What parts of
biojava are suitable for this? How can they be parallelized and moved
to current cloud infrastructure? There is a lot of online material
available for this which will be helpful here.

Friday, March 16, 2012

BioJava at at Google Summer of Code 2012

The Open Bioinformatics foundation as an umbrella organisation for
BioJava has been accepted to participate in this year's Google Summer
of Code. 

This means we will again be able to offer mentoring through BioJava
this year. Accepted students will get a stipend of 5,000$ from Google.
Participation is possible from most countries in the world, as long as
you are eligible to work in the country in which you'll reside
throughout the duration of the program.

If you are interested in working on a BioJava related project, now is
the time to start preparing and discussing your proposals. For the
last two years we had many applications for the projects proposed by
mentors. If you want to distinguish your application I recommend to
propose your own  project. Don't forget to discuss any proposal with
us before you submit them. We will try to provide feedback and match
you with a suitable Mentor.

Also see and Google's

The student application deadline is April 6th. Google will announce
which proposals got accepted on April 23rd.

BioJava 3.0.3 released

BioJava 3.0.3 has been released and is available from as well as from the
BioJava maven repository at .

New Features

BioJava 3.0.3 adds several new features

- Significant improvements for the web service module (ncbi blast and
hmmer web services)

- Fastq parser (ported from the biojava 1 series to version 3)

- Support for SIFTS-PDB to UniProt mapping

- Improved support for working with external protein domain definitions

- Protmod module renamed to modfinder

- Numerous improvements all over the place (several hundred commits
since last release)

- We are also working on an update for the legacy biojava 1.8 series.

This release would not have been possible with contributions from
numerous people, thanks to all for their support!

Happy BioJava-ing!