Monday, October 4, 2010

Getting the most out of the Combinatorial Extension (CE) algorithm

The CE algorithm is a frequently used algorithm for protein structure alignment. We have recently ported the CE algorithm from C to Java and made the new implementation available through BioJava project. It is run as part of the Protein Comparison Tool at the RCSB PDB web site. This was a good opportunity to learn a few hidden aspects of the original implementation and re-consider some of the parameter choices that have been take for the original code. Here a quick summary of how CE works and how to get the most out of it.

CE essentially has two main steps:
  1. Finding Aligned Fragment Pairs (AFPs) between the two protein structures that share similarity. This step in most cases is quite quick, however this only provides a rough alignment as a result which needs refinement.
  2. The alignment optimization steps is the slow part of the algorithm. It iteratively tries to extend the initial alignment while staying below a certain RMSD threshold.
So how does these two steps work in detail? Today we will look at Step 1:


Step 1: Finding AFPs:This step starts with calculating Matrices of intra-molecular distances withing each of the two proteins. These are matrices that describe the distances within a protein and they look something like this:



Here you can see the two matrices, on the left for the alpha chain of Hemoglobin, and on the right side for the beta chain of Hemoglobin. You can view their alignment at the RCSB PDB web site.

As you can see the two chains are very similar. Their pattern of intra-molecular distances looks similar, too. The regions that are colored indicate close proximity in space. Can you spot the helices?
There are a few different ways to compare such distance matrices. In order to identify the AFPs, CE takes the approach to "slide a window" over each of the matrices and tries to find regions where the two proteins looks similar.

This is how it is calculated: First all distances for fragments of size 8 are calculated and the difference of Distances is


This can be done by calculating differences of distances for each fragment of length winSize, which is one of the parameters that can be sent to CE and which is by default set to a length of 8.

The original CE paper has a table on the impact of this parameter. Here you can see how different values for winSize (here called Fragment size m) affect the final RMSD, alignment length and calculation time.It is impressive how much CPU speed has improved since back in those days: Now the alignment shows up almost instantly on my screen. However back then they only had a 248 Mhz CPU and you can see in the table above how many seconds one had to be patient back in 1998.

I tried to reproduce this table with the current implementation, but the numbers don't show up exactly the same. There were a few follow up publication on the algorithm during which this step got improved. The trace step has now some RMSD threshold parameters which make sure that the RMSD between the traces stay within limits.

In this table the internal numbers after the Step 1: As you can see the main effect is on the nr. of traces that are being detected.

winSizenr tracesRMSD after initial trace
42,662,2114.4
61,858,3803.8
81,158,3274.6
10891,8074.3
12847,8024.3
167585844.0
24261,2905.0
36173954.6

To come back to the differences of distances between the fragments of length winSize: at this stage the difference of distance matrix looks like this:


Each dot on the matrix corresponds to the start of a fragment and color corresponds to closely similar fragments. As you can see there are plenty of regions that have similar fragments, and the fragments come in groups.

The next step, optimizing the alignment, will be discussed some other time...