Why new generation of HPC algorithms are needed for Mass Spectrometry based omics - Part 2

3 minute read


In my previous post I have argued that the current HPC methods for MS based omics have been designed for making the arithmatic efficiency faster. The post can be seen at this link: https://prof-s.github.io/posts/2020/10/whyhpc/.

We also argued that the current HPC MS based omics data analysis methods need to designed for communication-avoidence i.e. they are bottlenecked by the data that needs to be communicated across different processing units instead of the compute-efficinecy.

But how much difference does that make? We will still have multiple processing units (that can be used for data-parallelism) to get the results; and they are going to be faster. right?

The short answer is No.

Long answer. Here are the reasons:

1) All of the existing HPC methods work in the following way. Assume that there are N spectra that needs to be processed, and database D is used for processing. All of these methods divide N spectra among P processors such that N/P spectra are processed on each processing unit. The underlying assumption is that either database D is divided (and replicated equally among processing units) or that (smaller) fasta database is communicated which is then expanded on each machine. In either case the database D is either communicated or is expanded. This is where the problem lies. No matter what the current HPC methods do; they scale with the complexity of O(N/P+D) where you can see that D is not affected by P i.e. no matter how many more processing units you have they will not be affecting the complexity due to D. This bring us to the next point.

2) All of the HPC methods are developed for expliting parallelims on distributed- and shared memory architectures; None of them exhibit linear-speedups with increasing size of the database (which is where the explosion in data takes places when you increase the number of search parameters) or even with increasing number of processors or spectra. In other words no matter how many processing units were increased; they did not result in (at least) linear-speedups which are essential for any reasonable parallel computing algorithm.

All of the current HPC methods exhibit speedups that are much less than linear. This is problematic because HPC methods would not scale with increasing size of the database, or increasing number of processors; something that current bottleneck for both closed- and open- database searches.

In order to make the MS based omics process more efficient we need to design, and develope communication-avoiding HPC algorithms which will be an ideal solution for scalable MS data analysis for non-model multispecies databases, that can translate into enormous search-space (several terabytes), against which MS data need to be matched.

The HPC methods must exhibit at least linear speedups with increasing number of processors and database size.

In the next blog post I will discuss:

a) If it is even algorithmically possible to do better than the current HPC methods? b) If yes by how much?