Why new generation of HPC algorithms are needed for Mass Spectrometry based omics - Part 3

2 minute read


In my previous post I have argued that the current HPC methods for MS based omics have been designed for making the arithmatic efficiency faster. The post can be seen at these links: https://prof-s.github.io/posts/2020/10/whyhpc/ and https://prof-s.github.io/posts/2020/11/whyhpc-part2/.

We also argued that the current HPC MS based omics data analysis methods need to designed for communication-avoidence, and we have also established that the current HPC algorithms are not optimal enough i.e. new communication-avoiding paradigm is needed to ensure that the HPC algorithms can scale effectively with increasing size of the proteome database.

The questions that we are going to answer in this blog post is as follows: a) If it is even algorithmically possible to do better than the current HPC methods? b) If yes by how much?

The answer to this question can be found in our recent pre-print in which we prove that it is possible to get superior algorithmic workflows as compared to the existing ones. We have shown that if the parallel algorithm is optimally designed, and implemented the theoretical framework dictates that it is possible to get a run time complexity of Omega(n/p) as compared to the Omega (n) that is obtained by the current workflows. Here n is the size of the theoretical database that is obtained by expanding the proteome using search-parameters.

Our pre-print can be accessed here: https://arxiv.org/pdf/2009.14123.pdf

In this paper, we prove that the communication bound that is reached by the existing parallel algorithms is Omega(mn+2r (q/p)), where $m$ and $n$ are the dimensions of the theoretical database matrix, $q$ and $r$ are dimensions of spectra, and $p$ is the number of processors. We further prove that #communication-optimal# strategy with fast-memory \sqrt{M} = mn + \frac{2qr}{p} can achieve \Omega({\frac{2mnq}{p}}) but is not achieved by any existing parallel proteomics algorithms till date.

In the paper, to further validate our claim, we performed a meta-analysis of published parallel algorithms, and their performance results. We show that sub-optimal speedups with increasing number of processors is a direct consequence of not achieving the communication lower-bounds proved in this paper. ##Consequently, we assert that next-generation of provable, and demonstrated superior parallel algorithms are urgently needed for MS based large systems-biology studies especially for meta-proteomics, protegenomics, microbiome, and proteomics for non-model organisms##.

Our hope is that this paper will excite the parallel computing community to further investigate parallel algorithms for highly influential MS based omics problems.