Why new generation of HPC algorithms are needed for Mass Spectrometry based omics - Part 1

3 minute read


Everyone you talk to will tell you that database search algorithms for proteomics are embarrassingly parallel problem - and is easy to solve. You just need to distribute your data on multiple nodes, and you will get excellent parallelization and scalable solutions. This ideal scenairo of scalabililty with the introduction of few pragmas does not go far.

I argue in this blog post that we need a new generation of HPC algorithms for MS based omics.

For the past 30 years (or more), database-search algorithms, that deduce peptides from Mass Spectrometry (MS) data, have tried to improve the computational efficiency to accomplish larger, and more complex systems biology studies. Poor-scalability with increasing size of theoretical search-space, and search-parameters is a well-known problem in MS based omics. There are excellent works (like MSFragger) that try to reduce the computational complexity and make the process more efficient using smart indexing.

Even with these advance serial algorithms HPC algorithms are needed [1-5], and wanted (to do large-scale non-model MS based omics) - you can argue otherwise and that is okay - but as computational scientists our objective is to come up with the most efficient way to deal with computations. Good example would be matrix multiplications (and the 50 years of literature on it).

Over the years, number of high-performance computing (HPC) algorithms have been proposed to mitigate this scalability problem with the objective of improved efficiency of underlying arithmetic operations. However, bottleneck for many HPC algorithms have shifted from computational arithmetic operations to communication costs of moving the data between hierarchy of memories, or between memory-distributed pro-cessors; which has dampened overall effectiveness of the existing HPC workflows [6,7,8]. Even grappling with scalability issues, communication-avoiding HPC algorithms that can exploit multi-layered parallelism of heterogenous architectures are nearly non-existent.

In order to make the MS based omics process more efficient we need to design, and develope communication-avoiding HPC algorithms which will be an ideal solution for scalable MS data analysis for non-model multispecies databases, that can translate into enormous search-space (several terabytes), against which MS data need to be matched.

How do we develop communication avoiding HPC algorithms you say? How bad the current HPC algorithms might be you say?

Will discuss this in the next blog post.

References: [1] Chuang Li, Tao Chen, Qiang He, Yunping Zhu, and Kenli Li. Mruninovo: an effi-cient tool for de novo peptide sequencing utilizing the hadoop distributed computingframework.Bioinformatics, 33(6):944–946, 12 2016.
[2] Ananth Kalyanaraman, William R. Cannon, Benjamin Latt, and Douglas J. Baxter.Mapreduce implementation of a hybrid spectral library-database search method forlarge-scale peptide identification.Bioinformatics, 27(21):3072–3073, 09 2011.
[3] Chuang Li, Kenli Li, Keqin Li, Xianghui Xie, and Feng Lin. Swpepnovo: An efficientde novo peptide sequencing tool for large-scale ms/ms spectra analysis.Internationaljournal of biological sciences, 15(9):1787, 2019.
[4] Lydia Ashleigh Baumgardner, Avinash Kumar Shanmugam, Henry Lam, Jimmy K Eng,and Daniel B Martin. Fast parallel tandem mass spectral library searching using gpuhardware acceleration.Journal of proteome research, 10(6):2882–2888, 2011.
[5] Brian Pratt, J Jeffry Howbert, Natalie I Tasman, and Erik J Nilsson. Mr-tandem:parallel x! tandem using hadoop mapreduce on amazon web services.Bioinformatics,28(1):136–137, 2012.
[6] Grey Ballard, Erin Carson, James Demmel, Mark Hoemmen, Nicholas Knight, and Oded Schwartz. Communication lower bounds and optimal algorithms for numericallinear algebra.Acta Numerica, 23:1, 2014.
[7] Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Minimizing communi-cation in numerical linear algebra.SIAM Journal on Matrix Analysis and Applications,32(3):866–901, 2011.
[8] National Research Council et al.Getting up to speed: The future of supercomputing.National Academies Press, 2005