Jekyll2021-11-12T13:18:38-08:00https://prof-s.github.io/feed.xmlDr. Fahad SaeedComputational Scientists and Associate Professor of ComputingDr. Fahad Saeedfsaeed@fiu.eduWhy new generation of HPC algorithms are needed for Mass Spectrometry based omics - Part 32021-01-25T00:00:00-08:002021-01-25T00:00:00-08:00https://prof-s.github.io/posts/2021/01/blog-post-4<p>In my previous post I have argued that the current HPC methods for MS based omics have been designed for making the arithmatic efficiency faster. The post can be seen at these links: <a href="https://prof-s.github.io/posts/2020/10/whyhpc/">https://prof-s.github.io/posts/2020/10/whyhpc/</a> and <a href="https://prof-s.github.io/posts/2020/11/whyhpc-part2/">https://prof-s.github.io/posts/2020/11/whyhpc-part2/</a>.</p>
<p>We also argued that the current HPC MS based omics data analysis methods need to designed for communication-avoidence, and we have also established that the current HPC algorithms are not optimal enough i.e. new communication-avoiding paradigm is needed to ensure that the HPC algorithms can scale effectively with increasing size of the proteome database.</p>
<p>The questions that we are going to answer in this blog post is as follows: a) If it is even algorithmically possible to do better than the current HPC methods? b) If yes by how much?</p>
<p>The answer to this question can be found in our recent pre-print in which we prove that it is possible to get superior algorithmic workflows as compared to the existing ones. We have shown that if the parallel algorithm is optimally designed, and implemented the theoretical framework dictates that it is possible to get a run time complexity of Omega(n/p) as compared to the Omega (n) that is obtained by the current workflows. Here n is the size of the theoretical database that is obtained by expanding the proteome using search-parameters.</p>
<p>Our pre-print can be accessed here: <a href="https://arxiv.org/pdf/2009.14123.pdf">https://arxiv.org/pdf/2009.14123.pdf</a></p>
<p>In this paper, we prove that the communication bound that is reached by the existing parallel algorithms is Omega(mn+2r (q/p)), where $m$ and $n$ are the dimensions of the theoretical database matrix, $q$ and $r$ are dimensions of spectra, and $p$ is the number of processors. We further prove that #communication-optimal# strategy with fast-memory \sqrt{M} = mn + \frac{2qr}{p} can achieve \Omega({\frac{2mnq}{p}}) but is not achieved by any existing parallel proteomics algorithms till date.</p>
<p>In the paper, to further validate our claim, we performed a meta-analysis of published parallel algorithms, and their performance results. We show that sub-optimal speedups with increasing number of processors is a direct consequence of not achieving the communication lower-bounds proved in this paper. ##Consequently, we assert that next-generation of provable, and demonstrated superior parallel algorithms are urgently needed for MS based large systems-biology studies especially for meta-proteomics, protegenomics, microbiome, and proteomics for non-model organisms##.</p>
<p>Our hope is that this paper will excite the parallel computing community to further investigate parallel algorithms for highly influential MS based omics problems.</p>Dr. Fahad Saeedfsaeed@fiu.eduIn my previous post I have argued that the current HPC methods for MS based omics have been designed for making the arithmatic efficiency faster. The post can be seen at these links: https://prof-s.github.io/posts/2020/10/whyhpc/ and https://prof-s.github.io/posts/2020/11/whyhpc-part2/.Why new generation of HPC algorithms are needed for Mass Spectrometry based omics - Part 22020-11-04T00:00:00-08:002020-11-04T00:00:00-08:00https://prof-s.github.io/posts/2020/11/blog-post-1<p>In my previous post I have argued that the current HPC methods for MS based omics have been designed for making the arithmatic efficiency faster. The post can be seen at this link: <a href="https://prof-s.github.io/posts/2020/10/whyhpc/">https://prof-s.github.io/posts/2020/10/whyhpc/</a>.</p>
<p>We also argued that the current HPC MS based omics data analysis methods need to designed for communication-avoidence i.e. they are bottlenecked by the data that needs to be communicated across different processing units instead of the compute-efficinecy.</p>
<p>But how much difference does that make? We will still have multiple processing units (that can be used for data-parallelism) to get the results; and they are going to be faster. right?</p>
<p>The short answer is No.</p>
<p>Long answer. Here are the reasons:</p>
<p>1) All of the existing HPC methods work in the following way. Assume that there are N spectra that needs to be processed, and database D is used for processing. All of these methods divide N spectra among P processors such that N/P spectra are processed on each processing unit. The underlying assumption is that either database D is divided (and replicated equally among processing units) or that (smaller) fasta database is communicated which is then expanded on each machine. In either case the database D is either communicated or is expanded. This is where the problem lies. No matter what the current HPC methods do; they scale with the complexity of O(N/P+D) where you can see that D is not affected by P i.e. no matter how many more processing units you have they will not be affecting the complexity due to D. This bring us to the next point.</p>
<p>2) All of the HPC methods are developed for expliting parallelims on distributed- and shared memory architectures; None of them exhibit linear-speedups with increasing size of the database (which is where the explosion in data takes places when you increase the number of search parameters) or even with increasing number of processors or spectra. In other words no matter how many processing units were increased; they did not result in (at least) linear-speedups which are essential for any reasonable parallel computing algorithm.</p>
<p>All of the current HPC methods exhibit speedups that are much less than linear. This is problematic because HPC methods would not scale with increasing size of the database, or increasing number of processors; something that current bottleneck for both closed- and open- database searches.</p>
<p>In order to make the MS based omics process more efficient we need to design, and develope <strong>communication-avoiding</strong> HPC algorithms which will be an ideal solution for scalable MS data analysis for non-model multispecies databases, that can translate into enormous search-space (several terabytes), against which MS data need to be matched.</p>
<p>The HPC methods <strong>must</strong> exhibit at least linear speedups with increasing number of processors and database size.</p>
<p>In the next blog post I will discuss:</p>
<p>a) If it is even algorithmically possible to do better than the current HPC methods?
b) If yes by how much?</p>Dr. Fahad Saeedfsaeed@fiu.eduIn my previous post I have argued that the current HPC methods for MS based omics have been designed for making the arithmatic efficiency faster. The post can be seen at this link: https://prof-s.github.io/posts/2020/10/whyhpc/.Why new generation of HPC algorithms are needed for Mass Spectrometry based omics - Part 12020-10-26T00:00:00-07:002020-10-26T00:00:00-07:00https://prof-s.github.io/posts/2020/10/blog-post-1<p>Everyone you talk to will tell you that database search algorithms for proteomics are embarrassingly parallel problem - and is easy to solve. You just need to distribute your data on multiple nodes, and you will get excellent parallelization and scalable solutions. This ideal scenairo of scalabililty with the introduction of few pragmas does not go far.</p>
<p>I argue in this blog post that <strong>we need a new generation of HPC algorithms for MS based omics</strong>.</p>
<p>For the past 30 years (or more), database-search algorithms, that deduce peptides from Mass Spectrometry (MS) data, have tried to improve the computational efficiency to accomplish larger, and more complex systems biology studies. Poor-scalability with increasing size of theoretical search-space, and search-parameters is a well-known problem in MS based omics. There are excellent works (like MSFragger) that try to reduce the computational complexity and make the process more efficient using smart indexing.</p>
<p>Even with these advance serial algorithms HPC algorithms are needed [1-5], and wanted (to do large-scale non-model MS based omics) - you can argue otherwise and that is okay - <em>but as computational scientists our objective is to come up with the most efficient way to deal with computations</em>. Good example would be matrix multiplications (and the 50 years of literature on it).</p>
<p>Over the years, number of high-performance computing (HPC) algorithms have been proposed to mitigate this scalability problem with the objective of improved efficiency of underlying arithmetic operations. <strong>However, bottleneck for many HPC algorithms have shifted from computational arithmetic operations to communication costs of moving the data between hierarchy of memories, or between memory-distributed pro-cessors; which has dampened overall effectiveness of the existing HPC workflows [6,7,8]</strong>. Even grappling with scalability issues, communication-avoiding HPC algorithms that can exploit multi-layered parallelism of heterogenous architectures are nearly non-existent.</p>
<p>In order to make the MS based omics process more efficient we need to design, and develope <strong>communication-avoiding</strong> HPC algorithms which will be an ideal solution for scalable MS data analysis for non-model multispecies databases, that can translate into enormous search-space (several terabytes), against which MS data need to be matched.</p>
<p>How do we develop communication avoiding HPC algorithms you say?
How bad the current HPC algorithms might be you say?</p>
<p>Will discuss this in the next blog post.</p>
<hr />
<p><strong>References:</strong>
[1] Chuang Li, Tao Chen, Qiang He, Yunping Zhu, and Kenli Li. Mruninovo: an effi-cient tool for de novo peptide sequencing utilizing the hadoop distributed computingframework.Bioinformatics, 33(6):944–946, 12 2016.<br />
[2] Ananth Kalyanaraman, William R. Cannon, Benjamin Latt, and Douglas J. Baxter.Mapreduce implementation of a hybrid spectral library-database search method forlarge-scale peptide identification.Bioinformatics, 27(21):3072–3073, 09 2011.<br />
[3] Chuang Li, Kenli Li, Keqin Li, Xianghui Xie, and Feng Lin. Swpepnovo: An efficientde novo peptide sequencing tool for large-scale ms/ms spectra analysis.Internationaljournal of biological sciences, 15(9):1787, 2019.<br />
[4] Lydia Ashleigh Baumgardner, Avinash Kumar Shanmugam, Henry Lam, Jimmy K Eng,and Daniel B Martin. Fast parallel tandem mass spectral library searching using gpuhardware acceleration.Journal of proteome research, 10(6):2882–2888, 2011.<br />
[5] Brian Pratt, J Jeffry Howbert, Natalie I Tasman, and Erik J Nilsson. Mr-tandem:parallel x! tandem using hadoop mapreduce on amazon web services.Bioinformatics,28(1):136–137, 2012.<br />
[6] Grey Ballard, Erin Carson, James Demmel, Mark Hoemmen, Nicholas Knight, and Oded Schwartz. Communication lower bounds and optimal algorithms for numericallinear algebra.Acta Numerica, 23:1, 2014.<br />
[7] Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Minimizing communi-cation in numerical linear algebra.SIAM Journal on Matrix Analysis and Applications,32(3):866–901, 2011.<br />
[8] National Research Council et al.Getting up to speed: The future of supercomputing.National Academies Press, 2005</p>Dr. Fahad Saeedfsaeed@fiu.eduEveryone you talk to will tell you that database search algorithms for proteomics are embarrassingly parallel problem - and is easy to solve. You just need to distribute your data on multiple nodes, and you will get excellent parallelization and scalable solutions. This ideal scenairo of scalabililty with the introduction of few pragmas does not go far.First blog post2020-09-16T00:00:00-07:002020-09-16T00:00:00-07:00https://prof-s.github.io/posts/2020/09/blog-post-3<p>This is my first blog post. This blog is going to be used for science communication, and all the cool sciency things that I can think of.
Part of this blog is to post things that are related to our papers; other things are ideas that need to be floated and discussed; ofcoures my favourite part is to push scientific theories, and agendas that I will like to puruse
So there it is. Stay tuned and hopefully I will write something useful.</p>Dr. Fahad Saeedfsaeed@fiu.eduThis is my first blog post. This blog is going to be used for science communication, and all the cool sciency things that I can think of. Part of this blog is to post things that are related to our papers; other things are ideas that need to be floated and discussed; ofcoures my favourite part is to push scientific theories, and agendas that I will like to puruse So there it is. Stay tuned and hopefully I will write something useful.