Summary Bioinformaticians are tackling increasingly computation-intensive tasks. In the meantime, workstations are shifting towards multicore architectures and even massively multicore may be the norm soon. Bag-of-Tasks (BoT) applications are commonly encountered in bioinformatics. They consist of a large number of independent computation-intensive tasks. This note introduces PAR, a scalable, dynamic, parallel and distributed execution engine for Bag-of-Tasks. PAR is aimed at multicore architectures and small clusters. Accelerations obtained thanks to PAR on two different applications are shown.68728
Availability: PAR is released under the GNU General Public License version three and can be freely downloaded .
1 Introduction
Bioinformaticians are significant high-performance computing users, in particular for simulations of biologic phenomena. On the other hand, the available hardware is getting faster but also much more parallelized (Intel publicly reported working on 80 cores prototype chips in 2007). In this context, most bioinformaticians could benefit from an easy-to-use software to harness such computing power.
The focus of this note is Bag-of-Tasks (BoT) applications execu-tion. As the name suggests, BoT applications can be seen as a bag, filled with tasks to do, each being independent from all the others. A middle-ware for BoT applications is called a job crusher. It has to consist of at least a server component connected to a set of clients.
This note introduces PAR, a parallel and distributed job crus-her working in pull mode and inspired by desktop grid platforms. Workers join the computation and can be added dynamically at run-time; the server delivers tasks to workers available at a given moment. PAR is actually a transposition of some concepts and fea-tures from previous distributed middle-ware to small HPC clusters and multi-core workstations.
This paper is organized as follows: Section 2 presents an overview of related projects and technologies used in bioinformatics. Section 3 presents two examples using PAR to illustrate scalability. The last section lists upcoming enhancements.
2 Related projects
A wide variety of tools and technologies have been used over the last two decades in bioinformatics. While PAR is a user-level tool with its own niche, it has some limitations. At the cost of a little more complexity, some of the tools listed hereafter allow fair share of resources, stronger reliability and even faster job or data throughput.
At the programming level, the Message-Passing Interface (MPI, Forum (1994)), CORBA (Object Management Group (1998)) or even MapReduce (Dean and Ghemawat (2004)) are noteworthy technology candidates.
MPI has become the defacto standard for programming highly parallel applications. It has been used in computational genomics (Swain et al. (2005)) and in molecular dynamics (Johnston et al. (2005); de Lomana et al. (2008)).
For applications following a client-server model, CORBA can be used. Handling of genome maps has successful examples (Huetal. (1998), Jungfer and Rodriguez-Tome´ (1998)).
For data-intensive applications, MapReduce and its open source implementation Hadoop2 are more appropriate. They unleash operations over huge amounts of data and were used recently in sequence alignment (Sadasivam and Baktavatchalam (2010)).
However, at the application level, Desktop Grids (DG) are closer to the focus of this note. A server distributes tasks to workers located on machines that do not communicate with each other, potentially anywhere on the Internet. Condor (Litzkowetal. (1988)), XtremWeb (Fedaketal. (2001)) and BOINC (Anderson (2004)) are three platforms for highly parallel, multiuser applications. One of the best-known DG project in bioinformatics is probably Folding@home (Bebergetal. (2009)). 平行和分布式作业破碎机英文文献和中文翻译:http://www.751com.cn/fanyi/lunwen_77375.html