FileMap: Map-Reduce Program Execution on Loosely-Coupled Distributed Systems

FileMap: Map-Reduce Program Execution on Loosely-Coupled Distributed Systems

Fisk, Michael E. and Hash, Curtis L.
LANL

Michael E. Fisk and Curtis L. Hash 2014. FileMap: Map-Reduce Program Execution on Loosely-Coupled Distributed Systems. Proceedings of the 4th International Workshop on Cloud Data and Platforms (Amsterdam, the Netherlands, Apr. 2014), to appear.

Abstract

In this paper we present FileMap, an open-source,1 alternative map-reduce-based computing system that we have developed and utilized over the last 5 years. This system features several significant design decisions and performance aspects that are not found in prevalent map-reduce systems such as Hadoop [16]. The prevailing design goal is to have a system for scheduling and orchestrating parallel and distributed data processing, but that does not interpose itself between data and the serial programs that process data. FileMap manages the organization of input and output files and the scheduling of program execution, but does not process files itself and is agnostic to the format in which data is stored and/or indexed. We layer on top of existing, ubiquitous file systems, security models, and network access software in order to minimize the complexity of FileMap and maximize the ability of its users to benefit from specialized compute platforms, file systems, and software. We measure the performance of FileMap in several instantiations including a heterogeneous “cloud” conglomeration of computers and storage distributed across multiple owning organizations with no cross-organization trust or synchronization. This “cloud” model intentionally supports distributed sensor systems in which nodes collect their own data and participate in analysis of data by moving map/reduce processing upstream to where the data is collected. Our on SMP systems, clusters built for Hadoop, and this distributed cloud, show that FileMap outperforms more prevalent computing systems and models by factors between 2x (compared to Hadoop) and 14x (cloud vs. centralized). FileMap provides a single programming model that allows processing to seamlessly scale from a single laptop to data-intensive compute clusters to distributed sensing and analysis clouds. This paper introduces our fundamental design decisions in Section 1 and specifics of the implementation in Section 2. Section 3 describes the use of embedded databases within our system. Section 4 presents a performance benchmark measured across a variety of computing environments.

Reference

@inproceedings{Fisk14a,
  author = {Fisk, Michael E. and Hash, Curtis L.},
  title = {FileMap: Map-Reduce Program Execution on Loosely-Coupled Distributed Systems},
  booktitle = {Proceedings of the 4th International Workshop on Cloud Data and Platforms},
  year = {2014},
  sortdate = {2014-04-01},
  pages = {to appear},
  month = apr,
  address = {Amsterdam, the Netherlands},
  publisher = {ACM},
  doi = {http://dx.doi.org/10.1145/2592784.2592790},
  location = {johnh: pafile},
  keywords = {map/reduce, file map, lanl, retro-future},
  myorganization = {LANL},
  codeurl = {http://mfisk.github.io/filemap},
  project = {ant, retrofuture}
}