Exhibits, Demos & Posters
On Building Scientific Workflow Systems for Data Management in the Cloud
- Yogesh Simmhan, Microsoft Research, San Francisco, California
- Roger Barga, Microsoft Research, Redmond, Washington
- Catharine van Ingen, Microsoft Research, San Francisco, California
- Ed Lazowska, University of Washington, Seattle, Washington
- Alex Szalay, Johns Hopkins University, Baltimore, Maryland
Scientific workflows have become an archetype to model in silico experiments by scientists and perform complex analyses in the cloud. While much research has gone into designing workflow systems for the scientist, there are diverse users engaged in end-to-end data management in the science cloud for whom the model of scientific workflows is well suited. One of these roles, the "data valets," manage and prepare raw scientific data arriving from instruments and sensors into a science-ready form for use by scientists. Workflows for data valets share data-intensive traits with traditional scientific workflows, yet differ significantly. For example, valet workflows require a greater degree of reliability since they operate at a larger scale with larger consequences. The type of provenance collected also differs, with valets needing to track the state of resources acted upon by workflows while scientists requiring more semantic provenance for context. We compare and contrast these two classes of workflows, Science Application and Data Valet, through exemplar e-Science projects, viz. the Pan-STARRS Sky Survey project and the NEPTUNE Oceanography project. We use these to illustrate shared and unique requirements for scientific workflows to support data management in the cloud. Our analysis can guide the evolution of workflow systems to support emerging scientific applications; the Trident Scientific Workbench is one such system that directly benefits from this.