Lawrence Berkeley National Lab

Stampede: Middleware for Monitoring and Troubleshooting of Large-Scale Applications on National Cyberinfrastructure

PI: Ewa Deelman
Collaborators: Ewa Deelman, Information Sciences Institute; Christopher Brooks, University of San Francisco; Dan Gunter, LBNL; Martin Swany, University of Delaware
website: Stampede wiki

Large-scale applications today make use of distributed resources to support computations and as part of their execution, generate large amounts of log information. Up to now, we have been using the Netlogger analysis tools to perform off-line log analysis. Stampede extends the current offline workflow log analysis capability and develops a comprehensive middleware solution that will allow users of complex scientific applications to track the status of their jobs in real time, to detect execution anomalies automatically, and to perform on-line troubleshooting without logging in to remote nodes or searching through thousands of log files.

We build on an important class of applications, scientific workflows, that are being used today in a number of scientific disciplines including astronomy, biology, ecology, earthquake science, gravitational-wave physics, and many others that are running on today's large-scale infrastructure such as the OSG or the TeraGrid. This solution will be modular and distributed, and reusable across a broad class of applications and workflow systems.

The system will be able to capture application-level logs from jobs as they are executing on the cyberinfrastructure. At the same time, it will also collect log information from the underlying cyberinfrastructure services, such as resource management and data transfer. These end-to-end logs will be combined and brokered through a subscription interface. External components will use the subscription interface to provide monitoring services.

This work is supported by the NSF under grant OCI-0943705


 

Publications

Please see the publications page.