Stampede: Middleware for Monitoring and Troubleshooting of Large-Scale Applications on National Cyberinfrastructure
|Collaborators:||Ewa Deelman, Information Sciences Institute; Christopher Brooks, University of San Francisco; Dan Gunter, LBNL; Martin Swany, University of Delaware|
Large-scale applications today make use of distributed resources to support computations and as part of their execution, generate large amounts of log information. Up to now, we have been using the Netlogger analysis tools to perform off-line log analysis. Stampede extends the current offline workflow log analysis capability and develops a comprehensive middleware solution that will allow users of complex scientific applications to track the status of their jobs in real time, to detect execution anomalies automatically, and to perform on-line troubleshooting without logging in to remote nodes or searching through thousands of log files.
We build on an important class of applications, scientific workflows, that are being used today in a number of scientific disciplines including astronomy, biology, ecology, earthquake science, gravitational-wave physics, and many others that are running on today's large-scale infrastructure such as the OSG or the TeraGrid. This solution will be modular and distributed, and reusable across a broad class of applications and workflow systems.
The system will be able to capture application-level logs from jobs as they are executing on the cyberinfrastructure. At the same time, it will also collect log information from the underlying cyberinfrastructure services, such as resource management and data transfer. These end-to-end logs will be combined and brokered through a subscription interface. External components will use the subscription interface to provide monitoring services.
This work is supported by the NSF under grant OCI-0943705