Publications W. Hoschek
Abstract: In a large distributed system spanning administrative domains such as a Grid, it is desirable to maintain and query dynamic and timely information about active participants such as services, resources and user communities. The web services vision promises that programs are made more flexible and powerful by querying Internet databases (registries) at runtime in order to discover information and network attached third-party building blocks. Services can advertise themselves and related metadata via such databases, enabling the assembly of distributed higher-level components. In support of this vision, this thesis shows how to support expressive general-purpose queries over a view that integrates autonomous dynamic database nodes from a wide range of distributed system topologies.
We motivate and justify the assertion that realistic ubiquitous service and resource discovery requires a rich general-purpose query language such as XQuery or SQL. Next, we introduce the Web Service Discovery Architecture (WSDA), which subsumes an array of disparate concepts, interfaces and protocols under a single semi-transparent umbrella. WSDA specifies a small set of orthogonal multi-purpose communication primitives (building blocks) for discovery. These primitives cover service identification, service description retrieval, data publication as well as minimal and powerful query support. The individual primitives can be combined and plugged together by specific clients and services to yield a wide range of behaviors and emerging synergies. Based on WSDA, we introduce the hyper registry, which is a centralized database node for discovery of dynamic distributed content, supporting XQueries over a tuple set from an XML data model. We address the problem of maintaining dynamic and timely information populated from a large variety of unreliable, frequently changing, autonomous and heterogeneous remote data sources.
However, in a large cross-organizational system, the set of information tuples is partitioned over many such distributed nodes, for reasons including autonomy, scalability, availability, performance and security. This suggests the use of Peer-to-Peer (P2P) query technology. Consequently, we take the first steps towards unifying the fields of database management systems and P2P computing. As a result, we propose the WSDA based Unified Peer-to-Peer Database Framework (UPDF) and its associated Peer Database Protocol (PDP), which are unified in the sense that they allow to express specific applications for a wide range of data types (typed or untyped XML, any MIME type), node topologies (e.g. ring, tree, graph), query languages (e.g. XQuery, SQL), query response modes (e.g. Routed, Direct and Referral Response), neighbor selection policies, pipelining, timeout and other scope characteristics. The uniformity and wide applicability of our approach is distinguished from related work, which (1) addresses some but not all problems, and (2) does not propose a unified framework.
|Book Chapters and Journals|
Abstract: In this paper, we propose the Web Service Discovery Architecture (WSDA). At runtime, Grid applications can use this architecture to discover and adapt to remote services. WSDA promotes an interoperable web service discovery layer by defining appropriate services, interfaces, operations and protocol bindings, based on industry standards. It is unified because it subsumes an array of disparate concepts, interfaces and protocols under a single semi-transparent umbrella. It is modular because it defines a small set of orthogonal multi-purpose communication primitives (building blocks) for discovery. These primitives cover service identification, service description retrieval, data publication as well as minimal and powerful query support. The architecture is open and flexible because each primitive can be used, implemented, customized and extended in many ways. It is powerful because the individual primitives can be combined and plugged together by specific clients and services to yield a wide range of behaviors and emerging synergies.
Abstract: In wide area computing systems, it is often desirable to create remote read-only copies (replicas) of files. Replication can be used to reduce access latency, improve data locality, and/or increase robustness, scalability and performance for distributed applications. We define a replica location service (RLS) as a system that maintains and provides access to information about the physical locations of copies. An RLS typically functions as one component of a data grid architecture. This paper makes the following contributions. First, we characterize RLS requirements. Next, we describe a parameterized architectural framework, which we name Giggle (for GIGa-scale Global Location Engine), within which a wide range of RLSs can be defined. We define several concrete instantiations of this framework with different performance characteristics. Finally, we present initial performance results for an RLS prototype, demonstrating that RLS systems can be constructed that meet performance goals.
Abstract: The most straightforward but also most inflexible configuration approach for invocation of remote services is to hard wire the location, interface, behavior and other properties of remote services into the local application. Loosely coupled decentralized systems call for solutions that are more flexible and can seamlessly adapt to changing conditions. While advances have recently been made in the field of web service specification, invocation and registration, the problem has so far received little systematic conceptual attention. In this paper, we outline seven web service problem areas and their associated processing steps, namely description, presentation, publication, request, discovery, brokering and execution.
Abstract: A variety of query response modes can be used to return matching query results from Peer-to-Peer nodes to an originator. Although from the functional perspective all response modes are equivalent, no mode is optimal under all circumstances. Which query response modes allow to express suitable trade-offs for a wide range of P2P applications? In this paper, we answer this question by systematically describing and characterizing four query response modes, namely Routed Response, Direct Response, Routed Metadata Response, and Direct Metadata Response. The response models are compared with respect to distribution and location transparency, efficiency of query support, economics, number of TCP connections at originator and agent, latency, caching and trust delegation to unknown parties. We discuss to what extent a given P2P network must mandate the use of any particular response mode throughout the system. As a result, we propose that response modes can be mixed by switches and shifts, in arbitrary permutations.
Abstract: In a Peer-to-Peer (P2P) network, non-pipelining query result set delivery without a dynamic abort timeout feature is highly unreliable due to what we propose as the simultaneous abort problem. To address the problem, we derive dynamic abort timeouts using as policy exponential decay with halving. This ensures that a maximum of results can be delivered reliably within the time frame desired by a user. We establish that a timeout for loop detection in query routes must be static. A dynamic timeout is unsuitable to be used as loop timeout, due to what we propose as the non-simultaneous loop timeout problem. In a P2P network, a node forwards a query to the set of nodes obtained from neighbor selection. Using neighbor selection, explicit topology characteristics can be exploited in query guidance. In addition to the normal query, the user defines a neighbor selection query (XQuery) that takes the tuple set of the current node as input and returns a subset that indicates the nodes selected for forwarding. A wide range of policies can be implemented in this manner, as the neighbor selection policy can draw from the rich set of information contained in the tuples published to the node.
Abstract: In a Peer-to-Peer (P2P) network, a link topology such as a ring, tree or graph describes the link structure among autonomous nodes. A node deployment model defines where and how one or more partitions of the link topology are physically running, stored and accessed. Link topology and node deployment are distinct and orthogonal concepts, and hence a node deployment model need not correspond to a link topology at all. The simplest (and most common) deployment model has distinct nodes running on distinct hosts. In this paper, we propose that nodes can also be concentrated in node containers, which are transparent software hosting environments that embed one or more virtual nodes. Node deployment models range from centralized to fully distributed. Virtual hosting has the potential for increased query performance (as opposed to increased scalability). Thus, we introduce the separate P2P query scope parameters logical radius and physical radius, as well as three novel query execution strategies that transparently exploit the properties of virtual hosting. The key idea is to reduce or eliminate the need for messaging between container-internal nodes and to run as few as possible queries against the database of shared nodes. Under normal query execution and under collecting traversal, a query to a container node can be efficiently answered without violating the semantics of query and scope. Under the quick scope violating query strategy, a query can be answered even more efficiently, by relaxing the conditions imposed by the query scope.
Abstract: In a distributed system such as a DataGrid, it is often desirable to maintain and query dynamic and timely information about active participants such as services, resources and user communities. This enables information discovery and collective collaborative functionality that operate on the system as a whole, rather than on a given part of it. However, it is not obvious how a database (registry) should maintain information populated from a large variety of unreliable, frequently changing, autonomous and heterogeneous remote data sources. In particular, how can one avoid sacrificing reliability, predictability and simplicity while allowing to express powerful queries over time-sensitive dynamic information? We propose the so-called hyper registry, which has has a number of key properties. An XML data model allows for structured and semi-structured data, which is important for integration of heterogeneous content. The XQuery language allows for powerful searching, which is critical for non-trivial applications. Database state maintenance is based on soft state, which enables reliable, predictable and simple content integration from a large number of autonomous distributed content providers. Content link, content cache and a hybrid pull/push communication model allow for a wide range of dynamic content freshness policies, which may be driven by all three system components: content provider, hyper registry and client.
Abstract: In a large distributed system spanning many administrative domains such as a DataGrid, it is often desirable to maintain and query dynamic and timely information about active participants such as services, resources and user communities. However, in such a database system, the set of information tuples in the universe is partitioned over one or more distributed nodes, for reasons including autonomy, scalability, availability, performance and security. It appears that a Peer-to-Peer (P2P) database network may be well suited to support dynamic distributed database search, for example for service discovery. In this paper, we devise a framework that is unified in the sense that it allows us to express specific applications for a wide range of data types, node topologies, query languages, query response modes, neighbor selection policies, pipelining characteristics, timeout and other scope options.
Abstract: In a large distributed system spanning administrative domains such as a DataGrid, it is often desirable to maintain and query dynamic and timely information about active participants such as services, resources and user communities. However, in such a database system, the set of information tuples in the universe is partitioned over one or more distributed nodes, for reasons including autonomy, scalability, availability, performance and security. This suggests the use of Peer-to-Peer (P2P) query technology. In this paper, we develop a messaging, communication and network protocol model for the P2P database framework and the hyper registry proposed in our prior studies.
Abstract: In a large heterogeneous distributed system spanning administrative domains such as a DataGrid, it is often desirable to maintain and query dynamic and timely information about active participants such as services, resources and user communities. This paper develops a suitable database and query model as well as a generic and dynamic data model for such database systems. Example service discovery queries are given. Three query types are identified, namely simple, medium and complex. An appropriate query language (XQuery) is suggested. The suitability of the query language is demonstrated by formulating the example prose queries in the language. Detailed requirements for a query language supporting service and resource discovery are given. The capabilities of various query languages are compared.