Saturday, February 28, 2004

I had an interview with Dr. Jack Edwards on 2/27/03. Dr. Edwards conducts physics research and has used a large number of diverse HPC systems thought his career. I explained the problem of expressing reliability preferences as a query on a set of service data.

Dr. Edwards expressed that his most important concern was achieving the shortest time to application execution. Dr. Edwards expressed that the exclusive criteria for service selection would be which service can begin in the shortest amount of time.

He beliefs the "time-to-start" is important because you cannot predict the activity of other users, and a system may go down during the execution of a program. He says that it should not matter if a service has failed for other users and past performance does not predict future rates of failure. Users have deferment parameters, different data sets, and a service call may execute an entirely different application from one user to the next.

I showed Dr. Edwards example queries that prefer the service with the lowest OutstandingRequests and then uses AvgRunTime and Exceptions only in case of a tie. He said that this query is "about right."

I had an interview with Dr. Jack Edwards on 2/27/03. Dr. Edwards conducts physics research and has used a large number of diverse HPC systems thoughout his career. I explained the problem of expressing reliabilty preferences as a query on a set of service data.

Dr. Edwards expressed that his most important concern was achieving the shortest time to application execution. Dr. Edwards expressed that the exlusive critieria for service selection whould be which service can begin in the shortest amount of time.

He belives the "time-to-start" is important because you cannot predict the activity of other users, and a system may go down during the execution of a program. He says that it should not matter if a service has failed for other users and past performance does not predict future rates of failure. Users have differnent parameters, different data sets, and a service call may execute an entirely different application from one user to the next.

I showed Dr. Edwards example queries that prefer the service with the lowest OutstandingRequests and then uses AvgRunTime and Exceptions only in case of a tie. He said that this query is "about right."

I had an interview with Dr. Gary Howell on 2/15/03. I described the problem of expressing reliability preferences as
a query on a set of service data. I explained that I wished to relieve the user of selecting a service replica, by creating some pre-build queries that would do a 'reasonable' job of selecting a service instance.

Dr. Howell informed me that perhaps it was not such a good idea to take endpoint selection choices away from the user. He said, "It's really a matter of individual preference." He went on to say that some users' may highly value a service with a faster total run-time, some may value a service with the shortest queue, and other will value a service with the most reliable history.

He suggested that I borrow an idea from Operations Research. He referred me to a graduate OR textbook. The essence of this
approach is to construct a function to translate speed and reliability into a common comparable good.

Wednesday, February 18, 2004

I had a talk with Dr. Eric Sills, who is an expert of HPC systems, to help determine a set of queries against a service data registry for service replica selection. We spoke on Feb. 17, 2004.

We established that in order to create a set of service replica selection preferences expressed as queries, we would need to make a couple of assumptions. The first is the freshness of the service data. The number of outstanding requests and server load can change drastically from 1 second to the next. He suggested 30 seconds as a good service data lifetime, and said “any more than that and you hurt the performance of the jobs you’re monitoring.”

The other assumption is that a small difference in server load, failure rate, or cue length should be considered negligible. The difference between a few OutstandingRequests, AvgProcUtilization points, or AvgLoad should not be considered. So how much is a small difference? Sills suggests between 5-10% deviations.

The general approach that Sills advocated was to look at a primary criteria (OutstandingRequests) and then use a secondary (AvgRunTime) and tertiary (Exceptions) criteria in case there was more than 1 result with 5-10% of the top result.

He justified his approach as what would get the service run the fastest. However, he also explained that the replica selection query is relative to how long the service may take to run. He felt that for large jobs the hardware requirements are so great that you won’t have many choices. But for small jobs, the difference between AvgCueTime will not be significant.

Monday, February 09, 2004

I’ve spent the last few weeks studying software reliability engineering to construct the fault-tolerance and replica-selection grid service data models. I’ve been gathering background info on the concept of The Operational Profile and Fault-Tolerant Software Reliability Engineering.

Here’s some thesis draft on Replica Selection in Grid Systems


4.2 A Model for Replica Selection


File Replica Selection

Replica Selection is the mechanism selecting the best copy of a duplicated resource, such as a file or grid service instance. The Globus Consortium literature uses the term replica selection more narrowly as a data grid concept:

A replica selection service within a data grid responds to requests for the “best” copy of files that are replicated on multiple storage systems. Here, information sources can once again include system configuration, instantaneous performance, and predictions, but for storage systems and net-works rather than computers. [Czajkowski01].

Other literature, such as Ian Forster’s A Decentralized, Adaptive, Replica Location Service, describes the similarities between selecting a host for filesharing within the gnutella network and selecting the best grid resource. As this paper acknowledges, P2P networks have weak QoS guarantees and are weakened by the free-rider problem. When a list a fileshares is present in a P2P client, hosts are ranked by available bandwidth, and the success of previous request. Later in this chapter, I will illustrate how this information can be expressed as grid service data XSD schema.

Replica Selection is discussed, in the context of Chimera Astrophysics applications, and a proposed Replica Selection Service [Foster02, Vazhkudai01]. The conclusions of these Conference Proceedings, is that a software component called the “Storage Broker” will communicate with replicas a base a decision on a set of performance predicting metadata.

In the [Foster02, Vazhkudai01] replica selection architecture, a broker, or intermediate application is responsible for communicating with replicas who must advertise their properties through a MetaData Repository or Classified Ads. However, [Czajkowski01] acknowledges that replica selection is just one of several uses of a MetaData Repository.

The MetaData Repository allows for metadata from all the Replica Sites to be aggregated, and thus allows the client to compare one replica with another. The metadata is illustrated as information about the hardware such as storage capabilities, volume, and bandwidth. A client may want to search, match, or invoke services from a number of replica sites.

An interesting question that these papers raise is how does this client choose a single replica? Does a client prefer reliability over performance, bandwidth over memory volume, SUN over Hewlett Packard?

These questions become more than theoretical when constructing XQUERY/XPATH expressions to select a GSH endpoint from a VOregistry. Such questions become design issues, when a collection or workflow components share service data in the form of an XSD vocabulary. Therefore, in this chapter I attempt to construct models for replica selection within a scientific workflow application.

Service Replica Selection

Just as a file can be replicated on a data grid, web services can be replicated within a virtual organization to achieve increased reliability and performance. In many ways, data retrieval is more efficient when the data is coupled with the retrieval mechanism. In my examples, I use a FileShareing service, to represent an arbitrary data retrieval or data derivation service. I assume that the grid service consumer is interested in a data product, which can be a raw file, or something derived from that file. Web Service technologies have begun to erase the distinction between data retrieval, and data derivation functions, so that a replica selection client need only be concerned with retrieving a data product and not how that data product is produced.

While previous Globus Toolkit projects use an LDAP directory or GIS model, the Globus Toolkit 3 project is now gravitating towards VOregistry service for XML aggregation as described in the background chapter. What this means for programmers who wish to create a replica selection client is that they need an XSD schema to represent relevant replica selection metadata.

From the history of the Grid Information Services Working Group’s research, many possible replica selection MetaData ontologies have been discussed. The broad concepts that have emerged in Grid research relevant to replica selection are:

Hardware Capabilities Service Provider State QoS

Machine Specifications
Bandwidth
Volume

Concurrent Users
Outstanding Requests
Served Requests
Previous Response Times
Exceptions
Failures
Uptime

Replica Selection metadata candidates

Ideally, a client would know the probability of a provider delivering a correct data product as well as the confidence interval and the margin or error. However, it is more realistic to simply track a service’s lifetime as the number of requests and responses that have passed through. A client could use information similar to what is presented in an HTTP server’s log files to determine the reliability and average response time.

However, it is expensive in terms of space and bandwidth to keep records of everything a service has ever done, so instead I propose some service data elements that can be derived from the log and exported to a VOregistry.

Replica Selection from an Operational Profile

Another MetaData model of replica selection can be constructed form the more mature research field of Software Reliability Engineering. A common practice is to define an operation profile in order to conduct reliability tests. The operation profile is used in much the same way as GIS MetaData or Service Data; the profile is a collection of information used to predict reliability and performance.

While practitioners of software reliability engineering use this information to reimplement and debug systems, a service consumer may have the option of simply selecting a different replica.

According to [Lyu96] the Software Reliability Handbook, “A profile is a set of disjoint (only one can occur at a time) elements, each with the probability that it will occur.” Subgroups of elements are arranged into system modes; divisions of work are called runs. The elements are given a probability of failure that is weighted by their usage probability.

Chimera: A Virtual Data System for Representing, Querying and Automating Data Derivation. I. Foster, J. Voeckler, M. Wilde, and Y. Zhao. Proceedings of the 14th Conference on Scientific and Statistical Database Management, Edinburgh, Scotland, July 2002.

Replica Selection in the Globus Data Grid. S. Vazhkudai, S. Tuecke, I. Foster. Proceedings of the First IEEE/ACM International Conference on Cluster Computing and the Grid (CCGRID 2001), pp. 106-113, IEEE Computer Society Press, May 2001.
Discusses a high-level replica selection service that uses information regarding replica location and user preferences to guide selection from among storage replica alternatives.

Preferences Survey

I interviewed several experts in the field of high performance computing in order to recommend a model for selecting a grid service replica. A realistic replica selection model should balance several goods including expect run time, probability of failure, service capacity, and utilization. At the most basic level, this survey asks: When should the client wait longer for a more reliable service?

In my survey of HPC experts, I ask each expert to construct a query on a set of grid service data for selecting the optimal replica of a FileshareService. We assume that there are N identical FileshareService instances residing on disparate hardware and software platforms. All services reliably and accurately publish their properties to a VORegistry.

Name Description
TotalRequests The total number of request that this instance has received in its lifetime
InvalidRequests The total number of server-side exceptions caused by an invalid SOAP request message
Exceptions The total number of service-side exceptions that have occurred in its lifetime
AvgCueTime A running average of how long a request spends waiting in the resource manager’s cue before execution begins.
AvgRunTime A running average of how long the service provider takes to complete sending the SOAP response
AvgLoad A 1 minute exponential average of resource requests, expressed as a percentage of capacity
AvgProcUtilization A 1 minute exponential average of processor usage, expressed as a percentage of capacity
OutstandingRequests Number of outstanding requests: cued and job running
Table X: Service Data Set Derived from NC State HPC center statistics

An example query may be as follows
1. Select the service with the fewest OutstandingRequests
2. If the result set contain more than 1 element, select the service with the lowest AvgRunTime
3. If the result set still contains more than 1 element, select the service with the fewest Exceptions

This set of query preferences could be expressed as a set of XQuery ‘PathExpr’ combined with Left Outer Joins. Since, few of the survey participants are familiar with XQuery; they are free to use SQL syntax, pseudocode, or English-language descriptions.

Each survey interview begins with a brief introduction of myself and my thesis research. I explain Replica Selection as the process of choosing an endpoint or GSH from among a number of functionally identical services. The services publish a set of service data to a registry, and the client must decide what replica to invoke based only on information provided by the registry and internal state data.

I ask the survey participants to consider several 3 scenarios. First, consider a single, one-time service invocation. Second, consider invoking a service N times where N is > 1000. Third, consider the collective action problem where a large number of clients use the same replica selection queries.


More Better Grid Service Data Stuff Coming Soon...













This page is powered by Blogger. Isn't yours?