Wednesday, December 31, 2003
A Brief Survey of Grid Services In Higher EducationGrid Services Graduate Courses --
Introduction to Grid Computing, CS 580G
Instructor
Madhusudhan Govindaraju
CP5170 - Topics in Systems and Networks
Subject Coordinator
Eoin Hyden
CS595 Grid and Ubiquitous Computing at IIT
Course notes - Collected by Gregor von Laszewski
Groups (ranked highly in google):
University of Virginia: Grid Computing Group
University of Connecticut: Grid Computing Group
UCSD: Grid Computing Laboratory
Virginia Tech: Grid-Computing Research Group
Introduction to Grid Computing, CS 580G
Instructor
Madhusudhan Govindaraju
CP5170 - Topics in Systems and Networks
Subject Coordinator
Eoin Hyden
CS595 Grid and Ubiquitous Computing at IIT
Course notes - Collected by Gregor von Laszewski
Groups (ranked highly in google):
University of Virginia: Grid Computing Group
University of Connecticut: Grid Computing Group
UCSD: Grid Computing Laboratory
Virginia Tech: Grid-Computing Research Group
Sunday, December 28, 2003
I just got a Regular Paper accepted in ITCC 2004 on the Modern Web and Grid Systems conferance track.
I might even try to go to Vegas.
It's titled High Throughput Web Services for Life Sciences
I might even try to go to Vegas.
It's titled High Throughput Web Services for Life Sciences
Definition of Terms From
Grid Information Services for Distributed Resource Sharing. K. Czajkowski, S. Fitzgerald, I. Foster, C. Kesselman. Proceedings of the Tenth IEEE International Symposium on High-Performance Distributed Computing (HPDC-10), IEEE Press, August 2001.
[Citation, PDF]
A superscheduler routes computational requests to the
“best” available computer in a Grid containing multiple
high-end computers, where “best” can encompass issues of
architecture, installed software, performance, availability,
and policy. Here, information sources are computers, and
information can include both relatively static information
such as system configuration (architecture, OS version, ac-cess
policy) and more dynamic information such as instan-taneous
load and predictions of future availability [40, 10].
A service discovery service records the identity and essential
characteristics of “services” available to community
members. Such a discovery service might enable a physicist
to determine that a new university that has just joined
his consortium has 100 new CPUs available for approved
use. Here, information sources are relatively static and the
information itself relates primarily to availability.
A replica selection service within a data grid responds
to requests for the “best” copy of files that are replicated
on multiple storage systems. Here, information sources can
once again include system configuration, instantaneous per-formance,
and predictions, but for storage systems and net-works
rather than computers.
An application adaptation agent monitors both a run-ning
application and external resource availability and mod-ifies
application behavior (e.g., reduces accuracy, changes
algorithms) and/or its resource consumption (e.g., migrates
to other resources) if, due to changes in resource status or
application behavior, these changes are thought likely to
improve performance. Information sources include various
components of both the application and the underlying exe-cution
environment.
A troubleshooting service monitors Grid resources,
looking for anomalous behaviors such as excessive load or
extended failure of critical services. Here, the information
sources can be arbitrary; the information that is of interest is
determined by troubleshooter heuristics and can be highly
dynamic.
A performance diagnosis tool, invoked by a user when
anomalous behavior is detected, discovers what information
sources are associated with an application and its resources
(e.g., application sensors, network sensors, historical infor-mation
sources) and accesses these information sources as
it seeks to diagnose the poor performance.
Grid Information Services for Distributed Resource Sharing. K. Czajkowski, S. Fitzgerald, I. Foster, C. Kesselman. Proceedings of the Tenth IEEE International Symposium on High-Performance Distributed Computing (HPDC-10), IEEE Press, August 2001.
[Citation, PDF]
A superscheduler routes computational requests to the
“best” available computer in a Grid containing multiple
high-end computers, where “best” can encompass issues of
architecture, installed software, performance, availability,
and policy. Here, information sources are computers, and
information can include both relatively static information
such as system configuration (architecture, OS version, ac-cess
policy) and more dynamic information such as instan-taneous
load and predictions of future availability [40, 10].
A service discovery service records the identity and essential
characteristics of “services” available to community
members. Such a discovery service might enable a physicist
to determine that a new university that has just joined
his consortium has 100 new CPUs available for approved
use. Here, information sources are relatively static and the
information itself relates primarily to availability.
A replica selection service within a data grid responds
to requests for the “best” copy of files that are replicated
on multiple storage systems. Here, information sources can
once again include system configuration, instantaneous per-formance,
and predictions, but for storage systems and net-works
rather than computers.
An application adaptation agent monitors both a run-ning
application and external resource availability and mod-ifies
application behavior (e.g., reduces accuracy, changes
algorithms) and/or its resource consumption (e.g., migrates
to other resources) if, due to changes in resource status or
application behavior, these changes are thought likely to
improve performance. Information sources include various
components of both the application and the underlying exe-cution
environment.
A troubleshooting service monitors Grid resources,
looking for anomalous behaviors such as excessive load or
extended failure of critical services. Here, the information
sources can be arbitrary; the information that is of interest is
determined by troubleshooter heuristics and can be highly
dynamic.
A performance diagnosis tool, invoked by a user when
anomalous behavior is detected, discovers what information
sources are associated with an application and its resources
(e.g., application sensors, network sensors, historical infor-mation
sources) and accesses these information sources as
it seeks to diagnose the poor performance.
Monday, December 22, 2003
Grid Service Workflows:
Service Data in Scientific Workflow Systems
Outline:
Abstract
Introduction
Background: Service Registries
UDDI Summary
WSIL: Summary
Grid Service Data Summary
Background: Workflow Systems
Historical Perspective
A Survey of Current Projects
Workflow Patterns
Grid Service Data Registry in Scientific Workflow Systems
Model for Fault Tolerance
Model for Workflow Optimization
Implantation Discussion
Future Work
References
Abstract
The emerging technologies of grid computing, web services, and service-oriented workflow will soon enable scientific projects to be conducted on a larger scale than ever before. Scientific workflows can be constructed by combining dispersed network accessible services into virtual organizations. Within a scientific workflow environment, metadata or grid service data is necessary for consumer application to discover services and for services to publish their properties.
This thesis proposes one of the first architectures and implementations of a grid service registry for use in scientific workflow applications. The registry uses the Globus Toolkit 3 and is made available as an OGSI compliant grid service. This thesis outlines several formats for the service data which allow consumer applications to attain features such as fault-tolerance, monitoring, and dynamic discovery or selection of services.
The Open Grid Service Architecture (OGSA) allows such registries to use service data for soft-state management, keeping track of metadata for service instances created from application factories. A service aggregation registry subscribes to services produced from a number of factories. Service data is express through an XSD namespace shared vocabulary. Discovery policy is expressed in XPath queries.
Introduction
Background: Service Registries
In this chapter, I provide background on the three most widely known web service registry technologies. In a web service workflow system, a registry’s primary function is to provide a mechanism for service discovery. Service discovery may enable improvements in the workflow systems such as fault-tolerance, planning, or improved performance. Since the inception of web services, many people believed that service discovery would become an essential part of web service technology. However, as programmers have used web services, most have left service discovery to the user (as parameters) or even hard-coding the service endpoint. This is especially true for scientific systems. Since there are usually a limited number of services that work with a scientific workflow system, the burden of discovering services is easily pushed to the user.
Academic projects have a RYO approach to service registries, and several existing scientific workflow systems have used registries. Projects such as DiscoveryNet [discoverynet], and ICENI [iceni] use custom registries. The Self-Serv [benatallah03] and Triana [Triana], project use UDDI.
UDDI Summary
Universal Description, Discovery, and Integration (UDDI) version 1.0 was create amongst the dot-com and e-commerce technologies in 2000. It was originally conceived as a machine-readable “Universal Business Registry,” an ecommerce directory that would revolutionize supply chain. The version 1.0 specification was supposes to create a standard platform on which business could compete to offer services. The UDDI specification was then substantially revised in version 2.0 in 2001. The specification was tied to XML, WSDL, XDS, and other overlapping projects. In 2004, the current version of UDDI (version 3.0) is a conglomeration of the initial web services specification (WSDL, SOAP, XSD) and the emerging technologies of XML-security and service publication. The UDDI project is currently negotiation with W3C and OASIS to turn the specification over to a standards body working group. UDDI’s authors have claimed [STENCILUDDI] that the standard is evolving to become more practical.
UDDI software offers a powerful, secure, and feature-rich web service registry at the cost of complexity and uncertainty about future versions of UDDI. UDDI has not yet been widely adapted and used in academia or in industry, but has been used successfully in some business projects. Many major corporations as well as numerous startups have been involved in the evolution of UDDI. The most influential actors IBM, Microsoft, Sun, and SAP have all published UDDI implementations.
Independent research carried out by SalCentral shows that over 67% of entries in public UDDI registries today are either invalidly formatted or validly formatted but unavailable. This is due to inadequate quality of service guarantees and a lack of moderation in these registries. Furthermore, the available services are underutilized and very rarely “discovered” programmatically. In order to mitigate these problems, UDDI version 3.0 relies on a publish/subscribe model.
One interacts with a UDDI server or UDDI enabled application though a number of APIs, and their respective implementations. Today there are 6 major implementations of UDDI 2.0, however only Systinet claims to support version 3.0 features.
API Implementations
• IBM WSTK and UDDI4J
• Systinet WASP UDDI
• jUDDI.org
• UDDI 2.0 in Java
• Microsoft UDDI SDK
• Trenian Web Services Directory
While the features and implementation of each implementation are different, they all provide mechanism to publish, find, and bind services. Publishing services is simply the matter of including a service in a UDDI registry. This can be done through a series or RPC style web service calls or by a person through a web page interface. The specification provides mechanism for assigning a unique identifier (UDDI Key) to a newly published service. Finding or discovering a service is accomplished through creating a client-side proxy and searching by business, service, or description. Binding consists of how an application connects to, and interacts with, a web service after it's been found.
The UDDI is capable of storing the following data elements:
• businessEntity: Describes a business or other organization that typically provides Web services.
• businessService: Describes a collection of related Web services offered by an organization described by a businessEntity.
• bindingTemplate: Describes the technical information necessary to use a particular Web service.
• tModel: Describes a “technical model” representing a reusable concept, such as a Web service type, a protocol used by Web services, or a category system.
The most flexible component is the tModel, that can be any data structure available in XML schema (XSD). WSDL documents, meta-data, and human readable descriptions, can all be encoded as a tModel component. A tModel can contain any data you want.
WSIL: Summary
Web Services Inspection Language is project created by Microsoft and IBM that offers a lightweight, decentralized service registry in contrast to the complex centralized approach of UDDI. WSIL a specification for an XML-based meta-language and is used by the creation and consumption of web-based XML documents [APPNEL02].
A WSIL document is simple a way for an organization to aggregate and advertise its web services. The WSIL specification says:
"WSIL defines how a service requestor can discover an XML Web Service description on a Web server, enabling such requestors to easily browse Web servers for XML Web Services."
WSIL encourages organizations to locate their WSIL documents in a uniform manner.
The WSIL document is published at http://example.org/inspection.wsil or http://examples.org/services/inspection.wsil.
All WSIL documents must have a root element,, which wraps all the service advertisements. Each service is wrapped in a tag, which contains a tag. Usually, the element provides a reference to the namespace and the WSDL.
WSIL documents incorporate XML schema so that the WSIL specification is designed to be extensible with other definition types. WSDL support in WSIL is achieved through the use of extensible XSD elements. WSDL and UDDI extensions are pre-built and implemented in the most widely-used WSIL toolkit, the Apache Axis WSIL4J project [WSIL4J].
Academic researchers have definitely noticed WSIL, and have mentioned or discussed it in many documents, most prominently by the UK e-science project, Indiana University, and University of Chicago.
XMethods.com, a directory of publicly-available Web services, is an earlier adopter of WSIL and has developed a binding extension for its service.
Grid Service Data Summary
The Globus Toolkit includes an example grid service, VOregistry. This service implements the OGSA model of service data aggregation; it combines XSD schema data, called service data from any number of grid services. The VOregistry service listens for service to publish their service data at regular intervals, and creates an element for each service.
This model of service data aggregation makes use of the GWSDL/OGSI mechanism of portType inheritance. Every service capable of publishing its service data to the registry extends NotificationSource and the service subscribing to this data extends NotificationSink. The conventional rules of inheritance apply to grid service, so it is possible to create further inherited services. A service’s GSWDL file can contain several elements that define the data structures of publishable service data in XSD schema.
Like UDDI and WSIL, a grid service data registry is queried through a set of APIs.
The service data registry can hold any collection of XSD elements and can process complex queries as XPATH or XQUERY expressions. The processing is done on the registry side and only the query results are returned.
In comparison to UDDI and WSIL, service data aggregation is a middle-of-the-road approach. It is not as centralized as UDDI, but does include server-side processing capabilities not founding WSIL. Like UDDI, service data aggregation attempts to protect QoS with a publish/subscribe model based on timeouts. However, this mechanism shifts the responsibility to service providers to publish correct service data through the use of a shared XSD namespace vocabulary.
However, service data aggregation is also compatible with WSIL and UDDI. All three registries reside on a server, use a similar concept of web service discovery and all three implementations rely on the same Java XML libraries.
The current service data aggregation implementation is limited by the fact that only OGSI-compliant services can be included. When (and if) the WSDL and GWSDL specification converge, this model will become available as a web service as well as a grid service registry mechanism.
Background: Workflow Systems
Historical Perspective
This chapter provides background on the numerous scientific workflow projects.
The most widely used definition of term workflow is presented by the workflow management coalition [WFMC]. Workflow can be defined as the orchestration of a set of activities to accomplish a larger and sophisticated goal, referred to as a business process.
Mathematical and Computer Science academic work on workflow dates back to the mid-70s. Skip Ellis and Michael Zisman worked on “Office Automation” prototypes at Xerox Park. Work in this area went on from 1975-1985, but did not make significant progress until the business process re-engineering trends of the 1990s. Several early commercial products were also created with Lotus Domino being the most successful. Simultaneously, office automation project were sponsored by the EU and the University of Dortmund.
One of today’s workflow researchers, Michael zur Muehlen, has created a histogram of early workflow projects.
By the 1990s, the automation of business workflow has matured into many successfully software products including:[Staffware], [COSA], [InConcert], [Eastman], [Domino], [Websphere], [Workflo], [I-Flow]. There are many open source workflow frameworks available as well.
Research on workflow systems was reopened from a different perspective in 2000-2001, when work began on Web Service Choreography Interface Specification (WSCI), and introduced the notions of web service composition and service orchestration. As part of their commitment to adopting web services and XML, Microsoft and IBM developed XLANG [xlang] and WSFL [wsfl] respectively. In 2002, Microsoft and IBM jointly announced the BPEL workflow language specification. The specification [BPEL4WSSPEC] was finalized with input from several other partners including BEA, and then rolled into a working group at OSASIS [oasis].
Over the last 2 years, XML meta-languages have gain a lot of attention in both technical and business publications. WSCI, XLANG, WSFL, and BPEL have been debated and analyzed both from a technical and political perspective by James Snell at IBM, David Berlind for coverpages.org, and Paul Krill reporting for InfoWorld. Simultaneously, many academic projects began using these languages or systems derived from these languages.
A Survey of Current Projects
Woflan
Self-Serv
BioPipe
SCIRUN
WASA
myGrid
Triana
Chimera
Ptolemy-II
DiscoveryNet
wftk
SWFL
ICENI
BioOpera
ILab
GridAnt
DAGMan
EDSS
METEOR
COUGAAR
Todo:
LabVIEW
XCAT
GridFlow
Webflow
Woflan
Woflan [www-woflan] is a "Petri-net-based workflow diagnosis tool" developed at TU/e in the Netherlands. The Woflan program allows a user to input a workflow specification in file formats from several commercial workflow applications including Staffware [www-staffware], COSA [www-cosa], and Protos [www-pallas]. Woflan produces a descriptive, warning, and error messages from the input file. Woflan checks workflows for correctness, and generates diagnostics to help repair problems [Verbeek00].
Self-Serv
Self-Serv [benatallah03] is a collection of web services middleware the creation of web service from other web services. Self-Serv is based on a peer-to-peer model, in that any service is capable of executing composite services without the need for a central scheduler. Self-Serv workflows are resented as state-charts that allow for a range of logical operation to occur within a composite service. Self-Serv also contains components for web service discovery with UDDI and web service deployment.
BioPipe
BioPipe [biopipe] is one of a series of related packages for carrying out bioinformatics analysis from the Open Bioinformatics Foundation [obf]. Biopipe is a collection of Perl modules for constructing workflows from BioPerl [bioperl] applications. Much of the code and ideas are borrowed from the Ensembl [ensembl] pipeline project.
SCIRUN
SCIRUN [scirun] is an application begun in 1992 by National Center for Research Resources Center at Utah. SCIRUN is a GUI “scientific workbench” that allows users to construct, manage, and debug simulations in domains such as physics and neurobiology. SCIRUN simulation may be thought of as workflows since they allow for parallel and conditional execution of tasks. Some SCIRUN applications allow for jobs to be run on a grid. SCIRUN also provides extensive scientific visualization libraries.
WASA
WASA2 [wasa] is an application that supports the creation and execution of workflows from CORBA components. It includes a GUI workflow modeler as well as controls to modify a workflow while it is running. WASA2 has been used in the domains of geoprocessing and molecular biology as well as for modeling business processes. WASA2 uses a strictly object orient approach where workflows are represented as CORBA objects and can be displayed as UML diagrams.
myGrid
The myGrid Project [greenwood02] [addis03] is a collection of bioinformatics web services and grid services hosted by the European Bioinformatics Institute. myGrid uses SoapLab [soaplab] and the Apache Axis [AXIS] framework to provide a web service interface to a collection of grid-based data-analysis services. Scientists are able to compose, edit, and save workflows with either a web portal or a GUI workbench [freeflow]. The system currently supports two XML workflow languages, a subset of IBM's WSFL [WSFL] and a more domain-specific language called XScufl [taverna] [wiki]. myGrid executes workflows defined in these documents with the IT Innovation Workflow Enactment Engine [it].
Triana
Triana [Triana] is a set of open source java libraries that provide a GUI interface for building workflows from a collection of OGSA grid services. Triana has been leveraged by several other projects including myGrid and Chimera. Triana contains an engine for coordinating and invoking a set of grid services. Triana also contains a peer-to-peer component based on JXTA [JXTA] allowing Triana to run a variety of devices including PDAs and mobile phones.
Chimera
Chimera [deelman03c] is a system used to find or create a workflow for a series of OGSA grid services to provide a scientist’s requested “data product.” Chimera was begun in 1999 and is enabled by the Pegasus Planner [pegasus] described in [deelman02b], a GriPhyN project at UC ISI. The Chimera is middleware designed to be invisible from a client who requests the data product. The workflow is represented as an “abstract program execution graph.” This graph is transformed into an executable DAG for the Condor DAGman [dagman] scheduler.
Ptolemy-II
Ptolemy-II [ptolemy] is platform for is a visual modeling tool written in Java. It was begun in 1997 at UC Berkley. Several recent SDM [sdm] efforts in including [altintas02a], [ludaesher03], [mladen03], and [altintas03] have extended the Ptolemy-II platform to allow for the drag-and-drop creation of scientific workflows from libraries of actors. The Ptolemy actor is often a wrapper around a call to web service or grid service. Ptolemy leverages an XML-meta language called MoML [lee-moml] to produce a workflow document describing the relationships of the entities, properties, and ports in a workflow. Presently, Ptolemy actor libraries exist for the domains of bioinformatics and ecology at [sdmnc] and [seek].
DiscoveryNet
DiscoveryNet [discoverynet] is a collection of software built on top of the UNICORE grid system for arranging database access and knowledge discovery procedures. DiscoveryNet was begun in 2001 at Imperial College of Science. DiscoveryNet provides a means of describing workflow between analysis service providers, data owners, and scientists who arrange and execute these workflows [row03]. DiscoveryNet makes use of the OGSI components and protocols as well as its own protocol for workflows, Discovery Process Markup Language [guo02]. This language is used for constructing, running, and managing grid-services, as well as recording their history
wftk
Open-source workflow toolkit [wftk], is the name of a generalized workflow system, implemented as a series of Java libraries. wftk was begun in 1998 by Michael Roberts. wftk uses its own high level language to describe workflow, and stores workflow related documents in a series of XML “datasheets.” wftk workflow engine contain two models of a workflow: a “task-based” model, essentially a DAG, and a “state-based” model, essentially a FSM. wftk libraries can also be run as web services.
SWFL
Service Workflow Langue [swfl] is an XML-based meta-language for the construction of scientific workflows from OGSA-compliant services. SWFL was developed at Cardiff University in 2002 and 2003. SWFL extends IBM’s WSFL and supports a new set of conditional operators and loop constructs as well as arrays and objects. SWFL has a workflow engine based on the JISGA (Jini-based Service-oriented Grid Architecture) architecture. JIGSA [jigsa] uses JavaSpaces [jinni] as shared memory, for parallel execution through grid services. Therefore, SWFL supports the integrating parallel programs into the workflow.
ICENI
ICENI (Imperial College e-Science Net-worked Infrastructure) [iceni] is a collection of grid middleware used for providing and coordinating grid services for e-science applications. ICENI includes a GUI workflow construction tool integrated into the Netbeans [netbeans] IDE. This tool can create a textual “execution plan” of the workflow in an XML-meta language derived from YAWL (Yet Another Workflow Language) [YAWLa]. The ICENI workflow system supports conditionals, loops, and parallel execution.
BioOpera
BioOpera [bioopera] is an application for the composition of various bioinformatics applications built on top of the [opera] architecture. The project was begun in 1999 at the Swiss Federal Institute of Technology. BioOpera provides a GUI tool for the construction of workflows from web services and grid services. BioOpera represents a workflow “process template” as a DAG, and translates this set of activities into an execution script for Condor-G [bausch03]. Additionally, the web service invocation engine within BioOpera uses UDDI, and the grid service component uses service description documents [haller03-thesis].
ILab
ILab [ilab] is a set of grid middleware developed at The Fraunhofer ICT Group since 2001. ILab includes a GUI tool for the construction of workflows from grid services. This tool uses an XML-based language called Grid Application Definition Language (GADL) [der03] to assemble applications from grid services, and Grid Job Definition Language (GJobDL) to describe the runtime behavior of such applications. ILab uses Petri nets instead of DAGs to model and control the workflow.
GridAnt
GridAnt [GRANT] is an extension of the Apache Ant build tool residing in the Globus COG kit. GridAnt was begun as GSFL: Grid Service Flow Language in 2001. GridAnt allows for the construction of client side workflow for Goblus Toolkit 3 applications. It allows for the specification of precondition and parallel tasks in much the same way as the Ant build tool.
DAGMan
DAGMan [dagman] is a set of C libraries which allow for the user to schedule programs based on dependencies. DAGMan is part of the Condor project and extends the Condor Job Scheduler [condor] to handle intra-job dependencies. As the name suggests, DAGMan represents a collection of job dependencies as a directed acyclic graph. These DAGs are specified using a simple text file format. DAGMan leverages technologies from the CondorG scheduler, Globus Toolkit, and the EU Datagrid [ppdg-20].
EDSS
The Environmental Decision Support System [edss] is a graphical tool for creating air quality simulation experiments. EDSS allows the user to define text file definitions or “templates” used by the various air quality simulation programs which are typically FORTRAN programs. The user defines a workflow, which is represented as a DAG, and can export the model for later use.
METEOR
The Managing End-To-End OpeRations [meteor] project is a collection of Java programs for constructing and executing workflows. METEOR began in 1994 and now continues as METEOR-S [verma03] a project on semantic web services. These projects were used to create IntelliGEN [kochut03], a bioinformatics workflow systems that coordinated a number of web services. METEOR was also used to build an information management system for Genome Databases [hall03]. METEOR has also been used in several projects for business workflows and to illustrate quality of service issues in [cardoso02] and [cardoso03].
COUGAAR
The Cognitive Agent Architecture [cougaar] project is a collection of java libraries that provide a framework for distributed multi-agent systems. The Logical Cougaar Experiment [roaringshrimp] is a project that attempts to define workflow structures and languages for distributed services-based applications. A derived work, the Microsoft Desktop Collaboration Agent Experiment [extendedexp], uses these projects to automate office tasks such as email. This project leverages techniques from narrative scripting and expert systems such as [amzi].
comming soon... pictures
Service Data in Scientific Workflow Systems
Outline:
Abstract
Introduction
Background: Service Registries
UDDI Summary
WSIL: Summary
Grid Service Data Summary
Background: Workflow Systems
Historical Perspective
A Survey of Current Projects
Workflow Patterns
Grid Service Data Registry in Scientific Workflow Systems
Model for Fault Tolerance
Model for Workflow Optimization
Implantation Discussion
Future Work
References
Abstract
The emerging technologies of grid computing, web services, and service-oriented workflow will soon enable scientific projects to be conducted on a larger scale than ever before. Scientific workflows can be constructed by combining dispersed network accessible services into virtual organizations. Within a scientific workflow environment, metadata or grid service data is necessary for consumer application to discover services and for services to publish their properties.
This thesis proposes one of the first architectures and implementations of a grid service registry for use in scientific workflow applications. The registry uses the Globus Toolkit 3 and is made available as an OGSI compliant grid service. This thesis outlines several formats for the service data which allow consumer applications to attain features such as fault-tolerance, monitoring, and dynamic discovery or selection of services.
The Open Grid Service Architecture (OGSA) allows such registries to use service data for soft-state management, keeping track of metadata for service instances created from application factories. A service aggregation registry subscribes to services produced from a number of factories. Service data is express through an XSD namespace shared vocabulary. Discovery policy is expressed in XPath queries.
Introduction
Background: Service Registries
In this chapter, I provide background on the three most widely known web service registry technologies. In a web service workflow system, a registry’s primary function is to provide a mechanism for service discovery. Service discovery may enable improvements in the workflow systems such as fault-tolerance, planning, or improved performance. Since the inception of web services, many people believed that service discovery would become an essential part of web service technology. However, as programmers have used web services, most have left service discovery to the user (as parameters) or even hard-coding the service endpoint. This is especially true for scientific systems. Since there are usually a limited number of services that work with a scientific workflow system, the burden of discovering services is easily pushed to the user.
Academic projects have a RYO approach to service registries, and several existing scientific workflow systems have used registries. Projects such as DiscoveryNet [discoverynet], and ICENI [iceni] use custom registries. The Self-Serv [benatallah03] and Triana [Triana], project use UDDI.
UDDI Summary
Universal Description, Discovery, and Integration (UDDI) version 1.0 was create amongst the dot-com and e-commerce technologies in 2000. It was originally conceived as a machine-readable “Universal Business Registry,” an ecommerce directory that would revolutionize supply chain. The version 1.0 specification was supposes to create a standard platform on which business could compete to offer services. The UDDI specification was then substantially revised in version 2.0 in 2001. The specification was tied to XML, WSDL, XDS, and other overlapping projects. In 2004, the current version of UDDI (version 3.0) is a conglomeration of the initial web services specification (WSDL, SOAP, XSD) and the emerging technologies of XML-security and service publication. The UDDI project is currently negotiation with W3C and OASIS to turn the specification over to a standards body working group. UDDI’s authors have claimed [STENCILUDDI] that the standard is evolving to become more practical.
UDDI software offers a powerful, secure, and feature-rich web service registry at the cost of complexity and uncertainty about future versions of UDDI. UDDI has not yet been widely adapted and used in academia or in industry, but has been used successfully in some business projects. Many major corporations as well as numerous startups have been involved in the evolution of UDDI. The most influential actors IBM, Microsoft, Sun, and SAP have all published UDDI implementations.
Independent research carried out by SalCentral shows that over 67% of entries in public UDDI registries today are either invalidly formatted or validly formatted but unavailable. This is due to inadequate quality of service guarantees and a lack of moderation in these registries. Furthermore, the available services are underutilized and very rarely “discovered” programmatically. In order to mitigate these problems, UDDI version 3.0 relies on a publish/subscribe model.
One interacts with a UDDI server or UDDI enabled application though a number of APIs, and their respective implementations. Today there are 6 major implementations of UDDI 2.0, however only Systinet claims to support version 3.0 features.
API Implementations
• IBM WSTK and UDDI4J
• Systinet WASP UDDI
• jUDDI.org
• UDDI 2.0 in Java
• Microsoft UDDI SDK
• Trenian Web Services Directory
While the features and implementation of each implementation are different, they all provide mechanism to publish, find, and bind services. Publishing services is simply the matter of including a service in a UDDI registry. This can be done through a series or RPC style web service calls or by a person through a web page interface. The specification provides mechanism for assigning a unique identifier (UDDI Key) to a newly published service. Finding or discovering a service is accomplished through creating a client-side proxy and searching by business, service, or description. Binding consists of how an application connects to, and interacts with, a web service after it's been found.
The UDDI is capable of storing the following data elements:
• businessEntity: Describes a business or other organization that typically provides Web services.
• businessService: Describes a collection of related Web services offered by an organization described by a businessEntity.
• bindingTemplate: Describes the technical information necessary to use a particular Web service.
• tModel: Describes a “technical model” representing a reusable concept, such as a Web service type, a protocol used by Web services, or a category system.
The most flexible component is the tModel, that can be any data structure available in XML schema (XSD). WSDL documents, meta-data, and human readable descriptions, can all be encoded as a tModel component. A tModel can contain any data you want.
WSIL: Summary
Web Services Inspection Language is project created by Microsoft and IBM that offers a lightweight, decentralized service registry in contrast to the complex centralized approach of UDDI. WSIL a specification for an XML-based meta-language and is used by the creation and consumption of web-based XML documents [APPNEL02].
A WSIL document is simple a way for an organization to aggregate and advertise its web services. The WSIL specification says:
"WSIL defines how a service requestor can discover an XML Web Service description on a Web server, enabling such requestors to easily browse Web servers for XML Web Services."
WSIL encourages organizations to locate their WSIL documents in a uniform manner.
The WSIL document is published at http://example.org/inspection.wsil or http://examples.org/services/inspection.wsil.
All WSIL documents must have a root element,
WSIL documents incorporate XML schema so that the WSIL specification is designed to be extensible with other definition types. WSDL support in WSIL is achieved through the use of extensible XSD elements. WSDL and UDDI extensions are pre-built and implemented in the most widely-used WSIL toolkit, the Apache Axis WSIL4J project [WSIL4J].
Academic researchers have definitely noticed WSIL, and have mentioned or discussed it in many documents, most prominently by the UK e-science project, Indiana University, and University of Chicago.
XMethods.com, a directory of publicly-available Web services, is an earlier adopter of WSIL and has developed a binding extension for its service.
Grid Service Data Summary
The Globus Toolkit includes an example grid service, VOregistry. This service implements the OGSA model of service data aggregation; it combines XSD schema data, called service data from any number of grid services. The VOregistry service listens for service to publish their service data at regular intervals, and creates an
This model of service data aggregation makes use of the GWSDL/OGSI mechanism of portType inheritance. Every service capable of publishing its service data to the registry extends NotificationSource and the service subscribing to this data extends NotificationSink. The conventional rules of inheritance apply to grid service, so it is possible to create further inherited services. A service’s GSWDL file can contain several
Like UDDI and WSIL, a grid service data registry is queried through a set of APIs.
The service data registry can hold any collection of XSD elements and can process complex queries as XPATH or XQUERY expressions. The processing is done on the registry side and only the query results are returned.
In comparison to UDDI and WSIL, service data aggregation is a middle-of-the-road approach. It is not as centralized as UDDI, but does include server-side processing capabilities not founding WSIL. Like UDDI, service data aggregation attempts to protect QoS with a publish/subscribe model based on timeouts. However, this mechanism shifts the responsibility to service providers to publish correct service data through the use of a shared XSD namespace vocabulary.
However, service data aggregation is also compatible with WSIL and UDDI. All three registries reside on a server, use a similar concept of web service discovery and all three implementations rely on the same Java XML libraries.
The current service data aggregation implementation is limited by the fact that only OGSI-compliant services can be included. When (and if) the WSDL and GWSDL specification converge, this model will become available as a web service as well as a grid service registry mechanism.
Background: Workflow Systems
Historical Perspective
This chapter provides background on the numerous scientific workflow projects.
The most widely used definition of term workflow is presented by the workflow management coalition [WFMC]. Workflow can be defined as the orchestration of a set of activities to accomplish a larger and sophisticated goal, referred to as a business process.
Mathematical and Computer Science academic work on workflow dates back to the mid-70s. Skip Ellis and Michael Zisman worked on “Office Automation” prototypes at Xerox Park. Work in this area went on from 1975-1985, but did not make significant progress until the business process re-engineering trends of the 1990s. Several early commercial products were also created with Lotus Domino being the most successful. Simultaneously, office automation project were sponsored by the EU and the University of Dortmund.
One of today’s workflow researchers, Michael zur Muehlen, has created a histogram of early workflow projects.
By the 1990s, the automation of business workflow has matured into many successfully software products including:[Staffware], [COSA], [InConcert], [Eastman], [Domino], [Websphere], [Workflo], [I-Flow]. There are many open source workflow frameworks available as well.
Research on workflow systems was reopened from a different perspective in 2000-2001, when work began on Web Service Choreography Interface Specification (WSCI), and introduced the notions of web service composition and service orchestration. As part of their commitment to adopting web services and XML, Microsoft and IBM developed XLANG [xlang] and WSFL [wsfl] respectively. In 2002, Microsoft and IBM jointly announced the BPEL workflow language specification. The specification [BPEL4WSSPEC] was finalized with input from several other partners including BEA, and then rolled into a working group at OSASIS [oasis].
Over the last 2 years, XML meta-languages have gain a lot of attention in both technical and business publications. WSCI, XLANG, WSFL, and BPEL have been debated and analyzed both from a technical and political perspective by James Snell at IBM, David Berlind for coverpages.org, and Paul Krill reporting for InfoWorld. Simultaneously, many academic projects began using these languages or systems derived from these languages.
A Survey of Current Projects
Woflan
Self-Serv
BioPipe
SCIRUN
WASA
myGrid
Triana
Chimera
Ptolemy-II
DiscoveryNet
wftk
SWFL
ICENI
BioOpera
ILab
GridAnt
DAGMan
EDSS
METEOR
COUGAAR
Todo:
LabVIEW
XCAT
GridFlow
Webflow
Woflan
Woflan [www-woflan] is a "Petri-net-based workflow diagnosis tool" developed at TU/e in the Netherlands. The Woflan program allows a user to input a workflow specification in file formats from several commercial workflow applications including Staffware [www-staffware], COSA [www-cosa], and Protos [www-pallas]. Woflan produces a descriptive, warning, and error messages from the input file. Woflan checks workflows for correctness, and generates diagnostics to help repair problems [Verbeek00].
Self-Serv
Self-Serv [benatallah03] is a collection of web services middleware the creation of web service from other web services. Self-Serv is based on a peer-to-peer model, in that any service is capable of executing composite services without the need for a central scheduler. Self-Serv workflows are resented as state-charts that allow for a range of logical operation to occur within a composite service. Self-Serv also contains components for web service discovery with UDDI and web service deployment.
BioPipe
BioPipe [biopipe] is one of a series of related packages for carrying out bioinformatics analysis from the Open Bioinformatics Foundation [obf]. Biopipe is a collection of Perl modules for constructing workflows from BioPerl [bioperl] applications. Much of the code and ideas are borrowed from the Ensembl [ensembl] pipeline project.
SCIRUN
SCIRUN [scirun] is an application begun in 1992 by National Center for Research Resources Center at Utah. SCIRUN is a GUI “scientific workbench” that allows users to construct, manage, and debug simulations in domains such as physics and neurobiology. SCIRUN simulation may be thought of as workflows since they allow for parallel and conditional execution of tasks. Some SCIRUN applications allow for jobs to be run on a grid. SCIRUN also provides extensive scientific visualization libraries.
WASA
WASA2 [wasa] is an application that supports the creation and execution of workflows from CORBA components. It includes a GUI workflow modeler as well as controls to modify a workflow while it is running. WASA2 has been used in the domains of geoprocessing and molecular biology as well as for modeling business processes. WASA2 uses a strictly object orient approach where workflows are represented as CORBA objects and can be displayed as UML diagrams.
myGrid
The myGrid Project [greenwood02] [addis03] is a collection of bioinformatics web services and grid services hosted by the European Bioinformatics Institute. myGrid uses SoapLab [soaplab] and the Apache Axis [AXIS] framework to provide a web service interface to a collection of grid-based data-analysis services. Scientists are able to compose, edit, and save workflows with either a web portal or a GUI workbench [freeflow]. The system currently supports two XML workflow languages, a subset of IBM's WSFL [WSFL] and a more domain-specific language called XScufl [taverna] [wiki]. myGrid executes workflows defined in these documents with the IT Innovation Workflow Enactment Engine [it].
Triana
Triana [Triana] is a set of open source java libraries that provide a GUI interface for building workflows from a collection of OGSA grid services. Triana has been leveraged by several other projects including myGrid and Chimera. Triana contains an engine for coordinating and invoking a set of grid services. Triana also contains a peer-to-peer component based on JXTA [JXTA] allowing Triana to run a variety of devices including PDAs and mobile phones.
Chimera
Chimera [deelman03c] is a system used to find or create a workflow for a series of OGSA grid services to provide a scientist’s requested “data product.” Chimera was begun in 1999 and is enabled by the Pegasus Planner [pegasus] described in [deelman02b], a GriPhyN project at UC ISI. The Chimera is middleware designed to be invisible from a client who requests the data product. The workflow is represented as an “abstract program execution graph.” This graph is transformed into an executable DAG for the Condor DAGman [dagman] scheduler.
Ptolemy-II
Ptolemy-II [ptolemy] is platform for is a visual modeling tool written in Java. It was begun in 1997 at UC Berkley. Several recent SDM [sdm] efforts in including [altintas02a], [ludaesher03], [mladen03], and [altintas03] have extended the Ptolemy-II platform to allow for the drag-and-drop creation of scientific workflows from libraries of actors. The Ptolemy actor is often a wrapper around a call to web service or grid service. Ptolemy leverages an XML-meta language called MoML [lee-moml] to produce a workflow document describing the relationships of the entities, properties, and ports in a workflow. Presently, Ptolemy actor libraries exist for the domains of bioinformatics and ecology at [sdmnc] and [seek].
DiscoveryNet
DiscoveryNet [discoverynet] is a collection of software built on top of the UNICORE grid system for arranging database access and knowledge discovery procedures. DiscoveryNet was begun in 2001 at Imperial College of Science. DiscoveryNet provides a means of describing workflow between analysis service providers, data owners, and scientists who arrange and execute these workflows [row03]. DiscoveryNet makes use of the OGSI components and protocols as well as its own protocol for workflows, Discovery Process Markup Language [guo02]. This language is used for constructing, running, and managing grid-services, as well as recording their history
wftk
Open-source workflow toolkit [wftk], is the name of a generalized workflow system, implemented as a series of Java libraries. wftk was begun in 1998 by Michael Roberts. wftk uses its own high level language to describe workflow, and stores workflow related documents in a series of XML “datasheets.” wftk workflow engine contain two models of a workflow: a “task-based” model, essentially a DAG, and a “state-based” model, essentially a FSM. wftk libraries can also be run as web services.
SWFL
Service Workflow Langue [swfl] is an XML-based meta-language for the construction of scientific workflows from OGSA-compliant services. SWFL was developed at Cardiff University in 2002 and 2003. SWFL extends IBM’s WSFL and supports a new set of conditional operators and loop constructs as well as arrays and objects. SWFL has a workflow engine based on the JISGA (Jini-based Service-oriented Grid Architecture) architecture. JIGSA [jigsa] uses JavaSpaces [jinni] as shared memory, for parallel execution through grid services. Therefore, SWFL supports the integrating parallel programs into the workflow.
ICENI
ICENI (Imperial College e-Science Net-worked Infrastructure) [iceni] is a collection of grid middleware used for providing and coordinating grid services for e-science applications. ICENI includes a GUI workflow construction tool integrated into the Netbeans [netbeans] IDE. This tool can create a textual “execution plan” of the workflow in an XML-meta language derived from YAWL (Yet Another Workflow Language) [YAWLa]. The ICENI workflow system supports conditionals, loops, and parallel execution.
BioOpera
BioOpera [bioopera] is an application for the composition of various bioinformatics applications built on top of the [opera] architecture. The project was begun in 1999 at the Swiss Federal Institute of Technology. BioOpera provides a GUI tool for the construction of workflows from web services and grid services. BioOpera represents a workflow “process template” as a DAG, and translates this set of activities into an execution script for Condor-G [bausch03]. Additionally, the web service invocation engine within BioOpera uses UDDI, and the grid service component uses service description documents [haller03-thesis].
ILab
ILab [ilab] is a set of grid middleware developed at The Fraunhofer ICT Group since 2001. ILab includes a GUI tool for the construction of workflows from grid services. This tool uses an XML-based language called Grid Application Definition Language (GADL) [der03] to assemble applications from grid services, and Grid Job Definition Language (GJobDL) to describe the runtime behavior of such applications. ILab uses Petri nets instead of DAGs to model and control the workflow.
GridAnt
GridAnt [GRANT] is an extension of the Apache Ant build tool residing in the Globus COG kit. GridAnt was begun as GSFL: Grid Service Flow Language in 2001. GridAnt allows for the construction of client side workflow for Goblus Toolkit 3 applications. It allows for the specification of precondition and parallel tasks in much the same way as the Ant build tool.
DAGMan
DAGMan [dagman] is a set of C libraries which allow for the user to schedule programs based on dependencies. DAGMan is part of the Condor project and extends the Condor Job Scheduler [condor] to handle intra-job dependencies. As the name suggests, DAGMan represents a collection of job dependencies as a directed acyclic graph. These DAGs are specified using a simple text file format. DAGMan leverages technologies from the CondorG scheduler, Globus Toolkit, and the EU Datagrid [ppdg-20].
EDSS
The Environmental Decision Support System [edss] is a graphical tool for creating air quality simulation experiments. EDSS allows the user to define text file definitions or “templates” used by the various air quality simulation programs which are typically FORTRAN programs. The user defines a workflow, which is represented as a DAG, and can export the model for later use.
METEOR
The Managing End-To-End OpeRations [meteor] project is a collection of Java programs for constructing and executing workflows. METEOR began in 1994 and now continues as METEOR-S [verma03] a project on semantic web services. These projects were used to create IntelliGEN [kochut03], a bioinformatics workflow systems that coordinated a number of web services. METEOR was also used to build an information management system for Genome Databases [hall03]. METEOR has also been used in several projects for business workflows and to illustrate quality of service issues in [cardoso02] and [cardoso03].
COUGAAR
The Cognitive Agent Architecture [cougaar] project is a collection of java libraries that provide a framework for distributed multi-agent systems. The Logical Cougaar Experiment [roaringshrimp] is a project that attempts to define workflow structures and languages for distributed services-based applications. A derived work, the Microsoft Desktop Collaboration Agent Experiment [extendedexp], uses these projects to automate office tasks such as email. This project leverages techniques from narrative scripting and expert systems such as [amzi].
comming soon... pictures
Sunday, December 21, 2003
University of Applied Sciences Rapperswil (HSR), Rapperswil, Switzerland
Marcial Rion and Mathias Kengelbacher have finished the PH.D.
Adapting BPEL4WS to the Grid: A workflow language proposal for Grid enviroments
http://www.volewa.ch/singapur/thesis.pdf (7.2 megabytes!)
Well, this isn't the first workflow language and won't be the last. The number of XML languages, schemas, proposals, and specifications is exploding. Who exactly are we proposing too and why? In fact, one could see this thesis as only adding to the problem of too many workflow lanaguges. However, I see this project as a step in the right direction for creating better workflow languages. The main improvments in this workflow language are the authors' acknowlegement of the workflow patterns theory, existing languages, and the BPEL specification.
If future XML languages build on existing languages and have a set of workflow patterns standards for comparison, then each language can be evaluated for what it is, a set of trade offs.
Marcial Rion and Mathias Kengelbacher have finished the PH.D.
Adapting BPEL4WS to the Grid: A workflow language proposal for Grid enviroments
http://www.volewa.ch/singapur/thesis.pdf (7.2 megabytes!)
Well, this isn't the first workflow language and won't be the last. The number of XML languages, schemas, proposals, and specifications is exploding. Who exactly are we proposing too and why? In fact, one could see this thesis as only adding to the problem of too many workflow lanaguges. However, I see this project as a step in the right direction for creating better workflow languages. The main improvments in this workflow language are the authors' acknowlegement of the workflow patterns theory, existing languages, and the BPEL specification.
If future XML languages build on existing languages and have a set of workflow patterns standards for comparison, then each language can be evaluated for what it is, a set of trade offs.
Sunday, December 14, 2003
What's new with workflow?
A list of active open source workflow projects that are written in Java
a new presentation from Bertram Ludäscher
Kepler: Scientific Workflows Based on Dataflow Process Networks (or: Workflow considered harmful!?), e-Science Workflow Services Meeting, NeSC, Edinburgh, December 3th, 2003.
Whats new with Grid?
There's (yet another) NFS Middleware project:
OGCE: Open Grid Computing Environment
Gregor's found people who have used the CoG Kits
A list of active open source workflow projects that are written in Java
a new presentation from Bertram Ludäscher
Kepler: Scientific Workflows Based on Dataflow Process Networks (or: Workflow considered harmful!?), e-Science Workflow Services Meeting, NeSC, Edinburgh, December 3th, 2003.
Whats new with Grid?
There's (yet another) NFS Middleware project:
OGCE: Open Grid Computing Environment
Gregor's found people who have used the CoG Kits