GFD 186
GFD 186
Copyright Notice
Copyright c Open Grid Forum (2006-2011). Some Rights Reserved. Distribution is unlimited.
Abstract
This document follows the document produced by the GridRPC-WG on GridRPC Model and API for End-
User applications. This new document aims to complete the GridRPC API with Data Management mecha-
nisms and API.
This document is not intended to provide features and capabilities for building data management middle-
ware. Its goal is to complete the GridRPC set of functions and definitions to allow users to manipulate their
data. The motivation for this document is to provide explicit functions to manipulate the exchange of data
between users, components of a GridRPC platform and storage resources since (1) the size of the data used
in Grid applications may be large and useless data transfers must be avoided; (2) data are not always stored
on the client side but may be made available either on a storage resource or within the GridRPC platform.
All functions in the API have been thought to be called by each part in a GridRPC platform (client, agent
and server) if needed.
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Security Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Data Management motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 GridRPC Data Management model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Data Management API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.1 GridRPC data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.1.1 The grpc data t type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.1.2 Function specific types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
[email protected] 1
GFD-R-P.186 June 6, 2011
[email protected] 2
GFD-R-P.186 June 6, 2011
1 Introduction
The goal of this document is to define a data management extension to the GridRPC API for End-User
applications. As for the GridRPC API document [? ], it is out of the scope of this document to discuss
the implementation of the API described in this document, which focuses on data management mechanisms
inside a GridRPC platform.
The motivation of the data management extension is to provide explicit functions to handle data exchanges
between data storage resources, components of a GridRPC platform and the user. The GridRPC API de-
fines a RPC mechanism to access Network Enabled Servers. However, an application needs data to run and
generates some output data, which have to be transferred. As the size of data may be large in grid envi-
ronments, it is mandatory to optimize transfers of large data by avoiding useless exchanges. Several cases
may be considered depending on where data are stored: on an external data storage, inside the GridRPC
platform or on the user side. In all these cases, the knowledge of “what to do with these data?” is owned
by the client. Then, the GridRPC API must be extended to provide functions for explicit and simple data
management.
We firstly present a motivation for the data management and in Section 3, the proposed data management
model is introduced. The main contribution of this document is given in Section 4 where we describe our
proposal for a data management API.
Next, we explain two different cases concerning external data and internal data:
• External data are placed on servers, like data repositories. These servers are not registered inside the
platform but can be directly accessed to read/write data. The use of such data implies several data
transfers if the client uses the basic GridRPC API: the client must download the data and then send
[email protected] 3
GFD-R-P.186 June 6, 2011
Computational Server
0
1
111
000
0
1 Data Storage Resource
0
1
1111
0000
0
1
0000
1111
0
1
0
1
Client 0
1
000
111
0
1
000
111
1
0
000
111
0
1
000
111
1
0
000
111
GridRPC Platform
it to the GridRPC platform when issuing the call to grpc_call(). One of these transfers should be
avoided: the client may just give a data reference (or handle) to the platform/server and the transfer
is completed by the platform/server. Consider a client, with small data storage capacities, that needs
to issue a request on large data stored in a data storage server. It is costly, and it may not be possible
to send the data first to the client before issuing the call. The client could also need to directly store
its results from the computational server to the data storage, or share an input or output data with
other users of the GridRPC platform. Examples of such Data Storage servers are IBP [? ] or SRB [?
]. Among the different available examples of this approach, we can cite: (1) the Distributed Storage
Infrastructure of NetSolve [? ]; (2) the utilization of JuxMem [? ] or DAGDA [? ] in D IET. This
approach is well suited for, but not limited to, long life persistent data.
• Internal data are managed inside the GridRPC platform. Their placement depends on computations
and it may be transparent to clients: in this case, the GridRPC middleware can manage data. Tempo-
rary data, generated by request sequencing [? ], are examples of internal data. For instance, a client
issues two calls to solve the same problem and the second call use input or output data from the first
call. This is the case if we solve C =t (A × B), where A and B are matrices. If the client do not find
a solver which computes these two operations in one step, then he must issue two calls: W = A × B
and C =t W . But the value of W is of no interest for him. Then, this matrix should not be sent back to
the client host. Temporary data should be leaved inside the platform, close to or on the computational
server, when clients do not need it. Other cases of useless temporary data occur when the results of a
simulation are sent to a graphical viewer as done in most Problem Solving Environments. Among the
examples of internal data management, we can cite the Data Tree Management infrastructure used
in D IET [? ], and the data layer omniStorage in omniRPC [? ]. This approach is suitable for, but not
limited to, intermediate results to be reused in case of request sequencing.
[email protected] 4
GFD-R-P.186 June 6, 2011
data is stored and can manage the transfer (he can use the same calls than the ones to manage external data
stored on a storage server), either the data is transparently managed by the GridRPC middleware, in which
case the middleware provides mechanisms for transfers between computational servers.
These two approaches are complementary in the data management model proposed here. The GridRPC
platform and the data storage interact to implement data transfers. Note that some additional function-
alities which are not addressed in this document, can be designed, such as: reusable data generated by a
computation could be stored during a TTL (Time To Leave) on the computational server before being sent
to data storage servers; or when the storage capacity of a computational server is overloaded, it may be
sent to another data storage server.
In both cases, it is mandatory to identify each data. All data stored either in the platform or on storage
servers will be identified by Data Handles and Storage Information. Without lack of generality, we define
the term GridRPC data as either the data used for a computational problem, either both a Data Handle and storage
information. Indeed, when a computational server receives a GridRPC data which does not contain the
computational data, it must know the unique name of the data with the Data Handle, and must know one
location to get it and where the client wants to save it after the computation. Thus storage information must
record the original location of the data and the destination of the data.
A data in a GridRPC middleware is defined by the grpc data t type. Variables of this type represent infor-
mation on a specific data which can be local or remote. It is at least composed of:
• A unique identifier, of type grpc_data_handle_t. This is created and used internally by the
GridRPC middleware to manage the data. Depending on the data management possibilities of the
GridRPC middleware, the data itself can also be stored.
[email protected] 5
GFD-R-P.186 June 6, 2011
• Two NULL-terminated lists of URIs, one to access the data and one to record the data (for example an
OUT parameter to transfer at the end of a computation, or to prefetch the data.)
• Information concerning the mode of management. For example, data management is defaulted to
the one of the standard GridRPC paradigm, but it can be noted for example as GRPC_PERSISTENT,
which corresponds to a transparent management by the GridRPC middleware.
• Information concerning the type of the data, as well as its dimensions.
Details on Storage Information
• URI: it defines the location where a data is stored. URI formalism is described in [? ]. It can be built
like “protocol:[//machine name][:port]/path to data” and thus, contains at least four fields. Some
fields are optionnal, depending on the requested protocol (e.g., no hostname for “file”). Several full
examples of utilization can be found in Section A.
– char * protocol: a token like “ibp”, “http”, “ftp”, “file”, “memory”, “LFS” (local file system),
“DFS” (distributed file system) or “middleware” can be used. It gives some information on how
to access the data (the list is not exhaustive). express the idea
– char * hostname: the name of the server on which resides the data.
– int port: the port to use to access the data.
– char * path: the full path of the data or an ID.
For example,
– A GridRPC data corresponding to an input matrix stored in memory can partly be constructed
with the information of protocol set to "memory", port is a null string, the machine name is
the one of the localhost and path to data is the path used to access the data in memory (a key
that lets the GridRPC API make the correspondence with the correct input to give to the data
middleware section 4.2.2).
– The URI “http://myName/myhome/data/matrix1” corresponds to the location of a file named
matrix1, which we can access on the machine named myName, with the http protocol. Typi-
cally, the data, stored as a file, can be downloaded with a command like:
wget http://myName/myhome/data/matrix1
• The management mode is an enumerated type grpc data mode t. It is useful to set the behavior (see
Table 1) of the data on the platform. It is given by the client, and is related to the following policy
values. If the middleware does not handle the given behavior, it throws an error.
– GRPC VOLATILE: used when the user explicitly manages the grpc data t data (location and
contained data). Thus the data may not be kept inside the platform after a computation, and
this can be considered as the default usage for GridRPC API. For coherency issues, note that an
underlying data middleware can only consider using any internal copy if it verifies, with the
help of checksum functions for example, that the data is indeed identical as the one specified in
the URI provided by the user.
– GRPC STRICTLY VOLATILE: used when the data must not be kept inside the platform after
a computation, for security reason for example (can be considered as the default usage for
GridRPC API).
[email protected] 6
GFD-R-P.186 June 6, 2011
– GRPC STICKY: used when a data is kept inside the platform but cannot be moved between the
servers, except if the user explicitly ask for the migration. This is used if the client needs that
data in the platform for a second computation on the same server for example. Note that in this
case, the data can also be replicated and that potential coherency issues may arise.
– GRPC UNIQUE STICKY: used when a data is kept inside the platform and cannot be moved
between the servers. This is used if the client needs that data in the platform for a second com-
putation on the same server for example. Note that in this case, the data cannot be replicated for
security reason for example.
– GRPC PERSISTENT: used when a data has to be kept inside the platform. The underlying data
middleware is explicitly asked to handle the data: the data can migrate or be replicated between
servers depending on scheduling decisions, and potential coherency issues may arise if the user
attempt to modify the data on his own.
– GRPC END LIST: this is not a type, but a marker, that is used to terminate grpc data mode t
lists.
• The type of the data is an enumerated type grpc data type t: it is set by the client and describes
the type of the data, for example GRPC DOUBLE, GRPC INT, as exposed in Table 2.
We have defined a special grpc data t that can contain other grpc data t, namely the GRPC_
CONTAINER_OF_GRPC_DATA (see section 4.2.2 on the management of containers). That way, the
user relies on the GridRPC Data Middleware to transfer a set of grpc data t data. The matter of
implementing it by an array, a list or anything else is GridRPC Data Middleware dependent, then not
in the scope of this document.
[email protected] 7
GFD-R-P.186 June 6, 2011
In this section, we describe some types that are internal to a given function. They are enumerated types,
and are given here for commodity reasons.
The grpc completion mode t type. This type is used in grpc_data_wait() and is defined by the enu-
merated type { GRPC_WAIT_ALL, GRPC_WAIT_ANY } which can be extended. It is used to detail the behav-
ior of the waiting process: the function can wait for one or all transfers concerning data involved during
the call to grpc_data_wait().
The grpc data info type t type. This type, only used in grpc_data_getinfo(), is used to define the
wanted information. It is an enumerated type defined with the following values (which can be extended):
• GRPC DATA HANDLE
used to have information on the handle.
• GRPC INPUT URI
used to know the different protocols and locations where replica of the data can be accessed.
• GRPC OUTPUT URI
used to know the different protocols and locations where the data has to be transfered.
• GRPC MANAGEMENT MODE
used to know the management mode of a data, for example GRPC_VOLATILE.
• GRPC DIMENSION
used to know the dimensions of the data.
• GRPC TYPE
used to know to which type of the language of implementation the data corresponds.
• GRPC LOCATIONS LIST
used to get all the locations known by the underlying data middleware where to access the data and
the protocols used to do so.
• GRPC STATUS
used to know if a grpc data is “GRPC IN PLACE”, “GRPC TRANSFERING” or “GRPC ERROR -
TRANSFER” for example (exact outputs have to be discussed in an interoperable document).
Note: Information is managed by the GridRPC Data Management API, which relies on at least one Data
Management middleware. Then, information concerning a data can be stored within the GridRPC Data
Management middleware, and/or within the grpc_data_t type. Nonetheless, this document does not
focus on implementation, and the syntax of outputs will be discussed in an interoperable document
[email protected] 8
GFD-R-P.186 June 6, 2011
Data exchanges between the client and explicit locations (computational servers or storage servers) are
done using the asynchronous transfer function. Consequently, the GridRPC data can also be inspected, to
get more information about the status of the data or its location. Functions are also given to wait after the
completion of some transfers. Finally, one can unbind the handle and the data, and free the GridRPC data.
To provide identification of long lived data, data handles should be saved and restored, for instance in a
file. This will allow two different users to share the same data.
Security and data life cycle management issues are not of the API concerns.
Examples of the use of this API are given in Appendix A.
The init function initializes the GridRPC data with a specific data. This data may be available locally or on a
remote storage server. Both identifications can be used. GridRPC data referencing input parameters must
be initialized with identified data before being used in a grpc call().
Function prototype:
grpc_error_t grpc_data_init(grpc_data_t * data,
const char ** list_of_URI_input,
const char ** list_of_URI_output,
const grpc_data_type_t data_type,
const size_t * data_dimensions,
const grpc_data_mode_t * list_of_data_mode);
list_of_URI_input and list_of_URI_output parameters are NULL-terminated lists of strings, which
give the different locations on where to transfer the data from, and the required locations on where to trans-
fer the data to. Hence, a list describes generally all the available locations known by the client, in order for
the GridRPC Data Management middleware to possibly implement some efficient and fault-tolerant mech-
anisms to perform a choice among all the proposed selections (and the ones eventually known by the Data
Management middleware if the handle has already been defined during a previous call). In sake of sim-
plicity, one can imagine that the default behavior would be a sequential try until the transfer from one of
them can be achieved.
[email protected] 9
GFD-R-P.186 June 6, 2011
Remarks:
• If the function is called with a grpc_data_t which has been used in a previous call, fields corre-
sponding to information already given are overwritten.
• Parameter list_of_URI_input and list_of_URI_output can be set to NULL if empty. Typi-
cally, input parameters will have their list_of_URI_output set to NULL, output parameter will
have their list_of_URI_input set to NULL, and inout parameters will have their list_of_URI
_output containing the list_of_URI_input.
Note that the presence of list_of_URI_input and list_of_URI_output does not imply that
the data is IN, INOUT or OUT. Of course, giving the same value to both list should give the same
behavior than setting a data INOUT.
• If list_of_data_mode is NULL, the data is managed either with GRPC_VOLATILE or
GRPC_STRICTLY_VOLATILE as default mode depending on the capability of the Grid middleware
mode.
If the data has to be managed differently on at least one another resource, for example with a GRPC_
STICKY mode on a given resource, then the storage management has to be defined for all locations:
this implies that the size of this list is the same as the size of the output list, since there is one mode
for each output location.
• The data_dimensions parameter is a vector terminated by a zero value, containing the dimensions
of the data. For example, an array of [n m 0] would be used to describe a n × m matrix.
The dimension is always known by the client: in case the number of results is not known for a service,
then generally this service will return a GRPC CONTAINER OF GRPC DATA of dimension [1 0].
Each result can then be accessed inside the single container with grpc_data_container_get(),
and its dimension known by a call to grpc_data_getinfo().
• Error code identifiers and meanings are described in Table 3. When an argument is not valid, it can
mean that either the user made a mistake when calling the function, either that the corresponding
information is not supported by the implementation of the API.
Table 3: Error codes identifiers and meanings for the init function.
[email protected] 10
GFD-R-P.186 June 6, 2011
If he wants to use a data stored in memory, the user must provide some name in the URIs in the input
and/or output fields which has to be understood by the GridRPC Data Management layer in the GridRPC
system, in addition of the use of the memory protocol. For this reason, we provide here two functions:
Function prototype:
grpc_error_t grpc_data_memory_mapping_set(const char * key, void * data );
grpc_error_t grpc_data_memory_mapping_get(const char * key, void ** data );
The function grpc_data_memory_mapping_set() is used to make the relation between a data stored
in memory and a grpc_data_t data when the memory protocol is used: the aim is to set a keyword that
will be used in the URI used for example during the initialization of the data.
Error code identifiers and meanings are described in Table 4. Note that, like all function of this API,
grpci nitialize() as to be called previously of the use of these two functions.
Table 4: Error codes identifiers and meanings for the memory mapping function.
In order to facilitate the use of some special structures like lists or arrays of grpc data t variables, the two
following functions let the user manipulate them at a higher level and without knowing the contents of the
structures.
Function prototype:
grpc_error_t grpc_data_container_set(grpc_data_t * container, int rank,
const grpc_data_t * data);
grpc_error_t grpc_data_container_get(const grpc_data_t * container, int rank,
grpc_data_t ** data);
The variable container is necessarily a grpc data t of type GRPC_CONTAINER_OF_GRPC_DATA. rank is
a given integer which acts as a key index, and data is the data that the user wants to add in or get from
the container. Note that getting the data does not remove the data from the container. Furthermore, the
container management is free of implementation.
Error code identifiers and meanings are described in Table 5.
[email protected] 11
GFD-R-P.186 June 6, 2011
Table 5: Error codes identifiers and meanings for container management functions.
This function writes a GridRPC data to the output locations set during the init call in the output parameters
fields. For commodity reasons, additional URIs from which the data can be downloaded and to which the
data has to be uploaded can be provided.
A user may want to be able to transfer data while computations are done. For example, if a computation can
begin as soon as some data are downloaded but needs all of them to finish. Then the management of data
must use asynchronous mechanisms as default behavior. This function initiates the call for the transfers,
and returns immediately after.
Function prototype:
grpc_error_t grpc_data_transfer(grpc_data_t * data,
const char ** list_of_input_URI,
const char ** list_of_output_URI,
const grpc_data_mode_t * list_of_output_modes);
list_of_output_modes is a GRPC_END_LIST-terminated list with the same number of items as
list_of_output_URI. For each URI describing the hostname, the protocol used to access the data, etc.,
a management mode can be specified. Hence, list_of_output_modes can be used to set different man-
agement policies on some resources (for example, set the data as GRPC_STICKY to a set of resources and
GRPC_PERSISTENT to the others) while possibly benefiting of an “aggressive” write as the data is the same
everywhere.
Remarks:
• If list_of_output_modes is set to NULL, the management mode of the data is the one specified
during the initialization of the data, or defaulted if it was not set as a unique mode.
Note that if a user wants to change the management modes of a data, this function can be called with
the fields list_of_output and list_of_output_modes correctly filled.
• No information is given as when the transfer will indeed begin.
• If a user needs to know if the transfer is completed on one or more locations, he can use the grpc_
data_getinfo() function.
• If a user wants to wait after the completion of one or more transfers, he can use the grpc_data_wait()
function.
[email protected] 12
GFD-R-P.186 June 6, 2011
• If the data middleware (e.g., the GridRPC middleware or the data middleware on which it relies) does
not manage coherency between the duplicates on the platform, a correct call to this function can be
useful to ensure that all copies are up-to-date.
Error code identifiers and meanings are described in Table 6.
Table 6: Error codes identifiers and meanings for the transfer function.
For convenient reasons, the function grpc_data_transfer() is asynchronous. Hence, a user have the
possibility to perform overlap transfers with computation and try to realize transfers in parallel. This
function can then be used by the user to wait for the completion of one or several transfers.
Function prototype:
grpc_error_t grpc_data_wait(const grpc_data_t ** list_of_data,
grpc_completion_mode_t mode
grpc_data_t ** returned_data);
Depending on the value of mode (GRPC_WAIT_ALL or GRPC_WAIT_ANY), the call returns when all or one
of the data listed in list_of_data is transferred.
Remarks:
• returned_data is the (or one of the) data which makes the function return: if no special error is
returned and the mode was set to GRPC_WAIT_ANY, then the data is the one whose transfer has been
completed. If an error occured, then there has been at least one error on the transfer of this data.
• This function considers only the information that the user is aware of: if the data is shared between
different users, then a call to grpc_data_wait() returns depending on the input of the user that
has performed the call. Hence, the call will not depend on an other user action for example.
• The use of this function can be done in such a way that the server can test if data are in place (i.e., that
transfers involved in the grpc_data_transfer() on the client part have been completed) before
doing anything. If the user performs a grpc_data_transfer() of a grpc_data_t whose transfer
has not yet been completed, the behavior is depending on the data middleware that manages the data:
if the middleware implements some stamps mechanisms, then no problem will occur.
Error code identifiers and meanings are described in Table 7.
[email protected] 13
GFD-R-P.186 June 6, 2011
Table 7: Error codes identifiers and meanings for the wait function.
When the user does not need a handle anymore, but knows that the data may be used by another user for
example, he can unbind the handle and the GridRPC data by calling this function without actually freeing
the GridRPC data on the remote servers.
Function prototype:
grpc_error_t grpc_data_unbind(grpc_data_t * data);
After calling this function, data does not reference the data anymore.
Error code identifiers and meanings are described in Table 8.
Table 8: Error codes identifiers and meanings for the unbind function.
This function frees the GridRPC data identified by data on a subset or on all the different locations where
the data is stored, and unbind the handle and the data. This function may be used to explicitly erase the
data on a storage resource.
Function prototype:
grpc_error_t grpc_data_free(grpc_data_t * data, const char ** URI_locations);
If URI_locations is NULL, then the data is erased on all the locations where it is stored, else it is freed on
all the location contained in the list of URI.
After calling this function, data does not reference the data anymore.
Error code identifiers and meanings are described in Table 9.
[email protected] 14
GFD-R-P.186 June 6, 2011
Table 9: Error codes identifiers and meanings for the free function.
This function let the user access information about an instanciation of a grpc data t. It returns information
on data characteristics, status, locations, etc.
Function prototype:
grpc_error_t grpc_data_getinfo(const grpc_data_t * data,
grpc_data_info_type_t info_tag,
const char * URI,
char ** info);
The kind of information that the function gets is defined by the info_tag parameter (defined page 8). By
setting URI, a server name can be given to get some data information dependent on the location of where is
the data (like GRPC_STICKY). info is a NULL-terminated list containing the different available information
corresponding to the request.
Remarks:
• The exact syntax of outputs of this function has to be defined in an interoperable document.
• For values of info_tag equal to GRPC_INPUT_URI and GRPC_OUTPUT_URI, the returned list is
considered to be information on the grpc data in the system, not only the information got locally for
the handle (or stored in the grpc_data_t).
• URI can be set to NULL (default behavior). In that case, if the user tries to access the information of
the mode (GRPC_STICKY for example) and the data has different management mode on the platform,
then the value GRPC_UNDEFINED may be returned.
• If info_tag equals to GRPC_STATUS, then info may be “GRPC IN PLACE”, “GRPC TRANSFE-
RING” or “GRPC ERROR TRANSFER” for example (see the interoperable document to know stan-
dard outputs).
Note that in case of info_tag is set to GRPC_DATA_HANDLE, information is of no use to manage data with
the given API: handles are initialized in the init call function, stored in the grpc data t. Furthermore, the
user has to free the memory allocated by the function to info_tag.
Error code identifiers and meanings are described in Table 10.
[email protected] 15
GFD-R-P.186 June 6, 2011
Table 10: Error codes identifiers and meanings for the getinfo function.
In order to communicate a reference between Grid users, for example in case of large size data, one should
be able to store a GridRPC data. The location can then be shared, for example by mail, and one can be able
to load the corresponding information.
Function prototype:
grpc_error_t grpc_data_load(grpc_data_t * data, const char * URI_input);
grpc_error_t grpc_data_save(const grpc_data_t * data, const char * URI_output);
These functions are used to load/save the data descriptions. Even if the GridRPC data contains the data
in addition to metadata management information (dimension, type, etc.), only data information have to be
saved to the URI_output. The format used by these functions is let to the developer’s choice. The way the
information are shared by different middleware is out of scope of this document and should be discussed
in an interoperability recommendation document.
Error code identifiers and meanings are described in Table 11.
Table 11: Error codes identifiers and meanings for the load and save functions.
[email protected] 16
GFD-R-P.186 June 6, 2011
5 Contributors
Yves Caniou (Corresponding Author)
University of Lyon / CNRS / ENS Lyon / INRIA / UCBL
46 Allée d’Italie
69364 Lyon Cedex 7
France
Email: [email protected]
Frederic Desprez
University of Lyon / CNRS / ENS Lyon / INRIA / UCBL / SysFera
46 Allée d’Italie
69364 Lyon Cedex 7
France
Email: [email protected]
Gaël Le Mahec
University of Picardie Jules Verne
33, rue Saint Leu
80039 Amiens Cedex 01
France
Email: [email protected]
Yusuke Tanimura
Information Technology Research Institute, AIST
1-1-1 Umezono, Tsukuba Central 2
Tsukuba City 305-8568
Japan
Email: [email protected]
[email protected] 17
GFD-R-P.186 June 6, 2011
7 Disclaimer
This document and the information contained herein is provided on an “As Is” basis and the OGF disclaims
all warranties, express or implied, including but not limited to any warranty that the use of the information
herein will not infringe any rights or any implied warranties of merchantability or fitness for a particular
purpose.
[email protected] 18
GFD-R-P.186 June 6, 2011
9 References
[email protected] 19
GFD-R-P.186 June 6, 2011
Appendix
A Examples of use of the API
In this section, we give examples of the data management API usage to illustrate its interest. We do not
consider these examples as an exhaustive list but they can help to understand how to use the API, as well
as the way to build a data management API in GridRPC middleware.
grpc_function_handle_init(handle1,"karadoc.aist.go.jp","*");
grpc_data_init(&dhA,
(const char * []){"memory://britannia.ens-lyon.fr/A", NULL},
NULL,
GRPC_DOUBLE, (size_t []){3, 3, 0},
(grpc_data_mode_t []) NULL);
grpc_data_init(&dhB,
(const char * []){"DFS://britannia.ens-lyon.fr/home/user/B.dat", NULL},
NULL,
GRPC_DOUBLE, (size_t []){3, 3, 0},
(grpc_data_mode_t []){GRPC_VOLATILE, GRPC_END_LIST});
grpc_data_init(&dhC,
NULL,
(const char * []){"FTP://britannia.ens-lyon.fr/home/user/C.out", NULL},
GRPC_DOUBLE, (size_t []){3, 3, 0},
(grpc_data_mode_t []){GRPC_VOLATILE, GRPC_END_LIST});
grpc_call(handle1, dhA, dhB, &dhC);
Here, we illustrate the way to send local data (in memory and on disk) to the GridRPC platform. In this
example, the client issues a call with two input data A and B. A and B are local to the client. As A is in
memory, a call to grpc_data_memory_mapping_set() has to be performed, hence allowing the use of
the mapped string to reference A in the URI. B is a local file, and the protocol is set to DFS (Distributed File
System) to express the will of the user to benefit from a distributed file system on the remote server if one
is available. This can help to distribute the data, avoiding copies, if the computing resource is a cluster and
the service will use some of them.
Note that the memory and DFS protocols are not network protocols: involved transfers will be performed
using the GridRPC communication layer, and provided protocols are intended to give information to help
the management of the data. Moreover, note that in a real code, A must be allocated and defined before any
use.
[email protected] 20
GFD-R-P.186 June 6, 2011
In this example, no data persistency is needed: the client issues a call with A and B as input data and C as
output data, and output data C is sent back to the client at the end of the computation. But only the transfer
is initiated, and a call to grpc_data_wait() would be mandatory on the client side to be sure that the
data is in place before using it.
A.1.3 Note
In the first call to grpc_data_init(), we used a NULL value for the list of grpc_data_mode_t, which
means the default value of the underlying data middleware (either GRPC_VOLATILE or GRPC_STRICTLY_
VOLATILE). Thus, it may be the same than using a list containing GRPC_VOLATILE like done in the second
and third call.
[email protected] 21
GFD-R-P.186 June 6, 2011
Figure 2: Simple RPC call with input and output data using external storage resources.
Here, we illustrate the way to send a remote data stored on SRB or IBP server to the GridRPC platform. In
this example, the client issues a call with two input data A and B. A is available on IBP repository and B is
available on SRB repository. With the input and output parameters from the grpc data init() function
we can copy the data from a repository to another one:
[email protected] 22
GFD-R-P.186 June 6, 2011
• A is read from the IBP server, and will be sent to the client.
• B is read from SRB server and will be sent to IBP server.
Note that in Figure 2, we arbitrarily show that the data A is transfered from the server karadoc.aist.
go.jp to the client. But it could have been from kaamelott.cs.utk.edu depending on the implementation.
The same comment arises for B: it could have been transfered from the server carmelide.ens-lyon.fr.
Moreover, as A and B are input data, transfers can take place anytime before, during or after the compu-
tation. A call to grpc_data_wait() is needed to be sure that data A is completely stored on the client
before using it locally.
The output data C is sent back to the client after the end of the computation. Only the transfer is initiated,
and a call to grpc_data_wait() is needed to be sure that the transfer of data C is completed onto the
client before using it locally.
[email protected] 23
GFD-R-P.186 June 6, 2011
double * A;
grpc_data_memory_mapping_set("A", A); /* set mapping for memory scheme */
grpc_function_handle_init(handle1,"karadoc.aist.go.jp","*");
grpc_data_init(&dhA,
(const char * []){"memory://britannia.ens-lyon.fr/A", NULL},
(const char * []){"memory://karadoc.aist.go.jp/A", NULL},
GRPC_DOUBLE, (size_t []){3, 3, 0},
(grpc_data_mode_t []){GRPC_STICKY, GRPC_END_LIST});
grpc_data_init(&dhC,
(const char * []){"FTP://britannia.ens-lyon.fr/home/user/C.in", NULL},
(const char * []){"memory://karadoc.aist.go.jp/C", NULL},
GRPC_DOUBLE, (size_t []){3, 3, 0},
(grpc_data_mode_t []){GRPC_STICKY, GRPC_END_LIST});
for(i=0;i<n+1;i++) {
if( i==1 ) {
grpc_data_init(&dhA,
(const char * []){"memory://karadoc.aist.go.jp/A", NULL}, NULL,
GRPC_DOUBLE, (size_t []){3, 3, 0},
(grpc_data_mode_t []) NULL);
grpc_data_init(&dhC,
(const char * []){"memory://karadoc.aist.go.jp/C", NULL}, NULL,
GRPC_DOUBLE, (size_t []){3, 3, 0},
(grpc_data_mode_t []) NULL);
}
if( i==n )
grpc_data_init(&dhC,
(const char * []){"memory://karadoc.aist.go.jp/C", NULL},
(const char * []){"FTP://britannia.ens-lyon.fr/home/user/C.out", NULL},
GRPC_DOUBLE, (size_t []){3, 3, 0},
(grpc_data_mode_t []){GRPC_VOLATILE, GRPC_END_LIST});
Data A will be used and will remain on server karadoc: we use the GRPC_STICKY parameter to keep the
data on server karadoc. Data C is an input/output data. The first call to grpc data init() for this data
requires only an input location and the GRPC_STICKY mode. In this example, we assume that the middle-
ware deals with the readers-writers problem using a reader/writers lock for the access to the data. Then,
the second call can only start when A is transfered from britannia.
If the middleware cannot deal with such an access, users should use grpc_wait() after each grpc_call().
[email protected] 24
GFD-R-P.186 June 6, 2011
Client
britannia.ens-lyon.fr
A C
A, C
Middleware
A C
Server
karadoc.aist.go.jp
Figure 3: GridRPC calls with data management using persistence through the GRPC STICKY mode.
Output data C is generated on server karadoc but only the last result is useful for the client. Thus, to
send the final result to the client we update the output location just before the last grpc call(). We again
assume that the middleware allows only one process/thread to access C for writing at a time.
[email protected] 25
GFD-R-P.186 June 6, 2011
grpc_data_init(&dhB,
(const char * []){"FTP://britannia.ens-lyon.fr/home/user/B.dat", NULL},
(const char * []){"memory://karadoc.aist.go.jp/B", NULL},
GRPC_DOUBLE, (size_t []){3, 3, 0},
(grpc_data_mode_t []){GRPC_STICKY, GRPC_END_LIST});
grpc_data_init(&dhC,
NULL,
(const char * []){"memory://karadoc.aist.go.jp/C", NULL},
GRPC_DOUBLE, (size_t []){3, 3, 0},
(grpc_data_mode_t []){GRPC_STICKY, GRPC_END_LIST});
grpc_data_init(&dhB,
(const char * []){"memory://karadoc.aist.go.jp/B", NULL},
NULL,
GRPC_DOUBLE, (size_t []){3, 3, 0},
(grpc_data_mode_t []) NULL);
grpc_data_init(&dhC,
(const char * []){"memory://karadoc.aist.go.jp/C", NULL},
(const char * []){"memory://perceval.rush.aero.org/C", NULL},
GRPC_DOUBLE, (size_t []){3, 3, 0},
(grpc_data_mode_t []){GRPC_STICKY, GRPC_END_LIST});
grpc_data_init(&dhA,
(const char * []){"memory://perceval.rush.aero.org/A", NULL},
NULL,
GRPC_DOUBLE, (size_t []){3, 3, 0},
(grpc_data_mode_t []) NULL);
grpc_data_init(&dhC,
(const char * []){"memory://perceval.rush.aero.org/C", NULL},
(const char * []){"FTP://britannia.ens-lyon.fr/home/user/C.out", NULL},
GRPC_DOUBLE, (size_t []){3, 3, 0},
(grpc_data_mode_t []){GRPC_VOLATILE, GRPC_END_LIST});
grpc_call(handle3, dhA, dhC, &dhC);
[email protected] 26
GFD-R-P.186 June 6, 2011
Client
britannia.ens-lyon.fr
A B C
A B A
Middleware
C C
A B A C
Server 1 Server 2
karadoc.aist.go.jp perceval.rush.aero.org
Data A will be used on karadoc and perceval, we use a GRPC STICKY persistence to reuse it with the third
grpc call(). The data B is only used on karadoc. The first grpc call() transfers the data on karadoc
to compute C = A × B and on perceval for the computation of C = A × C. The second grpc call()
computes C = C + B and transfers the data C from karadoc to perceval for the final computation.
Output data C is created on server karadoc. C moves (or is duplicated) from server karadoc to server perceval.
Then, C is sent back to the client.
[email protected] 27
GFD-R-P.186 June 6, 2011
grpc_data_memory_mapping_set("A", contA);
// "contA" is pointer to a data container type.
// The container type depends on the implementation.
grpc_function_handle_init(handle1,"karadoc.aist.go.jp","sum");
grpc_data_init(&dhA,
(const char * []){"memory://britannia.ens-lyon.fr/A", NULL},
NULL, GRPC_CONTAINER_OF_GRPC_DATA, (size_t []){1, 0},
(grpc_data_mode_t []) NULL);
for (i=0; i<10; ++i) {
char name[34];
ssprintf(name,"memory://britannia.ens-lyon.fr/a%d", i);
data_memory_mapping_set(name, &values[i]);
grpc_data_init(&dhB, NULL,
(const char * []){"memory://britannia.ens-lyon.fr/B", NULL},
GRPC_DOUBLE, (size_t []){3, 3, 0},
(grpc_data_mode_t []){GRPC_STICKY, GRPC_END_LIST});
We define A as a data container. Each element a of the set A is mapped in memory and added to the data
container. The container is then used as input parameter of the sum service, exactly as a data of any other
type.
The output is simply computed on karadoc and sent back to britannia. Here, the user has chosen to store the
output data on memory. He asks the middleware to create a new data memory mapping using the name B
on britannia.
[email protected] 28
GFD-R-P.186 June 6, 2011
We want to compute B = An on the karadoc server with the matrix multiplication service available here.
After the computation, we use grpc_data_transfer() to transfer the result on a ftp server running on
the client (britannia.ens-lyon.fr).
grpc_function_handle_init(handle1,"karadoc.aist.go.jp","*");
grpc_data_init(&dhA,
(const char * []) {"dagda://id-01234567-89ab-cdef-0123456789ab", NULL},
NULL, GRPC_DOUBLE, (size_t []) {100, 100, 0},
(grpc_data_mode_t []) {GRPC_PERSISTENT, GRPC_END_LIST});
grpc_data_init(&dhB,
(const char * []) {"ftp://karadoc.aist.go.jp/pub/mat100_identity.dat", NULL},
(const char * []) {"dagda://id-98765432-10fe-dcba-9876543210fe", NULL},
GRPC_DOUBLE, (size_t []) {100, 100, 0},
(grpc_data_mode_t []) {GRPC_PERSISTENT, GRPC_END_LIST});
grpc_data_init(&dhB,
(const char * []) {"dagda://id-98765432-10fe-dcba-9876543210fe", NULL},
(const char * []) {"dagda://id-98765432-10fe-dcba-9876543210fe", NULL},
GRPC_DOUBLE, (size_t []) {100, 100, 0},
(grpc_data_mode_t []) {GRPC_PERSISTENT, GRPC_END_LIST});
grpc_data_transfer(&dhB, NULL,
(const char * []) {"ftp://britannia.ens-lyon.fr/pub/results/B.out", NULL},
NULL);
The data is designated by its unique identifier in the data middleware (here the data middleware is DAGDA,
the D IET data manager). Then the data is located “somewhere” on the grid. Because we choose, GRPC_
PERSISTENT as data persistency, the data will stay on the server after the first computation (DAGDA
manages this kind of data persistency). Matrix B is initiliazed with the identity matrix and then stores the
intermediate results.
The output data is designated to be a DAGDA data with GRPC_PERSISTENT persistency. Then, after each
computation, data B stays on karadoc for the next loop step. After the nth computation, we get back the
result using the extra destination field of the grpc_data_transfer function.
[email protected] 29
GFD-R-P.186 June 6, 2011
We can see that using an underlying data middleware greatly simplifies the application.
[email protected] 30
GFD-R-P.186 June 6, 2011
B Table of functions
[email protected] 31
GFD-R-P.186 June 6, 2011
C Table of types
Category Type Name Possible values Section
data structure grpc_data_t structured type 4.1
grpc_data_type_t GRPC_BOOL 4.1
GRPC_INT
GRPC_DOUBLE
GRPC_COMPLEX
GRPC_STRING
GRPC_FILE
GRPC_CONTAINER_OF_GRPC_DATA
data management grpc_completion_mode_t GRPC_WAIT_ALL 4.1.2
GRPC_WAIT_ANY