This release is called "BCR" in recognition
of the fact that there remains work to be done, and to prevent having several
different "CDS interfaces" floating around. Many apparent shortcomings
of the BCR interface have already been addressed by Elepar's
CDS description, but even that has had limited external involvement,
and has not been implemented at the time of this writing.
The correct approach to designing a stable
interface would be to involve potential users, hopefully after they had
experimented with current implementations like BCR. Though the common
approach would be to create some sort of standards body to physically meet
at regular intervals and hammer out details, it may be just as (or more)
productive to handle this in a distributed fashion on mailing lists.
A specific forum, called API_design,
has been set up on SourceForge to address CDS interface design issues.
Consider this web page as food for thought for discussion there.
One goal of an ultimate CDS interface might
be to minimize the number of reasons people might use to opt for some other
interface., like PVM, MPI, UDP/IP, or distributed shared memory.
For example, some of the features suggested in the Elepar design, over
and above those in BCR:
BCR may already have suitable (or superior)
alternatives to the MPI one-sided operations, but it is not clear how features
like collective communication and/or distributed I/O would map to CDS,
or whether they should remain as higher-level functions, since CDS makes
no assumption that it will execute on a high-performance system, or that
there will be any recognizable topology among the CCEs. Similarly,
how much support should there be for "stdin" and "stdout"? How much
process (or CCE) control should be possible?
Cell contexts. These play a similar
role to contexts (one dimension of "communicators") in MPI, by allowing
the communicating entities (in this case CCEs) to partition the communication
address space (in this case cell IDs) among their individual routines or
Blocking "enq" ("benq"). This is an
operation like an "enq" (i.e. a "put" with non-zero "qlike" argument),
but it blocks unless the cell is empty and there is another entity waiting
to get the result. Such a simple extension can fulfill the demands
of both the so-called "ready send" in MPI, and the potential in MPI for
a large send to block until there is a waiting buffer on the receiver.
Full support for non-blocking operations.
Although BCR does OK with it's "bcr_PENDING" in the time argument of "get",
it does not extend well to non-blocking "recv" operations, etc. Non-blocking
operations can be defined as simply independent threads performing the
equivalent of the blocking operation.
"rgmod" optimization. Since an "rgmod"
may be required after any "get" or "put", it may be beneficial to add an
extra argument onto "get" and "put" that tells whether to perform an "rgmod".
There are also binding questions.
Because of its use of pointers, CDS targets C-like languages, though it
is usable (and arguably as efficient as message passing) in Fortran by
using the "copyfm" and "copyto" operations exclusively to access regions.
Are there better solutions? Can Fortran pointers fulfill the needs
of CDS region IDs?
Related to the above, is it sensible to consider
a Linux kernel level interface to CDS functionality? In some sense,
CDS can be considered an amalgam and generalization of SysV constructs
like shared memory and message segments, and could serve as an alternative
to sockets. It can also be considered as a position-independent thread
interface. It could truly set Linux apart as being a network-ready
Completion, Optimization and Porting
Some tweaks don't even require changes in
the interface per se, just in how it is interpreted and implemented.
For example, the "objname" argument in an enlist is currently interpreted
as a filename on the target machine which stores the executable for the
process which will become the CCE. The caller shouldn't need to know
the file structure on the target machine, or whether the target CCE will
be implemented as a process or a thread, or whether it is already running
(and blocked at enlist) or needs to be initiated. To handle these,
each machine/program should have an independent means of mapping incoming
CCE "objnames" to a thread, and a means to determine how to initiate it
if it isn't already waiting. (A potential race condition must be
avoided, where the thread has begun but hasn't yet reached the "init" when
the CCE request comes in.)
Some features specifyiable in the "init"
call have not been implemented. These include garbage collection
(i.e. to automatically compact the comm heap when it becomes fragmented)
and error regions (i.e. to inform a CCE of failures in region deliver or
enlistment through automatically-generated regions to a specified cell).
Some modifications to the interface may be required to make these operate
correctly (e.g. to ensure that pointers within region IDs aren't changing
due to garbage collection without the user's being informed).
Even without extensions to the interface,
the existing BCR could be further optimized for different platforms.
For example, nothing more extensive than spinlocks or UDP/IP (with some
rather basic flow control and reliability algorithms) are currently used
for communication. Implementation of lock-free queues and various
OS bypass to high-performance networks could increase performance significantly.
The shared memory allocation logic is also ad hoc, and could probably be
significantly optimized. Send and recv are not optimized in any way
yet, although they are designed to allow significant optimization by circumventing
much of the intervening copying.
CDS is the proper level to build security
into many systems. It's method of initiating/enlisting CCEs may even
help to facilitate the generation and passing of keys, etc.