GASNet Portals-conduit documentation Michael Welcome $Revision: 1.11 $ User Information: ----------------- The GASNet Portals conduit is being developed exclusively for the Cray XT3/XT4 and follow-on Portals based systems. The design and implementation is based on the Portals 3.3 Message Passing Interface (Revision 1.0) developed at Sandia National Laboratory and the University of New Mexico. These should be only a few modifications required to port this implementation to non-Cray Portals systems. This implementation runs under both the Cray Catamount and Compute Node Linux (CNL) operating systems. Also, note that earlier implementations of this conduit were hybrid implementations, using Portals directly for Put and Get operations and the MPI-conduit for the active message layer. This version of the conduit is implemented entirely over Portals and has no MPI dependencies. This implementation will only work under the GASNET_SEGMENT_FAST environment, although modification to GASNET_SEGMENT_EVERYTHING are possible using the GASNet FireHose algorithm. In addition, this implementation is restricted as follows: Under Catamount: GASNET_SEQ (no threading environment under Catamount) Under CNL: GASNET_SEQ and GASNET_PAR. Note: One could use GASNET_PARSYNC by simply defining GASNET_PAR but there are probably several efficiencies to be gained by combing through the uses of locks to see which would be required in this environment and which are not needed. Some notes on building GASNet for the Cray XT3/XT4 -------------------------------------------------- * Since the XT3/XT4 requires using a cross-compiler, there is a special cross configuration script located in $GASNET_SRC_DIR/other/contrib/cross-configure-crayxt-catamount for a Catamount system OR $GASNET_SRC_DIR/other/contrib/cross-configure-crayxt-linux for CNL. Instructions for building gasnet are: cd $GASNET_SRC_DIR cp other/contrib/cross-configure-crayxt- . ./Bootstrap vi ./cross-configure-crayxt- to modify any of the configure-time options, such as turning on DEBUG or TRACE mode. ./cross-configure-crayxt- make * Special notes on configuring for GASNET-PAR under CNL You must be sure to use the correct version of the pthreads, one that works with Portals. At the time of this writing, 09/01/2007, the only Portals-compatible pthreads library available on CNL is NPTL, which is located in /usr/lib64/nptl. The header files are in /usr/include/nptl. Unfortunately, the pthread.h header file in /usr/include/nptl is buggy and needs to be modified. The easiest thing to do is copy the /usr/include/nptl directory to a local location, say $HOME/nptl. Modify the pthread.h file as follows: Change: extern int __sigsetjmp (struct __jmp_buf_tag __env[1], int __savemask) __THROW; To: extern int __sigsetjmp (struct __jmp_buf_tag *__env, int __savemask) __THROW; Change: #define PTHREAD_MUTEX_INITIALIZER {} #define PTHREAD_COND_INITIALIZER {} To: #define PTHREAD_MUTEX_INITIALIZER { { 0 } } #define PTHREAD_COND_INITIALIZER { { 0 } } Assuming no significant changes to pthread.h, this patching can be done automatically (and non-destructively) during configure by using the configure argument: --with-pthreads-patch=other/contrib/crayxt-nptl-pthread-patch Finally, the following to EXTRA_CONFIGURE_ARGS in cross-configure-crayxt-linux ensure the use of NPTL libraries and headers: --with-pthreads-include=/usr/include/nptl --with-pthreads-lib=/usr/lib64/nptl these settings are on by default in cross-configure-crayxt-linux. Recognized environment variables: --------------------------------- GASNET_PORTAL_MSG_LIMIT=N This value limits the total number of messages in-flight from this node. Given that the Cray seastar has 256 available slots in the DMA engine, any more in-flight messages requires a more expensive reliability algorithm to be used at the low levels. This value only limits the number of INITIATED messages, there is no control over the number of incoming messages, which also consume DMA slots. Earlier tests showed that performance would drop if this value got much larger than about 300. The default value is 250. GASNET_PORTAL_NUM_TMPMD=N Specifies the number of temporary memory descriptors that can be allocated at any time. The default value is 1024. Temporary MDs are allocated when a Put or Get operation is initiated, the message is larger than what would fit into a Request chunk (bounce buffer) and the memory for the source of the Put or dest of the Get is not already pinned (eg. if it were in the local segment). GASNET_PORTAL_PACKED_LONG=N If N=0, this turns off the ability to pack short-payload AM Long messages into a Request Send Buffer chunk. A normal AM Long request or reply will send two messages, a header and a data put. For small AM Long messages, using PACKED longs has better performance. The default value is N=1, allowing PACKED AM Longs. GASNET_PORTAL_FLOW_CONTROL=N if N=1, then active message flow control is enabled. This is the default. N=0 disables flow control, but the application may fail if a node receives too many AM requests such that its receive buffer overflows. GASNET_PORTAL_DYNAMIC_CREDITS=N If N=1, then dynamic credit management is enabled. This is the default. With dynamic credit management, a node dynamically distributes its credits to the nodes it has been communicating with recently. N=0 disables this feature. GASNET_PORTAL_SB_CHUNKS=N Specifies to allocate this number of memory chunks to the Request Send Buffer memory descriptor. Each active message request must allocate a chunk for its use. The chunk is held until the corresponding AM Reply arrives. These chunks are also used as bounce buffers for small Put and Get operations. GASNET_PORTAL_RPL_CHUNKS=N Specifies to allocate this number of memory chunks to the Request Reply Buffer memory descriptor. Each active message reply must allocate a chunk for its use. The chunk is freed when the AM Reply is known to have been sent off-node. GASNET_PORTAL_SAFE_LIMIT=N This value defines the maximum number of events that are allowed to be processed on the SAFE event queue, per polling call. The default value is 12. GASNET_PORTAL_AM_LIMIT=N This value defines the maximum number of events that are allowed to be processed on the AM event queue, per polling call. The default value is 8. GASNET_PORTAL_SYS_LIMIT=N This value defines the maximum number of events that are allowed to be processed on the SYS event queue, per polling call. The default value is 0, meaning there is no limit. That is, all events are processed per polling call. GASNET_PORTAL_MAX_CRED_PER_NODE=N This value limits the total number of credits that can be allocated to a remote node. The default value is 400. GASNET_PORTAL_ACCEL=N /* if built with -DGASNETC_USE_SANDIA_ACCEL */ This value can only be used on a Sandia based Catamount system with accelerated portals. In this environment, the majority of the portals processing is done on the seastar chip, rather than on the opteron. If N=1, (and GASNet is built with the -DGASNETC_USE_SANDIA_ACCEL flag) then accelerated mode is enabled. The default is N=0. GASNET_PORTAL_CREDIT_STATS=1 will write credit stats to a file. This allows a developer to get a sense of how the flow-control credits are distributed. It should not be used in performance runs. GASNET_PORTAL_EPOCH_DURATION=N The number of AMRequest operations a node receives before it ends its credit Epoch. The default value is 1024. See the section on dynamic credit management (below). GASNET_PORTAL_CRED_PER_NODE=N The number of flow_control credits issued to each remote node at startup. By default, this is determined via a table lookup based on the number of nodes in the job. GASNET_PORTAL_BANKED_CREDITS=N The number of flow_control credits that are put into the descressionary bankpool on each node at startup. By default, this is determined via a table lookup based on the number of nodes in the job. GASNET_PORTAL_AMRECV_SPACE=N The number of bytes to allocate for the ReqRB space. By default, this is determined via a table lookup based on the number of nodes in the job. * All the standard GASNet environment variables (see top-level README) * The GASNET_EXITTIMEOUT family of environment variables (see top-level README) Optional compile-time settings: ------------------------------ * All the compile-time settings from extended-ref (see the extended-ref README) Known problems: --------------- * See the Berkeley UPC Bugzilla server for details on known bugs. * On CNL, if we try to pin more memory than the OS will allow, the job is killed. Therefore there is really no way (that we know of) to determine the maximum pinnable memory (and therefore maximal GASNET_SEGMENT_FAST segment size) under CNL without dire consequences - furthermore the maximal value appears to be site-specific. Currently portals-conduit leaves this quantity unlimited (subject only to GASNET_MAX_SEGSIZE, which determines the maximal mmap) and jobs requesting too large a GASNet segment will be killed by the kernel with an error message like: [NID 18]Apid 49229: initiated application termination Application 49229 exit signals: Killed Programs encountering this error are recommended to reduce their GASNet segment size demands, either at the GASNet client level (eg UPC_SHARED_HEAP_SIZE), or by setting environment variable GASNET_MAX_SEGSIZE to a smaller value to impose a segment limit. The default value for GASNET_MAX_SEGSIZE can be established at configure time using: --with-segment-mmap-max=XGB ============================================================================== A Brief Overview of Portals: --------------------------- All GASNet communication operations (puts, gets, active messages, barriers, etc) are implemented in terms of Portals PtlPut, PtlPutRegion, PtlGet and PtlGetRegion operations. Portals Put and Get operations are RDMA operations between a local and remote "Memory Descriptor". A Memory Descriptor represents a pinned region of memory that is endowed with various properties, such the type of events that will be generated on the MD, the set of operations that can be performed on the MD, and how the memory is managed ("Locally" or "Remotely"). In addition, Portals MDs can be Free Floating, or on an ordered list attached to a Portals Table Entry. MDs that are attached to a Portals Table Entry are accessable to remote nodes as the target of a Put or source of a Get RDMA operation. Free Floating MDs can only be used as the source of a Put or the destination of a Get operation. Each MD attached to a Portals Table Entry has a set of Match-Bits and Ignore-Bits (64 bits wide) that are used to determine if it is the target of an imcoming message. In addition, an MD may be associated with a local Portals Event Queue (EQ). Depending on how the MD is configured, communication operations that operate on the MD will generate events on the associated EQ. The types of events are: PTL_EVENT_SEND_END - a locally initiated Put operation has been sent PTL_EVENT_ACK - a locally initiated Put operation has reached its remote destination MD PTL_EVENT_PUT_END - a remotely initiated Put operation has completed on a local MD PTL_EVENT_GET_END - a remotely initiated Get operation has completed on a local MD PTL_EVENT_REPLY_END - a locally initiated Get operation has completed on a local MD PTL_EVENT_UNLINK - a "Locally Managed" MD has filled and has been unlinked from its Portals Table List. Portals Put and Get operations are non-blocking and do not return a handle. As an example, a PtlPutRegion option takes the following arguments: src_md: A local memory descriptor, the source of the data put. local_offset: The byte offset within src_md where the source message starts msg_bytes: The length (in bytes) of the Put operation ack_req: Should an ACK be generated when the data reaches the remote MD? target_id: The process ID of the target node pt_index: The Portals Table Index to target on the remote node ac_index: The Access-Control table index on the remote node match_bits: The 64-bit match-bits used to select the remote MD for this operation remote_offset: The byte offset in the remote MD where the data should be placed. hdr_data: 64-bits of immediate data that should be sent with the message. The operation works as follows (and not necessairly in this strict order): (1) The data in the src_md of length msg_bytes starting at local_offset is sent to the target_id node along with the match_bits and hdr_data. (2) When the data is safely copied out of src_md, and if src_md is configured to generate such an event, a PTL_EVENT_SEND_END event is generated on the EQ associated with the src_md (if one exists). (3) When the message arrives at the remote node, Portals searches the list of MDs attached to the Portals Table at index pt_index. For each MD, it masks the message match-bits with the MD's ignore-bits then compares it with the MDs match-bits. If they match, it checks if the MD accepts Put opertions. Finally, if the MD is "Locally Managed" and there is enough space to fit the message, the MD is selected and the data is delivered. If any of these conditions fail, the next MD on the list is examined. When an MD is selected: (4) If the sender requested an ACK, it is delivered back to the EQ of the src_md. (5) If the selected target_md is configured to generate PTL_EVENT_PUT_END events and if the target_md is associated with an EQ, such an event is generated for this operation. Several notes: * If the target MD is configured to be "Locally Managed", then each incoming message bound for that MD is placed, in sequence, after the previous. In this case, the remote_offset of the PtlPut operation is ignored. When the MD fills it will no longer accept incoming data. The MD can be configured to be automatically unlinked from the Portals Table list and can generate a PTL_EVENT_UNLINK event on its MD. * If the target MD is not configured as "Locally Managed" then the incoming data is placed at the remote_offset specified in the PtlPut operation. * Get operations do not have the equivalent of the hdr_data (64-bits of immediate data). When events are generated, the event structure contains the type of the event, a copy of the match_bits used along with the hdr_data (if available) along with other data. Since Portals Put/Get operations do not return a handle, the user must determine which operation the event was associated with via the context contained in the event structure. Fortunately, the GASNet implementation uses only a handful of MDs so that most of the 64-bits in the match-bits can be used to hide information (masked off by the MDs ignore-bits) to associate the event with the corresponding GASNet operation. Full details of the Portals software interface can be found in: "The Portals 3.3 Message Passing Interface, Revision 1.0" authored by Ron Brightwell, Arthur B. Maccabe, Rolf Riesen and Trammell Hudson in May of 2003. Memory Descriptors uses in the GASNet Implementation ---------------------------------------------------- RAR - covers the local GASNet memory segment. It has no EQ but will generated ACK events if requested. It is linked on the Portals Table and is intended as the target of remote GASNet Put/Get operations. RARSRC - Also covers the local GASNet memory segment. It is free floating and used as the Src MD of a Put or dest MD of a Get when the data happens to reside in the GASNet segment and events are needed. It is also used as the target of AMLong Reply operations. It generates events on the SAFE_EQ event queue. RARAM - Also covers the local GASNet memory segment and is associated with the AM_EQ event queue. It is used as the target of AMLong Request operations. This MD is also linked on the Portals Table. ReqSB - This MD covers a local send buffer that is divided into "chunks", that get allocated and freed for use by various GASNet communication operations. It is associated with the SAFE_EQ. It is used as the source of AM Requests, the destination of AM Reply operations, and as a pre-pinned bounce-buffer for small Put and Get operations. This MD is linked through the Portals Table. RplSB - This free-floating MD covers a local send buffer, also divided into allocatable "chunks". It is associated with the SAFE_EQ and used as the source of AM Reply operations. ReqRB - These MDs are the only "Locally Managed" MDs used by GASNet. They are associated with the AM_EQ and are used as the target of AM Request operations. They are linked on the Portals Table. CB - This MD is always linked at the end of the chain of ReqRB MDs and has the same match bits as ReqRB. If an operation makes it to this MD, it represents a ReqRB overflow and results in a fatal error. Its called the "catch-basin" MD. TMPMD - These MDs are temporary and free floating. They are used as the source of large Put and destinations of large Get operations where the local data is not in the GASNet segment (RAR) and is too big to fit into a ReqSB bounce buffer. SYSMD - This memory descriptor is associated with a zero-length buffer and SYS_EQ event queue. This MD is used to send zero-length out-of-band system messages between the nodes. All the information in the messages is contained in the hdr_data and unused portions of the match-bits. Event Queues used in the GASNet Implementation ---------------------------------------------- SAFE_EQ - The events on this queue can be polled and processed at any time. They represent the completion of operations and will generatlly result in the release of resources. The operations associated with these events will never require the acquisition of additional resources. AM_EQ - This event queue is only associated with the ReqRB buffers. Events on this queue will result in the execution of AM Request handlers that will require sending AM Reply messages. This EQ can only be polled when resources are guaranteed to be available to complete the AM Reply message. SYS_EQ - The event queue used to send/recv out-of-band system messages between nodes. It is used during program termination and in the dynamic re-distribution of flow-control credits. The System Event Queue ---------------------- The SYS_EQ is used for out-of-band management messages sent between nodes. The messages are short and all the information can be packed into the unused match-bits and hdr_data of the underlying PtlPut operations. In order to limit the size of the event queue, each node can have at most one outstanding message to any other node at a time. For this reason, each message must have a corresponding reply or acknowledgement. The SYS_EQ is used for clean shutdown operations and to implement the dynamic credit re-distribution algorithm. When a thread polls the network, it processes all available events on the SYS_EQ before attempting to process any events on either the SAFE_EQ or the AM_EQ. The Implemention of GASNet Put and Get Operations ------------------------------------------------- All GASNet Put operations send data to the RAR of a remote node and all GASNet Get operations pull data from the RAR of a remote node. The remote node does not have to be informed of the delivery (or theaft) of this data so the GASNet operations target the RAR MD, which has no EQ associated with it. However, in the case of a GASNet Put operation, the underlying PtlPut operation requests an ACK event to be sent back to the sender when the data has arrived. Each GASNet Put or Get operation is associated with a GASNet handle, a polymorphic typed object that records the state of the operation. The objects can be referenced by either a standard (64 bit) pointer, or a compact 24-bit representation that can be converted to a pointer to the object. Consider the implementation of a non-blocking GASNet Put operation: gasnet_handle_t gasnet_put_nb_bulk(void *dest, gasnet_node_t node, void *src, size_t nbytes); * poll until a send ticket can be obtained. * We know the destination node and virtual memory address of where the data is to be put. This region must lie within the RAR of the remote node. We know the starting address of this RAR since this information was exchanged at job startup. We know the RAR is covered by the RAR MD memory descriptor and we can compute the (remote) offset from the start of this MD. * There are several cases for determining the MD of the source memory region: (1) It lives within the local RAR. If so, use the local RARSRC as the source MD and compute its offset within this MD. (2) It does not live within the local RAR, but the message size is small enough to fit into a ReqSB chunk. If so, attempt to allocate a chunk and copy the data into this "bounce buffer". The local offset is the offset of this chunk within the ReqSB. (3) It does not live in the local RAR, it is too big to be copied though a bounce buffer, or its small enough but no ReqSB chunks were available. In this case, construct a TmpMD to cover the source region. Set the local offset to zero. * Allocate a gasnet_handle_t object and encode its 24-bit representation into a portion of the 64-bit MATCH_BITS that will be ignored by the remote memory descriptors (the upper 60 bits). The handle is marked "IN_FLIGHT". * Issue the PtlPutRegion operation from the selected local MD and local_offset to the remote node, specifying the RAR_PTE portals table entry and MATCH_BITS as specified above. Request that an ACK event be delivered when the data has been written to the remote memory. * return the gasnet_handle_t object to the client. At some point later in time, the data will be sent to the remote node, generating a PTL_EVENT_SEND_END event on the local source MD. In addition, when the data has been delivered to the target memory, an PTL_EVENT_ACK event will delivered to the source MD. At some point in time, the client or runtime layer will poll the SAFE_EQ to process some of the outstanding events. For this Put operation, the events will cause the following action: ** The SEND_END event will be ignored by all memory descriptors for this operation. ** The ACK event will cause the following actions: - the 24 bit representation of the gasnet_handle_t object will be extracted from the MATCH_BITS in the event structure. The 64-bit pointer to the object will be generate from the 24-bit representation. The operation will be marked "DONE". - If the event occurred on the local RAR (case (1) above), no action is taken. If the event occurred on the ReqSB_MD (case (2)), this was a copy through a ReqSB bounce buffer. The chunk is freed for re-use. If the event occurred on a TEMP_MD, the memory descriptor is "unlinked" (unpinned). Finally, the next time the client or the runtime layer calls gasnet_wait_syncnb() or gasnet_try_syncnb() with this handle, the handle is freed and the operation is complete. GASNet Get operations are handled in a similar manner. All blocking Put/Get operations are implemented in terms of non-blocking operations. They poll the network until the handles indicate the operations are complete before returning from the call. The Implemention of GASNet Active Messages ------------------------------------------ Before an active message request can be issues the sending thread must acquire several resources. If it cannot obtain these resources immediately, it must poll on the event queues until the resources become available. The required resources are: * N flow-control send credits. The value of N is dependent on the size of the AM message header. One credits corresponds to 256 bytes of data. All AMShort messages require 1 credit. AMMedium and AMLong messages may require up to 4 (or 8) credits. 4 if GASNet is configured with a chunksize of 1024 bytes and 8 if it is configured with a chunksize of 2048 bytes. Note that if the data payload of an AMLong will fit inside a Chunk, the message will be send with the data payload packed in with the header (otherwise, only the args are sent in the header and the data is sent in a seperate message). * N send tickets. AMShort, AMMedium and packed AMLong requests will require N=1 send tickets. Non-packed AMLong message will require N=2 send tickets. Send tickets are used to limit the total number of outstanding messages sent off-node at any one time. The Cray XT seastar has 256 DMA slots and attempting to use more than this at any time can cause additional protocol overhead, decreasing performance. * One ReqSB chunk. This memory chunk will be used to send the AM Request header message, which (along with the unused portions of the match-bits and hdr_data) will include the Request handler to run on the remote node, the arguments to that handler, the offset of this ReqSB chunk, and the data payload for AMMedium and packed AMLong requests. The ReqSB offset will be needed by the remote node since it will send the corresponding AMReply header message back to this chunk. * In the case of a non-packed AMLong message the sending thread may also have to allocate a TMPMD ticket. This will only be required if the data payload is not already in an existing MD (such as the RAR). We limit the number of outstanding TmpMDs in use any any time, although the total allowed is rather liberal. Once the thread has acquired all the necessary resources, it packs the Request handler id, the ReqSB offset, the handler arguments, the credits used (and additional requested) and, in the case of the AMMedium and packed AMLong, the data payload into the match-bits, hdr_data and ReqSB chunk. It then issues a PtlPutRegion to the target node with the match-bits set to select the ReqRB MD. In the case of a non-packed AMLong request, the data payload is sent to the remote RARAM MD at the desired location. On the remote node, before a thread can poll the AM_EQ, it also must acquire certain resources. It must gather 2 send tickets, one TmpMD ticket and a RplSB chunk. If it cant get these, it must wait until the next polling opportunity to try again. Once these reqources have been obtained, it can poll the AM_EQ to receive an event (if any exist). The event will correspond to either a PUT_END operation on the ReqRB MD, indicating the receipt of an AMRequest header message, or a PUT_END event on the RARAM MD, indicating the receipt of the data segment associated with an AMLong Request. In the case of an arriving AMShort or AMMedium Request, the handler id, args and data payload are unpacked from the message and the requested handler is executed. However, before doing this, it constructs a token that contains the ID of the calling node, the ReqSB offset of the sender, the RepSB offset it just obtained, and credit info into the token. If the Request handler issues an AMReply operation, the AMReply notes this by setting a bit in the token. When the handler completes, the calling thread checks for this bit. If it was not set, then the handler did not issue a reply so the thread issues an AMReply Short with a no-op handler. The reply message servers to (1) return send credits and (2) cause the ReqSB chunk to be deallocated. If the handler issues an AMLong Reply, that cannot be packed into the RplSB chunk, then two messages are sent: the header is sent back to the originating chunk of the ReqSB and the data payload is sent to the specified offset of the RAR. When the AMReply header arrives on the originating node, an event is generated on the SAFE_EQ for the ReqSB MD. In response to this event, the receiving thread unpacks the data from the match-bits, hdr_data and header payload and executes the requested Reply handler. Once complete, the ReqSB chunk is deallocated. Also, the Reply message will contain credit update values, allowing future AMRequests to be sent to that target. There is an additional complication in the case of non-packed AMLong Request and Reply operations. Since there are two messages, the Request/Reply handler cannot be executed until both messages have arrived. To resolve this problem, the sender computes a unique "Long ID" or LID and includes it in both the header and data messages (as part of the match-bits or hdr_data). The receiving thread checks a local cache to see if a message with that LID from that source node has arrived yet. If not, it adds an object to the LID cache with the metadata contained in the message and returns a NULL object. If, when it checks the LID cache, it finds an object with that LID and source node ID, it removes the object from the cache and returns it. The call that get a non-null object back from the LID cache means that both header and payload have arrived and the AMLong Request handler function can then be executed. Refresh of ReqRB buffers ------------------------ At startup, a number of ReqRB buffers are allocated and chained, one after the other, from the same Portals Table Entry with the same match-bits and ignore-bits (with the CB at the end of the list). As AM Requests arrive, they are placed, in order, in the first ReqRB buffer. The ReqRBs are configured so that if the remaining available space in the buffer is less than one ReqSB chunk, then the next AMRequest will be placed in the next ReqSB in the chain. As soon as all the AM Requests contained in the buffer have been processed, the buffer is unlinked from the match-list and re-linked as an empty buffer at the end of the list (just before CB). Credit Management for GASNet Active Messages -------------------------------------------- The only messages that require flow-control in the GASNet Portals-conduit are AMRequest messages. Regardless of how large the ReqRB buffers are, if all nodes issue AMRequests to a single node (say X), quickly enough, then the ReqRB buffers on X will be exhausted before X can reply and relink the buffers. In this case, the incoming messages will fall into the CB MD. In this case, the message body is dropped and an event is generated on the CB. When X processes this event, it is a fatal error and the application is killed. In order to prevent this, and in an attempt to limit the size of the ReqRB space this conduit implements a credit-based flow control algorithm that dynamically attempts to migrate credits to nodes that are frequent communication partners. See the section at the end of this document for a description of the algorithm. Notes on Bootstrapping at Startup --------------------------------- Like every other GASNet communication operation in the portals-conduit, gasnet barriers are implemented using Portals Put (or Get) operations. However, we cannot issue a Portals Put operation to a remote node until we are sure that node has allocated and configured all of its Portals resources (MDs, EQs, etc). So, at startup we need an external mechanism to act as a barrier such that after the barrier, we can be sure that all nodes have thier resources initialized. This mechanism will be system dependent. On the Cray XT series, we use the "cnos_barrier()" function. However, before using this function, each node must call "cnos_barrier_init()" and "cnos_register_ptlid()" In addition, we get the total number of nodes in the job and our rank in the job from the "cnos_get_size()" and "cnos_get_rank()" runtime functions. We also get a mapping of job ranks to portals_id addresses using the "cnos_get_nidpid_map()" function. See gasnetc_init_portals_network() in gasnet_portals.c for details. Notes on the Job Shutdown Process --------------------------------- Job shutdown is a complicated affair. For best behaviour, all applications should have a barrier before calling gasnet_exit to terminate the job. However, any node can call gasnet_exit at any time (after gasnet_init) and the result should be to cleanly terminate all the nodes in the job. So, the first thread of a gasnet node to enter gasnet_exit will send shutdown messages to all the other nodes in the job (using the SYS queue) and wait, for a limited time, for them to reply that they too are shutting down. If they all reply, the shutdown is clean. If not all reply, and a reasonable time limit passes, the thread will reset the signal handler for SIGINT to its default value and raise a SIGINT signal. This causes the process to terminate and, as a side effect on the Cray XT, causes the job launcher to kill all the other processes in the job. Note that at startup, GASNet does redefine the handler for SIGINT. Upon receipt of SIGINT, the node calls gasnet_exit. Unfortunately, this has the effect of running the shutdown process, including sending Portals messages to other nodes, in a signal handler context. If the signal arrived when the thread was in a non-reentrant Portals call, or holding a lock required to send Portals messages, the result could be deadlock (on that node) or the results of Portals operations could be undefined. Recall that on the Cray XT, all communication and I/O are performed using Portals operations. In practice, this has not been an issue. However, a better way to deal with this situation would be to set a global "shutdown" flag that is checked frequently (such as each time we poll). If a thread sees that the flag is set, it should call gasnet_exit() in this safe context. ======================================================================================= Scheme for credit balancing and re-distribution In Portals Conduit: - 1 credit is equivalent to 256 bytes of ReqRB space + one event structure (128 bytes) = 384 bytes. - Initially, total number of credits is computed based on size of ReqRB space: credit_memsize = (num_ReqRB_buffers-1)*(chunks_per_buffer-1)*sizeof_chunk total credits = credit_memsize/256 Must allocate AM_EQ with this number of event structures. Once allocated, cannot increase size of EQ dynamically, at least not easily. NOTE: we can only allocate credits for num_ReqRB_buffers-1, since we cannot recycle a buffer until all AM Requests contained in the buffer have been processed. Credits are returned to the requester with the AM Reply message, so the sender of earlier messages may already have received the credits for the space it used in this buffer but the buffer has yet to be recycled. Furthermore, we cant count the last chunk in each buffer, since our mechanism for recycling the chunks insures that it is unlinked when the space remaining in the buffer is less than one chunk in size. The values of num_ReqRB_buffers and chunks_per_buffer will have to be adjusted at job bootstrap time depending on the value of gasneti_nodes and a reasonable value for the number of credits per node, etc. These will have to be adjustable by runtime environment vars. See table at end of this note for possible default values. - A percentage of the total credits is banked for re-distribution. The remainder are distributed to the Nodes-1 other gasnet_nodes. (Note: no initial communication necessary, all nodes compute the same values). - In order for Node X to send an AM Request to Node Y, it must have the required number of credits before it can send. If not, it must poll the network until the credits are available: * An AMShort is 1 credit * An AMMedium is 1-4 credits, depending on the payload size. * An AMLong is 2 credits: 1 for header (small) and 1 for data Put which accounts for the event consumed by remote RARAM PUT_END. * An AMLongPacked is 1-4 credits, depending on the payload size. - AMReply messages do not consume credits since: * Short, Medium and Long headers are sent back to the ReqSB chunk the Request was issued from. * Number of AM's in-flight is controlled => AMLong Data PUT_END event in RARSRC is already accounted for in size of SAFE_EQ. - AMReply messages carry a credit_update value back to the requester. This is generally the number of credits the requester used to send the AMRequest. It may be more, see below. In UPC, main use of AMs is: * global memory allocation, which only requires one AM per node even if all going to same target. * Lock management. Again, 1 AM should do this. * collectives: usually tree based so log(nodes) communicating partners which is small (< 15 nodes). Also, only a few AMs per collective (usually). * VIS: Point-to-point, but packed message bodies can be large. Example: - 32x32x32 grid of doubles per physical variable - 5 variables per cell - Ghost zones are 4 cells wide. There are 6 faces, assume only face values need to be communicated, no edge or corner values. For example, the left face boundary region is 4x32x32 doubles for each variable. This is 32KB/variable. AMMedium messages are limited to less than 1KB so this requires about 32 AMMediums per variable per face. Each AMMedium is 4 credits and there are 5 variables per cell so this requires 32*4*5 = 640 credits for the owner of the patch to send the 160 AMMediums to us. Note that if the remote node is attentive to the network and can process the AM Requests quickly, it can return credits in the corresponding AM Replys and the actual number of credits needed to send these messages without stalling may be lower. Ignoring this possibility and granting 640 credits is equivalent to 240KB of buffer/event space on my node. The 6 senders require a total of 1.4MB of space on my node, which is very reasonable. However if we have to allocate this to ALL the nodes in the job, a 1024 node job would need 240MB of buffer space and a 10,000 node job would require 2.4GB. Our goal would be to grant fewer credits to each node to constrain the local buffer footprint and dynamically redistribute the credits so that frequent communication partners have the credits they need. Goals of the algorithm: 1) The algorithm should be demand driven. If there is no use for dynamic re-distribution no credits should be moved. 2) The overhead should be as low as possible. 3) If the communication pattern is static, the credits should migrate to the nodes needing them and not move afterwards. 4) We should have parameters that control the rate at which credits are migrated. Migrating too quickly could cause credits to oscillate around the system. Dampening factors can help prevent this at the expense of building up the amount of credits over time. The goal should be to reach a steady state in a 'reasonable' amount of time. Limits: * Each remote node MUST be allocated a MINIMUM of 4 credits. This is required for send a full-sized AM Medium. NOTE: For a 10,000 node job, this would require 15MB of ReqRB and event queue space on each node. Allocating 6 credits per node would require 22 MB of space, but would allow each node to redistribute a maximum of 20,000 credits, which would be sufficient to grant 200 additional credits to 100 frequent communicating partners. If tree based communication algorithms are used for collectives, ln(10000) is less than 17 frequent partners. * There is no point to giving a remote node more than 2^16 = 64*1024 credits. This would be equivalent to 24MB of ReqRB + event space for that node alone. Thus, to keep metadata overhead small, per-node credit variables can be of type uint16_t. Basic algorithm overview: - When a node X stalls due to lack of credits trying to send an AM to a remote node Y, it requests additional credits. If Y has them banked, it increases its loan to X, if the loan is not increased. - When a nodes BankedCredits gets low, it will issue CREDIT_REVOKE requests to nodes that communicate with it infrequently. The target nodes determine how many, if any, credits to return to the requester. The returned credits are banked for future use. - state variables that record recent send/credit activity decay over time so that decisions on how many credits to revoke or request are made with current information, not old information. Algorithm Details: * Each node allocates its ReqRB and AM event queue structures resulting in TotalCredits credits. A portion of these are banked for discretionary use, the rest are loaned, or distributed evenly to all other nodes during job initialization. By default, each node computes the the number of credits_per_node and BankedCredits by interpolating values from a table, based on the number of GASNet nodes in the job. From this. TotalCredits_est = credits_per_node*(num_nodes-1) + BankedCredits Since each credit is equivalent to 256 bytes of ReqRB space, the number of ReqRB buffers (each of a fixed, pre-determined size) needed to hold TotalCredits_est worth of ReqRB space is allocated. Since ReqRB buffers are a fixed size, the total number of credits available is greater than TotalCredits_est with the additional going into the BankedCredits pool. gasneti_semaphore_t BankedCredits; /* discretionary use credits */ The default values of credits_per_node and BankedCredits can be changed at run-time via environment variables. * We use the prefix 'Loan' for variables that represent a loan of our credits to a remote node. We use the prefix 'Send' for variables that represent credits that have been loaned to us and that enable us to send AMRequests to the corresponding remote node. * Runtime on each node is divided into epochs. Certain state information is maintained on credit use during an epoch, and these values are reduced by a factor, or decayed, at the end of each epoch. We use this to maintain temporal use information that decays to zero after a few epochs without use. * Each node maintains a small amount of "connection state" for each other in the job. The state is maintained in the conn_state[] array, indexed by node number. #define GASNETC_MIN_CREDITS 4 uint16_t LoanCredits; /* number of my credits allocated to this remote node */ uint16_t SendCredits; /* number of remote node credits allocated to me */ uint16_t SendStalls; /* number of times we stall on credits trying to send AMs. Decayed */ uint16_t SendInuse; /* current number of SEND credits in flight to them. */ uint16_t SendMax; /* max number of SEND credits in flight this epoch Decayed */ uint16_t SendRevoked; /* number of SEND credits I give back this epoch. Decayed */ uint16_t LoanRequested; /* number of additional LOAN credits requested by them. Decayed */ uint16_t LoanGiven; /* number of additional LOAN credits given to them. Decayed */ dll_link_t link; /* 32 bit structure used as link in doubly linked list. */ uint8_t flags; /* used mainly in credit management */ NOTE: node[X]:conn_state[Y].LoanCredits == node[Y]:conn_state[X].SendCredits except during times when messages containing additional credits are in-flight. NOTE: SendCredits only changes when additional credits are allocated to us. We can reduce the number of vars in the state by eliminating SendInuse and just decrementing SendCredits when they are in use. In this case, we would record SendMin rather than SendMax to record how many "extra" credits (above active use) we have. Unfortunately, we would not be able to Decay SendMin in the same way as other values, since in this case its value would have to increase, and we no longer know the total number of credits we have been granted so dont know the max value that SendMin could take. NOTE: SendStalls and LoanRequested could be replaced with flag values, but then would not decay slowly but instantly at end of epoch. NOTE: The link field allows the individual records to be threaded on a list of nodes to be scavenged for un-used credits when the bank pool is depleted. Only nodes that have been lent more than the minimum number of credits are on the list. For small problems, it will usually be the case that all nodes are always on the list. For Large problems, the list should be significantly smaller than the total number of nodes. Some flags values are: #define GASNETC_SYS_MSG_INFLIGHT 0x02U #define GASNETC_CREDIT_REVOKE_ZERO_REPLY 0x04U #define GASNETC_REACHED_EPOCH 0x08U NOTE: We must update these fields as atomic operations. The locks would be held for only short periods of time so we can possibly use gasneti_spinlock_t, but they might not be fair. These cannot be used in interrupt handler code. NOTE: conn_state ~ 40 bytes per node or less than 400 KB for 10000 nodes. * Several (global) control knobs are set: int gasnetc_revoke_limit = XXX; /* Max number of credits a node can revoke in an epoch */ int gasnetc_lender_limit = XXX; /* Max number of credits a node can lend another within an epoch */ int gasnetc_max_cpn = XXX; /* The max number of credits that will be lent to a node */ int gasnetc_epoch_duration = XXX; /* The number of AM Requests a node will receive before * aging LOAN variables and starting a new epoch. */ * Other global variables: gasneti_mutex_t gasnetc_epoch_lock; /* control access to epoch update */ gasneti_mutex_t gasnetc_scavenge_lock; /* control access to/updates of link fields for scavenge list */ uint16_t gasnetc_scavenge_list; /* head of list of nodes to scavenge for credits */ /* GASNETC_CREDIT_DECAY is used to decay the credit values at the end of an epoch. */ #define GASNETC_CREDIT_DECAY(var) var = ((var)>>2) * Each node Y will end its own epoch after it receives gasnetc_epoch_duration AM Requests. At this point, it will decay the LOAN values (LoanGiven, LoanRequested) of all its conn_state[x] records. It also sets the GASNETC_REACHED_EPOCH bit in conn_state[X].flag for each state record. The next time Y sends an AMRequest or Reply, or SYS queue Revoke Request message to a node X, it will clear the GASNETC_REACHED_EPOCH bit in conn_state[X].flags and include a tag in the message informing the target to decay its SEND variables in its conn_state[Y] record. * When Node X wants to send an AM Request to node Y, it computes the credits required (ncredit) then checks if it has enough credits: credits_avail = conn_state[Y].SendCredits - conn_state[Y].SendInuse; stalled = false if (ncredit < credits_avail) { stalled = true; nextra = credit - credits_avail; conn_state[Y].SendStalls++; do { poll_the_network } until(ncredit >= credits_avail); } conn_state[Y].SendInuse += ncredit; conn_state[Y].SendMax = MAX(conn_state[Y].SendMax,conn_state[Y].SendInuse); Construct a credit_byte = EXXXNNNN where each character represents a bit: E = 1 If sender ended an epoch, otherwise 0. XXX = need The number of credits it was stalled waiting for. NNNN = ncredit The number of credits required to send the message. The credit_byte is included in the header of the message. * When Node Y processes the AMRequest from node X: It extracts the credit_byte and examines the fields: EXXXNNNN If E == 1 then it decays the SEND vars of conn_state[Y]. set nextra = XXX, the number of additional credits the node would like to have. set ncredit = NNNN, the number of credits required to send the Request. It executes the requested GASNet handler, then the AMReplyXXXX code. The return message will also contain a credit_byte: EXXXNNNN, in which: if (conn_state[Y].flags & GASNETC_REACHED_EPOCH) { return_E = 1; /* we reached our end of epoch since last message */ conn_state[Y].flags &= ~GASNETC_REACHED_EPOCH; /* clear the bit */ } return_credits = ncredit; /* always return the number of credits the sender used */ return_nextra = 0; if (nextra > 0) { conn_state[X].LoanRequested++; if ((nextra <= BankedCredits) && (conn_state[X].LoanGiven <= gasnetc_lender_limit)) { BankedCredits-= nextra; conn_state[X].LoanGiven += nextra; return_nextra = nextra; } } Construct credit_byte EXXXNNNN with: E = return_E XXX = return_extra NNNN = return_credits Send AM Reply message. if (BankedCredits is low) scavenge_for_credits(); * When Node X processes AM Reply from node Y: It extracts the credit_byte field EXXXNNNN = {E}{nextra}{ncredit} if (E == 1) { Decay SEND variables in conn_state[Y]. } conn_state[Y].SendInuse -= ncredit; /* the original number have been returned */ conn_state[Y].SendCredits += nextra; /* and possibly extra credits have been loaned to us */ * When a lender is running low on BankedCredits, it calls scavenge_for_credits(), which uses a special out-of-band system queue to issue credit_revoke() messages to nodes that have not been sending it AMRequests in the recent past. scavenge_for_credits() will first lock the gasnetc_scavenge_lock to insure no other thread can add or remove nodes from the list. It walks the gasnetc_scavenge_list looking for nodes to give up or revoke credits. As it walks the list, it will skip by a node on the list if any of the following reasons: * node == gasneti_mynode We dont need credits to send to ourself. * trylock to obtain the state lock fails (another thread holding it). * An out-of-band system message is in progress. We can only send one system message to a node at a time. * LoanCredits <= GASNETC_MIN_CREDITS. In general, this will not happen, since only nodes with more than the minimal number of credits will be put on the list. However, there is a small race condition between when the LoanCredits is decremented and when the node gets removed from the list. * a previous revoke request resulted in a zero reply during this epoch. No sense trying again. * If the node requested additional credits during this epoch, dont ask it to give up any. Otherwise, send a SYS_CREDIT_REVOKE message to the node. Continue this search until the list wraps or until gasnetc_num_scavenge targets are hit. NOTE: The head of the list: gasnetc_scavenge_list will be set to point to where we left off insuring that nodes on the list are targeted in a fair manner. When Y sends an out-of-band SYS_CREDIT_REVOKE message to node X, it sets conn_state[X].flags |= GASNETC_SYS_MSG_INFLIGHT; and will not send another system message to X until this bit is cleared by the reply message. Note: if con_state[X].flags & GASNETC_REACHED_EPOCH then the system message will also send a bit indicating that X should decay its SEND variables to Y. * When node X receives a SYS_CREDIT_REVOKE messaage from Y on the system queue, it will evaluate whether it has SendCredits that can be returned to the lender, but first: if the message contained a flag indicating that Y reached an end of epoch, decay our SEND variables in conn_state[Y]. navail = conn_state[Y].SendCredits - MAX(MIN_CREDITS,conn_state[Y].SendMax); nreturn = MIN(navail, MAX(0,EPOCH_REVOKE_LIMIT - conn_state[Y].SendRevoked)); /* Since SendMax is the max number of credits we used recently, navail is * the number we could return. However, if this is a large number, we should * probably limit the amount we return by EPOCH_REVOKE_LIMIT, at least this epoch, * since if we have a large number of extra credits, that means we used them * in the past and may need them again. On the other hand, if navail is small, * then we may be in the "each node of 10,000 gets 6 credits" situation and * all have to the extra credits. We can possibly have other run-time variables * to indicate we are in a low-overall-credit-per-node situation to change * this behavior */ if (conn_state[Y].SendStalls > 0) ncredit = 0; since we recently stalled sending to the requester and so are low on credits. if (conn_state[Y].SendRevoked >= gasnetc_revoke_limit) ncredit = 0; since we have already given up enough SendCredits recently conn_state[Y].SendCredits -= ncredit; conn_state[Y].SendRevoked += ncredit; end_epoch = 0; if (conn_state[Y].flags & GASNETC_REACHED_EPOCH) { /* Inform Y we have reached an epoch and it should decay send vars */ conn_state[Y].flags &= ~GASNETC_REACHED_EPOCH; end_epoch = 1; } issue a SYS_CREDIT_RETURN message back to the calling node Y with ncredit credits and including the value of end_epoch. * When node Y receives a SYS_CREDIT_RETURN message with arguments (ncredit, end_epoch) from node X it will: conn_state[X].flags &= ~SYS_MSG_INFLIGHT; to indicate the REVOKE message is complete if (end_epoch) decay SEND vars in conn_state[X]. if (ncredit > 0) { conn_state[X].LoanCredits -= ncredit; /* reduce the amount of their loan */ BankedCredits += ncredit; /* bank the credits for later distribution */ } else { conn_state[X].flags |= SYS_CREDIT_REVOKE_ZERO_REPLY; /* this will prevent us from hitting up node X again this epoch */ } * Epoch: Used to decay influence of usage variables over time. When polling network, if epoch time has expired, call start_epoch(); Insure only one thread enters start_epoch() per epoch by using lock and detecting checking start_time. #define DECAY(val) val = (uint16_t)((double)(val) * (1.0-gasnetc_decay_rate)) void start_epoch() lock(gasnetc_epoch_lock); if (current_time - gasnetc_epoch_starttime < gasnetc_epoch_duration) { /* another thread already did this, just exit */ unlock(gasnetc_epoch_lock); return; } gasnetc_epoch_starttime = current_time; for (node = 0; node < gasneti_nodes; node++) { state = &conn_state[node]; LOCK(state); /* spinlock */ DECAY(state->SendStalls); DECAY(state->SendRevoked); DECAY(state->SendMax); DECAY(state->LoanRequested); DECAY(state->LoanGiven); if (state->flags & SYS_MSG_INFLIGHT) { state->flags = SYS_MSG_INFLIGHT; } else { state->flags = 0; } UNLOCK(state); } unlock(gasnetc_epoch_lock); return; } Additional Notes: * The state variables LoanRequested and SendStalls have accumulated values but are only used to see if, respectively, a loan was requested or a stall occurred. These could be replaced by flag values. If so, they would only last from when the time they were set to the end of the epoch (similar to REVOKE_ZERO_REPLY). As variables, they will be decayed, but not necessarily set to zero at the start of a new epoch. Could try both approaches to see if it matters.