GASNet inter-Process SHared Memory (PSHM) design --------------------------------------------- $Revision: 1.16 $ Document by: Dan Bonachea Paul H. Hargrove Filip Blagojevic Implementation by: Jason Duell Filip Blagojevic Paul H. Hargrove Goal: ---- Provide GASNet with a mechanism to communicate through shared memory among processes on the same compute node. This is expected to be more robust than pthreads (which greatly complicates the Berkeley UPC runtime, and prevents linking to any numeric libraries that that are not thread-safe). It is also expected to display lower latency than use of a network API's loopback capabilities (though the network hardware might provide other benefits such as asynchronous bulk memory copy w/o cache pollution). We appreciate your feedback related to PSHM (both positive and negative) and would be happy to work with you to improve PSHM. To use: ------ In the current release, GASNet's PSHM support is enabled by default only on Linux. On all other platforms, one must pass --enable-pshm if PSHM support is desired. On Linux PSHM can be disabled by passing --disable-pshm at configure time. The PSHM support in GASNet can operate via three possible mechanisms: POSIX shared memory, SystemV shared memory, or mmap()ed disk files. When PSHM is enabled, the default is for the configure step to probe only for support via POSIX shared memory (except on MacOS, where it is known to be broken). If no POSIX shared memory support if found, there is no automatic fallback to any other mechanism. So, if one wishes to use SystemV shared memory or mmap()ed files, one should explicitly disable the POSIX support and enable the desired mechanism: Usage Summary (flags to be passed to the configure script): ---------------------------------------------------------- OFF: --disable-pshm POSIX: --enable-pshm SYSV: --enable-pshm --disable-pshm-posix --enable-pshm-sysv FILE: --enable-pshm --disable-pshm-posix --enable-pshm-file On Linux "--enable-pshm" is the default. On all other platforms "--disable-pshm" is the default. PSHM includes 2 environment variables for controlling use of memory for intra-memory AM traffic: GASNET_PSHMNET_QUEUE_DEPTH Number of slots (1 cache line each) in the AM receive queues. For P process per node, there are O(P^2) of these queues. If GASNET_PSHMNET_QUEUE_DEPTH is not set, the default number of slots in each queue is 8. The PSHM-AM network provides a flow-control mechanism: the sender waits until there is a free slot in the receiver's AM queue. GASNET_PSHMNET_QUEUE_MEMORY Size of memory pool used to hold AM payloads. For P processes per node, there are O(P) of these pools. If GASNET_PSHMNET_QUEUE_MEMORY is not set, the default size of each memory pool is 1MB. The PSHM-AM network provides a flow-control mechanism: the sender waits until there is an available chunk of memory for holding the AM's payload. Parameters Setting: ------------------ Although recommended as the first option, POSIX shared memory is not available on all systems, even systems running Linux may not be configured to support it. In the absence of the POSIX shared memory, users are advised to use the SystemV shared memory as the next-best option. In the absence of both, POSIX and SystemV shared memory, a user may try using mmap()ed disk files. However, on some systems we see significant performance degradation when using files (apparently due to committing the changes from memory to disk). On most operating systems the amount of available SystemV shared memory and the number of shared memory segments is controlled by the kernel parameters: shmmax, shmall and shmmni. shmmax = largest size of a shared memory segment (in bytes) shmall = total amount of memory allocatable as shared (in pages) shmmni = maximum number of shared memory segments Insufficient amount of SystemV shared memory will lead to failures at start-up of any application using a runtime configured to use PSHM over SystemV. Setting these parameters is system-specific and requires administrator privileges. * Examples to set: - Linux: sudo /sbin/sysctl -w kernel.shmmax= sudo /sbin/sysctl -w kernel.shmall= sudo /sbin/sysctl -w kernel.shmmni= - MacOS: sudo /sbin/sysctl -w kern.sysv.shmmax= sudo /sbin/sysctl -w kern.sysv.shmall= sudo /sbin/sysctl -w kern.sysv.shmmni= - Various BSD flavors: FreeBSD: follow MacOS example, replacing "sysv" with "ipc". NetBSD: follow MacOS example, replacing "sysv" with "ipc". OpenBSD: follow MacOS example, replacing "sysv" with "shminfo". - Solaris: Depends on version. Please see the Sun/Oracle documentation. - Cygwin: See Cygwin under "change parameters permanently", below. * To change parameters permanently: - Linux: Add the following lines to /etc/sysctl.conf (sudo required): kernel.shmmax= kernel.shmall= kernel.shmmni= To reload the new settings: sudo /sbin/sysctl -p /etc/sysctl.conf - MacOS: Add the following lines to /etc/sysctl.conf (sudo required): kern.sysv.shmmax= kern.sysv.shmall= kern.sysv.shmmni= To reload the new settings: reboot the machine. - Various BSD flavors: FreeBSD: follow MacOS example, replacing "sysv" with "ipc". NetBSD: follow MacOS example, replacing "sysv" with "ipc". OpenBSD: follow MacOS example, replacing "sysv" with "shminfo". - Solaris: Depends on version. Please see the Sun/Oracle documentation. - Cygwin: If you have not done so yet, please Read the Cygwin documentation on "cygserver-config" to create an initial /etc/cygsever.conf and start the server as a Windows service (optional). You may then edit /etc/cygserver to edit kern.ipc.shmmaxpgs kern.ipc.shmmni kern.ipc.shmseg Except for shmmaxpgs, the defaults are often large enough. These configuration values are only read when cygserver starts. So, read the Cygwin documentation to determine if/how to restart the cygserver service. Under Cygwin-1.5 you may also need to add "server" to the value of the CYGWIN environment variable. Again, you should see the Cygwin documentation for more information on this subject. IMPORTANT, SYSTEM CLEANING: -------------------------- If a GASNet application using PSHM is terminated before ending the initialization phase, there is a possibility that the shared memory objects will remain in the system. A large amount of memory or disk space can remain allocated, preventing users from fully utilizing all available hardware resources. In the SystemV case, the allocated (but not released) shared memory segments can be listed via the "ipcs" command, and can be removed via the "ipcrm" command. Note that on the systems with a batch scheduler, the "ipcs" and "ipcrm" instructions need to be run on the compute nodes. In the mmap()ed file case, the allocated but not released shared memory files can be found in the directory pointed by the TMPDIR environment variable (default: /tmp). These files are named with the prefix GASNT (the lack of an 'E' is not a typo), and can be deleted using the "rm" command. In the case of POSIX shared memory, the implementation is system-specific. In the case of Linux and Solaris, POSIX shared memory objects are visible in the file system. For Linux the default location is /dev/shm and on Solaris the default is /tmp. Scope: ----- * GASNet segment via PSHM only supported for SEGMENT_FAST or SEGMENT_LARGE (not meaningful for SEGMENT_EVERYTHING mode) * May eventually support AM-over-PSHM for SEGMENT_EVERYTHING (but not yet) * Applicable both w/ and w/o pthreads Terminology: ----------- * node: each UNIX process running GASNet * supernode: 1 or more nodes with cross-mapped segments using PSHM support * supernode peers: nodes which share a supernode Interface notes: --------------- * All node processes call gasnet_init(), each is a separate GASNet node * PSHM is enabled/disabled at configure time and GASNETI_PSHM_ENABLED is #defined to either 1 or 0. Each conduit can then #define GASNET_PSHM to 1 if it implements PSHM support. * gasnetc_init() performs super-node discovery, using OS-appropriate (or conduit-specific) mechanisms to figure out which nodes are capable of sharing memory with which other nodes: - unconditionally calls gasneti_nodemapInit() (to drive "discovery") - calls gasneti_pshm_init() only if PSHM support enabled (to setup data) * MaxLocal/Global return values reflecting the amount of segment space divided evenly among the supernode peers, and each node passes a size to gasnet_attach reflecting the per-node segment size they want. * gasnet_attach takes care of mapping each processor's segments as usual, but also maps the segments of supernode peers into each nodes VM space using OS-appropriate mechanisms. (shm_open()+mmap(), shmget()+shmat(), etc.). * Nodes on a supernode typically have different virtual address map of the segments on that supernode. They are typically not contiguous either. * Client calls getSegmentInfo to get the location of his segment and those of other nodes (as always) * seginfo_t for node X reflects the shared segment belonging to X, but also includes a supernode identifier (node_info) so nodes can see which nodes share their supernode * Client may directly load/store into the segments of any node sharing their supernode (currently implemented in Berkeley UPC runtime library) * remotely-addressable segment restrictions on gasnet_put/get/AMLong apply to the individual segments - ie gasnet_put() to an address in the segment of node X must give node X as the target node, not some other supernode peer Restrictions: ------------ * gasnet_hsl_t's are node-local and while they might reside in the segment, they may not be accessed by more than one node in a supernode - we can/should add a debug-mode check for this (also applies to shmem-conduit) * Use of GASNet atomics in the segment is allowed, but they must not be weak atomics (which means using the explicitly "strong" ones in client code). Closed (previously "Open") questions: ------------------------------------ Q1) Do we need a separate build or separate configure of libgasnet and/or libupcr with PSHM enabled/disabled? A1) Since the set of conduits supported by PSHM was initially a small subset of the total list, we chose not to complicate the UPC compiler with this. Thus we've chosen to configure everything (UPCR+gasnet) w/ --enable-pshm or w/o. The number of conduits supporting PSHM is now irrelevant since in a PSHM-enabled build of GASNet any conduits not supporting PSHM are simply built w/o it (as opposed to not built at all as was once the case). Q2) If we want to use the same build, then how should GASNET_ALIGNED_SEGMENTS definition behave? Never true when any supernode contains more than one node, but don't know that until runtime. A2) We assume that you don't use PSHM unless also using > 1 proc/node. May also revisit if we don't configure PSHM as a distinct build. Q3) Can we get away with always connecting segments after all processes are created, or do we need to fork after setting up shared memory segments? Will drivers & spawners even allow that? If we decide that a fork is required after job launch, then it should definitely be done by the conduit, not the client code. But how would the interface look? (this would very likely break MPI interoperability) A3) All supported conduits are attaching to segments in gasnet_attach(). We don't need to work about fork() at all (except that smp-conduit now has a fork-based spawner inside gasnetc_init()). Q4) Does the client code between init/attach need to know the supernode associations? (eg to make segsize decision) A4) So far we have not seen a need for this (though internal to GASNet we do). Q5) Can/do we still get allocate on first write mapping for the segment? - If so, who's responsible for establishing processor/memory affinity with first touch? (probably the client) A5) We have each node mmap() its own segment before any cross-mapping is done which should ensure locality if the OS does allocation at mmap() time. We currently have the client doing first-touch to deal with the case that the OS does page frame allocation on touch, rather than mmap(). Open questions: -------------- * How do we handle 8 or 16-way SMPs on 32-bit platforms where VM space is already tight, or OS's where the limit on sharable memory is small? This design would make our per-node segsizes rather small. Do we want a mode where segments are not cross-mapped, but the gasnet_put/get can bypass the NIC using a two-copy scheme through bounce buffers? - This bounce buffer mode could potentially also help for EVERYTHING mode (without pshm segments), although due to attentiveness issues, it may be slower than using loopback RDMA - Is this mode just the extended-ref using AM-over-PSHM? * Do we ever want to allow supernodes to share a physical node? (eg to increase segment size or to leverage NUMA affinity) - if so, need an interface to specify this (probably environment variables) * Will there be contention with MPI for resources (and should we care)? Known Problems / To do: ---------------------- * The mechanism we are using to probe for maximum segment size works fine on a system with plenty of memory, but dies on systems with less. The work around is to set the GASNET_MAX_SEGSIZE small enough for a given system. * There are still error cases that will leak shared memory. Status: ------ * The entire GASNet and Berkeley UPC test suites have been run on the following platforms and there are no known pshm-specific failures: - Linux 2.6/i686 smp, vapi, gm, mpi, udp - Linux 2.6/x86-64 smp, vapi, ibv, elan, mpi, udp - Linux 2.6/ia64 smp - Linux 2.6/ppc64 smp - AIX 5.3/ppc64 smp - Solaris 10/SPARC64 smp, udp - Solaris 10/x86 smp - Solaris 10/x64 smp - OpenSolaris/x86 smp - OpenSolaris/x64 smp - FreeBSD 8/i386 smp - FreeBSD 8/amd64 smp - IBM BG/P smp, dcmf We have no reason to think that any conduit listed above would NOT support PSHM on any OS/cpu above (assuming the conduit works on that platform at all). * IBM BG/P platform-specific notes: - PSHM does not work at all in "SMP" mode, only "DUAL" or "VN" - In "VN" mode one cannot run a hybrid GASNet+MPI code (appears that some scarce resource is exhausted in this case, but we have no details). - Currently one must manually set up a few things to use PSHM on BG/P + Must set CROSS_HAVE_SHM_OPEN=1 on configure command line + Shared segment size must be limited to a lower-than-default value + BG_SHAREDMEMPOOLSIZE env var must be set to fit the shared segment plus extra for Active Message buffers - Example environment variable settings for 200MB shared heap in VN mode when using Berkeley UPC's upcrun command: UPC_SHARED_HEAP_SIZE=200M BG_SHAREDMEMPOOLSIZE=820 - Example environment variable settings for 400MB shared heap in DUAL mode when using Berkeley UPC's upcrun command: UPC_SHARED_HEAP_SIZE=400M BG_SHAREDMEMPOOLSIZE=810 * Cray XT and XE platform-specific notes: - Portals conduit supports PSHM on the Cray XT - We have no native conduit for the Cray XE, but mpi-conduit is supported - CNL/CLE does not (at least on any systems we've encountered) support POSIX shared memory, but SystemV shared memory is available. - Add the following to your (cross-)configure options: --enable-pshm --enable-pshm-sysv * MacOS X platform-specific notes: - MacOS X with POSIX shared memory is NOT supported because we appear to trigger a kernel memory leak. - SystemV shared memory is a valid choice: --enable-pshm --enable-pshm-sysv - Use of mmap()ed files has been seen to cause VERY slow start-up. * {Free,Open,Net}BSD platform-specific notes: - FreeBSD supports POSIX shared memory and has been well tested - OpenBSD and NetBSD do not support POSIX shared memory, but do support SystemV: --enable-pshm --enable-pshm-sysv * GASNet conduits known NOT to work: - LAPI conduit requires "large pages" which cannot be obtained for POSIX shared memory segments by any mechanism we are aware of. While we now have support for SystemV-based PSHM, we lack access to a LAPI system to complete the required integration with the conduit. - SHMEM conduit does not support PSHM, but there is no reason to think that doing so would be constructive. - SCI conduit does not support PSHM because we no longer have any platform on which to develop and test sci-conduit. Keep in mind that if you use one of these conduits on a platform with the necessary support for PSHM, you may still configure with --enable-pshm to get PSHM support in other conduits (e.g. SMP and MPI), and these few conduits will still build (they will simply be missing PSHM support). * Platforms known NOT to work: - Catamount can't share memory (in any way that we know of).