KNEM: Fast Intra-Node MPI Communication

See also the main documentation for more information about installing, starting and using KNEM.

See knem_io.h for more details about the interface. This header file also explains how to port from the old KNEM interface to the new one.

Interface Basics

Once loaded, the KNEM kernel module creates a /dev/knem pseudo-character device (see the main documentation for details about granting access rights to this file). Applications must open this file (in read/write mode) before passing commands to the driver.

  #include <knem_io.h>

  ...
  knem_fd = open(KNEM_DEVICE_FILENAME, O_RDWR);

Preparing remote access to a local memory region

To prepare access to local memory from another process, declare a memory region (composed of one or several segments) and pass it to the driver.

  struct knem_cmd_create_region create;
  struct knem_cmd_param_iovec knem_iov[2];
  ...
  knem_iov[0].base = <myaddress>
  knem_iov[0].len = <mylength>
  knem_iov[1].base = <myotheraddress>
  knem_iov[1].len = <myotherlength>
  ...
  create.iovec_array = (uintptr_t) &knem_iov[0];
  create.iovec_nr = 2;
  create.flags = KNEM_FLAG_SINGLEUSE; /* automatically destroy after first use */
  create.protection = PROT_READ; /* only allow remote readers */
  err = ioctl(knem_fd, KNEM_CMD_CREATE_REGION, &create);

The region has been stored in the driver and associated to a 64bits cookie whose value is now available in create.cookie.

Accessing a remote memory region

Once the region creation returned a cookie, the application may pass this cookie to the receiver (using another way of communication). The receiver process then passes its own array of destination memory segments (where data should be copied to) to the driver along with the corresponding remote cookie.

  struct knem_cmd_inline_copy icopy;
  struct knem_cmd_param_iovec knem_iov[5];
  ...
  knem_iov[0].base = <myaddress>
  knem_iov[0].len = <mylength>
  ... setup others knem_iovs as well ...
  knem_iov[4].base = <myfifthaddress>
  knem_iov[4].len = <myfifthlength>
  ...
  icopy.local_iovec_array = (uintptr_t) &knem_iov[0];
  icopy.local_iovec_nr = 5;
  icopy.remote_cookie = <myremotecookie>;
  icopy.remote_offset = 0;
  icopy.write = 0; /* read from the remote region into our local segments */
  icopy.flags = 0;
  err = ioctl(knem_fd, KNEM_CMD_INLINE_COPY, &icopy);

If the ioctl succeeds (return 0), it means that the copy was properly initialized. In case of error during the copy (as opposed to during its initialization), it is reported in the request status. In this example, the request is processed synchronously. The status is then available immediately in icopy.current_status.

  if (icopy.current_status != KNEM_STATUS_SUCCESS)
    printf("request failed\n");

It is also possible to initiate a data transfer using a declared region.

  struct knem_cmd_copy copy;
  struct knem_cmd_create_region create;
  struct knem_cmd_param_iovec knem_iov[3];
  ...
  knem_iov[0].base = <myaddress>
  knem_iov[0].len = <mylength>
  ... setup the other knem_iov as well ...
  knem_iov[2].base = <mythirdaddress>
  knem_iov[2].len = <mythirdlength>
  ...
  create.iovec_array = (uintptr_t) &knem_iov[0];
  create.iovec_nr = 3;
  create.flags = KNEM_FLAG_SINGLEUSE; /* automatically destroy after first use */
  create.protection = PROT_WRITE; /* only writers */
  err = ioctl(knem_fd, KNEM_CMD_CREATE_REGION, &create);
  ...
  copy.src_cookie = <myremotecookie>; /* read from the other process */
  copy.src_offset = 0;
  copy.dst_cookie = <create.cookie>; /* write in our local region */
  copy.dst_offset = 0;
  copy.flags = 0;
  err = ioctl(knem_fd, KNEM_CMD_COPY, &copy);

Reusing memory regions multiple times

The above code tells the driver to destroy the memory region after its first use because of KNEM_FLAG_SINGLEUSE. It is possible to keep the region available after use by removing this flag at region creation (just set create.flags to 0 before the ioctl). The region will then be accessible multiple times by any KNEM process. It will only be destroyed when the owner process exits or when it explicitly destroys it:

  err = ioctl(knem_fd, KNEM_CMD_DESTROY_REGION, &<mycookie>);

Asynchronous requests

By default, KNEM processes requests synchronously, which means the above current_status will be set as soon as the ioctl returns. It is also possible to perform asynchronous data transfers through a kernel thread by adding the corresponding flag in the receive ioctl:

  icopy.flags = KNEM_FLAG_MEMCPYTHREAD;

Such an asynchronous request will show KNEM_STATUS_PENDING in current_status. It means that further polling is required to know when the request actually completes in the background. To do so, the ioctl must specify where the asynchronous status should be updated:

  icopy.async_status_index = <myindex>;

It is an index within an array of status slots that should be allocated at initialization by mapping the device file. This array may be mapped only once per file descriptor, but its size may be freely chosen by the application (depending on how many simultaneous pending requests it may need).

  static volatile knem_status_t *knem_status;
  #define KNEM_STATUS_NR 4096
  ...
  knem_status = mmap(NULL, KNEM_STATUS_NR, PROT_READ|PROT_WRITE, MAP_SHARED, knem_fd, KNEM_STATUS_ARRAY_FILE_OFFSET);
  ...

When submitting a request with async_status_index = N, the application may be notified of the completion by looking at the corresponding index in the knem_status array. The driver takes care of automatically freeing the corresponding sender and receiver resources. Note that you should always check the current_status first, since some requests might be processed synchronously in case of missing features in the driver.

  if (icopy.current_status != KNEM_STATUS_PENDING) {
    /* completed synchronously */
    if (icopy.current_status != KNEM_STATUS_SUCCESS)
      printf("request failed\n");
  } else {
    /* processed asynchronously, waiting for completion */
    while (knem_status[<myindex>] == KNEM_STATUS_PENDING);
    /* completed asynchronously */
    if (knem_status[<myindex>] != KNEM_STATUS_SUCCESS)
      printf("request failed\n");
  }

Again, initialization problems (synchronous) are reported in the ioctl return value, while actual copy issues (during the asynchronous processing) are reported in the request status, either current_status one or later in the asynchronous one.

I/OAT copy offload through DMA Engine

One interesting asynchronous feature is certainly I/OAT copy offload.

  icopy.flags = KNEM_FLAG_DMA;

If DMA engine is not supported by the kernel or the hardware, setting this flag will cause the ioctl to fail. To know if DMA is supported, you may get information about the driver and check its feature flags:

  struct knem_cmd_info info;
  ...
  err = ioctl(knem_fd, KNEM_CMD_GET_INFO, &info);
  if (info.features & KNEM_FEATURE_DMA)
    printf("DMA engine is supported\n");

The ultimate strategy is to enable overlapping of data transfer with computation by using an asynchronous data transfer with I/OAT. The aforementioned while loop will thus not return immediately and the application may perform useful work before the request is actually done.

  icopy.flags = KNEM_FLAG_DMA | KNEM_FLAG_ASYNCDMACOMPLETE;

If DMA does not seem to work (for instance if KNEM_FEATURE_DMA is missing in info.features), you may want to check DMA engine status in the driver:

  $ cat /dev/knem
  [...]
   DMAEngine: KernelSupported Enabled ChansAvail ChunkMin=1024B

The above line means that DMA Engine is supported by the kernel, enabled in KNEM and that some DMA channels are available. NoKernelSupport would mean that DMA Engine support is missing in the kernel. NoChannelAvailable would means that DMA Engine is supported by the kernel but no hardware DMA engine is available or no driver was loaded to use it. On Intel machines, loading the ioatdma kernel module will usually help.