KNEM: High-Performance Intra-Node MPI Communication

Summary

KNEM is a Linux kernel module enabling high-performance intra-node MPI communication for large messages. KNEM works on all Linux kernel since 2.6.15 and offers support for asynchronous and vectorial data transfers as well as offloading memory copies on to Intel I/OAT hardware.

MPICH2 (since release 1.1.1) uses KNEM in the DMA LMT to improve large message performance within a single node. Open MPI also includes KNEM support in its SM BTL component since release 1.5. Additionally, NetPIPE includes a KNEM backend since version 3.7.2. Discover how to use them here.

The general documentation covers installing, running and using KNEM, while the interface documentation describes the programming interface and how to port an application or MPI implementation to KNEM.

To get the latest KNEM news, report issues or ask questions, you should subscribe to the knem mailing list. See also the news archive.

Why?

MPI implementations usually offer a user-space double-copy based intra-node communication strategy. It's very good for small message latency, but it wastes many CPU cycles, pollutes the caches, and saturates memory busses. KNEM transfers data from one process to another through a single copy within the Linux kernel. The system call overhead (about 100ns these days) isn't good for small message latency but having a single memory copy is very good for large messages (usually starting from dozens of kilobytes).

Some vendor-specific MPI stacks (such as Myricom MX, Qlogic PSM, ...) offer similar abilities but they may only run on specific hardware interconnect while KNEM is generic (and open-source). Also, none of these competitors offers asynchronous completion models, I/OAT copy offload and/or vectorial memory buffers support as KNEM does.

Download

KNEM is freely available under the terms of the BSD license (user-space tools) and of the GPL licence (Linux kernel driver) Source code access and all tarballs are available from the Download page. Checkout from git (bottom of the download page) rather than using outdated releases.

Bugs and questions

Bug reports and questions should be sent to the knem mailing list.

You might want to read the documentation first.

Papers

Brice Goglin and Stéphanie Moreaud. KNEM: a Generic and Scalable Kernel-Assisted Intra-node MPI Communication Framework. in Journal of Parallel and Distributed Computing (JPDC). 73(2):176-188, February 2013. Elsevier. Available here.
This paper describes the design of KNEM and summarizes how it was successfully integrated in MPICH and Open MPI for point-to-point and collective operation improvement. If you are looking for general-purpose KNEM citations, please use this one.
Teng Ma, George Bosilca, Aurelien Bouteiller, Brice Goglin, Jeffrey M. Squyres, and Jack J. Dongarra. Kernel Assisted Collective Intra-node Communication Among Multi-core and Many-core CPUs. In Proceedings of the 40th International Conference on Parallel Processing (ICPP-2011), Taipei, Taiwan, September 2011. Available here.
This article describes the implementation of native collective operations in Open MPI on top of the KNEM RMA interface and the knowledge of the machine topology, leading to dramatic performance improvement on various multicore and manycore servers.
Stéphanie Moreaud, Brice Goglin, Dave Goodell, and Raymond Namyst. Optimizing MPI Communication within large Multicore nodes with Kernel assistance. In CAC 2010: The 10th Workshop on Communication Architecture for Clusters, held in conjunction with IPDPS 2010. Atlanta, GA, April 2010. IEEE Computer Society Press. Available here.
This paper discusses the use of kernel assistance and memory copy offload for various point-to-point and collective operations on a wide variety of modern shared-memory multicore machines up to 96 cores.
Darius Buntinas, Brice Goglin, Dave Goodell, Guillaume Mercier, and Stéphanie Moreaud. Cache-Efficient, Intranode Large-Message MPI Communication with MPICH2-Nemesis. In Proceedings of the 38th International Conference on Parallel Processing (ICPP-2009), Vienna, Austria, September 2009. IEEE Computer Society Press. Available here.
This paper describes the initial design and performance of the KNEM implementation when used within MPICH2/Nemesis and compares it to a vmsplice-based implementation as well as the usual double-buffering strategy.
Stéphanie Moreaud. Adaptation des communications MPI intra-noeud aux architectures multicoeurs modernes. In 19ème Rencontres Francophones du Parallélisme (RenPar'19), Toulouse, France, September 2009. Available here.
This french paper presents KNEM and its use in MPICH2/Nemesis before looking in depth at its performance for point-to-point and collective MPI operations.
Brice Goglin. High Throughput Intra-Node MPI Communication with Open-MX. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP2009), Weimar, Germany, February 2009. IEEE Computer Society Press. Available here.
The Open-MX intra-communication subsystem achieves very high throughput thanks to overlapped memory pinning and I/OAT copy offload. This paper led to the development of KNEM to provide generic MPI implementations with similar performance without requiring Open-MX.

There are several papers from people using KNEM:

Teng Ma, George Bosilca, Aurélien Bouteiller, and Jack Dongarra. HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters, in Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS '12). Best paper award. Shanghai, China, May 2012. IEEE Computer Society Press.
This article presents a framework that orchestrates multi-layer hierarchical collective algorithms with inter-node and kernel-assisted intra-node communication.
Teng Ma, Thomas Herault, George Bosilca, and Jack Dongarra. Process Distance-aware Adaptive MPI Collective Communications in Proceedings of the International Conference on Cluster Computing. Austin, TX, September 2011. IEEE Computer Society Press.
This article presents the distance- and topology aware implementation of some collective operations over KNEM in Open MPI.
Teng Ma, Aurélien Bouteiller, George Bosilca, and Jack Dongarra. Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW in Proceedings of the 18th EuroMPI conference. Santorini, Greece, September 2011. LNCS, Springer.
This article shows how Open MPI KNEM-based collective operations improve CPMD and FFTW performance.
Teng Ma, George Bosilca, Aurélien Bouteiller, and Jack Dongarra. Locality and Topology aware Intra-node Communication Among Multicore CPUs. in Proceedings of the 17th EuroMPI conference. Stuttgart, Germany, September 2010. LNCS, Springer.
This article describes a framework for tuning shared memory communications in Open MPI according to locality and topology.
Ping Lai, Sayantan sur, and Dhabaleswar Panda. Designing truly one-sided MPI-2 RMA intra-node communication on multi-core systems. in International Computing Conference (ISC'10). Hamburg, Germany, May-June 2010. Springer.
This paper compares one-sided performance of a dedicated custom implementation in MVAPICH2 with those of MPICH2 and Open MPI with their generic KNEM framework.

Credits

Inria KNEM was developed by Brice Goglin. It was created in 2008 within the former Inria Runtime Team-Project (headed by Raymond Namyst) in collaboration with the MPICH2 team at Argonne National Laboratory and the Open MPI community (former contributors include Dave Goodell, Stéphanie Moreaud, Jeff Squyres, and George Bosilca).