OpenOnload 201104-u2 Release Notes ================================== These release notes describe major new features, and known limitations and bugs. Please also see the Onload User Guide. This release mainly fixes bugs, but also adds a few performance improvements -- see the ChangeLog for details. The notes below are unchanged from the openonload-201104 release. ---------------------------------------- Kernel config dependency ------------------------ This release of OpenOnload uses the Linux kernel's netfilter feature, so requires that the kernel be built with CONFIG_NETFILTER enabled. As far as we know all standard distribution kernels do have this enabled. We are aware that some users of Onload build their own custom kernels, and some have disabled netfilter to avoid any overhead it might add. However, CONFIG_NETFILTER only adds a very cheap inline test to fast paths, so we are confident that requiring CONFIG_NETFILTER will not adversely affect performance. NB. The same is not true of iptables. Having a number of active iptables rules can have quite a significant impact on performance. ---------------------------------------- Epoll ----- This release adds a new implementation of epoll to OpenOnload. There are now three choices when using epoll: EF_UL_EPOLL=0 (unaccelerated kernel epoll) Accelerated epoll is disabled -- just use the kernel epoll. Sends, receives and other calls are still accelerated, but epoll_ctl() and epoll_wait() are just normal system calls. Downsides: Interrupt avoidance does not work properly in this mode, and spinning is not an option. If a socket is handed over to the kernel stack after it has been added to an epoll set, it will be dropped from the epoll set. EF_UL_EPOLL=1 (userlevel accelerated epoll, default) epoll_ctl() and epoll_wait() are handled mostly at userlevel. This option gives the best latency for relatively small numbers of file descriptors, and is likely to give best performance if the app invokes epoll_ctl() very frequently. Spinning is supported, and interrupts are effectively avoided until an application blocks. Downsides: - The CPU overhead and latency of each epoll_wait() call increases with the number of file descriptors in the set. - Accelerated sockets are removed from the epoll set when the file descriptor is closed, rather than when the last reference to the socket goes away. - With EPOLLET and EPOLLONESHOT, the Onload implementation can return more than one event if mixing accelerated and non-accelerated paths on a socket. - Deadlock is possible if a process fork()s concurrently with invoking epoll calls. EF_UL_EPOLL=2 (accelerated kernel epoll) This version of epoll leverages the kernel support for epoll, so each call to epoll_wait() is a system call. This means that the baseline latency is slightly worse than EF_UL_EPOLL=1. However, with this implementation the CPU overhead and latency of epoll_wait() calls is independent of the number of file descriptors in the set. This option really shines when handling many accelerated file descriptors. Spinning is supported, and interrupts are effectively avoided until the application blocks. Downsides: Slightly worse baseline latency than EF_UL_EPOLL=1, and epoll_ctl() calls are on average more expensive than with EF_UL_POLL=1 (but still cheaper than with EF_UL_EPOLL=0). ---------------------------------------- SO_BINDTODEVICE --------------- This socket option is now partially supported. In previous releases applying this option to a socket caused it to be handed over to the kernel stack, and so not accelerated. The partial implementation ensures that all sends from a socket go via the bound interface, and only packets received via the bound interface are delivered to the socket. It currently deviates from the correct semantics in that it is not possible to bind two sockets to the same set of addresses on two different interfaces, as it should be. We hope to fix this in a future release -- please let us know if this is important to you. ---------------------------------------- Acceleration of pipes --------------------- This release adds acceleration of anonymous pipes. That is, pipes created by invoking the pipe() call. Sends and receives can usually be done entirely at user-level, and the performance of select(), poll() and epoll_wait() --- when pipes are included in the set of file descriptors being monitored --- is improved. By default pipes are not accelerated. To accelerate pipes, set EF_PIPE=1. Alternatively if you set EF_PIPE=2, pipes will only be accelerated if the process has already created an Onload stack. This can be useful to ensure that processes that do no networking whatsoever will not instantiate Onload stacks and consume resources. See the documentation for EF_PIPE (onload_stackdump doc) for more details. Each pipe instance may consume quite a few "endpoint buffers" within Onload. If you run out you can increase the limit by setting the EF_MAX_ENDPOINTS option. ---------------------------------------- Onload extension API -------------------- We have added a the beginnings of an extension API to OpenOnload. This will allow you to customise applications to OpenOnload in order to achieve further performance improvements. The API is defined in: /src/include/onload/extensions.h Using this extension API does not force you to use Onload all of the time: You can still run the application over the kernel stack simply by removing the "onload" prefix or not setting LD_PRELOAD. This is because you link your application to a stub library that provides a dummy implementation of the interface when Onload is not present. ---------------------------------------- Controlling sockets and stacks ------------------------------ The first extension is onload_set_stackname(), which gives you full control over the mapping between sockets and stacks. ---------------------------------------- Jumbo frames ------------ When using jumbo frames with TCP, Onload limits the maximum segment size to a little under 2K. It does this because the codebase does not yet support TCP segments that span more than one of Onload's fixed size packet buffers. In practice this should not present any problems unless the other end of the connection does not support negotiation of the MSS. David Riddoch 2011/05/06