[RFC,0/1] RFC: Allow busy poll to be set per epoll instance

Message ID 20240120004247.42036-1-jdamato@fastly.com
Headers
Series RFC: Allow busy poll to be set per epoll instance |

Message

Joe Damato Jan. 20, 2024, 12:42 a.m. UTC
  Greetings:

TL;DR This RFC builds on bf3b9f6372c4 ("epoll: Add busy poll support to
epoll with socket fds.") by adding two fcntl knobs for enabling
epoll-based busy poll on a per epoll basis instead of the current
system-wide sysctl. This change makes epoll-based busy poll much more
usable.

I have another implementation which uses epoll_ctl and adds a new
EPOLL_CTL_BUSY_POLL_TIMEOUT knob instead of using fcntl, but fcntl
seemed to be slightly cleaner.

I am happy to use whatever interface is desired by the kernel community
in order to allow for per-epoll instance busy poll to be supported.

Longer explanation:

Presently epoll has support for a very useful form of busy poll based on
the incoming NAPI ID (see also: SO_INCOMING_NAPI_ID [1]).

This form of busy poll allows epoll_wait to drive NAPI packet processing
which can allow for user applications to decide when it is appropriate
to process network data vs being pre-empted during less optimal times.

For example, a network application might process an entire datagram and
get better use of L2/L3 cache by deferring packet processing until all
events are processed and epoll_wait is called.

The documentation available on this is, IMHO, a bit confusing so please
allow me to explain how to use this kernel feature.

In order to use this feature, user applications must do three things:

1. Ensure each application thread has its own epoll instance mapping
1-to-1 with NIC RX queues. An n-tuple filter would likely be used to
direct connections with specific dest ports to these queues.

2. Ensure that all incoming connections added to an epoll instance
have the same NAPI ID. This can be done with a BPF filter when
SO_REUSEPORT is used or getsockopt + SO_INCOMING_NAPI_ID when a single
accept thread is used which dispatches incoming connections to threads.

3. Lastly, busy poll must be enabled via a sysctl
(/proc/sys/net/core/busy_poll).

The unfortunate part about step 3 above is that this enables busy poll
system-wide which affects all user applications on the system.

It is worth noting that setting /proc/sys/net/core/busy_poll has
different effects on different system calls:

- poll and select based applications would not be affected as busy
  polling is only enabled when this sysctl is set *and* sockets have
  SO_BUSY_POLL set.
- All epoll based applications on the system, however, will busy poll
  when this sysctl is set.

If the user wants to run one low latency epoll-based server application with
epoll-based busy poll, but would like to run the rest of the applications on
the system (which may also use epoll) without busy poll, this
system-wide sysctl presents a significant problem.

This change preserves the system-wide sysctl, but adds a mechanism (via
fcntl) to enable or disable busy poll for epoll instances as needed.

This change is extremely useful for low latency network applications
that need to run side-by-side with other network applications where
latency is not a major concern.

As mentioned above, the epoll_ctl approach I have (which works) seemed
less clean than the fcntl approach in this RFC. I would be happy to use
whatever interface the kernel maintainers prefer to make epoll based busy
poll more convenient for user applications to use.

Thanks,
Joe

[1]: https://lore.kernel.org/lkml/20170324170836.15226.87178.stgit@localhost.localdomain/

Joe Damato (1):
  eventpoll: support busy poll per epoll instance

 fs/eventpoll.c                   | 71 ++++++++++++++++++++++++++++++--
 fs/fcntl.c                       |  5 +++
 include/linux/eventpoll.h        |  2 +
 include/uapi/linux/fcntl.h       |  6 +++
 tools/include/uapi/linux/fcntl.h |  6 +++
 tools/perf/trace/beauty/fcntl.c  |  3 +-
 6 files changed, 88 insertions(+), 5 deletions(-)