[net-next,0/3] Per epoll context busy poll support

Message ID 20240124025359.11419-1-jdamato@fastly.com
Headers
Series Per epoll context busy poll support |

Message

Joe Damato Jan. 24, 2024, 2:53 a.m. UTC
  Greetings:

TL;DR This builds on commit bf3b9f6372c4 ("epoll: Add busy poll support to
epoll with socket fds.") by allowing user applications to enable
epoll-based busy polling and set a busy poll packet budget on a per epoll
context basis.

To allow for this, two ioctls have been added for epoll contexts for
getting and setting a new struct, struct epoll_params.

This makes epoll-based busy polling much more usable for user
applications than the current system-wide sysctl and hardcoded budget.

Longer explanation:

Presently epoll has support for a very useful form of busy poll based on
the incoming NAPI ID (see also: SO_INCOMING_NAPI_ID [1]).

This form of busy poll allows epoll_wait to drive NAPI packet processing
which allows for a few interesting user application designs which can
reduce latency and also potentially improve L2/L3 cache hit rates by
deferring NAPI until userland has finished its work.

The documentation available on this is, IMHO, a bit confusing so please
allow me to explain how one might use this:

1. Ensure each application thread has its own epoll instance mapping
1-to-1 with NIC RX queues. An n-tuple filter would likely be used to
direct connections with specific dest ports to these queues.

2. Optionally: Setup IRQ coalescing for the NIC RX queues where busy
polling will occur. This can help avoid the userland app from being
pre-empted by a hard IRQ while userland is running. Note this means that
userland must take care to call epoll_wait and not take too long in
userland since it now drives NAPI via epoll_wait.

3. Ensure that all incoming connections added to an epoll instance
have the same NAPI ID. This can be done with a BPF filter when
SO_REUSEPORT is used or getsockopt + SO_INCOMING_NAPI_ID when a single
accept thread is used which dispatches incoming connections to threads.

4. Lastly, busy poll must be enabled via a sysctl
(/proc/sys/net/core/busy_poll).

The unfortunate part about step 4 above is that this enables busy poll
system-wide which affects all user applications on the system,
including epoll-based network applications which were not intended to
be used this way or applications where increased CPU usage for lower
latency network processing is unnecessary or not desirable.

If the user wants to run one low latency epoll-based server application
with epoll-based busy poll, but would like to run the rest of the
applications on the system (which may also use epoll) without busy poll,
this system-wide sysctl presents a significant problem.

This change preserves the system-wide sysctl, but adds a mechanism (via
ioctl) to enable or disable busy poll for epoll contexts as needed by
individual applications, making epoll-based busy poll more usable.

Thanks,
Joe

[1]: https://lore.kernel.org/lkml/20170324170836.15226.87178.stgit@localhost.localdomain/

Joe Damato (3):
  eventpoll: support busy poll per epoll instance
  eventpoll: Add per-epoll busy poll packet budget
  eventpoll: Add epoll ioctl for epoll_params

 .../userspace-api/ioctl/ioctl-number.rst      |  1 +
 fs/eventpoll.c                                | 99 ++++++++++++++++++-
 include/uapi/linux/eventpoll.h                | 12 +++
 3 files changed, 107 insertions(+), 5 deletions(-)
  

Comments

Eric Dumazet Jan. 24, 2024, 8:20 a.m. UTC | #1
On Wed, Jan 24, 2024 at 3:54 AM Joe Damato <jdamato@fastly.com> wrote:
>
> Greetings:
>
> TL;DR This builds on commit bf3b9f6372c4 ("epoll: Add busy poll support to
> epoll with socket fds.") by allowing user applications to enable
> epoll-based busy polling and set a busy poll packet budget on a per epoll
> context basis.
>
> To allow for this, two ioctls have been added for epoll contexts for
> getting and setting a new struct, struct epoll_params.
>
> This makes epoll-based busy polling much more usable for user
> applications than the current system-wide sysctl and hardcoded budget.
>
> Longer explanation:
>
> Presently epoll has support for a very useful form of busy poll based on
> the incoming NAPI ID (see also: SO_INCOMING_NAPI_ID [1]).
>
> This form of busy poll allows epoll_wait to drive NAPI packet processing
> which allows for a few interesting user application designs which can
> reduce latency and also potentially improve L2/L3 cache hit rates by
> deferring NAPI until userland has finished its work.
>
> The documentation available on this is, IMHO, a bit confusing so please
> allow me to explain how one might use this:
>
> 1. Ensure each application thread has its own epoll instance mapping
> 1-to-1 with NIC RX queues. An n-tuple filter would likely be used to
> direct connections with specific dest ports to these queues.
>
> 2. Optionally: Setup IRQ coalescing for the NIC RX queues where busy
> polling will occur. This can help avoid the userland app from being
> pre-empted by a hard IRQ while userland is running. Note this means that
> userland must take care to call epoll_wait and not take too long in
> userland since it now drives NAPI via epoll_wait.
>
> 3. Ensure that all incoming connections added to an epoll instance
> have the same NAPI ID. This can be done with a BPF filter when
> SO_REUSEPORT is used or getsockopt + SO_INCOMING_NAPI_ID when a single
> accept thread is used which dispatches incoming connections to threads.
>
> 4. Lastly, busy poll must be enabled via a sysctl
> (/proc/sys/net/core/busy_poll).
>
> The unfortunate part about step 4 above is that this enables busy poll
> system-wide which affects all user applications on the system,
> including epoll-based network applications which were not intended to
> be used this way or applications where increased CPU usage for lower
> latency network processing is unnecessary or not desirable.
>
> If the user wants to run one low latency epoll-based server application
> with epoll-based busy poll, but would like to run the rest of the
> applications on the system (which may also use epoll) without busy poll,
> this system-wide sysctl presents a significant problem.
>
> This change preserves the system-wide sysctl, but adds a mechanism (via
> ioctl) to enable or disable busy poll for epoll contexts as needed by
> individual applications, making epoll-based busy poll more usable.
>

I think this description missed the napi_defer_hard_irqs and
gro_flush_timeout settings ?

I would think that if an application really wants to make sure its
thread is the only one
eventually calling napi->poll(), we must make sure NIC interrupts stay masked.

Current implementations of busy poll always release NAPI_STATE_SCHED bit when
returning to user space.

It seems you want to make sure the application and only the
application calls the napi->poll()
at chosen times.

Some kind of contract is needed, and the presence of the hrtimer
(currently only driven from dev->@gro_flush_timeout)
would allow to do that correctly.

Whenever we 'trust' user space to perform the napi->poll shortly, we
also want to arm the hrtimer to eventually detect
the application took too long, to restart the other mechanisms (NIC irq based)

Note that we added the kthread based napi polling, and we are working
to add a busy polling feature to these kthreads.
allowing to completely mask NIC interrupts and further reduce latencies.

Thank you

> Thanks,
> Joe
>
> [1]: https://lore.kernel.org/lkml/20170324170836.15226.87178.stgit@localhost.localdomain/
>
> Joe Damato (3):
>   eventpoll: support busy poll per epoll instance
>   eventpoll: Add per-epoll busy poll packet budget
>   eventpoll: Add epoll ioctl for epoll_params
>
>  .../userspace-api/ioctl/ioctl-number.rst      |  1 +
>  fs/eventpoll.c                                | 99 ++++++++++++++++++-
>  include/uapi/linux/eventpoll.h                | 12 +++
>  3 files changed, 107 insertions(+), 5 deletions(-)
>
> --
> 2.25.1
>
  
Joe Damato Jan. 24, 2024, 2:20 p.m. UTC | #2
On Wed, Jan 24, 2024 at 09:20:09AM +0100, Eric Dumazet wrote:
> On Wed, Jan 24, 2024 at 3:54 AM Joe Damato <jdamato@fastly.com> wrote:
> >
> > Greetings:
> >
> > TL;DR This builds on commit bf3b9f6372c4 ("epoll: Add busy poll support to
> > epoll with socket fds.") by allowing user applications to enable
> > epoll-based busy polling and set a busy poll packet budget on a per epoll
> > context basis.
> >
> > To allow for this, two ioctls have been added for epoll contexts for
> > getting and setting a new struct, struct epoll_params.
> >
> > This makes epoll-based busy polling much more usable for user
> > applications than the current system-wide sysctl and hardcoded budget.
> >
> > Longer explanation:
> >
> > Presently epoll has support for a very useful form of busy poll based on
> > the incoming NAPI ID (see also: SO_INCOMING_NAPI_ID [1]).
> >
> > This form of busy poll allows epoll_wait to drive NAPI packet processing
> > which allows for a few interesting user application designs which can
> > reduce latency and also potentially improve L2/L3 cache hit rates by
> > deferring NAPI until userland has finished its work.
> >
> > The documentation available on this is, IMHO, a bit confusing so please
> > allow me to explain how one might use this:
> >
> > 1. Ensure each application thread has its own epoll instance mapping
> > 1-to-1 with NIC RX queues. An n-tuple filter would likely be used to
> > direct connections with specific dest ports to these queues.
> >
> > 2. Optionally: Setup IRQ coalescing for the NIC RX queues where busy
> > polling will occur. This can help avoid the userland app from being
> > pre-empted by a hard IRQ while userland is running. Note this means that
> > userland must take care to call epoll_wait and not take too long in
> > userland since it now drives NAPI via epoll_wait.
> >
> > 3. Ensure that all incoming connections added to an epoll instance
> > have the same NAPI ID. This can be done with a BPF filter when
> > SO_REUSEPORT is used or getsockopt + SO_INCOMING_NAPI_ID when a single
> > accept thread is used which dispatches incoming connections to threads.
> >
> > 4. Lastly, busy poll must be enabled via a sysctl
> > (/proc/sys/net/core/busy_poll).
> >
> > The unfortunate part about step 4 above is that this enables busy poll
> > system-wide which affects all user applications on the system,
> > including epoll-based network applications which were not intended to
> > be used this way or applications where increased CPU usage for lower
> > latency network processing is unnecessary or not desirable.
> >
> > If the user wants to run one low latency epoll-based server application
> > with epoll-based busy poll, but would like to run the rest of the
> > applications on the system (which may also use epoll) without busy poll,
> > this system-wide sysctl presents a significant problem.
> >
> > This change preserves the system-wide sysctl, but adds a mechanism (via
> > ioctl) to enable or disable busy poll for epoll contexts as needed by
> > individual applications, making epoll-based busy poll more usable.
> >
> 
> I think this description missed the napi_defer_hard_irqs and
> gro_flush_timeout settings ?

I'm not sure if those settings are strictly related to the change I am
proposing which makes epoll-based busy poll something that can be
enabled/disabled on a per-epoll context basis and allows the budget to be
set as well, but maybe I am missing something? Sorry for my
misunderstanding if so.

IMHO: a single system-wide busy poll setting is difficult to use
properly and it is unforunate that the packet budget is hardcoded. It would
be extremely useful to be able to set both of these on a per-epoll basis
and I think my suggested change helps to solve this.

Please let me know.

Re the two settings you noted:

I didn't mention those in the interest of brevity, but yes they can be used
instead of or in addition to what I've described above.

While those settings are very useful, IMHO, they have their own issues
because they are system-wide as well. If they were settable per-NAPI, that
would make it much easier to use them because they could be enabled for the
NAPIs which are being busy-polled by applications that support busy-poll.

Imagine you have 3 types of apps running side-by-side:
  - A low latency epoll-based busy poll app,
  - An app where latency doesn't matter as much, and
  - A latency sensitive legacy app which does not yet support epoll-based
    busy poll.

In the first two cases, the settings you mention would be helpful or not
make any difference, but in the third case the system-wide impact might be
undesirable because having IRQs fire might be important to keep latency
down.

If your comment was more that my cover letter should have mentioned these,
I can include that in a future cover letter or suggest some kernel
documentation which will discuss all of these features and how they relate
to each other.

> 
> I would think that if an application really wants to make sure its
> thread is the only one
> eventually calling napi->poll(), we must make sure NIC interrupts stay masked.
> 
> Current implementations of busy poll always release NAPI_STATE_SCHED bit when
> returning to user space.
>
> It seems you want to make sure the application and only the
> application calls the napi->poll()
> at chosen times.
> 
> Some kind of contract is needed, and the presence of the hrtimer
> (currently only driven from dev->@gro_flush_timeout)
> would allow to do that correctly.
> 
> Whenever we 'trust' user space to perform the napi->poll shortly, we
> also want to arm the hrtimer to eventually detect
> the application took too long, to restart the other mechanisms (NIC irq based)

There is another change [1] I've been looking at from a research paper [2]
which does something similar to what you've described above -- it keeps
IRQs suppressed during busy polling. The paper suggests a performance
improvement is measured when using a mechanism like this to keep IRQs off.
Please see the paper for more details.

I haven't had a chance to reach out to the authors or to tweak this patch
to attempt an RFC / submission for it, but it seems fairly promising in my
initial synthetic tests.

When I tested their patch, as you might expect, no IRQs were generated at
all for the NAPIs that were being busy polled, but the rest of the
NAPIs and queues were generating IRQs as expected.

Regardless of the above patch: I think my proposed change is helpful and
the IRQ suppression bit can be handled in a separate change in the future.
What do you think?

> Note that we added the kthread based napi polling, and we are working
> to add a busy polling feature to these kthreads.
> allowing to completely mask NIC interrupts and further reduce latencies.

I am aware of kthread based NAPI polling, yes, but I was not aware that
busy polling was being considered as a feature for them, thanks for the
head's up.

> Thank you

Thanks for your comments - I appreciate your time and attention.

Could you let me know if your comments are meant as a n-ack or similar?

I am unsure if you were suggesting that per-epoll context based busy
polling is unneeded/unnecessary from your perspective - or if it was more
of a hint that I should be including more context somewhere in the kernel
documentation as part of this change :)

Again, IMHO, allowing epoll based busy polling to be configured on a
per-epoll context basis (both the usecs and the packet budget) really help
to make epoll-based busy polling much more usable by user apps.

Thanks,
Joe

[1]: https://gitlab.uwaterloo.ca/p5cai/netstack-exp/-/raw/master/kernel-polling-5.15.79-base.patch?ref_type=heads
[2]: https://dl.acm.org/doi/pdf/10.1145/3626780
  
Eric Dumazet Jan. 24, 2024, 2:38 p.m. UTC | #3
On Wed, Jan 24, 2024 at 3:20 PM Joe Damato <jdamato@fastly.com> wrote:
>
> On Wed, Jan 24, 2024 at 09:20:09AM +0100, Eric Dumazet wrote:
> > On Wed, Jan 24, 2024 at 3:54 AM Joe Damato <jdamato@fastly.com> wrote:
> > >
> > > Greetings:
> > >
> > > TL;DR This builds on commit bf3b9f6372c4 ("epoll: Add busy poll support to
> > > epoll with socket fds.") by allowing user applications to enable
> > > epoll-based busy polling and set a busy poll packet budget on a per epoll
> > > context basis.
> > >
> > > To allow for this, two ioctls have been added for epoll contexts for
> > > getting and setting a new struct, struct epoll_params.
> > >
> > > This makes epoll-based busy polling much more usable for user
> > > applications than the current system-wide sysctl and hardcoded budget.
> > >
> > > Longer explanation:
> > >
> > > Presently epoll has support for a very useful form of busy poll based on
> > > the incoming NAPI ID (see also: SO_INCOMING_NAPI_ID [1]).
> > >
> > > This form of busy poll allows epoll_wait to drive NAPI packet processing
> > > which allows for a few interesting user application designs which can
> > > reduce latency and also potentially improve L2/L3 cache hit rates by
> > > deferring NAPI until userland has finished its work.
> > >
> > > The documentation available on this is, IMHO, a bit confusing so please
> > > allow me to explain how one might use this:
> > >
> > > 1. Ensure each application thread has its own epoll instance mapping
> > > 1-to-1 with NIC RX queues. An n-tuple filter would likely be used to
> > > direct connections with specific dest ports to these queues.
> > >
> > > 2. Optionally: Setup IRQ coalescing for the NIC RX queues where busy
> > > polling will occur. This can help avoid the userland app from being
> > > pre-empted by a hard IRQ while userland is running. Note this means that
> > > userland must take care to call epoll_wait and not take too long in
> > > userland since it now drives NAPI via epoll_wait.
> > >
> > > 3. Ensure that all incoming connections added to an epoll instance
> > > have the same NAPI ID. This can be done with a BPF filter when
> > > SO_REUSEPORT is used or getsockopt + SO_INCOMING_NAPI_ID when a single
> > > accept thread is used which dispatches incoming connections to threads.
> > >
> > > 4. Lastly, busy poll must be enabled via a sysctl
> > > (/proc/sys/net/core/busy_poll).
> > >
> > > The unfortunate part about step 4 above is that this enables busy poll
> > > system-wide which affects all user applications on the system,
> > > including epoll-based network applications which were not intended to
> > > be used this way or applications where increased CPU usage for lower
> > > latency network processing is unnecessary or not desirable.
> > >
> > > If the user wants to run one low latency epoll-based server application
> > > with epoll-based busy poll, but would like to run the rest of the
> > > applications on the system (which may also use epoll) without busy poll,
> > > this system-wide sysctl presents a significant problem.
> > >
> > > This change preserves the system-wide sysctl, but adds a mechanism (via
> > > ioctl) to enable or disable busy poll for epoll contexts as needed by
> > > individual applications, making epoll-based busy poll more usable.
> > >
> >
> > I think this description missed the napi_defer_hard_irqs and
> > gro_flush_timeout settings ?
>
> I'm not sure if those settings are strictly related to the change I am
> proposing which makes epoll-based busy poll something that can be
> enabled/disabled on a per-epoll context basis and allows the budget to be
> set as well, but maybe I am missing something? Sorry for my
> misunderstanding if so.
>
> IMHO: a single system-wide busy poll setting is difficult to use
> properly and it is unforunate that the packet budget is hardcoded. It would
> be extremely useful to be able to set both of these on a per-epoll basis
> and I think my suggested change helps to solve this.
>
> Please let me know.
>
> Re the two settings you noted:
>
> I didn't mention those in the interest of brevity, but yes they can be used
> instead of or in addition to what I've described above.
>
> While those settings are very useful, IMHO, they have their own issues
> because they are system-wide as well. If they were settable per-NAPI, that
> would make it much easier to use them because they could be enabled for the
> NAPIs which are being busy-polled by applications that support busy-poll.
>
> Imagine you have 3 types of apps running side-by-side:
>   - A low latency epoll-based busy poll app,
>   - An app where latency doesn't matter as much, and
>   - A latency sensitive legacy app which does not yet support epoll-based
>     busy poll.
>
> In the first two cases, the settings you mention would be helpful or not
> make any difference, but in the third case the system-wide impact might be
> undesirable because having IRQs fire might be important to keep latency
> down.
>
> If your comment was more that my cover letter should have mentioned these,
> I can include that in a future cover letter or suggest some kernel
> documentation which will discuss all of these features and how they relate
> to each other.
>
> >
> > I would think that if an application really wants to make sure its
> > thread is the only one
> > eventually calling napi->poll(), we must make sure NIC interrupts stay masked.
> >
> > Current implementations of busy poll always release NAPI_STATE_SCHED bit when
> > returning to user space.
> >
> > It seems you want to make sure the application and only the
> > application calls the napi->poll()
> > at chosen times.
> >
> > Some kind of contract is needed, and the presence of the hrtimer
> > (currently only driven from dev->@gro_flush_timeout)
> > would allow to do that correctly.
> >
> > Whenever we 'trust' user space to perform the napi->poll shortly, we
> > also want to arm the hrtimer to eventually detect
> > the application took too long, to restart the other mechanisms (NIC irq based)
>
> There is another change [1] I've been looking at from a research paper [2]
> which does something similar to what you've described above -- it keeps
> IRQs suppressed during busy polling. The paper suggests a performance
> improvement is measured when using a mechanism like this to keep IRQs off.
> Please see the paper for more details.
>
> I haven't had a chance to reach out to the authors or to tweak this patch
> to attempt an RFC / submission for it, but it seems fairly promising in my
> initial synthetic tests.
>
> When I tested their patch, as you might expect, no IRQs were generated at
> all for the NAPIs that were being busy polled, but the rest of the
> NAPIs and queues were generating IRQs as expected.
>
> Regardless of the above patch: I think my proposed change is helpful and
> the IRQ suppression bit can be handled in a separate change in the future.
> What do you think?
>
> > Note that we added the kthread based napi polling, and we are working
> > to add a busy polling feature to these kthreads.
> > allowing to completely mask NIC interrupts and further reduce latencies.
>
> I am aware of kthread based NAPI polling, yes, but I was not aware that
> busy polling was being considered as a feature for them, thanks for the
> head's up.
>
> > Thank you
>
> Thanks for your comments - I appreciate your time and attention.
>
> Could you let me know if your comments are meant as a n-ack or similar?

Patch #2 needs the 'why' part, and why would we allow user space to
ask to poll up to 65535 packets...
There is a reason we have a warning in place when a driver attempts to
set a budget bigger than 64.

You cited recent papers,  I wrote this one specific to linux busy
polling ( https://netdevconf.info/2.1/papers/BusyPollingNextGen.pdf )

Busy polling had been in the pipe, when Wei sent her patches and follow ups.

cb038357937ee4f589aab2469ec3896dce90f317 net: fix race between napi
kthread mode and busy poll
5fdd2f0e5c64846bf3066689b73fc3b8dddd1c74 net: add sysfs attribute to
control napi threaded mode
29863d41bb6e1d969c62fdb15b0961806942960e net: implement threaded-able
napi poll loop support

I am saying that I am currently working to implement the kthread busy
polling implementation,
after fixing two bugs in SCTP and UDP (making me wondering if busy
polling is really used these days)

I am also considering unifying napi_threaded_poll() and the
napi_busy_loop(), and seeing your patches
coming make this work more challenging.
  
Joe Damato Jan. 24, 2024, 3:19 p.m. UTC | #4
On Wed, Jan 24, 2024 at 03:38:19PM +0100, Eric Dumazet wrote:
> On Wed, Jan 24, 2024 at 3:20 PM Joe Damato <jdamato@fastly.com> wrote:
> >
> > On Wed, Jan 24, 2024 at 09:20:09AM +0100, Eric Dumazet wrote:
> > > On Wed, Jan 24, 2024 at 3:54 AM Joe Damato <jdamato@fastly.com> wrote:
> > > >
> > > > Greetings:
> > > >
> > > > TL;DR This builds on commit bf3b9f6372c4 ("epoll: Add busy poll support to
> > > > epoll with socket fds.") by allowing user applications to enable
> > > > epoll-based busy polling and set a busy poll packet budget on a per epoll
> > > > context basis.
> > > >
> > > > To allow for this, two ioctls have been added for epoll contexts for
> > > > getting and setting a new struct, struct epoll_params.
> > > >
> > > > This makes epoll-based busy polling much more usable for user
> > > > applications than the current system-wide sysctl and hardcoded budget.
> > > >
> > > > Longer explanation:
> > > >
> > > > Presently epoll has support for a very useful form of busy poll based on
> > > > the incoming NAPI ID (see also: SO_INCOMING_NAPI_ID [1]).
> > > >
> > > > This form of busy poll allows epoll_wait to drive NAPI packet processing
> > > > which allows for a few interesting user application designs which can
> > > > reduce latency and also potentially improve L2/L3 cache hit rates by
> > > > deferring NAPI until userland has finished its work.
> > > >
> > > > The documentation available on this is, IMHO, a bit confusing so please
> > > > allow me to explain how one might use this:
> > > >
> > > > 1. Ensure each application thread has its own epoll instance mapping
> > > > 1-to-1 with NIC RX queues. An n-tuple filter would likely be used to
> > > > direct connections with specific dest ports to these queues.
> > > >
> > > > 2. Optionally: Setup IRQ coalescing for the NIC RX queues where busy
> > > > polling will occur. This can help avoid the userland app from being
> > > > pre-empted by a hard IRQ while userland is running. Note this means that
> > > > userland must take care to call epoll_wait and not take too long in
> > > > userland since it now drives NAPI via epoll_wait.
> > > >
> > > > 3. Ensure that all incoming connections added to an epoll instance
> > > > have the same NAPI ID. This can be done with a BPF filter when
> > > > SO_REUSEPORT is used or getsockopt + SO_INCOMING_NAPI_ID when a single
> > > > accept thread is used which dispatches incoming connections to threads.
> > > >
> > > > 4. Lastly, busy poll must be enabled via a sysctl
> > > > (/proc/sys/net/core/busy_poll).
> > > >
> > > > The unfortunate part about step 4 above is that this enables busy poll
> > > > system-wide which affects all user applications on the system,
> > > > including epoll-based network applications which were not intended to
> > > > be used this way or applications where increased CPU usage for lower
> > > > latency network processing is unnecessary or not desirable.
> > > >
> > > > If the user wants to run one low latency epoll-based server application
> > > > with epoll-based busy poll, but would like to run the rest of the
> > > > applications on the system (which may also use epoll) without busy poll,
> > > > this system-wide sysctl presents a significant problem.
> > > >
> > > > This change preserves the system-wide sysctl, but adds a mechanism (via
> > > > ioctl) to enable or disable busy poll for epoll contexts as needed by
> > > > individual applications, making epoll-based busy poll more usable.
> > > >
> > >
> > > I think this description missed the napi_defer_hard_irqs and
> > > gro_flush_timeout settings ?
> >
> > I'm not sure if those settings are strictly related to the change I am
> > proposing which makes epoll-based busy poll something that can be
> > enabled/disabled on a per-epoll context basis and allows the budget to be
> > set as well, but maybe I am missing something? Sorry for my
> > misunderstanding if so.
> >
> > IMHO: a single system-wide busy poll setting is difficult to use
> > properly and it is unforunate that the packet budget is hardcoded. It would
> > be extremely useful to be able to set both of these on a per-epoll basis
> > and I think my suggested change helps to solve this.
> >
> > Please let me know.
> >
> > Re the two settings you noted:
> >
> > I didn't mention those in the interest of brevity, but yes they can be used
> > instead of or in addition to what I've described above.
> >
> > While those settings are very useful, IMHO, they have their own issues
> > because they are system-wide as well. If they were settable per-NAPI, that
> > would make it much easier to use them because they could be enabled for the
> > NAPIs which are being busy-polled by applications that support busy-poll.
> >
> > Imagine you have 3 types of apps running side-by-side:
> >   - A low latency epoll-based busy poll app,
> >   - An app where latency doesn't matter as much, and
> >   - A latency sensitive legacy app which does not yet support epoll-based
> >     busy poll.
> >
> > In the first two cases, the settings you mention would be helpful or not
> > make any difference, but in the third case the system-wide impact might be
> > undesirable because having IRQs fire might be important to keep latency
> > down.
> >
> > If your comment was more that my cover letter should have mentioned these,
> > I can include that in a future cover letter or suggest some kernel
> > documentation which will discuss all of these features and how they relate
> > to each other.
> >
> > >
> > > I would think that if an application really wants to make sure its
> > > thread is the only one
> > > eventually calling napi->poll(), we must make sure NIC interrupts stay masked.
> > >
> > > Current implementations of busy poll always release NAPI_STATE_SCHED bit when
> > > returning to user space.
> > >
> > > It seems you want to make sure the application and only the
> > > application calls the napi->poll()
> > > at chosen times.
> > >
> > > Some kind of contract is needed, and the presence of the hrtimer
> > > (currently only driven from dev->@gro_flush_timeout)
> > > would allow to do that correctly.
> > >
> > > Whenever we 'trust' user space to perform the napi->poll shortly, we
> > > also want to arm the hrtimer to eventually detect
> > > the application took too long, to restart the other mechanisms (NIC irq based)
> >
> > There is another change [1] I've been looking at from a research paper [2]
> > which does something similar to what you've described above -- it keeps
> > IRQs suppressed during busy polling. The paper suggests a performance
> > improvement is measured when using a mechanism like this to keep IRQs off.
> > Please see the paper for more details.
> >
> > I haven't had a chance to reach out to the authors or to tweak this patch
> > to attempt an RFC / submission for it, but it seems fairly promising in my
> > initial synthetic tests.
> >
> > When I tested their patch, as you might expect, no IRQs were generated at
> > all for the NAPIs that were being busy polled, but the rest of the
> > NAPIs and queues were generating IRQs as expected.
> >
> > Regardless of the above patch: I think my proposed change is helpful and
> > the IRQ suppression bit can be handled in a separate change in the future.
> > What do you think?
> >
> > > Note that we added the kthread based napi polling, and we are working
> > > to add a busy polling feature to these kthreads.
> > > allowing to completely mask NIC interrupts and further reduce latencies.
> >
> > I am aware of kthread based NAPI polling, yes, but I was not aware that
> > busy polling was being considered as a feature for them, thanks for the
> > head's up.
> >
> > > Thank you
> >
> > Thanks for your comments - I appreciate your time and attention.
> >
> > Could you let me know if your comments are meant as a n-ack or similar?
> 
> Patch #2 needs the 'why' part, and why would we allow user space to
> ask to poll up to 65535 packets...
> There is a reason we have a warning in place when a driver attempts to
> set a budget bigger than 64.

Sure, thanks for pointing this out.

I am happy to cap the budget to 64 if a user app requests a larger amount
and I can add a netdev_warn when this happens, if you'd like.

The 'why' has two reasons:
  - Increasing the budget for fast NICs can help improve throughput
    under load (i.e. the hardcoded amount might be too low for some users)
  - other poll methods have user-configurable budget amounts
    (SO_BUSY_POLL_BUDGET), so epoll stands out as an edge case where the
    budget is hardcoded.

I hope that reasoning is sufficient and I can include that more explicitly
in the commit message.

FWIW: My reading of SO_BUSY_POLL_BUDGET suggests that any budget amount
up to U16_MAX will be accepted. I probably missed it somewhere, but I
didn't see a warning in this case.

I think in a v2 SO_BUSY_POLL_BUDGET and the epoll ioctl budget should
be capped at the same amount for consistency and I am happy to agree to 64
or 128 or similar as a cap.

Let me know what you think and thanks again for your thoughts and detailed
response.

> You cited recent papers,  I wrote this one specific to linux busy
> polling ( https://netdevconf.info/2.1/papers/BusyPollingNextGen.pdf )

Thanks for sending this link, I'll need to take a closer look, but a quick
read surfaced two things:

  Their use is very limited, since they enforce busy polling
  for all sockets, which is not desirable

We agree on the limited usefulness of system-wide settings ;)

  Another big problem is that Busy Polling was not really
  deployed in production, because it works well when having
  no more than one thread per NIC RX queue.

I've been testing epoll-based busy poll in production with a few different
NICs and application setups and it has been pretty helpful, but I agree
that this is application architecture specific as you allude to in
your next paragraph about the scheduler.

Thanks for linking to the paper.

It would be great if all of this context and information could be put in
one place in the kernel docs. If I have time in the future, I'll propose a
doc change to try to outline all of this.

> Busy polling had been in the pipe, when Wei sent her patches and follow ups.
> 
> cb038357937ee4f589aab2469ec3896dce90f317 net: fix race between napi
> kthread mode and busy poll
> 5fdd2f0e5c64846bf3066689b73fc3b8dddd1c74 net: add sysfs attribute to
> control napi threaded mode
> 29863d41bb6e1d969c62fdb15b0961806942960e net: implement threaded-able
> napi poll loop support

Thanks for letting me know. I think I'd seen these in passing, but hadn't
remembered until you mentioned it now.

> I am saying that I am currently working to implement the kthread busy
> polling implementation,
> after fixing two bugs in SCTP and UDP (making me wondering if busy
> polling is really used these days)

Ah, I see. FWIW, I have so far only been trying to use it for TCP and so
far I haven't hit any bugs.

I was planning to use it with UDP in the future, though, once the TCP
epoll-based busy polling stuff I am working on is done... so thanks in
advance for the bug fixes in UDP.

> I am also considering unifying napi_threaded_poll() and the
> napi_busy_loop(), and seeing your patches
> coming make this work more challenging.

Sorry about that. I am happy to make modifications to my patches if there's
anything I could do which would make your work easier in the future.

Thanks,
Joe
  
Samudrala, Sridhar Jan. 30, 2024, 6:54 p.m. UTC | #5
On 1/24/2024 2:20 AM, Eric Dumazet wrote:
> On Wed, Jan 24, 2024 at 3:54 AM Joe Damato <jdamato@fastly.com> wrote:
>>
>> Greetings:
>>
>> TL;DR This builds on commit bf3b9f6372c4 ("epoll: Add busy poll support to
>> epoll with socket fds.") by allowing user applications to enable
>> epoll-based busy polling and set a busy poll packet budget on a per epoll
>> context basis.
>>
>> To allow for this, two ioctls have been added for epoll contexts for
>> getting and setting a new struct, struct epoll_params.
>>
>> This makes epoll-based busy polling much more usable for user
>> applications than the current system-wide sysctl and hardcoded budget.

Agree. looking forward to see this patch series accepted soon.

>>
>> Longer explanation:
>>
>> Presently epoll has support for a very useful form of busy poll based on
>> the incoming NAPI ID (see also: SO_INCOMING_NAPI_ID [1]).
>>
>> This form of busy poll allows epoll_wait to drive NAPI packet processing
>> which allows for a few interesting user application designs which can
>> reduce latency and also potentially improve L2/L3 cache hit rates by
>> deferring NAPI until userland has finished its work.
>>
>> The documentation available on this is, IMHO, a bit confusing so please
>> allow me to explain how one might use this:
>>
>> 1. Ensure each application thread has its own epoll instance mapping
>> 1-to-1 with NIC RX queues. An n-tuple filter would likely be used to
>> direct connections with specific dest ports to these queues.
>>
>> 2. Optionally: Setup IRQ coalescing for the NIC RX queues where busy
>> polling will occur. This can help avoid the userland app from being
>> pre-empted by a hard IRQ while userland is running. Note this means that
>> userland must take care to call epoll_wait and not take too long in
>> userland since it now drives NAPI via epoll_wait.
>>
>> 3. Ensure that all incoming connections added to an epoll instance
>> have the same NAPI ID. This can be done with a BPF filter when
>> SO_REUSEPORT is used or getsockopt + SO_INCOMING_NAPI_ID when a single
>> accept thread is used which dispatches incoming connections to threads.
>>
>> 4. Lastly, busy poll must be enabled via a sysctl
>> (/proc/sys/net/core/busy_poll).
>>
>> The unfortunate part about step 4 above is that this enables busy poll
>> system-wide which affects all user applications on the system,
>> including epoll-based network applications which were not intended to
>> be used this way or applications where increased CPU usage for lower
>> latency network processing is unnecessary or not desirable.
>>
>> If the user wants to run one low latency epoll-based server application
>> with epoll-based busy poll, but would like to run the rest of the
>> applications on the system (which may also use epoll) without busy poll,
>> this system-wide sysctl presents a significant problem.
>>
>> This change preserves the system-wide sysctl, but adds a mechanism (via
>> ioctl) to enable or disable busy poll for epoll contexts as needed by
>> individual applications, making epoll-based busy poll more usable.
>>
> 
> I think this description missed the napi_defer_hard_irqs and
> gro_flush_timeout settings ?
> 
> I would think that if an application really wants to make sure its
> thread is the only one
> eventually calling napi->poll(), we must make sure NIC interrupts stay masked.
> 
> Current implementations of busy poll always release NAPI_STATE_SCHED bit when
> returning to user space.
> 
> It seems you want to make sure the application and only the
> application calls the napi->poll()
> at chosen times.
> 
> Some kind of contract is needed, and the presence of the hrtimer
> (currently only driven from dev->@gro_flush_timeout)
> would allow to do that correctly.
> 
> Whenever we 'trust' user space to perform the napi->poll shortly, we
> also want to arm the hrtimer to eventually detect
> the application took too long, to restart the other mechanisms (NIC irq based)
> 
> Note that we added the kthread based napi polling, and we are working
> to add a busy polling feature to these kthreads.
> allowing to completely mask NIC interrupts and further reduce latencies.


Good to know that you are looking into enabling busy polling for napi 
kthreads.
We have something similar in our ice OOT driver that is implemented and 
we call it 'independent pollers' as in this mode, busy polling will not 
be app dependent or triggered by an application.
Here is a link to the slides we presented at netdev 0x16 driver workshop.
https://netdevconf.info/0x16/slides/48/netdev0x16_driver_workshop_ADQ.pdf

We haven't yet submitted the patches upstream as there is no kernel 
interface to configure napi specific timeouts.
With the recent per-queue and per-napi netlink APIs that are accepted 
upstream
https://lore.kernel.org/netdev/170147307026.5260.9300080745237900261.stgit@anambiarhost.jf.intel.com/

we are thinking of making timeout as a per-napi parameter and can be 
used as an interface to enable napi kthread based busy polling.

I think even the per-device napi_defer_hard_irqs and gro_flush_timeout 
should become per-napi parameters.

> 
> Thank you
> 
>> Thanks,
>> Joe
>>
>> [1]: https://lore.kernel.org/lkml/20170324170836.15226.87178.stgit@localhost.localdomain/
>>
>> Joe Damato (3):
>>    eventpoll: support busy poll per epoll instance
>>    eventpoll: Add per-epoll busy poll packet budget
>>    eventpoll: Add epoll ioctl for epoll_params
>>
>>   .../userspace-api/ioctl/ioctl-number.rst      |  1 +
>>   fs/eventpoll.c                                | 99 ++++++++++++++++++-
>>   include/uapi/linux/eventpoll.h                | 12 +++
>>   3 files changed, 107 insertions(+), 5 deletions(-)
>>
>> --
>> 2.25.1
>>
  
Joe Damato Feb. 2, 2024, 3:28 a.m. UTC | #6
On Tue, Jan 30, 2024 at 12:54:50PM -0600, Samudrala, Sridhar wrote:
> 
> 
> On 1/24/2024 2:20 AM, Eric Dumazet wrote:
> >On Wed, Jan 24, 2024 at 3:54 AM Joe Damato <jdamato@fastly.com> wrote:
> >>
> >>Greetings:
> >>
> >>TL;DR This builds on commit bf3b9f6372c4 ("epoll: Add busy poll support to
> >>epoll with socket fds.") by allowing user applications to enable
> >>epoll-based busy polling and set a busy poll packet budget on a per epoll
> >>context basis.
> >>
> >>To allow for this, two ioctls have been added for epoll contexts for
> >>getting and setting a new struct, struct epoll_params.
> >>
> >>This makes epoll-based busy polling much more usable for user
> >>applications than the current system-wide sysctl and hardcoded budget.
> 
> Agree. looking forward to see this patch series accepted soon.
> 
> >>
> >>Longer explanation:
> >>
> >>Presently epoll has support for a very useful form of busy poll based on
> >>the incoming NAPI ID (see also: SO_INCOMING_NAPI_ID [1]).
> >>
> >>This form of busy poll allows epoll_wait to drive NAPI packet processing
> >>which allows for a few interesting user application designs which can
> >>reduce latency and also potentially improve L2/L3 cache hit rates by
> >>deferring NAPI until userland has finished its work.
> >>
> >>The documentation available on this is, IMHO, a bit confusing so please
> >>allow me to explain how one might use this:
> >>
> >>1. Ensure each application thread has its own epoll instance mapping
> >>1-to-1 with NIC RX queues. An n-tuple filter would likely be used to
> >>direct connections with specific dest ports to these queues.
> >>
> >>2. Optionally: Setup IRQ coalescing for the NIC RX queues where busy
> >>polling will occur. This can help avoid the userland app from being
> >>pre-empted by a hard IRQ while userland is running. Note this means that
> >>userland must take care to call epoll_wait and not take too long in
> >>userland since it now drives NAPI via epoll_wait.
> >>
> >>3. Ensure that all incoming connections added to an epoll instance
> >>have the same NAPI ID. This can be done with a BPF filter when
> >>SO_REUSEPORT is used or getsockopt + SO_INCOMING_NAPI_ID when a single
> >>accept thread is used which dispatches incoming connections to threads.
> >>
> >>4. Lastly, busy poll must be enabled via a sysctl
> >>(/proc/sys/net/core/busy_poll).
> >>
> >>The unfortunate part about step 4 above is that this enables busy poll
> >>system-wide which affects all user applications on the system,
> >>including epoll-based network applications which were not intended to
> >>be used this way or applications where increased CPU usage for lower
> >>latency network processing is unnecessary or not desirable.
> >>
> >>If the user wants to run one low latency epoll-based server application
> >>with epoll-based busy poll, but would like to run the rest of the
> >>applications on the system (which may also use epoll) without busy poll,
> >>this system-wide sysctl presents a significant problem.
> >>
> >>This change preserves the system-wide sysctl, but adds a mechanism (via
> >>ioctl) to enable or disable busy poll for epoll contexts as needed by
> >>individual applications, making epoll-based busy poll more usable.
> >>
> >
> >I think this description missed the napi_defer_hard_irqs and
> >gro_flush_timeout settings ?
> >
> >I would think that if an application really wants to make sure its
> >thread is the only one
> >eventually calling napi->poll(), we must make sure NIC interrupts stay masked.
> >
> >Current implementations of busy poll always release NAPI_STATE_SCHED bit when
> >returning to user space.
> >
> >It seems you want to make sure the application and only the
> >application calls the napi->poll()
> >at chosen times.
> >
> >Some kind of contract is needed, and the presence of the hrtimer
> >(currently only driven from dev->@gro_flush_timeout)
> >would allow to do that correctly.
> >
> >Whenever we 'trust' user space to perform the napi->poll shortly, we
> >also want to arm the hrtimer to eventually detect
> >the application took too long, to restart the other mechanisms (NIC irq based)
> >
> >Note that we added the kthread based napi polling, and we are working
> >to add a busy polling feature to these kthreads.
> >allowing to completely mask NIC interrupts and further reduce latencies.
> 
> 
> Good to know that you are looking into enabling busy polling for napi
> kthreads.
> We have something similar in our ice OOT driver that is implemented and we
> call it 'independent pollers' as in this mode, busy polling will not be app
> dependent or triggered by an application.
> Here is a link to the slides we presented at netdev 0x16 driver workshop.
> https://netdevconf.info/0x16/slides/48/netdev0x16_driver_workshop_ADQ.pdf
> 
> We haven't yet submitted the patches upstream as there is no kernel
> interface to configure napi specific timeouts.
> With the recent per-queue and per-napi netlink APIs that are accepted
> upstream
> https://lore.kernel.org/netdev/170147307026.5260.9300080745237900261.stgit@anambiarhost.jf.intel.com/
> 
> we are thinking of making timeout as a per-napi parameter and can be used as
> an interface to enable napi kthread based busy polling.

I know I am replying to a stale thread on the patches I've submit (there is
a v5 now [1]), but I just looked at your message - sorry I didn't reply
sooner.

The per-queue and per-napi netlink APIs look extremely useful, thanks for
pointing this out.

In my development tree, I had added SIOCGIFNAME_BY_NAPI_ID which works
similar to SIOCGIFNAME: it takes a NAPI ID and returns the IF name. This is
useful on machines with multiple NICs where each NIC could be located in
one of many different NUMA zones.

The idea was that apps would use SO_INCOMING_NAPI_ID, distribute the NAPI
ID to a worker thread which could then use SIOCGIFNAME_BY_NAPI_ID to
compute which NIC the connection came in on. The app would then (via
configuration) know where to pin that worker thread; ideally somewhere NUMA
local to the NIC.

I had assumed that such a change would be rejected, but I figured I'd send
an RFC for it after the per epoll context stuff was done and see if anyone
thought SIOCGIFNAME_BY_NAPI_ID would be useful for them, as well.

> I think even the per-device napi_defer_hard_irqs and gro_flush_timeout
> should become per-napi parameters.

I agree.

I had been contemplating implementing this until I tried a different method
similar to an academic paper I was reading [2][3]. I think per-device
defer_hard_irqs and gro_flush_timeout would be extremely useful and a
better approach than the one I'm currently using.

Is this something you are currently working? I may try implementing this,
but didn't want to duplicate effort if you are already working on this.

Thanks,
Joe

[1]: https://lore.kernel.org/all/20240131180811.23566-1-jdamato@fastly.com/
[2]: https://dl.acm.org/doi/pdf/10.1145/3626780
[3]: https://gitlab.uwaterloo.ca/p5cai/netstack-exp/-/blob/master/kernel-polling-5.15.79-base.patch?ref_type=heads
  
Samudrala, Sridhar Feb. 2, 2024, 5:23 p.m. UTC | #7
On 2/1/2024 9:28 PM, Joe Damato wrote:
> On Tue, Jan 30, 2024 at 12:54:50PM -0600, Samudrala, Sridhar wrote:
>>
>>
>> On 1/24/2024 2:20 AM, Eric Dumazet wrote:
>>> On Wed, Jan 24, 2024 at 3:54 AM Joe Damato <jdamato@fastly.com> wrote:
>>>>
>>>> Greetings:
>>>>
>>>> TL;DR This builds on commit bf3b9f6372c4 ("epoll: Add busy poll support to
>>>> epoll with socket fds.") by allowing user applications to enable
>>>> epoll-based busy polling and set a busy poll packet budget on a per epoll
>>>> context basis.
>>>>
>>>> To allow for this, two ioctls have been added for epoll contexts for
>>>> getting and setting a new struct, struct epoll_params.
>>>>
>>>> This makes epoll-based busy polling much more usable for user
>>>> applications than the current system-wide sysctl and hardcoded budget.
>>
>> Agree. looking forward to see this patch series accepted soon.
>>
>>>>
>>>> Longer explanation:
>>>>
>>>> Presently epoll has support for a very useful form of busy poll based on
>>>> the incoming NAPI ID (see also: SO_INCOMING_NAPI_ID [1]).
>>>>
>>>> This form of busy poll allows epoll_wait to drive NAPI packet processing
>>>> which allows for a few interesting user application designs which can
>>>> reduce latency and also potentially improve L2/L3 cache hit rates by
>>>> deferring NAPI until userland has finished its work.
>>>>
>>>> The documentation available on this is, IMHO, a bit confusing so please
>>>> allow me to explain how one might use this:
>>>>
>>>> 1. Ensure each application thread has its own epoll instance mapping
>>>> 1-to-1 with NIC RX queues. An n-tuple filter would likely be used to
>>>> direct connections with specific dest ports to these queues.
>>>>
>>>> 2. Optionally: Setup IRQ coalescing for the NIC RX queues where busy
>>>> polling will occur. This can help avoid the userland app from being
>>>> pre-empted by a hard IRQ while userland is running. Note this means that
>>>> userland must take care to call epoll_wait and not take too long in
>>>> userland since it now drives NAPI via epoll_wait.
>>>>
>>>> 3. Ensure that all incoming connections added to an epoll instance
>>>> have the same NAPI ID. This can be done with a BPF filter when
>>>> SO_REUSEPORT is used or getsockopt + SO_INCOMING_NAPI_ID when a single
>>>> accept thread is used which dispatches incoming connections to threads.
>>>>
>>>> 4. Lastly, busy poll must be enabled via a sysctl
>>>> (/proc/sys/net/core/busy_poll).
>>>>
>>>> The unfortunate part about step 4 above is that this enables busy poll
>>>> system-wide which affects all user applications on the system,
>>>> including epoll-based network applications which were not intended to
>>>> be used this way or applications where increased CPU usage for lower
>>>> latency network processing is unnecessary or not desirable.
>>>>
>>>> If the user wants to run one low latency epoll-based server application
>>>> with epoll-based busy poll, but would like to run the rest of the
>>>> applications on the system (which may also use epoll) without busy poll,
>>>> this system-wide sysctl presents a significant problem.
>>>>
>>>> This change preserves the system-wide sysctl, but adds a mechanism (via
>>>> ioctl) to enable or disable busy poll for epoll contexts as needed by
>>>> individual applications, making epoll-based busy poll more usable.
>>>>
>>>
>>> I think this description missed the napi_defer_hard_irqs and
>>> gro_flush_timeout settings ?
>>>
>>> I would think that if an application really wants to make sure its
>>> thread is the only one
>>> eventually calling napi->poll(), we must make sure NIC interrupts stay masked.
>>>
>>> Current implementations of busy poll always release NAPI_STATE_SCHED bit when
>>> returning to user space.
>>>
>>> It seems you want to make sure the application and only the
>>> application calls the napi->poll()
>>> at chosen times.
>>>
>>> Some kind of contract is needed, and the presence of the hrtimer
>>> (currently only driven from dev->@gro_flush_timeout)
>>> would allow to do that correctly.
>>>
>>> Whenever we 'trust' user space to perform the napi->poll shortly, we
>>> also want to arm the hrtimer to eventually detect
>>> the application took too long, to restart the other mechanisms (NIC irq based)
>>>
>>> Note that we added the kthread based napi polling, and we are working
>>> to add a busy polling feature to these kthreads.
>>> allowing to completely mask NIC interrupts and further reduce latencies.
>>
>>
>> Good to know that you are looking into enabling busy polling for napi
>> kthreads.
>> We have something similar in our ice OOT driver that is implemented and we
>> call it 'independent pollers' as in this mode, busy polling will not be app
>> dependent or triggered by an application.
>> Here is a link to the slides we presented at netdev 0x16 driver workshop.
>> https://netdevconf.info/0x16/slides/48/netdev0x16_driver_workshop_ADQ.pdf
>>
>> We haven't yet submitted the patches upstream as there is no kernel
>> interface to configure napi specific timeouts.
>> With the recent per-queue and per-napi netlink APIs that are accepted
>> upstream
>> https://lore.kernel.org/netdev/170147307026.5260.9300080745237900261.stgit@anambiarhost.jf.intel.com/
>>
>> we are thinking of making timeout as a per-napi parameter and can be used as
>> an interface to enable napi kthread based busy polling.
> 
> I know I am replying to a stale thread on the patches I've submit (there is
> a v5 now [1]), but I just looked at your message - sorry I didn't reply
> sooner.
> 
> The per-queue and per-napi netlink APIs look extremely useful, thanks for
> pointing this out.
> 
> In my development tree, I had added SIOCGIFNAME_BY_NAPI_ID which works
> similar to SIOCGIFNAME: it takes a NAPI ID and returns the IF name. This is
> useful on machines with multiple NICs where each NIC could be located in
> one of many different NUMA zones.
> 
> The idea was that apps would use SO_INCOMING_NAPI_ID, distribute the NAPI
> ID to a worker thread which could then use SIOCGIFNAME_BY_NAPI_ID to
> compute which NIC the connection came in on. The app would then (via
> configuration) know where to pin that worker thread; ideally somewhere NUMA
> local to the NIC.
> 
> I had assumed that such a change would be rejected, but I figured I'd send
> an RFC for it after the per epoll context stuff was done and see if anyone
> thought SIOCGIFNAME_BY_NAPI_ID would be useful for them, as well.

I think you should be able to get this functionality via the netdev-genl 
API to get napi parameters. It returns ifindex as one of the parameters 
and you should able to get the name from ifindex.

$ ./cli.py --spec netdev.yaml --do napi-get --json='{"id": 593}'
{'id': 593, 'ifindex': 12, 'irq': 291, 'pid': 3727}

> 
>> I think even the per-device napi_defer_hard_irqs and gro_flush_timeout
>> should become per-napi parameters.
> 
> I agree.
> 
> I had been contemplating implementing this until I tried a different method
> similar to an academic paper I was reading [2][3]. I think per-device
> defer_hard_irqs and gro_flush_timeout would be extremely useful and a
> better approach than the one I'm currently using.
> 
> Is this something you are currently working? I may try implementing this,
> but didn't want to duplicate effort if you are already working on this.

Not yet. Please go ahead and work on it it if you have time.
napi-get and napi-set can be extended to show and set these parameters.

> 
> Thanks,
> Joe
> 
> [1]: https://lore.kernel.org/all/20240131180811.23566-1-jdamato@fastly.com/
> [2]: https://dl.acm.org/doi/pdf/10.1145/3626780
> [3]: https://gitlab.uwaterloo.ca/p5cai/netstack-exp/-/blob/master/kernel-polling-5.15.79-base.patch?ref_type=heads
  
Jakub Kicinski Feb. 2, 2024, 6:22 p.m. UTC | #8
On Fri, 2 Feb 2024 11:23:28 -0600 Samudrala, Sridhar wrote:
> > I know I am replying to a stale thread on the patches I've submit (there is
> > a v5 now [1]), but I just looked at your message - sorry I didn't reply
> > sooner.
> > 
> > The per-queue and per-napi netlink APIs look extremely useful, thanks for
> > pointing this out.
> > 
> > In my development tree, I had added SIOCGIFNAME_BY_NAPI_ID which works
> > similar to SIOCGIFNAME: it takes a NAPI ID and returns the IF name. This is
> > useful on machines with multiple NICs where each NIC could be located in
> > one of many different NUMA zones.
> > 
> > The idea was that apps would use SO_INCOMING_NAPI_ID, distribute the NAPI
> > ID to a worker thread which could then use SIOCGIFNAME_BY_NAPI_ID to
> > compute which NIC the connection came in on. The app would then (via
> > configuration) know where to pin that worker thread; ideally somewhere NUMA
> > local to the NIC.
> > 
> > I had assumed that such a change would be rejected, but I figured I'd send
> > an RFC for it after the per epoll context stuff was done and see if anyone
> > thought SIOCGIFNAME_BY_NAPI_ID would be useful for them, as well.  
> 
> I think you should be able to get this functionality via the netdev-genl 
> API to get napi parameters. It returns ifindex as one of the parameters 
> and you should able to get the name from ifindex.
> 
> $ ./cli.py --spec netdev.yaml --do napi-get --json='{"id": 593}'
> {'id': 593, 'ifindex': 12, 'irq': 291, 'pid': 3727}

FWIW we also have a C library to access those. Out of curiosity what's
the programming language you'd use in user space, Joe?
  
Joe Damato Feb. 2, 2024, 7:33 p.m. UTC | #9
On Fri, Feb 02, 2024 at 10:22:39AM -0800, Jakub Kicinski wrote:
> On Fri, 2 Feb 2024 11:23:28 -0600 Samudrala, Sridhar wrote:
> > > I know I am replying to a stale thread on the patches I've submit (there is
> > > a v5 now [1]), but I just looked at your message - sorry I didn't reply
> > > sooner.
> > > 
> > > The per-queue and per-napi netlink APIs look extremely useful, thanks for
> > > pointing this out.
> > > 
> > > In my development tree, I had added SIOCGIFNAME_BY_NAPI_ID which works
> > > similar to SIOCGIFNAME: it takes a NAPI ID and returns the IF name. This is
> > > useful on machines with multiple NICs where each NIC could be located in
> > > one of many different NUMA zones.
> > > 
> > > The idea was that apps would use SO_INCOMING_NAPI_ID, distribute the NAPI
> > > ID to a worker thread which could then use SIOCGIFNAME_BY_NAPI_ID to
> > > compute which NIC the connection came in on. The app would then (via
> > > configuration) know where to pin that worker thread; ideally somewhere NUMA
> > > local to the NIC.
> > > 
> > > I had assumed that such a change would be rejected, but I figured I'd send
> > > an RFC for it after the per epoll context stuff was done and see if anyone
> > > thought SIOCGIFNAME_BY_NAPI_ID would be useful for them, as well.  
> > 
> > I think you should be able to get this functionality via the netdev-genl 
> > API to get napi parameters. It returns ifindex as one of the parameters 
> > and you should able to get the name from ifindex.
> > 
> > $ ./cli.py --spec netdev.yaml --do napi-get --json='{"id": 593}'
> > {'id': 593, 'ifindex': 12, 'irq': 291, 'pid': 3727}
> 
> FWIW we also have a C library to access those. Out of curiosity what's
> the programming language you'd use in user space, Joe?

I am using C from user space. Curious what you think about
SIOCGIFNAME_BY_NAPI_ID, Jakub? I think it would be very useful, but not
sure if such an extension would be accepted. I can send an RFC, if you'd
like to take a look and consider it. I know you are busy and I don't want
to add too much noise to the list if I can help it :)

Here's a brief description of what I'm doing, which others might find
helpful:

1. Machine has multiple NICs. Each NIC has 1 queue per busy poll app
thread, plus a few extra queues for other non busy poll usage.

2. A custom RSS context is created to distribute flows to the busy poll
queues. This context is created for each NIC. The default context directs
flows to the non-busy poll queues.

3. Each NIC has n-tuple filters inserted to direct incoming connections
with certain destination ports (e.g. 80, 443) to the custom RSS context.
All other incoming connections will land in the default context and go to
the other queues.

4. IRQs for the busy poll queues are pinned to specific CPUs which are NUMA
local to the NIC.

5. IRQ coalescing values are setup with busy poll in mind, so IRQs are
deferred as much as possible with the assumption userland will drive NAPI
via epoll_wait. This is done per queue (using ethtool --per-queue and a
queue mask). This is where napi_defer_hard_irqs and gro_flush_timeout
could help even more. IRQ deferral is only needed for the busy poll queues.

6. userspace app config has NICs with their NUMA local CPUs listed, for
example like this:

   - eth0: 0,1,2,3
   - eth1: 4,5,6,7

The app reads that configuration in when it starts. Ideally, these are the
same CPUs the IRQs are pinned to in step 4, but hopefully the coalesce
settings let IRQs be deferred quite a bit so busy poll can take over.

7. App threads are created and sockets are opened with REUSEPORT. Notably:
when the sockets are created, SO_BINDTODEVICE is used* (see below for
longer explanation about this).

8. cbpf reusport program inserted to distribute incoming connections to
threads based on skb->queue_mapping. skb->queue_mapping values are not
unique (e.g. each NIC will have queue_mapping==0), this is why BINDTODEVICE
is needed. Again, see below.

9. worker thread epoll contexts are set to busy poll by the ioctl I've
submit in my patches.

The first time a worker thread receives a connection, it:

1. calls SO_INCOMING_NAPI_ID to get the NAPI ID associated with the
connection it received.

2. Takes that NAPI ID and calls SIOCGIFNAME_BY_NAPI_ID to figure out which
NIC the connection came in on.

3. Looks for an un-unsed CPU from the list it read in at configuration time
that is associated with that NIC and then pins itself to that CPU. That CPU
is removed from the list so other threads can't take it.

All future incoming connections with the same NAPI ID will be distributed
to app threads which are pinned in the appropriate place and are doing busy
polling.

So, as you can see, SIOCGIFNAME_BY_NAPI_ID makes this implementation very
simple.

I plan to eventually add some information to the kernel networking
documentation to capture some more details of the above, which I think
might be helpful for others.

Thanks,
Joe

* Longer explanation about SO_BINDTODEVICE (only relevant if you have
mulitple NICs):

It turns out that reuseport groups in the kernel are bounded by a few
attributes, port being one of them but also ifindex. Since multiple NICs
can have queue_mapping == 0, reusport groups need to be constructed in
userland with care if there are multiple NICs. This is required because
each epoll context can only do epoll busy poll on a single NAPI ID. So,
even if multiple NICs have queue_mapping == 0, the queues will have
different NAPI IDs and incoming connections must be distributed to threads
uniquely based on NAPI ID.

I am doing this by creating listen sockets for each NIC, one NIC at a time.
When the listen socket is created, SO_BINDTODEVICE is used on the socket
before calling listen.

In the kernel, this results in all listen sockets with the same ifindex to
form a reuseport group. So, if I have 2 NICs and 1 listen port (say port
80), this results in 2 reuseport groups -- one for nic0 port 80 and one for
nic1 port 80, because of SO_BINDTODEVICE.

The reuseport cbpf filter is inserted for each reuseport group, and then
the skb->queue_mapping based listen socket selection will work as expected
distributing NAPI IDs to app threads without breaking epoll busy poll.

Without the above, you can run into an issue where two connections with the
same queue_mapping (but from different NICs) can land in the same epoll
context, which breaks busy poll.

Another potential solution to avoid the above might be use an eBPF program
and to build a hash that maps NAPI IDs to thread IDs and write a more
complicated eBPF program to distribute connections that way. This seemed
cool, but involved a lot more work so I went with the SO_BINDTODEVICE +
SIOCGIFNAME_BY_NAPI_ID method instead which was pretty simple (C code wise)
and easy to implement.
  
Jakub Kicinski Feb. 2, 2024, 7:58 p.m. UTC | #10
On Fri, 2 Feb 2024 11:33:33 -0800 Joe Damato wrote:
> On Fri, Feb 02, 2024 at 10:22:39AM -0800, Jakub Kicinski wrote:
> > On Fri, 2 Feb 2024 11:23:28 -0600 Samudrala, Sridhar wrote:  
> > > I think you should be able to get this functionality via the netdev-genl 
> > > API to get napi parameters. It returns ifindex as one of the parameters 
> > > and you should able to get the name from ifindex.
> > > 
> > > $ ./cli.py --spec netdev.yaml --do napi-get --json='{"id": 593}'
> > > {'id': 593, 'ifindex': 12, 'irq': 291, 'pid': 3727}  
> > 
> > FWIW we also have a C library to access those. Out of curiosity what's
> > the programming language you'd use in user space, Joe?  
> 
> I am using C from user space. 

Ah, great! Here comes the advert.. :)

  make -C tools/net/ynl/

will generate the C lib for you. tools/net/ynl/generated/netdev-user.h
will have the full API. There are some samples in
tools/net/ynl/samples/. And basic info also here:
https://docs.kernel.org/next/userspace-api/netlink/intro-specs.html#ynl-lib

You should be able to convert Sridhar's cli.py into an equivalent 
in C in ~10 LoC.

> Curious what you think about
> SIOCGIFNAME_BY_NAPI_ID, Jakub? I think it would be very useful, but not
> sure if such an extension would be accepted. I can send an RFC, if you'd
> like to take a look and consider it. I know you are busy and I don't want
> to add too much noise to the list if I can help it :)

Nothing wrong with it in particular, but we went with the netlink API
because all the objects are related. There are interrupts, NAPI
instances, queues, page pools etc. and we need to show all sort of
attributes, capabilities, stats as well as the linking. So getsockopts
may not scale, or we'd need to create a monster mux getsockopt?
Plus with some luck the netlink API will send you notifications of
things changing.

> Here's a brief description of what I'm doing, which others might find
> helpful:
> 
> 1. Machine has multiple NICs. Each NIC has 1 queue per busy poll app
> thread, plus a few extra queues for other non busy poll usage.
> 
> 2. A custom RSS context is created to distribute flows to the busy poll
> queues. This context is created for each NIC. The default context directs
> flows to the non-busy poll queues.
> 
> 3. Each NIC has n-tuple filters inserted to direct incoming connections
> with certain destination ports (e.g. 80, 443) to the custom RSS context.
> All other incoming connections will land in the default context and go to
> the other queues.
> 
> 4. IRQs for the busy poll queues are pinned to specific CPUs which are NUMA
> local to the NIC.
> 
> 5. IRQ coalescing values are setup with busy poll in mind, so IRQs are
> deferred as much as possible with the assumption userland will drive NAPI
> via epoll_wait. This is done per queue (using ethtool --per-queue and a
> queue mask). This is where napi_defer_hard_irqs and gro_flush_timeout
> could help even more. IRQ deferral is only needed for the busy poll queues.

Did you see SO_PREFER_BUSY_POLL by any chance? (In combination with
gro_flush_timeout IIRC). We added it a while back with Bjorn, it seems
like a great idea to me at the time but I'm unclear if anyone uses it 
in production..

> 6. userspace app config has NICs with their NUMA local CPUs listed, for
> example like this:
> 
>    - eth0: 0,1,2,3
>    - eth1: 4,5,6,7
> 
> The app reads that configuration in when it starts. Ideally, these are the
> same CPUs the IRQs are pinned to in step 4, but hopefully the coalesce
> settings let IRQs be deferred quite a bit so busy poll can take over.

FWIW if the driver you're using annotates things right you'll also get
the NAPI <> IRQ mapping via the netdev netlink. Hopefully that
simplifies the pinning setup.

> 7. App threads are created and sockets are opened with REUSEPORT. Notably:
> when the sockets are created, SO_BINDTODEVICE is used* (see below for
> longer explanation about this).
> 
> 8. cbpf reusport program inserted to distribute incoming connections to
> threads based on skb->queue_mapping. skb->queue_mapping values are not
> unique (e.g. each NIC will have queue_mapping==0), this is why BINDTODEVICE
> is needed. Again, see below.
> 
> 9. worker thread epoll contexts are set to busy poll by the ioctl I've
> submit in my patches.
> 
> The first time a worker thread receives a connection, it:
> 
> 1. calls SO_INCOMING_NAPI_ID to get the NAPI ID associated with the
> connection it received.
> 
> 2. Takes that NAPI ID and calls SIOCGIFNAME_BY_NAPI_ID to figure out which
> NIC the connection came in on.
> 
> 3. Looks for an un-unsed CPU from the list it read in at configuration time
> that is associated with that NIC and then pins itself to that CPU. That CPU
> is removed from the list so other threads can't take it.
> 
> All future incoming connections with the same NAPI ID will be distributed
> to app threads which are pinned in the appropriate place and are doing busy
> polling.
> 
> So, as you can see, SIOCGIFNAME_BY_NAPI_ID makes this implementation very
> simple.
> 
> I plan to eventually add some information to the kernel networking
> documentation to capture some more details of the above, which I think
> might be helpful for others.

Sounds very sensible & neat indeed. And makes sense to describe this 
in the docs, that should hopefully put more people on the right path :)

> Another potential solution to avoid the above might be use an eBPF program
> and to build a hash that maps NAPI IDs to thread IDs and write a more
> complicated eBPF program to distribute connections that way. This seemed
> cool, but involved a lot more work so I went with the SO_BINDTODEVICE +
> SIOCGIFNAME_BY_NAPI_ID method instead which was pretty simple (C code wise)
> and easy to implement.

Interesting!
  
Joe Damato Feb. 2, 2024, 8:23 p.m. UTC | #11
On Fri, Feb 02, 2024 at 11:58:28AM -0800, Jakub Kicinski wrote:
> On Fri, 2 Feb 2024 11:33:33 -0800 Joe Damato wrote:
> > On Fri, Feb 02, 2024 at 10:22:39AM -0800, Jakub Kicinski wrote:
> > > On Fri, 2 Feb 2024 11:23:28 -0600 Samudrala, Sridhar wrote:  
> > > > I think you should be able to get this functionality via the netdev-genl 
> > > > API to get napi parameters. It returns ifindex as one of the parameters 
> > > > and you should able to get the name from ifindex.
> > > > 
> > > > $ ./cli.py --spec netdev.yaml --do napi-get --json='{"id": 593}'
> > > > {'id': 593, 'ifindex': 12, 'irq': 291, 'pid': 3727}  
> > > 
> > > FWIW we also have a C library to access those. Out of curiosity what's
> > > the programming language you'd use in user space, Joe?  
> > 
> > I am using C from user space. 
> 
> Ah, great! Here comes the advert.. :)
> 
>   make -C tools/net/ynl/
> 
> will generate the C lib for you. tools/net/ynl/generated/netdev-user.h
> will have the full API. There are some samples in
> tools/net/ynl/samples/. And basic info also here:
> https://docs.kernel.org/next/userspace-api/netlink/intro-specs.html#ynl-lib
> 
> You should be able to convert Sridhar's cli.py into an equivalent 
> in C in ~10 LoC.
> 
> > Curious what you think about
> > SIOCGIFNAME_BY_NAPI_ID, Jakub? I think it would be very useful, but not
> > sure if such an extension would be accepted. I can send an RFC, if you'd
> > like to take a look and consider it. I know you are busy and I don't want
> > to add too much noise to the list if I can help it :)
> 
> Nothing wrong with it in particular, but we went with the netlink API
> because all the objects are related. There are interrupts, NAPI
> instances, queues, page pools etc. and we need to show all sort of
> attributes, capabilities, stats as well as the linking. So getsockopts
> may not scale, or we'd need to create a monster mux getsockopt?
> Plus with some luck the netlink API will send you notifications of
> things changing.

Yes this all makes sense. The notification on changes would be excellent,
especially if NAPI IDs get changed for some reason  (e.g. the queue count
is adjusted or the queues are rebuilt by the driver for some reason like a
timeout, etc).

I think the issue I'm solving with SIOCGIFNAME_BY_NAPI_ID is related, but
different.

In my case, SIOCGIFNAME_BY_NAPI_ID identifies which NIC a specific fd from
accept arrived from.

AFAICT, the netlink API wouldn't be able to help me answer that question. I
could use SIOCGIFNAME_BY_NAPI_ID to tell me which NIC the fd is from and
then use netlink to figure out which CPU to bind to (for example), but I
think SIOCGIFNAME_BY_NAPI_ID is still needed.

I'll send an RFC about for that shortly, hope that's OK.

> > Here's a brief description of what I'm doing, which others might find
> > helpful:
> > 
> > 1. Machine has multiple NICs. Each NIC has 1 queue per busy poll app
> > thread, plus a few extra queues for other non busy poll usage.
> > 
> > 2. A custom RSS context is created to distribute flows to the busy poll
> > queues. This context is created for each NIC. The default context directs
> > flows to the non-busy poll queues.
> > 
> > 3. Each NIC has n-tuple filters inserted to direct incoming connections
> > with certain destination ports (e.g. 80, 443) to the custom RSS context.
> > All other incoming connections will land in the default context and go to
> > the other queues.
> > 
> > 4. IRQs for the busy poll queues are pinned to specific CPUs which are NUMA
> > local to the NIC.
> > 
> > 5. IRQ coalescing values are setup with busy poll in mind, so IRQs are
> > deferred as much as possible with the assumption userland will drive NAPI
> > via epoll_wait. This is done per queue (using ethtool --per-queue and a
> > queue mask). This is where napi_defer_hard_irqs and gro_flush_timeout
> > could help even more. IRQ deferral is only needed for the busy poll queues.
> 
> Did you see SO_PREFER_BUSY_POLL by any chance? (In combination with
> gro_flush_timeout IIRC). We added it a while back with Bjorn, it seems
> like a great idea to me at the time but I'm unclear if anyone uses it 
> in production..

I have seen it while reading the code, yes. I think maybe I missed
something about its interaction with gro_flush_timeout. In my use case,
the machine has no traffic until after the app is started.

In this case, I haven't needed to worry about regular NAPI monopolizing the
CPU and preventing busy poll from working.

Maybe I am missing something more nuanced, though? I'll have another look
at the code, just incase.

> 
> > 6. userspace app config has NICs with their NUMA local CPUs listed, for
> > example like this:
> > 
> >    - eth0: 0,1,2,3
> >    - eth1: 4,5,6,7
> > 
> > The app reads that configuration in when it starts. Ideally, these are the
> > same CPUs the IRQs are pinned to in step 4, but hopefully the coalesce
> > settings let IRQs be deferred quite a bit so busy poll can take over.
> 
> FWIW if the driver you're using annotates things right you'll also get
> the NAPI <> IRQ mapping via the netdev netlink. Hopefully that
> simplifies the pinning setup.

I looked last night after I learned about the netlink interface. It does
not look like the driver I am using does, but I can fix that fairly easily,
I think.

I'll try to send a patch for this next week.

> > 7. App threads are created and sockets are opened with REUSEPORT. Notably:
> > when the sockets are created, SO_BINDTODEVICE is used* (see below for
> > longer explanation about this).
> > 
> > 8. cbpf reusport program inserted to distribute incoming connections to
> > threads based on skb->queue_mapping. skb->queue_mapping values are not
> > unique (e.g. each NIC will have queue_mapping==0), this is why BINDTODEVICE
> > is needed. Again, see below.
> > 
> > 9. worker thread epoll contexts are set to busy poll by the ioctl I've
> > submit in my patches.
> > 
> > The first time a worker thread receives a connection, it:
> > 
> > 1. calls SO_INCOMING_NAPI_ID to get the NAPI ID associated with the
> > connection it received.
> > 
> > 2. Takes that NAPI ID and calls SIOCGIFNAME_BY_NAPI_ID to figure out which
> > NIC the connection came in on.
> > 
> > 3. Looks for an un-unsed CPU from the list it read in at configuration time
> > that is associated with that NIC and then pins itself to that CPU. That CPU
> > is removed from the list so other threads can't take it.
> > 
> > All future incoming connections with the same NAPI ID will be distributed
> > to app threads which are pinned in the appropriate place and are doing busy
> > polling.
> > 
> > So, as you can see, SIOCGIFNAME_BY_NAPI_ID makes this implementation very
> > simple.
> > 
> > I plan to eventually add some information to the kernel networking
> > documentation to capture some more details of the above, which I think
> > might be helpful for others.
> 
> Sounds very sensible & neat indeed. And makes sense to describe this 
> in the docs, that should hopefully put more people on the right path :)
> 
> > Another potential solution to avoid the above might be use an eBPF program
> > and to build a hash that maps NAPI IDs to thread IDs and write a more
> > complicated eBPF program to distribute connections that way. This seemed
> > cool, but involved a lot more work so I went with the SO_BINDTODEVICE +
> > SIOCGIFNAME_BY_NAPI_ID method instead which was pretty simple (C code wise)
> > and easy to implement.
> 
> Interesting!
  
Samudrala, Sridhar Feb. 2, 2024, 8:50 p.m. UTC | #12
On 2/2/2024 2:23 PM, Joe Damato wrote:
> On Fri, Feb 02, 2024 at 11:58:28AM -0800, Jakub Kicinski wrote:
>> On Fri, 2 Feb 2024 11:33:33 -0800 Joe Damato wrote:
>>> On Fri, Feb 02, 2024 at 10:22:39AM -0800, Jakub Kicinski wrote:
>>>> On Fri, 2 Feb 2024 11:23:28 -0600 Samudrala, Sridhar wrote:
>>>>> I think you should be able to get this functionality via the netdev-genl
>>>>> API to get napi parameters. It returns ifindex as one of the parameters
>>>>> and you should able to get the name from ifindex.
>>>>>
>>>>> $ ./cli.py --spec netdev.yaml --do napi-get --json='{"id": 593}'
>>>>> {'id': 593, 'ifindex': 12, 'irq': 291, 'pid': 3727}
>>>>
>>>> FWIW we also have a C library to access those. Out of curiosity what's
>>>> the programming language you'd use in user space, Joe?
>>>
>>> I am using C from user space.
>>
>> Ah, great! Here comes the advert.. :)
>>
>>    make -C tools/net/ynl/
>>
>> will generate the C lib for you. tools/net/ynl/generated/netdev-user.h
>> will have the full API. There are some samples in
>> tools/net/ynl/samples/. And basic info also here:
>> https://docs.kernel.org/next/userspace-api/netlink/intro-specs.html#ynl-lib
>>
>> You should be able to convert Sridhar's cli.py into an equivalent
>> in C in ~10 LoC.
>>
>>> Curious what you think about
>>> SIOCGIFNAME_BY_NAPI_ID, Jakub? I think it would be very useful, but not
>>> sure if such an extension would be accepted. I can send an RFC, if you'd
>>> like to take a look and consider it. I know you are busy and I don't want
>>> to add too much noise to the list if I can help it :)
>>
>> Nothing wrong with it in particular, but we went with the netlink API
>> because all the objects are related. There are interrupts, NAPI
>> instances, queues, page pools etc. and we need to show all sort of
>> attributes, capabilities, stats as well as the linking. So getsockopts
>> may not scale, or we'd need to create a monster mux getsockopt?
>> Plus with some luck the netlink API will send you notifications of
>> things changing.
> 
> Yes this all makes sense. The notification on changes would be excellent,
> especially if NAPI IDs get changed for some reason  (e.g. the queue count
> is adjusted or the queues are rebuilt by the driver for some reason like a
> timeout, etc).
> 
> I think the issue I'm solving with SIOCGIFNAME_BY_NAPI_ID is related, but
> different.
> 
> In my case, SIOCGIFNAME_BY_NAPI_ID identifies which NIC a specific fd from
> accept arrived from.
> 
> AFAICT, the netlink API wouldn't be able to help me answer that question. I
> could use SIOCGIFNAME_BY_NAPI_ID to tell me which NIC the fd is from and
> then use netlink to figure out which CPU to bind to (for example), but I
> think SIOCGIFNAME_BY_NAPI_ID is still needed.

The napi-get netlink api takes napi_id and returns ifindex, irq and pid 
associated with the napi id. You can then pass ifindex to the 
SIOCGIFNAME ioctl to get the interface name. So it is definitely 
possible without the need for the new SIOCGIFNAME_BY_NAPI_ID

> 
> I'll send an RFC about for that shortly, hope that's OK.
>
  
Joe Damato Feb. 2, 2024, 8:55 p.m. UTC | #13
On Fri, Feb 02, 2024 at 02:50:58PM -0600, Samudrala, Sridhar wrote:
> 
> 
> On 2/2/2024 2:23 PM, Joe Damato wrote:
> >On Fri, Feb 02, 2024 at 11:58:28AM -0800, Jakub Kicinski wrote:
> >>On Fri, 2 Feb 2024 11:33:33 -0800 Joe Damato wrote:
> >>>On Fri, Feb 02, 2024 at 10:22:39AM -0800, Jakub Kicinski wrote:
> >>>>On Fri, 2 Feb 2024 11:23:28 -0600 Samudrala, Sridhar wrote:
> >>>>>I think you should be able to get this functionality via the netdev-genl
> >>>>>API to get napi parameters. It returns ifindex as one of the parameters
> >>>>>and you should able to get the name from ifindex.
> >>>>>
> >>>>>$ ./cli.py --spec netdev.yaml --do napi-get --json='{"id": 593}'
> >>>>>{'id': 593, 'ifindex': 12, 'irq': 291, 'pid': 3727}
> >>>>
> >>>>FWIW we also have a C library to access those. Out of curiosity what's
> >>>>the programming language you'd use in user space, Joe?
> >>>
> >>>I am using C from user space.
> >>
> >>Ah, great! Here comes the advert.. :)
> >>
> >>   make -C tools/net/ynl/
> >>
> >>will generate the C lib for you. tools/net/ynl/generated/netdev-user.h
> >>will have the full API. There are some samples in
> >>tools/net/ynl/samples/. And basic info also here:
> >>https://docs.kernel.org/next/userspace-api/netlink/intro-specs.html#ynl-lib
> >>
> >>You should be able to convert Sridhar's cli.py into an equivalent
> >>in C in ~10 LoC.
> >>
> >>>Curious what you think about
> >>>SIOCGIFNAME_BY_NAPI_ID, Jakub? I think it would be very useful, but not
> >>>sure if such an extension would be accepted. I can send an RFC, if you'd
> >>>like to take a look and consider it. I know you are busy and I don't want
> >>>to add too much noise to the list if I can help it :)
> >>
> >>Nothing wrong with it in particular, but we went with the netlink API
> >>because all the objects are related. There are interrupts, NAPI
> >>instances, queues, page pools etc. and we need to show all sort of
> >>attributes, capabilities, stats as well as the linking. So getsockopts
> >>may not scale, or we'd need to create a monster mux getsockopt?
> >>Plus with some luck the netlink API will send you notifications of
> >>things changing.
> >
> >Yes this all makes sense. The notification on changes would be excellent,
> >especially if NAPI IDs get changed for some reason  (e.g. the queue count
> >is adjusted or the queues are rebuilt by the driver for some reason like a
> >timeout, etc).
> >
> >I think the issue I'm solving with SIOCGIFNAME_BY_NAPI_ID is related, but
> >different.
> >
> >In my case, SIOCGIFNAME_BY_NAPI_ID identifies which NIC a specific fd from
> >accept arrived from.
> >
> >AFAICT, the netlink API wouldn't be able to help me answer that question. I
> >could use SIOCGIFNAME_BY_NAPI_ID to tell me which NIC the fd is from and
> >then use netlink to figure out which CPU to bind to (for example), but I
> >think SIOCGIFNAME_BY_NAPI_ID is still needed.
> 
> The napi-get netlink api takes napi_id and returns ifindex, irq and pid
> associated with the napi id. You can then pass ifindex to the SIOCGIFNAME
> ioctl to get the interface name. So it is definitely possible without the
> need for the new SIOCGIFNAME_BY_NAPI_ID

Ah, I see. OK. In that case, I won't bother with the RFC for
SIOCGIFNAME_BY_NAPI_ID.

I'll give your suggestion a try next week after I make the driver changes
needed to support it.
  
Jakub Kicinski Feb. 3, 2024, 1:15 a.m. UTC | #14
On Fri, 2 Feb 2024 12:23:44 -0800 Joe Damato wrote:
> > Did you see SO_PREFER_BUSY_POLL by any chance? (In combination with
> > gro_flush_timeout IIRC). We added it a while back with Bjorn, it seems
> > like a great idea to me at the time but I'm unclear if anyone uses it 
> > in production..  
> 
> I have seen it while reading the code, yes. I think maybe I missed
> something about its interaction with gro_flush_timeout. In my use case,
> the machine has no traffic until after the app is started.
> 
> In this case, I haven't needed to worry about regular NAPI monopolizing the
> CPU and preventing busy poll from working.
> 
> Maybe I am missing something more nuanced, though? I'll have another look
> at the code, just incase.

We reused the gro_flush_timeout as an existing "user doesn't care if
packets get delayed by this much in worst case" value. If you set
SO_PREFER_BUSY_POLL the next time you busy pool the NAPI will be marked 
as "already scheduled" and a timer is set (to gro_flush_timeout).
If NIC IRQ fires before gro_flush_timeout it gets ignored, because NAPI
is already marked as scheduled.
If you busy poll again the timer gets postponed for another
gro_flush_timeout nsec.
If timer fires we go back to normal NAPI processing.

The idea is that you set gro_flush_timeout to some high value, like 
10 msec, and expect your app to poll more often than every 10 msec. 

Then the normal NAPI processing will never kick in, and there will 
be only 1 NIC IRQ after which the HW IRQ remains masked.
With high coalescing timer you technically still get an IRQ every
so often and interrupt the app. Worst case (UDP flood) you may even
get into an overload where the app gets starved out completely..
  
Stanislav Fomichev Feb. 5, 2024, 6:17 p.m. UTC | #15
On 02/02, Jakub Kicinski wrote:
> On Fri, 2 Feb 2024 12:23:44 -0800 Joe Damato wrote:
> > > Did you see SO_PREFER_BUSY_POLL by any chance? (In combination with
> > > gro_flush_timeout IIRC). We added it a while back with Bjorn, it seems
> > > like a great idea to me at the time but I'm unclear if anyone uses it 
> > > in production..  
> > 
> > I have seen it while reading the code, yes. I think maybe I missed
> > something about its interaction with gro_flush_timeout. In my use case,
> > the machine has no traffic until after the app is started.
> > 
> > In this case, I haven't needed to worry about regular NAPI monopolizing the
> > CPU and preventing busy poll from working.
> > 
> > Maybe I am missing something more nuanced, though? I'll have another look
> > at the code, just incase.
> 
> We reused the gro_flush_timeout as an existing "user doesn't care if
> packets get delayed by this much in worst case" value. If you set
> SO_PREFER_BUSY_POLL the next time you busy pool the NAPI will be marked 
> as "already scheduled" and a timer is set (to gro_flush_timeout).
> If NIC IRQ fires before gro_flush_timeout it gets ignored, because NAPI
> is already marked as scheduled.
> If you busy poll again the timer gets postponed for another
> gro_flush_timeout nsec.
> If timer fires we go back to normal NAPI processing.
> 
> The idea is that you set gro_flush_timeout to some high value, like 
> 10 msec, and expect your app to poll more often than every 10 msec. 
> 
> Then the normal NAPI processing will never kick in, and there will 
> be only 1 NIC IRQ after which the HW IRQ remains masked.
> With high coalescing timer you technically still get an IRQ every
> so often and interrupt the app. Worst case (UDP flood) you may even
> get into an overload where the app gets starved out completely..

Should we also add prefer_busy_poll parameter to EPIOCSPARAMS?
napi_busy_loop in ep_busy_loop has its prefer_busy_poll argument
hard-coded as false.
  
Joe Damato Feb. 5, 2024, 6:51 p.m. UTC | #16
On Fri, Feb 02, 2024 at 05:15:39PM -0800, Jakub Kicinski wrote:
> On Fri, 2 Feb 2024 12:23:44 -0800 Joe Damato wrote:
> > > Did you see SO_PREFER_BUSY_POLL by any chance? (In combination with
> > > gro_flush_timeout IIRC). We added it a while back with Bjorn, it seems
> > > like a great idea to me at the time but I'm unclear if anyone uses it 
> > > in production..  
> > 
> > I have seen it while reading the code, yes. I think maybe I missed
> > something about its interaction with gro_flush_timeout. In my use case,
> > the machine has no traffic until after the app is started.
> > 
> > In this case, I haven't needed to worry about regular NAPI monopolizing the
> > CPU and preventing busy poll from working.
> > 
> > Maybe I am missing something more nuanced, though? I'll have another look
> > at the code, just incase.
> 
> We reused the gro_flush_timeout as an existing "user doesn't care if
> packets get delayed by this much in worst case" value. If you set
> SO_PREFER_BUSY_POLL the next time you busy pool the NAPI will be marked 
> as "already scheduled" and a timer is set (to gro_flush_timeout).
> If NIC IRQ fires before gro_flush_timeout it gets ignored, because NAPI
> is already marked as scheduled.
> If you busy poll again the timer gets postponed for another
> gro_flush_timeout nsec.
> If timer fires we go back to normal NAPI processing.

Ah, I see. From my reading of the code in busy_poll_stop (which could be
wrong), defer_hard_irqs_count must also be non-zero to postpone the timer.

Is that right?

If so, I think the tricky thing with this is that these settings are
system-wide, so they'd affect non-busy poll apps, too.

I think in the ideal case being able to set these on a per-NAPI basis would
be very helpful. Maybe something for me to try working on next.

> The idea is that you set gro_flush_timeout to some high value, like 
> 10 msec, and expect your app to poll more often than every 10 msec. 

Yea, that makes sense.

> Then the normal NAPI processing will never kick in, and there will 
> be only 1 NIC IRQ after which the HW IRQ remains masked.
> With high coalescing timer you technically still get an IRQ every
> so often and interrupt the app. Worst case (UDP flood) you may even
> get into an overload where the app gets starved out completely..

Yup, this is true. I had been using a modified version of a patch from a
research paper to avoid enabling NIC IRQs [1][2], but I think making
defer_hard_irqs_count and gro_flush_timeout per NAPI parameters would make
more sense.

[1]: https://gitlab.uwaterloo.ca/p5cai/netstack-exp/-/raw/master/kernel-polling-5.15.79-base.patch?ref_type=heads
[2]: https://dl.acm.org/doi/pdf/10.1145/3626780
  
Joe Damato Feb. 5, 2024, 6:52 p.m. UTC | #17
On Mon, Feb 05, 2024 at 10:17:03AM -0800, Stanislav Fomichev wrote:
> On 02/02, Jakub Kicinski wrote:
> > On Fri, 2 Feb 2024 12:23:44 -0800 Joe Damato wrote:
> > > > Did you see SO_PREFER_BUSY_POLL by any chance? (In combination with
> > > > gro_flush_timeout IIRC). We added it a while back with Bjorn, it seems
> > > > like a great idea to me at the time but I'm unclear if anyone uses it 
> > > > in production..  
> > > 
> > > I have seen it while reading the code, yes. I think maybe I missed
> > > something about its interaction with gro_flush_timeout. In my use case,
> > > the machine has no traffic until after the app is started.
> > > 
> > > In this case, I haven't needed to worry about regular NAPI monopolizing the
> > > CPU and preventing busy poll from working.
> > > 
> > > Maybe I am missing something more nuanced, though? I'll have another look
> > > at the code, just incase.
> > 
> > We reused the gro_flush_timeout as an existing "user doesn't care if
> > packets get delayed by this much in worst case" value. If you set
> > SO_PREFER_BUSY_POLL the next time you busy pool the NAPI will be marked 
> > as "already scheduled" and a timer is set (to gro_flush_timeout).
> > If NIC IRQ fires before gro_flush_timeout it gets ignored, because NAPI
> > is already marked as scheduled.
> > If you busy poll again the timer gets postponed for another
> > gro_flush_timeout nsec.
> > If timer fires we go back to normal NAPI processing.
> > 
> > The idea is that you set gro_flush_timeout to some high value, like 
> > 10 msec, and expect your app to poll more often than every 10 msec. 
> > 
> > Then the normal NAPI processing will never kick in, and there will 
> > be only 1 NIC IRQ after which the HW IRQ remains masked.
> > With high coalescing timer you technically still get an IRQ every
> > so often and interrupt the app. Worst case (UDP flood) you may even
> > get into an overload where the app gets starved out completely..
> 
> Should we also add prefer_busy_poll parameter to EPIOCSPARAMS?
> napi_busy_loop in ep_busy_loop has its prefer_busy_poll argument
> hard-coded as false.

I think making this configurable is a good idea. I can add that in the v6
in addition to fixing the incorrect commit message in patch 1/3.

What do you think, Jakub?
  
Jakub Kicinski Feb. 5, 2024, 7:48 p.m. UTC | #18
On Mon, 5 Feb 2024 10:52:18 -0800 Joe Damato wrote:
> > Should we also add prefer_busy_poll parameter to EPIOCSPARAMS?
> > napi_busy_loop in ep_busy_loop has its prefer_busy_poll argument
> > hard-coded as false.  

Good catch! We only plumbed it thru to socket read busy poll.

> I think making this configurable is a good idea. I can add that in the v6
> in addition to fixing the incorrect commit message in patch 1/3.
> 
> What do you think, Jakub?

SGTM (with the caveat that I'm not epoll expert)
  
Jakub Kicinski Feb. 5, 2024, 7:59 p.m. UTC | #19
On Mon, 5 Feb 2024 10:51:30 -0800 Joe Damato wrote:
> > We reused the gro_flush_timeout as an existing "user doesn't care if
> > packets get delayed by this much in worst case" value. If you set
> > SO_PREFER_BUSY_POLL the next time you busy pool the NAPI will be marked 
> > as "already scheduled" and a timer is set (to gro_flush_timeout).
> > If NIC IRQ fires before gro_flush_timeout it gets ignored, because NAPI
> > is already marked as scheduled.
> > If you busy poll again the timer gets postponed for another
> > gro_flush_timeout nsec.
> > If timer fires we go back to normal NAPI processing.  
> 
> Ah, I see. From my reading of the code in busy_poll_stop (which could be
> wrong), defer_hard_irqs_count must also be non-zero to postpone the timer.
> 
> Is that right?
> 
> If so, I think the tricky thing with this is that these settings are
> system-wide, so they'd affect non-busy poll apps, too.
> 
> I think in the ideal case being able to set these on a per-NAPI basis would
> be very helpful. Maybe something for me to try working on next.

If wonder if it'd be good enough to do:

	min(defer_hard_irqs_count, 1)

there. If caller asked to prefer busy poll they clearly want to poll.
An explicit per-NAPI API works too, but it's a bit more work.
If I was doing the work I'd try min(..., 1) with the workload.
If there's value in having the full config - go for it.