[net-next,v3,01/18] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)

Message ID 20230620145338.1300897-2-dhowells@redhat.com
State New
Headers
Series splice, net: Switch over users of sendpage() and remove it |

Commit Message

David Howells June 20, 2023, 2:53 p.m. UTC
  If sendmsg() is passed MSG_SPLICE_PAGES and is given a buffer that contains
some data that's resident in the slab, copy it rather than returning EIO.
This can be made use of by a number of drivers in the kernel, including:
iwarp, ceph/rds, dlm, nvme, ocfs2, drdb.  It could also be used by iscsi,
rxrpc, sunrpc, cifs and probably others.

skb_splice_from_iter() is given it's own fragment allocator as
page_frag_alloc_align() can't be used because it does no locking to prevent
parallel callers from racing.  alloc_skb_frag() uses a separate folio for
each cpu and locks to the cpu whilst allocating, reenabling cpu migration
around folio allocation.

This could allocate a whole page instead for each fragment to be copied, as
alloc_skb_with_frags() would do instead, but that would waste a lot of
space (most of the fragments look like they're going to be small).

This allows an entire message that consists of, say, a protocol header or
two, a number of pages of data and a protocol footer to be sent using a
single call to sock_sendmsg().

The callers could be made to copy the data into fragments before calling
sendmsg(), but that then penalises them if MSG_SPLICE_PAGES gets ignored.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexander Duyck <alexander.duyck@gmail.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: David Ahern <dsahern@kernel.org>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Menglong Dong <imagedong@tencent.com>
cc: netdev@vger.kernel.org
---

Notes:
    ver #3)
     - Remove duplicate decl of skb_splice_from_iter().
    
    ver #2)
     - Fix parameter to put_cpu_ptr() to have an '&'.

 include/linux/skbuff.h |   2 +
 net/core/skbuff.c      | 171 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 170 insertions(+), 3 deletions(-)
  

Comments

Jakub Kicinski June 22, 2023, 6:12 p.m. UTC | #1
On Tue, 20 Jun 2023 15:53:20 +0100 David Howells wrote:
> If sendmsg() is passed MSG_SPLICE_PAGES and is given a buffer that contains
> some data that's resident in the slab, copy it rather than returning EIO.

How did that happen? I thought MSG_SPLICE_PAGES comes from former
sendpage users and sendpage can't operate on slab pages.

> This can be made use of by a number of drivers in the kernel, including:
> iwarp, ceph/rds, dlm, nvme, ocfs2, drdb.  It could also be used by iscsi,
> rxrpc, sunrpc, cifs and probably others.
> 
> skb_splice_from_iter() is given it's own fragment allocator as
> page_frag_alloc_align() can't be used because it does no locking to prevent
> parallel callers from racing.

The locking is to local_bh_disable(). Does the milliont^w new frag
allocator have any additional benefits?

>  alloc_skb_frag() uses a separate folio for
> each cpu and locks to the cpu whilst allocating, reenabling cpu migration
> around folio allocation.
  
Alexander Duyck June 22, 2023, 6:28 p.m. UTC | #2
On Thu, Jun 22, 2023 at 11:12 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 20 Jun 2023 15:53:20 +0100 David Howells wrote:
> > If sendmsg() is passed MSG_SPLICE_PAGES and is given a buffer that contains
> > some data that's resident in the slab, copy it rather than returning EIO.
>
> How did that happen? I thought MSG_SPLICE_PAGES comes from former
> sendpage users and sendpage can't operate on slab pages.
>
> > This can be made use of by a number of drivers in the kernel, including:
> > iwarp, ceph/rds, dlm, nvme, ocfs2, drdb.  It could also be used by iscsi,
> > rxrpc, sunrpc, cifs and probably others.
> >
> > skb_splice_from_iter() is given it's own fragment allocator as
> > page_frag_alloc_align() can't be used because it does no locking to prevent
> > parallel callers from racing.
>
> The locking is to local_bh_disable(). Does the milliont^w new frag
> allocator have any additional benefits?

Actually I would be concerned about it causing confusion since it is
called out as being how we do it for sockets but we already have
several different socket based setups using skb_page_frag_refill() and
the like.
  
David Howells June 22, 2023, 7:40 p.m. UTC | #3
Jakub Kicinski <kuba@kernel.org> wrote:

> > If sendmsg() is passed MSG_SPLICE_PAGES and is given a buffer that contains
> > some data that's resident in the slab, copy it rather than returning EIO.
> 
> How did that happen? I thought MSG_SPLICE_PAGES comes from former
> sendpage users and sendpage can't operate on slab pages.

Some of my patches, take the siw one for example, now aggregate all the bits
that make up a message into a single sendmsg() call, including any protocol
header and trailer in the same bio_vec[] as the payload where before it would
have to do, say, sendmsg+sendpage+sendpage+...+sendpage+sendmsg.

I'm trying to make it so that I make the minimum number of sendmsg calls
(ie. 1 where possible) and the loop that processes the data is inside of that.
This offers the opportunity, at least in the future, to append slab data to an
already-existing private fragment in the skbuff.

> The locking is to local_bh_disable(). Does the milliont^w new frag
> allocator have any additional benefits?

It is shareable because it does locking.  Multiple sockets of multiple
protocols can share the pages it has reserved.  It drops the lock around calls
to the page allocator so that GFP_KERNEL/GFP_NOFS can be used with it.

Without this, the page fragment allocator would need to be per-socket, I
think, or be done further up the stack where the higher level drivers would
have to have a fragment bucket per whatever unit they use to deal with the
lack of locking.

Doing it here makes cleanup simpler since I just transfer my ref on the
fragment to the skbuff frag list and it will automatically be cleaned up with
the skbuff.

Willy suggested that I just allocate a page for each thing I want to copy, but
I would rather not do that for, say, an 8-byte bit of protocol data.

David
  
Jakub Kicinski June 22, 2023, 8:28 p.m. UTC | #4
On Thu, 22 Jun 2023 20:40:43 +0100 David Howells wrote:
> > How did that happen? I thought MSG_SPLICE_PAGES comes from former
> > sendpage users and sendpage can't operate on slab pages.  
> 
> Some of my patches, take the siw one for example, now aggregate all the bits
> that make up a message into a single sendmsg() call, including any protocol
> header and trailer in the same bio_vec[] as the payload where before it would
> have to do, say, sendmsg+sendpage+sendpage+...+sendpage+sendmsg.

Maybe it's just me but I'd prefer to keep the clear rule that splice
operates on pages not slab objects. SIW is the software / fake
implementation of RDMA, right? You couldn't have picked a less
important user :(

Paolo indicated that he'll take a look tomorrow, we'll see what he
thinks.

> I'm trying to make it so that I make the minimum number of sendmsg calls
> (ie. 1 where possible) and the loop that processes the data is inside of that.

The in-kernel users can be fixed to not use slab, and user space can't
feed us slab objects.

> This offers the opportunity, at least in the future, to append slab data to an
> already-existing private fragment in the skbuff.

Maybe we can get Eric to comment. The ability to identify "frag type"
seems cool indeed, but I haven't thought about using it to attach
slab objects.

> > The locking is to local_bh_disable(). Does the milliont^w new frag
> > allocator have any additional benefits?  
> 
> It is shareable because it does locking.  Multiple sockets of multiple
> protocols can share the pages it has reserved.  It drops the lock around calls
> to the page allocator so that GFP_KERNEL/GFP_NOFS can be used with it.
> 
> Without this, the page fragment allocator would need to be per-socket, I
> think, or be done further up the stack where the higher level drivers would
> have to have a fragment bucket per whatever unit they use to deal with the
> lack of locking.

There's also the per task frag which can be used under normal conditions
(sk_use_task_frag).

> Doing it here makes cleanup simpler since I just transfer my ref on the
> fragment to the skbuff frag list and it will automatically be cleaned up with
> the skbuff.
> 
> Willy suggested that I just allocate a page for each thing I want to copy, but
> I would rather not do that for, say, an 8-byte bit of protocol data.

TBH my intuition would also be get a full page and let the callers who
care about performance fix themselves. Assuming we want to let slab
objects in in the first place.
  
David Howells June 22, 2023, 10:54 p.m. UTC | #5
Jakub Kicinski <kuba@kernel.org> wrote:

> Maybe it's just me but I'd prefer to keep the clear rule that splice
> operates on pages not slab objects.

sendpage isn't only being used for splice().  Or were you referring to
splicing pages into socket buffers more generally?

> SIW is the software / fake implementation of RDMA, right? You couldn't have
> picked a less important user :(

ISCSI and sunrpc could both make use of this, as could ceph and others.  I
have patches for sunrpc to make it condense into a single bio_vec[] and
sendmsg() in the server code (ie. nfsd) but for the moment, Chuck wanted me to
just do the xdr payload.

> > This offers the opportunity, at least in the future, to append slab data
> > to an already-existing private fragment in the skbuff.
> 
> Maybe we can get Eric to comment. The ability to identify "frag type"
> seems cool indeed, but I haven't thought about using it to attach
> slab objects.

Unfortunately, you can't attach slab objects.  Their lifetime isn't controlled
by put_page() or folio_put().  kmalloc()/kfree() doesn't refcount them -
they're recycled immediately.  Hence why I was copying them.  (Well, you
could attach, but then you need a callback mechanism).


What I'm trying to do is make it so that the process of calling sock_sendmsg()
with MSG_SPLICE_PAGES looks exactly the same as without: You fill in a
bio_vec[] pointing to your protocol header, the payload and the trailer,
pointing as appropriate to bits of slab, static, stack data or ref'able pages,
and call sendmsg and then the data will get copied or spliced as appropriate
to the page type, whether the MSG_SPLICE_PAGES flag is supplied and whether
the flag is supported.

There are a couple of things I'd like to avoid: (1) having to call
sock_sendmsg() more than once per message and (2) having sendmsg allocate more
space and make a copy of data that you had to copy into a frag before calling
sendmsg.

David
  
Jakub Kicinski June 23, 2023, 2:11 a.m. UTC | #6
On Thu, 22 Jun 2023 23:54:31 +0100 David Howells wrote:
> > Maybe it's just me but I'd prefer to keep the clear rule that splice
> > operates on pages not slab objects.  
> 
> sendpage isn't only being used for splice().  Or were you referring to
> splicing pages into socket buffers more generally?

Yes, sorry, any sort of "zero-copy attachment of data onto a socket
queue".

> > SIW is the software / fake implementation of RDMA, right? You couldn't have
> > picked a less important user :(  
> 
> ISCSI and sunrpc could both make use of this, as could ceph and others.  I
> have patches for sunrpc to make it condense into a single bio_vec[] and
> sendmsg() in the server code (ie. nfsd) but for the moment, Chuck wanted me to
> just do the xdr payload.

But to be clear (and I'm not implying that it's not a strong enough
reason) - the only benefit from letting someone pass headers in a slab
object is that the code already uses kmalloc(), right? IOW it could be
changed to use frags without much of a LoC bloat?

> > Maybe we can get Eric to comment. The ability to identify "frag type"
> > seems cool indeed, but I haven't thought about using it to attach
> > slab objects.  
> 
> Unfortunately, you can't attach slab objects.  Their lifetime isn't controlled
> by put_page() or folio_put().  kmalloc()/kfree() doesn't refcount them -
> they're recycled immediately.  Hence why I was copying them.  (Well, you
> could attach, but then you need a callback mechanism).

Right, right, I thought you were saying that _in the future_ we may try
to attach the slab objects as frags (and presumably copy when someone
tries to ref them). Maybe I over-interpreted.

> What I'm trying to do is make it so that the process of calling sock_sendmsg()
> with MSG_SPLICE_PAGES looks exactly the same as without: You fill in a
> bio_vec[] pointing to your protocol header, the payload and the trailer,
> pointing as appropriate to bits of slab, static, stack data or ref'able pages,
> and call sendmsg and then the data will get copied or spliced as appropriate
> to the page type, whether the MSG_SPLICE_PAGES flag is supplied and whether
> the flag is supported.
> 
> There are a couple of things I'd like to avoid: (1) having to call
> sock_sendmsg() more than once per message and (2) having sendmsg allocate more
> space and make a copy of data that you had to copy into a frag before calling
> sendmsg.

If we're not planning to attach the slab objects as frags, then surely
doing kmalloc() + free() in the caller, and then allocating a frag and
copying the data over in the skb / socket code is also inefficient.
Fixing the caller gives all the benefits you want, and then some.

Granted some form of alloc_skb_frag() needs to be added so that callers
don't curse us, I'd start with something based on sk_page_frag().

Or we could pull the coping out into an intermediate helper which
first replaces all slab objects in the iovec with page frags and then
calls sock_sendmsg()? Maybe that's stupid...

Let's hear what others think. If we can't reach instant agreement --
can you strategically separate out the minimal set of changes required
to just kill MSG_SENDPAGE_NOTLAST. IMHO it's worth getting that into
6.5.
  
Paolo Abeni June 23, 2023, 8:08 a.m. UTC | #7
Hi,

First thing first, I'm sorry for the delayed feedback. I was traveling.

On Tue, 2023-06-20 at 15:53 +0100, David Howells wrote:
> If sendmsg() is passed MSG_SPLICE_PAGES and is given a buffer that contains
> some data that's resident in the slab, copy it rather than returning EIO.
> This can be made use of by a number of drivers in the kernel, including:
> iwarp, ceph/rds, dlm, nvme, ocfs2, drdb.  It could also be used by iscsi,
> rxrpc, sunrpc, cifs and probably others.
> 
> skb_splice_from_iter() is given it's own fragment allocator as
> page_frag_alloc_align() can't be used because it does no locking to prevent
> parallel callers from racing.  alloc_skb_frag() uses a separate folio for
> each cpu and locks to the cpu whilst allocating, reenabling cpu migration
> around folio allocation.
> 
> This could allocate a whole page instead for each fragment to be copied, as
> alloc_skb_with_frags() would do instead, but that would waste a lot of
> space (most of the fragments look like they're going to be small).
> 
> This allows an entire message that consists of, say, a protocol header or
> two, a number of pages of data and a protocol footer to be sent using a
> single call to sock_sendmsg().
> 
> The callers could be made to copy the data into fragments before calling
> sendmsg(), but that then penalises them if MSG_SPLICE_PAGES gets ignored.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Alexander Duyck <alexander.duyck@gmail.com>
> cc: Eric Dumazet <edumazet@google.com>
> cc: "David S. Miller" <davem@davemloft.net>
> cc: David Ahern <dsahern@kernel.org>
> cc: Jakub Kicinski <kuba@kernel.org>
> cc: Paolo Abeni <pabeni@redhat.com>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: Menglong Dong <imagedong@tencent.com>
> cc: netdev@vger.kernel.org
> ---
> 
> Notes:
>     ver #3)
>      - Remove duplicate decl of skb_splice_from_iter().
>     
>     ver #2)
>      - Fix parameter to put_cpu_ptr() to have an '&'.
> 
>  include/linux/skbuff.h |   2 +
>  net/core/skbuff.c      | 171 ++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 170 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 91ed66952580..5f53bd5d375d 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -5037,6 +5037,8 @@ static inline void skb_mark_for_recycle(struct sk_buff *skb)
>  #endif
>  }
>  
> +void *alloc_skb_frag(size_t fragsz, gfp_t gfp);
> +void *copy_skb_frag(const void *s, size_t len, gfp_t gfp);
>  ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
>  			     ssize_t maxsize, gfp_t gfp);
>  
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index fee2b1c105fe..d962c93a429d 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -6755,6 +6755,145 @@ nodefer:	__kfree_skb(skb);
>  		smp_call_function_single_async(cpu, &sd->defer_csd);
>  }
>  
> +struct skb_splice_frag_cache {
> +	struct folio	*folio;
> +	void		*virt;
> +	unsigned int	offset;
> +	/* we maintain a pagecount bias, so that we dont dirty cache line
> +	 * containing page->_refcount every time we allocate a fragment.
> +	 */
> +	unsigned int	pagecnt_bias;
> +	bool		pfmemalloc;
> +};
> +
> +static DEFINE_PER_CPU(struct skb_splice_frag_cache, skb_splice_frag_cache);
> +
> +/**
> + * alloc_skb_frag - Allocate a page fragment for using in a socket
> + * @fragsz: The size of fragment required
> + * @gfp: Allocation flags
> + */
> +void *alloc_skb_frag(size_t fragsz, gfp_t gfp)
> +{
> +	struct skb_splice_frag_cache *cache;
> +	struct folio *folio, *spare = NULL;
> +	size_t offset, fsize;
> +	void *p;
> +
> +	if (WARN_ON_ONCE(fragsz == 0))
> +		fragsz = 1;
> +
> +	cache = get_cpu_ptr(&skb_splice_frag_cache);
> +reload:
> +	folio = cache->folio;
> +	offset = cache->offset;
> +try_again:
> +	if (fragsz > offset)
> +		goto insufficient_space;
> +
> +	/* Make the allocation. */
> +	cache->pagecnt_bias--;
> +	offset = ALIGN_DOWN(offset - fragsz, SMP_CACHE_BYTES);
> +	cache->offset = offset;
> +	p = cache->virt + offset;
> +	put_cpu_ptr(&skb_splice_frag_cache);
> +	if (spare)
> +		folio_put(spare);
> +	return p;
> +
> +insufficient_space:
> +	/* See if we can refurbish the current folio. */
> +	if (!folio || !folio_ref_sub_and_test(folio, cache->pagecnt_bias))
> +		goto get_new_folio;
> +	if (unlikely(cache->pfmemalloc)) {
> +		__folio_put(folio);
> +		goto get_new_folio;
> +	}
> +
> +	fsize = folio_size(folio);
> +	if (unlikely(fragsz > fsize))
> +		goto frag_too_big;
> +
> +	/* OK, page count is 0, we can safely set it */
> +	folio_set_count(folio, PAGE_FRAG_CACHE_MAX_SIZE + 1);
> +
> +	/* Reset page count bias and offset to start of new frag */
> +	cache->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
> +	offset = fsize;
> +	goto try_again;

IMHO this function uses a bit too much labels and would be more easy to
read, e.g. moving the above chunk of code in conditional branch.

Even without such change, I think the above 'goto try_again;'
introduces an unneeded conditional, as at this point we know 'fragsz <=
fsize'.

> +
> +get_new_folio:
> +	if (!spare) {
> +		cache->folio = NULL;
> +		put_cpu_ptr(&skb_splice_frag_cache);
> +
> +#if PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE
> +		spare = folio_alloc(gfp | __GFP_NOWARN | __GFP_NORETRY |
> +				    __GFP_NOMEMALLOC,
> +				    PAGE_FRAG_CACHE_MAX_ORDER);
> +		if (!spare)
> +#endif
> +			spare = folio_alloc(gfp, 0);
> +		if (!spare)
> +			return NULL;
> +
> +		cache = get_cpu_ptr(&skb_splice_frag_cache);
> +		/* We may now be on a different cpu and/or someone else may
> +		 * have refilled it
> +		 */
> +		cache->pfmemalloc = folio_is_pfmemalloc(spare);
> +		if (cache->folio)
> +			goto reload;

I think there is some problem with the above.

If cache->folio is != NULL, and cache->folio was not pfmemalloc-ed
while the spare one is, it looks like the wrong policy will be used.
And should be even worse if folio was pfmemalloc-ed while spare is not.

I think moving 'cache->pfmemalloc' initialization...

> +	}
> +

... here should fix the above.

> +	cache->folio = spare;
> +	cache->virt  = folio_address(spare);
> +	folio = spare;
> +	spare = NULL;
> +
> +	/* Even if we own the page, we do not use atomic_set().  This would
> +	 * break get_page_unless_zero() users.
> +	 */
> +	folio_ref_add(folio, PAGE_FRAG_CACHE_MAX_SIZE);
> +
> +	/* Reset page count bias and offset to start of new frag */
> +	cache->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
> +	offset = folio_size(folio);
> +	goto try_again;

What if fragsz > PAGE_SIZE, we are consistently unable to allocate an
high order page, but order-0, pfmemalloc-ed page allocation is
successful? It looks like this could become an unbounded loop?

> +
> +frag_too_big:
> +	/* The caller is trying to allocate a fragment with fragsz > PAGE_SIZE
> +	 * but the cache isn't big enough to satisfy the request, this may
> +	 * happen in low memory conditions.  We don't release the cache page
> +	 * because it could make memory pressure worse so we simply return NULL
> +	 * here.
> +	 */
> +	cache->offset = offset;
> +	put_cpu_ptr(&skb_splice_frag_cache);
> +	if (spare)
> +		folio_put(spare);
> +	return NULL;
> +}
> +EXPORT_SYMBOL(alloc_skb_frag);
> +
> +/**
> + * copy_skb_frag - Copy data into a page fragment.
> + * @s: The data to copy
> + * @len: The size of the data
> + * @gfp: Allocation flags
> + */
> +void *copy_skb_frag(const void *s, size_t len, gfp_t gfp)
> +{
> +	void *p;
> +
> +	p = alloc_skb_frag(len, gfp);
> +	if (!p)
> +		return NULL;
> +
> +	return memcpy(p, s, len);
> +}
> +EXPORT_SYMBOL(copy_skb_frag);
> +
>  static void skb_splice_csum_page(struct sk_buff *skb, struct page *page,
>  				 size_t offset, size_t len)
>  {
> @@ -6808,17 +6947,43 @@ ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
>  			break;
>  		}
>  
> +		if (space == 0 &&
> +		    !skb_can_coalesce(skb, skb_shinfo(skb)->nr_frags,
> +				      pages[0], off)) {
> +			iov_iter_revert(iter, len);
> +			break;
> +		}
> +
>  		i = 0;
>  		do {
>  			struct page *page = pages[i++];
>  			size_t part = min_t(size_t, PAGE_SIZE - off, len);
> -
> -			ret = -EIO;
> -			if (WARN_ON_ONCE(!sendpage_ok(page)))
> +			bool put = false;
> +
> +			if (PageSlab(page)) {

I'm a bit concerned from the above. If I read correctly, tcp 0-copy
will go through that for every page, even if the expected use-case is
always !PageSlub(page). compound_head() could be costly if the head
page is not hot on cache and I'm not sure if that could be the case for
tcp 0-copy. The bottom line is that I fear a possible regression here.

Given all the above, and the late stage of the current devel cycle,
would you consider slicing down this series to just kill
MSG_SENDPAGE_NOTLAST, as Jakub suggested?

Thanks!

Paolo
  
David Howells June 23, 2023, 9:06 a.m. UTC | #8
Paolo Abeni <pabeni@redhat.com> wrote:

> IMHO this function uses a bit too much labels and would be more easy to
> read, e.g. moving the above chunk of code in conditional branch.

Maybe.  I was trying to put the fast path up at the top without the slow path
bits in it, but I can put the "insufficient_space" bit there.

> Even without such change, I think the above 'goto try_again;'
> introduces an unneeded conditional, as at this point we know 'fragsz <=
> fsize'.

Good point.

> > +		cache->pfmemalloc = folio_is_pfmemalloc(spare);
> > +		if (cache->folio)
> > +			goto reload;
> 
> I think there is some problem with the above.
> 
> If cache->folio is != NULL, and cache->folio was not pfmemalloc-ed
> while the spare one is, it looks like the wrong policy will be used.
> And should be even worse if folio was pfmemalloc-ed while spare is not.
> 
> I think moving 'cache->pfmemalloc' initialization...
> 
> > +	}
> > +
> 
> ... here should fix the above.

Yeah.  We might have raced with someone else or been moved to another cpu and
there might now be a folio we can allocate from.

> > +	/* Reset page count bias and offset to start of new frag */
> > +	cache->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
> > +	offset = folio_size(folio);
> > +	goto try_again;
> 
> What if fragsz > PAGE_SIZE, we are consistently unable to allocate an
> high order page, but order-0, pfmemalloc-ed page allocation is
> successful? It looks like this could become an unbounded loop?

It shouldn't.  It should go:

	try_again:
		if (fragsz > offset)
			goto insufficient_space;
	insufficient_space:
		/* See if we can refurbish the current folio. */
		...
		fsize = folio_size(folio);
		if (unlikely(fragsz > fsize))
			goto frag_too_big;
	frag_too_big:
		...
		return NULL;

Though for safety's sake, it would make sense to put in a size check in the
case we fail to allocate a larger-order folio.

> >  		do {
> >  			struct page *page = pages[i++];
> >  			size_t part = min_t(size_t, PAGE_SIZE - off, len);
> > -
> > -			ret = -EIO;
> > -			if (WARN_ON_ONCE(!sendpage_ok(page)))
> > +			bool put = false;
> > +
> > +			if (PageSlab(page)) {
> 
> I'm a bit concerned from the above. If I read correctly, tcp 0-copy

Well, splice()-to-tcp will; MSG_ZEROCOPY is unaffected.

> will go through that for every page, even if the expected use-case is
> always !PageSlub(page). compound_head() could be costly if the head
> page is not hot on cache and I'm not sure if that could be the case for
> tcp 0-copy. The bottom line is that I fear a possible regression here.

I can put the PageSlab() check inside the sendpage_ok() so the page flag is
only checked once.  But PageSlab() doesn't check the headpage, only the page
it is given.  sendpage_ok() is more the problem as it also calls
page_count().  I could drop the check.

David
  
David Howells June 23, 2023, 9:08 a.m. UTC | #9
Jakub Kicinski <kuba@kernel.org> wrote:

> If we can't reach instant agreement --
> can you strategically separate out the minimal set of changes required
> to just kill MSG_SENDPAGE_NOTLAST. IMHO it's worth getting that into
> 6.5.

Paolo Abeni <pabeni@redhat.com> wrote:

> Given all the above, and the late stage of the current devel cycle,
> would you consider slicing down this series to just kill
> MSG_SENDPAGE_NOTLAST, as Jakub suggested?

I could do that.

There is also another alternative.  I could just push the sendpage wrappers up
the stack into the higher-level callers.  Basically this:

int udp_sendpage(struct sock *sk, struct page *page, int offset,
		 size_t size, int flags)
{
	struct bio_vec bvec;
	struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES };

	if (flags & MSG_SENDPAGE_NOTLAST)
		msg.msg_flags |= MSG_MORE;

	bvec_set_page(&bvec, page, size, offset);
	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
	return udp_sendmsg(sk, &msg, size);
}

and kill off sendpage and MSG_SENDPAGE_NOTLAST.

David
  
Paolo Abeni June 23, 2023, 9:37 a.m. UTC | #10
On Fri, 2023-06-23 at 10:06 +0100, David Howells wrote:
> Paolo Abeni <pabeni@redhat.com> wrote:
> 
> > IMHO this function uses a bit too much labels and would be more easy to
> > read, e.g. moving the above chunk of code in conditional branch.
> 
> Maybe.  I was trying to put the fast path up at the top without the slow path
> bits in it, but I can put the "insufficient_space" bit there.

I *think* you could move the insufficient_space in a separate helped,
that should achieve your goal with fewer labels and hopefully no
additional complexity.

> 
> > Even without such change, I think the above 'goto try_again;'
> > introduces an unneeded conditional, as at this point we know 'fragsz <=
> > fsize'.
> 
> Good point.
> 
> > > +		cache->pfmemalloc = folio_is_pfmemalloc(spare);
> > > +		if (cache->folio)
> > > +			goto reload;
> > 
> > I think there is some problem with the above.
> > 
> > If cache->folio is != NULL, and cache->folio was not pfmemalloc-ed
> > while the spare one is, it looks like the wrong policy will be used.
> > And should be even worse if folio was pfmemalloc-ed while spare is not.
> > 
> > I think moving 'cache->pfmemalloc' initialization...
> > 
> > > +	}
> > > +
> > 
> > ... here should fix the above.
> 
> Yeah.  We might have raced with someone else or been moved to another cpu and
> there might now be a folio we can allocate from.
> 
> > > +	/* Reset page count bias and offset to start of new frag */
> > > +	cache->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
> > > +	offset = folio_size(folio);
> > > +	goto try_again;
> > 
> > What if fragsz > PAGE_SIZE, we are consistently unable to allocate an
> > high order page, but order-0, pfmemalloc-ed page allocation is
> > successful? It looks like this could become an unbounded loop?
> 
> It shouldn't.  It should go:
> 
> 	try_again:
> 		if (fragsz > offset)
> 			goto insufficient_space;
> 	insufficient_space:
> 		/* See if we can refurbish the current folio. */
> 		...

I think the critical path is with pfmemalloc-ed pages:

		if (unlikely(cache->pfmemalloc)) {
			__folio_put(folio);
			goto get_new_folio;
		}

just before the following.

> 		fsize = folio_size(folio);
> 		if (unlikely(fragsz > fsize))
> 			goto frag_too_big;
> 	frag_too_big:
> 		...
> 		return NULL;
> 
> Though for safety's sake, it would make sense to put in a size check in the
> case we fail to allocate a larger-order folio.
> 
> > >  		do {
> > >  			struct page *page = pages[i++];
> > >  			size_t part = min_t(size_t, PAGE_SIZE - off, len);
> > > -
> > > -			ret = -EIO;
> > > -			if (WARN_ON_ONCE(!sendpage_ok(page)))
> > > +			bool put = false;
> > > +
> > > +			if (PageSlab(page)) {
> > 
> > I'm a bit concerned from the above. If I read correctly, tcp 0-copy
> 
> Well, splice()-to-tcp will; MSG_ZEROCOPY is unaffected.

Ah right! I got lost in some 'if' branch.

> > will go through that for every page, even if the expected use-case is
> > always !PageSlub(page). compound_head() could be costly if the head
> > page is not hot on cache and I'm not sure if that could be the case for
> > tcp 0-copy. The bottom line is that I fear a possible regression here.
> 
> I can put the PageSlab() check inside the sendpage_ok() so the page flag is
> only checked once.  

Perhaps I'm lost again, but AFAICS:

__PAGEFLAG(Slab, slab, PF_NO_TAIL)

// ...
#define __PAGEFLAG(uname, lname, policy)			\
	TESTPAGEFLAG(uname, lname, policy)			\
// ...

#define TESTPAGEFLAG(uname, lname, policy)				\
static __always_inline bool folio_test_##lname(struct folio *folio)	\
{ return test_bit(PG_##lname, folio_flags(folio, FOLIO_##policy));}	\
static __always_inline int Page##uname(struct page *page)               \
{ return test_bit(PG_##lname, &policy(page, 0)->flags); }
// ... 'policy' is PF_NO_TAIL here

#define PF_NO_TAIL(page, enforce) ({                                    \
                VM_BUG_ON_PGFLAGS(enforce && PageTail(page), page);	\
                PF_POISONED_CHECK(compound_head(page)); })

It looks at compound_head in the end ?!?

> But PageSlab() doesn't check the headpage, only the page
> it is given.  sendpage_ok() is more the problem as it also calls
> page_count().  I could drop the check.

Once the head page is hot on cache due to the previous check, it should
be cheap?

Cheers,

Paolo
  
Paolo Abeni June 23, 2023, 9:52 a.m. UTC | #11
On Fri, 2023-06-23 at 10:08 +0100, David Howells wrote:
> Jakub Kicinski <kuba@kernel.org> wrote:
> 
> > If we can't reach instant agreement --
> > can you strategically separate out the minimal set of changes required
> > to just kill MSG_SENDPAGE_NOTLAST. IMHO it's worth getting that into
> > 6.5.
> 
> Paolo Abeni <pabeni@redhat.com> wrote:
> 
> > Given all the above, and the late stage of the current devel cycle,
> > would you consider slicing down this series to just kill
> > MSG_SENDPAGE_NOTLAST, as Jakub suggested?
> 
> I could do that.
> 
> There is also another alternative.  I could just push the sendpage wrappers up
> the stack into the higher-level callers.  Basically this:
> 
> int udp_sendpage(struct sock *sk, struct page *page, int offset,
> 		 size_t size, int flags)
> {
> 	struct bio_vec bvec;
> 	struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES };
> 
> 	if (flags & MSG_SENDPAGE_NOTLAST)
> 		msg.msg_flags |= MSG_MORE;
> 
> 	bvec_set_page(&bvec, page, size, offset);
> 	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
> 	return udp_sendmsg(sk, &msg, size);
> }
> 
> and kill off sendpage and MSG_SENDPAGE_NOTLAST.

I'm unsure I follow the above ?!? I *thought* sendpage could be killed
even without patch 1/18 and 2/18, leaving some patches in this series
unmodified, and mangling those explicitly leveraging 1/18 to use
multiple sendmsg()s with different flags?

I haven't tried to code the above, but my wild guess/hope is that the
delta should be doable - ideally less then the other option.

Introducing slab support should still be possible later, with hopefully
less work.

Cheers,

Paolo
  
David Howells June 23, 2023, 10 a.m. UTC | #12
Paolo Abeni <pabeni@redhat.com> wrote:

> > Maybe.  I was trying to put the fast path up at the top without the slow path
> > bits in it, but I can put the "insufficient_space" bit there.
> 
> I *think* you could move the insufficient_space in a separate helped,
> that should achieve your goal with fewer labels and hopefully no
> additional complexity.

It would require moving things in and out of variables more, but that's
probably fine in the slow path.  The code I derived this from seems to do its
best only to touch memory when it absolutely has to.  But doing what you
suggest would certainly make this more readable, I think.

> > > What if fragsz > PAGE_SIZE, we are consistently unable to allocate an
> > > high order page, but order-0, pfmemalloc-ed page allocation is
> > > successful? It looks like this could become an unbounded loop?
> > 
> > It shouldn't.  It should go:
> > 
> > 	try_again:
> > 		if (fragsz > offset)
> > 			goto insufficient_space;
> > 	insufficient_space:
> > 		/* See if we can refurbish the current folio. */
> > 		...
> 
> I think the critical path is with pfmemalloc-ed pages:
> 
> 		if (unlikely(cache->pfmemalloc)) {
> 			__folio_put(folio);
> 			goto get_new_folio;
> 		}

I see what you're getting at.  I was thinking that you meant that the critical
bit was that we got into a loop because we never managed to allocate a folio
big enough.

Inserting a check in the event that we fail to allocate an order-3 folio would
take care of that, I think.  After that point, we have a spare folio of
sufficient capacity, even if the folio currently in residence is marked
pfmemalloc.

> > > will go through that for every page, even if the expected use-case is
> > > always !PageSlub(page). compound_head() could be costly if the head
> > > page is not hot on cache and I'm not sure if that could be the case for
> > > tcp 0-copy. The bottom line is that I fear a possible regression here.
> > 
> > I can put the PageSlab() check inside the sendpage_ok() so the page flag is
> > only checked once.  
> 
> Perhaps I'm lost again, but AFAICS:
> 
> __PAGEFLAG(Slab, slab, PF_NO_TAIL)
> ...
>                 PF_POISONED_CHECK(compound_head(page)); })
> 
> It looks at compound_head in the end ?!?

Fair point.  Those macros are somewhat hard to read.  Hopefully, all the
compound_head() calls will go away soon-ish.
  
David Howells June 23, 2023, 10:06 a.m. UTC | #13
Paolo Abeni <pabeni@redhat.com> wrote:

> I'm unsure I follow the above ?!? I *thought* sendpage could be killed
> even without patch 1/18 and 2/18, leaving some patches in this series
> unmodified, and mangling those explicitly leveraging 1/18 to use
> multiple sendmsg()s with different flags?

That's what I meant.

With the example, I was showing the minimum needed replacement for a call to
sendpage.

David
  
Paolo Abeni June 23, 2023, 10:21 a.m. UTC | #14
On Fri, 2023-06-23 at 11:06 +0100, David Howells wrote:
> Paolo Abeni <pabeni@redhat.com> wrote:
> 
> > I'm unsure I follow the above ?!? I *thought* sendpage could be killed
> > even without patch 1/18 and 2/18, leaving some patches in this series
> > unmodified, and mangling those explicitly leveraging 1/18 to use
> > multiple sendmsg()s with different flags?
> 
> That's what I meant.
> 
> With the example, I was showing the minimum needed replacement for a call to
> sendpage.

Thank LGTM!

Thanks,

Paolo
  

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 91ed66952580..5f53bd5d375d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -5037,6 +5037,8 @@  static inline void skb_mark_for_recycle(struct sk_buff *skb)
 #endif
 }
 
+void *alloc_skb_frag(size_t fragsz, gfp_t gfp);
+void *copy_skb_frag(const void *s, size_t len, gfp_t gfp);
 ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
 			     ssize_t maxsize, gfp_t gfp);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index fee2b1c105fe..d962c93a429d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -6755,6 +6755,145 @@  nodefer:	__kfree_skb(skb);
 		smp_call_function_single_async(cpu, &sd->defer_csd);
 }
 
+struct skb_splice_frag_cache {
+	struct folio	*folio;
+	void		*virt;
+	unsigned int	offset;
+	/* we maintain a pagecount bias, so that we dont dirty cache line
+	 * containing page->_refcount every time we allocate a fragment.
+	 */
+	unsigned int	pagecnt_bias;
+	bool		pfmemalloc;
+};
+
+static DEFINE_PER_CPU(struct skb_splice_frag_cache, skb_splice_frag_cache);
+
+/**
+ * alloc_skb_frag - Allocate a page fragment for using in a socket
+ * @fragsz: The size of fragment required
+ * @gfp: Allocation flags
+ */
+void *alloc_skb_frag(size_t fragsz, gfp_t gfp)
+{
+	struct skb_splice_frag_cache *cache;
+	struct folio *folio, *spare = NULL;
+	size_t offset, fsize;
+	void *p;
+
+	if (WARN_ON_ONCE(fragsz == 0))
+		fragsz = 1;
+
+	cache = get_cpu_ptr(&skb_splice_frag_cache);
+reload:
+	folio = cache->folio;
+	offset = cache->offset;
+try_again:
+	if (fragsz > offset)
+		goto insufficient_space;
+
+	/* Make the allocation. */
+	cache->pagecnt_bias--;
+	offset = ALIGN_DOWN(offset - fragsz, SMP_CACHE_BYTES);
+	cache->offset = offset;
+	p = cache->virt + offset;
+	put_cpu_ptr(&skb_splice_frag_cache);
+	if (spare)
+		folio_put(spare);
+	return p;
+
+insufficient_space:
+	/* See if we can refurbish the current folio. */
+	if (!folio || !folio_ref_sub_and_test(folio, cache->pagecnt_bias))
+		goto get_new_folio;
+	if (unlikely(cache->pfmemalloc)) {
+		__folio_put(folio);
+		goto get_new_folio;
+	}
+
+	fsize = folio_size(folio);
+	if (unlikely(fragsz > fsize))
+		goto frag_too_big;
+
+	/* OK, page count is 0, we can safely set it */
+	folio_set_count(folio, PAGE_FRAG_CACHE_MAX_SIZE + 1);
+
+	/* Reset page count bias and offset to start of new frag */
+	cache->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
+	offset = fsize;
+	goto try_again;
+
+get_new_folio:
+	if (!spare) {
+		cache->folio = NULL;
+		put_cpu_ptr(&skb_splice_frag_cache);
+
+#if PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE
+		spare = folio_alloc(gfp | __GFP_NOWARN | __GFP_NORETRY |
+				    __GFP_NOMEMALLOC,
+				    PAGE_FRAG_CACHE_MAX_ORDER);
+		if (!spare)
+#endif
+			spare = folio_alloc(gfp, 0);
+		if (!spare)
+			return NULL;
+
+		cache = get_cpu_ptr(&skb_splice_frag_cache);
+		/* We may now be on a different cpu and/or someone else may
+		 * have refilled it
+		 */
+		cache->pfmemalloc = folio_is_pfmemalloc(spare);
+		if (cache->folio)
+			goto reload;
+	}
+
+	cache->folio = spare;
+	cache->virt  = folio_address(spare);
+	folio = spare;
+	spare = NULL;
+
+	/* Even if we own the page, we do not use atomic_set().  This would
+	 * break get_page_unless_zero() users.
+	 */
+	folio_ref_add(folio, PAGE_FRAG_CACHE_MAX_SIZE);
+
+	/* Reset page count bias and offset to start of new frag */
+	cache->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
+	offset = folio_size(folio);
+	goto try_again;
+
+frag_too_big:
+	/* The caller is trying to allocate a fragment with fragsz > PAGE_SIZE
+	 * but the cache isn't big enough to satisfy the request, this may
+	 * happen in low memory conditions.  We don't release the cache page
+	 * because it could make memory pressure worse so we simply return NULL
+	 * here.
+	 */
+	cache->offset = offset;
+	put_cpu_ptr(&skb_splice_frag_cache);
+	if (spare)
+		folio_put(spare);
+	return NULL;
+}
+EXPORT_SYMBOL(alloc_skb_frag);
+
+/**
+ * copy_skb_frag - Copy data into a page fragment.
+ * @s: The data to copy
+ * @len: The size of the data
+ * @gfp: Allocation flags
+ */
+void *copy_skb_frag(const void *s, size_t len, gfp_t gfp)
+{
+	void *p;
+
+	p = alloc_skb_frag(len, gfp);
+	if (!p)
+		return NULL;
+
+	return memcpy(p, s, len);
+}
+EXPORT_SYMBOL(copy_skb_frag);
+
 static void skb_splice_csum_page(struct sk_buff *skb, struct page *page,
 				 size_t offset, size_t len)
 {
@@ -6808,17 +6947,43 @@  ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
 			break;
 		}
 
+		if (space == 0 &&
+		    !skb_can_coalesce(skb, skb_shinfo(skb)->nr_frags,
+				      pages[0], off)) {
+			iov_iter_revert(iter, len);
+			break;
+		}
+
 		i = 0;
 		do {
 			struct page *page = pages[i++];
 			size_t part = min_t(size_t, PAGE_SIZE - off, len);
-
-			ret = -EIO;
-			if (WARN_ON_ONCE(!sendpage_ok(page)))
+			bool put = false;
+
+			if (PageSlab(page)) {
+				const void *p;
+				void *q;
+
+				p = kmap_local_page(page);
+				q = copy_skb_frag(p + off, part, gfp);
+				kunmap_local(p);
+				if (!q) {
+					iov_iter_revert(iter, len);
+					ret = -ENOMEM;
+					goto out;
+				}
+				page = virt_to_page(q);
+				off = offset_in_page(q);
+				put = true;
+			} else if (WARN_ON_ONCE(!sendpage_ok(page))) {
+				ret = -EIO;
 				goto out;
+			}
 
 			ret = skb_append_pagefrags(skb, page, off, part,
 						   frag_limit);
+			if (put)
+				put_page(page);
 			if (ret < 0) {
 				iov_iter_revert(iter, len);
 				goto out;