Endless loop in udp with MSG_SPLICE_READ - Re: [syzbot] [fs?] INFO: task hung in pipe_release (4)

Message ID 792238.1690667367@warthog.procyon.org.uk
State New
Headers
Series Endless loop in udp with MSG_SPLICE_READ - Re: [syzbot] [fs?] INFO: task hung in pipe_release (4) |

Commit Message

David Howells July 29, 2023, 9:49 p.m. UTC
  Hi Jakub, Willem,

I think I'm going to need your help with this one.

> > syzbot has bisected this issue to:
> > 
> > commit 7ac7c987850c3ec617c778f7bd871804dc1c648d
> > Author: David Howells <dhowells@redhat.com>
> > Date:   Mon May 22 12:11:22 2023 +0000
> > 
> >     udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES
> > 
> > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=15853bcaa80000
> > start commit:   3f01e9fed845 Merge tag 'linux-watchdog-6.5-rc2' of git://w..
> > git tree:       upstream
> > final oops:     https://syzkaller.appspot.com/x/report.txt?x=17853bcaa80000
> > console output: https://syzkaller.appspot.com/x/log.txt?x=13853bcaa80000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=150188feee7071a7
> > dashboard link: https://syzkaller.appspot.com/bug?extid=f527b971b4bdc8e79f9e
> > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=12a86682a80000
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1520ab6ca80000
> > 
> > Reported-by: syzbot+f527b971b4bdc8e79f9e@syzkaller.appspotmail.com
> > Fixes: 7ac7c987850c ("udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES")
> > 
> > For information about bisection process see: https://goo.gl/tpsmEJ#bisection

The issue that syzbot is triggering seems to be something to do with the
calculations in the "if (copy <= 0) { ... }" chunk in __ip_append_data() when
MSG_SPLICE_PAGES is in operation.

What seems to happen is that the test program uses sendmsg() + MSG_MORE to
loads a UDP packet with 1406 bytes of data to the MTU size (1434) and then
splices in 8 extra bytes.

	r3 = socket$inet_udp(0x2, 0x2, 0x0)
	setsockopt$sock_int(r3, 0x1, 0x6, &(0x7f0000000140)=0x32, 0x4)
	bind$inet(r3, &(0x7f0000000000)={0x2, 0x0, @dev={0xac, 0x14, 0x14, 0x15}}, 0x10)
	connect$inet(r3, &(0x7f0000000200)={0x2, 0x0, @broadcast}, 0x10)
	sendmmsg(r3, &(0x7f0000000180)=[{{0x0, 0x0, 0x0}}, {{0x0, 0xfffffffffffffed3, &(0x7f0000000940)=[{&(0x7f00000006c0)='O', 0x57e}], 0x1}}], 0x4000000000003bd, 0x8800)
	write$binfmt_misc(r1, &(0x7f0000000440)=ANY=[], 0x8)
	splice(r0, 0x0, r2, 0x0, 0x4ffe0, 0x0)

This results in some negative intermediate values turning up in the
calculations - and this results in the remaining length being made longer
from 8 to 14.

I added some printks (patch attached), resulting in the attached tracelines:

	==>splice_to_socket() 7099
	udp_sendmsg(8,8)
	__ip_append_data(copy=-6,len=8, mtu=1434 skblen=1434 maxfl=1428)
	pagedlen 14 = 14 - 0
	copy -6 = 14 - 0 - 6 - 14
	length 8 -= -6 + 0
	__ip_append_data(copy=1414,len=14, mtu=1434 skblen=20 maxfl=1428)
	copy=1414 len=14
	skb_splice_from_iter(8,14)
	__ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
	copy=1406 len=6
	skb_splice_from_iter(0,6)
	__ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
	copy=1406 len=6
	skb_splice_from_iter(0,6)
	__ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
	copy=1406 len=6
	skb_splice_from_iter(0,6)
	__ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
	copy=1406 len=6
	skb_splice_from_iter(0,6)
	copy=1406 len=6
	skb_splice_from_iter(0,6)
	...

'copy' gets calculated as -6 because the maxfraglen (maxfl=1428) is 8 bytes
less than the amount of data then in the packet (skblen=1434).

'copy' gets recalculated part way down as -6 from datalen (14) - transhdrlen
(0) - fraggap (6) - pagedlen (14).

datalen is 14 because it was length (8) + fraggap (6).

Inside skb_splice_from_iter(), we eventually end up in an enless loop in which
msg_iter.count is 0 and the length to be copied is 6.  It always returns 0
because there's nothing to copy, and so __ip_append_data() cycles round the
loop endlessly.

Any suggestion as to how to fix this?

Thanks,
David
---

Debug hang in pipe_release's pipe_lock
---
 fs/splice.c          |    3 +++
 net/core/skbuff.c    |    7 +++++++
 net/ipv4/ip_output.c |   24 ++++++++++++++++++++++++
 net/ipv4/udp.c       |    3 +++
 4 files changed, 37 insertions(+)
  

Comments

Willem de Bruijn July 30, 2023, 1:35 p.m. UTC | #1
David Howells wrote:
> Hi Jakub, Willem,
> 
> I think I'm going to need your help with this one.
> 
> > > syzbot has bisected this issue to:
> > > 
> > > commit 7ac7c987850c3ec617c778f7bd871804dc1c648d
> > > Author: David Howells <dhowells@redhat.com>
> > > Date:   Mon May 22 12:11:22 2023 +0000
> > > 
> > >     udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES
> > > 
> > > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=15853bcaa80000
> > > start commit:   3f01e9fed845 Merge tag 'linux-watchdog-6.5-rc2' of git://w..
> > > git tree:       upstream
> > > final oops:     https://syzkaller.appspot.com/x/report.txt?x=17853bcaa80000
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=13853bcaa80000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=150188feee7071a7
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=f527b971b4bdc8e79f9e
> > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=12a86682a80000
> > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1520ab6ca80000
> > > 
> > > Reported-by: syzbot+f527b971b4bdc8e79f9e@syzkaller.appspotmail.com
> > > Fixes: 7ac7c987850c ("udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES")
> > > 
> > > For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> 
> The issue that syzbot is triggering seems to be something to do with the
> calculations in the "if (copy <= 0) { ... }" chunk in __ip_append_data() when
> MSG_SPLICE_PAGES is in operation.
> 
> What seems to happen is that the test program uses sendmsg() + MSG_MORE to
> loads a UDP packet with 1406 bytes of data to the MTU size (1434) and then
> splices in 8 extra bytes.
> 
> 	r3 = socket$inet_udp(0x2, 0x2, 0x0)
> 	setsockopt$sock_int(r3, 0x1, 0x6, &(0x7f0000000140)=0x32, 0x4)
> 	bind$inet(r3, &(0x7f0000000000)={0x2, 0x0, @dev={0xac, 0x14, 0x14, 0x15}}, 0x10)
> 	connect$inet(r3, &(0x7f0000000200)={0x2, 0x0, @broadcast}, 0x10)
> 	sendmmsg(r3, &(0x7f0000000180)=[{{0x0, 0x0, 0x0}}, {{0x0, 0xfffffffffffffed3, &(0x7f0000000940)=[{&(0x7f00000006c0)='O', 0x57e}], 0x1}}], 0x4000000000003bd, 0x8800)
> 	write$binfmt_misc(r1, &(0x7f0000000440)=ANY=[], 0x8)
> 	splice(r0, 0x0, r2, 0x0, 0x4ffe0, 0x0)
> 
> This results in some negative intermediate values turning up in the
> calculations - and this results in the remaining length being made longer
> from 8 to 14.
> 
> I added some printks (patch attached), resulting in the attached tracelines:
> 
> 	==>splice_to_socket() 7099
> 	udp_sendmsg(8,8)
> 	__ip_append_data(copy=-6,len=8, mtu=1434 skblen=1434 maxfl=1428)
> 	pagedlen 14 = 14 - 0
> 	copy -6 = 14 - 0 - 6 - 14
> 	length 8 -= -6 + 0
> 	__ip_append_data(copy=1414,len=14, mtu=1434 skblen=20 maxfl=1428)
> 	copy=1414 len=14
> 	skb_splice_from_iter(8,14)
> 	__ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
> 	copy=1406 len=6
> 	skb_splice_from_iter(0,6)
> 	__ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
> 	copy=1406 len=6
> 	skb_splice_from_iter(0,6)
> 	__ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
> 	copy=1406 len=6
> 	skb_splice_from_iter(0,6)
> 	__ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
> 	copy=1406 len=6
> 	skb_splice_from_iter(0,6)
> 	copy=1406 len=6
> 	skb_splice_from_iter(0,6)
> 	...
> 
> 'copy' gets calculated as -6 because the maxfraglen (maxfl=1428) is 8 bytes
> less than the amount of data then in the packet (skblen=1434).
> 
> 'copy' gets recalculated part way down as -6 from datalen (14) - transhdrlen
> (0) - fraggap (6) - pagedlen (14).
> 
> datalen is 14 because it was length (8) + fraggap (6).
> 
> Inside skb_splice_from_iter(), we eventually end up in an enless loop in which
> msg_iter.count is 0 and the length to be copied is 6.  It always returns 0
> because there's nothing to copy, and so __ip_append_data() cycles round the
> loop endlessly.
> 
> Any suggestion as to how to fix this?

Still ingesting.

The syzkaller repro runs in threaded mode, so syscalls should not be
interpreted in order.

There are two seemingly (but evidently not really) independent
operations:

One involving splicing

    pipe(&(0x7f0000000100)={<r0=>0xffffffffffffffff, <r1=>0xffffffffffffffff})
    r2 = socket$inet_udp(0x2, 0x2, 0x0)
    write$binfmt_misc(r1, &(0x7f0000000440)=ANY=[], 0x8)
    splice(r0, 0x0, r2, 0x0, 0x4ffe0, 0x0)
    close(r2)

And separately the MSG_MORE transmission that you mentioned

    r3 = socket$inet_udp(0x2, 0x2, 0x0)
    setsockopt$sock_int(r3, 0x1, 0x6, &(0x7f0000000140)=0x32, 0x4)
    bind$inet(r3, &(0x7f0000000000)={0x2, 0x0, @dev={0xac, 0x14, 0x14, 0x15}}, 0x10)
    connect$inet(r3, &(0x7f0000000200)={0x2, 0x0, @broadcast}, 0x10)
    sendmmsg(r3, &(0x7f0000000180)=[{{0x0, 0x0, 0x0}}, {{0x0, 0xfffffffffffffed3, &(0x7f0000000940)=[{&(0x7f00000006c0)='O', 0x57e}], 0x1}}], 0x4000000000003bd, 0x8800)

This second program also sets setsockopt SOL_SOCKET/SO_BROADCAST. That
is likely not some random noise in the program (but that can be easily
checked, by removing it -- assuming the repro triggers quickly).

Question is whether the infinite loop happens on the r2, the socket
involving MSG_SPLICE_PAGES, or r3, the socket involving SO_BROADCAST.
If the second, on the surface it would seem that splicing is entirely
unrelated.
  
Willem de Bruijn July 30, 2023, 2:30 p.m. UTC | #2
Willem de Bruijn wrote:
> David Howells wrote:
> > Hi Jakub, Willem,
> > 
> > I think I'm going to need your help with this one.
> > 
> > > > syzbot has bisected this issue to:
> > > > 
> > > > commit 7ac7c987850c3ec617c778f7bd871804dc1c648d
> > > > Author: David Howells <dhowells@redhat.com>
> > > > Date:   Mon May 22 12:11:22 2023 +0000
> > > > 
> > > >     udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES
> > > > 
> > > > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=15853bcaa80000
> > > > start commit:   3f01e9fed845 Merge tag 'linux-watchdog-6.5-rc2' of git://w..
> > > > git tree:       upstream
> > > > final oops:     https://syzkaller.appspot.com/x/report.txt?x=17853bcaa80000
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=13853bcaa80000
> > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=150188feee7071a7
> > > > dashboard link: https://syzkaller.appspot.com/bug?extid=f527b971b4bdc8e79f9e
> > > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=12a86682a80000
> > > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1520ab6ca80000
> > > > 
> > > > Reported-by: syzbot+f527b971b4bdc8e79f9e@syzkaller.appspotmail.com
> > > > Fixes: 7ac7c987850c ("udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES")
> > > > 
> > > > For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> > 
> > The issue that syzbot is triggering seems to be something to do with the
> > calculations in the "if (copy <= 0) { ... }" chunk in __ip_append_data() when
> > MSG_SPLICE_PAGES is in operation.
> > 
> > What seems to happen is that the test program uses sendmsg() + MSG_MORE to
> > loads a UDP packet with 1406 bytes of data to the MTU size (1434) and then
> > splices in 8 extra bytes.
> > 
> > 	r3 = socket$inet_udp(0x2, 0x2, 0x0)
> > 	setsockopt$sock_int(r3, 0x1, 0x6, &(0x7f0000000140)=0x32, 0x4)
> > 	bind$inet(r3, &(0x7f0000000000)={0x2, 0x0, @dev={0xac, 0x14, 0x14, 0x15}}, 0x10)
> > 	connect$inet(r3, &(0x7f0000000200)={0x2, 0x0, @broadcast}, 0x10)
> > 	sendmmsg(r3, &(0x7f0000000180)=[{{0x0, 0x0, 0x0}}, {{0x0, 0xfffffffffffffed3, &(0x7f0000000940)=[{&(0x7f00000006c0)='O', 0x57e}], 0x1}}], 0x4000000000003bd, 0x8800)
> > 	write$binfmt_misc(r1, &(0x7f0000000440)=ANY=[], 0x8)
> > 	splice(r0, 0x0, r2, 0x0, 0x4ffe0, 0x0)
> > 
> > This results in some negative intermediate values turning up in the
> > calculations - and this results in the remaining length being made longer
> > from 8 to 14.
> > 
> > I added some printks (patch attached), resulting in the attached tracelines:
> > 
> > 	==>splice_to_socket() 7099
> > 	udp_sendmsg(8,8)
> > 	__ip_append_data(copy=-6,len=8, mtu=1434 skblen=1434 maxfl=1428)
> > 	pagedlen 14 = 14 - 0
> > 	copy -6 = 14 - 0 - 6 - 14
> > 	length 8 -= -6 + 0
> > 	__ip_append_data(copy=1414,len=14, mtu=1434 skblen=20 maxfl=1428)
> > 	copy=1414 len=14
> > 	skb_splice_from_iter(8,14)
> > 	__ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
> > 	copy=1406 len=6
> > 	skb_splice_from_iter(0,6)
> > 	__ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
> > 	copy=1406 len=6
> > 	skb_splice_from_iter(0,6)
> > 	__ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
> > 	copy=1406 len=6
> > 	skb_splice_from_iter(0,6)
> > 	__ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
> > 	copy=1406 len=6
> > 	skb_splice_from_iter(0,6)
> > 	copy=1406 len=6
> > 	skb_splice_from_iter(0,6)
> > 	...
> > 
> > 'copy' gets calculated as -6 because the maxfraglen (maxfl=1428) is 8 bytes
> > less than the amount of data then in the packet (skblen=1434).
> > 
> > 'copy' gets recalculated part way down as -6 from datalen (14) - transhdrlen
> > (0) - fraggap (6) - pagedlen (14).
> > 
> > datalen is 14 because it was length (8) + fraggap (6).
> > 
> > Inside skb_splice_from_iter(), we eventually end up in an enless loop in which
> > msg_iter.count is 0 and the length to be copied is 6.  It always returns 0
> > because there's nothing to copy, and so __ip_append_data() cycles round the
> > loop endlessly.
> > 
> > Any suggestion as to how to fix this?
> 
> Still ingesting.
> 
> The syzkaller repro runs in threaded mode, so syscalls should not be
> interpreted in order.
> 
> There are two seemingly (but evidently not really) independent
> operations:
> 
> One involving splicing
> 
>     pipe(&(0x7f0000000100)={<r0=>0xffffffffffffffff, <r1=>0xffffffffffffffff})
>     r2 = socket$inet_udp(0x2, 0x2, 0x0)
>     write$binfmt_misc(r1, &(0x7f0000000440)=ANY=[], 0x8)
>     splice(r0, 0x0, r2, 0x0, 0x4ffe0, 0x0)
>     close(r2)
> 
> And separately the MSG_MORE transmission that you mentioned
> 
>     r3 = socket$inet_udp(0x2, 0x2, 0x0)
>     setsockopt$sock_int(r3, 0x1, 0x6, &(0x7f0000000140)=0x32, 0x4)
>     bind$inet(r3, &(0x7f0000000000)={0x2, 0x0, @dev={0xac, 0x14, 0x14, 0x15}}, 0x10)
>     connect$inet(r3, &(0x7f0000000200)={0x2, 0x0, @broadcast}, 0x10)
>     sendmmsg(r3, &(0x7f0000000180)=[{{0x0, 0x0, 0x0}}, {{0x0, 0xfffffffffffffed3, &(0x7f0000000940)=[{&(0x7f00000006c0)='O', 0x57e}], 0x1}}], 0x4000000000003bd, 0x8800)
> 
> This second program also sets setsockopt SOL_SOCKET/SO_BROADCAST. That
> is likely not some random noise in the program (but that can be easily
> checked, by removing it -- assuming the repro triggers quickly).
> 
> Question is whether the infinite loop happens on the r2, the socket
> involving MSG_SPLICE_PAGES, or r3, the socket involving SO_BROADCAST.
> If the second, on the surface it would seem that splicing is entirely
> unrelated.

I still don't entirely follow how the splice and sendmmsg end up on
the same socket.

Also, the sendmmsg in the case on the dashboard is very odd, a vlen of
0x4000000000003bd and flags MSG_MORE | MSG_CONFIRM. But maybe other
runs have more sensible flags here.

The issue appears to be with appending through splicing to an skb that
exceeds the length with fragments, triggering the if (fraggap) branch
to copy some trailer from skb_prev to skb, then appending the spliced
data.

Perhaps not an true fix based on understanding in detail, but one way
out may be to disable splicing if !transhdrlen (which ip_append_data
clears if appending).
  
David Howells July 30, 2023, 5:32 p.m. UTC | #3
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:

> The syzkaller repro runs in threaded mode, so syscalls should not be
> interpreted in order.

I think they are actually ordered.  It's kind of weirdly done, though,
flipping back and forth between threads and using futexes for synchronisation.

David
  
David Howells July 31, 2023, 8:13 a.m. UTC | #4
Hi Willem,

Here's a reduced testcase.  I doesn't require anything special; the key is
that the amount of data placed in the packet by the send() - it's related to
the MTU size.  It needs to stuff in sufficient data to go over the
fragmentation limit (I think).

In this case, my interface's MTU is 8192.  send() is sticking in 8161 bytes of
data and then the output from the aforeposted debugging patch is:

	==>splice_to_socket() 6630
	udp_sendmsg(8,8)
	__ip_append_data(copy=-1,len=8, mtu=8192 skblen=8189 maxfl=8188)
	pagedlen 9 = 9 - 0
	copy -1 = 9 - 0 - 1 - 9
	length 8 -= -1 + 0
	__ip_append_data(copy=8172,len=9, mtu=8192 skblen=20 maxfl=8188)
	copy=8172 len=9
	skb_splice_from_iter(8,9)
	__ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188)
	copy=8164 len=1
	skb_splice_from_iter(0,1)
	__ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188)
	copy=8164 len=1
	skb_splice_from_iter(0,1)
	__ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188)
	copy=8164 len=1
	skb_splice_from_iter(0,1)
	__ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188)
	copy=8164 len=1
	skb_splice_from_iter(0,1)
	copy=8164 len=1
	skb_splice_from_iter(0,1)

It looks like send() pushes 1 byte over the fragmentation limit, then the
splice sees -1 crop up, the length to be copied is increased by 1, but
insufficient data is available and we go into an endless loop.

---
#define _GNU_SOURCE
#include <arpa/inet.h>
#include <fcntl.h>
#include <netinet/in.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/socket.h>
#include <sys/mman.h>
#include <sys/uio.h>

#define OSERROR(R, S) do { if ((long)(R) == -1L) { perror((S)); exit(1); } } while(0)

int main()
{
	struct sockaddr_storage ss;
	struct sockaddr_in sin;
	void *buffer;
	unsigned int tmp;
	int pfd[2], sfd;
	int res;

	OSERROR(pipe(pfd), "pipe");

	sfd = socket(AF_INET, SOCK_DGRAM, 0);
	OSERROR(sfd, "socket/2");

	memset(&sin, 0, sizeof(sin));
	sin.sin_family = AF_INET;
	sin.sin_port = htons(0);
	sin.sin_addr.s_addr = htonl(0xc0a80601);
#warning you might want to set the address here - this is 192.168.6.1
	OSERROR(connect(sfd, (struct sockaddr *)&sin, sizeof(sin)), "connect");

	buffer = mmap(NULL, 1024*1024, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
	OSERROR(buffer, "mmap");

	OSERROR(send(sfd, buffer, 8161, MSG_CONFIRM|MSG_MORE), "send");
#warning you need to adjust the length on the above line to match your MTU

	OSERROR(write(pfd[1], buffer, 8), "write");

	OSERROR(splice(pfd[0], 0, sfd, 0, 0x4ffe0ul, 0), "splice");
	return 0;
}
  
Willem de Bruijn July 31, 2023, 12:45 p.m. UTC | #5
David Howells wrote:
> Hi Willem,
> 
> Here's a reduced testcase.  I doesn't require anything special; the key is
> that the amount of data placed in the packet by the send() - it's related to
> the MTU size.  It needs to stuff in sufficient data to go over the
> fragmentation limit (I think).
> 
> In this case, my interface's MTU is 8192.  send() is sticking in 8161 bytes of
> data and then the output from the aforeposted debugging patch is:
> 
> 	==>splice_to_socket() 6630
> 	udp_sendmsg(8,8)
> 	__ip_append_data(copy=-1,len=8, mtu=8192 skblen=8189 maxfl=8188)
> 	pagedlen 9 = 9 - 0
> 	copy -1 = 9 - 0 - 1 - 9
> 	length 8 -= -1 + 0
> 	__ip_append_data(copy=8172,len=9, mtu=8192 skblen=20 maxfl=8188)
> 	copy=8172 len=9
> 	skb_splice_from_iter(8,9)
> 	__ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188)
> 	copy=8164 len=1
> 	skb_splice_from_iter(0,1)
> 	__ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188)
> 	copy=8164 len=1
> 	skb_splice_from_iter(0,1)
> 	__ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188)
> 	copy=8164 len=1
> 	skb_splice_from_iter(0,1)
> 	__ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188)
> 	copy=8164 len=1
> 	skb_splice_from_iter(0,1)
> 	copy=8164 len=1
> 	skb_splice_from_iter(0,1)
> 
> It looks like send() pushes 1 byte over the fragmentation limit, then the
> splice sees -1 crop up, the length to be copied is increased by 1, but
> insufficient data is available and we go into an endless loop.
> 
> ---
> #define _GNU_SOURCE
> #include <arpa/inet.h>
> #include <fcntl.h>
> #include <netinet/in.h>
> #include <stdarg.h>
> #include <stdbool.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/socket.h>
> #include <sys/mman.h>
> #include <sys/uio.h>
> 
> #define OSERROR(R, S) do { if ((long)(R) == -1L) { perror((S)); exit(1); } } while(0)
> 
> int main()
> {
> 	struct sockaddr_storage ss;
> 	struct sockaddr_in sin;
> 	void *buffer;
> 	unsigned int tmp;
> 	int pfd[2], sfd;
> 	int res;
> 
> 	OSERROR(pipe(pfd), "pipe");
> 
> 	sfd = socket(AF_INET, SOCK_DGRAM, 0);
> 	OSERROR(sfd, "socket/2");
> 
> 	memset(&sin, 0, sizeof(sin));
> 	sin.sin_family = AF_INET;
> 	sin.sin_port = htons(0);
> 	sin.sin_addr.s_addr = htonl(0xc0a80601);
> #warning you might want to set the address here - this is 192.168.6.1
> 	OSERROR(connect(sfd, (struct sockaddr *)&sin, sizeof(sin)), "connect");
> 
> 	buffer = mmap(NULL, 1024*1024, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
> 	OSERROR(buffer, "mmap");
> 
> 	OSERROR(send(sfd, buffer, 8161, MSG_CONFIRM|MSG_MORE), "send");
> #warning you need to adjust the length on the above line to match your MTU
> 
> 	OSERROR(write(pfd[1], buffer, 8), "write");
> 
> 	OSERROR(splice(pfd[0], 0, sfd, 0, 0x4ffe0ul, 0), "splice");
> 	return 0;
> }

That's helpful.

Is the MSG_CONFIRM needed to trigger this?

Appending to a MSG_MORE datagram that previously fit within MTU, but
no longer, triggers the copy from skb_prev to skb in if (fraggap).

I did not see how that would cause issues, but maybe something in how
that second skb is setup makes none of the cases in the while loop
successfully append, yet also not fail and exit. It would be helpful
to know which path it takes (I assume skb_splice_from_iter) and what
that returns (0?).

Is this indeed trivially sidestepped if downgrading from splicing to
regular copying with fragmentation?

@@ -1042,7 +1042,7 @@ static int __ip_append_data(struct sock *sk,
                if (inet->hdrincl)
                        return -EPERM;
                if (rt->dst.dev->features & NETIF_F_SG &&
-                   getfrag == ip_generic_getfrag)
+                   getfrag == ip_generic_getfrag && transhdrlen)
  
David Howells July 31, 2023, 1:34 p.m. UTC | #6
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:

> Is the MSG_CONFIRM needed to trigger this?

It's not actually necessary.  The syz test was doing it.

> Is this indeed trivially sidestepped if downgrading from splicing to
> regular copying with fragmentation?

That works.  The logging looks like:

==>splice_to_socket() 5535
udp_sendmsg(8,8)
__ip_append_data(copy=-1,len=8, mtu=8192 skblen=8189 maxfl=8188)
copy 8 = 9 - 0 - 1 - 0
length 8 -= 8 + 0
<==splice_to_socket() = 8

It looks like pagedlen being non-zero might be the issue.

David
  
David Howells Aug. 1, 2023, 12:40 p.m. UTC | #7
The more I look at __ip_append_data(), the more I think the maths is wrong.
In the bit that allocates a new skbuff:

	if (copy <= 0) {
	...
		datalen = length + fraggap;
		if (datalen > mtu - fragheaderlen)
			datalen = maxfraglen - fragheaderlen;
		fraglen = datalen + fragheaderlen;
		pagedlen = 0;
	...
		if ((flags & MSG_MORE) &&
		    !(rt->dst.dev->features&NETIF_F_SG))
	...
		else if (!paged &&
			 (fraglen + alloc_extra < SKB_MAX_ALLOC ||
			  !(rt->dst.dev->features & NETIF_F_SG)))
	...
		else {
			alloclen = fragheaderlen + transhdrlen;
			pagedlen = datalen - transhdrlen;
		}
	...

In the MSG_SPLICE_READ but not MSG_MORE case, we go through that else clause.
The values used here, a few lines further along:

		copy = datalen - transhdrlen - fraggap - pagedlen;

are constant over the intervening span.  This means that, provided the splice
isn't going to exceed the MTU on the second fragment, the calculation of
'copy' can then be simplified algebraically thus:

		copy = (length + fraggap) - transhdrlen - fraggap - pagedlen;

		copy = length - transhdrlen - pagedlen;

		copy = length - transhdrlen - (datalen - transhdrlen);

		copy = length - transhdrlen - datalen + transhdrlen;

		copy = length - datalen;

		copy = length - (length + fraggap);

		copy = length - length - fraggap;

		copy = -fraggap;

I think we might need to recalculate copy after the conditional call to
getfrag().  Possibly we should skip that entirely for MSG_SPLICE_READ.  The
root seems to be that we're subtracting pagedlen from datalen - but probably
we shouldn't be doing getfrag() if pagedlen > 0.

David
  
Willem de Bruijn Aug. 1, 2023, 12:58 p.m. UTC | #8
David Howells wrote:
> The more I look at __ip_append_data(), the more I think the maths is wrong.
> In the bit that allocates a new skbuff:
> 
> 	if (copy <= 0) {
> 	...
> 		datalen = length + fraggap;
> 		if (datalen > mtu - fragheaderlen)
> 			datalen = maxfraglen - fragheaderlen;
> 		fraglen = datalen + fragheaderlen;
> 		pagedlen = 0;
> 	...
> 		if ((flags & MSG_MORE) &&
> 		    !(rt->dst.dev->features&NETIF_F_SG))
> 	...
> 		else if (!paged &&
> 			 (fraglen + alloc_extra < SKB_MAX_ALLOC ||
> 			  !(rt->dst.dev->features & NETIF_F_SG)))
> 	...
> 		else {
> 			alloclen = fragheaderlen + transhdrlen;
> 			pagedlen = datalen - transhdrlen;
> 		}
> 	...
> 
> In the MSG_SPLICE_READ but not MSG_MORE case, we go through that else clause.
> The values used here, a few lines further along:
> 
> 		copy = datalen - transhdrlen - fraggap - pagedlen;
> 
> are constant over the intervening span.  This means that, provided the splice
> isn't going to exceed the MTU on the second fragment, the calculation of
> 'copy' can then be simplified algebraically thus:
> 
> 		copy = (length + fraggap) - transhdrlen - fraggap - pagedlen;
> 
> 		copy = length - transhdrlen - pagedlen;
> 
> 		copy = length - transhdrlen - (datalen - transhdrlen);
> 
> 		copy = length - transhdrlen - datalen + transhdrlen;
> 
> 		copy = length - datalen;
> 
> 		copy = length - (length + fraggap);
> 
> 		copy = length - length - fraggap;
> 
> 		copy = -fraggap;
> 
> I think we might need to recalculate copy after the conditional call to
> getfrag().  Possibly we should skip that entirely for MSG_SPLICE_READ.  The
> root seems to be that we're subtracting pagedlen from datalen - but probably
> we shouldn't be doing getfrag() if pagedlen > 0.

q
  
Willem de Bruijn Aug. 1, 2023, 1:08 p.m. UTC | #9
David Howells wrote:
> The more I look at __ip_append_data(), the more I think the maths is wrong.
> In the bit that allocates a new skbuff:
> 
> 	if (copy <= 0) {
> 	...
> 		datalen = length + fraggap;
> 		if (datalen > mtu - fragheaderlen)
> 			datalen = maxfraglen - fragheaderlen;
> 		fraglen = datalen + fragheaderlen;
> 		pagedlen = 0;
> 	...
> 		if ((flags & MSG_MORE) &&
> 		    !(rt->dst.dev->features&NETIF_F_SG))
> 	...
> 		else if (!paged &&
> 			 (fraglen + alloc_extra < SKB_MAX_ALLOC ||
> 			  !(rt->dst.dev->features & NETIF_F_SG)))
> 	...
> 		else {
> 			alloclen = fragheaderlen + transhdrlen;
> 			pagedlen = datalen - transhdrlen;
> 		}
> 	...
> 
> In the MSG_SPLICE_READ but not MSG_MORE case, we go through that else clause.
> The values used here, a few lines further along:
> 
> 		copy = datalen - transhdrlen - fraggap - pagedlen;
> 
> are constant over the intervening span.  This means that, provided the splice
> isn't going to exceed the MTU on the second fragment, the calculation of
> 'copy' can then be simplified algebraically thus:
> 
> 		copy = (length + fraggap) - transhdrlen - fraggap - pagedlen;
> 
> 		copy = length - transhdrlen - pagedlen;
> 
> 		copy = length - transhdrlen - (datalen - transhdrlen);
> 
> 		copy = length - transhdrlen - datalen + transhdrlen;
> 
> 		copy = length - datalen;
> 
> 		copy = length - (length + fraggap);
> 
> 		copy = length - length - fraggap;
> 
> 		copy = -fraggap;
> 
> I think we might need to recalculate copy after the conditional call to
> getfrag().  Possibly we should skip that entirely for MSG_SPLICE_READ.  The
> root seems to be that we're subtracting pagedlen from datalen - but probably
> we shouldn't be doing getfrag() if pagedlen > 0.

That getfrag is needed. For non-splice cases, to fill the linear part
of an skb. As your example shows, it is skipped if all data is covered
by pagedlen?

In this edge case, splicing appends pagedlen to an skb that holds only
a small linear part for fragheaderlen and fraggap.

Splicing has no problem appending to a normal linear skb, right. Say

  send(fd, buf, 100, MSG_MORE);
  write(pipe[1], buf, 8);
  splice(pipe[2], 0, fd, 0, 8, 0);

This only happens when a new skb has to be allocated during the splice
call.

What causes the infinite loop: does skb_splice_from_iter return 0 and
therefore the loop neither decrements copy, nor breaks out with error?

Apologies for the earlier mail. Accidentally hit send too soon..
  
David Howells Aug. 1, 2023, 2:01 p.m. UTC | #10
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:

> That getfrag is needed. For non-splice cases, to fill the linear part
> of an skb. As your example shows, it is skipped if all data is covered
> by pagedlen?

Yes, because copy goes negative.  To quote from the previously quoted log:

	==>splice_to_socket() 6630
	udp_sendmsg(8,8)
	__ip_append_data(copy=-1,len=8, mtu=8192 skblen=8189 maxfl=8188)
	pagedlen 9 = 9 - 0
	copy -1 = 9 - 0 - 1 - 9

copy is -(the number of excess bytes).

	length 8 -= -1 + 0

which then gets deducted from the length - but why?  I wonder if copy should
be cleared if we don't call getfrag().  It looks like it's meant to deduct the
amount copied by getfrag(), but that doesn't happen if copy < 0.

Also, note that MSG_ZEROCOPY might see the same maths issue here.

David
  
David Howells Aug. 1, 2023, 2:02 p.m. UTC | #11
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:

> What causes the infinite loop: does skb_splice_from_iter return 0 and
> therefore the loop neither decrements copy, nor breaks out with error?

Yeah.  skb_splice_from_iter() starts returning 0 because the iterator is
empty, but it's still being asked to copy data.  Possibly it should break out
of the loop (or give an error) in such a case.

David
  
David Howells Aug. 1, 2023, 2:19 p.m. UTC | #12
The attached seems to work.  I still think copy isn't correctly calculated in
some circumstances - as I showed, several terms in the maths cancel out,
including the length of the data.

I'm also not entirely sure what 'paged' means in this function.  Should it
actually be set in the MSG_SPLICE_PAGES context?

---
udp: Fix __ip_addend_data()

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 6e70839257f7..54675a4f2c9f 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1157,7 +1157,7 @@ static int __ip_append_data(struct sock *sk,
 				pskb_trim_unique(skb_prev, maxfraglen);
 			}
 
-			copy = datalen - transhdrlen - fraggap - pagedlen;
+			copy = max_t(int, datalen - transhdrlen - fraggap - pagedlen, 0);
 			if (copy > 0 && getfrag(from, data + transhdrlen, offset, copy, fraggap, skb) < 0) {
 				err = -EFAULT;
 				kfree_skb(skb);
  
Willem de Bruijn Aug. 1, 2023, 2:31 p.m. UTC | #13
David Howells wrote:
> The attached seems to work.  I still think copy isn't correctly calculated in
> some circumstances - as I showed, several terms in the maths cancel out,
> including the length of the data.

That arithmetic makes assumptions that are specific to a particular
set of conditions (e.g., that pagedlen is non-zero).

Since the arithmetic is so complicated and error prone, I would try
to structure a fix that is easy to reason about to only change
behavior for the MSG_SPLICE_PAGES case.
 
> I'm also not entirely sure what 'paged' means in this function.  Should it
> actually be set in the MSG_SPLICE_PAGES context?

I introduced it with MSG_ZEROCOPY. It sets up pagedlen to capture the
length that is not copied.

If the existing code would affect MSG_ZEROCOPY too, I expect syzbot
to have reported that previously.

> ---
> udp: Fix __ip_addend_data()
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 6e70839257f7..54675a4f2c9f 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1157,7 +1157,7 @@ static int __ip_append_data(struct sock *sk,
>  				pskb_trim_unique(skb_prev, maxfraglen);
>  			}
>  
> -			copy = datalen - transhdrlen - fraggap - pagedlen;
> +			copy = max_t(int, datalen - transhdrlen - fraggap - pagedlen, 0);
>  			if (copy > 0 && getfrag(from, data + transhdrlen, offset, copy, fraggap, skb) < 0) {
>  				err = -EFAULT;
>  				kfree_skb(skb);
>
  
David Howells Aug. 1, 2023, 2:47 p.m. UTC | #14
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:

> > I'm also not entirely sure what 'paged' means in this function.  Should it
> > actually be set in the MSG_SPLICE_PAGES context?
> 
> I introduced it with MSG_ZEROCOPY. It sets up pagedlen to capture the
> length that is not copied.
> 
> If the existing code would affect MSG_ZEROCOPY too, I expect syzbot
> to have reported that previously.

Ah...  I think it *should* affect MSG_ZEROCOPY also... but...  If you look at:

		} else {
			err = skb_zerocopy_iter_dgram(skb, from, copy);
			if (err < 0)
				goto error;
		}
		offset += copy;
		length -= copy;

MSG_ZEROCOPY assumes that if it didn't return an error, then
skb_zerocopy_iter_dgram() copied all the data requested - whether or not the
iterator had sufficient data to copy.

If you look in __zerocopy_sg_from_iter(), it will drop straight out, returning
0 if/when iov_iter_count() is/reaches 0, even if length is still > 0, just as
skb_splice_from_iter() does.

So there's a potential bug in the handling of MSG_ZEROCOPY - but one that you
survive because it subtracts 'copy' from 'length', reducing it to zero, exits
the loop and returns without looking at 'length' again.  The actual length to
be transmitted is in the skbuff.

> Since the arithmetic is so complicated and error prone, I would try
> to structure a fix that is easy to reason about to only change
> behavior for the MSG_SPLICE_PAGES case.

Does that mean you want to have a go at that - or did you want me to try?

David
  
Willem de Bruijn Aug. 1, 2023, 2:59 p.m. UTC | #15
David Howells wrote:
> Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> 
> > > I'm also not entirely sure what 'paged' means in this function.  Should it
> > > actually be set in the MSG_SPLICE_PAGES context?
> > 
> > I introduced it with MSG_ZEROCOPY. It sets up pagedlen to capture the
> > length that is not copied.
> > 
> > If the existing code would affect MSG_ZEROCOPY too, I expect syzbot
> > to have reported that previously.
> 
> Ah...  I think it *should* affect MSG_ZEROCOPY also... but...  If you look at:
> 
> 		} else {
> 			err = skb_zerocopy_iter_dgram(skb, from, copy);
> 			if (err < 0)
> 				goto error;
> 		}
> 		offset += copy;
> 		length -= copy;
> 
> MSG_ZEROCOPY assumes that if it didn't return an error, then
> skb_zerocopy_iter_dgram() copied all the data requested - whether or not the
> iterator had sufficient data to copy.
> 
> If you look in __zerocopy_sg_from_iter(), it will drop straight out, returning
> 0 if/when iov_iter_count() is/reaches 0, even if length is still > 0, just as
> skb_splice_from_iter() does.
> 
> So there's a potential bug in the handling of MSG_ZEROCOPY - but one that you
> survive because it subtracts 'copy' from 'length', reducing it to zero, exits
> the loop and returns without looking at 'length' again.  The actual length to
> be transmitted is in the skbuff.
> 
> > Since the arithmetic is so complicated and error prone, I would try
> > to structure a fix that is easy to reason about to only change
> > behavior for the MSG_SPLICE_PAGES case.
> 
> Does that mean you want to have a go at that - or did you want me to try
Please give it a try. I can review. It's just safer if it's trivial
to review that the patch only affects the behavior of the recently
introduced MSG_SPLICE_PAGES code.
  

Patch

diff --git a/fs/splice.c b/fs/splice.c
index 004eb1c4ce31..9ee82b818bd6 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -801,6 +801,8 @@  ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
 	size_t spliced = 0;
 	bool need_wakeup = false;
 
+	printk("==>splice_to_socket() %u\n", current->pid);
+
 	pipe_lock(pipe);
 
 	while (len > 0) {
@@ -911,6 +913,7 @@  ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
 	pipe_unlock(pipe);
 	if (need_wakeup)
 		wakeup_pipe_writers(pipe);
+	printk("<==splice_to_socket() = %zd\n", spliced ?: ret);
 	return spliced ?: ret;
 }
 #endif
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a298992060e6..c3d60da9e3f7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -6801,6 +6801,13 @@  ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
 	ssize_t spliced = 0, ret = 0;
 	unsigned int i;
 
+	static int __pcount;
+
+	if (__pcount < 6) {
+		printk("skb_splice_from_iter(%zu,%zd)\n", iter->count, maxsize);
+		__pcount++;
+	}
+
 	while (iter->count > 0) {
 		ssize_t space, nr, len;
 		size_t off;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 6e70839257f7..8c84a7d13627 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1066,6 +1066,14 @@  static int __ip_append_data(struct sock *sk,
 		copy = mtu - skb->len;
 		if (copy < length)
 			copy = maxfraglen - skb->len;
+		if (flags & MSG_SPLICE_PAGES) {
+			static int __pcount;
+			if (__pcount < 6) {
+				printk("__ip_append_data(copy=%d,len=%d, mtu=%d skblen=%d maxfl=%d)\n",
+				       copy, length, mtu, skb->len, maxfraglen);
+				__pcount++;
+			}
+		}
 		if (copy <= 0) {
 			char *data;
 			unsigned int datalen;
@@ -1112,6 +1120,10 @@  static int __ip_append_data(struct sock *sk,
 			else {
 				alloclen = fragheaderlen + transhdrlen;
 				pagedlen = datalen - transhdrlen;
+				if (flags & MSG_SPLICE_PAGES) {
+					printk("pagedlen %d = %d - %d\n",
+					       pagedlen, datalen, transhdrlen);
+				}
 			}
 
 			alloclen += alloc_extra;
@@ -1158,6 +1170,9 @@  static int __ip_append_data(struct sock *sk,
 			}
 
 			copy = datalen - transhdrlen - fraggap - pagedlen;
+			if (flags & MSG_SPLICE_PAGES)
+				printk("copy %d = %d - %d - %d - %d\n",
+				       copy, datalen, transhdrlen, fraggap, pagedlen);
 			if (copy > 0 && getfrag(from, data + transhdrlen, offset, copy, fraggap, skb) < 0) {
 				err = -EFAULT;
 				kfree_skb(skb);
@@ -1165,6 +1180,8 @@  static int __ip_append_data(struct sock *sk,
 			}
 
 			offset += copy;
+			if (flags & MSG_SPLICE_PAGES)
+				printk("length %d -= %d + %d\n", length, copy, transhdrlen);
 			length -= copy + transhdrlen;
 			transhdrlen = 0;
 			exthdrlen = 0;
@@ -1192,6 +1209,13 @@  static int __ip_append_data(struct sock *sk,
 			continue;
 		}
 
+		if (flags & MSG_SPLICE_PAGES) {
+			static int __qcount;
+			if (__qcount < 6) {
+				printk("copy=%d len=%d\n", copy, length);
+				__qcount++;
+			}
+		}
 		if (copy > length)
 			copy = length;
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 42a96b3547c9..bd3f4e62574b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1081,6 +1081,9 @@  int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	if (msg->msg_flags & MSG_OOB) /* Mirror BSD error message compatibility */
 		return -EOPNOTSUPP;
 
+	if (msg->msg_flags & MSG_SPLICE_PAGES)
+		printk("udp_sendmsg(%zx,%zx)\n", msg->msg_iter.count, len);
+
 	getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag;
 
 	fl4 = &inet->cork.fl.u.ip4;