parisc: Fix csum_ipv6_magic on 64-bit systems

Message ID 20240213234631.940055-1-linux@roeck-us.net
State New
Headers
Series parisc: Fix csum_ipv6_magic on 64-bit systems |

Commit Message

Guenter Roeck Feb. 13, 2024, 11:46 p.m. UTC
  hppa 64-bit systems calculates the IPv6 checksum using 64-bit add
operations. The last add folds protocol and length fields into the 64-bit
result. While unlikely, this operation can overflow. The overflow can be
triggered with a code sequence such as the following.

	/* try to trigger massive overflows */
	memset(tmp_buf, 0xff, sizeof(struct in6_addr));
	csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf,
				      (struct in6_addr *)tmp_buf,
				      0xffff, 0xff, 0xffffffff);

Fix the problem by adding any overflows from the final add operation into
the calculated checksum. Fortunately, we can do this without additional
cost by replacing the add operation used to fold the checksum into 32 bit
with "add,dc" to add in the missing carry.

Cc: Charlie Jenkins <charlie@rivosinc.com>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
---
This patch does not completely fix the problems with csum_ipv6_magic seen
when running 64-bit parisc images with the C3700 emulation in qemu. That
is due to unaligned 64-bit load operations which (presumably as part of
unaligned trap handling) generate bad carry flags. It is unknown if that
is a problem with the qemu emulation or with the Linux kernel, so it is not
addressed here.

 arch/parisc/include/asm/checksum.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
  

Comments

Charlie Jenkins Feb. 14, 2024, 8:22 p.m. UTC | #1
On Tue, Feb 13, 2024 at 03:46:31PM -0800, Guenter Roeck wrote:
> hppa 64-bit systems calculates the IPv6 checksum using 64-bit add
> operations. The last add folds protocol and length fields into the 64-bit
> result. While unlikely, this operation can overflow. The overflow can be
> triggered with a code sequence such as the following.
> 
> 	/* try to trigger massive overflows */
> 	memset(tmp_buf, 0xff, sizeof(struct in6_addr));
> 	csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf,
> 				      (struct in6_addr *)tmp_buf,
> 				      0xffff, 0xff, 0xffffffff);
> 
> Fix the problem by adding any overflows from the final add operation into
> the calculated checksum. Fortunately, we can do this without additional
> cost by replacing the add operation used to fold the checksum into 32 bit
> with "add,dc" to add in the missing carry.
> 
> Cc: Charlie Jenkins <charlie@rivosinc.com>
> Cc: Palmer Dabbelt <palmer@rivosinc.com>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Signed-off-by: Guenter Roeck <linux@roeck-us.net>
> ---
> This patch does not completely fix the problems with csum_ipv6_magic seen
> when running 64-bit parisc images with the C3700 emulation in qemu. That
> is due to unaligned 64-bit load operations which (presumably as part of
> unaligned trap handling) generate bad carry flags. It is unknown if that
> is a problem with the qemu emulation or with the Linux kernel, so it is not
> addressed here.
> 
>  arch/parisc/include/asm/checksum.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/parisc/include/asm/checksum.h b/arch/parisc/include/asm/checksum.h
> index e619e67440db..c949aa20fa16 100644
> --- a/arch/parisc/include/asm/checksum.h
> +++ b/arch/parisc/include/asm/checksum.h
> @@ -137,8 +137,8 @@ static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
>  "	add,dc		%3, %0, %0\n"  /* fold in proto+len | carry bit */
>  "	extrd,u		%0, 31, 32, %4\n"/* copy upper half down */
>  "	depdi		0, 31, 32, %0\n"/* clear upper half */
> -"	add		%4, %0, %0\n"	/* fold into 32-bits */
> -"	addc		0, %0, %0\n"	/* add carry */
> +"	add,dc		%4, %0, %0\n"	/* fold into 32-bits, plus carry */
> +"	addc		0, %0, %0\n"	/* add final carry */
>  
>  #else
>  
> -- 
> 2.39.2
> 

Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
  
Helge Deller Feb. 16, 2024, 12:38 p.m. UTC | #2
* Guenter Roeck <linux@roeck-us.net>:
> hppa 64-bit systems calculates the IPv6 checksum using 64-bit add
> operations. The last add folds protocol and length fields into the 64-bit
> result. While unlikely, this operation can overflow. The overflow can be
> triggered with a code sequence such as the following.
> 
> 	/* try to trigger massive overflows */
> 	memset(tmp_buf, 0xff, sizeof(struct in6_addr));
> 	csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf,
> 				      (struct in6_addr *)tmp_buf,
> 				      0xffff, 0xff, 0xffffffff);
> 
> Fix the problem by adding any overflows from the final add operation into
> the calculated checksum. Fortunately, we can do this without additional
> cost by replacing the add operation used to fold the checksum into 32 bit
> with "add,dc" to add in the missing carry.


Gunter,

Thanks for your patch for 32- and 64-bit systems.
But I think it's time to sunset the parisc inline assembly for ipv4/ipv6
checksumming.
The patch below converts the code to use standard C-coding (taken from the
s390 header file) and it survives the v8 lib/checksum patch.

Opinions?

Helge

Subject: [PATCH] parisc: Fix csum_ipv6_magic on 32- and 64-bit machines

Guenter noticed that the 32- and 64-bit checksum functions may calculate
wrong values under certain circumstances. He fixed it by usining
corrected carry-flags added, but overall I think it's better to convert
away from inline assembly to generic C-coding.  That way the code is
much cleaner and the compiler can do much better optimizations based on
the target architecture (32- vs. 64-bit). Furthermore, the compiler
generates nowadays much better code, so inline assembly usually won't
give any benefit any longer.

Signed-off-by: Helge Deller <deller@gmx.de>
Noticed-by: Guenter Roeck <linux@roeck-us.net>

diff --git a/arch/parisc/include/asm/checksum.h b/arch/parisc/include/asm/checksum.h
index 3c43baca7b39..c72f14536353 100644
--- a/arch/parisc/include/asm/checksum.h
+++ b/arch/parisc/include/asm/checksum.h
@@ -18,160 +18,93 @@
  */
 extern __wsum csum_partial(const void *, int, __wsum);
 
+
 /*
- *	Optimized for IP headers, which always checksum on 4 octet boundaries.
- *
- *	Written by Randolph Chung <tausq@debian.org>, and then mucked with by
- *	LaMont Jones <lamont@debian.org>
+ * Fold a partial checksum without adding pseudo headers.
  */
-static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
+static inline __sum16 csum_fold(__wsum sum)
 {
-	unsigned int sum;
-	unsigned long t0, t1, t2;
+	u32 csum = (__force u32) sum;
 
-	__asm__ __volatile__ (
-"	ldws,ma		4(%1), %0\n"
-"	addib,<=	-4, %2, 2f\n"
-"\n"
-"	ldws		4(%1), %4\n"
-"	ldws		8(%1), %5\n"
-"	add		%0, %4, %0\n"
-"	ldws,ma		12(%1), %3\n"
-"	addc		%0, %5, %0\n"
-"	addc		%0, %3, %0\n"
-"1:	ldws,ma		4(%1), %3\n"
-"	addib,<		0, %2, 1b\n"
-"	addc		%0, %3, %0\n"
-"\n"
-"	extru		%0, 31, 16, %4\n"
-"	extru		%0, 15, 16, %5\n"
-"	addc		%4, %5, %0\n"
-"	extru		%0, 15, 16, %5\n"
-"	add		%0, %5, %0\n"
-"	subi		-1, %0, %0\n"
-"2:\n"
-	: "=r" (sum), "=r" (iph), "=r" (ihl), "=r" (t0), "=r" (t1), "=r" (t2)
-	: "1" (iph), "2" (ihl)
-	: "memory");
-
-	return (__force __sum16)sum;
+	csum += (csum >> 16) | (csum << 16);
+	csum >>= 16;
+	return (__force __sum16) ~csum;
 }
 
 /*
- *	Fold a partial checksum
+ * This is a version of ip_compute_csum() optimized for IP headers,
+ * which always checksums on 4 octet boundaries.
  */
-static inline __sum16 csum_fold(__wsum csum)
+static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
 {
-	u32 sum = (__force u32)csum;
-	/* add the swapped two 16-bit halves of sum,
-	   a possible carry from adding the two 16-bit halves,
-	   will carry from the lower half into the upper half,
-	   giving us the correct sum in the upper half. */
-	sum += (sum << 16) + (sum >> 16);
-	return (__force __sum16)(~sum >> 16);
+	__u64 csum = 0;
+	__u32 *ptr = (u32 *)iph;
+
+	csum += *ptr++;
+	csum += *ptr++;
+	csum += *ptr++;
+	csum += *ptr++;
+	ihl -= 4;
+	while (ihl--)
+		csum += *ptr++;
+	csum += (csum >> 32) | (csum << 32);
+	return csum_fold((__force __wsum)(csum >> 32));
 }
- 
-static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
-					__u32 len, __u8 proto,
-					__wsum sum)
+
+/*
+ * Computes the checksum of the TCP/UDP pseudo-header.
+ * Returns a 32-bit checksum.
+ */
+static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr, __u32 len,
+					__u8 proto, __wsum sum)
 {
-	__asm__(
-	"	add  %1, %0, %0\n"
-	"	addc %2, %0, %0\n"
-	"	addc %3, %0, %0\n"
-	"	addc %%r0, %0, %0\n"
-		: "=r" (sum)
-		: "r" (daddr), "r"(saddr), "r"(proto+len), "0"(sum));
-	return sum;
+	__u64 csum = (__force __u64)sum;
+
+	csum += (__force __u32)saddr;
+	csum += (__force __u32)daddr;
+	csum += len;
+	csum += proto;
+	csum += (csum >> 32) | (csum << 32);
+	return (__force __wsum)(csum >> 32);
 }
 
 /*
- * computes the checksum of the TCP/UDP pseudo-header
- * returns a 16-bit checksum, already complemented
+ * Computes the checksum of the TCP/UDP pseudo-header.
+ * Returns a 16-bit checksum, already complemented.
  */
-static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
-					__u32 len, __u8 proto,
-					__wsum sum)
+static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, __u32 len,
+					__u8 proto, __wsum sum)
 {
-	return csum_fold(csum_tcpudp_nofold(saddr,daddr,len,proto,sum));
+	return csum_fold(csum_tcpudp_nofold(saddr, daddr, len, proto, sum));
 }
 
 /*
- * this routine is used for miscellaneous IP-like checksums, mainly
- * in icmp.c
+ * Used for miscellaneous IP-like checksums, mainly icmp.
  */
-static inline __sum16 ip_compute_csum(const void *buf, int len)
+static inline __sum16 ip_compute_csum(const void *buff, int len)
 {
-	 return csum_fold (csum_partial(buf, len, 0));
+	return csum_fold(csum_partial(buff, len, 0));
 }
 
-
 #define _HAVE_ARCH_IPV6_CSUM
-static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
-					  const struct in6_addr *daddr,
-					  __u32 len, __u8 proto,
-					  __wsum sum)
+static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+				      const struct in6_addr *daddr,
+				      __u32 len, __u8 proto, __wsum csum)
 {
-	unsigned long t0, t1, t2, t3;
-
-	len += proto;	/* add 16-bit proto + len */
-
-	__asm__ __volatile__ (
-
-#if BITS_PER_LONG > 32
-
-	/*
-	** We can execute two loads and two adds per cycle on PA 8000.
-	** But add insn's get serialized waiting for the carry bit.
-	** Try to keep 4 registers with "live" values ahead of the ALU.
-	*/
-
-"	ldd,ma		8(%1), %4\n"	/* get 1st saddr word */
-"	ldd,ma		8(%2), %5\n"	/* get 1st daddr word */
-"	add		%4, %0, %0\n"
-"	ldd,ma		8(%1), %6\n"	/* 2nd saddr */
-"	ldd,ma		8(%2), %7\n"	/* 2nd daddr */
-"	add,dc		%5, %0, %0\n"
-"	add,dc		%6, %0, %0\n"
-"	add,dc		%7, %0, %0\n"
-"	add,dc		%3, %0, %0\n"  /* fold in proto+len | carry bit */
-"	extrd,u		%0, 31, 32, %4\n"/* copy upper half down */
-"	depdi		0, 31, 32, %0\n"/* clear upper half */
-"	add		%4, %0, %0\n"	/* fold into 32-bits */
-"	addc		0, %0, %0\n"	/* add carry */
-
-#else
-
-	/*
-	** For PA 1.x, the insn order doesn't matter as much.
-	** Insn stream is serialized on the carry bit here too.
-	** result from the previous operation (eg r0 + x)
-	*/
-"	ldw,ma		4(%1), %4\n"	/* get 1st saddr word */
-"	ldw,ma		4(%2), %5\n"	/* get 1st daddr word */
-"	add		%4, %0, %0\n"
-"	ldw,ma		4(%1), %6\n"	/* 2nd saddr */
-"	addc		%5, %0, %0\n"
-"	ldw,ma		4(%2), %7\n"	/* 2nd daddr */
-"	addc		%6, %0, %0\n"
-"	ldw,ma		4(%1), %4\n"	/* 3rd saddr */
-"	addc		%7, %0, %0\n"
-"	ldw,ma		4(%2), %5\n"	/* 3rd daddr */
-"	addc		%4, %0, %0\n"
-"	ldw,ma		4(%1), %6\n"	/* 4th saddr */
-"	addc		%5, %0, %0\n"
-"	ldw,ma		4(%2), %7\n"	/* 4th daddr */
-"	addc		%6, %0, %0\n"
-"	addc		%7, %0, %0\n"
-"	addc		%3, %0, %0\n"	/* fold in proto+len, catch carry */
-
-#endif
-	: "=r" (sum), "=r" (saddr), "=r" (daddr), "=r" (len),
-	  "=r" (t0), "=r" (t1), "=r" (t2), "=r" (t3)
-	: "0" (sum), "1" (saddr), "2" (daddr), "3" (len)
-	: "memory");
-	return csum_fold(sum);
+	__u64 sum = (__force __u64)csum;
+
+	sum += (__force __u32)saddr->s6_addr32[0];
+	sum += (__force __u32)saddr->s6_addr32[1];
+	sum += (__force __u32)saddr->s6_addr32[2];
+	sum += (__force __u32)saddr->s6_addr32[3];
+	sum += (__force __u32)daddr->s6_addr32[0];
+	sum += (__force __u32)daddr->s6_addr32[1];
+	sum += (__force __u32)daddr->s6_addr32[2];
+	sum += (__force __u32)daddr->s6_addr32[3];
+	sum += len;
+	sum += proto;
+	sum += (sum >> 32) | (sum << 32);
+	return csum_fold((__force __wsum)(sum >> 32));
 }
 
 #endif
-
  
Guenter Roeck Feb. 16, 2024, 3:31 p.m. UTC | #3
Hi Helge,

On 2/16/24 04:38, Helge Deller wrote:
> * Guenter Roeck <linux@roeck-us.net>:
>> hppa 64-bit systems calculates the IPv6 checksum using 64-bit add
>> operations. The last add folds protocol and length fields into the 64-bit
>> result. While unlikely, this operation can overflow. The overflow can be
>> triggered with a code sequence such as the following.
>>
>> 	/* try to trigger massive overflows */
>> 	memset(tmp_buf, 0xff, sizeof(struct in6_addr));
>> 	csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf,
>> 				      (struct in6_addr *)tmp_buf,
>> 				      0xffff, 0xff, 0xffffffff);
>>
>> Fix the problem by adding any overflows from the final add operation into
>> the calculated checksum. Fortunately, we can do this without additional
>> cost by replacing the add operation used to fold the checksum into 32 bit
>> with "add,dc" to add in the missing carry.
> 
> 
> Gunter,
> 
> Thanks for your patch for 32- and 64-bit systems.
> But I think it's time to sunset the parisc inline assembly for ipv4/ipv6
> checksumming.
> The patch below converts the code to use standard C-coding (taken from the
> s390 header file) and it survives the v8 lib/checksum patch.
> 
> Opinions?
> 

Works for me.

> Helge
> 
> Subject: [PATCH] parisc: Fix csum_ipv6_magic on 32- and 64-bit machines
> 
> Guenter noticed that the 32- and 64-bit checksum functions may calculate
> wrong values under certain circumstances. He fixed it by usining

using

> corrected carry-flags added, but overall I think it's better to convert
> away from inline assembly to generic C-coding.  That way the code is
> much cleaner and the compiler can do much better optimizations based on
> the target architecture (32- vs. 64-bit). Furthermore, the compiler
> generates nowadays much better code, so inline assembly usually won't
> give any benefit any longer.
> 
> Signed-off-by: Helge Deller <deller@gmx.de>
> Noticed-by: Guenter Roeck <linux@roeck-us.net>

That would be a new tag. Maybe use "Reported-by:", or just drop it.

Maybe also consider adding

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")

Other than that,

Tested-by: Guenter Roeck <linux@roeck-us.net>

Thanks,
Guenter
  
Charlie Jenkins Feb. 17, 2024, 3 a.m. UTC | #4
On Fri, Feb 16, 2024 at 01:38:51PM +0100, Helge Deller wrote:
> * Guenter Roeck <linux@roeck-us.net>:
> > hppa 64-bit systems calculates the IPv6 checksum using 64-bit add
> > operations. The last add folds protocol and length fields into the 64-bit
> > result. While unlikely, this operation can overflow. The overflow can be
> > triggered with a code sequence such as the following.
> > 
> > 	/* try to trigger massive overflows */
> > 	memset(tmp_buf, 0xff, sizeof(struct in6_addr));
> > 	csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf,
> > 				      (struct in6_addr *)tmp_buf,
> > 				      0xffff, 0xff, 0xffffffff);
> > 
> > Fix the problem by adding any overflows from the final add operation into
> > the calculated checksum. Fortunately, we can do this without additional
> > cost by replacing the add operation used to fold the checksum into 32 bit
> > with "add,dc" to add in the missing carry.
> 
> 
> Gunter,
> 
> Thanks for your patch for 32- and 64-bit systems.
> But I think it's time to sunset the parisc inline assembly for ipv4/ipv6
> checksumming.
> The patch below converts the code to use standard C-coding (taken from the
> s390 header file) and it survives the v8 lib/checksum patch.
> 
> Opinions?
> 
> Helge
> 
> Subject: [PATCH] parisc: Fix csum_ipv6_magic on 32- and 64-bit machines
> 
> Guenter noticed that the 32- and 64-bit checksum functions may calculate
> wrong values under certain circumstances. He fixed it by usining
> corrected carry-flags added, but overall I think it's better to convert
> away from inline assembly to generic C-coding.  That way the code is
> much cleaner and the compiler can do much better optimizations based on
> the target architecture (32- vs. 64-bit). Furthermore, the compiler
> generates nowadays much better code, so inline assembly usually won't
> give any benefit any longer.
> 
> Signed-off-by: Helge Deller <deller@gmx.de>
> Noticed-by: Guenter Roeck <linux@roeck-us.net>
> 
> diff --git a/arch/parisc/include/asm/checksum.h b/arch/parisc/include/asm/checksum.h
> index 3c43baca7b39..c72f14536353 100644
> --- a/arch/parisc/include/asm/checksum.h
> +++ b/arch/parisc/include/asm/checksum.h
> @@ -18,160 +18,93 @@
>   */
>  extern __wsum csum_partial(const void *, int, __wsum);
>  
> +
>  /*
> - *	Optimized for IP headers, which always checksum on 4 octet boundaries.
> - *
> - *	Written by Randolph Chung <tausq@debian.org>, and then mucked with by
> - *	LaMont Jones <lamont@debian.org>
> + * Fold a partial checksum without adding pseudo headers.
>   */
> -static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
> +static inline __sum16 csum_fold(__wsum sum)
>  {
> -	unsigned int sum;
> -	unsigned long t0, t1, t2;
> +	u32 csum = (__force u32) sum;
>  
> -	__asm__ __volatile__ (
> -"	ldws,ma		4(%1), %0\n"
> -"	addib,<=	-4, %2, 2f\n"
> -"\n"
> -"	ldws		4(%1), %4\n"
> -"	ldws		8(%1), %5\n"
> -"	add		%0, %4, %0\n"
> -"	ldws,ma		12(%1), %3\n"
> -"	addc		%0, %5, %0\n"
> -"	addc		%0, %3, %0\n"
> -"1:	ldws,ma		4(%1), %3\n"
> -"	addib,<		0, %2, 1b\n"
> -"	addc		%0, %3, %0\n"
> -"\n"
> -"	extru		%0, 31, 16, %4\n"
> -"	extru		%0, 15, 16, %5\n"
> -"	addc		%4, %5, %0\n"
> -"	extru		%0, 15, 16, %5\n"
> -"	add		%0, %5, %0\n"
> -"	subi		-1, %0, %0\n"
> -"2:\n"
> -	: "=r" (sum), "=r" (iph), "=r" (ihl), "=r" (t0), "=r" (t1), "=r" (t2)
> -	: "1" (iph), "2" (ihl)
> -	: "memory");
> -
> -	return (__force __sum16)sum;
> +	csum += (csum >> 16) | (csum << 16);
> +	csum >>= 16;
> +	return (__force __sum16) ~csum;
>  }

For all of my analysis I am using gcc 12.3.0.

We can do better than this! By inspection this looks like a performance
regression. The generic version of csum_fold in
include/asm-generic/checksum.h is better than this so should be used
instead.

I am not familiar with hppa assembly but this is the assembly for the
original csum_fold here:

0000000000000000 <csum_fold>:
   0:	08 03 02 41 	copy r3,r1
   4:	08 1e 02 43 	copy sp,r3
   8:	73 c1 00 88 	std,ma r1,40(sp)
   c:	d3 5a 09 fc 	shrpw r26,r26,16,ret0
  10:	0b 9a 0a 1c 	add,l r26,ret0,ret0
  14:	0b 80 09 9c 	uaddcm r0,ret0,ret0
  18:	db 9c 09 f0 	extrd,u,* ret0,47,16,ret0
  1c:	34 7e 00 80 	ldo 40(r3),sp
  20:	e8 40 d0 00 	bve (rp)
  24:	53 c3 3f 8d 	ldd,mb -40(sp),r3

This is the assembly for the generic version:
0000000000000000 <csum_fold_generic>:
   0:	08 03 02 41 	copy r3,r1
   4:	08 1e 02 43 	copy sp,r3
   8:	73 c1 00 88 	std,ma r1,40(sp)
   c:	0b 40 09 9c 	uaddcm r0,r26,ret0
  10:	d3 5a 09 fa 	shrpw r26,r26,16,r26
  14:	0b 5c 04 1c 	sub ret0,r26,ret0
  18:	db 9c 09 f0 	extrd,u,* ret0,47,16,ret0
  1c:	34 7e 00 80 	ldo 40(r3),sp
  20:	e8 40 d0 00 	bve (rp)
  24:	53 c3 3f 8d 	ldd,mb -40(sp),r3

This is the assembly for yours:
0000000000000028 <csum_fold_new>:
  28:	08 03 02 41 	copy r3,r1
  2c:	08 1e 02 43 	copy sp,r3
  30:	73 c1 00 88 	std,ma r1,40(sp)
  34:	d3 5a 09 fc 	shrpw r26,r26,16,ret0
  38:	0b 9a 0a 1c 	add,l r26,ret0,ret0
  3c:	d3 9c 19 f0 	extrw,u ret0,15,16,ret0
  40:	0b 80 09 9c 	uaddcm r0,ret0,ret0
  44:	db 9c 0b f0 	extrd,u,* ret0,63,16,ret0
  48:	34 7e 00 80 	ldo 40(r3),sp
  4c:	e8 40 d0 00 	bve (rp)
  50:	53 c3 3f 8d 	ldd,mb -40(sp),r3
  54:	00 00 00 00 	break 0,0

 The assembly is almost the same for the generic and the original code,
 except for rearranging the shift and add operation which allows the
 addition to become a subtraction and shortens the dependency chain on
 some architectures (looks like it doesn't change much here). However,
 this new code introduces an additional extrw instruction.

>  
>  /*
> - *	Fold a partial checksum
> + * This is a version of ip_compute_csum() optimized for IP headers,
> + * which always checksums on 4 octet boundaries.
>   */
> -static inline __sum16 csum_fold(__wsum csum)
> +static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
>  {
> -	u32 sum = (__force u32)csum;
> -	/* add the swapped two 16-bit halves of sum,
> -	   a possible carry from adding the two 16-bit halves,
> -	   will carry from the lower half into the upper half,
> -	   giving us the correct sum in the upper half. */
> -	sum += (sum << 16) + (sum >> 16);
> -	return (__force __sum16)(~sum >> 16);
> +	__u64 csum = 0;
> +	__u32 *ptr = (u32 *)iph;
> +
> +	csum += *ptr++;
> +	csum += *ptr++;
> +	csum += *ptr++;
> +	csum += *ptr++;
> +	ihl -= 4;
> +	while (ihl--)
> +		csum += *ptr++;
> +	csum += (csum >> 32) | (csum << 32);
> +	return csum_fold((__force __wsum)(csum >> 32));
>  }

This doesn't leverage add with carry well. This causes the code size of this
to be dramatically larger than the original assembly, which I assume
nicely correlates to an increased execution time.

This is the original assembly in 64-bit (almost the same in 32-bit):
0000000000000028 <ip_fast_csum>:
  28:	08 03 02 41 	copy r3,r1
  2c:	08 1e 02 43 	copy sp,r3
  30:	73 c1 00 88 	std,ma r1,40(sp)
  34:	0f 48 10 bc 	ldw,ma 4(r26),ret0
  38:	a7 39 60 70 	addib,<= -4,r25,78 <ip_fast_csum+0x50>
  3c:	0f 48 10 93 	ldw 4(r26),r19
  40:	0f 50 10 94 	ldw 8(r26),r20
  44:	0a 7c 06 1c 	add ret0,r19,ret0
  48:	0f 58 10 bf 	ldw,ma c(r26),r31
  4c:	0a 9c 07 1c 	add,c ret0,r20,ret0
  50:	0b fc 07 1c 	add,c ret0,r31,ret0
  54:	0f 48 10 bf 	ldw,ma 4(r26),r31
  58:	a7 20 5f ed 	addib,< 0,r25,54 <ip_fast_csum+0x2c>
  5c:	0b fc 07 1c 	add,c ret0,r31,ret0
  60:	d3 93 1b f0 	extrw,u ret0,31,16,r19
  64:	d3 94 19 f0 	extrw,u ret0,15,16,r20
  68:	0a 93 07 1c 	add,c r19,r20,ret0
  6c:	d3 94 19 f0 	extrw,u ret0,15,16,r20
  70:	0a 9c 06 1c 	add ret0,r20,ret0
  74:	97 9c 07 ff 	subi -1,ret0,ret0
  78:	db 9c 0b f0 	extrd,u,* ret0,63,16,ret0
  7c:	34 7e 00 80 	ldo 40(r3),sp
  80:	e8 40 d0 00 	bve (rp)
  84:	53 c3 3f 8d 	ldd,mb -40(sp),r3

This is the 64-bit assembly of your proposal:

0000000000000058 <ip_fast_csum_new>:
  58:	08 03 02 41 	copy r3,r1
  5c:	0f c2 12 c1 	std rp,-10(sp)
  60:	08 1e 02 43 	copy sp,r3
  64:	73 c1 01 08 	std,ma r1,80(sp)
  68:	37 39 3f f9 	ldo -4(r25),r25
  6c:	0c 64 12 d0 	std r4,8(r3)
  70:	0f 48 10 9f 	ldw 4(r26),r31
  74:	db 39 0b e0 	extrd,u,* r25,63,32,r25
  78:	37 5a 00 20 	ldo 10(r26),r26
  7c:	0f 41 10 9c 	ldw -10(r26),ret0
  80:	0b fc 0a 1c 	add,l ret0,r31,ret0
  84:	0f 51 10 9f 	ldw -8(r26),r31
  88:	0b 9f 0a 1f 	add,l r31,ret0,r31
  8c:	0f 59 10 9c 	ldw -4(r26),ret0
  90:	0b fc 0a 1c 	add,l ret0,r31,ret0
  94:	37 3f 3f ff 	ldo -1(r25),r31
  98:	8f ff 20 68 	cmpib,<> -1,r31,d4 <ip_fast_csum_new+0x7c>
  9c:	db f9 0b e0 	extrd,u,* r31,63,32,r25
  a0:	d3 9c 07 fa 	shrpd,* ret0,ret0,32,r26
  a4:	37 dd 3f a1 	ldo -30(sp),ret1
  a8:	0b 9a 0a 1a 	add,l r26,ret0,r26
  ac:	00 00 14 a1 	mfia r1
  b0:	28 20 00 00 	addil L%0,r1,r1
  b4:	34 21 00 00 	ldo 0(r1),r1
  b8:	e8 20 f0 00 	bve,l (r1),rp
  bc:	db 5a 03 e0 	extrd,u,* r26,31,32,r26
  c0:	0c 70 10 c4 	ldd 8(r3),r4
  c4:	0c 61 10 c2 	ldd -10(r3),rp
  c8:	34 7e 00 80 	ldo 40(r3),sp
  cc:	e8 40 d0 00 	bve (rp)
  d0:	53 c3 3f 8d 	ldd,mb -40(sp),r3
  d4:	0f 40 10 9f 	ldw 0(r26),r31
  d8:	37 5a 00 08 	ldo 4(r26),r26
  dc:	e8 1f 1f 65 	b,l 94 <ip_fast_csum_new+0x3c>,r0
  e0:	0b fc 0a 1c 	add,l ret0,r31,ret0
  e4:	00 00 00 00 	break 0,0

The 32-bit assembly is even longer.

Maybe you can get some inspiration from my riscv implementation
arch/riscv/include/asm/checksum.h. You can swap out the line:
csum += csum < ((const unsigned int *)iph)[pos];
With the addc macro defined in arch/parisc/lib/checksum.c.

I will guess that this will improve the code but I haven't checked.

> - 
> -static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
> -					__u32 len, __u8 proto,
> -					__wsum sum)
> +
> +/*
> + * Computes the checksum of the TCP/UDP pseudo-header.
> + * Returns a 32-bit checksum.
> + */
> +static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr, __u32 len,
> +					__u8 proto, __wsum sum)
>  {
> -	__asm__(
> -	"	add  %1, %0, %0\n"
> -	"	addc %2, %0, %0\n"
> -	"	addc %3, %0, %0\n"
> -	"	addc %%r0, %0, %0\n"
> -		: "=r" (sum)
> -		: "r" (daddr), "r"(saddr), "r"(proto+len), "0"(sum));
> -	return sum;
> +	__u64 csum = (__force __u64)sum;
> +
> +	csum += (__force __u32)saddr;
> +	csum += (__force __u32)daddr;
> +	csum += len;
> +	csum += proto;
> +	csum += (csum >> 32) | (csum << 32);
> +	return (__force __wsum)(csum >> 32);
>  }

The assembly from the original:

0000000000000088 <csum_tcpudp_nofold>:
  88:	08 03 02 41 	copy r3,r1
  8c:	08 1e 02 43 	copy sp,r3
  90:	73 c1 00 88 	std,ma r1,40(sp)
  94:	da f7 0b f8 	extrd,u,* r23,63,8,r23
  98:	0a f8 0a 17 	add,l r24,r23,r23
  9c:	0a d9 06 16 	add r25,r22,r22
  a0:	0a da 07 16 	add,c r26,r22,r22
  a4:	0a d7 07 16 	add,c r23,r22,r22
  a8:	0a c0 07 16 	add,c r0,r22,r22
  ac:	da dc 0b e0 	extrd,u,* r22,63,32,ret0
  b0:	34 7e 00 80 	ldo 40(r3),sp
  b4:	e8 40 d0 00 	bve (rp)
  b8:	53 c3 3f 8d 	ldd,mb -40(sp),r3
  bc:	00 00 00 00 	break 0,0

The 64-bit assembly from your proposal:

0000000000000140 <csum_tcpudp_nofold_new>:
 140:	08 03 02 41 	copy r3,r1
 144:	08 1e 02 43 	copy sp,r3
 148:	73 c1 00 88 	std,ma r1,40(sp)
 14c:	db 5a 0b e0 	extrd,u,* r26,63,32,r26
 150:	db 39 0b e0 	extrd,u,* r25,63,32,r25
 154:	da f7 0b f8 	extrd,u,* r23,63,8,r23
 158:	db 18 0b e0 	extrd,u,* r24,63,32,r24
 15c:	da d6 0b e0 	extrd,u,* r22,63,32,r22
 160:	0a f8 0a 18 	add,l r24,r23,r24
 164:	0b 38 0a 18 	add,l r24,r25,r24
 168:	0b 58 0a 18 	add,l r24,r26,r24
 16c:	0a d8 0a 16 	add,l r24,r22,r22
 170:	d2 d6 07 fc 	shrpd,* r22,r22,32,ret0
 174:	0a dc 0a 1c 	add,l ret0,r22,ret0
 178:	db 9c 03 e0 	extrd,u,* ret0,31,32,ret0
 17c:	34 7e 00 80 	ldo 40(r3),sp
 180:	e8 40 d0 00 	bve (rp)
 184:	53 c3 3f 8d 	ldd,mb -40(sp),r3

There are a lot of extra extrd instructions, and again is really bad on
32-bit parisc.

Maybe there is a good way to represent this in C, but the assembly is
so simple and clean for this function already.

>  
>  /*
> - * computes the checksum of the TCP/UDP pseudo-header
> - * returns a 16-bit checksum, already complemented
> + * Computes the checksum of the TCP/UDP pseudo-header.
> + * Returns a 16-bit checksum, already complemented.
>   */
> -static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
> -					__u32 len, __u8 proto,
> -					__wsum sum)
> +static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, __u32 len,
> +					__u8 proto, __wsum sum)
>  {
> -	return csum_fold(csum_tcpudp_nofold(saddr,daddr,len,proto,sum));
> +	return csum_fold(csum_tcpudp_nofold(saddr, daddr, len, proto, sum));
>  }
>  
The implementation of csum_tcpudp_magic is the same as the generic, so
the generic version should be used instead.

>  /*
> - * this routine is used for miscellaneous IP-like checksums, mainly
> - * in icmp.c
> + * Used for miscellaneous IP-like checksums, mainly icmp.
>   */
> -static inline __sum16 ip_compute_csum(const void *buf, int len)
> +static inline __sum16 ip_compute_csum(const void *buff, int len)
>  {
> -	 return csum_fold (csum_partial(buf, len, 0));
> +	return csum_fold(csum_partial(buff, len, 0));
>  }

The generic version is better than this so it should be used instead.

>  
> -
>  #define _HAVE_ARCH_IPV6_CSUM
> -static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
> -					  const struct in6_addr *daddr,
> -					  __u32 len, __u8 proto,
> -					  __wsum sum)
> +static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
> +				      const struct in6_addr *daddr,
> +				      __u32 len, __u8 proto, __wsum csum)
>  {
> -	unsigned long t0, t1, t2, t3;
> -
> -	len += proto;	/* add 16-bit proto + len */
> -
> -	__asm__ __volatile__ (
> -
> -#if BITS_PER_LONG > 32
> -
> -	/*
> -	** We can execute two loads and two adds per cycle on PA 8000.
> -	** But add insn's get serialized waiting for the carry bit.
> -	** Try to keep 4 registers with "live" values ahead of the ALU.
> -	*/
> -
> -"	ldd,ma		8(%1), %4\n"	/* get 1st saddr word */
> -"	ldd,ma		8(%2), %5\n"	/* get 1st daddr word */
> -"	add		%4, %0, %0\n"
> -"	ldd,ma		8(%1), %6\n"	/* 2nd saddr */
> -"	ldd,ma		8(%2), %7\n"	/* 2nd daddr */
> -"	add,dc		%5, %0, %0\n"
> -"	add,dc		%6, %0, %0\n"
> -"	add,dc		%7, %0, %0\n"
> -"	add,dc		%3, %0, %0\n"  /* fold in proto+len | carry bit */
> -"	extrd,u		%0, 31, 32, %4\n"/* copy upper half down */
> -"	depdi		0, 31, 32, %0\n"/* clear upper half */
> -"	add		%4, %0, %0\n"	/* fold into 32-bits */
> -"	addc		0, %0, %0\n"	/* add carry */
> -
> -#else
> -
> -	/*
> -	** For PA 1.x, the insn order doesn't matter as much.
> -	** Insn stream is serialized on the carry bit here too.
> -	** result from the previous operation (eg r0 + x)
> -	*/
> -"	ldw,ma		4(%1), %4\n"	/* get 1st saddr word */
> -"	ldw,ma		4(%2), %5\n"	/* get 1st daddr word */
> -"	add		%4, %0, %0\n"
> -"	ldw,ma		4(%1), %6\n"	/* 2nd saddr */
> -"	addc		%5, %0, %0\n"
> -"	ldw,ma		4(%2), %7\n"	/* 2nd daddr */
> -"	addc		%6, %0, %0\n"
> -"	ldw,ma		4(%1), %4\n"	/* 3rd saddr */
> -"	addc		%7, %0, %0\n"
> -"	ldw,ma		4(%2), %5\n"	/* 3rd daddr */
> -"	addc		%4, %0, %0\n"
> -"	ldw,ma		4(%1), %6\n"	/* 4th saddr */
> -"	addc		%5, %0, %0\n"
> -"	ldw,ma		4(%2), %7\n"	/* 4th daddr */
> -"	addc		%6, %0, %0\n"
> -"	addc		%7, %0, %0\n"
> -"	addc		%3, %0, %0\n"	/* fold in proto+len, catch carry */
> -
> -#endif
> -	: "=r" (sum), "=r" (saddr), "=r" (daddr), "=r" (len),
> -	  "=r" (t0), "=r" (t1), "=r" (t2), "=r" (t3)
> -	: "0" (sum), "1" (saddr), "2" (daddr), "3" (len)
> -	: "memory");
> -	return csum_fold(sum);
> +	__u64 sum = (__force __u64)csum;
> +
> +	sum += (__force __u32)saddr->s6_addr32[0];
> +	sum += (__force __u32)saddr->s6_addr32[1];
> +	sum += (__force __u32)saddr->s6_addr32[2];
> +	sum += (__force __u32)saddr->s6_addr32[3];
> +	sum += (__force __u32)daddr->s6_addr32[0];
> +	sum += (__force __u32)daddr->s6_addr32[1];
> +	sum += (__force __u32)daddr->s6_addr32[2];
> +	sum += (__force __u32)daddr->s6_addr32[3];
> +	sum += len;
> +	sum += proto;
> +	sum += (sum >> 32) | (sum << 32);
> +	return csum_fold((__force __wsum)(sum >> 32));
>  }
>  
>  #endif
> -

Similar story again here where the add with carry is not well translated
into C, resulting in significantly worse assembly. Using __u64 seems to
be a big contributing factor for why the 32-bit assembly is worse.

I am not sure the best way to represent this in a clean way in C.

add with carry is not well understood by GCC 12.3 it seems. These
functions are generally heavily optimized on every architecture, so I
think it is worth it to default to assembly if you aren't able to
achieve comparable performance in C.

- Charlie
  
David Laight Feb. 17, 2024, 10:47 p.m. UTC | #5
..
> We can do better than this! By inspection this looks like a performance
> regression. The generic version of csum_fold in
> include/asm-generic/checksum.h is better than this so should be used
> instead.

Yes, that got changed for 6.8-rc1 (I pretty much suggested the patch)
but hadn't noticed Linus has applied it.
That C version is (probably) not worse than any of the asm versions
except sparc32 - which has a carry flag but rotate.
(It is better than the x86-64 asm one.)

..
> This doesn't leverage add with carry well. This causes the code size of this
> to be dramatically larger than the original assembly, which I assume
> nicely correlates to an increased execution time.

It is pretty much impossible to do add with carry from C.
So an asm adc block is pretty much always going to win.

For csum_partial and short to moderate length buffers
on x86 it is hard to beat 10: adc, adc, dec, jnz 10b
which (on modern intel cpu at least) does 8 bytes/clock.
You can get 12 bytes/clock but it only really wins for 256+ bytes.
(See the current x86-64 version.)

For cpu without a carry flag it is likely that a common C
function will be pretty much optimal on all architectures.
(Or maybe a couple of implementations based the actual
cpu implementation - not the architecture.)

Mostly I don't think you can beat 4 instructions/word, but they
will pipeline so with multi-issue you might get a read/clock. 
Arm's barrel shifter might give 3: v + *p; x += v, y += v >> 32.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
  
Helge Deller Feb. 19, 2024, 9:11 p.m. UTC | #6
On 2/17/24 04:00, Charlie Jenkins wrote:
> On Fri, Feb 16, 2024 at 01:38:51PM +0100, Helge Deller wrote:
>> * Guenter Roeck <linux@roeck-us.net>:
>>> hppa 64-bit systems calculates the IPv6 checksum using 64-bit add
>>> operations. The last add folds protocol and length fields into the 64-bit
>>> result. While unlikely, this operation can overflow. The overflow can be
>>> triggered with a code sequence such as the following.
>>>
>>> 	/* try to trigger massive overflows */
>>> 	memset(tmp_buf, 0xff, sizeof(struct in6_addr));
>>> 	csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf,
>>> 				      (struct in6_addr *)tmp_buf,
>>> 				      0xffff, 0xff, 0xffffffff);
>>>
>>> Fix the problem by adding any overflows from the final add operation into
>>> the calculated checksum. Fortunately, we can do this without additional
>>> cost by replacing the add operation used to fold the checksum into 32 bit
>>> with "add,dc" to add in the missing carry.
>>
>>
>> Gunter,
>>
>> Thanks for your patch for 32- and 64-bit systems.
>> But I think it's time to sunset the parisc inline assembly for ipv4/ipv6
>> checksumming.
>> The patch below converts the code to use standard C-coding (taken from the
>> s390 header file) and it survives the v8 lib/checksum patch.
>>
>> Opinions?
>> [...]

> We can do better than this! By inspection this looks like a performance
> regression.
> [...]
> Similar story again here where the add with carry is not well translated
> into C, resulting in significantly worse assembly. Using __u64 seems to
> be a big contributing factor for why the 32-bit assembly is worse.
>
> I am not sure the best way to represent this in a clean way in C.
>
> add with carry is not well understood by GCC 12.3 it seems. These
> functions are generally heavily optimized on every architecture, so I
> think it is worth it to default to assembly if you aren't able to
> achieve comparable performance in C.

Thanks a lot for your analysis!!!
I've now reverted my change to switch to generic code and applied
the 3 suggested patches from Guenter which fix the hppa assembly.
Let's see how they behave in the for-next git tree during the next few days.

Helge
  

Patch

diff --git a/arch/parisc/include/asm/checksum.h b/arch/parisc/include/asm/checksum.h
index e619e67440db..c949aa20fa16 100644
--- a/arch/parisc/include/asm/checksum.h
+++ b/arch/parisc/include/asm/checksum.h
@@ -137,8 +137,8 @@  static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
 "	add,dc		%3, %0, %0\n"  /* fold in proto+len | carry bit */
 "	extrd,u		%0, 31, 32, %4\n"/* copy upper half down */
 "	depdi		0, 31, 32, %0\n"/* clear upper half */
-"	add		%4, %0, %0\n"	/* fold into 32-bits */
-"	addc		0, %0, %0\n"	/* add carry */
+"	add,dc		%4, %0, %0\n"	/* fold into 32-bits, plus carry */
+"	addc		0, %0, %0\n"	/* add final carry */
 
 #else