PR29189, dlltool delaylibs corrupt float/double arguments

Message ID ZGQlZ4KofqcYiJX1@squeak.grove.modra.org
State Accepted
Headers
Series PR29189, dlltool delaylibs corrupt float/double arguments |

Checks

Context Check Description
snail/binutils-gdb-check success Github commit url

Commit Message

Alan Modra May 17, 2023, 12:52 a.m. UTC
  PR29189 is an excellent bug report.  The reporter debugged the problem
to the point of finding out exactly where things were going wrong (in
a windows dll, so not a mingw problem) and even supplied a fix, giving
an ABI reference.  There hasn't been any action on the bug report due
to lack of an active mingw binutils maintainer, so I thought I'd take
a look as part of trying to whittle down the enormous number of
binutils bugzilla entries.

This is a rewrite of the patch given in the PR.  (It might even
resemble code emitted by Microsoft's LINK.EXE as reported in
https://www.mail-archive.com/llvm-bugs@lists.llvm.org/msg55205.html).
I'm not going to apply it without review from an x86 maintainer,
because it's been too long since I did any serious x86 specific work
and I don't want to break things for someone running an original amd64
machine.  The patch needs to be tested too.  I don't have a mingw
setup.

	PR 29189
	* dlltool.c (i386_x64_trampoline): Save and restore xmm0-3.  Make
	use of parameter save area for integer arg regs.  Comment.
  

Comments

Jan Beulich May 17, 2023, 10:18 a.m. UTC | #1
On 17.05.2023 02:52, Alan Modra wrote:
> PR29189 is an excellent bug report.  The reporter debugged the problem
> to the point of finding out exactly where things were going wrong (in
> a windows dll, so not a mingw problem) and even supplied a fix, giving
> an ABI reference.  There hasn't been any action on the bug report due
> to lack of an active mingw binutils maintainer, so I thought I'd take
> a look as part of trying to whittle down the enormous number of
> binutils bugzilla entries.
> 
> This is a rewrite of the patch given in the PR.  (It might even
> resemble code emitted by Microsoft's LINK.EXE as reported in
> https://www.mail-archive.com/llvm-bugs@lists.llvm.org/msg55205.html).
> I'm not going to apply it without review from an x86 maintainer,
> because it's been too long since I did any serious x86 specific work
> and I don't want to break things for someone running an original amd64
> machine.  The patch needs to be tested too.  I don't have a mingw
> setup.
> 
> 	PR 29189
> 	* dlltool.c (i386_x64_trampoline): Save and restore xmm0-3.  Make
> 	use of parameter save area for integer arg regs.  Comment.

The change looks good to me, and I'm inclined to suggest that we postpone
dealing with wider vector registers (it's not just AVX, but also AVX512
which may need dealing with), in the hope that the system DLLs would only
ever touch the XMM ones, and only via legacy (pre-AVX) instructions (thus
leaving their upper halves / three quarters intact).

However, I'm inclined to suggest that we deal with %xmm4 and %xmm5 right
away: They're not callee preserved, so even a legacy-only system DLL
might clobber their low half/quarter.

Jan

> --- a/binutils/dlltool.c
> +++ b/binutils/dlltool.c
> @@ -583,22 +583,37 @@ static const char i386_trampoline[] =
>    "\tpopl %%ecx\n"
>    "\tjmp *%%eax\n";
>  
> +/* Save integer arg regs in parameter space reserved by our caller
> +   above the return address.  Allocate space for four fp arg regs plus
> +   parameter space possibly used by __delayLoadHelper2 plus alignment.
> +   We enter with the stack offset from 16-byte alignment by the return
> +   address, so allocate 64 + 32 + 8 = 104 bytes.
> +   FIXME: do we need to save avx ymm0-5 used to pass vectors args?  If
> +   so, how to do it without blowing up on cpus without avx, cpuid?  */
>  static const char i386_x64_trampoline[] =
> -  "\tsubq $72, %%rsp\n"
> -  "\t.seh_stackalloc 72\n"
> +  "\tsubq $104, %%rsp\n"
> +  "\t.seh_stackalloc 104\n"
>    "\t.seh_endprologue\n"
> -  "\tmovq %%rcx, 64(%%rsp)\n"
> -  "\tmovq %%rdx, 56(%%rsp)\n"
> -  "\tmovq %%r8, 48(%%rsp)\n"
> -  "\tmovq %%r9, 40(%%rsp)\n"
> -  "\tmovq  %%rax, %%rdx\n"
> -  "\tleaq  __DELAY_IMPORT_DESCRIPTOR_%s(%%rip), %%rcx\n"
> +  "\tmovq %%rcx, 104+8(%%rsp)\n"
> +  "\tmovq %%rdx, 104+16(%%rsp)\n"
> +  "\tmovq %%r8, 104+24(%%rsp)\n"
> +  "\tmovq %%r9, 104+32(%%rsp)\n"
> +  "\tmovaps %%xmm0, 32(%%rsp)\n"
> +  "\tmovaps %%xmm1, 48(%%rsp)\n"
> +  "\tmovaps %%xmm2, 64(%%rsp)\n"
> +  "\tmovaps %%xmm3, 80(%%rsp)\n"
> +  "\tmovq %%rax, %%rdx\n"
> +  "\tleaq __DELAY_IMPORT_DESCRIPTOR_%s(%%rip), %%rcx\n"
>    "\tcall __delayLoadHelper2\n"
> -  "\tmovq 40(%%rsp), %%r9\n"
> -  "\tmovq 48(%%rsp), %%r8\n"
> -  "\tmovq 56(%%rsp), %%rdx\n"
> -  "\tmovq 64(%%rsp), %%rcx\n"
> -  "\taddq $72, %%rsp\n"
> +  "\tmovq 104+8(%%rsp), %%rcx\n"
> +  "\tmovq 104+16(%%rsp), %%rdx\n"
> +  "\tmovq 104+24(%%rsp), %%r8\n"
> +  "\tmovq 104+32(%%rsp), %%r9\n"
> +  "\tmovaps 32(%%rsp), %%xmm0\n"
> +  "\tmovaps 48(%%rsp), %%xmm1\n"
> +  "\tmovaps 64(%%rsp), %%xmm2\n"
> +  "\tmovaps 80(%%rsp), %%xmm3\n"
> +  "\taddq $104, %%rsp\n"
>    "\tjmp *%%rax\n";
>  
>  struct mac
>
  
Alan Modra May 17, 2023, 11:52 p.m. UTC | #2
On Wed, May 17, 2023 at 12:18:15PM +0200, Jan Beulich wrote:
> On 17.05.2023 02:52, Alan Modra wrote:
> > PR29189 is an excellent bug report.  The reporter debugged the problem
> > to the point of finding out exactly where things were going wrong (in
> > a windows dll, so not a mingw problem) and even supplied a fix, giving
> > an ABI reference.  There hasn't been any action on the bug report due
> > to lack of an active mingw binutils maintainer, so I thought I'd take
> > a look as part of trying to whittle down the enormous number of
> > binutils bugzilla entries.
> > 
> > This is a rewrite of the patch given in the PR.  (It might even
> > resemble code emitted by Microsoft's LINK.EXE as reported in
> > https://www.mail-archive.com/llvm-bugs@lists.llvm.org/msg55205.html).
> > I'm not going to apply it without review from an x86 maintainer,
> > because it's been too long since I did any serious x86 specific work
> > and I don't want to break things for someone running an original amd64
> > machine.  The patch needs to be tested too.  I don't have a mingw
> > setup.
> > 
> > 	PR 29189
> > 	* dlltool.c (i386_x64_trampoline): Save and restore xmm0-3.  Make
> > 	use of parameter save area for integer arg regs.  Comment.
> 
> The change looks good to me, and I'm inclined to suggest that we postpone
> dealing with wider vector registers (it's not just AVX, but also AVX512
> which may need dealing with), in the hope that the system DLLs would only
> ever touch the XMM ones, and only via legacy (pre-AVX) instructions (thus
> leaving their upper halves / three quarters intact).

Yes, I think the same.  "only via pre-AVX insns" is more of a concern
than any real vector processing occurring.

> However, I'm inclined to suggest that we deal with %xmm4 and %xmm5 right
> away: They're not callee preserved, so even a legacy-only system DLL
> might clobber their low half/quarter.

That sounds very reasonable.  Here's an updated patch.

	PR 29189
	* dlltool.c (i386_x64_trampoline): Save and restore xmm0-5.  Make
	use of parameter save area for integer arg regs.  Comment.

diff --git a/binutils/dlltool.c b/binutils/dlltool.c
index 31c864d7d5c..3ae9de28f10 100644
--- a/binutils/dlltool.c
+++ b/binutils/dlltool.c
@@ -583,22 +583,48 @@ static const char i386_trampoline[] =
   "\tpopl %%ecx\n"
   "\tjmp *%%eax\n";
 
+/* Save integer arg regs in parameter space reserved by our caller
+   above the return address.  Allocate space for six fp arg regs plus
+   parameter space possibly used by __delayLoadHelper2 plus alignment.
+   We enter with the stack offset from 16-byte alignment by the return
+   address, so allocate 96 + 32 + 8 = 136 bytes.  Note that only the
+   first four xmm regs are used to pass fp args, but the first six
+   vector ymm (zmm too?) are used to pass vector args.  We are
+   assuming that volatile vector regs are not modified inside
+   __delayLoadHelper2.  However, it is known that at least xmm0 and
+   xmm1 are trashed in some versions of Microsoft dlls, and if xmm4 or
+   xmm5 are also used then that would trash the lower bits of ymm4 and
+   ymm5.  If it turns out that vector insns with a vex prefix are used
+   then we'll need to save ymm0-5 here but that can't be done without
+   first testing cpuid to see whether the instructions are available.  */
 static const char i386_x64_trampoline[] =
-  "\tsubq $72, %%rsp\n"
-  "\t.seh_stackalloc 72\n"
+  "\tsubq $136, %%rsp\n"
+  "\t.seh_stackalloc 136\n"
   "\t.seh_endprologue\n"
-  "\tmovq %%rcx, 64(%%rsp)\n"
-  "\tmovq %%rdx, 56(%%rsp)\n"
-  "\tmovq %%r8, 48(%%rsp)\n"
-  "\tmovq %%r9, 40(%%rsp)\n"
-  "\tmovq  %%rax, %%rdx\n"
-  "\tleaq  __DELAY_IMPORT_DESCRIPTOR_%s(%%rip), %%rcx\n"
+  "\tmovq %%rcx, 136+8(%%rsp)\n"
+  "\tmovq %%rdx, 136+16(%%rsp)\n"
+  "\tmovq %%r8, 136+24(%%rsp)\n"
+  "\tmovq %%r9, 136+32(%%rsp)\n"
+  "\tmovaps %%xmm0, 32(%%rsp)\n"
+  "\tmovaps %%xmm1, 48(%%rsp)\n"
+  "\tmovaps %%xmm2, 64(%%rsp)\n"
+  "\tmovaps %%xmm3, 80(%%rsp)\n"
+  "\tmovaps %%xmm4, 96(%%rsp)\n"
+  "\tmovaps %%xmm5, 112(%%rsp)\n"
+  "\tmovq %%rax, %%rdx\n"
+  "\tleaq __DELAY_IMPORT_DESCRIPTOR_%s(%%rip), %%rcx\n"
   "\tcall __delayLoadHelper2\n"
-  "\tmovq 40(%%rsp), %%r9\n"
-  "\tmovq 48(%%rsp), %%r8\n"
-  "\tmovq 56(%%rsp), %%rdx\n"
-  "\tmovq 64(%%rsp), %%rcx\n"
-  "\taddq $72, %%rsp\n"
+  "\tmovq 136+8(%%rsp), %%rcx\n"
+  "\tmovq 136+16(%%rsp), %%rdx\n"
+  "\tmovq 136+24(%%rsp), %%r8\n"
+  "\tmovq 136+32(%%rsp), %%r9\n"
+  "\tmovaps 32(%%rsp), %%xmm0\n"
+  "\tmovaps 48(%%rsp), %%xmm1\n"
+  "\tmovaps 64(%%rsp), %%xmm2\n"
+  "\tmovaps 80(%%rsp), %%xmm3\n"
+  "\tmovaps 96(%%rsp), %%xmm4\n"
+  "\tmovaps 112(%%rsp), %%xmm5\n"
+  "\taddq $136, %%rsp\n"
   "\tjmp *%%rax\n";
 
 struct mac
  
Jan Beulich May 19, 2023, 7:11 a.m. UTC | #3
On 18.05.2023 01:52, Alan Modra wrote:
> On Wed, May 17, 2023 at 12:18:15PM +0200, Jan Beulich wrote:
>> On 17.05.2023 02:52, Alan Modra wrote:
>>> PR29189 is an excellent bug report.  The reporter debugged the problem
>>> to the point of finding out exactly where things were going wrong (in
>>> a windows dll, so not a mingw problem) and even supplied a fix, giving
>>> an ABI reference.  There hasn't been any action on the bug report due
>>> to lack of an active mingw binutils maintainer, so I thought I'd take
>>> a look as part of trying to whittle down the enormous number of
>>> binutils bugzilla entries.
>>>
>>> This is a rewrite of the patch given in the PR.  (It might even
>>> resemble code emitted by Microsoft's LINK.EXE as reported in
>>> https://www.mail-archive.com/llvm-bugs@lists.llvm.org/msg55205.html).
>>> I'm not going to apply it without review from an x86 maintainer,
>>> because it's been too long since I did any serious x86 specific work
>>> and I don't want to break things for someone running an original amd64
>>> machine.  The patch needs to be tested too.  I don't have a mingw
>>> setup.
>>>
>>> 	PR 29189
>>> 	* dlltool.c (i386_x64_trampoline): Save and restore xmm0-3.  Make
>>> 	use of parameter save area for integer arg regs.  Comment.
>>
>> The change looks good to me, and I'm inclined to suggest that we postpone
>> dealing with wider vector registers (it's not just AVX, but also AVX512
>> which may need dealing with), in the hope that the system DLLs would only
>> ever touch the XMM ones, and only via legacy (pre-AVX) instructions (thus
>> leaving their upper halves / three quarters intact).
> 
> Yes, I think the same.  "only via pre-AVX insns" is more of a concern
> than any real vector processing occurring.
> 
>> However, I'm inclined to suggest that we deal with %xmm4 and %xmm5 right
>> away: They're not callee preserved, so even a legacy-only system DLL
>> might clobber their low half/quarter.
> 
> That sounds very reasonable.  Here's an updated patch.
> 
> 	PR 29189
> 	* dlltool.c (i386_x64_trampoline): Save and restore xmm0-5.  Make
> 	use of parameter save area for integer arg regs.  Comment.

Lgtm, fwiw. One more remark though (you may want to further extend the
comment):

> --- a/binutils/dlltool.c
> +++ b/binutils/dlltool.c
> @@ -583,22 +583,48 @@ static const char i386_trampoline[] =
>    "\tpopl %%ecx\n"
>    "\tjmp *%%eax\n";
>  
> +/* Save integer arg regs in parameter space reserved by our caller
> +   above the return address.  Allocate space for six fp arg regs plus
> +   parameter space possibly used by __delayLoadHelper2 plus alignment.
> +   We enter with the stack offset from 16-byte alignment by the return
> +   address, so allocate 96 + 32 + 8 = 136 bytes.  Note that only the
> +   first four xmm regs are used to pass fp args, but the first six
> +   vector ymm (zmm too?) are used to pass vector args.  We are
> +   assuming that volatile vector regs are not modified inside
> +   __delayLoadHelper2.  However, it is known that at least xmm0 and
> +   xmm1 are trashed in some versions of Microsoft dlls, and if xmm4 or
> +   xmm5 are also used then that would trash the lower bits of ymm4 and
> +   ymm5.  If it turns out that vector insns with a vex prefix are used
> +   then we'll need to save ymm0-5 here but that can't be done without
> +   first testing cpuid to see whether the instructions are available.  */

Checking CPUID alone isn't going to be sufficient. XCR0 would also need
consulting.

Jan

>  static const char i386_x64_trampoline[] =
> -  "\tsubq $72, %%rsp\n"
> -  "\t.seh_stackalloc 72\n"
> +  "\tsubq $136, %%rsp\n"
> +  "\t.seh_stackalloc 136\n"
>    "\t.seh_endprologue\n"
> -  "\tmovq %%rcx, 64(%%rsp)\n"
> -  "\tmovq %%rdx, 56(%%rsp)\n"
> -  "\tmovq %%r8, 48(%%rsp)\n"
> -  "\tmovq %%r9, 40(%%rsp)\n"
> -  "\tmovq  %%rax, %%rdx\n"
> -  "\tleaq  __DELAY_IMPORT_DESCRIPTOR_%s(%%rip), %%rcx\n"
> +  "\tmovq %%rcx, 136+8(%%rsp)\n"
> +  "\tmovq %%rdx, 136+16(%%rsp)\n"
> +  "\tmovq %%r8, 136+24(%%rsp)\n"
> +  "\tmovq %%r9, 136+32(%%rsp)\n"
> +  "\tmovaps %%xmm0, 32(%%rsp)\n"
> +  "\tmovaps %%xmm1, 48(%%rsp)\n"
> +  "\tmovaps %%xmm2, 64(%%rsp)\n"
> +  "\tmovaps %%xmm3, 80(%%rsp)\n"
> +  "\tmovaps %%xmm4, 96(%%rsp)\n"
> +  "\tmovaps %%xmm5, 112(%%rsp)\n"
> +  "\tmovq %%rax, %%rdx\n"
> +  "\tleaq __DELAY_IMPORT_DESCRIPTOR_%s(%%rip), %%rcx\n"
>    "\tcall __delayLoadHelper2\n"
> -  "\tmovq 40(%%rsp), %%r9\n"
> -  "\tmovq 48(%%rsp), %%r8\n"
> -  "\tmovq 56(%%rsp), %%rdx\n"
> -  "\tmovq 64(%%rsp), %%rcx\n"
> -  "\taddq $72, %%rsp\n"
> +  "\tmovq 136+8(%%rsp), %%rcx\n"
> +  "\tmovq 136+16(%%rsp), %%rdx\n"
> +  "\tmovq 136+24(%%rsp), %%r8\n"
> +  "\tmovq 136+32(%%rsp), %%r9\n"
> +  "\tmovaps 32(%%rsp), %%xmm0\n"
> +  "\tmovaps 48(%%rsp), %%xmm1\n"
> +  "\tmovaps 64(%%rsp), %%xmm2\n"
> +  "\tmovaps 80(%%rsp), %%xmm3\n"
> +  "\tmovaps 96(%%rsp), %%xmm4\n"
> +  "\tmovaps 112(%%rsp), %%xmm5\n"
> +  "\taddq $136, %%rsp\n"
>    "\tjmp *%%rax\n";
>  
>  struct mac
>
  

Patch

diff --git a/binutils/dlltool.c b/binutils/dlltool.c
index 31c864d7d5c..eb2c4f34658 100644
--- a/binutils/dlltool.c
+++ b/binutils/dlltool.c
@@ -583,22 +583,37 @@  static const char i386_trampoline[] =
   "\tpopl %%ecx\n"
   "\tjmp *%%eax\n";
 
+/* Save integer arg regs in parameter space reserved by our caller
+   above the return address.  Allocate space for four fp arg regs plus
+   parameter space possibly used by __delayLoadHelper2 plus alignment.
+   We enter with the stack offset from 16-byte alignment by the return
+   address, so allocate 64 + 32 + 8 = 104 bytes.
+   FIXME: do we need to save avx ymm0-5 used to pass vectors args?  If
+   so, how to do it without blowing up on cpus without avx, cpuid?  */
 static const char i386_x64_trampoline[] =
-  "\tsubq $72, %%rsp\n"
-  "\t.seh_stackalloc 72\n"
+  "\tsubq $104, %%rsp\n"
+  "\t.seh_stackalloc 104\n"
   "\t.seh_endprologue\n"
-  "\tmovq %%rcx, 64(%%rsp)\n"
-  "\tmovq %%rdx, 56(%%rsp)\n"
-  "\tmovq %%r8, 48(%%rsp)\n"
-  "\tmovq %%r9, 40(%%rsp)\n"
-  "\tmovq  %%rax, %%rdx\n"
-  "\tleaq  __DELAY_IMPORT_DESCRIPTOR_%s(%%rip), %%rcx\n"
+  "\tmovq %%rcx, 104+8(%%rsp)\n"
+  "\tmovq %%rdx, 104+16(%%rsp)\n"
+  "\tmovq %%r8, 104+24(%%rsp)\n"
+  "\tmovq %%r9, 104+32(%%rsp)\n"
+  "\tmovaps %%xmm0, 32(%%rsp)\n"
+  "\tmovaps %%xmm1, 48(%%rsp)\n"
+  "\tmovaps %%xmm2, 64(%%rsp)\n"
+  "\tmovaps %%xmm3, 80(%%rsp)\n"
+  "\tmovq %%rax, %%rdx\n"
+  "\tleaq __DELAY_IMPORT_DESCRIPTOR_%s(%%rip), %%rcx\n"
   "\tcall __delayLoadHelper2\n"
-  "\tmovq 40(%%rsp), %%r9\n"
-  "\tmovq 48(%%rsp), %%r8\n"
-  "\tmovq 56(%%rsp), %%rdx\n"
-  "\tmovq 64(%%rsp), %%rcx\n"
-  "\taddq $72, %%rsp\n"
+  "\tmovq 104+8(%%rsp), %%rcx\n"
+  "\tmovq 104+16(%%rsp), %%rdx\n"
+  "\tmovq 104+24(%%rsp), %%r8\n"
+  "\tmovq 104+32(%%rsp), %%r9\n"
+  "\tmovaps 32(%%rsp), %%xmm0\n"
+  "\tmovaps 48(%%rsp), %%xmm1\n"
+  "\tmovaps 64(%%rsp), %%xmm2\n"
+  "\tmovaps 80(%%rsp), %%xmm3\n"
+  "\taddq $104, %%rsp\n"
   "\tjmp *%%rax\n";
 
 struct mac