[v1,0/4] LoongArch: Add support for TLS Descriptors (TLSDESC)

Message ID 20231201090424.854662-1-cailulu@loongson.cn
Headers
Series LoongArch: Add support for TLS Descriptors (TLSDESC) |

Message

Lulu Cai Dec. 1, 2023, 9:04 a.m. UTC
  The LoongArch TLS Descriptors implementation contains several points:

1. The instruction sequences is:
   pcalau12i  $a0,%desc_pc_hi20(var)		#R_LARCH_TLS_DESC_PC_HI20
   ld.d       $a1,$a0,%desc_ld_pc_lo12(var)	#R_LARCH_TLS_DESC_LD_PC_LO12
   addi.d     $a0,$a0,%desc_add_pc_lo12(var)	#R_LARCH_TLS_DESC_ADD_PC_LO12
   jirl       $ra,$a1,%desc_call(var)		#R_LARCH_TLS_DESC_CALL

   The linker for each DESC generates a R_LARCH_TLS_DESC64 dynamic relocation,
   which relocation is placed at .rela.dyn.
   TLSDESC always allocates two GOT slots and one dynamic relocation space to TLSDESC.

2. When using multiple ways to access the same TLS variable, a maximum of 5 GOT
   slots are used. For example, using GD, TLSDESC, and IE to access the same TLS
   variable,GD always uses the first two of the five GOT, TLSDESC uses the third
   and fourth, and IE uses the last.

3. TLSDESC always requires dynamic relocation because of LoongArch does not yet have
   a tls type transition. Howerer statically linked programs cannot resolve TLSDESC's
   dynamic relocation, so we did a type transition for this case.
   DESC -> LE:
   pcalau12i  $a0,%desc_pc_hi20(var)	      =>  lu12i.w $a0,%le_hi20(var)
   ld.d       $a1,$a0,%desc_ld_pc_lo12(var)   =>  ori $a0,$a0,%le_lo12(var)
   addi.d     $a0,$a0,%desc_add_pc_lo12(var)  =>  NOP	
   jirl       $ra,$a1,%desc_call(var)	      =>  NOP

4. The current code passes the tests of gas ld and glibc.

Lulu Cai (4):
  LoongArch: Add new relocs and macro for TLSDESC.
  LoongArch: Add support for TLSDESC in ld.
  LoongArch: Add transition support for DESC to LE.
  LoongArch: Add testsuits for TLSDESC in gas and ld.

 bfd/bfd-in2.h                                 |  12 +
 bfd/elfnn-loongarch.c                         | 276 ++++++++++++++++--
 bfd/elfxx-loongarch.c                         | 209 ++++++++++++-
 bfd/libbfd.h                                  |  12 +
 bfd/reloc.c                                   |  29 ++
 gas/config/tc-loongarch.c                     |  14 +-
 gas/testsuite/gas/loongarch/tlsdesc_32.d      |  26 ++
 gas/testsuite/gas/loongarch/tlsdesc_32.s      |  12 +
 gas/testsuite/gas/loongarch/tlsdesc_64.d      |  26 ++
 gas/testsuite/gas/loongarch/tlsdesc_64.s      |  12 +
 .../gas/loongarch/tlsdesc_large_abs.d         |  21 ++
 .../gas/loongarch/tlsdesc_large_abs.s         |   9 +
 .../gas/loongarch/tlsdesc_large_pc.d          |  34 +++
 .../gas/loongarch/tlsdesc_large_pc.s          |  16 +
 include/elf/loongarch.h                       |  22 +-
 include/opcode/loongarch.h                    |   3 +
 .../ld-loongarch-elf/ld-loongarch-elf.exp     |  16 +
 ld/testsuite/ld-loongarch-elf/tls-desc.dd     |  74 +++++
 ld/testsuite/ld-loongarch-elf/tls-desc.rd     |  79 +++++
 ld/testsuite/ld-loongarch-elf/tls-desc.s      | 102 +++++++
 .../ld-loongarch-elf/tls-relax-desc-le.d      |  15 +
 .../ld-loongarch-elf/tls-relax-desc-le.s      |   8 +
 opcodes/loongarch-opc.c                       |  54 ++++
 23 files changed, 1054 insertions(+), 27 deletions(-)
 create mode 100644 gas/testsuite/gas/loongarch/tlsdesc_32.d
 create mode 100644 gas/testsuite/gas/loongarch/tlsdesc_32.s
 create mode 100644 gas/testsuite/gas/loongarch/tlsdesc_64.d
 create mode 100644 gas/testsuite/gas/loongarch/tlsdesc_64.s
 create mode 100644 gas/testsuite/gas/loongarch/tlsdesc_large_abs.d
 create mode 100644 gas/testsuite/gas/loongarch/tlsdesc_large_abs.s
 create mode 100644 gas/testsuite/gas/loongarch/tlsdesc_large_pc.d
 create mode 100644 gas/testsuite/gas/loongarch/tlsdesc_large_pc.s
 create mode 100644 ld/testsuite/ld-loongarch-elf/tls-desc.dd
 create mode 100644 ld/testsuite/ld-loongarch-elf/tls-desc.rd
 create mode 100644 ld/testsuite/ld-loongarch-elf/tls-desc.s
 create mode 100644 ld/testsuite/ld-loongarch-elf/tls-relax-desc-le.d
 create mode 100644 ld/testsuite/ld-loongarch-elf/tls-relax-desc-le.s
  

Comments

Alexandre Oliva Dec. 1, 2023, 4:14 p.m. UTC | #1
Hello,

On Dec  1, 2023, Lulu Cai <cailulu@loongson.cn> wrote:

> The LoongArch TLS Descriptors implementation contains several points:

I'm excited to see another platform gain TLS Descriptors support.

I'm not deeply acquainted with LoongArch, but I'll dare chime in.

> 1. The instruction sequences is:
>    pcalau12i  $a0,%desc_pc_hi20(var)		#R_LARCH_TLS_DESC_PC_HI20
>    ld.d       $a1,$a0,%desc_ld_pc_lo12(var)	#R_LARCH_TLS_DESC_LD_PC_LO12
>    addi.d     $a0,$a0,%desc_add_pc_lo12(var)	#R_LARCH_TLS_DESC_ADD_PC_LO12
>    jirl       $ra,$a1,%desc_call(var)		#R_LARCH_TLS_DESC_CALL

Are these instructions fixed, and supposed to appear in this sequence,
or can different registers be used, and the instructions intermixed with
other unrelated ones?  The ability to intermix them for better
scheduling and register allocation was one of the guiding design
principles of TLS Descriptors, so the canonical sequence and the design
of relaxations should ideally take flexibility into account, and choose
relaxations with similar scheduling profiles.

Say, would compiler-generated or hand-coded asm still work if one used:

     pcalau12i  $a2,%desc_pc_hi20(var)		#R_LARCH_TLS_DESC_PC_HI20
     ld.d       $a3,$a2,%desc_ld_pc_lo12(var)	#R_LARCH_TLS_DESC_LD_PC_LO12
     addi.d     $a0,$a2,%desc_add_pc_lo12(var)	#R_LARCH_TLS_DESC_ADD_PC_LO12
     jirl       $ra,$a3,%desc_call(var)		#R_LARCH_TLS_DESC_CALL

or even

     pcalau12i  $a2,%desc_pc_hi20(var)		#R_LARCH_TLS_DESC_PC_HI20
     or         $a5,$a2
     or         $a6,$a2
     addi.d     $a4,$a5,%desc_add_pc_lo12(var)	#R_LARCH_TLS_DESC_ADD_PC_LO12
     ld.d       $a3,$a6,%desc_ld_pc_lo12(var)	#R_LARCH_TLS_DESC_LD_PC_LO12
     or         $a0,$a4,$r0
     jirl       $ra,$a3,%desc_call(var)		#R_LARCH_TLS_DESC_CALL

?

(I realize you seem to have not planned/implemented relaxations, aside
from the LE one for static linking, but planning for them ahead of time
about them helps make sure they're doable)

E.g., for IE, I'd suggest turning the latter sequence into (I'm making
up relocation names):

     pcalau12i  $a2,%gotpc_tlsoff_hi20(var)
     or         $a5,$a2,$r0 #not necessary, but not marked, so unchanged
     or         $a6,$a2,$r0
     nop
     ld.d       $a3,$a6,%gotpc_tlsoff_lo12(var)
     or         $a0,$a4,$r0 #not necessary, but not marked, so unchanged
     or         $a0,$a3,$r0

and or LE, I'd suggest:

     pcalau12i  $a2,%tlsoff_hi20(var)
     or         $a5,$a2,$r0
     or         $a6,$a2,$r0 #not necessary, but not marked, so unchanged
     addi.d     $a4,$a5,%tlsoff_lo12(var)
     nop
     or         $a0,$a4,$r0 #not necessary, but not marked, so unchanged
     nop

This addi.d is what I suggest instead of the 'ori' in the LE relaxation.
The main difference in my suggestion is that it takes the same position
of the original addi instruction, thus the very same scheduling profile,
and more importantly participating the same way in the data flow, as the
extra moves help see.

I realize that addi rather than ori may require offsetting the base
address to account for the signed rather than unsigned (I suppose)
immediate, so maybe it's not worth it.  I am not sure, however, whether
you can even separate the pcalau12i hi20 instruction from the subsequent
lo12 one (ISTM that it would be challenging to match them if so,
especially if a single hi20 is reused by multiple lo12 loads), so maybe
there is less flexibility to be exploited than I'm making out.

Anyway, I hope this makes sense and that it helps,
  
mengqinggang Dec. 2, 2023, 3:54 a.m. UTC | #2
Thank you very much for your suggestions.


在 2023/12/2 上午12:14, Alexandre Oliva 写道:
> Hello,
>
> On Dec  1, 2023, Lulu Cai <cailulu@loongson.cn> wrote:
>
>> The LoongArch TLS Descriptors implementation contains several points:
> I'm excited to see another platform gain TLS Descriptors support.
>
> I'm not deeply acquainted with LoongArch, but I'll dare chime in.
>
>> 1. The instruction sequences is:
>>     pcalau12i  $a0,%desc_pc_hi20(var)		#R_LARCH_TLS_DESC_PC_HI20
>>     ld.d       $a1,$a0,%desc_ld_pc_lo12(var)	#R_LARCH_TLS_DESC_LD_PC_LO12
>>     addi.d     $a0,$a0,%desc_add_pc_lo12(var)	#R_LARCH_TLS_DESC_ADD_PC_LO12
>>     jirl       $ra,$a1,%desc_call(var)		#R_LARCH_TLS_DESC_CALL
> Are these instructions fixed, and supposed to appear in this sequence,
> or can different registers be used, and the instructions intermixed with
> other unrelated ones?  The ability to intermix them for better
> scheduling and register allocation was one of the guiding design
> principles of TLS Descriptors, so the canonical sequence and the design
> of relaxations should ideally take flexibility into account, and choose
> relaxations with similar scheduling profiles.
>
> Say, would compiler-generated or hand-coded asm still work if one used:
>
>       pcalau12i  $a2,%desc_pc_hi20(var)		#R_LARCH_TLS_DESC_PC_HI20
>       ld.d       $a3,$a2,%desc_ld_pc_lo12(var)	#R_LARCH_TLS_DESC_LD_PC_LO12
>       addi.d     $a0,$a2,%desc_add_pc_lo12(var)	#R_LARCH_TLS_DESC_ADD_PC_LO12
>       jirl       $ra,$a3,%desc_call(var)		#R_LARCH_TLS_DESC_CALL
>
> or even
>
>       pcalau12i  $a2,%desc_pc_hi20(var)		#R_LARCH_TLS_DESC_PC_HI20
>       or         $a5,$a2
>       or         $a6,$a2
>       addi.d     $a4,$a5,%desc_add_pc_lo12(var)	#R_LARCH_TLS_DESC_ADD_PC_LO12
>       ld.d       $a3,$a6,%desc_ld_pc_lo12(var)	#R_LARCH_TLS_DESC_LD_PC_LO12
>       or         $a0,$a4,$r0
>       jirl       $ra,$a3,%desc_call(var)		#R_LARCH_TLS_DESC_CALL
>
> ?


I do a test, these two sequences still work.
But in this version patch, TLS descriptors instructions sequences expand 
for la.tls.desc
and fixed registers and instructions are used.


> (I realize you seem to have not planned/implemented relaxations, aside
> from the LE one for static linking, but planning for them ahead of time
> about them helps make sure they're doable)


We will support relax to IE in the future.
Because glibc can only resolve R_XXX_IRELATIVE relocation in static 
linking,
we relax DESC to LE to avoid generating R_LARCH_TLS_DESC relocation.


> E.g., for IE, I'd suggest turning the latter sequence into (I'm making
> up relocation names):
>
>       pcalau12i  $a2,%gotpc_tlsoff_hi20(var)
>       or         $a5,$a2,$r0 #not necessary, but not marked, so unchanged
>       or         $a6,$a2,$r0
>       nop
>       ld.d       $a3,$a6,%gotpc_tlsoff_lo12(var)
>       or         $a0,$a4,$r0 #not necessary, but not marked, so unchanged
>       or         $a0,$a3,$r0
>
> and or LE, I'd suggest:
>
>       pcalau12i  $a2,%tlsoff_hi20(var)
>       or         $a5,$a2,$r0
>       or         $a6,$a2,$r0 #not necessary, but not marked, so unchanged
>       addi.d     $a4,$a5,%tlsoff_lo12(var)
>       nop
>       or         $a0,$a4,$r0 #not necessary, but not marked, so unchanged
>       nop
>
> This addi.d is what I suggest instead of the 'ori' in the LE relaxation.
> The main difference in my suggestion is that it takes the same position
> of the original addi instruction, thus the very same scheduling profile,
> and more importantly participating the same way in the data flow, as the
> extra moves help see.

We will add  a new relocation for addi.d, the related patch is here:
https://sourceware.org/pipermail/binutils/2023-December/130921.html

>
> I realize that addi rather than ori may require offsetting the base
> address to account for the signed rather than unsigned (I suppose)
> immediate, so maybe it's not worth it.  I am not sure, however, whether
> you can even separate the pcalau12i hi20 instruction from the subsequent
> lo12 one (ISTM that it would be challenging to match them if so,
> especially if a single hi20 is reused by multiple lo12 loads), so maybe
> there is less flexibility to be exploited than I'm making out.
> Anyway, I hope this makes sense and that it helps,
>
  
Jinyang He Dec. 8, 2023, 3:04 a.m. UTC | #3
On 2023-12-01 17:04, Lulu Cai wrote:

> The LoongArch TLS Descriptors implementation contains several points:
>
> 1. The instruction sequences is:
>     pcalau12i  $a0,%desc_pc_hi20(var)		#R_LARCH_TLS_DESC_PC_HI20
>     ld.d       $a1,$a0,%desc_ld_pc_lo12(var)	#R_LARCH_TLS_DESC_LD_PC_LO12
>     addi.d     $a0,$a0,%desc_add_pc_lo12(var)	#R_LARCH_TLS_DESC_ADD_PC_LO12
>     jirl       $ra,$a1,%desc_call(var)		#R_LARCH_TLS_DESC_CALL
>
>     The linker for each DESC generates a R_LARCH_TLS_DESC64 dynamic relocation,
>     which relocation is placed at .rela.dyn.
>     TLSDESC always allocates two GOT slots and one dynamic relocation space to TLSDESC.

Hi, all,

There is a new idea of la.tls.desc insn sequence.
The sequence is,
pcalau12i  $a0,%desc_pc_hi20(var)     #R_LARCH_TLS_DESC_PC_HI20
                                       #R_LARCH_RELAX if needed
addi.d     $a0,$a0,%desc_pc_lo12(var) #R_LARCH_TLS_DESC_PC_LO12
                                       #R_LARCH_RELAX if needed
ld.d       $ra,$a0,%desc_ld(var)      #R_LARCH_TLS_DESC_LD
                                       #R_LARCH_RELAX if needed
jirl       $ra,$ra,%desc_call(var)    #R_LARCH_TLS_DESC_CALL
                                       #R_LARCH_RELAX if needed
It loads the address of TLSDESC got entry first, and access it and jump 
then.
The pcalau12i + addi.d should be adjacent.

For DESC to LE type transition,
pcalau12i  $a0,%desc_pc_hi20(var)     => lu12i.w $a0,%le_hi20(var)
addi.d     $a0,$a0,%desc_pc_lo12(var) => ori $a0,$a0,%le_lo12(var)
ld.d       $ra,$a0,%desc_ld(var)      => NOP, delete if with RELAX
jirl       $ra,$ra,%desc_call(var)    => NOP, delete if with RELAX

For DESC to IE type transition,
pcalau12i  $a0,%desc_pc_hi20(var)     => pcalau12i $a0,%ie_hi20(var)
addi.d     $a0,$a0,%desc_pc_lo12(var) => ld.d $a0,$a0,%ie_lo12(var)
ld.d       $ra,$a0,%desc_ld(var)      => NOP, delete if with RELAX
jirl       $ra,$ra,%desc_call(var)    => NOP, delete if with RELAX

For DESC relax, Do it if cannot do DESC to LE/IE
pcalau12i + addi.d -> pcaddi $a0, %???(var) (pseudo reloc type maybe 
"R_LARCH_TLS_DESC_PCREL20_S2")
ld.d       $ra,$a0,%desc_ld(var)
jirl       $ra,$ra,%desc_call(var)

And for la.tls.gd or la.tls.ld, we can also do load got entry address relax.
pcalau12i + addi.d -> pcaddi $a0, %???(var) (pseudo reloc type maybe 
"R_LARCH_TLS_GD/LD_PCREL20_S2")

Some relative info can be got in loongarch_elf_relax_section(),
e.g. sec_addr (got), got_off, desc_off. It can be relaxed in theory.
It seems cannot reuse R_LARCH_PCREL20_S2 and needs other relocation types.

All suggestions and ideas are welcome. Thanks in advance.

Jinyang

>
> 2. When using multiple ways to access the same TLS variable, a maximum of 5 GOT
>     slots are used. For example, using GD, TLSDESC, and IE to access the same TLS
>     variable,GD always uses the first two of the five GOT, TLSDESC uses the third
>     and fourth, and IE uses the last.
>
> 3. TLSDESC always requires dynamic relocation because of LoongArch does not yet have
>     a tls type transition. Howerer statically linked programs cannot resolve TLSDESC's
>     dynamic relocation, so we did a type transition for this case.
>     DESC -> LE:
>     pcalau12i  $a0,%desc_pc_hi20(var)	      =>  lu12i.w $a0,%le_hi20(var)
>     ld.d       $a1,$a0,%desc_ld_pc_lo12(var)   =>  ori $a0,$a0,%le_lo12(var)
>     addi.d     $a0,$a0,%desc_add_pc_lo12(var)  =>  NOP	
>     jirl       $ra,$a1,%desc_call(var)	      =>  NOP
>
> 4. The current code passes the tests of gas ld and glibc.
>
> Lulu Cai (4):
>    LoongArch: Add new relocs and macro for TLSDESC.
>    LoongArch: Add support for TLSDESC in ld.
>    LoongArch: Add transition support for DESC to LE.
>    LoongArch: Add testsuits for TLSDESC in gas and ld.
>
>   bfd/bfd-in2.h                                 |  12 +
>   bfd/elfnn-loongarch.c                         | 276 ++++++++++++++++--
>   bfd/elfxx-loongarch.c                         | 209 ++++++++++++-
>   bfd/libbfd.h                                  |  12 +
>   bfd/reloc.c                                   |  29 ++
>   gas/config/tc-loongarch.c                     |  14 +-
>   gas/testsuite/gas/loongarch/tlsdesc_32.d      |  26 ++
>   gas/testsuite/gas/loongarch/tlsdesc_32.s      |  12 +
>   gas/testsuite/gas/loongarch/tlsdesc_64.d      |  26 ++
>   gas/testsuite/gas/loongarch/tlsdesc_64.s      |  12 +
>   .../gas/loongarch/tlsdesc_large_abs.d         |  21 ++
>   .../gas/loongarch/tlsdesc_large_abs.s         |   9 +
>   .../gas/loongarch/tlsdesc_large_pc.d          |  34 +++
>   .../gas/loongarch/tlsdesc_large_pc.s          |  16 +
>   include/elf/loongarch.h                       |  22 +-
>   include/opcode/loongarch.h                    |   3 +
>   .../ld-loongarch-elf/ld-loongarch-elf.exp     |  16 +
>   ld/testsuite/ld-loongarch-elf/tls-desc.dd     |  74 +++++
>   ld/testsuite/ld-loongarch-elf/tls-desc.rd     |  79 +++++
>   ld/testsuite/ld-loongarch-elf/tls-desc.s      | 102 +++++++
>   .../ld-loongarch-elf/tls-relax-desc-le.d      |  15 +
>   .../ld-loongarch-elf/tls-relax-desc-le.s      |   8 +
>   opcodes/loongarch-opc.c                       |  54 ++++
>   23 files changed, 1054 insertions(+), 27 deletions(-)
>   create mode 100644 gas/testsuite/gas/loongarch/tlsdesc_32.d
>   create mode 100644 gas/testsuite/gas/loongarch/tlsdesc_32.s
>   create mode 100644 gas/testsuite/gas/loongarch/tlsdesc_64.d
>   create mode 100644 gas/testsuite/gas/loongarch/tlsdesc_64.s
>   create mode 100644 gas/testsuite/gas/loongarch/tlsdesc_large_abs.d
>   create mode 100644 gas/testsuite/gas/loongarch/tlsdesc_large_abs.s
>   create mode 100644 gas/testsuite/gas/loongarch/tlsdesc_large_pc.d
>   create mode 100644 gas/testsuite/gas/loongarch/tlsdesc_large_pc.s
>   create mode 100644 ld/testsuite/ld-loongarch-elf/tls-desc.dd
>   create mode 100644 ld/testsuite/ld-loongarch-elf/tls-desc.rd
>   create mode 100644 ld/testsuite/ld-loongarch-elf/tls-desc.s
>   create mode 100644 ld/testsuite/ld-loongarch-elf/tls-relax-desc-le.d
>   create mode 100644 ld/testsuite/ld-loongarch-elf/tls-relax-desc-le.s
>