RISC-V: Fix wrong tune parameters on int_div

Message ID 20231026185021.654679-1-chenyangyu@isrc.iscas.ac.cn
State Accepted
Headers
Series RISC-V: Fix wrong tune parameters on int_div |

Checks

Context Check Description
snail/gcc-patch-check success Github commit url

Commit Message

Yangyu Chen Oct. 26, 2023, 6:50 p.m. UTC
  This patch fixes an issue with the cost on "int_div" in various RISC-V
tune parameters including those for Rocket, SiFive U7 series, and T-Head
C906. This incorrect cost value interferes with the optimization process.
For example, it prevents the optimization of division by a constant to a
more efficient method known as Barrett reduction. This lack of
optimization negatively affects the performance of these systems.

The integer div cost of the Rocket and SiFive U7 is taken from the
Rocket-Chip Divider source code[1] with BigCore configuration[2]. It shows
the divUnroll unchanged which is 1 by default. Thus, the maximum int_div
cycles should be the dataWidth + 1, which is 33 for 32-bit and 65 for
64-bit.

As for C906, the divider takes 2 cycle to start[3], and it produce 2-bit
result each cycle[4]. Thus, the maximum int_div cycles should be the
dataWidth / 2 + 2, which is 18 for 32-bit and 34 for 64-bit.

I also test the performance on VisionFive2 which has Qual-Core Sifive U74.
I write a simple C program to do 1e8 times div by constant 6 in int32. The
result shows it takes 1.998s using div, and 0.420s using barrett reduction
to replace div with mul, which is 4.75x faster.

[1] https://github.com/chipsalliance/rocket-chip/blob/v1.6/src/main/scala/rocket/Multiplier.scala#L40
[2] https://github.com/chipsalliance/rocket-chip/blob/v1.6/src/main/scala/subsystem/Configs.scala#L97
[3] https://github.com/T-head-Semi/openc906/blob/af5614d72de7e5a4b8609c427d2e20af1deb21c4/C906_RTL_FACTORY/gen_rtl/iu/rtl/aq_iu_div.v#L267
[4] https://github.com/T-head-Semi/openc906/blob/af5614d72de7e5a4b8609c427d2e20af1deb21c4/C906_RTL_FACTORY/gen_rtl/iu/rtl/aq_iu_div_shift2_kernel.v#L93

gcc/ChangeLog:

        * config/riscv/riscv.cc: Fix wrong tune parameters on int_div

Signed-off-by: Yangyu Chen <chenyangyu@isrc.iscas.ac.cn>
---
 gcc/config/riscv/riscv.cc | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
  

Comments

Robin Dapp Oct. 27, 2023, 7:49 a.m. UTC | #1
> @@ -346,7 +346,7 @@ static const struct riscv_tune_param rocket_tune_info = {
>    {COSTS_N_INSNS (4), COSTS_N_INSNS (5)},	/* fp_mul */
>    {COSTS_N_INSNS (20), COSTS_N_INSNS (20)},	/* fp_div */
>    {COSTS_N_INSNS (4), COSTS_N_INSNS (4)},	/* int_mul */
> -  {COSTS_N_INSNS (6), COSTS_N_INSNS (6)},	/* int_div */
> +  {COSTS_N_INSNS (33), COSTS_N_INSNS (65)},	/* int_div */
>    1,						/* issue_rate */
>    3,						/* branch_cost */
>    5,						/* memory_cost */
> @@ -361,7 +361,7 @@ static const struct riscv_tune_param sifive_7_tune_info = {
>    {COSTS_N_INSNS (4), COSTS_N_INSNS (5)},	/* fp_mul */
>    {COSTS_N_INSNS (20), COSTS_N_INSNS (20)},	/* fp_div */
>    {COSTS_N_INSNS (4), COSTS_N_INSNS (4)},	/* int_mul */
> -  {COSTS_N_INSNS (6), COSTS_N_INSNS (6)},	/* int_div */
> +  {COSTS_N_INSNS (33), COSTS_N_INSNS (65)},	/* int_div */
>    2,						/* issue_rate */
>    4,						/* branch_cost */
>    3,						/* memory_cost */
> @@ -376,7 +376,7 @@ static const struct riscv_tune_param thead_c906_tune_info = {
>    {COSTS_N_INSNS (4), COSTS_N_INSNS (5)}, /* fp_mul */
>    {COSTS_N_INSNS (20), COSTS_N_INSNS (20)}, /* fp_div */
>    {COSTS_N_INSNS (4), COSTS_N_INSNS (4)}, /* int_mul */
> -  {COSTS_N_INSNS (6), COSTS_N_INSNS (6)}, /* int_div */
> +  {COSTS_N_INSNS (18), COSTS_N_INSNS (34)}, /* int_div */
>    1,            /* issue_rate */
>    3,            /* branch_cost */
>    5,            /* memory_cost */

Instruction costs don't really correspond to latencies even though
sometimes they are used as if they were.  I'm a bit wary of using
e.g. 65 which would disparage each use of an integer division inside
a sequence.

Could you check which costs we need in order to still emit your wanted
sequence?  Maybe we can use values a bit lower than yours and still
get the proper code.  Where is the decision being made actually?

Regards
 Robin
  
Jeff Law Oct. 27, 2023, 1:55 p.m. UTC | #2
On 10/27/23 01:49, Robin Dapp wrote:
>> @@ -346,7 +346,7 @@ static const struct riscv_tune_param rocket_tune_info = {
>>     {COSTS_N_INSNS (4), COSTS_N_INSNS (5)},	/* fp_mul */
>>     {COSTS_N_INSNS (20), COSTS_N_INSNS (20)},	/* fp_div */
>>     {COSTS_N_INSNS (4), COSTS_N_INSNS (4)},	/* int_mul */
>> -  {COSTS_N_INSNS (6), COSTS_N_INSNS (6)},	/* int_div */
>> +  {COSTS_N_INSNS (33), COSTS_N_INSNS (65)},	/* int_div */
>>     1,						/* issue_rate */
>>     3,						/* branch_cost */
>>     5,						/* memory_cost */
>> @@ -361,7 +361,7 @@ static const struct riscv_tune_param sifive_7_tune_info = {
>>     {COSTS_N_INSNS (4), COSTS_N_INSNS (5)},	/* fp_mul */
>>     {COSTS_N_INSNS (20), COSTS_N_INSNS (20)},	/* fp_div */
>>     {COSTS_N_INSNS (4), COSTS_N_INSNS (4)},	/* int_mul */
>> -  {COSTS_N_INSNS (6), COSTS_N_INSNS (6)},	/* int_div */
>> +  {COSTS_N_INSNS (33), COSTS_N_INSNS (65)},	/* int_div */
>>     2,						/* issue_rate */
>>     4,						/* branch_cost */
>>     3,						/* memory_cost */
>> @@ -376,7 +376,7 @@ static const struct riscv_tune_param thead_c906_tune_info = {
>>     {COSTS_N_INSNS (4), COSTS_N_INSNS (5)}, /* fp_mul */
>>     {COSTS_N_INSNS (20), COSTS_N_INSNS (20)}, /* fp_div */
>>     {COSTS_N_INSNS (4), COSTS_N_INSNS (4)}, /* int_mul */
>> -  {COSTS_N_INSNS (6), COSTS_N_INSNS (6)}, /* int_div */
>> +  {COSTS_N_INSNS (18), COSTS_N_INSNS (34)}, /* int_div */
>>     1,            /* issue_rate */
>>     3,            /* branch_cost */
>>     5,            /* memory_cost */
> 
> Instruction costs don't really correspond to latencies even though
> sometimes they are used as if they were.  I'm a bit wary of using
> e.g. 65 which would disparage each use of an integer division inside
> a sequence.
> 
> Could you check which costs we need in order to still emit your wanted
> sequence?  Maybe we can use values a bit lower than yours and still
> get the proper code.  Where is the decision being made actually?
The main use of costing of a div/mod instruction is to guide the 
reciprocal division code when dividing by a constant.    In that context 
we're comparing costs against a sequence of multiplies, shifts, add/sub 
insns which are almost always costed by their latency.  So using latency 
for division is a reasonable place to start.

The other thing that might be worth investigating for those processors 
would be to set "use_divmod_expansion" in the cost structure.  I've 
heard talk of fusing div/mod into divmod, though I'm not aware of any 
part implementing that fusion (from a prior life, that would seem to 
require a 2nd output port on the integer unit which could be highly 
undesirable).  Anyway, this could be a followup item for Yangyu if it 
looks profitable.

jeff
  
Jeff Law Oct. 27, 2023, 2:44 p.m. UTC | #3
On 10/26/23 12:50, Yangyu Chen wrote:
> This patch fixes an issue with the cost on "int_div" in various RISC-V
> tune parameters including those for Rocket, SiFive U7 series, and T-Head
> C906. This incorrect cost value interferes with the optimization process.
> For example, it prevents the optimization of division by a constant to a
> more efficient method known as Barrett reduction. This lack of
> optimization negatively affects the performance of these systems.
> 
> The integer div cost of the Rocket and SiFive U7 is taken from the
> Rocket-Chip Divider source code[1] with BigCore configuration[2]. It shows
> the divUnroll unchanged which is 1 by default. Thus, the maximum int_div
> cycles should be the dataWidth + 1, which is 33 for 32-bit and 65 for
> 64-bit.
> 
> As for C906, the divider takes 2 cycle to start[3], and it produce 2-bit
> result each cycle[4]. Thus, the maximum int_div cycles should be the
> dataWidth / 2 + 2, which is 18 for 32-bit and 34 for 64-bit.
> 
> I also test the performance on VisionFive2 which has Qual-Core Sifive U74.
> I write a simple C program to do 1e8 times div by constant 6 in int32. The
> result shows it takes 1.998s using div, and 0.420s using barrett reduction
> to replace div with mul, which is 4.75x faster.
> 
> [1] https://github.com/chipsalliance/rocket-chip/blob/v1.6/src/main/scala/rocket/Multiplier.scala#L40
> [2] https://github.com/chipsalliance/rocket-chip/blob/v1.6/src/main/scala/subsystem/Configs.scala#L97
> [3] https://github.com/T-head-Semi/openc906/blob/af5614d72de7e5a4b8609c427d2e20af1deb21c4/C906_RTL_FACTORY/gen_rtl/iu/rtl/aq_iu_div.v#L267
> [4] https://github.com/T-head-Semi/openc906/blob/af5614d72de7e5a4b8609c427d2e20af1deb21c4/C906_RTL_FACTORY/gen_rtl/iu/rtl/aq_iu_div_shift2_kernel.v#L93
> 
> gcc/ChangeLog:
> 
>          * config/riscv/riscv.cc: Fix wrong tune parameters on int_div
> 
> Signed-off-by: Yangyu Chen <chenyangyu@isrc.iscas.ac.cn>
I adjusted the ChangeLog entry and pushed this to the trunk.

As a follow-up you should look at the tuning parameter which controls 
using the divmod expansion.  If these uarchs don't do div+mod fusion, 
then I should strongly suspect that they'll benefit from using the 
divmod expansion path when we need a/b and a%b.

Jeff
  
Andrew Waterman Oct. 27, 2023, 5:21 p.m. UTC | #4
On Fri, Oct 27, 2023 at 6:55 AM Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>
>
> On 10/27/23 01:49, Robin Dapp wrote:
> >> @@ -346,7 +346,7 @@ static const struct riscv_tune_param rocket_tune_info = {
> >>     {COSTS_N_INSNS (4), COSTS_N_INSNS (5)},  /* fp_mul */
> >>     {COSTS_N_INSNS (20), COSTS_N_INSNS (20)},        /* fp_div */
> >>     {COSTS_N_INSNS (4), COSTS_N_INSNS (4)},  /* int_mul */
> >> -  {COSTS_N_INSNS (6), COSTS_N_INSNS (6)},   /* int_div */
> >> +  {COSTS_N_INSNS (33), COSTS_N_INSNS (65)}, /* int_div */
> >>     1,                                               /* issue_rate */
> >>     3,                                               /* branch_cost */
> >>     5,                                               /* memory_cost */
> >> @@ -361,7 +361,7 @@ static const struct riscv_tune_param sifive_7_tune_info = {
> >>     {COSTS_N_INSNS (4), COSTS_N_INSNS (5)},  /* fp_mul */
> >>     {COSTS_N_INSNS (20), COSTS_N_INSNS (20)},        /* fp_div */
> >>     {COSTS_N_INSNS (4), COSTS_N_INSNS (4)},  /* int_mul */
> >> -  {COSTS_N_INSNS (6), COSTS_N_INSNS (6)},   /* int_div */
> >> +  {COSTS_N_INSNS (33), COSTS_N_INSNS (65)}, /* int_div */
> >>     2,                                               /* issue_rate */
> >>     4,                                               /* branch_cost */
> >>     3,                                               /* memory_cost */
> >> @@ -376,7 +376,7 @@ static const struct riscv_tune_param thead_c906_tune_info = {
> >>     {COSTS_N_INSNS (4), COSTS_N_INSNS (5)}, /* fp_mul */
> >>     {COSTS_N_INSNS (20), COSTS_N_INSNS (20)}, /* fp_div */
> >>     {COSTS_N_INSNS (4), COSTS_N_INSNS (4)}, /* int_mul */
> >> -  {COSTS_N_INSNS (6), COSTS_N_INSNS (6)}, /* int_div */
> >> +  {COSTS_N_INSNS (18), COSTS_N_INSNS (34)}, /* int_div */
> >>     1,            /* issue_rate */
> >>     3,            /* branch_cost */
> >>     5,            /* memory_cost */
> >
> > Instruction costs don't really correspond to latencies even though
> > sometimes they are used as if they were.  I'm a bit wary of using
> > e.g. 65 which would disparage each use of an integer division inside
> > a sequence.
> >
> > Could you check which costs we need in order to still emit your wanted
> > sequence?  Maybe we can use values a bit lower than yours and still
> > get the proper code.  Where is the decision being made actually?
> The main use of costing of a div/mod instruction is to guide the
> reciprocal division code when dividing by a constant.    In that context
> we're comparing costs against a sequence of multiplies, shifts, add/sub
> insns which are almost always costed by their latency.  So using latency
> for division is a reasonable place to start.
>
> The other thing that might be worth investigating for those processors
> would be to set "use_divmod_expansion" in the cost structure.  I've
> heard talk of fusing div/mod into divmod, though I'm not aware of any
> part implementing that fusion

I'm also unaware of existing implementations that fuse these
operations; div + mul + sub is probably best for most uarches...

> (from a prior life, that would seem to
> require a 2nd output port on the integer unit which could be highly
> undesirable).

...but it can be done more cheaply than this, so I wouldn't foreclose
on the possibility.  Nevertheless, future work, as you say.

> Anyway, this could be a followup item for Yangyu if it
> looks profitable.
>
> jeff
  

Patch

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index f2dcb0db6fb..ca9a2ca81d5 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -346,7 +346,7 @@  static const struct riscv_tune_param rocket_tune_info = {
   {COSTS_N_INSNS (4), COSTS_N_INSNS (5)},	/* fp_mul */
   {COSTS_N_INSNS (20), COSTS_N_INSNS (20)},	/* fp_div */
   {COSTS_N_INSNS (4), COSTS_N_INSNS (4)},	/* int_mul */
-  {COSTS_N_INSNS (6), COSTS_N_INSNS (6)},	/* int_div */
+  {COSTS_N_INSNS (33), COSTS_N_INSNS (65)},	/* int_div */
   1,						/* issue_rate */
   3,						/* branch_cost */
   5,						/* memory_cost */
@@ -361,7 +361,7 @@  static const struct riscv_tune_param sifive_7_tune_info = {
   {COSTS_N_INSNS (4), COSTS_N_INSNS (5)},	/* fp_mul */
   {COSTS_N_INSNS (20), COSTS_N_INSNS (20)},	/* fp_div */
   {COSTS_N_INSNS (4), COSTS_N_INSNS (4)},	/* int_mul */
-  {COSTS_N_INSNS (6), COSTS_N_INSNS (6)},	/* int_div */
+  {COSTS_N_INSNS (33), COSTS_N_INSNS (65)},	/* int_div */
   2,						/* issue_rate */
   4,						/* branch_cost */
   3,						/* memory_cost */
@@ -376,7 +376,7 @@  static const struct riscv_tune_param thead_c906_tune_info = {
   {COSTS_N_INSNS (4), COSTS_N_INSNS (5)}, /* fp_mul */
   {COSTS_N_INSNS (20), COSTS_N_INSNS (20)}, /* fp_div */
   {COSTS_N_INSNS (4), COSTS_N_INSNS (4)}, /* int_mul */
-  {COSTS_N_INSNS (6), COSTS_N_INSNS (6)}, /* int_div */
+  {COSTS_N_INSNS (18), COSTS_N_INSNS (34)}, /* int_div */
   1,            /* issue_rate */
   3,            /* branch_cost */
   5,            /* memory_cost */