RISC-V: Fix wrong tune parameters on int_div
Checks
Commit Message
This patch fixes an issue with the cost on "int_div" in various RISC-V
tune parameters including those for Rocket, SiFive U7 series, and T-Head
C906. This incorrect cost value interferes with the optimization process.
For example, it prevents the optimization of division by a constant to a
more efficient method known as Barrett reduction. This lack of
optimization negatively affects the performance of these systems.
The integer div cost of the Rocket and SiFive U7 is taken from the
Rocket-Chip Divider source code[1] with BigCore configuration[2]. It shows
the divUnroll unchanged which is 1 by default. Thus, the maximum int_div
cycles should be the dataWidth + 1, which is 33 for 32-bit and 65 for
64-bit.
As for C906, the divider takes 2 cycle to start[3], and it produce 2-bit
result each cycle[4]. Thus, the maximum int_div cycles should be the
dataWidth / 2 + 2, which is 18 for 32-bit and 34 for 64-bit.
I also test the performance on VisionFive2 which has Qual-Core Sifive U74.
I write a simple C program to do 1e8 times div by constant 6 in int32. The
result shows it takes 1.998s using div, and 0.420s using barrett reduction
to replace div with mul, which is 4.75x faster.
[1] https://github.com/chipsalliance/rocket-chip/blob/v1.6/src/main/scala/rocket/Multiplier.scala#L40
[2] https://github.com/chipsalliance/rocket-chip/blob/v1.6/src/main/scala/subsystem/Configs.scala#L97
[3] https://github.com/T-head-Semi/openc906/blob/af5614d72de7e5a4b8609c427d2e20af1deb21c4/C906_RTL_FACTORY/gen_rtl/iu/rtl/aq_iu_div.v#L267
[4] https://github.com/T-head-Semi/openc906/blob/af5614d72de7e5a4b8609c427d2e20af1deb21c4/C906_RTL_FACTORY/gen_rtl/iu/rtl/aq_iu_div_shift2_kernel.v#L93
gcc/ChangeLog:
* config/riscv/riscv.cc: Fix wrong tune parameters on int_div
Signed-off-by: Yangyu Chen <chenyangyu@isrc.iscas.ac.cn>
---
gcc/config/riscv/riscv.cc | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
Comments
> @@ -346,7 +346,7 @@ static const struct riscv_tune_param rocket_tune_info = {
> {COSTS_N_INSNS (4), COSTS_N_INSNS (5)}, /* fp_mul */
> {COSTS_N_INSNS (20), COSTS_N_INSNS (20)}, /* fp_div */
> {COSTS_N_INSNS (4), COSTS_N_INSNS (4)}, /* int_mul */
> - {COSTS_N_INSNS (6), COSTS_N_INSNS (6)}, /* int_div */
> + {COSTS_N_INSNS (33), COSTS_N_INSNS (65)}, /* int_div */
> 1, /* issue_rate */
> 3, /* branch_cost */
> 5, /* memory_cost */
> @@ -361,7 +361,7 @@ static const struct riscv_tune_param sifive_7_tune_info = {
> {COSTS_N_INSNS (4), COSTS_N_INSNS (5)}, /* fp_mul */
> {COSTS_N_INSNS (20), COSTS_N_INSNS (20)}, /* fp_div */
> {COSTS_N_INSNS (4), COSTS_N_INSNS (4)}, /* int_mul */
> - {COSTS_N_INSNS (6), COSTS_N_INSNS (6)}, /* int_div */
> + {COSTS_N_INSNS (33), COSTS_N_INSNS (65)}, /* int_div */
> 2, /* issue_rate */
> 4, /* branch_cost */
> 3, /* memory_cost */
> @@ -376,7 +376,7 @@ static const struct riscv_tune_param thead_c906_tune_info = {
> {COSTS_N_INSNS (4), COSTS_N_INSNS (5)}, /* fp_mul */
> {COSTS_N_INSNS (20), COSTS_N_INSNS (20)}, /* fp_div */
> {COSTS_N_INSNS (4), COSTS_N_INSNS (4)}, /* int_mul */
> - {COSTS_N_INSNS (6), COSTS_N_INSNS (6)}, /* int_div */
> + {COSTS_N_INSNS (18), COSTS_N_INSNS (34)}, /* int_div */
> 1, /* issue_rate */
> 3, /* branch_cost */
> 5, /* memory_cost */
Instruction costs don't really correspond to latencies even though
sometimes they are used as if they were. I'm a bit wary of using
e.g. 65 which would disparage each use of an integer division inside
a sequence.
Could you check which costs we need in order to still emit your wanted
sequence? Maybe we can use values a bit lower than yours and still
get the proper code. Where is the decision being made actually?
Regards
Robin
On 10/27/23 01:49, Robin Dapp wrote:
>> @@ -346,7 +346,7 @@ static const struct riscv_tune_param rocket_tune_info = {
>> {COSTS_N_INSNS (4), COSTS_N_INSNS (5)}, /* fp_mul */
>> {COSTS_N_INSNS (20), COSTS_N_INSNS (20)}, /* fp_div */
>> {COSTS_N_INSNS (4), COSTS_N_INSNS (4)}, /* int_mul */
>> - {COSTS_N_INSNS (6), COSTS_N_INSNS (6)}, /* int_div */
>> + {COSTS_N_INSNS (33), COSTS_N_INSNS (65)}, /* int_div */
>> 1, /* issue_rate */
>> 3, /* branch_cost */
>> 5, /* memory_cost */
>> @@ -361,7 +361,7 @@ static const struct riscv_tune_param sifive_7_tune_info = {
>> {COSTS_N_INSNS (4), COSTS_N_INSNS (5)}, /* fp_mul */
>> {COSTS_N_INSNS (20), COSTS_N_INSNS (20)}, /* fp_div */
>> {COSTS_N_INSNS (4), COSTS_N_INSNS (4)}, /* int_mul */
>> - {COSTS_N_INSNS (6), COSTS_N_INSNS (6)}, /* int_div */
>> + {COSTS_N_INSNS (33), COSTS_N_INSNS (65)}, /* int_div */
>> 2, /* issue_rate */
>> 4, /* branch_cost */
>> 3, /* memory_cost */
>> @@ -376,7 +376,7 @@ static const struct riscv_tune_param thead_c906_tune_info = {
>> {COSTS_N_INSNS (4), COSTS_N_INSNS (5)}, /* fp_mul */
>> {COSTS_N_INSNS (20), COSTS_N_INSNS (20)}, /* fp_div */
>> {COSTS_N_INSNS (4), COSTS_N_INSNS (4)}, /* int_mul */
>> - {COSTS_N_INSNS (6), COSTS_N_INSNS (6)}, /* int_div */
>> + {COSTS_N_INSNS (18), COSTS_N_INSNS (34)}, /* int_div */
>> 1, /* issue_rate */
>> 3, /* branch_cost */
>> 5, /* memory_cost */
>
> Instruction costs don't really correspond to latencies even though
> sometimes they are used as if they were. I'm a bit wary of using
> e.g. 65 which would disparage each use of an integer division inside
> a sequence.
>
> Could you check which costs we need in order to still emit your wanted
> sequence? Maybe we can use values a bit lower than yours and still
> get the proper code. Where is the decision being made actually?
The main use of costing of a div/mod instruction is to guide the
reciprocal division code when dividing by a constant. In that context
we're comparing costs against a sequence of multiplies, shifts, add/sub
insns which are almost always costed by their latency. So using latency
for division is a reasonable place to start.
The other thing that might be worth investigating for those processors
would be to set "use_divmod_expansion" in the cost structure. I've
heard talk of fusing div/mod into divmod, though I'm not aware of any
part implementing that fusion (from a prior life, that would seem to
require a 2nd output port on the integer unit which could be highly
undesirable). Anyway, this could be a followup item for Yangyu if it
looks profitable.
jeff
On 10/26/23 12:50, Yangyu Chen wrote:
> This patch fixes an issue with the cost on "int_div" in various RISC-V
> tune parameters including those for Rocket, SiFive U7 series, and T-Head
> C906. This incorrect cost value interferes with the optimization process.
> For example, it prevents the optimization of division by a constant to a
> more efficient method known as Barrett reduction. This lack of
> optimization negatively affects the performance of these systems.
>
> The integer div cost of the Rocket and SiFive U7 is taken from the
> Rocket-Chip Divider source code[1] with BigCore configuration[2]. It shows
> the divUnroll unchanged which is 1 by default. Thus, the maximum int_div
> cycles should be the dataWidth + 1, which is 33 for 32-bit and 65 for
> 64-bit.
>
> As for C906, the divider takes 2 cycle to start[3], and it produce 2-bit
> result each cycle[4]. Thus, the maximum int_div cycles should be the
> dataWidth / 2 + 2, which is 18 for 32-bit and 34 for 64-bit.
>
> I also test the performance on VisionFive2 which has Qual-Core Sifive U74.
> I write a simple C program to do 1e8 times div by constant 6 in int32. The
> result shows it takes 1.998s using div, and 0.420s using barrett reduction
> to replace div with mul, which is 4.75x faster.
>
> [1] https://github.com/chipsalliance/rocket-chip/blob/v1.6/src/main/scala/rocket/Multiplier.scala#L40
> [2] https://github.com/chipsalliance/rocket-chip/blob/v1.6/src/main/scala/subsystem/Configs.scala#L97
> [3] https://github.com/T-head-Semi/openc906/blob/af5614d72de7e5a4b8609c427d2e20af1deb21c4/C906_RTL_FACTORY/gen_rtl/iu/rtl/aq_iu_div.v#L267
> [4] https://github.com/T-head-Semi/openc906/blob/af5614d72de7e5a4b8609c427d2e20af1deb21c4/C906_RTL_FACTORY/gen_rtl/iu/rtl/aq_iu_div_shift2_kernel.v#L93
>
> gcc/ChangeLog:
>
> * config/riscv/riscv.cc: Fix wrong tune parameters on int_div
>
> Signed-off-by: Yangyu Chen <chenyangyu@isrc.iscas.ac.cn>
I adjusted the ChangeLog entry and pushed this to the trunk.
As a follow-up you should look at the tuning parameter which controls
using the divmod expansion. If these uarchs don't do div+mod fusion,
then I should strongly suspect that they'll benefit from using the
divmod expansion path when we need a/b and a%b.
Jeff
On Fri, Oct 27, 2023 at 6:55 AM Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>
>
> On 10/27/23 01:49, Robin Dapp wrote:
> >> @@ -346,7 +346,7 @@ static const struct riscv_tune_param rocket_tune_info = {
> >> {COSTS_N_INSNS (4), COSTS_N_INSNS (5)}, /* fp_mul */
> >> {COSTS_N_INSNS (20), COSTS_N_INSNS (20)}, /* fp_div */
> >> {COSTS_N_INSNS (4), COSTS_N_INSNS (4)}, /* int_mul */
> >> - {COSTS_N_INSNS (6), COSTS_N_INSNS (6)}, /* int_div */
> >> + {COSTS_N_INSNS (33), COSTS_N_INSNS (65)}, /* int_div */
> >> 1, /* issue_rate */
> >> 3, /* branch_cost */
> >> 5, /* memory_cost */
> >> @@ -361,7 +361,7 @@ static const struct riscv_tune_param sifive_7_tune_info = {
> >> {COSTS_N_INSNS (4), COSTS_N_INSNS (5)}, /* fp_mul */
> >> {COSTS_N_INSNS (20), COSTS_N_INSNS (20)}, /* fp_div */
> >> {COSTS_N_INSNS (4), COSTS_N_INSNS (4)}, /* int_mul */
> >> - {COSTS_N_INSNS (6), COSTS_N_INSNS (6)}, /* int_div */
> >> + {COSTS_N_INSNS (33), COSTS_N_INSNS (65)}, /* int_div */
> >> 2, /* issue_rate */
> >> 4, /* branch_cost */
> >> 3, /* memory_cost */
> >> @@ -376,7 +376,7 @@ static const struct riscv_tune_param thead_c906_tune_info = {
> >> {COSTS_N_INSNS (4), COSTS_N_INSNS (5)}, /* fp_mul */
> >> {COSTS_N_INSNS (20), COSTS_N_INSNS (20)}, /* fp_div */
> >> {COSTS_N_INSNS (4), COSTS_N_INSNS (4)}, /* int_mul */
> >> - {COSTS_N_INSNS (6), COSTS_N_INSNS (6)}, /* int_div */
> >> + {COSTS_N_INSNS (18), COSTS_N_INSNS (34)}, /* int_div */
> >> 1, /* issue_rate */
> >> 3, /* branch_cost */
> >> 5, /* memory_cost */
> >
> > Instruction costs don't really correspond to latencies even though
> > sometimes they are used as if they were. I'm a bit wary of using
> > e.g. 65 which would disparage each use of an integer division inside
> > a sequence.
> >
> > Could you check which costs we need in order to still emit your wanted
> > sequence? Maybe we can use values a bit lower than yours and still
> > get the proper code. Where is the decision being made actually?
> The main use of costing of a div/mod instruction is to guide the
> reciprocal division code when dividing by a constant. In that context
> we're comparing costs against a sequence of multiplies, shifts, add/sub
> insns which are almost always costed by their latency. So using latency
> for division is a reasonable place to start.
>
> The other thing that might be worth investigating for those processors
> would be to set "use_divmod_expansion" in the cost structure. I've
> heard talk of fusing div/mod into divmod, though I'm not aware of any
> part implementing that fusion
I'm also unaware of existing implementations that fuse these
operations; div + mul + sub is probably best for most uarches...
> (from a prior life, that would seem to
> require a 2nd output port on the integer unit which could be highly
> undesirable).
...but it can be done more cheaply than this, so I wouldn't foreclose
on the possibility. Nevertheless, future work, as you say.
> Anyway, this could be a followup item for Yangyu if it
> looks profitable.
>
> jeff
@@ -346,7 +346,7 @@ static const struct riscv_tune_param rocket_tune_info = {
{COSTS_N_INSNS (4), COSTS_N_INSNS (5)}, /* fp_mul */
{COSTS_N_INSNS (20), COSTS_N_INSNS (20)}, /* fp_div */
{COSTS_N_INSNS (4), COSTS_N_INSNS (4)}, /* int_mul */
- {COSTS_N_INSNS (6), COSTS_N_INSNS (6)}, /* int_div */
+ {COSTS_N_INSNS (33), COSTS_N_INSNS (65)}, /* int_div */
1, /* issue_rate */
3, /* branch_cost */
5, /* memory_cost */
@@ -361,7 +361,7 @@ static const struct riscv_tune_param sifive_7_tune_info = {
{COSTS_N_INSNS (4), COSTS_N_INSNS (5)}, /* fp_mul */
{COSTS_N_INSNS (20), COSTS_N_INSNS (20)}, /* fp_div */
{COSTS_N_INSNS (4), COSTS_N_INSNS (4)}, /* int_mul */
- {COSTS_N_INSNS (6), COSTS_N_INSNS (6)}, /* int_div */
+ {COSTS_N_INSNS (33), COSTS_N_INSNS (65)}, /* int_div */
2, /* issue_rate */
4, /* branch_cost */
3, /* memory_cost */
@@ -376,7 +376,7 @@ static const struct riscv_tune_param thead_c906_tune_info = {
{COSTS_N_INSNS (4), COSTS_N_INSNS (5)}, /* fp_mul */
{COSTS_N_INSNS (20), COSTS_N_INSNS (20)}, /* fp_div */
{COSTS_N_INSNS (4), COSTS_N_INSNS (4)}, /* int_mul */
- {COSTS_N_INSNS (6), COSTS_N_INSNS (6)}, /* int_div */
+ {COSTS_N_INSNS (18), COSTS_N_INSNS (34)}, /* int_div */
1, /* issue_rate */
3, /* branch_cost */
5, /* memory_cost */