[RFA] Avoid unnecessary load-immediate in coremark

Message ID f41501c6-4a9a-6dc0-7224-0f9a721a0765@ventanamicro.com
State Accepted, archived
Headers
Series [RFA] Avoid unnecessary load-immediate in coremark |

Checks

Context Check Description
snail/gcc-patches-check success Github commit url

Commit Message

Jeff Law Sept. 27, 2022, 7:53 p.m. UTC
  This is another minor improvement to coremark.   I suspect this only 
improves code size as the load-immediate was likely issuing with the ret 
statement on multi-issue machines.


Basically we're failing to utilize conditional equivalences during the 
post-reload CSE pass.  So if a particular block is only reached when a 
certain condition holds (say for example a4 == 0) and the block has an 
assignment like a4 = 0, we would fail to eliminate the unnecessary 
assignment.


So the way this works, as we enter each block in reload_cse_regs_1 we 
look at the block's predecessors to see if all of them have the same 
implicit assignment.  If they do, then we create a dummy insn 
representing that implicit assignment.


Before processing the first real insn, we enter the implicit assignment 
into the cselib hash tables.    This deferred action is necessary 
because of CODE_LABEL handling in cselib -- when it sees a CODE_LABEL it 
wipes state.  So we have to add the implicit assignment after processing 
the (optional) CODE_LABEL, but before processing real insns.


Note we have to walk all the block's predecessors to verify they all 
have the same implicit assignment.  That could potentially be expensive, 
so we limit it to cases where there are only a few predecessors.   For 
reference on x86_64, 81% of the cases where implicit assignments can be 
found are for single predecessor blocks.  96% have two preds, 99.1% have 
3 preds, 99.6% have 4 preds, 99.8% have 5 preds and so-on.   While there 
were cases where all 19 preds had the same implicit assignment capturing 
those cases just doesn't seem terribly important.   I put the clamp at 3 
preds.    If folks think it's important, I could certainly make that a 
PARAM.


Bootstrapped and regression tested on x86.  Bootstrapped on riscv as well.


OK for the trunk?


Jeff
gcc/
	* postreload.cc (reload_cse_regs_1): Record implicit sets from
	conditional branches into the cselib tables.

gcc/testsuite/

	* gcc.target/riscv/implict-set.c: New test.
  

Comments

Richard Biener Sept. 29, 2022, 7:44 a.m. UTC | #1
On Tue, Sep 27, 2022 at 9:54 PM Jeff Law <jlaw@ventanamicro.com> wrote:
>
>
> This is another minor improvement to coremark.   I suspect this only
> improves code size as the load-immediate was likely issuing with the ret
> statement on multi-issue machines.
>
>
> Basically we're failing to utilize conditional equivalences during the
> post-reload CSE pass.  So if a particular block is only reached when a
> certain condition holds (say for example a4 == 0) and the block has an
> assignment like a4 = 0, we would fail to eliminate the unnecessary
> assignment.

conditional equivalences on RTL - ick ;)

I'm not familiar with RTL pattern matching so somebody else has to
comment on that, but

+                     /* If this is not the first time through, then
+                        verify the source and destination match.  */
+                     else if (dest == XEXP (cond, 0) && src == XEXP (cond, 1))
+                       ;

shouldn't you restrict dest/src somehow?  It might be a MEM?
The way you create the fake insn suggests only REG_P dest are OK
(not SUBREGs for example?)?
Should you use rtx_equal_p (not using that possibly exempts MEM,
but being more explicit would be nice).  Should you restrict this to
MODE_INT compares?

Richard.

>
> So the way this works, as we enter each block in reload_cse_regs_1 we
> look at the block's predecessors to see if all of them have the same
> implicit assignment.  If they do, then we create a dummy insn
> representing that implicit assignment.
>
>
> Before processing the first real insn, we enter the implicit assignment
> into the cselib hash tables.    This deferred action is necessary
> because of CODE_LABEL handling in cselib -- when it sees a CODE_LABEL it
> wipes state.  So we have to add the implicit assignment after processing
> the (optional) CODE_LABEL, but before processing real insns.
>
>
> Note we have to walk all the block's predecessors to verify they all
> have the same implicit assignment.  That could potentially be expensive,
> so we limit it to cases where there are only a few predecessors.   For
> reference on x86_64, 81% of the cases where implicit assignments can be
> found are for single predecessor blocks.  96% have two preds, 99.1% have
> 3 preds, 99.6% have 4 preds, 99.8% have 5 preds and so-on.   While there
> were cases where all 19 preds had the same implicit assignment capturing
> those cases just doesn't seem terribly important.   I put the clamp at 3
> preds.    If folks think it's important, I could certainly make that a
> PARAM.
>
>
> Bootstrapped and regression tested on x86.  Bootstrapped on riscv as well.
>
>
> OK for the trunk?
>
>
> Jeff
>
>
  
Richard Sandiford Sept. 30, 2022, 10:47 a.m. UTC | #2
Jeff Law <jlaw@ventanamicro.com> writes:
> This is another minor improvement to coremark.   I suspect this only 
> improves code size as the load-immediate was likely issuing with the ret 
> statement on multi-issue machines.
>
>
> Basically we're failing to utilize conditional equivalences during the 
> post-reload CSE pass.  So if a particular block is only reached when a 
> certain condition holds (say for example a4 == 0) and the block has an 
> assignment like a4 = 0, we would fail to eliminate the unnecessary 
> assignment.

I wasn't sure (and was too lazy to try, sorry), but: is the reason
that we fail to catch this earlier because the two uses of r4 are
entirely separate (i.e. not from the same pseudo)?

> +	  /* Iterate over each incoming edge and see if they
> +	     all have the same implicit set.  */
> +	  FOR_EACH_EDGE (e, ei, bb->preds)
> +	    {
> +	      /* If the predecessor does not end in a conditional
> +		 jump, then it does not have an implicit set.  */
> +	      if (e->src != ENTRY_BLOCK_PTR_FOR_FN (cfun)
> +		  && !block_ends_with_condjump_p (e->src))
> +		{
> +		  found = false;
> +		  break;
> +		}
> +
> +	      /* We know the predecessor ends with a conditional
> +		 jump.  Now dig into the actal form of the jump
> +		 to potentially extract an implicit set.  */

Very minor, but it looked odd to fall through for the entry block.
How about:

	      if (e->src == ENTRY_BLOCK_PTR_FOR_FN (cfun)
		  || !block_ends_with_condjump_p (e->src))

?

> +	      rtx_insn *condjump = BB_END (e->src);
> +	      if (condjump
> +		  && any_condjump_p (condjump)
> +		  && onlyjump_p (condjump))
> +		{
> +		  /* Extract the condition.  */
> +		  rtx pat = PATTERN (condjump);
> +		  rtx i_t_e = SET_SRC (pat);
> +		  gcc_assert (GET_CODE (i_t_e) == IF_THEN_ELSE);
> +		  rtx cond = XEXP (i_t_e, 0);
> +		  if ((GET_CODE (cond) == EQ
> +		       && GET_CODE (XEXP (i_t_e, 1)) == LABEL_REF
> +		       && XEXP (XEXP (i_t_e, 1), 0) == BB_HEAD (bb))
> +		      || (GET_CODE (cond) == NE
> +			  && XEXP (i_t_e, 2) == pc_rtx
> +			  && e->src->next_bb == bb))
> +		    {
> +		      /* If this is the first time through record
> +			 the source and destination.  */
> +		      if (!dest)
> +			{
> +			  dest = XEXP (cond, 0);
> +			  src = XEXP (cond, 1);
> +			}
> +		      /* If this is not the first time through, then
> +			 verify the source and destination match.  */
> +		      else if (dest == XEXP (cond, 0) && src == XEXP (cond, 1))
> +			;

FWIW, agree with what Richi said about using rtx_equal_p here.  We don't
necessarily end up with shared hard regs, especially if they originated
from different pseudos.

Thanks,
Richard
  
Jeff Law Oct. 1, 2022, 6:58 p.m. UTC | #3
On 9/29/22 01:44, Richard Biener wrote:
> On Tue, Sep 27, 2022 at 9:54 PM Jeff Law <jlaw@ventanamicro.com> wrote:
>>
>> This is another minor improvement to coremark.   I suspect this only
>> improves code size as the load-immediate was likely issuing with the ret
>> statement on multi-issue machines.
>>
>>
>> Basically we're failing to utilize conditional equivalences during the
>> post-reload CSE pass.  So if a particular block is only reached when a
>> certain condition holds (say for example a4 == 0) and the block has an
>> assignment like a4 = 0, we would fail to eliminate the unnecessary
>> assignment.
> conditional equivalences on RTL - ick ;)

That was my first reaction as well.


>
> I'm not familiar with RTL pattern matching so somebody else has to
> comment on that, but
>
> +                     /* If this is not the first time through, then
> +                        verify the source and destination match.  */
> +                     else if (dest == XEXP (cond, 0) && src == XEXP (cond, 1))
> +                       ;
>
> shouldn't you restrict dest/src somehow?  It might be a MEM?
> The way you create the fake insn suggests only REG_P dest are OK
> (not SUBREGs for example?)?

You're absolutely right, as is Richard S WRT unexpected sharing. I'll 
adjust the patch appropriately.


> Should you use rtx_equal_p (not using that possibly exempts MEM,
> but being more explicit would be nice).  Should you restrict this to
> MODE_INT compares?

rtx_equal_p would be better, yes.  I'll adjust that too.

This should work regardless of hte mode type though.  The key is the 
post-reload cse bits have to check that the pattern matches and that the 
constraints are satisfied when a replacement is made.   My only concern 
would be MODE_CC stuff.  I'll think a bit more about that case.


Jeff
  
Jeff Law Oct. 1, 2022, 7:03 p.m. UTC | #4
On 9/30/22 04:47, Richard Sandiford wrote:
> Jeff Law <jlaw@ventanamicro.com> writes:
>> This is another minor improvement to coremark.   I suspect this only
>> improves code size as the load-immediate was likely issuing with the ret
>> statement on multi-issue machines.
>>
>>
>> Basically we're failing to utilize conditional equivalences during the
>> post-reload CSE pass.  So if a particular block is only reached when a
>> certain condition holds (say for example a4 == 0) and the block has an
>> assignment like a4 = 0, we would fail to eliminate the unnecessary
>> assignment.
> I wasn't sure (and was too lazy to try, sorry), but: is the reason
> that we fail to catch this earlier because the two uses of r4 are
> entirely separate (i.e. not from the same pseudo)?

Right.  Different pseudos used in the comparison vs the destination of 
the assignment.  If they used the same pseudo, then I would have 
expected cse or gcse to pick it up.



>
>> +	  /* Iterate over each incoming edge and see if they
>> +	     all have the same implicit set.  */
>> +	  FOR_EACH_EDGE (e, ei, bb->preds)
>> +	    {
>> +	      /* If the predecessor does not end in a conditional
>> +		 jump, then it does not have an implicit set.  */
>> +	      if (e->src != ENTRY_BLOCK_PTR_FOR_FN (cfun)
>> +		  && !block_ends_with_condjump_p (e->src))
>> +		{
>> +		  found = false;
>> +		  break;
>> +		}
>> +
>> +	      /* We know the predecessor ends with a conditional
>> +		 jump.  Now dig into the actal form of the jump
>> +		 to potentially extract an implicit set.  */
> Very minor, but it looked odd to fall through for the entry block.
> How about:
>
> 	      if (e->src == ENTRY_BLOCK_PTR_FOR_FN (cfun)
> 		  || !block_ends_with_condjump_p (e->src))
>
> ?

Looks like a mistake to me.  we don't want to process anything for a 
successor of the entry block.  I'll adjust and retest.


Thanks,

Jeff
  

Patch

diff --git a/gcc/postreload.cc b/gcc/postreload.cc
index 41f61d32648..2f155a239ae 100644
--- a/gcc/postreload.cc
+++ b/gcc/postreload.cc
@@ -33,6 +33,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "emit-rtl.h"
 #include "recog.h"
 
+#include "cfghooks.h"
 #include "cfgrtl.h"
 #include "cfgbuild.h"
 #include "cfgcleanup.h"
@@ -221,13 +222,108 @@  reload_cse_regs_1 (void)
   init_alias_analysis ();
 
   FOR_EACH_BB_FN (bb, cfun)
-    FOR_BB_INSNS (bb, insn)
-      {
-	if (INSN_P (insn))
-	  cfg_changed |= reload_cse_simplify (insn, testreg);
+    {
+      /* If BB has a small number of predecessors, see if each of the
+	 has the same implicit set.  If so, record that implicit set so
+	 that we can add it to the cselib tables.  */
+      rtx_insn *implicit_set;
 
-	cselib_process_insn (insn);
-      }
+      implicit_set = NULL;
+      if (EDGE_COUNT (bb->preds) <= 3)
+	{
+	  edge e;
+	  edge_iterator ei;
+	  rtx src = NULL_RTX;
+	  rtx dest = NULL_RTX;
+	  bool found = true;
+
+	  /* Iterate over each incoming edge and see if they
+	     all have the same implicit set.  */
+	  FOR_EACH_EDGE (e, ei, bb->preds)
+	    {
+	      /* If the predecessor does not end in a conditional
+		 jump, then it does not have an implicit set.  */
+	      if (e->src != ENTRY_BLOCK_PTR_FOR_FN (cfun)
+		  && !block_ends_with_condjump_p (e->src))
+		{
+		  found = false;
+		  break;
+		}
+
+	      /* We know the predecessor ends with a conditional
+		 jump.  Now dig into the actal form of the jump
+		 to potentially extract an implicit set.  */
+	      rtx_insn *condjump = BB_END (e->src);
+	      if (condjump
+		  && any_condjump_p (condjump)
+		  && onlyjump_p (condjump))
+		{
+		  /* Extract the condition.  */
+		  rtx pat = PATTERN (condjump);
+		  rtx i_t_e = SET_SRC (pat);
+		  gcc_assert (GET_CODE (i_t_e) == IF_THEN_ELSE);
+		  rtx cond = XEXP (i_t_e, 0);
+		  if ((GET_CODE (cond) == EQ
+		       && GET_CODE (XEXP (i_t_e, 1)) == LABEL_REF
+		       && XEXP (XEXP (i_t_e, 1), 0) == BB_HEAD (bb))
+		      || (GET_CODE (cond) == NE
+			  && XEXP (i_t_e, 2) == pc_rtx
+			  && e->src->next_bb == bb))
+		    {
+		      /* If this is the first time through record
+			 the source and destination.  */
+		      if (!dest)
+			{
+			  dest = XEXP (cond, 0);
+			  src = XEXP (cond, 1);
+			}
+		      /* If this is not the first time through, then
+			 verify the source and destination match.  */
+		      else if (dest == XEXP (cond, 0) && src == XEXP (cond, 1))
+			;
+		      else
+			{
+			  found = false;
+			  break;
+			}
+		    }
+		}
+	      else
+		{
+		  found = false;
+		  break;
+		}
+	    }
+
+	  /* If all the incoming edges had the same implicit
+	     set, then create a dummy insn for that set.
+
+	     It will be entered into the cselib tables before
+	     we process the first real insn in this block.  */
+	  if (dest && found)
+	    implicit_set = make_insn_raw (gen_rtx_SET (dest, src));
+	}
+
+      FOR_BB_INSNS (bb, insn)
+	{
+	  if (INSN_P (insn))
+	    {
+	      /* If we recorded an implicit set, enter it
+		 into the tables before the first real insn.
+
+		 We have to do it this way because a CODE_LABEL
+		 will flush the cselib tables.  */
+	      if (implicit_set)
+		{
+		  cselib_process_insn (implicit_set);
+		  implicit_set = NULL;
+		}
+	      cfg_changed |= reload_cse_simplify (insn, testreg);
+	    }
+
+	  cselib_process_insn (insn);
+	}
+    }
 
   /* Clean up.  */
   end_alias_analysis ();
diff --git a/gcc/testsuite/gcc.target/riscv/implicit-set.c b/gcc/testsuite/gcc.target/riscv/implicit-set.c
new file mode 100644
index 00000000000..91106bb5d80
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/implicit-set.c
@@ -0,0 +1,40 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -dp" } */
+/* This was extracted from coremark.  */
+
+
+typedef signed short ee_s16;
+typedef struct list_data_s
+{
+    ee_s16 data16;
+    ee_s16 idx;
+} list_data;
+
+typedef struct list_head_s
+{
+    struct list_head_s *next;
+    struct list_data_s *info;
+} list_head;
+
+
+list_head *
+core_list_find(list_head *list, list_data *info)
+{
+    if (info->idx >= 0)
+    {
+        while (list && (list->info->idx != info->idx))
+            list = list->next;
+        return list;
+    }
+    else
+    {
+        while (list && ((list->info->data16 & 0xff) != info->data16))
+            list = list->next;
+        return list;
+    }
+}
+
+/* There was an unnecessary assignment to the return value until
+   recently.  Scan for that in the resulting output.  */
+/* { dg-final { scan-assembler-not "li\\ta0,0" } } */
+