PR rtl-optimization/106594: Preserve zero_extend when cheap.

Message ID 007201d8c637$e2c3bfd0$a84b3f70$@nextmovesoftware.com
State New, archived
Headers
Series PR rtl-optimization/106594: Preserve zero_extend when cheap. |

Commit Message

Roger Sayle Sept. 11, 2022, 11:40 p.m. UTC
  This patch addresses PR rtl-optimization/106594, a significant performance
regression affecting aarch64 recently introduced (exposed) by one of my
recent RTL simplification improvements.  Firstly many thanks to
Tamar Christina for confirming that the core of this patch provides ~5%
performance improvement on some on his benchmarks.

GCC's combine pass uses the function expand_compound_operation to
conceptually simplify/canonicalize SIGN_EXTEND and ZERO_EXTEND as
a pair of shift operations, as not all targets support extension
instructions [technically ZERO_EXTEND may potentially be simplified/
canonicalized to an AND operation, but the theory remains the same].

In that function, around line 7226 of combine.cc, there's an optimization
that's remarkably similar to part of my recent simplify-rtx patch posted
at https://gcc.gnu.org/pipermail/gcc-patches/2022-August/599196.html
The comment above this code reads:

  /* Convert sign extension to zero extension, if we know that the high
     bit is not set, as this is easier to optimize.  It will be converted
     back to cheaper alternative in make_extraction.  */

which is exactly the SIGN_EXTEND to ZERO_EXTEND canonicalization that
we now perform in simplify-rtx.  The significant difference is that this
code checks the backend's RTL costs, via set_src_cost, and selects either
the SIGN_EXTEND, the ZERO_EXTEND or the pair of SHIFTs depending on which
is cheaper.

The problem is that now that SIGN_EXTEND is converted to ZERO_EXTEND
earlier, this transform/check is no longer triggered, and as a result the
incoming ZERO_EXTEND is always converted to a pair of shifts, irrespective
of the backend's costs.  The latent bug, revealed by benchmarking, is that
we should avoid converting SIGN_EXTEND or ZERO_EXTEND into (slower) shifts
on targets where extensions are cheap (i.e. a single instruction, that's
cheaper than two shift instructions, or as cheap as an AND).

This core fix (and performance improvement) is the first chunk of the
attached patch.  Unfortunately (as is often the case), as soon as you
tweak the RTL stream/canonical forms of instructions, you expose a number
of missed optimization issues, in both the middle-end and backends, that
were expecting one pattern but now see a (cheaper) equivalent pattern...

The remaining chunks affecting expand_compound_operation, prevent
combine from generating SUBREGs of RTX other than REG or MEM (considered
invalid RTL) where previously gen_lowpart would have generated a CLOBBER.

In simplify_unary_operation_1, the middle-end can create rtx for FFS, PARITY
and POPCOUNT where the operand mode didn't match the result mode
[which is no longer supported according to the RTL documentation].

In i386.md, I needed to add variations of the define_insn_and_split
patterns for *clzsi2_lzcnt_zext, *clzsi2_lzcnt_zext_falsedep,
*bmi2_bzhi_zero_extendsidi_4 , *popcountsi2_zext,
*popcountsi2_zext_falsedep, *popcounthi2 to recognize ZERO_EXTEND
in addition to the previous AND forms, and I added a variation of a
popcount-related peephole2 now that we generate one less instruction
in the input sequence that it's expecting to match.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures.  I'm happy to help fix any turbulence encountered
by targets with cheap sign/zero extension operations, but most should
see a performance improvement, hopefully to better than before the
identified performance regression.  Ok for mainline?

Fingers-crossed, Uros can review the x86 backend changes, which are
potentially independent (fixing regressions caused by the middle-end
changes), but included in this post to provide better context.  TIA.


2022-09-12  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        PR rtl-optimization/106594
        * gcc/combine.cc (expand_compound_operation): Don't expand/transform
        ZERO_EXTEND or SIGN_EXTEND on targets where rtx_cost claims they are
        cheap.  If gen_lowpart returns a SUBREG of something other than a
        REG or a MEM, i.e. invalid RTL, return the original expression.
        * gcc/simplify-rtx.cc (simplify_unary_operation_1) <case FFS>:
        Avoid generating FFS with mismatched operand and result modes, by
        using an explicit SIGN_EXTEND/ZERO_EXTEND instead.
        <case POPCOUNT>: Likewise, for POPCOUNT of ZERO_EXTEND.
        <case PARITY>: Likewise, for PARITY of {ZERO,SIGN}_EXTEND.

        * config/i386/i386.md (*clzsi2_lzcnt_zext_2): define_insn_and_split
        to match ZERO_EXTEND form of *clzsi2_lzcnt_zext.
        (*clzsi2_lzcnt_zext_2_falsedep): Likewise, new define_insn to match
        ZERO_EXTEND form of *clzsi2_lzcnt_zext_falsedep.
        (*bmi2_bzhi_zero_extendsidi_5): Likewise, new define_insn to match
        ZERO_EXTEND form of *bmi2_bzhi_zero_extendsidi.
        (*popcountsi2_zext_2): Likewise, new define_insn_and_split to match
        ZERO_EXTEND form of *popcountsi2_zext.
        (*popcountsi2_zext_2_falsedep): Likewise, new define_insn to match
        ZERO_EXTEND form of *popcountsi2_zext_falsedep.
        (*popcounthi2_2): Likewise, new define_insn_and_split to match
        ZERO_EXTEND form of *popcounthi2.
        (define_peephole2): ZERO_EXTEND variant of HImode popcount&1 using
        parity flag peephole2.


Thanks in advance,
Roger
--
  

Comments

Richard Sandiford Sept. 12, 2022, 10:31 a.m. UTC | #1
"Roger Sayle" <roger@nextmovesoftware.com> writes:
> This patch addresses PR rtl-optimization/106594, a significant performance
> regression affecting aarch64 recently introduced (exposed) by one of my
> recent RTL simplification improvements.  Firstly many thanks to
> Tamar Christina for confirming that the core of this patch provides ~5%
> performance improvement on some on his benchmarks.
>
> GCC's combine pass uses the function expand_compound_operation to
> conceptually simplify/canonicalize SIGN_EXTEND and ZERO_EXTEND as
> a pair of shift operations, as not all targets support extension
> instructions [technically ZERO_EXTEND may potentially be simplified/
> canonicalized to an AND operation, but the theory remains the same].

Are you sure the reason is that not all targets support extension
instructions?  I thought in practice this routine would only tend
to see ZEOR_EXTEND etc. if those codes appeared in the original rtl
insns.

Like you say earlier, I'd thought it was a (pure) canonicalisation
thing: try to have only one representation of an operation before
simplifying/combining, rather than multiple representations.  Then
make_compound_operation was supposed to do the opposite operation
after simplifying/combining.

On that basis…

> In that function, around line 7226 of combine.cc, there's an optimization
> that's remarkably similar to part of my recent simplify-rtx patch posted
> at https://gcc.gnu.org/pipermail/gcc-patches/2022-August/599196.html
> The comment above this code reads:
>
>   /* Convert sign extension to zero extension, if we know that the high
>      bit is not set, as this is easier to optimize.  It will be converted
>      back to cheaper alternative in make_extraction.  */
>
> which is exactly the SIGN_EXTEND to ZERO_EXTEND canonicalization that
> we now perform in simplify-rtx.  The significant difference is that this
> code checks the backend's RTL costs, via set_src_cost, and selects either
> the SIGN_EXTEND, the ZERO_EXTEND or the pair of SHIFTs depending on which
> is cheaper.
>
> The problem is that now that SIGN_EXTEND is converted to ZERO_EXTEND
> earlier, this transform/check is no longer triggered, and as a result the
> incoming ZERO_EXTEND is always converted to a pair of shifts, irrespective
> of the backend's costs.  The latent bug, revealed by benchmarking, is that
> we should avoid converting SIGN_EXTEND or ZERO_EXTEND into (slower) shifts
> on targets where extensions are cheap (i.e. a single instruction, that's
> cheaper than two shift instructions, or as cheap as an AND).

…I don't think the expectation was that shifts were cheaper than/as cheap
as the original expressions.  On most targets the shifts will be more
expensive.  It was more that we had (sometimes misplaced) faith in
make_compound_operation's being able to switch back to the compound
operations if they were supported/cheaper.

So it feels like taking rtx costs into account is muddying the waters.
But like you say, that line has already been crossed.  The (existing)
code in question is:

  /* Convert sign extension to zero extension, if we know that the high
     bit is not set, as this is easier to optimize.  It will be converted
     back to cheaper alternative in make_extraction.  */
  if (GET_CODE (x) == SIGN_EXTEND
      && HWI_COMPUTABLE_MODE_P (mode)
      && ((nonzero_bits (XEXP (x, 0), inner_mode)
	   & ~(((unsigned HOST_WIDE_INT) GET_MODE_MASK (inner_mode)) >> 1))
	  == 0))
    {
      rtx temp = gen_rtx_ZERO_EXTEND (mode, XEXP (x, 0));
      rtx temp2 = expand_compound_operation (temp);

      /* Make sure this is a profitable operation.  */
      if (set_src_cost (x, mode, optimize_this_for_speed_p)
          > set_src_cost (temp2, mode, optimize_this_for_speed_p))
       return temp2;
      else if (set_src_cost (x, mode, optimize_this_for_speed_p)
               > set_src_cost (temp, mode, optimize_this_for_speed_p))
       return temp;
      else
       return x;
    }

I can see an argument for using rtx costs to switch between an
expand_compound_operation of a SIGN_EXTEND and an expand_compound_operation
of a ZERO_EXTEND, given that this is one case where even e_c_o can't give
a single, unique representation.  But it feels odd to be returning an
unexpanded SIGN_EXTEND or ZERO_EXTEND when the purpose of the routine
is to expand them.  (And it rtx costs *are* supposed to influence the
choice between the SIGN_EXTEND-based and ZERO_EXTEND-based expansions,
it feels like we should have corresponding code for when the original
operation is a ZERO_EXTEND.)

Obviously Segher's call, not mine.  But given that we already have
the code above, I think it'd be worth clarifying what the
expand/make_compound_operation pair are supposed to do and when.

Thanks,
Richard

> This core fix (and performance improvement) is the first chunk of the
> attached patch.  Unfortunately (as is often the case), as soon as you
> tweak the RTL stream/canonical forms of instructions, you expose a number
> of missed optimization issues, in both the middle-end and backends, that
> were expecting one pattern but now see a (cheaper) equivalent pattern...
>
> The remaining chunks affecting expand_compound_operation, prevent
> combine from generating SUBREGs of RTX other than REG or MEM (considered
> invalid RTL) where previously gen_lowpart would have generated a CLOBBER.
>
> In simplify_unary_operation_1, the middle-end can create rtx for FFS, PARITY
> and POPCOUNT where the operand mode didn't match the result mode
> [which is no longer supported according to the RTL documentation].
>
> In i386.md, I needed to add variations of the define_insn_and_split
> patterns for *clzsi2_lzcnt_zext, *clzsi2_lzcnt_zext_falsedep,
> *bmi2_bzhi_zero_extendsidi_4 , *popcountsi2_zext,
> *popcountsi2_zext_falsedep, *popcounthi2 to recognize ZERO_EXTEND
> in addition to the previous AND forms, and I added a variation of a
> popcount-related peephole2 now that we generate one less instruction
> in the input sequence that it's expecting to match.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32},
> with no new failures.  I'm happy to help fix any turbulence encountered
> by targets with cheap sign/zero extension operations, but most should
> see a performance improvement, hopefully to better than before the
> identified performance regression.  Ok for mainline?
>
> Fingers-crossed, Uros can review the x86 backend changes, which are
> potentially independent (fixing regressions caused by the middle-end
> changes), but included in this post to provide better context.  TIA.
>
>
> 2022-09-12  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         PR rtl-optimization/106594
>         * gcc/combine.cc (expand_compound_operation): Don't expand/transform
>         ZERO_EXTEND or SIGN_EXTEND on targets where rtx_cost claims they are
>         cheap.  If gen_lowpart returns a SUBREG of something other than a
>         REG or a MEM, i.e. invalid RTL, return the original expression.
>         * gcc/simplify-rtx.cc (simplify_unary_operation_1) <case FFS>:
>         Avoid generating FFS with mismatched operand and result modes, by
>         using an explicit SIGN_EXTEND/ZERO_EXTEND instead.
>         <case POPCOUNT>: Likewise, for POPCOUNT of ZERO_EXTEND.
>         <case PARITY>: Likewise, for PARITY of {ZERO,SIGN}_EXTEND.
>
>         * config/i386/i386.md (*clzsi2_lzcnt_zext_2): define_insn_and_split
>         to match ZERO_EXTEND form of *clzsi2_lzcnt_zext.
>         (*clzsi2_lzcnt_zext_2_falsedep): Likewise, new define_insn to match
>         ZERO_EXTEND form of *clzsi2_lzcnt_zext_falsedep.
>         (*bmi2_bzhi_zero_extendsidi_5): Likewise, new define_insn to match
>         ZERO_EXTEND form of *bmi2_bzhi_zero_extendsidi.
>         (*popcountsi2_zext_2): Likewise, new define_insn_and_split to match
>         ZERO_EXTEND form of *popcountsi2_zext.
>         (*popcountsi2_zext_2_falsedep): Likewise, new define_insn to match
>         ZERO_EXTEND form of *popcountsi2_zext_falsedep.
>         (*popcounthi2_2): Likewise, new define_insn_and_split to match
>         ZERO_EXTEND form of *popcounthi2.
>         (define_peephole2): ZERO_EXTEND variant of HImode popcount&1 using
>         parity flag peephole2.
>
>
> Thanks in advance,
> Roger
> --
>
> diff --git a/gcc/combine.cc b/gcc/combine.cc
> index a5fabf3..9412a9c 100644
> --- a/gcc/combine.cc
> +++ b/gcc/combine.cc
> @@ -7288,7 +7288,17 @@ expand_compound_operation (rtx x)
>  	  && (STORE_FLAG_VALUE & ~GET_MODE_MASK (inner_mode)) == 0)
>  	return SUBREG_REG (XEXP (x, 0));
>  
> +      /* If ZERO_EXTEND is cheap on this target, do nothing,
> +	 i.e. don't attempt to convert it to a pair of shifts.  */
> +      if (set_src_cost (x, mode, optimize_this_for_speed_p)
> +	  <= COSTS_N_INSNS (1))
> +	return x;
>      }
> +  /* Likewise, if SIGN_EXTEND is cheap, do nothing.  */
> +  else if (GET_CODE (x) == SIGN_EXTEND
> +	   && set_src_cost (x, mode, optimize_this_for_speed_p)
> +	      <= COSTS_N_INSNS (1))
> +    return x;
>  
>    /* If we reach here, we want to return a pair of shifts.  The inner
>       shift is a left shift of BITSIZE - POS - LEN bits.  The outer
> @@ -7309,7 +7319,11 @@ expand_compound_operation (rtx x)
>    if (modewidth >= pos + len)
>      {
>        tem = gen_lowpart (mode, XEXP (x, 0));
> -      if (!tem || GET_CODE (tem) == CLOBBER)
> +      if (!tem
> +	  || GET_CODE (tem) == CLOBBER
> +	  || (GET_CODE (tem) == SUBREG
> +	      && !REG_P (SUBREG_REG (tem))
> +	      && !MEM_P (SUBREG_REG (tem))))
>  	return x;
>        tem = simplify_shift_const (NULL_RTX, ASHIFT, mode,
>  				  tem, modewidth - pos - len);
> @@ -7321,7 +7335,11 @@ expand_compound_operation (rtx x)
>        tem = simplify_shift_const (NULL_RTX, LSHIFTRT, inner_mode,
>  				  XEXP (x, 0), pos);
>        tem = gen_lowpart (mode, tem);
> -      if (!tem || GET_CODE (tem) == CLOBBER)
> +      if (!tem
> +	  || GET_CODE (tem) == CLOBBER
> +	  || (GET_CODE (tem) == SUBREG
> +	      && !REG_P (SUBREG_REG (tem))
> +	      && !MEM_P (SUBREG_REG (tem))))
>  	return x;
>        tem = simplify_and_const_int (NULL_RTX, mode, tem,
>  				    (HOST_WIDE_INT_1U << len) - 1);
> diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
> index fc0d6c3..698ca6e 100644
> --- a/gcc/simplify-rtx.cc
> +++ b/gcc/simplify-rtx.cc
> @@ -1404,22 +1404,32 @@ simplify_context::simplify_unary_operation_1 (rtx_code code, machine_mode mode,
>        break;
>  
>      case FFS:
> -      /* (ffs (*_extend <X>)) = (ffs <X>) */
> +      /* (ffs (*_extend <X>)) = (*_extend (ffs <X>)).  */
>        if (GET_CODE (op) == SIGN_EXTEND
>  	  || GET_CODE (op) == ZERO_EXTEND)
> -	return simplify_gen_unary (FFS, mode, XEXP (op, 0),
> -				   GET_MODE (XEXP (op, 0)));
> +	{
> +	  temp = simplify_gen_unary (FFS, GET_MODE (XEXP (op, 0)),
> +				     XEXP (op, 0), GET_MODE (XEXP (op, 0)));
> +	  return simplify_gen_unary (GET_CODE (op), mode, temp,
> +				     GET_MODE (temp));
> +	}
>        break;
>  
>      case POPCOUNT:
>        switch (GET_CODE (op))
>  	{
>  	case BSWAP:
> -	case ZERO_EXTEND:
> -	  /* (popcount (zero_extend <X>)) = (popcount <X>) */
> +	  /* (popcount (bswap <X>)) = (popcount <X>).  */
>  	  return simplify_gen_unary (POPCOUNT, mode, XEXP (op, 0),
>  				     GET_MODE (XEXP (op, 0)));
>  
> +	case ZERO_EXTEND:
> +	  /* (popcount (zero_extend <X>)) = (zero_extend (popcount <X>)).  */
> +	  temp = simplify_gen_unary (POPCOUNT, GET_MODE (XEXP (op, 0)),
> +				     XEXP (op, 0), GET_MODE (XEXP (op, 0)));
> +	  return simplify_gen_unary (ZERO_EXTEND, mode, temp,
> +				     GET_MODE (temp));
> +
>  	case ROTATE:
>  	case ROTATERT:
>  	  /* Rotations don't affect popcount.  */
> @@ -1438,11 +1448,16 @@ simplify_context::simplify_unary_operation_1 (rtx_code code, machine_mode mode,
>  	{
>  	case NOT:
>  	case BSWAP:
> -	case ZERO_EXTEND:
> -	case SIGN_EXTEND:
>  	  return simplify_gen_unary (PARITY, mode, XEXP (op, 0),
>  				     GET_MODE (XEXP (op, 0)));
>  
> +	case ZERO_EXTEND:
> +	case SIGN_EXTEND:
> +	  temp = simplify_gen_unary (PARITY, GET_MODE (XEXP (op, 0)),
> +				     XEXP (op, 0), GET_MODE (XEXP (op, 0)));
> +	  return simplify_gen_unary (GET_CODE (op), mode, temp,
> +				     GET_MODE (temp));
> +
>  	case ROTATE:
>  	case ROTATERT:
>  	  /* Rotations don't affect parity.  */
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index 1be9b66..dcf50cb 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -17010,6 +17010,42 @@
>     (set_attr "type" "bitmanip")
>     (set_attr "mode" "SI")])
>  
> +(define_insn_and_split "*clzsi2_lzcnt_zext_2"
> +  [(set (match_operand:DI 0 "register_operand" "=r")
> +	(zero_extend:DI
> +	  (clz:SI (match_operand:SI 1 "nonimmediate_operand" "rm"))))
> +   (clobber (reg:CC FLAGS_REG))]
> +  "TARGET_LZCNT && TARGET_64BIT"
> +  "lzcnt{l}\t{%1, %k0|%k0, %1}"
> +  "&& TARGET_AVOID_FALSE_DEP_FOR_BMI && epilogue_completed
> +   && optimize_function_for_speed_p (cfun)
> +   && !reg_mentioned_p (operands[0], operands[1])"
> +  [(parallel
> +    [(set (match_dup 0)
> +	  (zero_extend:DI (clz:SI (match_dup 1))))
> +     (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)
> +     (clobber (reg:CC FLAGS_REG))])]
> +  "ix86_expand_clear (operands[0]);"
> +  [(set_attr "prefix_rep" "1")
> +   (set_attr "type" "bitmanip")
> +   (set_attr "mode" "SI")])
> +
> +; False dependency happens when destination is only updated by tzcnt,
> +; lzcnt or popcnt.  There is no false dependency when destination is
> +; also used in source.
> +(define_insn "*clzsi2_lzcnt_zext_2_falsedep"
> +  [(set (match_operand:DI 0 "register_operand" "=r")
> +	(zero_extend:DI
> +	  (clz:SI (match_operand:SWI48 1 "nonimmediate_operand" "rm"))))
> +   (unspec [(match_operand:DI 2 "register_operand" "0")]
> +	   UNSPEC_INSN_FALSE_DEP)
> +   (clobber (reg:CC FLAGS_REG))]
> +  "TARGET_LZCNT"
> +  "lzcnt{l}\t{%1, %k0|%k0, %1}"
> +  [(set_attr "prefix_rep" "1")
> +   (set_attr "type" "bitmanip")
> +   (set_attr "mode" "SI")])
> +
>  (define_int_iterator LT_ZCNT
>  	[(UNSPEC_TZCNT "TARGET_BMI")
>  	 (UNSPEC_LZCNT "TARGET_LZCNT")])
> @@ -17328,6 +17364,22 @@
>     (set_attr "prefix" "vex")
>     (set_attr "mode" "DI")])
>  
> +(define_insn "*bmi2_bzhi_zero_extendsidi_5"
> +  [(set (match_operand:DI 0 "register_operand" "=r")
> +	(and:DI
> +	  (zero_extend:DI
> +	    (plus:SI
> +	      (ashift:SI (const_int 1)
> +			 (match_operand:QI 2 "register_operand" "r"))
> +	      (const_int -1)))
> +	  (match_operand:DI 1 "nonimmediate_operand" "rm")))
> +   (clobber (reg:CC FLAGS_REG))]
> +  "TARGET_64BIT && TARGET_BMI2"
> +  "bzhi\t{%q2, %q1, %q0|%q0, %q1, %q2}"
> +  [(set_attr "type" "bitmanip")
> +   (set_attr "prefix" "vex")
> +   (set_attr "mode" "DI")])
> +
>  (define_insn "bmi2_pdep_<mode>3"
>    [(set (match_operand:SWI48 0 "register_operand" "=r")
>          (unspec:SWI48 [(match_operand:SWI48 1 "register_operand" "r")
> @@ -17590,6 +17642,54 @@
>     (set_attr "type" "bitmanip")
>     (set_attr "mode" "SI")])
>  
> +(define_insn_and_split "*popcountsi2_zext_2"
> +  [(set (match_operand:DI 0 "register_operand" "=r")
> +	(zero_extend:DI
> +	  (popcount:SI (match_operand:SI 1 "nonimmediate_operand" "rm"))))
> +   (clobber (reg:CC FLAGS_REG))]
> +  "TARGET_POPCNT && TARGET_64BIT"
> +{
> +#if TARGET_MACHO
> +  return "popcnt\t{%1, %k0|%k0, %1}";
> +#else
> +  return "popcnt{l}\t{%1, %k0|%k0, %1}";
> +#endif
> +}
> +  "&& TARGET_AVOID_FALSE_DEP_FOR_BMI && epilogue_completed
> +   && optimize_function_for_speed_p (cfun)
> +   && !reg_mentioned_p (operands[0], operands[1])"
> +  [(parallel
> +    [(set (match_dup 0)
> +	  (zero_extend:DI (popcount:SI (match_dup 1))))
> +     (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)
> +     (clobber (reg:CC FLAGS_REG))])]
> +  "ix86_expand_clear (operands[0]);"
> +  [(set_attr "prefix_rep" "1")
> +   (set_attr "type" "bitmanip")
> +   (set_attr "mode" "SI")])
> +
> +; False dependency happens when destination is only updated by tzcnt,
> +; lzcnt or popcnt.  There is no false dependency when destination is
> +; also used in source.
> +(define_insn "*popcountsi2_zext_2_falsedep"
> +  [(set (match_operand:DI 0 "register_operand" "=r")
> +	(zero_extend:DI
> +	  (popcount:SI (match_operand:SI 1 "nonimmediate_operand" "rm"))))
> +   (unspec [(match_operand:DI 2 "register_operand" "0")]
> +	   UNSPEC_INSN_FALSE_DEP)
> +   (clobber (reg:CC FLAGS_REG))]
> +  "TARGET_POPCNT && TARGET_64BIT"
> +{
> +#if TARGET_MACHO
> +  return "popcnt\t{%1, %k0|%k0, %1}";
> +#else
> +  return "popcnt{l}\t{%1, %k0|%k0, %1}";
> +#endif
> +}
> +  [(set_attr "prefix_rep" "1")
> +   (set_attr "type" "bitmanip")
> +   (set_attr "mode" "SI")])
> +
>  (define_insn_and_split "*popcounthi2_1"
>    [(set (match_operand:SI 0 "register_operand")
>  	(popcount:SI
> @@ -17608,6 +17708,24 @@
>    DONE;
>  })
>  
> +(define_insn_and_split "*popcounthi2_2"
> +  [(set (match_operand:SI 0 "register_operand")
> +	(zero_extend:SI
> +	  (popcount:HI (match_operand:HI 1 "nonimmediate_operand"))))
> +   (clobber (reg:CC FLAGS_REG))]
> +  "TARGET_POPCNT
> +   && ix86_pre_reload_split ()"
> +  "#"
> +  "&& 1"
> +  [(const_int 0)]
> +{
> +  rtx tmp = gen_reg_rtx (HImode);
> +
> +  emit_insn (gen_popcounthi2 (tmp, operands[1]));
> +  emit_insn (gen_zero_extendhisi2 (operands[0], tmp));
> +  DONE;
> +})
> +
>  (define_insn "popcounthi2"
>    [(set (match_operand:HI 0 "register_operand" "=r")
>  	(popcount:HI
> @@ -17927,6 +18045,39 @@
>    PUT_CODE (operands[5], GET_CODE (operands[5]) == EQ ? UNORDERED : ORDERED);
>  })
>  
> +;; Eliminate HImode popcount&1 using parity flag (variant 2)
> +(define_peephole2
> +  [(match_scratch:HI 0 "Q")
> +   (parallel [(set (match_operand:HI 1 "register_operand")
> +		   (popcount:HI
> +		    (match_operand:HI 2 "nonimmediate_operand")))
> +	      (clobber (reg:CC FLAGS_REG))])
> +   (set (reg:CCZ FLAGS_REG)
> +        (compare:CCZ (and:QI (match_operand:QI 3 "register_operand")
> +			     (const_int 1))
> +		     (const_int 0)))
> +   (set (pc) (if_then_else (match_operator 4 "bt_comparison_operator"
> +			    [(reg:CCZ FLAGS_REG)
> +			     (const_int 0)])
> +			   (label_ref (match_operand 5))
> +			   (pc)))]
> +  "REGNO (operands[1]) == REGNO (operands[3])
> +   && peep2_reg_dead_p (2, operands[1])
> +   && peep2_reg_dead_p (2, operands[3])
> +   && peep2_regno_dead_p (3, FLAGS_REG)"
> +  [(set (match_dup 0) (match_dup 2))
> +   (parallel [(set (reg:CC FLAGS_REG)
> +		   (unspec:CC [(match_dup 0)] UNSPEC_PARITY))
> +	      (clobber (match_dup 0))])
> +   (set (pc) (if_then_else (match_op_dup 4 [(reg:CC FLAGS_REG)
> +					    (const_int 0)])
> +			   (label_ref (match_dup 5))
> +			   (pc)))]
> +{
> +  operands[4] = shallow_copy_rtx (operands[4]);
> +  PUT_CODE (operands[4], GET_CODE (operands[4]) == EQ ? UNORDERED : ORDERED);
> +})
> +
>  
>  ;; Thread-local storage patterns for ELF.
>  ;;
  
Roger Sayle Sept. 12, 2022, 2:16 p.m. UTC | #2
Hi Richard,

> "Roger Sayle" <roger@nextmovesoftware.com> writes:
> > This patch addresses PR rtl-optimization/106594, a significant
> > performance regression affecting aarch64 recently introduced (exposed)
> > by one of my recent RTL simplification improvements.  Firstly many
> > thanks to Tamar Christina for confirming that the core of this patch
> > provides ~5% performance improvement on some on his benchmarks.
> >
> > GCC's combine pass uses the function expand_compound_operation to
> > conceptually simplify/canonicalize SIGN_EXTEND and ZERO_EXTEND as a
> > pair of shift operations, as not all targets support extension
> > instructions [technically ZERO_EXTEND may potentially be simplified/
> > canonicalized to an AND operation, but the theory remains the same].
> 
> Are you sure the reason is that not all targets support extension instructions?
> I thought in practice this routine would only tend to see ZERO_EXTEND etc. if
> those codes appeared in the original rtl insns.

Excellent question.  My current understanding is that this subroutine is used
for both SIGN_EXTEND/ZERO_EXTEND (for which may processors have instructions
and even addressing mode support) and also SIGN_EXTRACT/ZERO_EXTRACT for
which many platforms, really do need a pair of shift instructions. (or an AND).

The bit (to me) that that's suspicious is the exact wording of the comment...

> >   /* Convert sign extension to zero extension, if we know that the high
> >      bit is not set, as this is easier to optimize.  It will be converted
> >      back to cheaper alternative in make_extraction.  */

Notice that things are converted back in "make_extraction", and may not be
getting converted back (based on empirical evidence) for non-extractions,
i.e. extensions.  This code is being called for ZERO_EXTEND on a path that
doesn't subsequently call make_compound.  As shown in the PR, there are
code generation differences (that impact performance), but I agree there's
some ambiguity around the intent of the original code.

My personal preference is to write backend patterns that contain ZERO_EXTEND
(or SIGN_EXTEND) rather than a pair of shifts, or an AND of a paradoxical SUBREG.
For example, the new patterns added to i386.md look (to me) "cleaner" than the
forms that they complement/replace.  The burden is then on the middle-end to
simplify {ZERO,SIGN}_EXTEND forms as efficiently as it would shifts or ANDs.

Interestingly, this is sometimes already the case, for example, we simplify
(ffs (zero_extend ...)) in cases where we wouldn't simplify the equivalent
(ffs (and ... 255)) [but these are perhaps also just missed optimizations].

Thoughts?  I'll also dig more into the history of these (combine) functions.

Cheers,
Roger
--
  
Richard Sandiford Sept. 12, 2022, 2:56 p.m. UTC | #3
"Roger Sayle" <roger@nextmovesoftware.com> writes:
> Hi Richard,
>
>> "Roger Sayle" <roger@nextmovesoftware.com> writes:
>> > This patch addresses PR rtl-optimization/106594, a significant
>> > performance regression affecting aarch64 recently introduced (exposed)
>> > by one of my recent RTL simplification improvements.  Firstly many
>> > thanks to Tamar Christina for confirming that the core of this patch
>> > provides ~5% performance improvement on some on his benchmarks.
>> >
>> > GCC's combine pass uses the function expand_compound_operation to
>> > conceptually simplify/canonicalize SIGN_EXTEND and ZERO_EXTEND as a
>> > pair of shift operations, as not all targets support extension
>> > instructions [technically ZERO_EXTEND may potentially be simplified/
>> > canonicalized to an AND operation, but the theory remains the same].
>> 
>> Are you sure the reason is that not all targets support extension instructions?
>> I thought in practice this routine would only tend to see ZERO_EXTEND etc. if
>> those codes appeared in the original rtl insns.
>
> Excellent question.  My current understanding is that this subroutine is used
> for both SIGN_EXTEND/ZERO_EXTEND (for which may processors have instructions
> and even addressing mode support) and also SIGN_EXTRACT/ZERO_EXTRACT for
> which many platforms, really do need a pair of shift instructions. (or an AND).
>
> The bit (to me) that that's suspicious is the exact wording of the comment...
>
>> >   /* Convert sign extension to zero extension, if we know that the high
>> >      bit is not set, as this is easier to optimize.  It will be converted
>> >      back to cheaper alternative in make_extraction.  */
>
> Notice that things are converted back in "make_extraction", and may not be
> getting converted back (based on empirical evidence) for non-extractions,
> i.e. extensions.  This code is being called for ZERO_EXTEND on a path that
> doesn't subsequently call make_compound.  As shown in the PR, there are
> code generation differences (that impact performance), but I agree there's
> some ambiguity around the intent of the original code.
>
> My personal preference is to write backend patterns that contain ZERO_EXTEND
> (or SIGN_EXTEND) rather than a pair of shifts, or an AND of a paradoxical SUBREG.
> For example, the new patterns added to i386.md look (to me) "cleaner" than the
> forms that they complement/replace.  The burden is then on the middle-end to
> simplify {ZERO,SIGN}_EXTEND forms as efficiently as it would shifts or ANDs.

Yeah.  I think the intention is still that .md patterns match the
"compound" forms like zero_extend/sign_extend where possible.
AIUI the "expanded" from is just an internal thing, to reduce the
number of simplification rules needed, and combine is supposed to call
make_compound_operation before recog()nising the result.

But given that the simplification rules in simplify-rtx.cc (as opposed
to combine.cc itself) have to cope with passes that *don't* do this
canonicalisation, I'm not sure how much it really helps in practice.
And the cost of make_compound_operation failing to recognise the
(possibly quite convoluted) results of substituting into an expanded
compound operation can be quite severe.

> Interestingly, this is sometimes already the case, for example, we simplify
> (ffs (zero_extend ...)) in cases where we wouldn't simplify the equivalent
> (ffs (and ... 255)) [but these are perhaps also just missed optimizations].

Yeah, sounds like it.

Thanks,
Richard

> Thoughts?  I'll also dig more into the history of these (combine) functions.
>
> Cheers,
> Roger
> --
  
Segher Boessenkool Sept. 12, 2022, 4:47 p.m. UTC | #4
Hi!

On Mon, Sep 12, 2022 at 12:40:31AM +0100, Roger Sayle wrote:
>         * gcc/combine.cc (expand_compound_operation): Don't expand/transform
>         ZERO_EXTEND or SIGN_EXTEND on targets where rtx_cost claims they are
>         cheap.  If gen_lowpart returns a SUBREG of something other than a
>         REG or a MEM, i.e. invalid RTL, return the original expression.

This needs testing and analysis on all platforms.

> @@ -7309,7 +7319,11 @@ expand_compound_operation (rtx x)
>    if (modewidth >= pos + len)
>      {
>        tem = gen_lowpart (mode, XEXP (x, 0));
> -      if (!tem || GET_CODE (tem) == CLOBBER)
> +      if (!tem
> +	  || GET_CODE (tem) == CLOBBER
> +	  || (GET_CODE (tem) == SUBREG
> +	      && !REG_P (SUBREG_REG (tem))
> +	      && !MEM_P (SUBREG_REG (tem))))
>  	return x;

Why handle subregs of memory here?  Those are antiquated, and should go
away pretty soon.

All of {expand,make}_compound_operation should be removed.  But this is
non-trivial work.  All of {SIGN,ZERO}_EXTRACT should be eradicated as
well, in my opinion.


Segher
  
Segher Boessenkool Sept. 12, 2022, 6:09 p.m. UTC | #5
On Mon, Sep 12, 2022 at 11:47:47AM -0500, Segher Boessenkool wrote:
> On Mon, Sep 12, 2022 at 12:40:31AM +0100, Roger Sayle wrote:
> >         * gcc/combine.cc (expand_compound_operation): Don't expand/transform
> >         ZERO_EXTEND or SIGN_EXTEND on targets where rtx_cost claims they are
> >         cheap.  If gen_lowpart returns a SUBREG of something other than a
> >         REG or a MEM, i.e. invalid RTL, return the original expression.
> 
> This needs testing and analysis on all platforms.

Btw.  This then needs to be of only the combine change.  Importantly not
the simplify changes, which should be a separate patch anyway: they
should be an improvment always (otherwise they are not okay anyway!)

The i386 patches should be separate and probably can be committed first,
too?  I haven't looked at those target things.

Other target will probably want changes as well?


Segher
  

Patch

diff --git a/gcc/combine.cc b/gcc/combine.cc
index a5fabf3..9412a9c 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -7288,7 +7288,17 @@  expand_compound_operation (rtx x)
 	  && (STORE_FLAG_VALUE & ~GET_MODE_MASK (inner_mode)) == 0)
 	return SUBREG_REG (XEXP (x, 0));
 
+      /* If ZERO_EXTEND is cheap on this target, do nothing,
+	 i.e. don't attempt to convert it to a pair of shifts.  */
+      if (set_src_cost (x, mode, optimize_this_for_speed_p)
+	  <= COSTS_N_INSNS (1))
+	return x;
     }
+  /* Likewise, if SIGN_EXTEND is cheap, do nothing.  */
+  else if (GET_CODE (x) == SIGN_EXTEND
+	   && set_src_cost (x, mode, optimize_this_for_speed_p)
+	      <= COSTS_N_INSNS (1))
+    return x;
 
   /* If we reach here, we want to return a pair of shifts.  The inner
      shift is a left shift of BITSIZE - POS - LEN bits.  The outer
@@ -7309,7 +7319,11 @@  expand_compound_operation (rtx x)
   if (modewidth >= pos + len)
     {
       tem = gen_lowpart (mode, XEXP (x, 0));
-      if (!tem || GET_CODE (tem) == CLOBBER)
+      if (!tem
+	  || GET_CODE (tem) == CLOBBER
+	  || (GET_CODE (tem) == SUBREG
+	      && !REG_P (SUBREG_REG (tem))
+	      && !MEM_P (SUBREG_REG (tem))))
 	return x;
       tem = simplify_shift_const (NULL_RTX, ASHIFT, mode,
 				  tem, modewidth - pos - len);
@@ -7321,7 +7335,11 @@  expand_compound_operation (rtx x)
       tem = simplify_shift_const (NULL_RTX, LSHIFTRT, inner_mode,
 				  XEXP (x, 0), pos);
       tem = gen_lowpart (mode, tem);
-      if (!tem || GET_CODE (tem) == CLOBBER)
+      if (!tem
+	  || GET_CODE (tem) == CLOBBER
+	  || (GET_CODE (tem) == SUBREG
+	      && !REG_P (SUBREG_REG (tem))
+	      && !MEM_P (SUBREG_REG (tem))))
 	return x;
       tem = simplify_and_const_int (NULL_RTX, mode, tem,
 				    (HOST_WIDE_INT_1U << len) - 1);
diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index fc0d6c3..698ca6e 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -1404,22 +1404,32 @@  simplify_context::simplify_unary_operation_1 (rtx_code code, machine_mode mode,
       break;
 
     case FFS:
-      /* (ffs (*_extend <X>)) = (ffs <X>) */
+      /* (ffs (*_extend <X>)) = (*_extend (ffs <X>)).  */
       if (GET_CODE (op) == SIGN_EXTEND
 	  || GET_CODE (op) == ZERO_EXTEND)
-	return simplify_gen_unary (FFS, mode, XEXP (op, 0),
-				   GET_MODE (XEXP (op, 0)));
+	{
+	  temp = simplify_gen_unary (FFS, GET_MODE (XEXP (op, 0)),
+				     XEXP (op, 0), GET_MODE (XEXP (op, 0)));
+	  return simplify_gen_unary (GET_CODE (op), mode, temp,
+				     GET_MODE (temp));
+	}
       break;
 
     case POPCOUNT:
       switch (GET_CODE (op))
 	{
 	case BSWAP:
-	case ZERO_EXTEND:
-	  /* (popcount (zero_extend <X>)) = (popcount <X>) */
+	  /* (popcount (bswap <X>)) = (popcount <X>).  */
 	  return simplify_gen_unary (POPCOUNT, mode, XEXP (op, 0),
 				     GET_MODE (XEXP (op, 0)));
 
+	case ZERO_EXTEND:
+	  /* (popcount (zero_extend <X>)) = (zero_extend (popcount <X>)).  */
+	  temp = simplify_gen_unary (POPCOUNT, GET_MODE (XEXP (op, 0)),
+				     XEXP (op, 0), GET_MODE (XEXP (op, 0)));
+	  return simplify_gen_unary (ZERO_EXTEND, mode, temp,
+				     GET_MODE (temp));
+
 	case ROTATE:
 	case ROTATERT:
 	  /* Rotations don't affect popcount.  */
@@ -1438,11 +1448,16 @@  simplify_context::simplify_unary_operation_1 (rtx_code code, machine_mode mode,
 	{
 	case NOT:
 	case BSWAP:
-	case ZERO_EXTEND:
-	case SIGN_EXTEND:
 	  return simplify_gen_unary (PARITY, mode, XEXP (op, 0),
 				     GET_MODE (XEXP (op, 0)));
 
+	case ZERO_EXTEND:
+	case SIGN_EXTEND:
+	  temp = simplify_gen_unary (PARITY, GET_MODE (XEXP (op, 0)),
+				     XEXP (op, 0), GET_MODE (XEXP (op, 0)));
+	  return simplify_gen_unary (GET_CODE (op), mode, temp,
+				     GET_MODE (temp));
+
 	case ROTATE:
 	case ROTATERT:
 	  /* Rotations don't affect parity.  */
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 1be9b66..dcf50cb 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -17010,6 +17010,42 @@ 
    (set_attr "type" "bitmanip")
    (set_attr "mode" "SI")])
 
+(define_insn_and_split "*clzsi2_lzcnt_zext_2"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+	(zero_extend:DI
+	  (clz:SI (match_operand:SI 1 "nonimmediate_operand" "rm"))))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_LZCNT && TARGET_64BIT"
+  "lzcnt{l}\t{%1, %k0|%k0, %1}"
+  "&& TARGET_AVOID_FALSE_DEP_FOR_BMI && epilogue_completed
+   && optimize_function_for_speed_p (cfun)
+   && !reg_mentioned_p (operands[0], operands[1])"
+  [(parallel
+    [(set (match_dup 0)
+	  (zero_extend:DI (clz:SI (match_dup 1))))
+     (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)
+     (clobber (reg:CC FLAGS_REG))])]
+  "ix86_expand_clear (operands[0]);"
+  [(set_attr "prefix_rep" "1")
+   (set_attr "type" "bitmanip")
+   (set_attr "mode" "SI")])
+
+; False dependency happens when destination is only updated by tzcnt,
+; lzcnt or popcnt.  There is no false dependency when destination is
+; also used in source.
+(define_insn "*clzsi2_lzcnt_zext_2_falsedep"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+	(zero_extend:DI
+	  (clz:SI (match_operand:SWI48 1 "nonimmediate_operand" "rm"))))
+   (unspec [(match_operand:DI 2 "register_operand" "0")]
+	   UNSPEC_INSN_FALSE_DEP)
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_LZCNT"
+  "lzcnt{l}\t{%1, %k0|%k0, %1}"
+  [(set_attr "prefix_rep" "1")
+   (set_attr "type" "bitmanip")
+   (set_attr "mode" "SI")])
+
 (define_int_iterator LT_ZCNT
 	[(UNSPEC_TZCNT "TARGET_BMI")
 	 (UNSPEC_LZCNT "TARGET_LZCNT")])
@@ -17328,6 +17364,22 @@ 
    (set_attr "prefix" "vex")
    (set_attr "mode" "DI")])
 
+(define_insn "*bmi2_bzhi_zero_extendsidi_5"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+	(and:DI
+	  (zero_extend:DI
+	    (plus:SI
+	      (ashift:SI (const_int 1)
+			 (match_operand:QI 2 "register_operand" "r"))
+	      (const_int -1)))
+	  (match_operand:DI 1 "nonimmediate_operand" "rm")))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_64BIT && TARGET_BMI2"
+  "bzhi\t{%q2, %q1, %q0|%q0, %q1, %q2}"
+  [(set_attr "type" "bitmanip")
+   (set_attr "prefix" "vex")
+   (set_attr "mode" "DI")])
+
 (define_insn "bmi2_pdep_<mode>3"
   [(set (match_operand:SWI48 0 "register_operand" "=r")
         (unspec:SWI48 [(match_operand:SWI48 1 "register_operand" "r")
@@ -17590,6 +17642,54 @@ 
    (set_attr "type" "bitmanip")
    (set_attr "mode" "SI")])
 
+(define_insn_and_split "*popcountsi2_zext_2"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+	(zero_extend:DI
+	  (popcount:SI (match_operand:SI 1 "nonimmediate_operand" "rm"))))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_POPCNT && TARGET_64BIT"
+{
+#if TARGET_MACHO
+  return "popcnt\t{%1, %k0|%k0, %1}";
+#else
+  return "popcnt{l}\t{%1, %k0|%k0, %1}";
+#endif
+}
+  "&& TARGET_AVOID_FALSE_DEP_FOR_BMI && epilogue_completed
+   && optimize_function_for_speed_p (cfun)
+   && !reg_mentioned_p (operands[0], operands[1])"
+  [(parallel
+    [(set (match_dup 0)
+	  (zero_extend:DI (popcount:SI (match_dup 1))))
+     (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)
+     (clobber (reg:CC FLAGS_REG))])]
+  "ix86_expand_clear (operands[0]);"
+  [(set_attr "prefix_rep" "1")
+   (set_attr "type" "bitmanip")
+   (set_attr "mode" "SI")])
+
+; False dependency happens when destination is only updated by tzcnt,
+; lzcnt or popcnt.  There is no false dependency when destination is
+; also used in source.
+(define_insn "*popcountsi2_zext_2_falsedep"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+	(zero_extend:DI
+	  (popcount:SI (match_operand:SI 1 "nonimmediate_operand" "rm"))))
+   (unspec [(match_operand:DI 2 "register_operand" "0")]
+	   UNSPEC_INSN_FALSE_DEP)
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_POPCNT && TARGET_64BIT"
+{
+#if TARGET_MACHO
+  return "popcnt\t{%1, %k0|%k0, %1}";
+#else
+  return "popcnt{l}\t{%1, %k0|%k0, %1}";
+#endif
+}
+  [(set_attr "prefix_rep" "1")
+   (set_attr "type" "bitmanip")
+   (set_attr "mode" "SI")])
+
 (define_insn_and_split "*popcounthi2_1"
   [(set (match_operand:SI 0 "register_operand")
 	(popcount:SI
@@ -17608,6 +17708,24 @@ 
   DONE;
 })
 
+(define_insn_and_split "*popcounthi2_2"
+  [(set (match_operand:SI 0 "register_operand")
+	(zero_extend:SI
+	  (popcount:HI (match_operand:HI 1 "nonimmediate_operand"))))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_POPCNT
+   && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+{
+  rtx tmp = gen_reg_rtx (HImode);
+
+  emit_insn (gen_popcounthi2 (tmp, operands[1]));
+  emit_insn (gen_zero_extendhisi2 (operands[0], tmp));
+  DONE;
+})
+
 (define_insn "popcounthi2"
   [(set (match_operand:HI 0 "register_operand" "=r")
 	(popcount:HI
@@ -17927,6 +18045,39 @@ 
   PUT_CODE (operands[5], GET_CODE (operands[5]) == EQ ? UNORDERED : ORDERED);
 })
 
+;; Eliminate HImode popcount&1 using parity flag (variant 2)
+(define_peephole2
+  [(match_scratch:HI 0 "Q")
+   (parallel [(set (match_operand:HI 1 "register_operand")
+		   (popcount:HI
+		    (match_operand:HI 2 "nonimmediate_operand")))
+	      (clobber (reg:CC FLAGS_REG))])
+   (set (reg:CCZ FLAGS_REG)
+        (compare:CCZ (and:QI (match_operand:QI 3 "register_operand")
+			     (const_int 1))
+		     (const_int 0)))
+   (set (pc) (if_then_else (match_operator 4 "bt_comparison_operator"
+			    [(reg:CCZ FLAGS_REG)
+			     (const_int 0)])
+			   (label_ref (match_operand 5))
+			   (pc)))]
+  "REGNO (operands[1]) == REGNO (operands[3])
+   && peep2_reg_dead_p (2, operands[1])
+   && peep2_reg_dead_p (2, operands[3])
+   && peep2_regno_dead_p (3, FLAGS_REG)"
+  [(set (match_dup 0) (match_dup 2))
+   (parallel [(set (reg:CC FLAGS_REG)
+		   (unspec:CC [(match_dup 0)] UNSPEC_PARITY))
+	      (clobber (match_dup 0))])
+   (set (pc) (if_then_else (match_op_dup 4 [(reg:CC FLAGS_REG)
+					    (const_int 0)])
+			   (label_ref (match_dup 5))
+			   (pc)))]
+{
+  operands[4] = shallow_copy_rtx (operands[4]);
+  PUT_CODE (operands[4], GET_CODE (operands[4]) == EQ ? UNORDERED : ORDERED);
+})
+
 
 ;; Thread-local storage patterns for ELF.
 ;;