[x86] Provide zero_extend versions/variants of several patterns.

Message ID 00e801d91a59$df0f3e70$9d2dbb50$@nextmovesoftware.com
State Accepted
Headers
Series [x86] Provide zero_extend versions/variants of several patterns. |

Checks

Context Check Description
snail/gcc-patch-check success Github commit url

Commit Message

Roger Sayle Dec. 28, 2022, 1:15 a.m. UTC
  Back in September, the review of my patch for PR rtl-optimization/106594,
https://gcc.gnu.org/pipermail/gcc-patches/2022-September/601501.html
suggested that I submit the x86 backend bits, independently and first.

The executive summary is that the middle-end doesn't have a preferred
canonical form for expressing zero-extension, sometimes using an AND
and sometimes using zero_extend.  Pending changes to RTL simplification
will/may alter some of these representations, so a few additional
patterns are required to recognize these alternate representations
and avoid any testsuite regressions.

As an example, *popcountsi2_zext is currently represented as:
  [(set (match_operand:DI 0 "register_operand" "=r")
        (and:DI
          (subreg:DI
            (popcount:SI
              (match_operand:SI 1 "nonimmediate_operand" "rm")) 0)
          (const_int 63)))
   (clobber (reg:CC FLAGS_REG))]

this patch adds an alternate/equivalent pattern that matches:
  [(set (match_operand:DI 0 "register_operand" "=r")
       (zero_extend:DI
         (popcount:SI (match_operand:SI 1 "nonimmediate_operand" "rm"))))
   (clobber (reg:CC FLAGS_REG))]

Another example is *popcounthi2 which is currently represented as:
  [(set (match_operand:SI 0 "register_operand")
        (popcount:SI
          (zero_extend:SI (match_operand:HI 1 "nonimmediate_operand"))))
   (clobber (reg:CC FLAGS_REG))]

this patch adds an alternate/equivalent pattern that matches:
  [(set (match_operand:SI 0 "register_operand")
        (zero_extend:SI
          (popcount:HI (match_operand:HI 1 "nonimmediate_operand"))))
   (clobber (reg:CC FLAGS_REG))]

The contents of the machine description definitions remain the same,
it's just the expected RTL is slightly different but equivalent.
Providing both forms makes the backend more robust to middle-end
changes [and possibly catches some missed optimizations].

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures.  Ok for mainline?


2022-12-28  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        * config/i386/i386.md (*clzsi2_lzcnt_zext_2): define_insn_and_split
        to match ZERO_EXTEND form of *clzsi2_lzcnt_zext.
        (*clzsi2_lzcnt_zext_2_falsedep): Likewise, new define_insn to match
        ZERO_EXTEND form of *clzsi2_lzcnt_zext_falsedep.
        (*bmi2_bzhi_zero_extendsidi_5): Likewise, new define_insn to match
        ZERO_EXTEND form of *bmi2_bzhi_zero_extendsidi.
        (*popcountsi2_zext_2): Likewise, new define_insn_and_split to match
        ZERO_EXTEND form of *popcountsi2_zext.
        (*popcountsi2_zext_2_falsedep): Likewise, new define_insn to match
        ZERO_EXTEND form of *popcountsi2_zext_falsedep.
        (*popcounthi2_2): Likewise, new define_insn_and_split to match
        ZERO_EXTEND form of *popcounthi2.
        (define_peephole2): ZERO_EXTEND variant of HImode popcount&1 using
        parity flag peephole2.

Thanks in advance,
Roger
--
  

Comments

Jeff Law Dec. 28, 2022, 1:32 a.m. UTC | #1
On 12/27/22 18:15, Roger Sayle wrote:
> 
> Back in September, the review of my patch for PR rtl-optimization/106594,
> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/601501.html
> suggested that I submit the x86 backend bits, independently and first.
> 
> The executive summary is that the middle-end doesn't have a preferred
> canonical form for expressing zero-extension, sometimes using an AND
> and sometimes using zero_extend.  Pending changes to RTL simplification
> will/may alter some of these representations, so a few additional
> patterns are required to recognize these alternate representations
> and avoid any testsuite regressions.
Oh and internally other forms can show up.  Like paired shifts.  In 
fact, I think paired shifts are canonical inside at least some parts of 
combine.

jeff
  
Uros Bizjak Dec. 28, 2022, 9:35 a.m. UTC | #2
On Wed, Dec 28, 2022 at 2:15 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Back in September, the review of my patch for PR rtl-optimization/106594,
> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/601501.html
> suggested that I submit the x86 backend bits, independently and first.
>
> The executive summary is that the middle-end doesn't have a preferred
> canonical form for expressing zero-extension, sometimes using an AND
> and sometimes using zero_extend.  Pending changes to RTL simplification
> will/may alter some of these representations, so a few additional
> patterns are required to recognize these alternate representations
> and avoid any testsuite regressions.
>
> As an example, *popcountsi2_zext is currently represented as:
>   [(set (match_operand:DI 0 "register_operand" "=r")
>         (and:DI
>           (subreg:DI
>             (popcount:SI
>               (match_operand:SI 1 "nonimmediate_operand" "rm")) 0)
>           (const_int 63)))
>    (clobber (reg:CC FLAGS_REG))]
>
> this patch adds an alternate/equivalent pattern that matches:
>   [(set (match_operand:DI 0 "register_operand" "=r")
>        (zero_extend:DI
>          (popcount:SI (match_operand:SI 1 "nonimmediate_operand" "rm"))))
>    (clobber (reg:CC FLAGS_REG))]
>
> Another example is *popcounthi2 which is currently represented as:
>   [(set (match_operand:SI 0 "register_operand")
>         (popcount:SI
>           (zero_extend:SI (match_operand:HI 1 "nonimmediate_operand"))))
>    (clobber (reg:CC FLAGS_REG))]
>
> this patch adds an alternate/equivalent pattern that matches:
>   [(set (match_operand:SI 0 "register_operand")
>         (zero_extend:SI
>           (popcount:HI (match_operand:HI 1 "nonimmediate_operand"))))
>    (clobber (reg:CC FLAGS_REG))]
>
> The contents of the machine description definitions remain the same,
> it's just the expected RTL is slightly different but equivalent.
> Providing both forms makes the backend more robust to middle-end
> changes [and possibly catches some missed optimizations].

It would be nice to have a canonical representation of zero-extended
patterns, but this is what we have now. Unfortunately, a certain HW
limitation requires several patterns for one insn, so the canonical
representation is even more desirable here.  Hopefully, a "future"
patch will allow us some cleanups in this area.

> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32},
> with no new failures.  Ok for mainline?

OK, but please split out HImode popcount&1 pattern to a separate patch
to not mix separate topics in one patch.

Thanks,
Uros.

>
>
> 2022-12-28  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/i386/i386.md (*clzsi2_lzcnt_zext_2): define_insn_and_split
>         to match ZERO_EXTEND form of *clzsi2_lzcnt_zext.
>         (*clzsi2_lzcnt_zext_2_falsedep): Likewise, new define_insn to match
>         ZERO_EXTEND form of *clzsi2_lzcnt_zext_falsedep.
>         (*bmi2_bzhi_zero_extendsidi_5): Likewise, new define_insn to match
>         ZERO_EXTEND form of *bmi2_bzhi_zero_extendsidi.
>         (*popcountsi2_zext_2): Likewise, new define_insn_and_split to match
>         ZERO_EXTEND form of *popcountsi2_zext.
>         (*popcountsi2_zext_2_falsedep): Likewise, new define_insn to match
>         ZERO_EXTEND form of *popcountsi2_zext_falsedep.
>         (*popcounthi2_2): Likewise, new define_insn_and_split to match
>         ZERO_EXTEND form of *popcounthi2.
>         (define_peephole2): ZERO_EXTEND variant of HImode popcount&1 using
>         parity flag peephole2.
>
> Thanks in advance,
> Roger
> --
>
  
Roger Sayle Dec. 28, 2022, 12:22 p.m. UTC | #3
Hi Uros,
Many thanks for your reviews.

> On Wed, Dec 28, 2022 at 2:15 AM Roger Sayle
> <roger@nextmovesoftware.com> wrote:
> >
> >
> > Back in September, the review of my patch for PR
> > rtl-optimization/106594,
> > https://gcc.gnu.org/pipermail/gcc-patches/2022-September/601501.html
> > suggested that I submit the x86 backend bits, independently and first.
> >
> > The executive summary is that the middle-end doesn't have a preferred
> > canonical form for expressing zero-extension, sometimes using an AND
> > and sometimes using zero_extend.  Pending changes to RTL
> > simplification will/may alter some of these representations, so a few
> > additional patterns are required to recognize these alternate
> > representations and avoid any testsuite regressions.
> >
> > As an example, *popcountsi2_zext is currently represented as:
> >   [(set (match_operand:DI 0 "register_operand" "=r")
> >         (and:DI
> >           (subreg:DI
> >             (popcount:SI
> >               (match_operand:SI 1 "nonimmediate_operand" "rm")) 0)
> >           (const_int 63)))
> >    (clobber (reg:CC FLAGS_REG))]
> >
> > this patch adds an alternate/equivalent pattern that matches:
> >   [(set (match_operand:DI 0 "register_operand" "=r")
> >        (zero_extend:DI
> >          (popcount:SI (match_operand:SI 1 "nonimmediate_operand" "rm"))))
> >    (clobber (reg:CC FLAGS_REG))]
> >
> > Another example is *popcounthi2 which is currently represented as:
> >   [(set (match_operand:SI 0 "register_operand")
> >         (popcount:SI
> >           (zero_extend:SI (match_operand:HI 1 "nonimmediate_operand"))))
> >    (clobber (reg:CC FLAGS_REG))]
> >
> > this patch adds an alternate/equivalent pattern that matches:
> >   [(set (match_operand:SI 0 "register_operand")
> >         (zero_extend:SI
> >           (popcount:HI (match_operand:HI 1 "nonimmediate_operand"))))
> >    (clobber (reg:CC FLAGS_REG))]
> >
> > The contents of the machine description definitions remain the same,
> > it's just the expected RTL is slightly different but equivalent.
> > Providing both forms makes the backend more robust to middle-end
> > changes [and possibly catches some missed optimizations].
> 
> It would be nice to have a canonical representation of zero-extended patterns,
> but this is what we have now. Unfortunately, a certain HW limitation requires
> several patterns for one insn, so the canonical representation is even more
> desirable here.  Hopefully, a "future"
> patch will allow us some cleanups in this area.
> 
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32},
> > with no new failures.  Ok for mainline?
> 
> OK, but please split out HImode popcount&1 pattern to a separate patch to not
> mix separate topics in one patch.

The peephole2 is the exact same topic; it's not a new optimization or feature, 
but identical fall-out from the representational changes to ZERO_EXTEND.

The existing peephole2 (line 18305 of i386.md) matches this sequence:

   (parallel [(set (match_operand:HI 1 "register_operand")
                   (popcount:HI
                    (match_operand:HI 2 "nonimmediate_operand")))
              (clobber (reg:CC FLAGS_REG))])
   (set (match_operand 3 "register_operand")
        (zero_extend (match_dup 1)))
   (set (reg:CCZ FLAGS_REG)
        (compare:CCZ (and:QI (match_operand:QI 4 "register_operand")
                             (const_int 1))
                     (const_int 0)))
   (set (pc) (if_then_else (match_operator 5 "bt_comparison_operator"
                            [(reg:CCZ FLAGS_REG)
                             (const_int 0)])
                           (label_ref (match_operand 6))
                           (pc)))]

but with pending tweaks/changes to middle-end zero_extend handing, the
peephole2 pass now sees the equivalent (apologies for the renumbering):

   (parallel [(set (match_operand:HI 1 "register_operand")
                  (popcount:HI
                   (match_operand:HI 2 "nonimmediate_operand")))
             (clobber (reg:CC FLAGS_REG))])
   (set (reg:CCZ FLAGS_REG)
        (compare:CCZ (and:QI (match_operand:QI 3 "register_operand")
                            (const_int 1))
                    (const_int 0)))
   (set (pc) (if_then_else (match_operator 4 "bt_comparison_operator"
                           [(reg:CCZ FLAGS_REG)
                            (const_int 0)])
                           (label_ref (match_operand 5))
                           (pc)))]

Notice that the zero_extension has been eliminated, as match_dup 3
and match_dup 4 are the same register, and we only need to inspect
the lowest bit.  i.e. the current transform matches a redundant
instruction, that the RTL optimizers will soon be able to eliminate.

Hence, IMHO, the issue is identical, the RTL that the backend is expecting
is sensitive the representations and transformations performed upstream.

All the new patterns appear in the MD file immediately after their originals,
so this peephole2 appears immediately after the peephole2 that used to
match.

I'm happy to perform two commits, if that might make bisecting issues
easier, but these changes really are just a single topic (already split into 
three pieces from a serious P1 regression fix).  It may be possible to
remove the current patterns, once they become "dead", perhaps when
we return to stage1 (after the middle-end pieces have been approved).

> 
> Thanks,
> Uros.
> 
> >
> >
> > 2022-12-28  Roger Sayle  <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         * config/i386/i386.md (*clzsi2_lzcnt_zext_2): define_insn_and_split
> >         to match ZERO_EXTEND form of *clzsi2_lzcnt_zext.
> >         (*clzsi2_lzcnt_zext_2_falsedep): Likewise, new define_insn to match
> >         ZERO_EXTEND form of *clzsi2_lzcnt_zext_falsedep.
> >         (*bmi2_bzhi_zero_extendsidi_5): Likewise, new define_insn to match
> >         ZERO_EXTEND form of *bmi2_bzhi_zero_extendsidi.
> >         (*popcountsi2_zext_2): Likewise, new define_insn_and_split to match
> >         ZERO_EXTEND form of *popcountsi2_zext.
> >         (*popcountsi2_zext_2_falsedep): Likewise, new define_insn to match
> >         ZERO_EXTEND form of *popcountsi2_zext_falsedep.
> >         (*popcounthi2_2): Likewise, new define_insn_and_split to match
> >         ZERO_EXTEND form of *popcounthi2.
> >         (define_peephole2): ZERO_EXTEND variant of HImode popcount&1 using
> >         parity flag peephole2.
> >
> > Thanks in advance,
> > Roger
> > --
> >
  
Uros Bizjak Dec. 28, 2022, 1:34 p.m. UTC | #4
On Wed, Dec 28, 2022 at 1:22 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Uros,
> Many thanks for your reviews.
>
> > On Wed, Dec 28, 2022 at 2:15 AM Roger Sayle
> > <roger@nextmovesoftware.com> wrote:
> > >
> > >
> > > Back in September, the review of my patch for PR
> > > rtl-optimization/106594,
> > > https://gcc.gnu.org/pipermail/gcc-patches/2022-September/601501.html
> > > suggested that I submit the x86 backend bits, independently and first.
> > >
> > > The executive summary is that the middle-end doesn't have a preferred
> > > canonical form for expressing zero-extension, sometimes using an AND
> > > and sometimes using zero_extend.  Pending changes to RTL
> > > simplification will/may alter some of these representations, so a few
> > > additional patterns are required to recognize these alternate
> > > representations and avoid any testsuite regressions.
> > >
> > > As an example, *popcountsi2_zext is currently represented as:
> > >   [(set (match_operand:DI 0 "register_operand" "=r")
> > >         (and:DI
> > >           (subreg:DI
> > >             (popcount:SI
> > >               (match_operand:SI 1 "nonimmediate_operand" "rm")) 0)
> > >           (const_int 63)))
> > >    (clobber (reg:CC FLAGS_REG))]
> > >
> > > this patch adds an alternate/equivalent pattern that matches:
> > >   [(set (match_operand:DI 0 "register_operand" "=r")
> > >        (zero_extend:DI
> > >          (popcount:SI (match_operand:SI 1 "nonimmediate_operand" "rm"))))
> > >    (clobber (reg:CC FLAGS_REG))]
> > >
> > > Another example is *popcounthi2 which is currently represented as:
> > >   [(set (match_operand:SI 0 "register_operand")
> > >         (popcount:SI
> > >           (zero_extend:SI (match_operand:HI 1 "nonimmediate_operand"))))
> > >    (clobber (reg:CC FLAGS_REG))]
> > >
> > > this patch adds an alternate/equivalent pattern that matches:
> > >   [(set (match_operand:SI 0 "register_operand")
> > >         (zero_extend:SI
> > >           (popcount:HI (match_operand:HI 1 "nonimmediate_operand"))))
> > >    (clobber (reg:CC FLAGS_REG))]
> > >
> > > The contents of the machine description definitions remain the same,
> > > it's just the expected RTL is slightly different but equivalent.
> > > Providing both forms makes the backend more robust to middle-end
> > > changes [and possibly catches some missed optimizations].
> >
> > It would be nice to have a canonical representation of zero-extended patterns,
> > but this is what we have now. Unfortunately, a certain HW limitation requires
> > several patterns for one insn, so the canonical representation is even more
> > desirable here.  Hopefully, a "future"
> > patch will allow us some cleanups in this area.
> >
> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > and make -k check, both with and without --target_board=unix{-m32},
> > > with no new failures.  Ok for mainline?
> >
> > OK, but please split out HImode popcount&1 pattern to a separate patch to not
> > mix separate topics in one patch.
>
> The peephole2 is the exact same topic; it's not a new optimization or feature,
> but identical fall-out from the representational changes to ZERO_EXTEND.
>
> The existing peephole2 (line 18305 of i386.md) matches this sequence:
>
>    (parallel [(set (match_operand:HI 1 "register_operand")
>                    (popcount:HI
>                     (match_operand:HI 2 "nonimmediate_operand")))
>               (clobber (reg:CC FLAGS_REG))])
>    (set (match_operand 3 "register_operand")
>         (zero_extend (match_dup 1)))
>    (set (reg:CCZ FLAGS_REG)
>         (compare:CCZ (and:QI (match_operand:QI 4 "register_operand")
>                              (const_int 1))
>                      (const_int 0)))
>    (set (pc) (if_then_else (match_operator 5 "bt_comparison_operator"
>                             [(reg:CCZ FLAGS_REG)
>                              (const_int 0)])
>                            (label_ref (match_operand 6))
>                            (pc)))]
>
> but with pending tweaks/changes to middle-end zero_extend handing, the
> peephole2 pass now sees the equivalent (apologies for the renumbering):
>
>    (parallel [(set (match_operand:HI 1 "register_operand")
>                   (popcount:HI
>                    (match_operand:HI 2 "nonimmediate_operand")))
>              (clobber (reg:CC FLAGS_REG))])
>    (set (reg:CCZ FLAGS_REG)
>         (compare:CCZ (and:QI (match_operand:QI 3 "register_operand")
>                             (const_int 1))
>                     (const_int 0)))
>    (set (pc) (if_then_else (match_operator 4 "bt_comparison_operator"
>                            [(reg:CCZ FLAGS_REG)
>                             (const_int 0)])
>                            (label_ref (match_operand 5))
>                            (pc)))]
>
> Notice that the zero_extension has been eliminated, as match_dup 3
> and match_dup 4 are the same register, and we only need to inspect
> the lowest bit.  i.e. the current transform matches a redundant
> instruction, that the RTL optimizers will soon be able to eliminate.
>
> Hence, IMHO, the issue is identical, the RTL that the backend is expecting
> is sensitive the representations and transformations performed upstream.
>
> All the new patterns appear in the MD file immediately after their originals,
> so this peephole2 appears immediately after the peephole2 that used to
> match.
>
> I'm happy to perform two commits, if that might make bisecting issues
> easier, but these changes really are just a single topic (already split into
> three pieces from a serious P1 regression fix).  It may be possible to
> remove the current patterns, once they become "dead", perhaps when
> we return to stage1 (after the middle-end pieces have been approved).

Thanks for the explanation, please go ahead with your original patch.

Thanks,
Uros.

>
> >
> > Thanks,
> > Uros.
> >
> > >
> > >
> > > 2022-12-28  Roger Sayle  <roger@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > >         * config/i386/i386.md (*clzsi2_lzcnt_zext_2): define_insn_and_split
> > >         to match ZERO_EXTEND form of *clzsi2_lzcnt_zext.
> > >         (*clzsi2_lzcnt_zext_2_falsedep): Likewise, new define_insn to match
> > >         ZERO_EXTEND form of *clzsi2_lzcnt_zext_falsedep.
> > >         (*bmi2_bzhi_zero_extendsidi_5): Likewise, new define_insn to match
> > >         ZERO_EXTEND form of *bmi2_bzhi_zero_extendsidi.
> > >         (*popcountsi2_zext_2): Likewise, new define_insn_and_split to match
> > >         ZERO_EXTEND form of *popcountsi2_zext.
> > >         (*popcountsi2_zext_2_falsedep): Likewise, new define_insn to match
> > >         ZERO_EXTEND form of *popcountsi2_zext_falsedep.
> > >         (*popcounthi2_2): Likewise, new define_insn_and_split to match
> > >         ZERO_EXTEND form of *popcounthi2.
> > >         (define_peephole2): ZERO_EXTEND variant of HImode popcount&1 using
> > >         parity flag peephole2.
> > >
> > > Thanks in advance,
> > > Roger
> > > --
> > >
>
  

Patch

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 0626752..ca40c4f 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -17419,6 +17419,42 @@ 
    (set_attr "type" "bitmanip")
    (set_attr "mode" "SI")])
 
+(define_insn_and_split "*clzsi2_lzcnt_zext_2"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+	(zero_extend:DI
+	  (clz:SI (match_operand:SI 1 "nonimmediate_operand" "rm"))))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_LZCNT && TARGET_64BIT"
+  "lzcnt{l}\t{%1, %k0|%k0, %1}"
+  "&& TARGET_AVOID_FALSE_DEP_FOR_BMI && epilogue_completed
+   && optimize_function_for_speed_p (cfun)
+   && !reg_mentioned_p (operands[0], operands[1])"
+  [(parallel
+    [(set (match_dup 0)
+	  (zero_extend:DI (clz:SI (match_dup 1))))
+     (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)
+     (clobber (reg:CC FLAGS_REG))])]
+  "ix86_expand_clear (operands[0]);"
+  [(set_attr "prefix_rep" "1")
+   (set_attr "type" "bitmanip")
+   (set_attr "mode" "SI")])
+
+; False dependency happens when destination is only updated by tzcnt,
+; lzcnt or popcnt.  There is no false dependency when destination is
+; also used in source.
+(define_insn "*clzsi2_lzcnt_zext_2_falsedep"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+	(zero_extend:DI
+	  (clz:SI (match_operand:SWI48 1 "nonimmediate_operand" "rm"))))
+   (unspec [(match_operand:DI 2 "register_operand" "0")]
+	   UNSPEC_INSN_FALSE_DEP)
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_LZCNT"
+  "lzcnt{l}\t{%1, %k0|%k0, %1}"
+  [(set_attr "prefix_rep" "1")
+   (set_attr "type" "bitmanip")
+   (set_attr "mode" "SI")])
+
 (define_int_iterator LT_ZCNT
 	[(UNSPEC_TZCNT "TARGET_BMI")
 	 (UNSPEC_LZCNT "TARGET_LZCNT")])
@@ -17737,6 +17773,22 @@ 
    (set_attr "prefix" "vex")
    (set_attr "mode" "DI")])
 
+(define_insn "*bmi2_bzhi_zero_extendsidi_5"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+	(and:DI
+	  (zero_extend:DI
+	    (plus:SI
+	      (ashift:SI (const_int 1)
+			 (match_operand:QI 2 "register_operand" "r"))
+	      (const_int -1)))
+	  (match_operand:DI 1 "nonimmediate_operand" "rm")))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_64BIT && TARGET_BMI2"
+  "bzhi\t{%q2, %q1, %q0|%q0, %q1, %q2}"
+  [(set_attr "type" "bitmanip")
+   (set_attr "prefix" "vex")
+   (set_attr "mode" "DI")])
+
 (define_insn "bmi2_pdep_<mode>3"
   [(set (match_operand:SWI48 0 "register_operand" "=r")
         (unspec:SWI48 [(match_operand:SWI48 1 "register_operand" "r")
@@ -17999,6 +18051,54 @@ 
    (set_attr "type" "bitmanip")
    (set_attr "mode" "SI")])
 
+(define_insn_and_split "*popcountsi2_zext_2"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+	(zero_extend:DI
+	  (popcount:SI (match_operand:SI 1 "nonimmediate_operand" "rm"))))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_POPCNT && TARGET_64BIT"
+{
+#if TARGET_MACHO
+  return "popcnt\t{%1, %k0|%k0, %1}";
+#else
+  return "popcnt{l}\t{%1, %k0|%k0, %1}";
+#endif
+}
+  "&& TARGET_AVOID_FALSE_DEP_FOR_BMI && epilogue_completed
+   && optimize_function_for_speed_p (cfun)
+   && !reg_mentioned_p (operands[0], operands[1])"
+  [(parallel
+    [(set (match_dup 0)
+	  (zero_extend:DI (popcount:SI (match_dup 1))))
+     (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)
+     (clobber (reg:CC FLAGS_REG))])]
+  "ix86_expand_clear (operands[0]);"
+  [(set_attr "prefix_rep" "1")
+   (set_attr "type" "bitmanip")
+   (set_attr "mode" "SI")])
+
+; False dependency happens when destination is only updated by tzcnt,
+; lzcnt or popcnt.  There is no false dependency when destination is
+; also used in source.
+(define_insn "*popcountsi2_zext_2_falsedep"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+	(zero_extend:DI
+	  (popcount:SI (match_operand:SI 1 "nonimmediate_operand" "rm"))))
+   (unspec [(match_operand:DI 2 "register_operand" "0")]
+	   UNSPEC_INSN_FALSE_DEP)
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_POPCNT && TARGET_64BIT"
+{
+#if TARGET_MACHO
+  return "popcnt\t{%1, %k0|%k0, %1}";
+#else
+  return "popcnt{l}\t{%1, %k0|%k0, %1}";
+#endif
+}
+  [(set_attr "prefix_rep" "1")
+   (set_attr "type" "bitmanip")
+   (set_attr "mode" "SI")])
+
 (define_insn_and_split "*popcounthi2_1"
   [(set (match_operand:SI 0 "register_operand")
 	(popcount:SI
@@ -18017,6 +18117,24 @@ 
   DONE;
 })
 
+(define_insn_and_split "*popcounthi2_2"
+  [(set (match_operand:SI 0 "register_operand")
+	(zero_extend:SI
+	  (popcount:HI (match_operand:HI 1 "nonimmediate_operand"))))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_POPCNT
+   && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+{
+  rtx tmp = gen_reg_rtx (HImode);
+
+  emit_insn (gen_popcounthi2 (tmp, operands[1]));
+  emit_insn (gen_zero_extendhisi2 (operands[0], tmp));
+  DONE;
+})
+
 (define_insn "popcounthi2"
   [(set (match_operand:HI 0 "register_operand" "=r")
 	(popcount:HI
@@ -18336,6 +18454,39 @@ 
   PUT_CODE (operands[5], GET_CODE (operands[5]) == EQ ? UNORDERED : ORDERED);
 })
 
+;; Eliminate HImode popcount&1 using parity flag (variant 2)
+(define_peephole2
+  [(match_scratch:HI 0 "Q")
+   (parallel [(set (match_operand:HI 1 "register_operand")
+		   (popcount:HI
+		    (match_operand:HI 2 "nonimmediate_operand")))
+	      (clobber (reg:CC FLAGS_REG))])
+   (set (reg:CCZ FLAGS_REG)
+        (compare:CCZ (and:QI (match_operand:QI 3 "register_operand")
+			     (const_int 1))
+		     (const_int 0)))
+   (set (pc) (if_then_else (match_operator 4 "bt_comparison_operator"
+			    [(reg:CCZ FLAGS_REG)
+			     (const_int 0)])
+			   (label_ref (match_operand 5))
+			   (pc)))]
+  "REGNO (operands[1]) == REGNO (operands[3])
+   && peep2_reg_dead_p (2, operands[1])
+   && peep2_reg_dead_p (2, operands[3])
+   && peep2_regno_dead_p (3, FLAGS_REG)"
+  [(set (match_dup 0) (match_dup 2))
+   (parallel [(set (reg:CC FLAGS_REG)
+		   (unspec:CC [(match_dup 0)] UNSPEC_PARITY))
+	      (clobber (match_dup 0))])
+   (set (pc) (if_then_else (match_op_dup 4 [(reg:CC FLAGS_REG)
+					    (const_int 0)])
+			   (label_ref (match_dup 5))
+			   (pc)))]
+{
+  operands[4] = shallow_copy_rtx (operands[4]);
+  PUT_CODE (operands[4], GET_CODE (operands[4]) == EQ ? UNORDERED : ORDERED);
+})
+
 
 ;; Thread-local storage patterns for ELF.
 ;;