From patchwork Tue Feb 27 13:56:42 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com>
X-Patchwork-Id: 21024
Return-Path: <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:7300:a81b:b0:108:e6aa:91d0 with SMTP id
 bq27csp2709326dyb;
        Tue, 27 Feb 2024 05:58:03 -0800 (PST)
X-Forwarded-Encrypted: i=2;
 AJvYcCV/0swdVOetyIGr9h7jf9g66LihnO/rQnith145xxXwt1EftPL4NyB0RweYLAE2gk8gBHDvdjiBKjURcm/5RDv80CA56A==
X-Google-Smtp-Source: 
 AGHT+IFL6DSyHIZiI/VYggp4BAZ9xzyKCbPxAm/mH9QG6x8czrnmin/YTirC652dQLVtL/8ODhIJ
X-Received: by 2002:a05:6214:3f89:b0:690:309e:9e1d with SMTP id
 ow9-20020a0562143f8900b00690309e9e1dmr16653qvb.5.1709042282872;
        Tue, 27 Feb 2024 05:58:02 -0800 (PST)
Received: from server2.sourceware.org (server2.sourceware.org.
 [2620:52:3:1:0:246e:9693:128c])
        by mx.google.com with ESMTPS id
 15-20020a05621420cf00b0068fdea11be2si7445870qve.608.2024.02.27.05.58.02
        for <ouuuleilei@gmail.com>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 27 Feb 2024 05:58:02 -0800 (PST)
Received-SPF: pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates
 2620:52:3:1:0:246e:9693:128c as permitted sender)
 client-ip=2620:52:3:1:0:246e:9693:128c;
Authentication-Results: mx.google.com;
       arc=fail (body hash mismatch);
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates
 2620:52:3:1:0:246e:9693:128c as permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org";
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 8AC6B3858296
	for <ouuuleilei@gmail.com>; Tue, 27 Feb 2024 13:58:02 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id 8ED6C3858C78
 for <gcc-patches@gcc.gnu.org>; Tue, 27 Feb 2024 13:57:04 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8ED6C3858C78
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 8ED6C3858C78
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=217.140.110.172
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1709042226; cv=none;
 b=UQWzR5tA04RDm+huWWdGJis8v1FXGVY3lOak4pDIzfRWE0+SW6B/95lVKokdcSwqERRrH5bxvLSnHWhUaa6rtUjdFNAbb733ihfpjExtszr1vGyvvq8IKC0x+eHZzJ0k9wfTzJ+iOvOKt0KwpEt19Rfatzahc4/KlxdyyYLrOHE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1709042226; c=relaxed/simple;
 bh=G0M+lr5/j+JONE6Ks+S0sLspzvlgCLTZ4/RsdPJwUpY=;
 h=From:To:Subject:Date:Message-Id:MIME-Version;
 b=RK+TfV9LqydWoHDRn6QpD10M7O1Zex0nm0LgyltBiChJ3EHWWq5K0CGcH7OD1XURUvxmztT6rEHNWyulJNGD+dNBxsiKm2pqbZ5DNLa4DyFC04pbu/9VogexXVE/r8185Jnimt4sxz13fB7OQJWxJWo1Lvl+DhtvyckK8u6O8z0=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 7EBE6DA7;
 Tue, 27 Feb 2024 05:57:42 -0800 (PST)
Received: from e107157-lin.cambridge.arm.com (e107157-lin.cambridge.arm.com
 [10.2.78.70])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 6D65D3F762;
 Tue, 27 Feb 2024 05:57:03 -0800 (PST)
From: Andre Vieira <andre.simoesdiasvieira@arm.com>
To: gcc-patches@gcc.gnu.org
Cc: stam.markianos-wright@arm.com, richard.earnshaw@arm.com,
 Andre Vieira <andre.simoesdiasvieira@arm.com>
Subject: [PATCH v5 0/5] arm: Add support for MVE Tail-Predicated Low Overhead
 Loops
Date: Tue, 27 Feb 2024 13:56:42 +0000
Message-Id: <20240227135647.30404-1-andre.simoesdiasvieira@arm.com>
X-Mailer: git-send-email 2.17.1
MIME-Version: 1.0
X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, KAM_DMARC_NONE,
 KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, SPF_HELO_NONE, SPF_NONE, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1792060720500454899
X-GMAIL-MSGID: 1792060720500454899

Hi,

Re-ordered patches, our latest plan is to only commit patches 1-3, and leave
4-5 for GCC 15, as we believe it is too late in Stage 4 to be making changes to
target agnostic parts, especially since these affect so many ports that we can
not easily test.

[1/5] arm: Add define_attr to to create a mapping between MVE predicated and unpredicated insns
[2/5] arm: Annotate instructions with mve_safe_imp_xlane_pred
[3/5] arm: Fix a wrong attribute use and remove unused unspecs and iterators
[4/5] doloop: Add support for predicated vectorized loops
[5/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops

Original cover letter:
This patch adds support for Arm's MVE Tail Predicated Low Overhead Loop
feature.

The M-class Arm-ARM:
https://developer.arm.com/documentation/ddi0553/bu/?lang=en
Section B5.5.1 "Loop tail predication" describes the feature
we are adding support for with this patch (although
we only add codegen for DLSTP/LETP instruction loops).

Previously with commit d2ed233cb94 we'd added support for
non-MVE DLS/LE loops through the loop-doloop pass, which, given
a standard MVE loop like:

```
void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int n)
{
  while (n > 0)
    {
      mve_pred16_t p = vctp16q (n);
      int16x8_t va = vldrhq_z_s16 (a, p);
      int16x8_t vb = vldrhq_z_s16 (b, p);
      int16x8_t vc = vaddq_x_s16 (va, vb, p);
      vstrhq_p_s16 (c, vc, p);
      c+=8;
      a+=8;
      b+=8;
      n-=8;
    }
}
```
.. would output:

```
        <pre-calculate the number of iterations and place it into lr>
        dls     lr, lr
.L3:
        vctp.16 r3
        vmrs    ip, P0  @ movhi
        sxth    ip, ip
        vmsr     P0, ip @ movhi
        mov     r4, r0
        vpst
        vldrht.16       q2, [r4]
        mov     r4, r1
        vmov    q3, q0
        vpst
        vldrht.16       q1, [r4]
        mov     r4, r2
        vpst
        vaddt.i16       q3, q2, q1
        subs    r3, r3, #8
        vpst
        vstrht.16       q3, [r4]
        adds    r0, r0, #16
        adds    r1, r1, #16
        adds    r2, r2, #16
        le      lr, .L3
```

where the LE instruction will decrement LR by 1, compare and
branch if needed.

(there are also other inefficiencies with the above code, like the
pointless vmrs/sxth/vmsr on the VPR and the adds not being merged
into the vldrht/vstrht as a #16 offsets and some random movs!
But that's different problems...)

The MVE version is similar, except that:
* Instead of DLS/LE the instructions are DLSTP/LETP.
* Instead of pre-calculating the number of iterations of the
  loop, we place the number of elements to be processed by the
  loop into LR.
* Instead of decrementing the LR by one, LETP will decrement it
  by FPSCR.LTPSIZE, which is the number of elements being
  processed in each iteration: 16 for 8-bit elements, 5 for 16-bit
  elements, etc.
* On the final iteration, automatic Loop Tail Predication is
  performed, as if the instructions within the loop had been VPT
  predicated with a VCTP generating the VPR predicate in every
  loop iteration.

The dlstp/letp loop now looks like:

```
        <place n into r3>
        dlstp.16        lr, r3
.L14:
        mov     r3, r0
        vldrh.16        q3, [r3]
        mov     r3, r1
        vldrh.16        q2, [r3]
        mov     r3, r2
        vadd.i16  q3, q3, q2
        adds    r0, r0, #16
        vstrh.16        q3, [r3]
        adds    r1, r1, #16
        adds    r2, r2, #16
        letp    lr, .L14

```

Since the loop tail predication is automatic, we have eliminated
the VCTP that had been specified by the user in the intrinsic
and converted the VPT-predicated instructions into their
unpredicated equivalents (which also saves us from VPST insns).

The LE instruction here decrements LR by 8 in each iteration.

Stam Markianos-Wright (1):
  arm: Add define_attr to to create a mapping between MVE predicated and
    unpredicated insns

Andre Vieira (4):
  arm: Annotate instructions with mve_safe_imp_xlane_pred
  arm: Fix a wrong attribute use and remove unused unspecs and iterators
  doloop: Add support for predicated vectorized loops
  arm: Add support for MVE Tail-Predicated Low Overhead Loops