From patchwork Thu Jun 15 11:47:44 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Stamatis Markianos-Wright X-Patchwork-Id: 108452 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp570618vqr; Thu, 15 Jun 2023 04:49:51 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ44lucFDb5iQ4qeC/IzVw76Qo30WImUszoQz5pruyNAVCVVCpc3EnK8wUfzWOfd/aoexo6H X-Received: by 2002:a17:907:3da7:b0:978:9666:2ea6 with SMTP id he39-20020a1709073da700b0097896662ea6mr18353775ejc.66.1686829791145; Thu, 15 Jun 2023 04:49:51 -0700 (PDT) Received: from sourceware.org (server2.sourceware.org. [8.43.85.97]) by mx.google.com with ESMTPS id ss8-20020a170907038800b00982854beb85si754139ejb.367.2023.06.15.04.49.50 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 15 Jun 2023 04:49:51 -0700 (PDT) Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) client-ip=8.43.85.97; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=VfEba0fY; arc=fail (signature failed); spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 920F2385B522 for ; Thu, 15 Jun 2023 11:49:15 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 920F2385B522 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1686829755; bh=ZMYKpuYnwYZ3glgLnIXpXQusYgoNYHCEphNrVxB59iE=; h=Date:To:Cc:Subject:List-Id:List-Unsubscribe:List-Archive: List-Post:List-Help:List-Subscribe:From:Reply-To:From; b=VfEba0fY1J5vLhyMRKaeryvq+xOJMciLh3F61Nbseo/ZBOOkY54Gg8DJP5KnMetax nApkv8u0Vtcurd34b9k8gVtPUMiDISr85Q+lHtv0oU3ZCcFq7bUxGWze0tcqbgOz7Q TJ/w9Lp3mlgVBB4xZrWh2aGOBtbUGXrv04QAZ4J8= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from EUR04-HE1-obe.outbound.protection.outlook.com (mail-he1eur04on2079.outbound.protection.outlook.com [40.107.7.79]) by sourceware.org (Postfix) with ESMTPS id D5EC2385771F for ; Thu, 15 Jun 2023 11:47:58 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D5EC2385771F Received: from DU2P250CA0020.EURP250.PROD.OUTLOOK.COM (2603:10a6:10:231::25) by DB9PR08MB8204.eurprd08.prod.outlook.com (2603:10a6:10:39f::11) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6455.44; Thu, 15 Jun 2023 11:47:55 +0000 Received: from DBAEUR03FT022.eop-EUR03.prod.protection.outlook.com (2603:10a6:10:231:cafe::97) by DU2P250CA0020.outlook.office365.com (2603:10a6:10:231::25) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6477.38 via Frontend Transport; Thu, 15 Jun 2023 11:47:55 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com;dmarc=pass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; pr=C Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by DBAEUR03FT022.mail.protection.outlook.com (100.127.142.217) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6500.25 via Frontend Transport; Thu, 15 Jun 2023 11:47:55 +0000 Received: ("Tessian outbound 5154e9d36775:v136"); Thu, 15 Jun 2023 11:47:55 +0000 X-CheckRecipientChecked: true X-CR-MTA-CID: 8b50268a9308bdc7 X-CR-MTA-TID: 64aa7808 Received: from 3d20f3995b6f.2 by 64aa7808-outbound-1.mta.getcheckrecipient.com id A6A98FF8-85F8-480E-BC06-BC80671B4286.1; Thu, 15 Jun 2023 11:47:49 +0000 Received: from EUR01-VE1-obe.outbound.protection.outlook.com by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id 3d20f3995b6f.2 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384); Thu, 15 Jun 2023 11:47:49 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=QUMntJAp0hrAcXcqy6wrTFUmsjP3VcpfXLU6j64NT3dCDkNOhi/1qGMBf6nU08vVMApgbaIjCJnGHOJTSXbXuZ17It17+DNCq1Xy9o/W4TWvlDxIoMdros5dDNRpnWM37PEIlzarh1qEJ05cjAyiB2jcy6Xn/u7QlSYAnN7pUUOAe6BsXXTqVQ19PWT/WswqBnlKx0D5ZGonj5Z8gjpfDG891TC5xRlVueWS/c4QLvDZd5Zu1ZqfFqmTsk85MooYVR48m5+sh76BSE2p3TDKTkurduesZi0dP2LgQWLh1+hq0jPbWdJIxYPBxe7C5hAl4lWS7O0c+TNuI8be4IBE6A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=ZMYKpuYnwYZ3glgLnIXpXQusYgoNYHCEphNrVxB59iE=; b=AF6AKsghG/fIitWbBG6wRBpMdv9XU2vcnCFJ77G9u+GI8lg1hwk70+TLp1+STlihvwS/BFlTP8FYeyJH05bK00dFgPP1OutSUj3txFx5Etov9JOMLpD9G9l9aTJrqf069r6SGc5iYKB+p3vpdbAUIqnfwRP6EANe3yZAXfdyOHeCNUYE3cCMKxZJnM4uhuo+jYyjg+UrNK0t5yu3q45anEaBpH0h3cUfhjW5908fvCBkRQ8cABLBQD99tmOo9A1AOBb8Jd3vsp535Gc5X2GLJdNl0O8OOMpzc2XPqRYfHefCh/4et+3hsBbgUIoCvy/uDPZ0rewmxKj0xNM0cekR9g== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none Authentication-Results-Original: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=arm.com; Received: from DB9PR08MB6507.eurprd08.prod.outlook.com (2603:10a6:10:25a::6) by VI1PR08MB5455.eurprd08.prod.outlook.com (2603:10a6:803:135::17) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6455.46; Thu, 15 Jun 2023 11:47:46 +0000 Received: from DB9PR08MB6507.eurprd08.prod.outlook.com ([fe80::2fd1:9380:86b1:467f]) by DB9PR08MB6507.eurprd08.prod.outlook.com ([fe80::2fd1:9380:86b1:467f%4]) with mapi id 15.20.6477.037; Thu, 15 Jun 2023 11:47:46 +0000 Message-ID: Date: Thu, 15 Jun 2023 12:47:44 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 Content-Language: en-US To: "gcc-patches@gcc.gnu.org" Cc: Kyrylo Tkachov , Richard Earnshaw , ramana.gcc@gmail.com, "nickc@redhat.com" Subject: [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops X-ClientProxiedBy: LO4P123CA0403.GBRP123.PROD.OUTLOOK.COM (2603:10a6:600:189::12) To DB9PR08MB6507.eurprd08.prod.outlook.com (2603:10a6:10:25a::6) MIME-Version: 1.0 X-MS-TrafficTypeDiagnostic: DB9PR08MB6507:EE_|VI1PR08MB5455:EE_|DBAEUR03FT022:EE_|DB9PR08MB8204:EE_ X-MS-Office365-Filtering-Correlation-Id: 339c323f-20a0-4eac-2cb3-08db6d965bfe X-LD-Processed: f34e5979-57d9-4aaa-ad4d-b122a662184d,ExtAddr x-checkrecipientrouted: true NoDisclaimer: true X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: 5F2rMkmGVpHPzHB7DKbvYUDoCnXa3aZmTeGYHZAPSidtDa1U3oQycGsH03bESFeK+eyJfJoe8XwDQcHtYT9DIJlykoQ1yFsG9lzVREZNSpkHsLsBa4g5N8HBsMVxDmbSBwg4CDw2ZKpXtwcJyegdaGsmYwZugx93QmZWM+O9IYMa9rmsODXaXGtCBSZJBq8Y1iMzkYkeK+tozFum5J4mvaoJCvb2VtDbGSa5jIZhohB6WHyiPJ5VDUfgy8yti+GsChyDxqkrna0Vaw0wHSzjr1XTHyeVx8T8pl6BW3p1eihuTWFjhSRgTEm40rbbGziKG2VaXg0Fn1krkWWpWmbiAC1xvsuFuWqMmFRqAF/zct3mNj98kMd6M3bfrKMO1y9upnAz4vKir4OcsyVOvlhSSv1Ya/cQEWFynYoej+bLrxoeAtOR5NiEyiNzZGQRUfDPZxOmWHjrXB7kEzb2joJ8/MBfFlrxtzg9XxIjeAGAM4qkvNtyXcfwm5/yoTAFFLBWU4m+29cSQ6/STVBnE0vAdWXeGaUOtnsoZl4epDvuijHefu9O/Gf4pvIvUrRQII3t05dVPhl2Krk6659647vXWKfD4SNlWQiecGJjwjMeyY9RJ4Deh4wNq25y8wYX7PTKaeiQEneEIPDPFBVj2y6COdsagQXtZkvwuUDcKTS01tGrviFZcfiJXTxdkY0HGenE X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:DB9PR08MB6507.eurprd08.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230028)(4636009)(376002)(396003)(136003)(346002)(39860400002)(366004)(451199021)(186003)(8936002)(235185007)(5660300002)(8676002)(316002)(41300700001)(36756003)(478600001)(4326008)(6916009)(31696002)(86362001)(66556008)(66476007)(6486002)(54906003)(66946007)(38100700002)(33964004)(26005)(6506007)(6512007)(31686004)(2906002)(83380400001)(84970400001)(2616005)(43740500002)(45980500001); DIR:OUT; SFP:1101; X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR08MB5455 Original-Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=arm.com; X-EOPAttributedMessage: 0 X-MS-Exchange-Transport-CrossTenantHeadersStripped: DBAEUR03FT022.eop-EUR03.prod.protection.outlook.com X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id-Prvs: f4c15cf8-92d4-456f-3df1-08db6d96563e X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: c0XbBuoaPUiYYTaTe7mQk4U46ke1yqaL2ZNECk/fGF7mACltf4HkktetKwq0AJif3bJl2R3fSKX/fYqVjdcU/xZqKk0/ZhKqUDUKpIff1h/2O4yTZGDK5uKXIrigNU014MUDm2MJl2ixMt+a/K9sKZ4zmiBxqNoq+gRfdpfrNpyRSiV9SN7FiqN2rtCT0RHfkmpJY+fIHwIumURLBX3OkN86+fZpux4iLSPAZW7/HCbadYnfkKLptD5Dyy++blucB3qwzjbR4zPG1DG4Y6oTqsljZ4p5Ge1SmDYTU1zWF8a6+upAjKzsHKMO6J8bZpjIdsUIRCdZuVYbVpHiiMf1DqSNtiXzJnfo06yspLoRJ9bm/L4/HL4HGylzUfnwP6zYzDkZ41oohyjSbIP8bJ0AOdqJEkAoQrxtFY3kihMjEgrOaZZKpc1ag3khhCeusZQ74BAaSTjQWTeoTkjs7RtBHHCN3Ebyrc9oAc3ay+ODrtqT5UAl6eK4EmVgMCqQFJzlYDhDEw2Bx2Q3hKh4VJslW/cmHYa85uAOry3f7YhITzV5zp4Jyi3H/pAF2JPUPaGs0tswewjD1cbWjss2pQC4r6OfhyJxeSR/QK5wid7e+qGlbqn6KchfBfpjOzejak8grvDWzaQWevGzAYZYn8n46lrG2HG4YxL419ksXRdzU9jGGtI2AW3rUM/oK8bLsfDI08lopvvrFen5iLRg72S5M61H7vabqxYw536lStcWa+H1lvJug6iB6stsjeSHaCFznAL3JiCfIMTfy5zKRsIlaJLQDu3yxBkabbykBrLesPECVeMjGVQPdPYVTF10Kk91 X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com; PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE; SFS:(13230028)(4636009)(376002)(39860400002)(346002)(136003)(396003)(451199021)(36840700001)(40470700004)(46966006)(40460700003)(54906003)(33964004)(336012)(83380400001)(86362001)(31696002)(6512007)(107886003)(6506007)(2616005)(36860700001)(26005)(81166007)(356005)(82740400003)(47076005)(186003)(82310400005)(40480700001)(478600001)(36756003)(6486002)(84970400001)(41300700001)(2906002)(31686004)(5660300002)(235185007)(8676002)(8936002)(6916009)(4326008)(70206006)(70586007)(316002)(43740500002); DIR:OUT; SFP:1101; X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 15 Jun 2023 11:47:55.8807 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 339c323f-20a0-4eac-2cb3-08db6d965bfe X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123]; Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com] X-MS-Exchange-CrossTenant-AuthSource: DBAEUR03FT022.eop-EUR03.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB9PR08MB8204 X-Spam-Status: No, score=-11.1 required=5.0 tests=BAYES_00, BODY_8BITS, DKIM_SIGNED, DKIM_VALID, FORGED_SPF_HELO, GIT_PATCH_0, KAM_DMARC_NONE, KAM_SHORT, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2, SPF_HELO_PASS, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE, UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Stamatis Markianos-Wright via Gcc-patches From: Stamatis Markianos-Wright Reply-To: Stamatis Markianos-Wright Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org Sender: "Gcc-patches" X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1768769234654081393?= X-GMAIL-MSGID: =?utf-8?q?1768769234654081393?= Hi all,     This is the 2/2 patch that contains the functional changes needed     for MVE Tail Predicated Low Overhead Loops.  See my previous email     for a general introduction of MVE LOLs.     This support is added through the already existing loop-doloop     mechanisms that are used for non-MVE dls/le looping.     Mid-end changes are:     1) Relax the loop-doloop mechanism in the mid-end to allow for        decrement numbers other that -1 and for `count` to be an        rtx containing a simple REG (which in this case will contain        the number of elements to be processed), rather        than an expression for calculating the number of iterations.     2) Added a new df utility function: `df_bb_regno_only_def_find` that        will return the DEF of a REG only if it is DEF-ed once within the        basic block.     And many things in the backend to implement the above optimisation:     3)  Implement the `arm_predict_doloop_p` target hook to instruct the         mid-end about Low Overhead Loops (MVE or not), as well as         `arm_loop_unroll_adjust` which will prevent unrolling of any loops         that are valid for becoming MVE Tail_Predicated Low Overhead Loops         (unrolling can transform a loop in ways that invalidate the dlstp/         letp tranformation logic and the benefit of the dlstp/letp loop         would be considerably higher than that of unrolling)     4)  Appropriate changes to the define_expand of doloop_end, new         patterns for dlstp and letp, new iterators,  unspecs, etc.     5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:        * `arm_mve_dlstp_check_dec_counter`        * `arm_mve_dlstp_check_inc_counter`        * `arm_mve_check_reg_origin_is_num_elems`        * `arm_mve_check_df_chain_back_for_implic_predic`        * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`        This all, in smoe way or another, are running checks on the loop        structure in order to determine if the loop is valid for dlstp/letp        transformation.     6) `arm_attempt_dlstp_transform`: (called from the define_expand of         doloop_end) this function re-checks for the loop's suitability for         dlstp/letp transformation and then implements it, if possible.     7) Various utility functions:        *`arm_mve_get_vctp_lanes` to map        from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`        to check an insn to see if it requires the VPR or not.        * `arm_mve_get_loop_vctp`        * `arm_mve_get_vctp_lanes`        * `arm_emit_mve_unpredicated_insn_to_seq`        * `arm_get_required_vpr_reg`        * `arm_get_required_vpr_reg_param`        * `arm_get_required_vpr_reg_ret_val`        * `arm_mve_vec_insn_is_predicated_with_this_predicate`        * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`     No regressions on arm-none-eabi with various targets and on     aarch64-none-elf. Thoughts on getting this into trunk?     Thank you,     Stam Markianos-Wright     gcc/ChangeLog:             * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): Rename to...             (arm_target_bb_ok_for_lob): ...this             (arm_attempt_dlstp_transform): New.             * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.             (TARGET_PREDICT_DOLOOP_P): New.             (arm_block_set_vect):             (arm_target_insn_ok_for_lob): Rename from arm_target_insn_ok_for_lob.             (arm_target_bb_ok_for_lob): New.             (arm_mve_get_vctp_lanes): New.             (arm_get_required_vpr_reg): New.             (arm_get_required_vpr_reg_param): New.             (arm_get_required_vpr_reg_ret_val): New.             (arm_mve_get_loop_vctp): New. (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.             (arm_mve_vec_insn_is_predicated_with_this_predicate): New.             (arm_mve_check_df_chain_back_for_implic_predic): New.             (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.             (arm_mve_check_reg_origin_is_num_elems): New.             (arm_mve_dlstp_check_inc_counter): New.             (arm_mve_dlstp_check_dec_counter): New.             (arm_mve_loop_valid_for_dlstp): New.             (arm_predict_doloop_p): New.             (arm_loop_unroll_adjust): New.             (arm_emit_mve_unpredicated_insn_to_seq): New.             (arm_attempt_dlstp_transform): New.             * config/arm/iterators.md (DLSTP): New.             (mode1): Add DLSTP mappings.             * config/arm/mve.md (*predicated_doloop_end_internal): New.             (dlstp_insn): New.             * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.             * config/arm/unspecs.md: New unspecs.             * df-core.cc (df_bb_regno_only_def_find): New.             * df.h (df_bb_regno_only_def_find): New.             * loop-doloop.cc (doloop_condition_get): Relax conditions.             (doloop_optimize): Add support for elementwise LoLs.     gcc/testsuite/ChangeLog:             * gcc.target/arm/lob.h: Update framework.             * gcc.target/arm/lob1.c: Likewise.             * gcc.target/arm/lob6.c: Likewise.             * gcc.target/arm/mve/dlstp-compile-asm.c: New test.             * gcc.target/arm/mve/dlstp-int16x8.c: New test.             * gcc.target/arm/mve/dlstp-int32x4.c: New test.             * gcc.target/arm/mve/dlstp-int64x2.c: New test.             * gcc.target/arm/mve/dlstp-int8x16.c: New test.             * gcc.target/arm/mve/dlstp-invalid-asm.c: New test. commit 4f6b2c886257c4e265fe255145f2ecb17510501b Author: Stam Markianos-Wright Date: Tue Oct 18 17:42:56 2022 +0100 arm: Add support for MVE Tail-Predicated Low Overhead Loops This is the 2/2 patch that contains the functional changes needed for MVE Tail Predicated Low Overhead Loops. See my previous email for a general introduction of MVE LOLs. This support is added through the already existing loop-doloop mechanisms that are used for non-MVE dls/le looping. Mid-end changes are: 1) Relax the loop-doloop mechanism in the mid-end to allow for decrement numbers other that -1 and for `count` to be an rtx containing a simple REG (which in this case will contain the number of elements to be processed), rather than an expression for calculating the number of iterations. 2) Added a new df utility function: `df_bb_regno_only_def_find` that will return the DEF of a REG only if it is DEF-ed once within the basic block. And many things in the backend to implement the above optimisation: 3) Implement the `arm_predict_doloop_p` target hook to instruct the mid-end about Low Overhead Loops (MVE or not), as well as `arm_loop_unroll_adjust` which will prevent unrolling of any loops that are valid for becoming MVE Tail_Predicated Low Overhead Loops (unrolling can transform a loop in ways that invalidate the dlstp/ letp tranformation logic and the benefit of the dlstp/letp loop would be considerably higher than that of unrolling) 4) Appropriate changes to the define_expand of doloop_end, new patterns for dlstp and letp, new iterators, unspecs, etc. 5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions: * `arm_mve_dlstp_check_dec_counter` * `arm_mve_dlstp_check_inc_counter` * `arm_mve_check_reg_origin_is_num_elems` * `arm_mve_check_df_chain_back_for_implic_predic` * `arm_mve_check_df_chain_fwd_for_implic_predic_impact` This all, in smoe way or another, are running checks on the loop structure in order to determine if the loop is valid for dlstp/letp transformation. 6) `arm_attempt_dlstp_transform`: (called from the define_expand of doloop_end) this function re-checks for the loop's suitability for dlstp/letp transformation and then implements it, if possible. 7) Various utility functions: *`arm_mve_get_vctp_lanes` to map from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg` to check an insn to see if it requires the VPR or not. * `arm_mve_get_loop_vctp` * `arm_mve_get_vctp_lanes` * `arm_emit_mve_unpredicated_insn_to_seq` * `arm_get_required_vpr_reg` * `arm_get_required_vpr_reg_param` * `arm_get_required_vpr_reg_ret_val` * `arm_mve_vec_insn_is_predicated_with_this_predicate` * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate` No regressions on arm-none-eabi with various targets and on aarch64-none-elf. Thoughts on getting this into trunk? Thank you, Stam Markianos-Wright gcc/ChangeLog: * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): Rename to... (arm_target_bb_ok_for_lob): ...this (arm_attempt_dlstp_transform): New. * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New. (TARGET_PREDICT_DOLOOP_P): New. (arm_block_set_vect): (arm_target_insn_ok_for_lob): Rename from arm_target_insn_ok_for_lob. (arm_target_bb_ok_for_lob): New. (arm_mve_get_vctp_lanes): New. (arm_get_required_vpr_reg): New. (arm_get_required_vpr_reg_param): New. (arm_get_required_vpr_reg_ret_val): New. (arm_mve_get_loop_vctp): New. (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New. (arm_mve_vec_insn_is_predicated_with_this_predicate): New. (arm_mve_check_df_chain_back_for_implic_predic): New. (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New. (arm_mve_check_reg_origin_is_num_elems): New. (arm_mve_dlstp_check_inc_counter): New. (arm_mve_dlstp_check_dec_counter): New. (arm_mve_loop_valid_for_dlstp): New. (arm_predict_doloop_p): New. (arm_loop_unroll_adjust): New. (arm_emit_mve_unpredicated_insn_to_seq): New. (arm_attempt_dlstp_transform): New. * config/arm/iterators.md (DLSTP): New. (mode1): Add DLSTP mappings. * config/arm/mve.md (*predicated_doloop_end_internal): New. (dlstp_insn): New. * config/arm/thumb2.md (doloop_end): Update for MVE LOLs. * config/arm/unspecs.md: New unspecs. * df-core.cc (df_bb_regno_only_def_find): New. * df.h (df_bb_regno_only_def_find): New. * loop-doloop.cc (doloop_condition_get): Relax conditions. (doloop_optimize): Add support for elementwise LoLs. gcc/testsuite/ChangeLog: * gcc.target/arm/lob.h: Update framework. * gcc.target/arm/lob1.c: Likewise. * gcc.target/arm/lob6.c: Likewise. * gcc.target/arm/mve/dlstp-compile-asm.c: New test. * gcc.target/arm/mve/dlstp-int16x8.c: New test. * gcc.target/arm/mve/dlstp-int32x4.c: New test. * gcc.target/arm/mve/dlstp-int64x2.c: New test. * gcc.target/arm/mve/dlstp-int8x16.c: New test. * gcc.target/arm/mve/dlstp-invalid-asm.c: New test. diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index 61fcd671437..b4f6538c991 100644 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -64,8 +64,8 @@ extern void arm_emit_speculation_barrier_function (void); extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *); extern bool arm_q_bit_access (void); extern bool arm_ge_bits_access (void); -extern bool arm_target_insn_ok_for_lob (rtx); - +extern bool arm_target_bb_ok_for_lob (basic_block); +extern rtx arm_attempt_dlstp_transform (rtx); #ifdef RTX_CODE enum reg_class arm_mode_base_reg_class (machine_mode); diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc index c3e731b8982..760635da914 100644 --- a/gcc/config/arm/arm.cc +++ b/gcc/config/arm/arm.cc @@ -659,6 +659,12 @@ static const struct attribute_spec arm_attribute_table[] = #undef TARGET_HAVE_CONDITIONAL_EXECUTION #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution +#undef TARGET_LOOP_UNROLL_ADJUST +#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust + +#undef TARGET_PREDICT_DOLOOP_P +#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p + #undef TARGET_LEGITIMATE_CONSTANT_P #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p @@ -34416,9 +34422,8 @@ arm_invalid_within_doloop (const rtx_insn *insn) } bool -arm_target_insn_ok_for_lob (rtx insn) +arm_target_bb_ok_for_lob (basic_block bb) { - basic_block bb = BLOCK_FOR_INSN (insn); /* Make sure the basic block of the target insn is a simple latch having as single predecessor and successor the body of the loop itself. Only simple loops with a single basic block as body are @@ -34427,8 +34432,923 @@ arm_target_insn_ok_for_lob (rtx insn) return single_succ_p (bb) && single_pred_p (bb) - && single_succ_edge (bb)->dest == single_pred_edge (bb)->src - && contains_no_active_insn_p (bb); + && single_succ_edge (bb)->dest == single_pred_edge (bb)->src; +} + +/* Utility fuction: Given a VCTP or a VCTP_M insn, return the number of MVE + lanes based on the machine mode being used. */ + +static int +arm_mve_get_vctp_lanes (rtx x) +{ + if (GET_CODE (x) == SET && GET_CODE (XEXP (x, 1)) == UNSPEC + && (XINT (XEXP (x, 1), 1) == VCTP || XINT (XEXP (x, 1), 1) == VCTP_M)) + { + switch (GET_MODE (XEXP (x, 1))) + { + case V16BImode: + return 16; + case V8BImode: + return 8; + case V4BImode: + return 4; + case V2QImode: + return 2; + default: + break; + } + } + return 0; +} + +/* Check if an insn requires the use of the VPR_REG, if it does, return the + sub-rtx of the VPR_REG. The `type` argument controls whether + this function should: + * For type == 0, check all operands, including the OUT operands, + and return the first occurance of the VPR_REG. + * For type == 1, only check the input operands. + * For type == 2, only check the output operands. + (INOUT operands are considered both as input and output operands) +*/ +static rtx +arm_get_required_vpr_reg (rtx_insn *insn, unsigned int type = 0) +{ + gcc_assert (type < 3); + if (!NONJUMP_INSN_P (insn)) + return NULL_RTX; + + bool requires_vpr; + extract_constrain_insn (insn); + int n_operands = recog_data.n_operands; + if (recog_data.n_alternatives == 0) + return NULL_RTX; + + /* Fill in recog_op_alt with information about the constraints of + this insn. */ + preprocess_constraints (insn); + + for (int op = 0; op < n_operands; op++) + { + requires_vpr = true; + if (type == 1 && (recog_data.operand_type[op] == OP_OUT + || recog_data.operand_type[op] == OP_INOUT)) + continue; + else if (type == 2 && (recog_data.operand_type[op] == OP_IN + || recog_data.operand_type[op] == OP_INOUT)) + continue; + + /* Iterate through alternatives of operand "op" in recog_op_alt and + identify if the operand is required to be the VPR. */ + for (int alt = 0; alt < recog_data.n_alternatives; alt++) + { + const operand_alternative *op_alt + = &recog_op_alt[alt * n_operands]; + /* Fetch the reg_class for each entry and check it against the + * VPR_REG reg_class. */ + if (alternative_class (op_alt, op) != VPR_REG) + requires_vpr = false; + } + /* If all alternatives of the insn require the VPR reg for this operand, + it means that either this is VPR-generating instruction, like a vctp, + vcmp, etc., or it is a VPT-predicated insruction. Return the subrtx + of the VPR reg operand. */ + if (requires_vpr) + return recog_data.operand[op]; + } + return NULL_RTX; +} + +/* Wrapper function of arm_get_required_vpr_reg with type == 1, so return + something only if the VPR reg is an input operand to the insn. */ + +static rtx +ALWAYS_INLINE +arm_get_required_vpr_reg_param (rtx_insn *insn) +{ + return arm_get_required_vpr_reg (insn, 1); +} + +/* Wrapper function of arm_get_required_vpr_reg with type == 2, so return + something only if the VPR reg is the retrurn value, an output of, or is + clobbered by the insn. */ + +static rtx +ALWAYS_INLINE +arm_get_required_vpr_reg_ret_val (rtx_insn *insn) +{ + return arm_get_required_vpr_reg (insn, 2); +} + +/* Scan the basic block of a loop body for a vctp instruction. If there is + at least vctp instruction, return the first rtx_insn *. */ + +static rtx_insn * +arm_mve_get_loop_vctp (basic_block bb) +{ + rtx_insn *insn = BB_HEAD (bb); + + /* Now scan through all the instruction patterns and pick out the VCTP + instruction. We require arm_get_required_vpr_reg_param to be false + to make sure we pick up a VCTP, rather than a VCTP_M. */ + FOR_BB_INSNS (bb, insn) + if (NONDEBUG_INSN_P (insn)) + if (arm_get_required_vpr_reg_ret_val (insn) + && (arm_mve_get_vctp_lanes (PATTERN (insn)) != 0) + && !arm_get_required_vpr_reg_param (insn)) + return insn; + return NULL; +} + +/* Return true if an insn is an MVE instruction that VPT-predicable, but in + its unpredicated form, or if it is predicated, but on a predicate other + than vpr_reg. */ + +static bool +arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate (rtx_insn *insn, + rtx vpr_reg) +{ + rtx insn_vpr_reg_operand; + if (MVE_VPT_UNPREDICATED_INSN_P (insn) + || (MVE_VPT_PREDICATED_INSN_P (insn) + && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn)) + && !rtx_equal_p (vpr_reg, insn_vpr_reg_operand))) + return true; + else + return false; +} + +static bool +arm_mve_vec_insn_is_predicated_with_this_predicate (rtx_insn *insn, + rtx vpr_reg) +{ + rtx insn_vpr_reg_operand; + if (MVE_VPT_PREDICATED_INSN_P (insn) + && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn)) + && rtx_equal_p (vpr_reg, insn_vpr_reg_operand)) + return true; + else + return false; +} + +/* Recursively scan through the DF chain backwards within the basic block and + determine if any of the USEs of the original insn (or the USEs of the insns + where thy were DEF-ed, etc., recursively) were affected by implicit VPT + predication of an MVE_VPT_UNPREDICATED_INSN_P in a dlstp/letp loop. + Having such implicit predication on an unpredicated insn isn't in itself + an error, because the output of that insn might then be used in a correctly + predicated store insn, where the disabled lanes will be ignored. To verify + this we later call arm_mve_check_df_chain_fwd_for_implic_predic_impact, + which will check the DF chains forward to see if any implicitly-predicated + operand gets used in an improper way. */ + +static bool +arm_mve_check_df_chain_back_for_implic_predic (rtx_insn *insn, + rtx vctp_vpr_generated) +{ + basic_block body = BLOCK_FOR_INSN (insn); + /* If we've traced back to the loop vctp, then this operand must be the VPR + reg and is safe. */ + if (insn != arm_mve_get_loop_vctp (body)) + { + /* The circumstances under which an instruction is affected by "implicit + predication" are as follows: + * It is an UNPREDICATED_INSN_P: + * That loads/stores from/to memory. + * Where any one of its operands are MVE vectors from outside the + loop body bb. + Or: + * Any of it's operands, recursively backwards, are affected. */ + if (MVE_VPT_UNPREDICATED_INSN_P (insn)) + { + /* Find if this is a load or a store insn. */ + extract_insn (insn); + int n_operands = recog_data.n_operands; + for (int op = 0; op < n_operands; op++) + if (mve_memory_operand (recog_data.operand[op], + GET_MODE (recog_data.operand[op]))) + return true; + } + + df_ref insn_uses = NULL; + FOR_EACH_INSN_USE (insn_uses, insn) + { + /* If the insn is unpredicated, the operand is in the input reg set + to the the basic block and is an MVE vector, consider it unsafe. + */ + if (MVE_VPT_UNPREDICATED_INSN_P (insn) + && VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_uses))) + && REGNO_REG_SET_P (df_get_live_in (body), DF_REF_REGNO + (insn_uses))) + return true; + + /* Starting from the current insn, scan backwards through the insn + chain until BB_HEAD: "for each insn in the BB prior to the current". + */ + rtx_insn *prev_insn = NULL; + for (prev_insn = insn; + prev_insn && prev_insn != PREV_INSN (BB_HEAD (body)); + prev_insn = PREV_INSN (prev_insn)) + { + /* Look at all the DEFs of that previous insn: if one of them is on + the same REG as our current insn, then recurse in order to check + that insn's USEs. If any of these insns return true as + MVE_VPT_UNPREDICATED_INSN_Ps, then the whole chain is affected + by the change in behaviour from being placed in dlstp/letp loop. + */ + df_ref prev_insn_defs = NULL; + FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn) + { + if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs) + && insn != prev_insn + && body == BLOCK_FOR_INSN (prev_insn) + && !arm_mve_vec_insn_is_predicated_with_this_predicate + (insn, vctp_vpr_generated) + && arm_mve_check_df_chain_back_for_implic_predic + (prev_insn, vctp_vpr_generated)) + return true; + } + } + } + } + return false; +} + +/* If we have identified that the current DEF will be modified + by such implicit predication, scan through all the + insns that USE it and bail out if any one is outside the + current basic block (i.e. the reg is live after the loop) + or if any are store insns that are unpredicated or using a + predicate other than the loop VPR. */ + +static bool +arm_mve_check_df_chain_fwd_for_implic_predic_impact (rtx_insn *insn, + rtx vctp_vpr_generated) +{ + /* If this insn is indeed an unpredicated store to memory, bail out. */ + if (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate + (insn, vctp_vpr_generated) + && mve_memory_operand (SET_DEST (single_set (insn)), + GET_MODE (SET_DEST (single_set (insn))))) + return true; + + /* Next, scan forward to the various USEs of the DEFs in this insn. */ + df_ref insn_def = NULL; + FOR_EACH_INSN_DEF (insn_def, insn) + { + for (df_ref use = DF_REG_USE_CHAIN (DF_REF_REGNO (insn_def)); use; + use = DF_REF_NEXT_REG (use)) + { + rtx_insn *next_use_insn = DF_REF_INSN (use); + if (insn != next_use_insn && NONDEBUG_INSN_P (next_use_insn)) + { + /* If the USE is outside the loop body bb, or it is inside, but + is an unpredicated store to memory. */ + if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (next_use_insn) + || (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate + (next_use_insn, vctp_vpr_generated) + && mve_memory_operand + (SET_DEST (single_set (next_use_insn)), + GET_MODE (SET_DEST (single_set (next_use_insn)))))) + return true; + } + } + } + return false; +} + +static bool +arm_mve_check_reg_origin_is_num_elems (basic_block body, rtx reg, rtx vctp_step) +{ + /* Ok, we now know the loop starts from zero and increments by one. + Now just show that the max value of the counter came from an + appropriate ASHIFRT expr of the correct amount. */ + basic_block pre_loop_bb = body->prev_bb; + while (pre_loop_bb && BB_END (pre_loop_bb) + && !df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg))) + pre_loop_bb = pre_loop_bb->prev_bb; + + df_ref counter_max_last_def = df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg)); + rtx counter_max_last_set; + if (counter_max_last_def) + counter_max_last_set = PATTERN (DF_REF_INSN (counter_max_last_def)); + else + return false; + + /* If we encounter a simple SET from a REG, follow it through. */ + if (GET_CODE (counter_max_last_set) == SET + && REG_P (XEXP (counter_max_last_set, 1))) + return arm_mve_check_reg_origin_is_num_elems + (pre_loop_bb, XEXP (counter_max_last_set, 1), vctp_step); + + if (GET_CODE (XEXP (counter_max_last_set, 1)) == ASHIFTRT + && CONST_INT_P (XEXP (XEXP (counter_max_last_set, 1), 1)) + && ((1 << INTVAL (XEXP (XEXP (counter_max_last_set, 1), 1))) + == abs (INTVAL (vctp_step)))) + return true; + + return false; +} + +/* If we have identified the loop to have an incrementing counter, we need to + make sure that it increments by 1 and that the loop is structured correctly: + * The counter starts from 0 + * The counter terminates at (num_of_elem + num_of_lanes - 1) / num_of_lanes + * The vctp insn uses a reg that decrements appropriately in each iteration. +*/ + +static bool +arm_mve_dlstp_check_inc_counter (basic_block body, rtx_insn* vctp_insn, + rtx condconst, rtx condcount) +{ + rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0); + /* The loop latch has to be empty. When compiling all the known MVE LoLs in + user applications, none of those with incrementing counters had any real + insns in the loop latch. As such, this function has only been tested with + an empty latch and may misbehave or ICE if we somehow get here with an + increment in the latch, so, for sanity, error out early. */ + rtx_insn *dec_insn = BB_END (body->loop_father->latch); + if (NONDEBUG_INSN_P (dec_insn)) + gcc_unreachable (); + + class rtx_iv vctp_reg_iv; + /* For loops of type B) the loop counter is independent of the decrement + of the reg used in the vctp_insn. So run iv analysis on that reg. This + has to succeed for such loops to be supported. */ + if (!iv_analyze (vctp_insn, as_a (GET_MODE (vctp_reg)), + vctp_reg, &vctp_reg_iv)) + return false; + + /* Find where both of those are modified in the loop body bb. */ + rtx condcount_reg_set = PATTERN (DF_REF_INSN (df_bb_regno_only_def_find + (body, REGNO (condcount)))); + rtx vctp_reg_set = PATTERN (DF_REF_INSN (df_bb_regno_only_def_find + (body, REGNO (vctp_reg)))); + if (!vctp_reg_set || !condcount_reg_set) + return false; + + if (REG_P (condcount) && REG_P (condconst)) + { + /* First we need to prove that the loop is going 0..condconst with an + inc of 1 in each iteration. */ + if (GET_CODE (XEXP (condcount_reg_set, 1)) == PLUS + && CONST_INT_P (XEXP (XEXP (condcount_reg_set, 1), 1)) + && INTVAL (XEXP (XEXP (condcount_reg_set, 1), 1)) == 1) + { + /* Check that the counter did indeed start from zero. */ + rtx counter_orig_set; + counter_orig_set = XEXP (PATTERN + (DF_REF_INSN + (DF_REF_NEXT_REG + (DF_REG_DEF_CHAIN + (REGNO + (XEXP (condcount_reg_set, 0)))))), 1); + if (!CONST_INT_P (counter_orig_set) + || (INTVAL (counter_orig_set) != 0)) + return false; + /* And finally check that the target value of the counter, condconst + is of the correct shape. */ + if (!arm_mve_check_reg_origin_is_num_elems (body, condconst, vctp_reg_iv.step)) + return false; + } + else + return false; + } + else + return false; + + /* Extract the decrementnum of the vctp reg. */ + int decrementnum = abs (INTVAL (vctp_reg_iv.step)); + /* Ensure it matches the number of lanes of the vctp instruction. */ + if (decrementnum != arm_mve_get_vctp_lanes (PATTERN (vctp_insn))) + return false; + + /* Everything looks valid. */ + return true; +} + +static bool +arm_mve_dlstp_check_dec_counter (basic_block body, rtx_insn* vctp_insn, + rtx condconst, rtx condcount) +{ + rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0); + class rtx_iv vctp_reg_iv; + int decrementnum; + /* For decrementing loops of type A), the counter is usually present in the + loop latch. Here we simply need to verify that this counter is the same + reg that is also used in the vctp_insn and that it is not otherwise + modified. */ + rtx_insn *dec_insn = BB_END (body->loop_father->latch); + /* If not in the loop latch, try to find the decrement in the loop body. */ + if (!NONDEBUG_INSN_P (dec_insn)) + { + df_ref temp = df_bb_regno_only_def_find (body, REGNO (condcount)); + /* If we haven't been able to find the decrement, bail out. */ + if (!temp) + return false; + dec_insn = DF_REF_INSN (temp); + } + + /* Next, ensure that it is a PLUS of the form: + (set (reg a) (plus (reg a) (const_int))) + where (reg a) is the same as condcount. */ + if (GET_CODE (XEXP (PATTERN (dec_insn), 1)) == PLUS + && REGNO (XEXP (PATTERN (dec_insn), 0)) + == REGNO (XEXP (XEXP (PATTERN (dec_insn), 1), 0)) + && REGNO (XEXP (PATTERN (dec_insn), 0)) == REGNO (condcount)) + decrementnum = abs (INTVAL (XEXP (XEXP (PATTERN (dec_insn), 1), 1))); + else + return false; + + /* Ok, so we now know the loop decrement. If it is a 1, then we need to + look at the loop vctp_reg and verify that it also decrements correctly. + Then, we need to establish that the starting value of the loop decrement + originates from the starting value of the vctp decrement. */ + if (decrementnum == 1) + { + class rtx_iv vctp_reg_iv; + /* The loop counter is found to be independent of the decrement + of the reg used in the vctp_insn, again. Ensure that IV analysis + succeeds and check the step. */ + if (!iv_analyze (vctp_insn, as_a (GET_MODE (vctp_reg)), + vctp_reg, &vctp_reg_iv)) + return false; + /* Ensure it matches the number of lanes of the vctp instruction. */ + if (abs (INTVAL (vctp_reg_iv.step)) + != arm_mve_get_vctp_lanes (PATTERN (vctp_insn))) + return false; + if (!arm_mve_check_reg_origin_is_num_elems (body, condcount, vctp_reg_iv.step)) + return false; + } + /* If the decrements are the same, then the situation is simple: either they + are also the same reg, which is safe, or they are different registers, in + which case makse sure that there is a only simple SET from one to the + other inside the loop.*/ + else if (decrementnum == arm_mve_get_vctp_lanes (PATTERN (vctp_insn))) + { + if (REGNO (condcount) != REGNO (vctp_reg)) + { + /* It wasn't the same reg, but it could be behild a + (set (vctp_reg) (condcount)), so instead find where + the VCTP insn is DEF'd inside the loop. */ + rtx vctp_reg_set = + PATTERN (DF_REF_INSN (df_bb_regno_only_def_find + (body, REGNO (vctp_reg)))); + /* This must just be a simple SET from the condcount. */ + if (GET_CODE (vctp_reg_set) != SET || !REG_P (XEXP (vctp_reg_set, 1)) + || REGNO (XEXP (vctp_reg_set, 1)) != REGNO (condcount)) + return false; + } + } + else + return false; + + /* We now only need to find out that the loop terminates with a LE + zero condition. If condconst is a const_int, then this is easy. + If its a REG, look at the last condition+jump in a bb before + the loop, because that usually will have a branch jumping over + the loop body. */ + if (CONST_INT_P (condconst) + && !(INTVAL (condconst) == 0 && JUMP_P (BB_END (body)) + && GET_CODE (XEXP (PATTERN (BB_END (body)), 1)) == IF_THEN_ELSE + && (GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == NE + ||GET_CODE (XEXP (XEXP (PATTERN (BB_END (body)), 1), 0)) == GT))) + return false; + else if (REG_P (condconst)) + { + basic_block pre_loop_bb = body; + while (pre_loop_bb->prev_bb && BB_END (pre_loop_bb->prev_bb) + && !JUMP_P (BB_END (pre_loop_bb->prev_bb))) + pre_loop_bb = pre_loop_bb->prev_bb; + if (pre_loop_bb && BB_END (pre_loop_bb)) + pre_loop_bb = pre_loop_bb->prev_bb; + else + return false; + rtx initial_compare = NULL_RTX; + if (!(prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb)) + && INSN_P (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb))))) + return false; + else + initial_compare + = PATTERN (prev_nonnote_nondebug_insn_bb (BB_END (pre_loop_bb))); + if (!(initial_compare && GET_CODE (initial_compare) == SET + && cc_register (XEXP (initial_compare, 0), VOIDmode) + && GET_CODE (XEXP (initial_compare, 1)) == COMPARE + && CONST_INT_P (XEXP (XEXP (initial_compare, 1), 1)) + && INTVAL (XEXP (XEXP (initial_compare, 1), 1)) == 0)) + return false; + + /* Usually this is a LE condition, but it can also just be a GT or an EQ + condition (if the value is unsigned or the compiler knows its not negative) */ + rtx_insn *loop_jumpover = BB_END (pre_loop_bb); + if (!(JUMP_P (loop_jumpover) + && GET_CODE (XEXP (PATTERN (loop_jumpover), 1)) == IF_THEN_ELSE + && (GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == LE + || GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == GT + || GET_CODE (XEXP (XEXP (PATTERN (loop_jumpover), 1), 0)) == EQ))) + return false; + } + + /* Everything looks valid. */ + return true; +} + +static bool +arm_mve_loop_valid_for_dlstp (basic_block body) +{ + /* Doloop can only be done "elementwise" with predicated dlstp/letp if it + contains a VCTP on the number of elements processed by the loop. + Find the VCTP predicate generation inside the loop body BB. */ + rtx_insn *vctp_insn = arm_mve_get_loop_vctp (body); + if (!vctp_insn) + return false; + + /* There are only two types of loops that can be turned into dlstp/letp + loops: + A) Loops of the form: + while (num_of_elem > 0) + { + p = vctp (num_of_elem) + n -= num_of_lanes; + } + B) Loops of the form: + int num_of_iters = (num_of_elem + num_of_lanes - 1) / num_of_lanes + for (i = 0; i < num_of_iters; i++) + { + p = vctp (num_of_elem) + n -= num_of_lanes; + } + + Then, depending on the type of loop above we need will need to do + different sets of checks. */ + iv_analysis_loop_init (body->loop_father); + + /* In order to find out if the loop is of type A or B above look for the + loop counter: it will either be incrementing by one per iteration or + it will be decrementing by num_of_lanes. We can find the loop counter + in the condition at the end of the loop. */ + rtx_insn *loop_cond = prev_nonnote_nondebug_insn_bb (BB_END (body)); + gcc_assert (cc_register (XEXP (PATTERN (loop_cond), 0), VOIDmode) + && GET_CODE (XEXP (PATTERN (loop_cond), 1)) == COMPARE); + /* The operands in the condition: Try to identify which one is the + constant and which is the counter and run IV analysis on the latter. */ + rtx cond_arg_1 = XEXP (XEXP (PATTERN (loop_cond), 1), 0); + rtx cond_arg_2 = XEXP (XEXP (PATTERN (loop_cond), 1), 1); + + rtx loop_cond_constant; + rtx loop_counter; + class rtx_iv cond_counter_iv, cond_temp_iv; + + if (CONST_INT_P (cond_arg_1)) + { + /* cond_arg_1 is the constant and cond_arg_2 is the counter. */ + loop_cond_constant = cond_arg_1; + loop_counter = cond_arg_2; + iv_analyze (loop_cond, as_a (GET_MODE (cond_arg_2)), + cond_arg_2, &cond_counter_iv); + } + else if (CONST_INT_P (cond_arg_2)) + { + /* cond_arg_2 is the constant and cond_arg_1 is the counter. */ + loop_cond_constant = cond_arg_2; + loop_counter = cond_arg_1; + iv_analyze (loop_cond, as_a (GET_MODE (cond_arg_1)), + cond_arg_1, &cond_counter_iv); + } + else if (REG_P (cond_arg_1) && REG_P (cond_arg_2)) + { + /* If both operands to the compare are REGs, we can safely + run IV analysis on both and then determine which is the + constant by looking at the step. + First assume cond_arg_1 is the counter. */ + loop_counter = cond_arg_1; + loop_cond_constant = cond_arg_2; + iv_analyze (loop_cond, as_a (GET_MODE (cond_arg_1)), + cond_arg_1, &cond_counter_iv); + iv_analyze (loop_cond, as_a (GET_MODE (cond_arg_2)), + cond_arg_2, &cond_temp_iv); + + if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P (cond_temp_iv.step)) + return false; + /* Look at the steps and swap around the rtx's if needed. Error out if + one of them cannot be identified as constant. */ + if (INTVAL (cond_counter_iv.step) != 0 && INTVAL (cond_temp_iv.step) != 0) + return false; + if (INTVAL (cond_counter_iv.step) == 0 && INTVAL (cond_temp_iv.step) != 0) + { + loop_counter = cond_arg_2; + loop_cond_constant = cond_arg_1; + cond_counter_iv = cond_temp_iv; + } + } + else + return false; + + if (!REG_P (loop_counter)) + return false; + if (!(REG_P (loop_cond_constant) || CONST_INT_P (loop_cond_constant))) + return false; + + /* Now we have extracted the IV step of the loop counter, call the + appropriate checking function. */ + if (INTVAL (cond_counter_iv.step) > 0) + return arm_mve_dlstp_check_inc_counter (body, vctp_insn, + loop_cond_constant, loop_counter); + else if (INTVAL (cond_counter_iv.step) < 0) + return arm_mve_dlstp_check_dec_counter (body, vctp_insn, + loop_cond_constant, loop_counter); + else + return false; +} + +/* Predict whether the given loop in gimple will be transformed in the RTL + doloop_optimize pass. */ + +static bool +arm_predict_doloop_p (struct loop *loop) +{ + gcc_assert (loop); + /* On arm, targetm.can_use_doloop_p is actually + can_use_doloop_if_innermost. Ensure the loop is innermost, + it is valid and as per arm_target_bb_ok_for_lob and the + correct architecture flags are enabled. */ + if (!(TARGET_32BIT && TARGET_HAVE_LOB && optimize > 0)) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "Predict doloop failure due to" + " target architecture or optimisation flags.\n"); + return false; + } + else if (loop->inner != NULL) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "Predict doloop failure due to" + " loop nesting.\n"); + return false; + } + else if (!arm_target_bb_ok_for_lob (loop->header->next_bb)) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "Predict doloop failure due to" + " loop bb complexity.\n"); + return false; + } + + return true; +} + +/* Implement targetm.loop_unroll_adjust. Use this to block unrolling of loops + that may later be turned into MVE Tail Predicated Low Overhead Loops. The + performance benefit of an MVE LoL is likely to be much higher than that of + the unrolling. */ + +unsigned +arm_loop_unroll_adjust (unsigned nunroll, struct loop *loop) +{ + if (TARGET_THUMB2 && TARGET_HAVE_LOB && TARGET_HAVE_MVE + && arm_target_bb_ok_for_lob (loop->header->next_bb) + && arm_mve_loop_valid_for_dlstp (loop->header)) + return 0; + else + return nunroll; +} + +/* Function to hadle emitting a VPT-unpredicated version of a VPT-predicated + insn to a sequence. */ + +static bool +arm_emit_mve_unpredicated_insn_to_seq (rtx_insn* insn) +{ + rtx insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn); + int new_icode = get_attr_mve_unpredicated_insn (insn); + if (!in_sequence_p () + || !MVE_VPT_PREDICATED_INSN_P (insn) + || (!insn_vpr_reg_operand) + || (!new_icode)) + return false; + + extract_insn (insn); + rtx arr[8]; + int j = 0; + + /* When transforming a VPT-predicated instruction + into its unpredicated equivalent we need to drop + the VPR operand and we may need to also drop a + merge "vuninit" input operand, depending on the + instruction pattern. Here ensure that we have at + most a two-operand difference between the two + instrunctions. */ + int n_operands_diff + = recog_data.n_operands - insn_data[new_icode].n_operands; + if (!(n_operands_diff > 0 && n_operands_diff <= 2)) + return false; + + /* Then, loop through the operands of the predicated + instruction, and retain the ones that map to the + unpredicated instruction. */ + for (int i = 0; i < recog_data.n_operands; i++) + { + /* Ignore the VPR and, if needed, the vuninit + operand. */ + if (insn_vpr_reg_operand == recog_data.operand[i] + || (n_operands_diff == 2 + && !strcmp (recog_data.constraints[i], "0"))) + continue; + else + { + arr[j] = recog_data.operand[i]; + j++; + } + } + + /* Finally, emit the upredicated instruction. */ + switch (j) + { + case 1: + emit_insn (GEN_FCN (new_icode) (arr[0])); + break; + case 2: + emit_insn (GEN_FCN (new_icode) (arr[0], arr[1])); + break; + case 3: + emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2])); + break; + case 4: + emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], + arr[3])); + break; + case 5: + emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3], + arr[4])); + break; + case 6: + emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3], + arr[4], arr[5])); + break; + case 7: + emit_insn (GEN_FCN (new_icode) (arr[0], arr[1], arr[2], arr[3], + arr[4], arr[5], arr[6])); + break; + default: + gcc_unreachable (); + } + return true; +} + +/* Attempt to transform the loop contents of loop basic block from VPT + predicated insns into unpredicated insns for a dlstp/letp loop. */ + +rtx +arm_attempt_dlstp_transform (rtx label) +{ + basic_block body = BLOCK_FOR_INSN (label)->prev_bb; + + /* Ensure that the bb is within a loop that has all required metadata. */ + if (!body->loop_father || !body->loop_father->header + || !body->loop_father->simple_loop_desc) + return GEN_INT (1); + + rtx_insn *vctp_insn = arm_mve_get_loop_vctp (body); + if (!vctp_insn || !arm_mve_loop_valid_for_dlstp (body)) + return GEN_INT (1); + rtx vctp_reg = XVECEXP (XEXP (PATTERN (vctp_insn), 1), 0, 0); + /* decrementunum is already known to be valid at this point. */ + int decrementnum = arm_mve_get_vctp_lanes + (PATTERN (arm_mve_get_loop_vctp (body))); + + rtx_insn *insn = 0; + rtx_insn *cur_insn = 0; + rtx_insn *seq; + rtx vctp_vpr_generated = NULL_RTX; + + /* Scan through the insns in the loop bb and emit the transformed bb + insns to a sequence. */ + start_sequence (); + FOR_BB_INSNS (body, insn) + { + if (GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn)) + continue; + else if (NOTE_P (insn)) + emit_note ((enum insn_note)NOTE_KIND (insn)); + else if (DEBUG_INSN_P (insn)) + emit_debug_insn (PATTERN (insn)); + else if (!INSN_P (insn)) + { + end_sequence (); + return GEN_INT (1); + } + /* When we find the vctp instruction: This may be followed by + a zero-extend insn to SImode. If it is, then save the + zero-extended REG into vctp_vpr_generated. If there is no + zero-extend, then store the raw output of the vctp. + For any VPT-predicated instructions we need to ensure that + the VPR they use is the same as the one given here and + they often consume the output of a subreg of the SImode + zero-extended VPR-reg. As a result, comparing against the + output of the zero-extend is more likely to succeed. + This code also guarantees to us that the vctp comes before + any instructions that use the VPR within the loop, for the + dlstp/letp transform to succeed. */ + else if (insn == vctp_insn) + { + rtx_insn *next_use1 = NULL; + df_ref use; + for (use = DF_REG_USE_CHAIN + (DF_REF_REGNO (DF_INSN_INFO_DEFS + (DF_INSN_INFO_GET (insn)))); + use; use = DF_REF_NEXT_REG (use)) + if (!next_use1 && NONDEBUG_INSN_P (DF_REF_INSN (use))) + next_use1 = DF_REF_INSN (use); + + if (GET_CODE (SET_SRC (single_set (next_use1))) == ZERO_EXTEND) + { + rtx_insn *next_use2 = NULL; + for (use = DF_REG_USE_CHAIN + (DF_REF_REGNO (DF_INSN_INFO_DEFS + (DF_INSN_INFO_GET (next_use1)))); + use; use = DF_REF_NEXT_REG (use)) + if (!next_use2 && NONDEBUG_INSN_P (DF_REF_INSN (use))) + next_use2 = DF_REF_INSN (use); + + if (GET_CODE (SET_SRC (single_set (next_use2))) == SUBREG) + vctp_vpr_generated = XEXP (PATTERN (next_use2), 0); + } + + if (!vctp_vpr_generated || !REG_P (vctp_vpr_generated) + || !VALID_MVE_PRED_MODE (GET_MODE (vctp_vpr_generated))) + { + end_sequence (); + return GEN_INT (1); + } + continue; + } + /* If the insn pattern requires the use of the VPR value from the + vctp as an input parameter. */ + else if (arm_mve_vec_insn_is_predicated_with_this_predicate (insn, vctp_vpr_generated)) + { + bool success = arm_emit_mve_unpredicated_insn_to_seq (insn); + if (!success) + { + end_sequence (); + return GEN_INT (1); + } + } + /* If the insn isn't VPT predicated on vctp_vpr_generated, we need to + make sure that it is still valid within the dlstp/letp loop. */ + else + { + /* None of registers USE-d by the instruction need can be the VPR + vctp_vpr_generated. This blocks the optimisation if there any + instructions that use the optimised-out VPR value in any way + other than as a VPT block predicate. */ + df_ref insn_uses = NULL; + FOR_EACH_INSN_USE (insn_uses, insn) + { + if (rtx_equal_p (vctp_vpr_generated, DF_REF_REG (insn_uses))) + { + end_sequence (); + return GEN_INT (1); + } + } + /* If within the loop we have an MVE vector instruction that is + unpredicated, the dlstp/letp looping will add implicit + predication to it. This will result in a change in behaviour + of the instruction, so we need to find out if any instructions + that feed into the current instruction were implicitly + predicated. */ + if (arm_mve_check_df_chain_back_for_implic_predic + (insn, vctp_vpr_generated)) + { + if (arm_mve_check_df_chain_fwd_for_implic_predic_impact + (insn, vctp_vpr_generated)) + { + end_sequence (); + return GEN_INT (1); + } + } + emit_insn (PATTERN (insn)); + } + } + seq = get_insns (); + end_sequence (); + + /* Re-write the entire BB contents with the transformed + sequence. */ + FOR_BB_INSNS_SAFE (body, insn, cur_insn) + if (!(GET_CODE (insn) == CODE_LABEL || NOTE_INSN_BASIC_BLOCK_P (insn))) + delete_insn (insn); + for (insn = seq; NEXT_INSN (insn); insn = NEXT_INSN (insn)) + if (NOTE_P (insn)) + emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body)); + else if (DEBUG_INSN_P (insn)) + emit_debug_insn_after (PATTERN (insn), BB_END (body)); + else + emit_insn_after (PATTERN (insn), BB_END (body)); + + emit_jump_insn_after (PATTERN (insn), BB_END (body)); + /* The transformation has succeeded, so now modify the "count" + (a.k.a. niter_expr) for the middle-end. */ + simple_loop_desc (body->loop_father)->niter_expr = vctp_reg; + return GEN_INT (decrementnum); } #if CHECKING_P diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md index 597c1dae640..835caa06fad 100644 --- a/gcc/config/arm/iterators.md +++ b/gcc/config/arm/iterators.md @@ -2599,6 +2599,9 @@ (define_int_attr mrrc [(VUNSPEC_MRRC "mrrc") (VUNSPEC_MRRC2 "mrrc2")]) (define_int_attr MRRC [(VUNSPEC_MRRC "MRRC") (VUNSPEC_MRRC2 "MRRC2")]) +(define_int_attr mode1 [(DLSTP8 "8") (DLSTP16 "16") (DLSTP32 "32") + (DLSTP64 "64")]) + (define_int_attr opsuffix [(UNSPEC_DOT_S "s8") (UNSPEC_DOT_U "u8") (UNSPEC_DOT_US "s8") @@ -2841,6 +2844,8 @@ (define_int_iterator VSHLCQ_M [VSHLCQ_M_S VSHLCQ_M_U]) (define_int_iterator VQSHLUQ_M_N [VQSHLUQ_M_N_S]) (define_int_iterator VQSHLUQ_N [VQSHLUQ_N_S]) +(define_int_iterator DLSTP [DLSTP8 DLSTP16 DLSTP32 + DLSTP64]) ;; Define iterators for VCMLA operations (define_int_iterator VCMLA_OP [UNSPEC_VCMLA diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md index 74b8af8d57e..a81720f4aa2 100644 --- a/gcc/config/arm/mve.md +++ b/gcc/config/arm/mve.md @@ -7194,3 +7194,41 @@ } } ) + +;; Originally expanded by 'predicated_doloop_end'. +;; In the rare situation where the branch is too far, we do also need to +;; revert FPSCR.LTPSIZE back to 0x100 after the last iteration. +(define_insn "*predicated_doloop_end_internal" + [(set (pc) + (if_then_else + (ge (plus:SI (reg:SI LR_REGNUM) + (match_operand:SI 0 "const_int_operand" "")) + (const_int 0)) + (label_ref (match_operand 1 "" "")) + (pc))) + (set (reg:SI LR_REGNUM) + (plus:SI (reg:SI LR_REGNUM) (match_dup 0))) + (clobber (reg:CC CC_REGNUM)) + (clobber (match_scratch:SI 2 "=r"))] + "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2" + { + if (get_attr_length (insn) == 4) + return "letp\t%|lr, %l1"; + else + return "subs\t%|lr, #%n0\n\tbgt\t%l1\n\tvmrs\t%2, FPSCR\n\torr\t%2, %2, #0x40000\n\tand\t%2, %2, 0xFFFCFFFF\n\tvmsr\tFPSCR, %2;"; + } + [(set (attr "length") + (if_then_else + (ltu (minus (pc) (match_dup 1)) (const_int 1024)) + (const_int 4) + (const_int 6))) + (set_attr "type" "branch")]) + +(define_insn "dlstp_insn" + [ + (set (reg:SI LR_REGNUM) + (unspec:SI [(match_operand:SI 0 "s_register_operand" "r")] + DLSTP)) + ] + "TARGET_32BIT && TARGET_HAVE_LOB && TARGET_HAVE_MVE && TARGET_THUMB2" + "dlstp.\t%|lr, %0") diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md index e1e013befa7..2094e9c4641 100644 --- a/gcc/config/arm/thumb2.md +++ b/gcc/config/arm/thumb2.md @@ -1613,7 +1613,7 @@ (use (match_operand 1 "" ""))] ; label "TARGET_32BIT" " - { +{ /* Currently SMS relies on the do-loop pattern to recognize loops where (1) the control part consists of all insns defining and/or using a certain 'count' register and (2) the loop count can be @@ -1623,41 +1623,67 @@ Also used to implement the low over head loops feature, which is part of the Armv8.1-M Mainline Low Overhead Branch (LOB) extension. */ - if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB)) - { - rtx s0; - rtx bcomp; - rtx loc_ref; - rtx cc_reg; - rtx insn; - rtx cmp; - - if (GET_MODE (operands[0]) != SImode) - FAIL; - - s0 = operands [0]; - - /* Low over head loop instructions require the first operand to be LR. */ - if (TARGET_HAVE_LOB && arm_target_insn_ok_for_lob (operands [1])) - s0 = gen_rtx_REG (SImode, LR_REGNUM); - - if (TARGET_THUMB2) - insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT (-1))); - else - insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1))); - - cmp = XVECEXP (PATTERN (insn), 0, 0); - cc_reg = SET_DEST (cmp); - bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx); - loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]); - emit_jump_insn (gen_rtx_SET (pc_rtx, - gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp, - loc_ref, pc_rtx))); - DONE; - } - else - FAIL; - }") + if (optimize > 0 && (flag_modulo_sched || TARGET_HAVE_LOB)) + { + rtx s0; + rtx bcomp; + rtx loc_ref; + rtx cc_reg; + rtx insn; + rtx cmp; + rtx decrement_num; + + if (GET_MODE (operands[0]) != SImode) + FAIL; + + s0 = operands[0]; + + if (TARGET_HAVE_LOB && arm_target_bb_ok_for_lob (BLOCK_FOR_INSN (operands[1]))) + { + s0 = gen_rtx_REG (SImode, LR_REGNUM); + + /* If we have a compatibe MVE target, try and analyse the loop + contents to determine if we can use predicated dlstp/letp + looping. */ + if (TARGET_HAVE_MVE && TARGET_THUMB2 + && (decrement_num = arm_attempt_dlstp_transform (operands[1])) + && (INTVAL (decrement_num) != 1)) + { + insn = emit_insn + (gen_thumb2_addsi3_compare0 + (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num))))); + cmp = XVECEXP (PATTERN (insn), 0, 0); + cc_reg = SET_DEST (cmp); + bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx); + loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]); + emit_jump_insn (gen_rtx_SET (pc_rtx, + gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp, + loc_ref, pc_rtx))); + DONE; + } + + /* Otherwise, try standard decrement-by-one dls/le looping. */ + if (TARGET_THUMB2) + insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, + GEN_INT (-1))); + else + insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1))); + + cmp = XVECEXP (PATTERN (insn), 0, 0); + cc_reg = SET_DEST (cmp); + bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx); + loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]); + emit_jump_insn (gen_rtx_SET (pc_rtx, + gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp, + loc_ref, pc_rtx))); + DONE; + } + else + FAIL; + } + else + FAIL; +}") (define_insn "*clear_apsr" [(unspec_volatile:SI [(const_int 0)] VUNSPEC_CLRM_APSR) @@ -1755,7 +1781,37 @@ { if (REGNO (operands[0]) == LR_REGNUM) { - emit_insn (gen_dls_insn (operands[0])); + /* Pick out the number by which we are decrementing the loop counter + in every iteration. If it's > 1, then use dlstp. */ + int const_int_dec_num + = abs (INTVAL (XEXP (XEXP (XVECEXP (PATTERN (operands[1]), 0, 1), + 1), + 1))); + switch (const_int_dec_num) + { + case 16: + emit_insn (gen_dlstp8_insn (operands[0])); + break; + + case 8: + emit_insn (gen_dlstp16_insn (operands[0])); + break; + + case 4: + emit_insn (gen_dlstp32_insn (operands[0])); + break; + + case 2: + emit_insn (gen_dlstp64_insn (operands[0])); + break; + + case 1: + emit_insn (gen_dls_insn (operands[0])); + break; + + default: + gcc_unreachable (); + } DONE; } else diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md index dccda283573..f7c4f2b1b0e 100644 --- a/gcc/config/arm/unspecs.md +++ b/gcc/config/arm/unspecs.md @@ -581,6 +581,10 @@ VADDLVQ_U VCTP VCTP_M + DLSTP8 + DLSTP16 + DLSTP32 + DLSTP64 VPNOT VCREATEQ_F VCVTQ_N_TO_F_S diff --git a/gcc/df-core.cc b/gcc/df-core.cc index d4812b04a7c..4fcc14bf790 100644 --- a/gcc/df-core.cc +++ b/gcc/df-core.cc @@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno) return NULL; } +/* Return the one and only def of REGNO within BB. If there is no def or + there are multiple defs, return NULL. */ + +df_ref +df_bb_regno_only_def_find (basic_block bb, unsigned int regno) +{ + df_ref temp = df_bb_regno_first_def_find (bb, regno); + if (!temp) + return NULL; + else if (temp == df_bb_regno_last_def_find (bb, regno)) + return temp; + else + return NULL; +} + /* Finds the reference corresponding to the definition of REG in INSN. DF is the dataflow object. */ diff --git a/gcc/df.h b/gcc/df.h index 402657a7076..98623637f9c 100644 --- a/gcc/df.h +++ b/gcc/df.h @@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void); #endif extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int); extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int); +extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int); extern df_ref df_find_def (rtx_insn *, rtx); extern bool df_reg_defined (rtx_insn *, rtx); extern df_ref df_find_use (rtx_insn *, rtx); diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc index 4feb0a25ab9..f6dbd0515de 100644 --- a/gcc/loop-doloop.cc +++ b/gcc/loop-doloop.cc @@ -85,29 +85,29 @@ doloop_condition_get (rtx_insn *doloop_pat) forms: 1) (parallel [(set (pc) (if_then_else (condition) - (label_ref (label)) - (pc))) - (set (reg) (plus (reg) (const_int -1))) - (additional clobbers and uses)]) + (label_ref (label)) + (pc))) + (set (reg) (plus (reg) (const_int -n))) + (additional clobbers and uses)]) The branch must be the first entry of the parallel (also required by jump.cc), and the second entry of the parallel must be a set of the loop counter register. Some targets (IA-64) wrap the set of the loop counter in an if_then_else too. - 2) (set (reg) (plus (reg) (const_int -1)) - (set (pc) (if_then_else (reg != 0) - (label_ref (label)) - (pc))). + 2) (set (reg) (plus (reg) (const_int -n)) + (set (pc) (if_then_else (reg != 0) + (label_ref (label)) + (pc))). Some targets (ARM) do the comparison before the branch, as in the following form: - 3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0))) - (set (reg) (plus (reg) (const_int -1)))]) - (set (pc) (if_then_else (cc == NE) - (label_ref (label)) - (pc))) */ + 3) (parallel [(set (cc) (compare ((plus (reg) (const_int -n), 0))) + (set (reg) (plus (reg) (const_int -n)))]) + (set (pc) (if_then_else (cc == NE) + (label_ref (label)) + (pc))) */ pattern = PATTERN (doloop_pat); @@ -143,7 +143,7 @@ doloop_condition_get (rtx_insn *doloop_pat) || GET_CODE (cmp_arg1) != PLUS) return 0; reg_orig = XEXP (cmp_arg1, 0); - if (XEXP (cmp_arg1, 1) != GEN_INT (-1) + if (!CONST_INT_P (XEXP (cmp_arg1, 1)) || !REG_P (reg_orig)) return 0; cc_reg = SET_DEST (cmp_orig); @@ -156,7 +156,8 @@ doloop_condition_get (rtx_insn *doloop_pat) { /* We expect the condition to be of the form (reg != 0) */ cond = XEXP (SET_SRC (cmp), 0); - if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx) + if ((GET_CODE (cond) != NE && GET_CODE (cond) != GE) + || XEXP (cond, 1) != const0_rtx) return 0; } } @@ -173,14 +174,14 @@ doloop_condition_get (rtx_insn *doloop_pat) if (! REG_P (reg)) return 0; - /* Check if something = (plus (reg) (const_int -1)). + /* Check if something = (plus (reg) (const_int -n)). On IA-64, this decrement is wrapped in an if_then_else. */ inc_src = SET_SRC (inc); if (GET_CODE (inc_src) == IF_THEN_ELSE) inc_src = XEXP (inc_src, 1); if (GET_CODE (inc_src) != PLUS || XEXP (inc_src, 0) != reg - || XEXP (inc_src, 1) != constm1_rtx) + || !CONST_INT_P (XEXP (inc_src, 1))) return 0; /* Check for (set (pc) (if_then_else (condition) @@ -211,42 +212,49 @@ doloop_condition_get (rtx_insn *doloop_pat) || (GET_CODE (XEXP (condition, 0)) == PLUS && XEXP (XEXP (condition, 0), 0) == reg)) { - if (GET_CODE (pattern) != PARALLEL) /* For the second form we expect: - (set (reg) (plus (reg) (const_int -1)) - (set (pc) (if_then_else (reg != 0) - (label_ref (label)) - (pc))). + (set (reg) (plus (reg) (const_int -n)) + (set (pc) (if_then_else (reg != 0) + (label_ref (label)) + (pc))). - is equivalent to the following: + If n == 1, that is equivalent to the following: - (parallel [(set (pc) (if_then_else (reg != 1) - (label_ref (label)) - (pc))) - (set (reg) (plus (reg) (const_int -1))) - (additional clobbers and uses)]) + (parallel [(set (pc) (if_then_else (reg != 1) + (label_ref (label)) + (pc))) + (set (reg) (plus (reg) (const_int -1))) + (additional clobbers and uses)]) - For the third form we expect: + For the third form we expect: - (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0)) - (set (reg) (plus (reg) (const_int -1)))]) - (set (pc) (if_then_else (cc == NE) - (label_ref (label)) - (pc))) + (parallel [(set (cc) (compare ((plus (reg) (const_int -n)), 0)) + (set (reg) (plus (reg) (const_int -n)))]) + (set (pc) (if_then_else (cc == NE) + (label_ref (label)) + (pc))) - which is equivalent to the following: + Which also for n == 1 is equivalent to the following: - (parallel [(set (cc) (compare (reg, 1)) - (set (reg) (plus (reg) (const_int -1))) - (set (pc) (if_then_else (NE == cc) - (label_ref (label)) - (pc))))]) + (parallel [(set (cc) (compare (reg, 1)) + (set (reg) (plus (reg) (const_int -1))) + (set (pc) (if_then_else (NE == cc) + (label_ref (label)) + (pc))))]) - So we return the second form instead for the two cases. + So we return the second form instead for the two cases. + For the "elementwise" form where the decrement number isn't -1, + the final value may be exceeded, so use GE instead of NE. */ - condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx); + if (GET_CODE (pattern) != PARALLEL) + { + if (INTVAL (XEXP (inc_src, 1)) != -1) + condition = gen_rtx_fmt_ee (GE, VOIDmode, inc_src, const0_rtx); + else + condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);; + } return condition; } @@ -685,17 +693,6 @@ doloop_optimize (class loop *loop) return false; } - max_cost - = COSTS_N_INSNS (param_max_iterations_computation_cost); - if (set_src_cost (desc->niter_expr, mode, optimize_loop_for_speed_p (loop)) - > max_cost) - { - if (dump_file) - fprintf (dump_file, - "Doloop: number of iterations too costly to compute.\n"); - return false; - } - if (desc->const_iter) iterations = widest_int::from (rtx_mode_t (desc->niter_expr, mode), UNSIGNED); @@ -716,11 +713,24 @@ doloop_optimize (class loop *loop) /* Generate looping insn. If the pattern FAILs then give up trying to modify the loop since there is some aspect the back-end does - not like. */ - count = copy_rtx (desc->niter_expr); + not like. If this succeeds, there is a chance that the loop + desc->niter_expr has been altered by the backend, so only extract + that data after the gen_doloop_end. */ start_label = block_label (desc->in_edge->dest); doloop_reg = gen_reg_rtx (mode); rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label); + count = copy_rtx (desc->niter_expr); + + max_cost + = COSTS_N_INSNS (param_max_iterations_computation_cost); + if (set_src_cost (count, mode, optimize_loop_for_speed_p (loop)) + > max_cost) + { + if (dump_file) + fprintf (dump_file, + "Doloop: number of iterations too costly to compute.\n"); + return false; + } word_mode_size = GET_MODE_PRECISION (word_mode); word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1; diff --git a/gcc/testsuite/gcc.target/arm/lob.h b/gcc/testsuite/gcc.target/arm/lob.h index feaae7cc899..3941fe7a8b6 100644 --- a/gcc/testsuite/gcc.target/arm/lob.h +++ b/gcc/testsuite/gcc.target/arm/lob.h @@ -1,15 +1,131 @@ #include - +#include /* Common code for lob tests. */ #define NO_LOB asm volatile ("@ clobber lr" : : : "lr" ) -#define N 10000 +#define N 100 + +static void +reset_data (int *a, int *b, int *c, int x) +{ + memset (a, -1, x * sizeof (*a)); + memset (b, -1, x * sizeof (*b)); + memset (c, 0, x * sizeof (*c)); +} + +static void +reset_data8 (int8_t *a, int8_t *b, int8_t *c, int x) +{ + memset (a, -1, x * sizeof (*a)); + memset (b, -1, x * sizeof (*b)); + memset (c, 0, x * sizeof (*c)); +} + +static void +reset_data16 (int16_t *a, int16_t *b, int16_t *c, int x) +{ + memset (a, -1, x * sizeof (*a)); + memset (b, -1, x * sizeof (*b)); + memset (c, 0, x * sizeof (*c)); +} + +static void +reset_data32 (int32_t *a, int32_t *b, int32_t *c, int x) +{ + memset (a, -1, x * sizeof (*a)); + memset (b, -1, x * sizeof (*b)); + memset (c, 0, x * sizeof (*c)); +} + +static void +reset_data64 (int64_t *a, int64_t *c, int x) +{ + memset (a, -1, x * sizeof (*a)); + memset (c, 0, x * sizeof (*c)); +} + +static void +check_plus (int *a, int *b, int *c, int x) +{ + for (int i = 0; i < N; i++) + { + NO_LOB; + if (i < x) + { + if (c[i] != (a[i] + b[i])) abort (); + } + else + { + if (c[i] != 0) abort (); + } + } +} + +static void +check_plus8 (int8_t *a, int8_t *b, int8_t *c, int x) +{ + for (int i = 0; i < N; i++) + { + NO_LOB; + if (i < x) + { + if (c[i] != (a[i] + b[i])) abort (); + } + else + { + if (c[i] != 0) abort (); + } + } +} + +static void +check_plus16 (int16_t *a, int16_t *b, int16_t *c, int x) +{ + for (int i = 0; i < N; i++) + { + NO_LOB; + if (i < x) + { + if (c[i] != (a[i] + b[i])) abort (); + } + else + { + if (c[i] != 0) abort (); + } + } +} + +static void +check_plus32 (int32_t *a, int32_t *b, int32_t *c, int x) +{ + for (int i = 0; i < N; i++) + { + NO_LOB; + if (i < x) + { + if (c[i] != (a[i] + b[i])) abort (); + } + else + { + if (c[i] != 0) abort (); + } + } +} static void -reset_data (int *a, int *b, int *c) +check_memcpy64 (int64_t *a, int64_t *c, int x) { - memset (a, -1, N * sizeof (*a)); - memset (b, -1, N * sizeof (*b)); - memset (c, -1, N * sizeof (*c)); + for (int i = 0; i < N; i++) + { + NO_LOB; + if (i < x) + { + if (c[i] != a[i]) abort (); + } + else + { + if (c[i] != 0) abort (); + } + } } diff --git a/gcc/testsuite/gcc.target/arm/lob1.c b/gcc/testsuite/gcc.target/arm/lob1.c index ba5c82cd55c..c8ce653a5c3 100644 --- a/gcc/testsuite/gcc.target/arm/lob1.c +++ b/gcc/testsuite/gcc.target/arm/lob1.c @@ -54,29 +54,18 @@ loop3 (int *a, int *b, int *c) } while (i < N); } -void -check (int *a, int *b, int *c) -{ - for (int i = 0; i < N; i++) - { - NO_LOB; - if (c[i] != a[i] + b[i]) - abort (); - } -} - int main (void) { - reset_data (a, b, c); + reset_data (a, b, c, N); loop1 (a, b ,c); - check (a, b ,c); - reset_data (a, b, c); + check_plus (a, b, c, N); + reset_data (a, b, c, N); loop2 (a, b ,c); - check (a, b ,c); - reset_data (a, b, c); + check_plus (a, b, c, N); + reset_data (a, b, c, N); loop3 (a, b ,c); - check (a, b ,c); + check_plus (a, b, c, N); return 0; } diff --git a/gcc/testsuite/gcc.target/arm/lob6.c b/gcc/testsuite/gcc.target/arm/lob6.c index 17b6124295e..4fe116e2c2b 100644 --- a/gcc/testsuite/gcc.target/arm/lob6.c +++ b/gcc/testsuite/gcc.target/arm/lob6.c @@ -79,14 +79,14 @@ check (void) int main (void) { - reset_data (a1, b1, c1); - reset_data (a2, b2, c2); + reset_data (a1, b1, c1, N); + reset_data (a2, b2, c2, N); loop1 (a1, b1, c1); ref1 (a2, b2, c2); check (); - reset_data (a1, b1, c1); - reset_data (a2, b2, c2); + reset_data (a1, b1, c1, N); + reset_data (a2, b2, c2, N); loop2 (a1, b1, c1); ref2 (a2, b2, c2); check (); diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c new file mode 100644 index 00000000000..fe573ebb749 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm.c @@ -0,0 +1,442 @@ +/* { dg-do compile { target { arm*-*-* } } } */ +/* { dg-require-effective-target arm_v8_1m_mve_ok } */ +/* { dg-options "-O3 -save-temps" } */ +/* { dg-add-options arm_v8_1m_mve } */ + +#include + +#define IMM 5 + +#define TEST_COMPILE_IN_DLSTP_TERNARY(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED) \ +void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##_t *a, TYPE##BITS##_t *b, TYPE##BITS##_t *c, int n) \ +{ \ + while (n > 0) \ + { \ + mve_pred16_t p = vctp##BITS##q (n); \ + TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p); \ + TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p); \ + TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (va, vb, p); \ + vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p); \ + c += LANES; \ + a += LANES; \ + b += LANES; \ + n -= LANES; \ + } \ +} + +#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY(BITS, LANES, LDRSTRYTPE, NAME, PRED) \ +TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED) \ +TEST_COMPILE_IN_DLSTP_TERNARY (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED) + +#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY(NAME, PRED) \ +TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (8, 16, b, NAME, PRED) \ +TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (16, 8, h, NAME, PRED) \ +TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY (32, 4, w, NAME, PRED) + + +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vaddq, _x) +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vmulq, _x) +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vsubq, _x) +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vhaddq, _x) +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY (vorrq, _x) + + +#define TEST_COMPILE_IN_DLSTP_TERNARY_M(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED) \ +void test_##NAME##PRED##_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a, TYPE##BITS##_t *b, TYPE##BITS##_t *c, int n) \ +{ \ + while (n > 0) \ + { \ + mve_pred16_t p = vctp##BITS##q (n); \ + TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p); \ + TYPE##BITS##x##LANES##_t vb = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (b, p); \ + TYPE##BITS##x##LANES##_t vc = NAME##PRED##_##SIGN##BITS (__inactive, va, vb, p); \ + vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p); \ + c += LANES; \ + a += LANES; \ + b += LANES; \ + n -= LANES; \ + } \ +} + +#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M(BITS, LANES, LDRSTRYTPE, NAME, PRED) \ +TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED) \ +TEST_COMPILE_IN_DLSTP_TERNARY_M (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED) + +#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M(NAME, PRED) \ +TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (8, 16, b, NAME, PRED) \ +TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (16, 8, h, NAME, PRED) \ +TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M (32, 4, w, NAME, PRED) + + +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vaddq, _m) +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vmulq, _m) +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vsubq, _m) +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vhaddq, _m) +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M (vorrq, _m) + +#define TEST_COMPILE_IN_DLSTP_TERNARY_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED) \ +void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##_t *a, TYPE##BITS##_t *c, int n) \ +{ \ + while (n > 0) \ + { \ + mve_pred16_t p = vctp##BITS##q (n); \ + TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p); \ + TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (va, IMM, p); \ + vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p); \ + c += LANES; \ + a += LANES; \ + n -= LANES; \ + } \ +} + +#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N(BITS, LANES, LDRSTRYTPE, NAME, PRED) \ +TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED) \ +TEST_COMPILE_IN_DLSTP_TERNARY_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED) + +#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N(NAME, PRED) \ +TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (8, 16, b, NAME, PRED) \ +TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (16, 8, h, NAME, PRED) \ +TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_N (32, 4, w, NAME, PRED) + +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vaddq, _x) +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vmulq, _x) +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vsubq, _x) +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vhaddq, _x) + +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vbrsrq, _x) +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshlq, _x) +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_N (vshrq, _x) + +#define TEST_COMPILE_IN_DLSTP_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, TYPE, SIGN, NAME, PRED) \ +void test_##NAME##PRED##_n_##SIGN##BITS (TYPE##BITS##x##LANES##_t __inactive, TYPE##BITS##_t *a, TYPE##BITS##_t *c, int n) \ +{ \ + while (n > 0) \ + { \ + mve_pred16_t p = vctp##BITS##q (n); \ + TYPE##BITS##x##LANES##_t va = vldr##LDRSTRYTPE##q_z_##SIGN##BITS (a, p); \ + TYPE##BITS##x##LANES##_t vc = NAME##PRED##_n_##SIGN##BITS (__inactive, va, IMM, p); \ + vstr##LDRSTRYTPE##q_p_##SIGN##BITS (c, vc, p); \ + c += LANES; \ + a += LANES; \ + n -= LANES; \ + } \ +} + +#define TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N(BITS, LANES, LDRSTRYTPE, NAME, PRED) \ +TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, int, s, NAME, PRED) \ +TEST_COMPILE_IN_DLSTP_TERNARY_M_N (BITS, LANES, LDRSTRYTPE, uint, u, NAME, PRED) + +#define TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N(NAME, PRED) \ +TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (8, 16, b, NAME, PRED) \ +TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (16, 8, h, NAME, PRED) \ +TEST_COMPILE_IN_DLSTP_SIGNED_UNSIGNED_TERNARY_M_N (32, 4, w, NAME, PRED) + +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vaddq, _m) +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vmulq, _m) +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vsubq, _m) +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vhaddq, _m) + +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vbrsrq, _m) +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshlq, _m) +TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY_M_N (vshrq, _m) + +/* Now test some more configurations. */ + +/* Using a >=1 condition. */ +void test1 (int32_t *a, int32_t *b, int32_t *c, int n) +{ + while (n >= 1) + { + mve_pred16_t p = vctp32q (n); + int32x4_t va = vldrwq_z_s32 (a, p); + int32x4_t vb = vldrwq_z_s32 (b, p); + int32x4_t vc = vaddq_x_s32 (va, vb, p); + vstrwq_p_s32 (c, vc, p); + c+=4; + a+=4; + b+=4; + n-=4; + } +} + +/* Test a for loop format of decrementing to zero */ +int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7}; +void test2 (int32_t *b, int num_elems) +{ + for (int i = num_elems; i > 0; i-= 4) + { + mve_pred16_t p = vctp32q (i); + int32x4_t va = vldrwq_z_s32 (&(a[i]), p); + vstrwq_p_s32 (b + i, va, p); + } +} + +/* Iteration counter counting up to num_iter. */ +void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n) +{ + int num_iter = (n + 15)/16; + for (int i = 0; i < num_iter; i++) + { + mve_pred16_t p = vctp8q (n); + uint8x16_t va = vldrbq_z_u8 (a, p); + uint8x16_t vb = vldrbq_z_u8 (b, p); + uint8x16_t vc = vaddq_x_u8 (va, vb, p); + vstrbq_p_u8 (c, vc, p); + n-=16; + } +} + +/* Using an unpredicated arithmetic instruction within the loop. */ +void test4 (uint8_t *a, uint8_t *b, uint8_t *c, uint8_t *d, int n) +{ + while (n > 0) + { + mve_pred16_t p = vctp8q (n); + uint8x16_t va = vldrbq_z_u8 (a, p); + uint8x16_t vb = vldrbq_u8 (b); + /* Is affected by implicit predication, because vb also + came from an unpredicated load, but there is no functional + problem, because the result is used in a predicated store. */ + uint8x16_t vc = vaddq_u8 (va, vb); + uint8x16_t vd = vaddq_x_u8 (va, vb, p); + vstrbq_p_u8 (c, vc, p); + vstrbq_p_u8 (d, vd, p); + n-=16; + } +} + +/* Using a different VPR value for one instruction in the loop. */ +void test5 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1) +{ + while (n > 0) + { + mve_pred16_t p = vctp32q (n); + int32x4_t va = vldrwq_z_s32 (a, p); + int32x4_t vb = vldrwq_z_s32 (b, p1); + int32x4_t vc = vaddq_x_s32 (va, vb, p); + vstrwq_p_s32 (c, vc, p); + c += 4; + a += 4; + b += 4; + n -= 4; + } +} + +/* Generating and using another VPR value in the loop, with a vctp. + The doloop logic will always try to do the transform on the first + vctp it encounters, so this is still expected to work. */ +void test6 (int32_t *a, int32_t *b, int32_t *c, int n, int g) +{ + while (n > 0) + { + mve_pred16_t p = vctp32q (n); + int32x4_t va = vldrwq_z_s32 (a, p); + mve_pred16_t p1 = vctp32q (g); + int32x4_t vb = vldrwq_z_s32 (b, p1); + int32x4_t vc = vaddq_x_s32 (va, vb, p); + vstrwq_p_s32 (c, vc, p); + c += 4; + a += 4; + b += 4; + n -= 4; + } +} + +/* Generating and using a different VPR value in the loop, with a vctp, + but this time the p1 will also chance in every loop (still fine) */ +void test7 (int32_t *a, int32_t *b, int32_t *c, int n, int g) +{ + while (n > 0) + { + mve_pred16_t p = vctp32q (n); + int32x4_t va = vldrwq_z_s32 (a, p); + mve_pred16_t p1 = vctp32q (g); + int32x4_t vb = vldrwq_z_s32 (b, p1); + int32x4_t vc = vaddq_x_s32 (va, vb, p); + vstrwq_p_s32 (c, vc, p); + c += 4; + a += 4; + b += 4; + n -= 4; + g++; + } +} + +/* Generating and using a different VPR value in the loop, with a vctp_m + that is independent of the loop vctp VPR. */ +void test8 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1) +{ + while (n > 0) + { + mve_pred16_t p = vctp32q (n); + int32x4_t va = vldrwq_z_s32 (a, p); + mve_pred16_t p2 = vctp32q_m (n, p1); + int32x4_t vb = vldrwq_z_s32 (b, p1); + int32x4_t vc = vaddq_x_s32 (va, vb, p2); + vstrwq_p_s32 (c, vc, p); + c += 4; + a += 4; + b += 4; + n -= 4; + } +} + +/* Generating and using a different VPR value in the loop, + with a vctp_m that is tied to the base vctp VPR. This + is still fine, because the vctp_m will be transformed + into a vctp and be implicitly predicated. */ +void test9 (int32_t *a, int32_t *b, int32_t *c, int n) +{ + while (n > 0) + { + mve_pred16_t p = vctp32q (n); + int32x4_t va = vldrwq_z_s32 (a, p); + mve_pred16_t p1 = vctp32q_m (n, p); + int32x4_t vb = vldrwq_z_s32 (b, p1); + int32x4_t vc = vaddq_x_s32 (va, vb, p1); + vstrwq_p_s32 (c, vc, p); + c += 4; + a += 4; + b += 4; + n -= 4; + } +} + +/* Generating and using a different VPR value in the loop, with a vcmp. */ +void test10 (int32_t *a, int32_t *b, int32_t *c, int n) +{ + while (n > 0) + { + mve_pred16_t p = vctp32q (n); + int32x4_t va = vldrwq_z_s32 (a, p); + int32x4_t vb = vldrwq_z_s32 (b, p); + mve_pred16_t p1 = vcmpeqq_s32 (va, vb); + int32x4_t vc = vaddq_x_s32 (va, vb, p1); + vstrwq_p_s32 (c, vc, p); + c += 4; + a += 4; + b += 4; + n -= 4; + } +} + +/* Generating and using a different VPR value in the loop, with a vcmp_m. */ +void test11 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1) +{ + while (n > 0) + { + mve_pred16_t p = vctp32q (n); + int32x4_t va = vldrwq_z_s32 (a, p); + int32x4_t vb = vldrwq_z_s32 (b, p); + mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p1); + int32x4_t vc = vaddq_x_s32 (va, vb, p2); + vstrwq_p_s32 (c, vc, p); + c += 4; + a += 4; + b += 4; + n -= 4; + } +} + +/* Generating and using a different VPR value in the loop, with a vcmp_m + that is tied to the base vctp VPR (same as above, this will be turned + into a vcmp and be implicitly predicated). */ +void test12 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p1) +{ + while (n > 0) + { + mve_pred16_t p = vctp32q (n); + int32x4_t va = vldrwq_z_s32 (a, p); + int32x4_t vb = vldrwq_z_s32 (b, p); + mve_pred16_t p2 = vcmpeqq_m_s32 (va, vb, p); + int32x4_t vc = vaddq_x_s32 (va, vb, p2); + vstrwq_p_s32 (c, vc, p); + c += 4; + a += 4; + b += 4; + n -= 4; + } +} + +/* Using an unpredicated op with a scalar output, where the result is valid + outside the bb. This is valid, because all the inputs to the unpredicated + op are correctly predicated. */ +uint8_t test13 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx) +{ + uint8_t sum = 0; + while (n > 0) + { + mve_pred16_t p = vctp8q (n); + uint8x16_t va = vldrbq_z_u8 (a, p); + uint8x16_t vb = vldrbq_z_u8 (b, p); + uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p); + sum += vaddvq_u8 (vc); + a += 16; + b += 16; + n -= 16; + } + return sum; +} + +/* Same as above, but with another scalar op between the unpredicated op and + the scalar op outside the loop. */ +uint8_t test14 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx, int g) +{ + uint8_t sum = 0; + while (n > 0) + { + mve_pred16_t p = vctp8q (n); + uint8x16_t va = vldrbq_z_u8 (a, p); + uint8x16_t vb = vldrbq_z_u8 (b, p); + uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p); + sum += vaddvq_u8 (vc); + sum += g; + a += 16; + b += 16; + n -= 16; + } + return sum; +} + +/* Using an unpredicated vcmp to generate a new predicate value in the + loop and then using it in a predicated store insn. */ +void test15 (int32_t *a, int32_t *b, int32x4_t vc, int32_t *c, int n) +{ + while (n > 0) + { + mve_pred16_t p = vctp32q (n); + int32x4_t va = vldrwq_z_s32 (a, p); + int32x4_t vb = vldrwq_s32 (b); + int32x4_t vc = vaddq_x_s32 (va, vb, p); + mve_pred16_t p1 = vcmpeqq_s32 (va, vc); + vstrwq_p_s32 (c, vc, p1); + c += 4; + a += 4; + b += 4; + n -= 4; + } +} + +/* Using a predicated vcmp to generate a new predicate value in the + loop and then using it in a predicated store insn. */ +void test16 (int32_t *a, int32_t *b, int32x4_t vc, int32_t *c, int n) +{ + while (n > 0) + { + mve_pred16_t p = vctp32q (n); + int32x4_t va = vldrwq_z_s32 (a, p); + int32x4_t vb = vldrwq_z_s32 (b, p); + int32x4_t vc = vaddq_s32 (va, vb); + mve_pred16_t p1 = vcmpeqq_m_s32 (va, vc, p); + vstrwq_p_s32 (c, vc, p1); + c += 4; + a += 4; + b += 4; + n -= 4; + } +} + +/* The final number of DLSTPs currently is calculated by the number of + `TEST_COMPILE_IN_DLSTP_INTBITS_SIGNED_UNSIGNED_TERNARY.*` macros * 6 + 16. */ +/* { dg-final { scan-assembler-times {\tdlstp} 160 } } */ +/* { dg-final { scan-assembler-times {\tletp} 160 } } */ diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c new file mode 100644 index 00000000000..0cdffb312b3 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c @@ -0,0 +1,68 @@ +/* { dg-do run { target { arm*-*-* } } } */ +/* { dg-require-effective-target arm_v8_1m_mve_ok } */ +/* { dg-options "-O2 -save-temps" } */ +/* { dg-add-options arm_v8_1m_mve } */ + +#include +#include +#include +#include "../lob.h" + +void __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int n) +{ + while (n > 0) + { + mve_pred16_t p = vctp16q (n); + int16x8_t va = vldrhq_z_s16 (a, p); + int16x8_t vb = vldrhq_z_s16 (b, p); + int16x8_t vc = vaddq_x_s16 (va, vb, p); + vstrhq_p_s16 (c, vc, p); + c+=8; + a+=8; + b+=8; + n-=8; + } +} + +int main () +{ + int i; + int16_t temp1[N]; + int16_t temp2[N]; + int16_t temp3[N]; + reset_data16 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 0); + check_plus16 (temp1, temp2, temp3, 0); + + reset_data16 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 1); + check_plus16 (temp1, temp2, temp3, 1); + + reset_data16 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 7); + check_plus16 (temp1, temp2, temp3, 7); + + reset_data16 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 8); + check_plus16 (temp1, temp2, temp3, 8); + + reset_data16 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 9); + check_plus16 (temp1, temp2, temp3, 9); + + reset_data16 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 16); + check_plus16 (temp1, temp2, temp3, 16); + + reset_data16 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 17); + check_plus16 (temp1, temp2, temp3, 17); + + reset_data16 (temp1, temp2, temp3, N); +} + +/* { dg-final { scan-assembler-times {\tdlstp.16} 1 } } */ +/* { dg-final { scan-assembler-times {\tletp} 1 } } */ +/* { dg-final { scan-assembler-not "\tvctp" } } */ +/* { dg-final { scan-assembler-not "\tvpst" } } */ +/* { dg-final { scan-assembler-not "p0" } } */ diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c new file mode 100644 index 00000000000..7ff789d7650 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c @@ -0,0 +1,68 @@ +/* { dg-do run { target { arm*-*-* } } } */ +/* { dg-require-effective-target arm_v8_1m_mve_ok } */ +/* { dg-options "-O2 -save-temps" } */ +/* { dg-add-options arm_v8_1m_mve } */ + +#include +#include +#include +#include "../lob.h" + +void __attribute__ ((noinline)) test (int32_t *a, int32_t *b, int32_t *c, int n) +{ + while (n > 0) + { + mve_pred16_t p = vctp32q (n); + int32x4_t va = vldrwq_z_s32 (a, p); + int32x4_t vb = vldrwq_z_s32 (b, p); + int32x4_t vc = vaddq_x_s32 (va, vb, p); + vstrwq_p_s32 (c, vc, p); + c+=4; + a+=4; + b+=4; + n-=4; + } +} + +int main () +{ + int i; + int32_t temp1[N]; + int32_t temp2[N]; + int32_t temp3[N]; + reset_data32 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 0); + check_plus32 (temp1, temp2, temp3, 0); + + reset_data32 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 1); + check_plus32 (temp1, temp2, temp3, 1); + + reset_data32 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 3); + check_plus32 (temp1, temp2, temp3, 3); + + reset_data32 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 4); + check_plus32 (temp1, temp2, temp3, 4); + + reset_data32 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 5); + check_plus32 (temp1, temp2, temp3, 5); + + reset_data32 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 8); + check_plus32 (temp1, temp2, temp3, 8); + + reset_data32 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 9); + check_plus32 (temp1, temp2, temp3, 9); + + reset_data32 (temp1, temp2, temp3, N); +} + +/* { dg-final { scan-assembler-times {\tdlstp.32} 1 } } */ +/* { dg-final { scan-assembler-times {\tletp} 1 } } */ +/* { dg-final { scan-assembler-not "\tvctp" } } */ +/* { dg-final { scan-assembler-not "\tvpst" } } */ +/* { dg-final { scan-assembler-not "p0" } } */ diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c new file mode 100644 index 00000000000..8065bd02469 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c @@ -0,0 +1,68 @@ +/* { dg-do run { target { arm*-*-* } } } */ +/* { dg-require-effective-target arm_v8_1m_mve_ok } */ +/* { dg-options "-O2 -save-temps" } */ +/* { dg-add-options arm_v8_1m_mve } */ + +#include +#include +#include +#include "../lob.h" + +void __attribute__ ((noinline)) test (int64_t *a, int64_t *c, int n) +{ + while (n > 0) + { + mve_pred16_t p = vctp64q (n); + int64x2_t va = vldrdq_gather_offset_z_s64 (a, vcreateq_u64 (0, 8), p); + vstrdq_scatter_offset_p_s64 (c, vcreateq_u64 (0, 8), va, p); + c+=2; + a+=2; + n-=2; + } +} + +int main () +{ + int i; + int64_t temp1[N]; + int64_t temp3[N]; + reset_data64 (temp1, temp3, N); + test (temp1, temp3, 0); + check_memcpy64 (temp1, temp3, 0); + + reset_data64 (temp1, temp3, N); + test (temp1, temp3, 1); + check_memcpy64 (temp1, temp3, 1); + + reset_data64 (temp1, temp3, N); + test (temp1, temp3, 2); + check_memcpy64 (temp1, temp3, 2); + + reset_data64 (temp1, temp3, N); + test (temp1, temp3, 3); + check_memcpy64 (temp1, temp3, 3); + + reset_data64 (temp1, temp3, N); + test (temp1, temp3, 4); + check_memcpy64 (temp1, temp3, 4); + + reset_data64 (temp1, temp3, N); + test (temp1, temp3, 5); + check_memcpy64 (temp1, temp3, 5); + + reset_data64 (temp1, temp3, N); + test (temp1, temp3, 6); + check_memcpy64 (temp1, temp3, 6); + + reset_data64 (temp1, temp3, N); + test (temp1, temp3, 7); + check_memcpy64 (temp1, temp3, 7); + + reset_data64 (temp1, temp3, N); +} + +/* { dg-final { scan-assembler-times {\tdlstp.64} 1 } } */ +/* { dg-final { scan-assembler-times {\tletp} 1 } } */ +/* { dg-final { scan-assembler-not "\tvctp" } } */ +/* { dg-final { scan-assembler-not "\tvpst" } } */ +/* { dg-final { scan-assembler-not "p0" } } */ diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c new file mode 100644 index 00000000000..552781001e9 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c @@ -0,0 +1,68 @@ +/* { dg-do run { target { arm*-*-* } } } */ +/* { dg-require-effective-target arm_v8_1m_mve_ok } */ +/* { dg-options "-O2 -save-temps" } */ +/* { dg-add-options arm_v8_1m_mve } */ + +#include +#include +#include +#include "../lob.h" + +void __attribute__ ((noinline)) test (int8_t *a, int8_t *b, int8_t *c, int n) +{ + while (n > 0) + { + mve_pred16_t p = vctp8q (n); + int8x16_t va = vldrbq_z_s8 (a, p); + int8x16_t vb = vldrbq_z_s8 (b, p); + int8x16_t vc = vaddq_x_s8 (va, vb, p); + vstrbq_p_s8 (c, vc, p); + c+=16; + a+=16; + b+=16; + n-=16; + } +} + +int main () +{ + int i; + int8_t temp1[N]; + int8_t temp2[N]; + int8_t temp3[N]; + reset_data8 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 0); + check_plus8 (temp1, temp2, temp3, 0); + + reset_data8 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 1); + check_plus8 (temp1, temp2, temp3, 1); + + reset_data8 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 15); + check_plus8 (temp1, temp2, temp3, 15); + + reset_data8 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 16); + check_plus8 (temp1, temp2, temp3, 16); + + reset_data8 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 17); + check_plus8 (temp1, temp2, temp3, 17); + + reset_data8 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 32); + check_plus8 (temp1, temp2, temp3, 32); + + reset_data8 (temp1, temp2, temp3, N); + test (temp1, temp2, temp3, 33); + check_plus8 (temp1, temp2, temp3, 33); + + reset_data8 (temp1, temp2, temp3, N); +} + +/* { dg-final { scan-assembler-times {\tdlstp.8} 1 } } */ +/* { dg-final { scan-assembler-times {\tletp} 1 } } */ +/* { dg-final { scan-assembler-not "\tvctp" } } */ +/* { dg-final { scan-assembler-not "\tvpst" } } */ +/* { dg-final { scan-assembler-not "p0" } } */ diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c new file mode 100644 index 00000000000..70e374dce81 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c @@ -0,0 +1,203 @@ +/* { dg-do compile { target { arm*-*-* } } } */ +/* { dg-require-effective-target arm_v8_1m_mve_ok } */ +/* { dg-options "-O3 -save-temps" } */ +/* { dg-add-options arm_v8_1m_mve } */ + +#include + +/* Terminating on a non-zero number of elements. */ +void test0 (uint8_t *a, uint8_t *b, uint8_t *c, int n) +{ + while (n > 1) + { + mve_pred16_t p = vctp8q (n); + uint8x16_t va = vldrbq_z_u8 (a, p); + uint8x16_t vb = vldrbq_z_u8 (b, p); + uint8x16_t vc = vaddq_x_u8 (va, vb, p); + vstrbq_p_u8 (c, vc, p); + n -= 16; + } +} + +/* Terminating on n >= 0. */ +void test1 (uint8_t *a, uint8_t *b, uint8_t *c, int n) +{ + while (n >= 0) + { + mve_pred16_t p = vctp8q (n); + uint8x16_t va = vldrbq_z_u8 (a, p); + uint8x16_t vb = vldrbq_z_u8 (b, p); + uint8x16_t vc = vaddq_x_u8 (va, vb, p); + vstrbq_p_u8 (c, vc, p); + n -= 16; + } +} + +/* Similar, terminating on a non-zero number of elements, but in a for loop + format. */ +int32_t a[] = {0, 1, 2, 3, 4, 5, 6, 7}; +void test2 (int32_t *b, int num_elems) +{ + for (int i = num_elems; i >= 2; i-= 4) + { + mve_pred16_t p = vctp32q (i); + int32x4_t va = vldrwq_z_s32 (&(a[i]), p); + vstrwq_p_s32 (b + i, va, p); + } +} + +/* Iteration counter counting up to num_iter, with a non-zero starting num. */ +void test3 (uint8_t *a, uint8_t *b, uint8_t *c, int n) +{ + int num_iter = (n + 15)/16; + for (int i = 1; i < num_iter; i++) + { + mve_pred16_t p = vctp8q (n); + uint8x16_t va = vldrbq_z_u8 (a, p); + uint8x16_t vb = vldrbq_z_u8 (b, p); + uint8x16_t vc = vaddq_x_u8 (va, vb, p); + vstrbq_p_u8 (c, vc, p); + n -= 16; + } +} + +/* Iteration counter counting up to num_iter, with a larger increment */ +void test4 (uint8_t *a, uint8_t *b, uint8_t *c, int n) +{ + int num_iter = (n + 15)/16; + for (int i = 0; i < num_iter; i+=2) + { + mve_pred16_t p = vctp8q (n); + uint8x16_t va = vldrbq_z_u8 (a, p); + uint8x16_t vb = vldrbq_z_u8 (b, p); + uint8x16_t vc = vaddq_x_u8 (va, vb, p); + vstrbq_p_u8 (c, vc, p); + n -= 16; + } +} + +/* Using an unpredicated store instruction within the loop. */ +void test5 (uint8_t *a, uint8_t *b, uint8_t *c, uint8_t *d, int n) +{ + while (n > 0) + { + mve_pred16_t p = vctp8q (n); + uint8x16_t va = vldrbq_z_u8 (a, p); + uint8x16_t vb = vldrbq_z_u8 (b, p); + uint8x16_t vc = vaddq_u8 (va, vb); + uint8x16_t vd = vaddq_x_u8 (va, vb, p); + vstrbq_u8 (d, vd); + n -= 16; + } +} + +/* Using an unpredicated store outside the loop. */ +void test6 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx) +{ + while (n > 0) + { + mve_pred16_t p = vctp8q (n); + uint8x16_t va = vldrbq_z_u8 (a, p); + uint8x16_t vb = vldrbq_z_u8 (b, p); + uint8x16_t vc = vaddq_m_u8 (vx, va, vb, p); + vx = vaddq_u8 (vx, vc); + a += 16; + b += 16; + n -= 16; + } + vstrbq_u8 (c, vx); +} + +/* Using a VPR that gets modified within the loop. */ +void test9 (int32_t *a, int32_t *b, int32_t *c, int n) +{ + while (n > 0) + { + mve_pred16_t p = vctp32q (n); + int32x4_t va = vldrwq_z_s32 (a, p); + p++; + int32x4_t vb = vldrwq_z_s32 (b, p); + int32x4_t vc = vaddq_x_s32 (va, vb, p); + vstrwq_p_s32 (c, vc, p); + c += 4; + a += 4; + b += 4; + n -= 4; + } +} + +/* Using a VPR that gets re-generated within the loop. */ +void test10 (int32_t *a, int32_t *b, int32_t *c, int n) +{ + mve_pred16_t p = vctp32q (n); + while (n > 0) + { + int32x4_t va = vldrwq_z_s32 (a, p); + p = vctp32q (n); + int32x4_t vb = vldrwq_z_s32 (b, p); + int32x4_t vc = vaddq_x_s32 (va, vb, p); + vstrwq_p_s32 (c, vc, p); + c += 4; + a += 4; + b += 4; + n -= 4; + } +} + +/* Using vctp32q_m instead of vctp32q. */ +void test11 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p0) +{ + while (n > 0) + { + mve_pred16_t p = vctp32q_m (n, p0); + int32x4_t va = vldrwq_z_s32 (a, p); + int32x4_t vb = vldrwq_z_s32 (b, p); + int32x4_t vc = vaddq_x_s32 (va, vb, p); + vstrwq_p_s32 (c, vc, p); + c += 4; + a += 4; + b += 4; + n -= 4; + } +} + +/* Using an unpredicated op with a scalar output, where the result is valid + outside the bb. This is invalid, because one of the inputs to the + unpredicated op is also unpredicated. */ +uint8_t test12 (uint8_t *a, uint8_t *b, uint8_t *c, int n, uint8x16_t vx) +{ + uint8_t sum = 0; + while (n > 0) + { + mve_pred16_t p = vctp8q (n); + uint8x16_t va = vldrbq_z_u8 (a, p); + uint8x16_t vb = vldrbq_u8 (b); + uint8x16_t vc = vaddq_u8 (va, vb); + sum += vaddvq_u8 (vc); + a += 16; + b += 16; + n -= 16; + } + return sum; +} + +/* Using an unpredicated vcmp to generate a new predicate value in the + loop and then using that VPR to predicate a store insn. */ +void test13 (int32_t *a, int32_t *b, int32x4_t vc, int32_t *c, int n) +{ + while (n > 0) + { + mve_pred16_t p = vctp32q (n); + int32x4_t va = vldrwq_s32 (a); + int32x4_t vb = vldrwq_z_s32 (b, p); + int32x4_t vc = vaddq_s32 (va, vb); + mve_pred16_t p1 = vcmpeqq_s32 (va, vc); + vstrwq_p_s32 (c, vc, p1); + c += 4; + a += 4; + b += 4; + n -= 4; + } +} +/* { dg-final { scan-assembler-not "\tdlstp" } } */ +/* { dg-final { scan-assembler-not "\tletp" } } */