From patchwork Fri Oct 14 07:54:42 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jiang, Haochen" X-Patchwork-Id: 2554 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4ac7:0:0:0:0:0 with SMTP id y7csp53938wrs; Fri, 14 Oct 2022 00:56:40 -0700 (PDT) X-Google-Smtp-Source: AMsMyM5bsfSijKayZDJ3TNS6npUtGMCra13nw0/jeWuq7aY3NjR6tLGbNDGl2n8NUYGFgVIex78s X-Received: by 2002:a17:907:2daa:b0:78d:4dca:43e with SMTP id gt42-20020a1709072daa00b0078d4dca043emr2676661ejc.134.1665734200256; Fri, 14 Oct 2022 00:56:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1665734200; cv=none; d=google.com; s=arc-20160816; b=0dtuA21ZyDONRD4PaA+5Isu+hiIzPpV3GSrN7WGGuvUchLkl6A7JjMh+o16YES7L+C ifopHCBs/gdAtsJzHwkyDT07vmARcmquPzs3aaLLDb9jUnc5OfovRswmWSmvzMSV79iA i+D1/2PZIhQDG0cGOjM/ekDhffZQtna1ES1HUTrcBOWPDVpyYOK7LAVW1r8pq0Dvf8YE xr9vvUpK4Jsnz7IiPXi7QBsdmogZdXdFb0mJ5af8Xkf5esUYL9e1MDKhyhW+f4NlXceB iS49Mey+Mzo7bWp6+NDmSq/irqp2sb9iSAABQX3c9sEHqzXs38qzXRkYSYW1fN0ExVlh wmkQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:cc:reply-to:from:list-subscribe:list-help :list-post:list-archive:list-unsubscribe:list-id:precedence :references:in-reply-to:message-id:date:subject:to:dmarc-filter :delivered-to:dkim-signature:dkim-filter; bh=gAzjroxN34knf6B48N8kXkEWzVqdVeepNchrpyHY5rA=; b=jcXRLnaYVWwz1slg7gHEshQrf0izAdqPW+a11rccTc211zSLYlE4OGTS42wQjbpchm Oyg67F0DnMaA8VzYUVFQy/cnZsLqTRALo2sgCrCl7DfXY+1Y9iR0BVAh1BHlPs6o5s8a h6qG7hMbx/kLhNztowM2LFanJ243T1CJRIUXhdPRvbkr8IgxPQZ4TBMGvEXWnhUutrYg HaXMJX4MZJNx0EgyCSTS1upoTalk+JbPVPyku8IjA28EliH+rOCSDdtdQtVwH/82un36 TTruLiJc8h26+tSWkAArDcFPmy8cW/+bMXA6hmjkaJddaY1ZP25m4NwyPNdUaHTUW9in y7Ow== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=u95QAd0D; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from sourceware.org (server2.sourceware.org. [2620:52:3:1:0:246e:9693:128c]) by mx.google.com with ESMTPS id dp16-20020a170906c15000b0078dcc87b1c4si1583028ejc.923.2022.10.14.00.56.39 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Oct 2022 00:56:40 -0700 (PDT) Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) client-ip=2620:52:3:1:0:246e:9693:128c; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=u95QAd0D; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id C3FA838515EF for ; Fri, 14 Oct 2022 07:55:48 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C3FA838515EF DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1665734148; bh=gAzjroxN34knf6B48N8kXkEWzVqdVeepNchrpyHY5rA=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=u95QAd0DAT2p0Dr9H51EbK9j/kLc1jlLjrtg8CraicfhoveKBiLeBbxMQxwfzjkDT VlZtKcq3NQyPclRJ59uqK0F6onfHj5qtYLOqJTEl2bW6H62jIlzzCTKuTW7Dz8TZ8b Srz8DTu70AGz6oCyJTYsD2DXWVNJHwu7dgFJ3/qA= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by sourceware.org (Postfix) with ESMTPS id 8BC053857C58 for ; Fri, 14 Oct 2022 07:54:58 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8BC053857C58 X-IronPort-AV: E=McAfee;i="6500,9779,10499"; a="288597868" X-IronPort-AV: E=Sophos;i="5.95,182,1661842800"; d="scan'208";a="288597868" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Oct 2022 00:54:57 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10499"; a="627488391" X-IronPort-AV: E=Sophos;i="5.95,182,1661842800"; d="scan'208";a="627488391" Received: from shvmail03.sh.intel.com ([10.239.245.20]) by orsmga002.jf.intel.com with ESMTP; 14 Oct 2022 00:54:48 -0700 Received: from shliclel320.sh.intel.com (shliclel320.sh.intel.com [10.239.240.127]) by shvmail03.sh.intel.com (Postfix) with ESMTP id E6FED1009C8E; Fri, 14 Oct 2022 15:54:47 +0800 (CST) To: gcc-patches@gcc.gnu.org Subject: [PATCH 3/6] i386: Add intrinsic for vector __bf16 Date: Fri, 14 Oct 2022 15:54:42 +0800 Message-Id: <20221014075445.7938-4-haochen.jiang@intel.com> X-Mailer: git-send-email 2.18.1 In-Reply-To: <20221014075445.7938-1-haochen.jiang@intel.com> References: <20221014075445.7938-1-haochen.jiang@intel.com> X-Spam-Status: No, score=-10.7 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_NUMSUBJECT, SCC_10_SHORT_WORD_LINES, SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Haochen Jiang via Gcc-patches From: "Jiang, Haochen" Reply-To: Haochen Jiang Cc: hongtao.liu@intel.com Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org Sender: "Gcc-patches" X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1746648904442852395?= X-GMAIL-MSGID: =?utf-8?q?1746648904442852395?= From: konglin1 gcc/ChangeLog: * config/i386/avx512fp16intrin.h : New intrinsic. (_mm_load_sbf16): Ditto. (_mm_mask_load_sbf16): Ditto. (_mm_maskz_load_sbf16): Ditto. (_mm_mask_store_sbf16): Ditto. (_mm_mask_move_sbf16): Ditto. (_mm_maskz_move_sbf16): Ditto. * config/i386/avx512bf16intrin.h: New intrinsic. (_mm_setzero_pbf16): Ditto. (_mm256_setzero_pbf16): Ditto. (_mm512_setzero_pbf16): Ditto. (_mm512_undefined_pbf16): Ditto. (_mm512_set1_pbf16): Ditto. (_mm512_set_pbf16): Ditto. (_mm512_setr_pbf16): Ditto. (_mm_castpbf16_ps): Ditto. (_mm256_castpbf16_ps): Ditto. (_mm512_castpbf16_ps): Ditto. (_mm_castpbf16_pd): Ditto. (_mm256_castpbf16_pd): Ditto. (_mm512_castpbf16_pd): Ditto. (_mm_castpbf16_si128): Ditto. (_mm256_castpbf16_si256): Ditto. (_mm512_castpbf16_si512): Ditto. (_mm_castps_pbf16): Ditto. (_mm256_castps_pbf16): Ditto. (_mm512_castps_pbf16): Ditto. (_mm_castpd_pbf16): Ditto. (_mm256_castpd_pbf16): Ditto. (_mm512_castpd_pbf16): Ditto. (_mm_castsi128_pbf16): Ditto. (_mm256_castsi256_pbf16): Ditto. (_mm512_castsi512_pbf16): Ditto. (_mm256_castpbf16256_pbf16128): Ditto. (_mm512_castpbf16512_pbf16128): Ditto. (_mm512_castpbf16512_pbf16256): Ditto. (_mm256_castpbf16128_pbf16256): Ditto. (_mm512_castpbf16128_pbf16512): Ditto. (_mm512_castpbf16256_pbf16512): Ditto. (_mm256_zextpbf16128_pbf16256): Ditto. (_mm512_zextpbf16128_pbf16512): Ditto. (_mm512_zextpbf16256_pbf16512): Ditto. (_mm512_abs_pbf16): Ditto. (_mm512_load_pbf16): Ditto. (_mm256_load_pbf16): Ditto. (_mm_load_pbf16): Ditto. (_mm512_loadu_pbf16): Ditto. (_mm256_loadu_pbf16): Ditto. (_mm_loadu_pbf16): Ditto. (_mm_store_sbf16): Ditto. (_mm512_store_pbf16): Ditto. (_mm256_store_pbf16): Ditto. (_mm_store_pbf16): Ditto. (_mm512_storeu_pbf16): Ditto. (_mm256_storeu_pbf16): Ditto. (_mm_storeu_pbf16): Ditto. (_mm_move_sbf16): Ditto. (_mm512_mask_blend_pbf16): Ditto. (_mm512_permutex2var_pbf16): Ditto. (_mm512_permutexvar_pbf16): Ditto. (_mm512_bcstnebf16_ps): Ditto. (_mm512_mask_bcstnebf16_ps): Ditto. (_mm512_bcstnesh_ps): Ditto. (_mm512_mask_bcstnesh_ps): Ditto. (_mm512_maskz_bcstnesh_ps): Ditto. (_mm512_cvtne2ps_ph): Ditto. (_mm512_mask_cvtne2ps_ph): Ditto. (_mm512_cvtne_round2ps_ph): Ditto. (_mm512_mask_cvtne_round2ps_ph): Ditto. (_mm512_cvtneebf16_ps): Ditto. (_mm512_mask_cvtneebf16_ps): Ditto. (_mm512_maskz_cvtneebf16_ps): Ditto. (_mm512_cvtneeph_ps): Ditto. (_mm512_mask_cvtneeph_ps): Ditto. (_mm512_cvtneobf16_ps): Ditto. (_mm512_mask_cvtneobf16_ps): Ditto. (_mm512_maskz_cvtneobf16_ps): Ditto. (_mm512_cvtneoph_ps): Ditto. (_mm512_mask_cvtneoph_ps): Ditto. * config/i386/avx512bf16vlintrin.h (__attribute__): Ditto. (_mm_cvtsbf16_bf16): Ditto. (_mm256_cvtsbf16_bf16): Ditto. (_mm256_undefined_pbf16): Ditto. (_mm_undefined_pbf16): Ditto. (_mm_set_sbf16): Ditto. (_mm_set1_pbf16): Ditto. (_mm256_set1_pbf16): Ditto. (_mm_set_pbf16): Ditto. (_mm256_set_pbf16): Ditto. (_mm_setr_pbf16): Ditto. (_mm256_setr_pbf16): Ditto. (_mm256_abs_pbf16): Ditto. (_mm_abs_pbf16): Ditto. (_mm_mask_blend_pbf16): Ditto. (_mm256_mask_blend_pbf16): Ditto. (_mm_permutex2var_pbf16): Ditto. (_mm256_permutex2var_pbf16): Ditto. (_mm_permutexvar_pbf16): Ditto. (_mm256_permutexvar_pbf16): Ditto. (_mm_cvtneebf16_ps): Change bf16 mode. (_mm256_cvtneebf16_ps): Diito. (_mm_cvtneobf16_ps): Diito. (_mm256_cvtneobf16_ps): Diito. (_mm_mask_cvtneebf16_ps): Diito. (_mm_maskz_cvtneebf16_ps): Diito. (_mm256_mask_cvtneebf16_ps): Diito. (_mm256_maskz_cvtneebf16_ps): Diito. (_mm_mask_cvtneobf16_ps): Diito. (_mm_maskz_cvtneobf16_ps): Diito. (_mm256_mask_cvtneobf16_ps): Diito. (_mm256_maskz_cvtneobf16_ps): Diito. * config/i386/immintrin.h: Add SSE2 depend for avx512bf16. --- gcc/config/i386/avx512bf16intrin.h | 418 +++++++++++++++++++++++++++ gcc/config/i386/avx512bf16vlintrin.h | 177 ++++++++++++ gcc/config/i386/avx512fp16intrin.h | 70 +++++ gcc/config/i386/immintrin.h | 2 + 4 files changed, 667 insertions(+) diff --git a/gcc/config/i386/avx512bf16intrin.h b/gcc/config/i386/avx512bf16intrin.h index b6e9ddad157..d09a59c1509 100644 --- a/gcc/config/i386/avx512bf16intrin.h +++ b/gcc/config/i386/avx512bf16intrin.h @@ -51,6 +51,424 @@ _mm_cvtsbh_ss (__bfloat16 __A) return __tmp.a; } +extern __inline __m512bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_setzero_pbf16 (void) +{ + return (__m512bf16)(__v32bf) _mm512_setzero_ps (); +} + +extern __inline __m512bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_undefined_pbf16 (void) +{ + __m512bf16 __Y = __Y; + return __Y; +} + +extern __inline __m512bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_set1_pbf16 (__bf16 __h) +{ + return (__m512bf16)(__v32bf) {__h, __h, __h, __h, __h, __h, __h, __h, + __h, __h, __h, __h, __h, __h, __h, __h, + __h, __h, __h, __h, __h, __h, __h, __h, + __h, __h, __h, __h, __h, __h, __h, __h}; +} + +extern __inline __m512bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_set_pbf16 (__bf16 __h1, __bf16 __h2, __bf16 __h3, __bf16 __h4, + __bf16 __h5, __bf16 __h6, __bf16 __h7, __bf16 __h8, + __bf16 __h9, __bf16 __h10, __bf16 __h11, __bf16 __h12, + __bf16 __h13, __bf16 __h14, __bf16 __h15, __bf16 __h16, + __bf16 __h17, __bf16 __h18, __bf16 __h19, __bf16 __h20, + __bf16 __h21, __bf16 __h22, __bf16 __h23, __bf16 __h24, + __bf16 __h25, __bf16 __h26, __bf16 __h27, __bf16 __h28, + __bf16 __h29, __bf16 __h30, __bf16 __h31, __bf16 __h32) +{ + return + (__m512bf16)(__v32bf) {__h32, __h31, __h30, __h29, __h28, __h27, __h26, + __h25, __h24, __h23, __h22, __h21, __h20, __h19, + __h18, __h17, __h16, __h15, __h14, __h13, __h12, + __h11, __h10, __h9, __h8, __h7, __h6, __h5, + __h4, __h3, __h2, __h1}; +} + +#define _mm512_setr_pbf16(h1, h2, h3, h4, h5, h6, h7, h8, h9, h10, h11, h12, \ + h13, h14, h15, h16, h17, h18, h19, h20, h21, h22, \ + h23, h24, h25, h26, h27, h28, h29, h30, h31, h32) \ + _mm512_set_pbf16 ((h32), (h31), (h30), (h29), (h28), (h27), (h26), (h25), \ + (h24), (h23), (h22), (h21), (h20), (h19), (h18), (h17), \ + (h16), (h15), (h14), (h13), (h12), (h11), (h10), (h9), \ + (h8), (h7), (h6), (h5), (h4), (h3), (h2), (h1)) + +extern __inline __m128 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_castpbf16_ps (__m128bf16 __a) +{ + return (__m128) __a; +} + +extern __inline __m256 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_castpbf16_ps (__m256bf16 __a) +{ + return (__m256) __a; +} + +extern __inline __m512 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_castpbf16_ps (__m512bf16 __a) +{ + return (__m512) __a; +} + +extern __inline __m128d +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_castpbf16_pd (__m128bf16 __a) +{ + return (__m128d) __a; +} + +extern __inline __m256d +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_castpbf16_pd (__m256bf16 __a) +{ + return (__m256d) __a; +} + +extern __inline __m512d +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_castpbf16_pd (__m512bf16 __a) +{ + return (__m512d) __a; +} + +extern __inline __m128i +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_castpbf16_si128 (__m128bf16 __a) +{ + return (__m128i) __a; +} + +extern __inline __m256i +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_castpbf16_si256 (__m256bf16 __a) +{ + return (__m256i) __a; +} + +extern __inline __m512i +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_castpbf16_si512 (__m512bf16 __a) +{ + return (__m512i) __a; +} + +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_castps_pbf16 (__m128 __a) +{ + return (__m128bf16) __a; +} + +extern __inline __m256bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_castps_pbf16 (__m256 __a) +{ + return (__m256bf16) __a; +} + +extern __inline __m512bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_castps_pbf16 (__m512 __a) +{ + return (__m512bf16) __a; +} + +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_castpd_pbf16 (__m128d __a) +{ + return (__m128bf16) __a; +} + +extern __inline __m256bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_castpd_pbf16 (__m256d __a) +{ + return (__m256bf16) __a; +} + +extern __inline __m512bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_castpd_pbf16 (__m512d __a) +{ + return (__m512bf16) __a; +} + +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_castsi128_pbf16 (__m128i __a) +{ + return (__m128bf16) __a; +} + +extern __inline __m256bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_castsi256_pbf16 (__m256i __a) +{ + return (__m256bf16) __a; +} + +extern __inline __m512bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_castsi512_pbf16 (__m512i __a) +{ + return (__m512bf16) __a; +} + +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_castpbf16256_pbf16128 (__m256bf16 __a) +{ + return __builtin_shufflevector (__a, __a, 0, 1, 2, 3, 4, 5, 6, 7); +} + +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_castpbf16512_pbf16128 (__m512bf16 __a) +{ + return __builtin_shufflevector (__a, __a, 0, 1, 2, 3, 4, 5, 6, 7); +} + +extern __inline __m256bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_castpbf16512_pbf16256 (__m512bf16 __a) +{ + return __builtin_shufflevector (__a, __a, 0, 1, 2, 3, 4, 5, 6, 7, + 8, 9, 10, 11, 12, 13, 14, 15); +} + +extern __inline __m256bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_castpbf16128_pbf16256 (__m128bf16 __a) +{ + return __builtin_shufflevector (__a, __a, 0, 1, 2, 3, 4, 5, 6, 7, + -1, -1, -1, -1, -1, -1, -1, -1); +} + +extern __inline __m512bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_castpbf16128_pbf16512 (__m128bf16 __a) +{ + return __builtin_shufflevector (__a, __a, 0, 1, 2, 3, 4, 5, 6, 7, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); +} + +extern __inline __m512bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_castpbf16256_pbf16512 (__m256bf16 __a) +{ + return __builtin_shufflevector (__a, __a, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, + 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); +} + +extern __inline __m256bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_zextpbf16128_pbf16256 (__m128bf16 __A) +{ + return (__m256bf16) _mm256_insertf128_ps (_mm256_setzero_ps (), + (__m128) __A, 0); +} + +extern __inline __m512bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_zextpbf16128_pbf16512 (__m128bf16 __A) +{ + return (__m512bf16) _mm512_insertf32x4 (_mm512_setzero_ps (), + (__m128) __A, 0); +} + +extern __inline __m512bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_zextpbf16256_pbf16512 (__m256bf16 __A) +{ + return (__m512bf16) _mm512_insertf64x4 (_mm512_setzero_pd (), + (__m256d) __A, 0); +} + +extern __inline __m512bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_abs_pbf16 (__m512bf16 __A) +{ + return + (__m512bf16) _mm512_and_epi32 (_mm512_set1_epi32 (0x7FFF7FFF), + (__m512i) __A); +} + +// loads with vmovsh if avx512fp16 enable: +extern __inline __m512bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_load_pbf16 (void const *__p) +{ + return *(const __m512bf16 *) __p; +} + +extern __inline __m256bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_load_pbf16 (void const *__p) +{ + return *(const __m256bf16 *) __p; +} + +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_load_pbf16 (void const *__p) +{ + return *(const __m128bf16 *) __p; +} + +extern __inline __m512bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_loadu_pbf16 (void const *__p) +{ + struct __loadu_pbf16 + { + __m512bf16_u __v; + } __attribute__((__packed__, __may_alias__)); + return ((const struct __loadu_pbf16 *) __p)->__v; +} + +extern __inline __m256bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_loadu_pbf16 (void const *__p) +{ + struct __loadu_pbf16 + { + __m256bf16_u __v; + } __attribute__((__packed__, __may_alias__)); + return ((const struct __loadu_pbf16 *) __p)->__v; +} + +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_loadu_pbf16 (void const *__p) +{ + struct __loadu_pbf16 + { + __m128bf16_u __v; + } __attribute__((__packed__, __may_alias__)); + return ((const struct __loadu_pbf16 *) __p)->__v; +} + +// stores with vmovsh if avx512fp16 enable: +extern __inline void +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_store_sbf16 (void *__dp, __m128bf16 __a) +{ + struct __mm_store_sbf16_struct + { + __bf16 __u; + } __attribute__((__packed__, __may_alias__)); + ((struct __mm_store_sbf16_struct *) __dp)->__u = __a[0]; +} + +extern __inline void +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_store_pbf16 (void *__P, __m512bf16 __A) +{ + *(__m512bf16 *) __P = __A; +} + +extern __inline void +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_store_pbf16 (void *__P, __m256bf16 __A) +{ + *(__m256bf16 *) __P = __A; +} + +extern __inline void +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_store_pbf16 (void *__P, __m128bf16 __A) +{ + *(__m128bf16 *) __P = __A; +} + +extern __inline void +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_storeu_pbf16 (void *__P, __m512bf16 __A) +{ + struct __storeu_pbf16 { + __m512bf16_u __v; + } __attribute__((__packed__, __may_alias__)); + ((struct __storeu_pbf16 *) __P)->__v = __A; +} + +extern __inline void +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_storeu_pbf16 (void *__P, __m256bf16 __A) +{ + struct __storeu_pbf16 + { + __m256bf16_u __v; + } __attribute__((__packed__, __may_alias__)); + ((struct __storeu_pbf16 *) __P)->__v = __A; +} + +extern __inline void +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_storeu_pbf16 (void *__P, __m128bf16 __A) +{ + struct __storeu_pbf16 + { + __m128bf16_u __v; + } __attribute__((__packed__, __may_alias__)); + ((struct __storeu_pbf16 *) __P)->__v = __A; +} + +// moves with vmovsh if enable avx512fp16: +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_move_sbf16 (__m128bf16 __a, __m128bf16 __b) +{ + __a[0] = __b[0]; + return __a; +} + +extern __inline __m512bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_mask_blend_pbf16 (__mmask32 __U, __m512bf16 __A, __m512bf16 __W) +{ + return (__m512bf16) __builtin_ia32_movdquhi512_mask ((__v32hi) __W, + (__v32hi) __A, + (__mmask32) __U); +} + +extern __inline __m512bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_permutex2var_pbf16 (__m512bf16 __A, __m512i __I, __m512bf16 __B) +{ + return (__m512bf16) __builtin_ia32_vpermi2varhi512_mask ((__v32hi) __A, + (__v32hi) __I, + (__v32hi) __B, + (__mmask32)-1); +} + +extern __inline __m512bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm512_permutexvar_pbf16 (__m512i __A, __m512bf16 __B) +{ + return (__m512bf16) __builtin_ia32_permvarhi512_mask ((__v32hi) __B, + (__v32hi) __A, + (__v32hi) + (_mm512_setzero_si512 ()), + (__mmask32)-1); +} + /* vcvtne2ps2bf16 */ extern __inline __m512bh diff --git a/gcc/config/i386/avx512bf16vlintrin.h b/gcc/config/i386/avx512bf16vlintrin.h index 969335ff358..732623a94a2 100644 --- a/gcc/config/i386/avx512bf16vlintrin.h +++ b/gcc/config/i386/avx512bf16vlintrin.h @@ -44,6 +44,183 @@ typedef short __m256bh __attribute__ ((__vector_size__ (32), __may_alias__)); typedef short __m128bh __attribute__ ((__vector_size__ (16), __may_alias__)); typedef unsigned short __bfloat16; + +extern __inline __bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_cvtsbf16_bf16 (__m128bf16 __a) +{ + return __a[0]; +} + +extern __inline __bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_cvtsbf16_bf16 (__m256bf16 __a) +{ + return __a[0]; +} + +extern __inline __m256bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_undefined_pbf16 (void) +{ + __m256bf16 __Y = __Y; + return __Y; +} + +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_undefined_pbf16 (void) +{ + __m128bf16 __Y = __Y; + return __Y; +} + +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_setzero_pbf16 (void) +{ + return (__m128bf16)(__v8bf) _mm_setzero_ps (); +} + +extern __inline __m256bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_setzero_pbf16 (void) +{ + return (__m256bf16)(__v16bf) _mm256_setzero_ps (); +} + +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_set_sbf16 (__bf16 bf) +{ + return (__v8bf) + __builtin_shufflevector ((__v8bf){bf, bf, bf, bf, bf, bf, bf, bf}, + (__v8bf) _mm_setzero_pbf16 (), 0, + 8, 8, 8, 8, 8, 8, 8); +} + +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_set1_pbf16 (__bf16 bf) +{ + return (__m128bf16)(__v8bf) {bf, bf, bf, bf, bf, bf, bf, bf}; +} + +extern __inline __m256bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_set1_pbf16 (__bf16 bf) +{ + return (__m256bf16)(__v16bf) {bf, bf, bf, bf, bf, bf, bf, bf, + bf, bf, bf, bf, bf, bf, bf, bf}; +} + +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_set_pbf16 (__bf16 bf1, __bf16 bf2, __bf16 bf3, __bf16 bf4, + __bf16 bf5, __bf16 bf6, __bf16 bf7, __bf16 bf8) +{ + return (__m128bf16)(__v8bf) {bf1, bf2, bf3, bf4, bf5, bf6, bf7, bf8}; +} + +extern __inline __m256bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_set_pbf16 (__bf16 bf1, __bf16 bf2, __bf16 bf3, __bf16 bf4, + __bf16 bf5, __bf16 bf6, __bf16 bf7, __bf16 bf8, + __bf16 bf9, __bf16 bf10, __bf16 bf11, __bf16 bf12, + __bf16 bf13, __bf16 bf14, __bf16 bf15, __bf16 bf16) +{ + return (__m256bf16)(__v16bf) {bf1, bf2, bf3, bf4, bf5, bf6, bf7, bf8, + bf9, bf10, bf11, bf12, bf13, bf14, + bf15, bf16}; +} + +#define _mm_setr_pbf16(bf1, bf2, bf3, bf4, bf5, bf6, bf7, bf8) \ + _mm_set_pbf16 ((bf8), (bf7), (bf6), (bf5), (bf4), (bf3), (bf2), (bf1)) + +#define _mm256_setr_pbf16(bf1, bf2, bf3, bf4, bf5, bf6, bf7, bf8, bf9, bf10, \ + bf11, bf12, bf13, bf14, bf15, bf16) \ + _mm256_set_pbf16 ((bf16), (bf15), (bf14), (bf13), (bf12), (bf11), (bf10), \ + (bf9), (bf8), (bf7), (bf6), (bf5), (bf4), (bf3), (bf2), \ + (bf1)) + +extern __inline __m256bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_abs_pbf16 (__m256bf16 __A) +{ + return (__m256bf16) _mm256_and_si256 (_mm256_set1_epi32 (0x7FFF7FFF), + (__m256i)__A); +} + +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_abs_pbf16 (__m128bf16 __A) +{ + return (__m128bf16) _mm_and_si128 (_mm_set1_epi32 (0x7FFF7FFF), + (__m128i)__A); +} + +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_mask_blend_pbf16 (__mmask8 __U, __m128bf16 __A, __m128bf16 __W) +{ + return (__m128bf16) + __builtin_ia32_movdquhi128_mask ((__v8hi) __W, + (__v8hi) __A, + (__mmask8) __U); +} + +extern __inline __m256bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_mask_blend_pbf16 (__mmask16 __U, __m256bf16 __A, __m256bf16 __W) +{ + return (__m256bf16) + __builtin_ia32_movdquhi256_mask ((__v16hi) __W, + (__v16hi) __A, + (__mmask16) __U); +} + +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_permutex2var_pbf16 (__m128bf16 __A, __m128i __I, __m128bf16 __B) +{ + return (__m128bf16) + __builtin_ia32_vpermi2varhi128_mask ((__v8hi) __A, + (__v8hi) __I, + (__v8hi) __B, + (__mmask8) -1); +} + +extern __inline __m256bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_permutex2var_pbf16 (__m256bf16 __A, __m256i __I, __m256bf16 __B) +{ + return (__m256bf16) __builtin_ia32_vpermi2varhi256_mask ((__v16hi) __A, + (__v16hi) __I, + (__v16hi) __B, + (__mmask16)-1); +} + +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_permutexvar_pbf16 (__m128i __A, __m128bf16 __B) +{ + return (__m128bf16) __builtin_ia32_permvarhi128_mask ((__v8hi) __B, + (__v8hi) __A, + (__v8hi) + (_mm_setzero_si128 ()), + (__mmask8) -1); +} + +extern __inline __m256bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_permutexvar_pbf16 (__m256i __A, __m256bf16 __B) +{ + return (__m256bf16) __builtin_ia32_permvarhi256_mask ((__v16hi) __B, + (__v16hi) __A, + (__v16hi) + (_mm256_setzero_si256 ()), + (__mmask16) -1); +} /* vcvtne2ps2bf16 */ extern __inline __m256bh diff --git a/gcc/config/i386/avx512fp16intrin.h b/gcc/config/i386/avx512fp16intrin.h index 75f7475ad18..82b814abde2 100644 --- a/gcc/config/i386/avx512fp16intrin.h +++ b/gcc/config/i386/avx512fp16intrin.h @@ -53,6 +53,18 @@ typedef _Float16 __m256h_u __attribute__ ((__vector_size__ (32), \ typedef _Float16 __m512h_u __attribute__ ((__vector_size__ (64), \ __may_alias__, __aligned__ (1))); + +/* Internal data types for implementing the bf16 intrinsics. */ +typedef __bf16 __v32bf __attribute__((__vector_size__(64), __aligned__(64))); +typedef __bf16 __m512bf16 __attribute__((__vector_size__(64), __aligned__(64))); +typedef __bf16 __m512bf16_u __attribute__((__vector_size__(64), __aligned__(1))); +typedef __bf16 __v8bf __attribute__((__vector_size__(16), __aligned__(16))); +typedef __bf16 __m128bf16 __attribute__((__vector_size__(16), __aligned__(16))); +typedef __bf16 __m128bf16_u __attribute__((__vector_size__(16), __aligned__(1))); +typedef __bf16 __v16bf __attribute__((__vector_size__(32), __aligned__(32))); +typedef __bf16 __m256bf16 __attribute__((__vector_size__(32), __aligned__(32))); +typedef __bf16 __m256bf16_u __attribute__((__vector_size__(32), __aligned__(1))); + extern __inline __m128h __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) _mm_set_ph (_Float16 __A7, _Float16 __A6, _Float16 __A5, @@ -2771,6 +2783,44 @@ _mm_mask_store_sh (_Float16 const* __A, __mmask8 __B, __m128h __C) __builtin_ia32_storesh_mask (__A, __C, __B); } +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_load_sbf16 (void const *__dp) +{ + return (__m128bf16) + __builtin_ia32_loadsh_mask ((_Float16 const*) __dp, + _mm_setzero_ph(), + (__mmask8) -1); +} + +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_mask_load_sbf16 (__m128bf16 __A, __mmask8 __B, const void *__C) +{ + return (__m128bf16) + __builtin_ia32_loadsh_mask ((_Float16 const*) __C, + (__v8hf) __A, + (__mmask8) __B); +} + +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_maskz_load_sbf16 (__mmask8 __A, const void *__B) +{ + return (__m128bf16) + __builtin_ia32_loadsh_mask ((_Float16 const*) __B, + _mm_setzero_ph(), + (__mmask8) __A); +} + +extern __inline void +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_mask_store_sbf16 (const void *__A, __mmask8 __B, __m128bf16 __C) +{ + __builtin_ia32_storesh_mask ((_Float16 const*) __A, + (__v8hf) __C, (__mmask8) __B); +} + extern __inline __m128h __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) _mm_move_sh (__m128h __A, __m128h __B) @@ -2793,6 +2843,26 @@ _mm_maskz_move_sh (__mmask8 __A, __m128h __B, __m128h __C) return __builtin_ia32_vmovsh_mask (__B, __C, _mm_setzero_ph (), __A); } +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_mask_move_sbf16 (__m128bf16 __A, __mmask8 __B, + __m128bf16 __C, __m128bf16 __D) +{ + return (__m128bf16) + __builtin_ia32_vmovsh_mask ((__v8hf) __C, (__v8hf) __D, + (__v8hf) __A, (__mmask8) __B); +} + +extern __inline __m128bf16 +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm_maskz_move_sbf16 (__mmask8 __A, __m128bf16 __B, __m128bf16 __C) +{ + return (__m128bf16) + __builtin_ia32_vmovsh_mask ((__v8hf) __B, (__v8hf) __C, + _mm_setzero_ph(), + (__mmask8) __A); +} + /* Intrinsics vcvtph2dq. */ extern __inline __m512i __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) diff --git a/gcc/config/i386/immintrin.h b/gcc/config/i386/immintrin.h index ddea249d09b..c62d50f1951 100644 --- a/gcc/config/i386/immintrin.h +++ b/gcc/config/i386/immintrin.h @@ -118,9 +118,11 @@ #include +#ifdef __SSE2__ #include #include +#endif #include