diff mbox series

vect: Cost adjacent vector loads/stores together [PR111784]

Message ID	fb808aab-2d4b-0d0d-8fa2-f3693bcd4bf1@linux.ibm.com
State	Unresolved
Headers	Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) client-ip=2620:52:3:1:0:246e:9693:128c; DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 49ED03858D33 Message-ID: <fb808aab-2d4b-0d0d-8fa2-f3693bcd4bf1@linux.ibm.com> Date: Wed, 18 Oct 2023 13:09:15 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.6.1 Content-Language: en-US From: "Kewen.Lin" <linkw@linux.ibm.com> Subject: [PATCH] vect: Cost adjacent vector loads/stores together [PR111784] To: GCC Patches <gcc-patches@gcc.gnu.org> Cc: Richard Biener <richard.guenther@gmail.com>, Richard Sandiford <richard.sandiford@arm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit MIME-Version: 1.0 Precedence: list Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org X-getmail-retrieved-from-mailbox: INBOX
Series	vect: Cost adjacent vector loads/stores together [PR111784] \| vect: Cost adjacent vector loads/stores together [PR111784]

Checks

Context	Check	Description
snail/gcc-patch-check	warning	Git am fail log

Commit Message

Kewen.Lin Oct. 18, 2023, 5:09 a.m. UTC

  Hi,

As comments[1][2], this patch is to change the costing way
on some adjacent vector loads/stores from costing one by
one to costing them together with the total number once.

It helps to fix the exposed regression PR111784 on aarch64,
as aarch64 specific costing could make different decisions
according to the different costing ways (counting with total
number vs. counting one by one).  Based on a reduced test
case from PR111784, only considering vec_num can fix the
regression already, but vector loads/stores in regard to
ncopies are also adjacent accesses, so they are considered
as well.

btw, this patch leaves the costing on dr_explicit_realign
and dr_explicit_realign_optimized alone to make it simple.
The costing way change can cause the differences for them
since there is one costing depending on targetm.vectorize.
builtin_mask_for_load and it's costed according to the
calling times.  IIUC, these two dr_alignment_support are
mainly used for old Power? (only having 16 bytes aligned
vector load/store but no unaligned vector load/store).

Bootstrapped and regtested on x86_64-redhat-linux,
aarch64-linux-gnu, powerpc64-linux-gnu P{7,8,9}
and powerpc64le-linux-gnu P{8,9,10}.

Is it ok for trunk?

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630742.html
[2] https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630744.html

BR,
Kewen
-----
gcc/ChangeLog:

	* tree-vect-stmts.cc (vectorizable_store): Adjust costing way for
	adjacent vector stores, by costing them with the total number
	rather than costing them one by one.
	(vectorizable_load): Adjust costing way for adjacent vector
	loads, by costing them with the total number rather than costing
	them one by one.
---
 gcc/tree-vect-stmts.cc | 137 ++++++++++++++++++++++++++++-------------
 1 file changed, 95 insertions(+), 42 deletions(-)

--
2.31.1

Comments

Richard Sandiford Oct. 19, 2023, 10:12 p.m. UTC | #1

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi,
>
> As comments[1][2], this patch is to change the costing way
> on some adjacent vector loads/stores from costing one by
> one to costing them together with the total number once.
>
> It helps to fix the exposed regression PR111784 on aarch64,
> as aarch64 specific costing could make different decisions
> according to the different costing ways (counting with total
> number vs. counting one by one).  Based on a reduced test
> case from PR111784, only considering vec_num can fix the
> regression already, but vector loads/stores in regard to
> ncopies are also adjacent accesses, so they are considered
> as well.
>
> btw, this patch leaves the costing on dr_explicit_realign
> and dr_explicit_realign_optimized alone to make it simple.
> The costing way change can cause the differences for them
> since there is one costing depending on targetm.vectorize.
> builtin_mask_for_load and it's costed according to the
> calling times.  IIUC, these two dr_alignment_support are
> mainly used for old Power? (only having 16 bytes aligned
> vector load/store but no unaligned vector load/store).
>
> Bootstrapped and regtested on x86_64-redhat-linux,
> aarch64-linux-gnu, powerpc64-linux-gnu P{7,8,9}
> and powerpc64le-linux-gnu P{8,9,10}.
>
> Is it ok for trunk?
>
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630742.html
> [2] https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630744.html
>
> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
> 	* tree-vect-stmts.cc (vectorizable_store): Adjust costing way for
> 	adjacent vector stores, by costing them with the total number
> 	rather than costing them one by one.
> 	(vectorizable_load): Adjust costing way for adjacent vector
> 	loads, by costing them with the total number rather than costing
> 	them one by one.

OK.  Thanks for doing this!  Like Richard says, the way that the aarch64
cost hooks rely on the count isn't super robust, but I think it's the
best we can easily do in the circumstances.  Hopefully costing will
become much easier once the non-SLP representation goes away.

Richard

> ---
>  gcc/tree-vect-stmts.cc | 137 ++++++++++++++++++++++++++++-------------
>  1 file changed, 95 insertions(+), 42 deletions(-)
>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index b3a56498595..af134ff2bf7 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -8681,6 +8681,9 @@ vectorizable_store (vec_info *vinfo,
>        alias_off = build_int_cst (ref_type, 0);
>        stmt_vec_info next_stmt_info = first_stmt_info;
>        auto_vec<tree> vec_oprnds (ncopies);
> +      /* For costing some adjacent vector stores, we'd like to cost with
> +	 the total number of them once instead of cost each one by one. */
> +      unsigned int n_adjacent_stores = 0;
>        for (g = 0; g < group_size; g++)
>  	{
>  	  running_off = offvar;
> @@ -8738,10 +8741,7 @@ vectorizable_store (vec_info *vinfo,
>  			 store to avoid ICE like 110776.  */
>  		      if (VECTOR_TYPE_P (ltype)
>  			  && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
> -			vect_get_store_cost (vinfo, stmt_info, 1,
> -					     alignment_support_scheme,
> -					     misalignment, &inside_cost,
> -					     cost_vec);
> +			n_adjacent_stores++;
>  		      else
>  			inside_cost
>  			  += record_stmt_cost (cost_vec, 1, scalar_store,
> @@ -8798,11 +8798,18 @@ vectorizable_store (vec_info *vinfo,
>  	    break;
>  	}
>
> -      if (costing_p && dump_enabled_p ())
> -	dump_printf_loc (MSG_NOTE, vect_location,
> -			 "vect_model_store_cost: inside_cost = %d, "
> -			 "prologue_cost = %d .\n",
> -			 inside_cost, prologue_cost);
> +      if (costing_p)
> +	{
> +	  if (n_adjacent_stores > 0)
> +	    vect_get_store_cost (vinfo, stmt_info, n_adjacent_stores,
> +				 alignment_support_scheme, misalignment,
> +				 &inside_cost, cost_vec);
> +	  if (dump_enabled_p ())
> +	    dump_printf_loc (MSG_NOTE, vect_location,
> +			     "vect_model_store_cost: inside_cost = %d, "
> +			     "prologue_cost = %d .\n",
> +			     inside_cost, prologue_cost);
> +	}
>
>        return true;
>      }
> @@ -8909,6 +8916,9 @@ vectorizable_store (vec_info *vinfo,
>      {
>        gcc_assert (!slp && grouped_store);
>        unsigned inside_cost = 0, prologue_cost = 0;
> +      /* For costing some adjacent vector stores, we'd like to cost with
> +	 the total number of them once instead of cost each one by one. */
> +      unsigned int n_adjacent_stores = 0;
>        for (j = 0; j < ncopies; j++)
>  	{
>  	  gimple *new_stmt;
> @@ -8974,10 +8984,7 @@ vectorizable_store (vec_info *vinfo,
>
>  	  if (costing_p)
>  	    {
> -	      for (i = 0; i < vec_num; i++)
> -		vect_get_store_cost (vinfo, stmt_info, 1,
> -				     alignment_support_scheme, misalignment,
> -				     &inside_cost, cost_vec);
> +	      n_adjacent_stores += vec_num;
>  	      continue;
>  	    }
>
> @@ -9067,11 +9074,18 @@ vectorizable_store (vec_info *vinfo,
>  	  STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
>  	}
>
> -      if (costing_p && dump_enabled_p ())
> -	dump_printf_loc (MSG_NOTE, vect_location,
> -			 "vect_model_store_cost: inside_cost = %d, "
> -			 "prologue_cost = %d .\n",
> -			 inside_cost, prologue_cost);
> +      if (costing_p)
> +	{
> +	  if (n_adjacent_stores > 0)
> +	    vect_get_store_cost (vinfo, stmt_info, n_adjacent_stores,
> +				 alignment_support_scheme, misalignment,
> +				 &inside_cost, cost_vec);
> +	  if (dump_enabled_p ())
> +	    dump_printf_loc (MSG_NOTE, vect_location,
> +			     "vect_model_store_cost: inside_cost = %d, "
> +			     "prologue_cost = %d .\n",
> +			     inside_cost, prologue_cost);
> +	}
>
>        return true;
>      }
> @@ -9290,6 +9304,9 @@ vectorizable_store (vec_info *vinfo,
>  	      || memory_access_type == VMAT_CONTIGUOUS_REVERSE);
>
>    unsigned inside_cost = 0, prologue_cost = 0;
> +  /* For costing some adjacent vector stores, we'd like to cost with
> +     the total number of them once instead of cost each one by one. */
> +  unsigned int n_adjacent_stores = 0;
>    auto_vec<tree> result_chain (group_size);
>    auto_vec<tree, 1> vec_oprnds;
>    for (j = 0; j < ncopies; j++)
> @@ -9451,9 +9468,7 @@ vectorizable_store (vec_info *vinfo,
>
>  	  if (costing_p)
>  	    {
> -	      vect_get_store_cost (vinfo, stmt_info, 1,
> -				   alignment_support_scheme, misalignment,
> -				   &inside_cost, cost_vec);
> +	      n_adjacent_stores++;
>
>  	      if (!slp)
>  		{
> @@ -9623,6 +9638,11 @@ vectorizable_store (vec_info *vinfo,
>
>    if (costing_p)
>      {
> +      if (n_adjacent_stores > 0)
> +	vect_get_store_cost (vinfo, stmt_info, n_adjacent_stores,
> +			     alignment_support_scheme, misalignment,
> +			     &inside_cost, cost_vec);
> +
>        /* When vectorizing a store into the function result assign
>  	 a penalty if the function returns in a multi-register location.
>  	 In this case we assume we'll end up with having to spill the
> @@ -10337,6 +10357,9 @@ vectorizable_load (vec_info *vinfo,
>        unsigned HOST_WIDE_INT
>  	elsz = tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (vectype)));
>        unsigned int n_groups = 0;
> +      /* For costing some adjacent vector loads, we'd like to cost with
> +	 the total number of them once instead of cost each one by one. */
> +      unsigned int n_adjacent_loads = 0;
>        for (j = 0; j < ncopies; j++)
>  	{
>  	  if (nloads > 1 && !costing_p)
> @@ -10350,10 +10373,7 @@ vectorizable_load (vec_info *vinfo,
>  		     avoid ICE, see PR110776.  */
>  		  if (VECTOR_TYPE_P (ltype)
>  		      && memory_access_type != VMAT_ELEMENTWISE)
> -		    vect_get_load_cost (vinfo, stmt_info, 1,
> -					alignment_support_scheme, misalignment,
> -					false, &inside_cost, nullptr, cost_vec,
> -					cost_vec, true);
> +		    n_adjacent_loads++;
>  		  else
>  		    inside_cost += record_stmt_cost (cost_vec, 1, scalar_load,
>  						     stmt_info, 0, vect_body);
> @@ -10447,11 +10467,19 @@ vectorizable_load (vec_info *vinfo,
>  					  false, &n_perms);
>  	}
>
> -      if (costing_p && dump_enabled_p ())
> -	dump_printf_loc (MSG_NOTE, vect_location,
> -			 "vect_model_load_cost: inside_cost = %u, "
> -			 "prologue_cost = 0 .\n",
> -			 inside_cost);
> +      if (costing_p)
> +	{
> +	  if (n_adjacent_loads > 0)
> +	    vect_get_load_cost (vinfo, stmt_info, n_adjacent_loads,
> +				alignment_support_scheme, misalignment, false,
> +				&inside_cost, nullptr, cost_vec, cost_vec,
> +				true);
> +	  if (dump_enabled_p ())
> +	    dump_printf_loc (MSG_NOTE, vect_location,
> +			     "vect_model_load_cost: inside_cost = %u, "
> +			     "prologue_cost = 0 .\n",
> +			     inside_cost);
> +	}
>
>        return true;
>      }
> @@ -10756,6 +10784,9 @@ vectorizable_load (vec_info *vinfo,
>        gcc_assert (grouped_load && !slp);
>
>        unsigned int inside_cost = 0, prologue_cost = 0;
> +      /* For costing some adjacent vector loads, we'd like to cost with
> +	 the total number of them once instead of cost each one by one. */
> +      unsigned int n_adjacent_loads = 0;
>        for (j = 0; j < ncopies; j++)
>  	{
>  	  if (costing_p)
> @@ -10787,9 +10818,7 @@ vectorizable_load (vec_info *vinfo,
>  					  true);
>  		    }
>  		}
> -	      vect_get_load_cost (vinfo, stmt_info, 1, alignment_support_scheme,
> -				  misalignment, false, &inside_cost,
> -				  &prologue_cost, cost_vec, cost_vec, true);
> +	      n_adjacent_loads++;
>  	      continue;
>  	    }
>
> @@ -10891,11 +10920,19 @@ vectorizable_load (vec_info *vinfo,
>  	  *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
>  	}
>
> -      if (costing_p && dump_enabled_p ())
> -	dump_printf_loc (MSG_NOTE, vect_location,
> -			 "vect_model_load_cost: inside_cost = %u, "
> -			 "prologue_cost = %u .\n",
> -			 inside_cost, prologue_cost);
> +      if (costing_p)
> +	{
> +	  if (n_adjacent_loads > 0)
> +	    vect_get_load_cost (vinfo, stmt_info, n_adjacent_loads,
> +				alignment_support_scheme, misalignment, false,
> +				&inside_cost, &prologue_cost, cost_vec,
> +				cost_vec, true);
> +	  if (dump_enabled_p ())
> +	    dump_printf_loc (MSG_NOTE, vect_location,
> +			     "vect_model_load_cost: inside_cost = %u, "
> +			     "prologue_cost = %u .\n",
> +			     inside_cost, prologue_cost);
> +	}
>
>        return true;
>      }
> @@ -11111,6 +11148,9 @@ vectorizable_load (vec_info *vinfo,
>
>    poly_uint64 group_elt = 0;
>    unsigned int inside_cost = 0, prologue_cost = 0;
> +  /* For costing some adjacent vector loads, we'd like to cost with
> +     the total number of them once instead of cost each one by one. */
> +  unsigned int n_adjacent_loads = 0;
>    for (j = 0; j < ncopies; j++)
>      {
>        /* 1. Create the vector or array pointer update chain.  */
> @@ -11505,10 +11545,18 @@ vectorizable_load (vec_info *vinfo,
>  		  || memory_access_type == VMAT_CONTIGUOUS_REVERSE
>  		  || (memory_access_type == VMAT_CONTIGUOUS_PERMUTE
>  		      && (!grouped_load || first_stmt_info_p)))
> -		vect_get_load_cost (vinfo, stmt_info, 1,
> -				    alignment_support_scheme, misalignment,
> -				    add_realign_cost, &inside_cost,
> -				    &prologue_cost, cost_vec, cost_vec, true);
> +		{
> +		  /* Leave realign cases alone to keep them simple.  */
> +		  if (alignment_support_scheme == dr_explicit_realign_optimized
> +		      || alignment_support_scheme == dr_explicit_realign)
> +		    vect_get_load_cost (vinfo, stmt_info, 1,
> +					alignment_support_scheme, misalignment,
> +					add_realign_cost, &inside_cost,
> +					&prologue_cost, cost_vec, cost_vec,
> +					true);
> +		  else
> +		    n_adjacent_loads++;
> +		}
>  	    }
>  	  else
>  	    {
> @@ -11679,6 +11727,11 @@ vectorizable_load (vec_info *vinfo,
>        gcc_assert (memory_access_type == VMAT_CONTIGUOUS
>  		  || memory_access_type == VMAT_CONTIGUOUS_REVERSE
>  		  || memory_access_type == VMAT_CONTIGUOUS_PERMUTE);
> +      if (n_adjacent_loads > 0)
> +	vect_get_load_cost (vinfo, stmt_info, n_adjacent_loads,
> +			    alignment_support_scheme, misalignment, false,
> +			    &inside_cost, &prologue_cost, cost_vec, cost_vec,
> +			    true);
>        if (dump_enabled_p ())
>  	dump_printf_loc (MSG_NOTE, vect_location,
>  			 "vect_model_load_cost: inside_cost = %u, "
> --
> 2.31.1

Kewen.Lin Oct. 23, 2023, 6:15 a.m. UTC | #2

Hi Richard,

on 2023/10/20 06:12, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> Hi,
>>
>> As comments[1][2], this patch is to change the costing way
>> on some adjacent vector loads/stores from costing one by
>> one to costing them together with the total number once.
>>
>> It helps to fix the exposed regression PR111784 on aarch64,
>> as aarch64 specific costing could make different decisions
>> according to the different costing ways (counting with total
>> number vs. counting one by one).  Based on a reduced test
>> case from PR111784, only considering vec_num can fix the
>> regression already, but vector loads/stores in regard to
>> ncopies are also adjacent accesses, so they are considered
>> as well.
>>
>> btw, this patch leaves the costing on dr_explicit_realign
>> and dr_explicit_realign_optimized alone to make it simple.
>> The costing way change can cause the differences for them
>> since there is one costing depending on targetm.vectorize.
>> builtin_mask_for_load and it's costed according to the
>> calling times.  IIUC, these two dr_alignment_support are
>> mainly used for old Power? (only having 16 bytes aligned
>> vector load/store but no unaligned vector load/store).
>>
>> Bootstrapped and regtested on x86_64-redhat-linux,
>> aarch64-linux-gnu, powerpc64-linux-gnu P{7,8,9}
>> and powerpc64le-linux-gnu P{8,9,10}.
>>
>> Is it ok for trunk?
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630742.html
>> [2] https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630744.html
>>
>> BR,
>> Kewen
>> -----
>> gcc/ChangeLog:
>>
>> 	* tree-vect-stmts.cc (vectorizable_store): Adjust costing way for
>> 	adjacent vector stores, by costing them with the total number
>> 	rather than costing them one by one.
>> 	(vectorizable_load): Adjust costing way for adjacent vector
>> 	loads, by costing them with the total number rather than costing
>> 	them one by one.
> 
> OK.  Thanks for doing this!  Like Richard says, the way that the aarch64
> cost hooks rely on the count isn't super robust, but I think it's the
> best we can easily do in the circumstances.  Hopefully costing will
> become much easier once the non-SLP representation goes away.

Looking forward to that, thanks for the review!  Committed in r14-4842.

BR,
Kewen

diff mbox series

Patch

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index b3a56498595..af134ff2bf7 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -8681,6 +8681,9 @@  vectorizable_store (vec_info *vinfo,
       alias_off = build_int_cst (ref_type, 0);
       stmt_vec_info next_stmt_info = first_stmt_info;
       auto_vec<tree> vec_oprnds (ncopies);
+      /* For costing some adjacent vector stores, we'd like to cost with
+	 the total number of them once instead of cost each one by one. */
+      unsigned int n_adjacent_stores = 0;
       for (g = 0; g < group_size; g++)
 	{
 	  running_off = offvar;
@@ -8738,10 +8741,7 @@  vectorizable_store (vec_info *vinfo,
 			 store to avoid ICE like 110776.  */
 		      if (VECTOR_TYPE_P (ltype)
 			  && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
-			vect_get_store_cost (vinfo, stmt_info, 1,
-					     alignment_support_scheme,
-					     misalignment, &inside_cost,
-					     cost_vec);
+			n_adjacent_stores++;
 		      else
 			inside_cost
 			  += record_stmt_cost (cost_vec, 1, scalar_store,
@@ -8798,11 +8798,18 @@  vectorizable_store (vec_info *vinfo,
 	    break;
 	}

-      if (costing_p && dump_enabled_p ())
-	dump_printf_loc (MSG_NOTE, vect_location,
-			 "vect_model_store_cost: inside_cost = %d, "
-			 "prologue_cost = %d .\n",
-			 inside_cost, prologue_cost);
+      if (costing_p)
+	{
+	  if (n_adjacent_stores > 0)
+	    vect_get_store_cost (vinfo, stmt_info, n_adjacent_stores,
+				 alignment_support_scheme, misalignment,
+				 &inside_cost, cost_vec);
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "vect_model_store_cost: inside_cost = %d, "
+			     "prologue_cost = %d .\n",
+			     inside_cost, prologue_cost);
+	}

       return true;
     }
@@ -8909,6 +8916,9 @@  vectorizable_store (vec_info *vinfo,
     {
       gcc_assert (!slp && grouped_store);
       unsigned inside_cost = 0, prologue_cost = 0;
+      /* For costing some adjacent vector stores, we'd like to cost with
+	 the total number of them once instead of cost each one by one. */
+      unsigned int n_adjacent_stores = 0;
       for (j = 0; j < ncopies; j++)
 	{
 	  gimple *new_stmt;
@@ -8974,10 +8984,7 @@  vectorizable_store (vec_info *vinfo,

 	  if (costing_p)
 	    {
-	      for (i = 0; i < vec_num; i++)
-		vect_get_store_cost (vinfo, stmt_info, 1,
-				     alignment_support_scheme, misalignment,
-				     &inside_cost, cost_vec);
+	      n_adjacent_stores += vec_num;
 	      continue;
 	    }

@@ -9067,11 +9074,18 @@  vectorizable_store (vec_info *vinfo,
 	  STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
 	}

-      if (costing_p && dump_enabled_p ())
-	dump_printf_loc (MSG_NOTE, vect_location,
-			 "vect_model_store_cost: inside_cost = %d, "
-			 "prologue_cost = %d .\n",
-			 inside_cost, prologue_cost);
+      if (costing_p)
+	{
+	  if (n_adjacent_stores > 0)
+	    vect_get_store_cost (vinfo, stmt_info, n_adjacent_stores,
+				 alignment_support_scheme, misalignment,
+				 &inside_cost, cost_vec);
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "vect_model_store_cost: inside_cost = %d, "
+			     "prologue_cost = %d .\n",
+			     inside_cost, prologue_cost);
+	}

       return true;
     }
@@ -9290,6 +9304,9 @@  vectorizable_store (vec_info *vinfo,
 	      || memory_access_type == VMAT_CONTIGUOUS_REVERSE);

   unsigned inside_cost = 0, prologue_cost = 0;
+  /* For costing some adjacent vector stores, we'd like to cost with
+     the total number of them once instead of cost each one by one. */
+  unsigned int n_adjacent_stores = 0;
   auto_vec<tree> result_chain (group_size);
   auto_vec<tree, 1> vec_oprnds;
   for (j = 0; j < ncopies; j++)
@@ -9451,9 +9468,7 @@  vectorizable_store (vec_info *vinfo,

 	  if (costing_p)
 	    {
-	      vect_get_store_cost (vinfo, stmt_info, 1,
-				   alignment_support_scheme, misalignment,
-				   &inside_cost, cost_vec);
+	      n_adjacent_stores++;

 	      if (!slp)
 		{
@@ -9623,6 +9638,11 @@  vectorizable_store (vec_info *vinfo,

   if (costing_p)
     {
+      if (n_adjacent_stores > 0)
+	vect_get_store_cost (vinfo, stmt_info, n_adjacent_stores,
+			     alignment_support_scheme, misalignment,
+			     &inside_cost, cost_vec);
+
       /* When vectorizing a store into the function result assign
 	 a penalty if the function returns in a multi-register location.
 	 In this case we assume we'll end up with having to spill the
@@ -10337,6 +10357,9 @@  vectorizable_load (vec_info *vinfo,
       unsigned HOST_WIDE_INT
 	elsz = tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (vectype)));
       unsigned int n_groups = 0;
+      /* For costing some adjacent vector loads, we'd like to cost with
+	 the total number of them once instead of cost each one by one. */
+      unsigned int n_adjacent_loads = 0;
       for (j = 0; j < ncopies; j++)
 	{
 	  if (nloads > 1 && !costing_p)
@@ -10350,10 +10373,7 @@  vectorizable_load (vec_info *vinfo,
 		     avoid ICE, see PR110776.  */
 		  if (VECTOR_TYPE_P (ltype)
 		      && memory_access_type != VMAT_ELEMENTWISE)
-		    vect_get_load_cost (vinfo, stmt_info, 1,
-					alignment_support_scheme, misalignment,
-					false, &inside_cost, nullptr, cost_vec,
-					cost_vec, true);
+		    n_adjacent_loads++;
 		  else
 		    inside_cost += record_stmt_cost (cost_vec, 1, scalar_load,
 						     stmt_info, 0, vect_body);
@@ -10447,11 +10467,19 @@  vectorizable_load (vec_info *vinfo,
 					  false, &n_perms);
 	}

-      if (costing_p && dump_enabled_p ())
-	dump_printf_loc (MSG_NOTE, vect_location,
-			 "vect_model_load_cost: inside_cost = %u, "
-			 "prologue_cost = 0 .\n",
-			 inside_cost);
+      if (costing_p)
+	{
+	  if (n_adjacent_loads > 0)
+	    vect_get_load_cost (vinfo, stmt_info, n_adjacent_loads,
+				alignment_support_scheme, misalignment, false,
+				&inside_cost, nullptr, cost_vec, cost_vec,
+				true);
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "vect_model_load_cost: inside_cost = %u, "
+			     "prologue_cost = 0 .\n",
+			     inside_cost);
+	}

       return true;
     }
@@ -10756,6 +10784,9 @@  vectorizable_load (vec_info *vinfo,
       gcc_assert (grouped_load && !slp);

       unsigned int inside_cost = 0, prologue_cost = 0;
+      /* For costing some adjacent vector loads, we'd like to cost with
+	 the total number of them once instead of cost each one by one. */
+      unsigned int n_adjacent_loads = 0;
       for (j = 0; j < ncopies; j++)
 	{
 	  if (costing_p)
@@ -10787,9 +10818,7 @@  vectorizable_load (vec_info *vinfo,
 					  true);
 		    }
 		}
-	      vect_get_load_cost (vinfo, stmt_info, 1, alignment_support_scheme,
-				  misalignment, false, &inside_cost,
-				  &prologue_cost, cost_vec, cost_vec, true);
+	      n_adjacent_loads++;
 	      continue;
 	    }

@@ -10891,11 +10920,19 @@  vectorizable_load (vec_info *vinfo,
 	  *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
 	}

-      if (costing_p && dump_enabled_p ())
-	dump_printf_loc (MSG_NOTE, vect_location,
-			 "vect_model_load_cost: inside_cost = %u, "
-			 "prologue_cost = %u .\n",
-			 inside_cost, prologue_cost);
+      if (costing_p)
+	{
+	  if (n_adjacent_loads > 0)
+	    vect_get_load_cost (vinfo, stmt_info, n_adjacent_loads,
+				alignment_support_scheme, misalignment, false,
+				&inside_cost, &prologue_cost, cost_vec,
+				cost_vec, true);
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "vect_model_load_cost: inside_cost = %u, "
+			     "prologue_cost = %u .\n",
+			     inside_cost, prologue_cost);
+	}

       return true;
     }
@@ -11111,6 +11148,9 @@  vectorizable_load (vec_info *vinfo,

   poly_uint64 group_elt = 0;
   unsigned int inside_cost = 0, prologue_cost = 0;
+  /* For costing some adjacent vector loads, we'd like to cost with
+     the total number of them once instead of cost each one by one. */
+  unsigned int n_adjacent_loads = 0;
   for (j = 0; j < ncopies; j++)
     {
       /* 1. Create the vector or array pointer update chain.  */
@@ -11505,10 +11545,18 @@  vectorizable_load (vec_info *vinfo,
 		  || memory_access_type == VMAT_CONTIGUOUS_REVERSE
 		  || (memory_access_type == VMAT_CONTIGUOUS_PERMUTE
 		      && (!grouped_load || first_stmt_info_p)))
-		vect_get_load_cost (vinfo, stmt_info, 1,
-				    alignment_support_scheme, misalignment,
-				    add_realign_cost, &inside_cost,
-				    &prologue_cost, cost_vec, cost_vec, true);
+		{
+		  /* Leave realign cases alone to keep them simple.  */
+		  if (alignment_support_scheme == dr_explicit_realign_optimized
+		      || alignment_support_scheme == dr_explicit_realign)
+		    vect_get_load_cost (vinfo, stmt_info, 1,
+					alignment_support_scheme, misalignment,
+					add_realign_cost, &inside_cost,
+					&prologue_cost, cost_vec, cost_vec,
+					true);
+		  else
+		    n_adjacent_loads++;
+		}
 	    }
 	  else
 	    {
@@ -11679,6 +11727,11 @@  vectorizable_load (vec_info *vinfo,
       gcc_assert (memory_access_type == VMAT_CONTIGUOUS
 		  || memory_access_type == VMAT_CONTIGUOUS_REVERSE
 		  || memory_access_type == VMAT_CONTIGUOUS_PERMUTE);
+      if (n_adjacent_loads > 0)
+	vect_get_load_cost (vinfo, stmt_info, n_adjacent_loads,
+			    alignment_support_scheme, misalignment, false,
+			    &inside_cost, &prologue_cost, cost_vec, cost_vec,
+			    true);
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_NOTE, vect_location,
 			 "vect_model_load_cost: inside_cost = %u, "