aarch64: New RTL optimization pass avoid-store-forwarding.
Checks
Commit Message
This is an RTL pass that detects store forwarding from stores to larger loads (load pairs).
This optimization is SPEC2017-driven and was found to be beneficial for some benchmarks,
through testing on ampere1/ampere1a machines.
For example, it can transform cases like
str d5, [sp, #320]
fmul d5, d31, d29
ldp d31, d17, [sp, #312] # Large load from small store
to
str d5, [sp, #320]
fmul d5, d31, d29
ldr d31, [sp, #312]
ldr d17, [sp, #320]
Currently, the pass is disabled by default on all architectures and enabled by a target-specific option.
If deemed beneficial enough for a default, it will be enabled on ampere1/ampere1a,
or other architectures as well, without needing to be turned on by this option.
Bootstrapped and regtested on aarch64-linux.
gcc/ChangeLog:
* alias.cc (memrefs_conflict_p): Expose static function.
* alias.h (memrefs_conflict_p): Expose static function.
* config.gcc: Add aarch64-store-forwarding.o to extra_objs.
* config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New pass.
* config/aarch64/aarch64-protos.h (make_pass_avoid_store_forwarding): Declare.
* config/aarch64/aarch64.opt (mavoid-store-forwarding): New option.
(aarch64-store-forwarding-threshold): New param.
* config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
* doc/invoke.texi: Document new option and new param.
* config/aarch64/aarch64-store-forwarding.cc: New file.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
* gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
* gcc.target/aarch64/ldp_ssll_overlap.c: New test.
Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
Co-Authored-By: Manolis Tsamis <manolis.tsamis@vrull.eu>
Co-Authored-By: Philipp Tomsich <philipp.tomsich@vrull.eu>
---
gcc/alias.cc | 2 +-
gcc/alias.h | 1 +
gcc/config.gcc | 1 +
gcc/config/aarch64/aarch64-passes.def | 1 +
gcc/config/aarch64/aarch64-protos.h | 1 +
.../aarch64/aarch64-store-forwarding.cc | 347 ++++++++++++++++++
gcc/config/aarch64/aarch64.opt | 9 +
gcc/config/aarch64/t-aarch64 | 10 +
gcc/doc/invoke.texi | 12 +-
.../aarch64/ldp_ssll_no_overlap_address.c | 33 ++
.../aarch64/ldp_ssll_no_overlap_offset.c | 33 ++
.../gcc.target/aarch64/ldp_ssll_overlap.c | 33 ++
12 files changed, 481 insertions(+), 2 deletions(-)
create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc
create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
--
2.40.1
Comments
Thanks for the patch.
Manos Anagnostakis <manos.anagnostakis@vrull.eu> writes:
> This is an RTL pass that detects store forwarding from stores to larger loads (load pairs).
>
> This optimization is SPEC2017-driven and was found to be beneficial for some benchmarks,
> through testing on ampere1/ampere1a machines.
>
> For example, it can transform cases like
>
> str d5, [sp, #320]
> fmul d5, d31, d29
> ldp d31, d17, [sp, #312] # Large load from small store
>
> to
>
> str d5, [sp, #320]
> fmul d5, d31, d29
> ldr d31, [sp, #312]
> ldr d17, [sp, #320]
For this particular case, it would be good to have a spill forwarding
pass that inserts:
mov d17, d5
before the fmul (possibly with further clean-up beyond that).
But I realise the patch handles the general case where that isn't possible.
The patch can also afford to be approximate.
Some of the insn recognition code will probably need to be updated for
Alex's pass that forms LDP/STP, since that's going to have writeback
support. We'll also need to arrange a pass ording that works for both
passes.
The patch generally looks good, and it's nicely compact. Some specific
comments below:
> Currently, the pass is disabled by default on all architectures and enabled by a target-specific option.
>
> If deemed beneficial enough for a default, it will be enabled on ampere1/ampere1a,
> or other architectures as well, without needing to be turned on by this option.
>
> Bootstrapped and regtested on aarch64-linux.
>
> gcc/ChangeLog:
>
> * alias.cc (memrefs_conflict_p): Expose static function.
> * alias.h (memrefs_conflict_p): Expose static function.
> * config.gcc: Add aarch64-store-forwarding.o to extra_objs.
> * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New pass.
> * config/aarch64/aarch64-protos.h (make_pass_avoid_store_forwarding): Declare.
> * config/aarch64/aarch64.opt (mavoid-store-forwarding): New option.
> (aarch64-store-forwarding-threshold): New param.
> * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
> * doc/invoke.texi: Document new option and new param.
> * config/aarch64/aarch64-store-forwarding.cc: New file.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
> * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
> * gcc.target/aarch64/ldp_ssll_overlap.c: New test.
>
> Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> Co-Authored-By: Manolis Tsamis <manolis.tsamis@vrull.eu>
> Co-Authored-By: Philipp Tomsich <philipp.tomsich@vrull.eu>
> ---
> gcc/alias.cc | 2 +-
> gcc/alias.h | 1 +
> gcc/config.gcc | 1 +
> gcc/config/aarch64/aarch64-passes.def | 1 +
> gcc/config/aarch64/aarch64-protos.h | 1 +
> .../aarch64/aarch64-store-forwarding.cc | 347 ++++++++++++++++++
> gcc/config/aarch64/aarch64.opt | 9 +
> gcc/config/aarch64/t-aarch64 | 10 +
> gcc/doc/invoke.texi | 12 +-
> .../aarch64/ldp_ssll_no_overlap_address.c | 33 ++
> .../aarch64/ldp_ssll_no_overlap_offset.c | 33 ++
> .../gcc.target/aarch64/ldp_ssll_overlap.c | 33 ++
> 12 files changed, 481 insertions(+), 2 deletions(-)
> create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc
> create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
>
> diff --git a/gcc/alias.cc b/gcc/alias.cc
> index 86d8f7104ad..303683f85e3 100644
> --- a/gcc/alias.cc
> +++ b/gcc/alias.cc
> @@ -2488,7 +2488,7 @@ offset_overlap_p (poly_int64 c, poly_int64 xsize, poly_int64 ysize)
> one for X + non-constant and Y + non-constant when X and Y are equal.
> If that is fixed the TBAA hack for union type-punning can be removed. */
>
> -static int
> +int
> memrefs_conflict_p (poly_int64 xsize, rtx x, poly_int64 ysize, rtx y,
> poly_int64 c)
> {
> diff --git a/gcc/alias.h b/gcc/alias.h
> index ab06ac9055f..49836f7d808 100644
> --- a/gcc/alias.h
> +++ b/gcc/alias.h
> @@ -41,6 +41,7 @@ bool alias_ptr_types_compatible_p (tree, tree);
> int compare_base_decls (tree, tree);
> bool refs_same_for_tbaa_p (tree, tree);
> bool mems_same_for_tbaa_p (rtx, rtx);
> +int memrefs_conflict_p (poly_int64, rtx, poly_int64, rtx, poly_int64);
>
> /* This alias set can be used to force a memory to conflict with all
> other memories, creating a barrier across which no memory reference
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index ba6d63e33ac..ae50d36004f 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -350,6 +350,7 @@ aarch64*-*-*)
> cxx_target_objs="aarch64-c.o"
> d_target_objs="aarch64-d.o"
> extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
> + extra_objs="${extra_objs} aarch64-store-forwarding.o"
> target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc \$(srcdir)/config/aarch64/aarch64-sve-builtins.h \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
> target_has_targetm_common=yes
> ;;
> diff --git a/gcc/config/aarch64/aarch64-passes.def b/gcc/config/aarch64/aarch64-passes.def
> index 6ace797b738..fa79e8adca8 100644
> --- a/gcc/config/aarch64/aarch64-passes.def
> +++ b/gcc/config/aarch64/aarch64-passes.def
> @@ -23,3 +23,4 @@ INSERT_PASS_BEFORE (pass_reorder_blocks, 1, pass_track_speculation);
> INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
> INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
> INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
> +INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding);
> diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> index 60a55f4bc19..5a7e68f22e5 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -1049,6 +1049,7 @@ rtl_opt_pass *make_pass_track_speculation (gcc::context *);
> rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *);
> rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
> rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
> +rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt);
>
> poly_uint64 aarch64_regmode_natural_size (machine_mode);
>
> diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc b/gcc/config/aarch64/aarch64-store-forwarding.cc
> new file mode 100644
> index 00000000000..30148fb4435
> --- /dev/null
> +++ b/gcc/config/aarch64/aarch64-store-forwarding.cc
> @@ -0,0 +1,347 @@
> +/* Avoid store forwarding optimization pass.
> + Copyright (C) 2015-2023 Free Software Foundation, Inc.
Just checking the copyright: does the file include code from other
files? The line above is OK if so, otherwise I guess it should
start at 2023.
> + Contributed by VRULL GmbH.
> +
> + This file is part of GCC.
> +
> + GCC is free software; you can redistribute it and/or modify it
> + under the terms of the GNU General Public License as published by
> + the Free Software Foundation; either version 3, or (at your option)
> + any later version.
> +
> + GCC is distributed in the hope that it will be useful, but
> + WITHOUT ANY WARRANTY; without even the implied warranty of
> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + General Public License for more details.
> +
> + You should have received a copy of the GNU General Public License
> + along with GCC; see the file COPYING3. If not see
> + <http://www.gnu.org/licenses/>. */
> +
> +#define IN_TARGET_CODE 1
> +
> +#include "config.h"
> +#define INCLUDE_VECTOR
> +#define INCLUDE_MAP
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "alias.h"
> +#include "rtlanal.h"
> +#include "tree-pass.h"
> +#include "cselib.h"
> +
> +/* This is an RTL pass that detects store forwarding from stores to larger
> + loads (load pairs). For example, it can transform cases like
> +
> + str d5, [sp, #320]
> + fmul d5, d31, d29
> + ldp d31, d17, [sp, #312] # Large load from small store
> +
> + to
> +
> + str d5, [sp, #320]
> + fmul d5, d31, d29
> + ldr d31, [sp, #312]
> + ldr d17, [sp, #320]
> +
> + Design: The pass follows a straightforward design. It starts by
> + initializing the alias analysis and the cselib. Both of these are used to
> + find stores and larger loads with overlapping addresses, which are
> + candidates for store forwarding optimizations. It then scans on basic block
> + level to find stores that forward to larger loads and handles them
> + accordingly as described in the above example. Finally, the alias analysis
> + and the cselib library are closed. */
> +
> +typedef std::vector<std::pair<rtx_insn *, rtx>> vec_rtx_pair;
> +typedef std::map<rtx, uint32_t> map_rtx_dist;
> +
> +/* Statistics counters. */
> +static unsigned int stats_store_count = 0;
> +static unsigned int stats_ldp_count = 0;
> +static unsigned int stats_ssll_count = 0;
> +static unsigned int stats_transformed_count = 0;
> +
> +/* Default. */
> +static rtx dummy;
> +static bool is_store (rtx expr, rtx &op_0=dummy);
> +static bool is_load (rtx expr, rtx &op_1=dummy);
> +
> +/* Return true if EXPR is a store; otherwise false. OP_0 will contain the MEM
> + operand of the store. */
Probably worth saying "Return true if SET expression EXPR...".
> +static bool
> +is_store (rtx expr, rtx &op_0)
> +{
> + op_0 = XEXP (expr, 0);
Using SET_DEST would be clearer.
> +
> + if (GET_CODE (op_0) == ZERO_EXTEND
> + || GET_CODE (op_0) == SIGN_EXTEND)
> + op_0 = XEXP (op_0, 0);
> +
> + return MEM_P (op_0);
> +}
> +
> +/* Return true if EXPR is a load; otherwise false. OP_1 will contain the MEM
> + operand of the load. */
Same as above.
> +
> +static bool
> +is_load (rtx expr, rtx &op_1)
> +{
> + op_1 = XEXP (expr, 1);
SET_SRC.
> +
> + if (GET_CODE (op_1) == ZERO_EXTEND
> + || GET_CODE (op_1) == SIGN_EXTEND)
> + op_1 = XEXP (op_1, 0);
> +
> + return MEM_P (op_1);
> +}
> +
> +/* Check the maximum instruction distance threshold and if crossed, remove the
> + store from STORE_EXPRS. DISTANCES contains the intruction distances. */
> +
> +void
> +update_store_load_distances (vec_rtx_pair &store_exprs, map_rtx_dist &distances)
> +{
> + for (auto iter = store_exprs.begin (); iter < store_exprs.end ();)
> + {
> + (distances[(*iter).second])++;
> + iter = distances[(*iter).second]
> + > (uint32_t) aarch64_store_forwarding_threshold_param
> + ? store_exprs.erase (iter) : iter + 1;
What are the guidelines for picking a good param value? It looks
like this is called one for each instruction, without taking issue
rate into account, so I assume the param value is supposed to account
for that instead.
> + }
> +}
> +
> +/* Return true if STORE_MEM is forwarding to LOAD_MEM; otherwise false. */
> +
> +static bool
> +is_forwarding (rtx store_mem, rtx load_mem)
> +{
> + gcc_checking_assert (MEM_P (store_mem) && MEM_P (load_mem));
> +
> + rtx store_mem_addr = canon_rtx (get_addr (XEXP (store_mem, 0)));
> + rtx load_mem_addr = canon_rtx (get_addr (XEXP (load_mem, 0)));
> +
> + return (memrefs_conflict_p (GET_MODE_SIZE (GET_MODE (store_mem)),
> + store_mem_addr,
> + GET_MODE_SIZE (GET_MODE (load_mem)),
> + load_mem_addr, 0) == 1
> + || rtx_equal_for_cselib_1 (store_mem_addr,
> + load_mem_addr,
> + GET_MODE (store_mem), 0));
> +}
> +
> +/* Return true if INSN is a load pair, preceded by a store forwarding to it;
> + otherwise false. STORE_EXPRS contains the stores. */
> +
> +static bool
> +is_small_store_to_large_load (vec_rtx_pair &store_exprs, rtx_insn *insn)
> +{
> + unsigned int load_count = 0;
> + bool forwarding = false;
> + rtx expr = PATTERN (insn);
> +
> + if (GET_CODE (expr) != PARALLEL
> + || XVECLEN (expr, 0) != 2)
> + return false;
> +
> + for (int i = 0; i < XVECLEN (expr, 0); i++)
> + {
> + rtx op_1;
> + rtx out_exp = XVECEXP (expr, 0, i);
> +
> + if (GET_CODE (out_exp) != SET)
> + continue;
> +
> + if (!is_load (out_exp, op_1))
> + {
> + load_count = 0;
> + continue;
> + }
> +
> + load_count++;
> +
> + for (std::pair<rtx_insn *, rtx> str : store_exprs)
> + {
> + if (!is_forwarding (XEXP (str.second, 0), op_1))
> + continue;
> +
> + if (dump_file)
> + {
> + fprintf (dump_file,
> + "Store forwarding to PARALLEL with loads:\n");
> + fprintf (dump_file, " From: ");
> + print_rtl_single (dump_file, str.first);
> + fprintf (dump_file, " To: ");
> + print_rtl_single (dump_file, insn);
> + }
> +
> + forwarding = true;
> + }
> + }
> +
> + if (load_count == 2)
> + stats_ldp_count++;
> +
> + return load_count == 2 && forwarding;
> +}
> +
> +/* If the memory register in EXPR is ovewriten between the store and the larger
> + load, we need to avoid handling that case and thus we remove the store from
> + STORE_EXPRS. */
> +
> +static void check_memory_reg_overwrite (rtx expr, vec_rtx_pair &store_exprs)
Formatting nit: new line after "void".
Could you explain more detail why this is needed? Doesn't cselib
handle the definition itself?
> +{
> + if (REG_P (XEXP (expr, 0)))
> + for (auto iter = store_exprs.begin (); iter < store_exprs.end ();)
> + iter = reg_mentioned_p (XEXP (expr, 0), XEXP ((*iter).second, 0))
> + ? store_exprs.erase (iter) : iter + 1;
> +}
> +
> +/* Break a load pair into its 2 distinct loads, except if the base source
> + address to load from is overwriten in the first load. INSN should be the
> + PARALLEL of the load pair. */
> +
> +static void
> +break_ldp (rtx_insn *insn)
> +{
> + rtx expr = PATTERN (insn);
> +
> + gcc_checking_assert (GET_CODE (expr) == PARALLEL && XVECLEN (expr, 0) == 2);
> +
> + rtx load_0 = XVECEXP (expr, 0, 0);
> + rtx load_1 = XVECEXP (expr, 0, 1);
> +
> + gcc_checking_assert (is_load (load_0) && is_load (load_1));
> +
> + /* The base address was overwriten in the first load. */
> + if (reg_mentioned_p (XEXP (load_0, 0), XEXP (load_1, 1)))
SET_DEST and SET_SRC here.
> + return;
> +
> + emit_insn_after (load_1, insn);
> + emit_insn_after (load_0, insn);
> + remove_insn (insn);
It's probably safer to insert before the instruction, since that copes
correctly with non-call exceptions (where a load could raise an exception).
It's unlikely to matter as things stand, but LDPs could be formed earlier
in future.
> +
> + stats_transformed_count++;
> +}
> +
> +static void
> +scan_and_transform_bb_level ()
> +{
> + rtx_insn *insn, *next;
> + basic_block bb;
> + FOR_EACH_BB_FN (bb, cfun)
> + {
> + vec_rtx_pair store_exprs;
> + map_rtx_dist store_cnt_byte_dist;
> + for (insn = BB_HEAD (bb); insn != NEXT_INSN (BB_END (bb)); insn = next)
> + {
> + next = NEXT_INSN (insn);
> +
> + /* If we cross a CALL_P insn, clear the vectors, because the
> + small-store-to-large-load is unlikely to cause performance
> + difference. */
> + if (CALL_P (insn))
> + {
> + store_exprs.clear ();
> + store_cnt_byte_dist.clear ();
> + }
> +
> + if (!NONJUMP_INSN_P (insn))
> + continue;
> +
> + cselib_process_insn (insn);
> +
> + rtx expr = PATTERN (insn);
> +
> + if (GET_CODE (expr) == SET)
Would be good to use single_set here, so that we pick up sets that are
parallel with clobbers.
> + {
> + /* If a store is encountered, push it to the store_exprs vector
> + to check it later. */
> + if (is_store (expr))
> + {
> + store_exprs.push_back ({insn, expr});
> +
> + /* The store insn shouldn't have been encountered before. */
> + gcc_checking_assert (store_cnt_byte_dist.find (expr)
> + == store_cnt_byte_dist.end ());
> + store_cnt_byte_dist.insert ({ expr, 0 });
> + stats_store_count++;
> + }
> +
> + check_memory_reg_overwrite (expr, store_exprs);
Depending on the answer to the question above, we probably ought to
use note_stores to iterate over all stores for check_memory_reg_overwrite.
Or, alternatively:
vec_rtx_properties properties;
properties.add_insn (insn, false);
for (rtx_obj_reference ref : properties.refs ())
if (ref.is_write () && ref.is_reg ())
check_memory_reg_overwrite ...
> + }
> +
> + /* Check for small-store-to-large-load. */
> + if (is_small_store_to_large_load (store_exprs, insn))
> + {
> + stats_ssll_count++;
> + break_ldp (insn);
> + }
> +
> + update_store_load_distances (store_exprs, store_cnt_byte_dist);
> + }
> + }
It looks like the loop carries information across BB boundaries.
Is that deliberate, to try to detect forwarding in EBBs? If so,
I'm not sure whether BBs are guaranteed to be in fallthrough order.
The information probably ought to be reset after non-fallthru.
Thanks,
Richard
> +}
> +
> +static void
> +execute_avoid_store_forwarding ()
> +{
> + init_alias_analysis ();
> + cselib_init (0);
> + scan_and_transform_bb_level ();
> + end_alias_analysis ();
> + cselib_finish ();
> + statistics_counter_event (cfun, "Number of stores identified: ",
> + stats_store_count);
> + statistics_counter_event (cfun, "Number of load pairs identified: ",
> + stats_ldp_count);
> + statistics_counter_event (cfun,
> + "Number of forwarding cases identified: ",
> + stats_ssll_count);
> + statistics_counter_event (cfun, "Number of trasformed cases: ",
> + stats_transformed_count);
> +}
> +
> +const pass_data pass_data_avoid_store_forwarding =
> +{
> + RTL_PASS, /* type. */
> + "avoid_store_forwarding", /* name. */
> + OPTGROUP_NONE, /* optinfo_flags. */
> + TV_NONE, /* tv_id. */
> + 0, /* properties_required. */
> + 0, /* properties_provided. */
> + 0, /* properties_destroyed. */
> + 0, /* todo_flags_start. */
> + 0 /* todo_flags_finish. */
> +};
> +
> +class pass_avoid_store_forwarding : public rtl_opt_pass
> +{
> +public:
> + pass_avoid_store_forwarding (gcc::context *ctxt)
> + : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
> + {}
> +
> + /* opt_pass methods: */
> + virtual bool gate (function *)
> + {
> + return aarch64_flag_avoid_store_forwarding && optimize >= 2;
> + }
> +
> + virtual unsigned int execute (function *)
> + {
> + execute_avoid_store_forwarding ();
> + return 0;
> + }
> +
> +}; // class pass_avoid_store_forwarding
> +
> +/* Create a new avoid store forwarding pass instance. */
> +
> +rtl_opt_pass *
> +make_pass_avoid_store_forwarding (gcc::context *ctxt)
> +{
> + return new pass_avoid_store_forwarding (ctxt);
> +}
> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> index f5a518202a1..a9e944a272f 100644
> --- a/gcc/config/aarch64/aarch64.opt
> +++ b/gcc/config/aarch64/aarch64.opt
> @@ -304,6 +304,10 @@ moutline-atomics
> Target Var(aarch64_flag_outline_atomics) Init(2) Save
> Generate local calls to out-of-line atomic operations.
>
> +mavoid-store-forwarding
> +Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0) Optimization
> +Avoid store forwarding to load pairs.
> +
> -param=aarch64-sve-compare-costs=
> Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1) IntegerRange(0, 1) Param
> When vectorizing for SVE, consider using unpacked vectors for smaller elements and use the cost model to pick the cheapest approach. Also use the cost model to choose between SVE and Advanced SIMD vectorization.
> @@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never) Value(AARCH64_LDP_STP_POLICY_NEVER)
>
> EnumValue
> Enum(aarch64_ldp_stp_policy) String(aligned) Value(AARCH64_LDP_STP_POLICY_ALIGNED)
> +
> +-param=aarch64-store-forwarding-threshold=
> +Target Joined UInteger Var(aarch64_store_forwarding_threshold_param) Init(20) Param
> +Maximum instruction distance allowed between a store and a load pair for this to be
> +considered a candidate to avoid when using aarch64-avoid-store-forwarding.
> diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
> index a9a244ab6d6..7639b50358d 100644
> --- a/gcc/config/aarch64/t-aarch64
> +++ b/gcc/config/aarch64/t-aarch64
> @@ -176,6 +176,16 @@ aarch64-cc-fusion.o: $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
> $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> $(srcdir)/config/aarch64/aarch64-cc-fusion.cc
>
> +aarch64-store-forwarding.o: \
> + $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \
> + $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h $(RTL_BASE_H) \
> + dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H) $(RECOG_H) \
> + output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \
> + $(CONTEXT_H) $(TREE_PASS_H) regrename.h \
> + $(srcdir)/config/aarch64/aarch64-protos.h
> + $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> + $(srcdir)/config/aarch64/aarch64-store-forwarding.cc
> +
> comma=,
> MULTILIB_OPTIONS = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
> MULTILIB_DIRNAMES = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 2caa7ceec40..ca30f52109a 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -792,7 +792,7 @@ Objective-C and Objective-C++ Dialects}.
> -moverride=@var{string} -mverbose-cost-dump
> -mstack-protector-guard=@var{guard} -mstack-protector-guard-reg=@var{sysreg}
> -mstack-protector-guard-offset=@var{offset} -mtrack-speculation
> --moutline-atomics }
> +-moutline-atomics -mavoid-store-forwarding}
>
> @emph{Adapteva Epiphany Options}
> @gccoptlist{-mhalf-reg-file -mprefer-short-insn-regs
> @@ -16583,6 +16583,11 @@ With @option{--param=aarch64-stp-policy=never}, do not emit stp.
> With @option{--param=aarch64-stp-policy=aligned}, emit stp only if the
> source pointer is aligned to at least double the alignment of the type.
>
> +@item aarch64-store-forwarding-threshold
> +Maximum allowed instruction distance between a store and a load pair for
> +this to be considered a candidate to avoid when using
> +aarch64-avoid-store-forwarding.
> +
> @item aarch64-loop-vect-issue-rate-niters
> The tuning for some AArch64 CPUs tries to take both latencies and issue
> rates into account when deciding whether a loop should be vectorized
> @@ -20558,6 +20563,11 @@ Generate code which uses only the general-purpose registers. This will prevent
> the compiler from using floating-point and Advanced SIMD registers but will not
> impose any restrictions on the assembler.
>
> +@item -mavoid-store-forwarding
> +@itemx -mno-avoid-store-forwarding
> +Avoid store forwarding to load pairs.
> +@option{-mavoid-store-forwarding} at levels @option{-O2}, @option{-O3}.
> +
> @opindex mlittle-endian
> @item -mlittle-endian
> Generate little-endian code. This is the default when GCC is configured for an
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> new file mode 100644
> index 00000000000..b77de6c64b6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> @@ -0,0 +1,33 @@
> +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> +
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +/* Different address, same offset, no overlap */
> +
> +#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \
> +TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE *st_arr_2, TYPE i, TYPE dummy){ \
> + TYPE r, y; \
> + st_arr[0] = i; \
> + ld_arr[0] = dummy; \
> + r = st_arr_2[0]; \
> + y = st_arr_2[1]; \
> + return r + y; \
> +}
> +
> +LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(int)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(long)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(float)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(double)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(v4si)
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
> +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> new file mode 100644
> index 00000000000..f1b3a66abfd
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> @@ -0,0 +1,33 @@
> +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> +
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +/* Same address, different offset, no overlap */
> +
> +#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \
> +TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE dummy){ \
> + TYPE r, y; \
> + st_arr[0] = i; \
> + ld_arr[0] = dummy; \
> + r = st_arr[10]; \
> + y = st_arr[11]; \
> + return r + y; \
> +}
> +
> +LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t)
> +LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t)
> +LDP_SSLL_NO_OVERLAP_OFFSET(int32_t)
> +LDP_SSLL_NO_OVERLAP_OFFSET(int64_t)
> +LDP_SSLL_NO_OVERLAP_OFFSET(int)
> +LDP_SSLL_NO_OVERLAP_OFFSET(long)
> +LDP_SSLL_NO_OVERLAP_OFFSET(float)
> +LDP_SSLL_NO_OVERLAP_OFFSET(double)
> +LDP_SSLL_NO_OVERLAP_OFFSET(v4si)
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
> +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> new file mode 100644
> index 00000000000..8d5ce5cc87e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> @@ -0,0 +1,33 @@
> +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> +
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +/* Same address, same offset, overlap */
> +
> +#define LDP_SSLL_OVERLAP(TYPE) \
> +TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE dummy){ \
> + TYPE r, y; \
> + st_arr[0] = i; \
> + ld_arr[0] = dummy; \
> + r = st_arr[0]; \
> + y = st_arr[1]; \
> + return r + y; \
> +}
> +
> +LDP_SSLL_OVERLAP(uint32_t)
> +LDP_SSLL_OVERLAP(uint64_t)
> +LDP_SSLL_OVERLAP(int32_t)
> +LDP_SSLL_OVERLAP(int64_t)
> +LDP_SSLL_OVERLAP(int)
> +LDP_SSLL_OVERLAP(long)
> +LDP_SSLL_OVERLAP(float)
> +LDP_SSLL_OVERLAP(double)
> +LDP_SSLL_OVERLAP(v4si)
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> --
> 2.40.1
Hi Richard,
thank you for reviewing the patch.
On Sat, Nov 11, 2023 at 6:57 PM Richard Sandiford <richard.sandiford@arm.com>
wrote:
> Thanks for the patch.
>
> Manos Anagnostakis <manos.anagnostakis@vrull.eu> writes:
> > This is an RTL pass that detects store forwarding from stores to larger
> loads (load pairs).
> >
> > This optimization is SPEC2017-driven and was found to be beneficial for
> some benchmarks,
> > through testing on ampere1/ampere1a machines.
> >
> > For example, it can transform cases like
> >
> > str d5, [sp, #320]
> > fmul d5, d31, d29
> > ldp d31, d17, [sp, #312] # Large load from small store
> >
> > to
> >
> > str d5, [sp, #320]
> > fmul d5, d31, d29
> > ldr d31, [sp, #312]
> > ldr d17, [sp, #320]
>
> For this particular case, it would be good to have a spill forwarding
> pass that inserts:
>
> mov d17, d5
>
> before the fmul (possibly with further clean-up beyond that).
>
> But I realise the patch handles the general case where that isn't possible.
> The patch can also afford to be approximate.
>
> Some of the insn recognition code will probably need to be updated for
> Alex's pass that forms LDP/STP, since that's going to have writeback
> support. We'll also need to arrange a pass ording that works for both
> passes.
>
> The patch generally looks good, and it's nicely compact. Some specific
> comments below:
>
> > Currently, the pass is disabled by default on all architectures and
> enabled by a target-specific option.
> >
> > If deemed beneficial enough for a default, it will be enabled on
> ampere1/ampere1a,
> > or other architectures as well, without needing to be turned on by this
> option.
> >
> > Bootstrapped and regtested on aarch64-linux.
> >
> > gcc/ChangeLog:
> >
> > * alias.cc (memrefs_conflict_p): Expose static function.
> > * alias.h (memrefs_conflict_p): Expose static function.
> > * config.gcc: Add aarch64-store-forwarding.o to extra_objs.
> > * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New
> pass.
> > * config/aarch64/aarch64-protos.h
> (make_pass_avoid_store_forwarding): Declare.
> > * config/aarch64/aarch64.opt (mavoid-store-forwarding): New
> option.
> > (aarch64-store-forwarding-threshold): New param.
> > * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
> > * doc/invoke.texi: Document new option and new param.
> > * config/aarch64/aarch64-store-forwarding.cc: New file.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
> > * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
> > * gcc.target/aarch64/ldp_ssll_overlap.c: New test.
> >
> > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> > Co-Authored-By: Manolis Tsamis <manolis.tsamis@vrull.eu>
> > Co-Authored-By: Philipp Tomsich <philipp.tomsich@vrull.eu>
> > ---
> > gcc/alias.cc | 2 +-
> > gcc/alias.h | 1 +
> > gcc/config.gcc | 1 +
> > gcc/config/aarch64/aarch64-passes.def | 1 +
> > gcc/config/aarch64/aarch64-protos.h | 1 +
> > .../aarch64/aarch64-store-forwarding.cc | 347 ++++++++++++++++++
> > gcc/config/aarch64/aarch64.opt | 9 +
> > gcc/config/aarch64/t-aarch64 | 10 +
> > gcc/doc/invoke.texi | 12 +-
> > .../aarch64/ldp_ssll_no_overlap_address.c | 33 ++
> > .../aarch64/ldp_ssll_no_overlap_offset.c | 33 ++
> > .../gcc.target/aarch64/ldp_ssll_overlap.c | 33 ++
> > 12 files changed, 481 insertions(+), 2 deletions(-)
> > create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc
> > create mode 100644
> gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> > create mode 100644
> gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> > create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> >
> > diff --git a/gcc/alias.cc b/gcc/alias.cc
> > index 86d8f7104ad..303683f85e3 100644
> > --- a/gcc/alias.cc
> > +++ b/gcc/alias.cc
> > @@ -2488,7 +2488,7 @@ offset_overlap_p (poly_int64 c, poly_int64 xsize,
> poly_int64 ysize)
> > one for X + non-constant and Y + non-constant when X and Y are equal.
> > If that is fixed the TBAA hack for union type-punning can be
> removed. */
> >
> > -static int
> > +int
> > memrefs_conflict_p (poly_int64 xsize, rtx x, poly_int64 ysize, rtx y,
> > poly_int64 c)
> > {
> > diff --git a/gcc/alias.h b/gcc/alias.h
> > index ab06ac9055f..49836f7d808 100644
> > --- a/gcc/alias.h
> > +++ b/gcc/alias.h
> > @@ -41,6 +41,7 @@ bool alias_ptr_types_compatible_p (tree, tree);
> > int compare_base_decls (tree, tree);
> > bool refs_same_for_tbaa_p (tree, tree);
> > bool mems_same_for_tbaa_p (rtx, rtx);
> > +int memrefs_conflict_p (poly_int64, rtx, poly_int64, rtx, poly_int64);
> >
> > /* This alias set can be used to force a memory to conflict with all
> > other memories, creating a barrier across which no memory reference
> > diff --git a/gcc/config.gcc b/gcc/config.gcc
> > index ba6d63e33ac..ae50d36004f 100644
> > --- a/gcc/config.gcc
> > +++ b/gcc/config.gcc
> > @@ -350,6 +350,7 @@ aarch64*-*-*)
> > cxx_target_objs="aarch64-c.o"
> > d_target_objs="aarch64-d.o"
> > extra_objs="aarch64-builtins.o aarch-common.o
> aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o
> aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o
> cortex-a57-fma-steering.o aarch64-speculation.o
> falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
> > + extra_objs="${extra_objs} aarch64-store-forwarding.o"
> > target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.h
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
> > target_has_targetm_common=yes
> > ;;
> > diff --git a/gcc/config/aarch64/aarch64-passes.def
> b/gcc/config/aarch64/aarch64-passes.def
> > index 6ace797b738..fa79e8adca8 100644
> > --- a/gcc/config/aarch64/aarch64-passes.def
> > +++ b/gcc/config/aarch64/aarch64-passes.def
> > @@ -23,3 +23,4 @@ INSERT_PASS_BEFORE (pass_reorder_blocks, 1,
> pass_track_speculation);
> > INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
> > INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
> > INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
> > +INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding);
> > diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> > index 60a55f4bc19..5a7e68f22e5 100644
> > --- a/gcc/config/aarch64/aarch64-protos.h
> > +++ b/gcc/config/aarch64/aarch64-protos.h
> > @@ -1049,6 +1049,7 @@ rtl_opt_pass *make_pass_track_speculation
> (gcc::context *);
> > rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *);
> > rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
> > rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
> > +rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt);
> >
> > poly_uint64 aarch64_regmode_natural_size (machine_mode);
> >
> > diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc
> b/gcc/config/aarch64/aarch64-store-forwarding.cc
> > new file mode 100644
> > index 00000000000..30148fb4435
> > --- /dev/null
> > +++ b/gcc/config/aarch64/aarch64-store-forwarding.cc
> > @@ -0,0 +1,347 @@
> > +/* Avoid store forwarding optimization pass.
> > + Copyright (C) 2015-2023 Free Software Foundation, Inc.
>
> Just checking the copyright: does the file include code from other
> files? The line above is OK if so, otherwise I guess it should
> start at 2023.
>
> + Contributed by VRULL GmbH.
> > +
> > + This file is part of GCC.
> > +
> > + GCC is free software; you can redistribute it and/or modify it
> > + under the terms of the GNU General Public License as published by
> > + the Free Software Foundation; either version 3, or (at your option)
> > + any later version.
> > +
> > + GCC is distributed in the hope that it will be useful, but
> > + WITHOUT ANY WARRANTY; without even the implied warranty of
> > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> > + General Public License for more details.
> > +
> > + You should have received a copy of the GNU General Public License
> > + along with GCC; see the file COPYING3. If not see
> > + <http://www.gnu.org/licenses/>. */
> > +
> > +#define IN_TARGET_CODE 1
> > +
> > +#include "config.h"
> > +#define INCLUDE_VECTOR
> > +#define INCLUDE_MAP
> > +#include "system.h"
> > +#include "coretypes.h"
> > +#include "backend.h"
> > +#include "rtl.h"
> > +#include "alias.h"
> > +#include "rtlanal.h"
> > +#include "tree-pass.h"
> > +#include "cselib.h"
> > +
> > +/* This is an RTL pass that detects store forwarding from stores to
> larger
> > + loads (load pairs). For example, it can transform cases like
> > +
> > + str d5, [sp, #320]
> > + fmul d5, d31, d29
> > + ldp d31, d17, [sp, #312] # Large load from small store
> > +
> > + to
> > +
> > + str d5, [sp, #320]
> > + fmul d5, d31, d29
> > + ldr d31, [sp, #312]
> > + ldr d17, [sp, #320]
> > +
> > + Design: The pass follows a straightforward design. It starts by
> > + initializing the alias analysis and the cselib. Both of these are
> used to
> > + find stores and larger loads with overlapping addresses, which are
> > + candidates for store forwarding optimizations. It then scans on
> basic block
> > + level to find stores that forward to larger loads and handles them
> > + accordingly as described in the above example. Finally, the alias
> analysis
> > + and the cselib library are closed. */
> > +
> > +typedef std::vector<std::pair<rtx_insn *, rtx>> vec_rtx_pair;
> > +typedef std::map<rtx, uint32_t> map_rtx_dist;
> > +
> > +/* Statistics counters. */
> > +static unsigned int stats_store_count = 0;
> > +static unsigned int stats_ldp_count = 0;
> > +static unsigned int stats_ssll_count = 0;
> > +static unsigned int stats_transformed_count = 0;
> > +
> > +/* Default. */
> > +static rtx dummy;
> > +static bool is_store (rtx expr, rtx &op_0=dummy);
> > +static bool is_load (rtx expr, rtx &op_1=dummy);
> > +
> > +/* Return true if EXPR is a store; otherwise false. OP_0 will contain
> the MEM
> > + operand of the store. */
>
> Probably worth saying "Return true if SET expression EXPR...".
>
> > +static bool
> > +is_store (rtx expr, rtx &op_0)
> > +{
> > + op_0 = XEXP (expr, 0);
>
> Using SET_DEST would be clearer.
>
> > +
> > + if (GET_CODE (op_0) == ZERO_EXTEND
> > + || GET_CODE (op_0) == SIGN_EXTEND)
> > + op_0 = XEXP (op_0, 0);
> > +
> > + return MEM_P (op_0);
> > +}
> > +
> > +/* Return true if EXPR is a load; otherwise false. OP_1 will contain
> the MEM
> > + operand of the load. */
>
> Same as above.
>
> > +
> > +static bool
> > +is_load (rtx expr, rtx &op_1)
> > +{
> > + op_1 = XEXP (expr, 1);
>
> SET_SRC.
>
> > +
> > + if (GET_CODE (op_1) == ZERO_EXTEND
> > + || GET_CODE (op_1) == SIGN_EXTEND)
> > + op_1 = XEXP (op_1, 0);
> > +
> > + return MEM_P (op_1);
> > +}
> > +
> > +/* Check the maximum instruction distance threshold and if crossed,
> remove the
> > + store from STORE_EXPRS. DISTANCES contains the intruction
> distances. */
> > +
> > +void
> > +update_store_load_distances (vec_rtx_pair &store_exprs, map_rtx_dist
> &distances)
> > +{
> > + for (auto iter = store_exprs.begin (); iter < store_exprs.end ();)
> > + {
> > + (distances[(*iter).second])++;
> > + iter = distances[(*iter).second]
> > + > (uint32_t) aarch64_store_forwarding_threshold_param
> > + ? store_exprs.erase (iter) : iter + 1;
>
> What are the guidelines for picking a good param value? It looks
> like this is called one for each instruction, without taking issue
> rate into account, so I assume the param value is supposed to account
> for that instead.
>
>
The param value specifies a threshold of RTL instructions that follow each
store.
We actually needed a way to count instruction distance between stores and
loads
to determine a way to distinguish store forwardings that have or do not
have impact
on performance.
A store that forwards to a "very far" load pair should have no impact.
We have chosen this method for simplicity, despite the limitations and
possible inaccuracies
that exist.
If you happen to know a more accurate way to track instruction distance, I
would gladly use it instead.
> > + }
> > +}
> > +
> > +/* Return true if STORE_MEM is forwarding to LOAD_MEM; otherwise
> false. */
> > +
> > +static bool
> > +is_forwarding (rtx store_mem, rtx load_mem)
> > +{
> > + gcc_checking_assert (MEM_P (store_mem) && MEM_P (load_mem));
> > +
> > + rtx store_mem_addr = canon_rtx (get_addr (XEXP (store_mem, 0)));
> > + rtx load_mem_addr = canon_rtx (get_addr (XEXP (load_mem, 0)));
> > +
> > + return (memrefs_conflict_p (GET_MODE_SIZE (GET_MODE (store_mem)),
> > + store_mem_addr,
> > + GET_MODE_SIZE (GET_MODE (load_mem)),
> > + load_mem_addr, 0) == 1
> > + || rtx_equal_for_cselib_1 (store_mem_addr,
> > + load_mem_addr,
> > + GET_MODE (store_mem), 0));
> > +}
> > +
> > +/* Return true if INSN is a load pair, preceded by a store forwarding
> to it;
> > + otherwise false. STORE_EXPRS contains the stores. */
> > +
> > +static bool
> > +is_small_store_to_large_load (vec_rtx_pair &store_exprs, rtx_insn *insn)
> > +{
> > + unsigned int load_count = 0;
> > + bool forwarding = false;
> > + rtx expr = PATTERN (insn);
> > +
> > + if (GET_CODE (expr) != PARALLEL
> > + || XVECLEN (expr, 0) != 2)
> > + return false;
> > +
> > + for (int i = 0; i < XVECLEN (expr, 0); i++)
> > + {
> > + rtx op_1;
> > + rtx out_exp = XVECEXP (expr, 0, i);
> > +
> > + if (GET_CODE (out_exp) != SET)
> > + continue;
> > +
> > + if (!is_load (out_exp, op_1))
> > + {
> > + load_count = 0;
> > + continue;
> > + }
> > +
> > + load_count++;
> > +
> > + for (std::pair<rtx_insn *, rtx> str : store_exprs)
> > + {
> > + if (!is_forwarding (XEXP (str.second, 0), op_1))
> > + continue;
> > +
> > + if (dump_file)
> > + {
> > + fprintf (dump_file,
> > + "Store forwarding to PARALLEL with loads:\n");
> > + fprintf (dump_file, " From: ");
> > + print_rtl_single (dump_file, str.first);
> > + fprintf (dump_file, " To: ");
> > + print_rtl_single (dump_file, insn);
> > + }
> > +
> > + forwarding = true;
> > + }
> > + }
> > +
> > + if (load_count == 2)
> > + stats_ldp_count++;
> > +
> > + return load_count == 2 && forwarding;
> > +}
> > +
> > +/* If the memory register in EXPR is ovewriten between the store and
> the larger
> > + load, we need to avoid handling that case and thus we remove the
> store from
> > + STORE_EXPRS. */
> > +
> > +static void check_memory_reg_overwrite (rtx expr, vec_rtx_pair
> &store_exprs)
>
> Formatting nit: new line after "void".
>
> Could you explain more detail why this is needed? Doesn't cselib
> handle the definition itself?
>
>
Actually this is supposed to be catching cases, where a limitation might be
causing
cselib to return false, while memrefs_conflict_p is returning true in
"is_forwarding" function.
In this case, memrefs_conflict_p will miss a register overwrite that
happened between
the store and the load pair and will report a false-positive.
I'll perform a re-evaluation to make sure this is indeed needed or if we
can do this by
just utilizing the cselib.
> > +{
> > + if (REG_P (XEXP (expr, 0)))
> > + for (auto iter = store_exprs.begin (); iter < store_exprs.end ();)
> > + iter = reg_mentioned_p (XEXP (expr, 0), XEXP ((*iter).second, 0))
> > + ? store_exprs.erase (iter) : iter + 1;
> > +}
> > +
> > +/* Break a load pair into its 2 distinct loads, except if the base
> source
> > + address to load from is overwriten in the first load. INSN should
> be the
> > + PARALLEL of the load pair. */
> > +
> > +static void
> > +break_ldp (rtx_insn *insn)
> > +{
> > + rtx expr = PATTERN (insn);
> > +
> > + gcc_checking_assert (GET_CODE (expr) == PARALLEL && XVECLEN (expr, 0)
> == 2);
> > +
> > + rtx load_0 = XVECEXP (expr, 0, 0);
> > + rtx load_1 = XVECEXP (expr, 0, 1);
> > +
> > + gcc_checking_assert (is_load (load_0) && is_load (load_1));
> > +
> > + /* The base address was overwriten in the first load. */
> > + if (reg_mentioned_p (XEXP (load_0, 0), XEXP (load_1, 1)))
>
> SET_DEST and SET_SRC here.
>
> > + return;
> > +
> > + emit_insn_after (load_1, insn);
> > + emit_insn_after (load_0, insn);
> > + remove_insn (insn);
>
> It's probably safer to insert before the instruction, since that copes
> correctly with non-call exceptions (where a load could raise an exception).
> It's unlikely to matter as things stand, but LDPs could be formed earlier
> in future.
>
> > +
> > + stats_transformed_count++;
> > +}
> > +
> > +static void
> > +scan_and_transform_bb_level ()
> > +{
> > + rtx_insn *insn, *next;
> > + basic_block bb;
> > + FOR_EACH_BB_FN (bb, cfun)
> > + {
> > + vec_rtx_pair store_exprs;
> > + map_rtx_dist store_cnt_byte_dist;
> > + for (insn = BB_HEAD (bb); insn != NEXT_INSN (BB_END (bb)); insn =
> next)
> > + {
> > + next = NEXT_INSN (insn);
> > +
> > + /* If we cross a CALL_P insn, clear the vectors, because the
> > + small-store-to-large-load is unlikely to cause performance
> > + difference. */
> > + if (CALL_P (insn))
> > + {
> > + store_exprs.clear ();
> > + store_cnt_byte_dist.clear ();
> > + }
> > +
> > + if (!NONJUMP_INSN_P (insn))
> > + continue;
> > +
> > + cselib_process_insn (insn);
> > +
> > + rtx expr = PATTERN (insn);
> > +
> > + if (GET_CODE (expr) == SET)
>
> Would be good to use single_set here, so that we pick up sets that are
> parallel with clobbers.
>
> > + {
> > + /* If a store is encountered, push it to the store_exprs
> vector
> > + to check it later. */
> > + if (is_store (expr))
> > + {
> > + store_exprs.push_back ({insn, expr});
> > +
> > + /* The store insn shouldn't have been encountered
> before. */
> > + gcc_checking_assert (store_cnt_byte_dist.find (expr)
> > + == store_cnt_byte_dist.end ());
> > + store_cnt_byte_dist.insert ({ expr, 0 });
> > + stats_store_count++;
> > + }
> > +
> > + check_memory_reg_overwrite (expr, store_exprs);
>
> Depending on the answer to the question above, we probably ought to
> use note_stores to iterate over all stores for check_memory_reg_overwrite.
> Or, alternatively:
>
> vec_rtx_properties properties;
> properties.add_insn (insn, false);
> for (rtx_obj_reference ref : properties.refs ())
> if (ref.is_write () && ref.is_reg ())
> check_memory_reg_overwrite ...
>
>
Thanks for the point.
I may use it in V2 depending on the outcome of the re-evaluation.
> > + }
> > +
> > + /* Check for small-store-to-large-load. */
> > + if (is_small_store_to_large_load (store_exprs, insn))
> > + {
> > + stats_ssll_count++;
> > + break_ldp (insn);
> > + }
> > +
> > + update_store_load_distances (store_exprs, store_cnt_byte_dist);
> > + }
> > + }
>
> It looks like the loop carries information across BB boundaries.
> Is that deliberate, to try to detect forwarding in EBBs? If so,
> I'm not sure whether BBs are guaranteed to be in fallthrough order.
>
> The information probably ought to be reset after non-fallthru.
>
>
I am not sure I understood exactly what you said here, but the vectors are
local to each BB iteration so no information is carried across BBs.
If I am missing something, please let me know.
Thanks,
> Richard
>
> > +}
> > +
> > +static void
> > +execute_avoid_store_forwarding ()
> > +{
> > + init_alias_analysis ();
> > + cselib_init (0);
> > + scan_and_transform_bb_level ();
> > + end_alias_analysis ();
> > + cselib_finish ();
> > + statistics_counter_event (cfun, "Number of stores identified: ",
> > + stats_store_count);
> > + statistics_counter_event (cfun, "Number of load pairs identified: ",
> > + stats_ldp_count);
> > + statistics_counter_event (cfun,
> > + "Number of forwarding cases identified: ",
> > + stats_ssll_count);
> > + statistics_counter_event (cfun, "Number of trasformed cases: ",
> > + stats_transformed_count);
> > +}
> > +
> > +const pass_data pass_data_avoid_store_forwarding =
> > +{
> > + RTL_PASS, /* type. */
> > + "avoid_store_forwarding", /* name. */
> > + OPTGROUP_NONE, /* optinfo_flags. */
> > + TV_NONE, /* tv_id. */
> > + 0, /* properties_required. */
> > + 0, /* properties_provided. */
> > + 0, /* properties_destroyed. */
> > + 0, /* todo_flags_start. */
> > + 0 /* todo_flags_finish. */
> > +};
> > +
> > +class pass_avoid_store_forwarding : public rtl_opt_pass
> > +{
> > +public:
> > + pass_avoid_store_forwarding (gcc::context *ctxt)
> > + : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
> > + {}
> > +
> > + /* opt_pass methods: */
> > + virtual bool gate (function *)
> > + {
> > + return aarch64_flag_avoid_store_forwarding && optimize >= 2;
> > + }
> > +
> > + virtual unsigned int execute (function *)
> > + {
> > + execute_avoid_store_forwarding ();
> > + return 0;
> > + }
> > +
> > +}; // class pass_avoid_store_forwarding
> > +
> > +/* Create a new avoid store forwarding pass instance. */
> > +
> > +rtl_opt_pass *
> > +make_pass_avoid_store_forwarding (gcc::context *ctxt)
> > +{
> > + return new pass_avoid_store_forwarding (ctxt);
> > +}
> > diff --git a/gcc/config/aarch64/aarch64.opt
> b/gcc/config/aarch64/aarch64.opt
> > index f5a518202a1..a9e944a272f 100644
> > --- a/gcc/config/aarch64/aarch64.opt
> > +++ b/gcc/config/aarch64/aarch64.opt
> > @@ -304,6 +304,10 @@ moutline-atomics
> > Target Var(aarch64_flag_outline_atomics) Init(2) Save
> > Generate local calls to out-of-line atomic operations.
> >
> > +mavoid-store-forwarding
> > +Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0)
> Optimization
> > +Avoid store forwarding to load pairs.
> > +
> > -param=aarch64-sve-compare-costs=
> > Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1)
> IntegerRange(0, 1) Param
> > When vectorizing for SVE, consider using unpacked vectors for smaller
> elements and use the cost model to pick the cheapest approach. Also use
> the cost model to choose between SVE and Advanced SIMD vectorization.
> > @@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never)
> Value(AARCH64_LDP_STP_POLICY_NEVER)
> >
> > EnumValue
> > Enum(aarch64_ldp_stp_policy) String(aligned)
> Value(AARCH64_LDP_STP_POLICY_ALIGNED)
> > +
> > +-param=aarch64-store-forwarding-threshold=
> > +Target Joined UInteger Var(aarch64_store_forwarding_threshold_param)
> Init(20) Param
> > +Maximum instruction distance allowed between a store and a load pair
> for this to be
> > +considered a candidate to avoid when using
> aarch64-avoid-store-forwarding.
> > diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
> > index a9a244ab6d6..7639b50358d 100644
> > --- a/gcc/config/aarch64/t-aarch64
> > +++ b/gcc/config/aarch64/t-aarch64
> > @@ -176,6 +176,16 @@ aarch64-cc-fusion.o:
> $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
> > $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> > $(srcdir)/config/aarch64/aarch64-cc-fusion.cc
> >
> > +aarch64-store-forwarding.o: \
> > + $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \
> > + $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h
> $(RTL_BASE_H) \
> > + dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H)
> $(RECOG_H) \
> > + output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \
> > + $(CONTEXT_H) $(TREE_PASS_H) regrename.h \
> > + $(srcdir)/config/aarch64/aarch64-protos.h
> > + $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> > + $(srcdir)/config/aarch64/aarch64-store-forwarding.cc
> > +
> > comma=,
> > MULTILIB_OPTIONS = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst
> $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
> > MULTILIB_DIRNAMES = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index 2caa7ceec40..ca30f52109a 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -792,7 +792,7 @@ Objective-C and Objective-C++ Dialects}.
> > -moverride=@var{string} -mverbose-cost-dump
> > -mstack-protector-guard=@var{guard}
> -mstack-protector-guard-reg=@var{sysreg}
> > -mstack-protector-guard-offset=@var{offset} -mtrack-speculation
> > --moutline-atomics }
> > +-moutline-atomics -mavoid-store-forwarding}
> >
> > @emph{Adapteva Epiphany Options}
> > @gccoptlist{-mhalf-reg-file -mprefer-short-insn-regs
> > @@ -16583,6 +16583,11 @@ With @option{--param=aarch64-stp-policy=never},
> do not emit stp.
> > With @option{--param=aarch64-stp-policy=aligned}, emit stp only if the
> > source pointer is aligned to at least double the alignment of the type.
> >
> > +@item aarch64-store-forwarding-threshold
> > +Maximum allowed instruction distance between a store and a load pair for
> > +this to be considered a candidate to avoid when using
> > +aarch64-avoid-store-forwarding.
> > +
> > @item aarch64-loop-vect-issue-rate-niters
> > The tuning for some AArch64 CPUs tries to take both latencies and issue
> > rates into account when deciding whether a loop should be vectorized
> > @@ -20558,6 +20563,11 @@ Generate code which uses only the
> general-purpose registers. This will prevent
> > the compiler from using floating-point and Advanced SIMD registers but
> will not
> > impose any restrictions on the assembler.
> >
> > +@item -mavoid-store-forwarding
> > +@itemx -mno-avoid-store-forwarding
> > +Avoid store forwarding to load pairs.
> > +@option{-mavoid-store-forwarding} at levels @option{-O2}, @option{-O3}.
> > +
> > @opindex mlittle-endian
> > @item -mlittle-endian
> > Generate little-endian code. This is the default when GCC is
> configured for an
> > diff --git
> a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> > new file mode 100644
> > index 00000000000..b77de6c64b6
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> > +
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +/* Different address, same offset, no overlap */
> > +
> > +#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \
> > +TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr,
> TYPE *st_arr_2, TYPE i, TYPE dummy){ \
> > + TYPE r, y; \
> > + st_arr[0] = i; \
> > + ld_arr[0] = dummy; \
> > + r = st_arr_2[0]; \
> > + y = st_arr_2[1]; \
> > + return r + y; \
> > +}
> > +
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(long)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(float)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(double)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(v4si)
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > diff --git
> a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> > new file mode 100644
> > index 00000000000..f1b3a66abfd
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> > +
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +/* Same address, different offset, no overlap */
> > +
> > +#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \
> > +TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE
> i, TYPE dummy){ \
> > + TYPE r, y; \
> > + st_arr[0] = i; \
> > + ld_arr[0] = dummy; \
> > + r = st_arr[10]; \
> > + y = st_arr[11]; \
> > + return r + y; \
> > +}
> > +
> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(int32_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(int64_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(int)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(long)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(float)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(double)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(v4si)
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> > new file mode 100644
> > index 00000000000..8d5ce5cc87e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> > +
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +/* Same address, same offset, overlap */
> > +
> > +#define LDP_SSLL_OVERLAP(TYPE) \
> > +TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE
> dummy){ \
> > + TYPE r, y; \
> > + st_arr[0] = i; \
> > + ld_arr[0] = dummy; \
> > + r = st_arr[0]; \
> > + y = st_arr[1]; \
> > + return r + y; \
> > +}
> > +
> > +LDP_SSLL_OVERLAP(uint32_t)
> > +LDP_SSLL_OVERLAP(uint64_t)
> > +LDP_SSLL_OVERLAP(int32_t)
> > +LDP_SSLL_OVERLAP(int64_t)
> > +LDP_SSLL_OVERLAP(int)
> > +LDP_SSLL_OVERLAP(long)
> > +LDP_SSLL_OVERLAP(float)
> > +LDP_SSLL_OVERLAP(double)
> > +LDP_SSLL_OVERLAP(v4si)
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > --
> > 2.40.1
>
As for the rest of your points, it should be trivial to address them.
@@ -2488,7 +2488,7 @@ offset_overlap_p (poly_int64 c, poly_int64 xsize, poly_int64 ysize)
one for X + non-constant and Y + non-constant when X and Y are equal.
If that is fixed the TBAA hack for union type-punning can be removed. */
-static int
+int
memrefs_conflict_p (poly_int64 xsize, rtx x, poly_int64 ysize, rtx y,
poly_int64 c)
{
@@ -41,6 +41,7 @@ bool alias_ptr_types_compatible_p (tree, tree);
int compare_base_decls (tree, tree);
bool refs_same_for_tbaa_p (tree, tree);
bool mems_same_for_tbaa_p (rtx, rtx);
+int memrefs_conflict_p (poly_int64, rtx, poly_int64, rtx, poly_int64);
/* This alias set can be used to force a memory to conflict with all
other memories, creating a barrier across which no memory reference
@@ -350,6 +350,7 @@ aarch64*-*-*)
cxx_target_objs="aarch64-c.o"
d_target_objs="aarch64-d.o"
extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
+ extra_objs="${extra_objs} aarch64-store-forwarding.o"
target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc \$(srcdir)/config/aarch64/aarch64-sve-builtins.h \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
target_has_targetm_common=yes
;;
@@ -23,3 +23,4 @@ INSERT_PASS_BEFORE (pass_reorder_blocks, 1, pass_track_speculation);
INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
+INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding);
@@ -1049,6 +1049,7 @@ rtl_opt_pass *make_pass_track_speculation (gcc::context *);
rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *);
rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
+rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt);
poly_uint64 aarch64_regmode_natural_size (machine_mode);
new file mode 100644
@@ -0,0 +1,347 @@
+/* Avoid store forwarding optimization pass.
+ Copyright (C) 2015-2023 Free Software Foundation, Inc.
+ Contributed by VRULL GmbH.
+
+ This file is part of GCC.
+
+ GCC is free software; you can redistribute it and/or modify it
+ under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 3, or (at your option)
+ any later version.
+
+ GCC is distributed in the hope that it will be useful, but
+ WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with GCC; see the file COPYING3. If not see
+ <http://www.gnu.org/licenses/>. */
+
+#define IN_TARGET_CODE 1
+
+#include "config.h"
+#define INCLUDE_VECTOR
+#define INCLUDE_MAP
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "alias.h"
+#include "rtlanal.h"
+#include "tree-pass.h"
+#include "cselib.h"
+
+/* This is an RTL pass that detects store forwarding from stores to larger
+ loads (load pairs). For example, it can transform cases like
+
+ str d5, [sp, #320]
+ fmul d5, d31, d29
+ ldp d31, d17, [sp, #312] # Large load from small store
+
+ to
+
+ str d5, [sp, #320]
+ fmul d5, d31, d29
+ ldr d31, [sp, #312]
+ ldr d17, [sp, #320]
+
+ Design: The pass follows a straightforward design. It starts by
+ initializing the alias analysis and the cselib. Both of these are used to
+ find stores and larger loads with overlapping addresses, which are
+ candidates for store forwarding optimizations. It then scans on basic block
+ level to find stores that forward to larger loads and handles them
+ accordingly as described in the above example. Finally, the alias analysis
+ and the cselib library are closed. */
+
+typedef std::vector<std::pair<rtx_insn *, rtx>> vec_rtx_pair;
+typedef std::map<rtx, uint32_t> map_rtx_dist;
+
+/* Statistics counters. */
+static unsigned int stats_store_count = 0;
+static unsigned int stats_ldp_count = 0;
+static unsigned int stats_ssll_count = 0;
+static unsigned int stats_transformed_count = 0;
+
+/* Default. */
+static rtx dummy;
+static bool is_store (rtx expr, rtx &op_0=dummy);
+static bool is_load (rtx expr, rtx &op_1=dummy);
+
+/* Return true if EXPR is a store; otherwise false. OP_0 will contain the MEM
+ operand of the store. */
+
+static bool
+is_store (rtx expr, rtx &op_0)
+{
+ op_0 = XEXP (expr, 0);
+
+ if (GET_CODE (op_0) == ZERO_EXTEND
+ || GET_CODE (op_0) == SIGN_EXTEND)
+ op_0 = XEXP (op_0, 0);
+
+ return MEM_P (op_0);
+}
+
+/* Return true if EXPR is a load; otherwise false. OP_1 will contain the MEM
+ operand of the load. */
+
+static bool
+is_load (rtx expr, rtx &op_1)
+{
+ op_1 = XEXP (expr, 1);
+
+ if (GET_CODE (op_1) == ZERO_EXTEND
+ || GET_CODE (op_1) == SIGN_EXTEND)
+ op_1 = XEXP (op_1, 0);
+
+ return MEM_P (op_1);
+}
+
+/* Check the maximum instruction distance threshold and if crossed, remove the
+ store from STORE_EXPRS. DISTANCES contains the intruction distances. */
+
+void
+update_store_load_distances (vec_rtx_pair &store_exprs, map_rtx_dist &distances)
+{
+ for (auto iter = store_exprs.begin (); iter < store_exprs.end ();)
+ {
+ (distances[(*iter).second])++;
+ iter = distances[(*iter).second]
+ > (uint32_t) aarch64_store_forwarding_threshold_param
+ ? store_exprs.erase (iter) : iter + 1;
+ }
+}
+
+/* Return true if STORE_MEM is forwarding to LOAD_MEM; otherwise false. */
+
+static bool
+is_forwarding (rtx store_mem, rtx load_mem)
+{
+ gcc_checking_assert (MEM_P (store_mem) && MEM_P (load_mem));
+
+ rtx store_mem_addr = canon_rtx (get_addr (XEXP (store_mem, 0)));
+ rtx load_mem_addr = canon_rtx (get_addr (XEXP (load_mem, 0)));
+
+ return (memrefs_conflict_p (GET_MODE_SIZE (GET_MODE (store_mem)),
+ store_mem_addr,
+ GET_MODE_SIZE (GET_MODE (load_mem)),
+ load_mem_addr, 0) == 1
+ || rtx_equal_for_cselib_1 (store_mem_addr,
+ load_mem_addr,
+ GET_MODE (store_mem), 0));
+}
+
+/* Return true if INSN is a load pair, preceded by a store forwarding to it;
+ otherwise false. STORE_EXPRS contains the stores. */
+
+static bool
+is_small_store_to_large_load (vec_rtx_pair &store_exprs, rtx_insn *insn)
+{
+ unsigned int load_count = 0;
+ bool forwarding = false;
+ rtx expr = PATTERN (insn);
+
+ if (GET_CODE (expr) != PARALLEL
+ || XVECLEN (expr, 0) != 2)
+ return false;
+
+ for (int i = 0; i < XVECLEN (expr, 0); i++)
+ {
+ rtx op_1;
+ rtx out_exp = XVECEXP (expr, 0, i);
+
+ if (GET_CODE (out_exp) != SET)
+ continue;
+
+ if (!is_load (out_exp, op_1))
+ {
+ load_count = 0;
+ continue;
+ }
+
+ load_count++;
+
+ for (std::pair<rtx_insn *, rtx> str : store_exprs)
+ {
+ if (!is_forwarding (XEXP (str.second, 0), op_1))
+ continue;
+
+ if (dump_file)
+ {
+ fprintf (dump_file,
+ "Store forwarding to PARALLEL with loads:\n");
+ fprintf (dump_file, " From: ");
+ print_rtl_single (dump_file, str.first);
+ fprintf (dump_file, " To: ");
+ print_rtl_single (dump_file, insn);
+ }
+
+ forwarding = true;
+ }
+ }
+
+ if (load_count == 2)
+ stats_ldp_count++;
+
+ return load_count == 2 && forwarding;
+}
+
+/* If the memory register in EXPR is ovewriten between the store and the larger
+ load, we need to avoid handling that case and thus we remove the store from
+ STORE_EXPRS. */
+
+static void check_memory_reg_overwrite (rtx expr, vec_rtx_pair &store_exprs)
+{
+ if (REG_P (XEXP (expr, 0)))
+ for (auto iter = store_exprs.begin (); iter < store_exprs.end ();)
+ iter = reg_mentioned_p (XEXP (expr, 0), XEXP ((*iter).second, 0))
+ ? store_exprs.erase (iter) : iter + 1;
+}
+
+/* Break a load pair into its 2 distinct loads, except if the base source
+ address to load from is overwriten in the first load. INSN should be the
+ PARALLEL of the load pair. */
+
+static void
+break_ldp (rtx_insn *insn)
+{
+ rtx expr = PATTERN (insn);
+
+ gcc_checking_assert (GET_CODE (expr) == PARALLEL && XVECLEN (expr, 0) == 2);
+
+ rtx load_0 = XVECEXP (expr, 0, 0);
+ rtx load_1 = XVECEXP (expr, 0, 1);
+
+ gcc_checking_assert (is_load (load_0) && is_load (load_1));
+
+ /* The base address was overwriten in the first load. */
+ if (reg_mentioned_p (XEXP (load_0, 0), XEXP (load_1, 1)))
+ return;
+
+ emit_insn_after (load_1, insn);
+ emit_insn_after (load_0, insn);
+ remove_insn (insn);
+
+ stats_transformed_count++;
+}
+
+static void
+scan_and_transform_bb_level ()
+{
+ rtx_insn *insn, *next;
+ basic_block bb;
+ FOR_EACH_BB_FN (bb, cfun)
+ {
+ vec_rtx_pair store_exprs;
+ map_rtx_dist store_cnt_byte_dist;
+ for (insn = BB_HEAD (bb); insn != NEXT_INSN (BB_END (bb)); insn = next)
+ {
+ next = NEXT_INSN (insn);
+
+ /* If we cross a CALL_P insn, clear the vectors, because the
+ small-store-to-large-load is unlikely to cause performance
+ difference. */
+ if (CALL_P (insn))
+ {
+ store_exprs.clear ();
+ store_cnt_byte_dist.clear ();
+ }
+
+ if (!NONJUMP_INSN_P (insn))
+ continue;
+
+ cselib_process_insn (insn);
+
+ rtx expr = PATTERN (insn);
+
+ if (GET_CODE (expr) == SET)
+ {
+ /* If a store is encountered, push it to the store_exprs vector
+ to check it later. */
+ if (is_store (expr))
+ {
+ store_exprs.push_back ({insn, expr});
+
+ /* The store insn shouldn't have been encountered before. */
+ gcc_checking_assert (store_cnt_byte_dist.find (expr)
+ == store_cnt_byte_dist.end ());
+ store_cnt_byte_dist.insert ({ expr, 0 });
+ stats_store_count++;
+ }
+
+ check_memory_reg_overwrite (expr, store_exprs);
+ }
+
+ /* Check for small-store-to-large-load. */
+ if (is_small_store_to_large_load (store_exprs, insn))
+ {
+ stats_ssll_count++;
+ break_ldp (insn);
+ }
+
+ update_store_load_distances (store_exprs, store_cnt_byte_dist);
+ }
+ }
+}
+
+static void
+execute_avoid_store_forwarding ()
+{
+ init_alias_analysis ();
+ cselib_init (0);
+ scan_and_transform_bb_level ();
+ end_alias_analysis ();
+ cselib_finish ();
+ statistics_counter_event (cfun, "Number of stores identified: ",
+ stats_store_count);
+ statistics_counter_event (cfun, "Number of load pairs identified: ",
+ stats_ldp_count);
+ statistics_counter_event (cfun,
+ "Number of forwarding cases identified: ",
+ stats_ssll_count);
+ statistics_counter_event (cfun, "Number of trasformed cases: ",
+ stats_transformed_count);
+}
+
+const pass_data pass_data_avoid_store_forwarding =
+{
+ RTL_PASS, /* type. */
+ "avoid_store_forwarding", /* name. */
+ OPTGROUP_NONE, /* optinfo_flags. */
+ TV_NONE, /* tv_id. */
+ 0, /* properties_required. */
+ 0, /* properties_provided. */
+ 0, /* properties_destroyed. */
+ 0, /* todo_flags_start. */
+ 0 /* todo_flags_finish. */
+};
+
+class pass_avoid_store_forwarding : public rtl_opt_pass
+{
+public:
+ pass_avoid_store_forwarding (gcc::context *ctxt)
+ : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
+ {}
+
+ /* opt_pass methods: */
+ virtual bool gate (function *)
+ {
+ return aarch64_flag_avoid_store_forwarding && optimize >= 2;
+ }
+
+ virtual unsigned int execute (function *)
+ {
+ execute_avoid_store_forwarding ();
+ return 0;
+ }
+
+}; // class pass_avoid_store_forwarding
+
+/* Create a new avoid store forwarding pass instance. */
+
+rtl_opt_pass *
+make_pass_avoid_store_forwarding (gcc::context *ctxt)
+{
+ return new pass_avoid_store_forwarding (ctxt);
+}
@@ -304,6 +304,10 @@ moutline-atomics
Target Var(aarch64_flag_outline_atomics) Init(2) Save
Generate local calls to out-of-line atomic operations.
+mavoid-store-forwarding
+Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0) Optimization
+Avoid store forwarding to load pairs.
+
-param=aarch64-sve-compare-costs=
Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1) IntegerRange(0, 1) Param
When vectorizing for SVE, consider using unpacked vectors for smaller elements and use the cost model to pick the cheapest approach. Also use the cost model to choose between SVE and Advanced SIMD vectorization.
@@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never) Value(AARCH64_LDP_STP_POLICY_NEVER)
EnumValue
Enum(aarch64_ldp_stp_policy) String(aligned) Value(AARCH64_LDP_STP_POLICY_ALIGNED)
+
+-param=aarch64-store-forwarding-threshold=
+Target Joined UInteger Var(aarch64_store_forwarding_threshold_param) Init(20) Param
+Maximum instruction distance allowed between a store and a load pair for this to be
+considered a candidate to avoid when using aarch64-avoid-store-forwarding.
@@ -176,6 +176,16 @@ aarch64-cc-fusion.o: $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
$(srcdir)/config/aarch64/aarch64-cc-fusion.cc
+aarch64-store-forwarding.o: \
+ $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \
+ $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h $(RTL_BASE_H) \
+ dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H) $(RECOG_H) \
+ output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \
+ $(CONTEXT_H) $(TREE_PASS_H) regrename.h \
+ $(srcdir)/config/aarch64/aarch64-protos.h
+ $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
+ $(srcdir)/config/aarch64/aarch64-store-forwarding.cc
+
comma=,
MULTILIB_OPTIONS = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
MULTILIB_DIRNAMES = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
@@ -792,7 +792,7 @@ Objective-C and Objective-C++ Dialects}.
-moverride=@var{string} -mverbose-cost-dump
-mstack-protector-guard=@var{guard} -mstack-protector-guard-reg=@var{sysreg}
-mstack-protector-guard-offset=@var{offset} -mtrack-speculation
--moutline-atomics }
+-moutline-atomics -mavoid-store-forwarding}
@emph{Adapteva Epiphany Options}
@gccoptlist{-mhalf-reg-file -mprefer-short-insn-regs
@@ -16583,6 +16583,11 @@ With @option{--param=aarch64-stp-policy=never}, do not emit stp.
With @option{--param=aarch64-stp-policy=aligned}, emit stp only if the
source pointer is aligned to at least double the alignment of the type.
+@item aarch64-store-forwarding-threshold
+Maximum allowed instruction distance between a store and a load pair for
+this to be considered a candidate to avoid when using
+aarch64-avoid-store-forwarding.
+
@item aarch64-loop-vect-issue-rate-niters
The tuning for some AArch64 CPUs tries to take both latencies and issue
rates into account when deciding whether a loop should be vectorized
@@ -20558,6 +20563,11 @@ Generate code which uses only the general-purpose registers. This will prevent
the compiler from using floating-point and Advanced SIMD registers but will not
impose any restrictions on the assembler.
+@item -mavoid-store-forwarding
+@itemx -mno-avoid-store-forwarding
+Avoid store forwarding to load pairs.
+@option{-mavoid-store-forwarding} at levels @option{-O2}, @option{-O3}.
+
@opindex mlittle-endian
@item -mlittle-endian
Generate little-endian code. This is the default when GCC is configured for an
new file mode 100644
@@ -0,0 +1,33 @@
+/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
+
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+/* Different address, same offset, no overlap */
+
+#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \
+TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE *st_arr_2, TYPE i, TYPE dummy){ \
+ TYPE r, y; \
+ st_arr[0] = i; \
+ ld_arr[0] = dummy; \
+ r = st_arr_2[0]; \
+ y = st_arr_2[1]; \
+ return r + y; \
+}
+
+LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t)
+LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t)
+LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t)
+LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t)
+LDP_SSLL_NO_OVERLAP_ADDRESS(int)
+LDP_SSLL_NO_OVERLAP_ADDRESS(long)
+LDP_SSLL_NO_OVERLAP_ADDRESS(float)
+LDP_SSLL_NO_OVERLAP_ADDRESS(double)
+LDP_SSLL_NO_OVERLAP_ADDRESS(v4si)
+
+/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
+/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
+/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
+/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
+/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
new file mode 100644
@@ -0,0 +1,33 @@
+/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
+
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+/* Same address, different offset, no overlap */
+
+#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \
+TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE dummy){ \
+ TYPE r, y; \
+ st_arr[0] = i; \
+ ld_arr[0] = dummy; \
+ r = st_arr[10]; \
+ y = st_arr[11]; \
+ return r + y; \
+}
+
+LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t)
+LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t)
+LDP_SSLL_NO_OVERLAP_OFFSET(int32_t)
+LDP_SSLL_NO_OVERLAP_OFFSET(int64_t)
+LDP_SSLL_NO_OVERLAP_OFFSET(int)
+LDP_SSLL_NO_OVERLAP_OFFSET(long)
+LDP_SSLL_NO_OVERLAP_OFFSET(float)
+LDP_SSLL_NO_OVERLAP_OFFSET(double)
+LDP_SSLL_NO_OVERLAP_OFFSET(v4si)
+
+/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
+/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
+/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
+/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
+/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
new file mode 100644
@@ -0,0 +1,33 @@
+/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
+
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+/* Same address, same offset, overlap */
+
+#define LDP_SSLL_OVERLAP(TYPE) \
+TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE dummy){ \
+ TYPE r, y; \
+ st_arr[0] = i; \
+ ld_arr[0] = dummy; \
+ r = st_arr[0]; \
+ y = st_arr[1]; \
+ return r + y; \
+}
+
+LDP_SSLL_OVERLAP(uint32_t)
+LDP_SSLL_OVERLAP(uint64_t)
+LDP_SSLL_OVERLAP(int32_t)
+LDP_SSLL_OVERLAP(int64_t)
+LDP_SSLL_OVERLAP(int)
+LDP_SSLL_OVERLAP(long)
+LDP_SSLL_OVERLAP(float)
+LDP_SSLL_OVERLAP(double)
+LDP_SSLL_OVERLAP(v4si)
+
+/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
+/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
+/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } } */
+/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } } */
+/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */