diff mbox series

[V2] VECT: Add SELECT_VL support

Message ID	20230605035741.613909-1-juzhe.zhong@rivai.ai
State	Unresolved
Headers	Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) client-ip=8.43.85.97; DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5CECF3858D28 From: juzhe.zhong@rivai.ai To: gcc-patches@gcc.gnu.org Cc: richard.sandiford@arm.com, rguenther@suse.de, Ju-Zhe Zhong <juzhe.zhong@rivai.ai> Subject: [PATCH V2] VECT: Add SELECT_VL support Date: Mon, 5 Jun 2023 11:57:41 +0800 Message-Id: <20230605035741.613909-1-juzhe.zhong@rivai.ai> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Feedback-ID: bizesmtp:rivai.ai:qybglogicsvrgz:qybglogicsvrgz7a-one-0 Precedence: list Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org> X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
Series	[V2] VECT: Add SELECT_VL support \| [V2] VECT: Add SELECT_VL support

Checks

Context	Check	Description
snail/gcc-patch-check	warning	Git am fail log

Commit Message

juzhe.zhong@rivai.ai June 5, 2023, 3:57 a.m. UTC

  From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>

This patch address comments from Richard and rebase to trunk.

This patch is adding SELECT_VL middle-end support
allow target have target dependent optimization in case of
length calculation.

This patch is inspired by RVV ISA and LLVM:
https://reviews.llvm.org/D99750

The SELECT_VL is same behavior as LLVM "get_vector_length" with
these following properties:

1. Only apply on single-rgroup.
2. non SLP.
3. adjust loop control IV.
4. adjust data reference IV.
5. allow non-vf elements processing in non-final iteration

Code:
   # void vvaddint32(size_t n, const int*x, const int*y, int*z)
    # { for (size_t i=0; i<n; i++) { z[i]=x[i]+y[i]; } }

Take RVV codegen for example:

Before this patch:
vvaddint32:
        ble     a0,zero,.L6
        csrr    a4,vlenb
        srli    a6,a4,2
.L4:
        mv      a5,a0
        bleu    a0,a6,.L3
        mv      a5,a6
.L3:
        vsetvli zero,a5,e32,m1,ta,ma
        vle32.v v2,0(a1)
        vle32.v v1,0(a2)
        vsetvli a7,zero,e32,m1,ta,ma
        sub     a0,a0,a5
        vadd.vv v1,v1,v2
        vsetvli zero,a5,e32,m1,ta,ma
        vse32.v v1,0(a3)
        add     a2,a2,a4
        add     a3,a3,a4
        add     a1,a1,a4
        bne     a0,zero,.L4
.L6:
        ret

After this patch:

vvaddint32:
    vsetvli t0, a0, e32, ta, ma  # Set vector length based on 32-bit vectors
    vle32.v v0, (a1)         # Get first vector
      sub a0, a0, t0         # Decrement number done
      slli t0, t0, 2         # Multiply number done by 4 bytes
      add a1, a1, t0         # Bump pointer
    vle32.v v1, (a2)         # Get second vector
      add a2, a2, t0         # Bump pointer
    vadd.vv v2, v0, v1       # Sum vectors
    vse32.v v2, (a3)         # Store result
      add a3, a3, t0         # Bump pointer
      bnez a0, vvaddint32    # Loop back
      ret                    # Finished

gcc/ChangeLog:

        * doc/md.texi: Add SELECT_VL support.
        * internal-fn.def (SELECT_VL): Ditto.
        * optabs.def (OPTAB_D): Ditto.
        * tree-vect-loop-manip.cc (vect_set_loop_controls_directly): Ditto.
        * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Ditto.
        * tree-vect-stmts.cc (get_select_vl_data_ref_ptr): Ditto.
        (vectorizable_store): Ditto.
        (vectorizable_load): Ditto.
        * tree-vectorizer.h (LOOP_VINFO_USING_SELECT_VL_P): Ditto.

---
 gcc/doc/md.texi             | 22 +++++++++++++
 gcc/internal-fn.def         |  1 +
 gcc/optabs.def              |  1 +
 gcc/tree-vect-loop-manip.cc | 32 +++++++++++++-----
 gcc/tree-vect-loop.cc       | 48 +++++++++++++++++++++++++++
 gcc/tree-vect-stmts.cc      | 66 +++++++++++++++++++++++++++++++++++++
 gcc/tree-vectorizer.h       |  6 ++++
 7 files changed, 167 insertions(+), 9 deletions(-)

Comments

Richard Sandiford June 5, 2023, 6:21 a.m. UTC | #1

juzhe.zhong@rivai.ai writes:
> +  /* If we're using decrement IV approach in loop control, we can use output of
> +     SELECT_VL to adjust IV of loop control and data reference when it satisfies
> +     the following checks:
> +
> +     (a) SELECT_VL is supported by the target.
> +     (b) LOOP_VINFO is single-rgroup control.
> +     (c) non-SLP.
> +     (d) LOOP can not be unrolled.
> +
> +     Otherwise, we use MIN_EXPR approach.
> +
> +     1. We only apply SELECT_VL on single-rgroup since:
> +
> +     (1). Multiple-rgroup controls N vector loads/stores would need N pointer
> +	  updates by variable amounts.
> +     (2). SELECT_VL allows flexible length (<=VF) in each iteration.
> +     (3). For decrement IV approach, we calculate the MAX length of the loop
> +	  and then deduce the length of each control from this MAX length.
> +
> +     Base on (1), (2) and (3) situations, if we try to use SELECT_VL on
> +     multiple-rgroup control, we need to generate multiple SELECT_VL to
> +     carefully adjust length of each control.

If we use SELECT_VL to refer only to the target-independent ifn, I don't
see why this last bit is true.  Like I said in the previous message,
when it comes to determining the length of each control, the approach we
take for MIN_EXPR IVs should work for SELECT_VL IVs.  The point is that,
in both cases, any inactive lanes are always the last lanes.

E.g. suppose that, for one particular iteration, SELECT_VL decides that
6 lanes should be active in a loop with VF==8.  If there is a 2-control
rgroup with 4 lanes each, the first control must be 4 and the second
control must be 2, just as if a MIN_EXPR had decided that 6 lanes of
the final iteration are active.

I'm not saying the decision itself is wrong.  But I think the explanation
could be clearer.

> +     Such approach is very inefficient
> +     and unprofitable for targets that are using a standalone instruction
> +     to configure the length of each operation.
> +     E.g. RISC-V vector use 'vsetvl' to configure the length of each operation.

What I don't understand is why this isn't also a problem with the
fallback MIN_EXPR approach.  That is, with the same example as above,
but using MIN_EXPR IVs, I would have expected:

  VF == 8

  1-control rgroup "A":
    A set by MIN_EXPR IV

  2-control rgroup "B1", "B2":
    B1 = MIN (A, 4)
    B2 = A - B1

and so the vectors controlled by A, B1 and B2 would all have different
lengths.

Is the point that, when using MIN_EXPR, this only happens in the
final iteration?  And that you use a tail/epilogue loop for that,
so that the main loop body operates on full vectors only?

Thanks,
Richard

juzhe.zhong@rivai.ai June 5, 2023, 7:23 a.m. UTC | #2

Hi, Richard. Thanks for the comments.

>> If we use SELECT_VL to refer only to the target-independent ifn, I don't
>> see why this last bit is true.
Could you give me more details and information about this since I am not sure whether I catch up with you. 
You mean the current SELECT_VL is not an appropriate IFN?

>>Like I said in the previous message,
>>when it comes to determining the length of each control, the approach we
>>take for MIN_EXPR IVs should work for SELECT_VL IVs.  The point is that,
>>in both cases, any inactive lanes are always the last lanes.
>>E.g. suppose that, for one particular iteration, SELECT_VL decides that
>>6 lanes should be active in a loop with VF==8.  If there is a 2-control
>>rgroup with 4 lanes each, the first control must be 4 and the second
>>control must be 2, just as if a MIN_EXPR had decided that 6 lanes of
>>the final iteration are active.
>>What I don't understand is why this isn't also a problem with the
>>fallback MIN_EXPR approach.  That is, with the same example as above,
>>but using MIN_EXPR IVs, I would have expected:
>>  VF == 8
>>  1-control rgroup "A":
>>    A set by MIN_EXPR IV
>>  2-control rgroup "B1", "B2":
>>    B1 = MIN (A, 4)
>>    B2 = A - B1
>>and so the vectors controlled by A, B1 and B2 would all have different
>>lengths.
>>Is the point that, when using MIN_EXPR, this only happens in the
>>final iteration?  And that you use a tail/epilogue loop for that,
>>so that the main loop body operates on full vectors only?
In general, I think your description is correct and  comprehensive. 
I'd like to share more my understanding to make sure we are on the same page.

Take the example as you said:

FOR one  particular iteration, SELECT_VL decides that 6 lanes should be active in a loop with VF==8.
and 2-control rgroup with 4 lanes each
which means:

VF = 8;
each control VF = 4;

Total length = SELECT_VL(or MIN_EXPR) (remain, 8)
Then, IMHO, we can have 3 solutions to deduce the length of 2-control base on current flow we already built

Also, let me share "vsetvl" ISA spec:
ceil(AVL / 2) ≤ vl ≤ VF if  VF <AVL < (2 * VF)
"vl" is the number of the elements we need to process, "avl" = the actual number of elements we will process in the current iteration

Solution 1:

Total length = SELECT_VL (remain, 8) ===> suppose Total length value = 6

control 1 length = SELECT_VL (Total length, 4) ===> If we use "vsetvl" intruction to get the control 1 length,
 it can be 3 or 4, since RVV ISA: ceil(AVL / 2) ≤ vl ≤ VF if AVL < (2 * VF), the outcome of SELECT_VL may be Total length / 2 = 3
Depending on the hardware implementation of "vsetvli", Let's say some RVV CPU likes "even distribution" the outcome = 3

control 2 length = Total length - control 1 length  ===> 6 - 3 = 3 (if control 1 = 3) or 6 - 4 = 2 (if control 1 = 4) .

Since RVV ISA gives the flexible definition of "vsetvli", we will end up with this deduction.

Solution 2:

Total length = SELECT_VL (remain, 8)  ====> 6
control 1 length = MIN_EXPR (Total length, 4)  ====> since use MIN, so always 4
control 2 length = Total length - control 1 length  ===> 6 - 4 = 2

Solution 3 (Current flow):

Total length = MIN_EXPR  (remain, 8)  ====> 6 only when the remain  = 6 in tail/epilogue, otherwise, it always be 8 in loop body.
control 1 length = MIN_EXPR (Total length, 4)  ====> since use MIN, so always 4
control 2 length = Total length - control 1 length  ===> Total length -  4

I'd like to say these 3 solutions all work for RVV. 
However, RVV length configuration unlike IBM or ARM SVE using a mask. (I would like to say mask or length they are the same thing, use for control of each operations).
For example, ARM SVE has 8 mask registers, whenever it generate a mask, it can be include in use list in the instructions, since ARM SVE use encoding to specify the mask
register.

For example:
If we are using solution 1 in a target that control by length and length is specified in general registers, we can simulate the codegen as below.

max length = select_vl (vf=8)
length 1 = select_vl (vf=4)
length 2 = max length - length 1
...
load (...use general register which storing length 1 let's said r0, r0 is specified in the load encoding)
...
load (...use general register which storing length 2 let's said r1, r1 is specified in the load encoding)
....

However, for RVV, we don't specify the length in the instructions encoding.
Instead, we have only one VL register, and every time we want to change the length, we need"vsetvli"

So for solution 1, we will have:

max length = vsetvli (vf=8)
length 1 = vsetlvi (vf=4)
length 2 = max length = length 1
...
vsetvli zero, length 1 <======insert by "VSETVL" PASS of RISC-V backend
load....
vsetvli zero, length 2 <======insert by "VSETVL" PASS of RISC-V backend
load....

"vsetlvi" instruction is the instruction much more expensive than the general scalar instruction (for example "min" is much cheaper than "vsetvli").
So I am 100% sure that solution 3 (current MIN flow in GCC) is much better than above:

max length = min (vf=8) ===> replaced "vsetli" by "min"
length 1 = min (vf=4) ===> replaced "vsetli" by "min"
length 2 = max length = length 1
...
vsetvli zero, length 1 <======insert by "VSETVL" PASS of RISC-V backend
load....
vsetvli zero, length 2 <======insert by "VSETVL" PASS of RISC-V backend
load....

This is much better than Solution 3 and avoid multiple switching of "VL" register by "vsetvli"

Ok, you may want ask if "min" is much cheaper than "vsetvli", why we need SELECT_VL?
The reason is I want to optimize the special case (single-rgoup), since rgroup is just using a single length, 
unlike multiple-rgroup control which has multiple length calculation statement:

Current flow of single-rgoup:

...
length = min (vf)
...
vsetvli zero. length <=== insert by VSETLVI PASS
load (pointer IV)
vadd.
...
pointer IV = pointer IV + VF

I want to optimize it into:

...
length = vsetvli (Vf)
... <=== not need to insert vsetvlli.
load (pointer IV)
vadd.
...
pointer IV = pointer IV + length (adjust in bytesize).

This flow is the same as RVV ISA and LLVM. 
And also base on "vsetvli" definition, we can allow "even distribution" in the last iterations.

Hope my description is clear, feel free to comment.
Thanks so much.

juzhe.zhong@rivai.ai

From: Richard Sandiford
Date: 2023-06-05 14:21
To: juzhe.zhong
CC: gcc-patches; rguenther
Subject: Re: [PATCH V2] VECT: Add SELECT_VL support
juzhe.zhong@rivai.ai writes:
> +  /* If we're using decrement IV approach in loop control, we can use output of
> +     SELECT_VL to adjust IV of loop control and data reference when it satisfies
> +     the following checks:
> +
> +     (a) SELECT_VL is supported by the target.
> +     (b) LOOP_VINFO is single-rgroup control.
> +     (c) non-SLP.
> +     (d) LOOP can not be unrolled.
> +
> +     Otherwise, we use MIN_EXPR approach.
> +
> +     1. We only apply SELECT_VL on single-rgroup since:
> +
> +     (1). Multiple-rgroup controls N vector loads/stores would need N pointer
> +   updates by variable amounts.
> +     (2). SELECT_VL allows flexible length (<=VF) in each iteration.
> +     (3). For decrement IV approach, we calculate the MAX length of the loop
> +   and then deduce the length of each control from this MAX length.
> +
> +     Base on (1), (2) and (3) situations, if we try to use SELECT_VL on
> +     multiple-rgroup control, we need to generate multiple SELECT_VL to
> +     carefully adjust length of each control.

If we use SELECT_VL to refer only to the target-independent ifn, I don't
see why this last bit is true.  Like I said in the previous message,
when it comes to determining the length of each control, the approach we
take for MIN_EXPR IVs should work for SELECT_VL IVs.  The point is that,
in both cases, any inactive lanes are always the last lanes.

E.g. suppose that, for one particular iteration, SELECT_VL decides that
6 lanes should be active in a loop with VF==8.  If there is a 2-control
rgroup with 4 lanes each, the first control must be 4 and the second
control must be 2, just as if a MIN_EXPR had decided that 6 lanes of
the final iteration are active.

I'm not saying the decision itself is wrong.  But I think the explanation
could be clearer.

> +     Such approach is very inefficient
> +     and unprofitable for targets that are using a standalone instruction
> +     to configure the length of each operation.
> +     E.g. RISC-V vector use 'vsetvl' to configure the length of each operation.

What I don't understand is why this isn't also a problem with the
fallback MIN_EXPR approach.  That is, with the same example as above,
but using MIN_EXPR IVs, I would have expected:

  VF == 8

  1-control rgroup "A":
    A set by MIN_EXPR IV

  2-control rgroup "B1", "B2":
    B1 = MIN (A, 4)
    B2 = A - B1

and so the vectors controlled by A, B1 and B2 would all have different
lengths.

Is the point that, when using MIN_EXPR, this only happens in the
final iteration?  And that you use a tail/epilogue loop for that,
so that the main loop body operates on full vectors only?

Thanks,
Richard

Richard Sandiford June 5, 2023, 7:54 a.m. UTC | #3

"juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> Hi, Richard. Thanks for the comments.
>
>>> If we use SELECT_VL to refer only to the target-independent ifn, I don't
>>> see why this last bit is true.
> Could you give me more details and information about this since I am not sure whether I catch up with you. 
> You mean the current SELECT_VL is not an appropriate IFN?

No, I meant that the comment I quoted seemed to be saying that solution
3 wasn't possible.  The comment seemed to say that we would need to do
solution 1.

>>>Like I said in the previous message,
>>>when it comes to determining the length of each control, the approach we
>>>take for MIN_EXPR IVs should work for SELECT_VL IVs.  The point is that,
>>>in both cases, any inactive lanes are always the last lanes.
>>>E.g. suppose that, for one particular iteration, SELECT_VL decides that
>>>6 lanes should be active in a loop with VF==8.  If there is a 2-control
>>>rgroup with 4 lanes each, the first control must be 4 and the second
>>>control must be 2, just as if a MIN_EXPR had decided that 6 lanes of
>>>the final iteration are active.
>>>What I don't understand is why this isn't also a problem with the
>>>fallback MIN_EXPR approach.  That is, with the same example as above,
>>>but using MIN_EXPR IVs, I would have expected:
>>>  VF == 8
>>>  1-control rgroup "A":
>>>    A set by MIN_EXPR IV
>>>  2-control rgroup "B1", "B2":
>>>    B1 = MIN (A, 4)
>>>    B2 = A - B1
>>>and so the vectors controlled by A, B1 and B2 would all have different
>>>lengths.
>>>Is the point that, when using MIN_EXPR, this only happens in the
>>>final iteration?  And that you use a tail/epilogue loop for that,
>>>so that the main loop body operates on full vectors only?
> In general, I think your description is correct and  comprehensive. 
> I'd like to share more my understanding to make sure we are on the same page.
>
> Take the example as you said:
>
> FOR one  particular iteration, SELECT_VL decides that 6 lanes should be active in a loop with VF==8.
> and 2-control rgroup with 4 lanes each
> which means:
>
> VF = 8;
> each control VF = 4;
>
> Total length = SELECT_VL(or MIN_EXPR) (remain, 8)
> Then, IMHO, we can have 3 solutions to deduce the length of 2-control base on current flow we already built
>
> Also, let me share "vsetvl" ISA spec:
> ceil(AVL / 2) ≤ vl ≤ VF if  VF <AVL < (2 * VF)
> "vl" is the number of the elements we need to process, "avl" = the actual number of elements we will process in the current iteration
>
> Solution 1:
>
> Total length = SELECT_VL (remain, 8) ===> suppose Total length value = 6
>
> control 1 length = SELECT_VL (Total length, 4) ===> If we use "vsetvl" intruction to get the control 1 length,
>  it can be 3 or 4, since RVV ISA: ceil(AVL / 2) ≤ vl ≤ VF if AVL < (2 * VF), the outcome of SELECT_VL may be Total length / 2 = 3
> Depending on the hardware implementation of "vsetvli", Let's say some RVV CPU likes "even distribution" the outcome = 3
>
> control 2 length = Total length - control 1 length  ===> 6 - 3 = 3 (if control 1 = 3) or 6 - 4 = 2 (if control 1 = 4) .
>
> Since RVV ISA gives the flexible definition of "vsetvli", we will end up with this deduction.

Yeah, this one wouldn't work, for reasons discussed previously.
I was thinking only about solutions 2 and 3.

> Solution 2:
>
> Total length = SELECT_VL (remain, 8)  ====> 6
> control 1 length = MIN_EXPR (Total length, 4)  ====> since use MIN, so always 4
> control 2 length = Total length - control 1 length  ===> 6 - 4 = 2
>
> Solution 3 (Current flow):
>
> Total length = MIN_EXPR  (remain, 8)  ====> 6 only when the remain  = 6 in tail/epilogue, otherwise, it always be 8 in loop body.
> control 1 length = MIN_EXPR (Total length, 4)  ====> since use MIN, so always 4
> control 2 length = Total length - control 1 length  ===> Total length -  4
>
> I'd like to say these 3 solutions all work for RVV. 
> However, RVV length configuration unlike IBM or ARM SVE using a mask. (I would like to say mask or length they are the same thing, use for control of each operations).
> For example, ARM SVE has 8 mask registers, whenever it generate a mask, it can be include in use list in the instructions, since ARM SVE use encoding to specify the mask
> register.
>
> For example:
> If we are using solution 1 in a target that control by length and length is specified in general registers, we can simulate the codegen as below.
>
> max length = select_vl (vf=8)
> length 1 = select_vl (vf=4)
> length 2 = max length - length 1
> ...
> load (...use general register which storing length 1 let's said r0, r0 is specified in the load encoding)
> ...
> load (...use general register which storing length 2 let's said r1, r1 is specified in the load encoding)
> ....
>
> However, for RVV, we don't specify the length in the instructions encoding.
> Instead, we have only one VL register, and every time we want to change the length, we need"vsetvli"
>
> So for solution 1, we will have:
>
> max length = vsetvli (vf=8)
> length 1 = vsetlvi (vf=4)
> length 2 = max length = length 1
> ...
> vsetvli zero, length 1 <======insert by "VSETVL" PASS of RISC-V backend
> load....
> vsetvli zero, length 2 <======insert by "VSETVL" PASS of RISC-V backend
> load....
>
> "vsetlvi" instruction is the instruction much more expensive than the general scalar instruction (for example "min" is much cheaper than "vsetvli").
> So I am 100% sure that solution 3 (current MIN flow in GCC) is much better than above:
>
> max length = min (vf=8) ===> replaced "vsetli" by "min"
> length 1 = min (vf=4) ===> replaced "vsetli" by "min"
> length 2 = max length = length 1
> ...
> vsetvli zero, length 1 <======insert by "VSETVL" PASS of RISC-V backend
> load....
> vsetvli zero, length 2 <======insert by "VSETVL" PASS of RISC-V backend
> load....

Well, it depends on *why* the loop has a 2-control rgroup.  There are two
possibilities:

(a) The riscv target has asked the vectoriser to unroll the loop 2 times
    (via the unrolling hook).  In this case there will be a 2-control rgroup
    but no 1-control rgroup.

(b) The loop operates on a mixture of data element sizes and the loop
    operates on fully-populated vectors.  In that case, there will be
    a 1-control rgroup for the narrowest element size and a 2-control
    rgroup for the next widest element size.

Your example describes what would happen for (a), whereas I was thinking
about (b).

For (b), there would be three controls in total, even for solution 3.
So there would be three vsetvlis rather than two.  That matches the
number of vsetvlis for solution 2.

When comparing solutions 2 and 3 for case (b), is solution 3 still better?
E.g. is "vsetvli zero" cheaper than "vsetvli <gpr>"?

Thanks,
Richard

Richard Sandiford June 5, 2023, 7:57 a.m. UTC | #4

Richard Sandiford <richard.sandiford@arm.com> writes:
> "juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
>> Hi, Richard. Thanks for the comments.
>>
>>>> If we use SELECT_VL to refer only to the target-independent ifn, I don't
>>>> see why this last bit is true.
>> Could you give me more details and information about this since I am not sure whether I catch up with you. 
>> You mean the current SELECT_VL is not an appropriate IFN?
>
> No, I meant that the comment I quoted seemed to be saying that solution
> 3 wasn't possible.  The comment seemed to say that we would need to do
> solution 1.

Sorry, I meant solution 2 rather than solution 3.

juzhe.zhong@rivai.ai June 5, 2023, 8:08 a.m. UTC | #5

Hi, Richard.

>> No, I meant that the comment I quoted seemed to be saying that solution
>> 3 wasn't possible.  The comment seemed to say that we would need to do
>> solution 1.
I am so sorry that I didn't write the comments accurately.
Could you help me with comments ? Base on what we have discussed above (I think we are on same page now).
Hmmm. I am not the native English speaker, I use google translator for comments :).

>> When comparing solutions 2 and 3 for case (b), is solution 3 still better?
>> E.g. is "vsetvli zero" cheaper than "vsetvli <gpr>"?


"vsetvli zero" is the same cost as "vsetvli gpr", 

I think for (b),  solution 2 and solution 3 should be almost the same.




juzhe.zhong@rivai.ai
 
From: Richard Sandiford
Date: 2023-06-05 15:57
To: juzhe.zhong\@rivai.ai
CC: gcc-patches; rguenther
Subject: Re: [PATCH V2] VECT: Add SELECT_VL support
Richard Sandiford <richard.sandiford@arm.com> writes:
> "juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
>> Hi, Richard. Thanks for the comments.
>>
>>>> If we use SELECT_VL to refer only to the target-independent ifn, I don't
>>>> see why this last bit is true.
>> Could you give me more details and information about this since I am not sure whether I catch up with you. 
>> You mean the current SELECT_VL is not an appropriate IFN?
>
> No, I meant that the comment I quoted seemed to be saying that solution
> 3 wasn't possible.  The comment seemed to say that we would need to do
> solution 1.
 
Sorry, I meant solution 2 rather than solution 3.

Richard Sandiford June 5, 2023, 10:09 a.m. UTC | #6

"juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> Hi, Richard.
>
>>> No, I meant that the comment I quoted seemed to be saying that solution
>>> 3 wasn't possible.  The comment seemed to say that we would need to do
>>> solution 1.
> I am so sorry that I didn't write the comments accurately.
> Could you help me with comments ? Base on what we have discussed above (I think we are on same page now).

Yeah, agree we seem to be on the same page

>>> When comparing solutions 2 and 3 for case (b), is solution 3 still better?
>>> E.g. is "vsetvli zero" cheaper than "vsetvli <gpr>"?
>
>
> "vsetvli zero" is the same cost as "vsetvli gpr", 
>
> I think for (b),  solution 2 and solution 3 should be almost the same.

OK, thanks.  If we wanted to use solution 2 for (b), the condition
would be just:

  LOOP_VINFO_LENS (loop_vinfo)[0].factor == 1

dropping the:

  LOOP_VINFO_LENS (loop_vinfo).length () == 1

But it would make the pointer IV updates more complex.  So let's
say that that's the reason for preferring solution 3.

So rather than:

+  /* If we're using decrement IV approach in loop control, we can use output of
+     SELECT_VL to adjust IV of loop control and data reference when it satisfies
+     the following checks:
+
+     (a) SELECT_VL is supported by the target.
+     (b) LOOP_VINFO is single-rgroup control.
+     (c) non-SLP.
+     (d) LOOP can not be unrolled.
+
+     Otherwise, we use MIN_EXPR approach.
+
+     1. We only apply SELECT_VL on single-rgroup since:
+
+     (1). Multiple-rgroup controls N vector loads/stores would need N pointer
+	  updates by variable amounts.
+     (2). SELECT_VL allows flexible length (<=VF) in each iteration.
+     (3). For decrement IV approach, we calculate the MAX length of the loop
+	  and then deduce the length of each control from this MAX length.
+
+     Base on (1), (2) and (3) situations, if we try to use SELECT_VL on
+     multiple-rgroup control, we need to generate multiple SELECT_VL to
+     carefully adjust length of each control. Such approach is very inefficient
+     and unprofitable for targets that are using a standalone instruction
+     to configure the length of each operation.
+     E.g. RISC-V vector use 'vsetvl' to configure the length of each operation.

how about:

  /* If a loop uses length controls and has a decrementing loop control IV,
     we will normally pass that IV through a MIN_EXPR to calcaluate the
     basis for the length controls.  E.g. in a loop that processes one
     element per scalar iteration, the number of elements would be
     MIN_EXPR <N, VF>, where N is the number of scalar iterations left.

     This MIN_EXPR approach allows us to use pointer IVs with an invariant
     step, since only the final iteration of the vector loop can have
     inactive lanes.

     However, some targets have a dedicated instruction for calculating the
     preferred length, given the total number of elements that still need to
     be processed.  This is encapsulated in the SELECT_VL internal function.

     If the target supports SELECT_VL, we can use it instead of MIN_EXPR
     to determine the basis for the length controls.  However, unlike the
     MIN_EXPR calculation, the SELECT_VL calculation can decide to make
     lanes inactive in any iteration of the vector loop, not just the last
     iteration.  This SELECT_VL approach therefore requires us to use pointer
     IVs with variable steps.

     Once we've decided how many elements should be processed by one
     iteration of the vector loop, we need to populate the rgroup controls.
     If a loop has multiple rgroups, we need to make sure that those rgroups
     "line up" (that is, they must be consistent about which elements are
     active and which aren't).  This is done by vect_adjust_loop_lens_control.

     In principle, it would be possible to use vect_adjust_loop_lens_control
     on either the result of a MIN_EXPR or the result of a SELECT_VL.
     However:

     (1) In practice, it only makes sense to use SELECT_VL when a vector
         operation will be controlled directly by the result.  It is not
         worth using SELECT_VL if it would only be the input to other
         calculations.

     (2) If we use SELECT_VL for an rgroup that has N controls, each associated
         pointer IV will need N updates by a variable amount (N-1 updates
         within the iteration and 1 update to move to the next iteration).
         
     Because of this, we prefer to use the MIN_EXPR approach whenever there
     is more than one length control.

     In addition, SELECT_VL always operates to a granularity of 1 unit.
     If we wanted to use it to control an SLP operation on N consecutive
     elements, we would need to make the SELECT_VL inputs measure scalar
     iterations (rather than elements) and then multiply the SELECT_VL
     result by N.  But using SELECT_VL this way is inefficient because
     of (1) above.

Thanks,
Richard

juzhe.zhong@rivai.ai June 5, 2023, 10:26 a.m. UTC | #7

>> But it would make the pointer IV updates more complex.  So let's
>> say that that's the reason for preferring solution 3.

Yes, I prefer pointer solution 3 to avoid complex IV updates and there is
no benefits in solution 2(unlike single-rgroup).

I read your comments, it's more comprehensive than I wrote.

I will send V3 patch with appending your comments.

Thanks you so much!


juzhe.zhong@rivai.ai
 
From: Richard Sandiford
Date: 2023-06-05 18:09
To: juzhe.zhong\@rivai.ai
CC: gcc-patches; rguenther
Subject: Re: [PATCH V2] VECT: Add SELECT_VL support
"juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> Hi, Richard.
>
>>> No, I meant that the comment I quoted seemed to be saying that solution
>>> 3 wasn't possible.  The comment seemed to say that we would need to do
>>> solution 1.
> I am so sorry that I didn't write the comments accurately.
> Could you help me with comments ? Base on what we have discussed above (I think we are on same page now).
 
Yeah, agree we seem to be on the same page
 
>>> When comparing solutions 2 and 3 for case (b), is solution 3 still better?
>>> E.g. is "vsetvli zero" cheaper than "vsetvli <gpr>"?
>
>
> "vsetvli zero" is the same cost as "vsetvli gpr", 
>
> I think for (b),  solution 2 and solution 3 should be almost the same.
 
OK, thanks.  If we wanted to use solution 2 for (b), the condition
would be just:
 
  LOOP_VINFO_LENS (loop_vinfo)[0].factor == 1
 
dropping the:
 
  LOOP_VINFO_LENS (loop_vinfo).length () == 1
 
But it would make the pointer IV updates more complex.  So let's
say that that's the reason for preferring solution 3.
 
So rather than:
 
+  /* If we're using decrement IV approach in loop control, we can use output of
+     SELECT_VL to adjust IV of loop control and data reference when it satisfies
+     the following checks:
+
+     (a) SELECT_VL is supported by the target.
+     (b) LOOP_VINFO is single-rgroup control.
+     (c) non-SLP.
+     (d) LOOP can not be unrolled.
+
+     Otherwise, we use MIN_EXPR approach.
+
+     1. We only apply SELECT_VL on single-rgroup since:
+
+     (1). Multiple-rgroup controls N vector loads/stores would need N pointer
+   updates by variable amounts.
+     (2). SELECT_VL allows flexible length (<=VF) in each iteration.
+     (3). For decrement IV approach, we calculate the MAX length of the loop
+   and then deduce the length of each control from this MAX length.
+
+     Base on (1), (2) and (3) situations, if we try to use SELECT_VL on
+     multiple-rgroup control, we need to generate multiple SELECT_VL to
+     carefully adjust length of each control. Such approach is very inefficient
+     and unprofitable for targets that are using a standalone instruction
+     to configure the length of each operation.
+     E.g. RISC-V vector use 'vsetvl' to configure the length of each operation.
 
how about:
 
  /* If a loop uses length controls and has a decrementing loop control IV,
     we will normally pass that IV through a MIN_EXPR to calcaluate the
     basis for the length controls.  E.g. in a loop that processes one
     element per scalar iteration, the number of elements would be
     MIN_EXPR <N, VF>, where N is the number of scalar iterations left.
 
     This MIN_EXPR approach allows us to use pointer IVs with an invariant
     step, since only the final iteration of the vector loop can have
     inactive lanes.
 
     However, some targets have a dedicated instruction for calculating the
     preferred length, given the total number of elements that still need to
     be processed.  This is encapsulated in the SELECT_VL internal function.
 
     If the target supports SELECT_VL, we can use it instead of MIN_EXPR
     to determine the basis for the length controls.  However, unlike the
     MIN_EXPR calculation, the SELECT_VL calculation can decide to make
     lanes inactive in any iteration of the vector loop, not just the last
     iteration.  This SELECT_VL approach therefore requires us to use pointer
     IVs with variable steps.
 
     Once we've decided how many elements should be processed by one
     iteration of the vector loop, we need to populate the rgroup controls.
     If a loop has multiple rgroups, we need to make sure that those rgroups
     "line up" (that is, they must be consistent about which elements are
     active and which aren't).  This is done by vect_adjust_loop_lens_control.
 
     In principle, it would be possible to use vect_adjust_loop_lens_control
     on either the result of a MIN_EXPR or the result of a SELECT_VL.
     However:
 
     (1) In practice, it only makes sense to use SELECT_VL when a vector
         operation will be controlled directly by the result.  It is not
         worth using SELECT_VL if it would only be the input to other
         calculations.
 
     (2) If we use SELECT_VL for an rgroup that has N controls, each associated
         pointer IV will need N updates by a variable amount (N-1 updates
         within the iteration and 1 update to move to the next iteration).
         
     Because of this, we prefer to use the MIN_EXPR approach whenever there
     is more than one length control.
 
     In addition, SELECT_VL always operates to a granularity of 1 unit.
     If we wanted to use it to control an SLP operation on N consecutive
     elements, we would need to make the SELECT_VL inputs measure scalar
     iterations (rather than elements) and then multiply the SELECT_VL
     result by N.  But using SELECT_VL this way is inefficient because
     of (1) above.
 
Thanks,
Richard

diff mbox series

Patch

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 6a435eb4461..95f7fe1f802 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -4974,6 +4974,28 @@  for (i = 1; i < operand3; i++)
   operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
 @end smallexample
 
+@cindex @code{select_vl@var{m}} instruction pattern
+@item @code{select_vl@var{m}}
+Set operand 0 to the number of scalar iterations that should be handled
+by one iteration of a vector loop.  Operand 1 is the total number of
+scalar iterations that the loop needs to process and operand 2 is a
+maximum bound on the result (also known as the maximum ``vectorization
+factor'').
+
+The maximum value of operand 0 is given by:
+@smallexample
+operand0 = MIN (operand1, operand2)
+@end smallexample
+However, targets might choose a lower value than this, based on
+target-specific criteria.  Each iteration of the vector loop might
+therefore process a different number of scalar iterations, which in turn
+means that induction variables will have a variable step.  Because of
+this, it is generally not useful to define this instruction if it will
+always calculate the maximum value.
+
+This optab is only useful on targets that implement @samp{len_load_@var{m}}
+and/or @samp{len_store_@var{m}}.
+
 @cindex @code{check_raw_ptrs@var{m}} instruction pattern
 @item @samp{check_raw_ptrs@var{m}}
 Check whether, given two pointers @var{a} and @var{b} and a length @var{len},
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 7fe742c2ae7..6f6fa7d37f9 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -153,6 +153,7 @@  DEF_INTERNAL_OPTAB_FN (VEC_SET, 0, vec_set, vec_set)
 DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store)
 
 DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
+DEF_INTERNAL_OPTAB_FN (SELECT_VL, ECF_CONST | ECF_NOTHROW, select_vl, binary)
 DEF_INTERNAL_OPTAB_FN (CHECK_RAW_PTRS, ECF_CONST | ECF_NOTHROW,
 		       check_raw_ptrs, check_ptrs)
 DEF_INTERNAL_OPTAB_FN (CHECK_WAR_PTRS, ECF_CONST | ECF_NOTHROW,
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 695f5911b30..b637471b76e 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -476,3 +476,4 @@  OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
 OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
 OPTAB_D (len_load_optab, "len_load_$a")
 OPTAB_D (len_store_optab, "len_store_$a")
+OPTAB_D (select_vl_optab, "select_vl$a")
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index 3f735945e67..1c8100c1a1c 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -534,7 +534,7 @@  vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 	   _10 = (unsigned long) count_12(D);
 	   ...
 	   # ivtmp_9 = PHI <ivtmp_35(6), _10(5)>
-	   _36 = MIN_EXPR <ivtmp_9, POLY_INT_CST [4, 4]>;
+	   _36 = (MIN_EXPR | SELECT_VL) <ivtmp_9, POLY_INT_CST [4, 4]>;
 	   ...
 	   vect__4.8_28 = .LEN_LOAD (_17, 32B, _36, 0);
 	   ...
@@ -549,15 +549,28 @@  vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
       tree step = rgc->controls.length () == 1 ? rgc->controls[0]
 					       : make_ssa_name (iv_type);
       /* Create decrement IV.  */
-      create_iv (nitems_total, MINUS_EXPR, nitems_step, NULL_TREE, loop,
-		 &incr_gsi, insert_after, &index_before_incr,
-		 &index_after_incr);
-      gimple_seq_add_stmt (header_seq, gimple_build_assign (step, MIN_EXPR,
-							    index_before_incr,
-							    nitems_step));
+      if (LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo))
+	{
+	  create_iv (nitems_total, MINUS_EXPR, step, NULL_TREE, loop, &incr_gsi,
+		     insert_after, &index_before_incr, &index_after_incr);
+	  tree len = gimple_build (header_seq, IFN_SELECT_VL, iv_type,
+				   index_before_incr, nitems_step);
+	  gimple_seq_add_stmt (header_seq, gimple_build_assign (step, len));
+	}
+      else
+	{
+	  create_iv (nitems_total, MINUS_EXPR, nitems_step, NULL_TREE, loop,
+		     &incr_gsi, insert_after, &index_before_incr,
+		     &index_after_incr);
+	  gimple_seq_add_stmt (header_seq,
+			       gimple_build_assign (step, MIN_EXPR,
+						    index_before_incr,
+						    nitems_step));
+	}
       *iv_step = step;
       *compare_step = nitems_step;
-      return index_before_incr;
+      return LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) ? index_after_incr
+						       : index_before_incr;
     }
 
   /* Create increment IV.  */
@@ -888,7 +901,8 @@  vect_set_loop_condition_partial_vectors (class loop *loop,
   /* Get a boolean result that tells us whether to iterate.  */
   edge exit_edge = single_exit (loop);
   gcond *cond_stmt;
-  if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
+  if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
+      && !LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo))
     {
       gcc_assert (compare_step);
       tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? LE_EXPR : GT_EXPR;
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 5b7a0da0034..68c3432c0a4 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -974,6 +974,7 @@  _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     can_use_partial_vectors_p (param_vect_partial_vector_usage != 0),
     using_partial_vectors_p (false),
     using_decrementing_iv_p (false),
+    using_select_vl_p (false),
     epil_using_partial_vectors_p (false),
     partial_load_store_bias (0),
     peeling_for_gaps (false),
@@ -2737,6 +2738,53 @@  start_over:
 			LOOP_VINFO_VECT_FACTOR (loop_vinfo))))
     LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) = true;
 
+  /* If we're using decrement IV approach in loop control, we can use output of
+     SELECT_VL to adjust IV of loop control and data reference when it satisfies
+     the following checks:
+
+     (a) SELECT_VL is supported by the target.
+     (b) LOOP_VINFO is single-rgroup control.
+     (c) non-SLP.
+     (d) LOOP can not be unrolled.
+
+     Otherwise, we use MIN_EXPR approach.
+
+     1. We only apply SELECT_VL on single-rgroup since:
+
+     (1). Multiple-rgroup controls N vector loads/stores would need N pointer
+	  updates by variable amounts.
+     (2). SELECT_VL allows flexible length (<=VF) in each iteration.
+     (3). For decrement IV approach, we calculate the MAX length of the loop
+	  and then deduce the length of each control from this MAX length.
+
+     Base on (1), (2) and (3) situations, if we try to use SELECT_VL on
+     multiple-rgroup control, we need to generate multiple SELECT_VL to
+     carefully adjust length of each control. Such approach is very inefficient
+     and unprofitable for targets that are using a standalone instruction
+     to configure the length of each operation.
+     E.g. RISC-V vector use 'vsetvl' to configure the length of each operation.
+
+     2. We don't apply SELECT_VL on single-rgroup when both (1) and (2) are
+	satisfied:
+
+     (1). LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) is true.
+     (2). LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant () is true.
+
+     Since SELECT_VL (variable step) will make SCEV analysis failed and then
+     we will fail to gain benefits of following unroll optimizations. We prefer
+     using the MIN_EXPR approach in this situation.  */
+  if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
+    {
+      tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
+      if (direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type,
+					  OPTIMIZE_FOR_SPEED)
+	  && LOOP_VINFO_LENS (loop_vinfo).length () == 1
+	  && LOOP_VINFO_LENS (loop_vinfo)[0].factor == 1 && !slp
+	  && (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+	      || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()))
+	LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = true;
+    }
+
   /* If we're vectorizing an epilogue loop, the vectorized loop either needs
      to be able to handle fewer than VF scalars, or needs to have a lower VF
      than the main loop.  */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index e37c401b688..b28226b6e7e 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -3148,6 +3148,57 @@  vect_get_data_ptr_increment (vec_info *vinfo,
   return iv_step;
 }
 
+/* Prepare the pointer IVs which needs to be updated by a variable amount.
+   Such variable amount is the outcome of .SELECT_VL. In this case, we can
+   allow each iteration process the flexible number of elements as long as
+   the number <= vf elments.
+
+   Return data reference according to SELECT_VL.
+   If new statements are needed, insert them before GSI.  */
+
+static tree
+get_select_vl_data_ref_ptr (vec_info *vinfo, stmt_vec_info stmt_info,
+			    tree aggr_type, class loop *at_loop, tree offset,
+			    tree *dummy, gimple_stmt_iterator *gsi,
+			    bool simd_lane_access_p, vec_loop_lens *loop_lens,
+			    dr_vec_info *dr_info,
+			    vect_memory_access_type memory_access_type)
+{
+  loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
+  tree step = vect_dr_behavior (vinfo, dr_info)->step;
+
+  /* TODO: We don't support gather/scatter or load_lanes/store_lanes for pointer
+     IVs are updated by variable amount but we will support them in the future.
+   */
+  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER
+	      && memory_access_type != VMAT_LOAD_STORE_LANES);
+
+  /* When we support SELECT_VL pattern, we dynamic adjust
+     the memory address by .SELECT_VL result.
+
+     The result of .SELECT_VL is the number of elements to
+     be processed of each iteration. So the memory address
+     adjustment operation should be:
+
+     addr = addr + .SELECT_VL (ARG..) * step;
+  */
+  gimple *ptr_incr;
+  tree loop_len
+    = vect_get_loop_len (loop_vinfo, gsi, loop_lens, 1, aggr_type, 0, 0);
+  tree len_type = TREE_TYPE (loop_len);
+  /* Since the outcome of .SELECT_VL is element size, we should adjust
+     it into bytesize so that it can be used in address pointer variable
+     amount IVs adjustment.  */
+  tree tmp = fold_build2 (MULT_EXPR, len_type, loop_len,
+			  wide_int_to_tree (len_type, wi::to_widest (step)));
+  tree bump = make_temp_ssa_name (len_type, NULL, "ivtmp");
+  gassign *assign = gimple_build_assign (bump, tmp);
+  gsi_insert_before (gsi, assign, GSI_SAME_STMT);
+  return vect_create_data_ref_ptr (vinfo, stmt_info, aggr_type, at_loop, offset,
+				   dummy, gsi, &ptr_incr, simd_lane_access_p,
+				   bump);
+}
+
 /* Check and perform vectorization of BUILT_IN_BSWAP{16,32,64,128}.  */
 
 static bool
@@ -8631,6 +8682,14 @@  vectorizable_store (vec_info *vinfo,
 	    vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info,
 					 slp_node, &gs_info, &dataref_ptr,
 					 &vec_offsets);
+	  else if (loop_lens && LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo)
+		   && memory_access_type != VMAT_INVARIANT)
+	    dataref_ptr
+	      = get_select_vl_data_ref_ptr (vinfo, stmt_info, aggr_type,
+					    simd_lane_access_p ? loop : NULL,
+					    offset, &dummy, gsi,
+					    simd_lane_access_p, loop_lens,
+					    dr_info, memory_access_type);
 	  else
 	    dataref_ptr
 	      = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type,
@@ -10022,6 +10081,13 @@  vectorizable_load (vec_info *vinfo,
 					   slp_node, &gs_info, &dataref_ptr,
 					   &vec_offsets);
 	    }
+	  else if (loop_lens && LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo)
+		   && memory_access_type != VMAT_INVARIANT)
+	    dataref_ptr
+	      = get_select_vl_data_ref_ptr (vinfo, stmt_info, aggr_type,
+					    at_loop, offset, &dummy, gsi,
+					    simd_lane_access_p, loop_lens,
+					    dr_info, memory_access_type);
 	  else
 	    dataref_ptr
 	      = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type,
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 34552f6bf9e..c6e0140a0b5 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -825,6 +825,11 @@  public:
 	(b) can iterate more than once.  */
   bool using_decrementing_iv_p;
 
+  /* True if we've decided to use output of select_vl to adjust IV of
+     both loop control and data reference pointer. This is only true
+     for single-rgroup control.  */
+  bool using_select_vl_p;
+
   /* True if we've decided to use partially-populated vectors for the
      epilogue of loop.  */
   bool epil_using_partial_vectors_p;
@@ -898,6 +903,7 @@  public:
 #define LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P(L) (L)->can_use_partial_vectors_p
 #define LOOP_VINFO_USING_PARTIAL_VECTORS_P(L) (L)->using_partial_vectors_p
 #define LOOP_VINFO_USING_DECREMENTING_IV_P(L) (L)->using_decrementing_iv_p
+#define LOOP_VINFO_USING_SELECT_VL_P(L) (L)->using_select_vl_p
 #define LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P(L)                             \
   (L)->epil_using_partial_vectors_p
 #define LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS(L) (L)->partial_load_store_bias