[v2,0/6] KVM: arm64: implement vcpu_is_preempted check

Message ID	20221104062105.4119003-1-usama.arif@bytedance.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Usama Arif <usama.arif@bytedance.com> To: linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.cs.columbia.edu, kvm@vger.kernel.org, linux-doc@vger.kernel.org, virtualization@lists.linux-foundation.org, linux@armlinux.org.uk, yezengruan@huawei.com, catalin.marinas@arm.com, will@kernel.org, maz@kernel.org, steven.price@arm.com, mark.rutland@arm.com, bagasdotme@gmail.com Cc: fam.zheng@bytedance.com, liangma@liangbit.com, punit.agrawal@bytedance.com, Usama Arif <usama.arif@bytedance.com> Subject: [v2 0/6] KVM: arm64: implement vcpu_is_preempted check Date: Fri, 4 Nov 2022 06:20:59 +0000 Message-Id: <20221104062105.4119003-1-usama.arif@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	KVM: arm64: implement vcpu_is_preempted check \| [v2,0/6] KVM: arm64: implement vcpu_is_preempted check [v2,1/6] KVM: arm64: Document PV-lock interface [v2,2/6] KVM: arm64: Add SMCCC paravirtualised lock calls [v2,3/6] KVM: arm64: Support pvlock preempted via shared structure [v2,4/6] KVM: arm64: Provide VCPU attributes for PV lock [v2,5/6] KVM: arm64: Support the VCPU preemption check [v2,6/6] KVM: selftests: add tests for PV time specific hypercall

Message ID

20221104062105.4119003-1-usama.arif@bytedance.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: Usama Arif <usama.arif@bytedance.com>
To: linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
        kvmarm@lists.cs.columbia.edu, kvm@vger.kernel.org,
        linux-doc@vger.kernel.org,
        virtualization@lists.linux-foundation.org, linux@armlinux.org.uk,
        yezengruan@huawei.com, catalin.marinas@arm.com, will@kernel.org,
        maz@kernel.org, steven.price@arm.com, mark.rutland@arm.com,
        bagasdotme@gmail.com
Cc: fam.zheng@bytedance.com, liangma@liangbit.com,
        punit.agrawal@bytedance.com, Usama Arif <usama.arif@bytedance.com>
Subject: [v2 0/6] KVM: arm64: implement vcpu_is_preempted check
Date: Fri,  4 Nov 2022 06:20:59 +0000
Message-Id: <20221104062105.4119003-1-usama.arif@bytedance.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

KVM: arm64: implement vcpu_is_preempted check |

Message

Usama Arif Nov. 4, 2022, 6:20 a.m. UTC

  This patchset adds support for vcpu_is_preempted in arm64, which allows the guest
to check if a vcpu was scheduled out, which is useful to know incase it was
holding a lock. vcpu_is_preempted can be used to improve
performance in locking (see owner_on_cpu usage in mutex_spin_on_owner,
mutex_can_spin_on_owner, rtmutex_spin_on_owner and osq_lock) and scheduling
(see available_idle_cpu which is used in several places in kernel/sched/fair.c
for e.g. in wake_affine to determine which CPU can run soonest):

This patchset shows improvement on overcommitted hosts (vCPUs > pCPUS), as waiting
for preempted vCPUs reduces performance.

This patchset is inspired from the para_steal_clock implementation and from the
work originally done by Zengruan Ye:
https://lore.kernel.org/linux-arm-kernel/20191226135833.1052-1-yezengruan@huawei.com/.

All the results in the below experiments are done on an aws r6g.metal instance
which has 64 pCPUs.

The following table shows the index results of UnixBench running on a 128 vCPU VM
with (6.0.0+vcpu_is_preempted) and without (6.0.0 base) the patchset.
TestName                                6.0.0 base  6.0.0+vcpu_is_preempted    % improvement for vcpu_is_preempted
Dhrystone 2 using register variables    187761      191274.7                   1.871368389
Double-Precision Whetstone              96743.6     98414.4                    1.727039308
Execl Throughput                        689.3       10426                      1412.548963
File Copy 1024 bufsize 2000 maxblocks   549.5       3165                       475.978162
File Copy 256 bufsize 500 maxblocks     400.7       2084.7                     420.2645371
File Copy 4096 bufsize 8000 maxblocks   894.3       5003.2                     459.4543218
Pipe Throughput                         76819.5     78601.5                    2.319723508
Pipe-based Context Switching            3444.8      13414.5                    289.4130283
Process Creation                        301.1       293.4                      -2.557289937
Shell Scripts (1 concurrent)            1248.1      28300.6                    2167.494592
Shell Scripts (8 concurrent)            781.2       26222.3                    3256.669227
System Call Overhead                    3426        3729.4                     8.855808523

System Benchmarks Index Score           3053        11534                      277.7923354

This shows a 277% overall improvement using these patches.

The biggest improvement is in the shell scripts benchmark, which forks a lot of processes.
This acquires rwsem lock where a large chunk of time is spent in base 6.0.0 kernel.
This can be seen from one of the callstack of the perf output of the shell
scripts benchmark on 6.0.0 base (pseudo NMI enabled for perf numbers below):
- 33.79% el0_svc
   - 33.43% do_el0_svc
      - 33.43% el0_svc_common.constprop.3
         - 33.30% invoke_syscall
            - 17.27% __arm64_sys_clone
               - 17.27% __do_sys_clone
                  - 17.26% kernel_clone
                     - 16.73% copy_process
                        - 11.91% dup_mm
                           - 11.82% dup_mmap
                              - 9.15% down_write
                                 - 8.87% rwsem_down_write_slowpath
                                    - 8.48% osq_lock

Just under 50% of the total time in the shell script benchmarks ends up being
spent in osq_lock in the base 6.0.0 kernel:
  Children      Self  Command   Shared Object        Symbol
   17.19%    10.71%  sh      [kernel.kallsyms]  [k] osq_lock
    6.17%     4.04%  sort    [kernel.kallsyms]  [k] osq_lock
    4.20%     2.60%  multi.  [kernel.kallsyms]  [k] osq_lock
    3.77%     2.47%  grep    [kernel.kallsyms]  [k] osq_lock
    3.50%     2.24%  expr    [kernel.kallsyms]  [k] osq_lock
    3.41%     2.23%  od      [kernel.kallsyms]  [k] osq_lock
    3.36%     2.15%  rm      [kernel.kallsyms]  [k] osq_lock
    3.28%     2.12%  tee     [kernel.kallsyms]  [k] osq_lock
    3.16%     2.02%  wc      [kernel.kallsyms]  [k] osq_lock
    0.21%     0.13%  looper  [kernel.kallsyms]  [k] osq_lock
    0.01%     0.00%  Run     [kernel.kallsyms]  [k] osq_lock

and this comes down to less than 1% total with 6.0.0+vcpu_is_preempted kernel:
  Children      Self  Command   Shared Object        Symbol
     0.26%     0.21%  sh      [kernel.kallsyms]  [k] osq_lock
     0.10%     0.08%  multi.  [kernel.kallsyms]  [k] osq_lock
     0.04%     0.04%  sort    [kernel.kallsyms]  [k] osq_lock
     0.02%     0.01%  grep    [kernel.kallsyms]  [k] osq_lock
     0.02%     0.02%  od      [kernel.kallsyms]  [k] osq_lock
     0.01%     0.01%  tee     [kernel.kallsyms]  [k] osq_lock
     0.01%     0.00%  expr    [kernel.kallsyms]  [k] osq_lock
     0.01%     0.01%  looper  [kernel.kallsyms]  [k] osq_lock
     0.00%     0.00%  wc      [kernel.kallsyms]  [k] osq_lock
     0.00%     0.00%  rm      [kernel.kallsyms]  [k] osq_lock

To make sure, there is no change in performance when vCPUs < pCPUs, UnixBench
was run on a 32 CPU VM. The kernel with vcpu_is_preempted implemented
performed 0.9% better overall than base kernel, and the individual benchmarks
were within +/-2% improvement over 6.0.0 base.
Hence the patches have no negative affect when vCPUs < pCPUs.


The other method discussed in https://lore.kernel.org/linux-arm-kernel/20191226135833.1052-1-yezengruan@huawei.com/
was pv conditional yield by Marc Zyngier and Will Deacon to reduce vCPU reschedule
if the vCPU will exit immediately.
(https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pvcy).
The patches were ported to 6.0.0 kernel and tested with UnixBench with a 128 vCPU VM:

TestName                                6.0.0 base  6.0.0+pvcy      % improvement for pvcy
Dhrystone 2 using register variables    187761      183128          -2.467498575
Double-Precision Whetstone              96743.6     96645           -0.101918887
Execl Throughput                        689.3       1019.8          47.9471928
File Copy 1024 bufsize 2000 maxblocks   549.5       2029.7          269.3721565
File Copy 256 bufsize 500 maxblocks     400.7       1439.4          259.2213626
File Copy 4096 bufsize 8000 maxblocks   894.3       3434.1          283.9986582
Pipe Throughput                         76819.5     74268.8         -3.320380893
Pipe-based Context Switching            3444.8      5963.3          73.11019508
Process Creation                        301.1       163.2           -45.79873796
Shell Scripts (1 concurrent)            1248.1      1859.7          49.00248378
Shell Scripts (8 concurrent)            781.2       1171            49.89759345
System Call Overhead                    3426        3194.4          -6.760070053

System Benchmarks Index Score           3053        4605            50.83524402

pvcy shows a smaller overall improvement (50%) compared to vcpu_is_preempted (277%).
Host side flamegraph analysis shows that ~60% of the host time when using pvcy
is spent in kvm_handle_wfx, compared with ~1.5% when using vcpu_is_preempted,
hence vcpu_is_preempted shows a larger improvement.

It might be that pvcy can be used in combination with vcpu_is_preempted, but this
series is to start a discussion on vcpu_is_preempted as it shows a much bigger
improvement in performance and its much easier to review vcpu_is_preempted standalone.

The respective QEMU change to test this is at
https://github.com/uarif1/qemu/commit/2da2c2927ae8de8f03f439804a0dad9cf68501b6,

Looking forward to your response!
Thanks,
Usama
---
RFC->v2
- Fixed table and code referencing in pvlock documentation
- Switched to using a single hypercall similar to ptp_kvm and made check
  for has_kvm_pvlock simpler

Usama Arif (6):
  KVM: arm64: Document PV-lock interface
  KVM: arm64: Add SMCCC paravirtualised lock calls
  KVM: arm64: Support pvlock preempted via shared structure
  KVM: arm64: Provide VCPU attributes for PV lock
  KVM: arm64: Support the VCPU preemption check
  KVM: selftests: add tests for PV time specific hypercall

 Documentation/virt/kvm/arm/hypercalls.rst     |   3 +
 Documentation/virt/kvm/arm/index.rst          |   1 +
 Documentation/virt/kvm/arm/pvlock.rst         |  52 ++++++++
 Documentation/virt/kvm/devices/vcpu.rst       |  25 ++++
 arch/arm64/include/asm/kvm_host.h             |  25 ++++
 arch/arm64/include/asm/paravirt.h             |   2 +
 arch/arm64/include/asm/pvlock-abi.h           |  17 +++
 arch/arm64/include/asm/spinlock.h             |  16 ++-
 arch/arm64/include/uapi/asm/kvm.h             |   3 +
 arch/arm64/kernel/paravirt.c                  | 112 ++++++++++++++++++
 arch/arm64/kernel/setup.c                     |   3 +
 arch/arm64/kvm/Makefile                       |   2 +-
 arch/arm64/kvm/arm.c                          |   8 ++
 arch/arm64/kvm/guest.c                        |   9 ++
 arch/arm64/kvm/hypercalls.c                   |   8 ++
 arch/arm64/kvm/pvlock.c                       | 100 ++++++++++++++++
 include/linux/arm-smccc.h                     |   8 ++
 include/linux/cpuhotplug.h                    |   1 +
 include/uapi/linux/kvm.h                      |   2 +
 tools/arch/arm64/include/uapi/asm/kvm.h       |   1 +
 tools/include/linux/arm-smccc.h               |   8 ++
 .../selftests/kvm/aarch64/hypercalls.c        |   2 +
 22 files changed, 406 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/virt/kvm/arm/pvlock.rst
 create mode 100644 arch/arm64/include/asm/pvlock-abi.h
 create mode 100644 arch/arm64/kvm/pvlock.c

Comments

Marc Zyngier Nov. 4, 2022, 9:02 a.m. UTC | #1

On Fri, 04 Nov 2022 06:20:59 +0000,
Usama Arif <usama.arif@bytedance.com> wrote:
> 
> This patchset adds support for vcpu_is_preempted in arm64, which allows the guest
> to check if a vcpu was scheduled out, which is useful to know incase it was
> holding a lock. vcpu_is_preempted can be used to improve
> performance in locking (see owner_on_cpu usage in mutex_spin_on_owner,
> mutex_can_spin_on_owner, rtmutex_spin_on_owner and osq_lock) and scheduling
> (see available_idle_cpu which is used in several places in kernel/sched/fair.c
> for e.g. in wake_affine to determine which CPU can run soonest):

Please refrain from reposting a series only two days after the initial
one. One week is a minimum, and only if there is enough review
comments to justify a respin (there were no valuable comments so far).

Reposting more often only results in the review process being
exponentially delayed.

Thanks,

	M.

Marc Zyngier Nov. 6, 2022, 4:35 p.m. UTC | #2

On Fri, 04 Nov 2022 06:20:59 +0000,
Usama Arif <usama.arif@bytedance.com> wrote:
> 
> This patchset adds support for vcpu_is_preempted in arm64, which
> allows the guest to check if a vcpu was scheduled out, which is
> useful to know incase it was holding a lock. vcpu_is_preempted can
> be used to improve performance in locking (see owner_on_cpu usage in
> mutex_spin_on_owner, mutex_can_spin_on_owner, rtmutex_spin_on_owner
> and osq_lock) and scheduling (see available_idle_cpu which is used
> in several places in kernel/sched/fair.c for e.g. in wake_affine to
> determine which CPU can run soonest):

[...]

> pvcy shows a smaller overall improvement (50%) compared to
> vcpu_is_preempted (277%).  Host side flamegraph analysis shows that
> ~60% of the host time when using pvcy is spent in kvm_handle_wfx,
> compared with ~1.5% when using vcpu_is_preempted, hence
> vcpu_is_preempted shows a larger improvement.

And have you worked out *why* we spend so much time handling WFE?

	M.

Usama Arif Nov. 7, 2022, noon UTC | #3

On 06/11/2022 16:35, Marc Zyngier wrote:
> On Fri, 04 Nov 2022 06:20:59 +0000,
> Usama Arif <usama.arif@bytedance.com> wrote:
>>
>> This patchset adds support for vcpu_is_preempted in arm64, which
>> allows the guest to check if a vcpu was scheduled out, which is
>> useful to know incase it was holding a lock. vcpu_is_preempted can
>> be used to improve performance in locking (see owner_on_cpu usage in
>> mutex_spin_on_owner, mutex_can_spin_on_owner, rtmutex_spin_on_owner
>> and osq_lock) and scheduling (see available_idle_cpu which is used
>> in several places in kernel/sched/fair.c for e.g. in wake_affine to
>> determine which CPU can run soonest):
> 
> [...]
> 
>> pvcy shows a smaller overall improvement (50%) compared to
>> vcpu_is_preempted (277%).  Host side flamegraph analysis shows that
>> ~60% of the host time when using pvcy is spent in kvm_handle_wfx,
>> compared with ~1.5% when using vcpu_is_preempted, hence
>> vcpu_is_preempted shows a larger improvement.
> 
> And have you worked out *why* we spend so much time handling WFE?
> 
> 	M.

Its from the following change in pvcy patchset:

diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index e778eefcf214..915644816a85 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -118,7 +118,12 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
         }

         if (esr & ESR_ELx_WFx_ISS_WFE) {
-               kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
+               int state;
+               while ((state = kvm_pvcy_check_state(vcpu)) == 0)
+                       schedule();
+
+               if (state == -1)
+                       kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
         } else {
                 if (esr & ESR_ELx_WFx_ISS_WFxT)
                         vcpu_set_flag(vcpu, IN_WFIT);


If my understanding is correct of the pvcy changes, whenever pvcy 
returns an unchanged vcpu state, we would schedule to another vcpu. And 
its the constant scheduling where the time is spent. I guess the affects 
are much higher when the lock contention is very high. This can be seem 
from the pvcy host side flamegraph as well with (~67% of the time spent 
in the schedule() call in kvm_handle_wfx), For reference, I have put the 
graph at:
https://uarif1.github.io/pvlock/perf_host_pvcy_nmi.svg

Thanks,
Usama

>

Punit Agrawal Nov. 7, 2022, 6:24 p.m. UTC | #4

Hi Usama,

Usama Arif <usama.arif@bytedance.com> writes:

> This patchset adds support for vcpu_is_preempted in arm64, which allows the guest
> to check if a vcpu was scheduled out, which is useful to know incase it was
> holding a lock. vcpu_is_preempted can be used to improve
> performance in locking (see owner_on_cpu usage in mutex_spin_on_owner,
> mutex_can_spin_on_owner, rtmutex_spin_on_owner and osq_lock) and scheduling
> (see available_idle_cpu which is used in several places in kernel/sched/fair.c
> for e.g. in wake_affine to determine which CPU can run soonest):
>
> This patchset shows improvement on overcommitted hosts (vCPUs > pCPUS), as waiting
> for preempted vCPUs reduces performance.
>
> This patchset is inspired from the para_steal_clock implementation and from the
> work originally done by Zengruan Ye:
> https://lore.kernel.org/linux-arm-kernel/20191226135833.1052-1-yezengruan@huawei.com/.
>
> All the results in the below experiments are done on an aws r6g.metal instance
> which has 64 pCPUs.
>
> The following table shows the index results of UnixBench running on a 128 vCPU VM
> with (6.0.0+vcpu_is_preempted) and without (6.0.0 base) the patchset.
> TestName                                6.0.0 base  6.0.0+vcpu_is_preempted    % improvement for vcpu_is_preempted
> Dhrystone 2 using register variables    187761      191274.7                   1.871368389
> Double-Precision Whetstone              96743.6     98414.4                    1.727039308
> Execl Throughput                        689.3       10426                      1412.548963
> File Copy 1024 bufsize 2000 maxblocks   549.5       3165                       475.978162
> File Copy 256 bufsize 500 maxblocks     400.7       2084.7                     420.2645371
> File Copy 4096 bufsize 8000 maxblocks   894.3       5003.2                     459.4543218
> Pipe Throughput                         76819.5     78601.5                    2.319723508
> Pipe-based Context Switching            3444.8      13414.5                    289.4130283
> Process Creation                        301.1       293.4                      -2.557289937
> Shell Scripts (1 concurrent)            1248.1      28300.6                    2167.494592
> Shell Scripts (8 concurrent)            781.2       26222.3                    3256.669227
> System Call Overhead                    3426        3729.4                     8.855808523
>
> System Benchmarks Index Score           3053        11534                      277.7923354
>
> This shows a 277% overall improvement using these patches.
>
> The biggest improvement is in the shell scripts benchmark, which forks a lot of processes.
> This acquires rwsem lock where a large chunk of time is spent in base 6.0.0 kernel.
> This can be seen from one of the callstack of the perf output of the shell
> scripts benchmark on 6.0.0 base (pseudo NMI enabled for perf numbers below):
> - 33.79% el0_svc
>    - 33.43% do_el0_svc
>       - 33.43% el0_svc_common.constprop.3
>          - 33.30% invoke_syscall
>             - 17.27% __arm64_sys_clone
>                - 17.27% __do_sys_clone
>                   - 17.26% kernel_clone
>                      - 16.73% copy_process
>                         - 11.91% dup_mm
>                            - 11.82% dup_mmap
>                               - 9.15% down_write
>                                  - 8.87% rwsem_down_write_slowpath
>                                     - 8.48% osq_lock
>
> Just under 50% of the total time in the shell script benchmarks ends up being
> spent in osq_lock in the base 6.0.0 kernel:
>   Children      Self  Command   Shared Object        Symbol
>    17.19%    10.71%  sh      [kernel.kallsyms]  [k] osq_lock
>     6.17%     4.04%  sort    [kernel.kallsyms]  [k] osq_lock
>     4.20%     2.60%  multi.  [kernel.kallsyms]  [k] osq_lock
>     3.77%     2.47%  grep    [kernel.kallsyms]  [k] osq_lock
>     3.50%     2.24%  expr    [kernel.kallsyms]  [k] osq_lock
>     3.41%     2.23%  od      [kernel.kallsyms]  [k] osq_lock
>     3.36%     2.15%  rm      [kernel.kallsyms]  [k] osq_lock
>     3.28%     2.12%  tee     [kernel.kallsyms]  [k] osq_lock
>     3.16%     2.02%  wc      [kernel.kallsyms]  [k] osq_lock
>     0.21%     0.13%  looper  [kernel.kallsyms]  [k] osq_lock
>     0.01%     0.00%  Run     [kernel.kallsyms]  [k] osq_lock
>
> and this comes down to less than 1% total with 6.0.0+vcpu_is_preempted kernel:
>   Children      Self  Command   Shared Object        Symbol
>      0.26%     0.21%  sh      [kernel.kallsyms]  [k] osq_lock
>      0.10%     0.08%  multi.  [kernel.kallsyms]  [k] osq_lock
>      0.04%     0.04%  sort    [kernel.kallsyms]  [k] osq_lock
>      0.02%     0.01%  grep    [kernel.kallsyms]  [k] osq_lock
>      0.02%     0.02%  od      [kernel.kallsyms]  [k] osq_lock
>      0.01%     0.01%  tee     [kernel.kallsyms]  [k] osq_lock
>      0.01%     0.00%  expr    [kernel.kallsyms]  [k] osq_lock
>      0.01%     0.01%  looper  [kernel.kallsyms]  [k] osq_lock
>      0.00%     0.00%  wc      [kernel.kallsyms]  [k] osq_lock
>      0.00%     0.00%  rm      [kernel.kallsyms]  [k] osq_lock
>
> To make sure, there is no change in performance when vCPUs < pCPUs, UnixBench
> was run on a 32 CPU VM. The kernel with vcpu_is_preempted implemented
> performed 0.9% better overall than base kernel, and the individual benchmarks
> were within +/-2% improvement over 6.0.0 base.
> Hence the patches have no negative affect when vCPUs < pCPUs.
>
>
> The other method discussed in https://lore.kernel.org/linux-arm-kernel/20191226135833.1052-1-yezengruan@huawei.com/
> was pv conditional yield by Marc Zyngier and Will Deacon to reduce vCPU reschedule
> if the vCPU will exit immediately.
> (https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pvcy).
> The patches were ported to 6.0.0 kernel and tested with UnixBench with a 128 vCPU VM:
>
> TestName                                6.0.0 base  6.0.0+pvcy      % improvement for pvcy
> Dhrystone 2 using register variables    187761      183128          -2.467498575
> Double-Precision Whetstone              96743.6     96645           -0.101918887
> Execl Throughput                        689.3       1019.8          47.9471928
> File Copy 1024 bufsize 2000 maxblocks   549.5       2029.7          269.3721565
> File Copy 256 bufsize 500 maxblocks     400.7       1439.4          259.2213626
> File Copy 4096 bufsize 8000 maxblocks   894.3       3434.1          283.9986582
> Pipe Throughput                         76819.5     74268.8         -3.320380893
> Pipe-based Context Switching            3444.8      5963.3          73.11019508
> Process Creation                        301.1       163.2           -45.79873796
> Shell Scripts (1 concurrent)            1248.1      1859.7          49.00248378
> Shell Scripts (8 concurrent)            781.2       1171            49.89759345
> System Call Overhead                    3426        3194.4          -6.760070053
>
> System Benchmarks Index Score           3053        4605            50.83524402
>
> pvcy shows a smaller overall improvement (50%) compared to vcpu_is_preempted (277%).
> Host side flamegraph analysis shows that ~60% of the host time when using pvcy
> is spent in kvm_handle_wfx, compared with ~1.5% when using vcpu_is_preempted,
> hence vcpu_is_preempted shows a larger improvement.
>
> It might be that pvcy can be used in combination with vcpu_is_preempted, but this
> series is to start a discussion on vcpu_is_preempted as it shows a much bigger
> improvement in performance and its much easier to review vcpu_is_preempted standalone.

Looking at both the patchsets - this one and the pvcy, it looks to me
that vcpu_is_preempted() and the pvcy patches are somewhat
orthogonal. The former is optimizing mutex and rwsem in their optimistic
spinning phase while the latter is going after spinlocks (via wfe).

Unless I'm missing something the features are not necessarily comparable
on the same workloads - unixbench is probably mutex heavy and doesn't
show as much benefit with just the pvcy changes. I wonder if it's easy
to have both the features enabled to see this in effect.

I've left some comments on the patches; but no need to respin just
yet. Let's see if there is any other feedback.

Thanks,
Punit

[...]

Usama Arif Nov. 9, 2022, 7:38 p.m. UTC | #5

On 07/11/2022 18:24, Punit Agrawal wrote:
> Hi Usama,
> 
> Usama Arif <usama.arif@bytedance.com> writes:
> 
>> This patchset adds support for vcpu_is_preempted in arm64, which allows the guest
>> to check if a vcpu was scheduled out, which is useful to know incase it was
>> holding a lock. vcpu_is_preempted can be used to improve
>> performance in locking (see owner_on_cpu usage in mutex_spin_on_owner,
>> mutex_can_spin_on_owner, rtmutex_spin_on_owner and osq_lock) and scheduling
>> (see available_idle_cpu which is used in several places in kernel/sched/fair.c
>> for e.g. in wake_affine to determine which CPU can run soonest):
>>
>> This patchset shows improvement on overcommitted hosts (vCPUs > pCPUS), as waiting
>> for preempted vCPUs reduces performance.
>>
>> This patchset is inspired from the para_steal_clock implementation and from the
>> work originally done by Zengruan Ye:
>> https://lore.kernel.org/linux-arm-kernel/20191226135833.1052-1-yezengruan@huawei.com/.
>>
>> All the results in the below experiments are done on an aws r6g.metal instance
>> which has 64 pCPUs.
>>
>> The following table shows the index results of UnixBench running on a 128 vCPU VM
>> with (6.0.0+vcpu_is_preempted) and without (6.0.0 base) the patchset.
>> TestName                                6.0.0 base  6.0.0+vcpu_is_preempted    % improvement for vcpu_is_preempted
>> Dhrystone 2 using register variables    187761      191274.7                   1.871368389
>> Double-Precision Whetstone              96743.6     98414.4                    1.727039308
>> Execl Throughput                        689.3       10426                      1412.548963
>> File Copy 1024 bufsize 2000 maxblocks   549.5       3165                       475.978162
>> File Copy 256 bufsize 500 maxblocks     400.7       2084.7                     420.2645371
>> File Copy 4096 bufsize 8000 maxblocks   894.3       5003.2                     459.4543218
>> Pipe Throughput                         76819.5     78601.5                    2.319723508
>> Pipe-based Context Switching            3444.8      13414.5                    289.4130283
>> Process Creation                        301.1       293.4                      -2.557289937
>> Shell Scripts (1 concurrent)            1248.1      28300.6                    2167.494592
>> Shell Scripts (8 concurrent)            781.2       26222.3                    3256.669227
>> System Call Overhead                    3426        3729.4                     8.855808523
>>
>> System Benchmarks Index Score           3053        11534                      277.7923354
>>
>> This shows a 277% overall improvement using these patches.
>>
>> The biggest improvement is in the shell scripts benchmark, which forks a lot of processes.
>> This acquires rwsem lock where a large chunk of time is spent in base 6.0.0 kernel.
>> This can be seen from one of the callstack of the perf output of the shell
>> scripts benchmark on 6.0.0 base (pseudo NMI enabled for perf numbers below):
>> - 33.79% el0_svc
>>     - 33.43% do_el0_svc
>>        - 33.43% el0_svc_common.constprop.3
>>           - 33.30% invoke_syscall
>>              - 17.27% __arm64_sys_clone
>>                 - 17.27% __do_sys_clone
>>                    - 17.26% kernel_clone
>>                       - 16.73% copy_process
>>                          - 11.91% dup_mm
>>                             - 11.82% dup_mmap
>>                                - 9.15% down_write
>>                                   - 8.87% rwsem_down_write_slowpath
>>                                      - 8.48% osq_lock
>>
>> Just under 50% of the total time in the shell script benchmarks ends up being
>> spent in osq_lock in the base 6.0.0 kernel:
>>    Children      Self  Command   Shared Object        Symbol
>>     17.19%    10.71%  sh      [kernel.kallsyms]  [k] osq_lock
>>      6.17%     4.04%  sort    [kernel.kallsyms]  [k] osq_lock
>>      4.20%     2.60%  multi.  [kernel.kallsyms]  [k] osq_lock
>>      3.77%     2.47%  grep    [kernel.kallsyms]  [k] osq_lock
>>      3.50%     2.24%  expr    [kernel.kallsyms]  [k] osq_lock
>>      3.41%     2.23%  od      [kernel.kallsyms]  [k] osq_lock
>>      3.36%     2.15%  rm      [kernel.kallsyms]  [k] osq_lock
>>      3.28%     2.12%  tee     [kernel.kallsyms]  [k] osq_lock
>>      3.16%     2.02%  wc      [kernel.kallsyms]  [k] osq_lock
>>      0.21%     0.13%  looper  [kernel.kallsyms]  [k] osq_lock
>>      0.01%     0.00%  Run     [kernel.kallsyms]  [k] osq_lock
>>
>> and this comes down to less than 1% total with 6.0.0+vcpu_is_preempted kernel:
>>    Children      Self  Command   Shared Object        Symbol
>>       0.26%     0.21%  sh      [kernel.kallsyms]  [k] osq_lock
>>       0.10%     0.08%  multi.  [kernel.kallsyms]  [k] osq_lock
>>       0.04%     0.04%  sort    [kernel.kallsyms]  [k] osq_lock
>>       0.02%     0.01%  grep    [kernel.kallsyms]  [k] osq_lock
>>       0.02%     0.02%  od      [kernel.kallsyms]  [k] osq_lock
>>       0.01%     0.01%  tee     [kernel.kallsyms]  [k] osq_lock
>>       0.01%     0.00%  expr    [kernel.kallsyms]  [k] osq_lock
>>       0.01%     0.01%  looper  [kernel.kallsyms]  [k] osq_lock
>>       0.00%     0.00%  wc      [kernel.kallsyms]  [k] osq_lock
>>       0.00%     0.00%  rm      [kernel.kallsyms]  [k] osq_lock
>>
>> To make sure, there is no change in performance when vCPUs < pCPUs, UnixBench
>> was run on a 32 CPU VM. The kernel with vcpu_is_preempted implemented
>> performed 0.9% better overall than base kernel, and the individual benchmarks
>> were within +/-2% improvement over 6.0.0 base.
>> Hence the patches have no negative affect when vCPUs < pCPUs.
>>
>>
>> The other method discussed in https://lore.kernel.org/linux-arm-kernel/20191226135833.1052-1-yezengruan@huawei.com/
>> was pv conditional yield by Marc Zyngier and Will Deacon to reduce vCPU reschedule
>> if the vCPU will exit immediately.
>> (https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pvcy).
>> The patches were ported to 6.0.0 kernel and tested with UnixBench with a 128 vCPU VM:
>>
>> TestName                                6.0.0 base  6.0.0+pvcy      % improvement for pvcy
>> Dhrystone 2 using register variables    187761      183128          -2.467498575
>> Double-Precision Whetstone              96743.6     96645           -0.101918887
>> Execl Throughput                        689.3       1019.8          47.9471928
>> File Copy 1024 bufsize 2000 maxblocks   549.5       2029.7          269.3721565
>> File Copy 256 bufsize 500 maxblocks     400.7       1439.4          259.2213626
>> File Copy 4096 bufsize 8000 maxblocks   894.3       3434.1          283.9986582
>> Pipe Throughput                         76819.5     74268.8         -3.320380893
>> Pipe-based Context Switching            3444.8      5963.3          73.11019508
>> Process Creation                        301.1       163.2           -45.79873796
>> Shell Scripts (1 concurrent)            1248.1      1859.7          49.00248378
>> Shell Scripts (8 concurrent)            781.2       1171            49.89759345
>> System Call Overhead                    3426        3194.4          -6.760070053
>>
>> System Benchmarks Index Score           3053        4605            50.83524402
>>
>> pvcy shows a smaller overall improvement (50%) compared to vcpu_is_preempted (277%).
>> Host side flamegraph analysis shows that ~60% of the host time when using pvcy
>> is spent in kvm_handle_wfx, compared with ~1.5% when using vcpu_is_preempted,
>> hence vcpu_is_preempted shows a larger improvement.
>>
>> It might be that pvcy can be used in combination with vcpu_is_preempted, but this
>> series is to start a discussion on vcpu_is_preempted as it shows a much bigger
>> improvement in performance and its much easier to review vcpu_is_preempted standalone.
> 
> Looking at both the patchsets - this one and the pvcy, it looks to me
> that vcpu_is_preempted() and the pvcy patches are somewhat
> orthogonal. The former is optimizing mutex and rwsem in their optimistic
> spinning phase while the latter is going after spinlocks (via wfe).
> 
> Unless I'm missing something the features are not necessarily comparable
> on the same workloads - unixbench is probably mutex heavy and doesn't
> show as much benefit with just the pvcy changes. I wonder if it's easy
> to have both the features enabled to see this in effect.
> 
> I've left some comments on the patches; but no need to respin just
> yet. Let's see if there is any other feedback.
> 
> Thanks,
> Punit
> 

There was a small bug in v2, where pv_lock_init was called too early in 
the boot in setup_arch, hence pvlock_vcpu_state was not initialized for 
vCPU 0 (the state was initialized for vCPUs 1-127 during secondary core 
boot, hence the rest of the vCPUs were using pvlock correctly). I will 
send the next revision making it an early_initcall along with addressing 
Punits' comments, but will wait for further comments on v2 before 
sending it. I have tested it with early_initcall and didnt see a 
significant change in performance (which is expected as only 1 out of 
128 vCPUs wasnt using pvlock correctly).

I tried pvcy+vcpu_is_preempted patches together and I see a slight 
reduction in performance over pvcy only.
As a summary, with the above changes to move to early_initcall included 
the overall Unixbench score improvements are:

Change over 6.0.0 base kernel                 % improvement over base
vcpu_is_preempted                             279%
pvcy                                          51%
pvcy+vcpu_is_preempted                        37%

Thanks,
Usama


> [...]
>

Marc Zyngier Nov. 18, 2022, 12:20 a.m. UTC | #6

On Mon, 07 Nov 2022 12:00:44 +0000,
Usama Arif <usama.arif@bytedance.com> wrote:
> 
> 
> 
> On 06/11/2022 16:35, Marc Zyngier wrote:
> > On Fri, 04 Nov 2022 06:20:59 +0000,
> > Usama Arif <usama.arif@bytedance.com> wrote:
> >> 
> >> This patchset adds support for vcpu_is_preempted in arm64, which
> >> allows the guest to check if a vcpu was scheduled out, which is
> >> useful to know incase it was holding a lock. vcpu_is_preempted can
> >> be used to improve performance in locking (see owner_on_cpu usage in
> >> mutex_spin_on_owner, mutex_can_spin_on_owner, rtmutex_spin_on_owner
> >> and osq_lock) and scheduling (see available_idle_cpu which is used
> >> in several places in kernel/sched/fair.c for e.g. in wake_affine to
> >> determine which CPU can run soonest):
> > 
> > [...]
> > 
> >> pvcy shows a smaller overall improvement (50%) compared to
> >> vcpu_is_preempted (277%).  Host side flamegraph analysis shows that
> >> ~60% of the host time when using pvcy is spent in kvm_handle_wfx,
> >> compared with ~1.5% when using vcpu_is_preempted, hence
> >> vcpu_is_preempted shows a larger improvement.
> > 
> > And have you worked out *why* we spend so much time handling WFE?
> > 
> > 	M.
> 
> Its from the following change in pvcy patchset:
> 
> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
> index e778eefcf214..915644816a85 100644
> --- a/arch/arm64/kvm/handle_exit.c
> +++ b/arch/arm64/kvm/handle_exit.c
> @@ -118,7 +118,12 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
>         }
> 
>         if (esr & ESR_ELx_WFx_ISS_WFE) {
> -               kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
> +               int state;
> +               while ((state = kvm_pvcy_check_state(vcpu)) == 0)
> +                       schedule();
> +
> +               if (state == -1)
> +                       kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
>         } else {
>                 if (esr & ESR_ELx_WFx_ISS_WFxT)
>                         vcpu_set_flag(vcpu, IN_WFIT);
> 
> 
> If my understanding is correct of the pvcy changes, whenever pvcy
> returns an unchanged vcpu state, we would schedule to another
> vcpu. And its the constant scheduling where the time is spent. I guess
> the affects are much higher when the lock contention is very
> high. This can be seem from the pvcy host side flamegraph as well with
> (~67% of the time spent in the schedule() call in kvm_handle_wfx), For
> reference, I have put the graph at:
> https://uarif1.github.io/pvlock/perf_host_pvcy_nmi.svg

The real issue here is that we don't try to pick the right vcpu to
run, and strictly rely on schedule() to eventually pick something that
can run.

An interesting to do would be to try and fit the directed yield
mechanism there. It would be a lot more interesting than the one-off
vcpu_is_preempted hack, as it gives us a low-level primitive on which
to construct things (pvcy is effectively a mwait-like primitive).

	M.

Usama Arif Nov. 24, 2022, 1:55 p.m. UTC | #7

On 18/11/2022 00:20, Marc Zyngier wrote:
> On Mon, 07 Nov 2022 12:00:44 +0000,
> Usama Arif <usama.arif@bytedance.com> wrote:
>>
>>
>>
>> On 06/11/2022 16:35, Marc Zyngier wrote:
>>> On Fri, 04 Nov 2022 06:20:59 +0000,
>>> Usama Arif <usama.arif@bytedance.com> wrote:
>>>>
>>>> This patchset adds support for vcpu_is_preempted in arm64, which
>>>> allows the guest to check if a vcpu was scheduled out, which is
>>>> useful to know incase it was holding a lock. vcpu_is_preempted can
>>>> be used to improve performance in locking (see owner_on_cpu usage in
>>>> mutex_spin_on_owner, mutex_can_spin_on_owner, rtmutex_spin_on_owner
>>>> and osq_lock) and scheduling (see available_idle_cpu which is used
>>>> in several places in kernel/sched/fair.c for e.g. in wake_affine to
>>>> determine which CPU can run soonest):
>>>
>>> [...]
>>>
>>>> pvcy shows a smaller overall improvement (50%) compared to
>>>> vcpu_is_preempted (277%).  Host side flamegraph analysis shows that
>>>> ~60% of the host time when using pvcy is spent in kvm_handle_wfx,
>>>> compared with ~1.5% when using vcpu_is_preempted, hence
>>>> vcpu_is_preempted shows a larger improvement.
>>>
>>> And have you worked out *why* we spend so much time handling WFE?
>>>
>>> 	M.
>>
>> Its from the following change in pvcy patchset:
>>
>> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
>> index e778eefcf214..915644816a85 100644
>> --- a/arch/arm64/kvm/handle_exit.c
>> +++ b/arch/arm64/kvm/handle_exit.c
>> @@ -118,7 +118,12 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
>>          }
>>
>>          if (esr & ESR_ELx_WFx_ISS_WFE) {
>> -               kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
>> +               int state;
>> +               while ((state = kvm_pvcy_check_state(vcpu)) == 0)
>> +                       schedule();
>> +
>> +               if (state == -1)
>> +                       kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
>>          } else {
>>                  if (esr & ESR_ELx_WFx_ISS_WFxT)
>>                          vcpu_set_flag(vcpu, IN_WFIT);
>>
>>
>> If my understanding is correct of the pvcy changes, whenever pvcy
>> returns an unchanged vcpu state, we would schedule to another
>> vcpu. And its the constant scheduling where the time is spent. I guess
>> the affects are much higher when the lock contention is very
>> high. This can be seem from the pvcy host side flamegraph as well with
>> (~67% of the time spent in the schedule() call in kvm_handle_wfx), For
>> reference, I have put the graph at:
>> https://uarif1.github.io/pvlock/perf_host_pvcy_nmi.svg
> 
> The real issue here is that we don't try to pick the right vcpu to
> run, and strictly rely on schedule() to eventually pick something that
> can run.
> 
> An interesting to do would be to try and fit the directed yield
> mechanism there. It would be a lot more interesting than the one-off
> vcpu_is_preempted hack, as it gives us a low-level primitive on which
> to construct things (pvcy is effectively a mwait-like primitive).

We could use kvm_vcpu_yield_to to yield to a specific vcpu, but how 
would we determine which vcpu to yield to?

IMO vcpu_is_preempted is very well integrated in a lot of core kernel 
code, i.e. mutex, rtmutex, rwsem and osq_lock. It is also used in 
scheduler to determine better which vCPU we can run on soonest, select 
idle core, etc. I am not sure if all of these cases will be optimized by 
pvcy? Also, with vcpu_is_preempted, some of the lock heavy benchmarks 
come down from spending around 50% of the time in lock to less than 1% 
(so not sure how much more room is there for improvement).

We could also use vcpu_is_preempted to optimize IPI performance (along 
with directed yield to target IPI vCPU) similar to how its done in x86 
(https://lore.kernel.org/all/1560255830-8656-2-git-send-email-wanpengli@tencent.com/). 
This case definitely wont be covered by pvcy.

Considering all the above, i.e. the core kernel integration already 
present and possible future usecases of vcpu_is_preempted, maybe its 
worth making vcpu_is_preempted work on arm independently of pvcy?

Thanks,
Usama

> 
> 	M.
>

Usama Arif Dec. 5, 2022, 1:43 p.m. UTC | #8

On 24/11/2022 13:55, Usama Arif wrote:
> 
> 
> On 18/11/2022 00:20, Marc Zyngier wrote:
>> On Mon, 07 Nov 2022 12:00:44 +0000,
>> Usama Arif <usama.arif@bytedance.com> wrote:
>>>
>>>
>>>
>>> On 06/11/2022 16:35, Marc Zyngier wrote:
>>>> On Fri, 04 Nov 2022 06:20:59 +0000,
>>>> Usama Arif <usama.arif@bytedance.com> wrote:
>>>>>
>>>>> This patchset adds support for vcpu_is_preempted in arm64, which
>>>>> allows the guest to check if a vcpu was scheduled out, which is
>>>>> useful to know incase it was holding a lock. vcpu_is_preempted can
>>>>> be used to improve performance in locking (see owner_on_cpu usage in
>>>>> mutex_spin_on_owner, mutex_can_spin_on_owner, rtmutex_spin_on_owner
>>>>> and osq_lock) and scheduling (see available_idle_cpu which is used
>>>>> in several places in kernel/sched/fair.c for e.g. in wake_affine to
>>>>> determine which CPU can run soonest):
>>>>
>>>> [...]
>>>>
>>>>> pvcy shows a smaller overall improvement (50%) compared to
>>>>> vcpu_is_preempted (277%).  Host side flamegraph analysis shows that
>>>>> ~60% of the host time when using pvcy is spent in kvm_handle_wfx,
>>>>> compared with ~1.5% when using vcpu_is_preempted, hence
>>>>> vcpu_is_preempted shows a larger improvement.
>>>>
>>>> And have you worked out *why* we spend so much time handling WFE?
>>>>
>>>>     M.
>>>
>>> Its from the following change in pvcy patchset:
>>>
>>> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
>>> index e778eefcf214..915644816a85 100644
>>> --- a/arch/arm64/kvm/handle_exit.c
>>> +++ b/arch/arm64/kvm/handle_exit.c
>>> @@ -118,7 +118,12 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
>>>          }
>>>
>>>          if (esr & ESR_ELx_WFx_ISS_WFE) {
>>> -               kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
>>> +               int state;
>>> +               while ((state = kvm_pvcy_check_state(vcpu)) == 0)
>>> +                       schedule();
>>> +
>>> +               if (state == -1)
>>> +                       kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
>>>          } else {
>>>                  if (esr & ESR_ELx_WFx_ISS_WFxT)
>>>                          vcpu_set_flag(vcpu, IN_WFIT);
>>>
>>>
>>> If my understanding is correct of the pvcy changes, whenever pvcy
>>> returns an unchanged vcpu state, we would schedule to another
>>> vcpu. And its the constant scheduling where the time is spent. I guess
>>> the affects are much higher when the lock contention is very
>>> high. This can be seem from the pvcy host side flamegraph as well with
>>> (~67% of the time spent in the schedule() call in kvm_handle_wfx), For
>>> reference, I have put the graph at:
>>> https://uarif1.github.io/pvlock/perf_host_pvcy_nmi.svg
>>
>> The real issue here is that we don't try to pick the right vcpu to
>> run, and strictly rely on schedule() to eventually pick something that
>> can run.
>>
>> An interesting to do would be to try and fit the directed yield
>> mechanism there. It would be a lot more interesting than the one-off
>> vcpu_is_preempted hack, as it gives us a low-level primitive on which
>> to construct things (pvcy is effectively a mwait-like primitive).
> 
> We could use kvm_vcpu_yield_to to yield to a specific vcpu, but how 
> would we determine which vcpu to yield to?
> 
> IMO vcpu_is_preempted is very well integrated in a lot of core kernel 
> code, i.e. mutex, rtmutex, rwsem and osq_lock. It is also used in 
> scheduler to determine better which vCPU we can run on soonest, select 
> idle core, etc. I am not sure if all of these cases will be optimized by 
> pvcy? Also, with vcpu_is_preempted, some of the lock heavy benchmarks 
> come down from spending around 50% of the time in lock to less than 1% 
> (so not sure how much more room is there for improvement).
> 
> We could also use vcpu_is_preempted to optimize IPI performance (along 
> with directed yield to target IPI vCPU) similar to how its done in x86 
> (https://lore.kernel.org/all/1560255830-8656-2-git-send-email-wanpengli@tencent.com/). 
> This case definitely wont be covered by pvcy.
> 
> Considering all the above, i.e. the core kernel integration already 
> present and possible future usecases of vcpu_is_preempted, maybe its 
> worth making vcpu_is_preempted work on arm independently of pvcy?
> 

Hi,

Just wanted to check if there are any comments on above? I can send a v3 
with the doc and code fixes suggested in the earlier reviews if it makes 
sense?

Thanks,
Usama

> Thanks,
> Usama
> 
>>
>>     M.
>>

Usama Arif Jan. 17, 2023, 10:50 a.m. UTC | #9

On 05/12/2022 13:43, Usama Arif wrote:
> 
> 
> On 24/11/2022 13:55, Usama Arif wrote:
>>
>>
>> On 18/11/2022 00:20, Marc Zyngier wrote:
>>> On Mon, 07 Nov 2022 12:00:44 +0000,
>>> Usama Arif <usama.arif@bytedance.com> wrote:
>>>>
>>>>
>>>>
>>>> On 06/11/2022 16:35, Marc Zyngier wrote:
>>>>> On Fri, 04 Nov 2022 06:20:59 +0000,
>>>>> Usama Arif <usama.arif@bytedance.com> wrote:
>>>>>>
>>>>>> This patchset adds support for vcpu_is_preempted in arm64, which
>>>>>> allows the guest to check if a vcpu was scheduled out, which is
>>>>>> useful to know incase it was holding a lock. vcpu_is_preempted can
>>>>>> be used to improve performance in locking (see owner_on_cpu usage in
>>>>>> mutex_spin_on_owner, mutex_can_spin_on_owner, rtmutex_spin_on_owner
>>>>>> and osq_lock) and scheduling (see available_idle_cpu which is used
>>>>>> in several places in kernel/sched/fair.c for e.g. in wake_affine to
>>>>>> determine which CPU can run soonest):
>>>>>
>>>>> [...]
>>>>>
>>>>>> pvcy shows a smaller overall improvement (50%) compared to
>>>>>> vcpu_is_preempted (277%).  Host side flamegraph analysis shows that
>>>>>> ~60% of the host time when using pvcy is spent in kvm_handle_wfx,
>>>>>> compared with ~1.5% when using vcpu_is_preempted, hence
>>>>>> vcpu_is_preempted shows a larger improvement.
>>>>>
>>>>> And have you worked out *why* we spend so much time handling WFE?
>>>>>
>>>>>     M.
>>>>
>>>> Its from the following change in pvcy patchset:
>>>>
>>>> diff --git a/arch/arm64/kvm/handle_exit.c 
>>>> b/arch/arm64/kvm/handle_exit.c
>>>> index e778eefcf214..915644816a85 100644
>>>> --- a/arch/arm64/kvm/handle_exit.c
>>>> +++ b/arch/arm64/kvm/handle_exit.c
>>>> @@ -118,7 +118,12 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
>>>>          }
>>>>
>>>>          if (esr & ESR_ELx_WFx_ISS_WFE) {
>>>> -               kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
>>>> +               int state;
>>>> +               while ((state = kvm_pvcy_check_state(vcpu)) == 0)
>>>> +                       schedule();
>>>> +
>>>> +               if (state == -1)
>>>> +                       kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
>>>>          } else {
>>>>                  if (esr & ESR_ELx_WFx_ISS_WFxT)
>>>>                          vcpu_set_flag(vcpu, IN_WFIT);
>>>>
>>>>
>>>> If my understanding is correct of the pvcy changes, whenever pvcy
>>>> returns an unchanged vcpu state, we would schedule to another
>>>> vcpu. And its the constant scheduling where the time is spent. I guess
>>>> the affects are much higher when the lock contention is very
>>>> high. This can be seem from the pvcy host side flamegraph as well with
>>>> (~67% of the time spent in the schedule() call in kvm_handle_wfx), For
>>>> reference, I have put the graph at:
>>>> https://uarif1.github.io/pvlock/perf_host_pvcy_nmi.svg
>>>
>>> The real issue here is that we don't try to pick the right vcpu to
>>> run, and strictly rely on schedule() to eventually pick something that
>>> can run.
>>>
>>> An interesting to do would be to try and fit the directed yield
>>> mechanism there. It would be a lot more interesting than the one-off
>>> vcpu_is_preempted hack, as it gives us a low-level primitive on which
>>> to construct things (pvcy is effectively a mwait-like primitive).
>>
>> We could use kvm_vcpu_yield_to to yield to a specific vcpu, but how 
>> would we determine which vcpu to yield to?
>>
>> IMO vcpu_is_preempted is very well integrated in a lot of core kernel 
>> code, i.e. mutex, rtmutex, rwsem and osq_lock. It is also used in 
>> scheduler to determine better which vCPU we can run on soonest, select 
>> idle core, etc. I am not sure if all of these cases will be optimized 
>> by pvcy? Also, with vcpu_is_preempted, some of the lock heavy 
>> benchmarks come down from spending around 50% of the time in lock to 
>> less than 1% (so not sure how much more room is there for improvement).
>>
>> We could also use vcpu_is_preempted to optimize IPI performance (along 
>> with directed yield to target IPI vCPU) similar to how its done in x86 
>> (https://lore.kernel.org/all/1560255830-8656-2-git-send-email-wanpengli@tencent.com/). 
>> This case definitely wont be covered by pvcy.
>>
>> Considering all the above, i.e. the core kernel integration already 
>> present and possible future usecases of vcpu_is_preempted, maybe its 
>> worth making vcpu_is_preempted work on arm independently of pvcy?
>>
> 
> Hi,
> 
> Just wanted to check if there are any comments on above? I can send a v3 
> with the doc and code fixes suggested in the earlier reviews if it makes 
> sense?
> 
> Thanks,
> Usama
> 
>> Thanks,
>> Usama
>>

Hi,

The discussion on the patches had died down around November. I have sent 
v3 of the patches 
(https://lore.kernel.org/all/20230117102930.1053337-1-usama.arif@bytedance.com/) 
to hopefully restart it as I think that there is a significant 
performance improvement to be had with vcpu_is_preempted being 
implemented in arm64 which is well integrated in mutex, rtmutex, rwsem, 
osq_lock and scheduler, and could potentially be used to improve the IPI 
performance in the future.

Thanks,
Usama

>>>
>>>     M.
>>>