Message ID | 20230315164910.302265-1-krzysztof.kozlowski@linaro.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:604a:0:0:0:0:0 with SMTP id j10csp8599wrt; Wed, 15 Mar 2023 09:56:42 -0700 (PDT) X-Google-Smtp-Source: AK7set8QAZUgL4PH9X/4nt7lnzgIPd0cj5cNBdLngfrSzQ33NKwq0HkJ2IMfBT9XQ8X93+4TLUsc X-Received: by 2002:a17:902:d2cc:b0:19d:2542:96a4 with SMTP id n12-20020a170902d2cc00b0019d254296a4mr314153plc.4.1678899402157; Wed, 15 Mar 2023 09:56:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1678899402; cv=none; d=google.com; s=arc-20160816; b=P1Yq61fheuxGkcdHAPi4VUEXSGUlbeRVxnf7sZvWqYGyUBbNE18f5D/gchk2aA3W7j +xYMnzVuGxvbYDFewF10sFQC9155qFgjFzFPwNUhOi1G52YzzW9d9e/r1CRKo9uHRrQ9 MxzYsi1c/o2GnWasxKfHBfCHY4X0Sh9BgveZ1Bmb76yG3J7oSX1Xkv3olioahu3HdsGT eAXCTcXI7tATO/maE/fWfG3M5Nzccb+fuNV6VZA8LtLmMesMW1wUMwl/ouDqUeh1lKv/ qHUll/L9MLjkLyRSpQo9Glw9FMEyHIpMGHw6vO+PkZoUYNuz9v8s8x+6qzUaG11FrnJj A3Hg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=1MDMGCcobqwi/a1l36yEQFELeDyhrLe4cYbKsHJruU0=; b=mjRcynIXi0MiA6PJZYK7oNDWic/XJx3JEsMVP70rWHLH4p1IWrFz4nXUZo5F0fRD5J wykBZ12vBb+ouhnlfyovI2ttYkFhtRlIMG0ORU1MforzKLXPJ5bMGVmPxhtEjoSDoHTK C+LwPfNGxmlCNp/xq4GM4GabXdnXdYsHXmfZx3JPjhfkROZ40ntmPHd7Rt/m6Mo2QUOF Oxyhys/pJQPeNb6+IXJF/E0P5ZqsOTd8Hs0thlFrfpy+v+z8gb8CynoigU1OGnEyV+N6 kE54N9kwna72mRIAo0lE2JGuprLDRmilgmtMO/uDtHbdQQTaGHSKbdKm7ewtFkd3WStz 88Ng== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=Jue10VCE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id kq15-20020a170903284f00b0019acbf1dc4asi5592779plb.181.2023.03.15.09.56.29; Wed, 15 Mar 2023 09:56:42 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=Jue10VCE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231955AbjCOQtW (ORCPT <rfc822;ruipengqi7@gmail.com> + 99 others); Wed, 15 Mar 2023 12:49:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33442 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231802AbjCOQtQ (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 15 Mar 2023 12:49:16 -0400 Received: from mail-ed1-x533.google.com (mail-ed1-x533.google.com [IPv6:2a00:1450:4864:20::533]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 718F2149B9 for <linux-kernel@vger.kernel.org>; Wed, 15 Mar 2023 09:49:14 -0700 (PDT) Received: by mail-ed1-x533.google.com with SMTP id y4so48582586edo.2 for <linux-kernel@vger.kernel.org>; Wed, 15 Mar 2023 09:49:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; t=1678898953; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=1MDMGCcobqwi/a1l36yEQFELeDyhrLe4cYbKsHJruU0=; b=Jue10VCE7pDR/X9n255Txry2emurXDGYizex363AJ1dlTCcwKfWx/POZeXjLPpi44P le8JKKoxhZrZcv8PCtAn0yA4ppGgEhv1s5Pe8OZY3NeWH15o1tXl0UIwAWXF7b5YhyL6 31A45AdfR30DFSfFXb8ncHVu2U57qPdy+udRDHWgMdLjdyyfm/nx+1YYSbww60UuXKQN m1dLTkS3NWkbJOnFpDng10ymPi3HlsReR1tud1HxvJX5L/VkpoYnrEb73lvl6VNY3qEt DvAojHfEYviiPRqzndrxvkruoWTltwUNGEGlXskulg0k1IDFYB7Ikiw1Pf3FejxqRork OnAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678898953; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=1MDMGCcobqwi/a1l36yEQFELeDyhrLe4cYbKsHJruU0=; b=upbgG81X2E0dol5c3f7rvaZbj39zy/DjhV46FXj5oPpBdqqcAUCkVfixNCGJf5M628 mxzeJH61i1lSd2BtKypdVnBFCPutJqXAVRPfqpcAXzodtIqftm5Igiw7XRlIbIN/2Qq7 O38tDm5l/b/NlQHvU3DCjjdVFqxnuLwn6BXljpxsoEpz3kC1poPXDVfjTazhy5eVSekd spsOuMJ6RUx4KCccgnjC1TAu1SyJTPvNMJSPH8N9KeZ7awDPGefIfxEerfW5fgopGNOj wefFPTxVfEIksu9u72ET61IaZrE/vi9nhyVmaWTWrJYkudEfVShu2VAN1tuFq2cTeVzm 842A== X-Gm-Message-State: AO0yUKU08Rx+szEzBpvcNID4TBWYkIsPmT7C1s0hIVnkxkDNwELJr1eQ M0VpjevkGPUd2YNvYNWOKzjkuA== X-Received: by 2002:a17:906:facf:b0:8f4:809e:faee with SMTP id lu15-20020a170906facf00b008f4809efaeemr58817ejb.19.1678898952955; Wed, 15 Mar 2023 09:49:12 -0700 (PDT) Received: from krzk-bin.. ([2a02:810d:15c0:828:940e:8615:37dc:c2bd]) by smtp.gmail.com with ESMTPSA id u10-20020a170906068a00b008e3bf17fb2asm2786972ejb.19.2023.03.15.09.49.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Mar 2023 09:49:12 -0700 (PDT) From: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> To: Andy Gross <agross@kernel.org>, Bjorn Andersson <andersson@kernel.org>, Konrad Dybcio <konrad.dybcio@linaro.org>, "Rafael J. Wysocki" <rafael@kernel.org>, Viresh Kumar <viresh.kumar@linaro.org>, linux-arm-msm@vger.kernel.org, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Adrien Thierry <athierry@redhat.com>, Brian Masney <bmasney@redhat.com>, linux-rt-users@vger.kernel.org, Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Subject: [RFC PATCH] cpufreq: qcom-cpufreq-hw: allow work to be done on other CPU for PREEMPT_RT Date: Wed, 15 Mar 2023 17:49:10 +0100 Message-Id: <20230315164910.302265-1-krzysztof.kozlowski@linaro.org> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1760453619536171307?= X-GMAIL-MSGID: =?utf-8?q?1760453619536171307?= |
Series |
[RFC] cpufreq: qcom-cpufreq-hw: allow work to be done on other CPU for PREEMPT_RT
|
|
Commit Message
Krzysztof Kozlowski
March 15, 2023, 4:49 p.m. UTC
Qualcomm cpufreq driver configures interrupts with affinity to each
cluster, e.g. dcvsh-irq-0, dcvsh-irq-4 and dcvsh-irq-7 on SM8250.
Triggered interrupt will schedule delayed work, but, since workqueue
prefers local CPUs, it might get executed on a CPU dedicated to realtime
tasks causing unexpected latencies in realtime workload.
Use unbound workqueue for such case. This might come with performance
or energy penalty, e.g. because of cache miss or when other CPU is
sleeping.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
---
drivers/cpufreq/qcom-cpufreq-hw.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
Comments
On 15/03/2023 17:49, Krzysztof Kozlowski wrote: > Qualcomm cpufreq driver configures interrupts with affinity to each > cluster, e.g. dcvsh-irq-0, dcvsh-irq-4 and dcvsh-irq-7 on SM8250. > Triggered interrupt will schedule delayed work, but, since workqueue > prefers local CPUs, it might get executed on a CPU dedicated to realtime > tasks causing unexpected latencies in realtime workload. > > Use unbound workqueue for such case. This might come with performance > or energy penalty, e.g. because of cache miss or when other CPU is > sleeping. > > Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> > --- > drivers/cpufreq/qcom-cpufreq-hw.c | 11 ++++++++++- Let me also paste impact of this patch - rtla osnoise on entirely idle system (cores 2-7 isolated for Realtime): BEFORE: osnoise/7-2967 [007] d..h2.. 3937.898311: irq_noise: dcvsh-irq-7:179 start 3937.898310871 duration 104 ns irq/179-dcvsh-i-343 [007] d..h2.. 3937.898318: irq_noise: IPI:6 start 3937.898317537 duration 104 ns irq/179-dcvsh-i-343 [007] d...3.. 3937.898321: thread_noise: irq/179-dcvsh-i:343 start 3937.898316287 duration 4740 ns kworker/7:0-85 [007] d..h2.. 3937.898323: irq_noise: IPI:6 start 3937.898322381 duration 104 ns kworker/7:0-85 [007] d...3.. 3937.898343: thread_noise: kworker/7:0:85 start 3937.898321339 duration 20990 ns osnoise/7-2967 [007] ....... 3937.898343: sample_threshold: start 3937.898308475 duration 34531 ns interference 5 Noise duration: 34 us AFTER: osnoise/7-2637 [007] d..h2.. 530.563819: irq_noise: dcvsh-irq-7:178 start 530.563817139 duration 260 ns osnoise/7-2637 [007] d..h2.. 530.563827: irq_noise: IPI:6 start 530.563826670 duration 156 ns osnoise/7-2637 [007] ....... 530.563828: sample_threshold: start 530.563814587 duration 12864 ns interference 2 Noise duration: 13 us Best regards, Krzysztof
On 17/03/2023 00:57, Hillf Danton wrote: > On 16 Mar 2023 13:28:18 +0100 Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> >> On 15/03/2023 17:49, Krzysztof Kozlowski wrote: >>> Qualcomm cpufreq driver configures interrupts with affinity to each >>> cluster, e.g. dcvsh-irq-0, dcvsh-irq-4 and dcvsh-irq-7 on SM8250. >>> Triggered interrupt will schedule delayed work, but, since workqueue >>> prefers local CPUs, it might get executed on a CPU dedicated to realtime >>> tasks causing unexpected latencies in realtime workload. >>> >>> Use unbound workqueue for such case. This might come with performance >>> or energy penalty, e.g. because of cache miss or when other CPU is >>> sleeping. >>> >>> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> >>> --- >>> drivers/cpufreq/qcom-cpufreq-hw.c | 11 ++++++++++- >> >> Let me also paste impact of this patch - rtla osnoise on entirely idle >> system (cores 2-7 isolated for Realtime): > > Are cores 2-7 the non-housekeeping CPUs[1]? Yes, since their are isolated for Realtime this is synonymous. > > [1] https://lore.kernel.org/lkml/20230223150624.GA29739@lst.de/ Best regards, Krzysztof
On 2023-03-15 17:49:10 [+0100], Krzysztof Kozlowski wrote: > Qualcomm cpufreq driver configures interrupts with affinity to each > cluster, e.g. dcvsh-irq-0, dcvsh-irq-4 and dcvsh-irq-7 on SM8250. > Triggered interrupt will schedule delayed work, but, since workqueue > prefers local CPUs, it might get executed on a CPU dedicated to realtime > tasks causing unexpected latencies in realtime workload. > > Use unbound workqueue for such case. This might come with performance > or energy penalty, e.g. because of cache miss or when other CPU is > sleeping. I miss the point where it explains that only PREEMPT_RT is affected by this. > Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> > --- > drivers/cpufreq/qcom-cpufreq-hw.c | 11 ++++++++++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/drivers/cpufreq/qcom-cpufreq-hw.c b/drivers/cpufreq/qcom-cpufreq-hw.c > index 2f581d2d617d..c5ff8d25fabb 100644 > --- a/drivers/cpufreq/qcom-cpufreq-hw.c > +++ b/drivers/cpufreq/qcom-cpufreq-hw.c > @@ -390,7 +390,16 @@ static irqreturn_t qcom_lmh_dcvs_handle_irq(int irq, void *data) > > /* Disable interrupt and enable polling */ > disable_irq_nosync(c_data->throttle_irq); > - schedule_delayed_work(&c_data->throttle_work, 0); > + > + /* > + * Workqueue prefers local CPUs and since interrupts have set affinity, > + * the work might execute on a CPU dedicated to realtime tasks. > + */ > + if (IS_ENABLED(CONFIG_PREEMPT_RT)) > + queue_delayed_work_on(WORK_CPU_UNBOUND, system_unbound_wq, > + &c_data->throttle_work, 0); > + else > + schedule_delayed_work(&c_data->throttle_work, 0); You isolated CPUs and use this on PREEMPT_RT. And this special use-case is your reasoning to make this change and let it depend on PREEMPT_RT? If you do PREEMPT_RT and you care about latency I would argue that you either disable cpufreq and set it to PERFORMANCE so that the highest available frequency is set once and not changed afterwards. > if (qcom_cpufreq.soc_data->reg_intr_clr) > writel_relaxed(GT_IRQ_STATUS, > -- > 2.34.1 >
On 21/03/2023 11:04, Sebastian Andrzej Siewior wrote: > On 2023-03-15 17:49:10 [+0100], Krzysztof Kozlowski wrote: >> Qualcomm cpufreq driver configures interrupts with affinity to each >> cluster, e.g. dcvsh-irq-0, dcvsh-irq-4 and dcvsh-irq-7 on SM8250. >> Triggered interrupt will schedule delayed work, but, since workqueue >> prefers local CPUs, it might get executed on a CPU dedicated to realtime >> tasks causing unexpected latencies in realtime workload. >> >> Use unbound workqueue for such case. This might come with performance >> or energy penalty, e.g. because of cache miss or when other CPU is >> sleeping. > > I miss the point where it explains that only PREEMPT_RT is affected by > this. I assume "realtime tasks" implies this, but I can make it clearer. > >> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> >> --- >> drivers/cpufreq/qcom-cpufreq-hw.c | 11 ++++++++++- >> 1 file changed, 10 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/cpufreq/qcom-cpufreq-hw.c b/drivers/cpufreq/qcom-cpufreq-hw.c >> index 2f581d2d617d..c5ff8d25fabb 100644 >> --- a/drivers/cpufreq/qcom-cpufreq-hw.c >> +++ b/drivers/cpufreq/qcom-cpufreq-hw.c >> @@ -390,7 +390,16 @@ static irqreturn_t qcom_lmh_dcvs_handle_irq(int irq, void *data) >> >> /* Disable interrupt and enable polling */ >> disable_irq_nosync(c_data->throttle_irq); >> - schedule_delayed_work(&c_data->throttle_work, 0); >> + >> + /* >> + * Workqueue prefers local CPUs and since interrupts have set affinity, >> + * the work might execute on a CPU dedicated to realtime tasks. >> + */ >> + if (IS_ENABLED(CONFIG_PREEMPT_RT)) >> + queue_delayed_work_on(WORK_CPU_UNBOUND, system_unbound_wq, >> + &c_data->throttle_work, 0); >> + else >> + schedule_delayed_work(&c_data->throttle_work, 0); > > You isolated CPUs and use this on PREEMPT_RT. And this special use-case > is your reasoning to make this change and let it depend on PREEMPT_RT? > > If you do PREEMPT_RT and you care about latency I would argue that you > either disable cpufreq and set it to PERFORMANCE so that the highest > available frequency is set once and not changed afterwards. The cpufreq is set to performance. It will be changed anyway because underlying FW notifies through such interrupts about thermal mitigation happening. The only other solution is to disable the cpufreq device, e.g. by not compiling it. Best regards, Krzysztof
On 2023-03-21 11:24:46 [+0100], Krzysztof Kozlowski wrote: > >> --- a/drivers/cpufreq/qcom-cpufreq-hw.c > >> +++ b/drivers/cpufreq/qcom-cpufreq-hw.c > >> @@ -390,7 +390,16 @@ static irqreturn_t qcom_lmh_dcvs_handle_irq(int irq, void *data) > >> > >> /* Disable interrupt and enable polling */ > >> disable_irq_nosync(c_data->throttle_irq); > >> - schedule_delayed_work(&c_data->throttle_work, 0); > >> + > >> + /* > >> + * Workqueue prefers local CPUs and since interrupts have set affinity, > >> + * the work might execute on a CPU dedicated to realtime tasks. > >> + */ > >> + if (IS_ENABLED(CONFIG_PREEMPT_RT)) > >> + queue_delayed_work_on(WORK_CPU_UNBOUND, system_unbound_wq, > >> + &c_data->throttle_work, 0); > >> + else > >> + schedule_delayed_work(&c_data->throttle_work, 0); > > > > You isolated CPUs and use this on PREEMPT_RT. And this special use-case > > is your reasoning to make this change and let it depend on PREEMPT_RT? > > > > If you do PREEMPT_RT and you care about latency I would argue that you > > either disable cpufreq and set it to PERFORMANCE so that the highest > > available frequency is set once and not changed afterwards. > > The cpufreq is set to performance. It will be changed anyway because > underlying FW notifies through such interrupts about thermal mitigation > happening. I still fail to understand why this is PREEMPT_RT specific and not a problem in general when it comes not NO_HZ_FULL and/ or CPU isolation. However the thermal notifications have nothing to do with cpufreq. > The only other solution is to disable the cpufreq device, e.g. by not > compiling it. People often disable cpufreq because _usually_ the system boots at maximum performance. There are however exceptions and even x86 system are configured sometimes to a lower clock speed by the firmware/ BIOS. In this case it is nice to have a cpufreq so it is possible to set the system during boot to a higher clock speed. And then remain idle unless the cpufreq governor changed. > Best regards, > Krzysztof Sebastian
On 21/03/2023 11:57, Sebastian Andrzej Siewior wrote: > On 2023-03-21 11:24:46 [+0100], Krzysztof Kozlowski wrote: >>>> --- a/drivers/cpufreq/qcom-cpufreq-hw.c >>>> +++ b/drivers/cpufreq/qcom-cpufreq-hw.c >>>> @@ -390,7 +390,16 @@ static irqreturn_t qcom_lmh_dcvs_handle_irq(int irq, void *data) >>>> >>>> /* Disable interrupt and enable polling */ >>>> disable_irq_nosync(c_data->throttle_irq); >>>> - schedule_delayed_work(&c_data->throttle_work, 0); >>>> + >>>> + /* >>>> + * Workqueue prefers local CPUs and since interrupts have set affinity, >>>> + * the work might execute on a CPU dedicated to realtime tasks. >>>> + */ >>>> + if (IS_ENABLED(CONFIG_PREEMPT_RT)) >>>> + queue_delayed_work_on(WORK_CPU_UNBOUND, system_unbound_wq, >>>> + &c_data->throttle_work, 0); >>>> + else >>>> + schedule_delayed_work(&c_data->throttle_work, 0); >>> >>> You isolated CPUs and use this on PREEMPT_RT. And this special use-case >>> is your reasoning to make this change and let it depend on PREEMPT_RT? >>> >>> If you do PREEMPT_RT and you care about latency I would argue that you >>> either disable cpufreq and set it to PERFORMANCE so that the highest >>> available frequency is set once and not changed afterwards. >> >> The cpufreq is set to performance. It will be changed anyway because >> underlying FW notifies through such interrupts about thermal mitigation >> happening. > > I still fail to understand why this is PREEMPT_RT specific and not a > problem in general when it comes not NO_HZ_FULL and/ or CPU isolation. Hm, good point, I actually don't know what is the workqueue recommendation for NO_HZ_FULL CPUs - is still locality of the workqueue preferred? And how such code would look like? if (tick_nohz_tick_stopped())? > However the thermal notifications have nothing to do with cpufreq. They have. The FW notifies that thermal mitigation is happening and maximum allowed frequency is now XYZ. The cpufreq receives this and sets maximum allowed scaling frequency for governor. > >> The only other solution is to disable the cpufreq device, e.g. by not >> compiling it. > > People often disable cpufreq because _usually_ the system boots at > maximum performance. There are however exceptions and even x86 system > are configured sometimes to a lower clock speed by the firmware/ BIOS. > In this case it is nice to have a cpufreq so it is possible to set the > system during boot to a higher clock speed. And then remain idle unless > the cpufreq governor changed. Which we do not want here, thus disabling cpufreq is not the interesting solution... Best regards, Krzysztof
On 2023-03-21 12:27:42 [+0100], Krzysztof Kozlowski wrote: > > I still fail to understand why this is PREEMPT_RT specific and not a > > problem in general when it comes not NO_HZ_FULL and/ or CPU isolation. > > Hm, good point, I actually don't know what is the workqueue > recommendation for NO_HZ_FULL CPUs - is still locality of the workqueue > preferred? If you isolate a CPU you want the kernel to stay away from it. The idea is that something is done on that CPU and the kernel should leave it alone. That is why the HZ tick avoided. That is why timers migrate to the "housekeeping" CPU and do not fire on the CPU that it was programmed on (unless the timer has to fire on this CPU). > And how such code would look like? > if (tick_nohz_tick_stopped())? Yeah closer :) The CPU-mask for workqueues can still be different on non-NOHZ-full CPUs. Still you interrupt the CPU doing in-userland work and this is not desired. You have a threaded-IRQ which does nothing but schedules a worker. Why? Why not sleep and remain in that threaded IRQ until the work is done? You _can_ sleep in the threaded IRQ if you have to. Force-threaded is different but this is one is explicit threaded so you could do it. > > However the thermal notifications have nothing to do with cpufreq. > > They have. The FW notifies that thermal mitigation is happening and > maximum allowed frequency is now XYZ. The cpufreq receives this and sets > maximum allowed scaling frequency for governor. I see. So the driver is doing something in worst case. This interrupt, you have per-CPU and you need to do this CPU? I mean could you change the affinity of the interrupt to another CPU? > Best regards, > Krzysztof Sebastian
On 21/03/2023 14:39, Sebastian Andrzej Siewior wrote: > On 2023-03-21 12:27:42 [+0100], Krzysztof Kozlowski wrote: >>> I still fail to understand why this is PREEMPT_RT specific and not a >>> problem in general when it comes not NO_HZ_FULL and/ or CPU isolation. >> >> Hm, good point, I actually don't know what is the workqueue >> recommendation for NO_HZ_FULL CPUs - is still locality of the workqueue >> preferred? > > If you isolate a CPU you want the kernel to stay away from it. The idea > is that something is done on that CPU and the kernel should leave it > alone. That is why the HZ tick avoided. That is why timers migrate to > the "housekeeping" CPU and do not fire on the CPU that it was programmed > on (unless the timer has to fire on this CPU). > >> And how such code would look like? >> if (tick_nohz_tick_stopped())? > > Yeah closer :) The CPU-mask for workqueues can still be different on > non-NOHZ-full CPUs. Still you interrupt the CPU doing in-userland work > and this is not desired. Probably this should be done by workqueue core code. Individual drivers should not need to investigate which CPUs are isolated. > You have a threaded-IRQ which does nothing but schedules a worker. Why? > Why not sleep and remain in that threaded IRQ until the work is done? > You _can_ sleep in the threaded IRQ if you have to. Force-threaded is > different but this is one is explicit threaded so you could do it. If I get your point correctly, you want the IRQ handler thread to do the actual work instead of scheduling work? The answer to this is probably here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e0e27c3d4e20dab861566f1c348ae44e4b498630 > >>> However the thermal notifications have nothing to do with cpufreq. >> >> They have. The FW notifies that thermal mitigation is happening and >> maximum allowed frequency is now XYZ. The cpufreq receives this and sets >> maximum allowed scaling frequency for governor. > > I see. So the driver is doing something in worst case. This interrupt, > you have per-CPU and you need to do this CPU? I mean could you change > the affinity of the interrupt to another CPU? I don't know. The commit introducing it: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3ed6dfbd3bb987b3d2de86304ae45972ebff5870 claimed it helps to reduce number of interrupts hitting CPU 10x-100x times... I don't see it - neither in tests nor in the code, so I am just thinking to revert that one. Best regards, Krzysztof
On 2023-03-23 09:16:27 [+0100], Krzysztof Kozlowski wrote: > > Yeah closer :) The CPU-mask for workqueues can still be different on > > non-NOHZ-full CPUs. Still you interrupt the CPU doing in-userland work > > and this is not desired. > > Probably this should be done by workqueue core code. Individual drivers > should not need to investigate which CPUs are isolated. _Either_ this is part of the interrupt service routine or it is not. Sometimes work can be offloaded. However this interrupt is short and offloads work to a workqueue. Can the interrupt be moved to another CPU without breaking something? The use can only change the CPUs on which the workqueue can run but also the affinity of the IRQ itself. If the user wishes to isolate CPU X and move workqueues and interrupts away from the CPU the question is why is this a problem for you. > > You have a threaded-IRQ which does nothing but schedules a worker. Why? > > Why not sleep and remain in that threaded IRQ until the work is done? > > You _can_ sleep in the threaded IRQ if you have to. Force-threaded is > > different but this is one is explicit threaded so you could do it. > > If I get your point correctly, you want the IRQ handler thread to do the > actual work instead of scheduling work? The answer to this is probably here: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e0e27c3d4e20dab861566f1c348ae44e4b498630 Let me look. | Re-enabling an interrupt from its own interrupt handler may cause | an interrupt storm, if there is a pending interrupt and because its | handling is disabled due to already done entrance into the handler | above in the stack. I have hard time parsing this. disable_irq_nosync() and enable enable_irq() shouldn't be invoked from within the interrupt handler itself. This interrupt is already requested as a threaded handler with IRQF_ONESHOT. This means the IRQ-chip already disables the interrupt while the threaded handler is invoked. No need for that. I don't know what the purpose of reg_intr_clr here is. Acking the interrupt before reading the status and doing any work does not look right. | Also, apparently it is improper to lock a mutex in an interrupt contex. Again, this is an interrupt handler requested as a threaded handler. This handler is invoked with enabled interrupts and preemption. The code in this handler can invoke ssleep() or mutex_lock(). A might_sleep() does not produce a plat here. It okay to acquire a mutex. This is why we have threaded interrupts. You can't acquire a mutex in a forced-threaded handler. This is not the case here. > > > >>> However the thermal notifications have nothing to do with cpufreq. > >> > >> They have. The FW notifies that thermal mitigation is happening and > >> maximum allowed frequency is now XYZ. The cpufreq receives this and sets > >> maximum allowed scaling frequency for governor. > > > > I see. So the driver is doing something in worst case. This interrupt, > > you have per-CPU and you need to do this CPU? I mean could you change > > the affinity of the interrupt to another CPU? > > I don't know. The commit introducing it: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3ed6dfbd3bb987b3d2de86304ae45972ebff5870 > claimed it helps to reduce number of interrupts hitting CPU 10x-100x > times... I don't see it - neither in tests nor in the code, so I am just > thinking to revert that one. So it may run on another CPU but doing it on the right cluster reduces the received interrupt 10-100 times. Do we have per-cluster register or is the interrupt ACK wrong and this what is observed? The questions are: - What is the interrupt signaling. - What must be done to acknowledge the interrupt. This needs to be figured out and verified that it actually works as intended. The hardware might keep sending interrupt because the source is either not acknowledge properly or the source of the interrupt (the reason why it was generated in first place) is still pending/ not handled. The changes you reference look like "if we do this then it seems better.". No explanation about the issue/ root cause and the targeted solution. > Best regards, > Krzysztof Sebastian
diff --git a/drivers/cpufreq/qcom-cpufreq-hw.c b/drivers/cpufreq/qcom-cpufreq-hw.c index 2f581d2d617d..c5ff8d25fabb 100644 --- a/drivers/cpufreq/qcom-cpufreq-hw.c +++ b/drivers/cpufreq/qcom-cpufreq-hw.c @@ -390,7 +390,16 @@ static irqreturn_t qcom_lmh_dcvs_handle_irq(int irq, void *data) /* Disable interrupt and enable polling */ disable_irq_nosync(c_data->throttle_irq); - schedule_delayed_work(&c_data->throttle_work, 0); + + /* + * Workqueue prefers local CPUs and since interrupts have set affinity, + * the work might execute on a CPU dedicated to realtime tasks. + */ + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + queue_delayed_work_on(WORK_CPU_UNBOUND, system_unbound_wq, + &c_data->throttle_work, 0); + else + schedule_delayed_work(&c_data->throttle_work, 0); if (qcom_cpufreq.soc_data->reg_intr_clr) writel_relaxed(GT_IRQ_STATUS,