Message ID | 20240206185709.928420669@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-55520-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:168b:b0:106:860b:bbdd with SMTP id ma11csp1754061dyb; Tue, 6 Feb 2024 11:03:09 -0800 (PST) X-Google-Smtp-Source: AGHT+IFQHIgIxWvGrtZhAZQs33WzArD5KyBC/KbCjdOK8lQfPgdViLs/IWTqJystfKsHAhCfZaz1 X-Received: by 2002:a17:902:d4c1:b0:1d7:2e86:fb2a with SMTP id o1-20020a170902d4c100b001d72e86fb2amr2769145plg.65.1707246188771; Tue, 06 Feb 2024 11:03:08 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1707246188; cv=pass; d=google.com; s=arc-20160816; b=nBfc4sOXq40qrkKnXeJ572i2Yzby5bGFaO+I4zpysk19dUMRQsiPjWrI7rX/fNmruj 92bDDfe4lmB1cyWA8t2choiQ0PMNlocpoJI6p3lOuiJxNVxlfp6Ax+F0K1BrDDmQj6or YRV3AVI/39lMH9X0Uy6msxOrGl/V9HjThP3RbNYt9eCiejtlMnJWA1/lxCLrA6Z/ofCc 2A2t7Nx8ah6EEmjGWQqZKJEirnBuxNBULDaZ1Gm5z1UQWzLdONt3lkmQqAM4mfgliWhV MlVdq1r5uqF2a4EsFwdj7NbL7V0siX7E3INce7I6AkqvqqtEArFYBIPZCz/cNPs14GjY JTHQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:list-unsubscribe:list-subscribe:list-id:precedence :references:subject:cc:to:from:date:user-agent:message-id :dkim-signature; bh=Ra7EGxeWV8AGneO4cyp0zFD1CrKmCahpgLqv1cR4INo=; fh=S5pK+IFXb9KWWJgt1GmmpnukymGtwWq8gVNDoEMMZns=; b=BFlBuoGZ6PjQ3Sn7g29teKSgKHvrZ2Yd+3Cg6uDn+8tNeYu6PpaQzMkQI/2i4KRAvo yxZDCpxZAPbjNbDAdA0+92T/81fzsRtWUlfrOnPE6loHDv0GNRPDInAtsegM9DzmmZmf N4o6VDUZqHLQ6oZyKD0PJbkfnguejPDzXRp0DkeHZAEWBYQ6b5kqXZ2sEzddzvO43QAX GhHJOE40Bjnxw0UJYtl1UuJthQ+u53oSWTIQAAsNMBtJhmYUjUwIyNjy2Vgvwu83XvHY lEX5tqX/uI2rpP2/SXY2tZjWjo9MfvC9bU7JUQmNGyRy8QL6fBkVfeno6J0Lfftfc2Rc tXcg==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Tii4fEL6; arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass dkdomain=redhat.com dmarc=pass fromdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-55520-ouuuleilei=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-55520-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com X-Forwarded-Encrypted: i=1; AJvYcCXuKX/d1STQ17T5LpterVIZuIrA2DibVsKdz0aneohBVWcBw3DR8g8BoPq6/53JEC3bsBJKhu2at8WcjOg5loO167YVog== Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id 21-20020a170902e9d500b001d4e7ad3810si2043169plk.604.2024.02.06.11.03.08 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 06 Feb 2024 11:03:08 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-55520-ouuuleilei=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Tii4fEL6; arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass dkdomain=redhat.com dmarc=pass fromdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-55520-ouuuleilei=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-55520-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 547E32926E4 for <ouuuleilei@gmail.com>; Tue, 6 Feb 2024 19:02:47 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 865121BF54; Tue, 6 Feb 2024 19:00:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Tii4fEL6" Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6086C13AF0 for <linux-kernel@vger.kernel.org>; Tue, 6 Feb 2024 19:00:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707246035; cv=none; b=ATIGXjB59qklynLx61vZGzmr2C0xq8zwKj1/Te+iiyHixB36p7qkOw+90j23xETxHlLzSNtn1CHBKxNoqZMBMGtuKp11MmAjbVuImeiIWIa7EF5ZHil7w+7nz+2NvGRYd427Hkkx4VZYZpBv5sS0V6YN6O/kLoajw0qBKuAPkEg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707246035; c=relaxed/simple; bh=BYgVLfv6v/ttWx32QuJAxOjLWrclNlzBJvovbd6xDjw=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=XcoQstIpi4mpYM40W8c68ZyQ9EWZ0udyaQ4l4B/JahHafX11cb1s3fE5rH4DP3/Cz6Wf4IlVA7FEikHwO5/cBcF9r5Mf7HPeGpbhDkvowza/+S7CQsYgCtRUhgUKEX+/RDDBaF4xn3RqSjN3hjtn60L9qBHLDBng5skISEwZdrQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Tii4fEL6; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1707246030; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=Ra7EGxeWV8AGneO4cyp0zFD1CrKmCahpgLqv1cR4INo=; b=Tii4fEL6HSrqbjLiWgNqB/TZEV5g56FfItQFZ2eaBFJ/q0ZK9DW8ekXUPcuyC60X79Tg0Y lQbJrvawTXHVq6NGrFAGvShmTH+gSb2NZEPyzvEouUrzS/6lsFPRLzz9K7MEdGcC0NKWu0 w4DT2MRQzkGofzUbgfnFVFIho+iDCks= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-54-dMpKcn1bNXyBXxUWz-pMEA-1; Tue, 06 Feb 2024 14:00:26 -0500 X-MC-Unique: dMpKcn1bNXyBXxUWz-pMEA-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 5D0933CBDF69; Tue, 6 Feb 2024 19:00:25 +0000 (UTC) Received: from tpad.localdomain (unknown [10.96.133.4]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 1DE8F492BC7; Tue, 6 Feb 2024 19:00:25 +0000 (UTC) Received: by tpad.localdomain (Postfix, from userid 1000) id 25B48401E119E; Tue, 6 Feb 2024 15:58:07 -0300 (-03) Message-ID: <20240206185709.928420669@redhat.com> User-Agent: quilt/0.67 Date: Tue, 06 Feb 2024 15:49:15 -0300 From: Marcelo Tosatti <mtosatti@redhat.com> To: linux-kernel@vger.kernel.org Cc: Daniel Bristot de Oliveira <bristot@kernel.org>, Juri Lelli <juri.lelli@redhat.com>, Valentin Schneider <vschneid@redhat.com>, Frederic Weisbecker <frederic@kernel.org>, Leonardo Bras <leobras@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Thomas Gleixner <tglx@linutronix.de>, Marcelo Tosatti <mtosatti@redhat.com> Subject: [patch 04/12] clockevent unbind: use smp_call_func_single_fail References: <20240206184911.248214633@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.9 X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1790177379646603918 X-GMAIL-MSGID: 1790177379646603918 |
Series |
cpu isolation: infra to block interference to select CPUs
|
|
Commit Message
Marcelo Tosatti
Feb. 6, 2024, 6:49 p.m. UTC
Convert clockevents_unbind from smp_call_function_single
to smp_call_func_single_fail, which will fail in case
the target CPU is tagged as block interference CPU.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Comments
On Tue, Feb 06 2024 at 15:49, Marcelo Tosatti wrote: > Convert clockevents_unbind from smp_call_function_single > to smp_call_func_single_fail, which will fail in case > the target CPU is tagged as block interference CPU. Seriously? This is a root only operation. So if root wants to interfere then so be it.
On Wed, Feb 07, 2024 at 12:55:59PM +0100, Thomas Gleixner wrote: > On Tue, Feb 06 2024 at 15:49, Marcelo Tosatti wrote: > > Convert clockevents_unbind from smp_call_function_single > > to smp_call_func_single_fail, which will fail in case > > the target CPU is tagged as block interference CPU. > > Seriously? This is a root only operation. So if root wants to interfere > then so be it. Hi Thomas! OK, so the problem is the following: due to software complexity, one is often not aware of all operations that might take place. For example: https://lore.kernel.org/lkml/Y6mXDUZkII5OnuE8@tpad/T/ [PATCH] hwmon: coretemp: avoid RDMSR interruptions to isolated CPUs The coretemp driver uses rdmsr_on_cpu calls to read MSR_IA32_PACKAGE_THERM_STATUS/MSR_IA32_THERM_STATUS registers, which contain information about current core temperature. For certain low latency applications, the RDMSR interruption exceeds the applications requirements. So disable reading of crit_alarm and temp files via /sys, in case CPU isolation is enabled. Temperature information from the housekeeping cores should be sufficient to infer die temperature. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> diff --git a/drivers/hwmon/coretemp.c b/drivers/hwmon/coretemp.c index 9bee4d33fbdf..30a35f4130d5 100644 --- a/drivers/hwmon/coretemp.c +++ b/drivers/hwmon/coretemp.c @@ -27,6 +27,7 @@ #include <asm/msr.h> #include <asm/processor.h> #include <asm/cpu_device_id.h> +#include <linux/sched/isolation.h> #define DRVNAME "coretemp" @@ -121,6 +122,10 @@ static ssize_t show_crit_alarm(struct device *dev, struct platform_data *pdata = dev_get_drvdata(dev); struct temp_data *tdata = pdata->core_data[attr->index]; + + if (!housekeeping_cpu(tdata->cpu, HK_TYPE_MISC)) + return -EINVAL; + mutex_lock(&tdata->update_lock); rdmsr_on_cpu(tdata->cpu, tdata->status_reg, &eax, &edx); mutex_unlock(&tdata->update_lock); @@ -158,6 +163,8 @@ static ssize_t show_temp(struct device *dev, /* Check whether the time interval has elapsed */ if (!tdata->valid || time_after(jiffies, tdata->last_updated + HZ)) { + if (!housekeeping_cpu(tdata->cpu, HK_TYPE_MISC)) + return -EINVAL; rdmsr_on_cpu(tdata->cpu, tdata->status_reg, &eax, &edx); /* * Ignore the valid bit. In all observed cases the register In this case, a userspace application (collecting telemetry data), was responsible for reading the sysfs files. Now think of all possible paths, from userspace, that lead to kernel code that ends up in smp_call_function_* variants (or other functions that cause IPIs to isolated CPUs). The alternative, from blocking this in the kernel, would be to validate all userspace software involved in your application, to ensure it won't end up in the kernel sending IPIs. Which is impractical, isnt it ? (or rather, with such option in the kernel, it would be possible to run applications which have not been validated, since the kernel would fail the operation that results in IPI to isolated CPU). So the idea would be an additional "isolation mode", which when enabled, would disallow the IPIs. Its still possible for root user to disable this mode, and retry the operation. So lets say i want to read MSRs on a given CPU, as root. You'd have to: 1) readmsr on given CPU (returns -EPERM or whatever), since the "block interference" mode is enabled for that CPU. 2) Disable that CPU in the block interference cpumask. 3) readmsr on the given CPU (success). 4) Re-enable CPU in block interference cpumask, if desired. (BTW, better ideas than the cpumask are welcome, but it seems the realibility aspect of something similar to this is necessary). Thanks!
On Wed, Feb 07 2024 at 09:51, Marcelo Tosatti wrote: > On Wed, Feb 07, 2024 at 12:55:59PM +0100, Thomas Gleixner wrote: > > OK, so the problem is the following: due to software complexity, one is > often not aware of all operations that might take place. The problem is that people throw random crap on their systems and avoid proper system engineering and then complain that their realtime constraints are violated. So you are proliferating bad engineering practices and encourage people not to care. > Now think of all possible paths, from userspace, that lead to kernel > code that ends up in smp_call_function_* variants (or other functions > that cause IPIs to isolated CPUs). So you need to analyze every possible code path and interface and add your magic functions there after figuring out whether that's valid or not. > The alternative, from blocking this in the kernel, would be to validate all > userspace software involved in your application, to ensure it won't end > up in the kernel sending IPIs. Which is impractical, isnt it ? It's absolutely not impractical. It's part of proper system engineering. The wet dream that you can run random docker containers and everything works magically is just a wet dream. > (or rather, with such option in the kernel, it would be possible to run > applications which have not been validated, since the kernel would fail > the operation that results in IPI to isolated CPU). That's a fallacy because you _cannot_ define with a single CPU mask which interface is valid in a particular configuration to end up with an IPI and which one is not. There are legitimate reasons in realtime or latency constraint systems to invoke selective functionality which interferes with the overall system constraints. How do you cover that with your magic CPU mask? You can't. Aside of that there is a decent chance that you are subtly breaking user space that way. Just look at that hwmon/coretemp commit you pointed to: "Temperature information from the housekeeping cores should be sufficient to infer die temperature." That's just wishful thinking for various reasons: - The die temperature on larger packages is not evenly distributed and you can run into situations where the housekeeping cores are sitting "far" enough away from the worker core which creates the heat spot - Some monitoring applications just stop to work when they can't read the full data set, which means that they break subtly and you can infer exactly nothing. > So the idea would be an additional "isolation mode", which when enabled, > would disallow the IPIs. Its still possible for root user to disable > this mode, and retry the operation. > > So lets say i want to read MSRs on a given CPU, as root. > > You'd have to: > > 1) readmsr on given CPU (returns -EPERM or whatever), since the > "block interference" mode is enabled for that CPU. > > 2) Disable that CPU in the block interference cpumask. > > 3) readmsr on the given CPU (success). > > 4) Re-enable CPU in block interference cpumask, if desired. That's just wrong. Why? Once you enable it just to read the MSR you enable the operation for _ALL_ other non-validated crap too. So while the single MSR read might be OK under certain circumstances the fact that you open up a window for all other interfaces to do far more interfering operations is a red flag. This whole thing is a really badly defined policy mechanism of very dubious value. Thanks, tglx
On Sun, Feb 11, 2024 at 09:52:35AM +0100, Thomas Gleixner wrote: > On Wed, Feb 07 2024 at 09:51, Marcelo Tosatti wrote: > > On Wed, Feb 07, 2024 at 12:55:59PM +0100, Thomas Gleixner wrote: > > > > OK, so the problem is the following: due to software complexity, one is > > often not aware of all operations that might take place. > > The problem is that people throw random crap on their systems and avoid > proper system engineering and then complain that their realtime > constraints are violated. So you are proliferating bad engineering > practices and encourage people not to care. Its more of a practicality and cost concern: one usually does not have resources to fully review software before using that software. > > Now think of all possible paths, from userspace, that lead to kernel > > code that ends up in smp_call_function_* variants (or other functions > > that cause IPIs to isolated CPUs). > > So you need to analyze every possible code path and interface and add > your magic functions there after figuring out whether that's valid or > not. "A magic function", yes. > > The alternative, from blocking this in the kernel, would be to validate all > > userspace software involved in your application, to ensure it won't end > > up in the kernel sending IPIs. Which is impractical, isnt it ? > > It's absolutely not impractical. It's part of proper system > engineering. The wet dream that you can run random docker containers and > everything works magically is just a wet dream. Unfortunately that is what people do. I understand that "full software review" would be the ideal, but in most situations it does not seem to happen. > > (or rather, with such option in the kernel, it would be possible to run > > applications which have not been validated, since the kernel would fail > > the operation that results in IPI to isolated CPU). > > That's a fallacy because you _cannot_ define with a single CPU mask > which interface is valid in a particular configuration to end up with an > IPI and which one is not. There are legitimate reasons in realtime or > latency constraint systems to invoke selective functionality which > interferes with the overall system constraints. > > How do you cover that with your magic CPU mask? You can't. > > Aside of that there is a decent chance that you are subtly breaking user > space that way. Just look at that hwmon/coretemp commit you pointed to: > > "Temperature information from the housekeeping cores should be > sufficient to infer die temperature." > > That's just wishful thinking for various reasons: > > - The die temperature on larger packages is not evenly distributed and > you can run into situations where the housekeeping cores are sitting > "far" enough away from the worker core which creates the heat spot I know. > - Some monitoring applications just stop to work when they can't read > the full data set, which means that they break subtly and you can > infer exactly nothing. > > > So the idea would be an additional "isolation mode", which when enabled, > > would disallow the IPIs. Its still possible for root user to disable > > this mode, and retry the operation. > > > > So lets say i want to read MSRs on a given CPU, as root. > > > > You'd have to: > > > > 1) readmsr on given CPU (returns -EPERM or whatever), since the > > "block interference" mode is enabled for that CPU. > > > > 2) Disable that CPU in the block interference cpumask. > > > > 3) readmsr on the given CPU (success). > > > > 4) Re-enable CPU in block interference cpumask, if desired. > > That's just wrong. Why? > > Once you enable it just to read the MSR you enable the operation for > _ALL_ other non-validated crap too. So while the single MSR read might > be OK under certain circumstances the fact that you open up a window for > all other interfaces to do far more interfering operations is a red > flag. > > This whole thing is a really badly defined policy mechanism of very > dubious value. > > Thanks, OK, fair enough. From your comments, it seems that per-callsite toggling would be ideal, for example: /sys/kernel/interference_blocking/ directory containing one sub-directory per call site. Inside each sub-directory, a "enabled" file, controlling a boolean to enable or disable interference blocking for that particular callsite. Also a "cpumask" file on each directory, by default containing the same cpumask as the nohz_full CPUs, to control to which CPUs to block the interference for. How does that sound?
Index: linux-isolation/kernel/time/clockevents.c =================================================================== --- linux-isolation.orig/kernel/time/clockevents.c +++ linux-isolation/kernel/time/clockevents.c @@ -13,6 +13,7 @@ #include <linux/module.h> #include <linux/smp.h> #include <linux/device.h> +#include <linux/sched/isolation.h> #include "tick-internal.h" @@ -416,9 +417,14 @@ static void __clockevents_unbind(void *a */ static int clockevents_unbind(struct clock_event_device *ced, int cpu) { + int ret, idx; struct ce_unbind cu = { .ce = ced, .res = -ENODEV }; - smp_call_function_single(cpu, __clockevents_unbind, &cu, 1); + idx = block_interf_srcu_read_lock(); + ret = smp_call_function_single_fail(cpu, __clockevents_unbind, &cu, 1); + block_interf_srcu_read_unlock(idx); + if (ret) + return ret; return cu.res; }