Message ID | 20240126041126.1927228-12-michael.roth@amd.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-39605-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:e09d:b0:103:945f:af90 with SMTP id gm29csp443906dyb; Thu, 25 Jan 2024 20:42:35 -0800 (PST) X-Google-Smtp-Source: AGHT+IF0uAe/zu2v3lIT+O/7TrVNDREkzWs9v9AKAtii7UlyFQOMTdiz7+44TgABENiIaFI1TZMK X-Received: by 2002:a05:6358:7e49:b0:176:5b4f:388b with SMTP id p9-20020a0563587e4900b001765b4f388bmr796622rwm.12.1706244155387; Thu, 25 Jan 2024 20:42:35 -0800 (PST) Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [2604:1380:45e3:2400::1]) by mx.google.com with ESMTPS id z15-20020a63190f000000b005be1ee5dfc7si467750pgl.9.2024.01.25.20.42.35 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 25 Jan 2024 20:42:35 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-39605-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; Authentication-Results: mx.google.com; dkim=pass header.i=@amd.com header.s=selector1 header.b="S1OFy/N+"; arc=fail (signature failed); spf=pass (google.com: domain of linux-kernel+bounces-39605-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-39605-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amd.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 2F5082879D5 for <ouuuleilei@gmail.com>; Fri, 26 Jan 2024 04:42:21 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 56AC410A34; Fri, 26 Jan 2024 04:41:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="S1OFy/N+" Received: from NAM02-BN1-obe.outbound.protection.outlook.com (mail-bn1nam02on2070.outbound.protection.outlook.com [40.107.212.70]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9F6D911701; Fri, 26 Jan 2024 04:41:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.212.70 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706244100; cv=fail; b=OphD8OxLeuo3Ciz77PHfk+J3SRa2Nb1VU6NsK0YDBipC471GSwqtXda9hXTBMNGOQUtjD30PmvrlvpchHBJJxVEBNk31afYw/vV2GVCW1QwX5C58GEJwkc7prTdKje/W+PN2dh8p7Gg+/xp74tY8qf30hhz9OxP7AiuLwLdHW9Y= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706244100; c=relaxed/simple; bh=oMhffWMH4gBDKpXTOEQua4Uit8ORyMhn4p8Xg1w3Dp4=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=TQQ83IxGD2OnzCkBU7cXeo4Cx8BVGRyCjhysexdy+cnrRbMxFrv1CECXToW45iFNzrvA0c9kOOy94gpIwyl/TdFDUyUU/9HKPtDM1lYOjWQ7pRmXU40aqb1RIQqwETi3LSZzMbuopAuIik0BxSPnZuSdglVzHTyBkw7TJxNM9Ko= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=S1OFy/N+; arc=fail smtp.client-ip=40.107.212.70 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=fO4d3+LsIpzF4A9aewJtDNAS5cz80Tm76FCsb7V1BeCa49AF+Aw1GXEVxAGWHVBXn27eRY7V+JwGddDd9L3TFqKT0WLQBuc8qzjQcjycCzbpzrB7ljaPuwTa1ZX1VkdRXQIiT3Cnohq5Ri4RcJum84zweRieqU/z4VBx/0VXqNFCHSG6GfB8Xa0JfBC53VTzNfB4fLPvHJ2eVAv4m65IcJjzW1FcARR7aEPAPrRPEmZmuL1yo21qubGeHodG+FAyH9s/nfXbvMcTMdMHfZw25COQrdhDtQ9sley+9eeg4iSvPrD4hhe/0hp8tl0K0vmi/q+/HjUfujWA5RJb/buWVA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=ieXJvlkx0ARe4Dlb1EX+7YF5/vwoMHH96EazENT13Fs=; b=P5YTGBuosp3dA87a1NT+FIDR5RfnSqRMfDivs543Ih7nktM7cx9vhekEnWAyI4cT2z7KDSXhmhr0j3yenQf/UOy/fo58nDdB9CYW3ygI26IsajKpcZDuhYLD0+HgpZuBUzG9kcN9e554Cvl9E4sGgHWQSpF6mrWAf6DiYg8ylZ75mW6WWFynZg/R2s6P3XwFy+pVoT+dbkGiNiHJu9ebfQiaqdnKfvcznaErTNL9WhuqtJRuywdxtM5rsOGvG9C9My6N4OKqp5LlHY0J1wsJx9dD3jut6eLzNnyXWAkSPQV7o/cu1YjZzTYstKAYiSUQxwurbyARb7eE/gQLhx/nJA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ieXJvlkx0ARe4Dlb1EX+7YF5/vwoMHH96EazENT13Fs=; b=S1OFy/N+fIm6BHMIkkW3Npopu4kla6Oix4WQ7HWEgZDSlzhvUQYXCveRWUssMnQHlHGJlRWny6A+ahf0Aixk8z6O7VvJIBNDEophXv7xt1kGjH40ZR08wric2965EnVtVwcqzaiU2fzy7EpMjg/cjXnblq+eYHC7A9cqTyUMkm8= Received: from MW4PR03CA0279.namprd03.prod.outlook.com (2603:10b6:303:b5::14) by IA1PR12MB6233.namprd12.prod.outlook.com (2603:10b6:208:3e7::11) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7228.24; Fri, 26 Jan 2024 04:41:35 +0000 Received: from MWH0EPF000971E4.namprd02.prod.outlook.com (2603:10b6:303:b5:cafe::10) by MW4PR03CA0279.outlook.office365.com (2603:10b6:303:b5::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7228.26 via Frontend Transport; Fri, 26 Jan 2024 04:41:34 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C Received: from SATLEXMB04.amd.com (165.204.84.17) by MWH0EPF000971E4.mail.protection.outlook.com (10.167.243.72) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.20.7228.16 via Frontend Transport; Fri, 26 Jan 2024 04:41:34 +0000 Received: from localhost (10.180.168.240) by SATLEXMB04.amd.com (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.34; Thu, 25 Jan 2024 22:41:32 -0600 From: Michael Roth <michael.roth@amd.com> To: <x86@kernel.org> CC: <kvm@vger.kernel.org>, <linux-coco@lists.linux.dev>, <linux-mm@kvack.org>, <linux-crypto@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <tglx@linutronix.de>, <mingo@redhat.com>, <jroedel@suse.de>, <thomas.lendacky@amd.com>, <hpa@zytor.com>, <ardb@kernel.org>, <pbonzini@redhat.com>, <seanjc@google.com>, <vkuznets@redhat.com>, <jmattson@google.com>, <luto@kernel.org>, <dave.hansen@linux.intel.com>, <slp@redhat.com>, <pgonda@google.com>, <peterz@infradead.org>, <srinivas.pandruvada@linux.intel.com>, <rientjes@google.com>, <tobin@ibm.com>, <bp@alien8.de>, <vbabka@suse.cz>, <kirill@shutemov.name>, <ak@linux.intel.com>, <tony.luck@intel.com>, <sathyanarayanan.kuppuswamy@linux.intel.com>, <alpergun@google.com>, <jarkko@kernel.org>, <ashish.kalra@amd.com>, <nikunj.dadhania@amd.com>, <pankaj.gupta@amd.com>, <liam.merwick@oracle.com> Subject: [PATCH v2 11/25] x86/sev: Adjust directmap to avoid inadvertant RMP faults Date: Thu, 25 Jan 2024 22:11:11 -0600 Message-ID: <20240126041126.1927228-12-michael.roth@amd.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20240126041126.1927228-1-michael.roth@amd.com> References: <20240126041126.1927228-1-michael.roth@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-ClientProxiedBy: SATLEXMB04.amd.com (10.181.40.145) To SATLEXMB04.amd.com (10.181.40.145) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: MWH0EPF000971E4:EE_|IA1PR12MB6233:EE_ X-MS-Office365-Filtering-Correlation-Id: 443f62dd-9e6a-4b06-a1b7-08dc1e291371 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: lqKkq3OIPj0YoZqJN8zaR/lu25NFzon3JReU10qoBWHzdU8K92JplpoQV2Z/f4FbpwraHSTbGQuPByj7elzh+QeitFwCwO4oXaR84MiO85vX5WZyfneIdnQOd3mpZWCVV/AG7UztY8TkV+ofmZcDGt3ETU5fyRFyHw8WMmP+M049LwEgG/gcOpLVLd/ZeA70WkSZ4xLDyF0HFux3LEdu+MhgEkEwhRDxqXtDWSSgeoqKdGG4338NjFn8gsAjBn4nc7zFZ2hBImfh73bAGfxKZllDBqRTFwLno4WJy1q4lGD+1EcYjHddGX4xei44/uqxjzksJ72ErRPe7iW/VQGV1yDBosaXyOYxY4m7H02hMGszex3cbb8v3dCU2QBvjMmuamJ1/vwtdSMUnWc0HTPDAqlQSma/pZwW6L378u9RxDKTYMnJSqGLAkbG0K0tJWNA1gzulDiuOx6fssgNLjiziOHVwtvtQvJg88ppm8lgeFBtirn2Uqcm6mJj1Y1A7gqGX30AOxMMFhrEcm4Mg2aYE94iMlm/HkjNFU5o9F7NZ+OkwzMrK19F8lj8tzSEWGFkOF57apaF5vBhosnKm025XLdkCwrHXkYXRX8SiRpj+JlxkVtfPSCj8LaWe2ulnYh14q9fnVzYaQBOs/KPANa0ZfpfMhB6oz8a09ZUIslfu4pLtSBLulnqi2n24J7d59LJjg3NqN4lvvIQKWKT7OlXijiE3CBqX0hu1Yxb6nUjqhvvNkxmPhG6fUtCQMsJwkVjmg13CZZKe+1MP9F1DT7Mmg== X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:SATLEXMB04.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230031)(4636009)(346002)(376002)(396003)(39860400002)(136003)(230922051799003)(64100799003)(186009)(1800799012)(82310400011)(451199024)(40470700004)(46966006)(36840700001)(40460700003)(40480700001)(83380400001)(47076005)(41300700001)(356005)(86362001)(81166007)(36756003)(82740400003)(426003)(1076003)(5660300002)(36860700001)(2616005)(26005)(16526019)(336012)(70206006)(54906003)(70586007)(478600001)(6666004)(316002)(6916009)(2906002)(4326008)(44832011)(7406005)(7416002)(8676002)(8936002)(36900700001);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 26 Jan 2024 04:41:34.6911 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 443f62dd-9e6a-4b06-a1b7-08dc1e291371 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com] X-MS-Exchange-CrossTenant-AuthSource: MWH0EPF000971E4.namprd02.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: IA1PR12MB6233 X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1789126671135087935 X-GMAIL-MSGID: 1789126671135087935 |
Series |
Add AMD Secure Nested Paging (SEV-SNP) Initialization Support
|
|
Commit Message
Michael Roth
Jan. 26, 2024, 4:11 a.m. UTC
If the kernel uses a 2MB or larger directmap mapping to write to an
address, and that mapping contains any 4KB pages that are set to private
in the RMP table, an RMP #PF will trigger and cause a host crash.
SNP-aware code that owns the private PFNs will never attempt such a
write, but other kernel tasks writing to other PFNs in the range may
trigger these checks inadvertantly due to writing to those other PFNs
via a large directmap mapping that happens to also map a private PFN.
Prevent this by splitting any 2MB+ mappings that might end up containing
a mix of private/shared PFNs as a result of a subsequent RMPUPDATE for
the PFN/rmp_level passed in.
Another way to handle this would be to limit the directmap to 4K
mappings in the case of hosts that support SNP, but there is potential
risk for performance regressions of certain host workloads. Handling it
as-needed results in the directmap being slowly split over time, which
lessens the risk of a performance regression since the more the
directmap gets split as a result of running SNP guests, the more likely
the host is being used primarily to run SNP guests, where a mostly-split
directmap is actually beneficial since there is less chance of TLB
flushing and cpa_lock contention being needed to perform these splits.
Cases where a host knows in advance it wants to primarily run SNP guests
and wishes to pre-split the directmap can be handled by adding a
tuneable in the future, but preliminary testing has shown this to not
provide a signficant benefit in the common case of guests that are
backed primarily by 2MB THPs, so it does not seem to be warranted
currently and can be added later if a need arises in the future.
Signed-off-by: Michael Roth <michael.roth@amd.com>
---
arch/x86/virt/svm/sev.c | 75 +++++++++++++++++++++++++++++++++++++++--
1 file changed, 73 insertions(+), 2 deletions(-)
Comments
On Thu, Jan 25, 2024 at 10:11:11PM -0600, Michael Roth wrote: > +static int adjust_direct_map(u64 pfn, int rmp_level) > +{ > + unsigned long vaddr = (unsigned long)pfn_to_kaddr(pfn); > + unsigned int level; > + int npages, ret; > + pte_t *pte; Again, something I asked the last time but no reply: Looking at Documentation/arch/x86/x86_64/mm.rst, the direct map starts at page_offset_base so this here should at least check if (vaddr < __PAGE_OFFSET) return 0; I'm not sure about the upper end. Right now, the adjusting should not happen only for the direct map but also for the whole kernel address space range because we don't want to cause any mismatch between page mappings anywhere. Which means, this function should be called adjust_kernel_map() or so... Hmmm.
On Fri, Jan 26, 2024 at 04:34:51PM +0100, Borislav Petkov wrote: > On Thu, Jan 25, 2024 at 10:11:11PM -0600, Michael Roth wrote: > > +static int adjust_direct_map(u64 pfn, int rmp_level) > > +{ > > + unsigned long vaddr = (unsigned long)pfn_to_kaddr(pfn); > > + unsigned int level; > > + int npages, ret; > > + pte_t *pte; > > Again, something I asked the last time but no reply: > > Looking at Documentation/arch/x86/x86_64/mm.rst, the direct map starts > at page_offset_base so this here should at least check > > if (vaddr < __PAGE_OFFSET) > return 0; vaddr comes from pfn_to_kaddr(pfn), i.e. __va(paddr), so it will necessarily be a direct-mapped address above __PAGE_OFFSET. > > I'm not sure about the upper end. Right now, the adjusting should not For upper-end, a pfn_valid(pfn) check might suffice, since only a valid PFN would have a possibly-valid mapping wthin the directmap range. > happen only for the direct map but also for the whole kernel address > space range because we don't want to cause any mismatch between page > mappings anywhere. > > Which means, this function should be called adjust_kernel_map() or so... These are PFNs that are owned/allocated-to the caller. Due to the nature of the directmap it's possible non-owners would write to a mapping that overlaps, but vmalloc()/etc. would not create mappings for any pages that were not specifically part of an allocation that belongs to the caller, so I don't see where there's any chance for an overlap there. And the caller of these functions would not be adjusting directmap for PFNs that might be mapped into other kernel address ranges like kernel-text/etc unless the caller was specifically making SNP-aware adjustments to those ranges, in which case it would be responsible for making those other adjustments, or implementing the necessary helpers/etc. I'm not aware of such cases in the current code, and I don't think it makes sense to attempt to try to handle them here generically until such a case arises, since it will likely involve more specific requirements than what we can anticipate from a theoretical/generic standpoint. -Mike > > Hmmm. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette
On Fri, Jan 26, 2024 at 11:04:15AM -0600, Michael Roth wrote: > vaddr comes from pfn_to_kaddr(pfn), i.e. __va(paddr), so it will > necessarily be a direct-mapped address above __PAGE_OFFSET. Ah, true. > For upper-end, a pfn_valid(pfn) check might suffice, since only a valid > PFN would have a possibly-valid mapping wthin the directmap range. Looking at it, yap, that could be a sensible thing to check. > These are PFNs that are owned/allocated-to the caller. Due to the nature > of the directmap it's possible non-owners would write to a mapping that > overlaps, but vmalloc()/etc. would not create mappings for any pages that > were not specifically part of an allocation that belongs to the caller, > so I don't see where there's any chance for an overlap there. And the caller > of these functions would not be adjusting directmap for PFNs that might be > mapped into other kernel address ranges like kernel-text/etc unless the > caller was specifically making SNP-aware adjustments to those ranges, in > which case it would be responsible for making those other adjustments, > or implementing the necessary helpers/etc. Why does any of that matter? If you can make this helper as generic as possible now, why don't you? > I'm not aware of such cases in the current code, and I don't think it makes > sense to attempt to try to handle them here generically until such a case > arises, since it will likely involve more specific requirements than what > we can anticipate from a theoretical/generic standpoint. Then that's a different story. If it will likely involve more specific handling, then that function should deal with pfns for which it can DTRT and for others warn loudly so that the code gets fixed in time. IOW, then it should check for the upper pfn of the direct map here and we have two, depending on the page sizes used... Thx.
On Fri, Jan 26, 2024 at 07:43:40PM +0100, Borislav Petkov wrote: > On Fri, Jan 26, 2024 at 11:04:15AM -0600, Michael Roth wrote: > > vaddr comes from pfn_to_kaddr(pfn), i.e. __va(paddr), so it will > > necessarily be a direct-mapped address above __PAGE_OFFSET. > > Ah, true. > > > For upper-end, a pfn_valid(pfn) check might suffice, since only a valid > > PFN would have a possibly-valid mapping wthin the directmap range. > > Looking at it, yap, that could be a sensible thing to check. > > > These are PFNs that are owned/allocated-to the caller. Due to the nature > > of the directmap it's possible non-owners would write to a mapping that > > overlaps, but vmalloc()/etc. would not create mappings for any pages that > > were not specifically part of an allocation that belongs to the caller, > > so I don't see where there's any chance for an overlap there. And the caller > > of these functions would not be adjusting directmap for PFNs that might be > > mapped into other kernel address ranges like kernel-text/etc unless the > > caller was specifically making SNP-aware adjustments to those ranges, in > > which case it would be responsible for making those other adjustments, > > or implementing the necessary helpers/etc. > > Why does any of that matter? > > If you can make this helper as generic as possible now, why don't you? In this case, it would make it more difficult to handle things efficiently and implement proper bounds-checking/etc. For instance, if the caller *knows* they are doing something different like splitting a kernel-text mapping, then we could implement proper bounds-checking based on expected ranges, and implement any special handling associated with that use-case, and capture that in a nice/understandable adjust_kernel_text_mapping() helper. Or maybe if these are adjustments for non-static/non-linear mappings, it makes more sense to be given an HVA rather than PFN, etc., since we might not have any sort of reverse-map structure/function than can be used to do the PFN->HVA lookups efficiently. It's just a lot to guess at. And just the directmap splitting itself has been the source of so much discussion, investigation, re-work, benchmarking, etc., that it hints that implementing similar handling for other use-cases really needs to have a clear and necessary purpose and set of requirements hat can be evaluated/tested before enabling them and reasonably expecting them to work as expected. > > > I'm not aware of such cases in the current code, and I don't think it makes > > sense to attempt to try to handle them here generically until such a case > > arises, since it will likely involve more specific requirements than what > > we can anticipate from a theoretical/generic standpoint. > > Then that's a different story. If it will likely involve more specific > handling, then that function should deal with pfns for which it can DTRT > and for others warn loudly so that the code gets fixed in time. > > IOW, then it should check for the upper pfn of the direct map here and > we have two, depending on the page sizes used... Is something like this close to what you're thinking? I've re-tested with SNP guests and it seems to work as expected. diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c index 846e9e53dff0..c09497487c08 100644 --- a/arch/x86/virt/svm/sev.c +++ b/arch/x86/virt/svm/sev.c @@ -421,7 +421,12 @@ static int adjust_direct_map(u64 pfn, int rmp_level) if (WARN_ON_ONCE(rmp_level > PG_LEVEL_2M)) return -EINVAL; - if (WARN_ON_ONCE(rmp_level == PG_LEVEL_2M && !IS_ALIGNED(pfn, PTRS_PER_PMD))) + if (!pfn_valid(pfn)) + return -EINVAL; + + if (rmp_level == PG_LEVEL_2M && + (!IS_ALIGNED(pfn, PTRS_PER_PMD) || + !pfn_valid(pfn + PTRS_PER_PMD - 1))) return -EINVAL; Note that I removed the WARN_ON_ONCE(), which I think was a bit overzealous for this check. The one above it for rmp_level > PG_LEVEL_2M I think is more warranted, since it returns error for a use-case that might in theory become valid on future hardware, but would result in unecessary 1GB->2MB directmap splitting if we were to try to implement handling for that before it's actually needed (which may well be never). -Mike > > Thx. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette
On Fri, Jan 26, 2024 at 05:54:20PM -0600, Michael Roth wrote: > Is something like this close to what you're thinking? I've re-tested with > SNP guests and it seems to work as expected. > > diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c > index 846e9e53dff0..c09497487c08 100644 > --- a/arch/x86/virt/svm/sev.c > +++ b/arch/x86/virt/svm/sev.c > @@ -421,7 +421,12 @@ static int adjust_direct_map(u64 pfn, int rmp_level) > if (WARN_ON_ONCE(rmp_level > PG_LEVEL_2M)) > return -EINVAL; > > - if (WARN_ON_ONCE(rmp_level == PG_LEVEL_2M && !IS_ALIGNED(pfn, PTRS_PER_PMD))) > + if (!pfn_valid(pfn)) _text at VA 0xffffffff81000000 is also a valid pfn so no, this is not enough. Either this function should not have "direct map" in the name as it converts *any* valid pfn not just the direct map ones or it should check whether the pfn belongs to the direct map range. Thx.
On Sat, Jan 27, 2024 at 12:42:07PM +0100, Borislav Petkov wrote: > On Fri, Jan 26, 2024 at 05:54:20PM -0600, Michael Roth wrote: > > Is something like this close to what you're thinking? I've re-tested with > > SNP guests and it seems to work as expected. > > > > diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c > > index 846e9e53dff0..c09497487c08 100644 > > --- a/arch/x86/virt/svm/sev.c > > +++ b/arch/x86/virt/svm/sev.c > > @@ -421,7 +421,12 @@ static int adjust_direct_map(u64 pfn, int rmp_level) > > if (WARN_ON_ONCE(rmp_level > PG_LEVEL_2M)) > > return -EINVAL; > > > > - if (WARN_ON_ONCE(rmp_level == PG_LEVEL_2M && !IS_ALIGNED(pfn, PTRS_PER_PMD))) > > + if (!pfn_valid(pfn)) > > _text at VA 0xffffffff81000000 is also a valid pfn so no, this is not > enough. directmap maps all physical memory accessible by kernel, including text pages, so those are valid PFNs as far as this function is concerned. We can't generally guard against the caller passing in any random PFN that might also be mapped into additional address ranges, similarly to how we can't guard against something doing a write to some random PFN __va(0x1234) and scribbling over memory that it doesn't own, or just unmapping the entire directmap range and blowing up the kernel. The expectation is that the caller is aware of what PFNs it is passing in, whether those PFNs have additional mappings, and if those mappings are of concern, implement the necessary handlers if new use-cases are ever introducted, like the adjust_kernel_text_mapping() example I mentioned earlier. > > Either this function should not have "direct map" in the name as it > converts *any* valid pfn not just the direct map ones or it should check > whether the pfn belongs to the direct map range. This function only splits mappings in the 0xffff888000000000 directmap range. It would be inaccurate to name it in such a way that suggests that it does anything else. If a use-case ever arises for splitting _text mappings at 0xffffffff81000000, or any other ranges, those too would best served by dedicated helpers adjust_kernel_text_mapping() that *actually* modify mappings for those virtual ranges, and implement bounds-checking appropriate for those physical/virtual ranges. The directmap range maps all kernel-accessible physical memory, so it's appropriate that our bounds-checking for the purpose of adjust_direct_map() is all kernel-accessible physical memory. If that includes PFNs mapped to other virtual ranges as well, the caller needs to consider that and implement additional helpers as necessary, but they'd likely *still* need to call adjust_direct_map() to adjust the directmap range in addition to those other ranges, so even if were possible to do so reliably, we shouldn't try to selectively reject any PFN ranges beyond what mm.rst suggests is valid for 0xffff888000000000. -Mike > > Thx. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette >
On Sat, Jan 27, 2024 at 09:45:06AM -0600, Michael Roth wrote: > directmap maps all physical memory accessible by kernel, including text > pages, so those are valid PFNs as far as this function is concerned. Why don't you have a look at Documentation/arch/x86/x86_64/mm.rst to sync up on the nomenclature first? ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base) ... ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0 and so on. > The expectation is that the caller is aware of what PFNs it is passing in, There are no expectations. Have you written them down somewhere? > This function only splits mappings in the 0xffff888000000000 directmap > range. This function takes any PFN it gets passed in as it is. I don't care who its users are now or in the future and whether they pay attention what they pass into - it needs to be properly defined. Mike, please get on with the program. Use the right naming for the function and basta. IOW, this: diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c index 0a8f9334ec6e..652ee63e87fd 100644 --- a/arch/x86/virt/svm/sev.c +++ b/arch/x86/virt/svm/sev.c @@ -394,7 +394,7 @@ EXPORT_SYMBOL_GPL(psmash); * More specifics on how these checks are carried out can be found in APM * Volume 2, "RMP and VMPL Access Checks". */ -static int adjust_direct_map(u64 pfn, int rmp_level) +static int split_pfn(u64 pfn, int rmp_level) { unsigned long vaddr = (unsigned long)pfn_to_kaddr(pfn); unsigned int level; @@ -405,7 +405,12 @@ static int adjust_direct_map(u64 pfn, int rmp_level) if (WARN_ON_ONCE(rmp_level > PG_LEVEL_2M)) return -EINVAL; - if (WARN_ON_ONCE(rmp_level == PG_LEVEL_2M && !IS_ALIGNED(pfn, PTRS_PER_PMD))) + if (!pfn_valid(pfn)) + return -EINVAL; + + if (rmp_level == PG_LEVEL_2M && + (!IS_ALIGNED(pfn, PTRS_PER_PMD) || + !pfn_valid(pfn + PTRS_PER_PMD - 1))) return -EINVAL; /* @@ -456,7 +461,7 @@ static int rmpupdate(u64 pfn, struct rmp_state *state) level = RMP_TO_PG_LEVEL(state->pagesize); - if (adjust_direct_map(pfn, level)) + if (split_pfn(pfn, level)) return -EFAULT; do {
On Sat, Jan 27, 2024 at 05:02:49PM +0100, Borislav Petkov wrote: > This function takes any PFN it gets passed in as it is. I don't care > who its users are now or in the future and whether they pay attention > what they pass into - it needs to be properly defined. Ok, we solved it offlist, here's the final version I have. It has a comment explaining what I was asking. --- From: Michael Roth <michael.roth@amd.com> Date: Thu, 25 Jan 2024 22:11:11 -0600 Subject: [PATCH] x86/sev: Adjust the directmap to avoid inadvertent RMP faults If the kernel uses a 2MB or larger directmap mapping to write to an address, and that mapping contains any 4KB pages that are set to private in the RMP table, an RMP #PF will trigger and cause a host crash. SNP-aware code that owns the private PFNs will never attempt such a write, but other kernel tasks writing to other PFNs in the range may trigger these checks inadvertently due to writing to those other PFNs via a large directmap mapping that happens to also map a private PFN. Prevent this by splitting any 2MB+ mappings that might end up containing a mix of private/shared PFNs as a result of a subsequent RMPUPDATE for the PFN/rmp_level passed in. Another way to handle this would be to limit the directmap to 4K mappings in the case of hosts that support SNP, but there is potential risk for performance regressions of certain host workloads. Handling it as-needed results in the directmap being slowly split over time, which lessens the risk of a performance regression since the more the directmap gets split as a result of running SNP guests, the more likely the host is being used primarily to run SNP guests, where a mostly-split directmap is actually beneficial since there is less chance of TLB flushing and cpa_lock contention being needed to perform these splits. Cases where a host knows in advance it wants to primarily run SNP guests and wishes to pre-split the directmap can be handled by adding a tuneable in the future, but preliminary testing has shown this to not provide a signficant benefit in the common case of guests that are backed primarily by 2MB THPs, so it does not seem to be warranted currently and can be added later if a need arises in the future. Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240126041126.1927228-12-michael.roth@amd.com --- arch/x86/virt/svm/sev.c | 86 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 84 insertions(+), 2 deletions(-) diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c index c0b4c2306e8d..8da9c5330ff0 100644 --- a/arch/x86/virt/svm/sev.c +++ b/arch/x86/virt/svm/sev.c @@ -368,6 +368,82 @@ int psmash(u64 pfn) } EXPORT_SYMBOL_GPL(psmash); +/* + * If the kernel uses a 2MB or larger directmap mapping to write to an address, + * and that mapping contains any 4KB pages that are set to private in the RMP + * table, an RMP #PF will trigger and cause a host crash. Hypervisor code that + * owns the PFNs being transitioned will never attempt such a write, but other + * kernel tasks writing to other PFNs in the range may trigger these checks + * inadvertently due a large directmap mapping that happens to overlap such a + * PFN. + * + * Prevent this by splitting any 2MB+ mappings that might end up containing a + * mix of private/shared PFNs as a result of a subsequent RMPUPDATE for the + * PFN/rmp_level passed in. + * + * Note that there is no attempt here to scan all the RMP entries for the 2MB + * physical range, since it would only be worthwhile in determining if a + * subsequent RMPUPDATE for a 4KB PFN would result in all the entries being of + * the same shared/private state, thus avoiding the need to split the mapping. + * But that would mean the entries are currently in a mixed state, and so the + * mapping would have already been split as a result of prior transitions. + * And since the 4K split is only done if the mapping is 2MB+, and there isn't + * currently a mechanism in place to restore 2MB+ mappings, such a check would + * not provide any usable benefit. + * + * More specifics on how these checks are carried out can be found in APM + * Volume 2, "RMP and VMPL Access Checks". + */ +static int adjust_direct_map(u64 pfn, int rmp_level) +{ + unsigned long vaddr; + unsigned int level; + int npages, ret; + pte_t *pte; + + /* + * pfn_to_kaddr() will return a vaddr only within the direct + * map range. + */ + vaddr = (unsigned long)pfn_to_kaddr(pfn); + + /* Only 4KB/2MB RMP entries are supported by current hardware. */ + if (WARN_ON_ONCE(rmp_level > PG_LEVEL_2M)) + return -EINVAL; + + if (!pfn_valid(pfn)) + return -EINVAL; + + if (rmp_level == PG_LEVEL_2M && + (!IS_ALIGNED(pfn, PTRS_PER_PMD) || + !pfn_valid(pfn + PTRS_PER_PMD - 1))) + return -EINVAL; + + /* + * If an entire 2MB physical range is being transitioned, then there is + * no risk of RMP #PFs due to write accesses from overlapping mappings, + * since even accesses from 1GB mappings will be treated as 2MB accesses + * as far as RMP table checks are concerned. + */ + if (rmp_level == PG_LEVEL_2M) + return 0; + + pte = lookup_address(vaddr, &level); + if (!pte || pte_none(*pte)) + return 0; + + if (level == PG_LEVEL_4K) + return 0; + + npages = page_level_size(rmp_level) / PAGE_SIZE; + ret = set_memory_4k(vaddr, npages); + if (ret) + pr_warn("Failed to split direct map for PFN 0x%llx, ret: %d\n", + pfn, ret); + + return ret; +} + /* * It is expected that those operations are seldom enough so that no mutual * exclusion of updaters is needed and thus the overlap error condition below @@ -384,11 +460,16 @@ EXPORT_SYMBOL_GPL(psmash); static int rmpupdate(u64 pfn, struct rmp_state *state) { unsigned long paddr = pfn << PAGE_SHIFT; - int ret; + int ret, level; if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP)) return -ENODEV; + level = RMP_TO_PG_LEVEL(state->pagesize); + + if (adjust_direct_map(pfn, level)) + return -EFAULT; + do { /* Binutils version 2.36 supports the RMPUPDATE mnemonic. */ asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE" @@ -398,7 +479,8 @@ static int rmpupdate(u64 pfn, struct rmp_state *state) } while (ret == RMPUPDATE_FAIL_OVERLAP); if (ret) { - pr_err("RMPUPDATE failed for PFN %llx, ret: %d\n", pfn, ret); + pr_err("RMPUPDATE failed for PFN %llx, pg_level: %d, ret: %d\n", + pfn, level, ret); dump_rmpentry(pfn); dump_stack(); return -EFAULT;
On 1/29/24 12:59, Borislav Petkov wrote: > On Sat, Jan 27, 2024 at 05:02:49PM +0100, Borislav Petkov wrote: >> This function takes any PFN it gets passed in as it is. I don't care >> who its users are now or in the future and whether they pay attention >> what they pass into - it needs to be properly defined. > > Ok, we solved it offlist, here's the final version I have. It has > a comment explaining what I was asking. > > --- > From: Michael Roth <michael.roth@amd.com> > Date: Thu, 25 Jan 2024 22:11:11 -0600 > Subject: [PATCH] x86/sev: Adjust the directmap to avoid inadvertent RMP faults > > If the kernel uses a 2MB or larger directmap mapping to write to an > address, and that mapping contains any 4KB pages that are set to private > in the RMP table, an RMP #PF will trigger and cause a host crash. > > SNP-aware code that owns the private PFNs will never attempt such > a write, but other kernel tasks writing to other PFNs in the range may > trigger these checks inadvertently due to writing to those other PFNs > via a large directmap mapping that happens to also map a private PFN. > > Prevent this by splitting any 2MB+ mappings that might end up containing > a mix of private/shared PFNs as a result of a subsequent RMPUPDATE for > the PFN/rmp_level passed in. > > Another way to handle this would be to limit the directmap to 4K > mappings in the case of hosts that support SNP, but there is potential > risk for performance regressions of certain host workloads. > > Handling it as-needed results in the directmap being slowly split over > time, which lessens the risk of a performance regression since the more > the directmap gets split as a result of running SNP guests, the more > likely the host is being used primarily to run SNP guests, where > a mostly-split directmap is actually beneficial since there is less > chance of TLB flushing and cpa_lock contention being needed to perform > these splits. > > Cases where a host knows in advance it wants to primarily run SNP guests > and wishes to pre-split the directmap can be handled by adding > a tuneable in the future, but preliminary testing has shown this to not > provide a signficant benefit in the common case of guests that are > backed primarily by 2MB THPs, so it does not seem to be warranted > currently and can be added later if a need arises in the future. > > Signed-off-by: Michael Roth <michael.roth@amd.com> > Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> > Link: https://lore.kernel.org/r/20240126041126.1927228-12-michael.roth@amd.com Acked-by: Vlastimil Babka <vbabka@suse.cz> Thanks!
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c index 16b3d8139649..1a13eff78c9d 100644 --- a/arch/x86/virt/svm/sev.c +++ b/arch/x86/virt/svm/sev.c @@ -368,6 +368,71 @@ int psmash(u64 pfn) } EXPORT_SYMBOL_GPL(psmash); +/* + * If the kernel uses a 2MB or larger directmap mapping to write to an address, + * and that mapping contains any 4KB pages that are set to private in the RMP + * table, an RMP #PF will trigger and cause a host crash. Hypervisor code that + * owns the PFNs being transitioned will never attempt such a write, but other + * kernel tasks writing to other PFNs in the range may trigger these checks + * inadvertantly due a large directmap mapping that happens to overlap such a + * PFN. + * + * Prevent this by splitting any 2MB+ mappings that might end up containing a + * mix of private/shared PFNs as a result of a subsequent RMPUPDATE for the + * PFN/rmp_level passed in. + * + * Note that there is no attempt here to scan all the RMP entries for the 2MB + * physical range, since it would only be worthwhile in determining if a + * subsequent RMPUPDATE for a 4KB PFN would result in all the entries being of + * the same shared/private state, thus avoiding the need to split the mapping. + * But that would mean the entries are currently in a mixed state, and so the + * mapping would have already been split as a result of prior transitions. + * And since the 4K split is only done if the mapping is 2MB+, and there isn't + * currently a mechanism in place to restore 2MB+ mappings, such a check would + * not provide any usable benefit. + * + * More specifics on how these checks are carried out can be found in APM + * Volume 2, "RMP and VMPL Access Checks". + */ +static int adjust_direct_map(u64 pfn, int rmp_level) +{ + unsigned long vaddr = (unsigned long)pfn_to_kaddr(pfn); + unsigned int level; + int npages, ret; + pte_t *pte; + + /* Only 4KB/2MB RMP entries are supported by current hardware. */ + if (WARN_ON_ONCE(rmp_level > PG_LEVEL_2M)) + return -EINVAL; + + if (WARN_ON_ONCE(rmp_level == PG_LEVEL_2M && !IS_ALIGNED(pfn, PTRS_PER_PMD))) + return -EINVAL; + + /* + * If an entire 2MB physical range is being transitioned, then there is + * no risk of RMP #PFs due to write accesses from overlapping mappings, + * since even accesses from 1GB mappings will be treated as 2MB accesses + * as far as RMP table checks are concerned. + */ + if (rmp_level == PG_LEVEL_2M) + return 0; + + pte = lookup_address(vaddr, &level); + if (!pte || pte_none(*pte)) + return 0; + + if (level == PG_LEVEL_4K) + return 0; + + npages = page_level_size(rmp_level) / PAGE_SIZE; + ret = set_memory_4k(vaddr, npages); + if (ret) + pr_warn("Failed to split direct map for PFN 0x%llx, ret: %d\n", + pfn, ret); + + return ret; +} + /* * It is expected that those operations are seldom enough so that no mutual * exclusion of updaters is needed and thus the overlap error condition below @@ -384,11 +449,16 @@ EXPORT_SYMBOL_GPL(psmash); static int rmpupdate(u64 pfn, struct rmp_state *state) { unsigned long paddr = pfn << PAGE_SHIFT; - int ret; + int ret, level; if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP)) return -ENODEV; + level = RMP_TO_PG_LEVEL(state->pagesize); + + if (adjust_direct_map(pfn, level)) + return -EFAULT; + do { /* Binutils version 2.36 supports the RMPUPDATE mnemonic. */ asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE" @@ -398,7 +468,8 @@ static int rmpupdate(u64 pfn, struct rmp_state *state) } while (ret == RMPUPDATE_FAIL_OVERLAP); if (ret) { - pr_err("RMPUPDATE failed for PFN %llx, ret: %d\n", pfn, ret); + pr_err("RMPUPDATE failed for PFN %llx, pg_level: %d, ret: %d\n", + pfn, level, ret); dump_rmpentry(pfn); dump_stack(); return -EFAULT;