Message ID | 20240222-gunyah-v17-19-1e9da6763d38@quicinc.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-77497-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:a81b:b0:108:e6aa:91d0 with SMTP id bq27csp276174dyb; Thu, 22 Feb 2024 15:18:27 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCX7vhZLiWIsGo6Kcy6CUjwV32RQbn44AJcTZipiycC9wxZtPBzUgf/b97a1hn+/sr1GCPCBHpxhVP85dYpSYmjgh4i+hw== X-Google-Smtp-Source: AGHT+IHidtfTOusS2t9KvM+guzi7GRkijc3U193pS9x3spKlurs2BjpKDpMqU6ooosvPP2MAdYSJ X-Received: by 2002:a05:622a:1004:b0:42d:f128:eccf with SMTP id d4-20020a05622a100400b0042df128eccfmr842166qte.33.1708643906829; Thu, 22 Feb 2024 15:18:26 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1708643906; cv=pass; d=google.com; s=arc-20160816; b=bUen69hEoEXfo+0kw+yKjK5Xg1ckLiIgHw1XELhUG7sM/pi08NJGxdJK3RtrZbpuBw f1xvtW2T5wOMeLdpaWOBQGqFnaBXg979xfxHb+2GkAU/BThh5aqS2N1H30SKlTCrKCFG 9q7k7050XYyw8thEbcUkSM9gh98gni/XJeTGQU4L/hRcTYF2QcgUqsFwJFahVkQS+dwE rYSt/OR1Hg1JoU2fWSN7Nxq4Rjds6Lby3xD9F+0leAJf//fDbggHlNZK7zlDMmARpH6c RZfbOQbUOpaT4qZrXws/mYN+ACq5M1yfS4xYE6F7BqbvTb8uF6Ql2xnayzWIt2ZvMw1e Npzw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :subject:date:from:dkim-signature; bh=BqFFNu68PPJsdW9psryeqdBWeZaEA8iJMtTYpGjEiRs=; fh=j53ffieBeHmuArQHAkiXDczYOIriM+Jf92hqYE3BW8U=; b=IeFV/bt/aoYvC/HP0ZyGxGsQHQt6LOlnrn07nqyK2U8CaaTawrRPUb1S+7UOqRfick 9PD0MQZ6OtiztElViomxTvQhYBWmurZx2V/d+PnYnDF4lJqCmoL0oSy4aA1behBTad4j FCuvEWPn23fS9tiGCYTs4bLvTozHC2nQ1jouFqtJpXBaFI+J58UpDuH0ttVuGh78eq4M qdCMX+Z1M+W9O40UGieL/wd2plNcr7fDuUyWylLwGsKu7W4kEuTF+B3RoFDOD0JQklqt x7+LBcOEFJCgKuyVyBwHZXq5hir8N0MZwlQ5ZQP5LsC/S423xppYqM6w/f9qgu+WZIMI jaxA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@quicinc.com header.s=qcppdkim1 header.b=Aex6mvf2; arc=pass (i=1 spf=pass spfdomain=quicinc.com dkim=pass dkdomain=quicinc.com dmarc=pass fromdomain=quicinc.com); spf=pass (google.com: domain of linux-kernel+bounces-77497-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-77497-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=quicinc.com Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id e5-20020ac84905000000b0042c50465c78si13249887qtq.664.2024.02.22.15.18.26 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 22 Feb 2024 15:18:26 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-77497-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@quicinc.com header.s=qcppdkim1 header.b=Aex6mvf2; arc=pass (i=1 spf=pass spfdomain=quicinc.com dkim=pass dkdomain=quicinc.com dmarc=pass fromdomain=quicinc.com); spf=pass (google.com: domain of linux-kernel+bounces-77497-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-77497-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=quicinc.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 9D5ED1C227E8 for <ouuuleilei@gmail.com>; Thu, 22 Feb 2024 23:18:26 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 808DD13BAD6; Thu, 22 Feb 2024 23:17:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=quicinc.com header.i=@quicinc.com header.b="Aex6mvf2" Received: from mx0b-0031df01.pphosted.com (mx0b-0031df01.pphosted.com [205.220.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3CBE41C680; Thu, 22 Feb 2024 23:17:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=205.220.180.131 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708643823; cv=none; b=qF748Xrn0f/pRiaFUoFwdvx426fTjL6SnRy/IyUB6BdKidLEKu0dD6sDm+ChtLF/js5qMEy6QRNZcPJmCPEoqiFBZasEYwVnaaYhQJ6J7dzHMv0g8Kev8AD/tciFb65CsV/Oq25RF680waXoXUPIMWPKEzGg72d/Fyo8Hh5zBeE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708643823; c=relaxed/simple; bh=Ka7yMpUBY0NWUH9927zVV6/k4CvFgSZ6vIK6Qbq74FY=; h=From:Date:Subject:MIME-Version:Content-Type:Message-ID:References: In-Reply-To:To:CC; b=ejij4vAOzVcZ3H5bZRFj6KQ+p3C9QV6CAvQuQgYI+hgBEdDE2AkaPqP2u0m/uq8WSmu3L0lIR79GLnZLpA5oNrg+N1zB0DGDyh3a7IxIhjtnt6iwFcI8rkxG6lC8zpXdixtdDLbZpQOtI6Qx/q6rS+UOv09YIm31TBQfHKwdZrw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=quicinc.com; spf=pass smtp.mailfrom=quicinc.com; dkim=pass (2048-bit key) header.d=quicinc.com header.i=@quicinc.com header.b=Aex6mvf2; arc=none smtp.client-ip=205.220.180.131 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=quicinc.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=quicinc.com Received: from pps.filterd (m0279870.ppops.net [127.0.0.1]) by mx0a-0031df01.pphosted.com (8.17.1.24/8.17.1.24) with ESMTP id 41MLSSYf024471; Thu, 22 Feb 2024 23:16:39 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quicinc.com; h= from:date:subject:mime-version:content-type :content-transfer-encoding:message-id:references:in-reply-to:to :cc; s=qcppdkim1; bh=BqFFNu68PPJsdW9psryeqdBWeZaEA8iJMtTYpGjEiRs =; b=Aex6mvf2g3FnivTElXrRrrHWoO98H3rwWvOrmltuggNv3WUhHob6R/Efvf5 fth/r/0ebRdtGJN6hnNt00WxB+8pxKlu8y2Zjah3urXwGeZ2A8JIEuZNAzC80uvQ WlIeepmVse9WxaI/b//9J8xcsfewNrgxojbn1H3/NNwIMlcpN4bkj5HIqNF8ZUH7 RIuoI+MZZ6xj4SnJhmTgjSUiB7Us6pTFEBBECy0F6SLzOou0dnNdl49ruiqElAT2 F+J3iD96h5jsthaHd4CmCy/27vk8MDfgaFFPUXHpqcgipDVinxSZyXQCY0WIlNaK tnoNdc9MIJfm8YDvhpRbBPlK5Ig== Received: from nasanppmta05.qualcomm.com (i-global254.qualcomm.com [199.106.103.254]) by mx0a-0031df01.pphosted.com (PPS) with ESMTPS id 3we3231h1r-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 22 Feb 2024 23:16:39 +0000 (GMT) Received: from nasanex01b.na.qualcomm.com (nasanex01b.na.qualcomm.com [10.46.141.250]) by NASANPPMTA05.qualcomm.com (8.17.1.5/8.17.1.5) with ESMTPS id 41MNGc3b025583 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 22 Feb 2024 23:16:38 GMT Received: from hu-eberman-lv.qualcomm.com (10.49.16.6) by nasanex01b.na.qualcomm.com (10.46.141.250) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Thu, 22 Feb 2024 15:16:37 -0800 From: Elliot Berman <quic_eberman@quicinc.com> Date: Thu, 22 Feb 2024 15:16:42 -0800 Subject: [PATCH v17 19/35] arch/mm: Export direct {un,}map functions Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-ID: <20240222-gunyah-v17-19-1e9da6763d38@quicinc.com> References: <20240222-gunyah-v17-0-1e9da6763d38@quicinc.com> In-Reply-To: <20240222-gunyah-v17-0-1e9da6763d38@quicinc.com> To: Alex Elder <elder@linaro.org>, Srinivas Kandagatla <srinivas.kandagatla@linaro.org>, Murali Nalajal <quic_mnalajal@quicinc.com>, Trilok Soni <quic_tsoni@quicinc.com>, Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>, Carl van Schaik <quic_cvanscha@quicinc.com>, Philip Derrin <quic_pderrin@quicinc.com>, Prakruthi Deepak Heragu <quic_pheragu@quicinc.com>, Jonathan Corbet <corbet@lwn.net>, Rob Herring <robh+dt@kernel.org>, Krzysztof Kozlowski <krzysztof.kozlowski+dt@linaro.org>, Conor Dooley <conor+dt@kernel.org>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Konrad Dybcio <konrad.dybcio@linaro.org>, Bjorn Andersson <andersson@kernel.org>, Dmitry Baryshkov <dmitry.baryshkov@linaro.org>, "Fuad Tabba" <tabba@google.com>, Sean Christopherson <seanjc@google.com>, "Andrew Morton" <akpm@linux-foundation.org> CC: <linux-arm-msm@vger.kernel.org>, <linux-doc@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <devicetree@vger.kernel.org>, <linux-arm-kernel@lists.infradead.org>, <linux-mm@kvack.org>, Elliot Berman <quic_eberman@quicinc.com> X-Mailer: b4 0.12.4 X-ClientProxiedBy: nalasex01c.na.qualcomm.com (10.47.97.35) To nasanex01b.na.qualcomm.com (10.46.141.250) X-QCInternal: smtphost X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=5800 signatures=585085 X-Proofpoint-ORIG-GUID: MP-x340LqCsOeMxuGX6zrotY1v88l2Gs X-Proofpoint-GUID: MP-x340LqCsOeMxuGX6zrotY1v88l2Gs X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.1011,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2024-02-22_15,2024-02-22_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 bulkscore=0 impostorscore=0 priorityscore=1501 mlxscore=0 lowpriorityscore=0 clxscore=1015 suspectscore=0 malwarescore=0 mlxlogscore=828 adultscore=0 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2402120000 definitions=main-2402220178 X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1791642993186524545 X-GMAIL-MSGID: 1791642993186524545 |
Series |
Drivers for Gunyah hypervisor
|
|
Commit Message
Elliot Berman
Feb. 22, 2024, 11:16 p.m. UTC
Firmware and hypervisor drivers can donate system heap memory to their
respective firmware/hypervisor entities. Those drivers should unmap the
pages from the kernel's logical map before doing so.
Export can_set_direct_map, set_direct_map_invalid_noflush, and
set_direct_map_default_noflush.
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
---
arch/arm64/mm/pageattr.c | 3 +++
1 file changed, 3 insertions(+)
Comments
On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote: > Firmware and hypervisor drivers can donate system heap memory to their > respective firmware/hypervisor entities. Those drivers should unmap the > pages from the kernel's logical map before doing so. > > Export can_set_direct_map, set_direct_map_invalid_noflush, and > set_direct_map_default_noflush. Err, not they should not. And not using such super low-level interfaces from modular code.
On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote: > On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote: > > Firmware and hypervisor drivers can donate system heap memory to their > > respective firmware/hypervisor entities. Those drivers should unmap the > > pages from the kernel's logical map before doing so. > > > > Export can_set_direct_map, set_direct_map_invalid_noflush, and > > set_direct_map_default_noflush. > > Err, not they should not. And not using such super low-level interfaces > from modular code. Hi Cristoph, We've observed a few times that Linux can unintentionally access a page we've unmapped from host's stage 2 page table via an unaligned load from an adjacent page. The stage 2 is managed by Gunyah. There are few scenarios where even though we allocate and own a page from buddy, someone else could try to access the page without going through the hypervisor driver. One such instance we know about is load_unaligned_zeropad() via pathlookup_at() [1]. load_unaligned_zeropad() could be called near the end of a page. If the next page isn't mapped by the kernel in the stage one page tables, then the access from to the unmapped page from load_unaligned_zeropad() will land in __do_kernel_fault(), call fixup_exception(), and fill the remainder of the load with zeroes. If the page in question is mapped in stage 1 but was unmapped from stage 2, then the access lands back in Linux in do_sea(), leading to a panic(). Our preference would be to add fixup_exception() to S2 PTW errors for two reasons: 1. It's cheaper to do performance wise: we've already manipulated S2 page table and prevent intentional access to the page because pKVM/Gunyah drivers know that access to the page has been lost. 2. Page-granular S1 mappings only happen on arm64 with rodata=full. In an off-list discussion with the Android pkvm folks, their preference was to have the pages unmapped from stage 1. I've gone with that approach to get started but welcome discussion on the best approach. The Android (downstream) implementation of arm64 pkvm is currently implementing a hack where s2 ptw faults are given back to the host as s1 ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) -- allowing the kernel to fixup the exception. arm64 pKVM will also face this issue when implementing guest_memfd or when donating more memory to the hyp for s2 page tables, etc. As far as I can tell, this isn't an issue for arm64 pKVM today because memory isn't being dynamically donated to the hypervisor. Thanks, Elliot [1]: path_lookupat+0x340/0x3228 filename_lookup+0xbc/0x1c0 __arm64_sys_newfstatat+0xb0/0x4a0 invoke_syscall+0x58/0x118
The point is that we can't we just allow modules to unmap data from the kernel mapping, no matter how noble your intentions are.
On 26.02.24 12:06, Christoph Hellwig wrote: > The point is that we can't we just allow modules to unmap data from > the kernel mapping, no matter how noble your intentions are. I absolutely agree.
On Mon, Feb 26, 2024 at 12:53:48PM +0100, David Hildenbrand wrote: > On 26.02.24 12:06, Christoph Hellwig wrote: > > The point is that we can't we just allow modules to unmap data from > > the kernel mapping, no matter how noble your intentions are. > > I absolutely agree. > Hi David and Chirstoph, Are your preferences that we should make Gunyah builtin only or should add fixing up S2 PTW errors (or something else)? Also, do you extend that preference to modifying S2 mappings? This would require any hypervisor driver that supports confidential compute usecases to only ever be builtin. Is your concern about unmapping data from kernel mapping, then module being unloaded, and then having no way to recover the mapping? Would a permanent module be better? The primary reason we were wanting to have it as module was to avoid having driver in memory if you're not a Gunyah guest. Thanks, Elliot
On 26.02.24 18:27, Elliot Berman wrote: > On Mon, Feb 26, 2024 at 12:53:48PM +0100, David Hildenbrand wrote: >> On 26.02.24 12:06, Christoph Hellwig wrote: >>> The point is that we can't we just allow modules to unmap data from >>> the kernel mapping, no matter how noble your intentions are. >> >> I absolutely agree. >> > > Hi David and Chirstoph, > > Are your preferences that we should make Gunyah builtin only or should add > fixing up S2 PTW errors (or something else)? Having that built into the kernel certainly does sound better than exposing that functionality to arbitrary OOT modules. But still, this feels like it is using a "too-low-level" interface. > > Also, do you extend that preference to modifying S2 mappings? This would > require any hypervisor driver that supports confidential compute > usecases to only ever be builtin. > > Is your concern about unmapping data from kernel mapping, then module > being unloaded, and then having no way to recover the mapping? Would a > permanent module be better? The primary reason we were wanting to have > it as module was to avoid having driver in memory if you're not a Gunyah > guest. What I didn't grasp from this patch description: is the area where a driver would unmap/remap that memory somehow known ahead of time and limited? How would the driver obtain that memory it would try to unmap/remap the direct map of? Simply allocate some pages and then unmap the direct map? For example, we do have mm/secretmem.c, where we unmap the directmap on allocation and remap when freeing a page. A nice abstraction on alloc/free, so one cannot really do a lot of harm. Further, we enlightened the remainder of the system about secretmem, such that we can detect that the directmap is no longer there. As one example, see the secretmem_active() check in kernel/power/hibernate.c. A similar abstraction would make sense (I remember a discussion about having secretmem functionality in guest_memfd, would that help?), but the question is "which" memory you want to unmap the direct map of, and how the driver became "owner" of that memory such that it would really be allowed to mess with the directmap.
On Tue, Feb 27, 2024 at 10:49:32AM +0100, David Hildenbrand wrote: > On 26.02.24 18:27, Elliot Berman wrote: > > On Mon, Feb 26, 2024 at 12:53:48PM +0100, David Hildenbrand wrote: > > > On 26.02.24 12:06, Christoph Hellwig wrote: > > > > The point is that we can't we just allow modules to unmap data from > > > > the kernel mapping, no matter how noble your intentions are. > > > > > > I absolutely agree. > > > > > > > Hi David and Chirstoph, > > > > Are your preferences that we should make Gunyah builtin only or should add > > fixing up S2 PTW errors (or something else)? > > Having that built into the kernel certainly does sound better than exposing > that functionality to arbitrary OOT modules. But still, this feels like it > is using a "too-low-level" interface. > What are your thoughts about fixing up the stage-2 fault instead? I think this gives mmu-based isolation a slight speed boost because we avoid modifying kernel mapping. The hypervisor driver (KVM or Gunyah) knows that the page isn't mapped. Whether we get S2 or S1 fault, the kernel is likely going to crash, except in the rare case where we want to fix the exception. In that case, we can modify the S2 fault handler to call fixup_exception() when appropriate. > > > > Also, do you extend that preference to modifying S2 mappings? This would > > require any hypervisor driver that supports confidential compute > > usecases to only ever be builtin. > > > > Is your concern about unmapping data from kernel mapping, then module > > being unloaded, and then having no way to recover the mapping? Would a > > permanent module be better? The primary reason we were wanting to have > > it as module was to avoid having driver in memory if you're not a Gunyah > > guest. > > What I didn't grasp from this patch description: is the area where a driver > would unmap/remap that memory somehow known ahead of time and limited? > > How would the driver obtain that memory it would try to unmap/remap the > direct map of? Simply allocate some pages and then unmap the direct map? That's correct. > > For example, we do have mm/secretmem.c, where we unmap the directmap on > allocation and remap when freeing a page. A nice abstraction on alloc/free, > so one cannot really do a lot of harm. > > Further, we enlightened the remainder of the system about secretmem, such > that we can detect that the directmap is no longer there. As one example, > see the secretmem_active() check in kernel/power/hibernate.c. > I'll take a look at this. guest_memfd might be able to use PM notifiers here instead, but I'll dig in the archives to see why secretmem isn't using that. > A similar abstraction would make sense (I remember a discussion about having > secretmem functionality in guest_memfd, would that help?), but the question > is "which" memory you want to unmap the direct map of, and how the driver > became "owner" of that memory such that it would really be allowed to mess > with the directmap.
On Friday 23 Feb 2024 at 16:37:23 (-0800), Elliot Berman wrote: > On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote: > > On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote: > > > Firmware and hypervisor drivers can donate system heap memory to their > > > respective firmware/hypervisor entities. Those drivers should unmap the > > > pages from the kernel's logical map before doing so. > > > > > > Export can_set_direct_map, set_direct_map_invalid_noflush, and > > > set_direct_map_default_noflush. > > > > Err, not they should not. And not using such super low-level interfaces > > from modular code. > > Hi Cristoph, > > We've observed a few times that Linux can unintentionally access a page > we've unmapped from host's stage 2 page table via an unaligned load from > an adjacent page. The stage 2 is managed by Gunyah. There are few > scenarios where even though we allocate and own a page from buddy, > someone else could try to access the page without going through the > hypervisor driver. One such instance we know about is > load_unaligned_zeropad() via pathlookup_at() [1]. > > load_unaligned_zeropad() could be called near the end of a page. If the > next page isn't mapped by the kernel in the stage one page tables, then > the access from to the unmapped page from load_unaligned_zeropad() will > land in __do_kernel_fault(), call fixup_exception(), and fill the > remainder of the load with zeroes. If the page in question is mapped in > stage 1 but was unmapped from stage 2, then the access lands back in > Linux in do_sea(), leading to a panic(). > > Our preference would be to add fixup_exception() to S2 PTW errors for > two reasons: > 1. It's cheaper to do performance wise: we've already manipulated S2 > page table and prevent intentional access to the page because > pKVM/Gunyah drivers know that access to the page has been lost. > 2. Page-granular S1 mappings only happen on arm64 with rodata=full. > > In an off-list discussion with the Android pkvm folks, their preference > was to have the pages unmapped from stage 1. I've gone with that > approach to get started but welcome discussion on the best approach. > > The Android (downstream) implementation of arm64 pkvm is currently > implementing a hack where s2 ptw faults are given back to the host as s1 > ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) -- > allowing the kernel to fixup the exception. > > arm64 pKVM will also face this issue when implementing guest_memfd or > when donating more memory to the hyp for s2 page tables, etc. As far as > I can tell, this isn't an issue for arm64 pKVM today because memory > isn't being dynamically donated to the hypervisor. FWIW pKVM already donates memory dynamically to the hypervisor, to store e.g. guest VM metadata and page-tables, and we've never seen that problem as far as I can recall. A key difference is that pKVM injects a data abort back into the kernel in case of a stage-2 fault, so the whole EXTABLE trick/hack in load_unaligned_zeropad() should work fine out of the box. As discussed offline, Gunyah injecting an SEA into the kernel is questionable, but I understand that the architecture is a bit lacking in this department, and that's probably the next best thing. Could the Gunyah driver allocate from a CMA region instead? That would surely simplify unmapping from EL1 stage-1 (similar to how drivers usually donate memory to TZ). Thanks, Quentin
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c index 924843f1f661b..a9bd84588c98a 100644 --- a/arch/arm64/mm/pageattr.c +++ b/arch/arm64/mm/pageattr.c @@ -32,6 +32,7 @@ bool can_set_direct_map(void) return rodata_full || debug_pagealloc_enabled() || arm64_kfence_can_set_direct_map(); } +EXPORT_SYMBOL_GPL(can_set_direct_map); static int change_page_range(pte_t *ptep, unsigned long addr, void *data) { @@ -176,6 +177,7 @@ int set_direct_map_invalid_noflush(struct page *page) (unsigned long)page_address(page), PAGE_SIZE, change_page_range, &data); } +EXPORT_SYMBOL_GPL(set_direct_map_invalid_noflush); int set_direct_map_default_noflush(struct page *page) { @@ -191,6 +193,7 @@ int set_direct_map_default_noflush(struct page *page) (unsigned long)page_address(page), PAGE_SIZE, change_page_range, &data); } +EXPORT_SYMBOL_GPL(set_direct_map_default_noflush); #ifdef CONFIG_DEBUG_PAGEALLOC void __kernel_map_pages(struct page *page, int numpages, int enable)