Message ID | 1697202267-23600-1-git-send-email-quic_charante@quicinc.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2908:b0:403:3b70:6f57 with SMTP id ib8csp1876267vqb; Fri, 13 Oct 2023 06:06:09 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGz1t6WxMgYrAZIQJQgin6ZEdxfv/qz3bzHDN0zT1ydOrpbeuKGojRoIa2ihuD0E+BqNZmG X-Received: by 2002:a17:902:f54e:b0:1c2:c60:8387 with SMTP id h14-20020a170902f54e00b001c20c608387mr29571153plf.0.1697202369417; Fri, 13 Oct 2023 06:06:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697202369; cv=none; d=google.com; s=arc-20160816; b=YqBvG3N0JMbiYwRvfAKKG6baPlaj2mtfXA1JfTORrLy8LwwXIFhUbYMVaVCx+PW7VC mU3gAb6PdH+wzvJ5I69Ug+nbfVPbWJoX2XW/tER2eeC/4xzO5GHttKkdLpSR5v5+a+LU YbvATSQ8SOZTgoXu8fxm6beho8jOojnK4hN5y2E+5VIS1545pCm/9lVN1XXt/f7lXfYo WOSpcXcBk5WnQLHY3YbtQVMS0n5DMdd88jL3o5KzDy6i82pGbjlgvEzxunNIW5QnLeE4 emRk5N5i7zz5iBtDkPsSu2YdZZ3UpFpKC+OXIxs2ICn1uZec2LayMSi3u6FVNLyzZtx+ 8Yrw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:message-id:date:subject:cc:to:from :dkim-signature; bh=ZpVQjZLUjrm6AXMKPgqu2lAk/MlW4RlXdT8wWxzO4ZQ=; fh=GgCMYlJf36kz5KPsxe5EzVcda+UbOKZKQmm/4CD+pqU=; b=jEigNWXokAhhRfCUw+Jqi4/dcAdiSv7gwbZXTLBHRCsMJ3Axe6t5IUmgaQ9dreLdX+ 2Avm0MVvH+LRJbmGm6cvFUmU+i9d4GOY2wmsAP3GVzHe0sB7N4Q3E3tq9K5u/IR1AjL6 6cq5ywdnfgg9ARxFzlQ/UcN7s9Crn2+4DTqD7LPnMckJNufXIuKqrBHlfeH1A0hqmKZV EzGAys1U/8we7TBUkWhPaDH2GGNotaf+DsPhmjOo+4Qbta1PgLyc3SZObjrminXXS6GO eitFu6QlRkHUIp9tMyjAhNTaFLSrwyT5gcm1eT4anX/Bq6uZOvNZhCDCpJhd3lM4DFrG c6XQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@quicinc.com header.s=qcppdkim1 header.b=EXhnVIrU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=quicinc.com Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id z18-20020a170903019200b001c752a540cdsi4584324plg.145.2023.10.13.06.06.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 13 Oct 2023 06:06:09 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@quicinc.com header.s=qcppdkim1 header.b=EXhnVIrU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=quicinc.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id F30168047547; Fri, 13 Oct 2023 06:05:46 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231821AbjJMNFn (ORCPT <rfc822;rua109.linux@gmail.com> + 19 others); Fri, 13 Oct 2023 09:05:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54528 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231612AbjJMNFm (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 13 Oct 2023 09:05:42 -0400 Received: from mx0a-0031df01.pphosted.com (mx0a-0031df01.pphosted.com [205.220.168.131]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AF23391 for <linux-kernel@vger.kernel.org>; Fri, 13 Oct 2023 06:05:40 -0700 (PDT) Received: from pps.filterd (m0279863.ppops.net [127.0.0.1]) by mx0a-0031df01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 39D5IHZf031433; Fri, 13 Oct 2023 13:05:01 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quicinc.com; h=from : to : cc : subject : date : message-id : mime-version : content-type; s=qcppdkim1; bh=ZpVQjZLUjrm6AXMKPgqu2lAk/MlW4RlXdT8wWxzO4ZQ=; b=EXhnVIrUVQteOJYGnPhNlBAznoz7DRruEWlGD2eLItIY8Vruhq5/+RhYadlDeSPifLkM b6nQmtMySLvo/OMccRr3FJGNRn8Y1n7dJiLx6ebK/ixMkoLaqVR/7363o3CulCXXRf5A m0L7ykv0JI///gAcKeZsGl837ak8Tsyyk8tyEW+ahTk0+8T8c78us/AcXM5b2mETVZyS oMGY7LzpidE6ZVFqsZP//ox6+t0AK1mO1ndgD6fDSIbEtRo3bFQxPBoBAMMLy0Y53UiY GP2RfSnDmk1l4IGZSC7/7JvHX4Ja+UL440POgERV3jzKA62PWbFLCXyjvtfEoV+QqCis oA== Received: from nalasppmta01.qualcomm.com (Global_NAT1.qualcomm.com [129.46.96.20]) by mx0a-0031df01.pphosted.com (PPS) with ESMTPS id 3tpt109qm0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 13 Oct 2023 13:05:00 +0000 Received: from nalasex01a.na.qualcomm.com (nalasex01a.na.qualcomm.com [10.47.209.196]) by NALASPPMTA01.qualcomm.com (8.17.1.5/8.17.1.5) with ESMTPS id 39DD507t012608 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 13 Oct 2023 13:05:00 GMT Received: from hu-charante-hyd.qualcomm.com (10.80.80.8) by nalasex01a.na.qualcomm.com (10.47.209.196) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.36; Fri, 13 Oct 2023 06:04:56 -0700 From: Charan Teja Kalla <quic_charante@quicinc.com> To: <akpm@linux-foundation.org>, <osalvador@suse.de>, <dan.j.williams@intel.com>, <david@redhat.com>, <vbabka@suse.cz>, <mgorman@techsingularity.net>, <aneesh.kumar@linux.ibm.com> CC: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>, Charan Teja Kalla <quic_charante@quicinc.com> Subject: [PATCH] mm/sparsemem: fix race in accessing memory_section->usage Date: Fri, 13 Oct 2023 18:34:27 +0530 Message-ID: <1697202267-23600-1-git-send-email-quic_charante@quicinc.com> X-Mailer: git-send-email 2.7.4 MIME-Version: 1.0 Content-Type: text/plain X-Originating-IP: [10.80.80.8] X-ClientProxiedBy: nasanex01b.na.qualcomm.com (10.46.141.250) To nalasex01a.na.qualcomm.com (10.47.209.196) X-QCInternal: smtphost X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=5800 signatures=585085 X-Proofpoint-GUID: dtJjyDBU33e-bVtY1MuaXuU5_edqm5lj X-Proofpoint-ORIG-GUID: dtJjyDBU33e-bVtY1MuaXuU5_edqm5lj X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.980,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-10-13_04,2023-10-12_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 adultscore=0 spamscore=0 bulkscore=0 suspectscore=0 mlxscore=0 lowpriorityscore=0 priorityscore=1501 mlxlogscore=999 phishscore=0 malwarescore=0 clxscore=1011 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2309180000 definitions=main-2310130109 X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_LOW,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Fri, 13 Oct 2023 06:05:47 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1779645671821031214 X-GMAIL-MSGID: 1779645671821031214 |
Series |
mm/sparsemem: fix race in accessing memory_section->usage
|
|
Commit Message
Charan Teja Kalla
Oct. 13, 2023, 1:04 p.m. UTC
The below race is observed on a PFN which falls into the device memory
region with the system memory configuration where PFN's are such that
[ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL]. Since normal zone start and
end pfn contains the device memory PFN's as well, the compaction
triggered will try on the device memory PFN's too though they end up in
NOP(because pfn_to_online_page() returns NULL for ZONE_DEVICE memory
sections). When from other core, the section mappings are being removed
for the ZONE_DEVICE region, that the PFN in question belongs to,
on which compaction is currently being operated is resulting into the
kernel crash with CONFIG_SPASEMEM_VMEMAP enabled.
compact_zone() memunmap_pages
------------- ---------------
__pageblock_pfn_to_page
......
(a)pfn_valid():
valid_section()//return true
(b)__remove_pages()->
sparse_remove_section()->
section_deactivate():
[Free the array ms->usage and set
ms->usage = NULL]
pfn_section_valid()
[Access ms->usage which
is NULL]
NOTE: From the above it can be said that the race is reduced to between
the pfn_valid()/pfn_section_valid() and the section deactivate with
SPASEMEM_VMEMAP enabled.
The commit b943f045a9af("mm/sparse: fix kernel crash with
pfn_section_valid check") tried to address the same problem by clearing
the SECTION_HAS_MEM_MAP with the expectation of valid_section() returns
false thus ms->usage is not accessed.
Fix this issue by the below steps:
a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage.
b) RCU protected read side critical section will either return NULL when
SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage.
c) Synchronize the rcu on the write side and free the ->usage. No
attempt will be made to access ->usage after this as the
SECTION_HAS_MEM_MAP is cleared thus valid_section() return false.
Since the section_deactivate() is a rare operation and will come in the
hot remove path, impact of synchronize_rcu() should be negligble.
Fixes: f46edbd1b151 ("mm/sparsemem: add helpers track active portions of a section at boot")
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
---
include/linux/mmzone.h | 11 +++++++++--
mm/sparse.c | 14 ++++++++------
2 files changed, 17 insertions(+), 8 deletions(-)
Comments
On Fri, 13 Oct 2023 18:34:27 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote: > The below race is observed on a PFN which falls into the device memory > region with the system memory configuration where PFN's are such that > [ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL]. Since normal zone start and > end pfn contains the device memory PFN's as well, the compaction > triggered will try on the device memory PFN's too though they end up in > NOP(because pfn_to_online_page() returns NULL for ZONE_DEVICE memory > sections). When from other core, the section mappings are being removed > for the ZONE_DEVICE region, that the PFN in question belongs to, > on which compaction is currently being operated is resulting into the > kernel crash with CONFIG_SPASEMEM_VMEMAP enabled. Seems this bug is four years old, yes? It must be quite hard to hit. When people review this, please offer opinions on whether a fix should be backported into -stable kernels, thanks. > compact_zone() memunmap_page > ------------- --------------- > __pageblock_pfn_to_page > ...... > (a)pfn_valid(): > valid_section()//return true > (b)__remove_pages()-> > sparse_remove_section()-> > section_deactivate(): > [Free the array ms->usage and set > ms->usage = NULL] > pfn_section_valid() > [Access ms->usage which > is NULL] > > NOTE: From the above it can be said that the race is reduced to between > the pfn_valid()/pfn_section_valid() and the section deactivate with > SPASEMEM_VMEMAP enabled. > > The commit b943f045a9af("mm/sparse: fix kernel crash with > pfn_section_valid check") tried to address the same problem by clearing > the SECTION_HAS_MEM_MAP with the expectation of valid_section() returns > false thus ms->usage is not accessed. > > Fix this issue by the below steps: > a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage. > b) RCU protected read side critical section will either return NULL when > SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage. > c) Synchronize the rcu on the write side and free the ->usage. No > attempt will be made to access ->usage after this as the > SECTION_HAS_MEM_MAP is cleared thus valid_section() return false. > > Since the section_deactivate() is a rare operation and will come in the > hot remove path, impact of synchronize_rcu() should be negligble. > > Fixes: f46edbd1b151 ("mm/sparsemem: add helpers track active portions of a section at boot")
On 15.10.23 00:25, Andrew Morton wrote: > On Fri, 13 Oct 2023 18:34:27 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote: > >> The below race is observed on a PFN which falls into the device memory >> region with the system memory configuration where PFN's are such that >> [ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL]. Since normal zone start and >> end pfn contains the device memory PFN's as well, the compaction >> triggered will try on the device memory PFN's too though they end up in >> NOP(because pfn_to_online_page() returns NULL for ZONE_DEVICE memory >> sections). When from other core, the section mappings are being removed >> for the ZONE_DEVICE region, that the PFN in question belongs to, >> on which compaction is currently being operated is resulting into the >> kernel crash with CONFIG_SPASEMEM_VMEMAP enabled. > > Seems this bug is four years old, yes? It must be quite hard to hit. From the description, it's not quite clear to me if this was actually hit -- usually people include the dmesg bug/crash info. > > When people review this, please offer opinions on whether a fix should > be backported into -stable kernels, thanks. > >> compact_zone() memunmap_page >> ------------- --------------- >> __pageblock_pfn_to_page >> ...... >> (a)pfn_valid(): >> valid_section()//return true >> (b)__remove_pages()-> >> sparse_remove_section()-> >> section_deactivate(): >> [Free the array ms->usage and set >> ms->usage = NULL] >> pfn_section_valid() >> [Access ms->usage which >> is NULL] >> >> NOTE: From the above it can be said that the race is reduced to between >> the pfn_valid()/pfn_section_valid() and the section deactivate with >> SPASEMEM_VMEMAP enabled. >> >> The commit b943f045a9af("mm/sparse: fix kernel crash with >> pfn_section_valid check") tried to address the same problem by clearing >> the SECTION_HAS_MEM_MAP with the expectation of valid_section() returns >> false thus ms->usage is not accessed. >> >> Fix this issue by the below steps: >> a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage. >> b) RCU protected read side critical section will either return NULL when >> SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage. >> c) Synchronize the rcu on the write side and free the ->usage. No >> attempt will be made to access ->usage after this as the >> SECTION_HAS_MEM_MAP is cleared thus valid_section() return false. This affects any kind of memory hotunplug. When hotunplugging memory we will end up calling synchronize_rcu() for each and every memory section, which sounds extremely wasteful. Can't we find a way to kfree_rcu() that thing and read/write the pointer using READ?ONCE?WRITE_ONCE instead?
On Fri, Oct 13, 2023 at 06:34:27PM +0530, Charan Teja Kalla wrote: > The below race is observed on a PFN which falls into the device memory > region with the system memory configuration where PFN's are such that > [ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL]. Since normal zone start and > end pfn contains the device memory PFN's as well, the compaction > triggered will try on the device memory PFN's too though they end up in > NOP(because pfn_to_online_page() returns NULL for ZONE_DEVICE memory > sections). When from other core, the section mappings are being removed > for the ZONE_DEVICE region, that the PFN in question belongs to, > on which compaction is currently being operated is resulting into the > kernel crash with CONFIG_SPASEMEM_VMEMAP enabled. > > compact_zone() memunmap_pages > ------------- --------------- > __pageblock_pfn_to_page > ...... > (a)pfn_valid(): > valid_section()//return true > (b)__remove_pages()-> > sparse_remove_section()-> > section_deactivate(): > [Free the array ms->usage and set > ms->usage = NULL] > pfn_section_valid() > [Access ms->usage which > is NULL] > > NOTE: From the above it can be said that the race is reduced to between > the pfn_valid()/pfn_section_valid() and the section deactivate with > SPASEMEM_VMEMAP enabled. > > The commit b943f045a9af("mm/sparse: fix kernel crash with > pfn_section_valid check") tried to address the same problem by clearing > the SECTION_HAS_MEM_MAP with the expectation of valid_section() returns > false thus ms->usage is not accessed. > > Fix this issue by the below steps: > a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage. > b) RCU protected read side critical section will either return NULL when > SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage. > c) Synchronize the rcu on the write side and free the ->usage. No > attempt will be made to access ->usage after this as the > SECTION_HAS_MEM_MAP is cleared thus valid_section() return false. > > Since the section_deactivate() is a rare operation and will come in the > hot remove path, impact of synchronize_rcu() should be negligble. struct mem_section_usage has other field like pageblock_flags. Do we need to protect its readers with RCU? Also can we annotate usage field in struct mem_section with __rcu and use RCU accessors like rcu_dereference() while using memsection::usage field? > > Fixes: f46edbd1b151 ("mm/sparsemem: add helpers track active portions of a section at boot") > Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> > --- > include/linux/mmzone.h | 11 +++++++++-- > mm/sparse.c | 14 ++++++++------ > 2 files changed, 17 insertions(+), 8 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 4106fbc..c877396 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -1987,6 +1987,7 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn) > static inline int pfn_valid(unsigned long pfn) > { > struct mem_section *ms; > + int ret; > > /* > * Ensure the upper PAGE_SHIFT bits are clear in the > @@ -2000,13 +2001,19 @@ static inline int pfn_valid(unsigned long pfn) > if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) > return 0; > ms = __pfn_to_section(pfn); > - if (!valid_section(ms)) > + rcu_read_lock(); > + if (!valid_section(ms)) { > + rcu_read_unlock(); > return 0; > + } > /* > * Traditionally early sections always returned pfn_valid() for > * the entire section-sized span. > */ > - return early_section(ms) || pfn_section_valid(ms, pfn); > + ret = early_section(ms) || pfn_section_valid(ms, pfn); > + rcu_read_unlock(); > + > + return ret; > } > #endif > > diff --git a/mm/sparse.c b/mm/sparse.c > index 77d91e5..ca7dbe1 100644 > --- a/mm/sparse.c > +++ b/mm/sparse.c > @@ -792,6 +792,13 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, > unsigned long section_nr = pfn_to_section_nr(pfn); > > /* > + * Mark the section invalid so that valid_section() > + * return false. This prevents code from dereferencing > + * ms->usage array. > + */ > + ms->section_mem_map &= ~SECTION_HAS_MEM_MAP; > + This trick may not be needed if we add proper NULL checks around ms->usage. We are anyway introducing a new rule this check needs to be done under RCU lock, so why not revisit it? > + /* > * When removing an early section, the usage map is kept (as the > * usage maps of other sections fall into the same page). It > * will be re-used when re-adding the section - which is then no > @@ -799,16 +806,11 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, > * was allocated during boot. > */ > if (!PageReserved(virt_to_page(ms->usage))) { > + synchronize_rcu(); > kfree(ms->usage); > ms->usage = NULL; > } If we add NULL checks around ms->usage, this becomes tmp = rcu_replace_pointer(ms->usage, NULL, hotplug_locked()); syncrhonize_rcu(); kfree(tmp); btw, Do we come here with any global locks? if yes, synchronize_rcu() can add delays in releasing the lock. In that case we may have to go for async RCU free. > memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr); > - /* > - * Mark the section invalid so that valid_section() > - * return false. This prevents code from dereferencing > - * ms->usage array. > - */ > - ms->section_mem_map &= ~SECTION_HAS_MEM_MAP; > } > > /* > Thanks, Pavan
Thanks Andrew/David, On 10/16/2023 1:53 PM, David Hildenbrand wrote: >>> The below race is observed on a PFN which falls into the device memory >>> region with the system memory configuration where PFN's are such that >>> [ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL]. Since normal zone start and >>> end pfn contains the device memory PFN's as well, the compaction >>> triggered will try on the device memory PFN's too though they end up in >>> NOP(because pfn_to_online_page() returns NULL for ZONE_DEVICE memory >>> sections). When from other core, the section mappings are being removed >>> for the ZONE_DEVICE region, that the PFN in question belongs to, >>> on which compaction is currently being operated is resulting into the >>> kernel crash with CONFIG_SPASEMEM_VMEMAP enabled. >> >> Seems this bug is four years old, yes? It must be quite hard to hit. > > From the description, it's not quite clear to me if this was actually > hit -- usually people include the dmesg bug/crash info. On Snapdragon SoC, with the mentioned memory configuration of PFN's as [ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL], we are able to see bunch of issues daily while testing on a device farm. I note that from next time on wards will send the demsg bug/crash info for these type of issues. For this particular issue below is the log. Though the below log is not directly pointing to the pfn_section_valid(){ ms->usage;}, when we loaded this dump on T32 lauterbach tool, it is pointing. [ 540.578056] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 540.578068] Mem abort info: [ 540.578070] ESR = 0x0000000096000005 [ 540.578073] EC = 0x25: DABT (current EL), IL = 32 bits [ 540.578077] SET = 0, FnV = 0 [ 540.578080] EA = 0, S1PTW = 0 [ 540.578082] FSC = 0x05: level 1 translation fault [ 540.578085] Data abort info: [ 540.578086] ISV = 0, ISS = 0x00000005 [ 540.578088] CM = 0, WnR = 0 [ 540.579431] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--) [ 540.579436] pc : __pageblock_pfn_to_page+0x6c/0x14c [ 540.579454] lr : compact_zone+0x994/0x1058 [ 540.579460] sp : ffffffc03579b510 [ 540.579463] x29: ffffffc03579b510 x28: 0000000000235800 x27: 000000000000000c [ 540.579470] x26: 0000000000235c00 x25: 0000000000000068 x24: ffffffc03579b640 [ 540.579477] x23: 0000000000000001 x22: ffffffc03579b660 x21: 0000000000000000 [ 540.579483] x20: 0000000000235bff x19: ffffffdebf7e3940 x18: ffffffdebf66d140 [ 540.579489] x17: 00000000739ba063 x16: 00000000739ba063 x15: 00000000009f4bff [ 540.579495] x14: 0000008000000000 x13: 0000000000000000 x12: 0000000000000001 [ 540.579501] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffff897d2cd440 [ 540.579507] x8 : 0000000000000000 x7 : 0000000000000000 x6 : ffffffc03579b5b4 [ 540.579512] x5 : 0000000000027f25 x4 : ffffffc03579b5b8 x3 : 0000000000000001 [ 540.579518] x2 : ffffffdebf7e3940 x1 : 0000000000235c00 x0 : 0000000000235800 [ 540.579524] Call trace: [ 540.579527] __pageblock_pfn_to_page+0x6c/0x14c [ 540.579533] compact_zone+0x994/0x1058 [ 540.579536] try_to_compact_pages+0x128/0x378 [ 540.579540] __alloc_pages_direct_compact+0x80/0x2b0 [ 540.579544] __alloc_pages_slowpath+0x5c0/0xe10 [ 540.579547] __alloc_pages+0x250/0x2d0 [ 540.579550] __iommu_dma_alloc_noncontiguous+0x13c/0x3fc [ 540.579561] iommu_dma_alloc+0xa0/0x320 [ 540.579565] dma_alloc_attrs+0xd4/0x108 >>> Fix this issue by the below steps: >>> a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage. >>> b) RCU protected read side critical section will either return NULL when >>> SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage. >>> c) Synchronize the rcu on the write side and free the ->usage. No >>> attempt will be made to access ->usage after this as the >>> SECTION_HAS_MEM_MAP is cleared thus valid_section() return false. > > > This affects any kind of memory hotunplug. When hotunplugging memory we > will end up calling synchronize_rcu() for each and every memory section, > which sounds extremely wasteful. > > Can't we find a way to kfree_rcu() that thing and read/write the pointer > using READ?ONCE?WRITE_ONCE instead? I am inspired to use the synchronize_rcu() because of [1] where we did use it in offline_page_ext(). And my limited understanding is that, a user can trigger the offline operation more often than the unplug operation. I agree here that there is a scope to use kfree_rcu() unlike in [1]. Let me check for a way to use it. [1] https://lore.kernel.org/all/1661496993-11473-1-git-send-email-quic_charante@quicinc.com/ Thanks, Charan
On Mon, 16 Oct 2023 19:08:00 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote: > > From the description, it's not quite clear to me if this was actually > > hit -- usually people include the dmesg bug/crash info. > > On Snapdragon SoC, with the mentioned memory configuration of PFN's as > [ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL], we are able to see bunch of > issues daily while testing on a device farm. > > I note that from next time on wards will send the demsg bug/crash info > for these type of issues. For this particular issue below is the log. > Though the below log is not directly pointing to the > pfn_section_valid(){ ms->usage;}, when we loaded this dump on T32 > lauterbach tool, it is pointing. > > [ 540.578056] Unable to handle kernel NULL pointer dereference at > virtual address 0000000000000000 > [ 540.578068] Mem abort info: > [ 540.578070] ESR = 0x0000000096000005 > [ 540.578073] EC = 0x25: DABT (current EL), IL = 32 bits > [ 540.578077] SET = 0, FnV = 0 > [ 540.578080] EA = 0, S1PTW = 0 > [ 540.578082] FSC = 0x05: level 1 translation fault > [ 540.578085] Data abort info: > [ 540.578086] ISV = 0, ISS = 0x00000005 > [ 540.578088] CM = 0, WnR = 0 > [ 540.579431] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBS > BTYPE=--) > [ 540.579436] pc : __pageblock_pfn_to_page+0x6c/0x14c > [ 540.579454] lr : compact_zone+0x994/0x1058 > [ 540.579460] sp : ffffffc03579b510 > [ 540.579463] x29: ffffffc03579b510 x28: 0000000000235800 x27: > 000000000000000c > [ 540.579470] x26: 0000000000235c00 x25: 0000000000000068 x24: > ffffffc03579b640 > [ 540.579477] x23: 0000000000000001 x22: ffffffc03579b660 x21: > 0000000000000000 > [ 540.579483] x20: 0000000000235bff x19: ffffffdebf7e3940 x18: > ffffffdebf66d140 > [ 540.579489] x17: 00000000739ba063 x16: 00000000739ba063 x15: > 00000000009f4bff > [ 540.579495] x14: 0000008000000000 x13: 0000000000000000 x12: > 0000000000000001 > [ 540.579501] x11: 0000000000000000 x10: 0000000000000000 x9 : > ffffff897d2cd440 > [ 540.579507] x8 : 0000000000000000 x7 : 0000000000000000 x6 : > ffffffc03579b5b4 > [ 540.579512] x5 : 0000000000027f25 x4 : ffffffc03579b5b8 x3 : > 0000000000000001 > [ 540.579518] x2 : ffffffdebf7e3940 x1 : 0000000000235c00 x0 : > 0000000000235800 > [ 540.579524] Call trace: > [ 540.579527] __pageblock_pfn_to_page+0x6c/0x14c > [ 540.579533] compact_zone+0x994/0x1058 > [ 540.579536] try_to_compact_pages+0x128/0x378 > [ 540.579540] __alloc_pages_direct_compact+0x80/0x2b0 > [ 540.579544] __alloc_pages_slowpath+0x5c0/0xe10 > [ 540.579547] __alloc_pages+0x250/0x2d0 > [ 540.579550] __iommu_dma_alloc_noncontiguous+0x13c/0x3fc > [ 540.579561] iommu_dma_alloc+0xa0/0x320 > [ 540.579565] dma_alloc_attrs+0xd4/0x108 Thanks. I added the above info to the changelog, added a cc:stable and I added a note-to-myself that a new version of the fix may be forthcoming.
Thanks Pavan!! On 10/16/2023 4:03 PM, Pavan Kondeti wrote: >> Fix this issue by the below steps: >> a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage. >> b) RCU protected read side critical section will either return NULL when >> SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage. >> c) Synchronize the rcu on the write side and free the ->usage. No >> attempt will be made to access ->usage after this as the >> SECTION_HAS_MEM_MAP is cleared thus valid_section() return false. >> >> Since the section_deactivate() is a rare operation and will come in the >> hot remove path, impact of synchronize_rcu() should be negligble. > struct mem_section_usage has other field like pageblock_flags. Do we > need to protect its readers with RCU? Also can we annotate usage field > in struct mem_section with __rcu and use RCU accessors like > rcu_dereference() while using memsection::usage field? Good question about the pageblock_flags!! I think we rely on the get_pageblock_bitmap() to read the ms->usage->pageblock_flags by passing struct page*. 1) All the functions that I have come across calling get_pageblock_bitmap()/get_pfnblock_flags_mask() passing the page* which it get from buddy. I think we are safe here as the device pages(from which the problem is coming will never be onlined/added to buddy) 2) There are functions in compaction which directly operate on the pfn's through pfn_to_online_page(). As for device pages, it is going to return NULL, I think we are safe here too. 3) alloc_contig_range() which also operate on the pfn's directly, and TMK, we will not pass the [start , end) values of this function to contains the hole/offlined pages. I think we are safe here too. May be David/other reviewers can help in commenting if there are some mistakes above. I am currently working on to use the __rcu accessors. > >> /* >> + * Mark the section invalid so that valid_section() >> + * return false. This prevents code from dereferencing >> + * ms->usage array. >> + */ >> + ms->section_mem_map &= ~SECTION_HAS_MEM_MAP; >> + > This trick may not be needed if we add proper NULL checks around ms->usage. We are anyway > introducing a new rule this check needs to be done under RCU lock, so why not revisit it? > I think this is for valid_section(). >> * was allocated during boot. >> */ >> if (!PageReserved(virt_to_page(ms->usage))) { >> + synchronize_rcu(); >> kfree(ms->usage); >> ms->usage = NULL; >> } > If we add NULL checks around ms->usage, this becomes > > tmp = rcu_replace_pointer(ms->usage, NULL, hotplug_locked()); > syncrhonize_rcu(); > kfree(tmp); Per David input, I am working on using kfree_rcu(). Thanks, Charan
On 17.10.23 16:10, Charan Teja Kalla wrote: > Thanks Pavan!! > > On 10/16/2023 4:03 PM, Pavan Kondeti wrote: >>> Fix this issue by the below steps: >>> a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage. >>> b) RCU protected read side critical section will either return NULL when >>> SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage. >>> c) Synchronize the rcu on the write side and free the ->usage. No >>> attempt will be made to access ->usage after this as the >>> SECTION_HAS_MEM_MAP is cleared thus valid_section() return false. >>> >>> Since the section_deactivate() is a rare operation and will come in the >>> hot remove path, impact of synchronize_rcu() should be negligble. >> struct mem_section_usage has other field like pageblock_flags. Do we >> need to protect its readers with RCU? Also can we annotate usage field >> in struct mem_section with __rcu and use RCU accessors like >> rcu_dereference() while using memsection::usage field? > > Good question about the pageblock_flags!! I think we rely on the > get_pageblock_bitmap() to read the ms->usage->pageblock_flags by passing > struct page*. > > 1) All the functions that I have come across calling > get_pageblock_bitmap()/get_pfnblock_flags_mask() passing the page* which > it get from buddy. I think we are safe here as the device pages(from > which the problem is coming will never be onlined/added to buddy) > > 2) There are functions in compaction which directly operate on the pfn's > through pfn_to_online_page(). As for device pages, it is going to return > NULL, I think we are safe here too. > > 3) alloc_contig_range() which also operate on the pfn's directly, and > TMK, we will not pass the [start , end) values of this function to > contains the hole/offlined pages. I think we are safe here too. > > May be David/other reviewers can help in commenting if there are some > mistakes above. Sound reasonable to me; most PFN walkers shouldn't deal with pageblock flags. alloc_contig_range() is certainly interesting, I suspect it's fine but we better double-check.
On 16.10.23 15:38, Charan Teja Kalla wrote: > Thanks Andrew/David, > > On 10/16/2023 1:53 PM, David Hildenbrand wrote: >>>> The below race is observed on a PFN which falls into the device memory >>>> region with the system memory configuration where PFN's are such that >>>> [ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL]. Since normal zone start and >>>> end pfn contains the device memory PFN's as well, the compaction >>>> triggered will try on the device memory PFN's too though they end up in >>>> NOP(because pfn_to_online_page() returns NULL for ZONE_DEVICE memory >>>> sections). When from other core, the section mappings are being removed >>>> for the ZONE_DEVICE region, that the PFN in question belongs to, >>>> on which compaction is currently being operated is resulting into the >>>> kernel crash with CONFIG_SPASEMEM_VMEMAP enabled. >>> >>> Seems this bug is four years old, yes? It must be quite hard to hit. >> >> From the description, it's not quite clear to me if this was actually >> hit -- usually people include the dmesg bug/crash info. > > On Snapdragon SoC, with the mentioned memory configuration of PFN's as > [ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL], we are able to see bunch of > issues daily while testing on a device farm. > > I note that from next time on wards will send the demsg bug/crash info > for these type of issues. For this particular issue below is the log. > Though the below log is not directly pointing to the > pfn_section_valid(){ ms->usage;}, when we loaded this dump on T32 > lauterbach tool, it is pointing. > > [ 540.578056] Unable to handle kernel NULL pointer dereference at > virtual address 0000000000000000 > [ 540.578068] Mem abort info: > [ 540.578070] ESR = 0x0000000096000005 > [ 540.578073] EC = 0x25: DABT (current EL), IL = 32 bits > [ 540.578077] SET = 0, FnV = 0 > [ 540.578080] EA = 0, S1PTW = 0 > [ 540.578082] FSC = 0x05: level 1 translation fault > [ 540.578085] Data abort info: > [ 540.578086] ISV = 0, ISS = 0x00000005 > [ 540.578088] CM = 0, WnR = 0 > [ 540.579431] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBS > BTYPE=--) > [ 540.579436] pc : __pageblock_pfn_to_page+0x6c/0x14c > [ 540.579454] lr : compact_zone+0x994/0x1058 > [ 540.579460] sp : ffffffc03579b510 > [ 540.579463] x29: ffffffc03579b510 x28: 0000000000235800 x27: > 000000000000000c > [ 540.579470] x26: 0000000000235c00 x25: 0000000000000068 x24: > ffffffc03579b640 > [ 540.579477] x23: 0000000000000001 x22: ffffffc03579b660 x21: > 0000000000000000 > [ 540.579483] x20: 0000000000235bff x19: ffffffdebf7e3940 x18: > ffffffdebf66d140 > [ 540.579489] x17: 00000000739ba063 x16: 00000000739ba063 x15: > 00000000009f4bff > [ 540.579495] x14: 0000008000000000 x13: 0000000000000000 x12: > 0000000000000001 > [ 540.579501] x11: 0000000000000000 x10: 0000000000000000 x9 : > ffffff897d2cd440 > [ 540.579507] x8 : 0000000000000000 x7 : 0000000000000000 x6 : > ffffffc03579b5b4 > [ 540.579512] x5 : 0000000000027f25 x4 : ffffffc03579b5b8 x3 : > 0000000000000001 > [ 540.579518] x2 : ffffffdebf7e3940 x1 : 0000000000235c00 x0 : > 0000000000235800 > [ 540.579524] Call trace: > [ 540.579527] __pageblock_pfn_to_page+0x6c/0x14c > [ 540.579533] compact_zone+0x994/0x1058 > [ 540.579536] try_to_compact_pages+0x128/0x378 > [ 540.579540] __alloc_pages_direct_compact+0x80/0x2b0 > [ 540.579544] __alloc_pages_slowpath+0x5c0/0xe10 > [ 540.579547] __alloc_pages+0x250/0x2d0 > [ 540.579550] __iommu_dma_alloc_noncontiguous+0x13c/0x3fc > [ 540.579561] iommu_dma_alloc+0xa0/0x320 > [ 540.579565] dma_alloc_attrs+0xd4/0x108 > >>>> Fix this issue by the below steps: >>>> a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage. >>>> b) RCU protected read side critical section will either return NULL when >>>> SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage. >>>> c) Synchronize the rcu on the write side and free the ->usage. No >>>> attempt will be made to access ->usage after this as the >>>> SECTION_HAS_MEM_MAP is cleared thus valid_section() return false. >> >> >> This affects any kind of memory hotunplug. When hotunplugging memory we >> will end up calling synchronize_rcu() for each and every memory section, >> which sounds extremely wasteful. >> >> Can't we find a way to kfree_rcu() that thing and read/write the pointer >> using READ?ONCE?WRITE_ONCE instead? > > I am inspired to use the synchronize_rcu() because of [1] where we did > use it in offline_page_ext(). And my limited understanding is that, a > user can trigger the offline operation more often than the unplug operation. In theory yes. In practice there are not many use cases where we do that. Further, page_ext is already not used that frequently (page owner, young, idle tracking only), especially in most production (!debug) environments. > > I agree here that there is a scope to use kfree_rcu() unlike in [1]. Let > me check for a way to use it. Exactly, if we could use that, it would be ideal.
On Tue, 17 Oct 2023 19:40:15 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote: > >> * was allocated during boot. > >> */ > >> if (!PageReserved(virt_to_page(ms->usage))) { > >> + synchronize_rcu(); > >> kfree(ms->usage); > >> ms->usage = NULL; > >> } > > If we add NULL checks around ms->usage, this becomes > > > > tmp = rcu_replace_pointer(ms->usage, NULL, hotplug_locked()); > > syncrhonize_rcu(); > > kfree(tmp); > Per David input, I am working on using kfree_rcu(). How's it coming along? Given that we're at 6.6-rc7 and given that this issue is causing daily crashes in your device farm, I'm thinking that we use the current version of your patch for 6.6 and for -stable. We can look at the kfree_rcu() optimization for later kernel releases?
On 25.10.23 23:35, Andrew Morton wrote: > On Tue, 17 Oct 2023 19:40:15 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote: > >>>> * was allocated during boot. >>>> */ >>>> if (!PageReserved(virt_to_page(ms->usage))) { >>>> + synchronize_rcu(); >>>> kfree(ms->usage); >>>> ms->usage = NULL; >>>> } >>> If we add NULL checks around ms->usage, this becomes >>> >>> tmp = rcu_replace_pointer(ms->usage, NULL, hotplug_locked()); >>> syncrhonize_rcu(); >>> kfree(tmp); >> Per David input, I am working on using kfree_rcu(). > > How's it coming along? > > Given that we're at 6.6-rc7 and given that this issue is causing > daily crashes in your device farm, I'm thinking that we use the current > version of your patch for 6.6 and for -stable. We can look at the > kfree_rcu() optimization for later kernel releases? Any particular reason we have to rush this in? It's been seen by one company in a testing farm; there were no other reports, especially not from production systems. ... and the issue seems to be quite old.
Thanks Andrew/David, On 10/26/2023 12:30 PM, David Hildenbrand wrote: >> >> How's it coming along? >> I was in vacation... Will send V2 by Monday IST. >> Given that we're at 6.6-rc7 and given that this issue is causing >> daily crashes in your device farm, I'm thinking that we use the current >> version of your patch for 6.6 and for -stable. We can look at the >> kfree_rcu() optimization for later kernel releases? > > Any particular reason we have to rush this in? It's been seen by one > company in a testing farm; there were no other reports, especially not > from production systems. ... and the issue seems to be quite old. I think we can work on to merge the final solution here rather than an incremental one. At least internally we are unblocked with V1. Thanks, Charan
Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: kasan-dev@googlegroups.com Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Hi folks, (adding KMSAN reviewers and IBM people who are currently porting KMSAN to other architectures, plus Paul for his opinion on refactoring RCU) this patch broke x86 KMSAN in a subtle way. For every memory access in the code instrumented by KMSAN we call kmsan_get_metadata() to obtain the metadata for the memory being accessed. For virtual memory the metadata pointers are stored in the corresponding `struct page`, therefore we need to call virt_to_page() to get them. According to the comment in arch/x86/include/asm/page.h, virt_to_page(kaddr) returns a valid pointer iff virt_addr_valid(kaddr) is true, so KMSAN needs to call virt_addr_valid() as well. To avoid recursion, kmsan_get_metadata() must not call instrumented code, therefore ./arch/x86/include/asm/kmsan.h forks parts of arch/x86/mm/physaddr.c to check whether a virtual address is valid or not. But the introduction of rcu_read_lock() to pfn_valid() added instrumented RCU API calls to virt_to_page_or_null(), which is called by kmsan_get_metadata(), so there is an infinite recursion now. I do not think it is correct to stop that recursion by doing kmsan_enter_runtime()/kmsan_exit_runtime() in kmsan_get_metadata(): that would prevent instrumented functions called from within the runtime from tracking the shadow values, which might introduce false positives. I am currently looking into inlining __rcu_read_lock()/__rcu_read_unlock(), into KMSAN code to prevent it from being instrumented, but that might require factoring out parts of kernel/rcu/tree_plugin.h into a non-private header. Do you think this is feasible? Another option is to cut some edges in the code calling virt_to_page(). First, my observation is that virt_addr_valid() is quite rare in the kernel code, i.e. not all cases of calling virt_to_page() are covered with it. Second, every memory access to KMSAN metadata residing in virt_to_page(kaddr)->shadow always accompanies an access to `kaddr` itself, so if there is a race on a PFN then the access to `kaddr` will probably also trigger a fault. Third, KMSAN metadata accesses are inherently non-atomic, and even if we ensure pfn_valid() is returning a consistent value for a single memory access, calling it twice may already return different results. Considering the above, how bad would it be to drop synchronization for KMSAN's version of pfn_valid() called from kmsan_virt_addr_valid()?
On Mon, 15 Jan 2024 at 19:44, Alexander Potapenko <glider@google.com> wrote: > > Cc: "Paul E. McKenney" <paulmck@kernel.org> > Cc: Marco Elver <elver@google.com> > Cc: Dmitry Vyukov <dvyukov@google.com> > Cc: kasan-dev@googlegroups.com > Cc: Ilya Leoshkevich <iii@linux.ibm.com> > Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> > > Hi folks, > > (adding KMSAN reviewers and IBM people who are currently porting KMSAN to other > architectures, plus Paul for his opinion on refactoring RCU) > > this patch broke x86 KMSAN in a subtle way. > > For every memory access in the code instrumented by KMSAN we call > kmsan_get_metadata() to obtain the metadata for the memory being accessed. For > virtual memory the metadata pointers are stored in the corresponding `struct > page`, therefore we need to call virt_to_page() to get them. > > According to the comment in arch/x86/include/asm/page.h, virt_to_page(kaddr) > returns a valid pointer iff virt_addr_valid(kaddr) is true, so KMSAN needs to > call virt_addr_valid() as well. > > To avoid recursion, kmsan_get_metadata() must not call instrumented code, > therefore ./arch/x86/include/asm/kmsan.h forks parts of arch/x86/mm/physaddr.c > to check whether a virtual address is valid or not. > > But the introduction of rcu_read_lock() to pfn_valid() added instrumented RCU > API calls to virt_to_page_or_null(), which is called by kmsan_get_metadata(), > so there is an infinite recursion now. I do not think it is correct to stop that > recursion by doing kmsan_enter_runtime()/kmsan_exit_runtime() in > kmsan_get_metadata(): that would prevent instrumented functions called from > within the runtime from tracking the shadow values, which might introduce false > positives. > > I am currently looking into inlining __rcu_read_lock()/__rcu_read_unlock(), into > KMSAN code to prevent it from being instrumented, but that might require factoring > out parts of kernel/rcu/tree_plugin.h into a non-private header. Do you think this > is feasible? __rcu_read_lock/unlock() is only outlined in PREEMPT_RCU. Not sure that helps. Otherwise, there is rcu_read_lock_sched_notrace() which does the bare minimum and is static inline. Does that help? > Another option is to cut some edges in the code calling virt_to_page(). First, > my observation is that virt_addr_valid() is quite rare in the kernel code, i.e. > not all cases of calling virt_to_page() are covered with it. Second, every > memory access to KMSAN metadata residing in virt_to_page(kaddr)->shadow always > accompanies an access to `kaddr` itself, so if there is a race on a PFN then > the access to `kaddr` will probably also trigger a fault. Third, KMSAN metadata > accesses are inherently non-atomic, and even if we ensure pfn_valid() is > returning a consistent value for a single memory access, calling it twice may > already return different results. > > Considering the above, how bad would it be to drop synchronization for KMSAN's > version of pfn_valid() called from kmsan_virt_addr_valid()?
On Mon, Jan 15, 2024 at 09:34PM +0100, Marco Elver wrote: > On Mon, 15 Jan 2024 at 19:44, Alexander Potapenko <glider@google.com> wrote: > > > > Cc: "Paul E. McKenney" <paulmck@kernel.org> > > Cc: Marco Elver <elver@google.com> > > Cc: Dmitry Vyukov <dvyukov@google.com> > > Cc: kasan-dev@googlegroups.com > > Cc: Ilya Leoshkevich <iii@linux.ibm.com> > > Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> > > > > Hi folks, > > > > (adding KMSAN reviewers and IBM people who are currently porting KMSAN to other > > architectures, plus Paul for his opinion on refactoring RCU) > > > > this patch broke x86 KMSAN in a subtle way. > > > > For every memory access in the code instrumented by KMSAN we call > > kmsan_get_metadata() to obtain the metadata for the memory being accessed. For > > virtual memory the metadata pointers are stored in the corresponding `struct > > page`, therefore we need to call virt_to_page() to get them. > > > > According to the comment in arch/x86/include/asm/page.h, virt_to_page(kaddr) > > returns a valid pointer iff virt_addr_valid(kaddr) is true, so KMSAN needs to > > call virt_addr_valid() as well. > > > > To avoid recursion, kmsan_get_metadata() must not call instrumented code, > > therefore ./arch/x86/include/asm/kmsan.h forks parts of arch/x86/mm/physaddr.c > > to check whether a virtual address is valid or not. > > > > But the introduction of rcu_read_lock() to pfn_valid() added instrumented RCU > > API calls to virt_to_page_or_null(), which is called by kmsan_get_metadata(), > > so there is an infinite recursion now. I do not think it is correct to stop that > > recursion by doing kmsan_enter_runtime()/kmsan_exit_runtime() in > > kmsan_get_metadata(): that would prevent instrumented functions called from > > within the runtime from tracking the shadow values, which might introduce false > > positives. > > > > I am currently looking into inlining __rcu_read_lock()/__rcu_read_unlock(), into > > KMSAN code to prevent it from being instrumented, but that might require factoring > > out parts of kernel/rcu/tree_plugin.h into a non-private header. Do you think this > > is feasible? > > __rcu_read_lock/unlock() is only outlined in PREEMPT_RCU. Not sure that helps. > > Otherwise, there is rcu_read_lock_sched_notrace() which does the bare > minimum and is static inline. > > Does that help? Hrm, rcu_read_unlock_sched_notrace() can still call __preempt_schedule_notrace(), which is again instrumented by KMSAN. This patch gets me a working kernel: diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4ed33b127821..2d62df462d88 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -2000,6 +2000,7 @@ static inline int pfn_valid(unsigned long pfn) { struct mem_section *ms; int ret; + unsigned long flags; /* * Ensure the upper PAGE_SHIFT bits are clear in the @@ -2013,9 +2014,9 @@ static inline int pfn_valid(unsigned long pfn) if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) return 0; ms = __pfn_to_section(pfn); - rcu_read_lock(); + local_irq_save(flags); if (!valid_section(ms)) { - rcu_read_unlock(); + local_irq_restore(flags); return 0; } /* @@ -2023,7 +2024,7 @@ static inline int pfn_valid(unsigned long pfn) * the entire section-sized span. */ ret = early_section(ms) || pfn_section_valid(ms, pfn); - rcu_read_unlock(); + local_irq_restore(flags); return ret; } Disabling interrupts is a little heavy handed - it also assumes the current RCU implementation. There is preempt_enable_no_resched_notrace(), but that might be worse because it breaks scheduling guarantees. That being said, whatever we do here should be wrapped in some rcu_read_lock/unlock_<newvariant>() helper. Is there an existing helper we can use? If not, we need a variant that can be used from extremely constrained contexts that can't even call into the scheduler. And if we want pfn_valid() to switch to it, it also should be fast. Thanks, -- Marco
> > Hrm, rcu_read_unlock_sched_notrace() can still call > __preempt_schedule_notrace(), which is again instrumented by KMSAN. > > This patch gets me a working kernel: > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 4ed33b127821..2d62df462d88 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -2000,6 +2000,7 @@ static inline int pfn_valid(unsigned long pfn) > { > struct mem_section *ms; > int ret; > + unsigned long flags; > > /* > * Ensure the upper PAGE_SHIFT bits are clear in the > @@ -2013,9 +2014,9 @@ static inline int pfn_valid(unsigned long pfn) > if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) > return 0; > ms = __pfn_to_section(pfn); > - rcu_read_lock(); > + local_irq_save(flags); > if (!valid_section(ms)) { > - rcu_read_unlock(); > + local_irq_restore(flags); > return 0; > } > /* > @@ -2023,7 +2024,7 @@ static inline int pfn_valid(unsigned long pfn) > * the entire section-sized span. > */ > ret = early_section(ms) || pfn_section_valid(ms, pfn); > - rcu_read_unlock(); > + local_irq_restore(flags); > > return ret; > } > > Disabling interrupts is a little heavy handed - it also assumes the > current RCU implementation. There is > preempt_enable_no_resched_notrace(), but that might be worse because it > breaks scheduling guarantees. > > That being said, whatever we do here should be wrapped in some > rcu_read_lock/unlock_<newvariant>() helper. We could as well redefine rcu_read_lock/unlock in mm/kmsan/shadow.c (or the x86-specific KMSAN header, depending on whether people are seeing the problem on s390 and Power) with some header magic. But that's probably more fragile than adding a helper. > > Is there an existing helper we can use? If not, we need a variant that > can be used from extremely constrained contexts that can't even call > into the scheduler. And if we want pfn_valid() to switch to it, it also > should be fast.
On Thu, Jan 18, 2024 at 10:01AM +0100, Alexander Potapenko wrote: > > > > Hrm, rcu_read_unlock_sched_notrace() can still call > > __preempt_schedule_notrace(), which is again instrumented by KMSAN. > > > > This patch gets me a working kernel: > > [...] > > Disabling interrupts is a little heavy handed - it also assumes the > > current RCU implementation. There is > > preempt_enable_no_resched_notrace(), but that might be worse because it > > breaks scheduling guarantees. > > > > That being said, whatever we do here should be wrapped in some > > rcu_read_lock/unlock_<newvariant>() helper. > > We could as well redefine rcu_read_lock/unlock in mm/kmsan/shadow.c > (or the x86-specific KMSAN header, depending on whether people are > seeing the problem on s390 and Power) with some header magic. > But that's probably more fragile than adding a helper. > > > > > Is there an existing helper we can use? If not, we need a variant that > > can be used from extremely constrained contexts that can't even call > > into the scheduler. And if we want pfn_valid() to switch to it, it also > > should be fast. The below patch also gets me a working kernel. For pfn_valid(), using rcu_read_lock_sched() should be reasonable, given its critical section is very small and also enables it to be called from more constrained contexts again (like KMSAN). Within KMSAN we also have to suppress reschedules. This is again not ideal, but since it's limited to KMSAN should be tolerable. WDYT? ------ >8 ------ diff --git a/arch/x86/include/asm/kmsan.h b/arch/x86/include/asm/kmsan.h index 8fa6ac0e2d76..bbb1ba102129 100644 --- a/arch/x86/include/asm/kmsan.h +++ b/arch/x86/include/asm/kmsan.h @@ -64,6 +64,7 @@ static inline bool kmsan_virt_addr_valid(void *addr) { unsigned long x = (unsigned long)addr; unsigned long y = x - __START_KERNEL_map; + bool ret; /* use the carry flag to determine if x was < __START_KERNEL_map */ if (unlikely(x > y)) { @@ -79,7 +80,21 @@ static inline bool kmsan_virt_addr_valid(void *addr) return false; } - return pfn_valid(x >> PAGE_SHIFT); + /* + * pfn_valid() relies on RCU, and may call into the scheduler on exiting + * the critical section. However, this would result in recursion with + * KMSAN. Therefore, disable preemption here, and re-enable preemption + * below while suppressing rescheduls to avoid recursion. + * + * Note, this sacrifices occasionally breaking scheduling guarantees. + * Although, a kernel compiled with KMSAN has already given up on any + * performance guarantees due to being heavily instrumented. + */ + preempt_disable(); + ret = pfn_valid(x >> PAGE_SHIFT); + preempt_enable_no_resched(); + + return ret; } #endif /* !MODULE */ diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4ed33b127821..a497f189d988 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -2013,9 +2013,9 @@ static inline int pfn_valid(unsigned long pfn) if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) return 0; ms = __pfn_to_section(pfn); - rcu_read_lock(); + rcu_read_lock_sched(); if (!valid_section(ms)) { - rcu_read_unlock(); + rcu_read_unlock_sched(); return 0; } /* @@ -2023,7 +2023,7 @@ static inline int pfn_valid(unsigned long pfn) * the entire section-sized span. */ ret = early_section(ms) || pfn_section_valid(ms, pfn); - rcu_read_unlock(); + rcu_read_unlock_sched(); return ret; }
On Thu, Jan 18, 2024 at 10:43:06AM +0100, Marco Elver wrote: > On Thu, Jan 18, 2024 at 10:01AM +0100, Alexander Potapenko wrote: > > > > > > Hrm, rcu_read_unlock_sched_notrace() can still call > > > __preempt_schedule_notrace(), which is again instrumented by KMSAN. > > > > > > This patch gets me a working kernel: > > > > [...] > > > Disabling interrupts is a little heavy handed - it also assumes the > > > current RCU implementation. There is > > > preempt_enable_no_resched_notrace(), but that might be worse because it > > > breaks scheduling guarantees. > > > > > > That being said, whatever we do here should be wrapped in some > > > rcu_read_lock/unlock_<newvariant>() helper. > > > > We could as well redefine rcu_read_lock/unlock in mm/kmsan/shadow.c > > (or the x86-specific KMSAN header, depending on whether people are > > seeing the problem on s390 and Power) with some header magic. > > But that's probably more fragile than adding a helper. > > > > > > > > Is there an existing helper we can use? If not, we need a variant that > > > can be used from extremely constrained contexts that can't even call > > > into the scheduler. And if we want pfn_valid() to switch to it, it also > > > should be fast. > > The below patch also gets me a working kernel. For pfn_valid(), using > rcu_read_lock_sched() should be reasonable, given its critical section > is very small and also enables it to be called from more constrained > contexts again (like KMSAN). > > Within KMSAN we also have to suppress reschedules. This is again not > ideal, but since it's limited to KMSAN should be tolerable. > > WDYT? I like this one better from a purely selfish RCU perspective. ;-) Thanx, Paul > ------ >8 ------ > > diff --git a/arch/x86/include/asm/kmsan.h b/arch/x86/include/asm/kmsan.h > index 8fa6ac0e2d76..bbb1ba102129 100644 > --- a/arch/x86/include/asm/kmsan.h > +++ b/arch/x86/include/asm/kmsan.h > @@ -64,6 +64,7 @@ static inline bool kmsan_virt_addr_valid(void *addr) > { > unsigned long x = (unsigned long)addr; > unsigned long y = x - __START_KERNEL_map; > + bool ret; > > /* use the carry flag to determine if x was < __START_KERNEL_map */ > if (unlikely(x > y)) { > @@ -79,7 +80,21 @@ static inline bool kmsan_virt_addr_valid(void *addr) > return false; > } > > - return pfn_valid(x >> PAGE_SHIFT); > + /* > + * pfn_valid() relies on RCU, and may call into the scheduler on exiting > + * the critical section. However, this would result in recursion with > + * KMSAN. Therefore, disable preemption here, and re-enable preemption > + * below while suppressing rescheduls to avoid recursion. > + * > + * Note, this sacrifices occasionally breaking scheduling guarantees. > + * Although, a kernel compiled with KMSAN has already given up on any > + * performance guarantees due to being heavily instrumented. > + */ > + preempt_disable(); > + ret = pfn_valid(x >> PAGE_SHIFT); > + preempt_enable_no_resched(); > + > + return ret; > } > > #endif /* !MODULE */ > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 4ed33b127821..a497f189d988 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -2013,9 +2013,9 @@ static inline int pfn_valid(unsigned long pfn) > if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) > return 0; > ms = __pfn_to_section(pfn); > - rcu_read_lock(); > + rcu_read_lock_sched(); > if (!valid_section(ms)) { > - rcu_read_unlock(); > + rcu_read_unlock_sched(); > return 0; > } > /* > @@ -2023,7 +2023,7 @@ static inline int pfn_valid(unsigned long pfn) > * the entire section-sized span. > */ > ret = early_section(ms) || pfn_section_valid(ms, pfn); > - rcu_read_unlock(); > + rcu_read_unlock_sched(); > > return ret; > }
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4106fbc..c877396 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1987,6 +1987,7 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn) static inline int pfn_valid(unsigned long pfn) { struct mem_section *ms; + int ret; /* * Ensure the upper PAGE_SHIFT bits are clear in the @@ -2000,13 +2001,19 @@ static inline int pfn_valid(unsigned long pfn) if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) return 0; ms = __pfn_to_section(pfn); - if (!valid_section(ms)) + rcu_read_lock(); + if (!valid_section(ms)) { + rcu_read_unlock(); return 0; + } /* * Traditionally early sections always returned pfn_valid() for * the entire section-sized span. */ - return early_section(ms) || pfn_section_valid(ms, pfn); + ret = early_section(ms) || pfn_section_valid(ms, pfn); + rcu_read_unlock(); + + return ret; } #endif diff --git a/mm/sparse.c b/mm/sparse.c index 77d91e5..ca7dbe1 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -792,6 +792,13 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, unsigned long section_nr = pfn_to_section_nr(pfn); /* + * Mark the section invalid so that valid_section() + * return false. This prevents code from dereferencing + * ms->usage array. + */ + ms->section_mem_map &= ~SECTION_HAS_MEM_MAP; + + /* * When removing an early section, the usage map is kept (as the * usage maps of other sections fall into the same page). It * will be re-used when re-adding the section - which is then no @@ -799,16 +806,11 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, * was allocated during boot. */ if (!PageReserved(virt_to_page(ms->usage))) { + synchronize_rcu(); kfree(ms->usage); ms->usage = NULL; } memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr); - /* - * Mark the section invalid so that valid_section() - * return false. This prevents code from dereferencing - * ms->usage array. - */ - ms->section_mem_map &= ~SECTION_HAS_MEM_MAP; } /*