Message ID | 20231208151036.2458921-1-guoren@kernel.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:bcd1:0:b0:403:3b70:6f57 with SMTP id r17csp5519482vqy; Fri, 8 Dec 2023 07:12:34 -0800 (PST) X-Google-Smtp-Source: AGHT+IGrmOovlvq/0xOJ7NUX/GGvcl/G23tuUM0BikIa+O4PIWpmOFn3yfjzlKdvuNsIuEpZVLNb X-Received: by 2002:a05:6a20:1594:b0:187:4cf5:f171 with SMTP id h20-20020a056a20159400b001874cf5f171mr156780pzj.2.1702048353902; Fri, 08 Dec 2023 07:12:33 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702048353; cv=none; d=google.com; s=arc-20160816; b=EJjxT+q28SkyRFjwIqcb17lFHkGJUM6uiTZkw3c0lJudtaPmlTWiUu76QtpHNPDjJ5 ye8vPlFPR/AKTeQY91u/jTkyqxbRRQKxgju2vt0OK2DqFY5x6lxIiajhBctYchrtyjDv +9jpseM6Gh5rozp6O36+iOP18xogvRI2FV7z/RRPZMNs71F0UITe0Hxi8sddiJtAY/1T awbf0/xwh5S5iHJLn557N72wT3qWu86dhaDC/hahrUyImV1NA1l37wLZQuE99zq0hF3y U5dyXCIgLptijPAv0U6szoNLPckry+qy5V55cCGWRY4E9y5pw7mMFStB9bukgbcuSZ6I /7FQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=Q7HytPjUp3dQSeTVItXhV+I6uPdqfbxkLldUe6EvEuA=; fh=d3ye1FIp8zohpZEq0fTPB26NWGVkm64TR2T7T6TwxHs=; b=wNQsKijJ7hcaEz+mOKYc+2IZK062oJpRq4AF7sPX6meOUU6QqH6FGkvvA3Ejr2ml74 fQzW9WzuPUQkuoItF8KSnJ0gFUvvJy35KO/ZbSwyjkiGw4gyz1sUXTCon43VwJIGFPRk zjoXv0hZJkfMowyJIbNk4SwYkuFzI2oaGUKRdlR88Buo61v9Jdft7tMK1ip+prKfT8JW m95dwy9M+h/MA04LCLYA+Fw8plm6mr5dqPeEPUrUYdBmZsHUFWp/q76bSALHBmC2MIUA K2KWb8Z+1QkJwqfjtUp0a36G+eoyylFehnDFBhL9L0REVvfRf5kOn8JAhXoZYiKgTW48 CmLQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=NGjLb6dC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33]) by mx.google.com with ESMTPS id n10-20020a634d4a000000b005be232472e6si1657128pgl.474.2023.12.08.07.12.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 Dec 2023 07:12:33 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=NGjLb6dC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id A7B2A81C9A71; Fri, 8 Dec 2023 07:12:30 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1574229AbjLHPMV (ORCPT <rfc822;pusanteemu@gmail.com> + 99 others); Fri, 8 Dec 2023 10:12:21 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44158 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1574187AbjLHPL7 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 8 Dec 2023 10:11:59 -0500 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 24B95385C for <linux-kernel@vger.kernel.org>; Fri, 8 Dec 2023 07:10:46 -0800 (PST) Received: by smtp.kernel.org (Postfix) with ESMTPSA id F0427C433C9; Fri, 8 Dec 2023 15:10:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1702048246; bh=HCvGsfz3Vdu/xFP3RmrxB/dHcXYV1Nvtry8yb2qqgWc=; h=From:To:Cc:Subject:Date:From; b=NGjLb6dCAeMjgRqiITyAjM8BpXvxxXmnQ7klcAIgJfgHOQOqZoizF0n1OQ27XFr7Y GywSyKvoMwdu3vLwHMw8yxzdQnQALJIbcbe5yHV/5Zvdi6zI3hE2DUl3vDm9O/38ED D0CLGQvEPTwwQDQCq2vvFPKa+sfYbAUnyY08d0c7UJPMghb3yYgeR7+UO/AS5tFDME IdqTvz5bwvtbnCGca5msjARy4iSlfBIsAGloP0YLlEWgBXYR7Du7tuiBNAXl5h9OiE +41pXxUqimy6zOi8NvUV9kuyqn72damP9IwFBTMAd2+eDdwGudz9y/wZNwdkZGEvki 3iqpExDEbNaqg== From: guoren@kernel.org To: paul.walmsley@sifive.com, palmer@dabbelt.com, akpm@linux-foundation.org, alexghiti@rivosinc.com, catalin.marinas@arm.com, willy@infradead.org, david@redhat.com, muchun.song@linux.dev, will@kernel.org, peterz@infradead.org, rppt@kernel.org, paulmck@kernel.org, atishp@atishpatra.org, anup@brainfault.org, alex@ghiti.fr, mike.kravetz@oracle.com, dfustini@baylibre.com, wefu@redhat.com, jszhang@kernel.org, falcon@tinylab.org Cc: linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org, Guo Ren <guoren@linux.alibaba.com>, Guo Ren <guoren@kernel.org> Subject: [PATCH] riscv: pgtable: Enhance set_pte to prevent OoO risk Date: Fri, 8 Dec 2023 10:10:36 -0500 Message-Id: <20231208151036.2458921-1-guoren@kernel.org> X-Mailer: git-send-email 2.40.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.2 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Fri, 08 Dec 2023 07:12:30 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1784727054711490548 X-GMAIL-MSGID: 1784727054711490548 |
Series |
riscv: pgtable: Enhance set_pte to prevent OoO risk
|
|
Commit Message
Guo Ren
Dec. 8, 2023, 3:10 p.m. UTC
From: Guo Ren <guoren@linux.alibaba.com> When changing from an invalid pte to a valid one for a kernel page, there is no need for tlb_flush. It's okay for the TSO memory model, but there is an OoO risk for the Weak one. eg: sd t0, (a0) // a0 = pte address, pteval is changed from invalid to valid ... ld t1, (a1) // a1 = va of above pte If the ld instruction is executed speculatively before the sd instruction. Then it would bring an invalid entry into the TLB, and when the ld instruction retired, a spurious page fault occurred. Because the vmemmap has been ignored by vmalloc_fault, the spurious page fault would cause kernel panic. This patch was inspired by the commit: 7f0b1bf04511 ("arm64: Fix barriers used for page table modifications"). For RISC-V, there is no requirement in the spec to guarantee all tlb entries are valid and no requirement to PTW filter out invalid entries. Of course, micro-arch could give a more robust design, but here, use a software fence to guarantee. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org> --- arch/riscv/include/asm/pgtable.h | 7 +++++++ 1 file changed, 7 insertions(+)
Comments
On Fri, Dec 8, 2023 at 11:10 PM <guoren@kernel.org> wrote: > > From: Guo Ren <guoren@linux.alibaba.com> > > When changing from an invalid pte to a valid one for a kernel page, > there is no need for tlb_flush. It's okay for the TSO memory model, but > there is an OoO risk for the Weak one. eg: Sorry, TSO has no Immunity. The above sentence should be rewritten with: When changing from an invalid pte to a valid one for a kernel page, there is no guarantee tlb_flush in Linux. So, there is an OoO risk for riscv, e.g.: > > sd t0, (a0) // a0 = pte address, pteval is changed from invalid to valid > ... > ld t1, (a1) // a1 = va of above pte > > If the ld instruction is executed speculatively before the sd > instruction. Then it would bring an invalid entry into the TLB, and when > the ld instruction retired, a spurious page fault occurred. Because the > vmemmap has been ignored by vmalloc_fault, the spurious page fault would > cause kernel panic. > > This patch was inspired by the commit: 7f0b1bf04511 ("arm64: Fix barriers > used for page table modifications"). For RISC-V, there is no requirement > in the spec to guarantee all tlb entries are valid and no requirement to > PTW filter out invalid entries. Of course, micro-arch could give a more > robust design, but here, use a software fence to guarantee. > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > Signed-off-by: Guo Ren <guoren@kernel.org> > --- > arch/riscv/include/asm/pgtable.h | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h > index 294044429e8e..2fae5a5438e0 100644 > --- a/arch/riscv/include/asm/pgtable.h > +++ b/arch/riscv/include/asm/pgtable.h > @@ -511,6 +511,13 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b) > static inline void set_pte(pte_t *ptep, pte_t pteval) > { > *ptep = pteval; > + > + /* > + * Only if the new pte is present and kernel, otherwise TLB > + * maintenance or update_mmu_cache() have the necessary barriers. > + */ > + if (pte_val(pteval) & (_PAGE_PRESENT | _PAGE_GLOBAL)) > + RISCV_FENCE(rw,rw); > } > > void flush_icache_pte(pte_t pte); > -- > 2.40.1 >
Hi Guo, On Fri, Dec 8, 2023 at 4:10 PM <guoren@kernel.org> wrote: > > From: Guo Ren <guoren@linux.alibaba.com> > > When changing from an invalid pte to a valid one for a kernel page, > there is no need for tlb_flush. It's okay for the TSO memory model, but > there is an OoO risk for the Weak one. eg: > > sd t0, (a0) // a0 = pte address, pteval is changed from invalid to valid > ... > ld t1, (a1) // a1 = va of above pte > > If the ld instruction is executed speculatively before the sd > instruction. Then it would bring an invalid entry into the TLB, and when > the ld instruction retired, a spurious page fault occurred. Because the > vmemmap has been ignored by vmalloc_fault, the spurious page fault would > cause kernel panic. > > This patch was inspired by the commit: 7f0b1bf04511 ("arm64: Fix barriers > used for page table modifications"). For RISC-V, there is no requirement > in the spec to guarantee all tlb entries are valid and no requirement to > PTW filter out invalid entries. Of course, micro-arch could give a more > robust design, but here, use a software fence to guarantee. > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > Signed-off-by: Guo Ren <guoren@kernel.org> > --- > arch/riscv/include/asm/pgtable.h | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h > index 294044429e8e..2fae5a5438e0 100644 > --- a/arch/riscv/include/asm/pgtable.h > +++ b/arch/riscv/include/asm/pgtable.h > @@ -511,6 +511,13 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b) > static inline void set_pte(pte_t *ptep, pte_t pteval) > { > *ptep = pteval; > + > + /* > + * Only if the new pte is present and kernel, otherwise TLB > + * maintenance or update_mmu_cache() have the necessary barriers. > + */ > + if (pte_val(pteval) & (_PAGE_PRESENT | _PAGE_GLOBAL)) > + RISCV_FENCE(rw,rw); Only a sfence.vma can guarantee that the PTW actually sees a new mapping, a fence is not enough. That being said, new kernel mappings (vmalloc ones) are correctly handled in the kernel by using flush_cache_vmap(). Did you observe something that this patch fixes? Thanks, Alex > } > > void flush_icache_pte(pte_t pte); > -- > 2.40.1 >
On Mon, Dec 11, 2023 at 1:52 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote: > > Hi Guo, > > On Fri, Dec 8, 2023 at 4:10 PM <guoren@kernel.org> wrote: > > > > From: Guo Ren <guoren@linux.alibaba.com> > > > > When changing from an invalid pte to a valid one for a kernel page, > > there is no need for tlb_flush. It's okay for the TSO memory model, but > > there is an OoO risk for the Weak one. eg: > > > > sd t0, (a0) // a0 = pte address, pteval is changed from invalid to valid > > ... > > ld t1, (a1) // a1 = va of above pte > > > > If the ld instruction is executed speculatively before the sd > > instruction. Then it would bring an invalid entry into the TLB, and when > > the ld instruction retired, a spurious page fault occurred. Because the > > vmemmap has been ignored by vmalloc_fault, the spurious page fault would > > cause kernel panic. > > > > This patch was inspired by the commit: 7f0b1bf04511 ("arm64: Fix barriers > > used for page table modifications"). For RISC-V, there is no requirement > > in the spec to guarantee all tlb entries are valid and no requirement to > > PTW filter out invalid entries. Of course, micro-arch could give a more > > robust design, but here, use a software fence to guarantee. > > > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > > Signed-off-by: Guo Ren <guoren@kernel.org> > > --- > > arch/riscv/include/asm/pgtable.h | 7 +++++++ > > 1 file changed, 7 insertions(+) > > > > diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h > > index 294044429e8e..2fae5a5438e0 100644 > > --- a/arch/riscv/include/asm/pgtable.h > > +++ b/arch/riscv/include/asm/pgtable.h > > @@ -511,6 +511,13 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b) > > static inline void set_pte(pte_t *ptep, pte_t pteval) > > { > > *ptep = pteval; > > + > > + /* > > + * Only if the new pte is present and kernel, otherwise TLB > > + * maintenance or update_mmu_cache() have the necessary barriers. > > + */ > > + if (pte_val(pteval) & (_PAGE_PRESENT | _PAGE_GLOBAL)) > > + RISCV_FENCE(rw,rw); > > Only a sfence.vma can guarantee that the PTW actually sees a new > mapping, a fence is not enough. That being said, new kernel mappings > (vmalloc ones) are correctly handled in the kernel by using > flush_cache_vmap(). Did you observe something that this patch fixes? Thx for the reply! The sfence.vma is too expensive, so the situation is tricky. See the arm64 commit: 7f0b1bf04511 ("arm64: Fix barriers used for page table modifications"), which is similar. That is, linux assumes invalid pte won't get into TLB. Think about memory hotplug: mm/sparse.c: sparse_add_section() { ... memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap); if (IS_ERR(memmap)) return PTR_ERR(memmap); /* * Poison uninitialized struct pages in order to catch invalid flags * combinations. */ page_init_poison(memmap, sizeof(struct page) * nr_pages); ... } The section_activate would use set_pte to setup vmemmap, and page_init_poison would access these pages' struct. That means: sd t0, (a0) // a0 = struct page's pte address, pteval is changed from invalid to valid ... lw/sw t1, (a1) // a1 = va of struct page If the lw/sw instruction is executed speculatively before the set_pte, we need a fence to prevent this. > > Thanks, > > Alex > > > } > > > > void flush_icache_pte(pte_t pte); > > -- > > 2.40.1 > >
On Mon, Dec 11, 2023 at 9:41 AM Guo Ren <guoren@kernel.org> wrote: > > On Mon, Dec 11, 2023 at 1:52 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote: > > > > Hi Guo, > > > > On Fri, Dec 8, 2023 at 4:10 PM <guoren@kernel.org> wrote: > > > > > > From: Guo Ren <guoren@linux.alibaba.com> > > > > > > When changing from an invalid pte to a valid one for a kernel page, > > > there is no need for tlb_flush. It's okay for the TSO memory model, but > > > there is an OoO risk for the Weak one. eg: > > > > > > sd t0, (a0) // a0 = pte address, pteval is changed from invalid to valid > > > ... > > > ld t1, (a1) // a1 = va of above pte > > > > > > If the ld instruction is executed speculatively before the sd > > > instruction. Then it would bring an invalid entry into the TLB, and when > > > the ld instruction retired, a spurious page fault occurred. Because the > > > vmemmap has been ignored by vmalloc_fault, the spurious page fault would > > > cause kernel panic. > > > > > > This patch was inspired by the commit: 7f0b1bf04511 ("arm64: Fix barriers > > > used for page table modifications"). For RISC-V, there is no requirement > > > in the spec to guarantee all tlb entries are valid and no requirement to > > > PTW filter out invalid entries. Of course, micro-arch could give a more > > > robust design, but here, use a software fence to guarantee. > > > > > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > > > Signed-off-by: Guo Ren <guoren@kernel.org> > > > --- > > > arch/riscv/include/asm/pgtable.h | 7 +++++++ > > > 1 file changed, 7 insertions(+) > > > > > > diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h > > > index 294044429e8e..2fae5a5438e0 100644 > > > --- a/arch/riscv/include/asm/pgtable.h > > > +++ b/arch/riscv/include/asm/pgtable.h > > > @@ -511,6 +511,13 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b) > > > static inline void set_pte(pte_t *ptep, pte_t pteval) > > > { > > > *ptep = pteval; > > > + > > > + /* > > > + * Only if the new pte is present and kernel, otherwise TLB > > > + * maintenance or update_mmu_cache() have the necessary barriers. > > > + */ > > > + if (pte_val(pteval) & (_PAGE_PRESENT | _PAGE_GLOBAL)) > > > + RISCV_FENCE(rw,rw); > > > > Only a sfence.vma can guarantee that the PTW actually sees a new > > mapping, a fence is not enough. That being said, new kernel mappings > > (vmalloc ones) are correctly handled in the kernel by using > > flush_cache_vmap(). Did you observe something that this patch fixes? > Thx for the reply! > > The sfence.vma is too expensive, so the situation is tricky. See the > arm64 commit: 7f0b1bf04511 ("arm64: Fix barriers used for page table > modifications"), which is similar. That is, linux assumes invalid pte > won't get into TLB. Think about memory hotplug: > > mm/sparse.c: sparse_add_section() { > ... > memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap); > if (IS_ERR(memmap)) > return PTR_ERR(memmap); > > /* > * Poison uninitialized struct pages in order to catch invalid flags > * combinations. > */ > page_init_poison(memmap, sizeof(struct page) * nr_pages); > ... > } > The section_activate would use set_pte to setup vmemmap, and > page_init_poison would access these pages' struct. So I think the generic code must be fixed by adding a flush_cache_vmap() in vmemmap_populate_range() or similar: several architectures implement flush_cache_vmap() because they need to do "something" after a new mapping is established, so vmemmap should not be any different. > > That means: > sd t0, (a0) // a0 = struct page's pte address, pteval is changed from > invalid to valid > ... > lw/sw t1, (a1) // a1 = va of struct page > > If the lw/sw instruction is executed speculatively before the set_pte, > we need a fence to prevent this. Yes I agree, but to me we need the fence property of sfence.vma to make sure the PTW sees the new pte, unless I'm mistaken and something in the privileged specification states that a fence is enough? > > > > > Thanks, > > > > Alex > > > > > } > > > > > > void flush_icache_pte(pte_t pte); > > > -- > > > 2.40.1 > > > > > > > -- > Best Regards > Guo Ren
On Mon, Dec 11, 2023 at 5:04 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote: > > On Mon, Dec 11, 2023 at 9:41 AM Guo Ren <guoren@kernel.org> wrote: > > > > On Mon, Dec 11, 2023 at 1:52 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote: > > > > > > Hi Guo, > > > > > > On Fri, Dec 8, 2023 at 4:10 PM <guoren@kernel.org> wrote: > > > > > > > > From: Guo Ren <guoren@linux.alibaba.com> > > > > > > > > When changing from an invalid pte to a valid one for a kernel page, > > > > there is no need for tlb_flush. It's okay for the TSO memory model, but > > > > there is an OoO risk for the Weak one. eg: > > > > > > > > sd t0, (a0) // a0 = pte address, pteval is changed from invalid to valid > > > > ... > > > > ld t1, (a1) // a1 = va of above pte > > > > > > > > If the ld instruction is executed speculatively before the sd > > > > instruction. Then it would bring an invalid entry into the TLB, and when > > > > the ld instruction retired, a spurious page fault occurred. Because the > > > > vmemmap has been ignored by vmalloc_fault, the spurious page fault would > > > > cause kernel panic. > > > > > > > > This patch was inspired by the commit: 7f0b1bf04511 ("arm64: Fix barriers > > > > used for page table modifications"). For RISC-V, there is no requirement > > > > in the spec to guarantee all tlb entries are valid and no requirement to > > > > PTW filter out invalid entries. Of course, micro-arch could give a more > > > > robust design, but here, use a software fence to guarantee. > > > > > > > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > > > > Signed-off-by: Guo Ren <guoren@kernel.org> > > > > --- > > > > arch/riscv/include/asm/pgtable.h | 7 +++++++ > > > > 1 file changed, 7 insertions(+) > > > > > > > > diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h > > > > index 294044429e8e..2fae5a5438e0 100644 > > > > --- a/arch/riscv/include/asm/pgtable.h > > > > +++ b/arch/riscv/include/asm/pgtable.h > > > > @@ -511,6 +511,13 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b) > > > > static inline void set_pte(pte_t *ptep, pte_t pteval) > > > > { > > > > *ptep = pteval; > > > > + > > > > + /* > > > > + * Only if the new pte is present and kernel, otherwise TLB > > > > + * maintenance or update_mmu_cache() have the necessary barriers. > > > > + */ > > > > + if (pte_val(pteval) & (_PAGE_PRESENT | _PAGE_GLOBAL)) > > > > + RISCV_FENCE(rw,rw); > > > > > > Only a sfence.vma can guarantee that the PTW actually sees a new > > > mapping, a fence is not enough. That being said, new kernel mappings > > > (vmalloc ones) are correctly handled in the kernel by using > > > flush_cache_vmap(). Did you observe something that this patch fixes? > > Thx for the reply! > > > > The sfence.vma is too expensive, so the situation is tricky. See the > > arm64 commit: 7f0b1bf04511 ("arm64: Fix barriers used for page table > > modifications"), which is similar. That is, linux assumes invalid pte > > won't get into TLB. Think about memory hotplug: > > > > mm/sparse.c: sparse_add_section() { > > ... > > memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap); > > if (IS_ERR(memmap)) > > return PTR_ERR(memmap); > > > > /* > > * Poison uninitialized struct pages in order to catch invalid flags > > * combinations. > > */ > > page_init_poison(memmap, sizeof(struct page) * nr_pages); > > ... > > } > > The section_activate would use set_pte to setup vmemmap, and > > page_init_poison would access these pages' struct. > > So I think the generic code must be fixed by adding a > flush_cache_vmap() in vmemmap_populate_range() or similar: several > architectures implement flush_cache_vmap() because they need to do > "something" after a new mapping is established, so vmemmap should not > be any different. Perhaps generic code assumes TLB won't contain invalid entries. When invalid -> valid, Linux won't do any tlb_flush, ref: * Use set_p*_safe(), and elide TLB flushing, when confident that *no* * TLB flush will be required as a result of the "set". For example, use * in scenarios where it is known ahead of time that the routine is * setting non-present entries, or re-setting an existing entry to the * same value. Otherwise, use the typical "set" helpers and flush the * TLB. > > > > > That means: > > sd t0, (a0) // a0 = struct page's pte address, pteval is changed from > > invalid to valid > > ... > > lw/sw t1, (a1) // a1 = va of struct page > > > > If the lw/sw instruction is executed speculatively before the set_pte, > > we need a fence to prevent this. > > Yes I agree, but to me we need the fence property of sfence.vma to > make sure the PTW sees the new pte, unless I'm mistaken and something > in the privileged specification states that a fence is enough? All PTW are triggered by IFU & load/store. For the "set" scenarios, we just need to prevent the access va before the set_pte. So: - Don't worry about IFU, which fetches the code sequentially. - Use a fence prevent load/store before set_pte. Sfence.vma is used for invalidate TLB, not for invalid -> valid. > > > > > > > > > Thanks, > > > > > > Alex > > > > > > > } > > > > > > > > void flush_icache_pte(pte_t pte); > > > > -- > > > > 2.40.1 > > > > > > > > > > > > -- > > Best Regards > > Guo Ren
On 2023-12-11 5:36 AM, Guo Ren wrote: > On Mon, Dec 11, 2023 at 5:04 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote: >> >> On Mon, Dec 11, 2023 at 9:41 AM Guo Ren <guoren@kernel.org> wrote: >>> >>> On Mon, Dec 11, 2023 at 1:52 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote: >>>> >>>> Hi Guo, >>>> >>>> On Fri, Dec 8, 2023 at 4:10 PM <guoren@kernel.org> wrote: >>>>> >>>>> From: Guo Ren <guoren@linux.alibaba.com> >>>>> >>>>> When changing from an invalid pte to a valid one for a kernel page, >>>>> there is no need for tlb_flush. It's okay for the TSO memory model, but >>>>> there is an OoO risk for the Weak one. eg: >>>>> >>>>> sd t0, (a0) // a0 = pte address, pteval is changed from invalid to valid >>>>> ... >>>>> ld t1, (a1) // a1 = va of above pte >>>>> >>>>> If the ld instruction is executed speculatively before the sd >>>>> instruction. Then it would bring an invalid entry into the TLB, and when >>>>> the ld instruction retired, a spurious page fault occurred. Because the >>>>> vmemmap has been ignored by vmalloc_fault, the spurious page fault would >>>>> cause kernel panic. >>>>> >>>>> This patch was inspired by the commit: 7f0b1bf04511 ("arm64: Fix barriers >>>>> used for page table modifications"). For RISC-V, there is no requirement >>>>> in the spec to guarantee all tlb entries are valid and no requirement to >>>>> PTW filter out invalid entries. Of course, micro-arch could give a more >>>>> robust design, but here, use a software fence to guarantee. >>>>> >>>>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com> >>>>> Signed-off-by: Guo Ren <guoren@kernel.org> >>>>> --- >>>>> arch/riscv/include/asm/pgtable.h | 7 +++++++ >>>>> 1 file changed, 7 insertions(+) >>>>> >>>>> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h >>>>> index 294044429e8e..2fae5a5438e0 100644 >>>>> --- a/arch/riscv/include/asm/pgtable.h >>>>> +++ b/arch/riscv/include/asm/pgtable.h >>>>> @@ -511,6 +511,13 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b) >>>>> static inline void set_pte(pte_t *ptep, pte_t pteval) >>>>> { >>>>> *ptep = pteval; >>>>> + >>>>> + /* >>>>> + * Only if the new pte is present and kernel, otherwise TLB >>>>> + * maintenance or update_mmu_cache() have the necessary barriers. >>>>> + */ >>>>> + if (pte_val(pteval) & (_PAGE_PRESENT | _PAGE_GLOBAL)) >>>>> + RISCV_FENCE(rw,rw); >>>> >>>> Only a sfence.vma can guarantee that the PTW actually sees a new >>>> mapping, a fence is not enough. That being said, new kernel mappings >>>> (vmalloc ones) are correctly handled in the kernel by using >>>> flush_cache_vmap(). Did you observe something that this patch fixes? >>> Thx for the reply! >>> >>> The sfence.vma is too expensive, so the situation is tricky. See the >>> arm64 commit: 7f0b1bf04511 ("arm64: Fix barriers used for page table >>> modifications"), which is similar. That is, linux assumes invalid pte >>> won't get into TLB. Think about memory hotplug: >>> >>> mm/sparse.c: sparse_add_section() { >>> ... >>> memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap); >>> if (IS_ERR(memmap)) >>> return PTR_ERR(memmap); >>> >>> /* >>> * Poison uninitialized struct pages in order to catch invalid flags >>> * combinations. >>> */ >>> page_init_poison(memmap, sizeof(struct page) * nr_pages); >>> ... >>> } >>> The section_activate would use set_pte to setup vmemmap, and >>> page_init_poison would access these pages' struct. >> >> So I think the generic code must be fixed by adding a >> flush_cache_vmap() in vmemmap_populate_range() or similar: several >> architectures implement flush_cache_vmap() because they need to do >> "something" after a new mapping is established, so vmemmap should not >> be any different. > Perhaps generic code assumes TLB won't contain invalid entries. When > invalid -> valid, Linux won't do any tlb_flush, ref: > > * Use set_p*_safe(), and elide TLB flushing, when confident that *no* > * TLB flush will be required as a result of the "set". For example, use > * in scenarios where it is known ahead of time that the routine is > * setting non-present entries, or re-setting an existing entry to the > * same value. Otherwise, use the typical "set" helpers and flush the > * TLB. > >> >>> >>> That means: >>> sd t0, (a0) // a0 = struct page's pte address, pteval is changed from >>> invalid to valid >>> ... >>> lw/sw t1, (a1) // a1 = va of struct page >>> >>> If the lw/sw instruction is executed speculatively before the set_pte, >>> we need a fence to prevent this. >> >> Yes I agree, but to me we need the fence property of sfence.vma to >> make sure the PTW sees the new pte, unless I'm mistaken and something >> in the privileged specification states that a fence is enough? > All PTW are triggered by IFU & load/store. For the "set" scenarios, we > just need to prevent the access va before the set_pte. So: > - Don't worry about IFU, which fetches the code sequentially. > - Use a fence prevent load/store before set_pte. > > Sfence.vma is used for invalidate TLB, not for invalid -> valid. I think the problem is that, architecturally, you can't prevent a PTW by preventing access to the virtual address. The RISC-V privileged spec allows caching the results of PTWs from speculative execution, and it allows caching invalid PTEs. So effectively, as soon as satp is written, software must be able to handle _any_ virtual address being in the TLB. To avoid the sfence.vma in the invalid->valid case, you need to handle the possible page fault, like in Alex's series here: https://lore.kernel.org/linux-riscv/20231207150348.82096-1-alexghiti@rivosinc.com/ Regards, Samuel
On Mon, Dec 11, 2023 at 11:27 PM Samuel Holland <samuel.holland@sifive.com> wrote: > > On 2023-12-11 5:36 AM, Guo Ren wrote: > > On Mon, Dec 11, 2023 at 5:04 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote: > >> > >> On Mon, Dec 11, 2023 at 9:41 AM Guo Ren <guoren@kernel.org> wrote: > >>> > >>> On Mon, Dec 11, 2023 at 1:52 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote: > >>>> > >>>> Hi Guo, > >>>> > >>>> On Fri, Dec 8, 2023 at 4:10 PM <guoren@kernel.org> wrote: > >>>>> > >>>>> From: Guo Ren <guoren@linux.alibaba.com> > >>>>> > >>>>> When changing from an invalid pte to a valid one for a kernel page, > >>>>> there is no need for tlb_flush. It's okay for the TSO memory model, but > >>>>> there is an OoO risk for the Weak one. eg: > >>>>> > >>>>> sd t0, (a0) // a0 = pte address, pteval is changed from invalid to valid > >>>>> ... > >>>>> ld t1, (a1) // a1 = va of above pte > >>>>> > >>>>> If the ld instruction is executed speculatively before the sd > >>>>> instruction. Then it would bring an invalid entry into the TLB, and when > >>>>> the ld instruction retired, a spurious page fault occurred. Because the > >>>>> vmemmap has been ignored by vmalloc_fault, the spurious page fault would > >>>>> cause kernel panic. > >>>>> > >>>>> This patch was inspired by the commit: 7f0b1bf04511 ("arm64: Fix barriers > >>>>> used for page table modifications"). For RISC-V, there is no requirement > >>>>> in the spec to guarantee all tlb entries are valid and no requirement to > >>>>> PTW filter out invalid entries. Of course, micro-arch could give a more > >>>>> robust design, but here, use a software fence to guarantee. > >>>>> > >>>>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > >>>>> Signed-off-by: Guo Ren <guoren@kernel.org> > >>>>> --- > >>>>> arch/riscv/include/asm/pgtable.h | 7 +++++++ > >>>>> 1 file changed, 7 insertions(+) > >>>>> > >>>>> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h > >>>>> index 294044429e8e..2fae5a5438e0 100644 > >>>>> --- a/arch/riscv/include/asm/pgtable.h > >>>>> +++ b/arch/riscv/include/asm/pgtable.h > >>>>> @@ -511,6 +511,13 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b) > >>>>> static inline void set_pte(pte_t *ptep, pte_t pteval) > >>>>> { > >>>>> *ptep = pteval; > >>>>> + > >>>>> + /* > >>>>> + * Only if the new pte is present and kernel, otherwise TLB > >>>>> + * maintenance or update_mmu_cache() have the necessary barriers. > >>>>> + */ > >>>>> + if (pte_val(pteval) & (_PAGE_PRESENT | _PAGE_GLOBAL)) > >>>>> + RISCV_FENCE(rw,rw); > >>>> > >>>> Only a sfence.vma can guarantee that the PTW actually sees a new > >>>> mapping, a fence is not enough. That being said, new kernel mappings > >>>> (vmalloc ones) are correctly handled in the kernel by using > >>>> flush_cache_vmap(). Did you observe something that this patch fixes? > >>> Thx for the reply! > >>> > >>> The sfence.vma is too expensive, so the situation is tricky. See the > >>> arm64 commit: 7f0b1bf04511 ("arm64: Fix barriers used for page table > >>> modifications"), which is similar. That is, linux assumes invalid pte > >>> won't get into TLB. Think about memory hotplug: > >>> > >>> mm/sparse.c: sparse_add_section() { > >>> ... > >>> memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap); > >>> if (IS_ERR(memmap)) > >>> return PTR_ERR(memmap); > >>> > >>> /* > >>> * Poison uninitialized struct pages in order to catch invalid flags > >>> * combinations. > >>> */ > >>> page_init_poison(memmap, sizeof(struct page) * nr_pages); > >>> ... > >>> } > >>> The section_activate would use set_pte to setup vmemmap, and > >>> page_init_poison would access these pages' struct. > >> > >> So I think the generic code must be fixed by adding a > >> flush_cache_vmap() in vmemmap_populate_range() or similar: several > >> architectures implement flush_cache_vmap() because they need to do > >> "something" after a new mapping is established, so vmemmap should not > >> be any different. > > Perhaps generic code assumes TLB won't contain invalid entries. When > > invalid -> valid, Linux won't do any tlb_flush, ref: > > > > * Use set_p*_safe(), and elide TLB flushing, when confident that *no* > > * TLB flush will be required as a result of the "set". For example, use > > * in scenarios where it is known ahead of time that the routine is > > * setting non-present entries, or re-setting an existing entry to the > > * same value. Otherwise, use the typical "set" helpers and flush the > > * TLB. > > > >> > >>> > >>> That means: > >>> sd t0, (a0) // a0 = struct page's pte address, pteval is changed from > >>> invalid to valid > >>> ... > >>> lw/sw t1, (a1) // a1 = va of struct page > >>> > >>> If the lw/sw instruction is executed speculatively before the set_pte, > >>> we need a fence to prevent this. > >> > >> Yes I agree, but to me we need the fence property of sfence.vma to > >> make sure the PTW sees the new pte, unless I'm mistaken and something > >> in the privileged specification states that a fence is enough? > > All PTW are triggered by IFU & load/store. For the "set" scenarios, we > > just need to prevent the access va before the set_pte. So: > > - Don't worry about IFU, which fetches the code sequentially. > > - Use a fence prevent load/store before set_pte. > > > > Sfence.vma is used for invalidate TLB, not for invalid -> valid. > > I think the problem is that, architecturally, you can't prevent a PTW by > preventing access to the virtual address. The RISC-V privileged spec allows > caching the results of PTWs from speculative execution, and it allows caching > invalid PTEs. So effectively, as soon as satp is written, software must be able > to handle _any_ virtual address being in the TLB. > > To avoid the sfence.vma in the invalid->valid case, you need to handle the > possible page fault, like in Alex's series here: > > https://lore.kernel.org/linux-riscv/20231207150348.82096-1-alexghiti@rivosinc.com/ Just as this patch series said: + * The RISC-V kernel does not eagerly emit a sfence.vma after each + * new vmalloc mapping, which may result in exceptions: + * - if the uarch caches invalid entries, the new mapping would not be + * observed by the page table walker and an invalidation is needed. + * - if the uarch does not cache invalid entries, a reordered access + * could "miss" the new mapping and traps: in that case, we only need + * to retry the access, no sfence.vma is required. I'm talking about "uarch does not cache invalid entries, a reordered access could "miss" the new mapping and traps". Using a fence in set_pte is another solution, and better than retrying. Of course the premise is that the fence is not expensive for micro-arch. - Arm64 used this way: commit: 7f0b1bf04511 ("arm64: Fix barriers - X86 is similar, because it's TSO, so any load + store instructions between set_pte & next access would give a barrier, eg: set_pte va load (acquire) store (release) load/store va Another topic is about "retry the access", this is about kernel virtual address space spurious page fault right? And Alex is preventing that in the riscv linux kernel and it would cause a lot of side effects. > > Regards, > Samuel >
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index 294044429e8e..2fae5a5438e0 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -511,6 +511,13 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b) static inline void set_pte(pte_t *ptep, pte_t pteval) { *ptep = pteval; + + /* + * Only if the new pte is present and kernel, otherwise TLB + * maintenance or update_mmu_cache() have the necessary barriers. + */ + if (pte_val(pteval) & (_PAGE_PRESENT | _PAGE_GLOBAL)) + RISCV_FENCE(rw,rw); } void flush_icache_pte(pte_t pte);