Message ID | 20221207203034.650899-9-peterx@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp393361wrr; Wed, 7 Dec 2022 12:36:25 -0800 (PST) X-Google-Smtp-Source: AA0mqf6M/0GkPe0UUuwZtd1VW+D94ITLtJgy5QTLQ/+m/5q8RZXiDATsYQgrDKDMhXoDDLnynFBo X-Received: by 2002:a17:906:6093:b0:78d:b37c:83d9 with SMTP id t19-20020a170906609300b0078db37c83d9mr61962344ejj.637.1670445385673; Wed, 07 Dec 2022 12:36:25 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1670445385; cv=none; d=google.com; s=arc-20160816; b=zc02n3Ctw54jaSMomblh2CqjnbZ542nq9F6SD2OYCP1p5ga8ZSg8L9xfoH3BXvpLle jFSmkCxr5hvb1gz3vu5v8uVD2lKl4m8pqu1vXdQ0GigG3WBP9H5QPrYQXd8e+jwcWTKP LvVbHCrozi5LIgSUK9Jj76uU+sVzeEfkcJeFi+NXtxVB7WTFLfC2M6D5dtuU2QWMDVSC JFTg4F1xP/XBaDU+R/iM9IOZ14qLtWS+aR/7f93QWklni8vwnsSRECVlBBxdvFTVdIqH ViqWn35KQIZmBLIJHirNxpfez34kNaXnzoq2y9oYZrpA3lhOd2HoxOr+tXmwJssWCBlB ayqw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=dnUJ2i61erGKHgVVBS7fwgSAnx0w9hoDMCwKHPZDtzw=; b=jJcw3tOmlcgNPwWDV0tlhNtvbZAv+UKTlsxTXJ7zylQSe+K0J380LiuGKGMNI3FAjD andffCGeiLPqpMPXTqokPliijM7lMiAuS+xey3e3NAIk4ll+QpjHjpgbspZnmsB+Vrme dVjDPOSn5XxZCdIEndMBaj08C5s94lj0tYOVLojPo9Pwl/JR9D7U5Y/LAvC0XVv641nM 09iz9THpXlOQ1VRiQ3YptLpgNcHQZmM+wVNVXf97oq42WXg1y/mVuHfmOW9PWk88IjJa 11MbQCQF5MOX0RbTo1PXG9CGZ1ohUZXx60k1mh/L7hlbycsicMb7Xx/fTNX8+9he/o3w O3QQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=f6s6nbfU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id mp41-20020a1709071b2900b0073ce34d1a13si17405531ejc.499.2022.12.07.12.36.01; Wed, 07 Dec 2022 12:36:25 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=f6s6nbfU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229969AbiLGUcY (ORCPT <rfc822;foxyelen666@gmail.com> + 99 others); Wed, 7 Dec 2022 15:32:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49358 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229797AbiLGUb7 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 7 Dec 2022 15:31:59 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 307517B57D for <linux-kernel@vger.kernel.org>; Wed, 7 Dec 2022 12:31:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1670445062; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=dnUJ2i61erGKHgVVBS7fwgSAnx0w9hoDMCwKHPZDtzw=; b=f6s6nbfUCZYloXQKbEpunbY8UjvK37tq2gruLTfSjBW5VmtTbTOgRLKvWqS3Eu+rpimxPI eEswRhxHlf7xop6edqNJgRCELKQAnbQKa/xJ7/LshzoElmpQz36+K7iLB+sX5I+UNCzWnd i7bJ2IO9B6BmP/5166yDizoOAMzYgls= Received: from mail-qt1-f198.google.com (mail-qt1-f198.google.com [209.85.160.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-458-7l3cns-tOk2acgjeiX0obQ-1; Wed, 07 Dec 2022 15:30:53 -0500 X-MC-Unique: 7l3cns-tOk2acgjeiX0obQ-1 Received: by mail-qt1-f198.google.com with SMTP id w27-20020a05622a191b00b003a56c0e1cd0so38191158qtc.4 for <linux-kernel@vger.kernel.org>; Wed, 07 Dec 2022 12:30:53 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=dnUJ2i61erGKHgVVBS7fwgSAnx0w9hoDMCwKHPZDtzw=; b=Lb9ecmWMME5qcb/t8efYrJ4pCuzY1Te/CLckWh39jHUNsA8DB/D9mZup6heRt93Gur 7DIZ9cVQt5CxqZInykzBfYCJ8YKDjO3y+Q40NpaGv3PZ5eGKWZCX0SBL554cSvVLESFx /qzdLrNP4SwFYlLooxP0EYmsrPpTg5/SwnzMPJ7eEqmP0/3BR2jLRCxKoGh+FgWA7RGA CAWU9vHST7daqULrFEmej8KWdeGyJsJvit0zk2pzkbE9h+hFnEJn5InZ2q/FzU7lZhZq 0kBjyBbWQDMCD5SGtYJN/ngLngMIniW16E3Od6+bcyVxVajYDUO3SFW2WfiBv6b4st7W IgUg== X-Gm-Message-State: ANoB5pmqCZuz9jyeEep+SJnH0qCOP+Ly1iUSlki4fmSxgHhd4AtsztE0 WsHx51OCqSkeU+nyjj+LgU1tOjxrUJLolnVwFDcS6DKkoennSauRjlmvH2UNMe/YOBugLJY012g FScT4QLRlR8EsNh7bZaM02Tsb X-Received: by 2002:a05:622a:4116:b0:39c:e5bf:8162 with SMTP id cc22-20020a05622a411600b0039ce5bf8162mr1523508qtb.55.1670445050825; Wed, 07 Dec 2022 12:30:50 -0800 (PST) X-Received: by 2002:a05:622a:4116:b0:39c:e5bf:8162 with SMTP id cc22-20020a05622a411600b0039ce5bf8162mr1523492qtb.55.1670445050551; Wed, 07 Dec 2022 12:30:50 -0800 (PST) Received: from x1n.redhat.com (bras-base-aurron9127w-grc-46-70-31-27-79.dsl.bell.ca. [70.31.27.79]) by smtp.gmail.com with ESMTPSA id dc53-20020a05620a523500b006fefa5f7fcesm855594qkb.10.2022.12.07.12.30.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 Dec 2022 12:30:50 -0800 (PST) From: Peter Xu <peterx@redhat.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Muchun Song <songmuchun@bytedance.com>, John Hubbard <jhubbard@nvidia.com>, Andrea Arcangeli <aarcange@redhat.com>, James Houghton <jthoughton@google.com>, Jann Horn <jannh@google.com>, Rik van Riel <riel@surriel.com>, Miaohe Lin <linmiaohe@huawei.com>, Andrew Morton <akpm@linux-foundation.org>, Mike Kravetz <mike.kravetz@oracle.com>, peterx@redhat.com, David Hildenbrand <david@redhat.com>, Nadav Amit <nadav.amit@gmail.com> Subject: [PATCH v2 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare Date: Wed, 7 Dec 2022 15:30:32 -0500 Message-Id: <20221207203034.650899-9-peterx@redhat.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20221207203034.650899-1-peterx@redhat.com> References: <20221207203034.650899-1-peterx@redhat.com> MIME-Version: 1.0 Content-type: text/plain Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1751588940643588683?= X-GMAIL-MSGID: =?utf-8?q?1751588940643588683?= |
Series |
[v2,01/10] mm/hugetlb: Let vma_offset_start() to return start
|
|
Commit Message
Peter Xu
Dec. 7, 2022, 8:30 p.m. UTC
Since walk_hugetlb_range() walks the pgtable, it needs the vma lock to make sure the pgtable page will not be freed concurrently. Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/s390/mm/gmap.c | 2 ++ fs/proc/task_mmu.c | 2 ++ include/linux/pagewalk.h | 11 ++++++++++- mm/hmm.c | 15 ++++++++++++++- mm/pagewalk.c | 2 ++ 5 files changed, 30 insertions(+), 2 deletions(-)
Comments
On 12/7/22 12:30, Peter Xu wrote: > Since walk_hugetlb_range() walks the pgtable, it needs the vma lock > to make sure the pgtable page will not be freed concurrently. > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > arch/s390/mm/gmap.c | 2 ++ > fs/proc/task_mmu.c | 2 ++ > include/linux/pagewalk.h | 11 ++++++++++- > mm/hmm.c | 15 ++++++++++++++- > mm/pagewalk.c | 2 ++ > 5 files changed, 30 insertions(+), 2 deletions(-) Reviewed-by: John Hubbard <jhubbard@nvidia.com> thanks,
On 07.12.22 21:30, Peter Xu wrote: > Since walk_hugetlb_range() walks the pgtable, it needs the vma lock > to make sure the pgtable page will not be freed concurrently. > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > arch/s390/mm/gmap.c | 2 ++ > fs/proc/task_mmu.c | 2 ++ > include/linux/pagewalk.h | 11 ++++++++++- > mm/hmm.c | 15 ++++++++++++++- > mm/pagewalk.c | 2 ++ > 5 files changed, 30 insertions(+), 2 deletions(-) > > diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c > index 8947451ae021..292a54c490d4 100644 > --- a/arch/s390/mm/gmap.c > +++ b/arch/s390/mm/gmap.c > @@ -2643,7 +2643,9 @@ static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr, > end = start + HPAGE_SIZE - 1; > __storage_key_init_range(start, end); > set_bit(PG_arch_1, &page->flags); > + hugetlb_vma_unlock_read(walk->vma); > cond_resched(); > + hugetlb_vma_lock_read(walk->vma); > return 0; > } > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index e35a0398db63..cf3887fb2905 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -1613,7 +1613,9 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask, > frame++; > } > > + hugetlb_vma_unlock_read(walk->vma); > cond_resched(); > + hugetlb_vma_lock_read(walk->vma); We already hold the mmap_lock and reschedule. Even without the cond_resched() we might happily reschedule on a preemptive kernel. So I'm not sure if this optimization is strictly required or even helpful in practice here. In the worst case, concurrent unsharing would have to wait. For example, s390_enable_skey() is called at most once for a VM, for most VMs it gets never even called. Or am I missing something important?
On Thu, Dec 08, 2022 at 02:14:42PM +0100, David Hildenbrand wrote: > On 07.12.22 21:30, Peter Xu wrote: > > Since walk_hugetlb_range() walks the pgtable, it needs the vma lock > > to make sure the pgtable page will not be freed concurrently. > > > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> > > Signed-off-by: Peter Xu <peterx@redhat.com> > > --- > > arch/s390/mm/gmap.c | 2 ++ > > fs/proc/task_mmu.c | 2 ++ > > include/linux/pagewalk.h | 11 ++++++++++- > > mm/hmm.c | 15 ++++++++++++++- > > mm/pagewalk.c | 2 ++ > > 5 files changed, 30 insertions(+), 2 deletions(-) > > > > diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c > > index 8947451ae021..292a54c490d4 100644 > > --- a/arch/s390/mm/gmap.c > > +++ b/arch/s390/mm/gmap.c > > @@ -2643,7 +2643,9 @@ static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr, > > end = start + HPAGE_SIZE - 1; > > __storage_key_init_range(start, end); > > set_bit(PG_arch_1, &page->flags); > > + hugetlb_vma_unlock_read(walk->vma); > > cond_resched(); > > + hugetlb_vma_lock_read(walk->vma); > > return 0; > > } > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > index e35a0398db63..cf3887fb2905 100644 > > --- a/fs/proc/task_mmu.c > > +++ b/fs/proc/task_mmu.c > > @@ -1613,7 +1613,9 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask, > > frame++; > > } > > + hugetlb_vma_unlock_read(walk->vma); > > cond_resched(); > > + hugetlb_vma_lock_read(walk->vma); > > We already hold the mmap_lock and reschedule. Even without the > cond_resched() we might happily reschedule on a preemptive kernel. So I'm > not sure if this optimization is strictly required or even helpful in > practice here. It's just low hanging fruit if we need that complexity anyway. That's also why I didn't do that for v1 (where I missed hmm special case, though..), but I think since we'll need that anyway, we'd better release the vma lock if we can easily do so. mmap_lock is just more special because it needs more work in the caller to release (e.g. vma invalidations). Otherwise I'm happy dropping that too. > > In the worst case, concurrent unsharing would have to wait. > For example, s390_enable_skey() is called at most once for a VM, for most > VMs it gets never even called. > > Or am I missing something important? Nothing important. I just don't see why we need to strictly follow the same release rule of mmap_lock here when talking about vma lock. In short - if we can drop a lock earlier before sleep, why not? I tend to just keep it as-is, but let me know if you have further thoughts or concerns.
On Thu, Dec 08, 2022 at 03:47:26PM -0500, Peter Xu wrote: > On Thu, Dec 08, 2022 at 02:14:42PM +0100, David Hildenbrand wrote: > > On 07.12.22 21:30, Peter Xu wrote: > > > Since walk_hugetlb_range() walks the pgtable, it needs the vma lock > > > to make sure the pgtable page will not be freed concurrently. > > > > > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > > --- > > > arch/s390/mm/gmap.c | 2 ++ > > > fs/proc/task_mmu.c | 2 ++ > > > include/linux/pagewalk.h | 11 ++++++++++- > > > mm/hmm.c | 15 ++++++++++++++- > > > mm/pagewalk.c | 2 ++ > > > 5 files changed, 30 insertions(+), 2 deletions(-) > > > > > > diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c > > > index 8947451ae021..292a54c490d4 100644 > > > --- a/arch/s390/mm/gmap.c > > > +++ b/arch/s390/mm/gmap.c > > > @@ -2643,7 +2643,9 @@ static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr, > > > end = start + HPAGE_SIZE - 1; > > > __storage_key_init_range(start, end); > > > set_bit(PG_arch_1, &page->flags); > > > + hugetlb_vma_unlock_read(walk->vma); > > > cond_resched(); > > > + hugetlb_vma_lock_read(walk->vma); > > > return 0; > > > } > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > index e35a0398db63..cf3887fb2905 100644 > > > --- a/fs/proc/task_mmu.c > > > +++ b/fs/proc/task_mmu.c > > > @@ -1613,7 +1613,9 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask, > > > frame++; > > > } > > > + hugetlb_vma_unlock_read(walk->vma); > > > cond_resched(); > > > + hugetlb_vma_lock_read(walk->vma); > > > > We already hold the mmap_lock and reschedule. Even without the > > cond_resched() we might happily reschedule on a preemptive kernel. So I'm > > not sure if this optimization is strictly required or even helpful in > > practice here. > > It's just low hanging fruit if we need that complexity anyway. > > That's also why I didn't do that for v1 (where I missed hmm special case, > though..), but I think since we'll need that anyway, we'd better release > the vma lock if we can easily do so. > > mmap_lock is just more special because it needs more work in the caller to > release (e.g. vma invalidations). Otherwise I'm happy dropping that too. > > > > > In the worst case, concurrent unsharing would have to wait. > > For example, s390_enable_skey() is called at most once for a VM, for most > > VMs it gets never even called. > > > > Or am I missing something important? > > Nothing important. I just don't see why we need to strictly follow the > same release rule of mmap_lock here when talking about vma lock. > > In short - if we can drop a lock earlier before sleep, why not? > > I tend to just keep it as-is, but let me know if you have further thoughts > or concerns. One thing I can do better here is: - cond_resched(); + + if (need_resched()) { + hugetlb_vma_unlock_read(walk->vma); + cond_resched(); + hugetlb_vma_lock_read(walk->vma); + } + It's a pity we don't have rwsem version of cond_resched_rwlock_read(), or it'll be even cleaner.
On 08.12.22 21:47, Peter Xu wrote: > On Thu, Dec 08, 2022 at 02:14:42PM +0100, David Hildenbrand wrote: >> On 07.12.22 21:30, Peter Xu wrote: >>> Since walk_hugetlb_range() walks the pgtable, it needs the vma lock >>> to make sure the pgtable page will not be freed concurrently. >>> >>> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> >>> Signed-off-by: Peter Xu <peterx@redhat.com> >>> --- >>> arch/s390/mm/gmap.c | 2 ++ >>> fs/proc/task_mmu.c | 2 ++ >>> include/linux/pagewalk.h | 11 ++++++++++- >>> mm/hmm.c | 15 ++++++++++++++- >>> mm/pagewalk.c | 2 ++ >>> 5 files changed, 30 insertions(+), 2 deletions(-) >>> >>> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c >>> index 8947451ae021..292a54c490d4 100644 >>> --- a/arch/s390/mm/gmap.c >>> +++ b/arch/s390/mm/gmap.c >>> @@ -2643,7 +2643,9 @@ static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr, >>> end = start + HPAGE_SIZE - 1; >>> __storage_key_init_range(start, end); >>> set_bit(PG_arch_1, &page->flags); >>> + hugetlb_vma_unlock_read(walk->vma); >>> cond_resched(); >>> + hugetlb_vma_lock_read(walk->vma); >>> return 0; >>> } >>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >>> index e35a0398db63..cf3887fb2905 100644 >>> --- a/fs/proc/task_mmu.c >>> +++ b/fs/proc/task_mmu.c >>> @@ -1613,7 +1613,9 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask, >>> frame++; >>> } >>> + hugetlb_vma_unlock_read(walk->vma); >>> cond_resched(); >>> + hugetlb_vma_lock_read(walk->vma); >> >> We already hold the mmap_lock and reschedule. Even without the >> cond_resched() we might happily reschedule on a preemptive kernel. So I'm >> not sure if this optimization is strictly required or even helpful in >> practice here. > > It's just low hanging fruit if we need that complexity anyway. > > That's also why I didn't do that for v1 (where I missed hmm special case, > though..), but I think since we'll need that anyway, we'd better release > the vma lock if we can easily do so. > > mmap_lock is just more special because it needs more work in the caller to > release (e.g. vma invalidations). Otherwise I'm happy dropping that too. > >> >> In the worst case, concurrent unsharing would have to wait. >> For example, s390_enable_skey() is called at most once for a VM, for most >> VMs it gets never even called. >> >> Or am I missing something important? > > Nothing important. I just don't see why we need to strictly follow the > same release rule of mmap_lock here when talking about vma lock. > > In short - if we can drop a lock earlier before sleep, why not? > > I tend to just keep it as-is, but let me know if you have further thoughts > or concerns. To me this looks like a possibly unnecessary optimization, requiring additional code. IMHO, possibly unnecessary unlock+relock makes thatthat code harder to get. For such cases, it would be good to have any evidence that it really helps.
On Fri, Dec 09, 2022 at 11:24:55AM +0100, David Hildenbrand wrote:
> For such cases, it would be good to have any evidence that it really helps.
I don't know much on the s390 path, but if a process has a large hugetlb
vma, even MADV_DONTNEED will be blocked for whatever long time if there's
another process or thread scanning pagemap for this vma.
Would this justify a bit?
It's just that the vma lock is taken write far more than mmap_lock taken
write I think, meanwhile what we need here is only release the lock and
retake, nothing else. It didn't make things over complicated, IMO.
No strong ipinion from my side, as I said to me it's really low hanging
fruit. If you still think that doesn't justify and if Mike doesn't have a
preference either I can just drop it for later.
On 09.12.22 15:39, Peter Xu wrote: > On Fri, Dec 09, 2022 at 11:24:55AM +0100, David Hildenbrand wrote: >> For such cases, it would be good to have any evidence that it really helps. > > I don't know much on the s390 path, but if a process has a large hugetlb > vma, even MADV_DONTNEED will be blocked for whatever long time if there's > another process or thread scanning pagemap for this vma. > > Would this justify a bit? I get your point. But that raises the question if we should voluntarily drop the VMA lock already in the caller every now and then on such large VMAs and maybe move even the cond_resched() into the common page walker, if you get what I mean? On a preemtible kernel you could reschedule just before you drop the lock and call cond_resched() ... hmm No strong opinion here, it just looked a bit weird to optimize for a cond_resched() if we might just reschedule either way even without the cond_resched().
On Fri, Dec 09, 2022 at 04:18:11PM +0100, David Hildenbrand wrote: > On 09.12.22 15:39, Peter Xu wrote: > > On Fri, Dec 09, 2022 at 11:24:55AM +0100, David Hildenbrand wrote: > > > For such cases, it would be good to have any evidence that it really helps. > > > > I don't know much on the s390 path, but if a process has a large hugetlb > > vma, even MADV_DONTNEED will be blocked for whatever long time if there's > > another process or thread scanning pagemap for this vma. > > > > Would this justify a bit? > > I get your point. But that raises the question if we should voluntarily drop > the VMA lock already in the caller every now and then on such large VMAs and > maybe move even the cond_resched() into the common page walker, if you get > what I mean? > > On a preemtible kernel you could reschedule just before you drop the lock > and call cond_resched() ... hmm On full preempt the old cond_resched() doesn't work anyway. IIUC most distros now should be using dynamic preempt which makes voluntary mode by default, which this change should help. > > No strong opinion here, it just looked a bit weird to optimize for a > cond_resched() if we might just reschedule either way even without the > cond_resched(). It's really not the core of the patchset. Let me drop it for now.
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c index 8947451ae021..292a54c490d4 100644 --- a/arch/s390/mm/gmap.c +++ b/arch/s390/mm/gmap.c @@ -2643,7 +2643,9 @@ static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr, end = start + HPAGE_SIZE - 1; __storage_key_init_range(start, end); set_bit(PG_arch_1, &page->flags); + hugetlb_vma_unlock_read(walk->vma); cond_resched(); + hugetlb_vma_lock_read(walk->vma); return 0; } diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index e35a0398db63..cf3887fb2905 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1613,7 +1613,9 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask, frame++; } + hugetlb_vma_unlock_read(walk->vma); cond_resched(); + hugetlb_vma_lock_read(walk->vma); return err; } diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h index 959f52e5867d..27a6df448ee5 100644 --- a/include/linux/pagewalk.h +++ b/include/linux/pagewalk.h @@ -21,7 +21,16 @@ struct mm_walk; * depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD. * Any folded depths (where PTRS_PER_P?D is equal to 1) * are skipped. - * @hugetlb_entry: if set, called for each hugetlb entry + * @hugetlb_entry: if set, called for each hugetlb entry. This hook + * function is called with the vma lock held, in order to + * protect against a concurrent freeing of the pte_t* or + * the ptl. In some cases, the hook function needs to drop + * and retake the vma lock in order to avoid deadlocks + * while calling other functions. In such cases the hook + * function must either refrain from accessing the pte or + * ptl after dropping the vma lock, or else revalidate + * those items after re-acquiring the vma lock and before + * accessing them. * @test_walk: caller specific callback function to determine whether * we walk over the current vma or not. Returning 0 means * "do page table walk over the current vma", returning diff --git a/mm/hmm.c b/mm/hmm.c index 3850fb625dda..796de6866089 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -493,8 +493,21 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask, required_fault = hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags); if (required_fault) { + int ret; + spin_unlock(ptl); - return hmm_vma_fault(addr, end, required_fault, walk); + hugetlb_vma_unlock_read(vma); + /* + * Avoid deadlock: drop the vma lock before calling + * hmm_vma_fault(), which will itself potentially take and + * drop the vma lock. This is also correct from a + * protection point of view, because there is no further + * use here of either pte or ptl after dropping the vma + * lock. + */ + ret = hmm_vma_fault(addr, end, required_fault, walk); + hugetlb_vma_lock_read(vma); + return ret; } pfn = pte_pfn(entry) + ((start & ~hmask) >> PAGE_SHIFT); diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 7f1c9b274906..d98564a7be57 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -302,6 +302,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, const struct mm_walk_ops *ops = walk->ops; int err = 0; + hugetlb_vma_lock_read(vma); do { next = hugetlb_entry_end(h, addr, end); pte = huge_pte_offset(walk->mm, addr & hmask, sz); @@ -314,6 +315,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, if (err) break; } while (addr = next, addr != end); + hugetlb_vma_unlock_read(vma); return err; }