Message ID | 20230216051750.3125598-27-surenb@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp117491wrn; Wed, 15 Feb 2023 21:21:42 -0800 (PST) X-Google-Smtp-Source: AK7set8MessfJw9VgB2kCZl4nTUiV1nhKlwgcLgHD40NCMvH9jxwYQw3GfalzGM75gh8phm7KOnt X-Received: by 2002:a17:906:f2d8:b0:8b1:2824:90d7 with SMTP id gz24-20020a170906f2d800b008b1282490d7mr4386925ejb.16.1676524902621; Wed, 15 Feb 2023 21:21:42 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676524902; cv=none; d=google.com; s=arc-20160816; b=OMQu9vVhZW2mN4aHR7QQXAub9JuQvqOKmU463ePlgb4lFjUbuA2ic3RGFZqEdlELyN k+IGoJGnflVb9B5hvGJdFWfrIdS8hcskGoAnvExkZaMRDQ7LiGm8lAwX4GdkahlLfuEw 5xs2oFEajJeXBdN/5IKoboKFnEO1iQlN2FDpnFgQTWAeaxd6yXk+HG/MKq3bORTlr2bE BH8oAk0tMgDM5MLYI6/kcWi9UcpzjMeTIT0nSEelPkqNwKJ00zWESQA8lNhdOxaYleqQ tTQJXOGQ8G46tT6WZeQswIP9HhqCdutaQN2StgBX6nf5muhx6tzNqkmgHCKnSqtoDD0L 7pcg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=76w8C0WtscQXhYR05yNYHvPFGtZWYYoigmdRBKNsAGs=; b=mDU+6HVnVb/Hwda0MZiuMsCJWI9v80CnsIyfNVPqhDp8CPnWXzkwflye7dePl1YzMe o+BJB9c4BKIvC/dp6AsroNiyqVgi/cRzMnHyzI/K5bSCzIz6pP6ke0G5RFIc701EoZpF uswQrNBcp3xb1SMWj+QsHLGSgZZWJ556W9mPmzQ5LaZ6iXjAmo5fYJ59Gw4S6fzYOGTP 41hKGKpivJq8ycbB67VrSkAGzF9Cjl4Z8uf5Bd3J8Bx5YWDG757MLLZKnv/Uz+rSZZEI JesRgEiUbCgBTAcjVxg+uoL+jpsqfQ7DkI/ROmi0T6rQypBao6r9pCHeFlrLEx5yEhBs dFcw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=FCDmrKEx; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 18-20020a170906225200b0088d5dc62a29si1053189ejr.200.2023.02.15.21.21.19; Wed, 15 Feb 2023 21:21:42 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=FCDmrKEx; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229781AbjBPFUu (ORCPT <rfc822;hadasmailinglist@gmail.com> + 99 others); Thu, 16 Feb 2023 00:20:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46878 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229731AbjBPFTu (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 16 Feb 2023 00:19:50 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C3DE7457D4 for <linux-kernel@vger.kernel.org>; Wed, 15 Feb 2023 21:18:57 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-4cddba76f55so8311157b3.23 for <linux-kernel@vger.kernel.org>; Wed, 15 Feb 2023 21:18:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=76w8C0WtscQXhYR05yNYHvPFGtZWYYoigmdRBKNsAGs=; b=FCDmrKExzLuCArz312mMsT1GG2XBAJpY0IRk4Dudkq8tacxOtpStMHAB4mCqF7cHzr HxmYKbJzh0e/OlcI5tze/pJirqiUzOseaOUlDEtDH80qm7FP3CxiVmnvUlVVgBXp61ej ykGRdx137mYbxOiW4ntwVshjUhNYXqvxbpsd7tR4+U5K8xmf3z6sAC5BmIr5igjBFbE2 Lhh7YXKnyKDhial8w7C+wSOyLz2qDj3wAEjKVPIMU3N4BzLNr3q2F7Jzk8qKKMPeJp0i mhDW6Px8X9de0wrc9+taZBLP7xnnbG62Qe2RXE2u3O0X+3PidgLM2eknvp7ECPYQAyqI 26OQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=76w8C0WtscQXhYR05yNYHvPFGtZWYYoigmdRBKNsAGs=; b=Cl/2dg7px0VteYTm0pbQw/GnSDeS3R4D6VtyflogegSEOZoWEluwyTU/AlNNcDyfU1 qzLngBSYK2fAbFY7MXdonFLKX2iihrOywsQkp196b31AmMQs66bVyZjcqwYGiEKdV2As kI6qNDJKI8hp+4BFj3SHyy9J/bi+BrIIZ9YBAS9ro/c2Kn0CdOlRUj7QBrC+zMpwoepW 76GuwSpmusAEGjcieAX6cJENr51VyktlWN2Kk0V/VCrAJNEYfm/oIBNF4XKDI419JXZD W22h+sTVVExRJ1n9d8uR/y4l9KsxqbH8pwNyrijbcFQizA2cJnhWwpRO3sBcQRoTx6ef 9ENQ== X-Gm-Message-State: AO0yUKVaIfCPc3sKSXOFm2HkN/387vQOxApK9NEv/Wj7DBhMXSgdKfwr xckfzea6zM4CUEz9ld2zPHhxxJQYL8s= X-Received: from surenb-desktop.mtv.corp.google.com ([2620:15c:211:200:f781:d5ed:1806:6ebb]) (user=surenb job=sendgmr) by 2002:a81:9e0a:0:b0:532:a8a0:8d76 with SMTP id m10-20020a819e0a000000b00532a8a08d76mr235300ywj.85.1676524737340; Wed, 15 Feb 2023 21:18:57 -0800 (PST) Date: Wed, 15 Feb 2023 21:17:41 -0800 In-Reply-To: <20230216051750.3125598-1-surenb@google.com> Mime-Version: 1.0 References: <20230216051750.3125598-1-surenb@google.com> X-Mailer: git-send-email 2.39.1.581.gbfd45094c4-goog Message-ID: <20230216051750.3125598-27-surenb@google.com> Subject: [PATCH v3 26/35] mm: fall back to mmap_lock if vma->anon_vma is not yet set From: Suren Baghdasaryan <surenb@google.com> To: akpm@linux-foundation.org Cc: michel@lespinasse.org, jglisse@google.com, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, mgorman@techsingularity.net, dave@stgolabs.net, willy@infradead.org, liam.howlett@oracle.com, peterz@infradead.org, ldufour@linux.ibm.com, paulmck@kernel.org, mingo@redhat.com, will@kernel.org, luto@kernel.org, songliubraving@fb.com, peterx@redhat.com, david@redhat.com, dhowells@redhat.com, hughd@google.com, bigeasy@linutronix.de, kent.overstreet@linux.dev, punit.agrawal@bytedance.com, lstoakes@gmail.com, peterjung1337@gmail.com, rientjes@google.com, chriscli@google.com, axelrasmussen@google.com, joelaf@google.com, minchan@google.com, rppt@kernel.org, jannh@google.com, shakeelb@google.com, tatashin@google.com, edumazet@google.com, gthelen@google.com, gurua@google.com, arjunroy@google.com, soheil@google.com, leewalsh@google.com, posk@google.com, michalechner92@googlemail.com, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linuxppc-dev@lists.ozlabs.org, x86@kernel.org, linux-kernel@vger.kernel.org, kernel-team@android.com, Suren Baghdasaryan <surenb@google.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1757963776401405256?= X-GMAIL-MSGID: =?utf-8?q?1757963776401405256?= |
Series |
Per-VMA locks
|
|
Commit Message
Suren Baghdasaryan
Feb. 16, 2023, 5:17 a.m. UTC
When vma->anon_vma is not set, page fault handler will set it by either
reusing anon_vma of an adjacent VMA if VMAs are compatible or by
allocating a new one. find_mergeable_anon_vma() walks VMA tree to find
a compatible adjacent VMA and that requires not only the faulting VMA
to be stable but also the tree structure and other VMAs inside that tree.
Therefore locking just the faulting VMA is not enough for this search.
Fall back to taking mmap_lock when vma->anon_vma is not set. This
situation happens only on the first page fault and should not affect
overall performance.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
mm/memory.c | 4 ++++
1 file changed, 4 insertions(+)
Comments
On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > When vma->anon_vma is not set, page fault handler will set it by either > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > a compatible adjacent VMA and that requires not only the faulting VMA > to be stable but also the tree structure and other VMAs inside that tree. > Therefore locking just the faulting VMA is not enough for this search. > Fall back to taking mmap_lock when vma->anon_vma is not set. This > situation happens only on the first page fault and should not affect > overall performance. I think I asked this before, but don't remember getting an aswer. Why do we defer setting anon_vma to the first fault? Why don't we set it up at mmap time?
On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > When vma->anon_vma is not set, page fault handler will set it by either > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > > a compatible adjacent VMA and that requires not only the faulting VMA > > to be stable but also the tree structure and other VMAs inside that tree. > > Therefore locking just the faulting VMA is not enough for this search. > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > situation happens only on the first page fault and should not affect > > overall performance. > > I think I asked this before, but don't remember getting an aswer. > Why do we defer setting anon_vma to the first fault? Why don't we > set it up at mmap time? Yeah, I remember that conversation Matthew and I could not find the definitive answer at the time. I'll look into that again or maybe someone can answer it here. In the end rather than changing that logic I decided to skip vma->anon_vma==NULL cases because I measured them being less than 0.01% of all page faults, so ROI from changing that would be quite low. But I agree that the logic is weird and maybe we can improve that. I will have to review that again when I'm working on eliminating all these special cases we skip, like swap/userfaults/etc.
On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > > When vma->anon_vma is not set, page fault handler will set it by either > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > > > a compatible adjacent VMA and that requires not only the faulting VMA > > > to be stable but also the tree structure and other VMAs inside that tree. > > > Therefore locking just the faulting VMA is not enough for this search. > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > > situation happens only on the first page fault and should not affect > > > overall performance. > > > > I think I asked this before, but don't remember getting an aswer. > > Why do we defer setting anon_vma to the first fault? Why don't we > > set it up at mmap time? > > Yeah, I remember that conversation Matthew and I could not find the > definitive answer at the time. I'll look into that again or maybe > someone can answer it here. After looking into it again I'm still under the impression that vma->anon_vma is populated lazily (during the first page fault rather than at mmap time) to avoid doing extra work for areas which are never faulted. Though I might be missing some important detail here. > > In the end rather than changing that logic I decided to skip > vma->anon_vma==NULL cases because I measured them being less than > 0.01% of all page faults, so ROI from changing that would be quite > low. But I agree that the logic is weird and maybe we can improve > that. I will have to review that again when I'm working on eliminating > all these special cases we skip, like swap/userfaults/etc.
On Fri, Feb 17, 2023 at 11:15 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > > > When vma->anon_vma is not set, page fault handler will set it by either > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > > > > a compatible adjacent VMA and that requires not only the faulting VMA > > > > to be stable but also the tree structure and other VMAs inside that tree. > > > > Therefore locking just the faulting VMA is not enough for this search. > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > > > situation happens only on the first page fault and should not affect > > > > overall performance. > > > > > > I think I asked this before, but don't remember getting an aswer. > > > Why do we defer setting anon_vma to the first fault? Why don't we > > > set it up at mmap time? > > > > Yeah, I remember that conversation Matthew and I could not find the > > definitive answer at the time. I'll look into that again or maybe > > someone can answer it here. > > After looking into it again I'm still under the impression that > vma->anon_vma is populated lazily (during the first page fault rather > than at mmap time) to avoid doing extra work for areas which are never > faulted. Though I might be missing some important detail here. I think this is because the kernel cannot merge VMAs that have different anon_vmas? Enabling lazy population of anon_vma could potentially increase the chances of merging VMAs. > > In the end rather than changing that logic I decided to skip > > vma->anon_vma==NULL cases because I measured them being less than > > 0.01% of all page faults, so ROI from changing that would be quite > > low. But I agree that the logic is weird and maybe we can improve > > that. I will have to review that again when I'm working on eliminating > > all these special cases we skip, like swap/userfaults/etc.
On Thu, Feb 16, 2023 at 06:14:59PM -0800, Suren Baghdasaryan wrote: > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > > > When vma->anon_vma is not set, page fault handler will set it by either > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > > > > a compatible adjacent VMA and that requires not only the faulting VMA > > > > to be stable but also the tree structure and other VMAs inside that tree. > > > > Therefore locking just the faulting VMA is not enough for this search. > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > > > situation happens only on the first page fault and should not affect > > > > overall performance. > > > > > > I think I asked this before, but don't remember getting an aswer. > > > Why do we defer setting anon_vma to the first fault? Why don't we > > > set it up at mmap time? > > > > Yeah, I remember that conversation Matthew and I could not find the > > definitive answer at the time. I'll look into that again or maybe > > someone can answer it here. > > After looking into it again I'm still under the impression that > vma->anon_vma is populated lazily (during the first page fault rather > than at mmap time) to avoid doing extra work for areas which are never > faulted. Though I might be missing some important detail here. How often does userspace call mmap() and then _never_ fault on it? I appreciate that userspace might mmap() gigabytes of address space and then only end up using a small amount of it, so populating it lazily makes sense. But creating a region and never faulting on it? The only use-case I can think of is loading shared libraries: openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 (...) mmap(NULL, 1970000, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0ce612e000 mmap(0x7f0ce6154000, 1396736, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x26000) = 0x7f0ce6154000 mmap(0x7f0ce62a9000, 339968, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17b000) = 0x7f0ce62a9000 mmap(0x7f0ce62fc000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1ce000) = 0x7f0ce62fc000 mmap(0x7f0ce6302000, 53072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0ce6302000 but that's a file-backed VMA, not an anon VMA.
On Fri, Feb 17, 2023 at 8:05 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Feb 16, 2023 at 06:14:59PM -0800, Suren Baghdasaryan wrote: > > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > > > > When vma->anon_vma is not set, page fault handler will set it by either > > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > > > > > a compatible adjacent VMA and that requires not only the faulting VMA > > > > > to be stable but also the tree structure and other VMAs inside that tree. > > > > > Therefore locking just the faulting VMA is not enough for this search. > > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > > > > situation happens only on the first page fault and should not affect > > > > > overall performance. > > > > > > > > I think I asked this before, but don't remember getting an aswer. > > > > Why do we defer setting anon_vma to the first fault? Why don't we > > > > set it up at mmap time? > > > > > > Yeah, I remember that conversation Matthew and I could not find the > > > definitive answer at the time. I'll look into that again or maybe > > > someone can answer it here. > > > > After looking into it again I'm still under the impression that > > vma->anon_vma is populated lazily (during the first page fault rather > > than at mmap time) to avoid doing extra work for areas which are never > > faulted. Though I might be missing some important detail here. > > How often does userspace call mmap() and then _never_ fault on it? > I appreciate that userspace might mmap() gigabytes of address space and > then only end up using a small amount of it, so populating it lazily > makes sense. But creating a region and never faulting on it? The only > use-case I can think of is loading shared libraries: > > openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 > (...) > mmap(NULL, 1970000, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0ce612e000 > mmap(0x7f0ce6154000, 1396736, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x26000) = 0x7f0ce6154000 > mmap(0x7f0ce62a9000, 339968, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17b000) = 0x7f0ce62a9000 > mmap(0x7f0ce62fc000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1ce000) = 0x7f0ce62fc000 > mmap(0x7f0ce6302000, 53072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0ce6302000 > > but that's a file-backed VMA, not an anon VMA. Might the case of dup_mmap() while forking be the reason why a VMA in the child process might be never used while parent uses it (or visa versa)? Again, I'm not sure this is the reason but I can find no other good explanation.
On Fri, Feb 17, 2023 at 2:21 AM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote: > > On Fri, Feb 17, 2023 at 11:15 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > > > > When vma->anon_vma is not set, page fault handler will set it by either > > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > > > > > a compatible adjacent VMA and that requires not only the faulting VMA > > > > > to be stable but also the tree structure and other VMAs inside that tree. > > > > > Therefore locking just the faulting VMA is not enough for this search. > > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > > > > situation happens only on the first page fault and should not affect > > > > > overall performance. > > > > > > > > I think I asked this before, but don't remember getting an aswer. > > > > Why do we defer setting anon_vma to the first fault? Why don't we > > > > set it up at mmap time? > > > > > > Yeah, I remember that conversation Matthew and I could not find the > > > definitive answer at the time. I'll look into that again or maybe > > > someone can answer it here. > > > > After looking into it again I'm still under the impression that > > vma->anon_vma is populated lazily (during the first page fault rather > > than at mmap time) to avoid doing extra work for areas which are never > > faulted. Though I might be missing some important detail here. > > I think this is because the kernel cannot merge VMAs that have > different anon_vmas? > > Enabling lazy population of anon_vma could potentially increase the > chances of merging VMAs. Hmm. Do you have a clear explanation why merging chances increase this way? A couple of possibilities I can think of would be: 1. If after mmap'ing a VMA and before faulting the first page into it we often change something that affects anon_vma_compatible() decision, like vm_policy; 2. When mmap'ing VMAs we do not map them consecutively but the final arrangement is actually contiguous. Don't think either of those cases would be very representative of a usual case but maybe I'm wrong or there is another reason? > > > > In the end rather than changing that logic I decided to skip > > > vma->anon_vma==NULL cases because I measured them being less than > > > 0.01% of all page faults, so ROI from changing that would be quite > > > low. But I agree that the logic is weird and maybe we can improve > > > that. I will have to review that again when I'm working on eliminating > > > all these special cases we skip, like swap/userfaults/etc. > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
On Fri, Feb 17, 2023 at 08:13:01AM -0800, Suren Baghdasaryan wrote: > On Fri, Feb 17, 2023 at 2:21 AM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote: > > > > On Fri, Feb 17, 2023 at 11:15 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > > > > > When vma->anon_vma is not set, page fault handler will set it by either > > > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > > > > > > a compatible adjacent VMA and that requires not only the faulting VMA > > > > > > to be stable but also the tree structure and other VMAs inside that tree. > > > > > > Therefore locking just the faulting VMA is not enough for this search. > > > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > > > > > situation happens only on the first page fault and should not affect > > > > > > overall performance. > > > > > > > > > > I think I asked this before, but don't remember getting an aswer. > > > > > Why do we defer setting anon_vma to the first fault? Why don't we > > > > > set it up at mmap time? > > > > > > > > Yeah, I remember that conversation Matthew and I could not find the > > > > definitive answer at the time. I'll look into that again or maybe > > > > someone can answer it here. > > > > > > After looking into it again I'm still under the impression that > > > vma->anon_vma is populated lazily (during the first page fault rather > > > than at mmap time) to avoid doing extra work for areas which are never > > > faulted. Though I might be missing some important detail here. > > > > I think this is because the kernel cannot merge VMAs that have > > different anon_vmas? > > > > Enabling lazy population of anon_vma could potentially increase the > > chances of merging VMAs. > > Hmm. Do you have a clear explanation why merging chances increase this > way? A couple of possibilities I can think of would be: > 1. If after mmap'ing a VMA and before faulting the first page into it > we often change something that affects anon_vma_compatible() decision, > like vm_policy; > 2. When mmap'ing VMAs we do not map them consecutively but the final > arrangement is actually contiguous. > > Don't think either of those cases would be very representative of a > usual case but maybe I'm wrong or there is another reason? Ok. I agree it does not represent common cases. Hmm then I wonder how it went from the initial approach of "allocate anon_vma objects only via fork()" [1] to "populate anon_vma at page faults". [2] [3] Maybe Hugh, Andrea or Andrew have opinions? [1] anon_vma RFC2, lore.kernel.org https://lore.kernel.org/lkml/20040311065254.GT30940@dualathlon.random [2] The status of object-based reverse mapping, LWN.net https://lwn.net/Articles/85908 [3] rmap 39 add anon_vma rmap https://gitlab.com/hyeyoo/linux-historical/-/commit/8aa3448cabdfca146aa3fd36e852d0209fb2276a > > > > > > > In the end rather than changing that logic I decided to skip > > > > vma->anon_vma==NULL cases because I measured them being less than > > > > 0.01% of all page faults, so ROI from changing that would be quite > > > > low. But I agree that the logic is weird and maybe we can improve > > > > that. I will have to review that again when I'm working on eliminating > > > > all these special cases we skip, like swap/userfaults/etc. > > > > -- > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. > >
On Fri, Feb 17, 2023 at 08:10:35AM -0800, Suren Baghdasaryan wrote: > On Fri, Feb 17, 2023 at 8:05 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Thu, Feb 16, 2023 at 06:14:59PM -0800, Suren Baghdasaryan wrote: > > > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > > > > > When vma->anon_vma is not set, page fault handler will set it by either > > > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > > > > > > a compatible adjacent VMA and that requires not only the faulting VMA > > > > > > to be stable but also the tree structure and other VMAs inside that tree. > > > > > > Therefore locking just the faulting VMA is not enough for this search. > > > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > > > > > situation happens only on the first page fault and should not affect > > > > > > overall performance. > > > > > > > > > > I think I asked this before, but don't remember getting an aswer. > > > > > Why do we defer setting anon_vma to the first fault? Why don't we > > > > > set it up at mmap time? > > > > > > > > Yeah, I remember that conversation Matthew and I could not find the > > > > definitive answer at the time. I'll look into that again or maybe > > > > someone can answer it here. > > > > > > After looking into it again I'm still under the impression that > > > vma->anon_vma is populated lazily (during the first page fault rather > > > than at mmap time) to avoid doing extra work for areas which are never > > > faulted. Though I might be missing some important detail here. > > > > How often does userspace call mmap() and then _never_ fault on it? > > I appreciate that userspace might mmap() gigabytes of address space and > > then only end up using a small amount of it, so populating it lazily > > makes sense. But creating a region and never faulting on it? The only > > use-case I can think of is loading shared libraries: > > > > openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 > > (...) > > mmap(NULL, 1970000, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0ce612e000 > > mmap(0x7f0ce6154000, 1396736, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x26000) = 0x7f0ce6154000 > > mmap(0x7f0ce62a9000, 339968, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17b000) = 0x7f0ce62a9000 > > mmap(0x7f0ce62fc000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1ce000) = 0x7f0ce62fc000 > > mmap(0x7f0ce6302000, 53072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0ce6302000 > > > > but that's a file-backed VMA, not an anon VMA. > > Might the case of dup_mmap() while forking be the reason why a VMA in > the child process might be never used while parent uses it (or visa > versa)? Again, I'm not sure this is the reason but I can find no other > good explanation. I found an explanation! Well, a partial one. If we MAP_PRIVATE a file mapping (like, er those ones up there) and only take read faults on it, we can postpone allocation of the anon_vma indefinitely. But once we take a write fault in that VMA, we need to allocate an anon_vma for it so that we can track the anonymous pages that have been allocated to satisfy the copy-on-write (see do_cow_fault()). However, I think in that caase, we could probably skip the find_mergeable_anon_vma() step. We don't today; we check whether a->vm_file == b->vm_file in anon_vma_compatible, but I wonder if that triggers often.
diff --git a/mm/memory.c b/mm/memory.c index 5e1c124552a1..13369ff15ec1 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5242,6 +5242,10 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, if (!vma_is_anonymous(vma)) goto inval; + /* find_mergeable_anon_vma uses adjacent vmas which are not locked */ + if (!vma->anon_vma) + goto inval; + if (!vma_start_read(vma)) goto inval;