Message ID | 20230713042037.980211-1-42.hyeyoo@gmail.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:a6b2:0:b0:3e4:2afc:c1 with SMTP id c18csp1588062vqm; Wed, 12 Jul 2023 21:43:42 -0700 (PDT) X-Google-Smtp-Source: APBJJlEHavL+t6XQUVVK4khzEz9ABL6jj9uTEWaVV2Z5AqxCA/0M+81BdLbAD16Xeu/nrfRWcgct X-Received: by 2002:a05:6808:1585:b0:3a3:76c6:a46f with SMTP id t5-20020a056808158500b003a376c6a46fmr703792oiw.38.1689223422067; Wed, 12 Jul 2023 21:43:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689223422; cv=none; d=google.com; s=arc-20160816; b=D1dsZAp1dOBC90srGkS8oCpG6ZnyZxvqjcj+htgxJscZp1BKgKuh5pgN8nB969ngcQ u/WYJO8rG4cHjkRN/VASAdSfylETtzlmDEiXXWBTMLv/LWxSZfp0u6MrQfIrvRIFWfGJ QFBxMeikcDj/3ZPY36woighs2dS7cdr/ea7s5lN0IV7rFqsqMfNr7nuUiFMQKcJpnYMG aeu/CvZiOlrRVx6LzaYJ/Kf/k2wyMagfinS1qKZcCEwpTsx2fMwR58UmAJlgq6ld+dRT WVfLAmaualvdc5bDFuINbrzA8Zu0iGIiSWtVSp8Q1oRlHEtSuVyY6h1PZWGozI70aVlT lawQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=YPThBi3gLHbiMnxNJA8Pqg7+B1TGtxLdGO8OQZP+rPE=; fh=/QsNaKhtZgVhH9aginPULJkva4OxUan3OolcTMonS78=; b=VtR9KbfPdEQzAaVJTN2llzOKToFfXAwCpCcz0ah+l1yWTzsUQ5r/30sCqirUQD5UgI Asdy87qFcReqN6HErn9ebubAmxuwZPWGTzdEB3jMJTuaY7WJyIzD7eEK6v/XSYt6wo05 pX1uVhBigWxaI64LleiZWXfyi20RaPKpsgsGwiEN4NQ5tmLiBtxjdzswpJVmOPFthP55 WPqpj3u+USii0kLPY+JXjEU3Pa+DTBsB7nYswUBBxCtEf7ZUeJFcRKxWy5m7/IGQdgQ7 RiZNreIzdCnFkAtMFQE1blZJh3qVU0MjOSVpiOq1lyMVDNv+ZM+NZv0CVY7mcaOgaChS rgaA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=kZPDfTp8; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k62-20020a638441000000b005572b2f6ea2si4369530pgd.272.2023.07.12.21.43.29; Wed, 12 Jul 2023 21:43:42 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=kZPDfTp8; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233500AbjGMEU6 (ORCPT <rfc822;ybw1215001957@gmail.com> + 99 others); Thu, 13 Jul 2023 00:20:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60932 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232341AbjGMEUz (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 13 Jul 2023 00:20:55 -0400 Received: from mail-oi1-x235.google.com (mail-oi1-x235.google.com [IPv6:2607:f8b0:4864:20::235]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5D2D519BE for <linux-kernel@vger.kernel.org>; Wed, 12 Jul 2023 21:20:54 -0700 (PDT) Received: by mail-oi1-x235.google.com with SMTP id 5614622812f47-3a412653335so266082b6e.1 for <linux-kernel@vger.kernel.org>; Wed, 12 Jul 2023 21:20:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1689222053; x=1691814053; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=YPThBi3gLHbiMnxNJA8Pqg7+B1TGtxLdGO8OQZP+rPE=; b=kZPDfTp8BiBR6nr8WXBNq/Q0pG0Jj/xtqusNHh+GQB/o1+8mYiboEeIDaKy4oTDCaX m0KbGb1TzIiljMKK5ecWb0kViWUeZV2R797gxnFo6+8imC+dLU+UECElCCtJL7vnR6y4 jUJyurBGHJcWQ38PoVhHxoBS4hXToDnZ9xubwSOPm2NGnZqzb+9kIMJN/xNj0PaB1K5R HXgmxuwxbcPjmzN5eIT43t+HluklQTar/sMWaOE4MEBtNk84zA5Y14aVVhuAsmlfKwfK 3IFEnmFQeSTP1E/7nxPsTWWv0/c0/ZA7IZJQMapCu/YPDw2sN98paPG0h2DJ2XhCLKNL KGZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689222053; x=1691814053; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=YPThBi3gLHbiMnxNJA8Pqg7+B1TGtxLdGO8OQZP+rPE=; b=HoHJkxYR4B+3oWyzVrLbd1g5rb7AvKtEVZk3dXilxoA1RZB/BU5wgadrpxpmW75Idw 6MRciiQijT2Di8bGt9ptItFuz6GNEQxeMC6sMHYSbpzOR/IlMSilczxDeQduGiEP/4bd fQU2TQDkn4AVY/8jM1Tk7+xQoz80Uofr7QpAbbMH5fe/5d0aiI0w3M65m0BrgwoRgXp4 N/BM2GPjd5kHIsjGU0BsUxU9yoGW0M5WXxiZd7ilRQfxPAoha8L8zOcvpT/YDS0LqUwu jkUofOjgi5GbiwXyM2LWaXDpE+36ZH+6r6H/OIeMri7IrK8HxvtLf9VLjyTyyw6NYV96 M5tA== X-Gm-Message-State: ABy/qLZqqCQC45MLhVY+QM1mWxSRH1qE/TbkZADNe7ijaoiOhPai7EbG gZpS8HYJyhTxkIgDRYoSQL0= X-Received: by 2002:a05:6808:148d:b0:398:34da:daad with SMTP id e13-20020a056808148d00b0039834dadaadmr648193oiw.51.1689222053542; Wed, 12 Jul 2023 21:20:53 -0700 (PDT) Received: from fedora.. ([1.245.179.104]) by smtp.gmail.com with ESMTPSA id u18-20020aa78392000000b006827c26f147sm4346601pfm.138.2023.07.12.21.20.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 12 Jul 2023 21:20:52 -0700 (PDT) From: Hyeonggon Yoo <42.hyeyoo@gmail.com> To: Minchan Kim <minchan@kernel.org>, Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Matthew Wilcox <willy@infradead.org>, Mike Rapoport <rppt@kernel.org>, Hyeonggon Yoo <42.hyeyoo@gmail.com> Subject: [RFC PATCH v2 00/21] mm/zsmalloc: Split zsdesc from struct page Date: Thu, 13 Jul 2023 13:20:15 +0900 Message-ID: <20230713042037.980211-1-42.hyeyoo@gmail.com> X-Mailer: git-send-email 2.41.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,HK_RANDOM_ENVFROM, HK_RANDOM_FROM,RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1771279139027907800 X-GMAIL-MSGID: 1771279139027907800 |
Series |
mm/zsmalloc: Split zsdesc from struct page
|
|
Message
Hyeonggon Yoo
July 13, 2023, 4:20 a.m. UTC
v1: https://lore.kernel.org/linux-mm/20230220132218.546369-1-42.hyeyoo@gmail.com v1 -> v2: - rebased to the latest mm-unstable, resulting in some patches dropped - adjusted comments from Mike Rapoport, defining helpers when converting its users The purpose of this series is to define own memory descriptor for zsmalloc, instead of re-using various fields of struct page. This is a part of the effort to reduce the size of struct page to unsigned long and enable dynamic allocation of memory descriptors. While [1] outlines this ultimate objective, the current use of struct page is highly dependent on its definition, making it challenging to separately allocate memory descriptors. Therefore, this series introduces new descriptor for zsmalloc, called zsdesc. It overlays struct page for now, but will eventually be allocated independently in the future. And apart from dynamic allocation of descriptors, this is a nice cleanup. This work is also available at: https://gitlab.com/hyeyoo/linux/-/tree/separate_zsdesc_rfc-v2 [1] State Of The Page, August 2022 https://lore.kernel.org/lkml/YvV1KTyzZ+Jrtj9x@casper.infradead.org Hyeonggon Yoo (21): mm/zsmalloc: create new struct zsdesc mm/zsmalloc: add utility functions for zsdesc mm/zsmalloc: replace first_page to first_zsdesc in struct zspage mm/zsmalloc: add alternatives of frequently used helper functions mm/zsmalloc: convert {try,}lock_zspage() to use zsdesc mm/zsmalloc: convert __zs_{map,unmap}_object() to use zsdesc mm/zsmalloc: convert obj_to_location() and its users to use zsdesc mm/zsmalloc: convert obj_malloc() to use zsdesc mm/zsmalloc: convert create_page_chain() and its user to use zsdesc mm/zsmalloc: convert obj_allocated() and related helpers to use zsdesc mm/zsmalloc: convert init_zspage() to use zsdesc mm/zsmalloc: convert obj_to_page() and zs_free() to use zsdesc mm/zsmalloc: convert reset_page() to reset_zsdesc() mm/zsmalloc: convert zs_page_{isolate,migrate,putback} to use zsdesc mm/zsmalloc: convert __free_zspage() to use zsdesc mm/zsmalloc: convert location_to_obj() to use zsdesc mm/zsmalloc: convert migrate_zspage() to use zsdesc mm/zsmalloc: convert get_zspage() to take zsdesc mm/zsmalloc: convert SetZsPageMovable() to use zsdesc mm/zsmalloc: remove now unused helper functions mm/zsmalloc: convert {get,set}_first_obj_offset() to use zsdesc mm/zsmalloc.c | 574 +++++++++++++++++++++++++++++++------------------- 1 file changed, 360 insertions(+), 214 deletions(-)
Comments
On (23/07/13 13:20), Hyeonggon Yoo wrote: > The purpose of this series is to define own memory descriptor for zsmalloc, > instead of re-using various fields of struct page. This is a part of the > effort to reduce the size of struct page to unsigned long and enable > dynamic allocation of memory descriptors. > > While [1] outlines this ultimate objective, the current use of struct page > is highly dependent on its definition, making it challenging to separately > allocate memory descriptors. I glanced through the series and it all looks pretty straight forward to me. I'll have a closer look. And we definitely need Minchan to ACK it. > Therefore, this series introduces new descriptor for zsmalloc, called > zsdesc. It overlays struct page for now, but will eventually be allocated > independently in the future. So I don't expect zsmalloc memory usage increase. On one hand for each physical page that zspage consists of we will allocate zsdesc (extra bytes), but at the same time struct page gets slimmer. So we should be even, or am I wrong?
On Thu, Jul 20, 2023 at 12:18 AM Sergey Senozhatsky <senozhatsky@chromium.org> wrote: > > On (23/07/13 13:20), Hyeonggon Yoo wrote: > > The purpose of this series is to define own memory descriptor for zsmalloc, > > instead of re-using various fields of struct page. This is a part of the > > effort to reduce the size of struct page to unsigned long and enable > > dynamic allocation of memory descriptors. > > > > While [1] outlines this ultimate objective, the current use of struct page > > is highly dependent on its definition, making it challenging to separately > > allocate memory descriptors. > > I glanced through the series and it all looks pretty straight forward to > me. I'll have a closer look. And we definitely need Minchan to ACK it. > > > Therefore, this series introduces new descriptor for zsmalloc, called > > zsdesc. It overlays struct page for now, but will eventually be allocated > > independently in the future. > > So I don't expect zsmalloc memory usage increase. On one hand for each > physical page that zspage consists of we will allocate zsdesc (extra bytes), > but at the same time struct page gets slimmer. So we should be even, or > am I wrong? Well, it depends. Here is my understanding (which may be completely wrong): The end goal would be to have an 8-byte memdesc for each order-0 page, and then allocate a specialized struct per-folio according to the use case. In this case, we would have a memdesc and a zsdesc for each order-0 page. If sizeof(zsdesc) is 64 bytes (on 64-bit), then it's a net loss. The savings only start kicking in with higher order folios. As of now, zsmalloc only uses order-0 pages as far as I can tell, so the usage would increase if I understand correctly. It seems to me though the sizeof(zsdesc) is actually 56 bytes (on 64-bit), so sizeof(zsdesc) + sizeof(memdesc) would be equal to the current size of struct page. If that's true, then there is no loss, and there's potential gain if we start using higher order folios in zsmalloc in the future. (That is of course unless we want to maintain cache line alignment for the zsdescs, then we might end up using 64 bytes anyway).
On Thu, Jul 20, 2023 at 4:55 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Thu, Jul 20, 2023 at 12:18 AM Sergey Senozhatsky > <senozhatsky@chromium.org> wrote: > > > > On (23/07/13 13:20), Hyeonggon Yoo wrote: > > > The purpose of this series is to define own memory descriptor for zsmalloc, > > > instead of re-using various fields of struct page. This is a part of the > > > effort to reduce the size of struct page to unsigned long and enable > > > dynamic allocation of memory descriptors. > > > > > > While [1] outlines this ultimate objective, the current use of struct page > > > is highly dependent on its definition, making it challenging to separately > > > allocate memory descriptors. > > > > I glanced through the series and it all looks pretty straight forward to > > me. I'll have a closer look. And we definitely need Minchan to ACK it. > > > > > Therefore, this series introduces new descriptor for zsmalloc, called > > > zsdesc. It overlays struct page for now, but will eventually be allocated > > > independently in the future. > > > > So I don't expect zsmalloc memory usage increase. On one hand for each > > physical page that zspage consists of we will allocate zsdesc (extra bytes), > > but at the same time struct page gets slimmer. So we should be even, or > > am I wrong? > > Well, it depends. Here is my understanding (which may be completely wrong): > > The end goal would be to have an 8-byte memdesc for each order-0 page, > and then allocate a specialized struct per-folio according to the use > case. In this case, we would have a memdesc and a zsdesc for each > order-0 page. If sizeof(zsdesc) is 64 bytes (on 64-bit), then it's a > net loss. The savings only start kicking in with higher order folios. > As of now, zsmalloc only uses order-0 pages as far as I can tell, so > the usage would increase if I understand correctly. I partially agree with you that the point of memdesc stuff is allocating a use-case specific descriptor per folio. but I thought the primary gain from memdesc was from anon and file pages (where high order pages are more usable), rather than zsmalloc. And I believe enabling a memory descriptor per folio would be impossible (or inefficient) if zsmalloc and other subsystems are using struct page in the current way (or please tell me I'm wrong?) So I expect the primary gain would be from high-order anon/file folios, while this series is a prerequisite for them to work sanely. > It seems to me though the sizeof(zsdesc) is actually 56 bytes (on > 64-bit), so sizeof(zsdesc) + sizeof(memdesc) would be equal to the > current size of struct page. If that's true, then there is no loss, Yeah, zsdesc would be 56 bytes on 64 bit CPUs as memcg_data field is not used in zsmalloc. More fields in the current struct page might not be needed in the future, although it's hard to say at the moment. but it's not a loss. > and there's potential gain if we start using higher order folios in > zsmalloc in the future. AFAICS zsmalloc should work even when the system memory is fragmented, so we may implement fallback allocation (as currently discussed in large anon folios thread). It might work, but IMHO the purpose of this series is to enable memdesc for large anon/file folios, rather than seeing a large gain in zsmalloc itself. (But even in zsmalloc, it's not a loss) > (That is of course unless we want to maintain cache line alignment for > the zsdescs, then we might end up using 64 bytes anyway). we already don't require cache line alignment for struct page. the current alignment requirement is due to SLUB's cmpxchg128 operation, not cache line alignment. I might be wrong in some aspects, so please tell me if I am. And thank you and Sergey for taking a look at this! -- Hyeonggon
On Thu, Jul 20, 2023 at 4:34 AM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote: > > On Thu, Jul 20, 2023 at 4:55 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Thu, Jul 20, 2023 at 12:18 AM Sergey Senozhatsky > > <senozhatsky@chromium.org> wrote: > > > > > > On (23/07/13 13:20), Hyeonggon Yoo wrote: > > > > The purpose of this series is to define own memory descriptor for zsmalloc, > > > > instead of re-using various fields of struct page. This is a part of the > > > > effort to reduce the size of struct page to unsigned long and enable > > > > dynamic allocation of memory descriptors. > > > > > > > > While [1] outlines this ultimate objective, the current use of struct page > > > > is highly dependent on its definition, making it challenging to separately > > > > allocate memory descriptors. > > > > > > I glanced through the series and it all looks pretty straight forward to > > > me. I'll have a closer look. And we definitely need Minchan to ACK it. > > > > > > > Therefore, this series introduces new descriptor for zsmalloc, called > > > > zsdesc. It overlays struct page for now, but will eventually be allocated > > > > independently in the future. > > > > > > So I don't expect zsmalloc memory usage increase. On one hand for each > > > physical page that zspage consists of we will allocate zsdesc (extra bytes), > > > but at the same time struct page gets slimmer. So we should be even, or > > > am I wrong? > > > > Well, it depends. Here is my understanding (which may be completely wrong): > > > > The end goal would be to have an 8-byte memdesc for each order-0 page, > > and then allocate a specialized struct per-folio according to the use > > case. In this case, we would have a memdesc and a zsdesc for each > > order-0 page. If sizeof(zsdesc) is 64 bytes (on 64-bit), then it's a > > net loss. The savings only start kicking in with higher order folios. > > As of now, zsmalloc only uses order-0 pages as far as I can tell, so > > the usage would increase if I understand correctly. > > I partially agree with you that the point of memdesc stuff is > allocating a use-case specific > descriptor per folio. but I thought the primary gain from memdesc was > from anon and file pages > (where high order pages are more usable), rather than zsmalloc. > > And I believe enabling a memory descriptor per folio would be > impossible (or inefficient) > if zsmalloc and other subsystems are using struct page in the current > way (or please tell me I'm wrong?) > > So I expect the primary gain would be from high-order anon/file folios, > while this series is a prerequisite for them to work sanely. Right, I agree with that, sorry if I wasn't clear. I meant that generally speaking, we see gains from memdesc from higher order folios, so for zsmalloc specifically we probably won't see seeing any savings, and *might* see some extra usage (which I might be wrong about, see below). > > > It seems to me though the sizeof(zsdesc) is actually 56 bytes (on > > 64-bit), so sizeof(zsdesc) + sizeof(memdesc) would be equal to the > > current size of struct page. If that's true, then there is no loss, > > Yeah, zsdesc would be 56 bytes on 64 bit CPUs as memcg_data field is > not used in zsmalloc. > More fields in the current struct page might not be needed in the > future, although it's hard to say at the moment. > but it's not a loss. Is page->memcg_data something that we can drop? Aren't there code paths that will check page->memcg_data even for kernel pages (e.g. __folio_put() -> __folio_put_small() -> mem_cgroup_uncharge() ) ? > > > and there's potential gain if we start using higher order folios in > > zsmalloc in the future. > > AFAICS zsmalloc should work even when the system memory is fragmented, > so we may implement fallback allocation (as currently discussed in > large anon folios thread). Of course, any usage of higher order folios in zsmalloc must have a fallback logic, although it might be simpler for zsmalloc than anon folios. I agree that's off topic here. > > It might work, but IMHO the purpose of this series is to enable memdesc > for large anon/file folios, rather than seeing a large gain in zsmalloc itself. > (But even in zsmalloc, it's not a loss) > > > (That is of course unless we want to maintain cache line alignment for > > the zsdescs, then we might end up using 64 bytes anyway). > > we already don't require cache line alignment for struct page. the current > alignment requirement is due to SLUB's cmpxchg128 operation, not cache > line alignment. I thought we want struct page to be cache line aligned (to avoid having to fetch two cache lines for one struct page), but I can easily be wrong. > > I might be wrong in some aspects, so please tell me if I am. > And thank you and Sergey for taking a look at this! Thanks to you for doing the work! > -- > Hyeonggon
On Fri, Jul 21, 2023 at 3:31 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Thu, Jul 20, 2023 at 4:34 AM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote: > > > > On Thu, Jul 20, 2023 at 4:55 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > On Thu, Jul 20, 2023 at 12:18 AM Sergey Senozhatsky > > > <senozhatsky@chromium.org> wrote: > > > > > > > > On (23/07/13 13:20), Hyeonggon Yoo wrote: > > > > > The purpose of this series is to define own memory descriptor for zsmalloc, > > > > > instead of re-using various fields of struct page. This is a part of the > > > > > effort to reduce the size of struct page to unsigned long and enable > > > > > dynamic allocation of memory descriptors. > > > > > > > > > > While [1] outlines this ultimate objective, the current use of struct page > > > > > is highly dependent on its definition, making it challenging to separately > > > > > allocate memory descriptors. > > > > > > > > I glanced through the series and it all looks pretty straight forward to > > > > me. I'll have a closer look. And we definitely need Minchan to ACK it. > > > > > > > > > Therefore, this series introduces new descriptor for zsmalloc, called > > > > > zsdesc. It overlays struct page for now, but will eventually be allocated > > > > > independently in the future. > > > > > > > > So I don't expect zsmalloc memory usage increase. On one hand for each > > > > physical page that zspage consists of we will allocate zsdesc (extra bytes), > > > > but at the same time struct page gets slimmer. So we should be even, or > > > > am I wrong? > > > > > > Well, it depends. Here is my understanding (which may be completely wrong): > > > > > > The end goal would be to have an 8-byte memdesc for each order-0 page, > > > and then allocate a specialized struct per-folio according to the use > > > case. In this case, we would have a memdesc and a zsdesc for each > > > order-0 page. If sizeof(zsdesc) is 64 bytes (on 64-bit), then it's a > > > net loss. The savings only start kicking in with higher order folios. > > > As of now, zsmalloc only uses order-0 pages as far as I can tell, so > > > the usage would increase if I understand correctly. > > > > I partially agree with you that the point of memdesc stuff is > > allocating a use-case specific > > descriptor per folio. but I thought the primary gain from memdesc was > > from anon and file pages > > (where high order pages are more usable), rather than zsmalloc. > > > > And I believe enabling a memory descriptor per folio would be > > impossible (or inefficient) > > if zsmalloc and other subsystems are using struct page in the current > > way (or please tell me I'm wrong?) > > > > So I expect the primary gain would be from high-order anon/file folios, > > while this series is a prerequisite for them to work sanely. > > Right, I agree with that, sorry if I wasn't clear. I meant that > generally speaking, we see gains from memdesc from higher order > folios, so for zsmalloc specifically we probably won't see seeing any > savings, and *might* see some extra usage (which I might be wrong > about, see below). Yeah, even if I said, "oh, we don't necessarily need to use extra memory for zsdesc" below, a slight increase wouldn't hurt too much in that perspective, because there will be savings from other users of memdesc. > > > It seems to me though the sizeof(zsdesc) is actually 56 bytes (on > > > 64-bit), so sizeof(zsdesc) + sizeof(memdesc) would be equal to the > > > current size of struct page. If that's true, then there is no loss, > > > > Yeah, zsdesc would be 56 bytes on 64 bit CPUs as memcg_data field is > > not used in zsmalloc. > > More fields in the current struct page might not be needed in the > > future, although it's hard to say at the moment. > > but it's not a loss. > > Is page->memcg_data something that we can drop? Aren't there code > paths that will check page->memcg_data even for kernel pages (e.g. > __folio_put() -> __folio_put_small() -> mem_cgroup_uncharge() ) ? zsmalloc pages are not accounted for via __GFP_ACCOUNT, and IIUC the current implementation of zswap memcg charging does not use memcg_data either - so I think it can be dropped. I think we don't want to increase memdesc to 16 bytes by adding memcg_data. It should be in use-case specific descriptors if it can be charged to memcg? > > > and there's potential gain if we start using higher order folios in > > > zsmalloc in the future. > > > > AFAICS zsmalloc should work even when the system memory is fragmented, > > so we may implement fallback allocation (as currently discussed in > > large anon folios thread). > > Of course, any usage of higher order folios in zsmalloc must have a > fallback logic, although it might be simpler for zsmalloc than anon > folios. I agree that's off topic here. > > It might work, but IMHO the purpose of this series is to enable memdesc > > for large anon/file folios, rather than seeing a large gain in zsmalloc itself. > > (But even in zsmalloc, it's not a loss) > > > > > (That is of course unless we want to maintain cache line alignment for > > > the zsdescs, then we might end up using 64 bytes anyway). > > > > we already don't require cache line alignment for struct page. the current > > alignment requirement is due to SLUB's cmpxchg128 operation, not cache > > line alignment. > > I thought we want struct page to be cache line aligned (to avoid > having to fetch two cache lines for one struct page), but I can easily > be wrong. Right. I admit that even if it's not required to be cache line aligned, it is 64 bytes in commonly used configurations. and changing it could affect some workloads. But I think for zsdesc it would be better not to align by cache line size, before observing degradations due to alignment. By the time zsmalloc is intensively used, it shouldn't be a huge issue. > > I might be wrong in some aspects, so please tell me if I am. > > And thank you and Sergey for taking a look at this! > > Thanks to you for doing the work! No problem! :)
<snip> > > > > > It seems to me though the sizeof(zsdesc) is actually 56 bytes (on > > > > 64-bit), so sizeof(zsdesc) + sizeof(memdesc) would be equal to the > > > > current size of struct page. If that's true, then there is no loss, > > > > > > Yeah, zsdesc would be 56 bytes on 64 bit CPUs as memcg_data field is > > > not used in zsmalloc. > > > More fields in the current struct page might not be needed in the > > > future, although it's hard to say at the moment. > > > but it's not a loss. > > > > Is page->memcg_data something that we can drop? Aren't there code > > paths that will check page->memcg_data even for kernel pages (e.g. > > __folio_put() -> __folio_put_small() -> mem_cgroup_uncharge() ) ? > > zsmalloc pages are not accounted for via __GFP_ACCOUNT, Yeah, but the code in the free path above will check page->memcg_data nonetheless to check if it is charged. I think to drop memcg_data we need to enlighten the code that some pages do not even have memcg_data at all, no? > and IIUC the current implementation of zswap memcg charging does not > use memcg_data > either - so I think it can be dropped. My question is more about the generic mm code expecting to see page->memcg_data in every page, even if it is not actually used (zero). > > I think we don't want to increase memdesc to 16 bytes by adding memcg_data. > It should be in use-case specific descriptors if it can be charged to memcg? > <snip>
On Fri, Jul 21, 2023 at 6:39 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > <snip> > > > > > > > It seems to me though the sizeof(zsdesc) is actually 56 bytes (on > > > > > 64-bit), so sizeof(zsdesc) + sizeof(memdesc) would be equal to the > > > > > current size of struct page. If that's true, then there is no loss, > > > > > > > > Yeah, zsdesc would be 56 bytes on 64 bit CPUs as memcg_data field is > > > > not used in zsmalloc. > > > > More fields in the current struct page might not be needed in the > > > > future, although it's hard to say at the moment. > > > > but it's not a loss. > > > > > > Is page->memcg_data something that we can drop? Aren't there code > > > paths that will check page->memcg_data even for kernel pages (e.g. > > > __folio_put() -> __folio_put_small() -> mem_cgroup_uncharge() ) ? > > > > zsmalloc pages are not accounted for via __GFP_ACCOUNT, > > Yeah, but the code in the free path above will check page->memcg_data > nonetheless to check if it is charged. Right. > I think to drop memcg_data we need to enlighten the code that some pages > do not even have memcg_data at all I agree with you. It should be one of the milestones for all of this to work. It won't be complicated for the code to be aware of it, because there will be a freeing (and uncharging if need) routine per type of descriptors.
On Thu, Jul 20, 2023 at 2:52 PM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote: > > On Fri, Jul 21, 2023 at 6:39 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > <snip> > > > > > > > > > It seems to me though the sizeof(zsdesc) is actually 56 bytes (on > > > > > > 64-bit), so sizeof(zsdesc) + sizeof(memdesc) would be equal to the > > > > > > current size of struct page. If that's true, then there is no loss, > > > > > > > > > > Yeah, zsdesc would be 56 bytes on 64 bit CPUs as memcg_data field is > > > > > not used in zsmalloc. > > > > > More fields in the current struct page might not be needed in the > > > > > future, although it's hard to say at the moment. > > > > > but it's not a loss. > > > > > > > > Is page->memcg_data something that we can drop? Aren't there code > > > > paths that will check page->memcg_data even for kernel pages (e.g. > > > > __folio_put() -> __folio_put_small() -> mem_cgroup_uncharge() ) ? > > > > > > zsmalloc pages are not accounted for via __GFP_ACCOUNT, > > > > Yeah, but the code in the free path above will check page->memcg_data > > nonetheless to check if it is charged. > > Right. > > > I think to drop memcg_data we need to enlighten the code that some pages > > do not even have memcg_data at all > > I agree with you. It should be one of the milestones for all of this to work. > It won't be complicated for the code to be aware of it, because there will be > a freeing (and uncharging if need) routine per type of descriptors. Right. For this patch series, do we need to maintain memcg_data in zsdec to avoid any subtle problems?