Message ID | 20240116133145.12454-1-debug.penguin32@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-27422-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:42cf:b0:101:a8e8:374 with SMTP id q15csp257373dye; Tue, 16 Jan 2024 05:32:23 -0800 (PST) X-Google-Smtp-Source: AGHT+IG6vVugHISdWCDrG2CcFsK0t7RhF5nS4nKb0iDYgKiQrtMXneNKpEt+vytIE8oNzwRgKjC6 X-Received: by 2002:a17:903:1ce:b0:1d4:dff2:2f6c with SMTP id e14-20020a17090301ce00b001d4dff22f6cmr10232757plh.68.1705411943098; Tue, 16 Jan 2024 05:32:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1705411943; cv=none; d=google.com; s=arc-20160816; b=KDn/sC1TO8mluBocqg5mospZUPQI10eGgc0M1oW4Fy8PV3esCRe+08nv+MMeIX8Co0 YguzdvgwLQEmatt51QUVYV2VirV2BrpqiS75k7MFr3vQwNnsvIWmOYy1QWNoR4qRDQDm v4ZNM1DCZsBVucYMNN0BY7g9EWvAwNk3YVdzo2uEQjN8/eMD+5BhOHUG6A3XK46lllq+ YNCVEgnquPx1wa30TD1cUetx4X895P4YZEMvTgHhdLEGn11Rxr92is3LIlonXi0AJdhf VKLU+lasMJrq8hWeuaPQfJgcZH9wEVnLtGznjutUmYsT5nYVL9ZHLF/qncA/uyq98e8a e1Dg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:in-reply-to:message-id :date:subject:cc:to:from:dkim-signature; bh=toKzuFniRcNlr/ckldQaNDhULExFRcUqJOSIbFFbNgg=; fh=noem383WeTkWKZbc8RDzr57pWDWbqRCVDBmRxuo6KoA=; b=ad/+ZjozQ+1GCxUWWbSw6tOBO2mGDA7L9i6w4Fy8QygJ21nY5p3nY+kHu4JPsEGWWn Bq3bpn/9DgqnfpTzrmklAuXMuaTpUQnBsS+v7LjtBYrosA8do+H1d74JwqZRvA6vraa5 INMvz76W39KpEaWKrfl+SuGJMRHH3Z9mQM5AJrWXqjPwGjQVuXX0nB+xvyrhx9zLzyyz dWZeFX/W5Vajm+GwR9DLMcLQ9gpXZlGQr56l+BcYzfKF8czi2CAbpQHrzDDFVoYSq0I3 7D3G7APLX7xmT/okRsw79R/CIkwkNv+MTBygPL1ipo/HIAntj2jQ8Eu/vg3W/98tnsHl sBwQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=iq8OELEY; spf=pass (google.com: domain of linux-kernel+bounces-27422-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-27422-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [2604:1380:45e3:2400::1]) by mx.google.com with ESMTPS id kn5-20020a170903078500b001cf8e9e8813si10725767plb.315.2024.01.16.05.32.22 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 16 Jan 2024 05:32:23 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-27422-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=iq8OELEY; spf=pass (google.com: domain of linux-kernel+bounces-27422-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-27422-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id DC0B0284F20 for <ouuuleilei@gmail.com>; Tue, 16 Jan 2024 13:32:22 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 1AD161BDDB; Tue, 16 Jan 2024 13:32:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="iq8OELEY" Received: from mail-pf1-f177.google.com (mail-pf1-f177.google.com [209.85.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 46C541BC55 for <linux-kernel@vger.kernel.org>; Tue, 16 Jan 2024 13:32:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-pf1-f177.google.com with SMTP id d2e1a72fcca58-6db786df38dso1881562b3a.0 for <linux-kernel@vger.kernel.org>; Tue, 16 Jan 2024 05:32:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1705411929; x=1706016729; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=toKzuFniRcNlr/ckldQaNDhULExFRcUqJOSIbFFbNgg=; b=iq8OELEYdiOz8NYoumXVB+GvTE9hiLxZI/4y2bL7GdbPReCC5etCuHw2/0UG2jYA1P LoUxupjOPEs7ySGUR+YhUurU8bOmixlSkh7WSFfxd9HMhuvSMoak7ILAEsL1pR1XWAFu XbqX57NsMxe2IW/E2/YPX8sEganyBP2SylyMcvGBiMYn2USqbMvD0/6U5rGkEget4aDJ GHv5FIrx/HCKR0x3tOg5ZUb2HGokcP28yIowpa4s/zM7GUDk2Uc+BkJu+wTy8QtJ/TpI SLM8fxLFZ7SmKKapTVdaGXGO1WQfF0z8nMNjE0L6Dxh7+um+EqP4OyoNOEwU/2zcq18W 3ShA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705411929; x=1706016729; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=toKzuFniRcNlr/ckldQaNDhULExFRcUqJOSIbFFbNgg=; b=lqt8EsISLxJ97vjXiRlZ/EtwSQ2vQjLa7x55FbmOvVUuvqg+dmfjr9lzTN104S/THH 1i/JBVF06MOAtLyZJTTIiIpJUVglrNApDtQrbGWbHcQMP0icGx5erZ0S1K5Pu5UdJ/F4 r4SpRXkJ/cX9ddDZIXHjB4Ug+ik0d2RPasDsBB9L9PwYiZxZ+r96fFGGRIS10R+eRL7/ Y4p/517F+XIAeIqnmeVdUicV58D+suqITQHYk7kah14hjywSM/VGwuxwgvEmPlgjzaIc a7lIz+RwnH9PePPbNVFDMwaBF7xVHw92/8Xq+iwW28YWHQdt3gRtiyyImPT5hwr5A6kt 7iGg== X-Gm-Message-State: AOJu0Yxjtjk9ytveG8CwtbuN8Inba5QWA92FAInsWHpXdRIt7DVa9duu xAmQ/a6oXTW6BOcZcqzfmkY= X-Received: by 2002:a05:6a00:2da2:b0:6da:1833:cb6b with SMTP id fb34-20020a056a002da200b006da1833cb6bmr8654996pfb.59.1705411929442; Tue, 16 Jan 2024 05:32:09 -0800 (PST) Received: from eagle-5590.. ([192.166.246.176]) by smtp.gmail.com with ESMTPSA id d13-20020a056a0010cd00b006da04ab75a8sm9259336pfu.1.2024.01.16.05.32.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 16 Jan 2024 05:32:09 -0800 (PST) From: Ronald Monthero <debug.penguin32@gmail.com> To: nphamcs@gmail.com Cc: sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, akpm@linux-foundation.org, chrisl@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ronald Monthero <debug.penguin32@gmail.com> Subject: [PATCH] mm/zswap: Improve with alloc_workqueue() call Date: Tue, 16 Jan 2024 23:31:45 +1000 Message-Id: <20240116133145.12454-1-debug.penguin32@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <CAKEwX=NLe-N6dLvOVErPSL3Vfw6wqHgcUBQoNRLeWkN6chdvLQ@mail.gmail.com> References: <CAKEwX=NLe-N6dLvOVErPSL3Vfw6wqHgcUBQoNRLeWkN6chdvLQ@mail.gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1784962653563192025 X-GMAIL-MSGID: 1788254033503467112 |
Series |
mm/zswap: Improve with alloc_workqueue() call
|
|
Commit Message
Ronald Monthero
Jan. 16, 2024, 1:31 p.m. UTC
The core-api create_workqueue is deprecated, this patch replaces
the create_workqueue with alloc_workqueue. The previous
implementation workqueue of zswap was a bounded workqueue, this
patch uses alloc_workqueue() to create an unbounded workqueue.
The WQ_UNBOUND attribute is desirable making the workqueue
not localized to a specific cpu so that the scheduler is free
to exercise improvisations in any demanding scenarios for
offloading cpu time slices for workqueues.
For example if any other workqueues of the same primary cpu
had to be served which are WQ_HIGHPRI and WQ_CPU_INTENSIVE.
Also Unbound workqueue happens to be more efficient
in a system during memory pressure scenarios in comparison
to a bounded workqueue.
shrink_wq = alloc_workqueue("zswap-shrink",
WQ_UNBOUND|WQ_MEM_RECLAIM, 1);
Overall the change suggested in this patch should be
seamless and does not alter the existing behavior,
other than the improvisation to be an unbounded workqueue.
Signed-off-by: Ronald Monthero <debug.penguin32@gmail.com>
---
mm/zswap.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
Comments
On Tue, Jan 16, 2024 at 5:32 AM Ronald Monthero <debug.penguin32@gmail.com> wrote: + Johannes and Yosry > > The core-api create_workqueue is deprecated, this patch replaces > the create_workqueue with alloc_workqueue. The previous > implementation workqueue of zswap was a bounded workqueue, this > patch uses alloc_workqueue() to create an unbounded workqueue. > The WQ_UNBOUND attribute is desirable making the workqueue > not localized to a specific cpu so that the scheduler is free > to exercise improvisations in any demanding scenarios for > offloading cpu time slices for workqueues. nit: extra space between paragraph would be nice. > For example if any other workqueues of the same primary cpu > had to be served which are WQ_HIGHPRI and WQ_CPU_INTENSIVE. > Also Unbound workqueue happens to be more efficient > in a system during memory pressure scenarios in comparison > to a bounded workqueue. > > shrink_wq = alloc_workqueue("zswap-shrink", > WQ_UNBOUND|WQ_MEM_RECLAIM, 1); > > Overall the change suggested in this patch should be > seamless and does not alter the existing behavior, > other than the improvisation to be an unbounded workqueue. > > Signed-off-by: Ronald Monthero <debug.penguin32@gmail.com> > --- > mm/zswap.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 74411dfdad92..64dbe3e944a2 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1620,7 +1620,8 @@ static int zswap_setup(void) > zswap_enabled = false; > } > > - shrink_wq = create_workqueue("zswap-shrink"); > + shrink_wq = alloc_workqueue("zswap-shrink", > + WQ_UNBOUND|WQ_MEM_RECLAIM, 1); Have you benchmarked this to check if there is any regression, just to be safe? With an unbounded workqueue, you're gaining scheduling flexibility at the cost of cache locality. My intuition is that it doesn't matter too much here, but you should probably double check by stress testing - run some workload with a relatively small zswap pool limit (i.e heavy global writeback), and see if there is any difference in performance. > if (!shrink_wq) > goto fallback_fail; > > -- > 2.34.1 > On a different note, I wonder if it would help to perform synchronous reclaim here instead. With our current design, the zswap store failure (due to global limit hit) would leave the incoming page going to swap instead, creating an LRU inversion. Not sure if that's ideal.
On Wed, Jan 17, 2024 at 11:14 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Tue, Jan 16, 2024 at 5:32 AM Ronald Monthero > <debug.penguin32@gmail.com> wrote: > > + Johannes and Yosry > > > > > The core-api create_workqueue is deprecated, this patch replaces > > the create_workqueue with alloc_workqueue. The previous > > implementation workqueue of zswap was a bounded workqueue, this > > patch uses alloc_workqueue() to create an unbounded workqueue. > > The WQ_UNBOUND attribute is desirable making the workqueue > > not localized to a specific cpu so that the scheduler is free > > to exercise improvisations in any demanding scenarios for > > offloading cpu time slices for workqueues. > > nit: extra space between paragraph would be nice. > > > For example if any other workqueues of the same primary cpu > > had to be served which are WQ_HIGHPRI and WQ_CPU_INTENSIVE. > > Also Unbound workqueue happens to be more efficient > > in a system during memory pressure scenarios in comparison > > to a bounded workqueue. > > > > shrink_wq = alloc_workqueue("zswap-shrink", > > WQ_UNBOUND|WQ_MEM_RECLAIM, 1); > > > > Overall the change suggested in this patch should be > > seamless and does not alter the existing behavior, > > other than the improvisation to be an unbounded workqueue. > > > > Signed-off-by: Ronald Monthero <debug.penguin32@gmail.com> > > --- > > mm/zswap.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 74411dfdad92..64dbe3e944a2 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -1620,7 +1620,8 @@ static int zswap_setup(void) > > zswap_enabled = false; > > } > > > > - shrink_wq = create_workqueue("zswap-shrink"); > > + shrink_wq = alloc_workqueue("zswap-shrink", > > + WQ_UNBOUND|WQ_MEM_RECLAIM, 1); > > Have you benchmarked this to check if there is any regression, just to > be safe? With an unbounded workqueue, you're gaining scheduling > flexibility at the cost of cache locality. My intuition is that it > doesn't matter too much here, but you should probably double check by > stress testing - run some workload with a relatively small zswap pool > limit (i.e heavy global writeback), and see if there is any difference > in performance. I also think this shouldn't make a large difference. The global shrinking work is already expensive, and I imagine that it exhausts the caches anyway by iterating memcgs. A performance smoketest would be reassuring for sure, but I believe it won't make a difference. Keep in mind that even with WQ_UNBOUND, we prefer the local CPU (see wq_select_unbound_cpu()), so it will take more than global writeback to observe a difference. The local CPU must not be in wq_unbound_cpumask, or CONFIG_DEBUG_WQ_FORCE_RR_CPU should be on. > > > if (!shrink_wq) > > goto fallback_fail; > > > > -- > > 2.34.1 > > > > On a different note, I wonder if it would help to perform synchronous > reclaim here instead. With our current design, the zswap store failure > (due to global limit hit) would leave the incoming page going to swap > instead, creating an LRU inversion. Not sure if that's ideal. The global shrink path keeps reclaiming until zswap can accept again (by default, that means reclaiming 10% of the total limit). I think this is too expensive to be done synchronously.
On Wed, Jan 17, 2024 at 11:30:50AM -0800, Yosry Ahmed wrote: > On Wed, Jan 17, 2024 at 11:14 AM Nhat Pham <nphamcs@gmail.com> wrote: > > > > On Tue, Jan 16, 2024 at 5:32 AM Ronald Monthero > > <debug.penguin32@gmail.com> wrote: > > > > + Johannes and Yosry > > > > > > > > The core-api create_workqueue is deprecated, this patch replaces > > > the create_workqueue with alloc_workqueue. The previous > > > implementation workqueue of zswap was a bounded workqueue, this > > > patch uses alloc_workqueue() to create an unbounded workqueue. > > > The WQ_UNBOUND attribute is desirable making the workqueue > > > not localized to a specific cpu so that the scheduler is free > > > to exercise improvisations in any demanding scenarios for > > > offloading cpu time slices for workqueues. > > > > nit: extra space between paragraph would be nice. > > > > > For example if any other workqueues of the same primary cpu > > > had to be served which are WQ_HIGHPRI and WQ_CPU_INTENSIVE. > > > Also Unbound workqueue happens to be more efficient > > > in a system during memory pressure scenarios in comparison > > > to a bounded workqueue. > > > > > > shrink_wq = alloc_workqueue("zswap-shrink", > > > WQ_UNBOUND|WQ_MEM_RECLAIM, 1); > > > > > > Overall the change suggested in this patch should be > > > seamless and does not alter the existing behavior, > > > other than the improvisation to be an unbounded workqueue. > > > > > > Signed-off-by: Ronald Monthero <debug.penguin32@gmail.com> > > > --- > > > mm/zswap.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index 74411dfdad92..64dbe3e944a2 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -1620,7 +1620,8 @@ static int zswap_setup(void) > > > zswap_enabled = false; > > > } > > > > > > - shrink_wq = create_workqueue("zswap-shrink"); > > > + shrink_wq = alloc_workqueue("zswap-shrink", > > > + WQ_UNBOUND|WQ_MEM_RECLAIM, 1); > > > > Have you benchmarked this to check if there is any regression, just to > > be safe? With an unbounded workqueue, you're gaining scheduling > > flexibility at the cost of cache locality. My intuition is that it > > doesn't matter too much here, but you should probably double check by > > stress testing - run some workload with a relatively small zswap pool > > limit (i.e heavy global writeback), and see if there is any difference > > in performance. > > I also think this shouldn't make a large difference. The global > shrinking work is already expensive, and I imagine that it exhausts > the caches anyway by iterating memcgs. A performance smoketest would > be reassuring for sure, but I believe it won't make a difference. The LRU inherently makes the shrinker work on the oldest and coldest entries, so I doubt we benefit a lot from cache locality there. What could make a difference though is the increased concurrency by switching max_active from 1 to 0. This could cause a higher rate of shrinker runs, which might increase lock contention and reclaim volume. That part would be good to double check with the shrinker benchmarks. > > On a different note, I wonder if it would help to perform synchronous > > reclaim here instead. With our current design, the zswap store failure > > (due to global limit hit) would leave the incoming page going to swap > > instead, creating an LRU inversion. Not sure if that's ideal. > > The global shrink path keeps reclaiming until zswap can accept again > (by default, that means reclaiming 10% of the total limit). I think > this is too expensive to be done synchronously. That thresholding code is a bit weird right now. It wakes the shrinker and rejects at the same time. We're guaranteed to see rejections, even if the shrinker has no trouble flushing some entries a split second later. It would make more sense to wake the shrinker at e.g. 95% full and have it run until 90%. But with that in place we also *should* do synchronous reclaim once we hit 100%. Just enough to make room for the store. This is important to catch the case where reclaim rate exceeds swapout rate. Rejecting and going to swap means the reclaimer will be throttled down to IO rate anyway, and the app latency isn't any worse. But this way we keep the pipeline alive, and keep swapping out the oldest zswap entries, instead of rejecting and swapping what would be the hottest ones.
On Thu, Jan 18, 2024 at 11:16:08AM -0500, Johannes Weiner wrote: > > > On Tue, Jan 16, 2024 at 5:32 AM Ronald Monthero > > > > @@ -1620,7 +1620,8 @@ static int zswap_setup(void) > > > > zswap_enabled = false; > > > > } > > > > > > > > - shrink_wq = create_workqueue("zswap-shrink"); > > > > + shrink_wq = alloc_workqueue("zswap-shrink", > > > > + WQ_UNBOUND|WQ_MEM_RECLAIM, 1); > What could make a difference though is the increased concurrency by > switching max_active from 1 to 0. This could cause a higher rate of > shrinker runs, which might increase lock contention and reclaim > volume. That part would be good to double check with the shrinker > benchmarks. Nevermind, I clearly can't read. Could still be worthwhile testing with the default 0, but it's not a concern in the patch as-is. Acked-by: Johannes Weiner <hannes@cmpxchg.org>
On Thu, Jan 18, 2024 at 8:48 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Jan 18, 2024 at 11:16:08AM -0500, Johannes Weiner wrote: > > > > On Tue, Jan 16, 2024 at 5:32 AM Ronald Monthero > > > > > @@ -1620,7 +1620,8 @@ static int zswap_setup(void) > > > > > zswap_enabled = false; > > > > > } > > > > > > > > > > - shrink_wq = create_workqueue("zswap-shrink"); > > > > > + shrink_wq = alloc_workqueue("zswap-shrink", > > > > > + WQ_UNBOUND|WQ_MEM_RECLAIM, 1); > > > What could make a difference though is the increased concurrency by > > switching max_active from 1 to 0. This could cause a higher rate of > > shrinker runs, which might increase lock contention and reclaim > > volume. That part would be good to double check with the shrinker > > benchmarks. > > Nevermind, I clearly can't read. Regardless of max_active, we only have one shrink_work per zswap pool, and we can only have one instance of the work running at any time, right? > > Could still be worthwhile testing with the default 0, but it's not a > concern in the patch as-is. > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> >
> > > On a different note, I wonder if it would help to perform synchronous > > > reclaim here instead. With our current design, the zswap store failure > > > (due to global limit hit) would leave the incoming page going to swap > > > instead, creating an LRU inversion. Not sure if that's ideal. > > > > The global shrink path keeps reclaiming until zswap can accept again > > (by default, that means reclaiming 10% of the total limit). I think > > this is too expensive to be done synchronously. > > That thresholding code is a bit weird right now. > > It wakes the shrinker and rejects at the same time. We're guaranteed > to see rejections, even if the shrinker has no trouble flushing some > entries a split second later. > > It would make more sense to wake the shrinker at e.g. 95% full and > have it run until 90%. > > But with that in place we also *should* do synchronous reclaim once we > hit 100%. Just enough to make room for the store. This is important to > catch the case where reclaim rate exceeds swapout rate. Rejecting and > going to swap means the reclaimer will be throttled down to IO rate > anyway, and the app latency isn't any worse. But this way we keep the > pipeline alive, and keep swapping out the oldest zswap entries, > instead of rejecting and swapping what would be the hottest ones. I fully agree with the thresholding code being weird, and with waking up the shrinker before the pool is full. What I don't understand is how we can do synchronous reclaim when we hit 100% and still respect the acceptance threshold :/ Are you proposing we change the semantics of the acceptance threshold to begin with?
On Thu, Jan 18, 2024 at 09:06:43AM -0800, Yosry Ahmed wrote: > > > > On a different note, I wonder if it would help to perform synchronous > > > > reclaim here instead. With our current design, the zswap store failure > > > > (due to global limit hit) would leave the incoming page going to swap > > > > instead, creating an LRU inversion. Not sure if that's ideal. > > > > > > The global shrink path keeps reclaiming until zswap can accept again > > > (by default, that means reclaiming 10% of the total limit). I think > > > this is too expensive to be done synchronously. > > > > That thresholding code is a bit weird right now. > > > > It wakes the shrinker and rejects at the same time. We're guaranteed > > to see rejections, even if the shrinker has no trouble flushing some > > entries a split second later. > > > > It would make more sense to wake the shrinker at e.g. 95% full and > > have it run until 90%. > > > > But with that in place we also *should* do synchronous reclaim once we > > hit 100%. Just enough to make room for the store. This is important to > > catch the case where reclaim rate exceeds swapout rate. Rejecting and > > going to swap means the reclaimer will be throttled down to IO rate > > anyway, and the app latency isn't any worse. But this way we keep the > > pipeline alive, and keep swapping out the oldest zswap entries, > > instead of rejecting and swapping what would be the hottest ones. > > I fully agree with the thresholding code being weird, and with waking > up the shrinker before the pool is full. What I don't understand is > how we can do synchronous reclaim when we hit 100% and still respect > the acceptance threshold :/ > > Are you proposing we change the semantics of the acceptance threshold > to begin with? I kind of am. It's worth looking at the history of this knob. It was added in 2020 by 45190f01dd402112d3d22c0ddc4152994f9e1e55, and from the changelogs and the code in this patch I do not understand how this was supposed to work. It also *didn't* work for very basic real world applications. See Domenico's follow-up (e0228d590beb0d0af345c58a282f01afac5c57f3), which effectively reverted it to get halfway reasonable behavior. If there are no good usecases for this knob, then I think it makes sense to phase it out again.
On Thu, Jan 18, 2024 at 9:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Jan 18, 2024 at 09:06:43AM -0800, Yosry Ahmed wrote: > > > > > On a different note, I wonder if it would help to perform synchronous > > > > > reclaim here instead. With our current design, the zswap store failure > > > > > (due to global limit hit) would leave the incoming page going to swap > > > > > instead, creating an LRU inversion. Not sure if that's ideal. > > > > > > > > The global shrink path keeps reclaiming until zswap can accept again > > > > (by default, that means reclaiming 10% of the total limit). I think > > > > this is too expensive to be done synchronously. > > > > > > That thresholding code is a bit weird right now. > > > > > > It wakes the shrinker and rejects at the same time. We're guaranteed > > > to see rejections, even if the shrinker has no trouble flushing some > > > entries a split second later. > > > > > > It would make more sense to wake the shrinker at e.g. 95% full and > > > have it run until 90%. > > > > > > But with that in place we also *should* do synchronous reclaim once we > > > hit 100%. Just enough to make room for the store. This is important to > > > catch the case where reclaim rate exceeds swapout rate. Rejecting and > > > going to swap means the reclaimer will be throttled down to IO rate > > > anyway, and the app latency isn't any worse. But this way we keep the > > > pipeline alive, and keep swapping out the oldest zswap entries, > > > instead of rejecting and swapping what would be the hottest ones. > > > > I fully agree with the thresholding code being weird, and with waking > > up the shrinker before the pool is full. What I don't understand is > > how we can do synchronous reclaim when we hit 100% and still respect > > the acceptance threshold :/ > > > > Are you proposing we change the semantics of the acceptance threshold > > to begin with? > > I kind of am. It's worth looking at the history of this knob. > > It was added in 2020 by 45190f01dd402112d3d22c0ddc4152994f9e1e55, and > from the changelogs and the code in this patch I do not understand how > this was supposed to work. > > It also *didn't* work for very basic real world applications. See > Domenico's follow-up (e0228d590beb0d0af345c58a282f01afac5c57f3), which > effectively reverted it to get halfway reasonable behavior. > > If there are no good usecases for this knob, then I think it makes > sense to phase it out again. I am always nervous about removing/altering user visible knobs, but if you think it's fine then I am all for it. I think it makes more sense to start writeback early to avoid the whole situation if possible, and synchronously reclaim a little bit if we hit 100%. I think the proactive writeback should reduce the amount of synchronous IO we need to do in reclaim as well, so we may see some latency improvements.
On Wed, Jan 17, 2024 at 11:13 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Tue, Jan 16, 2024 at 5:32 AM Ronald Monthero > <debug.penguin32@gmail.com> wrote: > > + Johannes and Yosry > > > > > The core-api create_workqueue is deprecated, this patch replaces > > the create_workqueue with alloc_workqueue. The previous > > implementation workqueue of zswap was a bounded workqueue, this > > patch uses alloc_workqueue() to create an unbounded workqueue. > > The WQ_UNBOUND attribute is desirable making the workqueue > > not localized to a specific cpu so that the scheduler is free > > to exercise improvisations in any demanding scenarios for > > offloading cpu time slices for workqueues. > > nit: extra space between paragraph would be nice. > > > For example if any other workqueues of the same primary cpu > > had to be served which are WQ_HIGHPRI and WQ_CPU_INTENSIVE. > > Also Unbound workqueue happens to be more efficient > > in a system during memory pressure scenarios in comparison > > to a bounded workqueue. > > > > shrink_wq = alloc_workqueue("zswap-shrink", > > WQ_UNBOUND|WQ_MEM_RECLAIM, 1); > > > > Overall the change suggested in this patch should be > > seamless and does not alter the existing behavior, > > other than the improvisation to be an unbounded workqueue. > > > > Signed-off-by: Ronald Monthero <debug.penguin32@gmail.com> > > --- > > mm/zswap.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 74411dfdad92..64dbe3e944a2 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -1620,7 +1620,8 @@ static int zswap_setup(void) > > zswap_enabled = false; > > } > > > > - shrink_wq = create_workqueue("zswap-shrink"); > > + shrink_wq = alloc_workqueue("zswap-shrink", > > + WQ_UNBOUND|WQ_MEM_RECLAIM, 1); > [...] > > if (!shrink_wq) > > goto fallback_fail; > > > > -- > > 2.34.1 > > FWIW: Acked-by: Nhat Pham <nphamcs@gmail.com>
On Thu, Jan 18, 2024 at 9:03 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Thu, Jan 18, 2024 at 8:48 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Thu, Jan 18, 2024 at 11:16:08AM -0500, Johannes Weiner wrote: > > > > > On Tue, Jan 16, 2024 at 5:32 AM Ronald Monthero > > > > > > @@ -1620,7 +1620,8 @@ static int zswap_setup(void) > > > > > > zswap_enabled = false; > > > > > > } > > > > > > > > > > > > - shrink_wq = create_workqueue("zswap-shrink"); > > > > > > + shrink_wq = alloc_workqueue("zswap-shrink", > > > > > > + WQ_UNBOUND|WQ_MEM_RECLAIM, 1); > > > > > What could make a difference though is the increased concurrency by > > > switching max_active from 1 to 0. This could cause a higher rate of > > > shrinker runs, which might increase lock contention and reclaim > > > volume. That part would be good to double check with the shrinker > > > benchmarks. > > > > Nevermind, I clearly can't read. > > Regardless of max_active, we only have one shrink_work per zswap pool, > and we can only have one instance of the work running at any time, > right? I believe so, yeah. Well I guess you can have a weird setup where somehow multiple pools are full and submit shrink_work concurrently? But who does that :) But let's just keep it as is to reduce our mental workload (i.e not having to keep track of what changes) would be ideal. > > > > > Could still be worthwhile testing with the default 0, but it's not a > > concern in the patch as-is. > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > >
On Thu, Jan 18, 2024 at 9:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Jan 18, 2024 at 09:06:43AM -0800, Yosry Ahmed wrote: > > > > > On a different note, I wonder if it would help to perform synchronous > > > > > reclaim here instead. With our current design, the zswap store failure > > > > > (due to global limit hit) would leave the incoming page going to swap > > > > > instead, creating an LRU inversion. Not sure if that's ideal. > > > > > > > > The global shrink path keeps reclaiming until zswap can accept again > > > > (by default, that means reclaiming 10% of the total limit). I think > > > > this is too expensive to be done synchronously. > > > > > > That thresholding code is a bit weird right now. > > > > > > It wakes the shrinker and rejects at the same time. We're guaranteed > > > to see rejections, even if the shrinker has no trouble flushing some > > > entries a split second later. > > > > > > It would make more sense to wake the shrinker at e.g. 95% full and > > > have it run until 90%. Yep, we should be reclaiming zswap objects way ahead of the pool limit. Hence the new shrinker, which is memory pressure-driven (i.e independent of zswap internal limits), and will typically be triggered even if the pool is not full. During experiments, I never observe the pool becoming full, with the default settings. I'd be happy to extend it (or build in extra shrinking logic) to cover these pool limits too, if it turns out to be necessary. > > > > > > But with that in place we also *should* do synchronous reclaim once we > > > hit 100%. Just enough to make room for the store. This is important to > > > catch the case where reclaim rate exceeds swapout rate. Rejecting and > > > going to swap means the reclaimer will be throttled down to IO rate > > > anyway, and the app latency isn't any worse. But this way we keep the > > > pipeline alive, and keep swapping out the oldest zswap entries, > > > instead of rejecting and swapping what would be the hottest ones. > > > > I fully agree with the thresholding code being weird, and with waking > > up the shrinker before the pool is full. What I don't understand is > > how we can do synchronous reclaim when we hit 100% and still respect > > the acceptance threshold :/ > > > > Are you proposing we change the semantics of the acceptance threshold > > to begin with? > > I kind of am. It's worth looking at the history of this knob. > > It was added in 2020 by 45190f01dd402112d3d22c0ddc4152994f9e1e55, and > from the changelogs and the code in this patch I do not understand how > this was supposed to work. > > It also *didn't* work for very basic real world applications. See > Domenico's follow-up (e0228d590beb0d0af345c58a282f01afac5c57f3), which > effectively reverted it to get halfway reasonable behavior. > > If there are no good usecases for this knob, then I think it makes > sense to phase it out again. Yeah this was my original proposal - remove this knob altogether :) Based on a cursory read, it just seems like zswap was originally trying to shrink (synchronously) one "object", then try to check if the pool size is now under the limit. This is indeed insufficient. However, I'm not quite convinced by the solution (hysteresis) either. Maybe we can synchronously shrink a la Domenico, i.e until the pool can accept new pages, but this time capacity-based (maybe under the limit + some headroom, 1 page for example)? This is just so that the immediate incoming zswap store succeeds - we can still have the shrinker freeing up space later on (or maybe keep an asynchronous pool-limit based shrinker around).
Thanks for the reviews. This patch is available at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-zswap-improve-with-alloc_workqueue-call.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Ronald Monthero <debug.penguin32@gmail.com> Subject: mm/zswap: improve with alloc_workqueue() call Date: Tue, 16 Jan 2024 23:31:45 +1000 The core-api create_workqueue is deprecated, this patch replaces the create_workqueue with alloc_workqueue. The previous implementation workqueue of zswap was a bounded workqueue, this patch uses alloc_workqueue() to create an unbounded workqueue. The WQ_UNBOUND attribute is desirable making the workqueue not localized to a specific cpu so that the scheduler is free to exercise improvisations in any demanding scenarios for offloading cpu time slices for workqueues. For example if any other workqueues of the same primary cpu had to be served which are WQ_HIGHPRI and WQ_CPU_INTENSIVE. Also Unbound workqueue happens to be more efficient in a system during memory pressure scenarios in comparison to a bounded workqueue. shrink_wq = alloc_workqueue("zswap-shrink", WQ_UNBOUND|WQ_MEM_RECLAIM, 1); Overall the change suggested in this patch should be seamless and does not alter the existing behavior, other than the improvisation to be an unbounded workqueue. Link: https://lkml.kernel.org/r/20240116133145.12454-1-debug.penguin32@gmail.com Signed-off-by: Ronald Monthero <debug.penguin32@gmail.com> Cc: Chris Li <chrisl@kernel.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/zswap.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- a/mm/zswap.c~mm-zswap-improve-with-alloc_workqueue-call +++ a/mm/zswap.c @@ -1884,7 +1884,8 @@ static int zswap_setup(void) zswap_enabled = false; } - shrink_wq = create_workqueue("zswap-shrink"); + shrink_wq = alloc_workqueue("zswap-shrink", + WQ_UNBOUND|WQ_MEM_RECLAIM, 1); if (!shrink_wq) goto fallback_fail;
diff --git a/mm/zswap.c b/mm/zswap.c index 74411dfdad92..64dbe3e944a2 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1620,7 +1620,8 @@ static int zswap_setup(void) zswap_enabled = false; } - shrink_wq = create_workqueue("zswap-shrink"); + shrink_wq = alloc_workqueue("zswap-shrink", + WQ_UNBOUND|WQ_MEM_RECLAIM, 1); if (!shrink_wq) goto fallback_fail;