Message ID | 20230804180529.2483231-6-aleksander.lobakin@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:c44e:0:b0:3f2:4152:657d with SMTP id w14csp67973vqr; Fri, 4 Aug 2023 12:56:33 -0700 (PDT) X-Google-Smtp-Source: AGHT+IErX/q/g+q/0U+NfuzULL7pOfxacsEHY+1QlBVWJNw3jdvYHbIkJJNopB+gM/qhfbxI2oBd X-Received: by 2002:a2e:9bc5:0:b0:2b9:ce86:b4e7 with SMTP id w5-20020a2e9bc5000000b002b9ce86b4e7mr2087437ljj.28.1691178993309; Fri, 04 Aug 2023 12:56:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691178993; cv=none; d=google.com; s=arc-20160816; b=izkeQkCP69t+JzKTu+auZmpjmh6GNaR5T8Ipj2GeOlqoHMJ0u2x9Qc8KEhSsdyC7uf sHPqZwalw0lxVca8XBn7p+laIWcoX6sc97z40yX1blr0dDej8JFbzeA2W+3iOtRoa2Iy isjVzBlWCkvhnJFsRy0ujEPwlLRFxD1/SVocxZTL+WsfElm0GkIyht+Wa30fKgTqmIMd PLNy8AZ0PEvTwvIFjyn4xrooGgB4my59+mYtM/rc+Nac7TePUlWc+3OU2TmaIlDep+tk fKhuAcsi1NOCXKTCQXMptMOoXohpuislhvR9CWHUiiAgu/EkFfx2XVhh9XcDur560BYd w4Qw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=q66Vm9jdj9baDiHzZByESX0MmF/GPyVQngGachiPe3g=; fh=3HE1xXs4s0z3NpKSg+MkyPw1+1QpKlxno2HY1HR9bAc=; b=WV9GihO9mu4Qx2aGvV9UQKTcW63QfH9p/bChQ8RgV6tzT1/8kswdb5op66A8jQimg8 EMOLNyp9iEL5Vrpj4mWG+BF0/K1qZOl7jFMVzU2sLKl9CEouNBfFzY4Pf2svl99+7bz3 /iA+lEOPavkhdkyshxlQ2v7AbTR5fi0Zi2akAF0kjKcWcwtfjt8WB5HgtWU1x37dc06h m/+7B7RB9BeCjIapp0c08U87Qcn9GPIKxOr1IK+xO2jGsSP1HADPDQ/eC/Ely+PfU4/9 +9dq2FgmWzjXFyr0br2WRbME2SqCh7OitJfTkbZQ3Rt8x8eOUAXso3KGVKgs6vjGTlby YXJw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=W3yCZOr+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id h8-20020a1709067cc800b009769ff4d234si2157383ejp.58.2023.08.04.12.55.42; Fri, 04 Aug 2023 12:56:33 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=W3yCZOr+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230208AbjHDSGl (ORCPT <rfc822;sukrut.bellary@gmail.com> + 99 others); Fri, 4 Aug 2023 14:06:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42398 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229727AbjHDSG2 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 4 Aug 2023 14:06:28 -0400 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8345E4C1E; Fri, 4 Aug 2023 11:06:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1691172382; x=1722708382; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=jOARWqjiUUP8q6nnKNMsztpMUfoPiGZECpsn9f+GnMI=; b=W3yCZOr+M1/qsNfpinO525TnHU8qVtpw6isjQogBDtSiR6zkifc34PCA OaxOaCs/oSgoSFz1REwLdwML7TC7nWQ1w0c7AxsuPIBugCAPf8m6lILur +7jYU1Ih+0lLycPItKTY4FoB22sU8QL531VvcJeP5l6HDnmKLzmusdJPB PIqmj6aD4CGBS0OV6zjZvIAbc7scUdhz34UNJTrWPQpLRzCbJzYLH9uOg g7kSQ8gR7AnSlTzQj5McLcIyekzmmJkCtn3nHfz+E76oCkwRDpgnGKAgF dLqj6mXyGup8Lc07VU+bmWIZGiuKl8sfMoay6keLBplrOUxD7biBnfz05 w==; X-IronPort-AV: E=McAfee;i="6600,9927,10792"; a="434061732" X-IronPort-AV: E=Sophos;i="6.01,255,1684825200"; d="scan'208";a="434061732" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Aug 2023 11:06:22 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10792"; a="759673603" X-IronPort-AV: E=Sophos;i="6.01,255,1684825200"; d="scan'208";a="759673603" Received: from newjersey.igk.intel.com ([10.102.20.203]) by orsmga008.jf.intel.com with ESMTP; 04 Aug 2023 11:06:18 -0700 From: Alexander Lobakin <aleksander.lobakin@intel.com> To: "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com> Cc: Alexander Lobakin <aleksander.lobakin@intel.com>, Maciej Fijalkowski <maciej.fijalkowski@intel.com>, Larysa Zaremba <larysa.zaremba@intel.com>, Yunsheng Lin <linyunsheng@huawei.com>, Alexander Duyck <alexanderduyck@fb.com>, Jesper Dangaard Brouer <hawk@kernel.org>, Ilias Apalodimas <ilias.apalodimas@linaro.org>, Simon Horman <simon.horman@corigine.com>, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH net-next v4 5/6] page_pool: add a lockdep check for recycling in hardirq Date: Fri, 4 Aug 2023 20:05:28 +0200 Message-ID: <20230804180529.2483231-6-aleksander.lobakin@intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230804180529.2483231-1-aleksander.lobakin@intel.com> References: <20230804180529.2483231-1-aleksander.lobakin@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_NONE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1773329704092569713 X-GMAIL-MSGID: 1773329704092569713 |
Series |
page_pool: a couple of assorted optimizations
|
|
Commit Message
Alexander Lobakin
Aug. 4, 2023, 6:05 p.m. UTC
From: Jakub Kicinski <kuba@kernel.org> Page pool use in hardirq is prohibited, add debug checks to catch misuses. IIRC we previously discussed using DEBUG_NET_WARN_ON_ONCE() for this, but there were concerns that people will have DEBUG_NET enabled in perf testing. I don't think anyone enables lockdep in perf testing, so use lockdep to avoid pushback and arguing :) Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> --- include/linux/lockdep.h | 7 +++++++ net/core/page_pool.c | 2 ++ 2 files changed, 9 insertions(+)
Comments
On Fri, 2023-08-04 at 20:05 +0200, Alexander Lobakin wrote: > From: Jakub Kicinski <kuba@kernel.org> > > Page pool use in hardirq is prohibited, add debug checks > to catch misuses. IIRC we previously discussed using > DEBUG_NET_WARN_ON_ONCE() for this, but there were concerns > that people will have DEBUG_NET enabled in perf testing. > I don't think anyone enables lockdep in perf testing, > so use lockdep to avoid pushback and arguing :) > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> > Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> > --- > include/linux/lockdep.h | 7 +++++++ > net/core/page_pool.c | 2 ++ > 2 files changed, 9 insertions(+) > > diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h > index 310f85903c91..dc2844b071c2 100644 > --- a/include/linux/lockdep.h > +++ b/include/linux/lockdep.h > @@ -625,6 +625,12 @@ do { \ > WARN_ON_ONCE(__lockdep_enabled && !this_cpu_read(hardirq_context)); \ > } while (0) > > +#define lockdep_assert_no_hardirq() \ > +do { \ > + WARN_ON_ONCE(__lockdep_enabled && (this_cpu_read(hardirq_context) || \ > + !this_cpu_read(hardirqs_enabled))); \ > +} while (0) > + > #define lockdep_assert_preemption_enabled() \ > do { \ > WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \ > @@ -659,6 +665,7 @@ do { \ > # define lockdep_assert_irqs_enabled() do { } while (0) > # define lockdep_assert_irqs_disabled() do { } while (0) > # define lockdep_assert_in_irq() do { } while (0) > +# define lockdep_assert_no_hardirq() do { } while (0) > > # define lockdep_assert_preemption_enabled() do { } while (0) > # define lockdep_assert_preemption_disabled() do { } while (0) > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index 03ad74d25959..77cb75e63aca 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -587,6 +587,8 @@ static __always_inline struct page * > __page_pool_put_page(struct page_pool *pool, struct page *page, > unsigned int dma_sync_size, bool allow_direct) > { > + lockdep_assert_no_hardirq(); > + > /* This allocator is optimized for the XDP mode that uses > * one-frame-per-page, but have fallbacks that act like the > * regular page allocator APIs. So two points. First could we look at moving this inside the if statement just before we return the page, as there isn't a risk until we get into that path of needing a lock. Secondly rather than returning an error is there any reason why we couldn't just look at not returning page and instead just drop into the release path which wouldn't take the locks in the first place? Either that or I would even be good with some combination of the two where we threw a warning, but still just dropped the page so we reduce our risk further of actually locking things up.
From: Alexander H Duyck <alexander.duyck@gmail.com> Date: Mon, 07 Aug 2023 07:48:54 -0700 > On Fri, 2023-08-04 at 20:05 +0200, Alexander Lobakin wrote: >> From: Jakub Kicinski <kuba@kernel.org> >> >> Page pool use in hardirq is prohibited, add debug checks >> to catch misuses. IIRC we previously discussed using >> DEBUG_NET_WARN_ON_ONCE() for this, but there were concerns >> that people will have DEBUG_NET enabled in perf testing. >> I don't think anyone enables lockdep in perf testing, >> so use lockdep to avoid pushback and arguing :) >> >> Signed-off-by: Jakub Kicinski <kuba@kernel.org> >> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> >> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> >> --- >> include/linux/lockdep.h | 7 +++++++ >> net/core/page_pool.c | 2 ++ >> 2 files changed, 9 insertions(+) >> >> diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h >> index 310f85903c91..dc2844b071c2 100644 >> --- a/include/linux/lockdep.h >> +++ b/include/linux/lockdep.h >> @@ -625,6 +625,12 @@ do { \ >> WARN_ON_ONCE(__lockdep_enabled && !this_cpu_read(hardirq_context)); \ >> } while (0) >> >> +#define lockdep_assert_no_hardirq() \ >> +do { \ >> + WARN_ON_ONCE(__lockdep_enabled && (this_cpu_read(hardirq_context) || \ >> + !this_cpu_read(hardirqs_enabled))); \ >> +} while (0) >> + >> #define lockdep_assert_preemption_enabled() \ >> do { \ >> WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \ >> @@ -659,6 +665,7 @@ do { \ >> # define lockdep_assert_irqs_enabled() do { } while (0) >> # define lockdep_assert_irqs_disabled() do { } while (0) >> # define lockdep_assert_in_irq() do { } while (0) >> +# define lockdep_assert_no_hardirq() do { } while (0) >> >> # define lockdep_assert_preemption_enabled() do { } while (0) >> # define lockdep_assert_preemption_disabled() do { } while (0) >> diff --git a/net/core/page_pool.c b/net/core/page_pool.c >> index 03ad74d25959..77cb75e63aca 100644 >> --- a/net/core/page_pool.c >> +++ b/net/core/page_pool.c >> @@ -587,6 +587,8 @@ static __always_inline struct page * >> __page_pool_put_page(struct page_pool *pool, struct page *page, >> unsigned int dma_sync_size, bool allow_direct) >> { >> + lockdep_assert_no_hardirq(); >> + >> /* This allocator is optimized for the XDP mode that uses >> * one-frame-per-page, but have fallbacks that act like the >> * regular page allocator APIs. > > So two points. > > First could we look at moving this inside the if statement just before > we return the page, as there isn't a risk until we get into that path > of needing a lock. > > Secondly rather than returning an error is there any reason why we > couldn't just look at not returning page and instead just drop into the > release path which wouldn't take the locks in the first place? Either That is exception path to quickly catch broken drivers and fix them, why bother? It's not something we have to live with. > that or I would even be good with some combination of the two where we > threw a warning, but still just dropped the page so we reduce our risk > further of actually locking things up. Thanks, Olek
On Tue, Aug 8, 2023 at 6:16 AM Alexander Lobakin <aleksander.lobakin@intel.com> wrote: > > From: Alexander H Duyck <alexander.duyck@gmail.com> > Date: Mon, 07 Aug 2023 07:48:54 -0700 > > > On Fri, 2023-08-04 at 20:05 +0200, Alexander Lobakin wrote: > >> From: Jakub Kicinski <kuba@kernel.org> > >> > >> Page pool use in hardirq is prohibited, add debug checks > >> to catch misuses. IIRC we previously discussed using > >> DEBUG_NET_WARN_ON_ONCE() for this, but there were concerns > >> that people will have DEBUG_NET enabled in perf testing. > >> I don't think anyone enables lockdep in perf testing, > >> so use lockdep to avoid pushback and arguing :) > >> > >> Signed-off-by: Jakub Kicinski <kuba@kernel.org> > >> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> > >> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> > >> --- > >> include/linux/lockdep.h | 7 +++++++ > >> net/core/page_pool.c | 2 ++ > >> 2 files changed, 9 insertions(+) > >> > >> diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h > >> index 310f85903c91..dc2844b071c2 100644 > >> --- a/include/linux/lockdep.h > >> +++ b/include/linux/lockdep.h > >> @@ -625,6 +625,12 @@ do { \ > >> WARN_ON_ONCE(__lockdep_enabled && !this_cpu_read(hardirq_context)); \ > >> } while (0) > >> > >> +#define lockdep_assert_no_hardirq() \ > >> +do { \ > >> + WARN_ON_ONCE(__lockdep_enabled && (this_cpu_read(hardirq_context) || \ > >> + !this_cpu_read(hardirqs_enabled))); \ > >> +} while (0) > >> + > >> #define lockdep_assert_preemption_enabled() \ > >> do { \ > >> WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \ > >> @@ -659,6 +665,7 @@ do { \ > >> # define lockdep_assert_irqs_enabled() do { } while (0) > >> # define lockdep_assert_irqs_disabled() do { } while (0) > >> # define lockdep_assert_in_irq() do { } while (0) > >> +# define lockdep_assert_no_hardirq() do { } while (0) > >> > >> # define lockdep_assert_preemption_enabled() do { } while (0) > >> # define lockdep_assert_preemption_disabled() do { } while (0) > >> diff --git a/net/core/page_pool.c b/net/core/page_pool.c > >> index 03ad74d25959..77cb75e63aca 100644 > >> --- a/net/core/page_pool.c > >> +++ b/net/core/page_pool.c > >> @@ -587,6 +587,8 @@ static __always_inline struct page * > >> __page_pool_put_page(struct page_pool *pool, struct page *page, > >> unsigned int dma_sync_size, bool allow_direct) > >> { > >> + lockdep_assert_no_hardirq(); > >> + > >> /* This allocator is optimized for the XDP mode that uses > >> * one-frame-per-page, but have fallbacks that act like the > >> * regular page allocator APIs. > > > > So two points. > > > > First could we look at moving this inside the if statement just before > > we return the page, as there isn't a risk until we get into that path > > of needing a lock. > > > > Secondly rather than returning an error is there any reason why we > > couldn't just look at not returning page and instead just drop into the > > release path which wouldn't take the locks in the first place? Either > > That is exception path to quickly catch broken drivers and fix them, why > bother? It's not something we have to live with. My concern is that the current "fix" consists of stalling a Tx ring. We need to have a way to allow forward progress when somebody mixes xdp_frame and skb traffic as I suspect we will end up with a number of devices doing this since they cannot handle recycling the pages in hardirq context. The only reason why the skbs don't have the problem is that they are queued and then cleaned up in the net_tx_action. That is why I wonder if we shouldn't look at adding some sort of support for doing something like that with xdp_frame as well. Something like a dev_kfree_pp_page_any to go along with the dev_kfree_skb_any.
From: Alexander Duyck <alexander.duyck@gmail.com> Date: Tue, 8 Aug 2023 06:45:26 -0700 > On Tue, Aug 8, 2023 at 6:16 AM Alexander Lobakin > <aleksander.lobakin@intel.com> wrote: >> >> From: Alexander H Duyck <alexander.duyck@gmail.com> >> Date: Mon, 07 Aug 2023 07:48:54 -0700 >> >>> On Fri, 2023-08-04 at 20:05 +0200, Alexander Lobakin wrote: >>>> From: Jakub Kicinski <kuba@kernel.org> >>>> >>>> Page pool use in hardirq is prohibited, add debug checks >>>> to catch misuses. IIRC we previously discussed using >>>> DEBUG_NET_WARN_ON_ONCE() for this, but there were concerns >>>> that people will have DEBUG_NET enabled in perf testing. >>>> I don't think anyone enables lockdep in perf testing, >>>> so use lockdep to avoid pushback and arguing :) >>>> >>>> Signed-off-by: Jakub Kicinski <kuba@kernel.org> >>>> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> >>>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> >>>> --- >>>> include/linux/lockdep.h | 7 +++++++ >>>> net/core/page_pool.c | 2 ++ >>>> 2 files changed, 9 insertions(+) >>>> >>>> diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h >>>> index 310f85903c91..dc2844b071c2 100644 >>>> --- a/include/linux/lockdep.h >>>> +++ b/include/linux/lockdep.h >>>> @@ -625,6 +625,12 @@ do { \ >>>> WARN_ON_ONCE(__lockdep_enabled && !this_cpu_read(hardirq_context)); \ >>>> } while (0) >>>> >>>> +#define lockdep_assert_no_hardirq() \ >>>> +do { \ >>>> + WARN_ON_ONCE(__lockdep_enabled && (this_cpu_read(hardirq_context) || \ >>>> + !this_cpu_read(hardirqs_enabled))); \ >>>> +} while (0) >>>> + >>>> #define lockdep_assert_preemption_enabled() \ >>>> do { \ >>>> WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \ >>>> @@ -659,6 +665,7 @@ do { \ >>>> # define lockdep_assert_irqs_enabled() do { } while (0) >>>> # define lockdep_assert_irqs_disabled() do { } while (0) >>>> # define lockdep_assert_in_irq() do { } while (0) >>>> +# define lockdep_assert_no_hardirq() do { } while (0) >>>> >>>> # define lockdep_assert_preemption_enabled() do { } while (0) >>>> # define lockdep_assert_preemption_disabled() do { } while (0) >>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c >>>> index 03ad74d25959..77cb75e63aca 100644 >>>> --- a/net/core/page_pool.c >>>> +++ b/net/core/page_pool.c >>>> @@ -587,6 +587,8 @@ static __always_inline struct page * >>>> __page_pool_put_page(struct page_pool *pool, struct page *page, >>>> unsigned int dma_sync_size, bool allow_direct) >>>> { >>>> + lockdep_assert_no_hardirq(); >>>> + >>>> /* This allocator is optimized for the XDP mode that uses >>>> * one-frame-per-page, but have fallbacks that act like the >>>> * regular page allocator APIs. >>> >>> So two points. >>> >>> First could we look at moving this inside the if statement just before >>> we return the page, as there isn't a risk until we get into that path >>> of needing a lock. >>> >>> Secondly rather than returning an error is there any reason why we >>> couldn't just look at not returning page and instead just drop into the >>> release path which wouldn't take the locks in the first place? Either >> >> That is exception path to quickly catch broken drivers and fix them, why >> bother? It's not something we have to live with. > > My concern is that the current "fix" consists of stalling a Tx ring. > We need to have a way to allow forward progress when somebody mixes > xdp_frame and skb traffic as I suspect we will end up with a number of > devices doing this since they cannot handle recycling the pages in > hardirq context. You could've seen that several vendors already disabled recycling XDP buffers when in hardirq (= netpoll) in their drivers. hardirq is in general not for networking-related operations. > > The only reason why the skbs don't have the problem is that they are > queued and then cleaned up in the net_tx_action. That is why I wonder > if we shouldn't look at adding some sort of support for doing > something like that with xdp_frame as well. Something like a > dev_kfree_pp_page_any to go along with the dev_kfree_skb_any. I still don't get why we may need to clean XDP buffers in hardirq, maybe someone could give me some links to read why we may need this and how that happens? netpoll is a very specific thing for some debug operations, isn't it? XDP shouldn't in general be enabled when this happens, should it? (unrelated: 6:58 AM West Coast, you use to wake up early or traveling? :D) Thanks, Olek
On Tue, Aug 8, 2023 at 6:59 AM Alexander Lobakin <aleksander.lobakin@intel.com> wrote: > > From: Alexander Duyck <alexander.duyck@gmail.com> > Date: Tue, 8 Aug 2023 06:45:26 -0700 > > > On Tue, Aug 8, 2023 at 6:16 AM Alexander Lobakin > > <aleksander.lobakin@intel.com> wrote: > >> > >> From: Alexander H Duyck <alexander.duyck@gmail.com> > >> Date: Mon, 07 Aug 2023 07:48:54 -0700 > >> > >>> On Fri, 2023-08-04 at 20:05 +0200, Alexander Lobakin wrote: > >>>> From: Jakub Kicinski <kuba@kernel.org> > >>>> > >>>> Page pool use in hardirq is prohibited, add debug checks > >>>> to catch misuses. IIRC we previously discussed using > >>>> DEBUG_NET_WARN_ON_ONCE() for this, but there were concerns > >>>> that people will have DEBUG_NET enabled in perf testing. > >>>> I don't think anyone enables lockdep in perf testing, > >>>> so use lockdep to avoid pushback and arguing :) > >>>> > >>>> Signed-off-by: Jakub Kicinski <kuba@kernel.org> > >>>> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> > >>>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> > >>>> --- > >>>> include/linux/lockdep.h | 7 +++++++ > >>>> net/core/page_pool.c | 2 ++ > >>>> 2 files changed, 9 insertions(+) > >>>> > >>>> diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h > >>>> index 310f85903c91..dc2844b071c2 100644 > >>>> --- a/include/linux/lockdep.h > >>>> +++ b/include/linux/lockdep.h > >>>> @@ -625,6 +625,12 @@ do { \ > >>>> WARN_ON_ONCE(__lockdep_enabled && !this_cpu_read(hardirq_context)); \ > >>>> } while (0) > >>>> > >>>> +#define lockdep_assert_no_hardirq() \ > >>>> +do { \ > >>>> + WARN_ON_ONCE(__lockdep_enabled && (this_cpu_read(hardirq_context) || \ > >>>> + !this_cpu_read(hardirqs_enabled))); \ > >>>> +} while (0) > >>>> + > >>>> #define lockdep_assert_preemption_enabled() \ > >>>> do { \ > >>>> WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \ > >>>> @@ -659,6 +665,7 @@ do { \ > >>>> # define lockdep_assert_irqs_enabled() do { } while (0) > >>>> # define lockdep_assert_irqs_disabled() do { } while (0) > >>>> # define lockdep_assert_in_irq() do { } while (0) > >>>> +# define lockdep_assert_no_hardirq() do { } while (0) > >>>> > >>>> # define lockdep_assert_preemption_enabled() do { } while (0) > >>>> # define lockdep_assert_preemption_disabled() do { } while (0) > >>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c > >>>> index 03ad74d25959..77cb75e63aca 100644 > >>>> --- a/net/core/page_pool.c > >>>> +++ b/net/core/page_pool.c > >>>> @@ -587,6 +587,8 @@ static __always_inline struct page * > >>>> __page_pool_put_page(struct page_pool *pool, struct page *page, > >>>> unsigned int dma_sync_size, bool allow_direct) > >>>> { > >>>> + lockdep_assert_no_hardirq(); > >>>> + > >>>> /* This allocator is optimized for the XDP mode that uses > >>>> * one-frame-per-page, but have fallbacks that act like the > >>>> * regular page allocator APIs. > >>> > >>> So two points. > >>> > >>> First could we look at moving this inside the if statement just before > >>> we return the page, as there isn't a risk until we get into that path > >>> of needing a lock. > >>> > >>> Secondly rather than returning an error is there any reason why we > >>> couldn't just look at not returning page and instead just drop into the > >>> release path which wouldn't take the locks in the first place? Either > >> > >> That is exception path to quickly catch broken drivers and fix them, why > >> bother? It's not something we have to live with. > > > > My concern is that the current "fix" consists of stalling a Tx ring. > > We need to have a way to allow forward progress when somebody mixes > > xdp_frame and skb traffic as I suspect we will end up with a number of > > devices doing this since they cannot handle recycling the pages in > > hardirq context. > > You could've seen that several vendors already disabled recycling XDP > buffers when in hardirq (= netpoll) in their drivers. hardirq is in > general not for networking-related operations. The whole idea behind the netpoll cleanup is to get the Tx buffers out of the way so that we can transmit even after the system has crashed. The idea isn't to transmit XDP buffers, but to get the buffers out of the way in the cases where somebody is combining both xdp_frame and sk_buff on the same queue due to a limited number of rings being present on the device. My concern is that at some point in the near future somebody is going to have a system crash and instead of being able to get the crash log message out via their netconsole it is going to get cut off because the driver stopped cleaning the Tx ring because somebody was also using it as an XDP redirect destination. > > > > The only reason why the skbs don't have the problem is that they are > > queued and then cleaned up in the net_tx_action. That is why I wonder > > if we shouldn't look at adding some sort of support for doing > > something like that with xdp_frame as well. Something like a > > dev_kfree_pp_page_any to go along with the dev_kfree_skb_any. > > I still don't get why we may need to clean XDP buffers in hardirq, maybe > someone could give me some links to read why we may need this and how > that happens? netpoll is a very specific thing for some debug > operations, isn't it? XDP shouldn't in general be enabled when this > happens, should it? I think I kind of explained it above. It isn't so much about cleaning the XDP buffers as getting them off of the ring and out of the way. If we block a Tx queue because of an XDP buffer then we cannot use that Tx queue. I would be good with us just deferring the cleanup like we do with an sk_buff in dev_kfree_skb_irq, the only issue is we don't have the ability to put them on a queue since they don't have prev/next pointers. I suppose an alternative to cleaning them might be to make a mandatory requirement that you cannot support netpoll and mix xdp_frame and sk_buff on the same queue. If we enforced that then my concern about them blocking a queue would be addressed. > (unrelated: 6:58 AM West Coast, you use to wake up early or traveling? > :D) I am usually up pretty early, especially this time of year. Sunrise here is 6AM and I am usually up a little before that.. :)
From: Alexander Duyck <alexander.duyck@gmail.com> Date: Tue, 8 Aug 2023 07:52:32 -0700 > On Tue, Aug 8, 2023 at 6:59 AM Alexander Lobakin > <aleksander.lobakin@intel.com> wrote: >> >> From: Alexander Duyck <alexander.duyck@gmail.com> >> Date: Tue, 8 Aug 2023 06:45:26 -0700 [...] >>>>> Secondly rather than returning an error is there any reason why we >>>>> couldn't just look at not returning page and instead just drop into the >>>>> release path which wouldn't take the locks in the first place? Either >>>> >>>> That is exception path to quickly catch broken drivers and fix them, why >>>> bother? It's not something we have to live with. >>> >>> My concern is that the current "fix" consists of stalling a Tx ring. >>> We need to have a way to allow forward progress when somebody mixes >>> xdp_frame and skb traffic as I suspect we will end up with a number of >>> devices doing this since they cannot handle recycling the pages in >>> hardirq context. >> >> You could've seen that several vendors already disabled recycling XDP >> buffers when in hardirq (= netpoll) in their drivers. hardirq is in >> general not for networking-related operations. > > The whole idea behind the netpoll cleanup is to get the Tx buffers out > of the way so that we can transmit even after the system has crashed. > The idea isn't to transmit XDP buffers, but to get the buffers out of > the way in the cases where somebody is combining both xdp_frame and > sk_buff on the same queue due to a limited number of rings being > present on the device. I see now, thanks a lot! > > My concern is that at some point in the near future somebody is going > to have a system crash and instead of being able to get the crash log > message out via their netconsole it is going to get cut off because > the driver stopped cleaning the Tx ring because somebody was also > using it as an XDP redirect destination. > >>> >>> The only reason why the skbs don't have the problem is that they are >>> queued and then cleaned up in the net_tx_action. That is why I wonder >>> if we shouldn't look at adding some sort of support for doing >>> something like that with xdp_frame as well. Something like a >>> dev_kfree_pp_page_any to go along with the dev_kfree_skb_any. >> >> I still don't get why we may need to clean XDP buffers in hardirq, maybe >> someone could give me some links to read why we may need this and how >> that happens? netpoll is a very specific thing for some debug >> operations, isn't it? XDP shouldn't in general be enabled when this >> happens, should it? > > I think I kind of explained it above. It isn't so much about cleaning > the XDP buffers as getting them off of the ring and out of the way. If > we block a Tx queue because of an XDP buffer then we cannot use that > Tx queue. I would be good with us just deferring the cleanup like we > do with an sk_buff in dev_kfree_skb_irq, the only issue is we don't > have the ability to put them on a queue since they don't have > prev/next pointers. > > I suppose an alternative to cleaning them might be to make a mandatory > requirement that you cannot support netpoll and mix xdp_frame and > sk_buff on the same queue. If we enforced that then my concern about > them blocking a queue would be addressed. I'm leaning more towards this one TBH. I don't feel sole netpoll as a solid argument for introducing XDP frame deferred queues :s > >> (unrelated: 6:58 AM West Coast, you use to wake up early or traveling? >> :D) > > I am usually up pretty early, especially this time of year. Sunrise > here is 6AM and I am usually up a little before that.. :) Nice! Thanks, Olek
On Tue, Aug 8, 2023 at 8:06 AM Alexander Lobakin <aleksander.lobakin@intel.com> wrote: > > From: Alexander Duyck <alexander.duyck@gmail.com> > Date: Tue, 8 Aug 2023 07:52:32 -0700 > > > On Tue, Aug 8, 2023 at 6:59 AM Alexander Lobakin > > <aleksander.lobakin@intel.com> wrote: > >> > >> From: Alexander Duyck <alexander.duyck@gmail.com> > >> Date: Tue, 8 Aug 2023 06:45:26 -0700 > > [...] > > >>>>> Secondly rather than returning an error is there any reason why we > >>>>> couldn't just look at not returning page and instead just drop into the > >>>>> release path which wouldn't take the locks in the first place? Either > >>>> > >>>> That is exception path to quickly catch broken drivers and fix them, why > >>>> bother? It's not something we have to live with. > >>> > >>> My concern is that the current "fix" consists of stalling a Tx ring. > >>> We need to have a way to allow forward progress when somebody mixes > >>> xdp_frame and skb traffic as I suspect we will end up with a number of > >>> devices doing this since they cannot handle recycling the pages in > >>> hardirq context. > >> > >> You could've seen that several vendors already disabled recycling XDP > >> buffers when in hardirq (= netpoll) in their drivers. hardirq is in > >> general not for networking-related operations. > > > > The whole idea behind the netpoll cleanup is to get the Tx buffers out > > of the way so that we can transmit even after the system has crashed. > > The idea isn't to transmit XDP buffers, but to get the buffers out of > > the way in the cases where somebody is combining both xdp_frame and > > sk_buff on the same queue due to a limited number of rings being > > present on the device. > > I see now, thanks a lot! > > > > > My concern is that at some point in the near future somebody is going > > to have a system crash and instead of being able to get the crash log > > message out via their netconsole it is going to get cut off because > > the driver stopped cleaning the Tx ring because somebody was also > > using it as an XDP redirect destination. > > > >>> > >>> The only reason why the skbs don't have the problem is that they are > >>> queued and then cleaned up in the net_tx_action. That is why I wonder > >>> if we shouldn't look at adding some sort of support for doing > >>> something like that with xdp_frame as well. Something like a > >>> dev_kfree_pp_page_any to go along with the dev_kfree_skb_any. > >> > >> I still don't get why we may need to clean XDP buffers in hardirq, maybe > >> someone could give me some links to read why we may need this and how > >> that happens? netpoll is a very specific thing for some debug > >> operations, isn't it? XDP shouldn't in general be enabled when this > >> happens, should it? > > > > I think I kind of explained it above. It isn't so much about cleaning > > the XDP buffers as getting them off of the ring and out of the way. If > > we block a Tx queue because of an XDP buffer then we cannot use that > > Tx queue. I would be good with us just deferring the cleanup like we > > do with an sk_buff in dev_kfree_skb_irq, the only issue is we don't > > have the ability to put them on a queue since they don't have > > prev/next pointers. > > > > I suppose an alternative to cleaning them might be to make a mandatory > > requirement that you cannot support netpoll and mix xdp_frame and > > sk_buff on the same queue. If we enforced that then my concern about > > them blocking a queue would be addressed. > > I'm leaning more towards this one TBH. I don't feel sole netpoll as > a solid argument for introducing XDP frame deferred queues :s That was kind of my line of thought as well. That is why I was thinking that instead of bothering with a queue it might work just as well to just throw all recycling out the window and just call put_page if we are dealing with XDP in netpoll and just force it into the free path. Then it becomes more of an "_any" type handler.
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index 310f85903c91..dc2844b071c2 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -625,6 +625,12 @@ do { \ WARN_ON_ONCE(__lockdep_enabled && !this_cpu_read(hardirq_context)); \ } while (0) +#define lockdep_assert_no_hardirq() \ +do { \ + WARN_ON_ONCE(__lockdep_enabled && (this_cpu_read(hardirq_context) || \ + !this_cpu_read(hardirqs_enabled))); \ +} while (0) + #define lockdep_assert_preemption_enabled() \ do { \ WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \ @@ -659,6 +665,7 @@ do { \ # define lockdep_assert_irqs_enabled() do { } while (0) # define lockdep_assert_irqs_disabled() do { } while (0) # define lockdep_assert_in_irq() do { } while (0) +# define lockdep_assert_no_hardirq() do { } while (0) # define lockdep_assert_preemption_enabled() do { } while (0) # define lockdep_assert_preemption_disabled() do { } while (0) diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 03ad74d25959..77cb75e63aca 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -587,6 +587,8 @@ static __always_inline struct page * __page_pool_put_page(struct page_pool *pool, struct page *page, unsigned int dma_sync_size, bool allow_direct) { + lockdep_assert_no_hardirq(); + /* This allocator is optimized for the XDP mode that uses * one-frame-per-page, but have fallbacks that act like the * regular page allocator APIs.