Message ID | 20230727144336.1646454-7-aleksander.lobakin@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:a985:0:b0:3e4:2afc:c1 with SMTP id t5csp1210956vqo; Thu, 27 Jul 2023 09:16:21 -0700 (PDT) X-Google-Smtp-Source: APBJJlHUS4Z1B7GYWDQCgsrmK1ATF3KDnovE0LZivWAvCNYJ7jLQNj7OXYP7YwtFvqz4HcJ3RETT X-Received: by 2002:a05:6a00:c95:b0:668:82fe:16f1 with SMTP id a21-20020a056a000c9500b0066882fe16f1mr6931457pfv.1.1690474581109; Thu, 27 Jul 2023 09:16:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690474581; cv=none; d=google.com; s=arc-20160816; b=U1oYufdLZ/bQO5Mh3dntTMa4xD0z42VToD0lhJNzA5YVNsBNIcwH4+EwQEdHVhIfhP la8ydumwIP/1jS+N2dwBNKfMfEuTgiil3nUkZc+n/hoPoQ8k/4zjsLOubSjbCf5eqQFK eTfHH9rGvRMiiabUaTBd1r0URG2XZuzikJxdl3a6xBgG00PLhqKxN5cckY26H7jUTu0M WEcz8gM4IqeQGQeNQUAz7CukxM8DYKO9ko2t5Iu5AePUSV27gJ5POCONAWrJF5M+y4Rl 3gpbti0BNpgnjSRldGxomG74GMi73hyxzWsbG321qJfXIDyn8Cy9Fe16ceYlAaORQ74s O2Pg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=bBZqnXUbAUipWdT4Vd89hWOc1PTqacu4fQCklI1Mmko=; fh=3HE1xXs4s0z3NpKSg+MkyPw1+1QpKlxno2HY1HR9bAc=; b=fRA5ludD8Q2i/d10Eo9j676LYHgLhh1ZxwuFK6tQ4/mHY2n+SLDxsQ6eoaQUBMX2gj r9X4njS0DBSqETFVLg1PKvePuIFr4S0IRlna3Ixx12hkm+lAbytKHiMdzzpgilcELW26 b75X5DxFyrQeeaEHyrwRS6S99SUY9KNORGrUoNY+vb4RD+HvKs2SdpWZ11JTJmp9hjw+ S3yjWN+JJ3z7soi4g5agTxkzxhumzcaLz2cBEKbvLMIkP3B3CcrTPHW+oCMDK5An1ble Xvw1HEkMMnEIE2keipCmb05+RdM7VXDE/ITE+saYSj+a4/Z30Gv0JtQuneZFBdNpF6k8 xg8g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=niqM5UzI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id z14-20020a63190e000000b00563f871ca69si1396379pgl.859.2023.07.27.09.16.07; Thu, 27 Jul 2023 09:16:21 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=niqM5UzI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233175AbjG0Oq2 (ORCPT <rfc822;hanasaki@gmail.com> + 99 others); Thu, 27 Jul 2023 10:46:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39454 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234036AbjG0OqF (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 27 Jul 2023 10:46:05 -0400 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D2F513592; Thu, 27 Jul 2023 07:45:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1690469151; x=1722005151; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=rXg0fjSPcukCgI8rvn6/BURKoRJDLNE56JJQvNgZt28=; b=niqM5UzIfKyOlPuLyVIxnK7ECzCCpVK5ALYUdrapFK5VtqNQ9guLZRbZ tYAQblKHYRT6kO9diepxWbiFHRKnZBJ10Qa43LiipUK1MchVb44Ht6ATW 5nLRZZQQfkFYmBe2BckijfUi8GWP43AgV3cGu+zBpDBFMDuvOQ3WUXhMV lDxlkxnq9frAa/hdhO+xrSm8uhb7s7dwHoyA5YBZiBUNVHaYYWDDdQoGP 8mu48CstiyIxu2plBP9CL+KBASYY46/R9q5t+GTzzmQFRronDvCUNQfyc d1IiOyG1nXkL2bJin6Nso/m56c+qTlBs0dSfCSi9IH5wMqATWjduP4+72 w==; X-IronPort-AV: E=McAfee;i="6600,9927,10784"; a="432139796" X-IronPort-AV: E=Sophos;i="6.01,235,1684825200"; d="scan'208";a="432139796" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Jul 2023 07:45:51 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10784"; a="817119940" X-IronPort-AV: E=Sophos;i="6.01,235,1684825200"; d="scan'208";a="817119940" Received: from newjersey.igk.intel.com ([10.102.20.203]) by FMSMGA003.fm.intel.com with ESMTP; 27 Jul 2023 07:45:48 -0700 From: Alexander Lobakin <aleksander.lobakin@intel.com> To: "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com> Cc: Alexander Lobakin <aleksander.lobakin@intel.com>, Maciej Fijalkowski <maciej.fijalkowski@intel.com>, Larysa Zaremba <larysa.zaremba@intel.com>, Yunsheng Lin <linyunsheng@huawei.com>, Alexander Duyck <alexanderduyck@fb.com>, Jesper Dangaard Brouer <hawk@kernel.org>, Ilias Apalodimas <ilias.apalodimas@linaro.org>, Simon Horman <simon.horman@corigine.com>, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH net-next 6/9] page_pool: avoid calling no-op externals when possible Date: Thu, 27 Jul 2023 16:43:33 +0200 Message-ID: <20230727144336.1646454-7-aleksander.lobakin@intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230727144336.1646454-1-aleksander.lobakin@intel.com> References: <20230727144336.1646454-1-aleksander.lobakin@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1772591074299536008 X-GMAIL-MSGID: 1772591074299536008 |
Series |
page_pool: a couple of assorted optimizations
|
|
Commit Message
Alexander Lobakin
July 27, 2023, 2:43 p.m. UTC
Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
even when on DMA-coherent platforms (like x86) with no active IOMMU or
swiotlb, just for the call ladder.
Indeed, it's
page_pool_put_page()
page_pool_put_defragged_page() <- external
__page_pool_put_page()
page_pool_dma_sync_for_device() <- non-inline
dma_sync_single_range_for_device()
dma_sync_single_for_device() <- external
dma_direct_sync_single_for_device()
dev_is_dma_coherent() <- exit
For the inline functions, no guarantees the compiler won't uninline them
(they're clearly not one-liners and sometimes compilers uninline even
2 + 2). The first external call is necessary, but the rest 2+ are done
for nothing each time, plus a bunch of checks here and there.
Since Page Pool mappings are long-term and for one "device + addr" pair
dma_need_sync() will always return the same value (basically, whether it
belongs to an swiotlb pool), addresses can be tested once right after
they're obtained and the result can be reused until the page is unmapped.
Define the new PP DMA sync operation type, which will mean "do DMA syncs
for the device, but only when needed" and turn it on by default when the
driver asks to sync pages. When a page is mapped, check whether it needs
syncs and if so, replace that "sync when needed" back to "always do
syncs" globally for the whole pool (better safe than sorry). As long as
the pool has no pages requiring DMA syncs, this cuts off a good piece
of calls and checks. When at least one page required it, the pool
conservatively falls back to "always call sync functions", no per-page
verdicts. It's a fairly rare case anyway that only a few pages would
require syncing.
On my x86_64, this gives from 2% to 5% performance benefit with no
negative impact for cases when IOMMU is on and the shortcut can't be
used.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
include/net/page_pool/types.h | 3 ++-
net/core/page_pool.c | 7 ++++++-
2 files changed, 8 insertions(+), 2 deletions(-)
Comments
On 2023/7/27 22:43, Alexander Lobakin wrote: > Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles > even when on DMA-coherent platforms (like x86) with no active IOMMU or > swiotlb, just for the call ladder. > Indeed, it's > > page_pool_put_page() > page_pool_put_defragged_page() <- external > __page_pool_put_page() > page_pool_dma_sync_for_device() <- non-inline > dma_sync_single_range_for_device() > dma_sync_single_for_device() <- external > dma_direct_sync_single_for_device() > dev_is_dma_coherent() <- exit > > For the inline functions, no guarantees the compiler won't uninline them > (they're clearly not one-liners and sometimes compilers uninline even > 2 + 2). The first external call is necessary, but the rest 2+ are done > for nothing each time, plus a bunch of checks here and there. > Since Page Pool mappings are long-term and for one "device + addr" pair > dma_need_sync() will always return the same value (basically, whether it > belongs to an swiotlb pool), addresses can be tested once right after > they're obtained and the result can be reused until the page is unmapped. > Define the new PP DMA sync operation type, which will mean "do DMA syncs > for the device, but only when needed" and turn it on by default when the > driver asks to sync pages. When a page is mapped, check whether it needs > syncs and if so, replace that "sync when needed" back to "always do > syncs" globally for the whole pool (better safe than sorry). As long as > the pool has no pages requiring DMA syncs, this cuts off a good piece > of calls and checks. When at least one page required it, the pool > conservatively falls back to "always call sync functions", no per-page > verdicts. It's a fairly rare case anyway that only a few pages would > require syncing. > On my x86_64, this gives from 2% to 5% performance benefit with no > negative impact for cases when IOMMU is on and the shortcut can't be > used. > It seems other subsystem may have the similar problem as page_pool, is it possible to implement this kind of trick in the dma subsystem instead of every subsystem inventing their own trick?
From: Yunsheng Lin <linyunsheng@huawei.com> Date: Fri, 28 Jul 2023 20:39:24 +0800 > On 2023/7/27 22:43, Alexander Lobakin wrote: >> Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles >> even when on DMA-coherent platforms (like x86) with no active IOMMU or >> swiotlb, just for the call ladder. >> Indeed, it's >> >> page_pool_put_page() >> page_pool_put_defragged_page() <- external >> __page_pool_put_page() >> page_pool_dma_sync_for_device() <- non-inline >> dma_sync_single_range_for_device() >> dma_sync_single_for_device() <- external >> dma_direct_sync_single_for_device() >> dev_is_dma_coherent() <- exit >> >> For the inline functions, no guarantees the compiler won't uninline them >> (they're clearly not one-liners and sometimes compilers uninline even >> 2 + 2). The first external call is necessary, but the rest 2+ are done >> for nothing each time, plus a bunch of checks here and there. >> Since Page Pool mappings are long-term and for one "device + addr" pair >> dma_need_sync() will always return the same value (basically, whether it >> belongs to an swiotlb pool), addresses can be tested once right after >> they're obtained and the result can be reused until the page is unmapped. >> Define the new PP DMA sync operation type, which will mean "do DMA syncs >> for the device, but only when needed" and turn it on by default when the >> driver asks to sync pages. When a page is mapped, check whether it needs >> syncs and if so, replace that "sync when needed" back to "always do >> syncs" globally for the whole pool (better safe than sorry). As long as >> the pool has no pages requiring DMA syncs, this cuts off a good piece >> of calls and checks. When at least one page required it, the pool >> conservatively falls back to "always call sync functions", no per-page >> verdicts. It's a fairly rare case anyway that only a few pages would >> require syncing. >> On my x86_64, this gives from 2% to 5% performance benefit with no >> negative impact for cases when IOMMU is on and the shortcut can't be >> used. >> > > It seems other subsystem may have the similar problem as page_pool, > is it possible to implement this kind of trick in the dma subsystem > instead of every subsystem inventing their own trick? In the ladder I described above most of overhead comes from jumping between Page Pool functions, not the generic DMA ones. Let's say I do this shortcut in dma_sync_single_range_for_device(), that is too late already to count on some good CPU saves. Plus, DMA sync API operates with dma_addr_t, not struct page. IOW it's not clear to me where to store this "we can shortcut" bit in that case. From "other subsystem" I remember only XDP sockets. There, they also avoid calling their own non-inline functions in the first place, not the generic DMA ones. So I'd say both cases (PP and XSk) can't be solved via some "generic" solution. Thanks, Olek
On 2023/7/28 22:14, Alexander Lobakin wrote: > From: Yunsheng Lin <linyunsheng@huawei.com> > Date: Fri, 28 Jul 2023 20:39:24 +0800 > >> On 2023/7/27 22:43, Alexander Lobakin wrote: >>> Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles >>> even when on DMA-coherent platforms (like x86) with no active IOMMU or >>> swiotlb, just for the call ladder. >>> Indeed, it's >>> >>> page_pool_put_page() >>> page_pool_put_defragged_page() <- external >>> __page_pool_put_page() >>> page_pool_dma_sync_for_device() <- non-inline >>> dma_sync_single_range_for_device() >>> dma_sync_single_for_device() <- external >>> dma_direct_sync_single_for_device() >>> dev_is_dma_coherent() <- exit >>> >>> For the inline functions, no guarantees the compiler won't uninline them >>> (they're clearly not one-liners and sometimes compilers uninline even >>> 2 + 2). The first external call is necessary, but the rest 2+ are done >>> for nothing each time, plus a bunch of checks here and there. >>> Since Page Pool mappings are long-term and for one "device + addr" pair >>> dma_need_sync() will always return the same value (basically, whether it >>> belongs to an swiotlb pool), addresses can be tested once right after >>> they're obtained and the result can be reused until the page is unmapped. >>> Define the new PP DMA sync operation type, which will mean "do DMA syncs >>> for the device, but only when needed" and turn it on by default when the >>> driver asks to sync pages. When a page is mapped, check whether it needs >>> syncs and if so, replace that "sync when needed" back to "always do >>> syncs" globally for the whole pool (better safe than sorry). As long as >>> the pool has no pages requiring DMA syncs, this cuts off a good piece >>> of calls and checks. When at least one page required it, the pool >>> conservatively falls back to "always call sync functions", no per-page >>> verdicts. It's a fairly rare case anyway that only a few pages would >>> require syncing. >>> On my x86_64, this gives from 2% to 5% performance benefit with no >>> negative impact for cases when IOMMU is on and the shortcut can't be >>> used. >>> >> >> It seems other subsystem may have the similar problem as page_pool, >> is it possible to implement this kind of trick in the dma subsystem >> instead of every subsystem inventing their own trick? > > In the ladder I described above most of overhead comes from jumping > between Page Pool functions, not the generic DMA ones. Let's say I do > this shortcut in dma_sync_single_range_for_device(), that is too late > already to count on some good CPU saves. We can force inline the page_pool_dma_sync_for_device() function if it is 'the good CPU saves' you mentioned above. > Plus, DMA sync API operates with dma_addr_t, not struct page. IOW it's > not clear to me where to store this "we can shortcut" bit in that case. It seems we only need one bit in 'struct device' to do the 'shortcut', and there seems to have avaliable bit at the end of 'struct device'? Is it possible that we do something like this patch does in dma_sync_single_range_for_device()? One thing to note is that there may be multi concurrent callers to dma_sync_single_range_for_device(), which seems to be different from atomic context for page_pool_dma_map(), so it may need some atomic operation for the state changing if we want to implement it in a 'generic' way. > >>From "other subsystem" I remember only XDP sockets. There, they also > avoid calling their own non-inline functions in the first place, not the > generic DMA ones. So I'd say both cases (PP and XSk) can't be solved via > some "generic" solution. If PP and XSk both have a similar trick, isn't it a more clear sight that it may be solved via some "generic" solution? Is there any reason there is no a similar trick for sync for cpu in XSk as below code indicates? https://elixir.free-electrons.com/linux/v6.4-rc6/source/include/net/xsk_buff_pool.h#L152 > > Thanks, > Olek > > . >
From: Yunsheng Lin <linyunsheng@huawei.com> Date: Sat, 29 Jul 2023 20:46:22 +0800 > On 2023/7/28 22:14, Alexander Lobakin wrote: >> From: Yunsheng Lin <linyunsheng@huawei.com> >> Date: Fri, 28 Jul 2023 20:39:24 +0800 >> >>> On 2023/7/27 22:43, Alexander Lobakin wrote: >>>> Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles >>>> even when on DMA-coherent platforms (like x86) with no active IOMMU or >>>> swiotlb, just for the call ladder. >>>> Indeed, it's >>>> >>>> page_pool_put_page() >>>> page_pool_put_defragged_page() <- external >>>> __page_pool_put_page() >>>> page_pool_dma_sync_for_device() <- non-inline >>>> dma_sync_single_range_for_device() >>>> dma_sync_single_for_device() <- external >>>> dma_direct_sync_single_for_device() >>>> dev_is_dma_coherent() <- exit >>>> >>>> For the inline functions, no guarantees the compiler won't uninline them >>>> (they're clearly not one-liners and sometimes compilers uninline even >>>> 2 + 2). The first external call is necessary, but the rest 2+ are done >>>> for nothing each time, plus a bunch of checks here and there. >>>> Since Page Pool mappings are long-term and for one "device + addr" pair >>>> dma_need_sync() will always return the same value (basically, whether it >>>> belongs to an swiotlb pool), addresses can be tested once right after >>>> they're obtained and the result can be reused until the page is unmapped. >>>> Define the new PP DMA sync operation type, which will mean "do DMA syncs >>>> for the device, but only when needed" and turn it on by default when the >>>> driver asks to sync pages. When a page is mapped, check whether it needs >>>> syncs and if so, replace that "sync when needed" back to "always do >>>> syncs" globally for the whole pool (better safe than sorry). As long as >>>> the pool has no pages requiring DMA syncs, this cuts off a good piece >>>> of calls and checks. When at least one page required it, the pool >>>> conservatively falls back to "always call sync functions", no per-page >>>> verdicts. It's a fairly rare case anyway that only a few pages would >>>> require syncing. >>>> On my x86_64, this gives from 2% to 5% performance benefit with no >>>> negative impact for cases when IOMMU is on and the shortcut can't be >>>> used. >>>> >>> >>> It seems other subsystem may have the similar problem as page_pool, >>> is it possible to implement this kind of trick in the dma subsystem >>> instead of every subsystem inventing their own trick? >> >> In the ladder I described above most of overhead comes from jumping >> between Page Pool functions, not the generic DMA ones. Let's say I do >> this shortcut in dma_sync_single_range_for_device(), that is too late >> already to count on some good CPU saves. > > We can force inline the page_pool_dma_sync_for_device() function if it > is 'the good CPU saves' you mentioned above. > >> Plus, DMA sync API operates with dma_addr_t, not struct page. IOW it's >> not clear to me where to store this "we can shortcut" bit in that case. > > It seems we only need one bit in 'struct device' to do the 'shortcut', > and there seems to have avaliable bit at the end of 'struct device'? dma_need_sync() can return different results for two different DMA addresses within the same device. > > Is it possible that we do something like this patch does in > dma_sync_single_range_for_device()? > > One thing to note is that there may be multi concurrent callers to > dma_sync_single_range_for_device(), which seems to be different from > atomic context for page_pool_dma_map(), so it may need some atomic > operation for the state changing if we want to implement it in a 'generic' > way. > >> >> >From "other subsystem" I remember only XDP sockets. There, they also >> avoid calling their own non-inline functions in the first place, not the >> generic DMA ones. So I'd say both cases (PP and XSk) can't be solved via >> some "generic" solution. > > If PP and XSk both have a similar trick, isn't it a more clear sight > that it may be solved via some "generic" solution? Both shortcut their own functions in the first place, so I don't know what generic solution could be to optimize non-generic functions. > > Is there any reason there is no a similar trick for sync for cpu in > XSk as below code indicates? > https://elixir.free-electrons.com/linux/v6.4-rc6/source/include/net/xsk_buff_pool.h#L152 > >> >> Thanks, >> Olek >> >> . >> Thanks, Olek
On 2023/8/1 21:42, Alexander Lobakin wrote: ... >>>> >>>> It seems other subsystem may have the similar problem as page_pool, >>>> is it possible to implement this kind of trick in the dma subsystem >>>> instead of every subsystem inventing their own trick? >>> >>> In the ladder I described above most of overhead comes from jumping >>> between Page Pool functions, not the generic DMA ones. Let's say I do >>> this shortcut in dma_sync_single_range_for_device(), that is too late >>> already to count on some good CPU saves. >> >> We can force inline the page_pool_dma_sync_for_device() function if it >> is 'the good CPU saves' you mentioned above. >> >>> Plus, DMA sync API operates with dma_addr_t, not struct page. IOW it's >>> not clear to me where to store this "we can shortcut" bit in that case. >> >> It seems we only need one bit in 'struct device' to do the 'shortcut', >> and there seems to have avaliable bit at the end of 'struct device'? > > dma_need_sync() can return different results for two different DMA > addresses within the same device. Yes, that's why we need a per device state in order to do the similar trick like this patch does. > >> >> Is it possible that we do something like this patch does in >> dma_sync_single_range_for_device()? >> >> One thing to note is that there may be multi concurrent callers to >> dma_sync_single_range_for_device(), which seems to be different from >> atomic context for page_pool_dma_map(), so it may need some atomic >> operation for the state changing if we want to implement it in a 'generic' >> way. >> >>> >>> >From "other subsystem" I remember only XDP sockets. There, they also >>> avoid calling their own non-inline functions in the first place, not the >>> generic DMA ones. So I'd say both cases (PP and XSk) can't be solved via >>> some "generic" solution. >> >> If PP and XSk both have a similar trick, isn't it a more clear sight >> that it may be solved via some "generic" solution? > > Both shortcut their own functions in the first place, so I don't know > what generic solution could be to optimize non-generic functions. If we are able to shortcut the generic functions, for the page_pool and XSK case,it seems the non-generic functions just need to be inlined if I understand your concern correctly. And for that we may be able to shortcut the generic functions for dma_sync_single_range_for_device() used in driver too? > >> >> Is there any reason there is no a similar trick for sync for cpu in >> XSk as below code indicates? >> https://elixir.free-electrons.com/linux/v6.4-rc6/source/include/net/xsk_buff_pool.h#L152
From: Yunsheng Lin <linyunsheng@huawei.com> Date: Wed, 2 Aug 2023 19:37:35 +0800 > On 2023/8/1 21:42, Alexander Lobakin wrote: > ... > >>>>> >>>>> It seems other subsystem may have the similar problem as page_pool, >>>>> is it possible to implement this kind of trick in the dma subsystem >>>>> instead of every subsystem inventing their own trick? >>>> >>>> In the ladder I described above most of overhead comes from jumping >>>> between Page Pool functions, not the generic DMA ones. Let's say I do >>>> this shortcut in dma_sync_single_range_for_device(), that is too late >>>> already to count on some good CPU saves. >>> >>> We can force inline the page_pool_dma_sync_for_device() function if it >>> is 'the good CPU saves' you mentioned above. >>> >>>> Plus, DMA sync API operates with dma_addr_t, not struct page. IOW it's >>>> not clear to me where to store this "we can shortcut" bit in that case. >>> >>> It seems we only need one bit in 'struct device' to do the 'shortcut', >>> and there seems to have avaliable bit at the end of 'struct device'? >> >> dma_need_sync() can return different results for two different DMA >> addresses within the same device. > > Yes, that's why we need a per device state in order to do the > similar trick like this patch does. Per-device state is dev_is_dma_coherent()[0]. The rest is adderss-dependent (or almost). > >> >>> >>> Is it possible that we do something like this patch does in >>> dma_sync_single_range_for_device()? >>> >>> One thing to note is that there may be multi concurrent callers to >>> dma_sync_single_range_for_device(), which seems to be different from >>> atomic context for page_pool_dma_map(), so it may need some atomic >>> operation for the state changing if we want to implement it in a 'generic' >>> way. >>> >>>> >>>> >From "other subsystem" I remember only XDP sockets. There, they also >>>> avoid calling their own non-inline functions in the first place, not the >>>> generic DMA ones. So I'd say both cases (PP and XSk) can't be solved via >>>> some "generic" solution. >>> >>> If PP and XSk both have a similar trick, isn't it a more clear sight >>> that it may be solved via some "generic" solution? >> >> Both shortcut their own functions in the first place, so I don't know >> what generic solution could be to optimize non-generic functions. > > If we are able to shortcut the generic functions, for the page_pool > and XSK case,it seems the non-generic functions just need to be > inlined if I understand your concern correctly. > > And for that we may be able to shortcut the generic functions for > dma_sync_single_range_for_device() used in driver too? dma_sync_single_range_for_device() takes: @dev, @addr, @offset, @size, @dir. Where are you planning to put/get a flag indicating the sync is [not] needed? The thing with the PP shortcut or XSk shortcut is that they're local. We have limited number of DMA mappings with certain direction and for certain purpose. A global device flag with the same logics (i.e. which gets disabled as soon as you get the first address requiring sync) is way less optimal. The same device that you use for Page Pool, is used for Tx mappings. There, you can get kmalloced skb head, or highmem frag page, or anything else more complicated than pure lowmem low-order pages that we use for Rx or XSk pools. That said, some odd skb will eventually disable your shortcut, even though it was perfectly legit for your Page Pool. > >> >>> >>> Is there any reason there is no a similar trick for sync for cpu in >>> XSk as below code indicates? >>> https://elixir.free-electrons.com/linux/v6.4-rc6/source/include/net/xsk_buff_pool.h#L152 [0] https://elixir.bootlin.com/linux/v6.5-rc4/source/include/linux/dma-map-ops.h#L268 Thanks, Olek
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index dd26f4b2b66c..9134fb458cb2 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -97,8 +97,9 @@ struct page_pool { bool dma_map:1; /* Perform DMA mapping */ enum { PP_DMA_SYNC_ACT_DISABLED = 0, /* Driver didn't ask to sync */ + PP_DMA_SYNC_ACT_SKIP, /* Syncs can be skipped */ PP_DMA_SYNC_ACT_DO, /* Perform DMA sync ops */ - } dma_sync_act:1; + } dma_sync_act:2; bool page_frag:1; /* Allow page fragments */ long frag_users; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 6a8f105e2df5..529e4b41e9eb 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -197,7 +197,8 @@ static int page_pool_init(struct page_pool *pool, if (!pool->p.max_len) return -EINVAL; - pool->dma_sync_act = PP_DMA_SYNC_ACT_DO; + /* Try to avoid calling no-op syncs */ + pool->dma_sync_act = PP_DMA_SYNC_ACT_SKIP; /* pool->p.offset has to be set according to the address * offset used by the DMA engine to start copying rx data @@ -350,6 +351,10 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page) page_pool_set_dma_addr(page, dma); + if (pool->dma_sync_act == PP_DMA_SYNC_ACT_SKIP && + dma_need_sync(pool->p.dev, dma)) + pool->dma_sync_act = PP_DMA_SYNC_ACT_DO; + if (pool->dma_sync_act == PP_DMA_SYNC_ACT_DO) page_pool_dma_sync_for_device(pool, page, pool->p.max_len);