From patchwork Wed Apr 19 05:51:58 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Daisuke Matsuda (Fujitsu)" X-Patchwork-Id: 85191 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp138842vqo; Tue, 18 Apr 2023 23:05:56 -0700 (PDT) X-Google-Smtp-Source: AKy350ZCBkPVYma++U0pA+nQRjRu/7PcrkDcKoK55RPMheukd00tGgtfvcNoOQV9PF9hN7/bqxQk X-Received: by 2002:a05:6a00:2d03:b0:63c:e253:a692 with SMTP id fa3-20020a056a002d0300b0063ce253a692mr2766332pfb.15.1681884356416; Tue, 18 Apr 2023 23:05:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681884356; cv=none; d=google.com; s=arc-20160816; b=fiayVgYYOkjxoj/ODInb1M2dlUT31/v8upMlNiRwUNJ5ilrCtYpCiz6w1n53EZHe3v wQS2rsHE8HRTh7xbN6k6ZIJNWjx30p2BV9zqg61Ma7L1v0qkmyy7EhDmlWERl/MGuFT0 Wo3UHrq+ooF+HTfSf5hDOE+WMCWXqUzpvc3859ujOGTAAthi/ry6JWS9WREbrkF0oj+c SiscU4uWN/gd1N6YBz9CcAKXU5rckBoAQVv3I0YnxPR96iokXbrcGlSqMp76raxLd/OO +KTdDQq+IEOTaTk95YF018jPiAIRRxgMGan3QqHuAa9XZFWMz7IS1fEevl5AsnQ7N2If OphA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=GQTYvvAAYtJswB6eRKzhnVHMB6lPHMi+4pERfDSDYRA=; b=Wx7Wdpv8nouYDS+r+KnSpNyy5wWP7/SzN8jv0CBbdLnZuCoRptuLY5YDYsPAfAx2Dd QGYZkZ6BWOtSEyMapxlteeKi4iYfL95ZmWabwKFPNPa88FmwYMu8CsbfptX0kwzKHQKZ bY0jDB/+FM3B5WwGv4UkPuYaURmFpR24hNnBoMP1PZOz1wIjkDfA6w7CRyov5V9/SZvG /KRirem23R7jMs1nN8smv2p7BNQMaQXgxrvsAAZ2vDVtM1awiZTlrPJ9Kkh4ZzSb6y69 rfUwnShX4xZxNyK8+InkYzHv9aTU+lTGEbxraKOLifW+CZJlpjwUxrxwRPXNHuXcisHg wuPA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=fujitsu.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a20-20020aa795b4000000b005a908acf446si6098357pfk.104.2023.04.18.23.05.44; Tue, 18 Apr 2023 23:05:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=fujitsu.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231687AbjDSFxT (ORCPT + 99 others); Wed, 19 Apr 2023 01:53:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46990 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231421AbjDSFxG (ORCPT ); Wed, 19 Apr 2023 01:53:06 -0400 Received: from esa6.hc1455-7.c3s2.iphmx.com (esa6.hc1455-7.c3s2.iphmx.com [68.232.139.139]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 122556A4E; Tue, 18 Apr 2023 22:53:01 -0700 (PDT) X-IronPort-AV: E=McAfee;i="6600,9927,10684"; a="115016469" X-IronPort-AV: E=Sophos;i="5.99,208,1677510000"; d="scan'208";a="115016469" Received: from unknown (HELO oym-r3.gw.nic.fujitsu.com) ([210.162.30.91]) by esa6.hc1455-7.c3s2.iphmx.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Apr 2023 14:53:00 +0900 Received: from oym-m3.gw.nic.fujitsu.com (oym-nat-oym-m3.gw.nic.fujitsu.com [192.168.87.60]) by oym-r3.gw.nic.fujitsu.com (Postfix) with ESMTP id 622B2C53D5; Wed, 19 Apr 2023 14:52:58 +0900 (JST) Received: from m3003.s.css.fujitsu.com (m3003.s.css.fujitsu.com [10.128.233.114]) by oym-m3.gw.nic.fujitsu.com (Postfix) with ESMTP id 5E79CD55CF; Wed, 19 Apr 2023 14:52:57 +0900 (JST) Received: from localhost.localdomain (unknown [10.118.237.106]) by m3003.s.css.fujitsu.com (Postfix) with ESMTP id 1C5DD2043C64; Wed, 19 Apr 2023 14:52:57 +0900 (JST) From: Daisuke Matsuda To: linux-rdma@vger.kernel.org, leonro@nvidia.com, jgg@nvidia.com, zyjzyj2000@gmail.com Cc: linux-kernel@vger.kernel.org, rpearsonhpe@gmail.com, yangx.jy@fujitsu.com, lizhijian@fujitsu.com, Daisuke Matsuda Subject: [PATCH for-next v4 6/8] RDMA/rxe: Allow registering MRs for On-Demand Paging Date: Wed, 19 Apr 2023 14:51:58 +0900 Message-Id: <962918aada29762772d1874b60c7a93891077059.1681882651.git.matsuda-daisuke@fujitsu.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: References: MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_NONE,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1763583570607444006?= X-GMAIL-MSGID: =?utf-8?q?1763583570607444006?= Allow applications to register an ODP-enabled MR, in which case the flag IB_ACCESS_ON_DEMAND is passed to rxe_reg_user_mr(). However, there is no RDMA operation supported right now. They will be enabled later in the subsequent two patches. rxe_odp_do_pagefault() is called to initialize an ODP-enabled MR. It syncs process address space from the CPU page table to the driver page table (dma_list/pfn_list in umem_odp) when called with RXE_PAGEFAULT_SNAPSHOT flag. Additionally, It can be used to trigger page fault when pages being accessed are not present or do not have proper read/write permissions, and possibly to prefetch pages in the future. Signed-off-by: Daisuke Matsuda --- drivers/infiniband/sw/rxe/rxe.c | 7 ++ drivers/infiniband/sw/rxe/rxe_loc.h | 14 +++ drivers/infiniband/sw/rxe/rxe_mr.c | 9 +- drivers/infiniband/sw/rxe/rxe_odp.c | 120 ++++++++++++++++++++++++++ drivers/infiniband/sw/rxe/rxe_resp.c | 15 +++- drivers/infiniband/sw/rxe/rxe_verbs.c | 5 +- drivers/infiniband/sw/rxe/rxe_verbs.h | 2 + 7 files changed, 166 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c index 54c723a6edda..f2284d27229b 100644 --- a/drivers/infiniband/sw/rxe/rxe.c +++ b/drivers/infiniband/sw/rxe/rxe.c @@ -73,6 +73,13 @@ static void rxe_init_device_param(struct rxe_dev *rxe) rxe->ndev->dev_addr); rxe->max_ucontext = RXE_MAX_UCONTEXT; + + if (IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)) { + rxe->attr.kernel_cap_flags |= IBK_ON_DEMAND_PAGING; + + /* IB_ODP_SUPPORT_IMPLICIT is not supported right now. */ + rxe->attr.odp_caps.general_caps |= IB_ODP_SUPPORT; + } } /* initialize port attributes */ diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h index 3476726425d9..0f91e23c151e 100644 --- a/drivers/infiniband/sw/rxe/rxe_loc.h +++ b/drivers/infiniband/sw/rxe/rxe_loc.h @@ -202,4 +202,18 @@ static inline unsigned int wr_opcode_mask(int opcode, struct rxe_qp *qp) return rxe_wr_opcode_info[opcode].mask[qp->ibqp.qp_type]; } +/* rxe_odp.c */ +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING +int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, + u64 iova, int access_flags, struct rxe_mr *mr); +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */ +static inline int +rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova, + int access_flags, struct rxe_mr *mr) +{ + return -EOPNOTSUPP; +} + +#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */ + #endif /* RXE_LOC_H */ diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c index ffbac6f5e828..cd368cd096c8 100644 --- a/drivers/infiniband/sw/rxe/rxe_mr.c +++ b/drivers/infiniband/sw/rxe/rxe_mr.c @@ -318,7 +318,10 @@ int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, return err; } - return rxe_mr_copy_xarray(mr, iova, addr, length, dir); + if (mr->odp_enabled) + return -EOPNOTSUPP; + else + return rxe_mr_copy_xarray(mr, iova, addr, length, dir); } /* copy data in or out of a wqe, i.e. sg list @@ -527,6 +530,10 @@ int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value) struct page *page; u64 *va; + /* ODP is not supported right now. WIP. */ + if (mr->odp_enabled) + return RESPST_ERR_UNSUPPORTED_OPCODE; + /* See IBA oA19-28 */ if (unlikely(mr->state != RXE_MR_STATE_VALID)) { rxe_dbg_mr(mr, "mr not in valid state"); diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c index b69b25e0fef6..e5497d09c399 100644 --- a/drivers/infiniband/sw/rxe/rxe_odp.c +++ b/drivers/infiniband/sw/rxe/rxe_odp.c @@ -24,6 +24,24 @@ static void rxe_mr_unset_xarray(struct rxe_mr *mr, unsigned long start, spin_unlock(&mr->page_list.xa_lock); } +static void rxe_mr_set_xarray(struct rxe_mr *mr, unsigned long start, + unsigned long end, unsigned long *pfn_list) +{ + unsigned long lower, upper, idx; + struct page *page; + + lower = rxe_mr_iova_to_index(mr, start); + upper = rxe_mr_iova_to_index(mr, end); + + /* make pages visible in xarray. no sleep while taking the lock */ + spin_lock(&mr->page_list.xa_lock); + for (idx = lower; idx <= upper; idx++) { + page = hmm_pfn_to_page(pfn_list[idx]); + __xa_store(&mr->page_list, idx, page, GFP_ATOMIC); + } + spin_unlock(&mr->page_list.xa_lock); +} + static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni, const struct mmu_notifier_range *range, unsigned long cur_seq) @@ -54,3 +72,105 @@ static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni, const struct mmu_interval_notifier_ops rxe_mn_ops = { .invalidate = rxe_ib_invalidate_range, }; + +#define RXE_PAGEFAULT_RDONLY BIT(1) +#define RXE_PAGEFAULT_SNAPSHOT BIT(2) +static int rxe_odp_do_pagefault(struct rxe_mr *mr, u64 user_va, int bcnt, u32 flags) +{ + int np; + u64 access_mask; + bool fault = !(flags & RXE_PAGEFAULT_SNAPSHOT); + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem); + + access_mask = ODP_READ_ALLOWED_BIT; + if (umem_odp->umem.writable && !(flags & RXE_PAGEFAULT_RDONLY)) + access_mask |= ODP_WRITE_ALLOWED_BIT; + + /* + * ib_umem_odp_map_dma_and_lock() locks umem_mutex on success. + * Callers must release the lock later to let invalidation handler + * do its work again. + */ + np = ib_umem_odp_map_dma_and_lock(umem_odp, user_va, bcnt, + access_mask, fault); + if (np < 0) + return np; + + /* umem_mutex is still locked here, so we can use hmm_pfn_to_page() + * safely to fetch pages in the range. + */ + rxe_mr_set_xarray(mr, user_va, user_va + bcnt, umem_odp->pfn_list); + + return np; +} + +static int rxe_odp_init_pages(struct rxe_mr *mr) +{ + int ret; + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem); + + ret = rxe_odp_do_pagefault(mr, mr->umem->address, mr->umem->length, + RXE_PAGEFAULT_SNAPSHOT); + + if (ret >= 0) + mutex_unlock(&umem_odp->umem_mutex); + + return ret >= 0 ? 0 : ret; +} + +int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, + u64 iova, int access_flags, struct rxe_mr *mr) +{ + int err; + struct ib_umem_odp *umem_odp; + + if (!IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)) + return -EOPNOTSUPP; + + rxe_mr_init(access_flags, mr); + + xa_init(&mr->page_list); + + if (!start && length == U64_MAX) { + if (iova != 0) + return -EINVAL; + if (!(rxe->attr.odp_caps.general_caps & IB_ODP_SUPPORT_IMPLICIT)) + return -EINVAL; + + /* Never reach here, for implicit ODP is not implemented. */ + } + + umem_odp = ib_umem_odp_get(&rxe->ib_dev, start, length, access_flags, + &rxe_mn_ops); + if (IS_ERR(umem_odp)) { + rxe_dbg_mr(mr, "Unable to create umem_odp err = %d\n", + (int)PTR_ERR(umem_odp)); + return PTR_ERR(umem_odp); + } + + umem_odp->private = mr; + + mr->odp_enabled = true; + mr->umem = &umem_odp->umem; + mr->access = access_flags; + mr->ibmr.length = length; + mr->ibmr.iova = iova; + mr->page_offset = ib_umem_offset(&umem_odp->umem); + + err = rxe_odp_init_pages(mr); + if (err) { + ib_umem_odp_release(umem_odp); + return err; + } + + err = rxe_mr_fill_pages_from_sgt(mr, &umem_odp->umem.sgt_append.sgt); + if (err) { + ib_umem_odp_release(umem_odp); + return err; + } + + mr->state = RXE_MR_STATE_VALID; + mr->ibmr.type = IB_MR_TYPE_USER; + + return err; +} diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c index f915128ed32a..b40c47477be3 100644 --- a/drivers/infiniband/sw/rxe/rxe_resp.c +++ b/drivers/infiniband/sw/rxe/rxe_resp.c @@ -641,6 +641,10 @@ static enum resp_states process_flush(struct rxe_qp *qp, struct rxe_mr *mr = qp->resp.mr; struct resp_res *res = qp->resp.res; + /* ODP is not supported right now. WIP. */ + if (mr->odp_enabled) + return RESPST_ERR_UNSUPPORTED_OPCODE; + /* oA19-14, oA19-15 */ if (res && res->replay) return RESPST_ACKNOWLEDGE; @@ -694,10 +698,13 @@ static enum resp_states atomic_reply(struct rxe_qp *qp, if (!res->replay) { u64 iova = qp->resp.va + qp->resp.offset; - err = rxe_mr_do_atomic_op(mr, iova, pkt->opcode, - atmeth_comp(pkt), - atmeth_swap_add(pkt), - &res->atomic.orig_val); + if (mr->odp_enabled) + err = RESPST_ERR_UNSUPPORTED_OPCODE; + else + err = rxe_mr_do_atomic_op(mr, iova, pkt->opcode, + atmeth_comp(pkt), + atmeth_swap_add(pkt), + &res->atomic.orig_val); if (err) return err; diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c index dea605b7f683..9c23defdc7b5 100644 --- a/drivers/infiniband/sw/rxe/rxe_verbs.c +++ b/drivers/infiniband/sw/rxe/rxe_verbs.c @@ -1274,7 +1274,10 @@ static struct ib_mr *rxe_reg_user_mr(struct ib_pd *ibpd, u64 start, mr->ibmr.pd = ibpd; mr->ibmr.device = ibpd->device; - err = rxe_mr_init_user(rxe, start, length, iova, access, mr); + if (access & IB_ACCESS_ON_DEMAND) + err = rxe_odp_mr_init_user(rxe, start, length, iova, access, mr); + else + err = rxe_mr_init_user(rxe, start, length, iova, access, mr); if (err) { rxe_dbg_mr(mr, "reg_user_mr failed, err = %d", err); goto err_cleanup; diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h index b6fbd9b3d086..de5a982c7c7e 100644 --- a/drivers/infiniband/sw/rxe/rxe_verbs.h +++ b/drivers/infiniband/sw/rxe/rxe_verbs.h @@ -333,6 +333,8 @@ struct rxe_mr { u32 nbuf; struct xarray page_list; + + bool odp_enabled; }; static inline unsigned int mr_page_size(struct rxe_mr *mr)