Message ID | 20221017171118.1588820-1-sammler@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4ac7:0:0:0:0:0 with SMTP id y7csp1556717wrs; Mon, 17 Oct 2022 10:14:03 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7ZrRXTkabte4rf7BtHg/LJL8IbaHi1SwSOND9uVAmA1eHom4NhKaZ9WbZZ8V1jj/bzQSfU X-Received: by 2002:a17:902:e848:b0:180:c732:1e52 with SMTP id t8-20020a170902e84800b00180c7321e52mr13195103plg.83.1666026842990; Mon, 17 Oct 2022 10:14:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666026842; cv=none; d=google.com; s=arc-20160816; b=JaVlG+M1NmR5atrJal64aFgsurvJmkEgHWzA5/u2a1lUIWjd8bcFfsBpRTXEaWjiYC Zr3ANX7d+TOJgSwqsiFIMv8MR+wp7IfkAUiozUo78LMms+OhsKUtEpI0TOjmbLQztucR Ece6kKLEXKR+tlMPpBB1NNgNJJ6Bvcdlrme6NF2aicOrDBcxkAoB6XDVScUR6wu8YdTO lWn6VeQUEufXvXM47qncrOMFA7FmuHHtBeTHqaa80QnU/cQMzmuulmerGufaLL9nkpr1 R/zBGwCRymkbIdg6VuVCqDg/ulVUIdQi4zEi1jtCbll7Cm8Wzqke2CqpBB8sCmKVfCC7 WF6A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:mime-version:date :dkim-signature; bh=ZZNYuYnYhhKKqBDOIhr6XrGGyufP89bX5HYBhar9JwY=; b=y7Yu21e9409tW0Ww2xhsVVv4ihIJs2zFVSdSfNDXBnmL5grsqLsnvxd6i75+5muyud wYf1CLceljlFQ51fiY2zcyEJbwxb/5CxnAlZ76SmhtM0tjelMznKbhUR1AHldy0ePi7Z rqyyckZ8gPz6fLm8T+Ee4H+exP+6V4UVEqy1wIQzPS4/157Jd4cg6zieNZc8Rd/5yfNB ic9hLBUhNrGK9YMJmY6IcsQSHRaZ67+LqZQSUpMy8EJJUnyFkgaEkAIpYaQJiHdLujzq mUjWOjuz9EZ8mNnv5gCS/7g2VxtC5Sa2x/jGtGVCRCLqi4DauEvMs0H0tfBjJAJ4Q98Q PxSg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=YbUppmjp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id z3-20020a170902708300b00176a116a20csi11792472plk.567.2022.10.17.10.13.49; Mon, 17 Oct 2022 10:14:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=YbUppmjp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229597AbiJQRLp (ORCPT <rfc822;kernel.ruili@gmail.com> + 99 others); Mon, 17 Oct 2022 13:11:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49768 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229592AbiJQRLm (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 17 Oct 2022 13:11:42 -0400 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B82CD2A428 for <linux-kernel@vger.kernel.org>; Mon, 17 Oct 2022 10:11:41 -0700 (PDT) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-33dc888dc62so116087387b3.4 for <linux-kernel@vger.kernel.org>; Mon, 17 Oct 2022 10:11:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=ZZNYuYnYhhKKqBDOIhr6XrGGyufP89bX5HYBhar9JwY=; b=YbUppmjpRzqgtYxWUA5/kXQe7wLwR7bCxLD/Ud087xEgKvIJshznwYZ4ZySrPfiZqn NxPSZ7EowQM/sHlFZOX2x2lSL6MGX0S70RowSh9F/i25a8yimUcordRjJY13eKpkNL7I ZtpHVvEmKqv8c5yrrNFcuSnEls5TW5Y9H7zyF+A94n6o14llTh5SS9JFL70+NiikDYco jwXio2ChZpCZWGy1nESgUyuRIgIj8CICzFRbEgGs/ELtb7VLKL75je0H0VGQQaCoCvSI DYoah7Nm+WhHOWcY6X4AYZLkRLr3VhJTY0M92KhuyYW0+3/mptpvWZfVLX5MCDD95XUz GJjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=ZZNYuYnYhhKKqBDOIhr6XrGGyufP89bX5HYBhar9JwY=; b=RFQnols8QPLkdo/FGG5fPtiqIGgy2YWBah00oEBmTbjAKxxMSu0/lUL8+hd2xTU1v0 16IAGhTw7ZAlfrUp0Ol4msP4M8azrx9YDMGDCqeyFatCrvU/NYNMdlTuqb1JBawuEuw9 CIADzaYxK+yV2wfKLR/Ss7yXTldpwH1Ujzp9tcIecLHQ/n2riBO6rDzQD44ZjOyYyllf 0oaRsuk7imZ9f7wPEn3XJt4kNfQVRJG/Tv0oU3meKy2O8+zx4PspB2Z/p1U79R29WZm2 4XTgyFsl0eEKtAyFzPDmx5JlHGnt5CzTHXKe88kl8+ppYDTKOAZmEuMPMTyY+Jdxz7qy 4/cQ== X-Gm-Message-State: ACrzQf3rqVdV7KW898afyBRKAAfhlA5RIODewqtUTrZ2VkYDYoCtSGpy zmfkUjbyfffHSdp7BgGVmGXwLdwyJrrc X-Received: from sammler.c.googlers.com ([fda3:e722:ac3:cc00:2b:7d90:c0a8:43cd]) (user=sammler job=sendgmr) by 2002:a05:6902:1022:b0:6bf:eda0:f746 with SMTP id x2-20020a056902102200b006bfeda0f746mr9381400ybt.368.1666026700542; Mon, 17 Oct 2022 10:11:40 -0700 (PDT) Date: Mon, 17 Oct 2022 17:11:18 +0000 Mime-Version: 1.0 X-Mailer: git-send-email 2.38.0.413.g74048e4d9e-goog Message-ID: <20221017171118.1588820-1-sammler@google.com> Subject: [PATCH v1] virtio_pmem: populate numa information From: Michael Sammler <sammler@google.com> To: Pankaj Gupta <pankaj.gupta.linux@gmail.com>, Dan Williams <dan.j.williams@intel.com>, Vishal Verma <vishal.l.verma@intel.com>, Dave Jiang <dave.jiang@intel.com>, Ira Weiny <ira.weiny@intel.com>, Pasha Tatashin <pasha.tatashin@soleen.com>, nvdimm@lists.linux.dev, linux-kernel@vger.kernel.org Cc: Michael Sammler <sammler@google.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1746955763052627127?= X-GMAIL-MSGID: =?utf-8?q?1746955763052627127?= |
Series |
[v1] virtio_pmem: populate numa information
|
|
Commit Message
Michael Sammler
Oct. 17, 2022, 5:11 p.m. UTC
Compute the numa information for a virtio_pmem device from the memory
range of the device. Previously, the target_node was always 0 since
the ndr_desc.target_node field was never explicitly set. The code for
computing the numa node is taken from cxl_pmem_region_probe in
drivers/cxl/pmem.c.
Signed-off-by: Michael Sammler <sammler@google.com>
---
drivers/nvdimm/virtio_pmem.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
--
2.38.0.413.g74048e4d9e-goog
Comments
> Compute the numa information for a virtio_pmem device from the memory > range of the device. Previously, the target_node was always 0 since > the ndr_desc.target_node field was never explicitly set. The code for > computing the numa node is taken from cxl_pmem_region_probe in > drivers/cxl/pmem.c. > > Signed-off-by: Michael Sammler <sammler@google.com> > --- > drivers/nvdimm/virtio_pmem.c | 11 +++++++++-- > 1 file changed, 9 insertions(+), 2 deletions(-) > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c > index 20da455d2ef6..a92eb172f0e7 100644 > --- a/drivers/nvdimm/virtio_pmem.c > +++ b/drivers/nvdimm/virtio_pmem.c > @@ -32,7 +32,6 @@ static int init_vq(struct virtio_pmem *vpmem) > static int virtio_pmem_probe(struct virtio_device *vdev) > { > struct nd_region_desc ndr_desc = {}; > - int nid = dev_to_node(&vdev->dev); > struct nd_region *nd_region; > struct virtio_pmem *vpmem; > struct resource res; > @@ -79,7 +78,15 @@ static int virtio_pmem_probe(struct virtio_device *vdev) > dev_set_drvdata(&vdev->dev, vpmem->nvdimm_bus); > > ndr_desc.res = &res; > - ndr_desc.numa_node = nid; > + > + ndr_desc.numa_node = memory_add_physaddr_to_nid(res.start); > + ndr_desc.target_node = phys_to_target_node(res.start); > + if (ndr_desc.target_node == NUMA_NO_NODE) { > + ndr_desc.target_node = ndr_desc.numa_node; > + dev_dbg(&vdev->dev, "changing target node from %d to %d", > + NUMA_NO_NODE, ndr_desc.target_node); > + } As this memory later gets hotplugged using "devm_memremap_pages". I don't see if 'target_node' is used for fsdax case? It seems to me "target_node" is used mainly for volatile range above persistent memory ( e.g kmem driver?). Thanks, Pankaj > + > ndr_desc.flush = async_pmem_flush; > ndr_desc.provider_data = vdev; > set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags); > --
Hi Pankaj, Thank you for looking at the patch. > > > Compute the numa information for a virtio_pmem device from the memory > > range of the device. Previously, the target_node was always 0 since > > the ndr_desc.target_node field was never explicitly set. The code for > > computing the numa node is taken from cxl_pmem_region_probe in > > drivers/cxl/pmem.c. > > > > Signed-off-by: Michael Sammler <sammler@google.com> > > --- > > drivers/nvdimm/virtio_pmem.c | 11 +++++++++-- > > 1 file changed, 9 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c > > index 20da455d2ef6..a92eb172f0e7 100644 > > --- a/drivers/nvdimm/virtio_pmem.c > > +++ b/drivers/nvdimm/virtio_pmem.c > > @@ -32,7 +32,6 @@ static int init_vq(struct virtio_pmem *vpmem) > > static int virtio_pmem_probe(struct virtio_device *vdev) > > { > > struct nd_region_desc ndr_desc = {}; > > - int nid = dev_to_node(&vdev->dev); > > struct nd_region *nd_region; > > struct virtio_pmem *vpmem; > > struct resource res; > > @@ -79,7 +78,15 @@ static int virtio_pmem_probe(struct virtio_device *vdev) > > dev_set_drvdata(&vdev->dev, vpmem->nvdimm_bus); > > > > ndr_desc.res = &res; > > - ndr_desc.numa_node = nid; > > + > > + ndr_desc.numa_node = memory_add_physaddr_to_nid(res.start); > > + ndr_desc.target_node = phys_to_target_node(res.start); > > + if (ndr_desc.target_node == NUMA_NO_NODE) { > > + ndr_desc.target_node = ndr_desc.numa_node; > > + dev_dbg(&vdev->dev, "changing target node from %d to %d", > > + NUMA_NO_NODE, ndr_desc.target_node); > > + } > > As this memory later gets hotplugged using "devm_memremap_pages". I don't > see if 'target_node' is used for fsdax case? > > It seems to me "target_node" is used mainly for volatile range above > persistent memory ( e.g kmem driver?). > I am not sure if 'target_node' is used in the fsdax case, but it is indeed used by the devdax/kmem driver when hotplugging the memory (see 'dev_dax_kmem_probe' and '__dax_pmem_probe'). Best, Michael > Thanks, > Pankaj > > > + > > ndr_desc.flush = async_pmem_flush; > > ndr_desc.provider_data = vdev; > > set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags); > > --
> > > Compute the numa information for a virtio_pmem device from the memory > > > range of the device. Previously, the target_node was always 0 since > > > the ndr_desc.target_node field was never explicitly set. The code for > > > computing the numa node is taken from cxl_pmem_region_probe in > > > drivers/cxl/pmem.c. > > > > > > Signed-off-by: Michael Sammler <sammler@google.com> > > > --- > > > drivers/nvdimm/virtio_pmem.c | 11 +++++++++-- > > > 1 file changed, 9 insertions(+), 2 deletions(-) > > > > > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c > > > index 20da455d2ef6..a92eb172f0e7 100644 > > > --- a/drivers/nvdimm/virtio_pmem.c > > > +++ b/drivers/nvdimm/virtio_pmem.c > > > @@ -32,7 +32,6 @@ static int init_vq(struct virtio_pmem *vpmem) > > > static int virtio_pmem_probe(struct virtio_device *vdev) > > > { > > > struct nd_region_desc ndr_desc = {}; > > > - int nid = dev_to_node(&vdev->dev); > > > struct nd_region *nd_region; > > > struct virtio_pmem *vpmem; > > > struct resource res; > > > @@ -79,7 +78,15 @@ static int virtio_pmem_probe(struct virtio_device *vdev) > > > dev_set_drvdata(&vdev->dev, vpmem->nvdimm_bus); > > > > > > ndr_desc.res = &res; > > > - ndr_desc.numa_node = nid; > > > + > > > + ndr_desc.numa_node = memory_add_physaddr_to_nid(res.start); > > > + ndr_desc.target_node = phys_to_target_node(res.start); > > > + if (ndr_desc.target_node == NUMA_NO_NODE) { > > > + ndr_desc.target_node = ndr_desc.numa_node; > > > + dev_dbg(&vdev->dev, "changing target node from %d to %d", > > > + NUMA_NO_NODE, ndr_desc.target_node); > > > + } > > > > As this memory later gets hotplugged using "devm_memremap_pages". I don't > > see if 'target_node' is used for fsdax case? > > > > It seems to me "target_node" is used mainly for volatile range above > > persistent memory ( e.g kmem driver?). > > > I am not sure if 'target_node' is used in the fsdax case, but it is > indeed used by the devdax/kmem driver when hotplugging the memory (see > 'dev_dax_kmem_probe' and '__dax_pmem_probe'). Yes, but not currently for FS_DAX iiuc. Thanks, Pankaj
On Mon, Oct 17, 2022 at 1:11 PM Michael Sammler <sammler@google.com> wrote: > > Compute the numa information for a virtio_pmem device from the memory > range of the device. Previously, the target_node was always 0 since > the ndr_desc.target_node field was never explicitly set. The code for > computing the numa node is taken from cxl_pmem_region_probe in > drivers/cxl/pmem.c. > > Signed-off-by: Michael Sammler <sammler@google.com> Enables the hot-plugging of virtio-pmem memory into correct memory nodes. Does not look like it effect the FS_DAX. Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Thanks, Pasha > --- > drivers/nvdimm/virtio_pmem.c | 11 +++++++++-- > 1 file changed, 9 insertions(+), 2 deletions(-) > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c > index 20da455d2ef6..a92eb172f0e7 100644 > --- a/drivers/nvdimm/virtio_pmem.c > +++ b/drivers/nvdimm/virtio_pmem.c > @@ -32,7 +32,6 @@ static int init_vq(struct virtio_pmem *vpmem) > static int virtio_pmem_probe(struct virtio_device *vdev) > { > struct nd_region_desc ndr_desc = {}; > - int nid = dev_to_node(&vdev->dev); > struct nd_region *nd_region; > struct virtio_pmem *vpmem; > struct resource res; > @@ -79,7 +78,15 @@ static int virtio_pmem_probe(struct virtio_device *vdev) > dev_set_drvdata(&vdev->dev, vpmem->nvdimm_bus); > > ndr_desc.res = &res; > - ndr_desc.numa_node = nid; > + > + ndr_desc.numa_node = memory_add_physaddr_to_nid(res.start); > + ndr_desc.target_node = phys_to_target_node(res.start); > + if (ndr_desc.target_node == NUMA_NO_NODE) { > + ndr_desc.target_node = ndr_desc.numa_node; > + dev_dbg(&vdev->dev, "changing target node from %d to %d", > + NUMA_NO_NODE, ndr_desc.target_node); > + } > + > ndr_desc.flush = async_pmem_flush; > ndr_desc.provider_data = vdev; > set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags); > -- > 2.38.0.413.g74048e4d9e-goog
Pankaj Gupta wrote: > > > > Compute the numa information for a virtio_pmem device from the memory > > > > range of the device. Previously, the target_node was always 0 since > > > > the ndr_desc.target_node field was never explicitly set. The code for > > > > computing the numa node is taken from cxl_pmem_region_probe in > > > > drivers/cxl/pmem.c. > > > > > > > > Signed-off-by: Michael Sammler <sammler@google.com> > > > > --- > > > > drivers/nvdimm/virtio_pmem.c | 11 +++++++++-- > > > > 1 file changed, 9 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c > > > > index 20da455d2ef6..a92eb172f0e7 100644 > > > > --- a/drivers/nvdimm/virtio_pmem.c > > > > +++ b/drivers/nvdimm/virtio_pmem.c > > > > @@ -32,7 +32,6 @@ static int init_vq(struct virtio_pmem *vpmem) > > > > static int virtio_pmem_probe(struct virtio_device *vdev) > > > > { > > > > struct nd_region_desc ndr_desc = {}; > > > > - int nid = dev_to_node(&vdev->dev); > > > > struct nd_region *nd_region; > > > > struct virtio_pmem *vpmem; > > > > struct resource res; > > > > @@ -79,7 +78,15 @@ static int virtio_pmem_probe(struct virtio_device *vdev) > > > > dev_set_drvdata(&vdev->dev, vpmem->nvdimm_bus); > > > > > > > > ndr_desc.res = &res; > > > > - ndr_desc.numa_node = nid; > > > > + > > > > + ndr_desc.numa_node = memory_add_physaddr_to_nid(res.start); > > > > + ndr_desc.target_node = phys_to_target_node(res.start); > > > > + if (ndr_desc.target_node == NUMA_NO_NODE) { > > > > + ndr_desc.target_node = ndr_desc.numa_node; > > > > + dev_dbg(&vdev->dev, "changing target node from %d to %d", > > > > + NUMA_NO_NODE, ndr_desc.target_node); > > > > + } > > > > > > As this memory later gets hotplugged using "devm_memremap_pages". I don't > > > see if 'target_node' is used for fsdax case? > > > > > > It seems to me "target_node" is used mainly for volatile range above > > > persistent memory ( e.g kmem driver?). > > > > > I am not sure if 'target_node' is used in the fsdax case, but it is > > indeed used by the devdax/kmem driver when hotplugging the memory (see > > 'dev_dax_kmem_probe' and '__dax_pmem_probe'). > > Yes, but not currently for FS_DAX iiuc. The target_node is only used by the dax_kmem driver. In the FSDAX case the memory (persistent or otherwise) is mapped behind a block-device. That block-device has affinity to a CPU initiator, but that memory does not itself have any NUMA affinity or identity as a target. So: block-device NUMA node == closest CPU initiator node to the device dax-device target node == memory only NUMA node target, after onlining
On Wed, Oct 26, 2022 at 2:50 PM Dan Williams <dan.j.williams@intel.com> wrote: > > Pankaj Gupta wrote: > > > > > Compute the numa information for a virtio_pmem device from the memory > > > > > range of the device. Previously, the target_node was always 0 since > > > > > the ndr_desc.target_node field was never explicitly set. The code for > > > > > computing the numa node is taken from cxl_pmem_region_probe in > > > > > drivers/cxl/pmem.c. > > > > > > > > > > Signed-off-by: Michael Sammler <sammler@google.com> Tested-by: Mina Almasry <almasrymina@google.com> I don't have much expertise on this driver, but with the help of this patch I was able to get memory tiering [1] emulation going on qemu. As far as I know there is no alternative to this emulation, and so I would love to see this or equivalent merged, if possible. This is what I have going to get memory tiering emulation: In qemu, added these configs: -object memory-backend-file,id=m4,share=on,mem-path="$path_to_virtio_pmem_file",size=2G \ -smp 2,sockets=2,maxcpus=2 \ -numa node,nodeid=0,memdev=m0 \ -numa node,nodeid=1,memdev=m1 \ -numa node,nodeid=2,memdev=m2,initiator=0 \ -numa node,nodeid=3,initiator=0 \ -device virtio-pmem-pci,memdev=m4,id=nvdimm1 \ On boot, ran these commands: ndctl_static create-namespace -e namespace0.0 -m devdax -f 1&> /dev/null echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id for i in `ls /sys/devices/system/memory/`; do state=$(cat "/sys/devices/system/memory/$i/state" 2&>/dev/null) if [ "$state" == "offline" ]; then echo online_movable > "/sys/devices/system/memory/$i/state" fi done Without this CL, I see the memory onlined in node 0 always, and is not a separate memory tier. With this CL and qemu configs, the memory is onlined in node 3 and is set as a separate memory tier, which enables qemu-based development: ==> /sys/devices/virtual/memory_tiering/memory_tier22/nodelist <== 3 ==> /sys/devices/virtual/memory_tiering/memory_tier4/nodelist <== 0-2 AFAIK there is no alternative to enabling memory tiering emulation in qemu, and would love to see this or equivalent merged, if possible. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers > > > > > --- > > > > > drivers/nvdimm/virtio_pmem.c | 11 +++++++++-- > > > > > 1 file changed, 9 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c > > > > > index 20da455d2ef6..a92eb172f0e7 100644 > > > > > --- a/drivers/nvdimm/virtio_pmem.c > > > > > +++ b/drivers/nvdimm/virtio_pmem.c > > > > > @@ -32,7 +32,6 @@ static int init_vq(struct virtio_pmem *vpmem) > > > > > static int virtio_pmem_probe(struct virtio_device *vdev) > > > > > { > > > > > struct nd_region_desc ndr_desc = {}; > > > > > - int nid = dev_to_node(&vdev->dev); > > > > > struct nd_region *nd_region; > > > > > struct virtio_pmem *vpmem; > > > > > struct resource res; > > > > > @@ -79,7 +78,15 @@ static int virtio_pmem_probe(struct virtio_device *vdev) > > > > > dev_set_drvdata(&vdev->dev, vpmem->nvdimm_bus); > > > > > > > > > > ndr_desc.res = &res; > > > > > - ndr_desc.numa_node = nid; > > > > > + > > > > > + ndr_desc.numa_node = memory_add_physaddr_to_nid(res.start); > > > > > + ndr_desc.target_node = phys_to_target_node(res.start); > > > > > + if (ndr_desc.target_node == NUMA_NO_NODE) { > > > > > + ndr_desc.target_node = ndr_desc.numa_node; > > > > > + dev_dbg(&vdev->dev, "changing target node from %d to %d", > > > > > + NUMA_NO_NODE, ndr_desc.target_node); > > > > > + } > > > > > > > > As this memory later gets hotplugged using "devm_memremap_pages". I don't > > > > see if 'target_node' is used for fsdax case? > > > > > > > > It seems to me "target_node" is used mainly for volatile range above > > > > persistent memory ( e.g kmem driver?). > > > > > > > I am not sure if 'target_node' is used in the fsdax case, but it is > > > indeed used by the devdax/kmem driver when hotplugging the memory (see > > > 'dev_dax_kmem_probe' and '__dax_pmem_probe'). > > > > Yes, but not currently for FS_DAX iiuc. > > The target_node is only used by the dax_kmem driver. In the FSDAX case > the memory (persistent or otherwise) is mapped behind a block-device. > That block-device has affinity to a CPU initiator, but that memory does > not itself have any NUMA affinity or identity as a target. > > So: > > block-device NUMA node == closest CPU initiator node to the device > > dax-device target node == memory only NUMA node target, after onlining
> > Pankaj Gupta wrote: > > > > > > Compute the numa information for a virtio_pmem device from the memory > > > > > > range of the device. Previously, the target_node was always 0 since > > > > > > the ndr_desc.target_node field was never explicitly set. The code for > > > > > > computing the numa node is taken from cxl_pmem_region_probe in > > > > > > drivers/cxl/pmem.c. > > > > > > > > > > > > Signed-off-by: Michael Sammler <sammler@google.com> > > Tested-by: Mina Almasry <almasrymina@google.com> > > I don't have much expertise on this driver, but with the help of this > patch I was able to get memory tiering [1] emulation going on qemu. As > far as I know there is no alternative to this emulation, and so I > would love to see this or equivalent merged, if possible. > > This is what I have going to get memory tiering emulation: > > In qemu, added these configs: > -object memory-backend-file,id=m4,share=on,mem-path="$path_to_virtio_pmem_file",size=2G > \ > -smp 2,sockets=2,maxcpus=2 \ > -numa node,nodeid=0,memdev=m0 \ > -numa node,nodeid=1,memdev=m1 \ > -numa node,nodeid=2,memdev=m2,initiator=0 \ > -numa node,nodeid=3,initiator=0 \ > -device virtio-pmem-pci,memdev=m4,id=nvdimm1 \ > > On boot, ran these commands: > ndctl_static create-namespace -e namespace0.0 -m devdax -f 1&> /dev/null > echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind > echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id > for i in `ls /sys/devices/system/memory/`; do > state=$(cat "/sys/devices/system/memory/$i/state" 2&>/dev/null) > if [ "$state" == "offline" ]; then > echo online_movable > "/sys/devices/system/memory/$i/state" > fi > done Nice to see the way to handle the virtio-pmem device memory through kmem driver and online the corresponding memory blocks to 'zone_movable'. This also opens way to use this memory range directly irrespective of attached block device. Of course there won't be any persistent data guarantee. But good way to simulate memory tiering inside guest as demonstrated below. > > Without this CL, I see the memory onlined in node 0 always, and is not > a separate memory tier. With this CL and qemu configs, the memory is > onlined in node 3 and is set as a separate memory tier, which enables > qemu-based development: > > ==> /sys/devices/virtual/memory_tiering/memory_tier22/nodelist <== > 3 > ==> /sys/devices/virtual/memory_tiering/memory_tier4/nodelist <== > 0-2 > > AFAIK there is no alternative to enabling memory tiering emulation in > qemu, and would love to see this or equivalent merged, if possible. Just wondering if Qemu vNVDIMM device can also achieve this? In any case, this patch is useful, So, Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers > > > > > > > --- > > > > > > drivers/nvdimm/virtio_pmem.c | 11 +++++++++-- > > > > > > 1 file changed, 9 insertions(+), 2 deletions(-) > > > > > > > > > > > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c > > > > > > index 20da455d2ef6..a92eb172f0e7 100644 > > > > > > --- a/drivers/nvdimm/virtio_pmem.c > > > > > > +++ b/drivers/nvdimm/virtio_pmem.c > > > > > > @@ -32,7 +32,6 @@ static int init_vq(struct virtio_pmem *vpmem) > > > > > > static int virtio_pmem_probe(struct virtio_device *vdev) > > > > > > { > > > > > > struct nd_region_desc ndr_desc = {}; > > > > > > - int nid = dev_to_node(&vdev->dev); > > > > > > struct nd_region *nd_region; > > > > > > struct virtio_pmem *vpmem; > > > > > > struct resource res; > > > > > > @@ -79,7 +78,15 @@ static int virtio_pmem_probe(struct virtio_device *vdev) > > > > > > dev_set_drvdata(&vdev->dev, vpmem->nvdimm_bus); > > > > > > > > > > > > ndr_desc.res = &res; > > > > > > - ndr_desc.numa_node = nid; > > > > > > + > > > > > > + ndr_desc.numa_node = memory_add_physaddr_to_nid(res.start); > > > > > > + ndr_desc.target_node = phys_to_target_node(res.start); > > > > > > + if (ndr_desc.target_node == NUMA_NO_NODE) { > > > > > > + ndr_desc.target_node = ndr_desc.numa_node; > > > > > > + dev_dbg(&vdev->dev, "changing target node from %d to %d", > > > > > > + NUMA_NO_NODE, ndr_desc.target_node); > > > > > > + } > > > > > > > > > > As this memory later gets hotplugged using "devm_memremap_pages". I don't > > > > > see if 'target_node' is used for fsdax case? > > > > > > > > > > It seems to me "target_node" is used mainly for volatile range above > > > > > persistent memory ( e.g kmem driver?). > > > > > > > > > I am not sure if 'target_node' is used in the fsdax case, but it is > > > > indeed used by the devdax/kmem driver when hotplugging the memory (see > > > > 'dev_dax_kmem_probe' and '__dax_pmem_probe'). > > > > > > Yes, but not currently for FS_DAX iiuc. > > > > The target_node is only used by the dax_kmem driver. In the FSDAX case > > the memory (persistent or otherwise) is mapped behind a block-device. > > That block-device has affinity to a CPU initiator, but that memory does > > not itself have any NUMA affinity or identity as a target. > > > > So: > > > > block-device NUMA node == closest CPU initiator node to the device > > > > dax-device target node == memory only NUMA node target, after onlining
On Sun, Nov 13, 2022 at 9:44 AM Pankaj Gupta <pankaj.gupta.linux@gmail.com> wrote: > > > > Pankaj Gupta wrote: > > > > > > > Compute the numa information for a virtio_pmem device from the memory > > > > > > > range of the device. Previously, the target_node was always 0 since > > > > > > > the ndr_desc.target_node field was never explicitly set. The code for > > > > > > > computing the numa node is taken from cxl_pmem_region_probe in > > > > > > > drivers/cxl/pmem.c. > > > > > > > > > > > > > > Signed-off-by: Michael Sammler <sammler@google.com> > > > > Tested-by: Mina Almasry <almasrymina@google.com> > > > > I don't have much expertise on this driver, but with the help of this > > patch I was able to get memory tiering [1] emulation going on qemu. As > > far as I know there is no alternative to this emulation, and so I > > would love to see this or equivalent merged, if possible. > > > > This is what I have going to get memory tiering emulation: > > > > In qemu, added these configs: > > -object memory-backend-file,id=m4,share=on,mem-path="$path_to_virtio_pmem_file",size=2G > > \ > > -smp 2,sockets=2,maxcpus=2 \ > > -numa node,nodeid=0,memdev=m0 \ > > -numa node,nodeid=1,memdev=m1 \ > > -numa node,nodeid=2,memdev=m2,initiator=0 \ > > -numa node,nodeid=3,initiator=0 \ > > -device virtio-pmem-pci,memdev=m4,id=nvdimm1 \ > > > > On boot, ran these commands: > > ndctl_static create-namespace -e namespace0.0 -m devdax -f 1&> /dev/null > > echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind > > echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id > > for i in `ls /sys/devices/system/memory/`; do > > state=$(cat "/sys/devices/system/memory/$i/state" 2&>/dev/null) > > if [ "$state" == "offline" ]; then > > echo online_movable > "/sys/devices/system/memory/$i/state" > > fi > > done > > Nice to see the way to handle the virtio-pmem device memory through kmem driver > and online the corresponding memory blocks to 'zone_movable'. > > This also opens way to use this memory range directly irrespective of attached > block device. Of course there won't be any persistent data guarantee. But good > way to simulate memory tiering inside guest as demonstrated below. > > > > Without this CL, I see the memory onlined in node 0 always, and is not > > a separate memory tier. With this CL and qemu configs, the memory is > > onlined in node 3 and is set as a separate memory tier, which enables > > qemu-based development: > > > > ==> /sys/devices/virtual/memory_tiering/memory_tier22/nodelist <== > > 3 > > ==> /sys/devices/virtual/memory_tiering/memory_tier4/nodelist <== > > 0-2 > > > > AFAIK there is no alternative to enabling memory tiering emulation in > > qemu, and would love to see this or equivalent merged, if possible. > > Just wondering if Qemu vNVDIMM device can also achieve this? > I spent a few minutes on this. Please note I'm really not familiar with these drivers, but as far as I can tell the qemu vNVDIMM device has the same problem and needs a similar fix to this to what Michael did here. What I did with vNVDIMM qemu device: - Added these qemu configs: -object memory-backend-file,id=m4,share=on,mem-path=./hello,size=2G,readonly=off \ -device nvdimm,id=nvdimm1,memdev=m4,unarmed=off \ - Ran the same commands in my previous email (they seem to apply to the vNVDIMM device without modification): ndctl_static create-namespace -e namespace0.0 -m devdax -f 1&> /dev/null echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id for i in `ls /sys/devices/system/memory/`; do state=$(cat "/sys/devices/system/memory/$i/state" 2&>/dev/null) if [ "$state" == "offline" ]; then echo online_movable > "/sys/devices/system/memory/$i/state" fi done I see the memory from the vNVDIMM device get onlined on node0, and is not detected as a separate memory tier. I suspect that driver needs a similar fix to this one. > In any case, this patch is useful, So, > Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com > > > > > > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers > > > > > > > > > --- > > > > > > > drivers/nvdimm/virtio_pmem.c | 11 +++++++++-- > > > > > > > 1 file changed, 9 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c > > > > > > > index 20da455d2ef6..a92eb172f0e7 100644 > > > > > > > --- a/drivers/nvdimm/virtio_pmem.c > > > > > > > +++ b/drivers/nvdimm/virtio_pmem.c > > > > > > > @@ -32,7 +32,6 @@ static int init_vq(struct virtio_pmem *vpmem) > > > > > > > static int virtio_pmem_probe(struct virtio_device *vdev) > > > > > > > { > > > > > > > struct nd_region_desc ndr_desc = {}; > > > > > > > - int nid = dev_to_node(&vdev->dev); > > > > > > > struct nd_region *nd_region; > > > > > > > struct virtio_pmem *vpmem; > > > > > > > struct resource res; > > > > > > > @@ -79,7 +78,15 @@ static int virtio_pmem_probe(struct virtio_device *vdev) > > > > > > > dev_set_drvdata(&vdev->dev, vpmem->nvdimm_bus); > > > > > > > > > > > > > > ndr_desc.res = &res; > > > > > > > - ndr_desc.numa_node = nid; > > > > > > > + > > > > > > > + ndr_desc.numa_node = memory_add_physaddr_to_nid(res.start); > > > > > > > + ndr_desc.target_node = phys_to_target_node(res.start); > > > > > > > + if (ndr_desc.target_node == NUMA_NO_NODE) { > > > > > > > + ndr_desc.target_node = ndr_desc.numa_node; > > > > > > > + dev_dbg(&vdev->dev, "changing target node from %d to %d", > > > > > > > + NUMA_NO_NODE, ndr_desc.target_node); > > > > > > > + } > > > > > > > > > > > > As this memory later gets hotplugged using "devm_memremap_pages". I don't > > > > > > see if 'target_node' is used for fsdax case? > > > > > > > > > > > > It seems to me "target_node" is used mainly for volatile range above > > > > > > persistent memory ( e.g kmem driver?). > > > > > > > > > > > I am not sure if 'target_node' is used in the fsdax case, but it is > > > > > indeed used by the devdax/kmem driver when hotplugging the memory (see > > > > > 'dev_dax_kmem_probe' and '__dax_pmem_probe'). > > > > > > > > Yes, but not currently for FS_DAX iiuc. > > > > > > The target_node is only used by the dax_kmem driver. In the FSDAX case > > > the memory (persistent or otherwise) is mapped behind a block-device. > > > That block-device has affinity to a CPU initiator, but that memory does > > > not itself have any NUMA affinity or identity as a target. > > > > > > So: > > > > > > block-device NUMA node == closest CPU initiator node to the device > > > > > > dax-device target node == memory only NUMA node target, after onlining
> > > > > > > > Compute the numa information for a virtio_pmem device from the memory > > > > > > > > range of the device. Previously, the target_node was always 0 since > > > > > > > > the ndr_desc.target_node field was never explicitly set. The code for > > > > > > > > computing the numa node is taken from cxl_pmem_region_probe in > > > > > > > > drivers/cxl/pmem.c. > > > > > > > > > > > > > > > > Signed-off-by: Michael Sammler <sammler@google.com> > > > > > > Tested-by: Mina Almasry <almasrymina@google.com> > > > > > > I don't have much expertise on this driver, but with the help of this > > > patch I was able to get memory tiering [1] emulation going on qemu. As > > > far as I know there is no alternative to this emulation, and so I > > > would love to see this or equivalent merged, if possible. > > > > > > This is what I have going to get memory tiering emulation: > > > > > > In qemu, added these configs: > > > -object memory-backend-file,id=m4,share=on,mem-path="$path_to_virtio_pmem_file",size=2G > > > \ > > > -smp 2,sockets=2,maxcpus=2 \ > > > -numa node,nodeid=0,memdev=m0 \ > > > -numa node,nodeid=1,memdev=m1 \ > > > -numa node,nodeid=2,memdev=m2,initiator=0 \ > > > -numa node,nodeid=3,initiator=0 \ > > > -device virtio-pmem-pci,memdev=m4,id=nvdimm1 \ > > > > > > On boot, ran these commands: > > > ndctl_static create-namespace -e namespace0.0 -m devdax -f 1&> /dev/null > > > echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind > > > echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id > > > for i in `ls /sys/devices/system/memory/`; do > > > state=$(cat "/sys/devices/system/memory/$i/state" 2&>/dev/null) > > > if [ "$state" == "offline" ]; then > > > echo online_movable > "/sys/devices/system/memory/$i/state" > > > fi > > > done > > > > Nice to see the way to handle the virtio-pmem device memory through kmem driver > > and online the corresponding memory blocks to 'zone_movable'. > > > > This also opens way to use this memory range directly irrespective of attached > > block device. Of course there won't be any persistent data guarantee. But good > > way to simulate memory tiering inside guest as demonstrated below. > > > > > > Without this CL, I see the memory onlined in node 0 always, and is not > > > a separate memory tier. With this CL and qemu configs, the memory is > > > onlined in node 3 and is set as a separate memory tier, which enables > > > qemu-based development: > > > > > > ==> /sys/devices/virtual/memory_tiering/memory_tier22/nodelist <== > > > 3 > > > ==> /sys/devices/virtual/memory_tiering/memory_tier4/nodelist <== > > > 0-2 > > > > > > AFAIK there is no alternative to enabling memory tiering emulation in > > > qemu, and would love to see this or equivalent merged, if possible. > > > > Just wondering if Qemu vNVDIMM device can also achieve this? > > > > I spent a few minutes on this. Please note I'm really not familiar > with these drivers, but as far as I can tell the qemu vNVDIMM device > has the same problem and needs a similar fix to this to what Michael > did here. What I did with vNVDIMM qemu device: > > - Added these qemu configs: > -object memory-backend-file,id=m4,share=on,mem-path=./hello,size=2G,readonly=off > \ > -device nvdimm,id=nvdimm1,memdev=m4,unarmed=off \ > > - Ran the same commands in my previous email (they seem to apply to > the vNVDIMM device without modification): > ndctl_static create-namespace -e namespace0.0 -m devdax -f 1&> /dev/null > echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind > echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id > for i in `ls /sys/devices/system/memory/`; do > state=$(cat "/sys/devices/system/memory/$i/state" 2&>/dev/null) > if [ "$state" == "offline" ]; then > echo online_movable > "/sys/devices/system/memory/$i/state" > fi > done > > I see the memory from the vNVDIMM device get onlined on node0, and is > not detected as a separate memory tier. I suspect that driver needs a > similar fix to this one. Thanks for trying. It seems vNVDIMM device already has an option to provide the target node[1]. [1] https://www.mail-archive.com/qemu-devel@nongnu.org/msg827765.html
diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c index 20da455d2ef6..a92eb172f0e7 100644 --- a/drivers/nvdimm/virtio_pmem.c +++ b/drivers/nvdimm/virtio_pmem.c @@ -32,7 +32,6 @@ static int init_vq(struct virtio_pmem *vpmem) static int virtio_pmem_probe(struct virtio_device *vdev) { struct nd_region_desc ndr_desc = {}; - int nid = dev_to_node(&vdev->dev); struct nd_region *nd_region; struct virtio_pmem *vpmem; struct resource res; @@ -79,7 +78,15 @@ static int virtio_pmem_probe(struct virtio_device *vdev) dev_set_drvdata(&vdev->dev, vpmem->nvdimm_bus); ndr_desc.res = &res; - ndr_desc.numa_node = nid; + + ndr_desc.numa_node = memory_add_physaddr_to_nid(res.start); + ndr_desc.target_node = phys_to_target_node(res.start); + if (ndr_desc.target_node == NUMA_NO_NODE) { + ndr_desc.target_node = ndr_desc.numa_node; + dev_dbg(&vdev->dev, "changing target node from %d to %d", + NUMA_NO_NODE, ndr_desc.target_node); + } + ndr_desc.flush = async_pmem_flush; ndr_desc.provider_data = vdev; set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);