Message ID | 20221027065925.476955-1-ying.huang@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp73632wru; Thu, 27 Oct 2022 00:02:43 -0700 (PDT) X-Google-Smtp-Source: AMsMyM64DFIhPg/Je2hoxwaudJ9lJ8rdhGb01R1gA61XPd3sFsEOJjCEVJ5C/DWykB40J34mJg9G X-Received: by 2002:a17:907:97c3:b0:79b:3f8d:a354 with SMTP id js3-20020a17090797c300b0079b3f8da354mr27722599ejc.461.1666854163161; Thu, 27 Oct 2022 00:02:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666854163; cv=none; d=google.com; s=arc-20160816; b=GWlOFVzLnC5p+GH3Ohsj+/PBo0foPR1zffI/zTnuWTbrfd1RN6hhxzgJPze992nXV6 JB81c0W1K7xyWdtSw5ey/N2hJpOiu8lWUk+b1qIDJ5RkyWxpXtR90niTD+BYMfjBxp6R PBa4nqPqB0iIadMVohVI/MQDEh2rb87OsnjzCdg1L4rGlrfzBoFJcDlIOdsrstCnG5BG 80Cpen7KEY0mdHWm4eM2L7PzVkRKHoY3XLgRFrTWPp++iZX71/Tyhj+At9cjmBvA9X1g KGnXwD4bMZY86eeo18h4q75qg+1P4X8U8suW8jDWMAX0nM7ZyE97eWD19g0d+02ZVEIf a/ww== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=tjQK/66cZjnK/a0qHUKzpizplVokBuVmEgVSyN+o1YQ=; b=GqITtuDKC/dRiI4NePkTvX0kYO/pZ2VUyr597VloI205jabxlZwB99NrEDov4nth8s JGW0tmv3fR3AenRfX+iL6/DMuBuxYIpV+ZAmJtlOTLBsBQdF1fszFJ25DV4pIi4tdugT QAF/bPSMLIl8aI598yZiuQCO4z9SGvU6R9V49DgTIfGNQZ9zTXmhWWCdYhjArUIwzZAu IVw2O27jyWMqEXqhO8cry0XPY+jCSmueF68Ut0i+E/eiE1bKkXo+waTj7INO12scZeIy XUtJGi60HHdQfV76hPuZUZqwVO+TujYV440k8NIW0QjcV0MrO28Z+cquqxSk7LMXliw3 WjsA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=WJSPINb2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ji11-20020a170907980b00b007aad0a48ad7si633526ejc.1006.2022.10.27.00.02.18; Thu, 27 Oct 2022 00:02:43 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=WJSPINb2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234452AbiJ0HAk (ORCPT <rfc822;chrisfriedt@gmail.com> + 99 others); Thu, 27 Oct 2022 03:00:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47958 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234021AbiJ0G7z (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 27 Oct 2022 02:59:55 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ACF2DF5BE for <linux-kernel@vger.kernel.org>; Wed, 26 Oct 2022 23:59:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1666853980; x=1698389980; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=V14uG5yjLb6jlXg9bU1RN+1W4f8+KLhPzp7C9u7WGug=; b=WJSPINb2jE75FhasQPI+Jj8BjAXLIIKUPlwsTL8YxH9UBiBUBRp+b3K4 pgP/UbtGwi0clDzCVWW/LZL1NBL9xyqGxJCE76cV1eA2Q8Vh3Pm6n7yF0 92/BFQDQX/Yvndeh5KwNwduawtudeRk/aVeNmYrX1N/WAvGRGMzTIzuM9 zsaMoShXRLRbZHCbz69VSByh5t7gYKegZYWivY9Tw5Yhn0C8qYo8NG90C A/2TJfoVIEzd1ffLJ+r7Q6gT8edQl0e7adS2KMPJudOUWQ1FtFMCwPHPO fFdNJIyqzAJU3k2DjVD/KA6CCluxIzXT08UiZXbGjXXulADZw+w1vVMLR A==; X-IronPort-AV: E=McAfee;i="6500,9779,10512"; a="291449137" X-IronPort-AV: E=Sophos;i="5.95,217,1661842800"; d="scan'208";a="291449137" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Oct 2022 23:59:39 -0700 X-IronPort-AV: E=McAfee;i="6500,9779,10512"; a="774884708" X-IronPort-AV: E=Sophos;i="5.95,217,1661842800"; d="scan'208";a="774884708" Received: from yunfeima-mobl.ccr.corp.intel.com (HELO yhuang6-mobl2.ccr.corp.intel.com) ([10.254.212.241]) by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Oct 2022 23:59:35 -0700 From: Huang Ying <ying.huang@intel.com> To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>, Huang Ying <ying.huang@intel.com>, "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>, Alistair Popple <apopple@nvidia.com>, Bharata B Rao <bharata@amd.com>, Dan Williams <dan.j.williams@intel.com>, Dave Hansen <dave.hansen@intel.com>, Davidlohr Bueso <dave@stgolabs.net>, Hesham Almatary <hesham.almatary@huawei.com>, Jagdish Gediya <jvgediya.oss@gmail.com>, Johannes Weiner <hannes@cmpxchg.org>, Jonathan Cameron <Jonathan.Cameron@huawei.com>, Michal Hocko <mhocko@kernel.org>, Tim Chen <tim.c.chen@intel.com>, Wei Xu <weixugc@google.com>, Yang Shi <shy828301@gmail.com> Subject: [RFC] memory tiering: use small chunk size and more tiers Date: Thu, 27 Oct 2022 14:59:25 +0800 Message-Id: <20221027065925.476955-1-ying.huang@intel.com> X-Mailer: git-send-email 2.35.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE, SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747823271153260320?= X-GMAIL-MSGID: =?utf-8?q?1747823271153260320?= |
Series |
[RFC] memory tiering: use small chunk size and more tiers
|
|
Commit Message
Huang, Ying
Oct. 27, 2022, 6:59 a.m. UTC
We need some way to override the system default memory tiers. For
the example system as follows,
type abstract distance
---- -----------------
HBM 300
DRAM 1000
CXL_MEM 5000
PMEM 5100
Given the memory tier chunk size is 100, the default memory tiers
could be,
tier abstract distance types
range
---- ----------------- -----
3 300-400 HBM
10 1000-1100 DRAM
50 5000-5100 CXL_MEM
51 5100-5200 PMEM
If we want to group CXL MEM and PMEM into one tier, we have 2 choices.
1) Override the abstract distance of CXL_MEM or PMEM. For example, if
we change the abstract distance of PMEM to 5050, the memory tiers
become,
tier abstract distance types
range
---- ----------------- -----
3 300-400 HBM
10 1000-1100 DRAM
50 5000-5100 CXL_MEM, PMEM
2) Override the memory tier chunk size. For example, if we change the
memory tier chunk size to 200, the memory tiers become,
tier abstract distance types
range
---- ----------------- -----
1 200-400 HBM
5 1000-1200 DRAM
25 5000-5200 CXL_MEM, PMEM
But after some thoughts, I think choice 2) may be not good. The
problem is that even if 2 abstract distances are almost same, they may
be put in 2 tier if they sit in the different sides of the tier
boundary. For example, if the abstract distance of CXL_MEM is 4990,
while the abstract distance of PMEM is 5010. Although the difference
of the abstract distances is only 20, CXL_MEM and PMEM will put in
different tiers if the tier chunk size is 50, 100, 200, 250, 500, ....
This makes choice 2) hard to be used, it may become tricky to find out
the appropriate tier chunk size that satisfying all requirements.
So I suggest to abandon choice 2) and use choice 1) only. This makes
the overall design and user space interface to be simpler and easier
to be used. The overall design of the abstract distance could be,
1. Use decimal for abstract distance and its chunk size. This makes
them more user friendly.
2. Make the tier chunk size as small as possible. For example, 10.
This will put different memory types in one memory tier only if their
performance is almost same by default. And we will not provide the
interface to override the chunk size.
3. Make the abstract distance of normal DRAM large enough. For
example, 1000, then 100 tiers can be defined below DRAM, this is
more than enough in practice.
4. If we want to override the default memory tiers, just override the
abstract distances of some memory types with a per memory type
interface.
This patch is to apply the design choices above in the existing code.
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yang Shi <shy828301@gmail.com>
---
include/linux/memory-tiers.h | 7 +++----
mm/memory-tiers.c | 7 +++----
2 files changed, 6 insertions(+), 8 deletions(-)
Comments
On 10/27/22 12:29 PM, Huang Ying wrote: > We need some way to override the system default memory tiers. For > the example system as follows, > > type abstract distance > ---- ----------------- > HBM 300 > DRAM 1000 > CXL_MEM 5000 > PMEM 5100 > > Given the memory tier chunk size is 100, the default memory tiers > could be, > > tier abstract distance types > range > ---- ----------------- ----- > 3 300-400 HBM > 10 1000-1100 DRAM > 50 5000-5100 CXL_MEM > 51 5100-5200 PMEM > > If we want to group CXL MEM and PMEM into one tier, we have 2 choices. > > 1) Override the abstract distance of CXL_MEM or PMEM. For example, if > we change the abstract distance of PMEM to 5050, the memory tiers > become, > > tier abstract distance types > range > ---- ----------------- ----- > 3 300-400 HBM > 10 1000-1100 DRAM > 50 5000-5100 CXL_MEM, PMEM > > 2) Override the memory tier chunk size. For example, if we change the > memory tier chunk size to 200, the memory tiers become, > > tier abstract distance types > range > ---- ----------------- ----- > 1 200-400 HBM > 5 1000-1200 DRAM > 25 5000-5200 CXL_MEM, PMEM > > But after some thoughts, I think choice 2) may be not good. The > problem is that even if 2 abstract distances are almost same, they may > be put in 2 tier if they sit in the different sides of the tier > boundary. For example, if the abstract distance of CXL_MEM is 4990, > while the abstract distance of PMEM is 5010. Although the difference > of the abstract distances is only 20, CXL_MEM and PMEM will put in > different tiers if the tier chunk size is 50, 100, 200, 250, 500, .... > This makes choice 2) hard to be used, it may become tricky to find out > the appropriate tier chunk size that satisfying all requirements. > Shouldn't we wait for gaining experience w.r.t how we would end up mapping devices with different latencies and bandwidth before tuning these values? > So I suggest to abandon choice 2) and use choice 1) only. This makes > the overall design and user space interface to be simpler and easier > to be used. The overall design of the abstract distance could be, > > 1. Use decimal for abstract distance and its chunk size. This makes > them more user friendly. > > 2. Make the tier chunk size as small as possible. For example, 10. > This will put different memory types in one memory tier only if their > performance is almost same by default. And we will not provide the > interface to override the chunk size. > this could also mean we can end up with lots of memory tiers with relative smaller performance difference between them. Again it depends how HMAT attributes will be used to map to abstract distance. > 3. Make the abstract distance of normal DRAM large enough. For > example, 1000, then 100 tiers can be defined below DRAM, this is > more than enough in practice. Why 100? Will we really have that many tiers below/faster than DRAM? As of now I see only HBM below it. > > 4. If we want to override the default memory tiers, just override the > abstract distances of some memory types with a per memory type > interface. > > This patch is to apply the design choices above in the existing code. > > Signed-off-by: "Huang, Ying" <ying.huang@intel.com> > Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > Cc: Alistair Popple <apopple@nvidia.com> > Cc: Bharata B Rao <bharata@amd.com> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Dave Hansen <dave.hansen@intel.com> > Cc: Davidlohr Bueso <dave@stgolabs.net> > Cc: Hesham Almatary <hesham.almatary@huawei.com> > Cc: Jagdish Gediya <jvgediya.oss@gmail.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Tim Chen <tim.c.chen@intel.com> > Cc: Wei Xu <weixugc@google.com> > Cc: Yang Shi <shy828301@gmail.com> > --- > include/linux/memory-tiers.h | 7 +++---- > mm/memory-tiers.c | 7 +++---- > 2 files changed, 6 insertions(+), 8 deletions(-) > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > index 965009aa01d7..2e39d9a6c8ce 100644 > --- a/include/linux/memory-tiers.h > +++ b/include/linux/memory-tiers.h > @@ -7,17 +7,16 @@ > #include <linux/kref.h> > #include <linux/mmzone.h> > /* > - * Each tier cover a abstrace distance chunk size of 128 > + * Each tier cover a abstrace distance chunk size of 10 > */ > -#define MEMTIER_CHUNK_BITS 7 > -#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) > +#define MEMTIER_CHUNK_SIZE 10 > /* > * Smaller abstract distance values imply faster (higher) memory tiers. Offset > * the DRAM adistance so that we can accommodate devices with a slightly lower > * adistance value (slightly faster) than default DRAM adistance to be part of > * the same memory tier. > */ > -#define MEMTIER_ADISTANCE_DRAM ((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1)) > +#define MEMTIER_ADISTANCE_DRAM ((100 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE / 2)) > #define MEMTIER_HOTPLUG_PRIO 100 > > struct memory_tier; > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > index fa8c9d07f9ce..e03011428fa5 100644 > --- a/mm/memory-tiers.c > +++ b/mm/memory-tiers.c > @@ -165,11 +165,10 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty > bool found_slot = false; > struct memory_tier *memtier, *new_memtier; > int adistance = memtype->adistance; > - unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE; > > lockdep_assert_held_once(&memory_tier_lock); > > - adistance = round_down(adistance, memtier_adistance_chunk_size); > + adistance = rounddown(adistance, MEMTIER_CHUNK_SIZE); > /* > * If the memtype is already part of a memory tier, > * just return that. > @@ -204,7 +203,7 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty > else > list_add_tail(&new_memtier->list, &memory_tiers); > > - new_memtier->dev.id = adistance >> MEMTIER_CHUNK_BITS; > + new_memtier->dev.id = adistance / MEMTIER_CHUNK_SIZE; > new_memtier->dev.bus = &memory_tier_subsys; > new_memtier->dev.release = memory_tier_device_release; > new_memtier->dev.groups = memtier_dev_groups; > @@ -641,7 +640,7 @@ static int __init memory_tier_init(void) > #endif > mutex_lock(&memory_tier_lock); > /* > - * For now we can have 4 faster memory tiers with smaller adistance > + * For now we can have 100 faster memory tiers with smaller adistance > * than default DRAM tier. > */ > default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
Hi, Aneesh, Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > On 10/27/22 12:29 PM, Huang Ying wrote: >> We need some way to override the system default memory tiers. For >> the example system as follows, >> >> type abstract distance >> ---- ----------------- >> HBM 300 >> DRAM 1000 >> CXL_MEM 5000 >> PMEM 5100 >> >> Given the memory tier chunk size is 100, the default memory tiers >> could be, >> >> tier abstract distance types >> range >> ---- ----------------- ----- >> 3 300-400 HBM >> 10 1000-1100 DRAM >> 50 5000-5100 CXL_MEM >> 51 5100-5200 PMEM >> >> If we want to group CXL MEM and PMEM into one tier, we have 2 choices. >> >> 1) Override the abstract distance of CXL_MEM or PMEM. For example, if >> we change the abstract distance of PMEM to 5050, the memory tiers >> become, >> >> tier abstract distance types >> range >> ---- ----------------- ----- >> 3 300-400 HBM >> 10 1000-1100 DRAM >> 50 5000-5100 CXL_MEM, PMEM >> >> 2) Override the memory tier chunk size. For example, if we change the >> memory tier chunk size to 200, the memory tiers become, >> >> tier abstract distance types >> range >> ---- ----------------- ----- >> 1 200-400 HBM >> 5 1000-1200 DRAM >> 25 5000-5200 CXL_MEM, PMEM >> >> But after some thoughts, I think choice 2) may be not good. The >> problem is that even if 2 abstract distances are almost same, they may >> be put in 2 tier if they sit in the different sides of the tier >> boundary. For example, if the abstract distance of CXL_MEM is 4990, >> while the abstract distance of PMEM is 5010. Although the difference >> of the abstract distances is only 20, CXL_MEM and PMEM will put in >> different tiers if the tier chunk size is 50, 100, 200, 250, 500, .... >> This makes choice 2) hard to be used, it may become tricky to find out >> the appropriate tier chunk size that satisfying all requirements. >> > > Shouldn't we wait for gaining experience w.r.t how we would end up > mapping devices with different latencies and bandwidth before tuning these values? Just want to discuss the overall design. >> So I suggest to abandon choice 2) and use choice 1) only. This makes >> the overall design and user space interface to be simpler and easier >> to be used. The overall design of the abstract distance could be, >> >> 1. Use decimal for abstract distance and its chunk size. This makes >> them more user friendly. >> >> 2. Make the tier chunk size as small as possible. For example, 10. >> This will put different memory types in one memory tier only if their >> performance is almost same by default. And we will not provide the >> interface to override the chunk size. >> > > this could also mean we can end up with lots of memory tiers with relative > smaller performance difference between them. Again it depends how HMAT > attributes will be used to map to abstract distance. Per my understanding, there will not be many memory types in a system. So, there will not be many memory tiers too. In most systems, there are only 2 or 3 memory tiers in the system, for example, HBM, DRAM, CXL, etc. Do you know systems with many memory types? The basic idea is to put different memory types in different memory tiers by default. If users want to group them, they can do that via overriding the abstract distance of some memory type. > >> 3. Make the abstract distance of normal DRAM large enough. For >> example, 1000, then 100 tiers can be defined below DRAM, this is >> more than enough in practice. > > Why 100? Will we really have that many tiers below/faster than DRAM? As of now > I see only HBM below it. Yes. 100 is more than enough. We just want to avoid to group different memory types by default. Best Regards, Huang, Ying >> >> 4. If we want to override the default memory tiers, just override the >> abstract distances of some memory types with a per memory type >> interface. >> >> This patch is to apply the design choices above in the existing code. >> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> >> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> >> Cc: Alistair Popple <apopple@nvidia.com> >> Cc: Bharata B Rao <bharata@amd.com> >> Cc: Dan Williams <dan.j.williams@intel.com> >> Cc: Dave Hansen <dave.hansen@intel.com> >> Cc: Davidlohr Bueso <dave@stgolabs.net> >> Cc: Hesham Almatary <hesham.almatary@huawei.com> >> Cc: Jagdish Gediya <jvgediya.oss@gmail.com> >> Cc: Johannes Weiner <hannes@cmpxchg.org> >> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> >> Cc: Michal Hocko <mhocko@kernel.org> >> Cc: Tim Chen <tim.c.chen@intel.com> >> Cc: Wei Xu <weixugc@google.com> >> Cc: Yang Shi <shy828301@gmail.com> >> --- >> include/linux/memory-tiers.h | 7 +++---- >> mm/memory-tiers.c | 7 +++---- >> 2 files changed, 6 insertions(+), 8 deletions(-) >> >> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h >> index 965009aa01d7..2e39d9a6c8ce 100644 >> --- a/include/linux/memory-tiers.h >> +++ b/include/linux/memory-tiers.h >> @@ -7,17 +7,16 @@ >> #include <linux/kref.h> >> #include <linux/mmzone.h> >> /* >> - * Each tier cover a abstrace distance chunk size of 128 >> + * Each tier cover a abstrace distance chunk size of 10 >> */ >> -#define MEMTIER_CHUNK_BITS 7 >> -#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) >> +#define MEMTIER_CHUNK_SIZE 10 >> /* >> * Smaller abstract distance values imply faster (higher) memory tiers. Offset >> * the DRAM adistance so that we can accommodate devices with a slightly lower >> * adistance value (slightly faster) than default DRAM adistance to be part of >> * the same memory tier. >> */ >> -#define MEMTIER_ADISTANCE_DRAM ((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1)) >> +#define MEMTIER_ADISTANCE_DRAM ((100 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE / 2)) >> #define MEMTIER_HOTPLUG_PRIO 100 >> >> struct memory_tier; >> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c >> index fa8c9d07f9ce..e03011428fa5 100644 >> --- a/mm/memory-tiers.c >> +++ b/mm/memory-tiers.c >> @@ -165,11 +165,10 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty >> bool found_slot = false; >> struct memory_tier *memtier, *new_memtier; >> int adistance = memtype->adistance; >> - unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE; >> >> lockdep_assert_held_once(&memory_tier_lock); >> >> - adistance = round_down(adistance, memtier_adistance_chunk_size); >> + adistance = rounddown(adistance, MEMTIER_CHUNK_SIZE); >> /* >> * If the memtype is already part of a memory tier, >> * just return that. >> @@ -204,7 +203,7 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty >> else >> list_add_tail(&new_memtier->list, &memory_tiers); >> >> - new_memtier->dev.id = adistance >> MEMTIER_CHUNK_BITS; >> + new_memtier->dev.id = adistance / MEMTIER_CHUNK_SIZE; >> new_memtier->dev.bus = &memory_tier_subsys; >> new_memtier->dev.release = memory_tier_device_release; >> new_memtier->dev.groups = memtier_dev_groups; >> @@ -641,7 +640,7 @@ static int __init memory_tier_init(void) >> #endif >> mutex_lock(&memory_tier_lock); >> /* >> - * For now we can have 4 faster memory tiers with smaller adistance >> + * For now we can have 100 faster memory tiers with smaller adistance >> * than default DRAM tier. >> */ >> default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
On 10/28/22 8:33 AM, Huang, Ying wrote: > Hi, Aneesh, > > Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > >> On 10/27/22 12:29 PM, Huang Ying wrote: >>> We need some way to override the system default memory tiers. For >>> the example system as follows, >>> >>> type abstract distance >>> ---- ----------------- >>> HBM 300 >>> DRAM 1000 >>> CXL_MEM 5000 >>> PMEM 5100 >>> >>> Given the memory tier chunk size is 100, the default memory tiers >>> could be, >>> >>> tier abstract distance types >>> range >>> ---- ----------------- ----- >>> 3 300-400 HBM >>> 10 1000-1100 DRAM >>> 50 5000-5100 CXL_MEM >>> 51 5100-5200 PMEM >>> >>> If we want to group CXL MEM and PMEM into one tier, we have 2 choices. >>> >>> 1) Override the abstract distance of CXL_MEM or PMEM. For example, if >>> we change the abstract distance of PMEM to 5050, the memory tiers >>> become, >>> >>> tier abstract distance types >>> range >>> ---- ----------------- ----- >>> 3 300-400 HBM >>> 10 1000-1100 DRAM >>> 50 5000-5100 CXL_MEM, PMEM >>> >>> 2) Override the memory tier chunk size. For example, if we change the >>> memory tier chunk size to 200, the memory tiers become, >>> >>> tier abstract distance types >>> range >>> ---- ----------------- ----- >>> 1 200-400 HBM >>> 5 1000-1200 DRAM >>> 25 5000-5200 CXL_MEM, PMEM >>> >>> But after some thoughts, I think choice 2) may be not good. The >>> problem is that even if 2 abstract distances are almost same, they may >>> be put in 2 tier if they sit in the different sides of the tier >>> boundary. For example, if the abstract distance of CXL_MEM is 4990, >>> while the abstract distance of PMEM is 5010. Although the difference >>> of the abstract distances is only 20, CXL_MEM and PMEM will put in >>> different tiers if the tier chunk size is 50, 100, 200, 250, 500, .... >>> This makes choice 2) hard to be used, it may become tricky to find out >>> the appropriate tier chunk size that satisfying all requirements. >>> >> >> Shouldn't we wait for gaining experience w.r.t how we would end up >> mapping devices with different latencies and bandwidth before tuning these values? > > Just want to discuss the overall design. > >>> So I suggest to abandon choice 2) and use choice 1) only. This makes >>> the overall design and user space interface to be simpler and easier >>> to be used. The overall design of the abstract distance could be, >>> >>> 1. Use decimal for abstract distance and its chunk size. This makes >>> them more user friendly. >>> >>> 2. Make the tier chunk size as small as possible. For example, 10. >>> This will put different memory types in one memory tier only if their >>> performance is almost same by default. And we will not provide the >>> interface to override the chunk size. >>> >> >> this could also mean we can end up with lots of memory tiers with relative >> smaller performance difference between them. Again it depends how HMAT >> attributes will be used to map to abstract distance. > > Per my understanding, there will not be many memory types in a system. > So, there will not be many memory tiers too. In most systems, there are > only 2 or 3 memory tiers in the system, for example, HBM, DRAM, CXL, > etc. So we don't need the chunk size to be 10 because we don't forsee us needing to group devices into that many tiers. > Do you know systems with many memory types? The basic idea is to > put different memory types in different memory tiers by default. If > users want to group them, they can do that via overriding the abstract > distance of some memory type. > with small chunk size and depending on how we are going to derive abstract distance, I am wondering whether we would end up with lots of memory tiers with no real value. Hence my suggestion to wait making a change like this till we have code that map HMAT/CDAT attributes to abstract distance. >> >>> 3. Make the abstract distance of normal DRAM large enough. For >>> example, 1000, then 100 tiers can be defined below DRAM, this is >>> more than enough in practice. >> >> Why 100? Will we really have that many tiers below/faster than DRAM? As of now >> I see only HBM below it. > > Yes. 100 is more than enough. We just want to avoid to group different > memory types by default. > > Best Regards, > Huang, Ying >
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > On 10/28/22 8:33 AM, Huang, Ying wrote: >> Hi, Aneesh, >> >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >> >>> On 10/27/22 12:29 PM, Huang Ying wrote: >>>> We need some way to override the system default memory tiers. For >>>> the example system as follows, >>>> >>>> type abstract distance >>>> ---- ----------------- >>>> HBM 300 >>>> DRAM 1000 >>>> CXL_MEM 5000 >>>> PMEM 5100 >>>> >>>> Given the memory tier chunk size is 100, the default memory tiers >>>> could be, >>>> >>>> tier abstract distance types >>>> range >>>> ---- ----------------- ----- >>>> 3 300-400 HBM >>>> 10 1000-1100 DRAM >>>> 50 5000-5100 CXL_MEM >>>> 51 5100-5200 PMEM >>>> >>>> If we want to group CXL MEM and PMEM into one tier, we have 2 choices. >>>> >>>> 1) Override the abstract distance of CXL_MEM or PMEM. For example, if >>>> we change the abstract distance of PMEM to 5050, the memory tiers >>>> become, >>>> >>>> tier abstract distance types >>>> range >>>> ---- ----------------- ----- >>>> 3 300-400 HBM >>>> 10 1000-1100 DRAM >>>> 50 5000-5100 CXL_MEM, PMEM >>>> >>>> 2) Override the memory tier chunk size. For example, if we change the >>>> memory tier chunk size to 200, the memory tiers become, >>>> >>>> tier abstract distance types >>>> range >>>> ---- ----------------- ----- >>>> 1 200-400 HBM >>>> 5 1000-1200 DRAM >>>> 25 5000-5200 CXL_MEM, PMEM >>>> >>>> But after some thoughts, I think choice 2) may be not good. The >>>> problem is that even if 2 abstract distances are almost same, they may >>>> be put in 2 tier if they sit in the different sides of the tier >>>> boundary. For example, if the abstract distance of CXL_MEM is 4990, >>>> while the abstract distance of PMEM is 5010. Although the difference >>>> of the abstract distances is only 20, CXL_MEM and PMEM will put in >>>> different tiers if the tier chunk size is 50, 100, 200, 250, 500, .... >>>> This makes choice 2) hard to be used, it may become tricky to find out >>>> the appropriate tier chunk size that satisfying all requirements. >>>> >>> >>> Shouldn't we wait for gaining experience w.r.t how we would end up >>> mapping devices with different latencies and bandwidth before tuning these values? >> >> Just want to discuss the overall design. >> >>>> So I suggest to abandon choice 2) and use choice 1) only. This makes >>>> the overall design and user space interface to be simpler and easier >>>> to be used. The overall design of the abstract distance could be, >>>> >>>> 1. Use decimal for abstract distance and its chunk size. This makes >>>> them more user friendly. >>>> >>>> 2. Make the tier chunk size as small as possible. For example, 10. >>>> This will put different memory types in one memory tier only if their >>>> performance is almost same by default. And we will not provide the >>>> interface to override the chunk size. >>>> >>> >>> this could also mean we can end up with lots of memory tiers with relative >>> smaller performance difference between them. Again it depends how HMAT >>> attributes will be used to map to abstract distance. >> >> Per my understanding, there will not be many memory types in a system. >> So, there will not be many memory tiers too. In most systems, there are >> only 2 or 3 memory tiers in the system, for example, HBM, DRAM, CXL, >> etc. > > So we don't need the chunk size to be 10 because we don't forsee us needing > to group devices into that many tiers. I suggest to use small chunk size to avoid to group 2 memory types into one memory tier accidently. >> Do you know systems with many memory types? The basic idea is to >> put different memory types in different memory tiers by default. If >> users want to group them, they can do that via overriding the abstract >> distance of some memory type. >> > > with small chunk size and depending on how we are going to derive abstract distance, > I am wondering whether we would end up with lots of memory tiers with no > real value. Hence my suggestion to wait making a change like this till we have > code that map HMAT/CDAT attributes to abstract distance. Per my understanding, the NUMA nodes of the same memory type/tier will have the exact same latency and bandwidth in HMAT/CDAT for the CPU in the same socket. If my understanding were correct, you think the latency / bandwidth of these NUMA nodes will near each other, but may be different. Even if the latency / bandwidth of these NUMA nodes isn't exactly same, we should deal with that in memory types instead of memory tiers. There's only one abstract distance for each memory type. So, I still believe we will not have many memory tiers with my proposal. I don't care too much about the exact number, but want to discuss some general design choice, a) Avoid to group multiple memory types into one memory tier by default at most times. b) Abandon customizing abstract distance chunk size. Best Regards, Huang, Ying > >>> >>>> 3. Make the abstract distance of normal DRAM large enough. For >>>> example, 1000, then 100 tiers can be defined below DRAM, this is >>>> more than enough in practice. >>> >>> Why 100? Will we really have that many tiers below/faster than DRAM? As of now >>> I see only HBM below it. >> >> Yes. 100 is more than enough. We just want to avoid to group different >> memory types by default. >> >> Best Regards, >> Huang, Ying >>
On 10/28/2022 11:16 AM, Huang, Ying wrote: > If my understanding were correct, you think the latency / bandwidth of > these NUMA nodes will near each other, but may be different. > > Even if the latency / bandwidth of these NUMA nodes isn't exactly same, > we should deal with that in memory types instead of memory tiers. > There's only one abstract distance for each memory type. > > So, I still believe we will not have many memory tiers with my proposal. > > I don't care too much about the exact number, but want to discuss some > general design choice, > > a) Avoid to group multiple memory types into one memory tier by default > at most times. Do you expect the abstract distances of two different types to be close enough in real life (like you showed in your example with CXL - 5000 and PMEM - 5100) that they will get assigned into same tier most times? Are you foreseeing that abstract distance that get mapped by sources like HMAT would run into this issue? Regards, Bharata.
Bharata B Rao <bharata@amd.com> writes: > On 10/28/2022 11:16 AM, Huang, Ying wrote: >> If my understanding were correct, you think the latency / bandwidth of >> these NUMA nodes will near each other, but may be different. >> >> Even if the latency / bandwidth of these NUMA nodes isn't exactly same, >> we should deal with that in memory types instead of memory tiers. >> There's only one abstract distance for each memory type. >> >> So, I still believe we will not have many memory tiers with my proposal. >> >> I don't care too much about the exact number, but want to discuss some >> general design choice, >> >> a) Avoid to group multiple memory types into one memory tier by default >> at most times. > > Do you expect the abstract distances of two different types to be > close enough in real life (like you showed in your example with > CXL - 5000 and PMEM - 5100) that they will get assigned into same tier > most times? > > Are you foreseeing that abstract distance that get mapped by sources > like HMAT would run into this issue? Only if we set abstract distance chunk size large. So, I think that it's better to set chunk size as small as possible to avoid potential issue. What is the downside to set the chunk size small? Best Regards, Huang, Ying
On 10/28/2022 2:03 PM, Huang, Ying wrote: > Bharata B Rao <bharata@amd.com> writes: > >> On 10/28/2022 11:16 AM, Huang, Ying wrote: >>> If my understanding were correct, you think the latency / bandwidth of >>> these NUMA nodes will near each other, but may be different. >>> >>> Even if the latency / bandwidth of these NUMA nodes isn't exactly same, >>> we should deal with that in memory types instead of memory tiers. >>> There's only one abstract distance for each memory type. >>> >>> So, I still believe we will not have many memory tiers with my proposal. >>> >>> I don't care too much about the exact number, but want to discuss some >>> general design choice, >>> >>> a) Avoid to group multiple memory types into one memory tier by default >>> at most times. >> >> Do you expect the abstract distances of two different types to be >> close enough in real life (like you showed in your example with >> CXL - 5000 and PMEM - 5100) that they will get assigned into same tier >> most times? >> >> Are you foreseeing that abstract distance that get mapped by sources >> like HMAT would run into this issue? > > Only if we set abstract distance chunk size large. So, I think that > it's better to set chunk size as small as possible to avoid potential > issue. What is the downside to set the chunk size small? I don't see anything in particular. However - With just two memory types (default_dram_type and dax_slowmem_type with adistance values of 576 and 576*5 respectively) defined currently, - With no interface yet to set/change adistance value of a memory type, - With no defined way to convert the performance characteristics info (bw and latency) from sources like HMAT into a adistance value, I find it a bit difficult to see how a chunk size of 10 against the existing 128 could be more useful. Regards, Bharata.
Bharata B Rao <bharata@amd.com> writes: > On 10/28/2022 2:03 PM, Huang, Ying wrote: >> Bharata B Rao <bharata@amd.com> writes: >> >>> On 10/28/2022 11:16 AM, Huang, Ying wrote: >>>> If my understanding were correct, you think the latency / bandwidth of >>>> these NUMA nodes will near each other, but may be different. >>>> >>>> Even if the latency / bandwidth of these NUMA nodes isn't exactly same, >>>> we should deal with that in memory types instead of memory tiers. >>>> There's only one abstract distance for each memory type. >>>> >>>> So, I still believe we will not have many memory tiers with my proposal. >>>> >>>> I don't care too much about the exact number, but want to discuss some >>>> general design choice, >>>> >>>> a) Avoid to group multiple memory types into one memory tier by default >>>> at most times. >>> >>> Do you expect the abstract distances of two different types to be >>> close enough in real life (like you showed in your example with >>> CXL - 5000 and PMEM - 5100) that they will get assigned into same tier >>> most times? >>> >>> Are you foreseeing that abstract distance that get mapped by sources >>> like HMAT would run into this issue? >> >> Only if we set abstract distance chunk size large. So, I think that >> it's better to set chunk size as small as possible to avoid potential >> issue. What is the downside to set the chunk size small? > > I don't see anything in particular. However > > - With just two memory types (default_dram_type and dax_slowmem_type > with adistance values of 576 and 576*5 respectively) defined currently, > - With no interface yet to set/change adistance value of a memory type, > - With no defined way to convert the performance characteristics info > (bw and latency) from sources like HMAT into a adistance value, > > I find it a bit difficult to see how a chunk size of 10 against the > existing 128 could be more useful. OK. Maybe we pay too much attention to specific number. My target isn't to push this specific RFC into kernel. I just want to discuss the design choices with community. My basic idea is NOT to group memory types into memory tiers via customizing abstract distance chunk size. Because that's hard to be used and implemented. So far, it appears that nobody objects this. Then, it's even better to avoid to adjust abstract chunk size in kernel as much as possible. This will make the life of the user space tools/scripts easier. One solution is to define more than enough possible tiers under DRAM (we have unlimited number of tiers above DRAM). In the upstream implementation, 4 tiers are possible below DRAM. That's enough for now. But in the long run, it may be better to define more. 100 possible tiers below DRAM may be too extreme. How about define the abstract distance of DRAM to be 1050 and chunk size to be 100. Then we will have 10 possible tiers below DRAM. That may be more than enough even in the long run? Again, the specific number isn't so important for me. So please suggest your number if necessary. Best Regards, Huang, Ying
On Mon 31-10-22 09:33:49, Huang, Ying wrote: [...] > In the upstream implementation, 4 tiers are possible below DRAM. That's > enough for now. But in the long run, it may be better to define more. > 100 possible tiers below DRAM may be too extreme. I am just curious. Is any configurations with more than couple of tiers even manageable? I mean applications have been struggling even with regular NUMA systems for years and vast majority of them is largerly NUMA unaware. How are they going to configure for a more complex system when a) there is no resource access control so whatever you aim for might not be available and b) in which situations there is going to be a demand only for subset of tears (GPU memory?) ? Thanks!
Michal Hocko <mhocko@suse.com> writes: > On Mon 31-10-22 09:33:49, Huang, Ying wrote: > [...] >> In the upstream implementation, 4 tiers are possible below DRAM. That's >> enough for now. But in the long run, it may be better to define more. >> 100 possible tiers below DRAM may be too extreme. > > I am just curious. Is any configurations with more than couple of tiers > even manageable? I mean applications have been struggling even with > regular NUMA systems for years and vast majority of them is largerly > NUMA unaware. How are they going to configure for a more complex system > when a) there is no resource access control so whatever you aim for > might not be available and b) in which situations there is going to be a > demand only for subset of tears (GPU memory?) ? Sorry for confusing. I think that there are only several (less than 10) tiers in a system in practice. Yes, here, I suggested to define 100 (10 in the later text) POSSIBLE tiers below DRAM. My intention isn't to manage a system with tens memory tiers. Instead, my intention is to avoid to put 2 memory types into one memory tier by accident via make the abstract distance range of each memory tier as small as possible. More possible memory tiers, smaller abstract distance range of each memory tier. Best Regards, Huang, Ying
On Wed 02-11-22 08:39:49, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: > > > On Mon 31-10-22 09:33:49, Huang, Ying wrote: > > [...] > >> In the upstream implementation, 4 tiers are possible below DRAM. That's > >> enough for now. But in the long run, it may be better to define more. > >> 100 possible tiers below DRAM may be too extreme. > > > > I am just curious. Is any configurations with more than couple of tiers > > even manageable? I mean applications have been struggling even with > > regular NUMA systems for years and vast majority of them is largerly > > NUMA unaware. How are they going to configure for a more complex system > > when a) there is no resource access control so whatever you aim for > > might not be available and b) in which situations there is going to be a > > demand only for subset of tears (GPU memory?) ? > > Sorry for confusing. I think that there are only several (less than 10) > tiers in a system in practice. Yes, here, I suggested to define 100 (10 > in the later text) POSSIBLE tiers below DRAM. My intention isn't to > manage a system with tens memory tiers. Instead, my intention is to > avoid to put 2 memory types into one memory tier by accident via make > the abstract distance range of each memory tier as small as possible. > More possible memory tiers, smaller abstract distance range of each > memory tier. TBH I do not really understand how tweaking ranges helps anything. IIUC drivers are free to assign any abstract distance so they will clash without any higher level coordination.
Michal Hocko <mhocko@suse.com> writes: > On Wed 02-11-22 08:39:49, Huang, Ying wrote: >> Michal Hocko <mhocko@suse.com> writes: >> >> > On Mon 31-10-22 09:33:49, Huang, Ying wrote: >> > [...] >> >> In the upstream implementation, 4 tiers are possible below DRAM. That's >> >> enough for now. But in the long run, it may be better to define more. >> >> 100 possible tiers below DRAM may be too extreme. >> > >> > I am just curious. Is any configurations with more than couple of tiers >> > even manageable? I mean applications have been struggling even with >> > regular NUMA systems for years and vast majority of them is largerly >> > NUMA unaware. How are they going to configure for a more complex system >> > when a) there is no resource access control so whatever you aim for >> > might not be available and b) in which situations there is going to be a >> > demand only for subset of tears (GPU memory?) ? >> >> Sorry for confusing. I think that there are only several (less than 10) >> tiers in a system in practice. Yes, here, I suggested to define 100 (10 >> in the later text) POSSIBLE tiers below DRAM. My intention isn't to >> manage a system with tens memory tiers. Instead, my intention is to >> avoid to put 2 memory types into one memory tier by accident via make >> the abstract distance range of each memory tier as small as possible. >> More possible memory tiers, smaller abstract distance range of each >> memory tier. > > TBH I do not really understand how tweaking ranges helps anything. > IIUC drivers are free to assign any abstract distance so they will clash > without any higher level coordination. Yes. That's possible. Each memory tier corresponds to one abstract distance range. The larger the range is, the higher the possibility of clashing is. So I suggest to make the abstract distance range smaller to reduce the possibility of clashing. Best Regards, Huang, Ying
On Wed 02-11-22 16:02:54, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: > > > On Wed 02-11-22 08:39:49, Huang, Ying wrote: > >> Michal Hocko <mhocko@suse.com> writes: > >> > >> > On Mon 31-10-22 09:33:49, Huang, Ying wrote: > >> > [...] > >> >> In the upstream implementation, 4 tiers are possible below DRAM. That's > >> >> enough for now. But in the long run, it may be better to define more. > >> >> 100 possible tiers below DRAM may be too extreme. > >> > > >> > I am just curious. Is any configurations with more than couple of tiers > >> > even manageable? I mean applications have been struggling even with > >> > regular NUMA systems for years and vast majority of them is largerly > >> > NUMA unaware. How are they going to configure for a more complex system > >> > when a) there is no resource access control so whatever you aim for > >> > might not be available and b) in which situations there is going to be a > >> > demand only for subset of tears (GPU memory?) ? > >> > >> Sorry for confusing. I think that there are only several (less than 10) > >> tiers in a system in practice. Yes, here, I suggested to define 100 (10 > >> in the later text) POSSIBLE tiers below DRAM. My intention isn't to > >> manage a system with tens memory tiers. Instead, my intention is to > >> avoid to put 2 memory types into one memory tier by accident via make > >> the abstract distance range of each memory tier as small as possible. > >> More possible memory tiers, smaller abstract distance range of each > >> memory tier. > > > > TBH I do not really understand how tweaking ranges helps anything. > > IIUC drivers are free to assign any abstract distance so they will clash > > without any higher level coordination. > > Yes. That's possible. Each memory tier corresponds to one abstract > distance range. The larger the range is, the higher the possibility of > clashing is. So I suggest to make the abstract distance range smaller > to reduce the possibility of clashing. I am sorry but I really do not understand how the size of the range actually addresses a fundamental issue that each driver simply picks what it wants. Is there any enumeration defining basic characteristic of each tier? How does a driver developer knows which tear to assign its driver to?
Michal Hocko <mhocko@suse.com> writes: > On Wed 02-11-22 16:02:54, Huang, Ying wrote: >> Michal Hocko <mhocko@suse.com> writes: >> >> > On Wed 02-11-22 08:39:49, Huang, Ying wrote: >> >> Michal Hocko <mhocko@suse.com> writes: >> >> >> >> > On Mon 31-10-22 09:33:49, Huang, Ying wrote: >> >> > [...] >> >> >> In the upstream implementation, 4 tiers are possible below DRAM. That's >> >> >> enough for now. But in the long run, it may be better to define more. >> >> >> 100 possible tiers below DRAM may be too extreme. >> >> > >> >> > I am just curious. Is any configurations with more than couple of tiers >> >> > even manageable? I mean applications have been struggling even with >> >> > regular NUMA systems for years and vast majority of them is largerly >> >> > NUMA unaware. How are they going to configure for a more complex system >> >> > when a) there is no resource access control so whatever you aim for >> >> > might not be available and b) in which situations there is going to be a >> >> > demand only for subset of tears (GPU memory?) ? >> >> >> >> Sorry for confusing. I think that there are only several (less than 10) >> >> tiers in a system in practice. Yes, here, I suggested to define 100 (10 >> >> in the later text) POSSIBLE tiers below DRAM. My intention isn't to >> >> manage a system with tens memory tiers. Instead, my intention is to >> >> avoid to put 2 memory types into one memory tier by accident via make >> >> the abstract distance range of each memory tier as small as possible. >> >> More possible memory tiers, smaller abstract distance range of each >> >> memory tier. >> > >> > TBH I do not really understand how tweaking ranges helps anything. >> > IIUC drivers are free to assign any abstract distance so they will clash >> > without any higher level coordination. >> >> Yes. That's possible. Each memory tier corresponds to one abstract >> distance range. The larger the range is, the higher the possibility of >> clashing is. So I suggest to make the abstract distance range smaller >> to reduce the possibility of clashing. > > I am sorry but I really do not understand how the size of the range > actually addresses a fundamental issue that each driver simply picks > what it wants. Is there any enumeration defining basic characteristic of > each tier? How does a driver developer knows which tear to assign its > driver to? The smaller range size will not guarantee anything. It just tries to help the default behavior. The drivers are expected to assign the abstract distance based on the memory latency/bandwidth, etc. And the abstract distance range of a memory tier corresponds to a memory latency/bandwidth range too. So, if the size of the abstract distance range is smaller, the possibility for two types of memory with different latency/bandwidth to clash on the abstract distance range is lower. Clashing isn't a totally disaster. We plan to provide a per-memory-type knob to offset the abstract distance provided by driver. Then, we can move clashing memory types away if necessary. Best Regards, Huang, Ying
On Wed 02-11-22 16:28:08, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: > > > On Wed 02-11-22 16:02:54, Huang, Ying wrote: > >> Michal Hocko <mhocko@suse.com> writes: > >> > >> > On Wed 02-11-22 08:39:49, Huang, Ying wrote: > >> >> Michal Hocko <mhocko@suse.com> writes: > >> >> > >> >> > On Mon 31-10-22 09:33:49, Huang, Ying wrote: > >> >> > [...] > >> >> >> In the upstream implementation, 4 tiers are possible below DRAM. That's > >> >> >> enough for now. But in the long run, it may be better to define more. > >> >> >> 100 possible tiers below DRAM may be too extreme. > >> >> > > >> >> > I am just curious. Is any configurations with more than couple of tiers > >> >> > even manageable? I mean applications have been struggling even with > >> >> > regular NUMA systems for years and vast majority of them is largerly > >> >> > NUMA unaware. How are they going to configure for a more complex system > >> >> > when a) there is no resource access control so whatever you aim for > >> >> > might not be available and b) in which situations there is going to be a > >> >> > demand only for subset of tears (GPU memory?) ? > >> >> > >> >> Sorry for confusing. I think that there are only several (less than 10) > >> >> tiers in a system in practice. Yes, here, I suggested to define 100 (10 > >> >> in the later text) POSSIBLE tiers below DRAM. My intention isn't to > >> >> manage a system with tens memory tiers. Instead, my intention is to > >> >> avoid to put 2 memory types into one memory tier by accident via make > >> >> the abstract distance range of each memory tier as small as possible. > >> >> More possible memory tiers, smaller abstract distance range of each > >> >> memory tier. > >> > > >> > TBH I do not really understand how tweaking ranges helps anything. > >> > IIUC drivers are free to assign any abstract distance so they will clash > >> > without any higher level coordination. > >> > >> Yes. That's possible. Each memory tier corresponds to one abstract > >> distance range. The larger the range is, the higher the possibility of > >> clashing is. So I suggest to make the abstract distance range smaller > >> to reduce the possibility of clashing. > > > > I am sorry but I really do not understand how the size of the range > > actually addresses a fundamental issue that each driver simply picks > > what it wants. Is there any enumeration defining basic characteristic of > > each tier? How does a driver developer knows which tear to assign its > > driver to? > > The smaller range size will not guarantee anything. It just tries to > help the default behavior. > > The drivers are expected to assign the abstract distance based on the > memory latency/bandwidth, etc. Would it be possible/feasible to have a canonical way to calculate the abstract distance from these characteristics by the core kernel so that drivers do not even have fall into that trap?
Michal Hocko <mhocko@suse.com> writes: > On Wed 02-11-22 16:28:08, Huang, Ying wrote: >> Michal Hocko <mhocko@suse.com> writes: >> >> > On Wed 02-11-22 16:02:54, Huang, Ying wrote: >> >> Michal Hocko <mhocko@suse.com> writes: >> >> >> >> > On Wed 02-11-22 08:39:49, Huang, Ying wrote: >> >> >> Michal Hocko <mhocko@suse.com> writes: >> >> >> >> >> >> > On Mon 31-10-22 09:33:49, Huang, Ying wrote: >> >> >> > [...] >> >> >> >> In the upstream implementation, 4 tiers are possible below DRAM. That's >> >> >> >> enough for now. But in the long run, it may be better to define more. >> >> >> >> 100 possible tiers below DRAM may be too extreme. >> >> >> > >> >> >> > I am just curious. Is any configurations with more than couple of tiers >> >> >> > even manageable? I mean applications have been struggling even with >> >> >> > regular NUMA systems for years and vast majority of them is largerly >> >> >> > NUMA unaware. How are they going to configure for a more complex system >> >> >> > when a) there is no resource access control so whatever you aim for >> >> >> > might not be available and b) in which situations there is going to be a >> >> >> > demand only for subset of tears (GPU memory?) ? >> >> >> >> >> >> Sorry for confusing. I think that there are only several (less than 10) >> >> >> tiers in a system in practice. Yes, here, I suggested to define 100 (10 >> >> >> in the later text) POSSIBLE tiers below DRAM. My intention isn't to >> >> >> manage a system with tens memory tiers. Instead, my intention is to >> >> >> avoid to put 2 memory types into one memory tier by accident via make >> >> >> the abstract distance range of each memory tier as small as possible. >> >> >> More possible memory tiers, smaller abstract distance range of each >> >> >> memory tier. >> >> > >> >> > TBH I do not really understand how tweaking ranges helps anything. >> >> > IIUC drivers are free to assign any abstract distance so they will clash >> >> > without any higher level coordination. >> >> >> >> Yes. That's possible. Each memory tier corresponds to one abstract >> >> distance range. The larger the range is, the higher the possibility of >> >> clashing is. So I suggest to make the abstract distance range smaller >> >> to reduce the possibility of clashing. >> > >> > I am sorry but I really do not understand how the size of the range >> > actually addresses a fundamental issue that each driver simply picks >> > what it wants. Is there any enumeration defining basic characteristic of >> > each tier? How does a driver developer knows which tear to assign its >> > driver to? >> >> The smaller range size will not guarantee anything. It just tries to >> help the default behavior. >> >> The drivers are expected to assign the abstract distance based on the >> memory latency/bandwidth, etc. > > Would it be possible/feasible to have a canonical way to calculate the > abstract distance from these characteristics by the core kernel so that > drivers do not even have fall into that trap? Yes. That sounds a good idea. We can provide a function to map from the memory latency/bandwidth to the abstract distance for the drivers. Best Regards, Huang, Ying
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 965009aa01d7..2e39d9a6c8ce 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -7,17 +7,16 @@ #include <linux/kref.h> #include <linux/mmzone.h> /* - * Each tier cover a abstrace distance chunk size of 128 + * Each tier cover a abstrace distance chunk size of 10 */ -#define MEMTIER_CHUNK_BITS 7 -#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) +#define MEMTIER_CHUNK_SIZE 10 /* * Smaller abstract distance values imply faster (higher) memory tiers. Offset * the DRAM adistance so that we can accommodate devices with a slightly lower * adistance value (slightly faster) than default DRAM adistance to be part of * the same memory tier. */ -#define MEMTIER_ADISTANCE_DRAM ((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1)) +#define MEMTIER_ADISTANCE_DRAM ((100 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE / 2)) #define MEMTIER_HOTPLUG_PRIO 100 struct memory_tier; diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index fa8c9d07f9ce..e03011428fa5 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -165,11 +165,10 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty bool found_slot = false; struct memory_tier *memtier, *new_memtier; int adistance = memtype->adistance; - unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE; lockdep_assert_held_once(&memory_tier_lock); - adistance = round_down(adistance, memtier_adistance_chunk_size); + adistance = rounddown(adistance, MEMTIER_CHUNK_SIZE); /* * If the memtype is already part of a memory tier, * just return that. @@ -204,7 +203,7 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty else list_add_tail(&new_memtier->list, &memory_tiers); - new_memtier->dev.id = adistance >> MEMTIER_CHUNK_BITS; + new_memtier->dev.id = adistance / MEMTIER_CHUNK_SIZE; new_memtier->dev.bus = &memory_tier_subsys; new_memtier->dev.release = memory_tier_device_release; new_memtier->dev.groups = memtier_dev_groups; @@ -641,7 +640,7 @@ static int __init memory_tier_init(void) #endif mutex_lock(&memory_tier_lock); /* - * For now we can have 4 faster memory tiers with smaller adistance + * For now we can have 100 faster memory tiers with smaller adistance * than default DRAM tier. */ default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);