diff mbox series

[1/9] cpumask: introduce for_each_cpu_and_from()

Message ID	20240120025053.684838-2-yury.norov@gmail.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel+bounces-31684-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; From: Yury Norov <yury.norov@gmail.com> To: Andrew Morton <akpm@linux-foundation.org>, Thomas Gleixner <tglx@linutronix.de>, Ming Lei <ming.lei@redhat.com>, linux-kernel@vger.kernel.org Cc: Yury Norov <yury.norov@gmail.com>, Andy Shevchenko <andriy.shevchenko@linux.intel.com>, Breno Leitao <leitao@debian.org>, Nathan Chancellor <nathan@kernel.org>, Rasmus Villemoes <linux@rasmusvillemoes.dk>, Zi Yan <ziy@nvidia.com> Subject: [PATCH 1/9] cpumask: introduce for_each_cpu_and_from() Date: Fri, 19 Jan 2024 18:50:45 -0800 Message-Id: <20240120025053.684838-2-yury.norov@gmail.com> In-Reply-To: <20240120025053.684838-1-yury.norov@gmail.com> References: <20240120025053.684838-1-yury.norov@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX
Series	lib/group_cpus: rework grp_spread_init_one() and make it O(1) \| [v5,0/9] lib/group_cpus: rework grp_spread_init_one() and make it O(1) [1/9] cpumask: introduce for_each_cpu_and_from() [2/9] lib/group_cpus: optimize inner loop in grp_spread_init_one() [3/9] lib/group_cpus: relax atomicity requirement in grp_spread_init_one() [4/9] lib/group_cpus: optimize outer loop in grp_spread_init_one() [5/9] lib/group_cpus: don't zero cpumasks in group_cpus_evenly() on allocation [6/9] lib/group_cpus: drop unneeded cpumask_empty() call in __group_cpus_evenly() [7/9] cpumask: define cleanup function for cpumasks [8/9] lib/group_cpus: rework group_cpus_evenly() [9/9] lib/group_cpus: simplify group_cpus_evenly() for more

Commit Message

Yury Norov Jan. 20, 2024, 2:50 a.m. UTC

  Similarly to for_each_cpu_and(), introduce a for_each_cpu_and_from(),
which is handy when it's needed to traverse 2 cpumasks or bitmaps,
starting from a given position.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 include/linux/cpumask.h | 11 +++++++++++
 include/linux/find.h    |  3 +++
 2 files changed, 14 insertions(+)

Comments

Ming Lei Jan. 20, 2024, 3:03 a.m. UTC | #1

On Fri, Jan 19, 2024 at 06:50:45PM -0800, Yury Norov wrote:
> Similarly to for_each_cpu_and(), introduce a for_each_cpu_and_from(),
> which is handy when it's needed to traverse 2 cpumasks or bitmaps,
> starting from a given position.

The new helper is useless, see

https://lore.kernel.org/lkml/ZZNgDb6bzOscrNmk@fedora/


Thanks,
Ming

Yury Norov Jan. 21, 2024, 7:50 p.m. UTC | #2

On Sat, Jan 20, 2024 at 11:03:37AM +0800, Ming Lei wrote:
> On Fri, Jan 19, 2024 at 06:50:45PM -0800, Yury Norov wrote:
> > Similarly to for_each_cpu_and(), introduce a for_each_cpu_and_from(),
> > which is handy when it's needed to traverse 2 cpumasks or bitmaps,
> > starting from a given position.
> 
> The new helper is useless, see
> 
> https://lore.kernel.org/lkml/ZZNgDb6bzOscrNmk@fedora/

Let's consider the following configuration.

CPUs: 0b1111
Sibling groups: 0b0011 and 0b1100
nmsk: 0b1111

As the complexity measure we take the number of accesses to nmsk in
the outer loop, and to (nmsk & sibl) in the inner loop in search
routines, so that

	cpumask_first(1111)

requires 1 access to find the first set bit, and

	cpumask_first(1000)

requires 4 such accesses.

Actual find_bit() ops work better than this by using __ffs(), but on long
bitmaps the performance will be as described above.

Now, look at the code. This is yours:

  static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
                                  unsigned int cpus_per_grp)
  {
          const struct cpumask *siblmsk;
          int cpu, sibl;
  
          for ( ; cpus_per_grp > 0; ) {
                  cpu = cpumask_first(nmsk);
  
                  /* Should not happen, but I'm too lazy to think about it */
                  if (cpu >= nr_cpu_ids)
                          return;
  
                  cpumask_clear_cpu(cpu, nmsk);
                  cpumask_set_cpu(cpu, irqmsk);
                  cpus_per_grp--;
  
                  /* If the cpu has siblings, use them first */
                  siblmsk = topology_sibling_cpumask(cpu);
                  for (sibl = -1; cpus_per_grp > 0; ) {
                          sibl = cpumask_next(sibl, siblmsk);
                          if (sibl >= nr_cpu_ids)
                                  break;
                          if (!cpumask_test_and_clear_cpu(sibl, nmsk))
                                  continue;
                          cpumask_set_cpu(sibl, irqmsk);
                          cpus_per_grp--;
                  }
          }
  }

This is your code step-by-step:

#	loop	cpu	match	siblmsk	nmsk	irqmsk
 0	outer	0	yes 		1110	0001
 1	inner	0	no  		1110	0001
 2	inner	1	yes 	0011	1100	0011
 3	inner	2	no 		1100	0011
 4	inner	3	no 		1100	0011
 5	outer	0	no	 	1100	0011
 6	outer	1	no	 	1100	0011
 7	outer	2	yes	 	1000	0111
 8	inner	0	no 	1100	1000	0111
 9	inner	1	no 	1100	1000	0111
10	inner	2	no 	1100	1000	0111
11	inner	3	yes	1100	0000	1111
12	outer	0	no	 	0000	1111
13	outer	1	no	 	0000	1111
14	outer	2	no	 	0000	1111
15	outer	3	no	 	0000	1111

This is mine:

  static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
                                  unsigned int cpus_per_grp)
  {
          const struct cpumask *siblmsk;
          int cpu, sibl;
  
          for_each_cpu(cpu, nmsk) {
                  if (cpus_per_grp-- == 0)
                          return;
  
                  /*
                   * If a caller wants to spread IRQa on offline CPUs, we need to
                   * take care of it explicitly because those offline CPUS are not
                   * included in siblings cpumask.
                   */
                  __cpumask_clear_cpu(cpu, nmsk);
                  __cpumask_set_cpu(cpu, irqmsk);
  
                  /* If the cpu has siblings, use them first */
                  siblmsk = topology_sibling_cpumask(cpu);
                  sibl = cpu + 1;
  
                  for_each_cpu_and_from(sibl, siblmsk, nmsk) {
                          if (cpus_per_grp-- == 0)
                                  return;
  
                          __cpumask_clear_cpu(sibl, nmsk);
                          __cpumask_set_cpu(sibl, irqmsk);
                          cpu = sibl + 1;
                  }
          }
  }

Step-by-step:

#	loop	cpu	match	siblmsk	nmsk	irqmsk
 0	outer	0	yes 		1110	0001
 1	inner	1	yes 	0011	1100	0011
 2	inner	2	no 	0011	1100	0011
 3	inner	3	no 	0011	1100	0011
 4	outer	2	yes	 	1000	0111
 5	inner	3	yes	1100	0000	1111

Your code works worse because it's a Schlemiel the Painter's algorithm.
I mentioned it twice in the commit messages and at least 3 times in
replies to your comments.

Here I'll stop and will not reply to your emails, including the rest of
that Friday's night mailbombing, unless you at least admit you're wrong
in this case and for_each_cpu_and_from() is useful here. 

I'd also recommend you to learn more about atomic operations basics and
revoke your NAK from the patch #3.

Thanks,
	Yury

PS: There's a typo in the series name, I meant that the series makes the
function O(N) of course. But even that is overly optimistic. It's O(N*S),
where S is the number of sibling groups. A couple more patches needed to
make it a true O(N). Still, much better.

Ming Lei Jan. 22, 2024, 2:41 a.m. UTC | #3

On Sun, Jan 21, 2024 at 11:50:02AM -0800, Yury Norov wrote:
> On Sat, Jan 20, 2024 at 11:03:37AM +0800, Ming Lei wrote:
> > On Fri, Jan 19, 2024 at 06:50:45PM -0800, Yury Norov wrote:
> > > Similarly to for_each_cpu_and(), introduce a for_each_cpu_and_from(),
> > > which is handy when it's needed to traverse 2 cpumasks or bitmaps,
> > > starting from a given position.
> > 
> > The new helper is useless, see
> > 
> > https://lore.kernel.org/lkml/ZZNgDb6bzOscrNmk@fedora/
> 
> Let's consider the following configuration.
> Step-by-step:

..

> 
> #	loop	cpu	match	siblmsk	nmsk	irqmsk
>  0	outer	0	yes 		1110	0001
>  1	inner	1	yes 	0011	1100	0011
>  2	inner	2	no 	0011	1100	0011
>  3	inner	3	no 	0011	1100	0011
>  4	outer	2	yes	 	1000	0111
>  5	inner	3	yes	1100	0000	1111
> 
> Your code works worse because it's a Schlemiel the Painter's algorithm.
> I mentioned it twice in the commit messages and at least 3 times in
> replies to your comments.

Does it really matter here in reality? Which kind of user visible improvements
can be observed?

I have mentioned several times, for control/management code path, we care
more on maintainability, correctness instead of efficiency.

You are _wasting_ resources in wrong place, if you are really interested in
optimization, please do in fast code path, such as, related and not not limited,
irq handling, io handling, memory allocation, ....

Unfortunately, your V5 still have obvious bug, and as you mentioned,
the patchset title is wrong too.

> 
> Here I'll stop and will not reply to your emails, including the rest of
> that Friday's night mailbombing, unless you at least admit you're wrong
> in this case and for_each_cpu_and_from() is useful here. 

It is easy to get same result without adding for_each_cpu_and_from(), see the
patch I sent:

https://lore.kernel.org/lkml/20240120065543.739203-1-ming.lei@redhat.com/

in which we needn't to update iterator variable inside loop, and fix
the bug in your patch 4 of v5, and it is still O(N). Meantime it is
simpler and easier to get proved.

Here your use of for_each_cpu_and_from() is tricky too, cause the
loop condition variable(part of iterator variable) of cpu mask is being updated
inside the loop. And we can get same result by cpumask_next_and()
without playing the trick.

> 
> I'd also recommend you to learn more about atomic operations basics and
> revoke your NAK from the patch #3.

If you think my comment on the NAK is wrong, please reply on the comment
directly.

> 
> Thanks,
> 	Yury
> 
> PS: There's a typo in the series name, I meant that the series makes the
> function O(N) of course. But even that is overly optimistic. It's O(N*S),
> where S is the number of sibling groups. A couple more patches needed to
> make it a true O(N). Still, much better.

Either O(1) or O(N) isn't one big deal here, cause it is oneshot
slow code path, and nr_cpu_ids is not big enough in reality.

Even you can't make real O(N) because your patch 4 has logic
mistake, see my comment:

https://lore.kernel.org/lkml/ZatlggW%2F8SH6od9O@fedora/

Thanks,
Ming

diff mbox series

Patch

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index cfb545841a2c..73ff2e0ef090 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -332,6 +332,17 @@  unsigned int __pure cpumask_next_wrap(int n, const struct cpumask *mask, int sta
 #define for_each_cpu_and(cpu, mask1, mask2)				\
 	for_each_and_bit(cpu, cpumask_bits(mask1), cpumask_bits(mask2), small_cpumask_bits)
 
+/**
+ * for_each_cpu_and_from - iterate over every cpu in both masks starting from a given cpu
+ * @cpu: the (optionally unsigned) integer iterator
+ * @mask1: the first cpumask pointer
+ * @mask2: the second cpumask pointer
+ *
+ * After the loop, cpu is >= nr_cpu_ids.
+ */
+#define for_each_cpu_and_from(cpu, mask1, mask2)				\
+	for_each_and_bit_from(cpu, cpumask_bits(mask1), cpumask_bits(mask2), small_cpumask_bits)
+
 /**
  * for_each_cpu_andnot - iterate over every cpu present in one mask, excluding
  *			 those present in another.
diff --git a/include/linux/find.h b/include/linux/find.h
index 5e4f39ef2e72..dfd3d51ff590 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -563,6 +563,9 @@  unsigned long find_next_bit_le(const void *addr, unsigned
 	     (bit) = find_next_and_bit((addr1), (addr2), (size), (bit)), (bit) < (size);\
 	     (bit)++)
 
+#define for_each_and_bit_from(bit, addr1, addr2, size) \
+	for (; (bit) = find_next_and_bit((addr1), (addr2), (size), (bit)), (bit) < (size); (bit)++)
+
 #define for_each_andnot_bit(bit, addr1, addr2, size) \
 	for ((bit) = 0;									\
 	     (bit) = find_next_andnot_bit((addr1), (addr2), (size), (bit)), (bit) < (size);\