[1/4] mm: introduce 'encoded' page pointers with embedded extra bits

Message ID 20221108194139.57604-1-torvalds@linux-foundation.org
State New
Headers
Series [1/4] mm: introduce 'encoded' page pointers with embedded extra bits |

Commit Message

Linus Torvalds Nov. 8, 2022, 7:41 p.m. UTC
  We already have this notion in parts of the MM code (see the mlock code
with the LRU_PAGE and NEW_PAGE) bits, but I'm going to introduce a new
case, and I refuse to do the same thing we've done before where we just
put bits in the raw pointer and say it's still a normal pointer.

So this introduces a 'struct encoded_page' pointer that cannot be used
for anything else than to encode a real page pointer and a couple of
extra bits in the low bits.  That way the compiler can trivially track
the state of the pointer and you just explicitly encode and decode the
extra bits.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/mm_types.h | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)
  

Comments

Nadav Amit Nov. 8, 2022, 8:37 p.m. UTC | #1
On Nov 8, 2022, at 11:41 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> We already have this notion in parts of the MM code (see the mlock code
> with the LRU_PAGE and NEW_PAGE) bits, but I'm going to introduce a new
> case, and I refuse to do the same thing we've done before where we just
> put bits in the raw pointer and say it's still a normal pointer.
> 
> So this introduces a 'struct encoded_page' pointer that cannot be used
> for anything else than to encode a real page pointer and a couple of
> extra bits in the low bits.  That way the compiler can trivially track
> the state of the pointer and you just explicitly encode and decode the
> extra bits.

I tested again all of the patches with the PoC. They pass.

> 
> +struct encoded_page;
> +#define ENCODE_PAGE_BITS 3ul
> +static inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
> +{
> +	return (struct encoded_page *)(flags | (unsigned long)page);
> +}
> +
> +static inline bool encoded_page_flags(struct encoded_page *page)
> +{
> +	return ENCODE_PAGE_BITS & (unsigned long)page;
> +}

I think this one wants to be some unsigned, as otherwise why have
ENCODE_PAGE_BITS as 3ul ?
  
Linus Torvalds Nov. 8, 2022, 8:46 p.m. UTC | #2
On Tue, Nov 8, 2022 at 12:37 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> > +static inline bool encoded_page_flags(struct encoded_page *page)
> > +{
> > +     return ENCODE_PAGE_BITS & (unsigned long)page;
> > +}
>
> I think this one wants to be some unsigned, as otherwise why have
> ENCODE_PAGE_BITS as 3ul ?

Right you are. That came from my old old version where this was just
"bool dirty".

Will fix.

Doesn't matter for the TLB flushing case, but I really did hope that
we could use this for mlock too, and that case needs both bits.

I did look at converting mlock (and it's why I wanted to make
release_pages() take that whole encoded thing in general, rather than
make some special case for it), but the mlock code uses that "struct
pagevec" abstraction that seems entirely pointless ("pvec->nr" becomes
"pagevec_count(pvec)", which really doesn't seem to be any clearer at
alll), but whatever.

               Linus
  
Alexander Gordeev Nov. 9, 2022, 6:36 a.m. UTC | #3
On Tue, Nov 08, 2022 at 11:41:36AM -0800, Linus Torvalds wrote:

Hi Linus,

[...]

> +struct encoded_page;
> +#define ENCODE_PAGE_BITS 3ul
> +static inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
> +{

Any reaction in case ((flags & ~ENCODE_PAGE_BITS) != 0)?

> +	return (struct encoded_page *)(flags | (unsigned long)page);
> +}

Thanks!
  
Linus Torvalds Nov. 9, 2022, 6 p.m. UTC | #4
On Tue, Nov 8, 2022 at 10:38 PM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> On Tue, Nov 08, 2022 at 11:41:36AM -0800, Linus Torvalds wrote:
>
> > +static inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
> > +{
>
> Any reaction in case ((flags & ~ENCODE_PAGE_BITS) != 0)?

Heh. I've actually had three different implementations for that during
the development series, and I think I even posted them all at one
point or another (although usually just as attachments). And none of
them are good.

Those three trivial versions are: (a) use VM_BUG_ON(), (b) just
silently mask the bits and (c) just silently add them.

And (c) is that least annoying option that this latest patch uses,
because both (a) and (b) are just nasty.

Basically, all users are locally trivial to verify statically, so
VM_BUG_ON() is just conceptually wrong and generates extra pointless
code. And the silent masking - if it makes any difference - is just
another version of "just silently add the bits": regardless of whether
it clears them or not, it does the wrong thing if the bits don't fit.

So there are three bad options, I've gone back and forth between them
all, and I chose the least offensive one that is "invisible", in that
it at least doesn't do any extra pointless work.

Now, there are two non-offensive options too, and I actually
considered, but never implemented them. They both fix the problem
properly, by making it a *buildtime* check, but they have other
issues.

There's two ways to just make it a build-time check, and it's
annoyingly _close_ to being usable, but not quite there.

One is simply to require that the flags argument is always a plain
constant, and simply using BUILD_BUG_ON().

I actually almost went down that path - one of the things I considered
was to not add a 'flags' argument to __tlb_remove_page() at all, but
instead just have separate __tlb_remove_page() and
__tlb_remove_page_dirty() functions.

That would have meant that the argument to __tlb_remove_page_size
would have always been a built-time constant, and then it would be
trivial to just have that BUILD_BUG_ON(). Problem solved.

But it turns out that it's just nasty, particularly with different
configurations wanting different rules for what the dirty bit is. So
forcing it to some constant value was really not acceptable.

The thing that I actually *wanted* to do, but didn't actually dare,
was to just say "I will trust the compiler to do the value range
tracking".

Because *technically* our BUILD_BUG_ON() doesn't need a compile-time
constant. Because our implementation of BUILD_BUG_ON() is not the
garbage that the compiler gives us in "_Static_assert()" that really
requires a syntactically pure integer constant expression.

So the kernel version of BUILD_BUG_ON() is actually something much
smarter: it depends on the compiler actually *optimizing* the
expression, and it's only that optimized value that needs to be
determined at compile-time to be either true or false. You can use
things like inline functions etc, just as long as the end result is
obvious enough that the compiler ends up saying "ok, that's never the
case".

And *if* the compiler does any kind of reasonable range analysis, then a

        BUILD_BUG_ON(flags > ENCODE_PAGE_BITS);

should actually work. In theory.

In practice? Not so much.

Because while the argument isn't constant (not even in the caller),
the compiler *should* be smart enough to see that in the use in
mm/memory.c, 'flags' is always that

        unsigned int delay_rmap;

which then gets initialized to

        delay_rmap = 0;

and conditionally set to '1' later. So it's not a *constant*, but the
compiler can see that the value of flags is clearly never larger than
ENCODE_PAGE_BITS.

But right now the compiler cannot track that over the non-inline
function in __tlb_remove_page_size().

Maybe if the 'encode_page()' was done in the caller, and
__tlb_remove_page_size() were to just take an encoded_page as the
argument, then the compiler would always only see this all through
inlined functions, and it would work.

But even if it were to work for me (I never tried), I'd have been much
too worried that some other compiler version, with some other config
options, on some other architecture, wouldn't make the required
optimizations.

We do require compiler optimizations to be on for 'BUILD_BUG_ON()' to
do anything at all:

   #ifdef __OPTIMIZE__
   # define __compiletime_assert(condition, msg, prefix, suffix)           \
   ..
   #else
   # define __compiletime_assert(condition, msg, prefix, suffix) do {
} while (0)
   #endif

and we have a lot of places that depend on BUILD_BUG_ON() to do basic
constant folding and other fairly simple optimizations.

But while I think a BUILD_BUG_ON() would be the right thing to do
here, I do not feel confident enough to really put that to the test.

              Linus
  
Linus Torvalds Nov. 9, 2022, 8:02 p.m. UTC | #5
On Wed, Nov 9, 2022 at 10:00 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But while I think a BUILD_BUG_ON() would be the right thing to do
> here, I do not feel confident enough to really put that to the test.

Oh, what the hell.

Just writing that whole explanation out made me just go "let's try to
re-organize it a bit so that we *can* inline everything, and see how
well it works".

And it does actually work to use BUILD_BUG_ON(), both with gcc and clang.

At least that's the case with the versions of gcc and clang _I_ use,
and in the configurations I tested.

So now I have a slightly massaged version of the patches (I did have
to move the 'encode_page()' around a bit), which has that
BUILD_BUG_ON() in it, and it passes for me.

And I find that I really like seeing that whole page pointer encoding
be so obviously much stricter. That was obviously the point of the
whole separate type system checking, now it does bit value validity
checking too.

So I'll walk through my patches one more time to check for it, but
I'll post it as a git branch and send out a new series (and do it in a
separate thread with a cover letter, to not confuse the little mind of
'b4' again).

If it turns out that some other compiler version or configuration
doesn't deal with the BUILD_BUG_ON() gracefully, it's easy enough to
remove, and it will hopefully show up in linux-next when Andrew picks
it up.

                  Linus
  

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..b5cffd250784 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -67,7 +67,7 @@  struct mem_cgroup;
 #ifdef CONFIG_HAVE_ALIGNED_STRUCT_PAGE
 #define _struct_page_alignment	__aligned(2 * sizeof(unsigned long))
 #else
-#define _struct_page_alignment
+#define _struct_page_alignment	__aligned(sizeof(unsigned long))
 #endif
 
 struct page {
@@ -241,6 +241,37 @@  struct page {
 #endif
 } _struct_page_alignment;
 
+/**
+ * struct encoded_page - a nonexistent type marking this pointer
+ *
+ * An 'encoded_page' pointer is a pointer to a regular 'struct page', but
+ * with the low bits of the pointer indicating extra context-dependent
+ * information. Not super-common, but happens in mmu_gather and mlock
+ * handling, and this acts as a type system check on that use.
+ *
+ * We only really have two guaranteed bits in general, although you could
+ * play with 'struct page' alignment (see CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
+ * for more.
+ *
+ * Use the supplied helper functions to endcode/decode the pointer and bits.
+ */
+struct encoded_page;
+#define ENCODE_PAGE_BITS 3ul
+static inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
+{
+	return (struct encoded_page *)(flags | (unsigned long)page);
+}
+
+static inline bool encoded_page_flags(struct encoded_page *page)
+{
+	return ENCODE_PAGE_BITS & (unsigned long)page;
+}
+
+static inline struct page *encoded_page_ptr(struct encoded_page *page)
+{
+	return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page);
+}
+
 /**
  * struct folio - Represents a contiguous set of bytes.
  * @flags: Identical to the page flags.