[v2,2/7] mm/vmalloc.c: add flags to mark vm_map_ram area

Message ID 20221217015435.73889-3-bhe@redhat.com
State New
Headers
Series mm/vmalloc.c: allow vread() to read out vm_map_ram areas |

Commit Message

Baoquan He Dec. 17, 2022, 1:54 a.m. UTC
  Through vmalloc API, a virtual kernel area is reserved for physical
address mapping. And vmap_area is used to track them, while vm_struct
is allocated to associate with the vmap_area to store more information
and passed out.

However, area reserved via vm_map_ram() is an exception. It doesn't have
vm_struct to associate with vmap_area. And we can't recognize the
vmap_area with '->vm == NULL' as a vm_map_ram() area because the normal
freeing path will set va->vm = NULL before unmapping, please see
function remove_vm_area().

Meanwhile, there are two types of vm_map_ram area. One is the whole
vmap_area being reserved and mapped at one time; the other is the
whole vmap_area with VMAP_BLOCK_SIZE size being reserved, while mapped
into split regions with smaller size several times via vb_alloc().

To mark the area reserved through vm_map_ram(), add flags field into
struct vmap_area. Bit 0 indicates whether it's a vm_map_ram area,
while bit 1 indicates whether it's a vmap_block type of vm_map_ram
area.

This is a preparatoin for later use.

Signed-off-by: Baoquan He <bhe@redhat.com>
---
 include/linux/vmalloc.h |  1 +
 mm/vmalloc.c            | 22 +++++++++++++++++-----
 2 files changed, 18 insertions(+), 5 deletions(-)
  

Comments

Lorenzo Stoakes Dec. 17, 2022, 11:44 a.m. UTC | #1
On Sat, Dec 17, 2022 at 09:54:30AM +0800, Baoquan He wrote:
> @@ -2229,8 +2236,12 @@ void vm_unmap_ram(const void *mem, unsigned int count)
>  		return;
>  	}
>
> -	va = find_vmap_area(addr);
> +	spin_lock(&vmap_area_lock);
> +	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
>  	BUG_ON(!va);
> +	if (va)
> +		va->flags &= ~VMAP_RAM;
> +	spin_unlock(&vmap_area_lock);
>  	debug_check_no_locks_freed((void *)va->va_start,
>  				    (va->va_end - va->va_start));
>  	free_unmap_vmap_area(va);

Would it be better to perform the BUG_ON() after the lock is released? You
already check if va exists before unmasking so it's safe.

Also, do we want to clear VMAP_BLOCK here?
  
Baoquan He Dec. 19, 2022, 8:01 a.m. UTC | #2
On 12/17/22 at 11:44am, Lorenzo Stoakes wrote:
> On Sat, Dec 17, 2022 at 09:54:30AM +0800, Baoquan He wrote:
> > @@ -2229,8 +2236,12 @@ void vm_unmap_ram(const void *mem, unsigned int count)
> >  		return;
> >  	}
> >
> > -	va = find_vmap_area(addr);
> > +	spin_lock(&vmap_area_lock);
> > +	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
> >  	BUG_ON(!va);
> > +	if (va)
> > +		va->flags &= ~VMAP_RAM;
> > +	spin_unlock(&vmap_area_lock);
> >  	debug_check_no_locks_freed((void *)va->va_start,
> >  				    (va->va_end - va->va_start));
> >  	free_unmap_vmap_area(va);
> 
> Would it be better to perform the BUG_ON() after the lock is released? You
> already check if va exists before unmasking so it's safe.

It's a little unclear to me why we care BUG_ON() is performed before or
after the lock released. We won't have a stable kernel after BUG_ON()(),
right?
> 
> Also, do we want to clear VMAP_BLOCK here?

I do, but I don't find a good place to clear VMAP_BLOCK.

In v1, I tried to clear it in free_vmap_area_noflush() as below,
Uladzislau dislikes it. So I remove it. My thinking is when we unmap and
free the vmap area, the vmap_area is moved from vmap_area_root into
&free_vmap_area_root. When we allocate a new vmap_area via
alloc_vmap_area(), we will allocate a new va by kmem_cache_alloc_node(),
the va->flags must be 0. Seems not initializing it to 0 won't impact
thing.

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5d3fd3e6fe09..d6f376060d83 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1815,6 +1815,7 @@ static void free_vmap_area_noflush(struct vmap_area *va)

        spin_lock(&vmap_area_lock);
        unlink_va(va, &vmap_area_root);
+       va->flags = 0;
        spin_unlock(&vmap_area_lock);

        nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >>

>
  
Lorenzo Stoakes Dec. 19, 2022, 9:09 a.m. UTC | #3
On Mon, Dec 19, 2022 at 04:01:00PM +0800, Baoquan He wrote:
> On 12/17/22 at 11:44am, Lorenzo Stoakes wrote:
> > On Sat, Dec 17, 2022 at 09:54:30AM +0800, Baoquan He wrote:
> > > @@ -2229,8 +2236,12 @@ void vm_unmap_ram(const void *mem, unsigned int count)
> > >  		return;
> > >  	}
> > >
> > > -	va = find_vmap_area(addr);
> > > +	spin_lock(&vmap_area_lock);
> > > +	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
> > >  	BUG_ON(!va);
> > > +	if (va)
> > > +		va->flags &= ~VMAP_RAM;
> > > +	spin_unlock(&vmap_area_lock);
> > >  	debug_check_no_locks_freed((void *)va->va_start,
> > >  				    (va->va_end - va->va_start));
> > >  	free_unmap_vmap_area(va);
> >
> > Would it be better to perform the BUG_ON() after the lock is released? You
> > already check if va exists before unmasking so it's safe.
>
> It's a little unclear to me why we care BUG_ON() is performed before or
> after the lock released. We won't have a stable kernel after BUG_ON()(),
> right?

BUG_ON()'s can be recoverable in user context and it would be a very simple
change that would not fundamentally alter anything to simply place the added
lines prior to the BUG_ON().

The code as-is doesn't really make sense anyway, you BUG_ON(!va) then check if
va is non-null, then immediately the function afterwards passes va around as if
it were not null, so I think it'd also be an aesthetic and logical improvement
:)

> >
> > Also, do we want to clear VMAP_BLOCK here?
>
> I do, but I don't find a good place to clear VMAP_BLOCK.
>
> In v1, I tried to clear it in free_vmap_area_noflush() as below,
> Uladzislau dislikes it. So I remove it. My thinking is when we unmap and
> free the vmap area, the vmap_area is moved from vmap_area_root into
> &free_vmap_area_root. When we allocate a new vmap_area via
> alloc_vmap_area(), we will allocate a new va by kmem_cache_alloc_node(),
> the va->flags must be 0. Seems not initializing it to 0 won't impact
> thing.
>

You are at this point clearing the VMAP_RAM flag though, so if it is unimportant
what the flags are after this call, why are you clearing this one?

It is just a little confusing, I wonder whether the VMAP_BLOCK flag is necessary
at all, is it possible to just treat a non-VMAP_BLOCK VMAP_RAM area as if it
were simply a fully occupied block? Do we gain much by the distinction?

> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 5d3fd3e6fe09..d6f376060d83 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1815,6 +1815,7 @@ static void free_vmap_area_noflush(struct vmap_area *va)
>
>         spin_lock(&vmap_area_lock);
>         unlink_va(va, &vmap_area_root);
> +       va->flags = 0;
>         spin_unlock(&vmap_area_lock);
>
>         nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >>
>
> >
>
  
Baoquan He Dec. 19, 2022, 12:24 p.m. UTC | #4
On 12/19/22 at 09:09am, Lorenzo Stoakes wrote:
> On Mon, Dec 19, 2022 at 04:01:00PM +0800, Baoquan He wrote:
> > On 12/17/22 at 11:44am, Lorenzo Stoakes wrote:
> > > On Sat, Dec 17, 2022 at 09:54:30AM +0800, Baoquan He wrote:
> > > > @@ -2229,8 +2236,12 @@ void vm_unmap_ram(const void *mem, unsigned int count)
> > > >  		return;
> > > >  	}
> > > >
> > > > -	va = find_vmap_area(addr);
> > > > +	spin_lock(&vmap_area_lock);
> > > > +	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
> > > >  	BUG_ON(!va);
> > > > +	if (va)
> > > > +		va->flags &= ~VMAP_RAM;
> > > > +	spin_unlock(&vmap_area_lock);
> > > >  	debug_check_no_locks_freed((void *)va->va_start,
> > > >  				    (va->va_end - va->va_start));
> > > >  	free_unmap_vmap_area(va);
> > >
> > > Would it be better to perform the BUG_ON() after the lock is released? You
> > > already check if va exists before unmasking so it's safe.
> >
> > It's a little unclear to me why we care BUG_ON() is performed before or
> > after the lock released. We won't have a stable kernel after BUG_ON()(),
> > right?
> 
> BUG_ON()'s can be recoverable in user context and it would be a very simple
> change that would not fundamentally alter anything to simply place the added
> lines prior to the BUG_ON().
> 
> The code as-is doesn't really make sense anyway, you BUG_ON(!va) then check if
> va is non-null, then immediately the function afterwards passes va around as if
> it were not null, so I think it'd also be an aesthetic and logical improvement
> :)

In fact, I should not do the checking, but do the clearing anyway. If I
change it as below, does it look better to you?


diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5e578563784a..369b848d097a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2253,8 +2253,7 @@ void vm_unmap_ram(const void *mem, unsigned int count)
 	spin_lock(&vmap_area_lock);
 	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
 	BUG_ON(!va);
-	if (va)
-		va->flags &= ~VMAP_RAM;
+	va->flags &= ~VMAP_RAM;
 	spin_unlock(&vmap_area_lock);
 	debug_check_no_locks_freed((void *)va->va_start,
 				    (va->va_end - va->va_start));

> 
> > >
> > > Also, do we want to clear VMAP_BLOCK here?
> >
> > I do, but I don't find a good place to clear VMAP_BLOCK.
> >
> > In v1, I tried to clear it in free_vmap_area_noflush() as below,
> > Uladzislau dislikes it. So I remove it. My thinking is when we unmap and
> > free the vmap area, the vmap_area is moved from vmap_area_root into
> > &free_vmap_area_root. When we allocate a new vmap_area via
> > alloc_vmap_area(), we will allocate a new va by kmem_cache_alloc_node(),
> > the va->flags must be 0. Seems not initializing it to 0 won't impact
> > thing.
> >
> 
> You are at this point clearing the VMAP_RAM flag though, so if it is unimportant
> what the flags are after this call, why are you clearing this one?

With my understanding, We had better do the clearing. Currently, from
the code, not doing the clearing won't cause issue. If possible, I would
like to clear it as below draft code.

> 
> It is just a little confusing, I wonder whether the VMAP_BLOCK flag is necessary
> at all, is it possible to just treat a non-VMAP_BLOCK VMAP_RAM area as if it
> were simply a fully occupied block? Do we gain much by the distinction?

Yeah, VMAP_BLOCK flag is necessary. vmap_block contains used region,
or dirty/free regions. While the non-vmap_blcok vm_map_ram area is
similar with the non vm_map_ram area. When reading out vm_map_ram
regions, vmap_block regions need be treated differently.

> 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 5d3fd3e6fe09..d6f376060d83 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1815,6 +1815,7 @@ static void free_vmap_area_noflush(struct vmap_area *va)
> >
> >         spin_lock(&vmap_area_lock);
> >         unlink_va(va, &vmap_area_root);
> > +       va->flags = 0;
> >         spin_unlock(&vmap_area_lock);
> >
> >         nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >>
> >
> > >
> >
>
  
Lorenzo Stoakes Dec. 19, 2022, 1:01 p.m. UTC | #5
On Mon, Dec 19, 2022 at 08:24:47PM +0800, Baoquan He wrote:
> In fact, I should not do the checking, but do the clearing anyway. If I
> change it as below, does it look better to you?
>
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 5e578563784a..369b848d097a 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2253,8 +2253,7 @@ void vm_unmap_ram(const void *mem, unsigned int count)
>  	spin_lock(&vmap_area_lock);
>  	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
>  	BUG_ON(!va);
> -	if (va)
> -		va->flags &= ~VMAP_RAM;
> +	va->flags &= ~VMAP_RAM;
>  	spin_unlock(&vmap_area_lock);
>  	debug_check_no_locks_freed((void *)va->va_start,
>  				    (va->va_end - va->va_start));

This is better as it avoids the slightly contradictory situation of checking for
a condition we've asserted is not the case, but I would still far prefer keeping
this as-is and placing the unlock before the BUG_ON().

This avoids explicitly and knowingly holding a lock over a BUG_ON() and is
simple to implement, e.g.:-

  	spin_lock(&vmap_area_lock);
  	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
 	if (va)
 		va->flags &= ~VMAP_RAM;
  	spin_unlock(&vmap_area_lock);
  	BUG_ON(!va);

> > You are at this point clearing the VMAP_RAM flag though, so if it is unimportant
> > what the flags are after this call, why are you clearing this one?
>
> With my understanding, We had better do the clearing. Currently, from
> the code, not doing the clearing won't cause issue. If possible, I would
> like to clear it as below draft code.
>

Sure, it seems appropriate to clear it, I'm just unsure as to why you aren't
just clearing both flags? Perhaps just set va->flags = 0?

> >
> > It is just a little confusing, I wonder whether the VMAP_BLOCK flag is necessary
> > at all, is it possible to just treat a non-VMAP_BLOCK VMAP_RAM area as if it
> > were simply a fully occupied block? Do we gain much by the distinction?
>
> Yeah, VMAP_BLOCK flag is necessary. vmap_block contains used region,
> or dirty/free regions. While the non-vmap_blcok vm_map_ram area is
> similar with the non vm_map_ram area. When reading out vm_map_ram
> regions, vmap_block regions need be treated differently.

OK looking through again closely I see you're absolutely right, I wondered
whether you could somehow make a non-VMAP_BLOCK vread() operation be equivalent
to a block one (only across the whole mapping), but I don't think you can.
  
Baoquan He Dec. 20, 2022, 12:14 p.m. UTC | #6
On 12/19/22 at 01:01pm, Lorenzo Stoakes wrote:
> On Mon, Dec 19, 2022 at 08:24:47PM +0800, Baoquan He wrote:
> > In fact, I should not do the checking, but do the clearing anyway. If I
> > change it as below, does it look better to you?
> >
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 5e578563784a..369b848d097a 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2253,8 +2253,7 @@ void vm_unmap_ram(const void *mem, unsigned int count)
> >  	spin_lock(&vmap_area_lock);
> >  	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
> >  	BUG_ON(!va);
> > -	if (va)
> > -		va->flags &= ~VMAP_RAM;
> > +	va->flags &= ~VMAP_RAM;
> >  	spin_unlock(&vmap_area_lock);
> >  	debug_check_no_locks_freed((void *)va->va_start,
> >  				    (va->va_end - va->va_start));
> 
> This is better as it avoids the slightly contradictory situation of checking for
> a condition we've asserted is not the case, but I would still far prefer keeping
> this as-is and placing the unlock before the BUG_ON().
> 
> This avoids explicitly and knowingly holding a lock over a BUG_ON() and is
> simple to implement, e.g.:-
> 
>   	spin_lock(&vmap_area_lock);
>   	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
>  	if (va)
>  		va->flags &= ~VMAP_RAM;
>   	spin_unlock(&vmap_area_lock);
>   	BUG_ON(!va);

OK, I will change like this.

> 
> > > You are at this point clearing the VMAP_RAM flag though, so if it is unimportant
> > > what the flags are after this call, why are you clearing this one?
> >
> > With my understanding, We had better do the clearing. Currently, from
> > the code, not doing the clearing won't cause issue. If possible, I would
> > like to clear it as below draft code.
> >
> 
> Sure, it seems appropriate to clear it, I'm just unsure as to why you aren't
> just clearing both flags? Perhaps just set va->flags = 0?

Hmm, for the two kinds of vm_map_ram areas, their code paths are
different. for unmapping vmap_block vm_map_ram, it goes through
vb_free(). I can only do the clearing in free_vmap_block().

  -->vm_unmap_ram()
     -->vb_free()
        -->free_vmap_block()

For non vmap_block vm_map_ram area, I can do the clearing in
vm_unmap_ram().
  -->vm_unmap_ram()
     -->__find_vmap_area()
     -->free_unmap_vmap_area()

As said earlier, clearing va->flags when unmap vm_map_ram area, or
clearing va->vm in remove_vm_area(), these have better be done.
However, not clearing them won't cause issue currently. Because the
old vmap_area is inserted into free_vmap_area_root, when we allocate a
new vmap_area through alloc_vmap_area(), we will get q new vmap_area
from vmap_area_cachep, the old va->flags or va->vm won't be carried into
the new vmap_area. Clearing them is a good to have, just in case.

Rethinking about this, I may need to do the clearing when freeing
vmap_block. Otherwise, people will be confused why the clearing is not
done.

@@ -1815,6 +1815,7 @@ static void free_vmap_area_noflush(struct vmap_area *va)

        spin_lock(&vmap_area_lock);
        unlink_va(va, &vmap_area_root);
+       va->flags = 0;
        spin_unlock(&vmap_area_lock);

        nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >>

> 
> > >
> > > It is just a little confusing, I wonder whether the VMAP_BLOCK flag is necessary
> > > at all, is it possible to just treat a non-VMAP_BLOCK VMAP_RAM area as if it
> > > were simply a fully occupied block? Do we gain much by the distinction?
> >
> > Yeah, VMAP_BLOCK flag is necessary. vmap_block contains used region,
> > or dirty/free regions. While the non-vmap_blcok vm_map_ram area is
> > similar with the non vm_map_ram area. When reading out vm_map_ram
> > regions, vmap_block regions need be treated differently.
> 
> OK looking through again closely I see you're absolutely right, I wondered
> whether you could somehow make a non-VMAP_BLOCK vread() operation be equivalent
> to a block one (only across the whole mapping), but I don't think you can.

Right, it's much easier to do the same handling on non-VMAP_BLOCK vm_map_ram
as the normal vmap area.
  
Lorenzo Stoakes Dec. 20, 2022, 12:42 p.m. UTC | #7
On Tue, Dec 20, 2022 at 08:14:15PM +0800, Baoquan He wrote:
> Hmm, for the two kinds of vm_map_ram areas, their code paths are
> different. for unmapping vmap_block vm_map_ram, it goes through
> vb_free(). I can only do the clearing in free_vmap_block().
>
>   -->vm_unmap_ram()
>      -->vb_free()
>         -->free_vmap_block()
>
> For non vmap_block vm_map_ram area, I can do the clearing in
> vm_unmap_ram().
>   -->vm_unmap_ram()
>      -->__find_vmap_area()
>      -->free_unmap_vmap_area()
>
> As said earlier, clearing va->flags when unmap vm_map_ram area, or
> clearing va->vm in remove_vm_area(), these have better be done.
> However, not clearing them won't cause issue currently. Because the
> old vmap_area is inserted into free_vmap_area_root, when we allocate a
> new vmap_area through alloc_vmap_area(), we will get q new vmap_area
> from vmap_area_cachep, the old va->flags or va->vm won't be carried into
> the new vmap_area. Clearing them is a good to have, just in case.
>

Sure, this is more so about avoiding confusion and perhaps some future change
which might not take into account that flag state could be invalid.

I guess logically speaking, this is still a block when you are unmapping ram, so
it's not unreasonable to retain the VMAP_BLOCK flag. In that case I'd say we're
good simply clearing VMAP_RAM here. Thanks for the explanations!

> Rethinking about this, I may need to do the clearing when freeing
> vmap_block. Otherwise, people will be confused why the clearing is not
> done.
>
> @@ -1815,6 +1815,7 @@ static void free_vmap_area_noflush(struct vmap_area *va)
>
>         spin_lock(&vmap_area_lock);
>         unlink_va(va, &vmap_area_root);
> +       va->flags = 0;
>         spin_unlock(&vmap_area_lock);
>
>         nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >>
>

That sounds like a good idea!
  
Uladzislau Rezki Dec. 20, 2022, 4:55 p.m. UTC | #8
> Through vmalloc API, a virtual kernel area is reserved for physical
> address mapping. And vmap_area is used to track them, while vm_struct
> is allocated to associate with the vmap_area to store more information
> and passed out.
> 
> However, area reserved via vm_map_ram() is an exception. It doesn't have
> vm_struct to associate with vmap_area. And we can't recognize the
> vmap_area with '->vm == NULL' as a vm_map_ram() area because the normal
> freeing path will set va->vm = NULL before unmapping, please see
> function remove_vm_area().
> 
A normal "free" path sets it to NULL in order to prevent a double-free
of same VA. We can avoid of touching the va->vm if needed and do an unlink
on entry in the remove_vm_area() when a lock is taken to find an area.

Will it help you?

> Meanwhile, there are two types of vm_map_ram area. One is the whole
> vmap_area being reserved and mapped at one time; the other is the
> whole vmap_area with VMAP_BLOCK_SIZE size being reserved, while mapped
> into split regions with smaller size several times via vb_alloc().
> 
> To mark the area reserved through vm_map_ram(), add flags field into
> struct vmap_area. Bit 0 indicates whether it's a vm_map_ram area,
> while bit 1 indicates whether it's a vmap_block type of vm_map_ram
> area.
> 
> This is a preparatoin for later use.
> 
> Signed-off-by: Baoquan He <bhe@redhat.com>
> ---
>  include/linux/vmalloc.h |  1 +
>  mm/vmalloc.c            | 22 +++++++++++++++++-----
>  2 files changed, 18 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 096d48aa3437..69250efa03d1 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -76,6 +76,7 @@ struct vmap_area {
>  		unsigned long subtree_max_size; /* in "free" tree */
>  		struct vm_struct *vm;           /* in "busy" tree */
>  	};
> +	unsigned long flags; /* mark type of vm_map_ram area */
>  };
>  
>  /* archs that select HAVE_ARCH_HUGE_VMAP should override one or more of these */
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 5d3fd3e6fe09..190f29bbaaa7 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1586,7 +1586,8 @@ preload_this_cpu_lock(spinlock_t *lock, gfp_t gfp_mask, int node)
>  static struct vmap_area *alloc_vmap_area(unsigned long size,
>  				unsigned long align,
>  				unsigned long vstart, unsigned long vend,
> -				int node, gfp_t gfp_mask)
> +				int node, gfp_t gfp_mask,
> +				unsigned long va_flags)
>  {
>  	struct vmap_area *va;
>  	unsigned long freed;
> @@ -1630,6 +1631,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	va->va_start = addr;
>  	va->va_end = addr + size;
>  	va->vm = NULL;
> +	va->flags = va_flags;
>  
>  	spin_lock(&vmap_area_lock);
>  	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
> @@ -1887,6 +1889,10 @@ struct vmap_area *find_vmap_area(unsigned long addr)
>  
>  #define VMAP_BLOCK_SIZE		(VMAP_BBMAP_BITS * PAGE_SIZE)
>  
> +#define VMAP_RAM		0x1
> +#define VMAP_BLOCK		0x2
> +#define VMAP_FLAGS_MASK		0x3
> 
Maybe to rename a VMAP_BLOCK to something like VMAP_BLOCK_RESERVED or
VMAP_PER_CPU_BLOCK?

>  struct vmap_block_queue {
>  	spinlock_t lock;
>  	struct list_head free;
> @@ -1962,7 +1968,8 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
>  
>  	va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE,
>  					VMALLOC_START, VMALLOC_END,
> -					node, gfp_mask);
> +					node, gfp_mask,
> +					VMAP_RAM|VMAP_BLOCK);
>
A new_vmap_block() is for a per-cpu path. As far as i see the VMAP_BLOCK
flag is used to mark a VA that corresponds to a reserved per-cpu free area.

Whereas a VMAP_RAM is for VA that was obtained over per-cpu path but
over alloc_vmap_area() thus a VA should be read out over "busy" tree
directly.

Why do you need to set here both VMAP_RAM and VMAP_BLOCK?

>  	if (IS_ERR(va)) {
>  		kfree(vb);
>  		return ERR_CAST(va);
> @@ -2229,8 +2236,12 @@ void vm_unmap_ram(const void *mem, unsigned int count)
>  		return;
>  	}
>  
> -	va = find_vmap_area(addr);
> +	spin_lock(&vmap_area_lock);
> +	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
>  	BUG_ON(!va);
> +	if (va)
> +		va->flags &= ~VMAP_RAM;
> +	spin_unlock(&vmap_area_lock);
>  	debug_check_no_locks_freed((void *)va->va_start,
>
Agree with Lorenzo. BUG_ON() should be out of spinlock(). Furthermore
i think it makes sense to go with WARN_ON_ONCE() and do not kill a system.
Instead emit a warning and bailout.

What do you think? Maybe separate patch for it?

>  				    (va->va_end - va->va_start));
>  	free_unmap_vmap_area(va);
> @@ -2265,7 +2276,8 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
>  	} else {
>  		struct vmap_area *va;
>  		va = alloc_vmap_area(size, PAGE_SIZE,
> -				VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
> +				VMALLOC_START, VMALLOC_END,
> +				node, GFP_KERNEL, VMAP_RAM);
>  		if (IS_ERR(va))
>  			return NULL;
>  
> @@ -2505,7 +2517,7 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
>  	if (!(flags & VM_NO_GUARD))
>  		size += PAGE_SIZE;
>  
> -	va = alloc_vmap_area(size, align, start, end, node, gfp_mask);
> +	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0);
>  	if (IS_ERR(va)) {
>  		kfree(area);
>  		return NULL;
>
I know we have already discussed the new parameter. But what if we just
use atomic_set operation to mark VA as either vmap-ram or vmap-block?

As for alloc_vmap_area() we set it just as zero.

--
Uladzislau Rezki
  
Baoquan He Dec. 23, 2022, 4:14 a.m. UTC | #9
On 12/20/22 at 05:55pm, Uladzislau Rezki wrote:
> > Through vmalloc API, a virtual kernel area is reserved for physical
> > address mapping. And vmap_area is used to track them, while vm_struct
> > is allocated to associate with the vmap_area to store more information
> > and passed out.
> > 
> > However, area reserved via vm_map_ram() is an exception. It doesn't have
> > vm_struct to associate with vmap_area. And we can't recognize the
> > vmap_area with '->vm == NULL' as a vm_map_ram() area because the normal
> > freeing path will set va->vm = NULL before unmapping, please see
> > function remove_vm_area().
> > 
> A normal "free" path sets it to NULL in order to prevent a double-free
> of same VA. We can avoid of touching the va->vm if needed and do an unlink
> on entry in the remove_vm_area() when a lock is taken to find an area.
> 
> Will it help you?

Sorry, this mail sneaked out of my sight until I notice it now. My mutt
client makes it look like in the thread I talked with Lorenzo.

Yes, as I replied to your v2 patch, that is very helpful, thanks.

> 
> > Meanwhile, there are two types of vm_map_ram area. One is the whole
> > vmap_area being reserved and mapped at one time; the other is the
> > whole vmap_area with VMAP_BLOCK_SIZE size being reserved, while mapped
> > into split regions with smaller size several times via vb_alloc().
> > 
> > To mark the area reserved through vm_map_ram(), add flags field into
> > struct vmap_area. Bit 0 indicates whether it's a vm_map_ram area,
> > while bit 1 indicates whether it's a vmap_block type of vm_map_ram
> > area.
> > 
> > This is a preparatoin for later use.
> > 
> > Signed-off-by: Baoquan He <bhe@redhat.com>
> > ---
> >  include/linux/vmalloc.h |  1 +
> >  mm/vmalloc.c            | 22 +++++++++++++++++-----
> >  2 files changed, 18 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > index 096d48aa3437..69250efa03d1 100644
> > --- a/include/linux/vmalloc.h
> > +++ b/include/linux/vmalloc.h
> > @@ -76,6 +76,7 @@ struct vmap_area {
> >  		unsigned long subtree_max_size; /* in "free" tree */
> >  		struct vm_struct *vm;           /* in "busy" tree */
> >  	};
> > +	unsigned long flags; /* mark type of vm_map_ram area */
> >  };
> >  
> >  /* archs that select HAVE_ARCH_HUGE_VMAP should override one or more of these */
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 5d3fd3e6fe09..190f29bbaaa7 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1586,7 +1586,8 @@ preload_this_cpu_lock(spinlock_t *lock, gfp_t gfp_mask, int node)
> >  static struct vmap_area *alloc_vmap_area(unsigned long size,
> >  				unsigned long align,
> >  				unsigned long vstart, unsigned long vend,
> > -				int node, gfp_t gfp_mask)
> > +				int node, gfp_t gfp_mask,
> > +				unsigned long va_flags)
> >  {
> >  	struct vmap_area *va;
> >  	unsigned long freed;
> > @@ -1630,6 +1631,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> >  	va->va_start = addr;
> >  	va->va_end = addr + size;
> >  	va->vm = NULL;
> > +	va->flags = va_flags;
> >  
> >  	spin_lock(&vmap_area_lock);
> >  	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
> > @@ -1887,6 +1889,10 @@ struct vmap_area *find_vmap_area(unsigned long addr)
> >  
> >  #define VMAP_BLOCK_SIZE		(VMAP_BBMAP_BITS * PAGE_SIZE)
> >  
> > +#define VMAP_RAM		0x1
> > +#define VMAP_BLOCK		0x2
> > +#define VMAP_FLAGS_MASK		0x3
> > 
> Maybe to rename a VMAP_BLOCK to something like VMAP_BLOCK_RESERVED or
> VMAP_PER_CPU_BLOCK?

Both VMAP_BLOCK or VMAP_PER_CPU_BLOCK look good to me, please see my
explanation at below.

> 
> >  struct vmap_block_queue {
> >  	spinlock_t lock;
> >  	struct list_head free;
> > @@ -1962,7 +1968,8 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
> >  
> >  	va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE,
> >  					VMALLOC_START, VMALLOC_END,
> > -					node, gfp_mask);
> > +					node, gfp_mask,
> > +					VMAP_RAM|VMAP_BLOCK);
> >
> A new_vmap_block() is for a per-cpu path. As far as i see the VMAP_BLOCK
> flag is used to mark a VA that corresponds to a reserved per-cpu free area.
> 
> Whereas a VMAP_RAM is for VA that was obtained over per-cpu path but
> over alloc_vmap_area() thus a VA should be read out over "busy" tree
> directly.
> 
> Why do you need to set here both VMAP_RAM and VMAP_BLOCK?

My understanding is that the vm_map_ram area has two types, one is
the vb percpu area via vb_alloc(), the other is allocated via
alloc_vmap_area(). While both of them is got from vm_map_ram()
interface, this is the main point that distinguishes the vm_map_ram area
than the normal vmalloc area, and this makes vm_map_ram area not owning
va->vm pointer. So here, I use flag VMAP_RAM to mark the vm_map_ram
area, including the two types; meanwhile, I add VMAP_BLOCK to mark out
the vb percpu area. 

I understand people could have different view about them, e.g as you
said, use VMAP_RAM to mark the type of vm_map_ram area allocated through
alloc_vmap_area(), while use VMAP_PER_CPU_BLOCK to mark vb percpu area
from vb_alloc. In this way, we may need to rename VMAP_RAM to reflect
the area allocated from alloc_vmap_area() only. Both is fine to me.

> 
> >  	if (IS_ERR(va)) {
> >  		kfree(vb);
> >  		return ERR_CAST(va);
> > @@ -2229,8 +2236,12 @@ void vm_unmap_ram(const void *mem, unsigned int count)
> >  		return;
> >  	}
> >  
> > -	va = find_vmap_area(addr);
> > +	spin_lock(&vmap_area_lock);
> > +	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
> >  	BUG_ON(!va);
> > +	if (va)
> > +		va->flags &= ~VMAP_RAM;
> > +	spin_unlock(&vmap_area_lock);
> >  	debug_check_no_locks_freed((void *)va->va_start,
> >
> Agree with Lorenzo. BUG_ON() should be out of spinlock(). Furthermore
> i think it makes sense to go with WARN_ON_ONCE() and do not kill a system.
> Instead emit a warning and bailout.
> 
> What do you think? Maybe separate patch for it?

Agree, your patch looks great to me. Thanks.

> 
> >  				    (va->va_end - va->va_start));
> >  	free_unmap_vmap_area(va);
> > @@ -2265,7 +2276,8 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
> >  	} else {
> >  		struct vmap_area *va;
> >  		va = alloc_vmap_area(size, PAGE_SIZE,
> > -				VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
> > +				VMALLOC_START, VMALLOC_END,
> > +				node, GFP_KERNEL, VMAP_RAM);
> >  		if (IS_ERR(va))
> >  			return NULL;
> >  
> > @@ -2505,7 +2517,7 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
> >  	if (!(flags & VM_NO_GUARD))
> >  		size += PAGE_SIZE;
> >  
> > -	va = alloc_vmap_area(size, align, start, end, node, gfp_mask);
> > +	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0);
> >  	if (IS_ERR(va)) {
> >  		kfree(area);
> >  		return NULL;
> >
> I know we have already discussed the new parameter. But what if we just
> use atomic_set operation to mark VA as either vmap-ram or vmap-block?
> 
> As for alloc_vmap_area() we set it just as zero.

Sorry, I may not get your point clearly, could you be more specific?
  
Baoquan He Jan. 13, 2023, 3:55 a.m. UTC | #10
Hi Uladzislau Rezki,

On 12/23/22 at 12:14pm, Baoquan He wrote:
> On 12/20/22 at 05:55pm, Uladzislau Rezki wrote:
......
 > >  	spin_lock(&vmap_area_lock);
> > >  	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
> > > @@ -1887,6 +1889,10 @@ struct vmap_area *find_vmap_area(unsigned long addr)
> > >  
> > >  #define VMAP_BLOCK_SIZE		(VMAP_BBMAP_BITS * PAGE_SIZE)
> > >  
> > > +#define VMAP_RAM		0x1
> > > +#define VMAP_BLOCK		0x2
> > > +#define VMAP_FLAGS_MASK		0x3
> > > 
> > Maybe to rename a VMAP_BLOCK to something like VMAP_BLOCK_RESERVED or
> > VMAP_PER_CPU_BLOCK?
> 
> Both VMAP_BLOCK or VMAP_PER_CPU_BLOCK look good to me, please see my
> explanation at below.
> 
> > 
> > >  struct vmap_block_queue {
> > >  	spinlock_t lock;
> > >  	struct list_head free;
> > > @@ -1962,7 +1968,8 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
> > >  
> > >  	va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE,
> > >  					VMALLOC_START, VMALLOC_END,
> > > -					node, gfp_mask);
> > > +					node, gfp_mask,
> > > +					VMAP_RAM|VMAP_BLOCK);
> > >
> > A new_vmap_block() is for a per-cpu path. As far as i see the VMAP_BLOCK
> > flag is used to mark a VA that corresponds to a reserved per-cpu free area.
> > 
> > Whereas a VMAP_RAM is for VA that was obtained over per-cpu path but
> > over alloc_vmap_area() thus a VA should be read out over "busy" tree
> > directly.

Rethinking about the vmap->flags and the bit0->VMAP_RAM,
bit1->VMAP_BLOCK correspondence, it looks better to use bit0->VMAP_RAM
to indicate the vm_map_ram area, no matter how it's handled inside
vm_map_ram() interface; and use bit1->VMAP_BLOCK to mark out the special
vm_map_ram area which is further subdivided and managed by struct
vmap_block. With these, you can see that we can identify vm_map_ram area
and treat it as one type of vmalloc area, e.g in vread(), s_show().

Means when we are talking about vm_map_ram areas, we use
(vmap->flags & VMAP_RAM) to recognize them; when we need to
differentiate and handle vm_map_ram areas respectively, we use
(vmap->flags & VMAP_BLOCK) to pick out the area which is further managed
by vmap_block. Please help check if this is OK to you.

> > 
> > Why do you need to set here both VMAP_RAM and VMAP_BLOCK?
> 
> My understanding is that the vm_map_ram area has two types, one is
> the vb percpu area via vb_alloc(), the other is allocated via
> alloc_vmap_area(). While both of them is got from vm_map_ram()
> interface, this is the main point that distinguishes the vm_map_ram area
> than the normal vmalloc area, and this makes vm_map_ram area not owning
> va->vm pointer. So here, I use flag VMAP_RAM to mark the vm_map_ram
> area, including the two types; meanwhile, I add VMAP_BLOCK to mark out
> the vb percpu area. 
> 
> I understand people could have different view about them, e.g as you
> said, use VMAP_RAM to mark the type of vm_map_ram area allocated through
> alloc_vmap_area(), while use VMAP_PER_CPU_BLOCK to mark vb percpu area
> from vb_alloc. In this way, we may need to rename VMAP_RAM to reflect
> the area allocated from alloc_vmap_area() only. Both is fine to me.
> 
> > 
> > >  	if (IS_ERR(va)) {
> > >  		kfree(vb);
> > >  		return ERR_CAST(va);
> > > @@ -2229,8 +2236,12 @@ void vm_unmap_ram(const void *mem, unsigned int count)
> > >  		return;
> > >  	}
> > >  
> > > -	va = find_vmap_area(addr);
> > > +	spin_lock(&vmap_area_lock);
> > > +	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
> > >  	BUG_ON(!va);
> > > +	if (va)
> > > +		va->flags &= ~VMAP_RAM;
> > > +	spin_unlock(&vmap_area_lock);
> > >  	debug_check_no_locks_freed((void *)va->va_start,
> > >
> > Agree with Lorenzo. BUG_ON() should be out of spinlock(). Furthermore
> > i think it makes sense to go with WARN_ON_ONCE() and do not kill a system.
> > Instead emit a warning and bailout.
> > 
> > What do you think? Maybe separate patch for it?
> 
> Agree, your patch looks great to me. Thanks.
> 
> > 
> > >  				    (va->va_end - va->va_start));
> > >  	free_unmap_vmap_area(va);
> > > @@ -2265,7 +2276,8 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
> > >  	} else {
> > >  		struct vmap_area *va;
> > >  		va = alloc_vmap_area(size, PAGE_SIZE,
> > > -				VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
> > > +				VMALLOC_START, VMALLOC_END,
> > > +				node, GFP_KERNEL, VMAP_RAM);
> > >  		if (IS_ERR(va))
> > >  			return NULL;
> > >  
> > > @@ -2505,7 +2517,7 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
> > >  	if (!(flags & VM_NO_GUARD))
> > >  		size += PAGE_SIZE;
> > >  
> > > -	va = alloc_vmap_area(size, align, start, end, node, gfp_mask);
> > > +	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0);
> > >  	if (IS_ERR(va)) {
> > >  		kfree(area);
> > >  		return NULL;
> > >
> > I know we have already discussed the new parameter. But what if we just
> > use atomic_set operation to mark VA as either vmap-ram or vmap-block?

As I replied at above, I take the vm_map_ram as one kind of vmalloc
area, and mark out the percpu vmap block handling of vm_map_ram area.
Seems the passing in the flags through function parameter is better. Not
sure if I got your suggestion correctly, and my code change is
appropriate. I have sent v3 according to your and Lorenzo's comments and
suggestion, and my rethinking after reading your words. I make some
adjustment to try to remove misundersanding or confusion when reading
patch and code. Please help check if it's OK.
  
Uladzislau Rezki Jan. 16, 2023, 5:54 p.m. UTC | #11
On Fri, Jan 13, 2023 at 11:55:07AM +0800, Baoquan He wrote:
> Hi Uladzislau Rezki,
> 
> On 12/23/22 at 12:14pm, Baoquan He wrote:
> > On 12/20/22 at 05:55pm, Uladzislau Rezki wrote:
> ......
>  > >  	spin_lock(&vmap_area_lock);
> > > >  	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
> > > > @@ -1887,6 +1889,10 @@ struct vmap_area *find_vmap_area(unsigned long addr)
> > > >  
> > > >  #define VMAP_BLOCK_SIZE		(VMAP_BBMAP_BITS * PAGE_SIZE)
> > > >  
> > > > +#define VMAP_RAM		0x1
> > > > +#define VMAP_BLOCK		0x2
> > > > +#define VMAP_FLAGS_MASK		0x3
> > > > 
> > > Maybe to rename a VMAP_BLOCK to something like VMAP_BLOCK_RESERVED or
> > > VMAP_PER_CPU_BLOCK?
> > 
> > Both VMAP_BLOCK or VMAP_PER_CPU_BLOCK look good to me, please see my
> > explanation at below.
> > 
> > > 
> > > >  struct vmap_block_queue {
> > > >  	spinlock_t lock;
> > > >  	struct list_head free;
> > > > @@ -1962,7 +1968,8 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
> > > >  
> > > >  	va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE,
> > > >  					VMALLOC_START, VMALLOC_END,
> > > > -					node, gfp_mask);
> > > > +					node, gfp_mask,
> > > > +					VMAP_RAM|VMAP_BLOCK);
> > > >
> > > A new_vmap_block() is for a per-cpu path. As far as i see the VMAP_BLOCK
> > > flag is used to mark a VA that corresponds to a reserved per-cpu free area.
> > > 
> > > Whereas a VMAP_RAM is for VA that was obtained over per-cpu path but
> > > over alloc_vmap_area() thus a VA should be read out over "busy" tree
> > > directly.
> 
> Rethinking about the vmap->flags and the bit0->VMAP_RAM,
> bit1->VMAP_BLOCK correspondence, it looks better to use bit0->VMAP_RAM
> to indicate the vm_map_ram area, no matter how it's handled inside
> vm_map_ram() interface; and use bit1->VMAP_BLOCK to mark out the special
> vm_map_ram area which is further subdivided and managed by struct
> vmap_block. With these, you can see that we can identify vm_map_ram area
> and treat it as one type of vmalloc area, e.g in vread(), s_show().
> 
> Means when we are talking about vm_map_ram areas, we use
> (vmap->flags & VMAP_RAM) to recognize them; when we need to
> differentiate and handle vm_map_ram areas respectively, we use
> (vmap->flags & VMAP_BLOCK) to pick out the area which is further managed
> by vmap_block. Please help check if this is OK to you.
> 
> > > 
> > > Why do you need to set here both VMAP_RAM and VMAP_BLOCK?
> > 
> > My understanding is that the vm_map_ram area has two types, one is
> > the vb percpu area via vb_alloc(), the other is allocated via
> > alloc_vmap_area(). While both of them is got from vm_map_ram()
> > interface, this is the main point that distinguishes the vm_map_ram area
> > than the normal vmalloc area, and this makes vm_map_ram area not owning
> > va->vm pointer. So here, I use flag VMAP_RAM to mark the vm_map_ram
> > area, including the two types; meanwhile, I add VMAP_BLOCK to mark out
> > the vb percpu area. 
> > 
> > I understand people could have different view about them, e.g as you
> > said, use VMAP_RAM to mark the type of vm_map_ram area allocated through
> > alloc_vmap_area(), while use VMAP_PER_CPU_BLOCK to mark vb percpu area
> > from vb_alloc. In this way, we may need to rename VMAP_RAM to reflect
> > the area allocated from alloc_vmap_area() only. Both is fine to me.
> > 
> > > 
> > > >  	if (IS_ERR(va)) {
> > > >  		kfree(vb);
> > > >  		return ERR_CAST(va);
> > > > @@ -2229,8 +2236,12 @@ void vm_unmap_ram(const void *mem, unsigned int count)
> > > >  		return;
> > > >  	}
> > > >  
> > > > -	va = find_vmap_area(addr);
> > > > +	spin_lock(&vmap_area_lock);
> > > > +	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
> > > >  	BUG_ON(!va);
> > > > +	if (va)
> > > > +		va->flags &= ~VMAP_RAM;
> > > > +	spin_unlock(&vmap_area_lock);
> > > >  	debug_check_no_locks_freed((void *)va->va_start,
> > > >
> > > Agree with Lorenzo. BUG_ON() should be out of spinlock(). Furthermore
> > > i think it makes sense to go with WARN_ON_ONCE() and do not kill a system.
> > > Instead emit a warning and bailout.
> > > 
> > > What do you think? Maybe separate patch for it?
> > 
> > Agree, your patch looks great to me. Thanks.
> > 
> > > 
> > > >  				    (va->va_end - va->va_start));
> > > >  	free_unmap_vmap_area(va);
> > > > @@ -2265,7 +2276,8 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
> > > >  	} else {
> > > >  		struct vmap_area *va;
> > > >  		va = alloc_vmap_area(size, PAGE_SIZE,
> > > > -				VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
> > > > +				VMALLOC_START, VMALLOC_END,
> > > > +				node, GFP_KERNEL, VMAP_RAM);
> > > >  		if (IS_ERR(va))
> > > >  			return NULL;
> > > >  
> > > > @@ -2505,7 +2517,7 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
> > > >  	if (!(flags & VM_NO_GUARD))
> > > >  		size += PAGE_SIZE;
> > > >  
> > > > -	va = alloc_vmap_area(size, align, start, end, node, gfp_mask);
> > > > +	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0);
> > > >  	if (IS_ERR(va)) {
> > > >  		kfree(area);
> > > >  		return NULL;
> > > >
> > > I know we have already discussed the new parameter. But what if we just
> > > use atomic_set operation to mark VA as either vmap-ram or vmap-block?
> 
> As I replied at above, I take the vm_map_ram as one kind of vmalloc
> area, and mark out the percpu vmap block handling of vm_map_ram area.
> Seems the passing in the flags through function parameter is better. Not
> sure if I got your suggestion correctly, and my code change is
> appropriate. I have sent v3 according to your and Lorenzo's comments and
> suggestion, and my rethinking after reading your words. I make some
> adjustment to try to remove misundersanding or confusion when reading
> patch and code. Please help check if it's OK.
> 
OK, if we decided to go with a parameter it is OK, it is not a big deal
and complexity. If needed it can be adjusted later on if there is a
need.

Thanks!

--
Uladzislau Rezki
  
Baoquan He Jan. 18, 2023, 3:09 a.m. UTC | #12
On 01/16/23 at 06:54pm, Uladzislau Rezki wrote:
> On Fri, Jan 13, 2023 at 11:55:07AM +0800, Baoquan He wrote:
> > Hi Uladzislau Rezki,
> > 
> > On 12/23/22 at 12:14pm, Baoquan He wrote:
> > > On 12/20/22 at 05:55pm, Uladzislau Rezki wrote:
> > ......
> >  > >  	spin_lock(&vmap_area_lock);
> > > > >  	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
> > > > > @@ -1887,6 +1889,10 @@ struct vmap_area *find_vmap_area(unsigned long addr)
> > > > >  
> > > > >  #define VMAP_BLOCK_SIZE		(VMAP_BBMAP_BITS * PAGE_SIZE)
> > > > >  
> > > > > +#define VMAP_RAM		0x1
> > > > > +#define VMAP_BLOCK		0x2
> > > > > +#define VMAP_FLAGS_MASK		0x3
> > > > > 
> > > > Maybe to rename a VMAP_BLOCK to something like VMAP_BLOCK_RESERVED or
> > > > VMAP_PER_CPU_BLOCK?
> > > 
> > > Both VMAP_BLOCK or VMAP_PER_CPU_BLOCK look good to me, please see my
> > > explanation at below.
> > > 
> > > > 
> > > > >  struct vmap_block_queue {
> > > > >  	spinlock_t lock;
> > > > >  	struct list_head free;
> > > > > @@ -1962,7 +1968,8 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
> > > > >  
> > > > >  	va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE,
> > > > >  					VMALLOC_START, VMALLOC_END,
> > > > > -					node, gfp_mask);
> > > > > +					node, gfp_mask,
> > > > > +					VMAP_RAM|VMAP_BLOCK);
> > > > >
> > > > A new_vmap_block() is for a per-cpu path. As far as i see the VMAP_BLOCK
> > > > flag is used to mark a VA that corresponds to a reserved per-cpu free area.
> > > > 
> > > > Whereas a VMAP_RAM is for VA that was obtained over per-cpu path but
> > > > over alloc_vmap_area() thus a VA should be read out over "busy" tree
> > > > directly.
> > 
> > Rethinking about the vmap->flags and the bit0->VMAP_RAM,
> > bit1->VMAP_BLOCK correspondence, it looks better to use bit0->VMAP_RAM
> > to indicate the vm_map_ram area, no matter how it's handled inside
> > vm_map_ram() interface; and use bit1->VMAP_BLOCK to mark out the special
> > vm_map_ram area which is further subdivided and managed by struct
> > vmap_block. With these, you can see that we can identify vm_map_ram area
> > and treat it as one type of vmalloc area, e.g in vread(), s_show().
> > 
> > Means when we are talking about vm_map_ram areas, we use
> > (vmap->flags & VMAP_RAM) to recognize them; when we need to
> > differentiate and handle vm_map_ram areas respectively, we use
> > (vmap->flags & VMAP_BLOCK) to pick out the area which is further managed
> > by vmap_block. Please help check if this is OK to you.
> > 
> > > > 
> > > > Why do you need to set here both VMAP_RAM and VMAP_BLOCK?
> > > 
> > > My understanding is that the vm_map_ram area has two types, one is
> > > the vb percpu area via vb_alloc(), the other is allocated via
> > > alloc_vmap_area(). While both of them is got from vm_map_ram()
> > > interface, this is the main point that distinguishes the vm_map_ram area
> > > than the normal vmalloc area, and this makes vm_map_ram area not owning
> > > va->vm pointer. So here, I use flag VMAP_RAM to mark the vm_map_ram
> > > area, including the two types; meanwhile, I add VMAP_BLOCK to mark out
> > > the vb percpu area. 
> > > 
> > > I understand people could have different view about them, e.g as you
> > > said, use VMAP_RAM to mark the type of vm_map_ram area allocated through
> > > alloc_vmap_area(), while use VMAP_PER_CPU_BLOCK to mark vb percpu area
> > > from vb_alloc. In this way, we may need to rename VMAP_RAM to reflect
> > > the area allocated from alloc_vmap_area() only. Both is fine to me.
> > > 
> > > > 
> > > > >  	if (IS_ERR(va)) {
> > > > >  		kfree(vb);
> > > > >  		return ERR_CAST(va);
> > > > > @@ -2229,8 +2236,12 @@ void vm_unmap_ram(const void *mem, unsigned int count)
> > > > >  		return;
> > > > >  	}
> > > > >  
> > > > > -	va = find_vmap_area(addr);
> > > > > +	spin_lock(&vmap_area_lock);
> > > > > +	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
> > > > >  	BUG_ON(!va);
> > > > > +	if (va)
> > > > > +		va->flags &= ~VMAP_RAM;
> > > > > +	spin_unlock(&vmap_area_lock);
> > > > >  	debug_check_no_locks_freed((void *)va->va_start,
> > > > >
> > > > Agree with Lorenzo. BUG_ON() should be out of spinlock(). Furthermore
> > > > i think it makes sense to go with WARN_ON_ONCE() and do not kill a system.
> > > > Instead emit a warning and bailout.
> > > > 
> > > > What do you think? Maybe separate patch for it?
> > > 
> > > Agree, your patch looks great to me. Thanks.
> > > 
> > > > 
> > > > >  				    (va->va_end - va->va_start));
> > > > >  	free_unmap_vmap_area(va);
> > > > > @@ -2265,7 +2276,8 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
> > > > >  	} else {
> > > > >  		struct vmap_area *va;
> > > > >  		va = alloc_vmap_area(size, PAGE_SIZE,
> > > > > -				VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
> > > > > +				VMALLOC_START, VMALLOC_END,
> > > > > +				node, GFP_KERNEL, VMAP_RAM);
> > > > >  		if (IS_ERR(va))
> > > > >  			return NULL;
> > > > >  
> > > > > @@ -2505,7 +2517,7 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
> > > > >  	if (!(flags & VM_NO_GUARD))
> > > > >  		size += PAGE_SIZE;
> > > > >  
> > > > > -	va = alloc_vmap_area(size, align, start, end, node, gfp_mask);
> > > > > +	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0);
> > > > >  	if (IS_ERR(va)) {
> > > > >  		kfree(area);
> > > > >  		return NULL;
> > > > >
> > > > I know we have already discussed the new parameter. But what if we just
> > > > use atomic_set operation to mark VA as either vmap-ram or vmap-block?
> > 
> > As I replied at above, I take the vm_map_ram as one kind of vmalloc
> > area, and mark out the percpu vmap block handling of vm_map_ram area.
> > Seems the passing in the flags through function parameter is better. Not
> > sure if I got your suggestion correctly, and my code change is
> > appropriate. I have sent v3 according to your and Lorenzo's comments and
> > suggestion, and my rethinking after reading your words. I make some
> > adjustment to try to remove misundersanding or confusion when reading
> > patch and code. Please help check if it's OK.
> > 
> OK, if we decided to go with a parameter it is OK, it is not a big deal
> and complexity. If needed it can be adjusted later on if there is a
> need.

My preference for function parameter passing is we don't need do the
atomic reading when we want to check va->flags. However, in va->flags
setting side, atomic_set() code is simpler than function parameter.

	flags = atomic_read(&va->flags);
        if (flags & VMAP_RAM) {
		
	}

I checked code, and feel it doesn't have much difference, so keep the
current code. If there's other thing I didn't think of, we can still
change. Thanks.
  
Uladzislau Rezki Jan. 18, 2023, 12:20 p.m. UTC | #13
On Wed, Jan 18, 2023 at 11:09:44AM +0800, Baoquan He wrote:
> On 01/16/23 at 06:54pm, Uladzislau Rezki wrote:
> > On Fri, Jan 13, 2023 at 11:55:07AM +0800, Baoquan He wrote:
> > > Hi Uladzislau Rezki,
> > > 
> > > On 12/23/22 at 12:14pm, Baoquan He wrote:
> > > > On 12/20/22 at 05:55pm, Uladzislau Rezki wrote:
> > > ......
> > >  > >  	spin_lock(&vmap_area_lock);
> > > > > >  	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
> > > > > > @@ -1887,6 +1889,10 @@ struct vmap_area *find_vmap_area(unsigned long addr)
> > > > > >  
> > > > > >  #define VMAP_BLOCK_SIZE		(VMAP_BBMAP_BITS * PAGE_SIZE)
> > > > > >  
> > > > > > +#define VMAP_RAM		0x1
> > > > > > +#define VMAP_BLOCK		0x2
> > > > > > +#define VMAP_FLAGS_MASK		0x3
> > > > > > 
> > > > > Maybe to rename a VMAP_BLOCK to something like VMAP_BLOCK_RESERVED or
> > > > > VMAP_PER_CPU_BLOCK?
> > > > 
> > > > Both VMAP_BLOCK or VMAP_PER_CPU_BLOCK look good to me, please see my
> > > > explanation at below.
> > > > 
> > > > > 
> > > > > >  struct vmap_block_queue {
> > > > > >  	spinlock_t lock;
> > > > > >  	struct list_head free;
> > > > > > @@ -1962,7 +1968,8 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
> > > > > >  
> > > > > >  	va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE,
> > > > > >  					VMALLOC_START, VMALLOC_END,
> > > > > > -					node, gfp_mask);
> > > > > > +					node, gfp_mask,
> > > > > > +					VMAP_RAM|VMAP_BLOCK);
> > > > > >
> > > > > A new_vmap_block() is for a per-cpu path. As far as i see the VMAP_BLOCK
> > > > > flag is used to mark a VA that corresponds to a reserved per-cpu free area.
> > > > > 
> > > > > Whereas a VMAP_RAM is for VA that was obtained over per-cpu path but
> > > > > over alloc_vmap_area() thus a VA should be read out over "busy" tree
> > > > > directly.
> > > 
> > > Rethinking about the vmap->flags and the bit0->VMAP_RAM,
> > > bit1->VMAP_BLOCK correspondence, it looks better to use bit0->VMAP_RAM
> > > to indicate the vm_map_ram area, no matter how it's handled inside
> > > vm_map_ram() interface; and use bit1->VMAP_BLOCK to mark out the special
> > > vm_map_ram area which is further subdivided and managed by struct
> > > vmap_block. With these, you can see that we can identify vm_map_ram area
> > > and treat it as one type of vmalloc area, e.g in vread(), s_show().
> > > 
> > > Means when we are talking about vm_map_ram areas, we use
> > > (vmap->flags & VMAP_RAM) to recognize them; when we need to
> > > differentiate and handle vm_map_ram areas respectively, we use
> > > (vmap->flags & VMAP_BLOCK) to pick out the area which is further managed
> > > by vmap_block. Please help check if this is OK to you.
> > > 
> > > > > 
> > > > > Why do you need to set here both VMAP_RAM and VMAP_BLOCK?
> > > > 
> > > > My understanding is that the vm_map_ram area has two types, one is
> > > > the vb percpu area via vb_alloc(), the other is allocated via
> > > > alloc_vmap_area(). While both of them is got from vm_map_ram()
> > > > interface, this is the main point that distinguishes the vm_map_ram area
> > > > than the normal vmalloc area, and this makes vm_map_ram area not owning
> > > > va->vm pointer. So here, I use flag VMAP_RAM to mark the vm_map_ram
> > > > area, including the two types; meanwhile, I add VMAP_BLOCK to mark out
> > > > the vb percpu area. 
> > > > 
> > > > I understand people could have different view about them, e.g as you
> > > > said, use VMAP_RAM to mark the type of vm_map_ram area allocated through
> > > > alloc_vmap_area(), while use VMAP_PER_CPU_BLOCK to mark vb percpu area
> > > > from vb_alloc. In this way, we may need to rename VMAP_RAM to reflect
> > > > the area allocated from alloc_vmap_area() only. Both is fine to me.
> > > > 
> > > > > 
> > > > > >  	if (IS_ERR(va)) {
> > > > > >  		kfree(vb);
> > > > > >  		return ERR_CAST(va);
> > > > > > @@ -2229,8 +2236,12 @@ void vm_unmap_ram(const void *mem, unsigned int count)
> > > > > >  		return;
> > > > > >  	}
> > > > > >  
> > > > > > -	va = find_vmap_area(addr);
> > > > > > +	spin_lock(&vmap_area_lock);
> > > > > > +	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
> > > > > >  	BUG_ON(!va);
> > > > > > +	if (va)
> > > > > > +		va->flags &= ~VMAP_RAM;
> > > > > > +	spin_unlock(&vmap_area_lock);
> > > > > >  	debug_check_no_locks_freed((void *)va->va_start,
> > > > > >
> > > > > Agree with Lorenzo. BUG_ON() should be out of spinlock(). Furthermore
> > > > > i think it makes sense to go with WARN_ON_ONCE() and do not kill a system.
> > > > > Instead emit a warning and bailout.
> > > > > 
> > > > > What do you think? Maybe separate patch for it?
> > > > 
> > > > Agree, your patch looks great to me. Thanks.
> > > > 
> > > > > 
> > > > > >  				    (va->va_end - va->va_start));
> > > > > >  	free_unmap_vmap_area(va);
> > > > > > @@ -2265,7 +2276,8 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
> > > > > >  	} else {
> > > > > >  		struct vmap_area *va;
> > > > > >  		va = alloc_vmap_area(size, PAGE_SIZE,
> > > > > > -				VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
> > > > > > +				VMALLOC_START, VMALLOC_END,
> > > > > > +				node, GFP_KERNEL, VMAP_RAM);
> > > > > >  		if (IS_ERR(va))
> > > > > >  			return NULL;
> > > > > >  
> > > > > > @@ -2505,7 +2517,7 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
> > > > > >  	if (!(flags & VM_NO_GUARD))
> > > > > >  		size += PAGE_SIZE;
> > > > > >  
> > > > > > -	va = alloc_vmap_area(size, align, start, end, node, gfp_mask);
> > > > > > +	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0);
> > > > > >  	if (IS_ERR(va)) {
> > > > > >  		kfree(area);
> > > > > >  		return NULL;
> > > > > >
> > > > > I know we have already discussed the new parameter. But what if we just
> > > > > use atomic_set operation to mark VA as either vmap-ram or vmap-block?
> > > 
> > > As I replied at above, I take the vm_map_ram as one kind of vmalloc
> > > area, and mark out the percpu vmap block handling of vm_map_ram area.
> > > Seems the passing in the flags through function parameter is better. Not
> > > sure if I got your suggestion correctly, and my code change is
> > > appropriate. I have sent v3 according to your and Lorenzo's comments and
> > > suggestion, and my rethinking after reading your words. I make some
> > > adjustment to try to remove misundersanding or confusion when reading
> > > patch and code. Please help check if it's OK.
> > > 
> > OK, if we decided to go with a parameter it is OK, it is not a big deal
> > and complexity. If needed it can be adjusted later on if there is a
> > need.
> 
> My preference for function parameter passing is we don't need do the
> atomic reading when we want to check va->flags. However, in va->flags
> setting side, atomic_set() code is simpler than function parameter.
> 
> 	flags = atomic_read(&va->flags);
>         if (flags & VMAP_RAM) {
> 		
> 	}
> 
> I checked code, and feel it doesn't have much difference, so keep the
> current code. If there's other thing I didn't think of, we can still
> change. Thanks.
> 
That is fine.

--
Uladzislau Rezki
  

Patch

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 096d48aa3437..69250efa03d1 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -76,6 +76,7 @@  struct vmap_area {
 		unsigned long subtree_max_size; /* in "free" tree */
 		struct vm_struct *vm;           /* in "busy" tree */
 	};
+	unsigned long flags; /* mark type of vm_map_ram area */
 };
 
 /* archs that select HAVE_ARCH_HUGE_VMAP should override one or more of these */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5d3fd3e6fe09..190f29bbaaa7 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1586,7 +1586,8 @@  preload_this_cpu_lock(spinlock_t *lock, gfp_t gfp_mask, int node)
 static struct vmap_area *alloc_vmap_area(unsigned long size,
 				unsigned long align,
 				unsigned long vstart, unsigned long vend,
-				int node, gfp_t gfp_mask)
+				int node, gfp_t gfp_mask,
+				unsigned long va_flags)
 {
 	struct vmap_area *va;
 	unsigned long freed;
@@ -1630,6 +1631,7 @@  static struct vmap_area *alloc_vmap_area(unsigned long size,
 	va->va_start = addr;
 	va->va_end = addr + size;
 	va->vm = NULL;
+	va->flags = va_flags;
 
 	spin_lock(&vmap_area_lock);
 	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
@@ -1887,6 +1889,10 @@  struct vmap_area *find_vmap_area(unsigned long addr)
 
 #define VMAP_BLOCK_SIZE		(VMAP_BBMAP_BITS * PAGE_SIZE)
 
+#define VMAP_RAM		0x1
+#define VMAP_BLOCK		0x2
+#define VMAP_FLAGS_MASK		0x3
+
 struct vmap_block_queue {
 	spinlock_t lock;
 	struct list_head free;
@@ -1962,7 +1968,8 @@  static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
 
 	va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE,
 					VMALLOC_START, VMALLOC_END,
-					node, gfp_mask);
+					node, gfp_mask,
+					VMAP_RAM|VMAP_BLOCK);
 	if (IS_ERR(va)) {
 		kfree(vb);
 		return ERR_CAST(va);
@@ -2229,8 +2236,12 @@  void vm_unmap_ram(const void *mem, unsigned int count)
 		return;
 	}
 
-	va = find_vmap_area(addr);
+	spin_lock(&vmap_area_lock);
+	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
 	BUG_ON(!va);
+	if (va)
+		va->flags &= ~VMAP_RAM;
+	spin_unlock(&vmap_area_lock);
 	debug_check_no_locks_freed((void *)va->va_start,
 				    (va->va_end - va->va_start));
 	free_unmap_vmap_area(va);
@@ -2265,7 +2276,8 @@  void *vm_map_ram(struct page **pages, unsigned int count, int node)
 	} else {
 		struct vmap_area *va;
 		va = alloc_vmap_area(size, PAGE_SIZE,
-				VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
+				VMALLOC_START, VMALLOC_END,
+				node, GFP_KERNEL, VMAP_RAM);
 		if (IS_ERR(va))
 			return NULL;
 
@@ -2505,7 +2517,7 @@  static struct vm_struct *__get_vm_area_node(unsigned long size,
 	if (!(flags & VM_NO_GUARD))
 		size += PAGE_SIZE;
 
-	va = alloc_vmap_area(size, align, start, end, node, gfp_mask);
+	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0);
 	if (IS_ERR(va)) {
 		kfree(area);
 		return NULL;