[bpf-next] bpf: Support default .validate() and .update() behavior for struct_ops links

Message ID 20230810220456.521517-1-void@manifault.com
State New
Headers
Series [bpf-next] bpf: Support default .validate() and .update() behavior for struct_ops links |

Commit Message

David Vernet Aug. 10, 2023, 10:04 p.m. UTC
  Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also
define the .validate() and .update() callbacks in its corresponding
struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful
in its own right to ensure that the map is unloaded if an application
crashes. For example, with sched_ext, we want to automatically unload
the host-wide scheduler if the application crashes. We would likely
never support updating elements of a sched_ext struct_ops map, so we'd
have to implement these callbacks showing that they _can't_ support
element updates just to benefit from the basic lifetime management of
struct_ops links.

Let's enable struct_ops maps to work with BPF_F_LINK even if they
haven't defined these callbacks, by assuming that a struct_ops map
element cannot be updated by default.

Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/bpf/bpf_struct_ops.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)
  

Comments

Stanislav Fomichev Aug. 10, 2023, 10:46 p.m. UTC | #1
On 08/10, David Vernet wrote:
> Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also
> define the .validate() and .update() callbacks in its corresponding
> struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful
> in its own right to ensure that the map is unloaded if an application
> crashes. For example, with sched_ext, we want to automatically unload
> the host-wide scheduler if the application crashes. We would likely
> never support updating elements of a sched_ext struct_ops map, so we'd
> have to implement these callbacks showing that they _can't_ support
> element updates just to benefit from the basic lifetime management of
> struct_ops links.
> 
> Let's enable struct_ops maps to work with BPF_F_LINK even if they
> haven't defined these callbacks, by assuming that a struct_ops map
> element cannot be updated by default.

Any reason this is not part of sched_ext series? As you mention,
we don't seem to have such users in the three?
  
David Vernet Aug. 10, 2023, 11:01 p.m. UTC | #2
On Thu, Aug 10, 2023 at 03:46:18PM -0700, Stanislav Fomichev wrote:
> On 08/10, David Vernet wrote:
> > Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also
> > define the .validate() and .update() callbacks in its corresponding
> > struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful
> > in its own right to ensure that the map is unloaded if an application
> > crashes. For example, with sched_ext, we want to automatically unload
> > the host-wide scheduler if the application crashes. We would likely
> > never support updating elements of a sched_ext struct_ops map, so we'd
> > have to implement these callbacks showing that they _can't_ support
> > element updates just to benefit from the basic lifetime management of
> > struct_ops links.
> > 
> > Let's enable struct_ops maps to work with BPF_F_LINK even if they
> > haven't defined these callbacks, by assuming that a struct_ops map
> > element cannot be updated by default.
> 
> Any reason this is not part of sched_ext series? As you mention,
> we don't seem to have such users in the three?

Hi Stanislav,

The sched_ext series [0] implements these callbacks. See
bpf_scx_update() and bpf_scx_validate(). 

[0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/

We could add this into that series and remove those callbacks, but this
patch is fixing a UX / API issue with struct_ops links that's not really
relevant to sched_ext. I don't think there's any reason to couple
updating struct_ops map elements with allowing the kernel to manage the
lifetime of struct_ops maps -- just because we only have 1 (non-test)
struct_ops implementation in-tree doesn't mean we shouldn't improve APIs
where it makes sense.

Thanks,
David
  
Stanislav Fomichev Aug. 10, 2023, 11:15 p.m. UTC | #3
On 08/10, David Vernet wrote:
> On Thu, Aug 10, 2023 at 03:46:18PM -0700, Stanislav Fomichev wrote:
> > On 08/10, David Vernet wrote:
> > > Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also
> > > define the .validate() and .update() callbacks in its corresponding
> > > struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful
> > > in its own right to ensure that the map is unloaded if an application
> > > crashes. For example, with sched_ext, we want to automatically unload
> > > the host-wide scheduler if the application crashes. We would likely
> > > never support updating elements of a sched_ext struct_ops map, so we'd
> > > have to implement these callbacks showing that they _can't_ support
> > > element updates just to benefit from the basic lifetime management of
> > > struct_ops links.
> > > 
> > > Let's enable struct_ops maps to work with BPF_F_LINK even if they
> > > haven't defined these callbacks, by assuming that a struct_ops map
> > > element cannot be updated by default.
> > 
> > Any reason this is not part of sched_ext series? As you mention,
> > we don't seem to have such users in the three?
> 
> Hi Stanislav,
> 
> The sched_ext series [0] implements these callbacks. See
> bpf_scx_update() and bpf_scx_validate(). 
> 
> [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
> 
> We could add this into that series and remove those callbacks, but this
> patch is fixing a UX / API issue with struct_ops links that's not really
> relevant to sched_ext. I don't think there's any reason to couple
> updating struct_ops map elements with allowing the kernel to manage the
> lifetime of struct_ops maps -- just because we only have 1 (non-test)
> struct_ops implementation in-tree doesn't mean we shouldn't improve APIs
> where it makes sense.
> 
> Thanks,
> David

Ack. I guess up to you and Martin. Just trying to understand whether I'm
missing something or the patch does indeed fix some use-case :-)
  
Kui-Feng Lee Aug. 11, 2023, 6:22 a.m. UTC | #4
Overall, this patch make sense to me.

On 8/10/23 15:04, David Vernet wrote:
> Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also
> define the .validate() and .update() callbacks in its corresponding
> struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful
> in its own right to ensure that the map is unloaded if an application
> crashes. For example, with sched_ext, we want to automatically unload
> the host-wide scheduler if the application crashes. We would likely
> never support updating elements of a sched_ext struct_ops map, so we'd
> have to implement these callbacks showing that they _can't_ support
> element updates just to benefit from the basic lifetime management of
> struct_ops links.
> 
> Let's enable struct_ops maps to work with BPF_F_LINK even if they
> haven't defined these callbacks, by assuming that a struct_ops map
> element cannot be updated by default.
> 
> Signed-off-by: David Vernet <void@manifault.com>
> ---
>   kernel/bpf/bpf_struct_ops.c | 17 +++++++++++------
>   1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index eaff04eefb31..3d2fb85186a9 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -509,9 +509,12 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
>   	}
>   
>   	if (st_map->map.map_flags & BPF_F_LINK) {
> -		err = st_ops->validate(kdata);
> -		if (err)
> -			goto reset_unlock;
> +		err = 0;
> +		if (st_ops->validate) {
> +			err = st_ops->validate(kdata);
> +			if (err)
> +				goto reset_unlock;
> +		}
>   		set_memory_rox((long)st_map->image, 1);
>   		/* Let bpf_link handle registration & unregistration.
>   		 *
> @@ -663,9 +666,6 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
>   	if (attr->value_size != vt->size)
>   		return ERR_PTR(-EINVAL);
>   
> -	if (attr->map_flags & BPF_F_LINK && (!st_ops->validate || !st_ops->update))
> -		return ERR_PTR(-EOPNOTSUPP);
> -
>   	t = st_ops->type;
>   
>   	st_map_size = sizeof(*st_map) +
> @@ -838,6 +838,11 @@ static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map
>   		goto err_out;
>   	}
>   
> +	if (!st_map->st_ops->update) {
> +		err = -EOPNOTSUPP;
> +		goto err_out;
> +	}
> +

We can perform this check before calling mutex_lock(), and
return -EOPNOTSUPP early.


>   	err = st_map->st_ops->update(st_map->kvalue.data, old_st_map->kvalue.data);
>   	if (err)
>   		goto err_out;
  
Yonghong Song Aug. 11, 2023, 6:43 a.m. UTC | #5
On 8/10/23 3:04 PM, David Vernet wrote:
> Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also
> define the .validate() and .update() callbacks in its corresponding
> struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful
> in its own right to ensure that the map is unloaded if an application
> crashes. For example, with sched_ext, we want to automatically unload
> the host-wide scheduler if the application crashes. We would likely
> never support updating elements of a sched_ext struct_ops map, so we'd
> have to implement these callbacks showing that they _can't_ support
> element updates just to benefit from the basic lifetime management of
> struct_ops links.
> 
> Let's enable struct_ops maps to work with BPF_F_LINK even if they
> haven't defined these callbacks, by assuming that a struct_ops map
> element cannot be updated by default.

Maybe you want to add one map_flag to indicate validate/update callbacks
are optional for a struct_ops link? In this case, some struct_ops maps
can still require validate() and update(), but others can skip them?

> 
> Signed-off-by: David Vernet <void@manifault.com>
> ---
>   kernel/bpf/bpf_struct_ops.c | 17 +++++++++++------
>   1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index eaff04eefb31..3d2fb85186a9 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -509,9 +509,12 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
>   	}
>   
>   	if (st_map->map.map_flags & BPF_F_LINK) {
> -		err = st_ops->validate(kdata);
> -		if (err)
> -			goto reset_unlock;
> +		err = 0;
> +		if (st_ops->validate) {
> +			err = st_ops->validate(kdata);
> +			if (err)
> +				goto reset_unlock;
> +		}
>   		set_memory_rox((long)st_map->image, 1);
>   		/* Let bpf_link handle registration & unregistration.
>   		 *
> @@ -663,9 +666,6 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
>   	if (attr->value_size != vt->size)
>   		return ERR_PTR(-EINVAL);
>   
> -	if (attr->map_flags & BPF_F_LINK && (!st_ops->validate || !st_ops->update))
> -		return ERR_PTR(-EOPNOTSUPP);
> -
>   	t = st_ops->type;
>   
>   	st_map_size = sizeof(*st_map) +
> @@ -838,6 +838,11 @@ static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map
>   		goto err_out;
>   	}
>   
> +	if (!st_map->st_ops->update) {
> +		err = -EOPNOTSUPP;
> +		goto err_out;
> +	}
> +
>   	err = st_map->st_ops->update(st_map->kvalue.data, old_st_map->kvalue.data);
>   	if (err)
>   		goto err_out;
  
David Vernet Aug. 11, 2023, 3:09 p.m. UTC | #6
On Thu, Aug 10, 2023 at 11:43:26PM -0700, Yonghong Song wrote:
> 
> 
> On 8/10/23 3:04 PM, David Vernet wrote:
> > Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also
> > define the .validate() and .update() callbacks in its corresponding
> > struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful
> > in its own right to ensure that the map is unloaded if an application
> > crashes. For example, with sched_ext, we want to automatically unload
> > the host-wide scheduler if the application crashes. We would likely
> > never support updating elements of a sched_ext struct_ops map, so we'd
> > have to implement these callbacks showing that they _can't_ support
> > element updates just to benefit from the basic lifetime management of
> > struct_ops links.
> > 
> > Let's enable struct_ops maps to work with BPF_F_LINK even if they
> > haven't defined these callbacks, by assuming that a struct_ops map
> > element cannot be updated by default.
> 
> Maybe you want to add one map_flag to indicate validate/update callbacks
> are optional for a struct_ops link? In this case, some struct_ops maps
> can still require validate() and update(), but others can skip them?

Are you proposing that a map flag be added that a user space caller can
specify to say that they're OK with a struct_ops implementation not
supporting .validate() and .update(), but still want to use a link to
manage registration and unregistration?  Assuming I'm understanding your
suggestion correctly, I don't think it's what we want. Updating a
struct_ops map value is arguably orthogonal to the bpf link handling
registration and unregistration, so it seems confusing to require a user
to specify that it's the behavior they want as there's no reason they
shouldn't want it. If they mistakenly thought that update element is
supposed for that struct_ops variant, they'll just get an -EOPNOTSUPP
error at runtime, which seems reasonable. If a struct_ops implementation
should have implemented .validate() and/or .update() and neglects to,
that would just be a bug in the struct_ops implementation.

Apologies if I've misunderstood your proposal, and please feel free to
clarify if I have.

Thanks,
David

> 
> > 
> > Signed-off-by: David Vernet <void@manifault.com>
> > ---
> >   kernel/bpf/bpf_struct_ops.c | 17 +++++++++++------
> >   1 file changed, 11 insertions(+), 6 deletions(-)
> > 
> > diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> > index eaff04eefb31..3d2fb85186a9 100644
> > --- a/kernel/bpf/bpf_struct_ops.c
> > +++ b/kernel/bpf/bpf_struct_ops.c
> > @@ -509,9 +509,12 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
> >   	}
> >   	if (st_map->map.map_flags & BPF_F_LINK) {
> > -		err = st_ops->validate(kdata);
> > -		if (err)
> > -			goto reset_unlock;
> > +		err = 0;
> > +		if (st_ops->validate) {
> > +			err = st_ops->validate(kdata);
> > +			if (err)
> > +				goto reset_unlock;
> > +		}
> >   		set_memory_rox((long)st_map->image, 1);
> >   		/* Let bpf_link handle registration & unregistration.
> >   		 *
> > @@ -663,9 +666,6 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
> >   	if (attr->value_size != vt->size)
> >   		return ERR_PTR(-EINVAL);
> > -	if (attr->map_flags & BPF_F_LINK && (!st_ops->validate || !st_ops->update))
> > -		return ERR_PTR(-EOPNOTSUPP);
> > -
> >   	t = st_ops->type;
> >   	st_map_size = sizeof(*st_map) +
> > @@ -838,6 +838,11 @@ static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map
> >   		goto err_out;
> >   	}
> > +	if (!st_map->st_ops->update) {
> > +		err = -EOPNOTSUPP;
> > +		goto err_out;
> > +	}
> > +
> >   	err = st_map->st_ops->update(st_map->kvalue.data, old_st_map->kvalue.data);
> >   	if (err)
> >   		goto err_out;
  
David Vernet Aug. 11, 2023, 3:10 p.m. UTC | #7
On Thu, Aug 10, 2023 at 11:22:04PM -0700, Kui-Feng Lee wrote:
> Overall, this patch make sense to me.

Thanks for the review.

> On 8/10/23 15:04, David Vernet wrote:
> > Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also
> > define the .validate() and .update() callbacks in its corresponding
> > struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful
> > in its own right to ensure that the map is unloaded if an application
> > crashes. For example, with sched_ext, we want to automatically unload
> > the host-wide scheduler if the application crashes. We would likely
> > never support updating elements of a sched_ext struct_ops map, so we'd
> > have to implement these callbacks showing that they _can't_ support
> > element updates just to benefit from the basic lifetime management of
> > struct_ops links.
> > 
> > Let's enable struct_ops maps to work with BPF_F_LINK even if they
> > haven't defined these callbacks, by assuming that a struct_ops map
> > element cannot be updated by default.
> > 
> > Signed-off-by: David Vernet <void@manifault.com>
> > ---
> >   kernel/bpf/bpf_struct_ops.c | 17 +++++++++++------
> >   1 file changed, 11 insertions(+), 6 deletions(-)
> > 
> > diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> > index eaff04eefb31..3d2fb85186a9 100644
> > --- a/kernel/bpf/bpf_struct_ops.c
> > +++ b/kernel/bpf/bpf_struct_ops.c
> > @@ -509,9 +509,12 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
> >   	}
> >   	if (st_map->map.map_flags & BPF_F_LINK) {
> > -		err = st_ops->validate(kdata);
> > -		if (err)
> > -			goto reset_unlock;
> > +		err = 0;
> > +		if (st_ops->validate) {
> > +			err = st_ops->validate(kdata);
> > +			if (err)
> > +				goto reset_unlock;
> > +		}
> >   		set_memory_rox((long)st_map->image, 1);
> >   		/* Let bpf_link handle registration & unregistration.
> >   		 *
> > @@ -663,9 +666,6 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
> >   	if (attr->value_size != vt->size)
> >   		return ERR_PTR(-EINVAL);
> > -	if (attr->map_flags & BPF_F_LINK && (!st_ops->validate || !st_ops->update))
> > -		return ERR_PTR(-EOPNOTSUPP);
> > -
> >   	t = st_ops->type;
> >   	st_map_size = sizeof(*st_map) +
> > @@ -838,6 +838,11 @@ static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map
> >   		goto err_out;
> >   	}
> > +	if (!st_map->st_ops->update) {
> > +		err = -EOPNOTSUPP;
> > +		goto err_out;
> > +	}
> > +
> 
> We can perform this check before calling mutex_lock(), and
> return -EOPNOTSUPP early.

Yep, let's do the check outside of that global mutex. Will make that
change for v2.

Thanks,
David
  
Yonghong Song Aug. 11, 2023, 3:43 p.m. UTC | #8
On 8/11/23 8:09 AM, David Vernet wrote:
> On Thu, Aug 10, 2023 at 11:43:26PM -0700, Yonghong Song wrote:
>>
>>
>> On 8/10/23 3:04 PM, David Vernet wrote:
>>> Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also
>>> define the .validate() and .update() callbacks in its corresponding
>>> struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful
>>> in its own right to ensure that the map is unloaded if an application
>>> crashes. For example, with sched_ext, we want to automatically unload
>>> the host-wide scheduler if the application crashes. We would likely
>>> never support updating elements of a sched_ext struct_ops map, so we'd
>>> have to implement these callbacks showing that they _can't_ support
>>> element updates just to benefit from the basic lifetime management of
>>> struct_ops links.
>>>
>>> Let's enable struct_ops maps to work with BPF_F_LINK even if they
>>> haven't defined these callbacks, by assuming that a struct_ops map
>>> element cannot be updated by default.
>>
>> Maybe you want to add one map_flag to indicate validate/update callbacks
>> are optional for a struct_ops link? In this case, some struct_ops maps
>> can still require validate() and update(), but others can skip them?
> 
> Are you proposing that a map flag be added that a user space caller can
> specify to say that they're OK with a struct_ops implementation not
> supporting .validate() and .update(), but still want to use a link to
> manage registration and unregistration?  Assuming I'm understanding your
> suggestion correctly, I don't think it's what we want. Updating a
> struct_ops map value is arguably orthogonal to the bpf link handling
> registration and unregistration, so it seems confusing to require a user
> to specify that it's the behavior they want as there's no reason they
> shouldn't want it. If they mistakenly thought that update element is
> supposed for that struct_ops variant, they'll just get an -EOPNOTSUPP
> error at runtime, which seems reasonable. If a struct_ops implementation
> should have implemented .validate() and/or .update() and neglects to,
> that would just be a bug in the struct_ops implementation.
> 
> Apologies if I've misunderstood your proposal, and please feel free to
> clarify if I have.

You understanding with my proposal is correct.
Okay, after further thought, I agree with your above point.
Lacking implementation of 'validate' and 'update' itself is
equivalent to a flag. So flag itself is not really needed.

> 
> Thanks,
> David
> 
>>
>>>
>>> Signed-off-by: David Vernet <void@manifault.com>
>>> ---
>>>    kernel/bpf/bpf_struct_ops.c | 17 +++++++++++------
>>>    1 file changed, 11 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
>>> index eaff04eefb31..3d2fb85186a9 100644
>>> --- a/kernel/bpf/bpf_struct_ops.c
>>> +++ b/kernel/bpf/bpf_struct_ops.c
>>> @@ -509,9 +509,12 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
>>>    	}
>>>    	if (st_map->map.map_flags & BPF_F_LINK) {
>>> -		err = st_ops->validate(kdata);
>>> -		if (err)
>>> -			goto reset_unlock;
>>> +		err = 0;
>>> +		if (st_ops->validate) {
>>> +			err = st_ops->validate(kdata);
>>> +			if (err)
>>> +				goto reset_unlock;
>>> +		}
>>>    		set_memory_rox((long)st_map->image, 1);
>>>    		/* Let bpf_link handle registration & unregistration.
>>>    		 *
>>> @@ -663,9 +666,6 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
>>>    	if (attr->value_size != vt->size)
>>>    		return ERR_PTR(-EINVAL);
>>> -	if (attr->map_flags & BPF_F_LINK && (!st_ops->validate || !st_ops->update))
>>> -		return ERR_PTR(-EOPNOTSUPP);
>>> -
>>>    	t = st_ops->type;
>>>    	st_map_size = sizeof(*st_map) +
>>> @@ -838,6 +838,11 @@ static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map
>>>    		goto err_out;
>>>    	}
>>> +	if (!st_map->st_ops->update) {
>>> +		err = -EOPNOTSUPP;
>>> +		goto err_out;
>>> +	}
>>> +
>>>    	err = st_map->st_ops->update(st_map->kvalue.data, old_st_map->kvalue.data);
>>>    	if (err)
>>>    		goto err_out;
>
  
Martin KaFai Lau Aug. 11, 2023, 5:35 p.m. UTC | #9
On 8/10/23 4:15 PM, Stanislav Fomichev wrote:
> On 08/10, David Vernet wrote:
>> On Thu, Aug 10, 2023 at 03:46:18PM -0700, Stanislav Fomichev wrote:
>>> On 08/10, David Vernet wrote:
>>>> Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also
>>>> define the .validate() and .update() callbacks in its corresponding
>>>> struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful
>>>> in its own right to ensure that the map is unloaded if an application
>>>> crashes. For example, with sched_ext, we want to automatically unload
>>>> the host-wide scheduler if the application crashes. We would likely
>>>> never support updating elements of a sched_ext struct_ops map, so we'd
>>>> have to implement these callbacks showing that they _can't_ support
>>>> element updates just to benefit from the basic lifetime management of
>>>> struct_ops links.
>>>>
>>>> Let's enable struct_ops maps to work with BPF_F_LINK even if they
>>>> haven't defined these callbacks, by assuming that a struct_ops map
>>>> element cannot be updated by default.
>>>
>>> Any reason this is not part of sched_ext series? As you mention,
>>> we don't seem to have such users in the three?
>>
>> Hi Stanislav,
>>
>> The sched_ext series [0] implements these callbacks. See
>> bpf_scx_update() and bpf_scx_validate().
>>
>> [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
>>
>> We could add this into that series and remove those callbacks, but this
>> patch is fixing a UX / API issue with struct_ops links that's not really
>> relevant to sched_ext. I don't think there's any reason to couple
>> updating struct_ops map elements with allowing the kernel to manage the
>> lifetime of struct_ops maps -- just because we only have 1 (non-test)

Agree the link-update does not necessarily couple with link-creation, so 
removing 'link' update function enforcement is ok. The intention was to avoid 
the struct_ops link inconsistent experience (one struct_ops link support update 
and another struct_ops link does not) because consistency was one of the reason 
for the true kernel backed link support that Kui-Feng did. tcp-cc is the only 
one for now in struct_ops and it can support update, so the enforcement is here. 
I can see Stan's point that removing it now looks immature before a struct_ops 
landed in the kernel showing it does not make sense or very hard to support 
'link' update. However, the scx patch set has shown this point, so I think it is 
good enough.

For 'validate', it is not related a 'link' update. It is for the struct_ops 
'map' update. If the loaded struct_ops map is invalid, it will end up having a 
useless struct_ops map and no link can be created from it. I can see some 
struct_ops subsystem check all the 'ops' function for NULL before calling (like 
the FUSE RFC). I can also see some future struct_ops will prefer not to check 
NULL at all and prefer to assume a subset of the ops is always valid. Does 
having a 'validate' enforcement is blocking the scx patchset in some way? If 
not, I would like to keep this for now. Once it is removed, there is no turning 
back.

>> struct_ops implementation in-tree doesn't mean we shouldn't improve APIs
>> where it makes sense.
>>
>> Thanks,
>> David
> 
> Ack. I guess up to you and Martin. Just trying to understand whether I'm
> missing something or the patch does indeed fix some use-case :-)
  
Kui-Feng Lee Aug. 11, 2023, 6:17 p.m. UTC | #10
On 8/11/23 10:35, Martin KaFai Lau wrote:
> On 8/10/23 4:15 PM, Stanislav Fomichev wrote:
>> On 08/10, David Vernet wrote:
>>> On Thu, Aug 10, 2023 at 03:46:18PM -0700, Stanislav Fomichev wrote:
>>>> On 08/10, David Vernet wrote:
>>>>> Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also
>>>>> define the .validate() and .update() callbacks in its corresponding
>>>>> struct bpf_struct_ops in the kernel. Enabling struct_ops link is 
>>>>> useful
>>>>> in its own right to ensure that the map is unloaded if an application
>>>>> crashes. For example, with sched_ext, we want to automatically unload
>>>>> the host-wide scheduler if the application crashes. We would likely
>>>>> never support updating elements of a sched_ext struct_ops map, so we'd
>>>>> have to implement these callbacks showing that they _can't_ support
>>>>> element updates just to benefit from the basic lifetime management of
>>>>> struct_ops links.
>>>>>
>>>>> Let's enable struct_ops maps to work with BPF_F_LINK even if they
>>>>> haven't defined these callbacks, by assuming that a struct_ops map
>>>>> element cannot be updated by default.
>>>>
>>>> Any reason this is not part of sched_ext series? As you mention,
>>>> we don't seem to have such users in the three?
>>>
>>> Hi Stanislav,
>>>
>>> The sched_ext series [0] implements these callbacks. See
>>> bpf_scx_update() and bpf_scx_validate().
>>>
>>> [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
>>>
>>> We could add this into that series and remove those callbacks, but this
>>> patch is fixing a UX / API issue with struct_ops links that's not really
>>> relevant to sched_ext. I don't think there's any reason to couple
>>> updating struct_ops map elements with allowing the kernel to manage the
>>> lifetime of struct_ops maps -- just because we only have 1 (non-test)
> 
> Agree the link-update does not necessarily couple with link-creation, so 
> removing 'link' update function enforcement is ok. The intention was to 
> avoid the struct_ops link inconsistent experience (one struct_ops link 
> support update and another struct_ops link does not) because consistency 
> was one of the reason for the true kernel backed link support that 
> Kui-Feng did. tcp-cc is the only one for now in struct_ops and it can 
> support update, so the enforcement is here. I can see Stan's point that 
> removing it now looks immature before a struct_ops landed in the kernel 
> showing it does not make sense or very hard to support 'link' update. 
> However, the scx patch set has shown this point, so I think it is good 
> enough.
> 
> For 'validate', it is not related a 'link' update. It is for the 
> struct_ops 'map' update. If the loaded struct_ops map is invalid, it 
> will end up having a useless struct_ops map and no link can be created 
> from it. I can see some struct_ops subsystem check all the 'ops' 
> function for NULL before calling (like the FUSE RFC). I can also see 
> some future struct_ops will prefer not to check NULL at all and prefer 
> to assume a subset of the ops is always valid. Does having a 'validate' 
> enforcement is blocking the scx patchset in some way? If not, I would 
> like to keep this for now. Once it is removed, there is no turning back.

I am not saying which one is right or wrong, but the followings are some
of my concerns. Just FYI!

The 'validate' change more likes a default implementation that always
return 0.  It is up to struct_ops types to decide how to validate
values. If they decide to always success, this change will save them
a bit of time. In opposite, allowing empty update may make difficulties
to the developers of new struct_ops types. New developers
may spend a lot of time in the code base to figure out that they should 
implement an update function to make it work. A better document may
help. However, checking these function pointers at the first moment is
even better.

> 
>>> struct_ops implementation in-tree doesn't mean we shouldn't improve APIs
>>> where it makes sense.
>>>
>>> Thanks,
>>> David
>>
>> Ack. I guess up to you and Martin. Just trying to understand whether I'm
>> missing something or the patch does indeed fix some use-case :-)
> 
>
  
David Vernet Aug. 11, 2023, 8:19 p.m. UTC | #11
On Fri, Aug 11, 2023 at 10:35:03AM -0700, Martin KaFai Lau wrote:
> On 8/10/23 4:15 PM, Stanislav Fomichev wrote:
> > On 08/10, David Vernet wrote:
> > > On Thu, Aug 10, 2023 at 03:46:18PM -0700, Stanislav Fomichev wrote:
> > > > On 08/10, David Vernet wrote:
> > > > > Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also
> > > > > define the .validate() and .update() callbacks in its corresponding
> > > > > struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful
> > > > > in its own right to ensure that the map is unloaded if an application
> > > > > crashes. For example, with sched_ext, we want to automatically unload
> > > > > the host-wide scheduler if the application crashes. We would likely
> > > > > never support updating elements of a sched_ext struct_ops map, so we'd
> > > > > have to implement these callbacks showing that they _can't_ support
> > > > > element updates just to benefit from the basic lifetime management of
> > > > > struct_ops links.
> > > > > 
> > > > > Let's enable struct_ops maps to work with BPF_F_LINK even if they
> > > > > haven't defined these callbacks, by assuming that a struct_ops map
> > > > > element cannot be updated by default.
> > > > 
> > > > Any reason this is not part of sched_ext series? As you mention,
> > > > we don't seem to have such users in the three?
> > > 
> > > Hi Stanislav,
> > > 
> > > The sched_ext series [0] implements these callbacks. See
> > > bpf_scx_update() and bpf_scx_validate().
> > > 
> > > [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
> > > 
> > > We could add this into that series and remove those callbacks, but this
> > > patch is fixing a UX / API issue with struct_ops links that's not really
> > > relevant to sched_ext. I don't think there's any reason to couple
> > > updating struct_ops map elements with allowing the kernel to manage the
> > > lifetime of struct_ops maps -- just because we only have 1 (non-test)
> 
> Agree the link-update does not necessarily couple with link-creation, so
> removing 'link' update function enforcement is ok. The intention was to
> avoid the struct_ops link inconsistent experience (one struct_ops link
> support update and another struct_ops link does not) because consistency was
> one of the reason for the true kernel backed link support that Kui-Feng did.
> tcp-cc is the only one for now in struct_ops and it can support update, so
> the enforcement is here. I can see Stan's point that removing it now looks
> immature before a struct_ops landed in the kernel showing it does not make
> sense or very hard to support 'link' update. However, the scx patch set has
> shown this point, so I think it is good enough.

Sorry for sending v2 of the patch a bit prematurely. Should have let you
weigh in first.

> For 'validate', it is not related a 'link' update. It is for the struct_ops
> 'map' update. If the loaded struct_ops map is invalid, it will end up having
> a useless struct_ops map and no link can be created from it. I can see some

To be honest I'm actually not sure I understand why .validate() is only
called for when BPF_F_LINK is specified. Is it because it could break
existing programs if they defined a struct_ops map that wasn't valid
_without_ using BPF_F_LINK? Whether or not a map is valid should inform
whether we can load it regardless of whether there's a link, no? It
seems like .init_member() was already doing this as well. That's why I
got confused and conflated the two.

> struct_ops subsystem check all the 'ops' function for NULL before calling
> (like the FUSE RFC). I can also see some future struct_ops will prefer not
> to check NULL at all and prefer to assume a subset of the ops is always
> valid. Does having a 'validate' enforcement is blocking the scx patchset in
> some way? If not, I would like to keep this for now. Once it is removed,

No, it's not blocking scx at all. scx, as with any other struct_ops
implementation, could and does just implement these callbacks. As
Kui-Feng said in [0], this is really just about enabling a sane default
to improve usability. If a struct_ops implementation actually should
have implemented some validation but neglected to, that would be a bug
in exactly the same manner as if it had implemented .validate(), but
neglected to check some corner case that makes the map invalid.

[0]: https://lore.kernel.org/lkml/887699ea-f837-6ed7-50bd-48720cea581c@gmail.com/

> there is no turning back.

Hmm, why there would be no turning back from this? This isn't a UAPI
concern, is it? Whether or not a struct_ops implementation needs to
implement .validate() or can just rely on the default behavior of "no
.validate() callback implies the map is valid" is 100% an implementation
detail that's hidden from the end user. This is meant to be a UX
improvement for a developr defining a struct bpf_struct_ops instance in
the main kernel, not someone defining an instance of that struct_ops
(e.g. struct tcp_congestion_ops) in a BPF prog.
  
Kui-Feng Lee Aug. 11, 2023, 9:25 p.m. UTC | #12
On 8/11/23 13:19, David Vernet wrote:
> On Fri, Aug 11, 2023 at 10:35:03AM -0700, Martin KaFai Lau wrote:
>> On 8/10/23 4:15 PM, Stanislav Fomichev wrote:
>>> On 08/10, David Vernet wrote:
>>>> On Thu, Aug 10, 2023 at 03:46:18PM -0700, Stanislav Fomichev wrote:
>>>>> On 08/10, David Vernet wrote:
>>>>>> Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also
>>>>>> define the .validate() and .update() callbacks in its corresponding
>>>>>> struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful
>>>>>> in its own right to ensure that the map is unloaded if an application
>>>>>> crashes. For example, with sched_ext, we want to automatically unload
>>>>>> the host-wide scheduler if the application crashes. We would likely
>>>>>> never support updating elements of a sched_ext struct_ops map, so we'd
>>>>>> have to implement these callbacks showing that they _can't_ support
>>>>>> element updates just to benefit from the basic lifetime management of
>>>>>> struct_ops links.
>>>>>>
>>>>>> Let's enable struct_ops maps to work with BPF_F_LINK even if they
>>>>>> haven't defined these callbacks, by assuming that a struct_ops map
>>>>>> element cannot be updated by default.
>>>>>
>>>>> Any reason this is not part of sched_ext series? As you mention,
>>>>> we don't seem to have such users in the three?
>>>>
>>>> Hi Stanislav,
>>>>
>>>> The sched_ext series [0] implements these callbacks. See
>>>> bpf_scx_update() and bpf_scx_validate().
>>>>
>>>> [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
>>>>
>>>> We could add this into that series and remove those callbacks, but this
>>>> patch is fixing a UX / API issue with struct_ops links that's not really
>>>> relevant to sched_ext. I don't think there's any reason to couple
>>>> updating struct_ops map elements with allowing the kernel to manage the
>>>> lifetime of struct_ops maps -- just because we only have 1 (non-test)
>>
>> Agree the link-update does not necessarily couple with link-creation, so
>> removing 'link' update function enforcement is ok. The intention was to
>> avoid the struct_ops link inconsistent experience (one struct_ops link
>> support update and another struct_ops link does not) because consistency was
>> one of the reason for the true kernel backed link support that Kui-Feng did.
>> tcp-cc is the only one for now in struct_ops and it can support update, so
>> the enforcement is here. I can see Stan's point that removing it now looks
>> immature before a struct_ops landed in the kernel showing it does not make
>> sense or very hard to support 'link' update. However, the scx patch set has
>> shown this point, so I think it is good enough.
> 
> Sorry for sending v2 of the patch a bit prematurely. Should have let you
> weigh in first.
> 
>> For 'validate', it is not related a 'link' update. It is for the struct_ops
>> 'map' update. If the loaded struct_ops map is invalid, it will end up having
>> a useless struct_ops map and no link can be created from it. I can see some
> 
> To be honest I'm actually not sure I understand why .validate() is only
> called for when BPF_F_LINK is specified. Is it because it could break
> existing programs if they defined a struct_ops map that wasn't valid
> _without_ using BPF_F_LINK? Whether or not a map is valid should inform
> whether we can load it regardless of whether there's a link, no? It
> seems like .init_member() was already doing this as well. That's why I
> got confused and conflated the two.

With the previous solution (without link), you can not update the values
of a struct_ops map directly.
You have to delete the existing value before update it.
Updating a value would register a value, a function set,
to the implementation of a struct_ops type. Deleting a value
would unregister the value. So, the validation can be performed
in the registration function.

For BPF_LINK, it provides a solution to update a function
set atomically.  You doesn't have to unregister an existing
one before installing a new one. That is why validate functions
are invented.

init_member() handles/validates per-member value.  It can not detect
what is necessary but absent.  validate() has a full set of function
pointers (all members), so it is able to determine if something
necessary is missing.

> 
>> struct_ops subsystem check all the 'ops' function for NULL before calling
>> (like the FUSE RFC). I can also see some future struct_ops will prefer not
>> to check NULL at all and prefer to assume a subset of the ops is always
>> valid. Does having a 'validate' enforcement is blocking the scx patchset in
>> some way? If not, I would like to keep this for now. Once it is removed,
> 
> No, it's not blocking scx at all. scx, as with any other struct_ops
> implementation, could and does just implement these callbacks. As
> Kui-Feng said in [0], this is really just about enabling a sane default
> to improve usability. If a struct_ops implementation actually should
> have implemented some validation but neglected to, that would be a bug
> in exactly the same manner as if it had implemented .validate(), but
> neglected to check some corner case that makes the map invalid.
> 
> [0]: https://lore.kernel.org/lkml/887699ea-f837-6ed7-50bd-48720cea581c@gmail.com/
> 
>> there is no turning back.
> 
> Hmm, why there would be no turning back from this? This isn't a UAPI
> concern, is it? Whether or not a struct_ops implementation needs to
> implement .validate() or can just rely on the default behavior of "no
> .validate() callback implies the map is valid" is 100% an implementation
> detail that's hidden from the end user. This is meant to be a UX
> improvement for a developr defining a struct bpf_struct_ops instance in
> the main kernel, not someone defining an instance of that struct_ops
> (e.g. struct tcp_congestion_ops) in a BPF prog.
>
  
Martin KaFai Lau Aug. 11, 2023, 10:49 p.m. UTC | #13
On 8/11/23 1:19 PM, David Vernet wrote:
> On Fri, Aug 11, 2023 at 10:35:03AM -0700, Martin KaFai Lau wrote:
>> On 8/10/23 4:15 PM, Stanislav Fomichev wrote:
>>> On 08/10, David Vernet wrote:
>>>> On Thu, Aug 10, 2023 at 03:46:18PM -0700, Stanislav Fomichev wrote:
>>>>> On 08/10, David Vernet wrote:
>>>>>> Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also
>>>>>> define the .validate() and .update() callbacks in its corresponding
>>>>>> struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful
>>>>>> in its own right to ensure that the map is unloaded if an application
>>>>>> crashes. For example, with sched_ext, we want to automatically unload
>>>>>> the host-wide scheduler if the application crashes. We would likely
>>>>>> never support updating elements of a sched_ext struct_ops map, so we'd
>>>>>> have to implement these callbacks showing that they _can't_ support
>>>>>> element updates just to benefit from the basic lifetime management of
>>>>>> struct_ops links.
>>>>>>
>>>>>> Let's enable struct_ops maps to work with BPF_F_LINK even if they
>>>>>> haven't defined these callbacks, by assuming that a struct_ops map
>>>>>> element cannot be updated by default.
>>>>>
>>>>> Any reason this is not part of sched_ext series? As you mention,
>>>>> we don't seem to have such users in the three?
>>>>
>>>> Hi Stanislav,
>>>>
>>>> The sched_ext series [0] implements these callbacks. See
>>>> bpf_scx_update() and bpf_scx_validate().
>>>>
>>>> [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
>>>>
>>>> We could add this into that series and remove those callbacks, but this
>>>> patch is fixing a UX / API issue with struct_ops links that's not really
>>>> relevant to sched_ext. I don't think there's any reason to couple
>>>> updating struct_ops map elements with allowing the kernel to manage the
>>>> lifetime of struct_ops maps -- just because we only have 1 (non-test)
>>
>> Agree the link-update does not necessarily couple with link-creation, so
>> removing 'link' update function enforcement is ok. The intention was to
>> avoid the struct_ops link inconsistent experience (one struct_ops link
>> support update and another struct_ops link does not) because consistency was
>> one of the reason for the true kernel backed link support that Kui-Feng did.
>> tcp-cc is the only one for now in struct_ops and it can support update, so
>> the enforcement is here. I can see Stan's point that removing it now looks
>> immature before a struct_ops landed in the kernel showing it does not make
>> sense or very hard to support 'link' update. However, the scx patch set has
>> shown this point, so I think it is good enough.
> 
> Sorry for sending v2 of the patch a bit prematurely. Should have let you
> weigh in first.
> 
>> For 'validate', it is not related a 'link' update. It is for the struct_ops
>> 'map' update. If the loaded struct_ops map is invalid, it will end up having
>> a useless struct_ops map and no link can be created from it. I can see some
> 
> To be honest I'm actually not sure I understand why .validate() is only
> called for when BPF_F_LINK is specified. Is it because it could break

Regardless '.validate' must be enforced or not, the ->validate() should be 
called for the non BPF_F_LINK case also during map update. This should be fixed.

> existing programs if they defined a struct_ops map that wasn't valid
> _without_ using BPF_F_LINK? Whether or not a map is valid should inform
> whether we can load it regardless of whether there's a link, no? It
> seems like .init_member() was already doing this as well. That's why I
> got confused and conflated the two.

I think the best is to look at bpf_struct_ops_map_update_elem() and the 
differences between BPF_F_LINK and the older non BPF_F_LINK behavior.

Before the BPF_F_LINK was introduced, the map update and ->reg() happened 
together, so the kernel can reject at the map update time through ->reg() 
because '->reg()' does the validation also. If the earlier map update failed, 
the user space can do a map update again.

With the BPF_F_LINK, the map update and ->reg are two separated actions. The 
->reg is done later in the link creation time (after the map is updated). If the 
BPF_F_LINK struct_ops is not validated as a whole (like ops1 and ops2 must be 
defined) during map update, it will only be discovered during the link creation 
time in bpf_struct_ops_link_create() by ->reg(). It will be too late for the 
userspace to correct that mistake because the map cannot be updated again. Then 
it will end up having a struct_ops map loaded in the kernel that cannot do 
anything. I don't think it is the common case but at least the map should not be 
left in some unusable state when it did happen.

It is why the validation part has been separated from the '.reg', so '.validate' 
was added and enforced.  and ->validate() is called during the map update.

'.init_member' is for validating individual ops/member but not for validating 
struct_ops as a whole, like if ops_x is implemented, then ops_y must be 
implemented also.

>> struct_ops subsystem check all the 'ops' function for NULL before calling
>> (like the FUSE RFC). I can also see some future struct_ops will prefer not
>> to check NULL at all and prefer to assume a subset of the ops is always
>> valid. Does having a 'validate' enforcement is blocking the scx patchset in
>> some way? If not, I would like to keep this for now. Once it is removed,
> 
> No, it's not blocking scx at all. scx, as with any other struct_ops
> implementation, could and does just implement these callbacks. As
> Kui-Feng said in [0], this is really just about enabling a sane default
> to improve usability. If a struct_ops implementation actually should
> have implemented some validation but neglected to, that would be a bug
> in exactly the same manner as if it had implemented .validate(), but
> neglected to check some corner case that makes the map invalid.
> 
> [0]: https://lore.kernel.org/lkml/887699ea-f837-6ed7-50bd-48720cea581c@gmail.com/
> 
>> there is no turning back.
> 
> Hmm, why there would be no turning back from this? This isn't a UAPI
> concern, is it? Whether or not a struct_ops implementation needs to

hmm...at least, map update success in one kernel and then map update failure in 
a later kernel is a different behavior. yeah, the succeeded map is unusable 
anyway but still no need to create this inconsistency to begin with if it does 
not have to.

> implement .validate() or can just rely on the default behavior of "no
> .validate() callback implies the map is valid" is 100% an implementation
> detail that's hidden from the end user. This is meant to be a UX
> improvement for a developr defining a struct bpf_struct_ops instance in
> the main kernel, not someone defining an instance of that struct_ops
> (e.g. struct tcp_congestion_ops) in a BPF prog.

The UX here is about the subsystem doing the very first time struct_ops 
implementation in the kernel, so yes it is the internal details of the kernel 
and one time cost.

Multiple struct_ops bpf prog can then be developed and the bpf developer is not 
affected no matter .validate is enforced or not.

I think I weighted the end-user space experience more. Having map in unusable 
state is a bad userspace experience. Yes, the way it is enforcing it now looks 
bureaucratic. I think it took two emails to explain the internal details of the 
struct_ops update and the difference between doing validation in .validate vs in 
.reg. I am not sure if the subsystem implementer wants to know all this details 
or just go ahead to implement validation in '.validate' and put an empty one for 
the subsystem does not need to check anything.

I think enough words have exchanged on this subject. I am not going to insist. 
If it is still preferred to have this check removed, please add details 
description to the '.validate' of struct bpf_struct_ops in bpf.h and also the 
commit message to spell out the details for the future subsystem struct_ops 
kernel developer to follow:

If it needs to validate struct_ops as a while,

1. it must be implemented in .validate instead of .reg. Otherwise, it may end up 
having an unusable map.

2. if the validation is implemented in '.reg' only, the map update behavior will 
be different between BPF_F_LINK map and the non BPF_F_LINK map.
  
Kui-Feng Lee Aug. 11, 2023, 11:12 p.m. UTC | #14
On 8/11/23 15:49, Martin KaFai Lau wrote:
> On 8/11/23 1:19 PM, David Vernet wrote:
>> On Fri, Aug 11, 2023 at 10:35:03AM -0700, Martin KaFai Lau wrote:
>>> On 8/10/23 4:15 PM, Stanislav Fomichev wrote:
>>>> On 08/10, David Vernet wrote:
>>>>> On Thu, Aug 10, 2023 at 03:46:18PM -0700, Stanislav Fomichev wrote:
>>>>>> On 08/10, David Vernet wrote:
>>>>>>> Currently, if a struct_ops map is loaded with BPF_F_LINK, it must 
>>>>>>> also
>>>>>>> define the .validate() and .update() callbacks in its corresponding
>>>>>>> struct bpf_struct_ops in the kernel. Enabling struct_ops link is 
>>>>>>> useful
>>>>>>> in its own right to ensure that the map is unloaded if an 
>>>>>>> application
>>>>>>> crashes. For example, with sched_ext, we want to automatically 
>>>>>>> unload
>>>>>>> the host-wide scheduler if the application crashes. We would likely
>>>>>>> never support updating elements of a sched_ext struct_ops map, so 
>>>>>>> we'd
>>>>>>> have to implement these callbacks showing that they _can't_ support
>>>>>>> element updates just to benefit from the basic lifetime 
>>>>>>> management of
>>>>>>> struct_ops links.
>>>>>>>
>>>>>>> Let's enable struct_ops maps to work with BPF_F_LINK even if they
>>>>>>> haven't defined these callbacks, by assuming that a struct_ops map
>>>>>>> element cannot be updated by default.
>>>>>>
>>>>>> Any reason this is not part of sched_ext series? As you mention,
>>>>>> we don't seem to have such users in the three?
>>>>>
>>>>> Hi Stanislav,
>>>>>
>>>>> The sched_ext series [0] implements these callbacks. See
>>>>> bpf_scx_update() and bpf_scx_validate().
>>>>>
>>>>> [0]: 
>>>>> https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
>>>>>
>>>>> We could add this into that series and remove those callbacks, but 
>>>>> this
>>>>> patch is fixing a UX / API issue with struct_ops links that's not 
>>>>> really
>>>>> relevant to sched_ext. I don't think there's any reason to couple
>>>>> updating struct_ops map elements with allowing the kernel to manage 
>>>>> the
>>>>> lifetime of struct_ops maps -- just because we only have 1 (non-test)
>>>
>>> Agree the link-update does not necessarily couple with link-creation, so
>>> removing 'link' update function enforcement is ok. The intention was to
>>> avoid the struct_ops link inconsistent experience (one struct_ops link
>>> support update and another struct_ops link does not) because 
>>> consistency was
>>> one of the reason for the true kernel backed link support that 
>>> Kui-Feng did.
>>> tcp-cc is the only one for now in struct_ops and it can support 
>>> update, so
>>> the enforcement is here. I can see Stan's point that removing it now 
>>> looks
>>> immature before a struct_ops landed in the kernel showing it does not 
>>> make
>>> sense or very hard to support 'link' update. However, the scx patch 
>>> set has
>>> shown this point, so I think it is good enough.
>>
>> Sorry for sending v2 of the patch a bit prematurely. Should have let you
>> weigh in first.
>>
>>> For 'validate', it is not related a 'link' update. It is for the 
>>> struct_ops
>>> 'map' update. If the loaded struct_ops map is invalid, it will end up 
>>> having
>>> a useless struct_ops map and no link can be created from it. I can 
>>> see some
>>
>> To be honest I'm actually not sure I understand why .validate() is only
>> called for when BPF_F_LINK is specified. Is it because it could break
> 
> Regardless '.validate' must be enforced or not, the ->validate() should 
> be called for the non BPF_F_LINK case also during map update. This 
> should be fixed.

For the case of the TCP congestion control, its validation function is
called by the implementations of ->validate() and ->reg(). I mean it
expects ->reg() to do validation as well.

... SKIP ...
  
Martin KaFai Lau Aug. 11, 2023, 11:34 p.m. UTC | #15
On 8/11/23 4:12 PM, Kui-Feng Lee wrote:
> 
> 
> On 8/11/23 15:49, Martin KaFai Lau wrote:
>> On 8/11/23 1:19 PM, David Vernet wrote:
>>> On Fri, Aug 11, 2023 at 10:35:03AM -0700, Martin KaFai Lau wrote:
>>>> On 8/10/23 4:15 PM, Stanislav Fomichev wrote:
>>>>> On 08/10, David Vernet wrote:
>>>>>> On Thu, Aug 10, 2023 at 03:46:18PM -0700, Stanislav Fomichev wrote:
>>>>>>> On 08/10, David Vernet wrote:
>>>>>>>> Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also
>>>>>>>> define the .validate() and .update() callbacks in its corresponding
>>>>>>>> struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful
>>>>>>>> in its own right to ensure that the map is unloaded if an application
>>>>>>>> crashes. For example, with sched_ext, we want to automatically unload
>>>>>>>> the host-wide scheduler if the application crashes. We would likely
>>>>>>>> never support updating elements of a sched_ext struct_ops map, so we'd
>>>>>>>> have to implement these callbacks showing that they _can't_ support
>>>>>>>> element updates just to benefit from the basic lifetime management of
>>>>>>>> struct_ops links.
>>>>>>>>
>>>>>>>> Let's enable struct_ops maps to work with BPF_F_LINK even if they
>>>>>>>> haven't defined these callbacks, by assuming that a struct_ops map
>>>>>>>> element cannot be updated by default.
>>>>>>>
>>>>>>> Any reason this is not part of sched_ext series? As you mention,
>>>>>>> we don't seem to have such users in the three?
>>>>>>
>>>>>> Hi Stanislav,
>>>>>>
>>>>>> The sched_ext series [0] implements these callbacks. See
>>>>>> bpf_scx_update() and bpf_scx_validate().
>>>>>>
>>>>>> [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
>>>>>>
>>>>>> We could add this into that series and remove those callbacks, but this
>>>>>> patch is fixing a UX / API issue with struct_ops links that's not really
>>>>>> relevant to sched_ext. I don't think there's any reason to couple
>>>>>> updating struct_ops map elements with allowing the kernel to manage the
>>>>>> lifetime of struct_ops maps -- just because we only have 1 (non-test)
>>>>
>>>> Agree the link-update does not necessarily couple with link-creation, so
>>>> removing 'link' update function enforcement is ok. The intention was to
>>>> avoid the struct_ops link inconsistent experience (one struct_ops link
>>>> support update and another struct_ops link does not) because consistency was
>>>> one of the reason for the true kernel backed link support that Kui-Feng did.
>>>> tcp-cc is the only one for now in struct_ops and it can support update, so
>>>> the enforcement is here. I can see Stan's point that removing it now looks
>>>> immature before a struct_ops landed in the kernel showing it does not make
>>>> sense or very hard to support 'link' update. However, the scx patch set has
>>>> shown this point, so I think it is good enough.
>>>
>>> Sorry for sending v2 of the patch a bit prematurely. Should have let you
>>> weigh in first.
>>>
>>>> For 'validate', it is not related a 'link' update. It is for the struct_ops
>>>> 'map' update. If the loaded struct_ops map is invalid, it will end up having
>>>> a useless struct_ops map and no link can be created from it. I can see some
>>>
>>> To be honest I'm actually not sure I understand why .validate() is only
>>> called for when BPF_F_LINK is specified. Is it because it could break
>>
>> Regardless '.validate' must be enforced or not, the ->validate() should be 
>> called for the non BPF_F_LINK case also during map update. This should be fixed.
> 
> For the case of the TCP congestion control, its validation function is
> called by the implementations of ->validate() and ->reg(). I mean it
> expects ->reg() to do validation as well.

Right, for tcp-cc, the reg is doing the validation because it is how the kernel 
tcp-cc module is done.

For newer subsystem supporting struct_ops, it should expect the validation is 
done in the .validate alone.
  
David Vernet Aug. 11, 2023, 11:36 p.m. UTC | #16
On Fri, Aug 11, 2023 at 03:49:34PM -0700, Martin KaFai Lau wrote:
> On 8/11/23 1:19 PM, David Vernet wrote:
> > On Fri, Aug 11, 2023 at 10:35:03AM -0700, Martin KaFai Lau wrote:
> > > On 8/10/23 4:15 PM, Stanislav Fomichev wrote:
> > > > On 08/10, David Vernet wrote:
> > > > > On Thu, Aug 10, 2023 at 03:46:18PM -0700, Stanislav Fomichev wrote:
> > > > > > On 08/10, David Vernet wrote:
> > > > > > > Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also
> > > > > > > define the .validate() and .update() callbacks in its corresponding
> > > > > > > struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful
> > > > > > > in its own right to ensure that the map is unloaded if an application
> > > > > > > crashes. For example, with sched_ext, we want to automatically unload
> > > > > > > the host-wide scheduler if the application crashes. We would likely
> > > > > > > never support updating elements of a sched_ext struct_ops map, so we'd
> > > > > > > have to implement these callbacks showing that they _can't_ support
> > > > > > > element updates just to benefit from the basic lifetime management of
> > > > > > > struct_ops links.
> > > > > > >
> > > > > > > Let's enable struct_ops maps to work with BPF_F_LINK even if they
> > > > > > > haven't defined these callbacks, by assuming that a struct_ops map
> > > > > > > element cannot be updated by default.
> > > > > >
> > > > > > Any reason this is not part of sched_ext series? As you mention,
> > > > > > we don't seem to have such users in the three?
> > > > >
> > > > > Hi Stanislav,
> > > > >
> > > > > The sched_ext series [0] implements these callbacks. See
> > > > > bpf_scx_update() and bpf_scx_validate().
> > > > >
> > > > > [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
> > > > >
> > > > > We could add this into that series and remove those callbacks, but this
> > > > > patch is fixing a UX / API issue with struct_ops links that's not really
> > > > > relevant to sched_ext. I don't think there's any reason to couple
> > > > > updating struct_ops map elements with allowing the kernel to manage the
> > > > > lifetime of struct_ops maps -- just because we only have 1 (non-test)
> > >
> > > Agree the link-update does not necessarily couple with link-creation, so
> > > removing 'link' update function enforcement is ok. The intention was to
> > > avoid the struct_ops link inconsistent experience (one struct_ops link
> > > support update and another struct_ops link does not) because consistency was
> > > one of the reason for the true kernel backed link support that Kui-Feng did.
> > > tcp-cc is the only one for now in struct_ops and it can support update, so
> > > the enforcement is here. I can see Stan's point that removing it now looks
> > > immature before a struct_ops landed in the kernel showing it does not make
> > > sense or very hard to support 'link' update. However, the scx patch set has
> > > shown this point, so I think it is good enough.
> >
> > Sorry for sending v2 of the patch a bit prematurely. Should have let you
> > weigh in first.
> >
> > > For 'validate', it is not related a 'link' update. It is for the struct_ops
> > > 'map' update. If the loaded struct_ops map is invalid, it will end up having
> > > a useless struct_ops map and no link can be created from it. I can see some
> >
> > To be honest I'm actually not sure I understand why .validate() is only
> > called for when BPF_F_LINK is specified. Is it because it could break
>
> Regardless '.validate' must be enforced or not, the ->validate() should be
> called for the non BPF_F_LINK case also during map update. This should be
> fixed.

Thanks for clarifying (and also to Kui-Feng), this makes more sense to
me.

> > existing programs if they defined a struct_ops map that wasn't valid
> > _without_ using BPF_F_LINK? Whether or not a map is valid should inform
> > whether we can load it regardless of whether there's a link, no? It
> > seems like .init_member() was already doing this as well. That's why I
> > got confused and conflated the two.
>
> I think the best is to look at bpf_struct_ops_map_update_elem() and the
> differences between BPF_F_LINK and the older non BPF_F_LINK behavior.
>
> Before the BPF_F_LINK was introduced, the map update and ->reg() happened
> together, so the kernel can reject at the map update time through ->reg()
> because '->reg()' does the validation also. If the earlier map update
> failed, the user space can do a map update again.
>
> With the BPF_F_LINK, the map update and ->reg are two separated actions. The
> ->reg is done later in the link creation time (after the map is updated). If
> the BPF_F_LINK struct_ops is not validated as a whole (like ops1 and ops2
> must be defined) during map update, it will only be discovered during the
> link creation time in bpf_struct_ops_link_create() by ->reg(). It will be
> too late for the userspace to correct that mistake because the map cannot be
> updated again. Then it will end up having a struct_ops map loaded in the
> kernel that cannot do anything. I don't think it is the common case but at
> least the map should not be left in some unusable state when it did happen.

I see, thanks for explaining. This is why sched_ext doesn't really work
with the BPF_F_LINK version of map update. We can't guarantee that a map
can be updated if we can't succeed in ->reg(), because we can also race
with e.g. sysrq unloading the scheduler between ->validate() and
->reg(). In a sense, it feels like ->reg() in "updateable" struct_ops
implementations should be void, whereas in other struct_ops
implementations like scx() it has to be int *. If validate() is meant to
prevent the scenario you outlined, can you help me understand why we
still check the return value of ->reg() in bpf_struct_ops_link_create()?
Or at the very least it seems like we should WARN_ON()?

> It is why the validation part has been separated from the '.reg', so
> '.validate' was added and enforced.  and ->validate() is called during the
> map update.
>
> '.init_member' is for validating individual ops/member but not for
> validating struct_ops as a whole, like if ops_x is implemented, then ops_y
> must be implemented also.

Got it, this makes sense.

> > > struct_ops subsystem check all the 'ops' function for NULL before calling
> > > (like the FUSE RFC). I can also see some future struct_ops will prefer not
> > > to check NULL at all and prefer to assume a subset of the ops is always
> > > valid. Does having a 'validate' enforcement is blocking the scx patchset in
> > > some way? If not, I would like to keep this for now. Once it is removed,
> >
> > No, it's not blocking scx at all. scx, as with any other struct_ops
> > implementation, could and does just implement these callbacks. As
> > Kui-Feng said in [0], this is really just about enabling a sane default
> > to improve usability. If a struct_ops implementation actually should
> > have implemented some validation but neglected to, that would be a bug
> > in exactly the same manner as if it had implemented .validate(), but
> > neglected to check some corner case that makes the map invalid.
> >
> > [0]: https://lore.kernel.org/lkml/887699ea-f837-6ed7-50bd-48720cea581c@gmail.com/
> >
> > > there is no turning back.
> >
> > Hmm, why there would be no turning back from this? This isn't a UAPI
> > concern, is it? Whether or not a struct_ops implementation needs to
>
> hmm...at least, map update success in one kernel and then map update failure
> in a later kernel is a different behavior. yeah, the succeeded map is
> unusable anyway but still no need to create this inconsistency to begin with
> if it does not have to.

I was under the impression that we didn't provide any of those UAPI
guarantees for struct_ops. The sched_ext struct_ops API can change at
any time because it's a kernel <-> kernel API (though we don't expect
that to happen very often at all). IMO this would fall under that
umbrella as well.

> > implement .validate() or can just rely on the default behavior of "no
> > .validate() callback implies the map is valid" is 100% an implementation
> > detail that's hidden from the end user. This is meant to be a UX
> > improvement for a developr defining a struct bpf_struct_ops instance in
> > the main kernel, not someone defining an instance of that struct_ops
> > (e.g. struct tcp_congestion_ops) in a BPF prog.
>
> The UX here is about the subsystem doing the very first time struct_ops
> implementation in the kernel, so yes it is the internal details of the
> kernel and one time cost.
>
> Multiple struct_ops bpf prog can then be developed and the bpf developer is
> not affected no matter .validate is enforced or not.

I don't think this is or should be a guarantee that we provide to BPF
developers using struct_ops, though. In my opinion it's the exact same
thing as the kfunc API deprecation story. These are purely kernel <->
kernel APIs, so you get no hard guarantees.

> I think I weighted the end-user space experience more. Having map in
> unusable state is a bad userspace experience. Yes, the way it is enforcing
> it now looks bureaucratic. I think it took two emails to explain the
> internal details of the struct_ops update and the difference between doing
> validation in .validate vs in .reg. I am not sure if the subsystem
> implementer wants to know all this details or just go ahead to implement
> validation in '.validate' and put an empty one for the subsystem does not
> need to check anything.

Yeah, it's definitely subjective. I can certainly understand the
sentiment for requiring it to be implemented for the sake of forcing the
developer to "know what they're doing".

In general I do think the subsystem developer should probably know how
all of this works, and that we should document the struct bpf_struct_ops
API regardless of what we decide to do for this patch specifically.

My personal opinion is that the UX is more intuitive and representative
of the actual feature if we don't force a subsystem implementer to
implement those callbacks, but it is what it is. FWIW, I actually think
the decoupling of ->validate() and ->reg() is what's most confusing,
because in different contexts ->reg() is actually serving the same
purpose as ->validate(). That's beside the point though, I guess.

> I think enough words have exchanged on this subject. I am not going to
> insist. If it is still preferred to have this check removed, please add
> details description to the '.validate' of struct bpf_struct_ops in bpf.h and
> also the commit message to spell out the details for the future subsystem
> struct_ops kernel developer to follow:

Sure thing, happy to do this, and thank you for taking the time to
explain this stuff.

> If it needs to validate struct_ops as a while,
>
> 1. it must be implemented in .validate instead of .reg. Otherwise, it may
> end up having an unusable map.

Some clarity on this point (why we check ->reg() on the ->validate()
path) would help me write this comment more clearly.

> 2. if the validation is implemented in '.reg' only, the map update behavior
> will be different between BPF_F_LINK map and the non BPF_F_LINK map.

Ack, this is good to document regardless.

I'll send a v3 on Monday with these comments added both to the code, and
to the commit summary.

Thanks,
David
  
Martin KaFai Lau Aug. 14, 2023, 4:55 p.m. UTC | #17
On 8/11/23 4:36 PM, David Vernet wrote:
> I see, thanks for explaining. This is why sched_ext doesn't really work
> with the BPF_F_LINK version of map update. We can't guarantee that a map
> can be updated if we can't succeed in ->reg(), because we can also race
> with e.g. sysrq unloading the scheduler between ->validate() and
> ->reg(). In a sense, it feels like ->reg() in "updateable" struct_ops
> implementations should be void, whereas in other struct_ops
> implementations like scx() it has to be int *. If validate() is meant to
> prevent the scenario you outlined, can you help me understand why we
> still check the return value of ->reg() in bpf_struct_ops_link_create()?
> Or at the very least it seems like we should WARN_ON()?

->regs() can fail if another struct_ops under the same name has already been 
loaded to the subsystem. If another subsystem needs another return value to 
support .update, I believe it can be done if that is blocking scx to support 
"updateable" link.

>> If it needs to validate struct_ops as a while,

There was a typo: as a /whole/.

>>
>> 1. it must be implemented in .validate instead of .reg. Otherwise, it may
>> end up having an unusable map.
> 
> Some clarity on this point (why we check ->reg() on the ->validate()
> path) would help me write this comment more clearly.


hmm... where does it check ->reg() on the ->validate() now?

I was meaning the struct_ops supported subsystem should validate the struct_ops 
map in '.validate' instead of in the '.reg'.

or I misunderstood the question?

> 
>> 2. if the validation is implemented in '.reg' only, the map update behavior
>> will be different between BPF_F_LINK map and the non BPF_F_LINK map.
> 
> Ack, this is good to document regardless.
> 
> I'll send a v3 on Monday with these comments added both to the code, and
> to the commit summary.
> 
> Thanks,
> David
  
David Vernet Aug. 14, 2023, 5:45 p.m. UTC | #18
On Mon, Aug 14, 2023 at 09:55:37AM -0700, Martin KaFai Lau wrote:
> On 8/11/23 4:36 PM, David Vernet wrote:
> > I see, thanks for explaining. This is why sched_ext doesn't really work
> > with the BPF_F_LINK version of map update. We can't guarantee that a map
> > can be updated if we can't succeed in ->reg(), because we can also race
> > with e.g. sysrq unloading the scheduler between ->validate() and
> > ->reg(). In a sense, it feels like ->reg() in "updateable" struct_ops
> > implementations should be void, whereas in other struct_ops
> > implementations like scx() it has to be int *. If validate() is meant to
> > prevent the scenario you outlined, can you help me understand why we
> > still check the return value of ->reg() in bpf_struct_ops_link_create()?
> > Or at the very least it seems like we should WARN_ON()?
> 
> ->regs() can fail if another struct_ops under the same
> name has already been loaded to the subsystem. If another
> subsystem needs another return value to support .update, I
> believe it can be done if that is blocking scx to support
> "updateable" link.

Ok, so ->validate() is a static check that should either always succeed
or always fail, and ->reg() may fail due to runtime circumstances. So a
map that passes ->validate() could e.g. retry to create the link in a
loop or something. Or create a series of validated struct_ops maps and
then have a management layer that destroys and creates links for the map
you want to actually use. Thanks for explaining.

> > > If it needs to validate struct_ops as a while,
> 
> There was a typo: as a /whole/.
> 
> > > 
> > > 1. it must be implemented in .validate instead of .reg. Otherwise, it may
> > > end up having an unusable map.
> > 
> > Some clarity on this point (why we check ->reg() on the ->validate()
> > path) would help me write this comment more clearly.
> 
> 
> hmm... where does it check ->reg() on the ->validate() now?
> 
> I was meaning the struct_ops supported subsystem should
> validate the struct_ops map in '.validate' instead of in
> the '.reg'.
> 
> or I misunderstood the question?

I just meant that I wasn't understanding why we had to check the return
value of ->reg() in bpf_struct_ops_link_create(). Now that I understand
the semantics, I can document them.
  

Patch

diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index eaff04eefb31..3d2fb85186a9 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -509,9 +509,12 @@  static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
 	}
 
 	if (st_map->map.map_flags & BPF_F_LINK) {
-		err = st_ops->validate(kdata);
-		if (err)
-			goto reset_unlock;
+		err = 0;
+		if (st_ops->validate) {
+			err = st_ops->validate(kdata);
+			if (err)
+				goto reset_unlock;
+		}
 		set_memory_rox((long)st_map->image, 1);
 		/* Let bpf_link handle registration & unregistration.
 		 *
@@ -663,9 +666,6 @@  static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 	if (attr->value_size != vt->size)
 		return ERR_PTR(-EINVAL);
 
-	if (attr->map_flags & BPF_F_LINK && (!st_ops->validate || !st_ops->update))
-		return ERR_PTR(-EOPNOTSUPP);
-
 	t = st_ops->type;
 
 	st_map_size = sizeof(*st_map) +
@@ -838,6 +838,11 @@  static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map
 		goto err_out;
 	}
 
+	if (!st_map->st_ops->update) {
+		err = -EOPNOTSUPP;
+		goto err_out;
+	}
+
 	err = st_map->st_ops->update(st_map->kvalue.data, old_st_map->kvalue.data);
 	if (err)
 		goto err_out;