[v2] module: Don't wait for GOING modules

Message ID 20221205103557.18363-1-petr.pavlu@suse.com
State New
Headers
Series [v2] module: Don't wait for GOING modules |

Commit Message

Petr Pavlu Dec. 5, 2022, 10:35 a.m. UTC
  During a system boot, it can happen that the kernel receives a burst of
requests to insert the same module but loading it eventually fails
during its init call. For instance, udev can make a request to insert
a frequency module for each individual CPU when another frequency module
is already loaded which causes the init function of the new module to
return an error.

Since commit 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for
modules that have finished loading"), the kernel waits for modules in
MODULE_STATE_GOING state to finish unloading before making another
attempt to load the same module.

This creates unnecessary work in the described scenario and delays the
boot. In the worst case, it can prevent udev from loading drivers for
other devices and might cause timeouts of services waiting on them and
subsequently a failed boot.

This patch attempts a different solution for the problem 6e6de3dee51a
was trying to solve. Rather than waiting for the unloading to complete,
it returns a different error code (-EBUSY) for modules in the GOING
state. This should avoid the error situation that was described in
6e6de3dee51a (user space attempting to load a dependent module because
the -EEXIST error code would suggest to user space that the first module
had been loaded successfully), while avoiding the delay situation too.

Fixes: 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for modules that have finished loading")
Co-developed-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Cc: stable@vger.kernel.org
---

Changes since v1 [1]:
- Don't attempt a new module initialization when a same-name module
  completely disappeared while waiting on it, which means it went
  through the GOING state implicitly already.

[1] https://lore.kernel.org/linux-modules/20221123131226.24359-1-petr.pavlu@suse.com/

 kernel/module/main.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)
  

Comments

Petr Mladek Dec. 5, 2022, 7:54 p.m. UTC | #1
On Mon 2022-12-05 11:35:57, Petr Pavlu wrote:
> During a system boot, it can happen that the kernel receives a burst of
> requests to insert the same module but loading it eventually fails
> during its init call. For instance, udev can make a request to insert
> a frequency module for each individual CPU when another frequency module
> is already loaded which causes the init function of the new module to
> return an error.
> 
> Since commit 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for
> modules that have finished loading"), the kernel waits for modules in
> MODULE_STATE_GOING state to finish unloading before making another
> attempt to load the same module.
> 
> This creates unnecessary work in the described scenario and delays the
> boot. In the worst case, it can prevent udev from loading drivers for
> other devices and might cause timeouts of services waiting on them and
> subsequently a failed boot.
> 
> This patch attempts a different solution for the problem 6e6de3dee51a
> was trying to solve. Rather than waiting for the unloading to complete,
> it returns a different error code (-EBUSY) for modules in the GOING
> state. This should avoid the error situation that was described in
> 6e6de3dee51a (user space attempting to load a dependent module because
> the -EEXIST error code would suggest to user space that the first module
> had been loaded successfully), while avoiding the delay situation too.
> 
> Fixes: 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for modules that have finished loading")
> Co-developed-by: Martin Wilck <mwilck@suse.com>
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
> Cc: stable@vger.kernel.org
> ---
> 
> Changes since v1 [1]:
> - Don't attempt a new module initialization when a same-name module
>   completely disappeared while waiting on it, which means it went
>   through the GOING state implicitly already.
> 
> [1] https://lore.kernel.org/linux-modules/20221123131226.24359-1-petr.pavlu@suse.com/
> 
>  kernel/module/main.c | 26 +++++++++++++++++++++-----
>  1 file changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index d02d39c7174e..7a627345d4fd 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
>  	sched_annotate_sleep();
>  	mutex_lock(&module_mutex);
>  	mod = find_module_all(name, strlen(name), true);
> -	ret = !mod || mod->state == MODULE_STATE_LIVE;
> +	ret = !mod || mod->state == MODULE_STATE_LIVE
> +		|| mod->state == MODULE_STATE_GOING;

There is a actually one more race.

This function is supposed to wait until load of a particular module
finishes. But we might find some another module of the same name here.

Maybe, it is not that bad. If many modules of the same name are loaded
in parallel then hopefully most of them would wait for the first one
in add_unformed_module(). And they will never appear in the @modules
list.

Anyway, to be on the safe side. We might want to pass the pointer
to the @old module found in add_unformed_module() and make sure
that we find the same module here. Something like:

/*
 * @pending_mod: pointer to module that we are waiting for
 * @name: name of the module; the string must stay even when
 *	the pending module goes away completely
 */
static bool finished_loading(const struct module *pending_mod,
			    const char *name)
{
	struct module *mod;
	bool ret = true;

	/*
	 * The module_mutex should not be a heavily contended lock;
	 * if we get the occasional sleep here, we'll go an extra iteration
	 * in the wait_event_interruptible(), which is harmless.
	 */
	sched_annotate_sleep();
	mutex_lock(&module_mutex);

	mod = find_module_all(name, strlen(name), true);
	/* Check if the pending module is still being loaded */
	if (mod == pending_mod &&
	    (mod->state == MODULE_STATE_UNFORMED ||
	       mod->state == MODULE_STATE_COMMING))
	       ret = false;
	mutex_unlock(&module_mutex);

	return ret;
}

Another advantage is that this is using the very same logic
(if condition) as add_formed_module() and not the inverted one ;-)

Otherwise, the patch looks good. I wonder how it works for Prarit.

Best Regards,
Petr
  
Petr Pavlu Dec. 6, 2022, 4:57 p.m. UTC | #2
On 12/5/22 20:54, Petr Mladek wrote:
> On Mon 2022-12-05 11:35:57, Petr Pavlu wrote:
>> During a system boot, it can happen that the kernel receives a burst of
>> requests to insert the same module but loading it eventually fails
>> during its init call. For instance, udev can make a request to insert
>> a frequency module for each individual CPU when another frequency module
>> is already loaded which causes the init function of the new module to
>> return an error.
>>
>> Since commit 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for
>> modules that have finished loading"), the kernel waits for modules in
>> MODULE_STATE_GOING state to finish unloading before making another
>> attempt to load the same module.
>>
>> This creates unnecessary work in the described scenario and delays the
>> boot. In the worst case, it can prevent udev from loading drivers for
>> other devices and might cause timeouts of services waiting on them and
>> subsequently a failed boot.
>>
>> This patch attempts a different solution for the problem 6e6de3dee51a
>> was trying to solve. Rather than waiting for the unloading to complete,
>> it returns a different error code (-EBUSY) for modules in the GOING
>> state. This should avoid the error situation that was described in
>> 6e6de3dee51a (user space attempting to load a dependent module because
>> the -EEXIST error code would suggest to user space that the first module
>> had been loaded successfully), while avoiding the delay situation too.
>>
>> Fixes: 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for modules that have finished loading")
>> Co-developed-by: Martin Wilck <mwilck@suse.com>
>> Signed-off-by: Martin Wilck <mwilck@suse.com>
>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
>> Cc: stable@vger.kernel.org
>> ---
>>
>> Changes since v1 [1]:
>> - Don't attempt a new module initialization when a same-name module
>>   completely disappeared while waiting on it, which means it went
>>   through the GOING state implicitly already.
>>
>> [1] https://lore.kernel.org/linux-modules/20221123131226.24359-1-petr.pavlu@suse.com/
>>
>>  kernel/module/main.c | 26 +++++++++++++++++++++-----
>>  1 file changed, 21 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/module/main.c b/kernel/module/main.c
>> index d02d39c7174e..7a627345d4fd 100644
>> --- a/kernel/module/main.c
>> +++ b/kernel/module/main.c
>> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
>>  	sched_annotate_sleep();
>>  	mutex_lock(&module_mutex);
>>  	mod = find_module_all(name, strlen(name), true);
>> -	ret = !mod || mod->state == MODULE_STATE_LIVE;
>> +	ret = !mod || mod->state == MODULE_STATE_LIVE
>> +		|| mod->state == MODULE_STATE_GOING;
> 
> There is a actually one more race.
> 
> This function is supposed to wait until load of a particular module
> finishes. But we might find some another module of the same name here.
> 
> Maybe, it is not that bad. If many modules of the same name are loaded
> in parallel then hopefully most of them would wait for the first one
> in add_unformed_module(). And they will never appear in the @modules
> list.

Good point, a load waiting in add_unformed_module() could miss that its older
parallel load already finished if another insert of the same module appears in
the modules list in the meantime. This requires that the new load happens to
arrive just after the old one finishes and before the waiting load makes its
check.

This is somewhat similar to the current state where new same-name insert
requests can skip and starve the ones already waiting in
add_unformed_module().

I think in practice the situation should occur very rarely and be cleared
soon, as long one doesn't continuously try to insert the same module.

> Anyway, to be on the safe side. We might want to pass the pointer
> to the @old module found in add_unformed_module() and make sure
> that we find the same module here. Something like:
> 
> /*
>  * @pending_mod: pointer to module that we are waiting for
>  * @name: name of the module; the string must stay even when
>  *	the pending module goes away completely
>  */
> static bool finished_loading(const struct module *pending_mod,
> 			    const char *name)
> {
> 	struct module *mod;
> 	bool ret = true;
> 
> 	/*
> 	 * The module_mutex should not be a heavily contended lock;
> 	 * if we get the occasional sleep here, we'll go an extra iteration
> 	 * in the wait_event_interruptible(), which is harmless.
> 	 */
> 	sched_annotate_sleep();
> 	mutex_lock(&module_mutex);
> 
> 	mod = find_module_all(name, strlen(name), true);
> 	/* Check if the pending module is still being loaded */
> 	if (mod == pending_mod &&
> 	    (mod->state == MODULE_STATE_UNFORMED ||
> 	       mod->state == MODULE_STATE_COMMING))
> 	       ret = false;
> 	mutex_unlock(&module_mutex);
> 
> 	return ret;
> }

The new pending_mod pointer has no ownership of the target module which can be
then destroyed at any time. While no dereference of pending_mod is made in
finished_loading(), this looks still problematic to me.

I think it is generally good to treat a pointer as invalid and its value as
indeterminate when the object it points to reaches the end of its lifetime.
One specific problem here is that nothing guarantees that a new module doesn't
get allocated at the exactly same address as the old one that got released and
the code is actually waiting on.

Improving this case then likely results in a more complex solution, similar
to the one that was discussed originally.

Thanks,
Petr
  
Petr Mladek Dec. 7, 2022, 3:15 p.m. UTC | #3
On Tue 2022-12-06 17:57:30, Petr Pavlu wrote:
> On 12/5/22 20:54, Petr Mladek wrote:
> > On Mon 2022-12-05 11:35:57, Petr Pavlu wrote:
> >> During a system boot, it can happen that the kernel receives a burst of
> >> requests to insert the same module but loading it eventually fails
> >> during its init call. For instance, udev can make a request to insert
> >> a frequency module for each individual CPU when another frequency module
> >> is already loaded which causes the init function of the new module to
> >> return an error.
> >>
> >> Since commit 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for
> >> modules that have finished loading"), the kernel waits for modules in
> >> MODULE_STATE_GOING state to finish unloading before making another
> >> attempt to load the same module.
> >>
> >> This creates unnecessary work in the described scenario and delays the
> >> boot. In the worst case, it can prevent udev from loading drivers for
> >> other devices and might cause timeouts of services waiting on them and
> >> subsequently a failed boot.
> >>
> >> This patch attempts a different solution for the problem 6e6de3dee51a
> >> was trying to solve. Rather than waiting for the unloading to complete,
> >> it returns a different error code (-EBUSY) for modules in the GOING
> >> state. This should avoid the error situation that was described in
> >> 6e6de3dee51a (user space attempting to load a dependent module because
> >> the -EEXIST error code would suggest to user space that the first module
> >> had been loaded successfully), while avoiding the delay situation too.
> >>
> >> --- a/kernel/module/main.c
> >> +++ b/kernel/module/main.c
> >> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
> >>  	sched_annotate_sleep();
> >>  	mutex_lock(&module_mutex);
> >>  	mod = find_module_all(name, strlen(name), true);
> >> -	ret = !mod || mod->state == MODULE_STATE_LIVE;
> >> +	ret = !mod || mod->state == MODULE_STATE_LIVE
> >> +		|| mod->state == MODULE_STATE_GOING;
> > 
> > There is a actually one more race.
> > 
> > This function is supposed to wait until load of a particular module
> > finishes. But we might find some another module of the same name here.
> > 
> > Maybe, it is not that bad. If many modules of the same name are loaded
> > in parallel then hopefully most of them would wait for the first one
> > in add_unformed_module(). And they will never appear in the @modules
> > list.
> 
> Good point, a load waiting in add_unformed_module() could miss that its older
> parallel load already finished if another insert of the same module appears in
> the modules list in the meantime. This requires that the new load happens to
> arrive just after the old one finishes and before the waiting load makes its
> check.
>
> This is somewhat similar to the current state where new same-name insert
> requests can skip and starve the ones already waiting in
> add_unformed_module().
> 
> I think in practice the situation should occur very rarely and be cleared
> soon, as long one doesn't continuously try to insert the same module.

I am not completely sure how rare it is. Anyway, the conditions are
rather complex so it feels rare ;-)

Anyway, the important part of this race is that the new module
found in finished_loading() did not see the previously loaded
module and did not wait in add_unformed_module(). It means
that the load was not parallel enough. It means that the
optimization added by this patch was not used.

By other words, if the multiple loads are not parallel enough
then the optimization added by this patch will not help.

And the other way. If this patch helps in practice then
this race is not important.

> > Anyway, to be on the safe side. We might want to pass the pointer
> > to the @old module found in add_unformed_module() and make sure
> > that we find the same module here. Something like:
> > 
> > /*
> >  * @pending_mod: pointer to module that we are waiting for
> >  * @name: name of the module; the string must stay even when
> >  *	the pending module goes away completely
> >  */
> > static bool finished_loading(const struct module *pending_mod,
> > 			    const char *name)
> > {
> > 	struct module *mod;
> > 	bool ret = true;
> > 
> > 	/*
> > 	 * The module_mutex should not be a heavily contended lock;
> > 	 * if we get the occasional sleep here, we'll go an extra iteration
> > 	 * in the wait_event_interruptible(), which is harmless.
> > 	 */
> > 	sched_annotate_sleep();
> > 	mutex_lock(&module_mutex);
> > 
> > 	mod = find_module_all(name, strlen(name), true);
> > 	/* Check if the pending module is still being loaded */
> > 	if (mod == pending_mod &&
> > 	    (mod->state == MODULE_STATE_UNFORMED ||
> > 	       mod->state == MODULE_STATE_COMMING))
> > 	       ret = false;
> > 	mutex_unlock(&module_mutex);
> > 
> > 	return ret;
> > }
> 
> The new pending_mod pointer has no ownership of the target module which can be
> then destroyed at any time. While no dereference of pending_mod is made in
> finished_loading(), this looks still problematic to me.
> 
> I think it is generally good to treat a pointer as invalid and its value as
> indeterminate when the object it points to reaches the end of its lifetime.
> One specific problem here is that nothing guarantees that a new module doesn't
> get allocated at the exactly same address as the old one that got released and
> the code is actually waiting on.
>
> Improving this case then likely results in a more complex solution, similar
> to the one that was discussed originally.

Fair enough. Let's ignore the race in finished_loading().

Otherwise, the patch looks good to me. It is easy enough.
IMHO, it makes sense to use it if it helps in practice
(multiple loads are parallel enough[*]). Feel free to use:

Reviewed-by: Petr Mladek <pmladek@suse.com>

Of course, the ideal solution would be to avoid the multiple
loads in the first place. AFAIK, everything starts in the kernel
that sends the same udev events for each CPU...

Best Regards,
Petr
  
Luis Chamberlain Dec. 8, 2022, 2:44 a.m. UTC | #4
On Wed, Dec 07, 2022 at 04:15:21PM +0100, Petr Mladek wrote:
> Reviewed-by: Petr Mladek <pmladek@suse.com>

Queued onto modules-next.

> Of course, the ideal solution would be to avoid the multiple
> loads in the first place. AFAIK, everything starts in the kernel
> that sends the same udev events for each CPU...

Fixes go first, *then we can address enhancements. I have some old
fixes I can send after htis is merged. I believe folks have others.

  Luis
  
Petr Pavlu Dec. 12, 2022, 11:29 a.m. UTC | #5
On 12/8/22 03:44, Luis Chamberlain wrote:
> On Wed, Dec 07, 2022 at 04:15:21PM +0100, Petr Mladek wrote:
>> Reviewed-by: Petr Mladek <pmladek@suse.com>
> 
> Queued onto modules-next.

Thanks both.

>> Of course, the ideal solution would be to avoid the multiple
>> loads in the first place. AFAIK, everything starts in the kernel
>> that sends the same udev events for each CPU...
> 
> Fixes go first, *then we can address enhancements. I have some old
> fixes I can send after htis is merged. I believe folks have others.

This simpler fix should be sufficient and so the original more complex
approach can be scraped. My plan is then to only progress the additional
developed patches: a test case for the issue, minor clean-up patches, and an
ACPI change to load acpi_cpufreq/pcc_cpufreq only once.

Note that none of this work now addresses the problem that David Hildenbrand
encountered with vmap allocation issues [1]. That needs to be fixed
separately.

[1] https://lore.kernel.org/linux-modules/20221013180518.217405-1-david@redhat.com/

Cheers,
Petr
  
Luis Chamberlain Dec. 13, 2022, 2:58 a.m. UTC | #6
On Mon, Dec 12, 2022 at 12:29:02PM +0100, Petr Pavlu wrote:
> On 12/8/22 03:44, Luis Chamberlain wrote:
> > On Wed, Dec 07, 2022 at 04:15:21PM +0100, Petr Mladek wrote:
> >> Reviewed-by: Petr Mladek <pmladek@suse.com>
> > 
> > Queued onto modules-next.
> 
> Thanks both.

I'm **terribly sorry** for dropping this to Linus' pull request late
but I kept trying to justify a few things and I couldn't do it, and
so I'll reply to this patch with my last minute observations soon.

I figured it was better to be safe than sorry and the potential for
yet *more* regressions was *real* and I really wanted to avoid us
having to crunch / race for a fix over the holidays.

I will leave the fix as-is now in modules-next though, so it can get
more testing, but indeed, I don't feel quite ready for the masses
specially before the holidays.

  Luis
  
Luis Chamberlain Dec. 13, 2022, 5:09 a.m. UTC | #7
On Mon, Dec 05, 2022 at 11:35:57AM +0100, Petr Pavlu wrote:
> During a system boot, it can happen that the kernel receives a burst of
> requests to insert the same module but loading it eventually fails
> during its init call. For instance, udev can make a request to insert
> a frequency module for each individual CPU when another frequency module
> is already loaded which causes the init function of the new module to
> return an error.
> 
> Since commit 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for
> modules that have finished loading"), the kernel waits for modules in
> MODULE_STATE_GOING state to finish unloading before making another
> attempt to load the same module.
> 
> This creates unnecessary work in the described scenario and delays the
> boot. In the worst case, it can prevent udev from loading drivers for
> other devices and might cause timeouts of services waiting on them and
> subsequently a failed boot.
> 
> This patch attempts a different solution for the problem 6e6de3dee51a
> was trying to solve. Rather than waiting for the unloading to complete,
> it returns a different error code (-EBUSY) for modules in the GOING
> state. This should avoid the error situation that was described in
> 6e6de3dee51a (user space attempting to load a dependent module because
> the -EEXIST error code would suggest to user space that the first module
> had been loaded successfully), while avoiding the delay situation too.
> 
> Fixes: 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for modules that have finished loading")
> Co-developed-by: Martin Wilck <mwilck@suse.com>
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
> Cc: stable@vger.kernel.org
> ---
> 
> Changes since v1 [1]:
> - Don't attempt a new module initialization when a same-name module
>   completely disappeared while waiting on it, which means it went
>   through the GOING state implicitly already.
> 
> [1] https://lore.kernel.org/linux-modules/20221123131226.24359-1-petr.pavlu@suse.com/
> 
>  kernel/module/main.c | 26 +++++++++++++++++++++-----
>  1 file changed, 21 insertions(+), 5 deletions(-)

So sorry for the late review but these ideas only came to me as I
drafted my pull request to Linus.

Let's recap the issue from the start as this is why I ended having
to pause for a second and couldn't believe myself as I wrote that
we had merged this.

The old commit which broke things is old, that has been in the kerne
since since v5.3-rc1, May 2019, over 2 years ago!

The original issue was seeing "Unknown symbols" on the kernel logs for
Linux guests on Microsoft HyperV as it disables the X86_FEATURE_SMCA bit on
AMD systems. That's nothing to scream about as the error message in this
case was bogus and could be ignored. The *fix* to this not-big-deal-issue,
however, it was applied on v5.3-rc1 (May 2019), and in the end though
created a modern real tux-pain-screaming-worthy regression which in some
cases creates failed boots on systems with many CPUs.

The issue is that the fix triggered multiple same module requests to
eventually return a failure even though the same module may be creeping
up, and boot is also delayed by hammering on this check over and over again,
instead of informing userspace that the kernel is busy. So this new fix
limits these checks, it bounds it, and we pretty much return -EBUSY if
we see the same module for which we sent a request for twice in a row.

The issue is fatal for large systems given that for each CPU udev stupidly
requests the same stupid frequency module, and eventually it runs into
the unformed case the fix boobie trapped us into and fails. The series
of failures will eventually trigger a full boot failure.

> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index d02d39c7174e..7a627345d4fd 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
>  	sched_annotate_sleep();
>  	mutex_lock(&module_mutex);
>  	mod = find_module_all(name, strlen(name), true);
> -	ret = !mod || mod->state == MODULE_STATE_LIVE;
> +	ret = !mod || mod->state == MODULE_STATE_LIVE
> +		|| mod->state == MODULE_STATE_GOING;
>  	mutex_unlock(&module_mutex);
>  
>  	return ret;
> @@ -2562,20 +2563,35 @@ static int add_unformed_module(struct module *mod)
>  
>  	mod->state = MODULE_STATE_UNFORMED;
>  
> -again:

So this is part of my biggest concern for regression, the removal of
this tag and its use.

Before this we always looped back to trying again and again.

>  	mutex_lock(&module_mutex);
>  	old = find_module_all(mod->name, strlen(mod->name), true);
>  	if (old != NULL) {
> -		if (old->state != MODULE_STATE_LIVE) {
> +		if (old->state == MODULE_STATE_COMING
> +		    || old->state == MODULE_STATE_UNFORMED) {
>  			/* Wait in case it fails to load. */
>  			mutex_unlock(&module_mutex);
>  			err = wait_event_interruptible(module_wq,
>  					       finished_loading(mod->name));
>  			if (err)
>  				goto out_unlocked;
> -			goto again;

We essentially bound this now, and before we didn't.

Yes we we wait for finished_loading() of the module -- but if udev is
hammering tons of same requests, well, we *will* surely hit this, as
many requests managed to get in before userspace saw the module present.

While this typically can be useful, it means *quite a bit* of conditions which
definitely *did* happen before will now *bail out* fast, to the extent
that I'm not even sure why we just re-try once now. If we're going to
just re-check *once* why not do something graceful like *at least*
cond_resched() to let the system breathe for a *tiny bit*. Because
otherwise we it would seem we can end up in similar situations as before
except we now deal with that condition differently. And let's see what
that looks like now.

> +
> +			/* The module might have gone in the meantime. */
> +			mutex_lock(&module_mutex);
> +			old = find_module_all(mod->name, strlen(mod->name),
> +					      true);
>  		}
> -		err = -EEXIST;
> +
> +		/*
> +		 * We are here only when the same module was being loaded. Do
> +		 * not try to load it again right now. It prevents long delays
> +		 * caused by serialized module load failures. It might happen
> +		 * when more devices of the same type trigger load of
> +		 * a particular module.
> +		 */
> +		if (old && old->state == MODULE_STATE_LIVE)
> +			err = -EEXIST;
> +		else
> +			err = -EBUSY;

And for all those use cases we end up here now, with -EBUSY. So udev
before was not bounded, and kept busy-looping on the retry in-kernel,
and we now immediately bound its condition to just 2 tries to see if the
old module existed and now *return* a new value to userspace.

My main concerns are:

0) Why not use cond_resched() if we're just going to check twice?

1) How are we sure we are not regressing userspace by removing the boundless
loop there? (even if the endless loop was stupid)

2) How is it we expect that we won't resgress userspace now by bounding
that check and pretty much returning -EBUSY right away? This last part
seems dangerous, in that if userspace did not expect -EBUSY and if an
error before caused a module to fail and fail boot, why wouldn't we fail
boot now by bailing out faster??

3) *Fixing* a kernel regression by adding new expected API for testing
against -EBUSY seems not ideal.

  Luis
  
Petr Mladek Dec. 13, 2022, 10:17 a.m. UTC | #8
On Mon 2022-12-12 21:09:19, Luis Chamberlain wrote:
> On Mon, Dec 05, 2022 at 11:35:57AM +0100, Petr Pavlu wrote:
> > During a system boot, it can happen that the kernel receives a burst of
> > requests to insert the same module but loading it eventually fails
> > during its init call. For instance, udev can make a request to insert
> > a frequency module for each individual CPU when another frequency module
> > is already loaded which causes the init function of the new module to
> > return an error.
> > 
> > This patch attempts a different solution for the problem 6e6de3dee51a
> > was trying to solve. Rather than waiting for the unloading to complete,
> > it returns a different error code (-EBUSY) for modules in the GOING
> > state. This should avoid the error situation that was described in
> > 6e6de3dee51a (user space attempting to load a dependent module because
> > the -EEXIST error code would suggest to user space that the first module
> > had been loaded successfully), while avoiding the delay situation too.
> > 
> 
> So sorry for the late review but these ideas only came to me as I
> drafted my pull request to Linus.
> 
> Let's recap the issue from the start as this is why I ended having
> to pause for a second and couldn't believe myself as I wrote that
> we had merged this.

IMHO, it is perfectly fine to delay this. The problem is not trivial.
It always made my head spin. There is a risk of regression. The change
spent in linux next only one week. And this particular merge window is
not good for risking because of the holiday season.


> > diff --git a/kernel/module/main.c b/kernel/module/main.c
> > index d02d39c7174e..7a627345d4fd 100644
> > --- a/kernel/module/main.c
> > +++ b/kernel/module/main.c
> > @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
> >  	sched_annotate_sleep();
> >  	mutex_lock(&module_mutex);
> >  	mod = find_module_all(name, strlen(name), true);
> > -	ret = !mod || mod->state == MODULE_STATE_LIVE;
> > +	ret = !mod || mod->state == MODULE_STATE_LIVE
> > +		|| mod->state == MODULE_STATE_GOING;
> >  	mutex_unlock(&module_mutex);
> >  
> >  	return ret;
> > @@ -2562,20 +2563,35 @@ static int add_unformed_module(struct module *mod)
> >  
> >  	mod->state = MODULE_STATE_UNFORMED;
> >  
> > -again:
> 
> So this is part of my biggest concern for regression, the removal of
> this tag and its use.
> 
> Before this we always looped back to trying again and again.

Just to be sure that we are on the same page.

The loop was _not_ infinite. It serialized all attempts to load
the same module. In our case, it serialized all failures and
prolonged the pain.


> >  	mutex_lock(&module_mutex);
> >  	old = find_module_all(mod->name, strlen(mod->name), true);
> >  	if (old != NULL) {
> > -		if (old->state != MODULE_STATE_LIVE) {
> > +		if (old->state == MODULE_STATE_COMING
> > +		    || old->state == MODULE_STATE_UNFORMED) {
> >  			/* Wait in case it fails to load. */
> >  			mutex_unlock(&module_mutex);
> >  			err = wait_event_interruptible(module_wq,
> >  					       finished_loading(mod->name));
> >  			if (err)
> >  				goto out_unlocked;
> > -			goto again;
> 
> We essentially bound this now, and before we didn't.
> 
> Yes we we wait for finished_loading() of the module -- but if udev is
> hammering tons of same requests, well, we *will* surely hit this, as
> many requests managed to get in before userspace saw the module present.
> 
> While this typically can be useful, it means *quite a bit* of conditions which
> definitely *did* happen before will now *bail out* fast, to the extent
> that I'm not even sure why we just re-try once now.

I do not understand this. We do _not_ re-try the load in the new
version. We just wait for the result of the parallel attempt to
load the module.

Maybe, you are confused that we repeat find_module_all(). But it is
the way how to find the result of the parallel load.


> If we're going to 
> just re-check *once* why not do something graceful like *at least*
> cond_resched() to let the system breathe for a *tiny bit*.

We must check the result under module_mutex. We have to take this
sleeping lock. There is actually a rescheduling. I do not think that
cond_resched() would do any difference.


> > +
> > +			/* The module might have gone in the meantime. */
> > +			mutex_lock(&module_mutex);
> > +			old = find_module_all(mod->name, strlen(mod->name),
> > +					      true);
> >  		}
> > -		err = -EEXIST;
> > +
> > +		/*
> > +		 * We are here only when the same module was being loaded. Do
> > +		 * not try to load it again right now. It prevents long delays
> > +		 * caused by serialized module load failures. It might happen
> > +		 * when more devices of the same type trigger load of
> > +		 * a particular module.
> > +		 */
> > +		if (old && old->state == MODULE_STATE_LIVE)
> > +			err = -EEXIST;
> > +		else
> > +			err = -EBUSY;
> 
> And for all those use cases we end up here now, with -EBUSY. So udev
> before was not bounded, and kept busy-looping on the retry in-kernel,
> and we now immediately bound its condition to just 2 tries to see if the
> old module existed and now *return* a new value to userspace.
> 
> My main concerns are:
> 
> 0) Why not use cond_resched() if we're just going to check twice?

We take module_mutex. It should cause even bigger delay than cond_resched().


> 1) How are we sure we are not regressing userspace by removing the boundless
> loop there? (even if the endless loop was stupid)

We could not be sure. On the other hand, if more attempts help to load
the module then it is racy and not reliable. The new approach would
make it better reproducible and fix the race.


> 2) How is it we expect that we won't resgress userspace now by bounding
> that check and pretty much returning -EBUSY right away? This last part
> seems dangerous, in that if userspace did not expect -EBUSY and if an
> error before caused a module to fail and fail boot, why wouldn't we fail
> boot now by bailing out faster??

Same answer as for 1)


> 3) *Fixing* a kernel regression by adding new expected API for testing
> against -EBUSY seems not ideal.

IMHO, the right solution is to fix the subsystems so that they send
only one uevent.

The question is how the module loader would deal with "broken"
subsystems. Petr Pavlu, please, fixme. I think that there are
more subsystems doing this ugly thing.

I personally thing that returning -EBUSY is better than serializing
all the loads. It makes eventual problem easier to reproduce and fix.

Best Regards,
Petr
  
Petr Pavlu Dec. 13, 2022, 1:36 p.m. UTC | #9
On 12/13/22 11:17, Petr Mladek wrote:
> On Mon 2022-12-12 21:09:19, Luis Chamberlain wrote:
>> 3) *Fixing* a kernel regression by adding new expected API for testing
>> against -EBUSY seems not ideal.
> 
> IMHO, the right solution is to fix the subsystems so that they send
> only one uevent.
> 
> The question is how the module loader would deal with "broken"
> subsystems. Petr Pavlu, please, fixme. I think that there are
> more subsystems doing this ugly thing.

The issue has been seen with cpufreq and edac modules. It is a combination of
them being loaded per CPU and use of a cooperative pattern to allow only one
module of each such type on the system.

Fixing the module loader addresses the immediate regression, but should be
useful in general to cope better if some module which is tried to be loaded
multiple times per N devices is failing its initialization.

I'm not sure if these subsystems can be called "broken". However, I agree it
makes sense to have a look at some of the mentioned drivers separately if they
can be improved to try to load them only once on each system, because they can
be viewed more as whole-platform drivers than per-CPU ones.

Thanks,
Petr
  
Luis Chamberlain Jan. 18, 2023, 12:04 a.m. UTC | #10
On Tue, Dec 13, 2022 at 11:17:42AM +0100, Petr Mladek wrote:
> On Mon 2022-12-12 21:09:19, Luis Chamberlain wrote:
> > On Mon, Dec 05, 2022 at 11:35:57AM +0100, Petr Pavlu wrote:
> > > diff --git a/kernel/module/main.c b/kernel/module/main.c
> > > index d02d39c7174e..7a627345d4fd 100644
> > > --- a/kernel/module/main.c
> > > +++ b/kernel/module/main.c
> > > @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
> > >  	sched_annotate_sleep();
> > >  	mutex_lock(&module_mutex);
> > >  	mod = find_module_all(name, strlen(name), true);
> > > -	ret = !mod || mod->state == MODULE_STATE_LIVE;
> > > +	ret = !mod || mod->state == MODULE_STATE_LIVE
> > > +		|| mod->state == MODULE_STATE_GOING;
> > >  	mutex_unlock(&module_mutex);
> > >  
> > >  	return ret;
> > > @@ -2562,20 +2563,35 @@ static int add_unformed_module(struct module *mod)
> > >  
> > >  	mod->state = MODULE_STATE_UNFORMED;
> > >  
> > > -again:
> > 
> > So this is part of my biggest concern for regression, the removal of
> > this tag and its use.
> > 
> > Before this we always looped back to trying again and again.
> 
> Just to be sure that we are on the same page.
> 
> The loop was _not_ infinite. It serialized all attempts to load
> the same module. In our case, it serialized all failures and
> prolonged the pain.

That's fair yes. The loop happens so long as an already existing module is
present with the same name.

> > >  	mutex_lock(&module_mutex);
> > >  	old = find_module_all(mod->name, strlen(mod->name), true);
> > >  	if (old != NULL) {
> > > -		if (old->state != MODULE_STATE_LIVE) {
> > > +		if (old->state == MODULE_STATE_COMING
> > > +		    || old->state == MODULE_STATE_UNFORMED) {
> > >  			/* Wait in case it fails to load. */
> > >  			mutex_unlock(&module_mutex);
> > >  			err = wait_event_interruptible(module_wq,
> > >  					       finished_loading(mod->name));
> > >  			if (err)
> > >  				goto out_unlocked;
> > > -			goto again;
> > 
> > We essentially bound this now, and before we didn't.
> > 
> > Yes we we wait for finished_loading() of the module -- but if udev is
> > hammering tons of same requests, well, we *will* surely hit this, as
> > many requests managed to get in before userspace saw the module present.
> > 
> > While this typically can be useful, it means *quite a bit* of conditions which
> > definitely *did* happen before will now *bail out* fast, to the extent
> > that I'm not even sure why we just re-try once now.
> 
> I do not understand this. We do _not_ re-try the load in the new
> version. We just wait for the result of the parallel attempt to
> load the module.
> 
> Maybe, you are confused that we repeat find_module_all(). But it is
> the way how to find the result of the parallel load.

My point is that prior to the buggy commit 6e6de3dee51a ("kernel/module.c: Only
return -EEXIST for modules that have finished loading") and even after that
commit it we 'goto again' if an old request is found. We now simply bound this
right away. Yes, the loop was not infinite, but in theory at least a few
iterations were possible before whereas now immediately return -EBUSY
and I don't think all use cases may be ready yet.

> > If we're going to 
> > just re-check *once* why not do something graceful like *at least*
> > cond_resched() to let the system breathe for a *tiny bit*.
> 
> We must check the result under module_mutex. We have to take this
> sleeping lock. There is actually a rescheduling. I do not think that
> cond_resched() would do any difference.

Makes sense.

> > > +
> > > +			/* The module might have gone in the meantime. */
> > > +			mutex_lock(&module_mutex);
> > > +			old = find_module_all(mod->name, strlen(mod->name),
> > > +					      true);
> > >  		}
> > > -		err = -EEXIST;
> > > +
> > > +		/*
> > > +		 * We are here only when the same module was being loaded. Do
> > > +		 * not try to load it again right now. It prevents long delays
> > > +		 * caused by serialized module load failures. It might happen
> > > +		 * when more devices of the same type trigger load of
> > > +		 * a particular module.
> > > +		 */
> > > +		if (old && old->state == MODULE_STATE_LIVE)
> > > +			err = -EEXIST;
> > > +		else
> > > +			err = -EBUSY;
> > 
> > And for all those use cases we end up here now, with -EBUSY. So udev
> > before was not bounded, and kept busy-looping on the retry in-kernel,
> > and we now immediately bound its condition to just 2 tries to see if the
> > old module existed and now *return* a new value to userspace.
> > 
> > My main concerns are:
> > 
> > 0) Why not use cond_resched() if we're just going to check twice?
> 
> We take module_mutex. It should cause even bigger delay than cond_resched().

ACK.

> > 1) How are we sure we are not regressing userspace by removing the boundless
> > loop there? (even if the endless loop was stupid)
> 
> We could not be sure. On the other hand, if more attempts help to load
> the module then it is racy and not reliable. The new approach would
> make it better reproducible and fix the race.

Yes, but the short cut it is a userspace visible change.

> > 2) How is it we expect that we won't resgress userspace now by bounding
> > that check and pretty much returning -EBUSY right away? This last part
> > seems dangerous, in that if userspace did not expect -EBUSY and if an
> > error before caused a module to fail and fail boot, why wouldn't we fail
> > boot now by bailing out faster??
> 
> Same answer as for 1)
> 
> 
> > 3) *Fixing* a kernel regression by adding new expected API for testing
> > against -EBUSY seems not ideal.
> 
> IMHO, the right solution is to fix the subsystems so that they send
> only one uevent.

Makes sense, but that can take time and some folks are stuck on old kernels
and perhaps porting fixes for this on subsystems may take time to land
to some enterprisy kernels. And then there is also systemd that issues
the requests too, at least that was reflected in commit 6e6de3dee51a
("kernel/module.c: Only return -EEXIST for modules that have finished loading")
that commit claims it was systemd issueing the requests which I mean to
interpret finit_module(), not calling modprobe.

The rationale for making a regression fix with a new userspace return value
is fair given the old fix made things even much worse the point some kernel
boots would fail. So the rationale to suggest we *must* short-cut
parallel loads as effectively as possible seems sensible *iff* that
could not make things worse too but sadly I've found an isssue
proactively with this fix, or at least that this issue is also not fixed:

./tools/testing/selftests/kmod/kmod.sh -t 0006
Tue Jan 17 23:18:13 UTC 2023
Running test: kmod_test_0006 - run #0
kmod_test_0006: OK! - loading kmod test
kmod_test_0006: FAIL, test expects SUCCESS (0) - got -EINVAL (-22)
----------------------------------------------------
Custom trigger configuration for: test_kmod0
Number of threads:      50
Test_case:      TEST_KMOD_FS_TYPE (2)
driver: test_module
fs:     xfs
----------------------------------------------------
Test completed

When can multiple get_fs_type() calls be issued on a system? When
mounting a large number of filesystems. Sadly though this issue seems
to have gone unnoticed for a while now. Even reverting commit
6e6de3dee51a doesn't fix it, and I've run into issues with trying
to bisect, first due to missing Kees' patch which fixes a compiler
failure on older kernel [0] and now I'm seeing this while trying to
build v5.1:

ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order';
arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here
ld: warning: arch/x86/boot/compressed/efi_thunk_64.o: missing .note.GNU-stack section implies executable stack
ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
ld: arch/x86/boot/compressed/head_64.o: warning: relocation in read-only section `.head.text'
ld: warning: arch/x86/boot/compressed/vmlinux has a LOAD segment with RWX permissions
ld: warning: creating DT_TEXTREL in a PIE
make[2]: *** [arch/x86/boot/compressed/Makefile:118: arch/x86/boot/compressed/vmlinux] Error 1
make[1]: *** [arch/x86/boot/Makefile:112: arch/x86/boot/compressed/vmlinux] Error 2
make: *** [arch/x86/Makefile:283: bzImage] Error 2

[0] http://lore.kernel.org/lkml/20220213182443.4037039-1-keescook@chromium.org

But we should try to bisect to see what cauased the above kmod test 0006
to start failing.

> The question is how the module loader would deal with "broken"
> subsystems. Petr Pavlu, please, fixme. I think that there are
> more subsystems doing this ugly thing.
> 
> I personally thing that returning -EBUSY is better than serializing
> all the loads. It makes eventual problem easier to reproduce and fix.

I agree with this assessment, however given the multiple get_fs_type()
calls as an example, I am not sure if there are other areas which rely on the
old busy-wait mechanism.

*If* we knew this issue was not so common I'd go so far as to say we
should pr_warn_once() on failure, but at this point in time I think it'd
be pretty chatty.

I don't yet have confidence that the new fast track to -EXIST or -EBUSY may
not create regressions, so the below I think would be too chatty. If it
wasn't true, I'd say we should keep record of these uses so we fix the
callers.

diff --git a/kernel/module/main.c b/kernel/module/main.c
index d3be89de706d..d1ad0b510cb8 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2589,13 +2589,6 @@ static int add_unformed_module(struct module *mod)
 					      true);
 		}
 
-		/*
-		 * We are here only when the same module was being loaded. Do
-		 * not try to load it again right now. It prevents long delays
-		 * caused by serialized module load failures. It might happen
-		 * when more devices of the same type trigger load of
-		 * a particular module.
-		 */
 		if (old && old->state == MODULE_STATE_LIVE)
 			err = -EEXIST;
 		else
@@ -2610,6 +2603,15 @@ static int add_unformed_module(struct module *mod)
 out:
 	mutex_unlock(&module_mutex);
 out_unlocked:
+	/*
+	 * We get an error here only when there is an attempt to load the
+	 * same module. Subsystems should strive to only issue one request
+	 * for a needed module. Multiple requests might happen when more devices
+	 * of the same type trigger load of a particular module.
+	 */
+	if (err)
+		pr_warn_once("%: dropping duplicate module request, err: %d\n",
+			     mod->name, err);
 	return err;
 }
 

  Luis
  
Luis Chamberlain Jan. 18, 2023, 12:18 a.m. UTC | #11
On Tue, Jan 17, 2023 at 04:04:22PM -0800, Luis Chamberlain wrote:
> ./tools/testing/selftests/kmod/kmod.sh -t 0006
> Tue Jan 17 23:18:13 UTC 2023
> Running test: kmod_test_0006 - run #0
> kmod_test_0006: OK! - loading kmod test
> kmod_test_0006: FAIL, test expects SUCCESS (0) - got -EINVAL (-22)
> ----------------------------------------------------
> Custom trigger configuration for: test_kmod0
> Number of threads:      50
> Test_case:      TEST_KMOD_FS_TYPE (2)
> driver: test_module
> fs:     xfs
> ----------------------------------------------------
> Test completed
> 
> When can multiple get_fs_type() calls be issued on a system? When
> mounting a large number of filesystems. Sadly though this issue seems
> to have gone unnoticed for a while now. Even reverting commit
> 6e6de3dee51a doesn't fix it, and I've run into issues with trying
> to bisect, first due to missing Kees' patch which fixes a compiler
> failure on older kernel [0] and now I'm seeing this while trying to
> build v5.1:
> 
> ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order';
> arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here
> ld: warning: arch/x86/boot/compressed/efi_thunk_64.o: missing .note.GNU-stack section implies executable stack
> ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
> ld: arch/x86/boot/compressed/head_64.o: warning: relocation in read-only section `.head.text'
> ld: warning: arch/x86/boot/compressed/vmlinux has a LOAD segment with RWX permissions
> ld: warning: creating DT_TEXTREL in a PIE
> make[2]: *** [arch/x86/boot/compressed/Makefile:118: arch/x86/boot/compressed/vmlinux] Error 1
> make[1]: *** [arch/x86/boot/Makefile:112: arch/x86/boot/compressed/vmlinux] Error 2
> make: *** [arch/x86/Makefile:283: bzImage] Error 2
> 
> [0] http://lore.kernel.org/lkml/20220213182443.4037039-1-keescook@chromium.org
> 
> But we should try to bisect to see what cauased the above kmod test 0006
> to start failing.

BTW if someone beats me to bisecting the above it would be appreciated.
I have a feeling having old gcc / linker, etc would be easier.

  Luis
  
Petr Pavlu Jan. 18, 2023, 3:12 p.m. UTC | #12
On 1/18/23 01:04, Luis Chamberlain wrote:
> On Tue, Dec 13, 2022 at 11:17:42AM +0100, Petr Mladek wrote:
>> On Mon 2022-12-12 21:09:19, Luis Chamberlain wrote:
>>> 3) *Fixing* a kernel regression by adding new expected API for testing
>>> against -EBUSY seems not ideal.
>>
>> IMHO, the right solution is to fix the subsystems so that they send
>> only one uevent.
> 
> Makes sense, but that can take time and some folks are stuck on old kernels
> and perhaps porting fixes for this on subsystems may take time to land
> to some enterprisy kernels. And then there is also systemd that issues
> the requests too, at least that was reflected in commit 6e6de3dee51a
> ("kernel/module.c: Only return -EEXIST for modules that have finished loading")
> that commit claims it was systemd issueing the requests which I mean to
> interpret finit_module(), not calling modprobe.
> 
> The rationale for making a regression fix with a new userspace return value
> is fair given the old fix made things even much worse the point some kernel
> boots would fail. So the rationale to suggest we *must* short-cut
> parallel loads as effectively as possible seems sensible *iff* that
> could not make things worse too but sadly I've found an isssue
> proactively with this fix, or at least that this issue is also not fixed:
> 
> ./tools/testing/selftests/kmod/kmod.sh -t 0006
> Tue Jan 17 23:18:13 UTC 2023
> Running test: kmod_test_0006 - run #0
> kmod_test_0006: OK! - loading kmod test
> kmod_test_0006: FAIL, test expects SUCCESS (0) - got -EINVAL (-22)
> ----------------------------------------------------
> Custom trigger configuration for: test_kmod0
> Number of threads:      50
> Test_case:      TEST_KMOD_FS_TYPE (2)
> driver: test_module
> fs:     xfs
> ----------------------------------------------------
> Test completed
> 
> When can multiple get_fs_type() calls be issued on a system? When
> mounting a large number of filesystems. Sadly though this issue seems
> to have gone unnoticed for a while now. Even reverting commit
> 6e6de3dee51a doesn't fix it, and I've run into issues with trying
> to bisect, first due to missing Kees' patch which fixes a compiler
> failure on older kernel [0] and now I'm seeing this while trying to
> build v5.1:
> 
> ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order';
> arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here
> ld: warning: arch/x86/boot/compressed/efi_thunk_64.o: missing .note.GNU-stack section implies executable stack
> ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
> ld: arch/x86/boot/compressed/head_64.o: warning: relocation in read-only section `.head.text'
> ld: warning: arch/x86/boot/compressed/vmlinux has a LOAD segment with RWX permissions
> ld: warning: creating DT_TEXTREL in a PIE
> make[2]: *** [arch/x86/boot/compressed/Makefile:118: arch/x86/boot/compressed/vmlinux] Error 1
> make[1]: *** [arch/x86/boot/Makefile:112: arch/x86/boot/compressed/vmlinux] Error 2
> make: *** [arch/x86/Makefile:283: bzImage] Error 2
> 
> [0] http://lore.kernel.org/lkml/20220213182443.4037039-1-keescook@chromium.org
> 
> But we should try to bisect to see what cauased the above kmod test 0006
> to start failing.

It is not clear to me from your description if the observed failure of
kmod_test_0006 is related to the fix in this thread.

The problem was not possible for me to reproduce on my system. My test was on
an 8-CPU x86_64 machine using v6.2-rc4 with "defconfig + kvm_guest.config +
tools/testing/selftests/kmod/config".

Could you perhaps trace the test to determine where the EINVAL value comes
from?

>> The question is how the module loader would deal with "broken"
>> subsystems. Petr Pavlu, please, fixme. I think that there are
>> more subsystems doing this ugly thing.
>>
>> I personally thing that returning -EBUSY is better than serializing
>> all the loads. It makes eventual problem easier to reproduce and fix.
> 
> I agree with this assessment, however given the multiple get_fs_type()
> calls as an example, I am not sure if there are other areas which rely on the
> old busy-wait mechanism.
> 
> *If* we knew this issue was not so common I'd go so far as to say we
> should pr_warn_once() on failure, but at this point in time I think it'd
> be pretty chatty.
> 
> I don't yet have confidence that the new fast track to -EXIST or -EBUSY may
> not create regressions, so the below I think would be too chatty. If it
> wasn't true, I'd say we should keep record of these uses so we fix the
> callers.

A similar fast track logic was present prior to 6e6de3dee51a. The fix in this
thread doesn't bring a completely new behavior but rather restores the
previous one. The fix should have only two differences: a window when parallel
loads are detected is extended, the return code is -EBUSY instead of -EEXIST.

> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index d3be89de706d..d1ad0b510cb8 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -2589,13 +2589,6 @@ static int add_unformed_module(struct module *mod)
>  					      true);
>  		}
>  
> -		/*
> -		 * We are here only when the same module was being loaded. Do
> -		 * not try to load it again right now. It prevents long delays
> -		 * caused by serialized module load failures. It might happen
> -		 * when more devices of the same type trigger load of
> -		 * a particular module.
> -		 */
>  		if (old && old->state == MODULE_STATE_LIVE)
>  			err = -EEXIST;
>  		else
> @@ -2610,6 +2603,15 @@ static int add_unformed_module(struct module *mod)
>  out:
>  	mutex_unlock(&module_mutex);
>  out_unlocked:
> +	/*
> +	 * We get an error here only when there is an attempt to load the
> +	 * same module. Subsystems should strive to only issue one request
> +	 * for a needed module. Multiple requests might happen when more devices
> +	 * of the same type trigger load of a particular module.
> +	 */
> +	if (err)
> +		pr_warn_once("%: dropping duplicate module request, err: %d\n",
> +			     mod->name, err);
>  	return err;
>  }

I'm not sure if this would be the right thing to do.

What I think would be good to fix is the pattern utilized by some cpufreq (and
edac) modules. It is a combination of them being loaded per-CPU and using
a cooperative pattern to allow only one module of each such type on the
system. They can be viewed more as a whole-platform drivers that should be
attempted to be loaded only once. It is still on my todo list to post a patch
to cpufreq maintainers to start a discussion on this.

Another consideration on the kernel side could be to try grouping other
currently per-CPU loaded modules.

However, in general, a system can have many hardware pieces of the same type
and it looks correct to me that each is individually exposed in sysfs and via
uevents with their set of needed modules.

I would say that if this turns out to be a further issue in practice, udevd
could optimize its process and avoid same-module loads when it is bringing up
to life multiple devices.

Thanks,
Petr
  
Luis Chamberlain Jan. 18, 2023, 6:42 p.m. UTC | #13
On Wed, Jan 18, 2023 at 04:12:05PM +0100, Petr Pavlu wrote:
> On 1/18/23 01:04, Luis Chamberlain wrote:
> > The rationale for making a regression fix with a new userspace return value
> > is fair given the old fix made things even much worse the point some kernel
> > boots would fail. So the rationale to suggest we *must* short-cut
> > parallel loads as effectively as possible seems sensible *iff* that
> > could not make things worse too but sadly I've found an isssue
> > proactively with this fix, or at least that this issue is also not fixed:
> > 
> > ./tools/testing/selftests/kmod/kmod.sh -t 0006
> > Tue Jan 17 23:18:13 UTC 2023
> > Running test: kmod_test_0006 - run #0
> > kmod_test_0006: OK! - loading kmod test
> > kmod_test_0006: FAIL, test expects SUCCESS (0) - got -EINVAL (-22)
> > ----------------------------------------------------
> > Custom trigger configuration for: test_kmod0
> > Number of threads:      50
> > Test_case:      TEST_KMOD_FS_TYPE (2)
> > driver: test_module
> > fs:     xfs
> > ----------------------------------------------------
> > Test completed
> > 
> > When can multiple get_fs_type() calls be issued on a system? When
> > mounting a large number of filesystems. Sadly though this issue seems
> > to have gone unnoticed for a while now. Even reverting commit
> > 6e6de3dee51a doesn't fix it, and I've run into issues with trying
> > to bisect, first due to missing Kees' patch which fixes a compiler
> > failure on older kernel [0] and now I'm seeing this while trying to
> > build v5.1:
> > 
> > ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order';
> > arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here
> > ld: warning: arch/x86/boot/compressed/efi_thunk_64.o: missing .note.GNU-stack section implies executable stack
> > ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
> > ld: arch/x86/boot/compressed/head_64.o: warning: relocation in read-only section `.head.text'
> > ld: warning: arch/x86/boot/compressed/vmlinux has a LOAD segment with RWX permissions
> > ld: warning: creating DT_TEXTREL in a PIE
> > make[2]: *** [arch/x86/boot/compressed/Makefile:118: arch/x86/boot/compressed/vmlinux] Error 1
> > make[1]: *** [arch/x86/boot/Makefile:112: arch/x86/boot/compressed/vmlinux] Error 2
> > make: *** [arch/x86/Makefile:283: bzImage] Error 2
> > 
> > [0] http://lore.kernel.org/lkml/20220213182443.4037039-1-keescook@chromium.org
> > 
> > But we should try to bisect to see what cauased the above kmod test 0006
> > to start failing.
> 
> It is not clear to me from your description if the observed failure of
> kmod_test_0006 is related to the fix in this thread.

The issue happens with and without the patch in this thread, I'd just hate to
exacerbate the issue further.

> The problem was not possible for me to reproduce on my system. My test was on
> an 8-CPU x86_64 machine using v6.2-rc4 with "defconfig + kvm_guest.config +
> tools/testing/selftests/kmod/config".

With the patch?

> Could you perhaps trace the test to determine where the EINVAL value comes
> from?

Sure, it'll take a bit.

  Luis
  
Borislav Petkov Jan. 18, 2023, 8:02 p.m. UTC | #14
On Tue, Jan 17, 2023 at 04:04:22PM -0800, Luis Chamberlain wrote:
> and now I'm seeing this while trying to build v5.1:
> 
> ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order';
> arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here

You need to backport 

aa5cacdc29d7 ("x86/asm: Replace __force_order with a memory clobber")

for that.

Happens when building older kernels with newer toolchain.
  
Lucas De Marchi Jan. 19, 2023, 1:15 a.m. UTC | #15
On Tue, Jan 17, 2023 at 04:04:22PM -0800, Luis Chamberlain wrote:
>On Tue, Dec 13, 2022 at 11:17:42AM +0100, Petr Mladek wrote:
>> On Mon 2022-12-12 21:09:19, Luis Chamberlain wrote:
>> > On Mon, Dec 05, 2022 at 11:35:57AM +0100, Petr Pavlu wrote:
>> > > diff --git a/kernel/module/main.c b/kernel/module/main.c
>> > > index d02d39c7174e..7a627345d4fd 100644
>> > > --- a/kernel/module/main.c
>> > > +++ b/kernel/module/main.c
>> > > @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
>> > >  	sched_annotate_sleep();
>> > >  	mutex_lock(&module_mutex);
>> > >  	mod = find_module_all(name, strlen(name), true);
>> > > -	ret = !mod || mod->state == MODULE_STATE_LIVE;
>> > > +	ret = !mod || mod->state == MODULE_STATE_LIVE
>> > > +		|| mod->state == MODULE_STATE_GOING;
>> > >  	mutex_unlock(&module_mutex);
>> > >
>> > >  	return ret;
>> > > @@ -2562,20 +2563,35 @@ static int add_unformed_module(struct module *mod)
>> > >
>> > >  	mod->state = MODULE_STATE_UNFORMED;
>> > >
>> > > -again:
>> >
>> > So this is part of my biggest concern for regression, the removal of
>> > this tag and its use.
>> >
>> > Before this we always looped back to trying again and again.
>>
>> Just to be sure that we are on the same page.
>>
>> The loop was _not_ infinite. It serialized all attempts to load
>> the same module. In our case, it serialized all failures and
>> prolonged the pain.
>
>That's fair yes. The loop happens so long as an already existing module is
>present with the same name.
>
>> > >  	mutex_lock(&module_mutex);
>> > >  	old = find_module_all(mod->name, strlen(mod->name), true);
>> > >  	if (old != NULL) {
>> > > -		if (old->state != MODULE_STATE_LIVE) {
>> > > +		if (old->state == MODULE_STATE_COMING
>> > > +		    || old->state == MODULE_STATE_UNFORMED) {
>> > >  			/* Wait in case it fails to load. */
>> > >  			mutex_unlock(&module_mutex);
>> > >  			err = wait_event_interruptible(module_wq,
>> > >  					       finished_loading(mod->name));
>> > >  			if (err)
>> > >  				goto out_unlocked;
>> > > -			goto again;
>> >
>> > We essentially bound this now, and before we didn't.
>> >
>> > Yes we we wait for finished_loading() of the module -- but if udev is
>> > hammering tons of same requests, well, we *will* surely hit this, as
>> > many requests managed to get in before userspace saw the module present.
>> >
>> > While this typically can be useful, it means *quite a bit* of conditions which
>> > definitely *did* happen before will now *bail out* fast, to the extent
>> > that I'm not even sure why we just re-try once now.
>>
>> I do not understand this. We do _not_ re-try the load in the new
>> version. We just wait for the result of the parallel attempt to
>> load the module.
>>
>> Maybe, you are confused that we repeat find_module_all(). But it is
>> the way how to find the result of the parallel load.
>
>My point is that prior to the buggy commit 6e6de3dee51a ("kernel/module.c: Only
>return -EEXIST for modules that have finished loading") and even after that
>commit it we 'goto again' if an old request is found. We now simply bound this
>right away. Yes, the loop was not infinite, but in theory at least a few
>iterations were possible before whereas now immediately return -EBUSY
>and I don't think all use cases may be ready yet.
>
>> > If we're going to
>> > just re-check *once* why not do something graceful like *at least*
>> > cond_resched() to let the system breathe for a *tiny bit*.
>>
>> We must check the result under module_mutex. We have to take this
>> sleeping lock. There is actually a rescheduling. I do not think that
>> cond_resched() would do any difference.
>
>Makes sense.
>
>> > > +
>> > > +			/* The module might have gone in the meantime. */
>> > > +			mutex_lock(&module_mutex);
>> > > +			old = find_module_all(mod->name, strlen(mod->name),
>> > > +					      true);
>> > >  		}
>> > > -		err = -EEXIST;
>> > > +
>> > > +		/*
>> > > +		 * We are here only when the same module was being loaded. Do
>> > > +		 * not try to load it again right now. It prevents long delays
>> > > +		 * caused by serialized module load failures. It might happen
>> > > +		 * when more devices of the same type trigger load of
>> > > +		 * a particular module.
>> > > +		 */
>> > > +		if (old && old->state == MODULE_STATE_LIVE)
>> > > +			err = -EEXIST;
>> > > +		else
>> > > +			err = -EBUSY;
>> >
>> > And for all those use cases we end up here now, with -EBUSY. So udev
>> > before was not bounded, and kept busy-looping on the retry in-kernel,
>> > and we now immediately bound its condition to just 2 tries to see if the
>> > old module existed and now *return* a new value to userspace.
>> >
>> > My main concerns are:
>> >
>> > 0) Why not use cond_resched() if we're just going to check twice?
>>
>> We take module_mutex. It should cause even bigger delay than cond_resched().
>
>ACK.
>
>> > 1) How are we sure we are not regressing userspace by removing the boundless
>> > loop there? (even if the endless loop was stupid)
>>
>> We could not be sure. On the other hand, if more attempts help to load
>> the module then it is racy and not reliable. The new approach would
>> make it better reproducible and fix the race.
>
>Yes, but the short cut it is a userspace visible change.
>
>> > 2) How is it we expect that we won't resgress userspace now by bounding
>> > that check and pretty much returning -EBUSY right away? This last part
>> > seems dangerous, in that if userspace did not expect -EBUSY and if an
>> > error before caused a module to fail and fail boot, why wouldn't we fail
>> > boot now by bailing out faster??
>>
>> Same answer as for 1)
>>
>>
>> > 3) *Fixing* a kernel regression by adding new expected API for testing
>> > against -EBUSY seems not ideal.
>>
>> IMHO, the right solution is to fix the subsystems so that they send
>> only one uevent.
>
>Makes sense, but that can take time and some folks are stuck on old kernels
>and perhaps porting fixes for this on subsystems may take time to land
>to some enterprisy kernels. And then there is also systemd that issues
>the requests too, at least that was reflected in commit 6e6de3dee51a
>("kernel/module.c: Only return -EEXIST for modules that have finished loading")
>that commit claims it was systemd issueing the requests which I mean to
>interpret finit_module(), not calling modprobe.

just a comment on this, if it helps anything: commit 6e6de3dee51a says
systemd, but it would be more accurate to say udev or systemd-udev to
disambiguate with pid 1.

udev uses libkmod to load modules so it's pretty much the same logic as
modprobe.

Lucas De Marchi
  
Luis Chamberlain Jan. 19, 2023, 1:23 a.m. UTC | #16
On Wed, Jan 18, 2023 at 09:02:52PM +0100, Borislav Petkov wrote:
> On Tue, Jan 17, 2023 at 04:04:22PM -0800, Luis Chamberlain wrote:
> > and now I'm seeing this while trying to build v5.1:
> > 
> > ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order';
> > arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here
> 
> You need to backport 
> 
> aa5cacdc29d7 ("x86/asm: Replace __force_order with a memory clobber")
> 
> for that.
> 
> Happens when building older kernels with newer toolchain.

Thanks that certainly helps. FWIW if someone needs it, I had to remove
the double colons on write cr0 and cr4 to compile, but this crashed :(
Any ideas?

diff --git a/arch/x86/boot/compressed/kaslr_64.c b/arch/x86/boot/compressed/kaslr_64.c
index 748456c365f4..9557c5a15b91 100644
--- a/arch/x86/boot/compressed/kaslr_64.c
+++ b/arch/x86/boot/compressed/kaslr_64.c
@@ -29,9 +29,6 @@
 #define __PAGE_OFFSET __PAGE_OFFSET_BASE
 #include "../../mm/ident_map.c"
 
-/* Used by pgtable.h asm code to force instruction serialization. */
-unsigned long __force_order;
-
 /* Used to track our page table allocation area. */
 struct alloc_pgt_data {
 	unsigned char *pgt_buf;
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index f8debf7aeb4c..7471b48524cb 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -5,15 +5,6 @@
 #include "pgtable.h"
 #include "../string.h"
 
-/*
- * __force_order is used by special_insns.h asm code to force instruction
- * serialization.
- *
- * It is not referenced from the code, but GCC < 5 with -fPIE would fail
- * due to an undefined symbol. Define it to make these ancient GCCs work.
- */
-unsigned long __force_order;
-
 #define BIOS_START_MIN		0x20000U	/* 128K, less than this is insane */
 #define BIOS_START_MAX		0x9f000U	/* 640K, absolute maximum */
 
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 0a3c4cab39db..5586e4cf62d3 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -7,49 +7,50 @@
 
 #include <asm/nops.h>
 
+#define __FORCE_ORDER "m"(*(unsigned int *)0x1000UL)
+
 /*
- * Volatile isn't enough to prevent the compiler from reordering the
- * read/write functions for the control registers and messing everything up.
- * A memory clobber would solve the problem, but would prevent reordering of
- * all loads stores around it, which can hurt performance. Solution is to
- * use a variable and mimic reads and writes to it to enforce serialization
+ * The compiler should not reorder volatile asm statements with respect to each
+ * other: they should execute in program order. However GCC 4.9.x and 5.x have
+ * a bug (which was fixed in 8.1, 7.3 and 6.5) where they might reorder
+ * volatile asm. The write functions are not affected since they have memory
+ * clobbers preventing reordering. To prevent reads from being reordered with
+ * respect to writes, use a dummy memory operand
  */
-extern unsigned long __force_order;
-
 static inline unsigned long native_read_cr0(void)
 {
 	unsigned long val;
-	asm volatile("mov %%cr0,%0\n\t" : "=r" (val), "=m" (__force_order));
+	asm volatile("mov %%cr0,%0\n\t" : "=r" (val) : __FORCE_ORDER);
 	return val;
 }
 
 static inline void native_write_cr0(unsigned long val)
 {
-	asm volatile("mov %0,%%cr0": : "r" (val), "m" (__force_order));
+	asm volatile("mov %0,%%cr0": : "r" (val) : "memory");
 }
 
 static inline unsigned long native_read_cr2(void)
 {
 	unsigned long val;
-	asm volatile("mov %%cr2,%0\n\t" : "=r" (val), "=m" (__force_order));
+	asm volatile("mov %%cr2,%0\n\t" : "=r" (val) : __FORCE_ORDER);
 	return val;
 }
 
 static inline void native_write_cr2(unsigned long val)
 {
-	asm volatile("mov %0,%%cr2": : "r" (val), "m" (__force_order));
+	asm volatile("mov %0,%%cr2": : "r" (val) : "memory");
 }
 
 static inline unsigned long __native_read_cr3(void)
 {
 	unsigned long val;
-	asm volatile("mov %%cr3,%0\n\t" : "=r" (val), "=m" (__force_order));
+	asm volatile("mov %%cr3,%0\n\t" : "=r" (val) : __FORCE_ORDER);
 	return val;
 }
 
 static inline void native_write_cr3(unsigned long val)
 {
-	asm volatile("mov %0,%%cr3": : "r" (val), "m" (__force_order));
+	asm volatile("mov %0,%%cr3": : "r" (val) : "memory");
 }
 
 static inline unsigned long native_read_cr4(void)
@@ -64,17 +65,17 @@ static inline unsigned long native_read_cr4(void)
 	asm volatile("1: mov %%cr4, %0\n"
 		     "2:\n"
 		     _ASM_EXTABLE(1b, 2b)
-		     : "=r" (val), "=m" (__force_order) : "0" (0));
+		     : "=r" (val) : "0" (0), __FORCE_ORDER);
 #else
 	/* CR4 always exists on x86_64. */
-	asm volatile("mov %%cr4,%0\n\t" : "=r" (val), "=m" (__force_order));
+	asm volatile("mov %%cr4,%0\n\t" : "=r" (val) : __FORCE_ORDER);
 #endif
 	return val;
 }
 
 static inline void native_write_cr4(unsigned long val)
 {
-	asm volatile("mov %0,%%cr4": : "r" (val), "m" (__force_order));
+	asm volatile("mov %0,%%cr4": : "r" (val) : "memory");
 }
 
 #ifdef CONFIG_X86_64
  
Petr Pavlu Jan. 19, 2023, 12:26 p.m. UTC | #17
On 1/18/23 19:42, Luis Chamberlain wrote:
> On Wed, Jan 18, 2023 at 04:12:05PM +0100, Petr Pavlu wrote:
>> On 1/18/23 01:04, Luis Chamberlain wrote:
>>> The rationale for making a regression fix with a new userspace return value
>>> is fair given the old fix made things even much worse the point some kernel
>>> boots would fail. So the rationale to suggest we *must* short-cut
>>> parallel loads as effectively as possible seems sensible *iff* that
>>> could not make things worse too but sadly I've found an isssue
>>> proactively with this fix, or at least that this issue is also not fixed:
>>>
>>> ./tools/testing/selftests/kmod/kmod.sh -t 0006
>>> Tue Jan 17 23:18:13 UTC 2023
>>> Running test: kmod_test_0006 - run #0
>>> kmod_test_0006: OK! - loading kmod test
>>> kmod_test_0006: FAIL, test expects SUCCESS (0) - got -EINVAL (-22)
>>> ----------------------------------------------------
>>> Custom trigger configuration for: test_kmod0
>>> Number of threads:      50
>>> Test_case:      TEST_KMOD_FS_TYPE (2)
>>> driver: test_module
>>> fs:     xfs
>>> ----------------------------------------------------
>>> Test completed
>>>
>>> When can multiple get_fs_type() calls be issued on a system? When
>>> mounting a large number of filesystems. Sadly though this issue seems
>>> to have gone unnoticed for a while now. Even reverting commit
>>> 6e6de3dee51a doesn't fix it, and I've run into issues with trying
>>> to bisect, first due to missing Kees' patch which fixes a compiler
>>> failure on older kernel [0] and now I'm seeing this while trying to
>>> build v5.1:
>>>
>>> ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order';
>>> arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here
>>> ld: warning: arch/x86/boot/compressed/efi_thunk_64.o: missing .note.GNU-stack section implies executable stack
>>> ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
>>> ld: arch/x86/boot/compressed/head_64.o: warning: relocation in read-only section `.head.text'
>>> ld: warning: arch/x86/boot/compressed/vmlinux has a LOAD segment with RWX permissions
>>> ld: warning: creating DT_TEXTREL in a PIE
>>> make[2]: *** [arch/x86/boot/compressed/Makefile:118: arch/x86/boot/compressed/vmlinux] Error 1
>>> make[1]: *** [arch/x86/boot/Makefile:112: arch/x86/boot/compressed/vmlinux] Error 2
>>> make: *** [arch/x86/Makefile:283: bzImage] Error 2
>>>
>>> [0] http://lore.kernel.org/lkml/20220213182443.4037039-1-keescook@chromium.org
>>>
>>> But we should try to bisect to see what cauased the above kmod test 0006
>>> to start failing.
>>
>> It is not clear to me from your description if the observed failure of
>> kmod_test_0006 is related to the fix in this thread.
> 
> The issue happens with and without the patch in this thread, I'd just hate to
> exacerbate the issue further.
> 
>> The problem was not possible for me to reproduce on my system. My test was on
>> an 8-CPU x86_64 machine using v6.2-rc4 with "defconfig + kvm_guest.config +
>> tools/testing/selftests/kmod/config".
> 
> With the patch?

It is same for me without and with the patch. The test doesn't produce any
failure on my test system. The complete kmod.sh selftest passes too.

Cheers,
Petr
  
Petr Mladek Jan. 19, 2023, 3:47 p.m. UTC | #18
On Wed 2023-01-18 16:12:05, Petr Pavlu wrote:
> On 1/18/23 01:04, Luis Chamberlain wrote:
> > On Tue, Dec 13, 2022 at 11:17:42AM +0100, Petr Mladek wrote:
> >> On Mon 2022-12-12 21:09:19, Luis Chamberlain wrote:
> >>> 3) *Fixing* a kernel regression by adding new expected API for testing
> >>> against -EBUSY seems not ideal.
> >>
> >> IMHO, the right solution is to fix the subsystems so that they send
> >> only one uevent.
> > 
> > Makes sense, but that can take time and some folks are stuck on old kernels
> > and perhaps porting fixes for this on subsystems may take time to land
> > to some enterprisy kernels. And then there is also systemd that issues
> > the requests too, at least that was reflected in commit 6e6de3dee51a
> > ("kernel/module.c: Only return -EEXIST for modules that have finished loading")
> > that commit claims it was systemd issueing the requests which I mean to
> > interpret finit_module(), not calling modprobe.
> > 
> > The rationale for making a regression fix with a new userspace return value
> > is fair given the old fix made things even much worse the point some kernel
> > boots would fail. So the rationale to suggest we *must* short-cut
> > parallel loads as effectively as possible seems sensible *iff* that
> > could not make things worse too but sadly I've found an isssue
> > proactively with this fix, or at least that this issue is also not fixed:
> > 
> > ./tools/testing/selftests/kmod/kmod.sh -t 0006
> > Tue Jan 17 23:18:13 UTC 2023
> > Running test: kmod_test_0006 - run #0
> > kmod_test_0006: OK! - loading kmod test
> > kmod_test_0006: FAIL, test expects SUCCESS (0) - got -EINVAL (-22)
> > ----------------------------------------------------
> > Custom trigger configuration for: test_kmod0
> > Number of threads:      50
> > Test_case:      TEST_KMOD_FS_TYPE (2)
> > driver: test_module
> > fs:     xfs
> > ----------------------------------------------------
> > Test completed
> > 
> > When can multiple get_fs_type() calls be issued on a system? When
> > mounting a large number of filesystems.

This patch should not change the behavior that much. The parallel
load still waits until the pending one finishes.

The difference is that it does not try to load it once again.
But when the first load fails then something is broken anyway.

I could imagine that this could cause regression from the user POV
when the first failure is not important from some reasons. But
it means that the things work only by chance and the problem
might hit the user later anyway.

Another difference is that the parallel load finishes immediately
also when it sees a going module. But it should not affect that
much the multiple get_fs_type() calls. They are all trying
to load the module.


> Sadly though this issue seems
> > to have gone unnoticed for a while now. Even reverting commit
> > 6e6de3dee51a doesn't fix it, and I've run into issues with trying
> > to bisect, first due to missing Kees' patch which fixes a compiler
> > failure on older kernel [0] and now I'm seeing this while trying to
> > build v5.1:
> > 
> > ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order';
> > arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here
> > ld: warning: arch/x86/boot/compressed/efi_thunk_64.o: missing .note.GNU-stack section implies executable stack
> > ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
> > ld: arch/x86/boot/compressed/head_64.o: warning: relocation in read-only section `.head.text'
> > ld: warning: arch/x86/boot/compressed/vmlinux has a LOAD segment with RWX permissions
> > ld: warning: creating DT_TEXTREL in a PIE
> > make[2]: *** [arch/x86/boot/compressed/Makefile:118: arch/x86/boot/compressed/vmlinux] Error 1
> > make[1]: *** [arch/x86/boot/Makefile:112: arch/x86/boot/compressed/vmlinux] Error 2
> > make: *** [arch/x86/Makefile:283: bzImage] Error 2
> > 
> > [0] http://lore.kernel.org/lkml/20220213182443.4037039-1-keescook@chromium.org
> > 
> > But we should try to bisect to see what cauased the above kmod test 0006
> > to start failing.
> 
> It is not clear to me from your description if the observed failure of
> kmod_test_0006 is related to the fix in this thread.
> 
> The problem was not possible for me to reproduce on my system. My test was on
> an 8-CPU x86_64 machine using v6.2-rc4 with "defconfig + kvm_guest.config +
> tools/testing/selftests/kmod/config".

I can't reproduce it either. I guess that it needs some "good" timing,
probably more CPUs or so.

I wonder if it races with module -r that removes the module before
it tries to load it multiple times in parallel.

Does the test pass when you add sleep after the module -r, like this:

diff --git a/tools/testing/selftests/kmod/kmod.sh b/tools/testing/selftests/kmod/kmod.sh
index 7189715d7960..8a020f90a3f6 100755
--- a/tools/testing/selftests/kmod/kmod.sh
+++ b/tools/testing/selftests/kmod/kmod.sh
@@ -322,6 +322,7 @@ kmod_defaults_fs()
 {
 	config_reset
 	modprobe -r $DEFAULT_KMOD_FS
+	sleep 1
 	config_set_fs $DEFAULT_KMOD_FS
 	config_set_test_case_fs
 }



> Could you perhaps trace the test to determine where the EINVAL value comes
> from?

Yes, the -EINVAL error is strange. It is returned also in
kernel/module/main.c on few locations. But neither of them
looks like a good candidate.

My assumption is that it is more likely returned by the module_init()
callback from the loaded module. But it is just a guess and I might be wrong.

I wonder if it is cause by a delayed release of some resources,
when the module is removed, e.g. sysfs or so. It might theoretically
cause conflict when they still exist and the reloaded module
tries to create them again.

Best Regards,
Petr
  
Luis Chamberlain Jan. 19, 2023, 11:37 p.m. UTC | #19
On Wed, Jan 18, 2023 at 05:23:15PM -0800, Luis Chamberlain wrote:
> On Wed, Jan 18, 2023 at 09:02:52PM +0100, Borislav Petkov wrote:
> > On Tue, Jan 17, 2023 at 04:04:22PM -0800, Luis Chamberlain wrote:
> > > and now I'm seeing this while trying to build v5.1:
> > > 
> > > ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order';
> > > arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here
> > 
> > You need to backport 
> > 
> > aa5cacdc29d7 ("x86/asm: Replace __force_order with a memory clobber")
> > 
> > for that.
> > 
> > Happens when building older kernels with newer toolchain.
> 
> Thanks that certainly helps. FWIW if someone needs it, I had to remove
> the double colons on write cr0 and cr4 to compile, but this crashed :(
> Any ideas?

The issue was my write for cr0 and cr4 didn't have r+, so:

diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 5586e4cf62d3..e4da7248edcf 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -26,7 +26,7 @@ static inline unsigned long native_read_cr0(void)
 
 static inline void native_write_cr0(unsigned long val)
 {
-	asm volatile("mov %0,%%cr0": : "r" (val) : "memory");
+	asm volatile("mov %0,%%cr0": "+r" (val) : : "memory");
 }
 
 static inline unsigned long native_read_cr2(void)
@@ -75,7 +75,7 @@ static inline unsigned long native_read_cr4(void)
 
 static inline void native_write_cr4(unsigned long val)
 {
-	asm volatile("mov %0,%%cr4": : "r" (val) : "memory");
+	asm volatile("mov %0,%%cr4": "+r" (val) : : "memory");
 }
 
 #ifdef CONFIG_X86_64

The complete patch below now boots on v5.2 however this still ends up in
a panic, it dies on cpu_startup_entry().

[    0.595694] x86: Booting SMP configuration:
[    0.596807] .... node  #0, CPUs:      #1
[    0.264759] kvm-clock: cpu 1, msr 15edff041, secondary cpu clock
[    0.264759] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: start_secondary+0x1c6/0x1d0
[    0.264759] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.2.0+ #5
[    0.264759] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[    0.264759] Call Trace:
[    0.264759]  dump_stack+0x5c/0x80
[    0.264759]  panic+0x102/0x2f5
[    0.264759]  ? start_secondary+0x1c6/0x1d0
[    0.264759]  __stack_chk_fail+0x15/0x20
[    0.264759]  start_secondary+0x1c6/0x1d0
[    0.264759]  secondary_startup_64+0xa4/0xb0
[    0.264759] ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: start_secondary+0x1c6/0x1d0 ]---

(gdb) l *(start_secondary+0x1c6)
0xffffffff810506e6 is at arch/x86/kernel/smpboot.c:265.
260             boot_init_stack_canary();
261
262             x86_cpuinit.setup_percpu_clockev();
263
264             wmb();
265             cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);
266     }
267
268     /**
269      * topology_is_primary_thread - Check whether CPU is the primary SMT thread


The full attempt to backport aa5cacdc29d7 to v5.2:

diff --git a/arch/x86/boot/compressed/kaslr_64.c b/arch/x86/boot/compressed/kaslr_64.c
index 748456c365f4..9557c5a15b91 100644
--- a/arch/x86/boot/compressed/kaslr_64.c
+++ b/arch/x86/boot/compressed/kaslr_64.c
@@ -29,9 +29,6 @@
 #define __PAGE_OFFSET __PAGE_OFFSET_BASE
 #include "../../mm/ident_map.c"
 
-/* Used by pgtable.h asm code to force instruction serialization. */
-unsigned long __force_order;
-
 /* Used to track our page table allocation area. */
 struct alloc_pgt_data {
 	unsigned char *pgt_buf;
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index f8debf7aeb4c..7471b48524cb 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -5,15 +5,6 @@
 #include "pgtable.h"
 #include "../string.h"
 
-/*
- * __force_order is used by special_insns.h asm code to force instruction
- * serialization.
- *
- * It is not referenced from the code, but GCC < 5 with -fPIE would fail
- * due to an undefined symbol. Define it to make these ancient GCCs work.
- */
-unsigned long __force_order;
-
 #define BIOS_START_MIN		0x20000U	/* 128K, less than this is insane */
 #define BIOS_START_MAX		0x9f000U	/* 640K, absolute maximum */
 
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 0a3c4cab39db..e4da7248edcf 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -7,49 +7,50 @@
 
 #include <asm/nops.h>
 
+#define __FORCE_ORDER "m"(*(unsigned int *)0x1000UL)
+
 /*
- * Volatile isn't enough to prevent the compiler from reordering the
- * read/write functions for the control registers and messing everything up.
- * A memory clobber would solve the problem, but would prevent reordering of
- * all loads stores around it, which can hurt performance. Solution is to
- * use a variable and mimic reads and writes to it to enforce serialization
+ * The compiler should not reorder volatile asm statements with respect to each
+ * other: they should execute in program order. However GCC 4.9.x and 5.x have
+ * a bug (which was fixed in 8.1, 7.3 and 6.5) where they might reorder
+ * volatile asm. The write functions are not affected since they have memory
+ * clobbers preventing reordering. To prevent reads from being reordered with
+ * respect to writes, use a dummy memory operand
  */
-extern unsigned long __force_order;
-
 static inline unsigned long native_read_cr0(void)
 {
 	unsigned long val;
-	asm volatile("mov %%cr0,%0\n\t" : "=r" (val), "=m" (__force_order));
+	asm volatile("mov %%cr0,%0\n\t" : "=r" (val) : __FORCE_ORDER);
 	return val;
 }
 
 static inline void native_write_cr0(unsigned long val)
 {
-	asm volatile("mov %0,%%cr0": : "r" (val), "m" (__force_order));
+	asm volatile("mov %0,%%cr0": "+r" (val) : : "memory");
 }
 
 static inline unsigned long native_read_cr2(void)
 {
 	unsigned long val;
-	asm volatile("mov %%cr2,%0\n\t" : "=r" (val), "=m" (__force_order));
+	asm volatile("mov %%cr2,%0\n\t" : "=r" (val) : __FORCE_ORDER);
 	return val;
 }
 
 static inline void native_write_cr2(unsigned long val)
 {
-	asm volatile("mov %0,%%cr2": : "r" (val), "m" (__force_order));
+	asm volatile("mov %0,%%cr2": : "r" (val) : "memory");
 }
 
 static inline unsigned long __native_read_cr3(void)
 {
 	unsigned long val;
-	asm volatile("mov %%cr3,%0\n\t" : "=r" (val), "=m" (__force_order));
+	asm volatile("mov %%cr3,%0\n\t" : "=r" (val) : __FORCE_ORDER);
 	return val;
 }
 
 static inline void native_write_cr3(unsigned long val)
 {
-	asm volatile("mov %0,%%cr3": : "r" (val), "m" (__force_order));
+	asm volatile("mov %0,%%cr3": : "r" (val) : "memory");
 }
 
 static inline unsigned long native_read_cr4(void)
@@ -64,17 +65,17 @@ static inline unsigned long native_read_cr4(void)
 	asm volatile("1: mov %%cr4, %0\n"
 		     "2:\n"
 		     _ASM_EXTABLE(1b, 2b)
-		     : "=r" (val), "=m" (__force_order) : "0" (0));
+		     : "=r" (val) : "0" (0), __FORCE_ORDER);
 #else
 	/* CR4 always exists on x86_64. */
-	asm volatile("mov %%cr4,%0\n\t" : "=r" (val), "=m" (__force_order));
+	asm volatile("mov %%cr4,%0\n\t" : "=r" (val) : __FORCE_ORDER);
 #endif
 	return val;
 }
 
 static inline void native_write_cr4(unsigned long val)
 {
-	asm volatile("mov %0,%%cr4": : "r" (val), "m" (__force_order));
+	asm volatile("mov %0,%%cr4": "+r" (val) : : "memory");
 }
 
 #ifdef CONFIG_X86_64
  
Luis Chamberlain Jan. 20, 2023, 12:51 a.m. UTC | #20
On Thu, Jan 19, 2023 at 04:47:05PM +0100, Petr Mladek wrote:
> Yes, the -EINVAL error is strange. It is returned also in
> kernel/module/main.c on few locations. But neither of them
> looks like a good candidate.

OK I updated to next-20230119 and I don't see the issue now.
Odd. It could have been an issue with next-20221207 which I was
on before.

I'll run some more test and if nothing fails I'll send the fix
to Linux for rc5.

  Luis
  
Luis Chamberlain Jan. 20, 2023, 12:58 a.m. UTC | #21
On Thu, Jan 19, 2023 at 04:51:27PM -0800, Luis Chamberlain wrote:
> On Thu, Jan 19, 2023 at 04:47:05PM +0100, Petr Mladek wrote:
> > Yes, the -EINVAL error is strange. It is returned also in
> > kernel/module/main.c on few locations. But neither of them
> > looks like a good candidate.
> 
> OK I updated to next-20230119 and I don't see the issue now.
> Odd. It could have been an issue with next-20221207 which I was
> on before.
> 
> I'll run some more test and if nothing fails I'll send the fix
> to Linux for rc5.

Jeesh it just occured to me the difference, which I'll have to
test next, for next-20221207 I had enabled module compression
on kdevops with zstd.

You can see the issues on kdevops git log with that... and I finally
disabled it and the kmod test issue is gone. So it could be that
but I just am ending my day so will check tomorrow if that was it.
But if someone else beats me then great.

With kdevops it should be a matter of just enabling zstd as I
just bumped support for next-20230119 and that has module decompression
disabled.

  Luis
  
Luis Chamberlain Jan. 21, 2023, 10:40 p.m. UTC | #22
On Thu, Jan 19, 2023 at 04:58:53PM -0800, Luis Chamberlain wrote:
> On Thu, Jan 19, 2023 at 04:51:27PM -0800, Luis Chamberlain wrote:
> > On Thu, Jan 19, 2023 at 04:47:05PM +0100, Petr Mladek wrote:
> > > Yes, the -EINVAL error is strange. It is returned also in
> > > kernel/module/main.c on few locations. But neither of them
> > > looks like a good candidate.
> > 
> > OK I updated to next-20230119 and I don't see the issue now.
> > Odd. It could have been an issue with next-20221207 which I was
> > on before.
> > 
> > I'll run some more test and if nothing fails I'll send the fix
> > to Linux for rc5.
> 
> Jeesh it just occured to me the difference, which I'll have to
> test next, for next-20221207 I had enabled module compression
> on kdevops with zstd.
> 
> You can see the issues on kdevops git log with that... and I finally
> disabled it and the kmod test issue is gone. So it could be that
> but I just am ending my day so will check tomorrow if that was it.
> But if someone else beats me then great.
> 
> With kdevops it should be a matter of just enabling zstd as I
> just bumped support for next-20230119 and that has module decompression
> disabled.

So indeed, my suspcions were correct. There is one bug with
compression on debian:

 - gzip compressed modules don't end up in the initramfs

There is a generic upstream kmod bug:

  - modprobe --show-depends won't grok compressed modules so initramfs
    tools that use this as Debian likely are not getting module dependencies
    installed in their initramfs

But using xz compression reveals 4 GiB memory is not enough for kmod.sh
test 0004, the -EINVAL is due to an OOM hit on modprobe so the request
fails. That's a test bug.

But increasing memory (8 GiB seems to do it) still reveals kmod.sh test 0009
does fail, not all the times, and it is why the test runs 150 times if
you run the test once. The failure is not deterministic but surely fails
for me every time at least once out of the 150 runs. Test 0009 tries to
trigger running kmod_concurrent over max_modprobes for get_fs_type(). 

I'm trying to test to see if I this failure can trigger without module
compression but I don't see the failure yet.

Reverting the patch on this thread on linux-next does not fix that issue
and so this has perhaps been broken for a much longer time. And so this
patch still remains a candidate fix.

 Luis
  
Luis Chamberlain Jan. 24, 2023, 7:58 p.m. UTC | #23
On Thu, Jan 19, 2023 at 04:47:05PM +0100, Petr Mladek wrote:
> I wonder if it races with module -r that removes the module before
> it tries to load it multiple times in parallel.
> 
> Does the test pass when you add sleep after the module -r, like this:
> 
> diff --git a/tools/testing/selftests/kmod/kmod.sh b/tools/testing/selftests/kmod/kmod.sh
> index 7189715d7960..8a020f90a3f6 100755
> --- a/tools/testing/selftests/kmod/kmod.sh
> +++ b/tools/testing/selftests/kmod/kmod.sh
> @@ -322,6 +322,7 @@ kmod_defaults_fs()
>  {
>  	config_reset
>  	modprobe -r $DEFAULT_KMOD_FS
> +	sleep 1
>  	config_set_fs $DEFAULT_KMOD_FS
>  	config_set_test_case_fs
>  }

FWIW I was curious if the kmod test 0009.sh now could pass with this
too, but alas it can sometimes fail too.

[  138.590663] misc test_kmod0: reset
[  139.729273] misc test_kmod0: Test case: TEST_KMOD_FS_TYPE (2)
[  139.732874] misc test_kmod0: Test filesystem to load: xfs
[  139.736230] misc test_kmod0: Number of threads to run: 62
[  139.739575] misc test_kmod0: Thread IDs will range from 0 - 61
[  140.402079] __request_module: 2 callbacks suppressed
[  140.402082] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module fs-xfs, throttling...
[  140.418075] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module fs-xfs, throttling...
[  140.430124] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module fs-xfs, throttling...
[  140.450119] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module fs-xfs, throttling...
[  140.478037] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module fs-xfs, throttling...
[  140.498080] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module fs-xfs, throttling...
[  140.518066] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module fs-xfs, throttling...
[  140.530207] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module fs-xfs, throttling...
[  140.549949] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module fs-xfs, throttling...
[  140.562182] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module fs-xfs, throttling...
[  140.582342] misc test_kmod0: No errors were found while initializing threads
[  145.385414] SGI XFS with ACLs, security attributes, realtime, quota, no debug enabled
[  145.461962] request_module: modprobe fs-xfs cannot be processed, kmod busy with 50 threads for more than 5 seconds now
[  145.652541] misc test_kmod0: Done: 62 threads have all run now
[  145.655020] misc test_kmod0: Last thread to run: 39
[  145.655831] misc test_kmod0: Results:
[  145.656450] misc test_kmod0: Sync thread 0 fs: xfs
[  145.657164] misc test_kmod0: Sync thread 1 fs: xfs
[  145.657875] misc test_kmod0: Sync thread 2 fs: xfs
[  145.658615] misc test_kmod0: Sync thread 3 fs: xfs
[  145.659324] misc test_kmod0: Sync thread 4 fs: xfs
[  145.660026] misc test_kmod0: Sync thread 5 fs: xfs
[  145.660948] misc test_kmod0: Sync thread 6 fs: xfs
[  145.661870] misc test_kmod0: Sync thread 7 fs: xfs
[  145.662625] misc test_kmod0: Sync thread 8 fs: xfs
[  145.663363] misc test_kmod0: Sync thread 9 fs: xfs
[  145.664073] misc test_kmod0: Sync thread 10 fs: xfs
[  145.664791] misc test_kmod0: Sync thread 11 fs: xfs
[  145.665509] misc test_kmod0: Sync thread 12 fs: xfs
[  145.666252] misc test_kmod0: Sync thread 13 fs: xfs
[  145.666971] misc test_kmod0: Sync thread 14 fs: xfs
[  145.667693] misc test_kmod0: Sync thread 15 fs: xfs
[  145.668405] misc test_kmod0: Sync thread 16 fs: xfs
[  145.669113] misc test_kmod0: Sync thread 17 fs: xfs
[  145.669823] misc test_kmod0: Sync thread 18 fs: xfs
[  145.670634] misc test_kmod0: Sync thread 19 fs: xfs
[  145.671390] misc test_kmod0: Sync thread 20 fs: xfs
[  145.672126] misc test_kmod0: Sync thread 21 fs: xfs
[  145.672842] misc test_kmod0: Sync thread 22 fs: xfs
[  145.673561] misc test_kmod0: Sync thread 23 fs: xfs
[  145.674327] misc test_kmod0: Sync thread 24 fs: xfs
[  145.675051] misc test_kmod0: Sync thread 25 fs: xfs
[  145.675772] misc test_kmod0: Sync thread 26 fs: xfs
[  145.676491] misc test_kmod0: Sync thread 27 fs: xfs
[  145.677207] misc test_kmod0: Sync thread 28 fs: xfs
[  145.677920] misc test_kmod0: Sync thread 29 fs: xfs
[  145.678658] misc test_kmod0: Sync thread 30 fs: xfs
[  145.679369] misc test_kmod0: Sync thread 31 fs: xfs
[  145.680075] misc test_kmod0: Sync thread 32 fs: xfs
[  145.680780] misc test_kmod0: Sync thread 33 fs: xfs
[  145.681481] misc test_kmod0: Sync thread 34 fs: xfs
[  145.682211] misc test_kmod0: Sync thread 35 fs: xfs
[  145.682925] misc test_kmod0: Sync thread 36 fs: xfs
[  145.683633] misc test_kmod0: Sync thread 37 fs: xfs
[  145.684363] misc test_kmod0: Sync thread 38 fs: xfs
[  145.685196] misc test_kmod0: Sync thread 39 fs: xfs
[  145.685896] misc test_kmod0: Sync thread 40 fs: xfs
[  145.687009] misc test_kmod0: Sync thread 41 fs: xfs
[  145.687800] misc test_kmod0: Sync thread 42 fs: xfs
[  145.688540] misc test_kmod0: Sync thread 43 fs: xfs
[  145.689227] misc test_kmod0: Sync thread 44 fs: xfs
[  145.689901] misc test_kmod0: Sync thread 45 fs: xfs
[  145.690924] misc test_kmod0: Sync thread 46 fs: xfs
[  145.691721] misc test_kmod0: Sync thread 47 fs: xfs
[  145.692533] misc test_kmod0: Sync thread 48 fs: xfs
[  145.693329] misc test_kmod0: Sync thread 49 fs: xfs
[  145.694140] misc test_kmod0: Sync thread 50 fs: xfs
[  145.694910] misc test_kmod0: Sync thread 51 fs: xfs
[  145.695695] misc test_kmod0: Sync thread 52 fs: NULL
[  145.696461] misc test_kmod0: Sync thread 53 fs: xfs
[  145.697229] misc test_kmod0: Sync thread 54 fs: xfs
[  145.698027] misc test_kmod0: Sync thread 55 fs: xfs
[  145.698898] misc test_kmod0: Sync thread 56 fs: xfs
[  145.699953] misc test_kmod0: Sync thread 57 fs: xfs
[  145.700667] misc test_kmod0: Sync thread 58 fs: xfs
[  145.701362] misc test_kmod0: Sync thread 59 fs: xfs
[  145.702078] misc test_kmod0: Sync thread 60 fs: xfs
[  145.702765] misc test_kmod0: Sync thread 61 fs: xfs
[  145.703456] misc test_kmod0: General test result: -22

The key here:

[  145.461962] request_module: modprobe fs-xfs cannot be processed, kmod busy with 50 threads for more than 5 seconds now

That is not printed when the test iterates and does not fail. That
comes from kernel/kmod.c:

	if (!ret) {                                                     
		pr_warn_ratelimited("request_module: %s cannot be processed, kmod busy with %d threads for more than %d seconds now",
				    module_name, MAX_KMOD_CONCURRENT, MAX_KMOD_ALL_BUSY_TIMEOUT);
		return -ETIME;   
	} ...

ETIME however is 62 as per include/uapi/asm-generic/errno.h. The loss
of the value comes from the fact get_fs_type() ignores error types and
so lib/test_kmod.c just sets err_ret = -EINVAL in tally_work_test() as
it cannot get more heuristics out of the kernel as to why get_fs_type()
failed.

This should mean that the failure observed with test 0009 on kmod is
very likely not due to module compression but just a timing issue, and
that compression just increases the probability of having 50 threads
busy concurrently on modprobe in 5 seconds with that test. I've
confirmed this by running a test with a modified kmod as follows
*after* booting into a kernel with no compression:

diff --git a/tools/modprobe.c b/tools/modprobe.c
index 3b7897c..0b7574d 100644
--- a/tools/modprobe.c
+++ b/tools/modprobe.c
@@ -1012,6 +1012,8 @@ static int do_modprobe(int argc, char **orig_argv)
 
 	log_setup_kmod_log(ctx, verbose);
 
+	usleep(5000000);
+
 	kmod_load_resources(ctx);
 
 	if (do_show_config)

Modules don't tend to be large in size but module compression is an
extrapolation of what could happen without compression if we had huge
modules often and userspace doing something wild. If you end up with 50
concurrent threads running modprobe for more than 5 seconds the kernel
pr_warn_ratelimited() would print though and it surely is a sign userspace
is doing something stupid. The sad part though is that a filesystem
mount *can* be triggered in these cases and so can fail to boot.

The above test were run with next-20230119 without the patch on this
thread, and so as noted before this doesn't create a regression, this is
a known issue now. And so -- further confirmation I'll move forward with
this patch for the next rc.

Note kernel/kmod.c in the kernel states:

/*                                                                              
 * This is a restriction on having *all* MAX_KMOD_CONCURRENT threads            
 * running at the same time without returning. When this happens we             
 * believe you've somehow ended up with a recursive module dependency           
 * creating a loop.                                                             
 *                                                                              
 * We have no option but to fail.                                               
 *                                                                              
 * Userspace should proactively try to detect and prevent these.                
 */                                                                             
#define MAX_KMOD_ALL_BUSY_TIMEOUT 5 

I can update the docs to reflect that this can be triggered by
the kernel trying to auto-loading DoS or each CPU count triggering
tons of unecessary duplicate auto-loading module requests. This is a self
kernel inflicted situation.

As for the DoS Vegard Nossum did report that user namespaces *could*
trigger / abuse kernel module autoloading and that they shouldn't be allowed
to do that as they can't load modules directly anyway (finit_module() won't work
for them). I'm waiting for a proper patch follow up from him, but *that* in
theory could then be another way to trigger this issue other than kmod test
0009, abuse user namespaces so to trigger a module failure by going over board on
auto-module loading. Boot likely can't be compromised unless creation of user
namespaces is allowed to be exploited early on boot. But post boot
kernel module auto-loading could be DoS'd with user namespaces.

As for the other case -- each additional CPU causing more module
auto-loading than before -- this is a real issue to monitor for and likely
can cause odd boot failures. This thread already dealt with
cpu-frequency modules as an example of abuse in the kernel for this.
Folks are working to fix this though but older kernels will have these
issues.

Since the kernel is the one that *is* dealing with throttling of this
auto-load situation, fixing these cases in-kernel is the right solution.
I don't think userspace can do much here as the limit hit is inherent to
auto-loading.

Perhaps the we should upgrade the pr_warn_ratelimited() to WARN_ONCE()?

[0] https://lore.kernel.org/all/Y8HkC1re3Fo46Ne3@bombadil.infradead.org/T/#u

  Luis
  
Luis Chamberlain March 11, 2023, 9:36 p.m. UTC | #24
On Sat, Jan 21, 2023 at 2:40 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> There is one bug with  compression on debian:
>
>  - gzip compressed modules don't end up in the initramfs
>
> There is a generic upstream kmod bug:
>
>   - modprobe --show-depends won't grok compressed modules so initramfs
>     tools that use this as Debian likely are not getting module dependencies
>     installed in their initramfs

Vincenzo *might* be up to tackle the above, let's see!

  Luis
  
Lucas De Marchi March 12, 2023, 6:25 a.m. UTC | #25
On Sat, Jan 21, 2023 at 02:40:20PM -0800, Luis Chamberlain wrote:
>On Thu, Jan 19, 2023 at 04:58:53PM -0800, Luis Chamberlain wrote:
>> On Thu, Jan 19, 2023 at 04:51:27PM -0800, Luis Chamberlain wrote:
>> > On Thu, Jan 19, 2023 at 04:47:05PM +0100, Petr Mladek wrote:
>> > > Yes, the -EINVAL error is strange. It is returned also in
>> > > kernel/module/main.c on few locations. But neither of them
>> > > looks like a good candidate.
>> >
>> > OK I updated to next-20230119 and I don't see the issue now.
>> > Odd. It could have been an issue with next-20221207 which I was
>> > on before.
>> >
>> > I'll run some more test and if nothing fails I'll send the fix
>> > to Linux for rc5.
>>
>> Jeesh it just occured to me the difference, which I'll have to
>> test next, for next-20221207 I had enabled module compression
>> on kdevops with zstd.
>>
>> You can see the issues on kdevops git log with that... and I finally
>> disabled it and the kmod test issue is gone. So it could be that
>> but I just am ending my day so will check tomorrow if that was it.
>> But if someone else beats me then great.
>>
>> With kdevops it should be a matter of just enabling zstd as I
>> just bumped support for next-20230119 and that has module decompression
>> disabled.
>
>So indeed, my suspcions were correct. There is one bug with
>compression on debian:
>
> - gzip compressed modules don't end up in the initramfs
>
>There is a generic upstream kmod bug:
>
>  - modprobe --show-depends won't grok compressed modules so initramfs
>    tools that use this as Debian likely are not getting module dependencies
>    installed in their initramfs

are you sure you have the relevant compression setting enabled
in kmod?

$ kmod --version
kmod version 30
+ZSTD +XZ +ZLIB +LIBCRYPTO -EXPERIMENTAL
$ modprobe --show-depends ext4
insmod /lib/modules/6.1.12-1-MANJARO/kernel/fs/jbd2/jbd2.ko.zst 
insmod /lib/modules/6.1.12-1-MANJARO/kernel/fs/mbcache.ko.zst 
insmod /lib/modules/6.1.12-1-MANJARO/kernel/lib/crc16.ko.zst 
insmod /lib/modules/6.1.12-1-MANJARO/kernel/arch/x86/crypto/crc32c-intel.ko.zst 
insmod /lib/modules/6.1.12-1-MANJARO/kernel/crypto/crc32c_generic.ko.zst 
insmod /lib/modules/6.1.12-1-MANJARO/kernel/fs/ext4/ext4.ko.zst 

Lucas De Marchi
  
Luis Chamberlain March 22, 2023, 10:31 p.m. UTC | #26
On Sat, Mar 11, 2023 at 10:25:05PM -0800, Lucas De Marchi wrote:
> On Sat, Jan 21, 2023 at 02:40:20PM -0800, Luis Chamberlain wrote:
> > On Thu, Jan 19, 2023 at 04:58:53PM -0800, Luis Chamberlain wrote:
> > > On Thu, Jan 19, 2023 at 04:51:27PM -0800, Luis Chamberlain wrote:
> > > > On Thu, Jan 19, 2023 at 04:47:05PM +0100, Petr Mladek wrote:
> > > > > Yes, the -EINVAL error is strange. It is returned also in
> > > > > kernel/module/main.c on few locations. But neither of them
> > > > > looks like a good candidate.
> > > >
> > > > OK I updated to next-20230119 and I don't see the issue now.
> > > > Odd. It could have been an issue with next-20221207 which I was
> > > > on before.
> > > >
> > > > I'll run some more test and if nothing fails I'll send the fix
> > > > to Linux for rc5.
> > > 
> > > Jeesh it just occured to me the difference, which I'll have to
> > > test next, for next-20221207 I had enabled module compression
> > > on kdevops with zstd.
> > > 
> > > You can see the issues on kdevops git log with that... and I finally
> > > disabled it and the kmod test issue is gone. So it could be that
> > > but I just am ending my day so will check tomorrow if that was it.
> > > But if someone else beats me then great.
> > > 
> > > With kdevops it should be a matter of just enabling zstd as I
> > > just bumped support for next-20230119 and that has module decompression
> > > disabled.
> > 
> > So indeed, my suspcions were correct. There is one bug with
> > compression on debian:
> > 
> > - gzip compressed modules don't end up in the initramfs
> > 
> > There is a generic upstream kmod bug:
> > 
> >  - modprobe --show-depends won't grok compressed modules so initramfs
> >    tools that use this as Debian likely are not getting module dependencies
> >    installed in their initramfs
> 
> are you sure you have the relevant compression setting enabled
> in kmod?
> 
> $ kmod --version
> kmod version 30
> +ZSTD +XZ +ZLIB +LIBCRYPTO -EXPERIMENTAL

Debian has:

kmod version 30
+ZSTD +XZ -ZLIB +LIBCRYPTO -EXPERIMENTAL

> $ modprobe --show-depends ext4
> insmod /lib/modules/6.1.12-1-MANJARO/kernel/fs/jbd2/jbd2.ko.zst insmod
> /lib/modules/6.1.12-1-MANJARO/kernel/fs/mbcache.ko.zst insmod
> /lib/modules/6.1.12-1-MANJARO/kernel/lib/crc16.ko.zst insmod
> /lib/modules/6.1.12-1-MANJARO/kernel/arch/x86/crypto/crc32c-intel.ko.zst
> insmod /lib/modules/6.1.12-1-MANJARO/kernel/crypto/crc32c_generic.ko.zst
> insmod /lib/modules/6.1.12-1-MANJARO/kernel/fs/ext4/ext4.ko.zst

Perhaps this was related to the above gzip issue in debian then.

I'm hoping will have a bit more time than me to verify.

  Luis
  
Lucas De Marchi March 23, 2023, 3:01 p.m. UTC | #27
On Wed, Mar 22, 2023 at 03:31:59PM -0700, Luis Chamberlain wrote:
>On Sat, Mar 11, 2023 at 10:25:05PM -0800, Lucas De Marchi wrote:
>> On Sat, Jan 21, 2023 at 02:40:20PM -0800, Luis Chamberlain wrote:
>> > On Thu, Jan 19, 2023 at 04:58:53PM -0800, Luis Chamberlain wrote:
>> > > On Thu, Jan 19, 2023 at 04:51:27PM -0800, Luis Chamberlain wrote:
>> > > > On Thu, Jan 19, 2023 at 04:47:05PM +0100, Petr Mladek wrote:
>> > > > > Yes, the -EINVAL error is strange. It is returned also in
>> > > > > kernel/module/main.c on few locations. But neither of them
>> > > > > looks like a good candidate.
>> > > >
>> > > > OK I updated to next-20230119 and I don't see the issue now.
>> > > > Odd. It could have been an issue with next-20221207 which I was
>> > > > on before.
>> > > >
>> > > > I'll run some more test and if nothing fails I'll send the fix
>> > > > to Linux for rc5.
>> > >
>> > > Jeesh it just occured to me the difference, which I'll have to
>> > > test next, for next-20221207 I had enabled module compression
>> > > on kdevops with zstd.
>> > >
>> > > You can see the issues on kdevops git log with that... and I finally
>> > > disabled it and the kmod test issue is gone. So it could be that
>> > > but I just am ending my day so will check tomorrow if that was it.
>> > > But if someone else beats me then great.
>> > >
>> > > With kdevops it should be a matter of just enabling zstd as I
>> > > just bumped support for next-20230119 and that has module decompression
>> > > disabled.
>> >
>> > So indeed, my suspcions were correct. There is one bug with
>> > compression on debian:
>> >
>> > - gzip compressed modules don't end up in the initramfs
>> >
>> > There is a generic upstream kmod bug:
>> >
>> >  - modprobe --show-depends won't grok compressed modules so initramfs
>> >    tools that use this as Debian likely are not getting module dependencies
>> >    installed in their initramfs
>>
>> are you sure you have the relevant compression setting enabled
>> in kmod?
>>
>> $ kmod --version
>> kmod version 30
>> +ZSTD +XZ +ZLIB +LIBCRYPTO -EXPERIMENTAL
>
>Debian has:
>
>kmod version 30
>+ZSTD +XZ -ZLIB +LIBCRYPTO -EXPERIMENTAL

	   ^ so... mind the minus :). It doesn't support zlib.

Change your kernel config to either compress the modules as xz or zstd.


Lucas De Marchi

>
>> $ modprobe --show-depends ext4
>> insmod /lib/modules/6.1.12-1-MANJARO/kernel/fs/jbd2/jbd2.ko.zst insmod
>> /lib/modules/6.1.12-1-MANJARO/kernel/fs/mbcache.ko.zst insmod
>> /lib/modules/6.1.12-1-MANJARO/kernel/lib/crc16.ko.zst insmod
>> /lib/modules/6.1.12-1-MANJARO/kernel/arch/x86/crypto/crc32c-intel.ko.zst
>> insmod /lib/modules/6.1.12-1-MANJARO/kernel/crypto/crc32c_generic.ko.zst
>> insmod /lib/modules/6.1.12-1-MANJARO/kernel/fs/ext4/ext4.ko.zst
>
>Perhaps this was related to the above gzip issue in debian then.
>
>I'm hoping will have a bit more time than me to verify.
>
>  Luis
  
Luis Chamberlain March 23, 2023, 3:08 p.m. UTC | #28
On Thu, Mar 23, 2023 at 8:02 AM Lucas De Marchi
<lucas.demarchi@intel.com> wrote:
>
> On Wed, Mar 22, 2023 at 03:31:59PM -0700, Luis Chamberlain wrote:
> >On Sat, Mar 11, 2023 at 10:25:05PM -0800, Lucas De Marchi wrote:
> >> On Sat, Jan 21, 2023 at 02:40:20PM -0800, Luis Chamberlain wrote:
> >> > On Thu, Jan 19, 2023 at 04:58:53PM -0800, Luis Chamberlain wrote:
> >> > > On Thu, Jan 19, 2023 at 04:51:27PM -0800, Luis Chamberlain wrote:
> >> > > > On Thu, Jan 19, 2023 at 04:47:05PM +0100, Petr Mladek wrote:
> >> > > > > Yes, the -EINVAL error is strange. It is returned also in
> >> > > > > kernel/module/main.c on few locations. But neither of them
> >> > > > > looks like a good candidate.
> >> > > >
> >> > > > OK I updated to next-20230119 and I don't see the issue now.
> >> > > > Odd. It could have been an issue with next-20221207 which I was
> >> > > > on before.
> >> > > >
> >> > > > I'll run some more test and if nothing fails I'll send the fix
> >> > > > to Linux for rc5.
> >> > >
> >> > > Jeesh it just occured to me the difference, which I'll have to
> >> > > test next, for next-20221207 I had enabled module compression
> >> > > on kdevops with zstd.
> >> > >
> >> > > You can see the issues on kdevops git log with that... and I finally
> >> > > disabled it and the kmod test issue is gone. So it could be that
> >> > > but I just am ending my day so will check tomorrow if that was it.
> >> > > But if someone else beats me then great.
> >> > >
> >> > > With kdevops it should be a matter of just enabling zstd as I
> >> > > just bumped support for next-20230119 and that has module decompression
> >> > > disabled.
> >> >
> >> > So indeed, my suspcions were correct. There is one bug with
> >> > compression on debian:
> >> >
> >> > - gzip compressed modules don't end up in the initramfs
> >> >
> >> > There is a generic upstream kmod bug:
> >> >
> >> >  - modprobe --show-depends won't grok compressed modules so initramfs
> >> >    tools that use this as Debian likely are not getting module dependencies
> >> >    installed in their initramfs
> >>
> >> are you sure you have the relevant compression setting enabled
> >> in kmod?
> >>
> >> $ kmod --version
> >> kmod version 30
> >> +ZSTD +XZ +ZLIB +LIBCRYPTO -EXPERIMENTAL
> >
> >Debian has:
> >
> >kmod version 30
> >+ZSTD +XZ -ZLIB +LIBCRYPTO -EXPERIMENTAL
>
>            ^ so... mind the minus :). It doesn't support zlib.
>
> Change your kernel config to either compress the modules as xz or zstd.

Oh so then we should complain about these things if an initramfs is
detected with modules compressed using a compression algorithm which
modprobe installed does not support. What tool would do that?

  Luis
  
Lucas De Marchi March 24, 2023, 6:03 a.m. UTC | #29
On Thu, Mar 23, 2023 at 08:08:49AM -0700, Luis Chamberlain wrote:
>On Thu, Mar 23, 2023 at 8:02 AM Lucas De Marchi
><lucas.demarchi@intel.com> wrote:
>>
>> On Wed, Mar 22, 2023 at 03:31:59PM -0700, Luis Chamberlain wrote:
>> >On Sat, Mar 11, 2023 at 10:25:05PM -0800, Lucas De Marchi wrote:
>> >> On Sat, Jan 21, 2023 at 02:40:20PM -0800, Luis Chamberlain wrote:
>> >> > On Thu, Jan 19, 2023 at 04:58:53PM -0800, Luis Chamberlain wrote:
>> >> > > On Thu, Jan 19, 2023 at 04:51:27PM -0800, Luis Chamberlain wrote:
>> >> > > > On Thu, Jan 19, 2023 at 04:47:05PM +0100, Petr Mladek wrote:
>> >> > > > > Yes, the -EINVAL error is strange. It is returned also in
>> >> > > > > kernel/module/main.c on few locations. But neither of them
>> >> > > > > looks like a good candidate.
>> >> > > >
>> >> > > > OK I updated to next-20230119 and I don't see the issue now.
>> >> > > > Odd. It could have been an issue with next-20221207 which I was
>> >> > > > on before.
>> >> > > >
>> >> > > > I'll run some more test and if nothing fails I'll send the fix
>> >> > > > to Linux for rc5.
>> >> > >
>> >> > > Jeesh it just occured to me the difference, which I'll have to
>> >> > > test next, for next-20221207 I had enabled module compression
>> >> > > on kdevops with zstd.
>> >> > >
>> >> > > You can see the issues on kdevops git log with that... and I finally
>> >> > > disabled it and the kmod test issue is gone. So it could be that
>> >> > > but I just am ending my day so will check tomorrow if that was it.
>> >> > > But if someone else beats me then great.
>> >> > >
>> >> > > With kdevops it should be a matter of just enabling zstd as I
>> >> > > just bumped support for next-20230119 and that has module decompression
>> >> > > disabled.
>> >> >
>> >> > So indeed, my suspcions were correct. There is one bug with
>> >> > compression on debian:
>> >> >
>> >> > - gzip compressed modules don't end up in the initramfs
>> >> >
>> >> > There is a generic upstream kmod bug:
>> >> >
>> >> >  - modprobe --show-depends won't grok compressed modules so initramfs
>> >> >    tools that use this as Debian likely are not getting module dependencies
>> >> >    installed in their initramfs
>> >>
>> >> are you sure you have the relevant compression setting enabled
>> >> in kmod?
>> >>
>> >> $ kmod --version
>> >> kmod version 30
>> >> +ZSTD +XZ +ZLIB +LIBCRYPTO -EXPERIMENTAL
>> >
>> >Debian has:
>> >
>> >kmod version 30
>> >+ZSTD +XZ -ZLIB +LIBCRYPTO -EXPERIMENTAL
>>
>>            ^ so... mind the minus :). It doesn't support zlib.
>>
>> Change your kernel config to either compress the modules as xz or zstd.
>
>Oh so then we should complain about these things if an initramfs is
>detected with modules compressed using a compression algorithm which
>modprobe installed does not support. What tool would do that?

I guess we could add that in depmod side as a dummy handler for when
that config is off. Thoughts?

Lucas De Marchi

>
>  Luis
  
Luis Chamberlain March 24, 2023, 6:47 p.m. UTC | #30
On Thu, Mar 23, 2023 at 11:03:21PM -0700, Lucas De Marchi wrote:
> On Thu, Mar 23, 2023 at 08:08:49AM -0700, Luis Chamberlain wrote:
> > On Thu, Mar 23, 2023 at 8:02 AM Lucas De Marchi
> > <lucas.demarchi@intel.com> wrote:
> > > 
> > > On Wed, Mar 22, 2023 at 03:31:59PM -0700, Luis Chamberlain wrote:
> > > >On Sat, Mar 11, 2023 at 10:25:05PM -0800, Lucas De Marchi wrote:
> > > >> On Sat, Jan 21, 2023 at 02:40:20PM -0800, Luis Chamberlain wrote:
> > > >> > On Thu, Jan 19, 2023 at 04:58:53PM -0800, Luis Chamberlain wrote:
> > > >> > > On Thu, Jan 19, 2023 at 04:51:27PM -0800, Luis Chamberlain wrote:
> > > >> > > > On Thu, Jan 19, 2023 at 04:47:05PM +0100, Petr Mladek wrote:
> > > >> > > > > Yes, the -EINVAL error is strange. It is returned also in
> > > >> > > > > kernel/module/main.c on few locations. But neither of them
> > > >> > > > > looks like a good candidate.
> > > >> > > >
> > > >> > > > OK I updated to next-20230119 and I don't see the issue now.
> > > >> > > > Odd. It could have been an issue with next-20221207 which I was
> > > >> > > > on before.
> > > >> > > >
> > > >> > > > I'll run some more test and if nothing fails I'll send the fix
> > > >> > > > to Linux for rc5.
> > > >> > >
> > > >> > > Jeesh it just occured to me the difference, which I'll have to
> > > >> > > test next, for next-20221207 I had enabled module compression
> > > >> > > on kdevops with zstd.
> > > >> > >
> > > >> > > You can see the issues on kdevops git log with that... and I finally
> > > >> > > disabled it and the kmod test issue is gone. So it could be that
> > > >> > > but I just am ending my day so will check tomorrow if that was it.
> > > >> > > But if someone else beats me then great.
> > > >> > >
> > > >> > > With kdevops it should be a matter of just enabling zstd as I
> > > >> > > just bumped support for next-20230119 and that has module decompression
> > > >> > > disabled.
> > > >> >
> > > >> > So indeed, my suspcions were correct. There is one bug with
> > > >> > compression on debian:
> > > >> >
> > > >> > - gzip compressed modules don't end up in the initramfs
> > > >> >
> > > >> > There is a generic upstream kmod bug:
> > > >> >
> > > >> >  - modprobe --show-depends won't grok compressed modules so initramfs
> > > >> >    tools that use this as Debian likely are not getting module dependencies
> > > >> >    installed in their initramfs
> > > >>
> > > >> are you sure you have the relevant compression setting enabled
> > > >> in kmod?
> > > >>
> > > >> $ kmod --version
> > > >> kmod version 30
> > > >> +ZSTD +XZ +ZLIB +LIBCRYPTO -EXPERIMENTAL
> > > >
> > > >Debian has:
> > > >
> > > >kmod version 30
> > > >+ZSTD +XZ -ZLIB +LIBCRYPTO -EXPERIMENTAL
> > > 
> > >            ^ so... mind the minus :). It doesn't support zlib.
> > > 
> > > Change your kernel config to either compress the modules as xz or zstd.
> > 
> > Oh so then we should complain about these things if an initramfs is
> > detected with modules compressed using a compression algorithm which
> > modprobe installed does not support. What tool would do that?
> 
> I guess we could add that in depmod side as a dummy handler for when
> that config is off. Thoughts?

That sounds like a good solution, better than and complain before
allowing someone to boot and *not* be able to.

  Luis
  

Patch

diff --git a/kernel/module/main.c b/kernel/module/main.c
index d02d39c7174e..7a627345d4fd 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2386,7 +2386,8 @@  static bool finished_loading(const char *name)
 	sched_annotate_sleep();
 	mutex_lock(&module_mutex);
 	mod = find_module_all(name, strlen(name), true);
-	ret = !mod || mod->state == MODULE_STATE_LIVE;
+	ret = !mod || mod->state == MODULE_STATE_LIVE
+		|| mod->state == MODULE_STATE_GOING;
 	mutex_unlock(&module_mutex);
 
 	return ret;
@@ -2562,20 +2563,35 @@  static int add_unformed_module(struct module *mod)
 
 	mod->state = MODULE_STATE_UNFORMED;
 
-again:
 	mutex_lock(&module_mutex);
 	old = find_module_all(mod->name, strlen(mod->name), true);
 	if (old != NULL) {
-		if (old->state != MODULE_STATE_LIVE) {
+		if (old->state == MODULE_STATE_COMING
+		    || old->state == MODULE_STATE_UNFORMED) {
 			/* Wait in case it fails to load. */
 			mutex_unlock(&module_mutex);
 			err = wait_event_interruptible(module_wq,
 					       finished_loading(mod->name));
 			if (err)
 				goto out_unlocked;
-			goto again;
+
+			/* The module might have gone in the meantime. */
+			mutex_lock(&module_mutex);
+			old = find_module_all(mod->name, strlen(mod->name),
+					      true);
 		}
-		err = -EEXIST;
+
+		/*
+		 * We are here only when the same module was being loaded. Do
+		 * not try to load it again right now. It prevents long delays
+		 * caused by serialized module load failures. It might happen
+		 * when more devices of the same type trigger load of
+		 * a particular module.
+		 */
+		if (old && old->state == MODULE_STATE_LIVE)
+			err = -EEXIST;
+		else
+			err = -EBUSY;
 		goto out;
 	}
 	mod_update_bounds(mod);