[RFC] Fix stuck UCSI controller on DELL

Message ID 20240103100635.57099-1-lk@c--e.de
State New
Headers
Series [RFC] Fix stuck UCSI controller on DELL |

Commit Message

Christian A. Ehrhardt Jan. 3, 2024, 10:06 a.m. UTC
  I have a DELL Latitude 5431 where typec only works somewhat.
After the first plug/unplug event the PPM seems to be stuck and
commands end with a timeout (GET_CONNECTOR_STATUS failed (-110)).

This patch fixes it for me but according to my reading it is in
violation of the UCSI spec. On the other hand searching through
the net it appears that many DELL models seem to have timeout problems
with UCSI.

Do we want some kind of quirk here? There does not seem to be a quirk
framework for this part of the code, yet. Or is it ok to just send the
additional ACK in all cases and hope that the PPM will do the right
thing?

     regards   Christian
  

Comments

Heikki Krogerus Jan. 4, 2024, 11:59 a.m. UTC | #1
Hi Christian,

On Wed, Jan 03, 2024 at 11:06:35AM +0100, Christian A. Ehrhardt wrote:
> I have a DELL Latitude 5431 where typec only works somewhat.
> After the first plug/unplug event the PPM seems to be stuck and
> commands end with a timeout (GET_CONNECTOR_STATUS failed (-110)).
> 
> This patch fixes it for me but according to my reading it is in
> violation of the UCSI spec. On the other hand searching through
> the net it appears that many DELL models seem to have timeout problems
> with UCSI.
> 
> Do we want some kind of quirk here? There does not seem to be a quirk
> framework for this part of the code, yet. Or is it ok to just send the
> additional ACK in all cases and hope that the PPM will do the right
> thing?

We can use DMI quirks. Something like the attached diff (not tested).

thanks,
  
Christian A. Ehrhardt Jan. 15, 2024, 6:55 p.m. UTC | #2
Hi Heikki,

sorry to bother you again with this but I'm afraid there's
a misunderstanding wrt. the nature of the quirk. See below:

On Thu, Jan 04, 2024 at 01:59:02PM +0200, Heikki Krogerus wrote:
> Hi Christian,
> 
> On Wed, Jan 03, 2024 at 11:06:35AM +0100, Christian A. Ehrhardt wrote:
> > I have a DELL Latitude 5431 where typec only works somewhat.
> > After the first plug/unplug event the PPM seems to be stuck and
> > commands end with a timeout (GET_CONNECTOR_STATUS failed (-110)).
> > 
> > This patch fixes it for me but according to my reading it is in
> > violation of the UCSI spec. On the other hand searching through
> > the net it appears that many DELL models seem to have timeout problems
> > with UCSI.
> > 
> > Do we want some kind of quirk here? There does not seem to be a quirk
> > framework for this part of the code, yet. Or is it ok to just send the
> > additional ACK in all cases and hope that the PPM will do the right
> > thing?
> 
> We can use DMI quirks. Something like the attached diff (not tested).
> 
> thanks,
> 
> -- 
> heikki

> diff --git a/drivers/usb/typec/ucsi/ucsi_acpi.c b/drivers/usb/typec/ucsi/ucsi_acpi.c
> index 6bbf490ac401..7e8b1fcfa024 100644
> --- a/drivers/usb/typec/ucsi/ucsi_acpi.c
> +++ b/drivers/usb/typec/ucsi/ucsi_acpi.c
> @@ -113,18 +113,44 @@ ucsi_zenbook_read(struct ucsi *ucsi, unsigned int offset, void *val, size_t val_
>  	return 0;
>  }
>  
> -static const struct ucsi_operations ucsi_zenbook_ops = {
> -	.read = ucsi_zenbook_read,
> -	.sync_write = ucsi_acpi_sync_write,
> -	.async_write = ucsi_acpi_async_write
> -};
> +static int ucsi_dell_sync_write(struct ucsi *ucsi, unsigned int offset,
> +				const void *val, size_t val_len)
> +{
> +	u64 ctrl = *(u64 *)val;
> +	int ret;
> +
> +	ret = ucsi_acpi_sync_write(ucsi, offset, val, val_len);
> +	if (ret && (ctrl & (UCSI_ACK_CC_CI | UCSI_ACK_CONNECTOR_CHANGE))) {
> +		ctrl= UCSI_ACK_CC_CI | UCSI_ACK_COMMAND_COMPLETE;
> +
> +		dev_dbg(ucsi->dev->parent, "%s: ACK failed\n", __func__);
> +		ret = ucsi_acpi_sync_write(ucsi, UCSI_CONTROL, &ctrl, sizeof(ctrl));
> +	}

Unfortunately, this has the logic reversed. The quirk (i.e. the
additional UCSI_ACK_COMMAND_COMPLETE) is required after a _successful_
UCSI_ACK_CONNECTOR_CHANGE. Otherwise, _subsequent_ commands will timeout
(usually the next GET_CONNECTOR_CHANGE).

This means the quirk must be applied _before_ we detect any failure.
Consequently, the quirk has the potential to break working systems.

Sorry, if that wasn't clear from my original mail. Please let me know
if this changes how you want the quirks handled.

     Thanks    Christian
  
Mario Limonciello Jan. 17, 2024, 3 a.m. UTC | #3
On 1/15/2024 12:55, Christian A. Ehrhardt wrote:
> 
> Hi Heikki,
> 
> sorry to bother you again with this but I'm afraid there's
> a misunderstanding wrt. the nature of the quirk. See below:
> 
> On Thu, Jan 04, 2024 at 01:59:02PM +0200, Heikki Krogerus wrote:
>> Hi Christian,
>>
>> On Wed, Jan 03, 2024 at 11:06:35AM +0100, Christian A. Ehrhardt wrote:
>>> I have a DELL Latitude 5431 where typec only works somewhat.
>>> After the first plug/unplug event the PPM seems to be stuck and
>>> commands end with a timeout (GET_CONNECTOR_STATUS failed (-110)).
>>>
>>> This patch fixes it for me but according to my reading it is in
>>> violation of the UCSI spec. On the other hand searching through
>>> the net it appears that many DELL models seem to have timeout problems
>>> with UCSI.
>>>
>>> Do we want some kind of quirk here? There does not seem to be a quirk
>>> framework for this part of the code, yet. Or is it ok to just send the
>>> additional ACK in all cases and hope that the PPM will do the right
>>> thing?
>>
>> We can use DMI quirks. Something like the attached diff (not tested).
>>
>> thanks,
>>
>> -- 
>> heikki
> 
>> diff --git a/drivers/usb/typec/ucsi/ucsi_acpi.c b/drivers/usb/typec/ucsi/ucsi_acpi.c
>> index 6bbf490ac401..7e8b1fcfa024 100644
>> --- a/drivers/usb/typec/ucsi/ucsi_acpi.c
>> +++ b/drivers/usb/typec/ucsi/ucsi_acpi.c
>> @@ -113,18 +113,44 @@ ucsi_zenbook_read(struct ucsi *ucsi, unsigned int offset, void *val, size_t val_
>>   	return 0;
>>   }
>>   
>> -static const struct ucsi_operations ucsi_zenbook_ops = {
>> -	.read = ucsi_zenbook_read,
>> -	.sync_write = ucsi_acpi_sync_write,
>> -	.async_write = ucsi_acpi_async_write
>> -};
>> +static int ucsi_dell_sync_write(struct ucsi *ucsi, unsigned int offset,
>> +				const void *val, size_t val_len)
>> +{
>> +	u64 ctrl = *(u64 *)val;
>> +	int ret;
>> +
>> +	ret = ucsi_acpi_sync_write(ucsi, offset, val, val_len);
>> +	if (ret && (ctrl & (UCSI_ACK_CC_CI | UCSI_ACK_CONNECTOR_CHANGE))) {
>> +		ctrl= UCSI_ACK_CC_CI | UCSI_ACK_COMMAND_COMPLETE;
>> +
>> +		dev_dbg(ucsi->dev->parent, "%s: ACK failed\n", __func__);
>> +		ret = ucsi_acpi_sync_write(ucsi, UCSI_CONTROL, &ctrl, sizeof(ctrl));
>> +	}
> 
> Unfortunately, this has the logic reversed. The quirk (i.e. the
> additional UCSI_ACK_COMMAND_COMPLETE) is required after a _successful_
> UCSI_ACK_CONNECTOR_CHANGE. Otherwise, _subsequent_ commands will timeout
> (usually the next GET_CONNECTOR_CHANGE).
> 
> This means the quirk must be applied _before_ we detect any failure.
> Consequently, the quirk has the potential to break working systems.
> 
> Sorry, if that wasn't clear from my original mail. Please let me know
> if this changes how you want the quirks handled.
> 
>       Thanks    Christian
> 

For the problematic scenario have you tried to play with it a bit to see 
if it's too short of a timeout (raise timeout) or to output the response 
bits to see if anything else surprising is sent?

Does it always fail on the same command, or does it happen to a bunch of 
them?
  
Christian A. Ehrhardt Jan. 17, 2024, 6:35 a.m. UTC | #4
Hi Mario,

On Tue, Jan 16, 2024 at 09:00:03PM -0600, Mario Limonciello wrote:
> On 1/15/2024 12:55, Christian A. Ehrhardt wrote:
> > 
> > Hi Heikki,
> > 
> > sorry to bother you again with this but I'm afraid there's
> > a misunderstanding wrt. the nature of the quirk. See below:
> > 
> > On Thu, Jan 04, 2024 at 01:59:02PM +0200, Heikki Krogerus wrote:
> > > Hi Christian,
> > > 
> > > On Wed, Jan 03, 2024 at 11:06:35AM +0100, Christian A. Ehrhardt wrote:
> > > > I have a DELL Latitude 5431 where typec only works somewhat.
> > > > After the first plug/unplug event the PPM seems to be stuck and
> > > > commands end with a timeout (GET_CONNECTOR_STATUS failed (-110)).
> > > > 
> > > > This patch fixes it for me but according to my reading it is in
> > > > violation of the UCSI spec. On the other hand searching through
> > > > the net it appears that many DELL models seem to have timeout problems
> > > > with UCSI.
> > > > 
> > > > Do we want some kind of quirk here? There does not seem to be a quirk
> > > > framework for this part of the code, yet. Or is it ok to just send the
> > > > additional ACK in all cases and hope that the PPM will do the right
> > > > thing?
> > > 
> > > We can use DMI quirks. Something like the attached diff (not tested).
> > > 
> > > thanks,
> > > 
> > > -- 
> > > heikki
> > 
> > > diff --git a/drivers/usb/typec/ucsi/ucsi_acpi.c b/drivers/usb/typec/ucsi/ucsi_acpi.c
> > > index 6bbf490ac401..7e8b1fcfa024 100644
> > > --- a/drivers/usb/typec/ucsi/ucsi_acpi.c
> > > +++ b/drivers/usb/typec/ucsi/ucsi_acpi.c
> > > @@ -113,18 +113,44 @@ ucsi_zenbook_read(struct ucsi *ucsi, unsigned int offset, void *val, size_t val_
> > >   	return 0;
> > >   }
> > > -static const struct ucsi_operations ucsi_zenbook_ops = {
> > > -	.read = ucsi_zenbook_read,
> > > -	.sync_write = ucsi_acpi_sync_write,
> > > -	.async_write = ucsi_acpi_async_write
> > > -};
> > > +static int ucsi_dell_sync_write(struct ucsi *ucsi, unsigned int offset,
> > > +				const void *val, size_t val_len)
> > > +{
> > > +	u64 ctrl = *(u64 *)val;
> > > +	int ret;
> > > +
> > > +	ret = ucsi_acpi_sync_write(ucsi, offset, val, val_len);
> > > +	if (ret && (ctrl & (UCSI_ACK_CC_CI | UCSI_ACK_CONNECTOR_CHANGE))) {
> > > +		ctrl= UCSI_ACK_CC_CI | UCSI_ACK_COMMAND_COMPLETE;
> > > +
> > > +		dev_dbg(ucsi->dev->parent, "%s: ACK failed\n", __func__);
> > > +		ret = ucsi_acpi_sync_write(ucsi, UCSI_CONTROL, &ctrl, sizeof(ctrl));
> > > +	}
> > 
> > Unfortunately, this has the logic reversed. The quirk (i.e. the
> > additional UCSI_ACK_COMMAND_COMPLETE) is required after a _successful_
> > UCSI_ACK_CONNECTOR_CHANGE. Otherwise, _subsequent_ commands will timeout
> > (usually the next GET_CONNECTOR_CHANGE).
> > 
> > This means the quirk must be applied _before_ we detect any failure.
> > Consequently, the quirk has the potential to break working systems.
> > 
> > Sorry, if that wasn't clear from my original mail. Please let me know
> > if this changes how you want the quirks handled.
> > 
> >       Thanks    Christian
> > 
> 
> For the problematic scenario have you tried to play with it a bit to see if
> it's too short of a timeout (raise timeout) or to output the response bits
> to see if anything else surprising is sent?

It is not a problem with the timeout. Waiting forever in this case
doesn't help. IMHO this is actually a bug in the PPM, i.e. in Dell's
bios.

Sending an ack after the timeout fixes things, though.

> Does it always fail on the same command, or does it happen to a bunch of
> them?

It always fails on the first command after UCSI_ACK_CC_CI for a
connector change. However, there might be no such command if the
next event is a notification.

I did play around with it a bit more and came up with a way to
probe for the issue:

    https://lore.kernel.orgorg/all/20240116224041.220740-1-lk@c--e.de/   

regards    Christian
  
Mario Limonciello Jan. 17, 2024, 5:34 p.m. UTC | #5
On 1/17/2024 00:35, Christian A. Ehrhardt wrote:
> 
> Hi Mario,
> 
> On Tue, Jan 16, 2024 at 09:00:03PM -0600, Mario Limonciello wrote:
>> On 1/15/2024 12:55, Christian A. Ehrhardt wrote:
>>>
>>> Hi Heikki,
>>>
>>> sorry to bother you again with this but I'm afraid there's
>>> a misunderstanding wrt. the nature of the quirk. See below:
>>>
>>> On Thu, Jan 04, 2024 at 01:59:02PM +0200, Heikki Krogerus wrote:
>>>> Hi Christian,
>>>>
>>>> On Wed, Jan 03, 2024 at 11:06:35AM +0100, Christian A. Ehrhardt wrote:
>>>>> I have a DELL Latitude 5431 where typec only works somewhat.
>>>>> After the first plug/unplug event the PPM seems to be stuck and
>>>>> commands end with a timeout (GET_CONNECTOR_STATUS failed (-110)).
>>>>>
>>>>> This patch fixes it for me but according to my reading it is in
>>>>> violation of the UCSI spec. On the other hand searching through
>>>>> the net it appears that many DELL models seem to have timeout problems
>>>>> with UCSI.
>>>>>
>>>>> Do we want some kind of quirk here? There does not seem to be a quirk
>>>>> framework for this part of the code, yet. Or is it ok to just send the
>>>>> additional ACK in all cases and hope that the PPM will do the right
>>>>> thing?
>>>>
>>>> We can use DMI quirks. Something like the attached diff (not tested).
>>>>
>>>> thanks,
>>>>
>>>> -- 
>>>> heikki
>>>
>>>> diff --git a/drivers/usb/typec/ucsi/ucsi_acpi.c b/drivers/usb/typec/ucsi/ucsi_acpi.c
>>>> index 6bbf490ac401..7e8b1fcfa024 100644
>>>> --- a/drivers/usb/typec/ucsi/ucsi_acpi.c
>>>> +++ b/drivers/usb/typec/ucsi/ucsi_acpi.c
>>>> @@ -113,18 +113,44 @@ ucsi_zenbook_read(struct ucsi *ucsi, unsigned int offset, void *val, size_t val_
>>>>    	return 0;
>>>>    }
>>>> -static const struct ucsi_operations ucsi_zenbook_ops = {
>>>> -	.read = ucsi_zenbook_read,
>>>> -	.sync_write = ucsi_acpi_sync_write,
>>>> -	.async_write = ucsi_acpi_async_write
>>>> -};
>>>> +static int ucsi_dell_sync_write(struct ucsi *ucsi, unsigned int offset,
>>>> +				const void *val, size_t val_len)
>>>> +{
>>>> +	u64 ctrl = *(u64 *)val;
>>>> +	int ret;
>>>> +
>>>> +	ret = ucsi_acpi_sync_write(ucsi, offset, val, val_len);
>>>> +	if (ret && (ctrl & (UCSI_ACK_CC_CI | UCSI_ACK_CONNECTOR_CHANGE))) {
>>>> +		ctrl= UCSI_ACK_CC_CI | UCSI_ACK_COMMAND_COMPLETE;
>>>> +
>>>> +		dev_dbg(ucsi->dev->parent, "%s: ACK failed\n", __func__);
>>>> +		ret = ucsi_acpi_sync_write(ucsi, UCSI_CONTROL, &ctrl, sizeof(ctrl));
>>>> +	}
>>>
>>> Unfortunately, this has the logic reversed. The quirk (i.e. the
>>> additional UCSI_ACK_COMMAND_COMPLETE) is required after a _successful_
>>> UCSI_ACK_CONNECTOR_CHANGE. Otherwise, _subsequent_ commands will timeout
>>> (usually the next GET_CONNECTOR_CHANGE).
>>>
>>> This means the quirk must be applied _before_ we detect any failure.
>>> Consequently, the quirk has the potential to break working systems.
>>>
>>> Sorry, if that wasn't clear from my original mail. Please let me know
>>> if this changes how you want the quirks handled.
>>>
>>>        Thanks    Christian
>>>
>>
>> For the problematic scenario have you tried to play with it a bit to see if
>> it's too short of a timeout (raise timeout) or to output the response bits
>> to see if anything else surprising is sent?
> 
> It is not a problem with the timeout. Waiting forever in this case
> doesn't help. IMHO this is actually a bug in the PPM, i.e. in Dell's
> bios.

"Usually" the PD controller F/W is distributed with the EC, but yes Dell 
nominally puts everything in a monolithic BIOS package.

> 
> Sending an ack after the timeout fixes things, though.
> 
>> Does it always fail on the same command, or does it happen to a bunch of
>> them?
> 
> It always fails on the first command after UCSI_ACK_CC_CI for a
> connector change. However, there might be no such command if the
> next event is a notification.
> 
> I did play around with it a bit more and came up with a way to
> probe for the issue:
> 
>      https://lore.kernel.orgorg/all/20240116224041.220740-1-lk@c--e.de/

If some variation of your prob-able workaround is picked up I think it's 
worth making noise when probed (dev_warn or dev_notice) about this 
situation that it is being used to workaround a PPM bug.

> 
> regards    Christian
> 
> 

+ Dell Client Kernel mailbox

Dell team,

Can you look into this?  It sounds like it should be investigated more 
closely to see where the impedance mismatch against the spec and real 
behavior actually lies.
  

Patch

diff --git a/drivers/usb/typec/ucsi/ucsi.c b/drivers/usb/typec/ucsi/ucsi.c
index 61b64558f96c..65098a454f63 100644
--- a/drivers/usb/typec/ucsi/ucsi.c
+++ b/drivers/usb/typec/ucsi/ucsi.c
@@ -53,7 +53,10 @@  static int ucsi_acknowledge_connector_change(struct ucsi *ucsi)
 	ctrl = UCSI_ACK_CC_CI;
 	ctrl |= UCSI_ACK_CONNECTOR_CHANGE;
 
-	return ucsi->ops->sync_write(ucsi, UCSI_CONTROL, &ctrl, sizeof(ctrl));
+	if (ucsi->ops->sync_write(ucsi, UCSI_CONTROL, &ctrl, sizeof(ctrl)))
+		pr_err("ACK FAILED\n");
+
+	return ucsi_acknowledge_command(ucsi);
 }
 
 static int ucsi_exec_command(struct ucsi *ucsi, u64 command);