EDAC/device: Add sysfs notification for UE,CE count change

Message ID 20230731220059.28474-1-quic_djaggi@quicinc.com
State New
Headers
Series EDAC/device: Add sysfs notification for UE,CE count change |

Commit Message

Deepti Jaggi July 31, 2023, 10 p.m. UTC
  A daemon running in user space collects information on correctable
and uncorrectable errors from EDAC driver by reading corresponding
sysfs entries and takes appropriate action.
This patch adds support for user space daemon to wait on poll() until
the sysfs entries for UE count and CE count change and then read updated
counts instead of continuously monitoring the sysfs entries for
any changes.

Signed-off-by: Deepti Jaggi <quic_djaggi@quicinc.com>
---
 drivers/edac/edac_device.c       | 16 ++++++++++++++++
 drivers/edac/edac_device.h       |  8 ++++++++
 drivers/edac/edac_device_sysfs.c | 20 ++++++++++++++++++++
 3 files changed, 44 insertions(+)
  

Comments

Trilok Soni Aug. 1, 2023, 5:48 a.m. UTC | #1
On 7/31/2023 3:40 PM, Trilok Soni wrote:
> On 7/31/2023 3:00 PM, Deepti Jaggi wrote:
>> A daemon running in user space collects information on correctable
>> and uncorrectable errors from EDAC driver by reading corresponding
>> sysfs entries and takes appropriate action.
> 
> Which daemon we are referring here? Can you please provide the link to 
> the project?
> 
> Are you using this daemon?
> 
> https://mcelog.org/ - It is for x86, but is your daemon project different?
> 
>> This patch adds support for user space daemon to wait on poll() until
>> the sysfs entries for UE count and CE count change and then read updated
>> counts instead of continuously monitoring the sysfs entries for
>> any changes.
> 
> The modifications below are architecture agnostic so I really want to 
> know what exactly we are fixing and if there is a problem.

+ CC linux-arm-msm

Please keep linux-arm-msm in CC if there is a next revision.
  
Deepti Jaggi Aug. 1, 2023, 10:37 p.m. UTC | #2
On 7/31/2023 10:48 PM, Trilok Soni wrote:
> On 7/31/2023 3:40 PM, Trilok Soni wrote:
>> On 7/31/2023 3:00 PM, Deepti Jaggi wrote:
>>> A daemon running in user space collects information on correctable
>>> and uncorrectable errors from EDAC driver by reading corresponding
>>> sysfs entries and takes appropriate action.
>>
>> Which daemon we are referring here? Can you please provide the link to 
>> the project?
>>
>> Are you using this daemon?
>>
>> https://mcelog.org/ - It is for x86, but is your daemon project 
>> different?
>>

No this daemon is not used. Daemon is under development and it is more 
specific to Qualcomm use cases.
Based on my limited understanding of mcelog, this daemon is handling 
errors in an architecture specific way.
By adding support for sysfs notification in EDAC framework, drivers 
which are not using any custom sysfs attributes can take advantage of 
this modification to notify the user space daemon polling on ue_count 
and/or ce_count attributes.

>>> This patch adds support for user space daemon to wait on poll() until
>>> the sysfs entries for UE count and CE count change and then read updated
>>> counts instead of continuously monitoring the sysfs entries for
>>> any changes.
>>
>> The modifications below are architecture agnostic so I really want to 
>> know what exactly we are fixing and if there is a problem.
> 

In the change set, adding support for user space to poll on the ue_count 
and/or ce_count sysfs attributes.
On changes in ue_count,ce_count attributes, unblock user space poll from 
EDAC driver framework and user space can read the changed ce_count, 
ue_count.

As an example from user space perform the following steps:
	1. Open the sysfs attribute file for UE count and CE count
	2. Read the initial CE count and UE count
	3. Poll on any changes on CE count, UE count fds.
	4. Once poll unblocks, Read the updated count.
         5.Take appropriate action on the changed counts.

#####################################################################
Example Simple User space code Snippet:

#define MAX_POLL_FDS     2
char ue_count_file[] = 
"/sys/devices/system/edac/qcom-llcc/qcom-llcc0/ue_count";
char ce_count_file[] = 
"/sys/devices/system/edac/qcom-llcc/qcom-llcc0/ce_count";

struct pollfd *p_poll_fds = NULL;	
struct pollfd poll_fds[MAX_POLL_FDS] =  {0};
char data[100];

  poll_fds[0].fd = open(ue_count_file, O_RDONLY);
  poll_fds[1].fd = open(ce_count_file, O_RDONLY);

/*Read Initial value before poll and set poll events*/	
for (int i = 0; i < MAX_POLL_FDS; i++)	
  {	
	ret = read(poll_fds[i].fd, data, 100);	
         poll_fds[i].events = POLLPRI ;	
  }
p_poll_fds = &poll_fds[0];
while(1)	
  {	
     /*Block on poll until ue_count or ce_count change
     ret = poll(p_poll_fds, sizeof(poll_fds)/sizeof(struct pollfd) , -1);
     /*
      * Read the changed UE/CE count. lseek()
      * or close/re-open the changed fd
      */
     for(int i = 0; i < MAX_POLL_FDS; i++) {	
	 if( poll_fds[i].revents & POLLPRI)  {

	   ret = read(poll_fds[i].fd, data, 100);

            /*Take an appropriate action*/

	}
      }
}
######################################################################

> + CC linux-arm-msm
> 
> Please keep linux-arm-msm in CC if there is a next revision.
> 

Noted.


--Deepti
  
Trilok Soni Sept. 13, 2023, 5:22 p.m. UTC | #3
On 8/1/2023 3:37 PM, Deepti Jaggi wrote:
> On 7/31/2023 10:48 PM, Trilok Soni wrote:
>> On 7/31/2023 3:40 PM, Trilok Soni wrote:
>>> On 7/31/2023 3:00 PM, Deepti Jaggi wrote:
>>>> A daemon running in user space collects information on correctable
>>>> and uncorrectable errors from EDAC driver by reading corresponding
>>>> sysfs entries and takes appropriate action.
>>>
>>> Which daemon we are referring here? Can you please provide the link to the project?
>>>
>>> Are you using this daemon?
>>>
>>> https://mcelog.org/ - It is for x86, but is your daemon project different?
>>>
> 
> No this daemon is not used. Daemon is under development and it is more specific to Qualcomm use cases.
> Based on my limited understanding of mcelog, this daemon is handling errors in an architecture specific way.
> By adding support for sysfs notification in EDAC framework, drivers which are not using any custom sysfs attributes can take advantage of this modification to notify the user space daemon polling on ue_count and/or ce_count attributes.


Did you look at the rasdaemon then?

https://github.com/mchehab/rasdaemon - rasdaemon is also used on more than one architecture including ARM. 


> 
>>>> This patch adds support for user space daemon to wait on poll() until
>>>> the sysfs entries for UE count and CE count change and then read updated
>>>> counts instead of continuously monitoring the sysfs entries for
>>>> any changes.
>>>
>>> The modifications below are architecture agnostic so I really want to know what exactly we are fixing and if there is a problem.
>>
> 
> In the change set, adding support for user space to poll on the ue_count and/or ce_count sysfs attributes.
> On changes in ue_count,ce_count attributes, unblock user space poll from EDAC driver framework and user space can read the changed ce_count, ue_count.
> 
> As an example from user space perform the following steps:
>     1. Open the sysfs attribute file for UE count and CE count
>     2. Read the initial CE count and UE count
>     3. Poll on any changes on CE count, UE count fds.
>     4. Once poll unblocks, Read the updated count.
>         5.Take appropriate action on the changed counts.
> 
> #####################################################################
> Example Simple User space code Snippet:

All of this resolved in the EDAC framework by tracing per my understanding. If any changes required
we should extend the rasdaemon and show the usecase to explain the it better?

This is very old link but if you follow this patch series you will understand the tracing events in the EDAC
and latest EDAC framework code will help. 

https://lkml.indiana.edu/hypermail/linux/kernel/1205.1/01751.html
  
Adrien Thierry Sept. 29, 2023, 6:40 p.m. UTC | #4
Hi Deepti,

On Mon, Jul 31, 2023 at 03:00:59PM -0700, Deepti Jaggi wrote:
> A daemon running in user space collects information on correctable
> and uncorrectable errors from EDAC driver by reading corresponding
> sysfs entries and takes appropriate action.
> This patch adds support for user space daemon to wait on poll() until
> the sysfs entries for UE count and CE count change and then read updated
> counts instead of continuously monitoring the sysfs entries for
> any changes.
> 
> Signed-off-by: Deepti Jaggi <quic_djaggi@quicinc.com>
> ---
>  drivers/edac/edac_device.c       | 16 ++++++++++++++++
>  drivers/edac/edac_device.h       |  8 ++++++++
>  drivers/edac/edac_device_sysfs.c | 20 ++++++++++++++++++++
>  3 files changed, 44 insertions(+)
> 
> diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c
> index 8c4d947fb848..7b7aec4da6b9 100644
> --- a/drivers/edac/edac_device.c
> +++ b/drivers/edac/edac_device.c
> @@ -587,12 +587,20 @@ void edac_device_handle_ce_count(struct edac_device_ctl_info *edac_dev,
>  	if (instance->nr_blocks > 0) {
>  		block = instance->blocks + block_nr;
>  		block->counters.ce_count += count;
> +
> +		/* Notify block sysfs attribute change */
> +		if (block->kn_ce)
> +			sysfs_notify_dirent(block->kn_ce);
>  	}
>  
>  	/* Propagate the count up the 'totals' tree */
>  	instance->counters.ce_count += count;
>  	edac_dev->counters.ce_count += count;
>  
> +	/* Notify instance sysfs attribute change */
> +	if (instance->kn_ce)
> +		sysfs_notify_dirent(instance->kn_ce);
> +
>  	if (edac_device_get_log_ce(edac_dev))
>  		edac_device_printk(edac_dev, KERN_WARNING,
>  				   "CE: %s instance: %s block: %s count: %d '%s'\n",
> @@ -633,12 +641,20 @@ void edac_device_handle_ue_count(struct edac_device_ctl_info *edac_dev,
>  	if (instance->nr_blocks > 0) {
>  		block = instance->blocks + block_nr;
>  		block->counters.ue_count += count;
> +
> +		/* Notify block sysfs attribute change */
> +		if (block->kn_ue)
> +			sysfs_notify_dirent(block->kn_ue);
>  	}
>  
>  	/* Propagate the count up the 'totals' tree */
>  	instance->counters.ue_count += count;
>  	edac_dev->counters.ue_count += count;
>  
> +	/* Notify instance sysfs attribute change */
> +	if (instance->kn_ue)
> +		sysfs_notify_dirent(instance->kn_ue);
> +
>  	if (edac_device_get_log_ue(edac_dev))
>  		edac_device_printk(edac_dev, KERN_EMERG,
>  				   "UE: %s instance: %s block: %s count: %d '%s'\n",
> diff --git a/drivers/edac/edac_device.h b/drivers/edac/edac_device.h
> index fc2d2c218064..459514ea549e 100644
> --- a/drivers/edac/edac_device.h
> +++ b/drivers/edac/edac_device.h
> @@ -127,6 +127,10 @@ struct edac_device_block {
>  
>  	/* edac sysfs device control */
>  	struct kobject kobj;
> +
> +	/* kern fs node for block ue_count and ce count attributes*/
> +	struct kernfs_node *kn_ue;
> +	struct kernfs_node *kn_ce;
>  };
>  
>  /* device instance control structure */
> @@ -141,6 +145,10 @@ struct edac_device_instance {
>  
>  	/* edac sysfs device control */
>  	struct kobject kobj;
> +
> +	/* kern fs node for block ue_count and ce count attributes*/
> +	struct kernfs_node *kn_ue;
> +	struct kernfs_node *kn_ce;
>  };
>  
>  
> diff --git a/drivers/edac/edac_device_sysfs.c b/drivers/edac/edac_device_sysfs.c
> index 5e7593753799..d1e04a9296c7 100644
> --- a/drivers/edac/edac_device_sysfs.c
> +++ b/drivers/edac/edac_device_sysfs.c
> @@ -562,6 +562,13 @@ static int edac_device_create_block(struct edac_device_ctl_info *edac_dev,
>  	}
>  	kobject_uevent(&block->kobj, KOBJ_ADD);
>  
> +	/*
> +	 * Save kernfs pointer for ue count and ce count
> +	 * to notify from any context when attributes change
> +	 */
> +	block->kn_ue = sysfs_get_dirent(block->kobj.sd, "ue_count");
> +	block->kn_ce = sysfs_get_dirent(block->kobj.sd, "ce_count");
> +
>  	return 0;
>  
>  	/* Error unwind stack */
> @@ -594,6 +601,9 @@ static void edac_device_delete_block(struct edac_device_ctl_info *edac_dev,
>  		}
>  	}
>  
> +	block->kn_ue = NULL;
> +	block->kn_ce = NULL;
> +

Isn't there a possibility for a race condition here? It seems to me that
between the moment the attribute files are removed with
sysfs_remove_file() a few lines above, and the moment block->kn_ue and
block->kn_ce are nulled, sysfs_notify_dirent() can be called from
edac_device_handle_ce_count() with an block->kn_ce that refers to a
deleted file.

>  	/* unregister this block's kobject, SEE:
>  	 *	edac_device_ctrl_block_release() callback operation
>  	 */
> @@ -660,6 +670,13 @@ static int edac_device_create_instance(struct edac_device_ctl_info *edac_dev,
>  	edac_dbg(4, "Registered instance %d '%s' kobject\n",
>  		 idx, instance->name);
>  
> +	/*
> +	 * Save kernfs pointer for ue count and ce count
> +	 * to notify from any context when attributes change
> +	 */
> +	instance->kn_ue = sysfs_get_dirent(instance->kobj.sd, "ue_count");
> +	instance->kn_ce = sysfs_get_dirent(instance->kobj.sd, "ce_count");
> +
>  	return 0;
>  
>  	/* error unwind stack */
> @@ -682,6 +699,9 @@ static void edac_device_delete_instance(struct edac_device_ctl_info *edac_dev,
>  
>  	instance = &edac_dev->instances[idx];
>  
> +	instance->kn_ue = NULL;
> +	instance->kn_ce = NULL;
> +
>  	/* unregister all blocks in this instance */
>  	for (i = 0; i < instance->nr_blocks; i++)
>  		edac_device_delete_block(edac_dev, &instance->blocks[i]);
> -- 
> 2.31.1
>

Best,
Adrien
  

Patch

diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c
index 8c4d947fb848..7b7aec4da6b9 100644
--- a/drivers/edac/edac_device.c
+++ b/drivers/edac/edac_device.c
@@ -587,12 +587,20 @@  void edac_device_handle_ce_count(struct edac_device_ctl_info *edac_dev,
 	if (instance->nr_blocks > 0) {
 		block = instance->blocks + block_nr;
 		block->counters.ce_count += count;
+
+		/* Notify block sysfs attribute change */
+		if (block->kn_ce)
+			sysfs_notify_dirent(block->kn_ce);
 	}
 
 	/* Propagate the count up the 'totals' tree */
 	instance->counters.ce_count += count;
 	edac_dev->counters.ce_count += count;
 
+	/* Notify instance sysfs attribute change */
+	if (instance->kn_ce)
+		sysfs_notify_dirent(instance->kn_ce);
+
 	if (edac_device_get_log_ce(edac_dev))
 		edac_device_printk(edac_dev, KERN_WARNING,
 				   "CE: %s instance: %s block: %s count: %d '%s'\n",
@@ -633,12 +641,20 @@  void edac_device_handle_ue_count(struct edac_device_ctl_info *edac_dev,
 	if (instance->nr_blocks > 0) {
 		block = instance->blocks + block_nr;
 		block->counters.ue_count += count;
+
+		/* Notify block sysfs attribute change */
+		if (block->kn_ue)
+			sysfs_notify_dirent(block->kn_ue);
 	}
 
 	/* Propagate the count up the 'totals' tree */
 	instance->counters.ue_count += count;
 	edac_dev->counters.ue_count += count;
 
+	/* Notify instance sysfs attribute change */
+	if (instance->kn_ue)
+		sysfs_notify_dirent(instance->kn_ue);
+
 	if (edac_device_get_log_ue(edac_dev))
 		edac_device_printk(edac_dev, KERN_EMERG,
 				   "UE: %s instance: %s block: %s count: %d '%s'\n",
diff --git a/drivers/edac/edac_device.h b/drivers/edac/edac_device.h
index fc2d2c218064..459514ea549e 100644
--- a/drivers/edac/edac_device.h
+++ b/drivers/edac/edac_device.h
@@ -127,6 +127,10 @@  struct edac_device_block {
 
 	/* edac sysfs device control */
 	struct kobject kobj;
+
+	/* kern fs node for block ue_count and ce count attributes*/
+	struct kernfs_node *kn_ue;
+	struct kernfs_node *kn_ce;
 };
 
 /* device instance control structure */
@@ -141,6 +145,10 @@  struct edac_device_instance {
 
 	/* edac sysfs device control */
 	struct kobject kobj;
+
+	/* kern fs node for block ue_count and ce count attributes*/
+	struct kernfs_node *kn_ue;
+	struct kernfs_node *kn_ce;
 };
 
 
diff --git a/drivers/edac/edac_device_sysfs.c b/drivers/edac/edac_device_sysfs.c
index 5e7593753799..d1e04a9296c7 100644
--- a/drivers/edac/edac_device_sysfs.c
+++ b/drivers/edac/edac_device_sysfs.c
@@ -562,6 +562,13 @@  static int edac_device_create_block(struct edac_device_ctl_info *edac_dev,
 	}
 	kobject_uevent(&block->kobj, KOBJ_ADD);
 
+	/*
+	 * Save kernfs pointer for ue count and ce count
+	 * to notify from any context when attributes change
+	 */
+	block->kn_ue = sysfs_get_dirent(block->kobj.sd, "ue_count");
+	block->kn_ce = sysfs_get_dirent(block->kobj.sd, "ce_count");
+
 	return 0;
 
 	/* Error unwind stack */
@@ -594,6 +601,9 @@  static void edac_device_delete_block(struct edac_device_ctl_info *edac_dev,
 		}
 	}
 
+	block->kn_ue = NULL;
+	block->kn_ce = NULL;
+
 	/* unregister this block's kobject, SEE:
 	 *	edac_device_ctrl_block_release() callback operation
 	 */
@@ -660,6 +670,13 @@  static int edac_device_create_instance(struct edac_device_ctl_info *edac_dev,
 	edac_dbg(4, "Registered instance %d '%s' kobject\n",
 		 idx, instance->name);
 
+	/*
+	 * Save kernfs pointer for ue count and ce count
+	 * to notify from any context when attributes change
+	 */
+	instance->kn_ue = sysfs_get_dirent(instance->kobj.sd, "ue_count");
+	instance->kn_ce = sysfs_get_dirent(instance->kobj.sd, "ce_count");
+
 	return 0;
 
 	/* error unwind stack */
@@ -682,6 +699,9 @@  static void edac_device_delete_instance(struct edac_device_ctl_info *edac_dev,
 
 	instance = &edac_dev->instances[idx];
 
+	instance->kn_ue = NULL;
+	instance->kn_ce = NULL;
+
 	/* unregister all blocks in this instance */
 	for (i = 0; i < instance->nr_blocks; i++)
 		edac_device_delete_block(edac_dev, &instance->blocks[i]);