[v3,5/5] selftests/resctrl: Add non-contiguous CBMs CAT test

Message ID 647fbfd449f8b0e0ad6cfe58bb280ff44ee162b8.1706180726.git.maciej.wieczor-retman@intel.com
State New
Headers
Series selftests/resctrl: Add non-contiguous CBMs in Intel CAT selftest |

Commit Message

Maciej Wieczor-Retman Jan. 25, 2024, 11:13 a.m. UTC
  Add tests for both L2 and L3 CAT to verify the return values
generated by writing non-contiguous CBMs don't contradict the
reported non-contiguous support information.

Use a logical XOR to confirm return value of write_schemata() and
non-contiguous CBMs support information match.

Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
---
Changelog v3:
- Roll back __cpuid_count part. (Reinette)
- Update function name to read sparse_masks file.
- Roll back get_cache_level() changes.
- Add ksft_print_msg() to contiguous schemata write error handling
  (Reinette).

Changelog v2:
- Redo the patch message. (Ilpo)
- Tidy up __cpuid_count calls. (Ilpo)
- Remove redundant AND in noncont_mask calculations (Ilpo)
- Fix bit_center offset.
- Add newline before function return. (Ilpo)
- Group non-contiguous tests with CAT tests. (Ilpo)
- Use a helper for reading sparse_masks file. (Ilpo)
- Make get_cache_level() available in other source files. (Ilpo)

 tools/testing/selftests/resctrl/cat_test.c    | 81 +++++++++++++++++++
 tools/testing/selftests/resctrl/resctrl.h     |  2 +
 .../testing/selftests/resctrl/resctrl_tests.c |  2 +
 3 files changed, 85 insertions(+)
  

Comments

Reinette Chatre Jan. 26, 2024, 9:10 p.m. UTC | #1
Hi Maciej,

On 1/25/2024 3:13 AM, Maciej Wieczor-Retman wrote:
> Add tests for both L2 and L3 CAT to verify the return values
> generated by writing non-contiguous CBMs don't contradict the
> reported non-contiguous support information.
> 
> Use a logical XOR to confirm return value of write_schemata() and
> non-contiguous CBMs support information match.
> 
> Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
> ---
> Changelog v3:
> - Roll back __cpuid_count part. (Reinette)
> - Update function name to read sparse_masks file.
> - Roll back get_cache_level() changes.
> - Add ksft_print_msg() to contiguous schemata write error handling
>   (Reinette).
> 
> Changelog v2:
> - Redo the patch message. (Ilpo)
> - Tidy up __cpuid_count calls. (Ilpo)
> - Remove redundant AND in noncont_mask calculations (Ilpo)
> - Fix bit_center offset.
> - Add newline before function return. (Ilpo)
> - Group non-contiguous tests with CAT tests. (Ilpo)
> - Use a helper for reading sparse_masks file. (Ilpo)
> - Make get_cache_level() available in other source files. (Ilpo)
> 
>  tools/testing/selftests/resctrl/cat_test.c    | 81 +++++++++++++++++++
>  tools/testing/selftests/resctrl/resctrl.h     |  2 +
>  .../testing/selftests/resctrl/resctrl_tests.c |  2 +
>  3 files changed, 85 insertions(+)
> 
> diff --git a/tools/testing/selftests/resctrl/cat_test.c b/tools/testing/selftests/resctrl/cat_test.c
> index 39fc9303b8e8..9086bf359072 100644
> --- a/tools/testing/selftests/resctrl/cat_test.c
> +++ b/tools/testing/selftests/resctrl/cat_test.c
> @@ -294,6 +294,71 @@ static int cat_run_test(const struct resctrl_test *test, const struct user_param
>  	return ret;
>  }
>  
> +static int noncont_cat_run_test(const struct resctrl_test *test,
> +				const struct user_params *uparams)
> +{
> +	unsigned long full_cache_mask, cont_mask, noncont_mask;
> +	unsigned int eax, ebx, ecx, edx, ret, sparse_masks;
> +	char schemata[64];
> +	int bit_center;
> +
> +	/* Check to compare sparse_masks content to CPUID output. */
> +	ret = resource_info_unsigned_get(test->resource, "sparse_masks", &sparse_masks);
> +	if (ret)
> +		return ret;
> +
> +	if (!strcmp(test->resource, "L3"))
> +		__cpuid_count(0x10, 1, eax, ebx, ecx, edx);
> +	else if (!strcmp(test->resource, "L2"))
> +		__cpuid_count(0x10, 2, eax, ebx, ecx, edx);
> +	else
> +		return -EINVAL;
> +
> +	if (sparse_masks != ((ecx >> 3) & 1)) {
> +		ksft_print_msg("CPUID output doesn't match 'sparse_masks' file content!\n");
> +		return -1;

If I understand correctly this falls into the "test failure" [1] category
and should return 1? ...

> +	}
> +
> +	/* Write checks initialization. */
> +	ret = get_full_cbm(test->resource, &full_cache_mask);
> +	if (ret < 0)
> +		return ret;
> +	bit_center = count_bits(full_cache_mask) / 2;
> +	cont_mask = full_cache_mask >> bit_center;
> +
> +	/* Contiguous mask write check. */
> +	snprintf(schemata, sizeof(schemata), "%lx", cont_mask);
> +	ret = write_schemata("", schemata, uparams->cpu, test->resource);
> +	if (ret) {
> +		ksft_print_msg("Write of contiguous CBM failed\n");
> +		return ret;

.. although here I think the goal to distinguish between test error and test failure
falls apart since it is not possible to tell within the test if the failure is
because of error in the test or if test failed.

Reinette

[1] https://lore.kernel.org/all/33787043-5823-6de4-4e5c-a24a136ba541@linux.intel.com/
  
Maciej Wieczor-Retman Jan. 31, 2024, 12:55 p.m. UTC | #2
Hi!

On 2024-01-26 at 13:10:18 -0800, Reinette Chatre wrote:
>Hi Maciej,
>
>On 1/25/2024 3:13 AM, Maciej Wieczor-Retman wrote:
>> Add tests for both L2 and L3 CAT to verify the return values
>> generated by writing non-contiguous CBMs don't contradict the
>> reported non-contiguous support information.
>> 
>> Use a logical XOR to confirm return value of write_schemata() and
>> non-contiguous CBMs support information match.
>> 
>> Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
>> ---
>> Changelog v3:
>> - Roll back __cpuid_count part. (Reinette)
>> - Update function name to read sparse_masks file.
>> - Roll back get_cache_level() changes.
>> - Add ksft_print_msg() to contiguous schemata write error handling
>>   (Reinette).
>> 
>> Changelog v2:
>> - Redo the patch message. (Ilpo)
>> - Tidy up __cpuid_count calls. (Ilpo)
>> - Remove redundant AND in noncont_mask calculations (Ilpo)
>> - Fix bit_center offset.
>> - Add newline before function return. (Ilpo)
>> - Group non-contiguous tests with CAT tests. (Ilpo)
>> - Use a helper for reading sparse_masks file. (Ilpo)
>> - Make get_cache_level() available in other source files. (Ilpo)
>> 
>>  tools/testing/selftests/resctrl/cat_test.c    | 81 +++++++++++++++++++
>>  tools/testing/selftests/resctrl/resctrl.h     |  2 +
>>  .../testing/selftests/resctrl/resctrl_tests.c |  2 +
>>  3 files changed, 85 insertions(+)
>> 
>> diff --git a/tools/testing/selftests/resctrl/cat_test.c b/tools/testing/selftests/resctrl/cat_test.c
>> index 39fc9303b8e8..9086bf359072 100644
>> --- a/tools/testing/selftests/resctrl/cat_test.c
>> +++ b/tools/testing/selftests/resctrl/cat_test.c
>> @@ -294,6 +294,71 @@ static int cat_run_test(const struct resctrl_test *test, const struct user_param
>>  	return ret;
>>  }
>>  
>> +static int noncont_cat_run_test(const struct resctrl_test *test,
>> +				const struct user_params *uparams)
>> +{
>> +	unsigned long full_cache_mask, cont_mask, noncont_mask;
>> +	unsigned int eax, ebx, ecx, edx, ret, sparse_masks;
>> +	char schemata[64];
>> +	int bit_center;
>> +
>> +	/* Check to compare sparse_masks content to CPUID output. */
>> +	ret = resource_info_unsigned_get(test->resource, "sparse_masks", &sparse_masks);
>> +	if (ret)
>> +		return ret;
>> +
>> +	if (!strcmp(test->resource, "L3"))
>> +		__cpuid_count(0x10, 1, eax, ebx, ecx, edx);
>> +	else if (!strcmp(test->resource, "L2"))
>> +		__cpuid_count(0x10, 2, eax, ebx, ecx, edx);
>> +	else
>> +		return -EINVAL;
>> +
>> +	if (sparse_masks != ((ecx >> 3) & 1)) {
>> +		ksft_print_msg("CPUID output doesn't match 'sparse_masks' file content!\n");
>> +		return -1;
>
>If I understand correctly this falls into the "test failure" [1] category
>and should return 1? ...
>
>> +	}
>> +
>> +	/* Write checks initialization. */
>> +	ret = get_full_cbm(test->resource, &full_cache_mask);
>> +	if (ret < 0)
>> +		return ret;
>> +	bit_center = count_bits(full_cache_mask) / 2;
>> +	cont_mask = full_cache_mask >> bit_center;
>> +
>> +	/* Contiguous mask write check. */
>> +	snprintf(schemata, sizeof(schemata), "%lx", cont_mask);
>> +	ret = write_schemata("", schemata, uparams->cpu, test->resource);
>> +	if (ret) {
>> +		ksft_print_msg("Write of contiguous CBM failed\n");
>> +		return ret;
>
>... although here I think the goal to distinguish between test error and test failure
>falls apart since it is not possible to tell within the test if the failure is
>because of error in the test or if test failed.

Is there even a distinction between test error and failure in resctrl selftest?
I've been looking at it for a while and can't find any instances where
ksft_test_result_error() would be used. Everywhere I look it's either pass or
fail. By grep-ing over all selftests I found only five tests that use
ksft_test_result_error().

Furthermore there is this one "TODO" in kselftests.h:

	/* TODO: how does "error" differ from "fail" or "skip"? */

If you meant the distintion less literally then I'd say the sparse_masks
comparison to CPUID would be a failure. What I had in mind is that it tries to
validate a resctrl interface relevant to non-contiguous CBMs. If it fails
there is probably something wrong with the code concerning non-contiguous CBMs.
On the other hand writing contiguous CBMs shouldn't fail as far as the
non-contiguous CBMs in CAT test is concerned. So if that fails there might be
something wrong on a higher level and I'd say that can be more of an error than
a failure.

But I'm just saying how I undestood it so far. If there is some clear
distinction between error and failure definitions I could try to separate it
more explicitly.

>
>Reinette
>
>[1] https://lore.kernel.org/all/33787043-5823-6de4-4e5c-a24a136ba541@linux.intel.com/
>
  
Reinette Chatre Feb. 1, 2024, 7:47 p.m. UTC | #3
Hi Maciej,

On 1/31/2024 4:55 AM, Maciej Wieczor-Retman wrote:
> On 2024-01-26 at 13:10:18 -0800, Reinette Chatre wrote:
>> On 1/25/2024 3:13 AM, Maciej Wieczor-Retman wrote:
>>> Add tests for both L2 and L3 CAT to verify the return values
>>> generated by writing non-contiguous CBMs don't contradict the
>>> reported non-contiguous support information.
>>>
>>> Use a logical XOR to confirm return value of write_schemata() and
>>> non-contiguous CBMs support information match.
>>>
>>> Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
>>> ---
>>> Changelog v3:
>>> - Roll back __cpuid_count part. (Reinette)
>>> - Update function name to read sparse_masks file.
>>> - Roll back get_cache_level() changes.
>>> - Add ksft_print_msg() to contiguous schemata write error handling
>>>   (Reinette).
>>>
>>> Changelog v2:
>>> - Redo the patch message. (Ilpo)
>>> - Tidy up __cpuid_count calls. (Ilpo)
>>> - Remove redundant AND in noncont_mask calculations (Ilpo)
>>> - Fix bit_center offset.
>>> - Add newline before function return. (Ilpo)
>>> - Group non-contiguous tests with CAT tests. (Ilpo)
>>> - Use a helper for reading sparse_masks file. (Ilpo)
>>> - Make get_cache_level() available in other source files. (Ilpo)
>>>
>>>  tools/testing/selftests/resctrl/cat_test.c    | 81 +++++++++++++++++++
>>>  tools/testing/selftests/resctrl/resctrl.h     |  2 +
>>>  .../testing/selftests/resctrl/resctrl_tests.c |  2 +
>>>  3 files changed, 85 insertions(+)
>>>
>>> diff --git a/tools/testing/selftests/resctrl/cat_test.c b/tools/testing/selftests/resctrl/cat_test.c
>>> index 39fc9303b8e8..9086bf359072 100644
>>> --- a/tools/testing/selftests/resctrl/cat_test.c
>>> +++ b/tools/testing/selftests/resctrl/cat_test.c
>>> @@ -294,6 +294,71 @@ static int cat_run_test(const struct resctrl_test *test, const struct user_param
>>>  	return ret;
>>>  }
>>>  
>>> +static int noncont_cat_run_test(const struct resctrl_test *test,
>>> +				const struct user_params *uparams)
>>> +{
>>> +	unsigned long full_cache_mask, cont_mask, noncont_mask;
>>> +	unsigned int eax, ebx, ecx, edx, ret, sparse_masks;
>>> +	char schemata[64];
>>> +	int bit_center;
>>> +
>>> +	/* Check to compare sparse_masks content to CPUID output. */
>>> +	ret = resource_info_unsigned_get(test->resource, "sparse_masks", &sparse_masks);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	if (!strcmp(test->resource, "L3"))
>>> +		__cpuid_count(0x10, 1, eax, ebx, ecx, edx);
>>> +	else if (!strcmp(test->resource, "L2"))
>>> +		__cpuid_count(0x10, 2, eax, ebx, ecx, edx);
>>> +	else
>>> +		return -EINVAL;
>>> +
>>> +	if (sparse_masks != ((ecx >> 3) & 1)) {
>>> +		ksft_print_msg("CPUID output doesn't match 'sparse_masks' file content!\n");
>>> +		return -1;
>>
>> If I understand correctly this falls into the "test failure" [1] category
>> and should return 1? ...
>>
>>> +	}
>>> +
>>> +	/* Write checks initialization. */
>>> +	ret = get_full_cbm(test->resource, &full_cache_mask);
>>> +	if (ret < 0)
>>> +		return ret;
>>> +	bit_center = count_bits(full_cache_mask) / 2;
>>> +	cont_mask = full_cache_mask >> bit_center;
>>> +
>>> +	/* Contiguous mask write check. */
>>> +	snprintf(schemata, sizeof(schemata), "%lx", cont_mask);
>>> +	ret = write_schemata("", schemata, uparams->cpu, test->resource);
>>> +	if (ret) {
>>> +		ksft_print_msg("Write of contiguous CBM failed\n");
>>> +		return ret;
>>
>> ... although here I think the goal to distinguish between test error and test failure
>> falls apart since it is not possible to tell within the test if the failure is
>> because of error in the test or if test failed.
> 
> Is there even a distinction between test error and failure in resctrl selftest?

There is such a distinction in the current tests (and from what I understand the reason
behind the logical XOR used in this test) . In existing tests the running of
the test precedes and is clearly separate from determining of the test pass/fail.
All the current tests have a clear "run the test" phase where data is collected to
a file, followed by an analysis (aka "check results") phase that looks at collected
data to determine if the test passes or fails.
Note how all the "check results" return either 0 or 1 to indicate test pass
or fail respectively. Specifically, you can refer to:
mbm_test.c->check_results()
mba_test.c->check_results()
cmt_test.c->check_results()
cat_test.c->check_results()

> I've been looking at it for a while and can't find any instances where
> ksft_test_result_error() would be used. Everywhere I look it's either pass or
> fail. By grep-ing over all selftests I found only five tests that use
> ksft_test_result_error().

Yes, from the user perspective there is no such distinction. This seems to
be entirely internal to the resctrl selftests (but I do not think that this
should or can be a hard requirement).

> 
> Furthermore there is this one "TODO" in kselftests.h:
> 
> 	/* TODO: how does "error" differ from "fail" or "skip"? */
> 
> If you meant the distintion less literally then I'd say the sparse_masks
> comparison to CPUID would be a failure. What I had in mind is that it tries to
> validate a resctrl interface relevant to non-contiguous CBMs. If it fails
> there is probably something wrong with the code concerning non-contiguous CBMs.

Wrong with which code? As I understand this particular check compares the
resctrl view of the world to the hardware realities. If this check fails
then I do not think this is an issue with the test code (which would make it a test
error) but instead a resctrl bug and thus a test failure.

> On the other hand writing contiguous CBMs shouldn't fail as far as the
> non-contiguous CBMs in CAT test is concerned. So if that fails there might be
> something wrong on a higher level and I'd say that can be more of an error than
> a failure.

I think that the write_schemata() can fail for a variety of reasons, some may
indicate an issue with the test while some may indicate an issue with resctrl.
It is not possible for the caller of write_schemata() to distinguish.

> But I'm just saying how I undestood it so far. If there is some clear
> distinction between error and failure definitions I could try to separate it
> more explicitly.

I do not think it is possible to clearly distinguish between error and failure.
These are already lumped together as a ksft_test_result_fail() anyway so no
risk of confusion to folks just running the tests.
I think the final test result may be confusing to folks parsing the
resctrl selftest internals:

	run_single_test()
	{
		...
		ret = test->run_test(test, uparams);
		ksft_test_result(!ret, "%s: test\n", test->name);
		...
	}

above means that a test returning negative or greater than zero value is
considered a test failure and resctrl tests may return either in the case of
an actual test failure ... but from user perspective there is no difference
so I do not think it is an issue, just lack of consistency in the resctrl
test internals in cases like write_schemata() failure where a possible
test fail is captured as a test error. 

I do not think it is required to be strict here. Keeping "test returns
negative or greater than zero on test failure" seems reasonable to me.

Reinette
  
Maciej Wieczor-Retman Feb. 2, 2024, 10:17 a.m. UTC | #4
Hello!

On 2024-02-01 at 11:47:44 -0800, Reinette Chatre wrote:
>Hi Maciej,
>
>On 1/31/2024 4:55 AM, Maciej Wieczor-Retman wrote:
>> On 2024-01-26 at 13:10:18 -0800, Reinette Chatre wrote:
>>> On 1/25/2024 3:13 AM, Maciej Wieczor-Retman wrote:
>>>> +	if (sparse_masks != ((ecx >> 3) & 1)) {
>>>> +		ksft_print_msg("CPUID output doesn't match 'sparse_masks' file content!\n");
>>>> +		return -1;
>>>
>>> If I understand correctly this falls into the "test failure" [1] category
>>> and should return 1? ...
>>>
>>>> +	}
>>>> +
>>>> +	/* Write checks initialization. */
>>>> +	ret = get_full_cbm(test->resource, &full_cache_mask);
>>>> +	if (ret < 0)
>>>> +		return ret;
>>>> +	bit_center = count_bits(full_cache_mask) / 2;
>>>> +	cont_mask = full_cache_mask >> bit_center;
>>>> +
>>>> +	/* Contiguous mask write check. */
>>>> +	snprintf(schemata, sizeof(schemata), "%lx", cont_mask);
>>>> +	ret = write_schemata("", schemata, uparams->cpu, test->resource);
>>>> +	if (ret) {
>>>> +		ksft_print_msg("Write of contiguous CBM failed\n");
>>>> +		return ret;
>>>
>>> ... although here I think the goal to distinguish between test error and test failure
>>> falls apart since it is not possible to tell within the test if the failure is
>>> because of error in the test or if test failed.
>> 
>> Is there even a distinction between test error and failure in resctrl selftest?
>
>There is such a distinction in the current tests (and from what I understand the reason
>behind the logical XOR used in this test) . In existing tests the running of
>the test precedes and is clearly separate from determining of the test pass/fail.
>All the current tests have a clear "run the test" phase where data is collected to
>a file, followed by an analysis (aka "check results") phase that looks at collected
>data to determine if the test passes or fails.
>Note how all the "check results" return either 0 or 1 to indicate test pass
>or fail respectively. Specifically, you can refer to:
>mbm_test.c->check_results()
>mba_test.c->check_results()
>cmt_test.c->check_results()
>cat_test.c->check_results()
>
>> I've been looking at it for a while and can't find any instances where
>> ksft_test_result_error() would be used. Everywhere I look it's either pass or
>> fail. By grep-ing over all selftests I found only five tests that use
>> ksft_test_result_error().
>
>Yes, from the user perspective there is no such distinction. This seems to
>be entirely internal to the resctrl selftests (but I do not think that this
>should or can be a hard requirement).

Okay, thank you, that's what I wanted to know.

>
>> 
>> Furthermore there is this one "TODO" in kselftests.h:
>> 
>> 	/* TODO: how does "error" differ from "fail" or "skip"? */
>> 
>> If you meant the distintion less literally then I'd say the sparse_masks
>> comparison to CPUID would be a failure. What I had in mind is that it tries to
>> validate a resctrl interface relevant to non-contiguous CBMs. If it fails
>> there is probably something wrong with the code concerning non-contiguous CBMs.
>
>Wrong with which code? As I understand this particular check compares the
>resctrl view of the world to the hardware realities. If this check fails
>then I do not think this is an issue with the test code (which would make it a test
>error) but instead a resctrl bug and thus a test failure.

I also meant a resctrl bug. I was thinking about the kernel resctrl code that
handles taking the CPUID information about non-contiguous CBMs and putting it in
the sparse_masks file.

If there was a hardware problem and CPUID returned wrong information, then the
check wouldn't fail as sparse_masks relies on CPUID too and both values would
match. So in view of this I thought that this check could make sure that the
resctrl kernel code handles CPUID returned information properly.

So should this check be moved from the "run the test" phase to the end of the
function ("check results" phase) to signify that it's not an error but a
failure?

>
>> On the other hand writing contiguous CBMs shouldn't fail as far as the
>> non-contiguous CBMs in CAT test is concerned. So if that fails there might be
>> something wrong on a higher level and I'd say that can be more of an error than
>> a failure.
>
>I think that the write_schemata() can fail for a variety of reasons, some may
>indicate an issue with the test while some may indicate an issue with resctrl.
>It is not possible for the caller of write_schemata() to distinguish.
>
>> But I'm just saying how I undestood it so far. If there is some clear
>> distinction between error and failure definitions I could try to separate it
>> more explicitly.
>
>I do not think it is possible to clearly distinguish between error and failure.
>These are already lumped together as a ksft_test_result_fail() anyway so no
>risk of confusion to folks just running the tests.
>I think the final test result may be confusing to folks parsing the
>resctrl selftest internals:
>
>	run_single_test()
>	{
>		...
>		ret = test->run_test(test, uparams);
>		ksft_test_result(!ret, "%s: test\n", test->name);
>		...
>	}
>
>above means that a test returning negative or greater than zero value is
>considered a test failure and resctrl tests may return either in the case of
>an actual test failure ... but from user perspective there is no difference
>so I do not think it is an issue, just lack of consistency in the resctrl
>test internals in cases like write_schemata() failure where a possible
>test fail is captured as a test error. 
>
>I do not think it is required to be strict here. Keeping "test returns
>negative or greater than zero on test failure" seems reasonable to me.

Okay, so the approach I applied in noncont_cat_run_test() with write_schemata()
is acceptable?

>
>Reinette
  
Reinette Chatre Feb. 2, 2024, 5:10 p.m. UTC | #5
Hi Maciej,

On 2/2/2024 2:17 AM, Maciej Wieczor-Retman wrote:
> On 2024-02-01 at 11:47:44 -0800, Reinette Chatre wrote:
>> Hi Maciej,
>>
>> On 1/31/2024 4:55 AM, Maciej Wieczor-Retman wrote:
>>> On 2024-01-26 at 13:10:18 -0800, Reinette Chatre wrote:
>>>> On 1/25/2024 3:13 AM, Maciej Wieczor-Retman wrote:
>>>>> +	if (sparse_masks != ((ecx >> 3) & 1)) {
>>>>> +		ksft_print_msg("CPUID output doesn't match 'sparse_masks' file content!\n");
>>>>> +		return -1;
>>>>
>>>> If I understand correctly this falls into the "test failure" [1] category
>>>> and should return 1? ...
>>>>
>>>>> +	}
>>>>> +
>>>>> +	/* Write checks initialization. */
>>>>> +	ret = get_full_cbm(test->resource, &full_cache_mask);
>>>>> +	if (ret < 0)
>>>>> +		return ret;
>>>>> +	bit_center = count_bits(full_cache_mask) / 2;
>>>>> +	cont_mask = full_cache_mask >> bit_center;
>>>>> +
>>>>> +	/* Contiguous mask write check. */
>>>>> +	snprintf(schemata, sizeof(schemata), "%lx", cont_mask);
>>>>> +	ret = write_schemata("", schemata, uparams->cpu, test->resource);
>>>>> +	if (ret) {
>>>>> +		ksft_print_msg("Write of contiguous CBM failed\n");
>>>>> +		return ret;
>>>>
>>>> ... although here I think the goal to distinguish between test error and test failure
>>>> falls apart since it is not possible to tell within the test if the failure is
>>>> because of error in the test or if test failed.
>>>
>>> Is there even a distinction between test error and failure in resctrl selftest?
>>
>> There is such a distinction in the current tests (and from what I understand the reason
>> behind the logical XOR used in this test) . In existing tests the running of
>> the test precedes and is clearly separate from determining of the test pass/fail.
>> All the current tests have a clear "run the test" phase where data is collected to
>> a file, followed by an analysis (aka "check results") phase that looks at collected
>> data to determine if the test passes or fails.
>> Note how all the "check results" return either 0 or 1 to indicate test pass
>> or fail respectively. Specifically, you can refer to:
>> mbm_test.c->check_results()
>> mba_test.c->check_results()
>> cmt_test.c->check_results()
>> cat_test.c->check_results()
>>
>>> I've been looking at it for a while and can't find any instances where
>>> ksft_test_result_error() would be used. Everywhere I look it's either pass or
>>> fail. By grep-ing over all selftests I found only five tests that use
>>> ksft_test_result_error().
>>
>> Yes, from the user perspective there is no such distinction. This seems to
>> be entirely internal to the resctrl selftests (but I do not think that this
>> should or can be a hard requirement).
> 
> Okay, thank you, that's what I wanted to know.
> 
>>
>>>
>>> Furthermore there is this one "TODO" in kselftests.h:
>>>
>>> 	/* TODO: how does "error" differ from "fail" or "skip"? */
>>>
>>> If you meant the distintion less literally then I'd say the sparse_masks
>>> comparison to CPUID would be a failure. What I had in mind is that it tries to
>>> validate a resctrl interface relevant to non-contiguous CBMs. If it fails
>>> there is probably something wrong with the code concerning non-contiguous CBMs.
>>
>> Wrong with which code? As I understand this particular check compares the
>> resctrl view of the world to the hardware realities. If this check fails
>> then I do not think this is an issue with the test code (which would make it a test
>> error) but instead a resctrl bug and thus a test failure.
> 
> I also meant a resctrl bug. I was thinking about the kernel resctrl code that
> handles taking the CPUID information about non-contiguous CBMs and putting it in
> the sparse_masks file.
> 
> If there was a hardware problem and CPUID returned wrong information, then the
> check wouldn't fail as sparse_masks relies on CPUID too and both values would
> match. So in view of this I thought that this check could make sure that the
> resctrl kernel code handles CPUID returned information properly.
> 
> So should this check be moved from the "run the test" phase to the end of the
> function ("check results" phase) to signify that it's not an error but a
> failure?

I do not think this test matches the "run" and "check" phases of previous tests,
unless you create a new test for every scenario checked within this test.

Just returning 1 when the check (if (sparse_masks != ((ecx >> 3) & 1))) fails
should be ok, no?

>>> On the other hand writing contiguous CBMs shouldn't fail as far as the
>>> non-contiguous CBMs in CAT test is concerned. So if that fails there might be
>>> something wrong on a higher level and I'd say that can be more of an error than
>>> a failure.
>>
>> I think that the write_schemata() can fail for a variety of reasons, some may
>> indicate an issue with the test while some may indicate an issue with resctrl.
>> It is not possible for the caller of write_schemata() to distinguish.
>>
>>> But I'm just saying how I undestood it so far. If there is some clear
>>> distinction between error and failure definitions I could try to separate it
>>> more explicitly.
>>
>> I do not think it is possible to clearly distinguish between error and failure.
>> These are already lumped together as a ksft_test_result_fail() anyway so no
>> risk of confusion to folks just running the tests.
>> I think the final test result may be confusing to folks parsing the
>> resctrl selftest internals:
>>
>> 	run_single_test()
>> 	{
>> 		...
>> 		ret = test->run_test(test, uparams);
>> 		ksft_test_result(!ret, "%s: test\n", test->name);
>> 		...
>> 	}
>>
>> above means that a test returning negative or greater than zero value is
>> considered a test failure and resctrl tests may return either in the case of
>> an actual test failure ... but from user perspective there is no difference
>> so I do not think it is an issue, just lack of consistency in the resctrl
>> test internals in cases like write_schemata() failure where a possible
>> test fail is captured as a test error. 
>>
>> I do not think it is required to be strict here. Keeping "test returns
>> negative or greater than zero on test failure" seems reasonable to me.
> 
> Okay, so the approach I applied in noncont_cat_run_test() with write_schemata()
> is acceptable?

In general I'd say a write_schemata() failure's return code will be acceptable,
but you should be consistent in this test. There are two write_schemata()
calls in this test, one treats an error return as a failure and the other treats
an error return as an error. Considering this inconsistency I would thus rather
suggest that you always treat write_schemata() error return as a test failure.

Reinette
  

Patch

diff --git a/tools/testing/selftests/resctrl/cat_test.c b/tools/testing/selftests/resctrl/cat_test.c
index 39fc9303b8e8..9086bf359072 100644
--- a/tools/testing/selftests/resctrl/cat_test.c
+++ b/tools/testing/selftests/resctrl/cat_test.c
@@ -294,6 +294,71 @@  static int cat_run_test(const struct resctrl_test *test, const struct user_param
 	return ret;
 }
 
+static int noncont_cat_run_test(const struct resctrl_test *test,
+				const struct user_params *uparams)
+{
+	unsigned long full_cache_mask, cont_mask, noncont_mask;
+	unsigned int eax, ebx, ecx, edx, ret, sparse_masks;
+	char schemata[64];
+	int bit_center;
+
+	/* Check to compare sparse_masks content to CPUID output. */
+	ret = resource_info_unsigned_get(test->resource, "sparse_masks", &sparse_masks);
+	if (ret)
+		return ret;
+
+	if (!strcmp(test->resource, "L3"))
+		__cpuid_count(0x10, 1, eax, ebx, ecx, edx);
+	else if (!strcmp(test->resource, "L2"))
+		__cpuid_count(0x10, 2, eax, ebx, ecx, edx);
+	else
+		return -EINVAL;
+
+	if (sparse_masks != ((ecx >> 3) & 1)) {
+		ksft_print_msg("CPUID output doesn't match 'sparse_masks' file content!\n");
+		return -1;
+	}
+
+	/* Write checks initialization. */
+	ret = get_full_cbm(test->resource, &full_cache_mask);
+	if (ret < 0)
+		return ret;
+	bit_center = count_bits(full_cache_mask) / 2;
+	cont_mask = full_cache_mask >> bit_center;
+
+	/* Contiguous mask write check. */
+	snprintf(schemata, sizeof(schemata), "%lx", cont_mask);
+	ret = write_schemata("", schemata, uparams->cpu, test->resource);
+	if (ret) {
+		ksft_print_msg("Write of contiguous CBM failed\n");
+		return ret;
+	}
+
+	/*
+	 * Non-contiguous mask write check. CBM has a 0xf hole approximately in the middle.
+	 * Output is compared with support information to catch any edge case errors.
+	 */
+	noncont_mask = ~(0xf << (bit_center - 2)) & full_cache_mask;
+	snprintf(schemata, sizeof(schemata), "%lx", noncont_mask);
+	ret = write_schemata("", schemata, uparams->cpu, test->resource);
+	if (ret && sparse_masks)
+		ksft_print_msg("Non-contiguous CBMs supported but write of non-contiguous CBM failed\n");
+	else if (ret && !sparse_masks)
+		ksft_print_msg("Non-contiguous CBMs not supported and write of non-contiguous CBM failed as expected\n");
+	else if (!ret && !sparse_masks)
+		ksft_print_msg("Non-contiguous CBMs not supported but write of non-contiguous CBM succeeded\n");
+
+	return !ret == !sparse_masks;
+}
+
+static bool noncont_cat_feature_check(const struct resctrl_test *test)
+{
+	if (!resctrl_resource_exists(test->resource))
+		return false;
+
+	return resource_info_file_exists(test->resource, "sparse_masks");
+}
+
 struct resctrl_test l3_cat_test = {
 	.name = "L3_CAT",
 	.group = "CAT",
@@ -301,3 +366,19 @@  struct resctrl_test l3_cat_test = {
 	.feature_check = test_resource_feature_check,
 	.run_test = cat_run_test,
 };
+
+struct resctrl_test l3_noncont_cat_test = {
+	.name = "L3_NONCONT_CAT",
+	.group = "CAT",
+	.resource = "L3",
+	.feature_check = noncont_cat_feature_check,
+	.run_test = noncont_cat_run_test,
+};
+
+struct resctrl_test l2_noncont_cat_test = {
+	.name = "L2_NONCONT_CAT",
+	.group = "CAT",
+	.resource = "L2",
+	.feature_check = noncont_cat_feature_check,
+	.run_test = noncont_cat_run_test,
+};
diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
index c39105f46da9..8cb97f278459 100644
--- a/tools/testing/selftests/resctrl/resctrl.h
+++ b/tools/testing/selftests/resctrl/resctrl.h
@@ -210,5 +210,7 @@  extern struct resctrl_test mbm_test;
 extern struct resctrl_test mba_test;
 extern struct resctrl_test cmt_test;
 extern struct resctrl_test l3_cat_test;
+extern struct resctrl_test l3_noncont_cat_test;
+extern struct resctrl_test l2_noncont_cat_test;
 
 #endif /* RESCTRL_H */
diff --git a/tools/testing/selftests/resctrl/resctrl_tests.c b/tools/testing/selftests/resctrl/resctrl_tests.c
index 3044179ee6e9..f3dc1b9696e7 100644
--- a/tools/testing/selftests/resctrl/resctrl_tests.c
+++ b/tools/testing/selftests/resctrl/resctrl_tests.c
@@ -19,6 +19,8 @@  static struct resctrl_test *resctrl_tests[] = {
 	&mba_test,
 	&cmt_test,
 	&l3_cat_test,
+	&l3_noncont_cat_test,
+	&l2_noncont_cat_test,
 };
 
 static int detect_vendor(void)