[RESEND,v9,1/3] libfs: Introduce case-insensitive string comparison helper

Message ID 20240208064334.268216-2-eugen.hristev@collabora.com
State New
Headers
Series Introduce case-insensitive string comparison helper |

Commit Message

Eugen Hristev Feb. 8, 2024, 6:43 a.m. UTC
  From: Gabriel Krisman Bertazi <krisman@collabora.com>

generic_ci_match can be used by case-insensitive filesystems to compare
strings under lookup with dirents in a case-insensitive way.  This
function is currently reimplemented by each filesystem supporting
casefolding, so this reduces code duplication in filesystem-specific
code.

Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com>
---
 fs/libfs.c         | 68 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |  4 +++
 2 files changed, 72 insertions(+)
  

Comments

Gabriel Krisman Bertazi Feb. 8, 2024, 6:38 p.m. UTC | #1
Eugen Hristev <eugen.hristev@collabora.com> writes:

> From: Gabriel Krisman Bertazi <krisman@collabora.com>
>
> generic_ci_match can be used by case-insensitive filesystems to compare
> strings under lookup with dirents in a case-insensitive way.  This
> function is currently reimplemented by each filesystem supporting
> casefolding, so this reduces code duplication in filesystem-specific
> code.
>
> Reviewed-by: Eric Biggers <ebiggers@google.com>
> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
> Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com>

Hi Eugen,

Thanks for picking this up.  Please,  CC me in future versions.

> ---
>  fs/libfs.c         | 68 ++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/fs.h |  4 +++
>  2 files changed, 72 insertions(+)
>
> diff --git a/fs/libfs.c b/fs/libfs.c
> index bb18884ff20e..f80cb982ac89 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -1773,6 +1773,74 @@ static const struct dentry_operations generic_ci_dentry_ops = {
>  	.d_hash = generic_ci_d_hash,
>  	.d_compare = generic_ci_d_compare,
>  };
> +
> +/**
> + * generic_ci_match() - Match a name (case-insensitively) with a dirent.
> + * @parent: Inode of the parent of the dirent under comparison
> + * @name: name under lookup.
> + * @folded_name: Optional pre-folded name under lookup
> + * @de_name: Dirent name.
> + * @de_name_len: dirent name length.
> + *
> + *
> + * Test whether a case-insensitive directory entry matches the filename
> + * being searched.  If @folded_name is provided, it is used instead of
> + * recalculating the casefold of @name.

Can we add a note that this is a filesystem helper for comparison with
directory entries, and VFS' ->d_compare should use generic_ci_d_compare.

> + *
> + * Return: > 0 if the directory entry matches, 0 if it doesn't match, or
> + * < 0 on error.
> + */
> +int generic_ci_match(const struct inode *parent,
> +		     const struct qstr *name,
> +		     const struct qstr *folded_name,
> +		     const u8 *de_name, u32 de_name_len)
> +{
> +	const struct super_block *sb = parent->i_sb;
> +	const struct unicode_map *um = sb->s_encoding;
> +	struct fscrypt_str decrypted_name = FSTR_INIT(NULL, de_name_len);
> +	struct qstr dirent = QSTR_INIT(de_name, de_name_len);
> +	int res, match = false;

I know I originally wrote it this way, but match is an integer, so
let's use integers instead of false/true.

> +
> +	if (IS_ENCRYPTED(parent)) {
> +		const struct fscrypt_str encrypted_name =
> +			FSTR_INIT((u8 *) de_name, de_name_len);
> +
> +		if (WARN_ON_ONCE(!fscrypt_has_encryption_key(parent)))
> +			return -EINVAL;
> +
> +		decrypted_name.name = kmalloc(de_name_len, GFP_KERNEL);
> +		if (!decrypted_name.name)
> +			return -ENOMEM;
> +		res = fscrypt_fname_disk_to_usr(parent, 0, 0, &encrypted_name,
> +						&decrypted_name);
> +		if (res < 0)
> +			goto out;
> +		dirent.name = decrypted_name.name;
> +		dirent.len = decrypted_name.len;
> +	}
> +
> +	if (folded_name->name)
> +		res = utf8_strncasecmp_folded(um, folded_name, &dirent);
> +	else
> +		res = utf8_strncasecmp(um, name, &dirent);

Similar to

  https://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode.git/commit/?h=for-next&id=367122c529f35b4655acbe33c0cc4d6d3b32ba71

We should be checking for an exact-match first to avoid the utf8
comparison cost unnecessarily.  The only problem is that we need to
ensure we fail for an invalid utf-8 de_name in "strict mode".

Fortunately, if folded_name->name exists, we know the name-under-lookup
was validated when initialized, so an exact-match of de_name must also
be valid.  If folded_name is NULL, though, we might either have an
invalid utf-8 dentry-under-lookup or the allocation itself failed, so we
need to utf8_validate it.

Honestly, I don't care much about this !folded_name->name case, since
utf8_strncasecmp will do the right thing and an invalid utf8 on
case-insensitive directories should be an exception, not the norm.  but
the code might get simpler if we do both:

(untested)

if (folded_name->name) {
	if (dirent.len == folded_name->len &&
	    !memcmp(folded_name->name, dirent.name, dirent.len)) {
            	res = 1;
		goto out;
        }
	res = utf8_strncasecmp_folded(um, folded_name, &dirent);
} else {
	if (dirent.len == name->len &&
	    !memcmp(name->name, dirent.name, dirent.len) &&
            (!sb_has_strict_encoding(sb) || !utf8_validate(um, name))) {
            	res = 1;
		goto out;
        }
	res = utf8_strncasecmp(um, name, &dirent);
}

> +
> +	if (!res)
> +		match = true;
> +	else if (res < 0 && !sb_has_strict_encoding(sb)) {
> +		/*
> +		 * In non-strict mode, fallback to a byte comparison if
> +		 * the names have invalid characters.
> +		 */
> +		res = 0;
> +		match = ((name->len == dirent.len) &&
> +			 !memcmp(name->name, dirent.name, dirent.len));
> +	}

This goes away entirely.

> +
> +out:
> +	kfree(decrypted_name.name);
> +	return (res >= 0) ? match : res;

and this becomes:

return res;
  
Eugen Hristev Feb. 9, 2024, 10:30 a.m. UTC | #2
On 2/8/24 20:38, Gabriel Krisman Bertazi wrote:
> Eugen Hristev <eugen.hristev@collabora.com> writes:
> 
>> From: Gabriel Krisman Bertazi <krisman@collabora.com>
>>
>> generic_ci_match can be used by case-insensitive filesystems to compare
>> strings under lookup with dirents in a case-insensitive way.  This
>> function is currently reimplemented by each filesystem supporting
>> casefolding, so this reduces code duplication in filesystem-specific
>> code.
>>
>> Reviewed-by: Eric Biggers <ebiggers@google.com>
>> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
>> Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com>
> 
> Hi Eugen,
> 
> Thanks for picking this up.  Please,  CC me in future versions.

Hello Gabriel,

Thanks for reviewing :)
> 
>> ---
>>  fs/libfs.c         | 68 ++++++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/fs.h |  4 +++
>>  2 files changed, 72 insertions(+)
>>
>> diff --git a/fs/libfs.c b/fs/libfs.c
>> index bb18884ff20e..f80cb982ac89 100644
>> --- a/fs/libfs.c
>> +++ b/fs/libfs.c
>> @@ -1773,6 +1773,74 @@ static const struct dentry_operations generic_ci_dentry_ops = {
>>  	.d_hash = generic_ci_d_hash,
>>  	.d_compare = generic_ci_d_compare,
>>  };
>> +
>> +/**
>> + * generic_ci_match() - Match a name (case-insensitively) with a dirent.
>> + * @parent: Inode of the parent of the dirent under comparison
>> + * @name: name under lookup.
>> + * @folded_name: Optional pre-folded name under lookup
>> + * @de_name: Dirent name.
>> + * @de_name_len: dirent name length.
>> + *
>> + *
>> + * Test whether a case-insensitive directory entry matches the filename
>> + * being searched.  If @folded_name is provided, it is used instead of
>> + * recalculating the casefold of @name.
> 
> Can we add a note that this is a filesystem helper for comparison with
> directory entries, and VFS' ->d_compare should use generic_ci_d_compare.
> 
>> + *
>> + * Return: > 0 if the directory entry matches, 0 if it doesn't match, or
>> + * < 0 on error.
>> + */
>> +int generic_ci_match(const struct inode *parent,
>> +		     const struct qstr *name,
>> +		     const struct qstr *folded_name,
>> +		     const u8 *de_name, u32 de_name_len)
>> +{
>> +	const struct super_block *sb = parent->i_sb;
>> +	const struct unicode_map *um = sb->s_encoding;
>> +	struct fscrypt_str decrypted_name = FSTR_INIT(NULL, de_name_len);
>> +	struct qstr dirent = QSTR_INIT(de_name, de_name_len);
>> +	int res, match = false;
> 
> I know I originally wrote it this way, but match is an integer, so
> let's use integers instead of false/true.

With the changes below, 'match' is no longer needed
> 
>> +
>> +	if (IS_ENCRYPTED(parent)) {
>> +		const struct fscrypt_str encrypted_name =
>> +			FSTR_INIT((u8 *) de_name, de_name_len);
>> +
>> +		if (WARN_ON_ONCE(!fscrypt_has_encryption_key(parent)))
>> +			return -EINVAL;
>> +
>> +		decrypted_name.name = kmalloc(de_name_len, GFP_KERNEL);
>> +		if (!decrypted_name.name)
>> +			return -ENOMEM;
>> +		res = fscrypt_fname_disk_to_usr(parent, 0, 0, &encrypted_name,
>> +						&decrypted_name);
>> +		if (res < 0)
>> +			goto out;
>> +		dirent.name = decrypted_name.name;
>> +		dirent.len = decrypted_name.len;
>> +	}
>> +
>> +	if (folded_name->name)
>> +		res = utf8_strncasecmp_folded(um, folded_name, &dirent);
>> +	else
>> +		res = utf8_strncasecmp(um, name, &dirent);
> 
> Similar to
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode.git/commit/?h=for-next&id=367122c529f35b4655acbe33c0cc4d6d3b32ba71
> 
> We should be checking for an exact-match first to avoid the utf8
> comparison cost unnecessarily.  The only problem is that we need to
> ensure we fail for an invalid utf-8 de_name in "strict mode".
> 
> Fortunately, if folded_name->name exists, we know the name-under-lookup
> was validated when initialized, so an exact-match of de_name must also
> be valid.  If folded_name is NULL, though, we might either have an
> invalid utf-8 dentry-under-lookup or the allocation itself failed, so we
> need to utf8_validate it.
> 
> Honestly, I don't care much about this !folded_name->name case, since
> utf8_strncasecmp will do the right thing and an invalid utf8 on
> case-insensitive directories should be an exception, not the norm.  but
> the code might get simpler if we do both:
> 
> (untested)

I implemented your suggestion, but any idea about testing ? I ran smoke on xfstests
and it appears to be fine, but maybe some specific test case might try the
different paths here ?

Let me know,
Eugen
> 
> if (folded_name->name) {
> 	if (dirent.len == folded_name->len &&
> 	    !memcmp(folded_name->name, dirent.name, dirent.len)) {
>             	res = 1;
> 		goto out;
>         }
> 	res = utf8_strncasecmp_folded(um, folded_name, &dirent);
> } else {
> 	if (dirent.len == name->len &&
> 	    !memcmp(name->name, dirent.name, dirent.len) &&
>             (!sb_has_strict_encoding(sb) || !utf8_validate(um, name))) {
>             	res = 1;
> 		goto out;
>         }
> 	res = utf8_strncasecmp(um, name, &dirent);
> }
> 
>> +
>> +	if (!res)
>> +		match = true;
>> +	else if (res < 0 && !sb_has_strict_encoding(sb)) {
>> +		/*
>> +		 * In non-strict mode, fallback to a byte comparison if
>> +		 * the names have invalid characters.
>> +		 */
>> +		res = 0;
>> +		match = ((name->len == dirent.len) &&
>> +			 !memcmp(name->name, dirent.name, dirent.len));
>> +	}
> 
> This goes away entirely.
> 
>> +
>> +out:
>> +	kfree(decrypted_name.name);
>> +	return (res >= 0) ? match : res;
> 
> and this becomes:
> 
> return res;
>
  
Gabriel Krisman Bertazi Feb. 9, 2024, 2:40 p.m. UTC | #3
Eugen Hristev <eugen.hristev@collabora.com> writes:

> On 2/8/24 20:38, Gabriel Krisman Bertazi wrote:

>> (untested)
>
> I implemented your suggestion, but any idea about testing ? I ran smoke on xfstests
> and it appears to be fine, but maybe some specific test case might try the
> different paths here ?

Other than running the fstests quick group for each affected filesystems
looking for regressions, the way I'd do it is create a few files and
look them up with exact and inexact name matches.  While doing that,
observe through bpftrace which functions got called and what they
returned.

Here, since you are testing the uncached lookup, you want to make sure
to drop the cached version prior to each lookup.
  
Eugen Hristev Feb. 13, 2024, 4:44 a.m. UTC | #4
On 2/9/24 16:40, Gabriel Krisman Bertazi wrote:
> Eugen Hristev <eugen.hristev@collabora.com> writes:
> 
>> On 2/8/24 20:38, Gabriel Krisman Bertazi wrote:
> 
>>> (untested)
>>
>> I implemented your suggestion, but any idea about testing ? I ran smoke on xfstests
>> and it appears to be fine, but maybe some specific test case might try the
>> different paths here ?
> 
> Other than running the fstests quick group for each affected filesystems
> looking for regressions, the way I'd do it is create a few files and
> look them up with exact and inexact name matches.  While doing that,
> observe through bpftrace which functions got called and what they
> returned.
> 
> Here, since you are testing the uncached lookup, you want to make sure
> to drop the cached version prior to each lookup.
> 


Hello Gabriel,

With the changes you suggested, I get these errors now :

[  107.409410] EXT4-fs error (device sda1): ext4_lookup:1816: inode #521217: comm
ls: 'CUC' linked to parent dir
ls: cannot access '/media/CI_dir/CUC': Structure needs cleaning
total 8
drwxr-xr-x 2 root root 4096 Feb 12 11:51 .
drwxr-xr-x 4 root root 4096 Feb 12 11:47 ..
-????????? ? ?    ?       ?            ? CUC

Do you have any idea about what is wrong ?

Thanks,
Eugen
  
Gabriel Krisman Bertazi Feb. 13, 2024, 4:09 p.m. UTC | #5
Eugen Hristev <eugen.hristev@collabora.com> writes:

> On 2/9/24 16:40, Gabriel Krisman Bertazi wrote:
>> Eugen Hristev <eugen.hristev@collabora.com> writes:
> With the changes you suggested, I get these errors now :
>
> [  107.409410] EXT4-fs error (device sda1): ext4_lookup:1816: inode #521217: comm
> ls: 'CUC' linked to parent dir
> ls: cannot access '/media/CI_dir/CUC': Structure needs cleaning
> total 8
> drwxr-xr-x 2 root root 4096 Feb 12 11:51 .
> drwxr-xr-x 4 root root 4096 Feb 12 11:47 ..
> -????????? ? ?    ?       ?            ? CUC
>
> Do you have any idea about what is wrong ?

Hm, there's a bug somewhere. The lookup got broken and ls got an error.
Did you debug it a bit?  can you share the code and a reproducer?

From a quick look at the example I suggested, utf8_strncasecmp* return 0
on match, but ext4_match should return true when matched. So remember to
negate the output:

..
res = !utf8_strncasecmp(um, name, &dirent);
..
  

Patch

diff --git a/fs/libfs.c b/fs/libfs.c
index bb18884ff20e..f80cb982ac89 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1773,6 +1773,74 @@  static const struct dentry_operations generic_ci_dentry_ops = {
 	.d_hash = generic_ci_d_hash,
 	.d_compare = generic_ci_d_compare,
 };
+
+/**
+ * generic_ci_match() - Match a name (case-insensitively) with a dirent.
+ * @parent: Inode of the parent of the dirent under comparison
+ * @name: name under lookup.
+ * @folded_name: Optional pre-folded name under lookup
+ * @de_name: Dirent name.
+ * @de_name_len: dirent name length.
+ *
+ *
+ * Test whether a case-insensitive directory entry matches the filename
+ * being searched.  If @folded_name is provided, it is used instead of
+ * recalculating the casefold of @name.
+ *
+ * Return: > 0 if the directory entry matches, 0 if it doesn't match, or
+ * < 0 on error.
+ */
+int generic_ci_match(const struct inode *parent,
+		     const struct qstr *name,
+		     const struct qstr *folded_name,
+		     const u8 *de_name, u32 de_name_len)
+{
+	const struct super_block *sb = parent->i_sb;
+	const struct unicode_map *um = sb->s_encoding;
+	struct fscrypt_str decrypted_name = FSTR_INIT(NULL, de_name_len);
+	struct qstr dirent = QSTR_INIT(de_name, de_name_len);
+	int res, match = false;
+
+	if (IS_ENCRYPTED(parent)) {
+		const struct fscrypt_str encrypted_name =
+			FSTR_INIT((u8 *) de_name, de_name_len);
+
+		if (WARN_ON_ONCE(!fscrypt_has_encryption_key(parent)))
+			return -EINVAL;
+
+		decrypted_name.name = kmalloc(de_name_len, GFP_KERNEL);
+		if (!decrypted_name.name)
+			return -ENOMEM;
+		res = fscrypt_fname_disk_to_usr(parent, 0, 0, &encrypted_name,
+						&decrypted_name);
+		if (res < 0)
+			goto out;
+		dirent.name = decrypted_name.name;
+		dirent.len = decrypted_name.len;
+	}
+
+	if (folded_name->name)
+		res = utf8_strncasecmp_folded(um, folded_name, &dirent);
+	else
+		res = utf8_strncasecmp(um, name, &dirent);
+
+	if (!res)
+		match = true;
+	else if (res < 0 && !sb_has_strict_encoding(sb)) {
+		/*
+		 * In non-strict mode, fallback to a byte comparison if
+		 * the names have invalid characters.
+		 */
+		res = 0;
+		match = ((name->len == dirent.len) &&
+			 !memcmp(name->name, dirent.name, dirent.len));
+	}
+
+out:
+	kfree(decrypted_name.name);
+	return (res >= 0) ? match : res;
+}
+EXPORT_SYMBOL(generic_ci_match);
 #endif
 
 #ifdef CONFIG_FS_ENCRYPTION
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 820b93b2917f..7af691ff8d44 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3296,6 +3296,10 @@  extern int generic_file_fsync(struct file *, loff_t, loff_t, int);
 extern int generic_check_addressable(unsigned, u64);
 
 extern void generic_set_encrypted_ci_d_ops(struct dentry *dentry);
+extern int generic_ci_match(const struct inode *parent,
+			    const struct qstr *name,
+			    const struct qstr *folded_name,
+			    const u8 *de_name, u32 de_name_len);
 
 static inline bool sb_has_encoding(const struct super_block *sb)
 {