Message ID | 20231014-get-maintainers-utf8-v1-1-3af8c7aeb239@bang-olufsen.dk |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2908:b0:403:3b70:6f57 with SMTP id ib8csp2591004vqb; Sat, 14 Oct 2023 10:23:51 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGmLMv239D1ZYXVJJMlFDMywDzRz6vGhbsxz4HZZn/Hds4dUq9MTZeD3Zv+Gks5q4IvTyz2 X-Received: by 2002:a05:6a20:c18f:b0:16e:26fd:7c02 with SMTP id bg15-20020a056a20c18f00b0016e26fd7c02mr22073933pzb.2.1697304231745; Sat, 14 Oct 2023 10:23:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697304231; cv=none; d=google.com; s=arc-20160816; b=zH0E5ErK9cwD8p2QcKnXgHZ6Cm4E5rapgO+SwzAv9IDPNrDla/rFClcaK87mpbODIK Y4Vs8P9MYKH69UEqEjE2ILmL9+hKxAC3tLplts1YWpnGREr1FTgtvd7WcBmrEvqxUO5n f3yVzI3qPVPlWd4EZEBXUZaKQMMfczHfiDHTwBWQ4TtZb0yUqWvadnyaUK8aPQ25WxnE Nx1uV+y4OGoKd+NVlEhV1AVhT3t78KB0aHIULHdEh5oqGSoTWzs55DcwwxDn4DJ0j7Ob t20nOkl2ZfEEuOq6nuhpWSYRSR14b2xqqV86PRTqY+Ef4quIiDQaHo9cg7T0dNAzrfgv 7FDw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:message-id:content-transfer-encoding :mime-version:subject:date:from:dkim-signature; bh=8BHon3+6fMcyIPAc0hS9MZwVEhIjaxCRNkaJRhXQQno=; fh=VPLrO5FqB5xD4Q8V+yGfzyfkbtUY4xtN6lSPdOWyvk8=; b=XNUGEn+7lO6Tou2lQ2B8uWWTfz5wyyjrZ/6aztL/MZh+RYcu4wOpWr3c7b1Qbzy9za 9DrnlcYvEmKSr07crsT1JALnAf8DftpL6+Ji5ZCGnszw/eRXeeBvwPo5IgFf64ZHjhUR PG1EIjyyN3hYOV/SM8w3hWOvmFc6BK42vNvXQIZoAXy2nKkME919Dv5YjPf5qlYCBEF7 eIeQJ+jmnMv4W/GQp6kl7VA+dnrAvlcY/BgUUlixCtnUuwMpNHFEM4JxV4kP5NN/POS3 wd+3VerF/Rpj4ALFRYb1ujkjEXYXFbXyOIcRlDh7vwfAG+PAkg6ekbhfKd0k4QgAZreR goiw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@pqrs.dk header.s=google header.b=lmXSr5RY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from morse.vger.email (morse.vger.email. [2620:137:e000::3:1]) by mx.google.com with ESMTPS id o32-20020a635a20000000b00584a9290bd3si3857171pgb.522.2023.10.14.10.23.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 14 Oct 2023 10:23:51 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) client-ip=2620:137:e000::3:1; Authentication-Results: mx.google.com; dkim=pass header.i=@pqrs.dk header.s=google header.b=lmXSr5RY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id B9CD68034667; Sat, 14 Oct 2023 10:23:48 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233268AbjJNRXY (ORCPT <rfc822;hjfbswb@gmail.com> + 19 others); Sat, 14 Oct 2023 13:23:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51124 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233197AbjJNRXX (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sat, 14 Oct 2023 13:23:23 -0400 Received: from mail-ej1-x634.google.com (mail-ej1-x634.google.com [IPv6:2a00:1450:4864:20::634]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D9930D8 for <linux-kernel@vger.kernel.org>; Sat, 14 Oct 2023 10:23:20 -0700 (PDT) Received: by mail-ej1-x634.google.com with SMTP id a640c23a62f3a-9b2cee55056so542468066b.3 for <linux-kernel@vger.kernel.org>; Sat, 14 Oct 2023 10:23:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pqrs.dk; s=google; t=1697304199; x=1697908999; darn=vger.kernel.org; h=cc:to:message-id:content-transfer-encoding:mime-version:subject :date:from:from:to:cc:subject:date:message-id:reply-to; bh=8BHon3+6fMcyIPAc0hS9MZwVEhIjaxCRNkaJRhXQQno=; b=lmXSr5RYiCc2175nmqhwy94b20UxlhmArTeMoNKvtGPz95+wLxSjwhs048hXz1wNFX Z4t4AL1mM562JPe9YZQXQnAUKtVvsTNgqnfzv7tu0AmR3xuZRuMQhAPPer43OwP1514y gpMqj2JJ9h9ty9XGcrxf05l9AC0XdYAUcnLlxs4QDNGSZI1+5feypoF7XUWqFY4r/EuZ F/PN0KK1RrUE9uIO7mWWNa3KEdwAU899bWdJA3d8XWRLdX06cnhPxQdFh2hcD3KZL3y5 4Bn2mZdsrY/1chUmiX74AjFllFAEOu9HrarYJ0VX6j1fazpzMfXVIV+NS9vLFbRRSINt 2yzg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697304199; x=1697908999; h=cc:to:message-id:content-transfer-encoding:mime-version:subject :date:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=8BHon3+6fMcyIPAc0hS9MZwVEhIjaxCRNkaJRhXQQno=; b=MAxMkK0jDGxVkBMe/tKV0ieY0lQppHdg2gT4Hd9eCw8VJ6dXVavA1XR9NhvjGYBoo5 WjmsLBbA9UQAtQ5WDlAtk7J1ONZTnPTlkSYACMpJoUs9I3qnyDEUs3jhqk+4NEljfmHq 7X7k05eKkLuGYfW9vt924QRBlCq/4n/JODKHY6xZeb1GEo3AXGKAx2lSO6AFDN9nkxH8 iSYXJFt7YAypQxXRbGaRfe0ahlpNSXVDHbSAaB1HQY+8kamd3Ry58/SJ39BuLn2dPfEX lGU6I4ZhY6y8qujmgomrUCzHzDmp7+DTuAUx2CoykzLzqu45+gELoOeJj1Bc15RuqESw zs2A== X-Gm-Message-State: AOJu0YwP932CGm5Cn1qqCbCGt3634Tjfv3pV90jywUXA7KsoudcRMV5l RQo2haQSd8j+8eudcMil/ejAWQ== X-Received: by 2002:a17:907:9486:b0:9be:f71d:9471 with SMTP id dm6-20020a170907948600b009bef71d9471mr1481941ejc.68.1697304199144; Sat, 14 Oct 2023 10:23:19 -0700 (PDT) Received: from capella.localdomain (80.71.142.18.ipv4.parknet.dk. [80.71.142.18]) by smtp.gmail.com with ESMTPSA id bt8-20020a0564020a4800b0053e4783afbasm2215091edb.63.2023.10.14.10.23.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 14 Oct 2023 10:23:18 -0700 (PDT) From: =?utf-8?q?Alvin_=C5=A0ipraga?= <alvin@pqrs.dk> Date: Sat, 14 Oct 2023 19:22:44 +0200 Subject: [PATCH] get_maintainer: correctly parse UTF-8 encoded names in files MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Message-Id: <20231014-get-maintainers-utf8-v1-1-3af8c7aeb239@bang-olufsen.dk> X-B4-Tracking: v=1; b=H4sIAGPOKmUC/x2MMQqAMAwAvyKZDdhWRfyKOJSaagarpFUE6d8tD jfccPdCJGGKMFYvCN0c+QhFVF2B22xYCXkpDrrRRjWqxZUS7pZDKpBEvJIf0GjXd65dej9YKOk p5Pn5t9Oc8weerN/1ZgAAAA== To: Joe Perches <joe@perches.com> Cc: =?utf-8?q?Duje_Mihanovi=C4=87?= <duje.mihanovic@skole.hr>, Konstantin Ryabitsev <konstantin@linuxfoundation.org>, linux-kernel@vger.kernel.org, =?utf-8?q?Alvin_=C5=A0ipraga?= <alsi@bang-olufsen.dk> X-Mailer: b4 0.12.3 X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Sat, 14 Oct 2023 10:23:48 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1779752481772177505 X-GMAIL-MSGID: 1779752481772177505 |
Series |
get_maintainer: correctly parse UTF-8 encoded names in files
|
|
Commit Message
Alvin Šipraga
Oct. 14, 2023, 5:22 p.m. UTC
From: Alvin Šipraga <alsi@bang-olufsen.dk> While the script correctly extracts UTF-8 encoded names from the MAINTAINERS file, the regular expressions damage my name when parsing from .yaml files. Fix this by replacing the Latin-1-compatible regular expressions with the unicode property matcher \p{Latin}. It's also necessary to instruct Perl to open all files with UTF-8 encoding. The issue was also identified on the b4 mailing list [1]. This should solve the observed side effects there as well. Link: https://lore.kernel.org/all/20230726-gush-slouching-a5cd41@meerkat/ [1] Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk> --- scripts/get_maintainer.pl | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) --- base-commit: 70f8c6f8f8800d970b10676cceae42bba51a4899 change-id: 20231014-get-maintainers-utf8-32c65c4d6f8a
Comments
On Saturday, October 14, 2023 7:22:44 PM CEST Alvin Šipraga wrote: > From: Alvin Šipraga <alsi@bang-olufsen.dk> > > While the script correctly extracts UTF-8 encoded names from the > MAINTAINERS file, the regular expressions damage my name when parsing > from .yaml files. Fix this by replacing the Latin-1-compatible regular > expressions with the unicode property matcher \p{Latin}. It's also > necessary to instruct Perl to open all files with UTF-8 encoding. > > The issue was also identified on the b4 mailing list [1]. This should > solve the observed side effects there as well. > > Link: https://lore.kernel.org/all/20230726-gush-slouching-a5cd41@meerkat/ [1] > Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk> > --- > scripts/get_maintainer.pl | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) Tested-by: Duje Mihanović <duje.mihanovic@skole.hr>
On Mon, 2023-10-16 at 16:37 +0200, Duje Mihanović wrote: > On Saturday, October 14, 2023 7:22:44 PM CEST Alvin Šipraga wrote: > > From: Alvin Šipraga <alsi@bang-olufsen.dk> > > > > While the script correctly extracts UTF-8 encoded names from the > > MAINTAINERS file, the regular expressions damage my name when parsing > > from .yaml files. Fix this by replacing the Latin-1-compatible regular > > expressions with the unicode property matcher \p{Latin}. Well, OK > > It's also > > necessary to instruct Perl to open all files with UTF-8 encoding. But I'm not at all sure this is actually desired.
Hi Joe, On Mon, Oct 16, 2023 at 03:17:56PM -0700, Joe Perches wrote: > On Mon, 2023-10-16 at 16:37 +0200, Duje Mihanović wrote: > > On Saturday, October 14, 2023 7:22:44 PM CEST Alvin Šipraga wrote: > > > From: Alvin Šipraga <alsi@bang-olufsen.dk> > > > > > > While the script correctly extracts UTF-8 encoded names from the > > > MAINTAINERS file, the regular expressions damage my name when parsing > > > from .yaml files. Fix this by replacing the Latin-1-compatible regular > > > expressions with the unicode property matcher \p{Latin}. > > Well, OK > > > > It's also > > > necessary to instruct Perl to open all files with UTF-8 encoding. > > But I'm not at all sure this is actually desired. The whole patch, or just this last part? Regarding the last part, it's necessary because Perl defaults to opening files with (I think) Latin-1/ISO-8859-1, and this prevents the script from correctly parsing UTF-8 encoded strings. It seemed the most practical solution was to just open everything as UTF-8, including stdin/out. Are you worried that this will cause breakage elsewhere? Indeed, while Latin-1 and UTF-8 both have the same encoding for printable ASCII, the former is not a strict subset of the latter. But I assumed that UTF-8 would be being used everywhere in the source tree. Now I did a check to see if that is the case using the encguess tool. See below. It is a basic test but it seems that the vast majority of the tree is ASCII or UTF-8. For your reference, below is also test sequence that shows the different results with/without my patch, and with modifications to the encoding Perl uses when opening files. I hope you reconsider. Kind regards, Alvin ----8<--------- FILE ENCODINGS IN THE TREE -------8<------------- linux $ make mrproper linux $ find . -type f -not -path './.git/*' \ | parallel encguess \ | grep -v -e US-ASCII -e UTF-8 \ > out.txt linux $ head -n 2 out.txt # output is <file> <detected encoding> ./tools/include/linux/nmi.h unknown ./tools/testing/selftests/tc-testing/plugins/__init__.py unknown linux $ cat out.txt | cut -f1 | xargs wc 0 0 0 ./tools/include/linux/nmi.h # comment: this file is empty so encguess says unknown; ditto the others 0 0 0 ./tools/testing/selftests/tc-testing/plugins/__init__.py 0 0 0 ./tools/testing/selftests/powerpc/primitives/asm/processor.h 0 0 0 ./tools/testing/selftests/powerpc/primitives/asm/ppc-opcode.h 0 0 0 ./tools/testing/selftests/powerpc/primitives/asm/firmware.h 0 0 0 ./tools/testing/selftests/powerpc/primitives/linux/stringify.h 0 0 0 ./tools/testing/selftests/powerpc/copyloops/asm/processor.h 0 0 0 ./tools/testing/selftests/powerpc/copyloops/asm/kasan.h 0 0 0 ./tools/testing/selftests/powerpc/copyloops/asm/feature-fixups.h 0 0 0 ./tools/testing/selftests/powerpc/copyloops/asm/asm-compat.h 0 0 0 ./tools/testing/kunit/test_data/test_insufficient_memory.log 66 168 1668 ./tools/perf/util/top.h # comment: has a console escape sequence in macro CONSOLE_CLEAR 0 0 0 ./tools/perf/util/help-unknown-cmd.h 334 1950 141644 ./tools/perf/tests/pe-file.exe.debug 58 594 75595 ./tools/perf/tests/pe-file.exe # comment: these are binary files 0 0 0 ./tools/virtio/linux/hrtimer.h 0 0 0 ./tools/virtio/generated/autoconf.h 0 0 0 ./tools/virtio/crypto/hash.h 0 0 0 ./tools/build/tests/ex/empty/Build 252 1088 5563 ./arch/m68k/hp300/hp300map.map # comment: seems deliberately crafted, probably OK to ignore 0 0 0 ./arch/riscv/Kconfig.debug 0 0 0 ./drivers/s390/crypto/zcrypt_cex2c.h 0 0 0 ./drivers/s390/crypto/zcrypt_cex2c.c 0 0 0 ./drivers/s390/crypto/zcrypt_cex2a.h 0 0 0 ./drivers/s390/crypto/zcrypt_cex2a.c 0 0 0 ./drivers/staging/axis-fifo/README 358 1709 12218 ./drivers/tty/vt/defkeymap.map # comment: seems deliberately crafted, probably OK to ignore 0 0 0 ./drivers/gpu/drm/ci/xfails/virtio_gpu-none-flakes.txt 0 0 0 ./drivers/gpu/drm/ci/xfails/mediatek-mt8173-flakes.txt 89 482 16335 ./Documentation/images/logo.gif # comment: this is an image 0 0 0 ./Documentation/devicetree/bindings/media/s5p-mfc.txt 0 0 0 ./scripts/dummy-tools/dummy-plugin-dir/include/plugin-version.h 1190 6057 254726 total ----8<--------- TEST SEQUENCE FOR THIS PATCH -----8<------------- # fetch reference patch which exhibits this issue # => name is corrupted linux $ git checkout master linux $ b4 shazam -P _ 20231014-alvin-clk-si5351-no-pll-reset-v4-1-a3567024007d@bang-olufsen.dk ... Applying: dt-bindings: clock: si5351: convert to yaml linux $ git format-patch HEAD^ 0001-dt-bindings-clock-si5351-convert-to-yaml.patch linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi grep: (standard input): binary file matches linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi -a " ipraga" <alsi@bang-olufsen.dk> (in file) # apply my patch to get_maintainer.pl # => name is OK linux $ b4 shazam 20231014-get-maintainers-utf8-v1-1-3af8c7aeb239@bang-olufsen.dk ... Applying: get_maintainer: correctly parse UTF-8 encoded names in files linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi -a Alvin Šipraga <alsi@bang-olufsen.dk> (in file) # remove 'use open qw(:std :encoding(UTF-8))' # => name is still corrupted, slightly differently linux $ sed -i '/^use open/d' -i ./scripts/get_maintainer.pl linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi -a ipraga <alsi@bang-olufsen.dk> (in file) # remove only the :std part # => name is OK(?), but perl complains about wide char linux $ git restore . linux $ sed -i 's/:std //' -i ./scripts/get_maintainer.pl linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi -a Wide character in print at ./scripts/get_maintainer.pl line 2522. Alvin Šipraga <alsi@bang-olufsen.dk> (in file)
Hi again, Sorry to be a nuisance, but could you please have another look below and reconsider this patch? Otherwise NAK is fine, but I wanted to follow up on this as it solves an actual, albeit minor, issue for people with unusual names when sending and receiving patches. Thanks! Kind regards, Alvin On Mon, Oct 16, 2023 at 11:56:32PM +0000, Alvin Šipraga wrote: > Hi Joe, > > On Mon, Oct 16, 2023 at 03:17:56PM -0700, Joe Perches wrote: > > On Mon, 2023-10-16 at 16:37 +0200, Duje Mihanović wrote: > > > On Saturday, October 14, 2023 7:22:44 PM CEST Alvin Šipraga wrote: > > > > From: Alvin Šipraga <alsi@bang-olufsen.dk> > > > > > > > > While the script correctly extracts UTF-8 encoded names from the > > > > MAINTAINERS file, the regular expressions damage my name when parsing > > > > from .yaml files. Fix this by replacing the Latin-1-compatible regular > > > > expressions with the unicode property matcher \p{Latin}. > > > > Well, OK > > > > > > It's also > > > > necessary to instruct Perl to open all files with UTF-8 encoding. > > > > But I'm not at all sure this is actually desired. > > The whole patch, or just this last part? > > Regarding the last part, it's necessary because Perl defaults to opening files > with (I think) Latin-1/ISO-8859-1, and this prevents the script from correctly > parsing UTF-8 encoded strings. It seemed the most practical solution was to just > open everything as UTF-8, including stdin/out. > > Are you worried that this will cause breakage elsewhere? Indeed, while Latin-1 > and UTF-8 both have the same encoding for printable ASCII, the former is not a > strict subset of the latter. But I assumed that UTF-8 would be being used > everywhere in the source tree. > > Now I did a check to see if that is the case using the encguess tool. See below. > It is a basic test but it seems that the vast majority of the tree is ASCII or > UTF-8. > > For your reference, below is also test sequence that shows the different results > with/without my patch, and with modifications to the encoding Perl uses when > opening files. I hope you reconsider. > > Kind regards, > Alvin > > ----8<--------- FILE ENCODINGS IN THE TREE -------8<------------- > > linux $ make mrproper > linux $ find . -type f -not -path './.git/*' \ > | parallel encguess \ > | grep -v -e US-ASCII -e UTF-8 \ > > out.txt > linux $ head -n 2 out.txt # output is <file> <detected encoding> > ./tools/include/linux/nmi.h unknown > ./tools/testing/selftests/tc-testing/plugins/__init__.py unknown > linux $ cat out.txt | cut -f1 | xargs wc > 0 0 0 ./tools/include/linux/nmi.h > # comment: this file is empty so encguess says unknown; ditto the others > 0 0 0 ./tools/testing/selftests/tc-testing/plugins/__init__.py > 0 0 0 ./tools/testing/selftests/powerpc/primitives/asm/processor.h > 0 0 0 ./tools/testing/selftests/powerpc/primitives/asm/ppc-opcode.h > 0 0 0 ./tools/testing/selftests/powerpc/primitives/asm/firmware.h > 0 0 0 ./tools/testing/selftests/powerpc/primitives/linux/stringify.h > 0 0 0 ./tools/testing/selftests/powerpc/copyloops/asm/processor.h > 0 0 0 ./tools/testing/selftests/powerpc/copyloops/asm/kasan.h > 0 0 0 ./tools/testing/selftests/powerpc/copyloops/asm/feature-fixups.h > 0 0 0 ./tools/testing/selftests/powerpc/copyloops/asm/asm-compat.h > 0 0 0 ./tools/testing/kunit/test_data/test_insufficient_memory.log > 66 168 1668 ./tools/perf/util/top.h > # comment: has a console escape sequence in macro CONSOLE_CLEAR > 0 0 0 ./tools/perf/util/help-unknown-cmd.h > 334 1950 141644 ./tools/perf/tests/pe-file.exe.debug > 58 594 75595 ./tools/perf/tests/pe-file.exe > # comment: these are binary files > 0 0 0 ./tools/virtio/linux/hrtimer.h > 0 0 0 ./tools/virtio/generated/autoconf.h > 0 0 0 ./tools/virtio/crypto/hash.h > 0 0 0 ./tools/build/tests/ex/empty/Build > 252 1088 5563 ./arch/m68k/hp300/hp300map.map > # comment: seems deliberately crafted, probably OK to ignore > 0 0 0 ./arch/riscv/Kconfig.debug > 0 0 0 ./drivers/s390/crypto/zcrypt_cex2c.h > 0 0 0 ./drivers/s390/crypto/zcrypt_cex2c.c > 0 0 0 ./drivers/s390/crypto/zcrypt_cex2a.h > 0 0 0 ./drivers/s390/crypto/zcrypt_cex2a.c > 0 0 0 ./drivers/staging/axis-fifo/README > 358 1709 12218 ./drivers/tty/vt/defkeymap.map > # comment: seems deliberately crafted, probably OK to ignore > 0 0 0 ./drivers/gpu/drm/ci/xfails/virtio_gpu-none-flakes.txt > 0 0 0 ./drivers/gpu/drm/ci/xfails/mediatek-mt8173-flakes.txt > 89 482 16335 ./Documentation/images/logo.gif > # comment: this is an image > 0 0 0 ./Documentation/devicetree/bindings/media/s5p-mfc.txt > 0 0 0 ./scripts/dummy-tools/dummy-plugin-dir/include/plugin-version.h > 1190 6057 254726 total > > > ----8<--------- TEST SEQUENCE FOR THIS PATCH -----8<------------- > > # fetch reference patch which exhibits this issue > # => name is corrupted > linux $ git checkout master > linux $ b4 shazam -P _ 20231014-alvin-clk-si5351-no-pll-reset-v4-1-a3567024007d@bang-olufsen.dk > ... > Applying: dt-bindings: clock: si5351: convert to yaml > linux $ git format-patch HEAD^ > 0001-dt-bindings-clock-si5351-convert-to-yaml.patch > linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi > grep: (standard input): binary file matches > linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi -a > " ipraga" <alsi@bang-olufsen.dk> (in file) > > > # apply my patch to get_maintainer.pl > # => name is OK > linux $ b4 shazam 20231014-get-maintainers-utf8-v1-1-3af8c7aeb239@bang-olufsen.dk > ... > Applying: get_maintainer: correctly parse UTF-8 encoded names in files > linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi -a > Alvin Šipraga <alsi@bang-olufsen.dk> (in file) > > > # remove 'use open qw(:std :encoding(UTF-8))' > # => name is still corrupted, slightly differently > linux $ sed -i '/^use open/d' -i ./scripts/get_maintainer.pl > linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi -a > ipraga <alsi@bang-olufsen.dk> (in file) > > > # remove only the :std part > # => name is OK(?), but perl complains about wide char > linux $ git restore . > linux $ sed -i 's/:std //' -i ./scripts/get_maintainer.pl > linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi -a > Wide character in print at ./scripts/get_maintainer.pl line 2522. > Alvin Šipraga <alsi@bang-olufsen.dk> (in file)
On Wed, 13 Dec 2023 at 17:06, Alvin Šipraga <ALSI@bang-olufsen.dk> wrote: > > Sorry to be a nuisance, but could you please have another look below and > reconsider this patch? Otherwise NAK is fine, but I wanted to follow up > on this as it solves an actual, albeit minor, issue for people with > unusual names when sending and receiving patches. The patch seems bogus, because it shouldn't have any "Latin" encoding issues at all. Opening as utf8 makes sense, but the "Latin" part of the regular expressions seem bogus. IOW, isn't '\p{L}' the right pattern for a "letter"? Isn't that what we actually care about here? Replacing one locale bug with just another locale bug seems pointless. Linus
On Wed, Dec 13, 2023 at 05:41:59PM -0800, Linus Torvalds wrote: > On Wed, 13 Dec 2023 at 17:06, Alvin Šipraga <ALSI@bang-olufsen.dk> wrote: > > > > Sorry to be a nuisance, but could you please have another look below and > > reconsider this patch? Otherwise NAK is fine, but I wanted to follow up > > on this as it solves an actual, albeit minor, issue for people with > > unusual names when sending and receiving patches. > > The patch seems bogus, because it shouldn't have any "Latin" encoding > issues at all. > > Opening as utf8 makes sense, but the "Latin" part of the regular > expressions seem bogus. > > IOW, isn't '\p{L}' the right pattern for a "letter"? Isn't that what > we actually care about here? Yes, you have a point, I was being too conservative with the choice of '\p{Latin}'. I will send a v2 using '\p{L}'. > > Replacing one locale bug with just another locale bug seems pointless. Thanks for the review! Kind regards, Alvin
diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl index ab123b498fd9..cb78e11623a6 100755 --- a/scripts/get_maintainer.pl +++ b/scripts/get_maintainer.pl @@ -20,6 +20,7 @@ use Getopt::Long qw(:config no_auto_abbrev); use Cwd; use File::Find; use File::Spec::Functions; +use open qw(:std :encoding(UTF-8)); my $cur_path = fastgetcwd() . '/'; my $lk_path = "./"; @@ -442,7 +443,7 @@ sub maintainers_in_file { my $text = do { local($/) ; <$f> }; close($f); - my @poss_addr = $text =~ m$[A-Za-zÀ-ÿ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g; + my @poss_addr = $text =~ m$[\p{Latin}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g; push(@file_emails, clean_file_emails(@poss_addr)); } } @@ -2460,13 +2461,13 @@ sub clean_file_emails { $name = ""; } - my @nw = split(/[^A-Za-zÀ-ÿ\'\,\.\+-]/, $name); + my @nw = split(/[^\p{Latin}\'\,\.\+-]/, $name); if (@nw > 2) { my $first = $nw[@nw - 3]; my $middle = $nw[@nw - 2]; my $last = $nw[@nw - 1]; - if (((length($first) == 1 && $first =~ m/[A-Za-z]/) || + if (((length($first) == 1 && $first =~ m/\p{Latin}/) || (length($first) == 2 && substr($first, -1) eq ".")) || (length($middle) == 1 || (length($middle) == 2 && substr($middle, -1) eq "."))) {