From patchwork Thu Dec 14 15:06:53 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Alvin_=C5=A0ipraga?= X-Patchwork-Id: 178775 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:3b04:b0:fb:cd0c:d3e with SMTP id c4csp8611376dys; Thu, 14 Dec 2023 07:07:22 -0800 (PST) X-Google-Smtp-Source: AGHT+IGkavxhZ9zAwtXpjV/fx8r/zTKp4Svb58JfOhCUNV++MjyISlISFhyndU5paHN3yCYcp+Jd X-Received: by 2002:a17:902:c40d:b0:1d3:4c35:17a3 with SMTP id k13-20020a170902c40d00b001d34c3517a3mr2661022plk.91.1702566442435; Thu, 14 Dec 2023 07:07:22 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702566442; cv=none; d=google.com; s=arc-20160816; b=MNRg9nuNW3Qs2pxsDC+DV56tS4jDHqKCgNcIgCpx3EvXN9X0/ia6aXbqPhZ0uCdCsg 4TOuUEYeHOcNTIvw+U5tYy6jRocO6dtvQXG2p4brEh82M5432JVy/OIO+mUJJFnWXK/Z zWOgiwPQ64Ly1BuHUQAEzguofiblV1Fn0lqWPL+zAYSire89zDcnVDy/K8MzeVrMstOD CD1maxLsC6o0D+02kvCtAjkpYTEP7udD0RatAB/6aDzjyJBz0rJCVkbwk81ZBXBBQpLn tCGEVk7Tfh2IeRN/Km5/0EK4OC5EZIN+9GyHcUcKcTPa6GXoZRzbwnFllzBjhnQ7w7ST HAiw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:message-id:content-transfer-encoding :mime-version:subject:date:from:dkim-signature; bh=Zr4tfF93uxRBpIQm7Itom01eR+K81zCOIkvaVvYrtXk=; fh=7MeINWv0pMMYpsfk+DJhUKatUwwjalnN6wOmv5h+7yI=; b=PTjI43HaBbSpK5t0VOKOL7mYqLpKGV+rWizpJj/1yN1ANrPLnkEYcJXscxf6SuFlB7 3LAkn8asA72pzWK4RERNxkm28rnsqnePRqStSjqOZwiuX/pLvLloU8PofIheHLaZWHI3 wFqDncuxdCnrkZoMs0gtXFcqIh/jBEKuBkbyJ/l8mYMtGEGnjGshlb8AqkSNTsOxJcxF 6lAAeUUImD0qDJb/kLFbiwVas6JszD566ru2Xu4PUzYSpS0fQfrNL98hibVxTGXFO0mr 5Bilq9UaYFBkSAvlJTxJW8Ua/75MtOzk5FVrHS0Rzl/79p1KsXSoXoiteJJ53ZtxNyfa LtXw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@pqrs.dk header.s=key1 header.b=UGdfi8uY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id e4-20020a17090301c400b001cf50ffd38dsi1739620plh.164.2023.12.14.07.07.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Dec 2023 07:07:22 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; dkim=pass header.i=@pqrs.dk header.s=key1 header.b=UGdfi8uY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id AF5D982516D3; Thu, 14 Dec 2023 07:07:15 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1573642AbjLNPHH (ORCPT + 99 others); Thu, 14 Dec 2023 10:07:07 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49528 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1573629AbjLNPHG (ORCPT ); Thu, 14 Dec 2023 10:07:06 -0500 Received: from out-176.mta1.migadu.com (out-176.mta1.migadu.com [IPv6:2001:41d0:203:375::b0]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9EBF611B for ; Thu, 14 Dec 2023 07:07:10 -0800 (PST) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pqrs.dk; s=key1; t=1702566427; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=Zr4tfF93uxRBpIQm7Itom01eR+K81zCOIkvaVvYrtXk=; b=UGdfi8uYqVky/1c2eFLH/hEl/cC3Bz7hK8JrlIAob3vCyWoLsEvryHvgy07v1XRwssOG4e C+zm0VCGsNFbLjcbMzKC+nIwNpOl2jUzCV3sLvLd674iYFjuUMELiVCC7lzpPAJp8h+C5s eqGSgIqFaL7E9liVCBYkpxKCIlx1uqT0PvR3e+XQARyMnR8+zecjWc1B2+LCcpbr87XIwd mQ75u7Gb3Cksdqi8FgBuse2EsTV4QegXl8z3sF3M6F3N++XdD+RzaxtMquCxpnFEuC+U/1 fzC3EXBIY5o+ojtiP0BHqKTKZjFDRbuJcxjxOvA6miKuc64FPljItvMrJcxi4Q== From: =?utf-8?q?Alvin_=C5=A0ipraga?= Date: Thu, 14 Dec 2023 16:06:53 +0100 Subject: [PATCH v2] get_maintainer: correctly parse UTF-8 encoded names in files MIME-Version: 1.0 Message-Id: <20231214-get-maintainers-utf8-v2-1-b188dc7042a4@bang-olufsen.dk> X-B4-Tracking: v=1; b=H4sIAAwae2UC/4WOTQ6CMBCFr0Jm7RjaIiIr72FYlDKFiVpMWxoN4 e5WLuDiLb6XvJ8VAnmmAG2xgqfEgWeXQR4KMJN2IyEPmUGWUolSVDhSxKdmF7PIB1yibVBJU59 MNdS20ZCjL0+W33vtrcs8cYiz/+wrSfzcP4VJoEClbWPOmnqpLtc+n8H5sdhA7jjcodu27Qukz Ts+vgAAAA== To: Joe Perches , Linus Torvalds Cc: =?utf-8?q?Duje_Mihanovi=C4=87?= , Konstantin Ryabitsev , linux-kernel@vger.kernel.org, =?utf-8?q?Alvin_=C5=A0ipraga?= X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Thu, 14 Dec 2023 07:07:15 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1785270309792124064 X-GMAIL-MSGID: 1785270309792124064 From: Alvin Šipraga While the script correctly extracts UTF-8 encoded names from the MAINTAINERS file, the regular expressions damage my name when parsing from .yaml files. Fix this by replacing the Latin-1-compatible regular expressions with the unicode property matcher \p{L}, which matches on any letter according to the Unicode General Category of letters. It's also necessary to instruct Perl to open all files with UTF-8 encoding. The issue was also identified on the tools mailing list [1]. This should solve the observed side effects there as well. Link: https://lore.kernel.org/tools/20230726-gush-slouching-a5cd41@meerkat/ [1] Signed-off-by: Alvin Šipraga --- Changes in v2: - use '\p{L}' rather than '\p{Latin}', so that matching is even more inclusive (i.e. match also Greek letters, CJK, etc.) - fix commit message to refer to tools mailing list, not b4 mailing list - Link to v1: https://lore.kernel.org/r/20231014-get-maintainers-utf8-v1-1-3af8c7aeb239@bang-olufsen.dk --- scripts/get_maintainer.pl | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) --- base-commit: 70f8c6f8f8800d970b10676cceae42bba51a4899 change-id: 20231014-get-maintainers-utf8-32c65c4d6f8a diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl index ab123b498fd9..344d0cda9854 100755 --- a/scripts/get_maintainer.pl +++ b/scripts/get_maintainer.pl @@ -20,6 +20,7 @@ use Getopt::Long qw(:config no_auto_abbrev); use Cwd; use File::Find; use File::Spec::Functions; +use open qw(:std :encoding(UTF-8)); my $cur_path = fastgetcwd() . '/'; my $lk_path = "./"; @@ -442,7 +443,7 @@ sub maintainers_in_file { my $text = do { local($/) ; <$f> }; close($f); - my @poss_addr = $text =~ m$[A-Za-zÀ-ÿ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g; + my @poss_addr = $text =~ m$[\p{L}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g; push(@file_emails, clean_file_emails(@poss_addr)); } } @@ -2460,13 +2461,13 @@ sub clean_file_emails { $name = ""; } - my @nw = split(/[^A-Za-zÀ-ÿ\'\,\.\+-]/, $name); + my @nw = split(/[^\p{L}\'\,\.\+-]/, $name); if (@nw > 2) { my $first = $nw[@nw - 3]; my $middle = $nw[@nw - 2]; my $last = $nw[@nw - 1]; - if (((length($first) == 1 && $first =~ m/[A-Za-z]/) || + if (((length($first) == 1 && $first =~ m/\p{L}/) || (length($first) == 2 && substr($first, -1) eq ".")) || (length($middle) == 1 || (length($middle) == 2 && substr($middle, -1) eq "."))) {