From patchwork Tue Dec 19 01:25:14 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Alvin_=C5=A0ipraga?= X-Patchwork-Id: 180703 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:24d3:b0:fb:cd0c:d3e with SMTP id r19csp1647926dyi; Mon, 18 Dec 2023 17:26:25 -0800 (PST) X-Google-Smtp-Source: AGHT+IFztajtzZkUdz/KrYwpTMRIkt7Jh/rCbr25x/UEVXq6AyhAmmQ9kj5VlzexXRHP8Es44xqd X-Received: by 2002:a05:6214:19c4:b0:67f:2d42:b21f with SMTP id j4-20020a05621419c400b0067f2d42b21fmr7823536qvc.120.1702949184869; Mon, 18 Dec 2023 17:26:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702949184; cv=none; d=google.com; s=arc-20160816; b=k+XRNzkB8IqPlXEABOIci25PBpuNbminnVuzmSZN0jFuH879px8EeTadvh4s5eKeVE fDILgmcvs3f3urm+qRFFZzHVfF6NDNGCJUgkSAj/8r0AS3H13iRRbbh6BliBzPc5/9gs 76SZy21eD93R14LNyAoJhxgC5M350AVnD9di1C+WzP9SMVPigc/yfDFJGavudLnJb+hj txfR3/27krfTeCMdml9FqW0zeAzKHIVlpOgA2q4DrRBJLkobZp2i0V7FUbTUufz8CspW moXRBatl9AQXyV3XWAGkFZRgK0Db27+SSsGYeCTu//lSWD7XSLu/2q8nJr9LQPckaB7G 3syA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :subject:date:from:dkim-signature; bh=PDBV8cJaB7iV+TTP4pZBNgMOM9Qt20Xbuc5uqu6VF04=; fh=xl2J40ly2ZcWHQktF4kxcRfS4CvpZ1lyNrelw9LazNI=; b=DAEKFpsGzQ7ahb0MRXEZPdsSImnvS2KE2EIxjW0t5OrX6ClEvtinprPmEgnEGZPZ69 TY6z/IsdzqRit0ILOQRWUC9h+MemRvr9A5TbOktut0yOQoeANmd5d6Z5KSjlfaYQBz7b ljcLj5a+Wjb7EqZXyls0MB2864T/EJu3u70sySQEEpAm88t0UhAiAtdqQRrn4n1eP64w hQrpVW3n3F2f+TzuXsMH7fcKAB5IkL8CaOxWQTVAkyjeCZ1s7p7nPhyABaHc27MfNlT5 EAMtAEGcbLdsTpfHC1BY565MsPNiFByBFWDOaXkdzMdrrLa2iQUMGiuN3i/2oSGFCxaJ wTbA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@pqrs.dk header.s=key1 header.b=lfnekWpA; spf=pass (google.com: domain of linux-kernel+bounces-4552-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-4552-ouuuleilei=gmail.com@vger.kernel.org" Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id r17-20020a0cb291000000b0067f42e7222asi3739243qve.215.2023.12.18.17.26.24 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 18 Dec 2023 17:26:24 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-4552-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@pqrs.dk header.s=key1 header.b=lfnekWpA; spf=pass (google.com: domain of linux-kernel+bounces-4552-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-4552-ouuuleilei=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 8BC261C22EC3 for ; Tue, 19 Dec 2023 01:26:24 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 248E963AC; Tue, 19 Dec 2023 01:25:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=pqrs.dk header.i=@pqrs.dk header.b="lfnekWpA" X-Original-To: linux-kernel@vger.kernel.org Received: from out-171.mta0.migadu.com (out-171.mta0.migadu.com [91.218.175.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9706E15B1 for ; Tue, 19 Dec 2023 01:25:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=pqrs.dk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=pqrs.dk X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pqrs.dk; s=key1; t=1702949137; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=PDBV8cJaB7iV+TTP4pZBNgMOM9Qt20Xbuc5uqu6VF04=; b=lfnekWpAB6CzsHQTX6yE3W9MFHdgtKLUx+4hEtPY/9stmu7C/2VGc05jgGkTH1x/jOpqs8 ng9+j8baGfujv1RHW0M44wwhjI0cz9SjOZq0T1H08gqzHJmLylyuYPJFKAyqpzVasBifih OgTGD5tseC6RYuU6emWw1lkSbBiYxApssqU59LuNkvdVt1lIeKk9QiOWx5Xy0ky/jHJ6J0 NHIaPUv67bZZHYsUbbZX7OkRDceSVRkfGKpeplV5sMLL5uzE0uCUccQzPXaVZ3qwOLD1jV ITxlvoapBY1LwCsayvq8vj/UKuHlIIJesyN2jZiyzO6N8Xu77j/2DOdnVT5Bvw== From: =?utf-8?q?Alvin_=C5=A0ipraga?= Date: Tue, 19 Dec 2023 02:25:14 +0100 Subject: [PATCH v3 1/2] get_maintainer: correctly parse UTF-8 encoded names in files Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Message-Id: <20231219-get-maintainers-utf8-v3-1-f85a39e2265a@bang-olufsen.dk> References: <20231219-get-maintainers-utf8-v3-0-f85a39e2265a@bang-olufsen.dk> In-Reply-To: <20231219-get-maintainers-utf8-v3-0-f85a39e2265a@bang-olufsen.dk> To: Joe Perches , Linus Torvalds , Andrew Morton Cc: =?utf-8?q?Duje_Mihanovi=C4=87?= , Konstantin Ryabitsev , linux-kernel@vger.kernel.org, =?utf-8?q?Alvin_=C5=A0ipraga?= X-Migadu-Flow: FLOW_OUT X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1785671644345703661 X-GMAIL-MSGID: 1785671644345703661 From: Alvin Šipraga While the script correctly extracts UTF-8 encoded names from the MAINTAINERS file, the regular expressions damage my name when parsing from .yaml files. Fix this by replacing the Latin-1-compatible regular expressions with the unicode property matcher \p{L}, which matches on any letter according to the Unicode General Category of letters. The proposed solution only works if the script uses proper string encoding from the outset, so instruct Perl to unconditionally open all files with UTF-8 encoding. This should be safe, as the entire source tree is either UTF-8 or ASCII encoded anyway. See [1] for a detailed analysis. Furthermore, to prevent the \w expression from matching non-ASCII when checking for whether a name should be escaped with quotes, add the /a flag to the regular expression. The escaping logic was duplicated in two places, so it has been factored out into its own function. The original issue was also identified on the tools mailing list [2]. This should solve the observed side effects there as well. Link: https://lore.kernel.org/all/dzn6uco4c45oaa3ia4u37uo5mlt33obecv7gghj2l756fr4hdh@mt3cprft3tmq/ [1] Link: https://lore.kernel.org/tools/20230726-gush-slouching-a5cd41@meerkat/ [2] Signed-off-by: Alvin Šipraga --- scripts/get_maintainer.pl | 30 +++++++++++++++++------------- 1 file changed, 17 insertions(+), 13 deletions(-) diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl index 16d8ac6005b6..dac38c6e3b1c 100755 --- a/scripts/get_maintainer.pl +++ b/scripts/get_maintainer.pl @@ -20,6 +20,7 @@ use Getopt::Long qw(:config no_auto_abbrev); use Cwd; use File::Find; use File::Spec::Functions; +use open qw(:std :encoding(UTF-8)); my $cur_path = fastgetcwd() . '/'; my $lk_path = "./"; @@ -445,7 +446,7 @@ sub maintainers_in_file { my $text = do { local($/) ; <$f> }; close($f); - my @poss_addr = $text =~ m$[A-Za-zÀ-ÿ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g; + my @poss_addr = $text =~ m$[\p{L}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g; push(@file_emails, clean_file_emails(@poss_addr)); } } @@ -1152,6 +1153,17 @@ sub top_of_kernel_tree { return 0; } +sub escape_name { + my ($name) = @_; + + if ($name =~ /[^\w \-]/ai) { ##has "must quote" chars + $name =~ s/(? 2) { my $first = $nw[@nw - 3]; my $middle = $nw[@nw - 2]; my $last = $nw[@nw - 1]; - if (((length($first) == 1 && $first =~ m/[A-Za-z]/) || + if (((length($first) == 1 && $first =~ m/\p{L}/) || (length($first) == 2 && substr($first, -1) eq ".")) || (length($middle) == 1 || (length($middle) == 2 && substr($middle, -1) eq "."))) { From patchwork Tue Dec 19 01:25:15 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Alvin_=C5=A0ipraga?= X-Patchwork-Id: 180704 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:24d3:b0:fb:cd0c:d3e with SMTP id r19csp1647941dyi; Mon, 18 Dec 2023 17:26:26 -0800 (PST) X-Google-Smtp-Source: AGHT+IEhxt4H8scxQMlgp/v3zWSkmZpYVpDZC2sMOYDHdUTRcGjouxywXpBGtmZXKu2VZCK1lawn X-Received: by 2002:a50:cc88:0:b0:553:5609:3344 with SMTP id q8-20020a50cc88000000b0055356093344mr1024426edi.20.1702949186406; Mon, 18 Dec 2023 17:26:26 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702949186; cv=none; d=google.com; s=arc-20160816; b=rANomJsSNk5FpXShXSLQr4aq2+ZpjrxyT/8J3Y9wfRtvMfRljrpxNDq/RS2DGxn6E2 /L9W4yygm25Ne9gXB9IoNmqj5ScvtaPqNmvth7S6Ras4eWxzevJjDyJ0PNdF8csH9x1N Y+uqcEbcIDe1XMRZppJ2BEvvFNmL2+elDzuMm7MneRshGsGyY/ba3NzD6OV5b+p3V7kn uyPn/bRG+VWAmEmOxz9soyxLMu6HkaSlqQMMLffb7YQ+2ggR8q5Dhnp6xps/h1hkheQU 32ptvMZAJVdM9txiOB3XNAciX8Hg8IvzqqEYof4heIGD9CKSCFGblzt7a+UONG/xNktQ tWQQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :subject:date:from:dkim-signature; bh=DCcA5+GzeT7LmnAC4ZC7lotYpQkFKLlKWVTf7X6olec=; fh=xl2J40ly2ZcWHQktF4kxcRfS4CvpZ1lyNrelw9LazNI=; b=j3T4yBo74Yk3kyvhlN4/NNu9Io43s1QzPN+HuNdmT5BYlxFUMlt/Cs5s/gZknoWb0l VFMC50xu6KwpymdqQREndB5+2s44GBFCDFnqdjlRDoj3XZ3vHGgf2KXF21YJVD9mDtzj qSbHds74vu9+thgUUP63DBGNkQt8iPaCC1+rxXAMn0p7L2yhfXzUKp7zxy0hOfltyVnZ ZSYSueJ1RbzeTGZNlsM5NyGsWInm+GCitrt+2/6DOW6NoSXXixIkmAAXKEiXtYfFqdX6 p6gPSuHPE3rI0QtKVXXcu5FEmKsgyAUg+gj8BWHIJu8PNBdwgZZwpfhq1zu/MmnSPjpH EhsA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@pqrs.dk header.s=key1 header.b=WbvSdKXo; spf=pass (google.com: domain of linux-kernel+bounces-4553-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-4553-ouuuleilei=gmail.com@vger.kernel.org" Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id b94-20020a509f67000000b005539201893fsi119573edf.269.2023.12.18.17.26.26 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 18 Dec 2023 17:26:26 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-4553-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; dkim=pass header.i=@pqrs.dk header.s=key1 header.b=WbvSdKXo; spf=pass (google.com: domain of linux-kernel+bounces-4553-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-4553-ouuuleilei=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 0918E1F23124 for ; Tue, 19 Dec 2023 01:26:26 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 2A39F63D1; Tue, 19 Dec 2023 01:25:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=pqrs.dk header.i=@pqrs.dk header.b="WbvSdKXo" X-Original-To: linux-kernel@vger.kernel.org Received: from out-183.mta0.migadu.com (out-183.mta0.migadu.com [91.218.175.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8D3B615AC for ; Tue, 19 Dec 2023 01:25:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=pqrs.dk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=pqrs.dk X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pqrs.dk; s=key1; t=1702949137; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=DCcA5+GzeT7LmnAC4ZC7lotYpQkFKLlKWVTf7X6olec=; b=WbvSdKXocvo/vee2x/4gWhabR5oUOFwYlDUL49pFBugA3rlgQAtfZEYOnPGoFZbNEAKeQD L7qjo3jJwn5bQ1HLg8a78fkvseHQ2D8YawQ5IhtNdZ9+7Oy+rqfgixtH/t/w1wlYrVqScT Vtm59qGpbVGYDL/y4IFf51r8+d9hiADQP0fSxjMxBeTsznXA8P/e1kFzjhqClg5bNV6n5c iocVQ3UpcrBEilkx0w2lXyfj4Id8OXjfn9XbY3pj8quvD4X8oDeNSwGJztj/EraYGQ1EsA pZG+H2avnXHdNha9ldle27k5xIaN2pUFnI40PEogjJ/7jP+ITArhmrSDihh/sw== From: =?utf-8?q?Alvin_=C5=A0ipraga?= Date: Tue, 19 Dec 2023 02:25:15 +0100 Subject: [PATCH v3 2/2] get_maintainer: remove stray punctuation when cleaning file emails Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Message-Id: <20231219-get-maintainers-utf8-v3-2-f85a39e2265a@bang-olufsen.dk> References: <20231219-get-maintainers-utf8-v3-0-f85a39e2265a@bang-olufsen.dk> In-Reply-To: <20231219-get-maintainers-utf8-v3-0-f85a39e2265a@bang-olufsen.dk> To: Joe Perches , Linus Torvalds , Andrew Morton Cc: =?utf-8?q?Duje_Mihanovi=C4=87?= , Konstantin Ryabitsev , linux-kernel@vger.kernel.org, =?utf-8?q?Alvin_=C5=A0ipraga?= X-Migadu-Flow: FLOW_OUT X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1785671645910904855 X-GMAIL-MSGID: 1785671645910904855 From: Alvin Šipraga When parsing emails from .yaml files in particular, stray punctuation such as a leading '-' can end up in the name. For example, consider a common YAML section such as: maintainers: - devicetree@vger.kernel.org This would previously be processed by get_maintainer.pl as: - Make the logic in clean_file_emails more robust by deleting any sub-names which consist of common single punctuation marks before proceeding to the best-effort name extraction logic. The output is then correct: devicetree@vger.kernel.org Some additional comments are added to the function to make things clearer to future readers. Link: https://lore.kernel.org/all/0173e76a36b3a9b4e7f324dd3a36fd4a9757f302.camel@perches.com/ Suggested-by: Joe Perches Signed-off-by: Alvin Šipraga --- scripts/get_maintainer.pl | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl index dac38c6e3b1c..ee1aed7e090c 100755 --- a/scripts/get_maintainer.pl +++ b/scripts/get_maintainer.pl @@ -2462,11 +2462,17 @@ sub clean_file_emails { foreach my $email (@file_emails) { $email =~ s/[\(\<\{]{0,1}([A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+)[\)\>\}]{0,1}/\<$1\>/g; my ($name, $address) = parse_email($email); - if ($name eq '"[,\.]"') { - $name = ""; - } + # Strip quotes for easier processing, format_email will add them back + $name =~ s/^"(.*)"$/$1/; + + # Split into name-like parts and remove stray punctuation particles my @nw = split(/[^\p{L}\'\,\.\+-]/, $name); + @nw = grep(!/^[\'\,\.\+-]$/, @nw); + + # Make a best effort to extract the name, and only the name, by taking + # only the last two names, or in the case of obvious initials, the last + # three names. if (@nw > 2) { my $first = $nw[@nw - 3]; my $middle = $nw[@nw - 2]; @@ -2480,18 +2486,16 @@ sub clean_file_emails { } else { $name = "$middle $last"; } + } else { + $name = "@nw"; } if (substr($name, -1) =~ /[,\.]/) { $name = substr($name, 0, length($name) - 1); - } elsif (substr($name, -2) =~ /[,\.]"/) { - $name = substr($name, 0, length($name) - 2) . '"'; } if (substr($name, 0, 1) =~ /[,\.]/) { $name = substr($name, 1, length($name) - 1); - } elsif (substr($name, 0, 2) =~ /"[,\.]/) { - $name = '"' . substr($name, 2, length($name) - 2); } my $fmt_email = format_email($name, $address, $email_usename);