From patchwork Thu Mar 2 23:07:03 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Lewis Hyatt X-Patchwork-Id: 63650 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp119719wrd; Thu, 2 Mar 2023 15:22:04 -0800 (PST) X-Google-Smtp-Source: AK7set/GD7hVheEbb+wJyTTbx/xCU3xN6EY0+EB4ryGiDbvpOn41KDlu8rY6ymBnxEbuVXwMFZtT X-Received: by 2002:a17:906:660e:b0:8ae:f73e:233f with SMTP id b14-20020a170906660e00b008aef73e233fmr12462903ejp.32.1677799324358; Thu, 02 Mar 2023 15:22:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1677799324; cv=none; d=google.com; s=arc-20160816; b=Rh+d7bLxvtqa7L3MKqJFU+nXtYReIhQsCfWIbs0gQiLpOeSw9gGdg2YXar/G0YAWus tZre2RNEYOWJzI1muMC/tKHgo5FJZWiL8AKH/j+yEBgyWYrWtVEKyfJOCKSEt7KmB8uI UOHpBFuyfEu4YBgN2OCn3Hw0+74OMRkJZYf5c2iccvB15Kml/P6jnzLGkItNGWCghtYC Doq8GD+zPPMbi+24G2y/hJl9Yj/V6oP9SPIjWSMmVSOUh/wKZdn35whAbrpUq+FmdyyE QnCytaXH4muNLGXje2uIW3hcFCgzCRQJK2gilGZGohrhMJDe+0dOE0OUd+EpJ2dl74Lt 4c3w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:reply-to:from:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence :content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:dmarc-filter:delivered-to :dkim-signature:dkim-filter; bh=6lSno8iFU4w1mqmQktreRm0aPMjg4TkJ628ZkNnWbrg=; b=TiC4TEoB9IEb4mbj7DGdgm/YB08l4g/96Tr3YolijQlVs+XrVoC2XCv66KIVkQoXai PcS9dj/Zsj5j1oiwzWTuhLgYYFLsLLQZOnOdkw6gF9xGAgvHFwusuW6nOjiu2BX6OBCt DPszqQkADM6wnSYmr1fJagSYiQhwzOUb6Pp8vx74iDOdYR/8tu6f4dE+/YUcDiITd0U/ lmiNmXhHkEtVH1cnA8CWb2blf8C5LYHUDXX2sqjbCUOwbfiNzySK3w3ivKtl1lE6uILZ waXT/bzfJWC9aCJqr9FCyx+OQhIS+xKLGimmpv0iaaCQKkUY7P6bao9B972mb+nq3G8Q 7pjA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=njZeTX53; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from sourceware.org ([2620:52:3:1:0:246e:9693:128c]) by mx.google.com with ESMTPS id w22-20020aa7d296000000b004aef1b3a863si807272edq.518.2023.03.02.15.22.04 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 02 Mar 2023 15:22:04 -0800 (PST) Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) client-ip=2620:52:3:1:0:246e:9693:128c; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=njZeTX53; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 0C16F3858C62 for ; Thu, 2 Mar 2023 23:22:03 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0C16F3858C62 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1677799323; bh=6lSno8iFU4w1mqmQktreRm0aPMjg4TkJ628ZkNnWbrg=; h=To:Cc:Subject:Date:In-Reply-To:References:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=njZeTX539I3sphLsqXp8aCMZhUlljNluK1TKsbZ+9azELU+oYRO6FCv3HxOOh7BtY 687j1eRIj4rnuVcRrse3kFKDeXXQSPMhAGP/itI5rcyKBnXciERlwUfQFf7uRh7gYQ rHImkW5gxG+RWL1Aln6VFmdpRLcMSS/AENkDIL/0= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mail-qt1-x831.google.com (mail-qt1-x831.google.com [IPv6:2607:f8b0:4864:20::831]) by sourceware.org (Postfix) with ESMTPS id C10103858D37 for ; Thu, 2 Mar 2023 23:21:11 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C10103858D37 Received: by mail-qt1-x831.google.com with SMTP id l18so1153109qtp.1 for ; Thu, 02 Mar 2023 15:21:11 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1677799271; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=6lSno8iFU4w1mqmQktreRm0aPMjg4TkJ628ZkNnWbrg=; b=XHZ9JT0FtEXo3HA2A7euek2scURGvO7h+VlFwafN4/lb77P8CtfOdSOdeT7RoDLX0A TV3I9zf3dlIWE+B1Jp/QRb5JnkZs5j4CzydekIXMO79uo63qNZzgFVNTf/gPKebu0fP9 V4aWUayY8h8qcl0ybqje3k/hGCvp4WayCcfkMihlfm3AytTKrsIy3xf1FUUzF5xz8Jjm HSVCguzOy7IKMe8sNnjeLcJIvFIua6RTdxtu8hyGgpfA4gYteiAwAQoOHyV5WxgtxAdw 8PHJ+8R+DUjGQyDnQS8kiIOqvjAdrOe3qGvSOGUJR76spMDIE2gFYrW9bxfp5qovt7e2 prKw== X-Gm-Message-State: AO0yUKWyeamHD9aThegrPJ8tR7ShwXlMQJw2RwfM8rpeQRFiiOFNbQxZ OIvkzxWer5Cmdzyvg2JS3DrzNa1yVSA= X-Received: by 2002:ac8:4904:0:b0:3bf:b896:ff63 with SMTP id e4-20020ac84904000000b003bfb896ff63mr14935377qtq.52.1677799270631; Thu, 02 Mar 2023 15:21:10 -0800 (PST) Received: from localhost.localdomain (96-67-140-173-static.hfc.comcastbusiness.net. [96.67.140.173]) by smtp.gmail.com with ESMTPSA id t197-20020a3746ce000000b007417e60f621sm563144qka.126.2023.03.02.15.21.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 02 Mar 2023 15:21:10 -0800 (PST) To: gcc-patches@gcc.gnu.org Cc: Jason Merrill , Lewis Hyatt Subject: [PATCH v2] libcpp: Handle extended characters in user-defined literal suffix [PR103902] Date: Thu, 2 Mar 2023 18:07:03 -0500 Message-Id: <20230302230703.2234902-1-lhyatt@gmail.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 X-Spam-Status: No, score=-3038.9 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Lewis Hyatt via Gcc-patches From: Lewis Hyatt Reply-To: Lewis Hyatt Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org Sender: "Gcc-patches" X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1759300104156538142?= X-GMAIL-MSGID: =?utf-8?q?1759300104156538142?= The PR complains that we do not handle UTF-8 in the suffix for a user-defined literal, such as: bool operator ""_π (unsigned long long); In fact we don't handle any extended identifier characters there, whether UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after the "" tokens is included, since then the identifier is lexed in the "normal" way as its own token. But when it is lexed as part of the string token, this is handled in lex_string() with a one-off loop that is not aware of extended characters. This patch fixes it by adding a new function scan_cur_identifier() that can be used to lex an identifier while in the middle of lexing another token. BTW, the other place that has been mis-lexing identifiers is lex_identifier_intern(), which is used to implement #pragma push_macro and #pragma pop_macro. This does not support extended characters either. I will add that in a subsequent patch, because it can't directly reuse the new function, but rather needs to lex from a string instead of a cpp_buffer. With scan_cur_identifier(), we do also correctly warn about bidi and normalization issues in the extended identifiers comprising the suffix. libcpp/ChangeLog: PR preprocessor/103902 * lex.cc (identifier_diagnostics_on_lex): New function refactoring some common code. (lex_identifier_intern): Use the new function. (lex_identifier): Don't run identifier diagnostics here, rather let the call site do it when needed. (_cpp_lex_direct): Adjust the call sites of lex_identifier () acccordingly. (struct scan_id_result): New struct. (scan_cur_identifier): New function. (create_literal2): New function. (lit_accum::create_literal2): New function. (is_macro): Folded into new function... (maybe_ignore_udl_macro_suffix): ...here. (is_macro_not_literal_suffix): Folded likewise. (lex_raw_string): Handle UTF-8 in UDL suffix via scan_cur_identifier (). (lex_string): Likewise. gcc/testsuite/ChangeLog: PR preprocessor/103902 * g++.dg/cpp0x/udlit-extended-id-1.C: New test. * g++.dg/cpp0x/udlit-extended-id-2.C: New test. * g++.dg/cpp0x/udlit-extended-id-3.C: New test. * g++.dg/cpp0x/udlit-extended-id-4.C: New test. --- Notes: Hello- This is the updated version of the patch, incorporating feedback from Jakub and Jason, most recently discussed here: https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612073.html Please let me know how it looks? It is simpler than before with the new approach. Thanks! One thing to note. As Jason clarified for me, a usage like this: #pragma GCC poison _x const char * operator "" _x (const char *, unsigned long); The space between the "" and the _x is currently allowed but will be deprecated in C++23. GCC currently will complain about the poisoned use of _x in this case, and this patch, which is just focused on handling UTF-8 properly, does not change this. But it seems that it would be correct not to apply poison in this case. I can try to follow up with a patch to do so, if it seems worthwhile? Given the syntax is deprecated, maybe it's not worth it... For the time being, this patch does add a testcase for the above and xfails it. For the case where no space is present, which is the part touched by the present patch, existing behavior is preserved correctly and no diagnostics such as poison are issued for the UDL suffix. (Contrary to v1 of this patch.) Thanks! bootstrap + regtested all languages on x86-64 Linux with no regressions. -Lewis .../g++.dg/cpp0x/udlit-extended-id-1.C | 68 ++++ .../g++.dg/cpp0x/udlit-extended-id-2.C | 6 + .../g++.dg/cpp0x/udlit-extended-id-3.C | 15 + .../g++.dg/cpp0x/udlit-extended-id-4.C | 14 + libcpp/lex.cc | 382 ++++++++++-------- 5 files changed, 317 insertions(+), 168 deletions(-) create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C new file mode 100644 index 00000000000..411d4fdd0ba --- /dev/null +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C @@ -0,0 +1,68 @@ +// { dg-do run { target c++11 } } +// { dg-additional-options "-Wno-error=normalized" } +#include +using namespace std; + +constexpr unsigned long long operator "" _π (unsigned long long x) +{ + return 3 * x; +} + +/* Historically we didn't parse properly as part of the "" token, so check that + as well. */ +constexpr unsigned long long operator ""_Π2 (unsigned long long x) +{ + return 4 * x; +} + +char x1[1_π]; +char x2[2_Π2]; + +static_assert (sizeof x1 == 3, "test1"); +static_assert (sizeof x2 == 8, "test2"); + +const char * operator "" _1σ (const char *s, unsigned long) +{ + return s + 1; +} + +const char * operator ""_Σ2 (const char *s, unsigned long) +{ + return s + 2; +} + +const char * operator "" _\U000000e61 (const char *s, unsigned long) +{ + return "ae"; +} + +const char* operator ""_\u01532 (const char *s, unsigned long) +{ + return "oe"; +} + +bool operator "" _\u0BC7\u0BBE (unsigned long long); // { dg-warning "not in NFC" } +bool operator ""_\u0B47\U00000B3E (unsigned long long); // { dg-warning "not in NFC" } + +#define xτy +const char * str = ""xτy; // { dg-warning "invalid suffix on literal" } + +int main() +{ + if (3_π != 9) + __builtin_abort (); + if (4_Π2 != 16) + __builtin_abort (); + if (strcmp ("abc"_1σ, "bc")) + __builtin_abort (); + if (strcmp ("abcd"_Σ2, "cd")) + __builtin_abort (); + if (strcmp (R"(abcdef)"_1σ, "bcdef")) + __builtin_abort (); + if (strcmp (R"(abcdef)"_Σ2, "cdef")) + __builtin_abort (); + if (strcmp ("xyz"_æ1, "ae")) + __builtin_abort (); + if (strcmp ("xyz"_œ2, "oe")) + __builtin_abort (); +} diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C new file mode 100644 index 00000000000..05a2804a463 --- /dev/null +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C @@ -0,0 +1,6 @@ +// { dg-do compile { target c++11 } } +// { dg-additional-options "-Wbidi-chars=any,ucn" } +bool operator ""_d\u202ae\u202cf (unsigned long long); // { dg-line line1 } +// { dg-error "universal character \\\\u202a is not valid in an identifier" "test1" { target *-*-* } line1 } +// { dg-error "universal character \\\\u202c is not valid in an identifier" "test2" { target *-*-* } line1 } +// { dg-warning "found problematic Unicode character" "test3" { target *-*-* } line1 } diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C new file mode 100644 index 00000000000..11292e476e3 --- /dev/null +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C @@ -0,0 +1,15 @@ +// { dg-do compile { target c++11 } } + +// Check that we do not look for poisoned identifier when it is a suffix. +int _ħ; +#pragma GCC poison _ħ +const char * operator ""_ħ (const char *, unsigned long); // { dg-bogus "poisoned" } +bool operator ""_ħ (unsigned long long x); // { dg-bogus "poisoned" } +bool b = 1_ħ; // { dg-bogus "poisoned" } +const char *x = "hbar"_ħ; // { dg-bogus "poisoned" } + +/* Ideally, we should not warn here either, but this is not implemented yet. This + syntax has been deprecated for C++23. */ +#pragma GCC poison _ħ2 +const char * operator "" _ħ2 (const char *, unsigned long); // { dg-bogus "poisoned" "" { xfail *-*-*} } +const char *x2 = "hbar2"_ħ2; // { dg-bogus "poisoned" } diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C new file mode 100644 index 00000000000..d1683c4d892 --- /dev/null +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C @@ -0,0 +1,14 @@ +// { dg-options "-std=c++98 -Wc++11-compat" } +#define END ; +#define εND ; +#define EηD ; +#define EN\u0394 ; + +const char *s1 = "s1"END // { dg-warning "requires a space between string literal and macro" } +const char *s2 = "s2"εND // { dg-warning "requires a space between string literal and macro" } +const char *s3 = "s3"EηD // { dg-warning "requires a space between string literal and macro" } +const char *s4 = "s4"ENΔ // { dg-warning "requires a space between string literal and macro" } + +/* Make sure we did not skip the token also in the case that it wasn't found to + be a macro; compilation should fail here. */ +const char *s5 = "s5"NØT_A_MACRO; // { dg-error "expected ',' or ';' before" } diff --git a/libcpp/lex.cc b/libcpp/lex.cc index 45ea16a91bc..062935e2371 100644 --- a/libcpp/lex.cc +++ b/libcpp/lex.cc @@ -2057,8 +2057,11 @@ warn_about_normalization (cpp_reader *pfile, } } -/* Returns TRUE if the sequence starting at buffer->cur is valid in - an identifier. FIRST is TRUE if this starts an identifier. */ +/* Returns TRUE if the byte sequence starting at buffer->cur is a valid + extended character in an identifier. If FIRST is TRUE, then the character + must be valid at the beginning of an identifier as well. If the return + value is TRUE, then pfile->buffer->cur has been moved to point to the next + byte after the extended character. */ static bool forms_identifier_p (cpp_reader *pfile, int first, @@ -2154,6 +2157,47 @@ maybe_va_opt_error (cpp_reader *pfile) } } +/* Helper function to perform diagnostics that are needed (rarely) + when an identifier is lexed. */ +static void +identifier_diagnostics_on_lex (cpp_reader *pfile, cpp_hashnode *node) +{ + if (__builtin_expect (!(node->flags & NODE_DIAGNOSTIC) + || pfile->state.skipping, 1)) + return; + + /* It is allowed to poison the same identifier twice. */ + if ((node->flags & NODE_POISONED) && !pfile->state.poisoned_ok) + cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"", + NODE_NAME (node)); + + /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the + replacement list of a variadic macro. */ + if (node == pfile->spec_nodes.n__VA_ARGS__ + && !pfile->state.va_args_ok) + { + if (CPP_OPTION (pfile, cplusplus)) + cpp_error (pfile, CPP_DL_PEDWARN, + "__VA_ARGS__ can only appear in the expansion" + " of a C++11 variadic macro"); + else + cpp_error (pfile, CPP_DL_PEDWARN, + "__VA_ARGS__ can only appear in the expansion" + " of a C99 variadic macro"); + } + + /* __VA_OPT__ should only appear in the replacement list of a + variadic macro. */ + if (node == pfile->spec_nodes.n__VA_OPT__) + maybe_va_opt_error (pfile); + + /* For -Wc++-compat, warn about use of C++ named operators. */ + if (node->flags & NODE_WARN_OPERATOR) + cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES, + "identifier \"%s\" is a special operator name in C++", + NODE_NAME (node)); +} + /* Helper function to get the cpp_hashnode of the identifier BASE. */ static cpp_hashnode * lex_identifier_intern (cpp_reader *pfile, const uchar *base) @@ -2173,41 +2217,7 @@ lex_identifier_intern (cpp_reader *pfile, const uchar *base) hash = HT_HASHFINISH (hash, len); result = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table, base, len, hash, HT_ALLOC)); - - /* Rarely, identifiers require diagnostics when lexed. */ - if (__builtin_expect ((result->flags & NODE_DIAGNOSTIC) - && !pfile->state.skipping, 0)) - { - /* It is allowed to poison the same identifier twice. */ - if ((result->flags & NODE_POISONED) && !pfile->state.poisoned_ok) - cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"", - NODE_NAME (result)); - - /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the - replacement list of a variadic macro. */ - if (result == pfile->spec_nodes.n__VA_ARGS__ - && !pfile->state.va_args_ok) - { - if (CPP_OPTION (pfile, cplusplus)) - cpp_error (pfile, CPP_DL_PEDWARN, - "__VA_ARGS__ can only appear in the expansion" - " of a C++11 variadic macro"); - else - cpp_error (pfile, CPP_DL_PEDWARN, - "__VA_ARGS__ can only appear in the expansion" - " of a C99 variadic macro"); - } - - if (result == pfile->spec_nodes.n__VA_OPT__) - maybe_va_opt_error (pfile); - - /* For -Wc++-compat, warn about use of C++ named operators. */ - if (result->flags & NODE_WARN_OPERATOR) - cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES, - "identifier \"%s\" is a special operator name in C++", - NODE_NAME (result)); - } - + identifier_diagnostics_on_lex (pfile, result); return result; } @@ -2221,7 +2231,9 @@ _cpp_lex_identifier (cpp_reader *pfile, const char *name) return result; } -/* Lex an identifier starting at BUFFER->CUR - 1. */ +/* Lex an identifier starting at BASE. BUFFER->CUR is expected to point + one past the first character at BASE, which may be a (possibly multi-byte) + character if STARTS_UCN is true. */ static cpp_hashnode * lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn, struct normalize_state *nst, cpp_hashnode **spelling) @@ -2270,42 +2282,51 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn, *spelling = result; } - /* Rarely, identifiers require diagnostics when lexed. */ - if (__builtin_expect ((result->flags & NODE_DIAGNOSTIC) - && !pfile->state.skipping, 0)) - { - /* It is allowed to poison the same identifier twice. */ - if ((result->flags & NODE_POISONED) && !pfile->state.poisoned_ok) - cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"", - NODE_NAME (result)); - - /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the - replacement list of a variadic macro. */ - if (result == pfile->spec_nodes.n__VA_ARGS__ - && !pfile->state.va_args_ok) - { - if (CPP_OPTION (pfile, cplusplus)) - cpp_error (pfile, CPP_DL_PEDWARN, - "__VA_ARGS__ can only appear in the expansion" - " of a C++11 variadic macro"); - else - cpp_error (pfile, CPP_DL_PEDWARN, - "__VA_ARGS__ can only appear in the expansion" - " of a C99 variadic macro"); - } + return result; +} - /* __VA_OPT__ should only appear in the replacement list of a - variadic macro. */ - if (result == pfile->spec_nodes.n__VA_OPT__) - maybe_va_opt_error (pfile); - - /* For -Wc++-compat, warn about use of C++ named operators. */ - if (result->flags & NODE_WARN_OPERATOR) - cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES, - "identifier \"%s\" is a special operator name in C++", - NODE_NAME (result)); - } +/* Struct to hold the return value of the scan_cur_identifier () helper + function below. */ +struct scan_id_result +{ + cpp_hashnode *node; + normalize_state nst; + + scan_id_result () + : node (nullptr) + { + nst = INITIAL_NORMALIZE_STATE; + } + + explicit operator bool () const { return node; } +}; + +/* Helper function to scan an entire identifier beginning at + pfile->buffer->cur, and possibly containing extended characters (UCNs + and/or UTF-8). Returns the cpp_hashnode for the identifier on success, or + else nullptr, as well as a normalize_state so that normalization warnings + may be issued once the token lexing is complete. */ + +static scan_id_result +scan_cur_identifier (cpp_reader *pfile) +{ + const auto buffer = pfile->buffer; + const auto begin = buffer->cur; + scan_id_result result; + if (ISIDST (*buffer->cur)) + { + ++buffer->cur; + cpp_hashnode *ignore; + result.node = lex_identifier (pfile, begin, false, &result.nst, &ignore); + } + else if (forms_identifier_p (pfile, true, &result.nst)) + { + /* buffer->cur has been moved already by the call + to forms_identifier_p. */ + cpp_hashnode *ignore; + result.node = lex_identifier (pfile, begin, true, &result.nst, &ignore); + } return result; } @@ -2365,6 +2386,24 @@ create_literal (cpp_reader *pfile, cpp_token *token, const uchar *base, token->val.str.text = cpp_alloc_token_string (pfile, base, len); } +/* Like create_literal(), but construct it from two separate strings + which are concatenated. LEN2 may be 0 if no second string is + required. */ +static void +create_literal2 (cpp_reader *pfile, cpp_token *token, const uchar *base1, + unsigned int len1, const uchar *base2, unsigned int len2, + enum cpp_ttype type) +{ + token->type = type; + token->val.str.len = len1 + len2; + uchar *const dest = _cpp_unaligned_alloc (pfile, len1 + len2 + 1); + memcpy (dest, base1, len1); + if (len2) + memcpy (dest+len1, base2, len2); + dest[len1 + len2] = 0; + token->val.str.text = dest; +} + const uchar * cpp_alloc_token_string (cpp_reader *pfile, const unsigned char *ptr, unsigned len) @@ -2403,6 +2442,11 @@ struct lit_accum { rpos = NULL; return c; } + + void create_literal2 (cpp_reader *pfile, cpp_token *token, + const uchar *base1, unsigned int len1, + const uchar *base2, unsigned int len2, + enum cpp_ttype type); }; /* Subroutine of lex_raw_string: Append LEN chars from BASE to the buffer @@ -2445,45 +2489,57 @@ lit_accum::read_begin (cpp_reader *pfile) rpos = BUFF_FRONT (last); } -/* Returns true if a macro has been defined. - This might not work if compile with -save-temps, - or preprocess separately from compilation. */ +/* Helper function to check if a string format macro, say from inttypes.h, is + placed touching a string literal, in which case it could be parsed as a C++11 + user-defined string literal thus breaking the program. User-defined literals + outside of namespace std must start with a single underscore, so assume + anything of that form really is a UDL suffix. We don't need to worry about + UDLs defined inside namespace std because their names are reserved, so cannot + be used as macro names in valid programs. Return TRUE if the UDL should be + ignored for now and preserved for potential macro expansion. */ static bool -is_macro(cpp_reader *pfile, const uchar *base) +maybe_ignore_udl_macro_suffix (cpp_reader *pfile, location_t src_loc, + const uchar *suffix_begin, cpp_hashnode *node) { - const uchar *cur = base; - if (! ISIDST (*cur)) + if ((suffix_begin[0] == '_' && suffix_begin[1] != '_') + || !cpp_macro_p (node)) return false; - unsigned int hash = HT_HASHSTEP (0, *cur); - ++cur; - while (ISIDNUM (*cur)) - { - hash = HT_HASHSTEP (hash, *cur); - ++cur; - } - hash = HT_HASHFINISH (hash, cur - base); - cpp_hashnode *result = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table, - base, cur - base, hash, HT_NO_INSERT)); - - return result && cpp_macro_p (result); + /* Maybe raise a warning here; caller should arrange not to consume + the tokens. */ + if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping) + cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX, src_loc, 0, + "invalid suffix on literal; C++11 requires a space " + "between literal and string macro"); + return true; } -/* Returns true if a literal suffix does not have the expected form - and is defined as a macro. */ - -static bool -is_macro_not_literal_suffix(cpp_reader *pfile, const uchar *base) +/* Like create_literal2(), but also prepend all the accumulated data from + the lit_accum struct. */ +void +lit_accum::create_literal2 (cpp_reader *pfile, cpp_token *token, + const uchar *base1, unsigned int len1, + const uchar *base2, unsigned int len2, + enum cpp_ttype type) { - /* User-defined literals outside of namespace std must start with a single - underscore, so assume anything of that form really is a UDL suffix. - We don't need to worry about UDLs defined inside namespace std because - their names are reserved, so cannot be used as macro names in valid - programs. */ - if (base[0] == '_' && base[1] != '_') - return false; - return is_macro (pfile, base); + const unsigned int tot_len = accum + len1 + len2; + uchar *dest = _cpp_unaligned_alloc (pfile, tot_len + 1); + token->type = type; + token->val.str.len = tot_len; + token->val.str.text = dest; + for (_cpp_buff *buf = first; buf; buf = buf->next) + { + size_t len = BUFF_FRONT (buf) - buf->base; + memcpy (dest, buf->base, len); + dest += len; + } + memcpy (dest, base1, len1); + dest += len1; + if (len2) + memcpy (dest, base2, len2); + dest += len2; + *dest = '\0'; } /* Lexes a raw string. The stored string contains the spelling, @@ -2758,26 +2814,25 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base) if (CPP_OPTION (pfile, user_literals)) { - /* If a string format macro, say from inttypes.h, is placed touching - a string literal it could be parsed as a C++11 user-defined string - literal thus breaking the program. */ - if (is_macro_not_literal_suffix (pfile, pos)) - { - /* Raise a warning, but do not consume subsequent tokens. */ - if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping) - cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX, - token->src_loc, 0, - "invalid suffix on literal; C++11 requires " - "a space between literal and string macro"); - } - /* Grab user defined literal suffix. */ - else if (ISIDST (*pos)) - { - type = cpp_userdef_string_add_type (type); - ++pos; + const uchar *const suffix_begin = pos; + pfile->buffer->cur = pos; - while (ISIDNUM (*pos)) - ++pos; + if (const auto sr = scan_cur_identifier (pfile)) + { + if (maybe_ignore_udl_macro_suffix (pfile, token->src_loc, + suffix_begin, sr.node)) + pfile->buffer->cur = suffix_begin; + else + { + type = cpp_userdef_string_add_type (type); + accum.create_literal2 (pfile, token, base, suffix_begin - base, + NODE_NAME (sr.node), NODE_LEN (sr.node), + type); + if (accum.first) + _cpp_release_buff (pfile, accum.first); + warn_about_normalization (pfile, token, &sr.nst, true); + return; + } } } @@ -2787,21 +2842,8 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base) create_literal (pfile, token, base, pos - base, type); else { - size_t extra_len = pos - base; - uchar *dest = _cpp_unaligned_alloc (pfile, accum.accum + extra_len + 1); - - token->type = type; - token->val.str.len = accum.accum + extra_len; - token->val.str.text = dest; - for (_cpp_buff *buf = accum.first; buf; buf = buf->next) - { - size_t len = BUFF_FRONT (buf) - buf->base; - memcpy (dest, buf->base, len); - dest += len; - } + accum.create_literal2 (pfile, token, base, pos - base, nullptr, 0, type); _cpp_release_buff (pfile, accum.first); - memcpy (dest, base, extra_len); - dest[extra_len] = '\0'; } } @@ -2908,39 +2950,40 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base) cpp_error (pfile, CPP_DL_PEDWARN, "missing terminating %c character", (int) terminator); + pfile->buffer->cur = cur; + const uchar *const suffix_begin = cur; + if (CPP_OPTION (pfile, user_literals)) { - /* If a string format macro, say from inttypes.h, is placed touching - a string literal it could be parsed as a C++11 user-defined string - literal thus breaking the program. */ - if (is_macro_not_literal_suffix (pfile, cur)) - { - /* Raise a warning, but do not consume subsequent tokens. */ - if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping) - cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX, - token->src_loc, 0, - "invalid suffix on literal; C++11 requires " - "a space between literal and string macro"); - } - /* Grab user defined literal suffix. */ - else if (ISIDST (*cur)) + if (const auto sr = scan_cur_identifier (pfile)) { - type = cpp_userdef_char_add_type (type); - type = cpp_userdef_string_add_type (type); - ++cur; - - while (ISIDNUM (*cur)) - ++cur; + if (maybe_ignore_udl_macro_suffix (pfile, token->src_loc, + suffix_begin, sr.node)) + pfile->buffer->cur = suffix_begin; + else + { + /* Grab user defined literal suffix. */ + type = cpp_userdef_char_add_type (type); + type = cpp_userdef_string_add_type (type); + create_literal2 (pfile, token, base, suffix_begin - base, + NODE_NAME (sr.node), NODE_LEN (sr.node), type); + warn_about_normalization (pfile, token, &sr.nst, true); + return; + } } } else if (CPP_OPTION (pfile, cpp_warn_cxx11_compat) - && is_macro (pfile, cur) && !pfile->state.skipping) - cpp_warning_with_line (pfile, CPP_W_CXX11_COMPAT, - token->src_loc, 0, "C++11 requires a space " - "between string literal and macro"); + { + const auto sr = scan_cur_identifier (pfile); + /* Maybe raise a warning, but do not consume the tokens. */ + pfile->buffer->cur = suffix_begin; + if (sr && cpp_macro_p (sr.node)) + cpp_warning_with_line (pfile, CPP_W_CXX11_COMPAT, + token->src_loc, 0, "C++11 requires a space " + "between string literal and macro"); + } - pfile->buffer->cur = cur; create_literal (pfile, token, base, cur - base, type); } @@ -3915,9 +3958,10 @@ _cpp_lex_direct (cpp_reader *pfile) result->type = CPP_NAME; { struct normalize_state nst = INITIAL_NORMALIZE_STATE; - result->val.node.node = lex_identifier (pfile, buffer->cur - 1, false, - &nst, - &result->val.node.spelling); + const auto node = lex_identifier (pfile, buffer->cur - 1, false, &nst, + &result->val.node.spelling); + result->val.node.node = node; + identifier_diagnostics_on_lex (pfile, node); warn_about_normalization (pfile, result, &nst, true); } @@ -4220,8 +4264,10 @@ _cpp_lex_direct (cpp_reader *pfile) if (forms_identifier_p (pfile, true, &nst)) { result->type = CPP_NAME; - result->val.node.node = lex_identifier (pfile, base, true, &nst, - &result->val.node.spelling); + const auto node = lex_identifier (pfile, base, true, &nst, + &result->val.node.spelling); + result->val.node.node = node; + identifier_diagnostics_on_lex (pfile, node); warn_about_normalization (pfile, result, &nst, true); break; } @@ -4353,7 +4399,7 @@ cpp_digraph2name (enum cpp_ttype type) } /* Write the spelling of an identifier IDENT, using UCNs, to BUFFER. - The buffer must already contain the enough space to hold the + The buffer must already contain enough space to hold the token's spelling. Returns a pointer to the character after the last character written. */ unsigned char * @@ -4375,7 +4421,7 @@ _cpp_spell_ident_ucns (unsigned char *buffer, cpp_hashnode *ident) } /* Write the spelling of a token TOKEN to BUFFER. The buffer must - already contain the enough space to hold the token's spelling. + already contain enough space to hold the token's spelling. Returns a pointer to the character after the last character written. FORSTRING is true if this is to be the spelling after translation phase 1 (with the original spelling of extended identifiers), false