From patchwork Mon Sep 26 22:27:25 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Lewis Hyatt <lhyatt@gmail.com>
X-Patchwork-Id: 1475
Return-Path: <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a5d:5044:0:0:0:0:0 with SMTP id h4csp39144wrt;
        Mon, 26 Sep 2022 15:28:20 -0700 (PDT)
X-Google-Smtp-Source: 
 AMsMyM7j2JNGx0m0Q49qF7hY+qlodxYq1cvZZVleK6AZ38jWzKdez+XU+vf0lae5Weoh2aIVh67Y
X-Received: by 2002:a05:6402:3786:b0:451:24da:f8c9 with SMTP id
 et6-20020a056402378600b0045124daf8c9mr24909487edb.250.1664231300281;
        Mon, 26 Sep 2022 15:28:20 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1664231300; cv=none;
        d=google.com; s=arc-20160816;
        b=okM8fr559kYDyzCSJmUjEAPacOdNJ5l4LBIWjiE92Pc8KyYp5Au1spZkeqDFkirdF1
         N5mFMoWxgYn+Jc71d4+LWc30FmHuArjzkHl402Vt/KALA5DX/dAA8+9+yMnXbjM0CQdh
         MGlkTz0jTQzDe0gfEA5+1CJEtJWDyeeK3l0oWnRZFZTZ34/HTXfQnsXcziEZm+RqXVvl
         tn0HlOalssK9axEaTM+kBynSLpxGhSPcHiPrLV/k7nGEhMJHvYarcPzDmgp5NarEMHqr
         4PgrPbJp+HRi3FUpZxTbSPrOdN+R0SdejJteTeL607jewbr23RbvOQXUGZGH23m5zGo1
         8mKA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:from:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence:in-reply-to
         :content-transfer-encoding:content-disposition:mime-version
         :references:message-id:subject:to:date:dmarc-filter:delivered-to
         :dkim-signature:dkim-filter;
        bh=HF5T01daeLWpwFBg5vffPNZW9pgbcJUddFI0kVbv0ww=;
        b=jAGVbLx7zb0/TY+BpsIR0pRB4ud9zl3VCvuA+5t4M+0zmVQQfVo7j5Y7ZQFZYbu0HP
         AXUw56nlneFcctse33j0SHWPC53B//zwymYrRTFnrFm9zhL01vHfEXgzQ3j5eBjjjdPi
         xdODED10evlQQLz6YO+0XQi5a8Z5oBJkEHm24ISN9xW45t8OgM5aGBMKvOKCoJfohghO
         yJY/QuJ1PgwCySOVngL0fVgwCcOeqJJwhxO1o+PehYOwL83SoEhGJSWl0/EnhrptEWOP
         RxSUk4m5P2fNPswMUCxYfT5bcCXaHc2/OjHu4L/Oehi1wOJBZ3/4eHpEt/5Cltda+utD
         FJPQ==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@gcc.gnu.org header.s=default header.b=jXAfOjjh;
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates
 2620:52:3:1:0:246e:9693:128c as permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org
Received: from sourceware.org (server2.sourceware.org.
 [2620:52:3:1:0:246e:9693:128c])
        by mx.google.com with ESMTPS id
 hc33-20020a17090716a100b0073306ff26besi1391351ejc.969.2022.09.26.15.28.19
        for <ouuuleilei@gmail.com>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 26 Sep 2022 15:28:20 -0700 (PDT)
Received-SPF: pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates
 2620:52:3:1:0:246e:9693:128c as permitted sender)
 client-ip=2620:52:3:1:0:246e:9693:128c;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@gcc.gnu.org header.s=default header.b=jXAfOjjh;
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates
 2620:52:3:1:0:246e:9693:128c as permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 5AC513858406
	for <ouuuleilei@gmail.com>; Mon, 26 Sep 2022 22:28:18 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 5AC513858406
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1664231298;
	bh=HF5T01daeLWpwFBg5vffPNZW9pgbcJUddFI0kVbv0ww=;
	h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=jXAfOjjhWA7ASL155JGaLEb+SjcMxlZO6MgMSw2TGk284pbrYIp0M+UhhgCqmMKOL
	 SG75hGDlP1taoFkwqb/pjekTJozoVBni2TYAFzkOHHAGzLyylt9O0TgdiRfsWKgNmh
	 v3+JwpnLs/t2ujc8NAcMQs5blGpa7xVBusb8wbpU=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mail-qk1-x729.google.com (mail-qk1-x729.google.com
 [IPv6:2607:f8b0:4864:20::729])
 by sourceware.org (Postfix) with ESMTPS id 3930A3858D37
 for <gcc-patches@gcc.gnu.org>; Mon, 26 Sep 2022 22:27:30 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3930A3858D37
Received: by mail-qk1-x729.google.com with SMTP id u28so5063308qku.2
 for <gcc-patches@gcc.gnu.org>; Mon, 26 Sep 2022 15:27:30 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=in-reply-to:content-transfer-encoding:content-disposition
 :mime-version:references:message-id:subject:cc:to:from:date
 :x-gm-message-state:from:to:cc:subject:date;
 bh=HF5T01daeLWpwFBg5vffPNZW9pgbcJUddFI0kVbv0ww=;
 b=8PDq6iqCZDtF+I3IsCYTkQ9vTkqGsK9zysX16VGlplqxmGHeGlFwFsg4lXw74D/FDV
 NaNZAavxHhZ6evSoFdz6XiPpwYcnCBYKUL0VZUdYf4ZV2kadCciRgpMktS/9Nd60mBHL
 qobCm83zd+x7aoABvQUvz1sNAFtXdKWTTkrNCG7YOMe8VQR4iW74MUQAhxdTArXHnJU6
 7XzMfbCcQD1AGZkM7JIDdjISQ48W4ozVWrTXzbEUQd30sSy+kUWit1/m1UcF431UgRPy
 nUiAxR/KKCySKRBZNnrgZ0EMyLPUJIiHriD7JaMbVJ5wBk7yMXNHpNHcjZZRW/EJDrWI
 SH2w==
X-Gm-Message-State: ACrzQf33gsYbjRVGHPd/k3mGJu5dSZVXwsiIFDSJaadUcTsKmZ3Ay0Z9
 JEtBQPAePDDtmYjp+8IyhvlF1t4KzQs=
X-Received: by 2002:a05:620a:ccd:b0:6ce:3e34:f588 with SMTP id
 b13-20020a05620a0ccd00b006ce3e34f588mr16211883qkj.566.1664231249148;
 Mon, 26 Sep 2022 15:27:29 -0700 (PDT)
Received: from ldh-imac.local (96-67-140-173-static.hfc.comcastbusiness.net.
 [96.67.140.173]) by smtp.gmail.com with ESMTPSA id
 c15-20020ac8110f000000b0035bbb0fe90bsm11325297qtj.47.2022.09.26.15.27.27
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 26 Sep 2022 15:27:28 -0700 (PDT)
Date: Mon, 26 Sep 2022 18:27:25 -0400
To: gcc-patches@gcc.gnu.org
Subject: Ping^3: [PATCH] libcpp: Handle extended characters in user-defined
 literal suffix [PR103902]
Message-ID: <20220926222725.GA19652@ldh-imac.local>
References: <20220614212649.GA58025@ldh-imac.local>
 <20220615190616.GA70682@ldh-imac.local>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20220615190616.GA70682@ldh-imac.local>
X-Spam-Status: No, score=-3039.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: Lewis Hyatt via Gcc-patches
 <gcc-patches@gcc.gnu.org>
From: Lewis Hyatt <lhyatt@gmail.com>
Reply-To: Lewis Hyatt <lhyatt@gmail.com>
Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org
Sender: "Gcc-patches" <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1745073000057069300?=
X-GMAIL-MSGID: =?utf-8?q?1745073000057069300?=

On Wed, Jun 15, 2022 at 03:06:16PM -0400, Lewis Hyatt wrote:
> On Tue, Jun 14, 2022 at 05:26:49PM -0400, Lewis Hyatt wrote:
> > Hello-
> > 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103902
> > 
> > The attached patch resolves PR preprocessor/103902 as described in the patch
> > message inline below. bootstrap + regtest all languages was successful on
> > x86-64 Linux, with no new failures:
> > 
> > FAIL 103 103
> > PASS 542338 542371
> > UNSUPPORTED 15247 15250
> > UNTESTED 136 136
> > XFAIL 4166 4166
> > XPASS 17 17
> > 
> > Please let me know if it looks OK?
> > 
> > A few questions I have:
> > 
> > - A difference introduced with this patch is that after lexing something
> > like `operator ""_abc', then `_abc' is added to the identifier hash map,
> > whereas previously it was not. I feel like this must be OK because with the
> > optional space as in `operator "" _abc', it would be added with or without the
> > patch.
> > 
> > - The behavior of `#pragma GCC poison' is not consistent (including prior to
> >   my patch). I tried to make it more so but there is still one thing I want to
> >   ask about. Leaving aside extended characters for now, the inconsistency is
> >   that currently the poison is only checked, when the suffix appears as a
> >   standalone token.
> > 
> >   #pragma GCC poison _X
> >   bool operator ""_X (unsigned long long);   //accepted before the patch,
> >                                              //rejected after it
> >   bool operator "" _X (unsigned long long);  //rejected either before or after
> >   const char * operator ""_X (const char *, unsigned long); //accepted before,
> >                                                             //rejected after
> >   const char * operator "" _X (const char *, unsigned long); //rejected either
> > 
> >   const char * s = ""_X; //accepted before the patch, rejected after it
> >   const bool b = 1_X; //accepted before or after ****
> > 
> > I feel like after the patch, the behavior is the expected behavior for all
> > cases but the last one. Here, we allow the poisoned identifier because it's
> > not lexed as an identifier, it's lexed as part of a pp-number. Does it seem OK
> > like this or does it need to be addressed?
> 
> Sorry, that version actually did not handle the case of -Wc++11-compat in
> c++98 mode correctly. This updated version fixes that and adds the missing
> test coverage for that, if you could please review this one instead?
> 
> By the way, the pipermail archive seems to permanently mangle UTF-8 in inline
> attachments. I attached the patch also gzipped to address that for the
> archive, since the new testcases do use non-ASCII characters.
> 
> Thanks for taking a look!

Hello-

May I please ping this patch again? Joseph suggested that it would be best if
a C++ maintainer has a look at it. This is one of just a few places left where
we don't handle UTF-8 properly in libcpp, it would be really nice to get them
fixed up if there is time to review this patch. Thanks!

https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596704.html

I re-attached it here as it required some trivial rebasing on top of recently
pushed changes. As before, I also attached the gzipped version so that the
UTF-8 testcases show up OK in the online archive, in case that's still an
issue. Thanks for taking a look!

-Lewis
[PATCH] libcpp: Handle extended characters in user-defined literal suffix [PR103902]

The PR complains that we do not handle UTF-8 in the suffix for a user-defined
literal, such as:

bool operator ""_π (unsigned long long);

In fact we don't handle any extended identifier characters there, whether
UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after
the "" tokens is included, since then the identifier is lexed in the "normal"
way as its own token. But when it is lexed as part of the string token, this
is handled in lex_string() with a one-off loop that is not aware of extended
characters.

This patch fixes it by adding a new function scan_cur_identifier() that can be
used to lex an identifier while in the middle of lexing another token. It is
somewhat duplicative of the code in lex_identifier(), which handles the normal
case, but I think there's no good way to avoid that without pessimizing the
usual case, since lex_identifier() takes advantage of the fact that the first
character of the identifier has already been analyzed. The code duplication is
somewhat offset by factoring out the identifier lexing diagnostics (e.g. for
poisoned identifiers), which were formerly duplicated in two places, and have
been factored into their own function that's used in (now) 3 places.

BTW, the other place that was lexing identifiers is lex_identifier_intern(),
which is used to implement #pragma push_macro and #pragma pop_macro. This does
not support extended characters either. I will add that in a subsequent patch,
because it can't directly reuse the new function, but rather needs to lex from
a string instead of a cpp_buffer.

With scan_cur_identifier(), we do also correctly warn about bidi and
normalization issues in the extended identifiers comprising the suffix, and we
check for poisoned identifiers there as well.

libcpp/ChangeLog:

	PR preprocessor/103902
	* lex.cc (identifier_diagnostics_on_lex): New function refactors
	common code from...
	(lex_identifier_intern): ...here, and...
	(lex_identifier): ...here.
	(struct scan_id_result): New struct to hold the result of...
	(scan_cur_identifier): ...new function.
	(create_literal2): New function.
	(is_macro): Removed function that is now handled directly in
	lex_string() and lex_raw_string().
	(is_macro_not_literal_suffix): Likewise.
	(lit_accum::create_literal2): New function.
	(lex_raw_string): Make use of new function scan_cur_identifier().
	(lex_string): Likewise.

gcc/testsuite/ChangeLog:

	PR preprocessor/103902
	* g++.dg/cpp0x/udlit-extended-id-1.C: New test.
	* g++.dg/cpp0x/udlit-extended-id-2.C: New test.
	* g++.dg/cpp0x/udlit-extended-id-3.C: New test.
	* g++.dg/cpp0x/udlit-extended-id-4.C: New test.

diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
new file mode 100644
index 00000000000..411d4fdd0ba
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
@@ -0,0 +1,68 @@
+// { dg-do run { target c++11 } }
+// { dg-additional-options "-Wno-error=normalized" }
+#include <cstring>
+using namespace std;
+
+constexpr unsigned long long operator "" _π (unsigned long long x)
+{
+  return 3 * x;
+}
+
+/* Historically we didn't parse properly as part of the "" token, so check that
+   as well.  */
+constexpr unsigned long long operator ""_Π2 (unsigned long long x)
+{
+  return 4 * x;
+}
+
+char x1[1_π];
+char x2[2_Π2];
+
+static_assert (sizeof x1 == 3, "test1");
+static_assert (sizeof x2 == 8, "test2");
+
+const char * operator "" _1σ (const char *s, unsigned long)
+{
+  return s + 1;
+}
+
+const char * operator ""_Σ2 (const char *s, unsigned long)
+{
+  return s + 2;
+}
+
+const char * operator "" _\U000000e61 (const char *s, unsigned long)
+{
+  return "ae";
+}
+
+const char* operator ""_\u01532 (const char *s, unsigned long)
+{
+  return "oe";
+}
+
+bool operator "" _\u0BC7\u0BBE (unsigned long long); // { dg-warning "not in NFC" }
+bool operator ""_\u0B47\U00000B3E (unsigned long long); // { dg-warning "not in NFC" }
+
+#define xτy
+const char * str = ""xτy; // { dg-warning "invalid suffix on literal" }
+
+int main()
+{
+  if (3_π != 9)
+    __builtin_abort ();
+  if (4_Π2 != 16)
+    __builtin_abort ();
+  if (strcmp ("abc"_1σ, "bc"))
+    __builtin_abort ();
+  if (strcmp ("abcd"_Σ2, "cd"))
+    __builtin_abort ();
+  if (strcmp (R"(abcdef)"_1σ, "bcdef"))
+    __builtin_abort ();
+  if (strcmp (R"(abcdef)"_Σ2, "cdef"))
+    __builtin_abort ();
+  if (strcmp ("xyz"_æ1, "ae"))
+    __builtin_abort ();
+  if (strcmp ("xyz"_œ2, "oe"))
+    __builtin_abort ();
+}
diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
new file mode 100644
index 00000000000..05a2804a463
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
@@ -0,0 +1,6 @@
+// { dg-do compile { target c++11 } }
+// { dg-additional-options "-Wbidi-chars=any,ucn" }
+bool operator ""_d\u202ae\u202cf (unsigned long long); // { dg-line line1 }
+// { dg-error "universal character \\\\u202a is not valid in an identifier" "test1" { target *-*-* } line1 }
+// { dg-error "universal character \\\\u202c is not valid in an identifier" "test2" { target *-*-* } line1 }
+// { dg-warning "found problematic Unicode character" "test3" { target *-*-* } line1 }
diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C
new file mode 100644
index 00000000000..6db729c3432
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C
@@ -0,0 +1,7 @@
+// { dg-do compile { target c++11 } }
+int _ħ;
+const char * operator ""_ħ (const char *, unsigned long);
+bool operator ""_ħ (unsigned long long x);
+#pragma GCC poison _ħ
+bool b = 1_ħ; // This currently is allowed, is that intended?
+const char *x = "hbar"_ħ; // { dg-error "attempt to use poisoned" }
diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C
new file mode 100644
index 00000000000..a356eba4a3c
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C
@@ -0,0 +1,15 @@
+// { dg-options "-std=c++98 -Wc++11-compat" }
+#define END ;
+#define εND ;
+#define EηD ;
+#define EN\u0394 ;
+
+const char *s1 = "s1"END // { dg-warning "requires a space between string literal and macro" }
+const char *s2 = "s2"εND // { dg-warning "requires a space between string literal and macro" }
+const char *s3 = "s3"EηD // { dg-warning "requires a space between string literal and macro" }
+const char *s4 = "s4"ENΔ // { dg-warning "requires a space between string literal and macro" }
+
+/* Make sure we did not skip the token also in the case that it wasn't found to
+   be a macro; compilation should fail here.  */
+const char *s5 = "s5"NØT_A_MACRO; // { dg-error "expected ',' or ';' before" }
+
diff --git a/libcpp/lex.cc b/libcpp/lex.cc
index 41f905dea16..f93a883acce 100644
--- a/libcpp/lex.cc
+++ b/libcpp/lex.cc
@@ -2052,8 +2052,11 @@ warn_about_normalization (cpp_reader *pfile,
     }
 }
 
-/* Returns TRUE if the sequence starting at buffer->cur is valid in
-   an identifier.  FIRST is TRUE if this starts an identifier.  */
+/* Returns TRUE if the byte sequence starting at buffer->cur is a valid
+   extended character in an identifier.  If FIRST is TRUE, then the character
+   must be valid at the beginning of an identifier as well.  If the return
+   value is TRUE, then pfile->buffer->cur has been moved to point to the next
+   byte after the extended character.  */
 
 static bool
 forms_identifier_p (cpp_reader *pfile, int first,
@@ -2143,6 +2146,122 @@ maybe_va_opt_error (cpp_reader *pfile)
     }
 }
 
+/* Helper function to perform diagnostics that are needed (rarely)
+   when an identifier is lexed.  */
+static void identifier_diagnostics_on_lex (cpp_reader *pfile,
+					   cpp_hashnode *node)
+{
+  if (__builtin_expect (!(node->flags & NODE_DIAGNOSTIC)
+			|| pfile->state.skipping, 1))
+    return;
+
+  /* It is allowed to poison the same identifier twice.  */
+  if ((node->flags & NODE_POISONED) && !pfile->state.poisoned_ok)
+    cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"",
+	       NODE_NAME (node));
+
+  /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
+     replacement list of a variadic macro.  */
+  if (node == pfile->spec_nodes.n__VA_ARGS__
+      && !pfile->state.va_args_ok)
+    {
+      if (CPP_OPTION (pfile, cplusplus))
+	cpp_error (pfile, CPP_DL_PEDWARN,
+		   "__VA_ARGS__ can only appear in the expansion"
+		   " of a C++11 variadic macro");
+      else
+	cpp_error (pfile, CPP_DL_PEDWARN,
+		   "__VA_ARGS__ can only appear in the expansion"
+		   " of a C99 variadic macro");
+    }
+
+  if (node == pfile->spec_nodes.n__VA_OPT__)
+    maybe_va_opt_error (pfile);
+
+  /* For -Wc++-compat, warn about use of C++ named operators.  */
+  if (node->flags & NODE_WARN_OPERATOR)
+    cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
+		 "identifier \"%s\" is a special operator name in C++",
+		 NODE_NAME (node));
+}
+
+/* Helper function to scan an entire identifier beginning at
+   pfile->buffer->cur, and possibly containing extended characters (UCNs
+   and/or UTF-8).  Returns the cpp_hashnode for the identifier on success, or
+   else nullptr, as well as a normalize_state so that normalization warnings
+   may be issued once the token lexing is complete.  */
+
+struct scan_id_result
+{
+  cpp_hashnode *node;
+  normalize_state nst;
+
+  scan_id_result ()
+    : node (nullptr)
+  {
+    nst = INITIAL_NORMALIZE_STATE;
+  }
+
+  explicit operator bool () const { return node; }
+};
+
+static scan_id_result
+scan_cur_identifier (cpp_reader *pfile)
+{
+  cpp_buffer *const buffer = pfile->buffer;
+  const uchar *const begin = buffer->cur;
+  scan_id_result result;
+  bool need_extended;
+  unsigned int hash = 0;
+  if (ISIDST (*buffer->cur))
+    {
+      hash = HT_HASHSTEP (0, *buffer->cur);
+      ++buffer->cur;
+      while (ISIDNUM (*buffer->cur))
+	{
+	  hash = HT_HASHSTEP (hash, *buffer->cur);
+	  ++buffer->cur;
+	}
+      NORMALIZE_STATE_UPDATE_IDNUM (&result.nst, buffer->cur[-1]);
+      need_extended = forms_identifier_p (pfile, false, &result.nst);
+    }
+  else
+    {
+      if (!forms_identifier_p (pfile, true, &result.nst))
+	return result;
+      need_extended = true;
+    }
+
+  if (need_extended)
+    {
+      do {
+	while (ISIDNUM (*buffer->cur))
+	  {
+	    NORMALIZE_STATE_UPDATE_IDNUM (&result.nst, *buffer->cur);
+	    ++buffer->cur;
+	  }
+      } while (forms_identifier_p (pfile, false, &result.nst));
+
+      if (pfile->warn_bidi_p ())
+	maybe_warn_bidi_on_close (pfile, buffer->cur);
+
+      result.node = _cpp_interpret_identifier (pfile, begin,
+					       buffer->cur - begin);
+    }
+  else
+    {
+      const size_t len = buffer->cur - begin;
+      hash = HT_HASHFINISH (hash, len);
+      result.node = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
+						       begin, len,
+						       hash, HT_ALLOC));
+    }
+
+  identifier_diagnostics_on_lex (pfile, result.node);
+  return result;
+}
+
+
 /* Helper function to get the cpp_hashnode of the identifier BASE.  */
 static cpp_hashnode *
 lex_identifier_intern (cpp_reader *pfile, const uchar *base)
@@ -2162,41 +2281,7 @@ lex_identifier_intern (cpp_reader *pfile, const uchar *base)
   hash = HT_HASHFINISH (hash, len);
   result = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
 					      base, len, hash, HT_ALLOC));
-
-  /* Rarely, identifiers require diagnostics when lexed.  */
-  if (__builtin_expect ((result->flags & NODE_DIAGNOSTIC)
-			&& !pfile->state.skipping, 0))
-    {
-      /* It is allowed to poison the same identifier twice.  */
-      if ((result->flags & NODE_POISONED) && !pfile->state.poisoned_ok)
-	cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"",
-		   NODE_NAME (result));
-
-      /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
-	 replacement list of a variadic macro.  */
-      if (result == pfile->spec_nodes.n__VA_ARGS__
-	  && !pfile->state.va_args_ok)
-	{
-	  if (CPP_OPTION (pfile, cplusplus))
-	    cpp_error (pfile, CPP_DL_PEDWARN,
-		       "__VA_ARGS__ can only appear in the expansion"
-		       " of a C++11 variadic macro");
-	  else
-	    cpp_error (pfile, CPP_DL_PEDWARN,
-		       "__VA_ARGS__ can only appear in the expansion"
-		       " of a C99 variadic macro");
-	}
-
-      if (result == pfile->spec_nodes.n__VA_OPT__)
-	maybe_va_opt_error (pfile);
-
-      /* For -Wc++-compat, warn about use of C++ named operators.  */
-      if (result->flags & NODE_WARN_OPERATOR)
-	cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
-		     "identifier \"%s\" is a special operator name in C++",
-		     NODE_NAME (result));
-    }
-
+  identifier_diagnostics_on_lex (pfile, result);
   return result;
 }
 
@@ -2259,42 +2344,7 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn,
       *spelling = result;
     }
 
-  /* Rarely, identifiers require diagnostics when lexed.  */
-  if (__builtin_expect ((result->flags & NODE_DIAGNOSTIC)
-			&& !pfile->state.skipping, 0))
-    {
-      /* It is allowed to poison the same identifier twice.  */
-      if ((result->flags & NODE_POISONED) && !pfile->state.poisoned_ok)
-	cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"",
-		   NODE_NAME (result));
-
-      /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
-	 replacement list of a variadic macro.  */
-      if (result == pfile->spec_nodes.n__VA_ARGS__
-	  && !pfile->state.va_args_ok)
-	{
-	  if (CPP_OPTION (pfile, cplusplus))
-	    cpp_error (pfile, CPP_DL_PEDWARN,
-		       "__VA_ARGS__ can only appear in the expansion"
-		       " of a C++11 variadic macro");
-	  else
-	    cpp_error (pfile, CPP_DL_PEDWARN,
-		       "__VA_ARGS__ can only appear in the expansion"
-		       " of a C99 variadic macro");
-	}
-
-      /* __VA_OPT__ should only appear in the replacement list of a
-	 variadic macro.  */
-      if (result == pfile->spec_nodes.n__VA_OPT__)
-	maybe_va_opt_error (pfile);
-
-      /* For -Wc++-compat, warn about use of C++ named operators.  */
-      if (result->flags & NODE_WARN_OPERATOR)
-	cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
-		     "identifier \"%s\" is a special operator name in C++",
-		     NODE_NAME (result));
-    }
-
+  identifier_diagnostics_on_lex (pfile, result);
   return result;
 }
 
@@ -2354,6 +2404,24 @@ create_literal (cpp_reader *pfile, cpp_token *token, const uchar *base,
   token->val.str.text = cpp_alloc_token_string (pfile, base, len);
 }
 
+/* Like create_literal(), but construct it from two separate strings
+   which are concatenated.  LEN2 may be 0 if no second string is
+   required.  */
+static void
+create_literal2 (cpp_reader *pfile, cpp_token *token, const uchar *base1,
+		 unsigned int len1, const uchar *base2, unsigned int len2,
+		 enum cpp_ttype type)
+{
+  token->type = type;
+  token->val.str.len = len1 + len2;
+  uchar *const dest = _cpp_unaligned_alloc (pfile, len1 + len2 + 1);
+  memcpy (dest, base1, len1);
+  if (len2)
+    memcpy (dest+len1, base2, len2);
+  dest[len1 + len2] = 0;
+  token->val.str.text = dest;
+}
+
 const uchar *
 cpp_alloc_token_string (cpp_reader *pfile,
 			const unsigned char *ptr, unsigned len)
@@ -2392,6 +2460,11 @@ struct lit_accum {
       rpos = NULL;
     return c;
   }
+
+  void create_literal2 (cpp_reader *pfile, cpp_token *token,
+			const uchar *base1, unsigned int len1,
+			const uchar *base2, unsigned int len2,
+			enum cpp_ttype type);
 };
 
 /* Subroutine of lex_raw_string: Append LEN chars from BASE to the buffer
@@ -2434,45 +2507,31 @@ lit_accum::read_begin (cpp_reader *pfile)
   rpos = BUFF_FRONT (last);
 }
 
-/* Returns true if a macro has been defined.
-   This might not work if compile with -save-temps,
-   or preprocess separately from compilation.  */
-
-static bool
-is_macro(cpp_reader *pfile, const uchar *base)
+/* Like create_literal2(), but also prepend all the accumulated data from
+   the lit_accum struct.  */
+void
+lit_accum::create_literal2 (cpp_reader *pfile, cpp_token *token,
+			    const uchar *base1, unsigned int len1,
+			    const uchar *base2, unsigned int len2,
+			    enum cpp_ttype type)
 {
-  const uchar *cur = base;
-  if (! ISIDST (*cur))
-    return false;
-  unsigned int hash = HT_HASHSTEP (0, *cur);
-  ++cur;
-  while (ISIDNUM (*cur))
+  const unsigned int tot_len = accum + len1 + len2;
+  uchar *dest = _cpp_unaligned_alloc (pfile, tot_len + 1);
+  token->type = type;
+  token->val.str.len = tot_len;
+  token->val.str.text = dest;
+  for (_cpp_buff *buf = first; buf; buf = buf->next)
     {
-      hash = HT_HASHSTEP (hash, *cur);
-      ++cur;
+      size_t len = BUFF_FRONT (buf) - buf->base;
+      memcpy (dest, buf->base, len);
+      dest += len;
     }
-  hash = HT_HASHFINISH (hash, cur - base);
-
-  cpp_hashnode *result = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
-					base, cur - base, hash, HT_NO_INSERT));
-
-  return result && cpp_macro_p (result);
-}
-
-/* Returns true if a literal suffix does not have the expected form
-   and is defined as a macro.  */
-
-static bool
-is_macro_not_literal_suffix(cpp_reader *pfile, const uchar *base)
-{
-  /* User-defined literals outside of namespace std must start with a single
-     underscore, so assume anything of that form really is a UDL suffix.
-     We don't need to worry about UDLs defined inside namespace std because
-     their names are reserved, so cannot be used as macro names in valid
-     programs.  */
-  if (base[0] == '_' && base[1] != '_')
-    return false;
-  return is_macro (pfile, base);
+  memcpy (dest, base1, len1);
+  dest += len1;
+  if (len2)
+    memcpy (dest, base2, len2);
+  dest += len2;
+  *dest = '\0';
 }
 
 /* Lexes a raw string.  The stored string contains the spelling,
@@ -2741,26 +2800,53 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
 
   if (CPP_OPTION (pfile, user_literals))
     {
-      /* If a string format macro, say from inttypes.h, is placed touching
-	 a string literal it could be parsed as a C++11 user-defined string
-	 literal thus breaking the program.  */
-      if (is_macro_not_literal_suffix (pfile, pos))
-	{
-	  /* Raise a warning, but do not consume subsequent tokens.  */
-	  if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping)
-	    cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX,
-				   token->src_loc, 0,
-				   "invalid suffix on literal; C++11 requires "
-				   "a space between literal and string macro");
-	}
-      /* Grab user defined literal suffix.  */
-      else if (ISIDST (*pos))
-	{
-	  type = cpp_userdef_string_add_type (type);
-	  ++pos;
+      const uchar *const suffix_begin = pos;
+      pfile->buffer->cur = pos;
 
-	  while (ISIDNUM (*pos))
-	    ++pos;
+      if (const auto sr = scan_cur_identifier (pfile))
+	{
+	  /* If a string format macro, say from inttypes.h, is placed touching
+	     a string literal it could be parsed as a C++11 user-defined
+	     string literal thus breaking the program.  User-defined literals
+	     outside of namespace std must start with a single underscore, so
+	     assume anything of that form really is a UDL suffix.  We don't
+	     need to worry about UDLs defined inside namespace std because
+	     their names are reserved, so cannot be used as macro names in
+	     valid programs.  */
+	  if ((suffix_begin[0] != '_' || suffix_begin[1] == '_')
+	      && cpp_macro_p (sr.node))
+	    {
+	      /* Maybe raise a warning, but do not consume the tokens.  */
+	      pfile->buffer->cur = suffix_begin;
+	      if (CPP_OPTION (pfile, warn_literal_suffix)
+		  && !pfile->state.skipping)
+		cpp_warning_with_line
+		  (pfile, CPP_W_LITERAL_SUFFIX,
+		   token->src_loc, 0,
+		   "invalid suffix on literal; C++11 requires "
+		   "a space between literal and string macro");
+	    }
+	  else
+	    {
+	      type = cpp_userdef_string_add_type (type);
+	      if (!accum.accum)
+		create_literal2 (pfile, token, base,
+				 suffix_begin - base,
+				 NODE_NAME (sr.node),
+				 NODE_LEN (sr.node),
+				 type);
+	      else
+		{
+		  accum.create_literal2 (pfile, token, base,
+					 suffix_begin - base,
+					 NODE_NAME (sr.node),
+					 NODE_LEN (sr.node),
+					 type);
+		  _cpp_release_buff (pfile, accum.first);
+		}
+	      warn_about_normalization (pfile, token, &sr.nst);
+	      return;
+	    }
 	}
     }
 
@@ -2770,21 +2856,8 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
     create_literal (pfile, token, base, pos - base, type);
   else
     {
-      size_t extra_len = pos - base;
-      uchar *dest = _cpp_unaligned_alloc (pfile, accum.accum + extra_len + 1);
-
-      token->type = type;
-      token->val.str.len = accum.accum + extra_len;
-      token->val.str.text = dest;
-      for (_cpp_buff *buf = accum.first; buf; buf = buf->next)
-	{
-	  size_t len = BUFF_FRONT (buf) - buf->base;
-	  memcpy (dest, buf->base, len);
-	  dest += len;
-	}
+      accum.create_literal2 (pfile, token, base, pos - base, nullptr, 0, type);
       _cpp_release_buff (pfile, accum.first);
-      memcpy (dest, base, extra_len);
-      dest[extra_len] = '\0';
     }
 }
 
@@ -2891,39 +2964,58 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
     cpp_error (pfile, CPP_DL_PEDWARN, "missing terminating %c character",
 	       (int) terminator);
 
+  pfile->buffer->cur = cur;
+  const uchar *const suffix_begin = cur;
+
   if (CPP_OPTION (pfile, user_literals))
     {
-      /* If a string format macro, say from inttypes.h, is placed touching
-	 a string literal it could be parsed as a C++11 user-defined string
-	 literal thus breaking the program.  */
-      if (is_macro_not_literal_suffix (pfile, cur))
+      if (const auto sr = scan_cur_identifier (pfile))
 	{
-	  /* Raise a warning, but do not consume subsequent tokens.  */
-	  if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping)
-	    cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX,
-				   token->src_loc, 0,
-				   "invalid suffix on literal; C++11 requires "
-				   "a space between literal and string macro");
-	}
-      /* Grab user defined literal suffix.  */
-      else if (ISIDST (*cur))
-	{
-	  type = cpp_userdef_char_add_type (type);
-	  type = cpp_userdef_string_add_type (type);
-          ++cur;
-
-	  while (ISIDNUM (*cur))
-	    ++cur;
+	  /* If a string format macro, say from inttypes.h, is placed touching
+	     a string literal it could be parsed as a C++11 user-defined
+	     string literal thus breaking the program.  User-defined literals
+	     outside of namespace std must start with a single underscore, so
+	     assume anything of that form really is a UDL suffix.  We don't
+	     need to worry about UDLs defined inside namespace std because
+	     their names are reserved, so cannot be used as macro names in
+	     valid programs.  */
+	  if ((suffix_begin[0] != '_' || suffix_begin[1] == '_')
+	      && cpp_macro_p (sr.node))
+	    {
+	      /* Maybe raise a warning, but do not consume the tokens.  */
+	      pfile->buffer->cur = suffix_begin;
+	      if (CPP_OPTION (pfile, warn_literal_suffix)
+		  && !pfile->state.skipping)
+		cpp_warning_with_line
+		  (pfile, CPP_W_LITERAL_SUFFIX,
+		   token->src_loc, 0,
+		   "invalid suffix on literal; C++11 requires "
+		   "a space between literal and string macro");
+	    }
+	  else
+	    {
+	      /* Grab user defined literal suffix.  */
+	      type = cpp_userdef_char_add_type (type);
+	      type = cpp_userdef_string_add_type (type);
+	      create_literal2 (pfile, token, base, suffix_begin - base,
+			       NODE_NAME (sr.node), NODE_LEN (sr.node), type);
+	      warn_about_normalization (pfile, token, &sr.nst);
+	      return;
+	    }
 	}
     }
   else if (CPP_OPTION (pfile, cpp_warn_cxx11_compat)
-	   && is_macro (pfile, cur)
 	   && !pfile->state.skipping)
-    cpp_warning_with_line (pfile, CPP_W_CXX11_COMPAT,
-			   token->src_loc, 0, "C++11 requires a space "
-			   "between string literal and macro");
+    {
+      const auto sr = scan_cur_identifier (pfile);
+      /* Maybe raise a warning, but do not consume the tokens.  */
+      pfile->buffer->cur = suffix_begin;
+      if (sr && cpp_macro_p (sr.node))
+	cpp_warning_with_line (pfile, CPP_W_CXX11_COMPAT,
+			       token->src_loc, 0, "C++11 requires a space "
+			       "between string literal and macro");
+    }
 
-  pfile->buffer->cur = cur;
   create_literal (pfile, token, base, cur - base, type);
 }
 
@@ -4322,7 +4414,7 @@ cpp_digraph2name (enum cpp_ttype type)
 }
 
 /* Write the spelling of an identifier IDENT, using UCNs, to BUFFER.
-   The buffer must already contain the enough space to hold the
+   The buffer must already contain enough space to hold the
    token's spelling.  Returns a pointer to the character after the
    last character written.  */
 unsigned char *
@@ -4344,7 +4436,7 @@ _cpp_spell_ident_ucns (unsigned char *buffer, cpp_hashnode *ident)
 }
 
 /* Write the spelling of a token TOKEN to BUFFER.  The buffer must
-   already contain the enough space to hold the token's spelling.
+   already contain enough space to hold the token's spelling.
    Returns a pointer to the character after the last character written.
    FORSTRING is true if this is to be the spelling after translation
    phase 1 (with the original spelling of extended identifiers), false