From patchwork Mon Aug 29 08:15:10 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Jakub Jelinek <jakub@redhat.com>
X-Patchwork-Id: 810
Return-Path: <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:adf:ecc5:0:0:0:0:0 with SMTP id s5csp1311741wro;
        Mon, 29 Aug 2022 01:16:48 -0700 (PDT)
X-Google-Smtp-Source: 
 AA6agR5tWlJInG6oGBpI2Gtick3xB4D7uZCI41GSvIdVaLeshz4saZfwrZTQkp9jasDWWyPW6uzJ
X-Received: by 2002:a05:6402:1250:b0:447:dd0e:213a with SMTP id
 l16-20020a056402125000b00447dd0e213amr11778481edw.359.1661761008614;
        Mon, 29 Aug 2022 01:16:48 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1661761008; cv=none;
        d=google.com; s=arc-20160816;
        b=FPZ5eGs8BasrSGnz/6QIuWeYwAOFW/FeT21nJNhsrsXA3MXxKjH7GnDxVD4TpACppJ
         hQ1R9CL3u9Ex0xSYgjy1V2p8SFJU8YzHCpnt1zoLx3iRqHdnCTha4GZgQoHwjjKD2Sxl
         CPMcUx159w+Iiiimo8rYjbWopBfSiUjAkQh/m062cSetUwxwhUbuCz7HezXupCc+CLu7
         BQfTMZboMcQYLhE70oNnDZVq2MqVRnVxNU0MxlghAEuwjfHSma2am7gcXZYdBgSVvXqL
         ZyMnXpMD1jw6RrKGcvRk036w99qQXaAdEAyvao0J7MlVDAoKrrJSS/MQ4EUUK5bwFO+y
         DdMQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:cc:reply-to:from:list-subscribe:list-help
         :list-post:list-archive:list-unsubscribe:list-id:precedence
         :content-transfer-encoding:content-disposition:mime-version
         :message-id:subject:to:date:dmarc-filter:delivered-to:dkim-signature
         :dkim-filter;
        bh=Lh1sHI3iNGJBsez8H2OUPjlhWzYss2vFqRNWu4ENs4A=;
        b=Cs4PYzo0lOEyTcwx+KcgE8+jB9Q/f0/7QD200WIbjX+LfaxmKKHeMl19weEK0EGYpl
         i4dkrO5+I5PnLvLO9hCdaOgnIgSfSJekLqMCtRkUYPAtr2PCvEko6RNBnPLlwkkpB4U+
         rFsID+OqaZmBp7ofkIbKFLouf2la9Cb0yzxaU9+1wZVl4RMSVkeZu/kBPDSIRblOg2fT
         F5+bH6FkxarNmZvb3j+Djwl4eV1FwDN2Wbo8fyxCb+pXrj8oh3UvEPu0BMH6lVp38Awc
         kEyJHpRAmcFIccrABMqYEGUqnu9iY4BBiLhWEy4Konb5FZHFYLpnlqMBfdEucziTH9TU
         5etQ==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@gcc.gnu.org header.s=default header.b=H+U89Vu8;
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates
 2620:52:3:1:0:246e:9693:128c as permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org
Received: from sourceware.org (server2.sourceware.org.
 [2620:52:3:1:0:246e:9693:128c])
        by mx.google.com with ESMTPS id
 cr20-20020a170906d55400b00741a19b63b4si452015ejc.879.2022.08.29.01.16.47
        for <ouuuleilei@gmail.com>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 29 Aug 2022 01:16:48 -0700 (PDT)
Received-SPF: pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates
 2620:52:3:1:0:246e:9693:128c as permitted sender)
 client-ip=2620:52:3:1:0:246e:9693:128c;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@gcc.gnu.org header.s=default header.b=H+U89Vu8;
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates
 2620:52:3:1:0:246e:9693:128c as permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 15F52385781E
	for <ouuuleilei@gmail.com>; Mon, 29 Aug 2022 08:16:22 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 15F52385781E
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1661760982;
	bh=Lh1sHI3iNGJBsez8H2OUPjlhWzYss2vFqRNWu4ENs4A=;
	h=Date:To:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post:
	 List-Help:List-Subscribe:From:Reply-To:Cc:From;
	b=H+U89Vu8u0l2cJ1G4053ZIc3L3rHSZMr/Nb4X8Rt51AEFxIasHly8tNCw81sCpfBb
	 AJyVS2p4HCZFdgxiRG7wG4W+VPMS6oCyRMUqM44G4AqKgClbOcEX9V+xCDhPYWHKCL
	 vQ711mhlk9ZF6JqusKf3B1S1V/dbtyU+9uFPZvb0=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
 by sourceware.org (Postfix) with ESMTPS id 678A83858D32
 for <gcc-patches@gcc.gnu.org>; Mon, 29 Aug 2022 08:15:21 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 678A83858D32
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-639-AU1-UOW1OlC-iIk_LY88PA-1; Mon, 29 Aug 2022 04:15:15 -0400
X-MC-Unique: AU1-UOW1OlC-iIk_LY88PA-1
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com
 [10.11.54.3])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 8A1AE101E9B2
 for <gcc-patches@gcc.gnu.org>; Mon, 29 Aug 2022 08:15:14 +0000 (UTC)
Received: from tucnak.zalov.cz (unknown [10.39.192.41])
 by smtp.corp.redhat.com (Postfix) with ESMTPS id 24579112131B;
 Mon, 29 Aug 2022 08:15:14 +0000 (UTC)
Received: from tucnak.zalov.cz (localhost [127.0.0.1])
 by tucnak.zalov.cz (8.17.1/8.17.1) with ESMTPS id 27T8FBsm2092093
 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT);
 Mon, 29 Aug 2022 10:15:11 +0200
Received: (from jakub@localhost)
 by tucnak.zalov.cz (8.17.1/8.17.1/Submit) id 27T8FBPu2092092;
 Mon, 29 Aug 2022 10:15:11 +0200
Date: Mon, 29 Aug 2022 10:15:10 +0200
To: Jason Merrill <jason@redhat.com>
Subject: [PATCH] libcpp: Add -Winvalid-utf8 warning [PR106655]
Message-ID: <Ywx1jiBRWLmv2NOZ@tucnak>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.78 on 10.11.54.3
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Disposition: inline
X-Spam-Status: No, score=-2.5 required=5.0 tests=BAYES_00, BODY_8BITS,
 DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF,
 RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_NONE, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: Jakub Jelinek via Gcc-patches
 <gcc-patches@gcc.gnu.org>
From: Jakub Jelinek <jakub@redhat.com>
Reply-To: Jakub Jelinek <jakub@redhat.com>
Cc: gcc-patches@gcc.gnu.org
Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org
Sender: "Gcc-patches" <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1742482711098538773?=
X-GMAIL-MSGID: =?utf-8?q?1742482711098538773?=

Hi!

The following patch introduces a new warning - -Winvalid-utf8 similarly
to what clang now has - to diagnose invalid UTF-8 byte sequences in
comments.  In identifiers and in string literals it should be diagnosed
already but comment content hasn't been really verified.

I'm not sure if this is enough to say P2295R6 is implemented or not.

The problem is that in the most common case, people don't use
-finput-charset= option and the sources often are UTF-8, but sometimes
could be some ASCII compatible single byte encoding where non-ASCII
characters only appear in comments.  So having the warning off by default
is IMO desirable.  Now, if people use explicit -finput-charset=UTF-8,
perhaps we could make the warning on by default for C++23 and use pedwarn
instead of warning, because then the user told us explicitly that the source
is UTF-8.  From the paper I understood one of the implementation options
is to claim that the implementation supports 2 encodings, UTF-8 and UTF-8
like encodings where invalid UTF-8 characters in comments are replaced say
by spaces, where the latter could be the default and the former only
used if -finput-charset=UTF-8 -Werror=invalid-utf8 options are used.

Thoughts on this?

2022-08-29  Jakub Jelinek  <jakub@redhat.com>

	PR c++/106655
libcpp/
	* include/cpplib.h (struct cpp_options): Implement C++23
	P2295R6 - Support for UTF-8 as a portable source file encoding.
	Add cpp_warn_invalid_utf8 field.
	(enum cpp_warning_reason): Add CPP_W_INVALID_UTF8 enumerator.
	* init.cc (cpp_create_reader): Initialize cpp_warn_invalid_utf8.
	* lex.cc (utf8_continuation): New const variable.
	(utf8_signifier): Move earlier in the file.
	(_cpp_warn_invalid_utf8): New function.
	(_cpp_skip_block_comment): Handle -Winvalid-utf8 warning.
	(skip_line_comment): Likewise.
gcc/
	* doc/invoke.texi (-Winvalid-utf8): Document it.
gcc/c-family/
	* c.opt (-Winvalid-utf8): New warning.
gcc/testsuite/
	* c-c++-common/cpp/Winvalid-utf8-1.c: New test.


	Jakub

--- libcpp/include/cpplib.h.jj	2022-08-25 14:25:23.866912426 +0200
+++ libcpp/include/cpplib.h	2022-08-27 12:17:55.185022807 +0200
@@ -560,6 +560,9 @@ struct cpp_options
      cpp_bidirectional_level.  */
   unsigned char cpp_warn_bidirectional;
 
+  /* True if libcpp should warn about invalid UTF-8 characters in comments.  */
+  bool cpp_warn_invalid_utf8;
+
   /* Dependency generation.  */
   struct
   {
@@ -666,7 +669,8 @@ enum cpp_warning_reason {
   CPP_W_CXX11_COMPAT,
   CPP_W_CXX20_COMPAT,
   CPP_W_EXPANSION_TO_DEFINED,
-  CPP_W_BIDIRECTIONAL
+  CPP_W_BIDIRECTIONAL,
+  CPP_W_INVALID_UTF8
 };
 
 /* Callback for header lookup for HEADER, which is the name of a
--- libcpp/init.cc.jj	2022-08-24 09:55:44.571876638 +0200
+++ libcpp/init.cc	2022-08-27 12:18:54.559246323 +0200
@@ -227,6 +227,7 @@ cpp_create_reader (enum c_lang lang, cpp
   CPP_OPTION (pfile, ext_numeric_literals) = 1;
   CPP_OPTION (pfile, warn_date_time) = 0;
   CPP_OPTION (pfile, cpp_warn_bidirectional) = bidirectional_unpaired;
+  CPP_OPTION (pfile, cpp_warn_invalid_utf8) = 0;
 
   /* Default CPP arithmetic to something sensible for the host for the
      benefit of dumb users like fix-header.  */
--- libcpp/lex.cc.jj	2022-08-26 09:24:12.089615949 +0200
+++ libcpp/lex.cc	2022-08-27 13:43:40.560769087 +0200
@@ -1704,6 +1704,59 @@ maybe_warn_bidi_on_char (cpp_reader *pfi
   bidi::on_char (kind, ucn_p, loc);
 }
 
+static const cppchar_t utf8_continuation = 0x80;
+static const cppchar_t utf8_signifier = 0xC0;
+
+/* Emit -Winvalid-utf8 warning on invalid UTF-8 character starting
+   at PFILE->buffer->cur.  Return a pointer after the diagnosed
+   invalid character.  */
+
+static const uchar *
+_cpp_warn_invalid_utf8 (cpp_reader *pfile)
+{
+  cpp_buffer *buffer = pfile->buffer;
+  const uchar *cur = buffer->cur;
+
+  if (cur[0] < utf8_signifier
+      || cur[1] < utf8_continuation || cur[1] >= utf8_signifier)
+    {
+      cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
+			     pfile->line_table->highest_line,
+			     CPP_BUF_COL (buffer),
+			     "invalid UTF-8 character <%x> in comment",
+			     cur[0]);
+      return cur + 1;
+    }
+  else if (cur[2] < utf8_continuation || cur[2] >= utf8_signifier)
+    {
+      cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
+			     pfile->line_table->highest_line,
+			     CPP_BUF_COL (buffer),
+			     "invalid UTF-8 character <%x><%x> in comment",
+			     cur[0], cur[1]);
+      return cur + 2;
+    }
+  else if (cur[3] < utf8_continuation || cur[3] >= utf8_signifier)
+    {
+      cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
+			     pfile->line_table->highest_line,
+			     CPP_BUF_COL (buffer),
+			     "invalid UTF-8 character <%x><%x><%x> in comment",
+			     cur[0], cur[1], cur[2]);
+      return cur + 3;
+    }
+  else
+    {
+      cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
+			     pfile->line_table->highest_line,
+			     CPP_BUF_COL (buffer),
+			     "invalid UTF-8 character "
+			     "<%x><%x><%x><%x> in comment",
+			     cur[0], cur[1], cur[2], cur[3]);
+      return cur + 4;
+    }
+}
+
 /* Skip a C-style block comment.  We find the end of the comment by
    seeing if an asterisk is before every '/' we encounter.  Returns
    nonzero if comment terminated by EOF, zero otherwise.
@@ -1716,6 +1769,8 @@ _cpp_skip_block_comment (cpp_reader *pfi
   const uchar *cur = buffer->cur;
   uchar c;
   const bool warn_bidi_p = pfile->warn_bidi_p ();
+  const bool warn_invalid_utf8_p = CPP_OPTION (pfile, cpp_warn_invalid_utf8);
+  const bool warn_bidi_or_invalid_utf8_p = warn_bidi_p | warn_invalid_utf8_p;
 
   cur++;
   if (*cur == '/')
@@ -1765,13 +1820,32 @@ _cpp_skip_block_comment (cpp_reader *pfi
 
 	  cur = buffer->cur;
 	}
-      /* If this is a beginning of a UTF-8 encoding, it might be
-	 a bidirectional control character.  */
-      else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p)
-	{
-	  location_t loc;
-	  bidi::kind kind = get_bidi_utf8 (pfile, cur - 1, &loc);
-	  maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc);
+      else if (__builtin_expect (c >= utf8_continuation, 0)
+	       && warn_bidi_or_invalid_utf8_p)
+	{
+	  /* If this is a beginning of a UTF-8 encoding, it might be
+	     a bidirectional control character.  */
+	  if (c == bidi::utf8_start && warn_bidi_p)
+	    {
+	      location_t loc;
+	      bidi::kind kind = get_bidi_utf8 (pfile, cur - 1, &loc);
+	      maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc);
+	    }
+	  if (!warn_invalid_utf8_p)
+	    continue;
+	  if (c >= utf8_signifier)
+	    {
+	      cppchar_t s;
+	      const uchar *pstr = cur - 1;
+	      if (_cpp_valid_utf8 (pfile, &pstr, buffer->rlimit, 0, NULL, &s)
+		  && s <= 0x0010FFFF)
+		{
+		  cur = pstr;
+		  continue;
+		}
+	    }
+	  buffer->cur = cur - 1;
+	  cur = _cpp_warn_invalid_utf8 (pfile);
 	}
     }
 
@@ -1789,11 +1863,13 @@ skip_line_comment (cpp_reader *pfile)
   cpp_buffer *buffer = pfile->buffer;
   location_t orig_line = pfile->line_table->highest_line;
   const bool warn_bidi_p = pfile->warn_bidi_p ();
+  const bool warn_invalid_utf8_p = CPP_OPTION (pfile, cpp_warn_invalid_utf8);
+  const bool warn_bidi_or_invalid_utf8_p = warn_bidi_p | warn_invalid_utf8_p;
 
-  if (!warn_bidi_p)
+  if (!warn_bidi_or_invalid_utf8_p)
     while (*buffer->cur != '\n')
       buffer->cur++;
-  else
+  else if (!warn_invalid_utf8_p)
     {
       while (*buffer->cur != '\n'
 	     && *buffer->cur != bidi::utf8_start)
@@ -1813,6 +1889,38 @@ skip_line_comment (cpp_reader *pfile)
 	  maybe_warn_bidi_on_close (pfile, buffer->cur);
 	}
     }
+  else
+    {
+      while (*buffer->cur != '\n')
+	{
+	  if (*buffer->cur < utf8_continuation)
+	    {
+	      buffer->cur++;
+	      continue;
+	    }
+	  if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0)
+	      && warn_bidi_p)
+	    {
+	      location_t loc;
+	      bidi::kind kind = get_bidi_utf8 (pfile, buffer->cur, &loc);
+	      maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc);
+	    }
+	  if (*buffer->cur >= utf8_signifier)
+	    {
+	      cppchar_t s;
+	      const uchar *pstr = buffer->cur;
+	      if (_cpp_valid_utf8 (pfile, &pstr, buffer->rlimit, 0, NULL, &s)
+		  && s <= 0x0010FFFF)
+		{
+		  buffer->cur = pstr;
+		  continue;
+		}
+	    }
+	  buffer->cur = _cpp_warn_invalid_utf8 (pfile);
+	}
+      if (warn_bidi_p)
+	maybe_warn_bidi_on_close (pfile, buffer->cur);
+    }
 
   _cpp_process_line_notes (pfile, true);
   return orig_line != pfile->line_table->highest_line;
@@ -1919,8 +2027,6 @@ warn_about_normalization (cpp_reader *pf
     }
 }
 
-static const cppchar_t utf8_signifier = 0xC0;
-
 /* Returns TRUE if the sequence starting at buffer->cur is valid in
    an identifier.  FIRST is TRUE if this starts an identifier.  */
 
--- gcc/doc/invoke.texi.jj	2022-08-27 09:14:43.047696028 +0200
+++ gcc/doc/invoke.texi	2022-08-27 14:05:22.417755406 +0200
@@ -365,9 +365,9 @@ Objective-C and Objective-C++ Dialects}.
 -Winfinite-recursion @gol
 -Winit-self  -Winline  -Wno-int-conversion  -Wint-in-bool-context @gol
 -Wno-int-to-pointer-cast  -Wno-invalid-memory-model @gol
--Winvalid-pch  -Wjump-misses-init  -Wlarger-than=@var{byte-size} @gol
--Wlogical-not-parentheses  -Wlogical-op  -Wlong-long @gol
--Wno-lto-type-mismatch -Wmain  -Wmaybe-uninitialized @gol
+-Winvalid-pch  -Winvalid-utf8 -Wjump-misses-init  @gol
+-Wlarger-than=@var{byte-size}  -Wlogical-not-parentheses  -Wlogical-op  @gol
+-Wlong-long  -Wno-lto-type-mismatch -Wmain  -Wmaybe-uninitialized @gol
 -Wmemset-elt-size  -Wmemset-transposed-args @gol
 -Wmisleading-indentation  -Wmissing-attributes  -Wmissing-braces @gol
 -Wmissing-field-initializers  -Wmissing-format-attribute @gol
@@ -9569,6 +9569,11 @@ different size.
 Warn if a precompiled header (@pxref{Precompiled Headers}) is found in
 the search path but cannot be used.
 
+@item -Winvalid-utf8
+@opindex Winvalid-utf8
+@opindex Wno-invalid-utf8
+Warn if an invalid UTF-8 character is inside of a comment.
+
 @item -Wlong-long
 @opindex Wlong-long
 @opindex Wno-long-long
--- gcc/c-family/c.opt.jj	2022-08-27 09:14:43.036696173 +0200
+++ gcc/c-family/c.opt	2022-08-27 14:03:06.328534617 +0200
@@ -821,6 +821,10 @@ Winvalid-pch
 C ObjC C++ ObjC++ CPP(warn_invalid_pch) CppReason(CPP_W_INVALID_PCH) Var(cpp_warn_invalid_pch) Init(0) Warning
 Warn about PCH files that are found but not used.
 
+Winvalid-utf8
+C objC C++ ObjC++ CPP(cpp_warn_invalid_utf8) CppReason(CPP_W_INVALID_UTF8) Var(warn_invalid_utf8) Init(0) Warning
+Warn about invalid UTF-8 characters in comments.
+
 Wjump-misses-init
 C ObjC Var(warn_jump_misses_init) Warning LangEnabledby(C ObjC,Wc++-compat)
 Warn when a jump misses a variable initialization.
--- gcc/testsuite/c-c++-common/cpp/Winvalid-utf8-1.c.jj	2022-08-27 14:01:51.115517571 +0200
+++ gcc/testsuite/c-c++-common/cpp/Winvalid-utf8-1.c	2022-08-27 14:33:21.466802817 +0200
@@ -0,0 +1,39 @@
+// P2295R6 - Support for UTF-8 as a portable source file encoding
+// This test intentionally contains various byte sequences which are not valid UTF-8
+// { dg-do preprocess }
+// { dg-options "-finput-charset=UTF-8 -Winvalid-utf8" }
+
+// a߿ࠀ퟿𐀀􏿿a		{ dg-bogus "invalid UTF-8 character" }
+// a�a					{ dg-warning "invalid UTF-8 character <80> in comment" }
+// a�a					{ dg-warning "invalid UTF-8 character <bf> in comment" }
+// a�a					{ dg-warning "invalid UTF-8 character <c0> in comment" }
+// a�a					{ dg-warning "invalid UTF-8 character <c1> in comment" }
+// a�a					{ dg-warning "invalid UTF-8 character <f5> in comment" }
+// a�a					{ dg-warning "invalid UTF-8 character <ff> in comment" }
+// a�a					{ dg-warning "invalid UTF-8 character <c2> in comment" }
+// a�a					{ dg-warning "invalid UTF-8 character <e0> in comment" }
+// a���a				{ dg-warning "invalid UTF-8 character <e0><80><bf> in comment" }
+// a���a				{ dg-warning "invalid UTF-8 character <e0><9f><80> in comment" }
+// a��a					{ dg-warning "invalid UTF-8 character <e0><bf> in comment" }
+// a��a					{ dg-warning "invalid UTF-8 character <ec><80> in comment" }
+// a���a				{ dg-warning "invalid UTF-8 character <ed><a0><80> in comment" }
+// a����a				{ dg-warning "invalid UTF-8 character <f0><80><80><80> in comment" }
+// a����a				{ dg-warning "invalid UTF-8 character <f0><8f><bf><bf> in comment" }
+// a����a				{ dg-warning "invalid UTF-8 character <f4><90><80><80> in comment" }
+/* a߿ࠀ퟿𐀀􏿿a		{ dg-bogus "invalid UTF-8 character" } */
+/* a�a					{ dg-warning "invalid UTF-8 character <80> in comment" } */
+/* a�a					{ dg-warning "invalid UTF-8 character <bf> in comment" } */
+/* a�a					{ dg-warning "invalid UTF-8 character <c0> in comment" } */
+/* a�a					{ dg-warning "invalid UTF-8 character <c1> in comment" } */
+/* a�a					{ dg-warning "invalid UTF-8 character <f5> in comment" } */
+/* a�a					{ dg-warning "invalid UTF-8 character <ff> in comment" } */
+/* a�a					{ dg-warning "invalid UTF-8 character <c2> in comment" } */
+/* a�a					{ dg-warning "invalid UTF-8 character <e0> in comment" } */
+/* a���a				{ dg-warning "invalid UTF-8 character <e0><80><bf> in comment" } */
+/* a���a				{ dg-warning "invalid UTF-8 character <e0><9f><80> in comment" } */
+/* a��a					{ dg-warning "invalid UTF-8 character <e0><bf> in comment" } */
+/* a��a					{ dg-warning "invalid UTF-8 character <ec><80> in comment" } */
+/* a���a				{ dg-warning "invalid UTF-8 character <ed><a0><80> in comment" } */
+/* a����a				{ dg-warning "invalid UTF-8 character <f0><80><80><80> in comment" } */
+/* a����a				{ dg-warning "invalid UTF-8 character <f0><8f><bf><bf> in comment" } */
+/* a����a				{ dg-warning "invalid UTF-8 character <f4><90><80><80> in comment" } */