libcpp: Add -Winvalid-utf8 warning [PR106655]

Message ID Ywx1jiBRWLmv2NOZ@tucnak
State New, archived
Headers
Series libcpp: Add -Winvalid-utf8 warning [PR106655] |

Commit Message

Jakub Jelinek Aug. 29, 2022, 8:15 a.m. UTC
  Hi!

The following patch introduces a new warning - -Winvalid-utf8 similarly
to what clang now has - to diagnose invalid UTF-8 byte sequences in
comments.  In identifiers and in string literals it should be diagnosed
already but comment content hasn't been really verified.

I'm not sure if this is enough to say P2295R6 is implemented or not.

The problem is that in the most common case, people don't use
-finput-charset= option and the sources often are UTF-8, but sometimes
could be some ASCII compatible single byte encoding where non-ASCII
characters only appear in comments.  So having the warning off by default
is IMO desirable.  Now, if people use explicit -finput-charset=UTF-8,
perhaps we could make the warning on by default for C++23 and use pedwarn
instead of warning, because then the user told us explicitly that the source
is UTF-8.  From the paper I understood one of the implementation options
is to claim that the implementation supports 2 encodings, UTF-8 and UTF-8
like encodings where invalid UTF-8 characters in comments are replaced say
by spaces, where the latter could be the default and the former only
used if -finput-charset=UTF-8 -Werror=invalid-utf8 options are used.

Thoughts on this?

2022-08-29  Jakub Jelinek  <jakub@redhat.com>

	PR c++/106655
libcpp/
	* include/cpplib.h (struct cpp_options): Implement C++23
	P2295R6 - Support for UTF-8 as a portable source file encoding.
	Add cpp_warn_invalid_utf8 field.
	(enum cpp_warning_reason): Add CPP_W_INVALID_UTF8 enumerator.
	* init.cc (cpp_create_reader): Initialize cpp_warn_invalid_utf8.
	* lex.cc (utf8_continuation): New const variable.
	(utf8_signifier): Move earlier in the file.
	(_cpp_warn_invalid_utf8): New function.
	(_cpp_skip_block_comment): Handle -Winvalid-utf8 warning.
	(skip_line_comment): Likewise.
gcc/
	* doc/invoke.texi (-Winvalid-utf8): Document it.
gcc/c-family/
	* c.opt (-Winvalid-utf8): New warning.
gcc/testsuite/
	* c-c++-common/cpp/Winvalid-utf8-1.c: New test.


	Jakub
  

Comments

Jason Merrill Aug. 29, 2022, 9:15 p.m. UTC | #1
On 8/29/22 04:15, Jakub Jelinek wrote:
> Hi!
> 
> The following patch introduces a new warning - -Winvalid-utf8 similarly
> to what clang now has - to diagnose invalid UTF-8 byte sequences in
> comments.  In identifiers and in string literals it should be diagnosed
> already but comment content hasn't been really verified.
> 
> I'm not sure if this is enough to say P2295R6 is implemented or not.
> 
> The problem is that in the most common case, people don't use
> -finput-charset= option and the sources often are UTF-8, but sometimes
> could be some ASCII compatible single byte encoding where non-ASCII
> characters only appear in comments.  So having the warning off by default
> is IMO desirable.  Now, if people use explicit -finput-charset=UTF-8,
> perhaps we could make the warning on by default for C++23 and use pedwarn
> instead of warning, because then the user told us explicitly that the source
> is UTF-8.  From the paper I understood one of the implementation options
> is to claim that the implementation supports 2 encodings, UTF-8 and UTF-8
> like encodings where invalid UTF-8 characters in comments are replaced say
> by spaces, where the latter could be the default and the former only
> used if -finput-charset=UTF-8 -Werror=invalid-utf8 options are used.
> 
> Thoughts on this?

That sounds good to me.

> 2022-08-29  Jakub Jelinek  <jakub@redhat.com>
> 
> 	PR c++/106655
> libcpp/
> 	* include/cpplib.h (struct cpp_options): Implement C++23
> 	P2295R6 - Support for UTF-8 as a portable source file encoding.
> 	Add cpp_warn_invalid_utf8 field.
> 	(enum cpp_warning_reason): Add CPP_W_INVALID_UTF8 enumerator.
> 	* init.cc (cpp_create_reader): Initialize cpp_warn_invalid_utf8.
> 	* lex.cc (utf8_continuation): New const variable.
> 	(utf8_signifier): Move earlier in the file.
> 	(_cpp_warn_invalid_utf8): New function.
> 	(_cpp_skip_block_comment): Handle -Winvalid-utf8 warning.
> 	(skip_line_comment): Likewise.
> gcc/
> 	* doc/invoke.texi (-Winvalid-utf8): Document it.
> gcc/c-family/
> 	* c.opt (-Winvalid-utf8): New warning.
> gcc/testsuite/
> 	* c-c++-common/cpp/Winvalid-utf8-1.c: New test.
> 
> --- libcpp/include/cpplib.h.jj	2022-08-25 14:25:23.866912426 +0200
> +++ libcpp/include/cpplib.h	2022-08-27 12:17:55.185022807 +0200
> @@ -560,6 +560,9 @@ struct cpp_options
>        cpp_bidirectional_level.  */
>     unsigned char cpp_warn_bidirectional;
>   
> +  /* True if libcpp should warn about invalid UTF-8 characters in comments.  */
> +  bool cpp_warn_invalid_utf8;
> +
>     /* Dependency generation.  */
>     struct
>     {
> @@ -666,7 +669,8 @@ enum cpp_warning_reason {
>     CPP_W_CXX11_COMPAT,
>     CPP_W_CXX20_COMPAT,
>     CPP_W_EXPANSION_TO_DEFINED,
> -  CPP_W_BIDIRECTIONAL
> +  CPP_W_BIDIRECTIONAL,
> +  CPP_W_INVALID_UTF8
>   };
>   
>   /* Callback for header lookup for HEADER, which is the name of a
> --- libcpp/init.cc.jj	2022-08-24 09:55:44.571876638 +0200
> +++ libcpp/init.cc	2022-08-27 12:18:54.559246323 +0200
> @@ -227,6 +227,7 @@ cpp_create_reader (enum c_lang lang, cpp
>     CPP_OPTION (pfile, ext_numeric_literals) = 1;
>     CPP_OPTION (pfile, warn_date_time) = 0;
>     CPP_OPTION (pfile, cpp_warn_bidirectional) = bidirectional_unpaired;
> +  CPP_OPTION (pfile, cpp_warn_invalid_utf8) = 0;
>   
>     /* Default CPP arithmetic to something sensible for the host for the
>        benefit of dumb users like fix-header.  */
> --- libcpp/lex.cc.jj	2022-08-26 09:24:12.089615949 +0200
> +++ libcpp/lex.cc	2022-08-27 13:43:40.560769087 +0200
> @@ -1704,6 +1704,59 @@ maybe_warn_bidi_on_char (cpp_reader *pfi
>     bidi::on_char (kind, ucn_p, loc);
>   }
>   
> +static const cppchar_t utf8_continuation = 0x80;
> +static const cppchar_t utf8_signifier = 0xC0; >
> +/* Emit -Winvalid-utf8 warning on invalid UTF-8 character starting
> +   at PFILE->buffer->cur.  Return a pointer after the diagnosed
> +   invalid character.  */
> +
> +static const uchar *
> +_cpp_warn_invalid_utf8 (cpp_reader *pfile)
> +{
> +  cpp_buffer *buffer = pfile->buffer;
> +  const uchar *cur = buffer->cur;
> +
> +  if (cur[0] < utf8_signifier
> +      || cur[1] < utf8_continuation || cur[1] >= utf8_signifier)
> +    {
> +      cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
> +			     pfile->line_table->highest_line,
> +			     CPP_BUF_COL (buffer),
> +			     "invalid UTF-8 character <%x> in comment",
> +			     cur[0]);
> +      return cur + 1;
> +    }
> +  else if (cur[2] < utf8_continuation || cur[2] >= utf8_signifier)

Unicode table 3-7 says that the second byte is sometimes restricted to 
less than this range.  Hmm, it looks like one_utf8_to_cppchar doesn't 
check that, either.

Did you consider adding error reporting to one_utf8_to_cppchar?  It 
might be better to localize the utf8 logic.

> +    {
> +      cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
> +			     pfile->line_table->highest_line,
> +			     CPP_BUF_COL (buffer),
> +			     "invalid UTF-8 character <%x><%x> in comment",
> +			     cur[0], cur[1]);
> +      return cur + 2;
> +    }
> +  else if (cur[3] < utf8_continuation || cur[3] >= utf8_signifier)
> +    {
> +      cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
> +			     pfile->line_table->highest_line,
> +			     CPP_BUF_COL (buffer),
> +			     "invalid UTF-8 character <%x><%x><%x> in comment",
> +			     cur[0], cur[1], cur[2]);
> +      return cur + 3;
> +    }
> +  else
> +    {
> +      cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
> +			     pfile->line_table->highest_line,
> +			     CPP_BUF_COL (buffer),
> +			     "invalid UTF-8 character "
> +			     "<%x><%x><%x><%x> in comment",
> +			     cur[0], cur[1], cur[2], cur[3]);
> +      return cur + 4;
> +    }
> +}
> +
>   /* Skip a C-style block comment.  We find the end of the comment by
>      seeing if an asterisk is before every '/' we encounter.  Returns
>      nonzero if comment terminated by EOF, zero otherwise.
> @@ -1716,6 +1769,8 @@ _cpp_skip_block_comment (cpp_reader *pfi
>     const uchar *cur = buffer->cur;
>     uchar c;
>     const bool warn_bidi_p = pfile->warn_bidi_p ();
> +  const bool warn_invalid_utf8_p = CPP_OPTION (pfile, cpp_warn_invalid_utf8);
> +  const bool warn_bidi_or_invalid_utf8_p = warn_bidi_p | warn_invalid_utf8_p;
>   
>     cur++;
>     if (*cur == '/')
> @@ -1765,13 +1820,32 @@ _cpp_skip_block_comment (cpp_reader *pfi
>   
>   	  cur = buffer->cur;
>   	}
> -      /* If this is a beginning of a UTF-8 encoding, it might be
> -	 a bidirectional control character.  */
> -      else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p)
> -	{
> -	  location_t loc;
> -	  bidi::kind kind = get_bidi_utf8 (pfile, cur - 1, &loc);
> -	  maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc);
> +      else if (__builtin_expect (c >= utf8_continuation, 0)
> +	       && warn_bidi_or_invalid_utf8_p)
> +	{
> +	  /* If this is a beginning of a UTF-8 encoding, it might be
> +	     a bidirectional control character.  */
> +	  if (c == bidi::utf8_start && warn_bidi_p)
> +	    {
> +	      location_t loc;
> +	      bidi::kind kind = get_bidi_utf8 (pfile, cur - 1, &loc);
> +	      maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc);
> +	    }
> +	  if (!warn_invalid_utf8_p)
> +	    continue;
> +	  if (c >= utf8_signifier)
> +	    {
> +	      cppchar_t s;
> +	      const uchar *pstr = cur - 1;
> +	      if (_cpp_valid_utf8 (pfile, &pstr, buffer->rlimit, 0, NULL, &s)
> +		  && s <= 0x0010FFFF)
> +		{
> +		  cur = pstr;
> +		  continue;
> +		}
> +	    }
> +	  buffer->cur = cur - 1;
> +	  cur = _cpp_warn_invalid_utf8 (pfile);
>   	}
>       }
>   
> @@ -1789,11 +1863,13 @@ skip_line_comment (cpp_reader *pfile)
>     cpp_buffer *buffer = pfile->buffer;
>     location_t orig_line = pfile->line_table->highest_line;
>     const bool warn_bidi_p = pfile->warn_bidi_p ();
> +  const bool warn_invalid_utf8_p = CPP_OPTION (pfile, cpp_warn_invalid_utf8);
> +  const bool warn_bidi_or_invalid_utf8_p = warn_bidi_p | warn_invalid_utf8_p;
>   
> -  if (!warn_bidi_p)
> +  if (!warn_bidi_or_invalid_utf8_p)
>       while (*buffer->cur != '\n')
>         buffer->cur++;
> -  else
> +  else if (!warn_invalid_utf8_p)
>       {
>         while (*buffer->cur != '\n'
>   	     && *buffer->cur != bidi::utf8_start)
> @@ -1813,6 +1889,38 @@ skip_line_comment (cpp_reader *pfile)
>   	  maybe_warn_bidi_on_close (pfile, buffer->cur);
>   	}
>       }
> +  else
> +    {
> +      while (*buffer->cur != '\n')
> +	{
> +	  if (*buffer->cur < utf8_continuation)
> +	    {
> +	      buffer->cur++;
> +	      continue;
> +	    }
> +	  if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0)
> +	      && warn_bidi_p)
> +	    {
> +	      location_t loc;
> +	      bidi::kind kind = get_bidi_utf8 (pfile, buffer->cur, &loc);
> +	      maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc);
> +	    }
> +	  if (*buffer->cur >= utf8_signifier)
> +	    {
> +	      cppchar_t s;
> +	      const uchar *pstr = buffer->cur;
> +	      if (_cpp_valid_utf8 (pfile, &pstr, buffer->rlimit, 0, NULL, &s)
> +		  && s <= 0x0010FFFF)
> +		{
> +		  buffer->cur = pstr;
> +		  continue;
> +		}
> +	    }
> +	  buffer->cur = _cpp_warn_invalid_utf8 (pfile);
> +	}
> +      if (warn_bidi_p)
> +	maybe_warn_bidi_on_close (pfile, buffer->cur);
> +    }
>   
>     _cpp_process_line_notes (pfile, true);
>     return orig_line != pfile->line_table->highest_line;
> @@ -1919,8 +2027,6 @@ warn_about_normalization (cpp_reader *pf
>       }
>   }
>   
> -static const cppchar_t utf8_signifier = 0xC0;
> -
>   /* Returns TRUE if the sequence starting at buffer->cur is valid in
>      an identifier.  FIRST is TRUE if this starts an identifier.  */
>   
> --- gcc/doc/invoke.texi.jj	2022-08-27 09:14:43.047696028 +0200
> +++ gcc/doc/invoke.texi	2022-08-27 14:05:22.417755406 +0200
> @@ -365,9 +365,9 @@ Objective-C and Objective-C++ Dialects}.
>   -Winfinite-recursion @gol
>   -Winit-self  -Winline  -Wno-int-conversion  -Wint-in-bool-context @gol
>   -Wno-int-to-pointer-cast  -Wno-invalid-memory-model @gol
> --Winvalid-pch  -Wjump-misses-init  -Wlarger-than=@var{byte-size} @gol
> --Wlogical-not-parentheses  -Wlogical-op  -Wlong-long @gol
> --Wno-lto-type-mismatch -Wmain  -Wmaybe-uninitialized @gol
> +-Winvalid-pch  -Winvalid-utf8 -Wjump-misses-init  @gol
> +-Wlarger-than=@var{byte-size}  -Wlogical-not-parentheses  -Wlogical-op  @gol
> +-Wlong-long  -Wno-lto-type-mismatch -Wmain  -Wmaybe-uninitialized @gol
>   -Wmemset-elt-size  -Wmemset-transposed-args @gol
>   -Wmisleading-indentation  -Wmissing-attributes  -Wmissing-braces @gol
>   -Wmissing-field-initializers  -Wmissing-format-attribute @gol
> @@ -9569,6 +9569,11 @@ different size.
>   Warn if a precompiled header (@pxref{Precompiled Headers}) is found in
>   the search path but cannot be used.
>   
> +@item -Winvalid-utf8
> +@opindex Winvalid-utf8
> +@opindex Wno-invalid-utf8
> +Warn if an invalid UTF-8 character is inside of a comment.
> +
>   @item -Wlong-long
>   @opindex Wlong-long
>   @opindex Wno-long-long
> --- gcc/c-family/c.opt.jj	2022-08-27 09:14:43.036696173 +0200
> +++ gcc/c-family/c.opt	2022-08-27 14:03:06.328534617 +0200
> @@ -821,6 +821,10 @@ Winvalid-pch
>   C ObjC C++ ObjC++ CPP(warn_invalid_pch) CppReason(CPP_W_INVALID_PCH) Var(cpp_warn_invalid_pch) Init(0) Warning
>   Warn about PCH files that are found but not used.
>   
> +Winvalid-utf8
> +C objC C++ ObjC++ CPP(cpp_warn_invalid_utf8) CppReason(CPP_W_INVALID_UTF8) Var(warn_invalid_utf8) Init(0) Warning
> +Warn about invalid UTF-8 characters in comments.
> +
>   Wjump-misses-init
>   C ObjC Var(warn_jump_misses_init) Warning LangEnabledby(C ObjC,Wc++-compat)
>   Warn when a jump misses a variable initialization.
> --- gcc/testsuite/c-c++-common/cpp/Winvalid-utf8-1.c.jj	2022-08-27 14:01:51.115517571 +0200
> +++ gcc/testsuite/c-c++-common/cpp/Winvalid-utf8-1.c	2022-08-27 14:33:21.466802817 +0200
> @@ -0,0 +1,39 @@
> +// P2295R6 - Support for UTF-8 as a portable source file encoding
> +// This test intentionally contains various byte sequences which are not valid UTF-8
> +// { dg-do preprocess }
> +// { dg-options "-finput-charset=UTF-8 -Winvalid-utf8" }
> +
> +// a€߿ࠀ퟿𐀀􏿿a		{ dg-bogus "invalid UTF-8 character" }
> +// a�a					{ dg-warning "invalid UTF-8 character <80> in comment" }
> +// a�a					{ dg-warning "invalid UTF-8 character <bf> in comment" }
> +// a�a					{ dg-warning "invalid UTF-8 character <c0> in comment" }
> +// a�a					{ dg-warning "invalid UTF-8 character <c1> in comment" }
> +// a�a					{ dg-warning "invalid UTF-8 character <f5> in comment" }
> +// a�a					{ dg-warning "invalid UTF-8 character <ff> in comment" }
> +// a�a					{ dg-warning "invalid UTF-8 character <c2> in comment" }
> +// a�a					{ dg-warning "invalid UTF-8 character <e0> in comment" }
> +// a���a				{ dg-warning "invalid UTF-8 character <e0><80><bf> in comment" }
> +// a���a				{ dg-warning "invalid UTF-8 character <e0><9f><80> in comment" }
> +// a��a					{ dg-warning "invalid UTF-8 character <e0><bf> in comment" }
> +// a��a					{ dg-warning "invalid UTF-8 character <ec><80> in comment" }
> +// a���a				{ dg-warning "invalid UTF-8 character <ed><a0><80> in comment" }
> +// a����a				{ dg-warning "invalid UTF-8 character <f0><80><80><80> in comment" }
> +// a����a				{ dg-warning "invalid UTF-8 character <f0><8f><bf><bf> in comment" }
> +// a����a				{ dg-warning "invalid UTF-8 character <f4><90><80><80> in comment" }
> +/* a€߿ࠀ퟿𐀀􏿿a		{ dg-bogus "invalid UTF-8 character" } */
> +/* a�a					{ dg-warning "invalid UTF-8 character <80> in comment" } */
> +/* a�a					{ dg-warning "invalid UTF-8 character <bf> in comment" } */
> +/* a�a					{ dg-warning "invalid UTF-8 character <c0> in comment" } */
> +/* a�a					{ dg-warning "invalid UTF-8 character <c1> in comment" } */
> +/* a�a					{ dg-warning "invalid UTF-8 character <f5> in comment" } */
> +/* a�a					{ dg-warning "invalid UTF-8 character <ff> in comment" } */
> +/* a�a					{ dg-warning "invalid UTF-8 character <c2> in comment" } */
> +/* a�a					{ dg-warning "invalid UTF-8 character <e0> in comment" } */
> +/* a���a				{ dg-warning "invalid UTF-8 character <e0><80><bf> in comment" } */
> +/* a���a				{ dg-warning "invalid UTF-8 character <e0><9f><80> in comment" } */
> +/* a��a					{ dg-warning "invalid UTF-8 character <e0><bf> in comment" } */
> +/* a��a					{ dg-warning "invalid UTF-8 character <ec><80> in comment" } */
> +/* a���a				{ dg-warning "invalid UTF-8 character <ed><a0><80> in comment" } */
> +/* a����a				{ dg-warning "invalid UTF-8 character <f0><80><80><80> in comment" } */
> +/* a����a				{ dg-warning "invalid UTF-8 character <f0><8f><bf><bf> in comment" } */
> +/* a����a				{ dg-warning "invalid UTF-8 character <f4><90><80><80> in comment" } */
> 
> 	Jakub
>
  
Jakub Jelinek Aug. 29, 2022, 9:35 p.m. UTC | #2
On Mon, Aug 29, 2022 at 05:15:26PM -0400, Jason Merrill wrote:
> On 8/29/22 04:15, Jakub Jelinek wrote:
> > Hi!
> > 
> > The following patch introduces a new warning - -Winvalid-utf8 similarly
> > to what clang now has - to diagnose invalid UTF-8 byte sequences in
> > comments.  In identifiers and in string literals it should be diagnosed
> > already but comment content hasn't been really verified.
> > 
> > I'm not sure if this is enough to say P2295R6 is implemented or not.
> > 
> > The problem is that in the most common case, people don't use
> > -finput-charset= option and the sources often are UTF-8, but sometimes
> > could be some ASCII compatible single byte encoding where non-ASCII
> > characters only appear in comments.  So having the warning off by default
> > is IMO desirable.  Now, if people use explicit -finput-charset=UTF-8,
> > perhaps we could make the warning on by default for C++23 and use pedwarn
> > instead of warning, because then the user told us explicitly that the source
> > is UTF-8.  From the paper I understood one of the implementation options
> > is to claim that the implementation supports 2 encodings, UTF-8 and UTF-8
> > like encodings where invalid UTF-8 characters in comments are replaced say
> > by spaces, where the latter could be the default and the former only
> > used if -finput-charset=UTF-8 -Werror=invalid-utf8 options are used.
> > 
> > Thoughts on this?
> 
> That sounds good to me.

The pedwarn on -std=c++23 -finput-charset=UTF-8 or just documenting that
"conforming" UTF-8 is only -finput-charset=UTF-8 -Werror=invalid-utf8 ?

> > +static const uchar *
> > +_cpp_warn_invalid_utf8 (cpp_reader *pfile)
> > +{
> > +  cpp_buffer *buffer = pfile->buffer;
> > +  const uchar *cur = buffer->cur;
> > +
> > +  if (cur[0] < utf8_signifier
> > +      || cur[1] < utf8_continuation || cur[1] >= utf8_signifier)
> > +    {
> > +      cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
> > +			     pfile->line_table->highest_line,
> > +			     CPP_BUF_COL (buffer),
> > +			     "invalid UTF-8 character <%x> in comment",
> > +			     cur[0]);
> > +      return cur + 1;
> > +    }
> > +  else if (cur[2] < utf8_continuation || cur[2] >= utf8_signifier)
> 
> Unicode table 3-7 says that the second byte is sometimes restricted to less
> than this range.

That is true and I've tried to include tests for all of those cases in the
testcase and all of them get a warning.  Some of them are through:
  /* Make sure the shortest possible encoding was used.  */

  if (c <=      0x7F && nbytes > 1) return EILSEQ;
  if (c <=     0x7FF && nbytes > 2) return EILSEQ;
  if (c <=    0xFFFF && nbytes > 3) return EILSEQ;
  if (c <=  0x1FFFFF && nbytes > 4) return EILSEQ;
  if (c <= 0x3FFFFFF && nbytes > 5) return EILSEQ;
and others are through:
  /* Make sure the character is valid.  */
  if (c > 0x7FFFFFFF || (c >= 0xD800 && c <= 0xDFFF)) return EILSEQ;
All I had to do outside of what one_utf8_to_cppchar already implements was:

> > +	      if (_cpp_valid_utf8 (pfile, &pstr, buffer->rlimit, 0, NULL, &s)
> > +		  && s <= 0x0010FFFF)

the <= 0x0010FFFF check, because one_utf8_to_cppchar as written happily
supports up to 6 bytes long UTF-8 which can encode up to 7FFFFFFF:
   00000000-0000007F   0xxxxxxx
   00000080-000007FF   110xxxxx 10xxxxxx
   00000800-0000FFFF   1110xxxx 10xxxxxx 10xxxxxx
   00010000-001FFFFF   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
   00200000-03FFFFFF   111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
   04000000-7FFFFFFF   1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
while 3-7 only talks about encoding 0..D7FF and D800..10FFFF in up to 4
bytes.

I guess I should try what happens with 0x110000 and 0x7fffffff in
identifiers and string literals.

	Jakub
  
Jakub Jelinek Aug. 29, 2022, 9:54 p.m. UTC | #3
On Mon, Aug 29, 2022 at 11:35:44PM +0200, Jakub Jelinek wrote:
> I guess I should try what happens with 0x110000 and 0x7fffffff in
> identifiers and string literals.

It is rejected in identifiers, but happily accepted in string literals:
const char32_t *a = U"����";
const char32_t *b = U"������";
int a����b = 1;
int c������d = 2;

	Jakub
  
Jason Merrill Aug. 30, 2022, 3:31 a.m. UTC | #4
On 8/29/22 17:35, Jakub Jelinek wrote:
> On Mon, Aug 29, 2022 at 05:15:26PM -0400, Jason Merrill wrote:
>> On 8/29/22 04:15, Jakub Jelinek wrote:
>>> Hi!
>>>
>>> The following patch introduces a new warning - -Winvalid-utf8 similarly
>>> to what clang now has - to diagnose invalid UTF-8 byte sequences in
>>> comments.  In identifiers and in string literals it should be diagnosed
>>> already but comment content hasn't been really verified.
>>>
>>> I'm not sure if this is enough to say P2295R6 is implemented or not.
>>>
>>> The problem is that in the most common case, people don't use
>>> -finput-charset= option and the sources often are UTF-8, but sometimes
>>> could be some ASCII compatible single byte encoding where non-ASCII
>>> characters only appear in comments.  So having the warning off by default
>>> is IMO desirable.  Now, if people use explicit -finput-charset=UTF-8,
>>> perhaps we could make the warning on by default for C++23 and use pedwarn
>>> instead of warning, because then the user told us explicitly that the source
>>> is UTF-8.  From the paper I understood one of the implementation options
>>> is to claim that the implementation supports 2 encodings, UTF-8 and UTF-8
>>> like encodings where invalid UTF-8 characters in comments are replaced say
>>> by spaces, where the latter could be the default and the former only
>>> used if -finput-charset=UTF-8 -Werror=invalid-utf8 options are used.
>>>
>>> Thoughts on this?
>>
>> That sounds good to me.
> 
> The pedwarn on -std=c++23 -finput-charset=UTF-8 or just documenting that
> "conforming" UTF-8 is only -finput-charset=UTF-8 -Werror=invalid-utf8 ?

The former.

>>> +static const uchar *
>>> +_cpp_warn_invalid_utf8 (cpp_reader *pfile)
>>> +{
>>> +  cpp_buffer *buffer = pfile->buffer;
>>> +  const uchar *cur = buffer->cur;
>>> +
>>> +  if (cur[0] < utf8_signifier
>>> +      || cur[1] < utf8_continuation || cur[1] >= utf8_signifier)
>>> +    {
>>> +      cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
>>> +			     pfile->line_table->highest_line,
>>> +			     CPP_BUF_COL (buffer),
>>> +			     "invalid UTF-8 character <%x> in comment",
>>> +			     cur[0]);
>>> +      return cur + 1;
>>> +    }
>>> +  else if (cur[2] < utf8_continuation || cur[2] >= utf8_signifier)
>>
>> Unicode table 3-7 says that the second byte is sometimes restricted to less
>> than this range.
> 
> That is true and I've tried to include tests for all of those cases in the
> testcase and all of them get a warning.  Some of them are through:
>    /* Make sure the shortest possible encoding was used.  */
> 
>    if (c <=      0x7F && nbytes > 1) return EILSEQ;
>    if (c <=     0x7FF && nbytes > 2) return EILSEQ;
>    if (c <=    0xFFFF && nbytes > 3) return EILSEQ;
>    if (c <=  0x1FFFFF && nbytes > 4) return EILSEQ;
>    if (c <= 0x3FFFFFF && nbytes > 5) return EILSEQ;
> and others are through:
>    /* Make sure the character is valid.  */
>    if (c > 0x7FFFFFFF || (c >= 0xD800 && c <= 0xDFFF)) return EILSEQ;
> All I had to do outside of what one_utf8_to_cppchar already implements was:
> 
>>> +	      if (_cpp_valid_utf8 (pfile, &pstr, buffer->rlimit, 0, NULL, &s)
>>> +		  && s <= 0x0010FFFF)
> 
> the <= 0x0010FFFF check, because one_utf8_to_cppchar as written happily
> supports up to 6 bytes long UTF-8 which can encode up to 7FFFFFFF:
>     00000000-0000007F   0xxxxxxx
>     00000080-000007FF   110xxxxx 10xxxxxx
>     00000800-0000FFFF   1110xxxx 10xxxxxx 10xxxxxx
>     00010000-001FFFFF   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
>     00200000-03FFFFFF   111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
>     04000000-7FFFFFFF   1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> while 3-7 only talks about encoding 0..D7FF and D800..10FFFF in up to 4
> bytes.
> 
> I guess I should try what happens with 0x110000 and 0x7fffffff in
> identifiers and string literals.
> 
> 	Jakub
>
  

Patch

--- libcpp/include/cpplib.h.jj	2022-08-25 14:25:23.866912426 +0200
+++ libcpp/include/cpplib.h	2022-08-27 12:17:55.185022807 +0200
@@ -560,6 +560,9 @@  struct cpp_options
      cpp_bidirectional_level.  */
   unsigned char cpp_warn_bidirectional;
 
+  /* True if libcpp should warn about invalid UTF-8 characters in comments.  */
+  bool cpp_warn_invalid_utf8;
+
   /* Dependency generation.  */
   struct
   {
@@ -666,7 +669,8 @@  enum cpp_warning_reason {
   CPP_W_CXX11_COMPAT,
   CPP_W_CXX20_COMPAT,
   CPP_W_EXPANSION_TO_DEFINED,
-  CPP_W_BIDIRECTIONAL
+  CPP_W_BIDIRECTIONAL,
+  CPP_W_INVALID_UTF8
 };
 
 /* Callback for header lookup for HEADER, which is the name of a
--- libcpp/init.cc.jj	2022-08-24 09:55:44.571876638 +0200
+++ libcpp/init.cc	2022-08-27 12:18:54.559246323 +0200
@@ -227,6 +227,7 @@  cpp_create_reader (enum c_lang lang, cpp
   CPP_OPTION (pfile, ext_numeric_literals) = 1;
   CPP_OPTION (pfile, warn_date_time) = 0;
   CPP_OPTION (pfile, cpp_warn_bidirectional) = bidirectional_unpaired;
+  CPP_OPTION (pfile, cpp_warn_invalid_utf8) = 0;
 
   /* Default CPP arithmetic to something sensible for the host for the
      benefit of dumb users like fix-header.  */
--- libcpp/lex.cc.jj	2022-08-26 09:24:12.089615949 +0200
+++ libcpp/lex.cc	2022-08-27 13:43:40.560769087 +0200
@@ -1704,6 +1704,59 @@  maybe_warn_bidi_on_char (cpp_reader *pfi
   bidi::on_char (kind, ucn_p, loc);
 }
 
+static const cppchar_t utf8_continuation = 0x80;
+static const cppchar_t utf8_signifier = 0xC0;
+
+/* Emit -Winvalid-utf8 warning on invalid UTF-8 character starting
+   at PFILE->buffer->cur.  Return a pointer after the diagnosed
+   invalid character.  */
+
+static const uchar *
+_cpp_warn_invalid_utf8 (cpp_reader *pfile)
+{
+  cpp_buffer *buffer = pfile->buffer;
+  const uchar *cur = buffer->cur;
+
+  if (cur[0] < utf8_signifier
+      || cur[1] < utf8_continuation || cur[1] >= utf8_signifier)
+    {
+      cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
+			     pfile->line_table->highest_line,
+			     CPP_BUF_COL (buffer),
+			     "invalid UTF-8 character <%x> in comment",
+			     cur[0]);
+      return cur + 1;
+    }
+  else if (cur[2] < utf8_continuation || cur[2] >= utf8_signifier)
+    {
+      cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
+			     pfile->line_table->highest_line,
+			     CPP_BUF_COL (buffer),
+			     "invalid UTF-8 character <%x><%x> in comment",
+			     cur[0], cur[1]);
+      return cur + 2;
+    }
+  else if (cur[3] < utf8_continuation || cur[3] >= utf8_signifier)
+    {
+      cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
+			     pfile->line_table->highest_line,
+			     CPP_BUF_COL (buffer),
+			     "invalid UTF-8 character <%x><%x><%x> in comment",
+			     cur[0], cur[1], cur[2]);
+      return cur + 3;
+    }
+  else
+    {
+      cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
+			     pfile->line_table->highest_line,
+			     CPP_BUF_COL (buffer),
+			     "invalid UTF-8 character "
+			     "<%x><%x><%x><%x> in comment",
+			     cur[0], cur[1], cur[2], cur[3]);
+      return cur + 4;
+    }
+}
+
 /* Skip a C-style block comment.  We find the end of the comment by
    seeing if an asterisk is before every '/' we encounter.  Returns
    nonzero if comment terminated by EOF, zero otherwise.
@@ -1716,6 +1769,8 @@  _cpp_skip_block_comment (cpp_reader *pfi
   const uchar *cur = buffer->cur;
   uchar c;
   const bool warn_bidi_p = pfile->warn_bidi_p ();
+  const bool warn_invalid_utf8_p = CPP_OPTION (pfile, cpp_warn_invalid_utf8);
+  const bool warn_bidi_or_invalid_utf8_p = warn_bidi_p | warn_invalid_utf8_p;
 
   cur++;
   if (*cur == '/')
@@ -1765,13 +1820,32 @@  _cpp_skip_block_comment (cpp_reader *pfi
 
 	  cur = buffer->cur;
 	}
-      /* If this is a beginning of a UTF-8 encoding, it might be
-	 a bidirectional control character.  */
-      else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p)
-	{
-	  location_t loc;
-	  bidi::kind kind = get_bidi_utf8 (pfile, cur - 1, &loc);
-	  maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc);
+      else if (__builtin_expect (c >= utf8_continuation, 0)
+	       && warn_bidi_or_invalid_utf8_p)
+	{
+	  /* If this is a beginning of a UTF-8 encoding, it might be
+	     a bidirectional control character.  */
+	  if (c == bidi::utf8_start && warn_bidi_p)
+	    {
+	      location_t loc;
+	      bidi::kind kind = get_bidi_utf8 (pfile, cur - 1, &loc);
+	      maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc);
+	    }
+	  if (!warn_invalid_utf8_p)
+	    continue;
+	  if (c >= utf8_signifier)
+	    {
+	      cppchar_t s;
+	      const uchar *pstr = cur - 1;
+	      if (_cpp_valid_utf8 (pfile, &pstr, buffer->rlimit, 0, NULL, &s)
+		  && s <= 0x0010FFFF)
+		{
+		  cur = pstr;
+		  continue;
+		}
+	    }
+	  buffer->cur = cur - 1;
+	  cur = _cpp_warn_invalid_utf8 (pfile);
 	}
     }
 
@@ -1789,11 +1863,13 @@  skip_line_comment (cpp_reader *pfile)
   cpp_buffer *buffer = pfile->buffer;
   location_t orig_line = pfile->line_table->highest_line;
   const bool warn_bidi_p = pfile->warn_bidi_p ();
+  const bool warn_invalid_utf8_p = CPP_OPTION (pfile, cpp_warn_invalid_utf8);
+  const bool warn_bidi_or_invalid_utf8_p = warn_bidi_p | warn_invalid_utf8_p;
 
-  if (!warn_bidi_p)
+  if (!warn_bidi_or_invalid_utf8_p)
     while (*buffer->cur != '\n')
       buffer->cur++;
-  else
+  else if (!warn_invalid_utf8_p)
     {
       while (*buffer->cur != '\n'
 	     && *buffer->cur != bidi::utf8_start)
@@ -1813,6 +1889,38 @@  skip_line_comment (cpp_reader *pfile)
 	  maybe_warn_bidi_on_close (pfile, buffer->cur);
 	}
     }
+  else
+    {
+      while (*buffer->cur != '\n')
+	{
+	  if (*buffer->cur < utf8_continuation)
+	    {
+	      buffer->cur++;
+	      continue;
+	    }
+	  if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0)
+	      && warn_bidi_p)
+	    {
+	      location_t loc;
+	      bidi::kind kind = get_bidi_utf8 (pfile, buffer->cur, &loc);
+	      maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc);
+	    }
+	  if (*buffer->cur >= utf8_signifier)
+	    {
+	      cppchar_t s;
+	      const uchar *pstr = buffer->cur;
+	      if (_cpp_valid_utf8 (pfile, &pstr, buffer->rlimit, 0, NULL, &s)
+		  && s <= 0x0010FFFF)
+		{
+		  buffer->cur = pstr;
+		  continue;
+		}
+	    }
+	  buffer->cur = _cpp_warn_invalid_utf8 (pfile);
+	}
+      if (warn_bidi_p)
+	maybe_warn_bidi_on_close (pfile, buffer->cur);
+    }
 
   _cpp_process_line_notes (pfile, true);
   return orig_line != pfile->line_table->highest_line;
@@ -1919,8 +2027,6 @@  warn_about_normalization (cpp_reader *pf
     }
 }
 
-static const cppchar_t utf8_signifier = 0xC0;
-
 /* Returns TRUE if the sequence starting at buffer->cur is valid in
    an identifier.  FIRST is TRUE if this starts an identifier.  */
 
--- gcc/doc/invoke.texi.jj	2022-08-27 09:14:43.047696028 +0200
+++ gcc/doc/invoke.texi	2022-08-27 14:05:22.417755406 +0200
@@ -365,9 +365,9 @@  Objective-C and Objective-C++ Dialects}.
 -Winfinite-recursion @gol
 -Winit-self  -Winline  -Wno-int-conversion  -Wint-in-bool-context @gol
 -Wno-int-to-pointer-cast  -Wno-invalid-memory-model @gol
--Winvalid-pch  -Wjump-misses-init  -Wlarger-than=@var{byte-size} @gol
--Wlogical-not-parentheses  -Wlogical-op  -Wlong-long @gol
--Wno-lto-type-mismatch -Wmain  -Wmaybe-uninitialized @gol
+-Winvalid-pch  -Winvalid-utf8 -Wjump-misses-init  @gol
+-Wlarger-than=@var{byte-size}  -Wlogical-not-parentheses  -Wlogical-op  @gol
+-Wlong-long  -Wno-lto-type-mismatch -Wmain  -Wmaybe-uninitialized @gol
 -Wmemset-elt-size  -Wmemset-transposed-args @gol
 -Wmisleading-indentation  -Wmissing-attributes  -Wmissing-braces @gol
 -Wmissing-field-initializers  -Wmissing-format-attribute @gol
@@ -9569,6 +9569,11 @@  different size.
 Warn if a precompiled header (@pxref{Precompiled Headers}) is found in
 the search path but cannot be used.
 
+@item -Winvalid-utf8
+@opindex Winvalid-utf8
+@opindex Wno-invalid-utf8
+Warn if an invalid UTF-8 character is inside of a comment.
+
 @item -Wlong-long
 @opindex Wlong-long
 @opindex Wno-long-long
--- gcc/c-family/c.opt.jj	2022-08-27 09:14:43.036696173 +0200
+++ gcc/c-family/c.opt	2022-08-27 14:03:06.328534617 +0200
@@ -821,6 +821,10 @@  Winvalid-pch
 C ObjC C++ ObjC++ CPP(warn_invalid_pch) CppReason(CPP_W_INVALID_PCH) Var(cpp_warn_invalid_pch) Init(0) Warning
 Warn about PCH files that are found but not used.
 
+Winvalid-utf8
+C objC C++ ObjC++ CPP(cpp_warn_invalid_utf8) CppReason(CPP_W_INVALID_UTF8) Var(warn_invalid_utf8) Init(0) Warning
+Warn about invalid UTF-8 characters in comments.
+
 Wjump-misses-init
 C ObjC Var(warn_jump_misses_init) Warning LangEnabledby(C ObjC,Wc++-compat)
 Warn when a jump misses a variable initialization.
--- gcc/testsuite/c-c++-common/cpp/Winvalid-utf8-1.c.jj	2022-08-27 14:01:51.115517571 +0200
+++ gcc/testsuite/c-c++-common/cpp/Winvalid-utf8-1.c	2022-08-27 14:33:21.466802817 +0200
@@ -0,0 +1,39 @@ 
+// P2295R6 - Support for UTF-8 as a portable source file encoding
+// This test intentionally contains various byte sequences which are not valid UTF-8
+// { dg-do preprocess }
+// { dg-options "-finput-charset=UTF-8 -Winvalid-utf8" }
+
+// a€߿ࠀ퟿𐀀􏿿a		{ dg-bogus "invalid UTF-8 character" }
+// a�a					{ dg-warning "invalid UTF-8 character <80> in comment" }
+// a�a					{ dg-warning "invalid UTF-8 character <bf> in comment" }
+// a�a					{ dg-warning "invalid UTF-8 character <c0> in comment" }
+// a�a					{ dg-warning "invalid UTF-8 character <c1> in comment" }
+// a�a					{ dg-warning "invalid UTF-8 character <f5> in comment" }
+// a�a					{ dg-warning "invalid UTF-8 character <ff> in comment" }
+// a�a					{ dg-warning "invalid UTF-8 character <c2> in comment" }
+// a�a					{ dg-warning "invalid UTF-8 character <e0> in comment" }
+// a���a				{ dg-warning "invalid UTF-8 character <e0><80><bf> in comment" }
+// a���a				{ dg-warning "invalid UTF-8 character <e0><9f><80> in comment" }
+// a��a					{ dg-warning "invalid UTF-8 character <e0><bf> in comment" }
+// a��a					{ dg-warning "invalid UTF-8 character <ec><80> in comment" }
+// a���a				{ dg-warning "invalid UTF-8 character <ed><a0><80> in comment" }
+// a����a				{ dg-warning "invalid UTF-8 character <f0><80><80><80> in comment" }
+// a����a				{ dg-warning "invalid UTF-8 character <f0><8f><bf><bf> in comment" }
+// a����a				{ dg-warning "invalid UTF-8 character <f4><90><80><80> in comment" }
+/* a€߿ࠀ퟿𐀀􏿿a		{ dg-bogus "invalid UTF-8 character" } */
+/* a�a					{ dg-warning "invalid UTF-8 character <80> in comment" } */
+/* a�a					{ dg-warning "invalid UTF-8 character <bf> in comment" } */
+/* a�a					{ dg-warning "invalid UTF-8 character <c0> in comment" } */
+/* a�a					{ dg-warning "invalid UTF-8 character <c1> in comment" } */
+/* a�a					{ dg-warning "invalid UTF-8 character <f5> in comment" } */
+/* a�a					{ dg-warning "invalid UTF-8 character <ff> in comment" } */
+/* a�a					{ dg-warning "invalid UTF-8 character <c2> in comment" } */
+/* a�a					{ dg-warning "invalid UTF-8 character <e0> in comment" } */
+/* a���a				{ dg-warning "invalid UTF-8 character <e0><80><bf> in comment" } */
+/* a���a				{ dg-warning "invalid UTF-8 character <e0><9f><80> in comment" } */
+/* a��a					{ dg-warning "invalid UTF-8 character <e0><bf> in comment" } */
+/* a��a					{ dg-warning "invalid UTF-8 character <ec><80> in comment" } */
+/* a���a				{ dg-warning "invalid UTF-8 character <ed><a0><80> in comment" } */
+/* a����a				{ dg-warning "invalid UTF-8 character <f0><80><80><80> in comment" } */
+/* a����a				{ dg-warning "invalid UTF-8 character <f0><8f><bf><bf> in comment" } */
+/* a����a				{ dg-warning "invalid UTF-8 character <f4><90><80><80> in comment" } */