From patchwork Wed Jan 25 21:06:32 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Boeckel X-Patchwork-Id: 48343 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp486604wrn; Wed, 25 Jan 2023 13:09:39 -0800 (PST) X-Google-Smtp-Source: AK7set9E07l5U7weVGLiuJRjjprhkmbn/5kcrMxDa8ZAKBFaJ91ZYCa8Bhi215i1WVPoJlBa4rYz X-Received: by 2002:aa7:cb58:0:b0:4a0:90d0:da59 with SMTP id w24-20020aa7cb58000000b004a090d0da59mr6077482edt.26.1674680979694; Wed, 25 Jan 2023 13:09:39 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674680979; cv=none; d=google.com; s=arc-20160816; b=xi8R7y2BCPvNvuqFDg9B3L3VlZCR+0SZSbHrNSWtySteLpSX0wVvb47mYv0ptfa5kH 6jCyEW6RjSZG+au1h3h2UH1IBV+mYGjP/AFdRYQBkiE4JqzkPQENNwm2rI3c8akEbQNA MGGdeMA4iLfurWrEEbir5BrQ5gBOTFF2HGogsD6wfKw+W7nYYRKRCGC/f0ovGEqubx/L QyBpekRBhCvN7usTdN9+0zPA1s/SjrYOr4BWBDS6CpyzhfVCXY0krlw6iV+9gmnFKavn umZZjWAcCEQkHTf8ZDdbxyqtv8+MBd36sUWumD+4S7z3ziFtOAAqk3nLYcFzd3V6pZ8I Hg2g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:reply-to:from:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence :content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:dmarc-filter:delivered-to :dkim-signature:dkim-filter; bh=+AZO+CP+ZJuGxRNxlin0kEJjHYpUx2qBzWY38dxdrgA=; b=swxb75QmirWx47OPJL0wysBY1SfjRF+u2YCiIUXWkGJZpdf8BOg27+GDsb5R55uxmQ O3gvtpuS19104F5jEB8fVaumihWrlbpMuR748XjO4Qj7XrNn0w2zQ8O3UwARszcbfC53 LmDOZFvdXFbznID/DojrfiCMgvEEAhqB0KJ7ybAot43kIfCeP+GF7DbNNrFR5IYmLobh GHhfmDLE97eg8r2Wr6flNJU6/O1o0Ld8nva54EdJW9lsd0OyrJvuVhBb87lfVuKUL2nB phz19655hZCehQw9bkzd29o54DJaRfHqYuunqZWe6tlKT0PqyWYNUY5n6MtTX5Ct58/T ef4A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=JgAoiklY; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from sourceware.org (server2.sourceware.org. [2620:52:3:1:0:246e:9693:128c]) by mx.google.com with ESMTPS id r13-20020aa7cfcd000000b004a08f89336dsi5475068edy.446.2023.01.25.13.09.39 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 25 Jan 2023 13:09:39 -0800 (PST) Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) client-ip=2620:52:3:1:0:246e:9693:128c; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org header.s=default header.b=JgAoiklY; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 3350F385841C for ; Wed, 25 Jan 2023 21:07:46 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3350F385841C DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1674680866; bh=+AZO+CP+ZJuGxRNxlin0kEJjHYpUx2qBzWY38dxdrgA=; h=To:Cc:Subject:Date:In-Reply-To:References:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=JgAoiklYRO4z51oKDsdjr+Bw+XwbeC1HbIuNXZcbNHAbcRUOnKio7+GUpd/YecXZm v3FNF5VDby4hjHJ7/Rvnqc2YTu/n89cVDaugveGUw3CikmRhz1VQouDpB25M9A3V0B Im1fSi8lZwSV85pB2NnEBgY8sQxPev1DCm237SEo= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mail-qv1-xf32.google.com (mail-qv1-xf32.google.com [IPv6:2607:f8b0:4864:20::f32]) by sourceware.org (Postfix) with ESMTPS id 569B23858C53 for ; Wed, 25 Jan 2023 21:06:54 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 569B23858C53 Received: by mail-qv1-xf32.google.com with SMTP id s4so2776386qvo.3 for ; Wed, 25 Jan 2023 13:06:54 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+AZO+CP+ZJuGxRNxlin0kEJjHYpUx2qBzWY38dxdrgA=; b=uis/mCc3qf/60LBIy0CoTJDpKXjsrlG+9jdc4Y/eoMXIRce395KtV9zGxPCBqGmh8k q8hBd83iEjde0z61KDeq2EcNHP1FxNvQZDQ8s/JTzQmlNyzd7pWqPEM6/BlMumOXZjGX XLs/2q+0kX/2JMMx1LFRT0s2VzwkADU61OlXq9XpcYY4MJwRrHNOoQZnBj7KGnk1ph1q iNu/IgeyMM8VmeC2x8h3nGi4PVgIP9oSgisy0X8bT92OeY84IQ0r2amm7C6/ICwJJzRt TsyuvFTtO0AUTDGcjbXXRW2SlHtAfX0UywzKonL3Z4HtBcK0M35SEi0YSU9Fbi83ndeS W3DQ== X-Gm-Message-State: AFqh2kq5YfLR0jqmRFTULlEr+0geIQH5w4ehPK7p9Use8HAxqHBtM/MA zddB2aBxl/gFUWw/zWB98PgVNp0/1CWaFPsCGHk= X-Received: by 2002:a05:6214:268d:b0:537:4b26:7229 with SMTP id gm13-20020a056214268d00b005374b267229mr33671352qvb.29.1674680813691; Wed, 25 Jan 2023 13:06:53 -0800 (PST) Received: from localhost (cpe-142-105-146-128.nycap.res.rr.com. [142.105.146.128]) by smtp.gmail.com with ESMTPSA id a1-20020a05620a438100b006fc2b672950sm4246013qkp.37.2023.01.25.13.06.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 25 Jan 2023 13:06:53 -0800 (PST) To: gcc-patches@gcc.gnu.org Cc: Ben Boeckel , jason@redhat.com, nathan@acm.org, fortran@gcc.gnu.org, gcc@gcc.gnu.org, brad.king@kitware.com Subject: [PATCH v5 1/5] libcpp: reject codepoints above 0x10FFFF Date: Wed, 25 Jan 2023 16:06:32 -0500 Message-Id: <20230125210636.2960049-2-ben.boeckel@kitware.com> X-Mailer: git-send-email 2.39.0 In-Reply-To: <20230125210636.2960049-1-ben.boeckel@kitware.com> References: <20230125210636.2960049-1-ben.boeckel@kitware.com> MIME-Version: 1.0 X-Spam-Status: No, score=-13.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Ben Boeckel via Gcc-patches From: Ben Boeckel Reply-To: Ben Boeckel Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org Sender: "Gcc-patches" X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1756030282964685878?= X-GMAIL-MSGID: =?utf-8?q?1756030282964685878?= Unicode does not support such values because they are unrepresentable in UTF-16. libcpp/ * charset.cc: Reject encodings of codepoints above 0x10FFFF. UTF-16 does not support such codepoints and therefore all Unicode rejects such values. Signed-off-by: Ben Boeckel --- libcpp/charset.cc | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/libcpp/charset.cc b/libcpp/charset.cc index 3c47d4f868b..f7ae12ea5a2 100644 --- a/libcpp/charset.cc +++ b/libcpp/charset.cc @@ -158,6 +158,10 @@ struct _cpp_strbuf encoded as any of DF 80, E0 9F 80, F0 80 9F 80, F8 80 80 9F 80, or FC 80 80 80 9F 80. Only the first is valid. + Additionally, Unicode declares that all codepoints above 0010FFFF are + invalid because they cannot be represented in UTF-16. As such, all 5- and + 6-byte encodings are invalid. + An implementation note: the transformation from UTF-16 to UTF-8, or vice versa, is easiest done by using UTF-32 as an intermediary. */ @@ -216,7 +220,7 @@ one_utf8_to_cppchar (const uchar **inbufp, size_t *inbytesleftp, if (c <= 0x3FFFFFF && nbytes > 5) return EILSEQ; /* Make sure the character is valid. */ - if (c > 0x7FFFFFFF || (c >= 0xD800 && c <= 0xDFFF)) return EILSEQ; + if (c > 0x10FFFF || (c >= 0xD800 && c <= 0xDFFF)) return EILSEQ; *cp = c; *inbufp = inbuf; @@ -320,7 +324,7 @@ one_utf32_to_utf8 (iconv_t bigend, const uchar **inbufp, size_t *inbytesleftp, s += inbuf[bigend ? 2 : 1] << 8; s += inbuf[bigend ? 3 : 0]; - if (s >= 0x7FFFFFFF || (s >= 0xD800 && s <= 0xDFFF)) + if (s > 0x10FFFF || (s >= 0xD800 && s <= 0xDFFF)) return EILSEQ; rval = one_cppchar_to_utf8 (s, outbufp, outbytesleftp);