nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?)

  Hi!

On 2022-11-11T15:35:44+0100, Richard Biener via Fortran <fortran@gcc.gnu.org> wrote:
> On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge <thomas@codesourcery.com> wrote:
>> For example, for Fortran code like:
>>
>>     write (*,*) "Hello world"
>>
>> ..., 'gfortran' creates:
>>
>>     struct __st_parameter_dt dt_parm.0;
>>
>>     try
>>       {
>>         dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1};
>>         dt_parm.0.common.line = 29;
>>         dt_parm.0.common.flags = 128;
>>         dt_parm.0.common.unit = 6;
>>         _gfortran_st_write (&dt_parm.0);
>>         _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11);
>>         _gfortran_st_write_done (&dt_parm.0);
>>       }
>>     finally
>>       {
>>         dt_parm.0 = {CLOBBER(eol)};
>>       }
>>
>> The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes,
>> really! -- there's a lot of state in Fortran I/O apparently).  That's a
>> problem for GPU execution -- here: OpenACC/nvptx -- where typically you
>> have small stacks.  (For example, GCC/OpenACC/nvptx: 1 KiB per thread;
>> GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack'
>> "Use custom stacks instead of local memory for automatic storage".)
>>
>> Now, the Nvidia Driver tries to accomodate for such largish stack usage,
>> and dynamically increases the per-thread stack as necessary (thereby
>> potentially reducing parallelism) -- if it manages to understand the call
>> graph.  In case of libgfortran I/O, it evidently doesn't.  Not being able
>> to disprove existance of recursion is the common problem, as I've read.
>> At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example:
>>
>>     warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined
>>
>> That's still not an actual problem: if the GPU kernel's stack usage still
>> fits into 1 KiB.  Very often it does, but if, as happens in libgfortran
>> I/O handling, there is another such 'dt_parm' put onto the stack, the
>> stack then overflows; device-side SIGSEGV.
>>
>> (There is, by the way, some similar analysis by Tom de Vries in
>> <https://gcc.gnu.org/PR85519> "[nvptx, openacc, openmp, testsuite]
>> Recursive tests may fail due to thread stack limit".)
>>
>> Of course, you shouldn't really be doing I/O in GPU kernels, but people
>> do like their occasional "'printf' debugging", so we ought to make that
>> work (... without pessimizing any "normal" code).
>>
>> I assume that generally reducing the size of 'dt_parm' etc. is out of
>> scope.
>>
>> There is a way to manually set a per-thread stack size, but it's not
>> obvious which size to set: that sizes needs to work for the whole GPU
>> kernel, and should be as low as possible (to maximize parallelism).
>> I assume that even if GCC did an accurate call graph analysis of the GPU
>> kernel's maximum stack usage, that still wouldn't help: that's before the
>> PTX JIT does its own code transformations, including stack spilling.
>>
>> There exists a 'CU_JIT_LTO' flag to "Enable link-time optimization
>> (-dlto) for device code".  This might help, assuming that it manages to
>> simplify the libgfortran I/O code such that the PTX JIT then understands
>> the call graph.  But: that's available only starting with recent
>> CUDA 11.4, so not a general solution -- if it works at all, which I've
>> not tested.
>>
>> Similarly, we could enable GCC's LTO for device code generation -- but
>> that's a big project, out of scope at this time.  And again, we don't
>> know if that at all helps this case.
>>
>> I see a few options:
>>
>> (a) Figure out what it is in the libgfortran I/O implementation that
>> causes "Stack size [...] cannot be statically determined", and re-work
>> that code to avoid that, or even disable certain things for nvptx, if
>> feasible.

> Shrink st_parameter_dt (it's part of the ABI though, kind of).  Lots of the
> bloat is from things that are unused for simpler I/O cases (so some
> "inheritance" could help), and lots of the bloat is from using
> string/length pairs using char * + size_t for what looks like could be
> encoded a lot more efficiently.
>
> There's probably not much low-hanging fruit.

(Similarly comments in Janne's email.)

Well, as had to be expected, libgfortran I/O is really just one example,
but the underlying problem may also be triggered in other ways (via other
newlib/libc functions, for example).

So, really a generic solution seems to be called for.

>> (b) Also for GCC/OpenACC/nvptx use the GCC/OpenMP/nvptx '-msoft-stack'.
>> I don't really want to do that however: it does introduce a bit of
>> complexity in all the generated device code and run-time overhead that we
>> generally would like to avoid.

Directly using '-msoft-stack' isn't actually possible: it does implement
"one stack per 32-threads warp", but for OpenACC we need "one stack per
thread of a warp" (that is, each OpenACC 'vector' independently), and
pre-allocating from device memory all those stacks (which may be a lot!)
I foresee to really negatively impact overall performance?

>> (c) I'm contemplating a tweak/compiler pass for transforming such large
>> stack objects into heap allocation (during nvptx offloading compilation).
>> 'malloc'/'free' do exist; they're slow, but that's not a problem for the
>> code paths this is to affect.  (Might also add some compile-time
>> diagnostic, of course.)  Could maybe even limit this to only be used
>> during libgfortran compilation?  This is then conceptually a bit similar
>> to (b), but localized to relevant parts only.  Has such a thing been done
>> before in GCC, that I could build upon?
>>
>> Any other clever ideas?

> Converting to heap allocation is difficult outside of the frontend and you
> have to be very careful with memleaks.

Heh, in fact it seems to be pretty simple!  (Famous last words?)  See
"[WIP] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold'"
attached.  What do people think about such a thing?

Still to be discussed are '-Wframe-malloc-threshold' (default-on vs.
'-Wextra'; or '-fopt-info' 'missed: [...]' or 'note: [...]' instead?),
default value for '-mframe-malloc-threshold=[...]' (potentially different
for GCC/nvptx target libraries build vs. user-compiled code?), etc.

> The library is written in C and
> I see heap allocated temporaries there but in at least one
> place a stack one is used:
>
> void
> st_endfile (st_parameter_filepos *fpp)
> {
> ...
>       if (u->current_record)
>         {
>           st_parameter_dt dtp;
>           dtp.common = fpp->common;
>           memset (&dtp.u.p, 0, sizeof (dtp.u.p));
>           dtp.u.p.current_unit = u;
>           next_record (&dtp, 1);
>
> that might be a mistake though - maybe it's enough to change that
> to a heap allocation?  It might be also totally superfluous since
> only 'u' should matter here ... (not sure if the above is the case
> you are running into).

(Have not yet looked into that; won't solve the general issue.)

Grüße
 Thomas

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

Message ID	87ili2p60p.fsf@euler.schwinge.homeip.net
State	Unresolved
Headers	Return-Path: <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:e747:0:0:0:0:0 with SMTP id c7csp338429wrn; Fri, 23 Dec 2022 06:09:07 -0800 (PST) X-Google-Smtp-Source: AMrXdXskBzaWoPMFDjepvn/zxf4Qcz/EAA2FsgASSURlGtpBegJTkPGl8xnE0PaX41wFRMQnjwQ6 X-Received: by 2002:a17:906:8492:b0:7ad:8f6f:806d with SMTP id m18-20020a170906849200b007ad8f6f806dmr8075255ejx.24.1671804547306; Fri, 23 Dec 2022 06:09:07 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671804547; cv=none; d=google.com; s=arc-20160816; b=B45HoCILcGV2mJ13MEt08B8O8tXQJNeFKF29MuwX02urqTVv+0ioquSs7ysJnfJbN+ LfrKfFpNxzV3pdgBgotFIxyBoDeJUs+9ormrq/XCrBqyFhgEn2E2p3a5oIVRcKJ4e0c/ XC2segOfPsQoyzXHcMnitIRtGSDvOM0H1fr361f+v30Idhx8LM5OyUJhFJVvJuwAREr9 o+evGzX5Fcjx7bEne6ZDXOCNJTcYHHedaFTD43L38rz71xfeZCm7Oupy3BpG24LjcocX ycXnA7t1dOB1j82p1oeMPk+xDM4X2iwz+kP2ThGElTqY+X5yjY+olHchoMZFH3u94jFc GIpg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:list-subscribe:list-help:list-post:list-archive :list-unsubscribe:list-id:precedence:mime-version:message-id:date :user-agent:references:in-reply-to:subject:cc:to:from:ironport-sdr :dmarc-filter:delivered-to; bh=51I0N2WpCv4yTbXBozTuHSgUIfvnJmsifkvh7Jhrf+o=; b=u3QRX2GzCo0oSOOuijp2Z7+jmlb45xun0TSLRMoaaJXmzjBXDoGjFzzw4p7aO1FhMU AZCiLWswG8vc/lfleSJWWUpewk/P72KiqNuhfWlIqERDCQ/rfa5vuANf0FacVZq8yYnU miXjctDzCQ2YIMt8g7ibNhYQtYtOPqDckaOpwpxxZzhrmVomsN9JWZMjTyYSA9bRzMs0 zJNYrZrIVObr4KDLCQ6OBgfNdRGZzOB3F7ofgGH/dQ8fESxoVBLj/Vc9cP7WZFvXYxyb W96LFVFUMtAr/L+EFz4xZnH6HKpBpfucYOWyAg+Ier9rYeEP/Hdo5xzOaHnZyOIRrfNU hGuA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org" Received: from sourceware.org (server2.sourceware.org. [2620:52:3:1:0:246e:9693:128c]) by mx.google.com with ESMTPS id ga23-20020a1709070c1700b00829e391ba03si3021211ejc.38.2022.12.23.06.09.06 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 23 Dec 2022 06:09:07 -0800 (PST) Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) client-ip=2620:52:3:1:0:246e:9693:128c; Authentication-Results: mx.google.com; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org" Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 4AFDE385B510 for <ouuuleilei@gmail.com>; Fri, 23 Dec 2022 14:08:49 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from esa1.mentor.iphmx.com (esa1.mentor.iphmx.com [68.232.129.153]) by sourceware.org (Postfix) with ESMTPS id 9864B3858425; Fri, 23 Dec 2022 14:08:17 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 9864B3858425 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com X-IronPort-AV: E=Sophos;i="5.96,268,1665475200"; d="scan'208,223";a="94083240" Received: from orw-gwy-02-in.mentorg.com ([192.94.38.167]) by esa1.mentor.iphmx.com with ESMTP; 23 Dec 2022 06:08:14 -0800 IronPort-SDR: 6XZYWo/ZaO1QT0pR5ZGghE4EUy3qou6E9ugTT/wEV/5036GViuut8awMIV7pda/jYBhtiu0SSQ syXrhKSHGmUzAahRYKL3e04nHRY1d65fTTLwuuxxTD4P8vyxHeuRhvfD1BNJdCvueB7+dU2BUr RxMzjDti4HFjVRtYar/TiCrtyVk/R2KEcdbAqfh3ierxoKrvvVEvXkrNEbxdBHhKhflUR3w+V6 aptUVevenNXmN4+DNms4kijSTBuz7jQSlFWF1Str73tJNBhr0li5D0Jn8wqggU+DOdyPvLcmp4 aYs= From: Thomas Schwinge <thomas@codesourcery.com> To: Richard Biener <richard.guenther@gmail.com>, Tom de Vries <tdevries@suse.de>, <gcc-patches@gcc.gnu.org> CC: Janne Blomqvist <blomqvist.janne@gmail.com>, <fortran@gcc.gnu.org>, Alexander Monakov <amonakov@ispras.ru> Subject: nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?) In-Reply-To: <CAFiYyc0oAd+r97MfpcS8obsLeBmh4Q+qfeyZbszMzhKuR4wQiA@mail.gmail.com> References: <ae825c453f484ffd99c9be34af726089@mentor.com> <87mtaigz3l.fsf@dem-tschwing-1.ger.mentorg.com> <87zgcxoa05.fsf@euler.schwinge.homeip.net> <CAFiYyc0oAd+r97MfpcS8obsLeBmh4Q+qfeyZbszMzhKuR4wQiA@mail.gmail.com> User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/28.2 (x86_64-pc-linux-gnu) Date: Fri, 23 Dec 2022 15:08:06 +0100 Message-ID: <87ili2p60p.fsf@euler.schwinge.homeip.net> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: svr-ies-mbx-12.mgc.mentorg.com (139.181.222.12) To svr-ies-mbx-10.mgc.mentorg.com (139.181.222.10) X-Spam-Status: No, score=-11.9 required=5.0 tests=BAYES_00, GIT_PATCH_0, HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, KAM_SHORT, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org> X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1753014125132013680?= X-GMAIL-MSGID: =?utf-8?q?1753014125132013680?=
Series	nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?) \| nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects…

nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?)

Checks

Commit Message

Comments

Patch