Message ID | ZWD_fAPqEWkFlEkM@dwarf.suse.cz |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1492402vqx; Fri, 24 Nov 2023 11:54:50 -0800 (PST) X-Google-Smtp-Source: AGHT+IEcpAqxgbTBQrrzfQOxomIOuO/aBchQq3bEDwoyBX+dAehLPvLuGqzVkYPuNe6UqCgNIBct X-Received: by 2002:a05:6a00:21c7:b0:690:ce30:47ba with SMTP id t7-20020a056a0021c700b00690ce3047bamr4596803pfj.10.1700855690171; Fri, 24 Nov 2023 11:54:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700855690; cv=none; d=google.com; s=arc-20160816; b=HsalHGa7GpecbZc002pwoZIx77398k67ZEZpMQkd69z4C+homswjnXegIiXHnbolp2 UCaY2vzfNnNoqpcdMcaUNYaSxCvJjHS83do7/9/UbiGR+3YjUBaV/2R26CU92cQD9YWz AJtR4QOEmgx2Vavk7uJ6LTEmkN6TvNFjsd5TDY58SV/wEsQHDqIMp0C7TkdCnOVId3P0 1It2tBHufhIkfXi7SRx7xqW7iCTjFZeIzgg/MjsSm/RnF7dXdL+85ReZANl12a2x98Mr 7flQB9D1RfindHwYhGZ5CNgTCz3IZoTdrTK7FGlSGJX/peJ10TKl69IrcMlYzv4VkDBY zktw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-disposition:mime-version:message-id :subject:cc:to:from:date:dkim-signature:dkim-signature; bh=NKaKK9zej0m4xFJ3PcSZPMegxdL3JWqo86QBJH29fAg=; fh=N9iTC4ETbLLv8Qag9but96wYdeum3ovJpDOsbUrbTQg=; b=fWSDmfobQ1oM8YwAIETPf9mBqsaokPd+i9aZmS7QRy/GJ1SpgKjao4P97dHSAKSO+6 ysVk2GkUzSLqHua3LCYnwARAr2IkNgh56ctzQ/L5U1BA0MJTpWZ6V6j8uUL0hlFIZlm0 zIgB1+sy8WYKLZYGMlwef0d/O+y/D1jZh061bSPbsOlPpoCsfYUYDO5A4HxrQXDrn3n8 PX4n4Kjoqg4/r2SxK8+IDQwYMJGxshni+glAS1wxPYSRGS9rRXfWHmSOWDH2I6Mkj/Jg ozXla3L0yxfOPHXKjA3bB5Hb7zVZ8l8+dlvvnjvjjVl/0zdBFSLsyaBKTzLPeY4Q6n5i rlQQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=a4YpK+ov; dkim=neutral (no key) header.i=@suse.cz header.b=tsricbvE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from morse.vger.email (morse.vger.email. [23.128.96.31]) by mx.google.com with ESMTPS id f23-20020a635117000000b0055b640a6b3csi4231199pgb.884.2023.11.24.11.54.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 11:54:50 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=a4YpK+ov; dkim=neutral (no key) header.i=@suse.cz header.b=tsricbvE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id 746B78043D0C; Fri, 24 Nov 2023 11:54:45 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346095AbjKXTyg (ORCPT <rfc822;ouuuleilei@gmail.com> + 99 others); Fri, 24 Nov 2023 14:54:36 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39514 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345941AbjKXTyf (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 24 Nov 2023 14:54:35 -0500 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9CFEC10F6 for <linux-kernel@vger.kernel.org>; Fri, 24 Nov 2023 11:54:41 -0800 (PST) Received: from relay2.suse.de (unknown [149.44.160.134]) by smtp-out2.suse.de (Postfix) with ESMTP id F17E51FDB6; Fri, 24 Nov 2023 19:54:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1700855678; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type; bh=NKaKK9zej0m4xFJ3PcSZPMegxdL3JWqo86QBJH29fAg=; b=a4YpK+ovIPwdPnP2AM4Eh+RnsVS/RnRhRa6Asf10nYw6N/T77ZEGJuIco6XK/SREOT6eqU otY2lG0pJAZYMXLLhY8o9HRxQoT4GFCkSdHE7kYF1I1UXGpGrUNd601S2Ndy5txvGUdZwc aC0UsaiMjjYig20wNZE/itBLnHu+RYQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1700855678; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type; bh=NKaKK9zej0m4xFJ3PcSZPMegxdL3JWqo86QBJH29fAg=; b=tsricbvEp+EJK7XoMtsacnyy7m2bCFGl4JzjredPk4r4PMZwDe28cW4FswhNbmXNLlnAat +HjVZPggEBea0mBQ== Received: from localhost (dwarf.suse.cz [10.100.12.32]) by relay2.suse.de (Postfix) with ESMTP id E28652C145; Fri, 24 Nov 2023 19:54:36 +0000 (UTC) Date: Fri, 24 Nov 2023 20:54:36 +0100 From: Jiri Bohac <jbohac@suse.cz> To: Baoquan He <bhe@redhat.com>, Vivek Goyal <vgoyal@redhat.com>, Dave Young <dyoung@redhat.com>, kexec@lists.infradead.org Cc: linux-kernel@vger.kernel.org, mhocko@suse.cz Subject: [PATCH 0/4] kdump: crashkernel reservation from CMA Message-ID: <ZWD_fAPqEWkFlEkM@dwarf.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Spamd-Bar: ++++++++++ X-Spam-Score: 10.50 X-Rspamd-Server: rspamd1 X-Rspamd-Queue-Id: F17E51FDB6 Authentication-Results: smtp-out2.suse.de; dkim=none; dmarc=none; spf=pass (smtp-out2.suse.de: domain of jbohac@suse.cz designates 149.44.160.134 as permitted sender) smtp.mailfrom=jbohac@suse.cz X-Spamd-Result: default: False [10.50 / 50.00]; RDNS_NONE(1.00)[]; SPAMHAUS_XBL(0.00)[149.44.160.134:from]; TO_DN_SOME(0.00)[]; RWL_MAILSPIKE_GOOD(-1.00)[149.44.160.134:from]; R_SPF_ALLOW(-0.20)[+ip4:149.44.0.0/16]; HFILTER_HELO_IP_A(1.00)[relay2.suse.de]; HFILTER_HELO_NORES_A_OR_MX(0.30)[relay2.suse.de]; RCPT_COUNT_FIVE(0.00)[6]; MID_RHS_MATCH_FROMTLD(0.00)[]; MX_GOOD(-0.01)[]; RCVD_NO_TLS_LAST(0.10)[]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(2.20)[]; MIME_TRACE(0.00)[0:+]; BAYES_HAM(-2.99)[99.95%]; RDNS_DNSFAIL(0.00)[]; ARC_NA(0.00)[]; FROM_HAS_DN(0.00)[]; NEURAL_SPAM_SHORT(3.00)[1.000]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(1.20)[suse.cz]; TO_MATCH_ENVRCPT_SOME(0.00)[]; DKIM_SIGNED(0.00)[suse.cz:s=susede2_rsa,suse.cz:s=susede2_ed25519]; NEURAL_SPAM_LONG(3.50)[1.000]; DBL_BLOCKED_OPENRESOLVER(0.00)[suse.cz:email]; FUZZY_BLOCKED(0.00)[rspamd.com]; RCVD_COUNT_TWO(0.00)[2]; HFILTER_HOSTNAME_UNKNOWN(2.50)[] X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Fri, 24 Nov 2023 11:54:45 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783476456091888427 X-GMAIL-MSGID: 1783476456091888427 |
Series | kdump: crashkernel reservation from CMA | |
Message
Jiri Bohac
Nov. 24, 2023, 7:54 p.m. UTC
Hi, this series implements a new way to reserve additional crash kernel memory using CMA. Currently, all the memory for the crash kernel is not usable by the 1st (production) kernel. It is also unmapped so that it can't be corrupted by the fault that will eventually trigger the crash. This makes sense for the memory actually used by the kexec-loaded crash kernel image and initrd and the data prepared during the load (vmcoreinfo, ...). However, the reserved space needs to be much larger than that to provide enough run-time memory for the crash kernel and the kdump userspace. Estimating the amount of memory to reserve is difficult. Being too careful makes kdump likely to end in OOM, being too generous takes even more memory from the production system. Also, the reservation only allows reserving a single contiguous block (or two with the "low" suffix). I've seen systems where this fails because the physical memory is fragmented. By reserving additional crashkernel memory from CMA, the main crashkernel reservation can be just small enough to fit the kernel and initrd image, minimizing the memory taken away from the production system. Most of the run-time memory for the crash kernel will be memory previously available to userspace in the production system. As this memory is no longer wasted, the reservation can be done with a generous margin, making kdump more reliable. Kernel memory that we need to preserve for dumping is never allocated from CMA. User data is typically not dumped by makedumpfile. When dumping of user data is intended this new CMA reservation cannot be used. There are four patches in this series: The first adds a new ",cma" suffix to the recenly introduced generic crashkernel parsing code. parse_crashkernel() takes one more argument to store the cma reservation size. The second patch implements reserve_crashkernel_cma() which performs the reservation. If the requested size is not available in a single range, multiple smaller ranges will be reserved. The third patch enables the functionality for x86 as a proof of concept. There are just three things every arch needs to do: - call reserve_crashkernel_cma() - include the CMA-reserved ranges in the physical memory map - exclude the CMA-reserved ranges from the memory available through /proc/vmcore by excluding them from the vmcoreinfo PT_LOAD ranges. Adding other architectures is easy and I can do that as soon as this series is merged. The fourth patch just updates Documentation/ Now, specifying crashkernel=100M craskhernel=1G,cma on the command line will make a standard crashkernel reservation of 100M, where kexec will load the kernel and initrd. An additional 1G will be reserved from CMA, still usable by the production system. The crash kernel will have 1.1G memory available. The 100M can be reliably predicted based on the size of the kernel and initrd. When no crashkernel=size,cma is specified, everything works as before.
Comments
Hi Jiri, On Sat, Nov 25, 2023 at 3:55 AM Jiri Bohac <jbohac@suse.cz> wrote: > > Hi, > > this series implements a new way to reserve additional crash kernel > memory using CMA. > > Currently, all the memory for the crash kernel is not usable by > the 1st (production) kernel. It is also unmapped so that it can't > be corrupted by the fault that will eventually trigger the crash. > This makes sense for the memory actually used by the kexec-loaded > crash kernel image and initrd and the data prepared during the > load (vmcoreinfo, ...). However, the reserved space needs to be > much larger than that to provide enough run-time memory for the > crash kernel and the kdump userspace. Estimating the amount of > memory to reserve is difficult. Being too careful makes kdump > likely to end in OOM, being too generous takes even more memory > from the production system. Also, the reservation only allows > reserving a single contiguous block (or two with the "low" > suffix). I've seen systems where this fails because the physical > memory is fragmented. > > By reserving additional crashkernel memory from CMA, the main > crashkernel reservation can be just small enough to fit the > kernel and initrd image, minimizing the memory taken away from > the production system. Most of the run-time memory for the crash > kernel will be memory previously available to userspace in the > production system. As this memory is no longer wasted, the > reservation can be done with a generous margin, making kdump more > reliable. Kernel memory that we need to preserve for dumping is > never allocated from CMA. User data is typically not dumped by > makedumpfile. When dumping of user data is intended this new CMA > reservation cannot be used. > Thanks for the idea of using CMA as part of memory for the 2nd kernel. However I have a question: What if there is on-going DMA/RDMA access on the CMA range when 1st kernel crash? There might be data corruption when 2nd kernel and DMA/RDMA write to the same place, how to address such an issue? Thanks, Tao Liu > There are four patches in this series: > > The first adds a new ",cma" suffix to the recenly introduced generic > crashkernel parsing code. parse_crashkernel() takes one more > argument to store the cma reservation size. > > The second patch implements reserve_crashkernel_cma() which > performs the reservation. If the requested size is not available > in a single range, multiple smaller ranges will be reserved. > > The third patch enables the functionality for x86 as a proof of > concept. There are just three things every arch needs to do: > - call reserve_crashkernel_cma() > - include the CMA-reserved ranges in the physical memory map > - exclude the CMA-reserved ranges from the memory available > through /proc/vmcore by excluding them from the vmcoreinfo > PT_LOAD ranges. > Adding other architectures is easy and I can do that as soon as > this series is merged. > > The fourth patch just updates Documentation/ > > Now, specifying > crashkernel=100M craskhernel=1G,cma > on the command line will make a standard crashkernel reservation > of 100M, where kexec will load the kernel and initrd. > > An additional 1G will be reserved from CMA, still usable by the > production system. The crash kernel will have 1.1G memory > available. The 100M can be reliably predicted based on the size > of the kernel and initrd. > > When no crashkernel=size,cma is specified, everything works as > before. > > -- > Jiri Bohac <jbohac@suse.cz> > SUSE Labs, Prague, Czechia > > > _______________________________________________ > kexec mailing list > kexec@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/kexec >
Hi Tao, On Sat, Nov 25, 2023 at 09:51:54AM +0800, Tao Liu wrote: > Thanks for the idea of using CMA as part of memory for the 2nd kernel. > However I have a question: > > What if there is on-going DMA/RDMA access on the CMA range when 1st > kernel crash? There might be data corruption when 2nd kernel and > DMA/RDMA write to the same place, how to address such an issue? The crash kernel CMA area(s) registered via cma_declare_contiguous() are distinct from the dma_contiguous_default_area or device-specific CMA areas that dma_alloc_contiguous() would use to reserve memory for DMA. Kernel pages will not be allocated from the crash kernel CMA area(s), because they are not GFP_MOVABLE. The CMA area will only be used for user pages. User pages for RDMA, should be pinned with FOLL_LONGTERM and that would migrate them away from the CMA area. But you're right that DMA to user pages pinned without FOLL_LONGTERM would still be possible. Would this be a problem in practice? Do you see any way around it? Thanks,
Hi Jiri, On Sun, Nov 26, 2023 at 5:22 AM Jiri Bohac <jbohac@suse.cz> wrote: > > Hi Tao, > > On Sat, Nov 25, 2023 at 09:51:54AM +0800, Tao Liu wrote: > > Thanks for the idea of using CMA as part of memory for the 2nd kernel. > > However I have a question: > > > > What if there is on-going DMA/RDMA access on the CMA range when 1st > > kernel crash? There might be data corruption when 2nd kernel and > > DMA/RDMA write to the same place, how to address such an issue? > > The crash kernel CMA area(s) registered via > cma_declare_contiguous() are distinct from the > dma_contiguous_default_area or device-specific CMA areas that > dma_alloc_contiguous() would use to reserve memory for DMA. > > Kernel pages will not be allocated from the crash kernel CMA > area(s), because they are not GFP_MOVABLE. The CMA area will only > be used for user pages. > > User pages for RDMA, should be pinned with FOLL_LONGTERM and that > would migrate them away from the CMA area. > > But you're right that DMA to user pages pinned without > FOLL_LONGTERM would still be possible. Would this be a problem in > practice? Do you see any way around it? > Thanks for the explanation! Sorry I don't have any ideas so far... @Pingfan Liu @Baoquan He Hi, do you have any suggestions for it? Thanks, Tao Liu > Thanks, > > -- > Jiri Bohac <jbohac@suse.cz> > SUSE Labs, Prague, Czechia >
On Sun, Nov 26, 2023 at 5:24 AM Jiri Bohac <jbohac@suse.cz> wrote: > > Hi Tao, > > On Sat, Nov 25, 2023 at 09:51:54AM +0800, Tao Liu wrote: > > Thanks for the idea of using CMA as part of memory for the 2nd kernel. > > However I have a question: > > > > What if there is on-going DMA/RDMA access on the CMA range when 1st > > kernel crash? There might be data corruption when 2nd kernel and > > DMA/RDMA write to the same place, how to address such an issue? > > The crash kernel CMA area(s) registered via > cma_declare_contiguous() are distinct from the > dma_contiguous_default_area or device-specific CMA areas that > dma_alloc_contiguous() would use to reserve memory for DMA. > > Kernel pages will not be allocated from the crash kernel CMA > area(s), because they are not GFP_MOVABLE. The CMA area will only > be used for user pages. > > User pages for RDMA, should be pinned with FOLL_LONGTERM and that > would migrate them away from the CMA area. > > But you're right that DMA to user pages pinned without > FOLL_LONGTERM would still be possible. Would this be a problem in > practice? Do you see any way around it? > I have not a real case in mind. But this problem has kept us from using the CMA area in kdump for years. Most importantly, this method will introduce an uneasy tracking bug. For a way around, maybe you can introduce a specific zone, and for any GUP, migrate the pages away. I have doubts about whether this approach is worthwhile, considering the trade-off between benefits and complexity. Thanks, Pingfan
On 11/28/23 at 09:12am, Tao Liu wrote: > Hi Jiri, > > On Sun, Nov 26, 2023 at 5:22 AM Jiri Bohac <jbohac@suse.cz> wrote: > > > > Hi Tao, > > > > On Sat, Nov 25, 2023 at 09:51:54AM +0800, Tao Liu wrote: > > > Thanks for the idea of using CMA as part of memory for the 2nd kernel. > > > However I have a question: > > > > > > What if there is on-going DMA/RDMA access on the CMA range when 1st > > > kernel crash? There might be data corruption when 2nd kernel and > > > DMA/RDMA write to the same place, how to address such an issue? > > > > The crash kernel CMA area(s) registered via > > cma_declare_contiguous() are distinct from the > > dma_contiguous_default_area or device-specific CMA areas that > > dma_alloc_contiguous() would use to reserve memory for DMA. > > > > Kernel pages will not be allocated from the crash kernel CMA > > area(s), because they are not GFP_MOVABLE. The CMA area will only > > be used for user pages. > > > > User pages for RDMA, should be pinned with FOLL_LONGTERM and that > > would migrate them away from the CMA area. > > > > But you're right that DMA to user pages pinned without > > FOLL_LONGTERM would still be possible. Would this be a problem in > > practice? Do you see any way around it? Thanks for the effort to bring this up, Jiri. I am wondering how you will use this crashkernel=,cma parameter. I mean the scenario of crashkernel=,cma. Asking this because I don't know how SUSE deploy kdump in SUSE distros. In SUSE distros, kdump kernel's initramfs is the same as the 1st kernel, or only contain those needed kernel modules for needed devices. E.g if we dump to local disk, NIC driver will be filter out? If latter case, It's possibly having the on-flight DMA issue, e.g NIC has DMA buffer in the CMA area, but not reset during kdump bootup because the NIC driver is not loaded in to initialize. Not sure if this is 100%, possible in theory? Recently we are seeing an issue that on a HPE system, PCI error messages are always seen in kdump kernel, while it's a local dump, NIC device is not needed and the igb driver is not loaded in. Then adding igb driver into kdump initramfs can work around it. It's similar with above on-flight DMA. The crashkernel=,cma requires no userspace data dumping, from our support engineers' feedback, customer never express they don't need to dump user space data. Assume a server with huge databse deployed, and the database often collapsed recently and database provider claimed that it's not database's fault, OS need prove their innocence. What will you do? So this looks like a nice to have to me. At least in fedora/rhel's usage, we may only back port this patch, and add one sentence in our user guide saying "there's a crashkernel=,cma added, can be used with crashkernel= to save memory. Please feel free to try if you like". Unless SUSE or other distros decides to use it as default config or something like that. Please correct me if I missed anything or took anything wrong. Thanks Baoquan
On Tue 28-11-23 10:07:08, Pingfan Liu wrote: > On Sun, Nov 26, 2023 at 5:24 AM Jiri Bohac <jbohac@suse.cz> wrote: > > > > Hi Tao, > > > > On Sat, Nov 25, 2023 at 09:51:54AM +0800, Tao Liu wrote: > > > Thanks for the idea of using CMA as part of memory for the 2nd kernel. > > > However I have a question: > > > > > > What if there is on-going DMA/RDMA access on the CMA range when 1st > > > kernel crash? There might be data corruption when 2nd kernel and > > > DMA/RDMA write to the same place, how to address such an issue? > > > > The crash kernel CMA area(s) registered via > > cma_declare_contiguous() are distinct from the > > dma_contiguous_default_area or device-specific CMA areas that > > dma_alloc_contiguous() would use to reserve memory for DMA. > > > > Kernel pages will not be allocated from the crash kernel CMA > > area(s), because they are not GFP_MOVABLE. The CMA area will only > > be used for user pages. > > > > User pages for RDMA, should be pinned with FOLL_LONGTERM and that > > would migrate them away from the CMA area. > > > > But you're right that DMA to user pages pinned without > > FOLL_LONGTERM would still be possible. Would this be a problem in > > practice? Do you see any way around it? > > > > I have not a real case in mind. But this problem has kept us from > using the CMA area in kdump for years. Most importantly, this method > will introduce an uneasy tracking bug. Long term pinning is something that has changed the picture IMHO. The API had been breweing for a long time but it has been established and usage spreading. Is it possible that some driver could be doing remote DMA without the long term pinning? Quite possible but this means such a driver should be fixed rather than preventing cma use for this usecase TBH. > For a way around, maybe you can introduce a specific zone, and for any > GUP, migrate the pages away. I have doubts about whether this approach > is worthwhile, considering the trade-off between benefits and > complexity. No, a zone is definitely not an answer to that because because a) userspace would need to be able to use that memory and userspace might pin memory for direct IO and others. So in the end longterm pinning would need to be used anyway.
On Tue 28-11-23 10:11:31, Baoquan He wrote: > On 11/28/23 at 09:12am, Tao Liu wrote: [...] > Thanks for the effort to bring this up, Jiri. > > I am wondering how you will use this crashkernel=,cma parameter. I mean > the scenario of crashkernel=,cma. Asking this because I don't know how > SUSE deploy kdump in SUSE distros. In SUSE distros, kdump kernel's > driver will be filter out? If latter case, It's possibly having the > on-flight DMA issue, e.g NIC has DMA buffer in the CMA area, but not > reset during kdump bootup because the NIC driver is not loaded in to > initialize. Not sure if this is 100%, possible in theory? NIC drivers do not allocation from movable zones (that includes CMA zone). In fact kernel doesn't use GFP_MOVABLE for non-user requests. RDMA drivers might and do transfer from user backed memory but for that purpose they should be pinning memory (have a look at __gup_longterm_locked and its callers) and that will migrate away from the any zone. [...] > The crashkernel=,cma requires no userspace data dumping, from our > support engineers' feedback, customer never express they don't need to > dump user space data. Assume a server with huge databse deployed, and > the database often collapsed recently and database provider claimed that > it's not database's fault, OS need prove their innocence. What will you > do? Don't use CMA backed crash memory then? This is an optional feature. > So this looks like a nice to have to me. At least in fedora/rhel's > usage, we may only back port this patch, and add one sentence in our > user guide saying "there's a crashkernel=,cma added, can be used with > crashkernel= to save memory. Please feel free to try if you like". > Unless SUSE or other distros decides to use it as default config or > something like that. Please correct me if I missed anything or took > anything wrong. Jiri will know better than me but for us a proper crash memory configuration has become a real nut. You do not want to reserve too much because it is effectively cutting of the usable memory and we regularly hit into "not enough memory" if we tried to be savvy. The more tight you try to configure the easier to fail that is. Even worse any in kernel memory consumer can increase its memory demand and get the overall consumption off the cliff. So this is not an easy to maintain solution. CMA backed crash memory can be much more generous while still usable.
On 11/28/23 at 10:08am, Michal Hocko wrote: > On Tue 28-11-23 10:11:31, Baoquan He wrote: > > On 11/28/23 at 09:12am, Tao Liu wrote: > [...] > > Thanks for the effort to bring this up, Jiri. > > > > I am wondering how you will use this crashkernel=,cma parameter. I mean > > the scenario of crashkernel=,cma. Asking this because I don't know how > > SUSE deploy kdump in SUSE distros. In SUSE distros, kdump kernel's > > driver will be filter out? If latter case, It's possibly having the > > on-flight DMA issue, e.g NIC has DMA buffer in the CMA area, but not > > reset during kdump bootup because the NIC driver is not loaded in to > > initialize. Not sure if this is 100%, possible in theory? > > NIC drivers do not allocation from movable zones (that includes CMA > zone). In fact kernel doesn't use GFP_MOVABLE for non-user requests. > RDMA drivers might and do transfer from user backed memory but for that > purpose they should be pinning memory (have a look at > __gup_longterm_locked and its callers) and that will migrate away from > the any zone. OK, in that case, we don't need to worry about the risk of DMA. > > [...] > > The crashkernel=,cma requires no userspace data dumping, from our > > support engineers' feedback, customer never express they don't need to > > dump user space data. Assume a server with huge databse deployed, and > > the database often collapsed recently and database provider claimed that > > it's not database's fault, OS need prove their innocence. What will you > > do? > > Don't use CMA backed crash memory then? This is an optional feature. Guess so. As I said earlier, this is more like a nice-to-have feature, can suggest user to try by themselves. Since Jiri didn't give how he will use it. > > > So this looks like a nice to have to me. At least in fedora/rhel's > > usage, we may only back port this patch, and add one sentence in our > > user guide saying "there's a crashkernel=,cma added, can be used with > > crashkernel= to save memory. Please feel free to try if you like". > > Unless SUSE or other distros decides to use it as default config or > > something like that. Please correct me if I missed anything or took > > anything wrong. > > Jiri will know better than me but for us a proper crash memory > configuration has become a real nut. You do not want to reserve too much > because it is effectively cutting of the usable memory and we regularly > hit into "not enough memory" if we tried to be savvy. The more tight you > try to configure the easier to fail that is. Even worse any in kernel > memory consumer can increase its memory demand and get the overall > consumption off the cliff. So this is not an easy to maintain solution. > CMA backed crash memory can be much more generous while still usable. Hmm, Redhat could go in a different way. We have been trying to: 1) customize initrd for kdump kernel specifically, e.g exclude unneeded devices's driver to save memory; 2) monitor device and kenrel memory usage if they begin to consume much more memory than before. We have CI testing cases to watch this. We ever found one NIC even eat up GB level memory, then this need be investigated and fixed. With these effort, our default crashkernel values satisfy most of cases, surely not call cases. Only rare cases need be handled manually, increasing crashkernel. The crashkernel=,high was added in this case, a small low memory under 4G for DMA with crashkernel=,low, a big chunk of high memory above 4G with crashkernel=,high. I can't see where needs are not met. Wondering how you will use this crashkernel=,cma syntax. On normal machines and virt guests, not much meomry is needed, usually 256M or a little more is enough. On those high end systems with hundreds of Giga bytes, even Tera bytes of memory, I don't think the saved memory with crashkernel=,cma make much sense. Taking out 1G memory above 4G as crashkernel won't impact much. So with my understanding, crashkernel=,cma adds an option user can take besides the existing crashkernel=,high. As I have said earlier, in Redhat, we may rebase it to fedora/RHEL and add one sentence into our user guide saying "one another crashkernel=,cma can be use to save memory, please feel free to try if you like." Then that's it. Guess SUSE will check user's configuration, e.g the dump level of makedumpfile, if no user space data needed, crashkernel=,cma is taken, otherwise the normal crashkernel=xM will be chosen? Thanks Baoquan
On 11/28/23 at 10:08am, Michal Hocko wrote: > On Tue 28-11-23 10:11:31, Baoquan He wrote: > > On 11/28/23 at 09:12am, Tao Liu wrote: > [...] > > Thanks for the effort to bring this up, Jiri. > > > > I am wondering how you will use this crashkernel=,cma parameter. I mean > > the scenario of crashkernel=,cma. Asking this because I don't know how > > SUSE deploy kdump in SUSE distros. In SUSE distros, kdump kernel's > > driver will be filter out? If latter case, It's possibly having the > > on-flight DMA issue, e.g NIC has DMA buffer in the CMA area, but not > > reset during kdump bootup because the NIC driver is not loaded in to > > initialize. Not sure if this is 100%, possible in theory? > > NIC drivers do not allocation from movable zones (that includes CMA > zone). In fact kernel doesn't use GFP_MOVABLE for non-user requests. > RDMA drivers might and do transfer from user backed memory but for that > purpose they should be pinning memory (have a look at > __gup_longterm_locked and its callers) and that will migrate away from > the any zone. Add Don in this thread. I am not familiar with RDMA. If we reserve a range of 1G meory as cma in 1st kernel, and RDMA or any other user space tools could use it. When corruption happened with any cause, that 1G cma memory will be reused as available MOVABLE memory of kdump kernel. If no risk at all, I mean 100% safe from RDMA, that would be great. > > [...] > > The crashkernel=,cma requires no userspace data dumping, from our > > support engineers' feedback, customer never express they don't need to > > dump user space data. Assume a server with huge databse deployed, and > > the database often collapsed recently and database provider claimed that > > it's not database's fault, OS need prove their innocence. What will you > > do? > > Don't use CMA backed crash memory then? This is an optional feature. > > > So this looks like a nice to have to me. At least in fedora/rhel's > > usage, we may only back port this patch, and add one sentence in our > > user guide saying "there's a crashkernel=,cma added, can be used with > > crashkernel= to save memory. Please feel free to try if you like". > > Unless SUSE or other distros decides to use it as default config or > > something like that. Please correct me if I missed anything or took > > anything wrong. > > Jiri will know better than me but for us a proper crash memory > configuration has become a real nut. You do not want to reserve too much > because it is effectively cutting of the usable memory and we regularly > hit into "not enough memory" if we tried to be savvy. The more tight you > try to configure the easier to fail that is. Even worse any in kernel > memory consumer can increase its memory demand and get the overall > consumption off the cliff. So this is not an easy to maintain solution. > CMA backed crash memory can be much more generous while still usable. > -- > Michal Hocko > SUSE Labs >
On Wed 29-11-23 15:57:59, Baoquan He wrote: [...] > Hmm, Redhat could go in a different way. We have been trying to: > 1) customize initrd for kdump kernel specifically, e.g exclude unneeded > devices's driver to save memory; > 2) monitor device and kenrel memory usage if they begin to consume much > more memory than before. We have CI testing cases to watch this. We ever > found one NIC even eat up GB level memory, then this need be > investigated and fixed. How do you simulate all different HW configuration setups that are using out there in the wild?
Hi Baoquan, thanks for your interest... On Wed, Nov 29, 2023 at 03:57:59PM +0800, Baoquan He wrote: > On 11/28/23 at 10:08am, Michal Hocko wrote: > > On Tue 28-11-23 10:11:31, Baoquan He wrote: > > > On 11/28/23 at 09:12am, Tao Liu wrote: > > [...] > > > Thanks for the effort to bring this up, Jiri. > > > > > > I am wondering how you will use this crashkernel=,cma parameter. I mean > > > the scenario of crashkernel=,cma. Asking this because I don't know how > > > SUSE deploy kdump in SUSE distros. In SUSE distros, kdump kernel's > > > driver will be filter out? If latter case, It's possibly having the > > > on-flight DMA issue, e.g NIC has DMA buffer in the CMA area, but not > > > reset during kdump bootup because the NIC driver is not loaded in to > > > initialize. Not sure if this is 100%, possible in theory? yes, we also only add the necessary drivers to the kdump initrd (using dracut --hostonly). The plan was to use this feature by default only on systems where we are reasonably sure it is safe and let the user experiment with it when we're not sure. I grepped a list of all calls to pin_user_pages*. From the 55, about one half uses FOLL_LONGTERM, so these should be migrated away from the CMA area. In the rest there are four cases that don't use the pages to set up DMA: mm/process_vm_access.c: pinned_pages = pin_user_pages_remote(mm, pa, pinned_pages, net/rds/info.c: ret = pin_user_pages_fast(start, nr_pages, FOLL_WRITE, pages); drivers/vhost/vhost.c: r = pin_user_pages_fast(log, 1, FOLL_WRITE, &page); kernel/trace/trace_events_user.c: ret = pin_user_pages_remote(mm->mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT, The remaining cases are potentially problematic: drivers/gpu/drm/i915/gem/i915_gem_userptr.c: ret = pin_user_pages_fast(obj->userptr.ptr + pinned * PAGE_SIZE, drivers/iommu/iommufd/iova_bitmap.c: ret = pin_user_pages_fast((unsigned long)addr, npages, drivers/iommu/iommufd/pages.c: rc = pin_user_pages_remote( drivers/media/pci/ivtv/ivtv-udma.c: err = pin_user_pages_unlocked(user_dma.uaddr, user_dma.page_count, drivers/media/pci/ivtv/ivtv-yuv.c: uv_pages = pin_user_pages_unlocked(uv_dma.uaddr, drivers/media/pci/ivtv/ivtv-yuv.c: y_pages = pin_user_pages_unlocked(y_dma.uaddr, drivers/misc/genwqe/card_utils.c: rc = pin_user_pages_fast(data & PAGE_MASK, /* page aligned addr */ drivers/misc/xilinx_sdfec.c: res = pin_user_pages_fast((unsigned long)src_ptr, nr_pages, 0, pages); drivers/platform/goldfish/goldfish_pipe.c: ret = pin_user_pages_fast(first_page, requested_pages, drivers/rapidio/devices/rio_mport_cdev.c: pinned = pin_user_pages_fast( drivers/sbus/char/oradax.c: ret = pin_user_pages_fast((unsigned long)va, 1, FOLL_WRITE, p); drivers/scsi/st.c: res = pin_user_pages_fast(uaddr, nr_pages, rw == READ ? FOLL_WRITE : 0, drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c: actual_pages = pin_user_pages_fast((unsigned long)ubuf & PAGE_MASK, num_pages, drivers/tee/tee_shm.c: rc = pin_user_pages_fast(start, num_pages, FOLL_WRITE, drivers/vfio/vfio_iommu_spapr_tce.c: if (pin_user_pages_fast(tce & PAGE_MASK, 1, drivers/video/fbdev/pvr2fb.c: ret = pin_user_pages_fast((unsigned long)buf, nr_pages, FOLL_WRITE, pages); drivers/xen/gntdev.c: ret = pin_user_pages_fast(addr, 1, batch->writeable ? FOLL_WRITE : 0, &page); drivers/xen/privcmd.c: page_count = pin_user_pages_fast( fs/orangefs/orangefs-bufmap.c: ret = pin_user_pages_fast((unsigned long)user_desc->ptr, arch/x86/kvm/svm/sev.c: npinned = pin_user_pages_fast(uaddr, npages, write ? FOLL_WRITE : 0, pages); drivers/fpga/dfl-afu-dma-region.c: pinned = pin_user_pages_fast(region->user_addr, npages, FOLL_WRITE, lib/iov_iter.c: res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages); We can easily check if some of these drivers (of which some we don't even ship/support) are loaded and decide this system is not safe for CMA crashkernel. Maybe looking at the list more thoroughly will show that even some of the above calls are acually safe, e.g. because the DMA is set up for reading only. lib/iov_iter.c seem like it could be the real problem since it's used by generic block layer... > > > The crashkernel=,cma requires no userspace data dumping, from our > > > support engineers' feedback, customer never express they don't need to > > > dump user space data. Assume a server with huge databse deployed, and > > > the database often collapsed recently and database provider claimed that > > > it's not database's fault, OS need prove their innocence. What will you > > > do? > > > > Don't use CMA backed crash memory then? This is an optional feature. Right. Our kdump does not dump userspace by default and we would of course make sure ,cma is not used when the user wanted to turn on userspace dumping. > > Jiri will know better than me but for us a proper crash memory > > configuration has become a real nut. You do not want to reserve too much > > because it is effectively cutting of the usable memory and we regularly > > hit into "not enough memory" if we tried to be savvy. The more tight you > > try to configure the easier to fail that is. Even worse any in kernel > > memory consumer can increase its memory demand and get the overall > > consumption off the cliff. So this is not an easy to maintain solution. > > CMA backed crash memory can be much more generous while still usable. > > Hmm, Redhat could go in a different way. We have been trying to: > 1) customize initrd for kdump kernel specifically, e.g exclude unneeded > devices's driver to save memory; ditto > 2) monitor device and kenrel memory usage if they begin to consume much > more memory than before. We have CI testing cases to watch this. We ever > found one NIC even eat up GB level memory, then this need be > investigated and fixed. > With these effort, our default crashkernel values satisfy most of cases, > surely not call cases. Only rare cases need be handled manually, > increasing crashkernel. We get a lot of problems reported by partners testing kdump on their setups prior to release. But even if we tune the reserved size up, OOM is still the most common reason for kdump to fail when the product starts getting used in real life. It's been pretty frustrating for a long time. > Wondering how you will use this crashkernel=,cma syntax. On normal > machines and virt guests, not much meomry is needed, usually 256M or a > little more is enough. On those high end systems with hundreds of Giga > bytes, even Tera bytes of memory, I don't think the saved memory with > crashkernel=,cma make much sense. I feel the exact opposite about VMs. Reserving hundreds of MB for crash kernel on _every_ VM on a busy VM host wastes the most memory. VMs are often tuned to well defined task and can be set up with very little memory, so the ~256 MB can be a huge part of that. And while it's theoretically better to dump from the hypervisor, users still often prefer kdump because the hypervisor may not be under their control. Also, in a VM it should be much easier to be sure the machine is safe WRT the potential DMA corruption as it has less HW drivers. So I actually thought the CMA reservation could be most useful on VMs. Thanks,
Baoquan, hi! On 11/29/23 3:10 AM, Baoquan He wrote: > On 11/28/23 at 10:08am, Michal Hocko wrote: >> On Tue 28-11-23 10:11:31, Baoquan He wrote: >>> On 11/28/23 at 09:12am, Tao Liu wrote: >> [...] >>> Thanks for the effort to bring this up, Jiri. >>> >>> I am wondering how you will use this crashkernel=,cma parameter. I mean >>> the scenario of crashkernel=,cma. Asking this because I don't know how >>> SUSE deploy kdump in SUSE distros. In SUSE distros, kdump kernel's >>> driver will be filter out? If latter case, It's possibly having the >>> on-flight DMA issue, e.g NIC has DMA buffer in the CMA area, but not >>> reset during kdump bootup because the NIC driver is not loaded in to >>> initialize. Not sure if this is 100%, possible in theory? >> >> NIC drivers do not allocation from movable zones (that includes CMA >> zone). In fact kernel doesn't use GFP_MOVABLE for non-user requests. >> RDMA drivers might and do transfer from user backed memory but for that >> purpose they should be pinning memory (have a look at >> __gup_longterm_locked and its callers) and that will migrate away from >> the any zone. > > Add Don in this thread. > > I am not familiar with RDMA. If we reserve a range of 1G meory as cma in > 1st kernel, and RDMA or any other user space tools could use it. When > corruption happened with any cause, that 1G cma memory will be reused as > available MOVABLE memory of kdump kernel. If no risk at all, I mean 100% > safe from RDMA, that would be great. > My RDMA days are long behind me... more in mm space these days, so this still interests me. I thought, in general, userspace memory is not saved or used in kdumps, so if RDMA is using cma space for userspace-based IO (gup), then I would expect it can be re-used for kexec'd kernel. So, I'm not sure what 'safe from RDMA' means, but I would expect RDMA queues are in-kernel data structures, not userspace strucutures, and they would be more/most important to maintain/keep for kdump saving. The actual userspace data ... ssdd wrt any other userspace data. dma-buf's allocated from cma, which are (typically) shared with GPUs (& RDMA in GPU-direct configs), again, would be shared userspace, not control/cmd/rsp queues, so I'm not seeing an issue there either. I would poke the NVIDIA+Mellanox folks for further review in this space, if my reply leaves you (or others) 'wanting'. - Don >> >> [...] >>> The crashkernel=,cma requires no userspace data dumping, from our >>> support engineers' feedback, customer never express they don't need to >>> dump user space data. Assume a server with huge databse deployed, and >>> the database often collapsed recently and database provider claimed that >>> it's not database's fault, OS need prove their innocence. What will you >>> do? >> >> Don't use CMA backed crash memory then? This is an optional feature. >> >>> So this looks like a nice to have to me. At least in fedora/rhel's >>> usage, we may only back port this patch, and add one sentence in our >>> user guide saying "there's a crashkernel=,cma added, can be used with >>> crashkernel= to save memory. Please feel free to try if you like". >>> Unless SUSE or other distros decides to use it as default config or >>> something like that. Please correct me if I missed anything or took >>> anything wrong. >> >> Jiri will know better than me but for us a proper crash memory >> configuration has become a real nut. You do not want to reserve too much >> because it is effectively cutting of the usable memory and we regularly >> hit into "not enough memory" if we tried to be savvy. The more tight you >> try to configure the easier to fail that is. Even worse any in kernel >> memory consumer can increase its memory demand and get the overall >> consumption off the cliff. So this is not an easy to maintain solution. >> CMA backed crash memory can be much more generous while still usable. >> -- >> Michal Hocko >> SUSE Labs >> >
On 11/29/23 at 10:25am, Michal Hocko wrote: > On Wed 29-11-23 15:57:59, Baoquan He wrote: > [...] > > Hmm, Redhat could go in a different way. We have been trying to: > > 1) customize initrd for kdump kernel specifically, e.g exclude unneeded > > devices's driver to save memory; > > 2) monitor device and kenrel memory usage if they begin to consume much > > more memory than before. We have CI testing cases to watch this. We ever > > found one NIC even eat up GB level memory, then this need be > > investigated and fixed. > > How do you simulate all different HW configuration setups that are using > out there in the wild? We don't simulate. We do this with best effort with existing systems in our LAB. And meantime partner company will test and report any OOM if they encounter.
On 11/29/23 at 10:03am, Donald Dutile wrote: > Baoquan, > hi! > > On 11/29/23 3:10 AM, Baoquan He wrote: > > On 11/28/23 at 10:08am, Michal Hocko wrote: > > > On Tue 28-11-23 10:11:31, Baoquan He wrote: > > > > On 11/28/23 at 09:12am, Tao Liu wrote: > > > [...] > > > > Thanks for the effort to bring this up, Jiri. > > > > > > > > I am wondering how you will use this crashkernel=,cma parameter. I mean > > > > the scenario of crashkernel=,cma. Asking this because I don't know how > > > > SUSE deploy kdump in SUSE distros. In SUSE distros, kdump kernel's > > > > driver will be filter out? If latter case, It's possibly having the > > > > on-flight DMA issue, e.g NIC has DMA buffer in the CMA area, but not > > > > reset during kdump bootup because the NIC driver is not loaded in to > > > > initialize. Not sure if this is 100%, possible in theory? > > > > > > NIC drivers do not allocation from movable zones (that includes CMA > > > zone). In fact kernel doesn't use GFP_MOVABLE for non-user requests. > > > RDMA drivers might and do transfer from user backed memory but for that > > > purpose they should be pinning memory (have a look at > > > __gup_longterm_locked and its callers) and that will migrate away from > > > the any zone. > > > > Add Don in this thread. > > > > I am not familiar with RDMA. If we reserve a range of 1G meory as cma in > > 1st kernel, and RDMA or any other user space tools could use it. When > > corruption happened with any cause, that 1G cma memory will be reused as > > available MOVABLE memory of kdump kernel. If no risk at all, I mean 100% > > safe from RDMA, that would be great. > > > My RDMA days are long behind me... more in mm space these days, so this still > interests me. > I thought, in general, userspace memory is not saved or used in kdumps, so > if RDMA is using cma space for userspace-based IO (gup), then I would expect > it can be re-used for kexec'd kernel. > So, I'm not sure what 'safe from RDMA' means, but I would expect RDMA queues > are in-kernel data structures, not userspace strucutures, and they would be > more/most important to maintain/keep for kdump saving. The actual userspace > data ... ssdd wrt any other userspace data. > dma-buf's allocated from cma, which are (typically) shared with GPUs > (& RDMA in GPU-direct configs), again, would be shared userspace, not > control/cmd/rsp queues, so I'm not seeing an issue there either. Thanks a lot for valuable input, Don. Here, Jiri's patches attempt to reserve the cma area which is used in 1st kernel as CMA area, e.g being added into buddy allocator as MOVABLE, and will be taken as available system memory of kdump kernel. Means in kdump kernel, that specific CMA area will be zerod out and its content won't be cared about and dumped out at all in kdump kernel. Kdump kernel will see it as an available system RAM and initialize it and add it into memblock allocator and buddy allocator. Now, we are worried if there's risk if the CMA area is retaken into kdump kernel as system RAM. E.g is it possible that 1st kernel's ongoing RDMA or DMA will interfere with kdump kernel's normal memory accessing? Because kdump kernel usually only reset and initialize the needed device, e.g dump target. Those unneeded devices will be unshutdown and let go. We could overthink, so would like to make clear. > > I would poke the NVIDIA+Mellanox folks for further review in this space, > if my reply leaves you (or others) 'wanting'. > > - Don > > > [...] > > > > The crashkernel=,cma requires no userspace data dumping, from our > > > > support engineers' feedback, customer never express they don't need to > > > > dump user space data. Assume a server with huge databse deployed, and > > > > the database often collapsed recently and database provider claimed that > > > > it's not database's fault, OS need prove their innocence. What will you > > > > do? > > > > > > Don't use CMA backed crash memory then? This is an optional feature. > > > > So this looks like a nice to have to me. At least in fedora/rhel's > > > > usage, we may only back port this patch, and add one sentence in our > > > > user guide saying "there's a crashkernel=,cma added, can be used with > > > > crashkernel= to save memory. Please feel free to try if you like". > > > > Unless SUSE or other distros decides to use it as default config or > > > > something like that. Please correct me if I missed anything or took > > > > anything wrong. > > > > > > Jiri will know better than me but for us a proper crash memory > > > configuration has become a real nut. You do not want to reserve too much > > > because it is effectively cutting of the usable memory and we regularly > > > hit into "not enough memory" if we tried to be savvy. The more tight you > > > try to configure the easier to fail that is. Even worse any in kernel > > > memory consumer can increase its memory demand and get the overall > > > consumption off the cliff. So this is not an easy to maintain solution. > > > CMA backed crash memory can be much more generous while still usable. > > > -- > > > Michal Hocko > > > SUSE Labs > > > > > >
On 11/29/23 at 11:51am, Jiri Bohac wrote: > Hi Baoquan, > > thanks for your interest... > > On Wed, Nov 29, 2023 at 03:57:59PM +0800, Baoquan He wrote: > > On 11/28/23 at 10:08am, Michal Hocko wrote: > > > On Tue 28-11-23 10:11:31, Baoquan He wrote: > > > > On 11/28/23 at 09:12am, Tao Liu wrote: > > > [...] > > > > Thanks for the effort to bring this up, Jiri. > > > > > > > > I am wondering how you will use this crashkernel=,cma parameter. I mean > > > > the scenario of crashkernel=,cma. Asking this because I don't know how > > > > SUSE deploy kdump in SUSE distros. In SUSE distros, kdump kernel's > > > > driver will be filter out? If latter case, It's possibly having the > > > > on-flight DMA issue, e.g NIC has DMA buffer in the CMA area, but not > > > > reset during kdump bootup because the NIC driver is not loaded in to > > > > initialize. Not sure if this is 100%, possible in theory? > > yes, we also only add the necessary drivers to the kdump initrd (using > dracut --hostonly). > > The plan was to use this feature by default only on systems where > we are reasonably sure it is safe and let the user experiment > with it when we're not sure. > > I grepped a list of all calls to pin_user_pages*. From the 55, > about one half uses FOLL_LONGTERM, so these should be migrated > away from the CMA area. In the rest there are four cases that > don't use the pages to set up DMA: > mm/process_vm_access.c: pinned_pages = pin_user_pages_remote(mm, pa, pinned_pages, > net/rds/info.c: ret = pin_user_pages_fast(start, nr_pages, FOLL_WRITE, pages); > drivers/vhost/vhost.c: r = pin_user_pages_fast(log, 1, FOLL_WRITE, &page); > kernel/trace/trace_events_user.c: ret = pin_user_pages_remote(mm->mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT, > > The remaining cases are potentially problematic: > drivers/gpu/drm/i915/gem/i915_gem_userptr.c: ret = pin_user_pages_fast(obj->userptr.ptr + pinned * PAGE_SIZE, > drivers/iommu/iommufd/iova_bitmap.c: ret = pin_user_pages_fast((unsigned long)addr, npages, > drivers/iommu/iommufd/pages.c: rc = pin_user_pages_remote( > drivers/media/pci/ivtv/ivtv-udma.c: err = pin_user_pages_unlocked(user_dma.uaddr, user_dma.page_count, > drivers/media/pci/ivtv/ivtv-yuv.c: uv_pages = pin_user_pages_unlocked(uv_dma.uaddr, > drivers/media/pci/ivtv/ivtv-yuv.c: y_pages = pin_user_pages_unlocked(y_dma.uaddr, > drivers/misc/genwqe/card_utils.c: rc = pin_user_pages_fast(data & PAGE_MASK, /* page aligned addr */ > drivers/misc/xilinx_sdfec.c: res = pin_user_pages_fast((unsigned long)src_ptr, nr_pages, 0, pages); > drivers/platform/goldfish/goldfish_pipe.c: ret = pin_user_pages_fast(first_page, requested_pages, > drivers/rapidio/devices/rio_mport_cdev.c: pinned = pin_user_pages_fast( > drivers/sbus/char/oradax.c: ret = pin_user_pages_fast((unsigned long)va, 1, FOLL_WRITE, p); > drivers/scsi/st.c: res = pin_user_pages_fast(uaddr, nr_pages, rw == READ ? FOLL_WRITE : 0, > drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c: actual_pages = pin_user_pages_fast((unsigned long)ubuf & PAGE_MASK, num_pages, > drivers/tee/tee_shm.c: rc = pin_user_pages_fast(start, num_pages, FOLL_WRITE, > drivers/vfio/vfio_iommu_spapr_tce.c: if (pin_user_pages_fast(tce & PAGE_MASK, 1, > drivers/video/fbdev/pvr2fb.c: ret = pin_user_pages_fast((unsigned long)buf, nr_pages, FOLL_WRITE, pages); > drivers/xen/gntdev.c: ret = pin_user_pages_fast(addr, 1, batch->writeable ? FOLL_WRITE : 0, &page); > drivers/xen/privcmd.c: page_count = pin_user_pages_fast( > fs/orangefs/orangefs-bufmap.c: ret = pin_user_pages_fast((unsigned long)user_desc->ptr, > arch/x86/kvm/svm/sev.c: npinned = pin_user_pages_fast(uaddr, npages, write ? FOLL_WRITE : 0, pages); > drivers/fpga/dfl-afu-dma-region.c: pinned = pin_user_pages_fast(region->user_addr, npages, FOLL_WRITE, > lib/iov_iter.c: res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages); > > We can easily check if some of these drivers (of which some we don't > even ship/support) are loaded and decide this system is not safe > for CMA crashkernel. Maybe looking at the list more thoroughly > will show that even some of the above calls are acually safe, > e.g. because the DMA is set up for reading only. > lib/iov_iter.c seem like it could be the real > problem since it's used by generic block layer... Hmm, yeah. From my point of view, we may need make sure the safety of reusing ,cma area in kdump kernel without exception. That we can use it on system we 100% sure, let people to experiment with if if not sure, seems to be not safe. Most of time, user even don't know how to judge the system they own is 100% safe, or the safety is not sure. That's too hard. > > > > The crashkernel=,cma requires no userspace data dumping, from our > > > > support engineers' feedback, customer never express they don't need to > > > > dump user space data. Assume a server with huge databse deployed, and > > > > the database often collapsed recently and database provider claimed that > > > > it's not database's fault, OS need prove their innocence. What will you > > > > do? > > > > > > Don't use CMA backed crash memory then? This is an optional feature. > > Right. Our kdump does not dump userspace by default and we would > of course make sure ,cma is not used when the user wanted to turn > on userspace dumping. > > > > Jiri will know better than me but for us a proper crash memory > > > configuration has become a real nut. You do not want to reserve too much > > > because it is effectively cutting of the usable memory and we regularly > > > hit into "not enough memory" if we tried to be savvy. The more tight you > > > try to configure the easier to fail that is. Even worse any in kernel > > > memory consumer can increase its memory demand and get the overall > > > consumption off the cliff. So this is not an easy to maintain solution. > > > CMA backed crash memory can be much more generous while still usable. > > > > Hmm, Redhat could go in a different way. We have been trying to: > > 1) customize initrd for kdump kernel specifically, e.g exclude unneeded > > devices's driver to save memory; > > ditto > > > 2) monitor device and kenrel memory usage if they begin to consume much > > more memory than before. We have CI testing cases to watch this. We ever > > found one NIC even eat up GB level memory, then this need be > > investigated and fixed. > > With these effort, our default crashkernel values satisfy most of cases, > > surely not call cases. Only rare cases need be handled manually, > > increasing crashkernel. > > We get a lot of problems reported by partners testing kdump on > their setups prior to release. But even if we tune the reserved > size up, OOM is still the most common reason for kdump to fail > when the product starts getting used in real life. It's been > pretty frustrating for a long time. I remember SUSE engineers ever told you will boot kernel and do an estimation of kdump kernel usage, then set the crashkernel according to the estimation. OOM will be triggered even that way is taken? Just curious, not questioning the benefit of using ,cma to save memory. > > > Wondering how you will use this crashkernel=,cma syntax. On normal > > machines and virt guests, not much meomry is needed, usually 256M or a > > little more is enough. On those high end systems with hundreds of Giga > > bytes, even Tera bytes of memory, I don't think the saved memory with > > crashkernel=,cma make much sense. > > I feel the exact opposite about VMs. Reserving hundreds of MB for > crash kernel on _every_ VM on a busy VM host wastes the most > memory. VMs are often tuned to well defined task and can be set > up with very little memory, so the ~256 MB can be a huge part of > that. And while it's theoretically better to dump from the > hypervisor, users still often prefer kdump because the hypervisor > may not be under their control. Also, in a VM it should be much > easier to be sure the machine is safe WRT the potential DMA > corruption as it has less HW drivers. So I actually thought the > CMA reservation could be most useful on VMs. Hmm, we ever discussed this in upstream with David Hildend who works in virt team. VMs problem is much easier to solve if they complain the default crashkernel value is wasteful. The shrinking interface is for them. The crashkernel value can't be enlarged, but shrinking existing crashkernel memory is functioning smoothly well. They can adjust that in script in a very simple way. Anyway, let's discuss and figure out any risk of ,cma. If finally all worries and concerns are proved unnecessary, then let's have a new great feature. But we can't afford the risk if the ,cma area could be entangled with 1st kernel's on-going action. As we know, not like kexec reboot, we only shutdown CPUs, interrupt, most of devices are alive. And many of them could be not reset and initialized in kdump kernel if the relevant driver is not added in. Earlier, we met several on-flight DMA stomping into memory when kexec rebooting because some pci devices didn't provide shutdown() method. It gave people so much headache to figure out and fix it. Simillarly for kdump, we absolutely don't expect to see that happening with ,cma, it absolutely will be a disaster to kdump, no matter how much memory it can save. Because you don't know what happened, how to debug, until you suspect this and turn it off.
On Thu 30-11-23 11:00:48, Baoquan He wrote: [...] > Now, we are worried if there's risk if the CMA area is retaken into kdump > kernel as system RAM. E.g is it possible that 1st kernel's ongoing RDMA > or DMA will interfere with kdump kernel's normal memory accessing? > Because kdump kernel usually only reset and initialize the needed > device, e.g dump target. Those unneeded devices will be unshutdown and > let go. I do not really want to discount your concerns but I am bit confused why this matters so much. First of all, if there is a buggy RDMA driver which doesn't use the proper pinning API (which would migrate away from the CMA) then what is the worst case? We will get crash kernel corrupted potentially and fail to take a proper kernel crash, right? Is this worrisome? Yes. Is it a real roadblock? I do not think so. The problem seems theoretical to me and it is not CMA usage at fault here IMHO. It is the said theoretical driver that needs fixing anyway. Now, it is really fair to mention that CMA backed crash kernel memory has some limitations - CMA reservation can only be used by the userspace in the primary kernel. If the size is overshot this might have negative impact on kernel allocations - userspace memory dumping in the crash kernel is fundamentally incomplete. Just my 2c
On 11/30/23 at 11:16am, Michal Hocko wrote: > On Thu 30-11-23 11:00:48, Baoquan He wrote: > [...] > > Now, we are worried if there's risk if the CMA area is retaken into kdump > > kernel as system RAM. E.g is it possible that 1st kernel's ongoing RDMA > > or DMA will interfere with kdump kernel's normal memory accessing? > > Because kdump kernel usually only reset and initialize the needed > > device, e.g dump target. Those unneeded devices will be unshutdown and > > let go. > > I do not really want to discount your concerns but I am bit confused why > this matters so much. First of all, if there is a buggy RDMA driver > which doesn't use the proper pinning API (which would migrate away from > the CMA) then what is the worst case? We will get crash kernel corrupted > potentially and fail to take a proper kernel crash, right? Is this > worrisome? Yes. Is it a real roadblock? I do not think so. The problem > seems theoretical to me and it is not CMA usage at fault here IMHO. It > is the said theoretical driver that needs fixing anyway. > > Now, it is really fair to mention that CMA backed crash kernel memory > has some limitations > - CMA reservation can only be used by the userspace in the > primary kernel. If the size is overshot this might have > negative impact on kernel allocations > - userspace memory dumping in the crash kernel is fundamentally > incomplete. I am not sure if we are talking about the same thing. My concern is: ==================================================================== 1) system corrutption happened, crash dumping is prepared, cpu and interrupt controllers are shutdown; 2) all pci devices are kept alive; 3) kdump kernel boot up, initialization is only done on those devices which drivers are added into kdump kernel's initrd; 4) those on-flight DMA engine could be still working if their kernel module is not loaded; In this case, if the DMA's destination is located in crashkernel=,cma region, the DMA writting could continue even when kdump kernel has put important kernel data into the area. Is this possible or absolutely not possible with DMA, RDMA, or any other stuff which could keep accessing that area? The existing crashkernel= syntax can gurantee the reserved crashkernel area for kdump kernel is safe. ======================================================================= The 1st kernel's data in the ,cma area is ignored once crashkernel=,cma is taken.
Hi Michal, On 11/30/23 at 08:04pm, Baoquan He wrote: > On 11/30/23 at 11:16am, Michal Hocko wrote: > > On Thu 30-11-23 11:00:48, Baoquan He wrote: > > [...] > > > Now, we are worried if there's risk if the CMA area is retaken into kdump > > > kernel as system RAM. E.g is it possible that 1st kernel's ongoing RDMA > > > or DMA will interfere with kdump kernel's normal memory accessing? > > > Because kdump kernel usually only reset and initialize the needed > > > device, e.g dump target. Those unneeded devices will be unshutdown and > > > let go. Re-read your mail, we are saying the same thing, Please ignore the words at bottom from my last mail. > > > > I do not really want to discount your concerns but I am bit confused why > > this matters so much. First of all, if there is a buggy RDMA driver Not buggy DMA or RDMA driver. This is decided by kdump mechanism. When we do kexec reboot, we shutdown cpu, interrupt, all devicees. When we do kdump, we only shutdown cpu, interrupt. > > which doesn't use the proper pinning API (which would migrate away from > > the CMA) then what is the worst case? We will get crash kernel corrupted > > potentially and fail to take a proper kernel crash, right? Is this > > worrisome? Yes. Is it a real roadblock? I do not think so. The problem We may fail to take a proper kernel crash, why isn't it a roadblock? We have stable way with a little more memory, why would we take risk to take another way, just for saving memory? Usually only high end server needs the big memory for crashkernel and the big end server usually have huge system ram. The big memory will be a very small percentage relative to huge system RAM. > > seems theoretical to me and it is not CMA usage at fault here IMHO. It > > is the said theoretical driver that needs fixing anyway. Now, what we want to make clear is if it's a theoretical possibility, or very likely happen. We have met several on-flight DMA stomping into kexec kernel's initrd in the past two years because device driver didn't provide shutdown() methor properly. For kdump, once it happen, the pain is we don't know how to debug. For kexec reboot, customer allows to login their system to reproduce and figure out the stomping. For kdump, the system corruption rarely happend, and the stomping could rarely happen too. The code change looks simple and the benefit is very attractive. I surely like it if finally people confirm there's no risk. As I said, we can't afford to take the risk if it possibly happen. But I don't object if other people would rather take risk, we can let it land in kernel. My personal opinion, thanks for sharing your thought. > > > > Now, it is really fair to mention that CMA backed crash kernel memory > > has some limitations > > - CMA reservation can only be used by the userspace in the > > primary kernel. If the size is overshot this might have > > negative impact on kernel allocations > > - userspace memory dumping in the crash kernel is fundamentally > > incomplete. > > I am not sure if we are talking about the same thing. My concern is: > ==================================================================== > 1) system corrutption happened, crash dumping is prepared, cpu and > interrupt controllers are shutdown; > 2) all pci devices are kept alive; > 3) kdump kernel boot up, initialization is only done on those devices > which drivers are added into kdump kernel's initrd; > 4) those on-flight DMA engine could be still working if their kernel > module is not loaded; > > In this case, if the DMA's destination is located in crashkernel=,cma > region, the DMA writting could continue even when kdump kernel has put > important kernel data into the area. Is this possible or absolutely not > possible with DMA, RDMA, or any other stuff which could keep accessing > that area? > > The existing crashkernel= syntax can gurantee the reserved crashkernel > area for kdump kernel is safe. > ======================================================================= > > The 1st kernel's data in the ,cma area is ignored once crashkernel=,cma > is taken. >
On Thu 30-11-23 20:04:59, Baoquan He wrote: > On 11/30/23 at 11:16am, Michal Hocko wrote: > > On Thu 30-11-23 11:00:48, Baoquan He wrote: > > [...] > > > Now, we are worried if there's risk if the CMA area is retaken into kdump > > > kernel as system RAM. E.g is it possible that 1st kernel's ongoing RDMA > > > or DMA will interfere with kdump kernel's normal memory accessing? > > > Because kdump kernel usually only reset and initialize the needed > > > device, e.g dump target. Those unneeded devices will be unshutdown and > > > let go. > > > > I do not really want to discount your concerns but I am bit confused why > > this matters so much. First of all, if there is a buggy RDMA driver > > which doesn't use the proper pinning API (which would migrate away from > > the CMA) then what is the worst case? We will get crash kernel corrupted > > potentially and fail to take a proper kernel crash, right? Is this > > worrisome? Yes. Is it a real roadblock? I do not think so. The problem > > seems theoretical to me and it is not CMA usage at fault here IMHO. It > > is the said theoretical driver that needs fixing anyway. > > > > Now, it is really fair to mention that CMA backed crash kernel memory > > has some limitations > > - CMA reservation can only be used by the userspace in the > > primary kernel. If the size is overshot this might have > > negative impact on kernel allocations > > - userspace memory dumping in the crash kernel is fundamentally > > incomplete. > > I am not sure if we are talking about the same thing. My concern is: > ==================================================================== > 1) system corrutption happened, crash dumping is prepared, cpu and > interrupt controllers are shutdown; > 2) all pci devices are kept alive; > 3) kdump kernel boot up, initialization is only done on those devices > which drivers are added into kdump kernel's initrd; > 4) those on-flight DMA engine could be still working if their kernel > module is not loaded; > > In this case, if the DMA's destination is located in crashkernel=,cma > region, the DMA writting could continue even when kdump kernel has put > important kernel data into the area. Is this possible or absolutely not > possible with DMA, RDMA, or any other stuff which could keep accessing > that area? I do nuderstand your concern. But as already stated if anybody uses movable memory (CMA including) as a target of {R}DMA then that memory should be properly pinned. That would mean that the memory will be migrated to somewhere outside of movable (CMA) memory before the transfer is configured. So modulo bugs this shouldn't really happen. Are there {R}DMA drivers that do not pin memory correctly? Possibly. Is that a road bloack to not using CMA to back crash kernel memory, I do not think so. Those drivers should be fixed instead. > The existing crashkernel= syntax can gurantee the reserved crashkernel > area for kdump kernel is safe. I do not think this is true. If a DMA is misconfigured it can still target crash kernel memory even if it is not mapped AFAICS. But those are theoreticals. Or am I missing something?
On Thu, Nov 30, 2023 at 9:29 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 30-11-23 20:04:59, Baoquan He wrote: > > On 11/30/23 at 11:16am, Michal Hocko wrote: > > > On Thu 30-11-23 11:00:48, Baoquan He wrote: > > > [...] > > > > Now, we are worried if there's risk if the CMA area is retaken into kdump > > > > kernel as system RAM. E.g is it possible that 1st kernel's ongoing RDMA > > > > or DMA will interfere with kdump kernel's normal memory accessing? > > > > Because kdump kernel usually only reset and initialize the needed > > > > device, e.g dump target. Those unneeded devices will be unshutdown and > > > > let go. > > > > > > I do not really want to discount your concerns but I am bit confused why > > > this matters so much. First of all, if there is a buggy RDMA driver > > > which doesn't use the proper pinning API (which would migrate away from > > > the CMA) then what is the worst case? We will get crash kernel corrupted > > > potentially and fail to take a proper kernel crash, right? Is this > > > worrisome? Yes. Is it a real roadblock? I do not think so. The problem > > > seems theoretical to me and it is not CMA usage at fault here IMHO. It > > > is the said theoretical driver that needs fixing anyway. > > > > > > Now, it is really fair to mention that CMA backed crash kernel memory > > > has some limitations > > > - CMA reservation can only be used by the userspace in the > > > primary kernel. If the size is overshot this might have > > > negative impact on kernel allocations > > > - userspace memory dumping in the crash kernel is fundamentally > > > incomplete. > > > > I am not sure if we are talking about the same thing. My concern is: > > ==================================================================== > > 1) system corrutption happened, crash dumping is prepared, cpu and > > interrupt controllers are shutdown; > > 2) all pci devices are kept alive; > > 3) kdump kernel boot up, initialization is only done on those devices > > which drivers are added into kdump kernel's initrd; > > 4) those on-flight DMA engine could be still working if their kernel > > module is not loaded; > > > > In this case, if the DMA's destination is located in crashkernel=,cma > > region, the DMA writting could continue even when kdump kernel has put > > important kernel data into the area. Is this possible or absolutely not > > possible with DMA, RDMA, or any other stuff which could keep accessing > > that area? > > I do nuderstand your concern. But as already stated if anybody uses > movable memory (CMA including) as a target of {R}DMA then that memory > should be properly pinned. That would mean that the memory will be > migrated to somewhere outside of movable (CMA) memory before the > transfer is configured. So modulo bugs this shouldn't really happen. > Are there {R}DMA drivers that do not pin memory correctly? Possibly. Is > that a road bloack to not using CMA to back crash kernel memory, I do > not think so. Those drivers should be fixed instead. > I think that is our concern. Is there any method to guarantee that will not happen instead of 'should be' ? Any static analysis during compiling time or dynamic checking method? If this can be resolved, I think this method is promising. Thanks, Pingfan > > The existing crashkernel= syntax can gurantee the reserved crashkernel > > area for kdump kernel is safe. > > I do not think this is true. If a DMA is misconfigured it can still > target crash kernel memory even if it is not mapped AFAICS. But those > are theoreticals. Or am I missing something? > -- > Michal Hocko > SUSE Labs >
On Thu 30-11-23 20:31:44, Baoquan He wrote: [...] > > > which doesn't use the proper pinning API (which would migrate away from > > > the CMA) then what is the worst case? We will get crash kernel corrupted > > > potentially and fail to take a proper kernel crash, right? Is this > > > worrisome? Yes. Is it a real roadblock? I do not think so. The problem > > We may fail to take a proper kernel crash, why isn't it a roadblock? It would be if the threat was practical. So far I only see very theoretical what-if concerns. And I do not mean to downplay those at all. As already explained proper CMA users shouldn't ever leak out any writes across kernel reboot. > We > have stable way with a little more memory, why would we take risk to > take another way, just for saving memory? Usually only high end server > needs the big memory for crashkernel and the big end server usually have > huge system ram. The big memory will be a very small percentage relative > to huge system RAM. Jiri will likely talk more specific about that but our experience tells that proper crashkernel memory scaling has turned out a real maintainability problem because existing setups tend to break with major kernel version upgrades or non trivial changes. > > > seems theoretical to me and it is not CMA usage at fault here IMHO. It > > > is the said theoretical driver that needs fixing anyway. > > Now, what we want to make clear is if it's a theoretical possibility, or > very likely happen. We have met several on-flight DMA stomping into > kexec kernel's initrd in the past two years because device driver didn't > provide shutdown() methor properly. For kdump, once it happen, the pain > is we don't know how to debug. For kexec reboot, customer allows to > login their system to reproduce and figure out the stomping. For kdump, > the system corruption rarely happend, and the stomping could rarely > happen too. yes, this is understood. > The code change looks simple and the benefit is very attractive. I > surely like it if finally people confirm there's no risk. As I said, we > can't afford to take the risk if it possibly happen. But I don't object > if other people would rather take risk, we can let it land in kernel. I think it is fair to be cautious and I wouldn't impose the new method as a default. Only time can tell how safe this really is. It is hard to protect agains theoretical issues though. Bugs should be fixed. I believe this option would allow to configure kdump much easier and less fragile. > My personal opinion, thanks for sharing your thought. Thanks for sharing.
On Thu 30-11-23 21:33:04, Pingfan Liu wrote: > On Thu, Nov 30, 2023 at 9:29 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 30-11-23 20:04:59, Baoquan He wrote: > > > On 11/30/23 at 11:16am, Michal Hocko wrote: > > > > On Thu 30-11-23 11:00:48, Baoquan He wrote: > > > > [...] > > > > > Now, we are worried if there's risk if the CMA area is retaken into kdump > > > > > kernel as system RAM. E.g is it possible that 1st kernel's ongoing RDMA > > > > > or DMA will interfere with kdump kernel's normal memory accessing? > > > > > Because kdump kernel usually only reset and initialize the needed > > > > > device, e.g dump target. Those unneeded devices will be unshutdown and > > > > > let go. > > > > > > > > I do not really want to discount your concerns but I am bit confused why > > > > this matters so much. First of all, if there is a buggy RDMA driver > > > > which doesn't use the proper pinning API (which would migrate away from > > > > the CMA) then what is the worst case? We will get crash kernel corrupted > > > > potentially and fail to take a proper kernel crash, right? Is this > > > > worrisome? Yes. Is it a real roadblock? I do not think so. The problem > > > > seems theoretical to me and it is not CMA usage at fault here IMHO. It > > > > is the said theoretical driver that needs fixing anyway. > > > > > > > > Now, it is really fair to mention that CMA backed crash kernel memory > > > > has some limitations > > > > - CMA reservation can only be used by the userspace in the > > > > primary kernel. If the size is overshot this might have > > > > negative impact on kernel allocations > > > > - userspace memory dumping in the crash kernel is fundamentally > > > > incomplete. > > > > > > I am not sure if we are talking about the same thing. My concern is: > > > ==================================================================== > > > 1) system corrutption happened, crash dumping is prepared, cpu and > > > interrupt controllers are shutdown; > > > 2) all pci devices are kept alive; > > > 3) kdump kernel boot up, initialization is only done on those devices > > > which drivers are added into kdump kernel's initrd; > > > 4) those on-flight DMA engine could be still working if their kernel > > > module is not loaded; > > > > > > In this case, if the DMA's destination is located in crashkernel=,cma > > > region, the DMA writting could continue even when kdump kernel has put > > > important kernel data into the area. Is this possible or absolutely not > > > possible with DMA, RDMA, or any other stuff which could keep accessing > > > that area? > > > > I do nuderstand your concern. But as already stated if anybody uses > > movable memory (CMA including) as a target of {R}DMA then that memory > > should be properly pinned. That would mean that the memory will be > > migrated to somewhere outside of movable (CMA) memory before the > > transfer is configured. So modulo bugs this shouldn't really happen. > > Are there {R}DMA drivers that do not pin memory correctly? Possibly. Is > > that a road bloack to not using CMA to back crash kernel memory, I do > > not think so. Those drivers should be fixed instead. > > > I think that is our concern. Is there any method to guarantee that > will not happen instead of 'should be' ? > Any static analysis during compiling time or dynamic checking method? I am not aware of any method to detect a driver is going to configure a RDMA. > If this can be resolved, I think this method is promising. Are you indicating this is a mandatory prerequisite?
On Thu, Nov 30, 2023 at 9:43 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 30-11-23 21:33:04, Pingfan Liu wrote: > > On Thu, Nov 30, 2023 at 9:29 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 30-11-23 20:04:59, Baoquan He wrote: > > > > On 11/30/23 at 11:16am, Michal Hocko wrote: > > > > > On Thu 30-11-23 11:00:48, Baoquan He wrote: > > > > > [...] > > > > > > Now, we are worried if there's risk if the CMA area is retaken into kdump > > > > > > kernel as system RAM. E.g is it possible that 1st kernel's ongoing RDMA > > > > > > or DMA will interfere with kdump kernel's normal memory accessing? > > > > > > Because kdump kernel usually only reset and initialize the needed > > > > > > device, e.g dump target. Those unneeded devices will be unshutdown and > > > > > > let go. > > > > > > > > > > I do not really want to discount your concerns but I am bit confused why > > > > > this matters so much. First of all, if there is a buggy RDMA driver > > > > > which doesn't use the proper pinning API (which would migrate away from > > > > > the CMA) then what is the worst case? We will get crash kernel corrupted > > > > > potentially and fail to take a proper kernel crash, right? Is this > > > > > worrisome? Yes. Is it a real roadblock? I do not think so. The problem > > > > > seems theoretical to me and it is not CMA usage at fault here IMHO. It > > > > > is the said theoretical driver that needs fixing anyway. > > > > > > > > > > Now, it is really fair to mention that CMA backed crash kernel memory > > > > > has some limitations > > > > > - CMA reservation can only be used by the userspace in the > > > > > primary kernel. If the size is overshot this might have > > > > > negative impact on kernel allocations > > > > > - userspace memory dumping in the crash kernel is fundamentally > > > > > incomplete. > > > > > > > > I am not sure if we are talking about the same thing. My concern is: > > > > ==================================================================== > > > > 1) system corrutption happened, crash dumping is prepared, cpu and > > > > interrupt controllers are shutdown; > > > > 2) all pci devices are kept alive; > > > > 3) kdump kernel boot up, initialization is only done on those devices > > > > which drivers are added into kdump kernel's initrd; > > > > 4) those on-flight DMA engine could be still working if their kernel > > > > module is not loaded; > > > > > > > > In this case, if the DMA's destination is located in crashkernel=,cma > > > > region, the DMA writting could continue even when kdump kernel has put > > > > important kernel data into the area. Is this possible or absolutely not > > > > possible with DMA, RDMA, or any other stuff which could keep accessing > > > > that area? > > > > > > I do nuderstand your concern. But as already stated if anybody uses > > > movable memory (CMA including) as a target of {R}DMA then that memory > > > should be properly pinned. That would mean that the memory will be > > > migrated to somewhere outside of movable (CMA) memory before the > > > transfer is configured. So modulo bugs this shouldn't really happen. > > > Are there {R}DMA drivers that do not pin memory correctly? Possibly. Is > > > that a road bloack to not using CMA to back crash kernel memory, I do > > > not think so. Those drivers should be fixed instead. > > > > > I think that is our concern. Is there any method to guarantee that ^^^ Sorry, to clarify, I am only speaking for myself. > > will not happen instead of 'should be' ? > > Any static analysis during compiling time or dynamic checking method? > > I am not aware of any method to detect a driver is going to configure a > RDMA. > If there is a pattern, scripts/coccinelle may give some help. But I am not sure about that. > > If this can be resolved, I think this method is promising. > > Are you indicating this is a mandatory prerequisite? IMHO, that should be mandatory. Otherwise for any unexpected kdump kernel collapses, it can not shake off its suspicion. Thanks, Pingfan
On Fri 01-12-23 08:54:20, Pingfan Liu wrote: [...] > > I am not aware of any method to detect a driver is going to configure a > > RDMA. > > > > If there is a pattern, scripts/coccinelle may give some help. But I am > not sure about that. I am not aware of any pattern. > > > If this can be resolved, I think this method is promising. > > > > Are you indicating this is a mandatory prerequisite? > > IMHO, that should be mandatory. Otherwise for any unexpected kdump > kernel collapses, it can not shake off its suspicion. I appreciate your carefulness! But I do not really see how such a detection would work and be maintained over time. What exactly is the scope of such a tooling? Should it be limited to RDMA drivers? Should we protect from stray writes in general? Also to make it clear. Are you going to nak the proposed solution if there is no such tooling available?
Hi Michal, On Thu, 30 Nov 2023 14:41:12 +0100 Michal Hocko <mhocko@suse.com> wrote: > On Thu 30-11-23 20:31:44, Baoquan He wrote: > [...] > > > > which doesn't use the proper pinning API (which would migrate away from > > > > the CMA) then what is the worst case? We will get crash kernel corrupted > > > > potentially and fail to take a proper kernel crash, right? Is this > > > > worrisome? Yes. Is it a real roadblock? I do not think so. The problem > > > > We may fail to take a proper kernel crash, why isn't it a roadblock? > > It would be if the threat was practical. So far I only see very > theoretical what-if concerns. And I do not mean to downplay those at > all. As already explained proper CMA users shouldn't ever leak out any > writes across kernel reboot. You are right, "proper" CMA users don't do that. But "proper" drivers also provide a working shutdown() method. Experience shows that there are enough shitty drivers out there without working shutdown(). So I think it is naive to assume you are only dealing with "proper" CMA users. For me the question is, what is less painful? Hunting down shitty (potentially out of tree) drivers that cause a memory corruption? Or ... > > We > > have stable way with a little more memory, why would we take risk to > > take another way, just for saving memory? Usually only high end server > > needs the big memory for crashkernel and the big end server usually have > > huge system ram. The big memory will be a very small percentage relative > > to huge system RAM. > > Jiri will likely talk more specific about that but our experience tells > that proper crashkernel memory scaling has turned out a real > maintainability problem because existing setups tend to break with major > kernel version upgrades or non trivial changes. ... frequently test if the crashkernel memory is still appropriate? The big advantage of the latter I see is that an OOM situation has very easy to detect and debug. A memory corruption isn't. Especially when it was triggered by an other kernel. And yes, those are all what-if concerns but unfortunately that is all we have right now. Only alternative would be to run extended tests in the field. Which means this user facing change needs to be included. Which also means that we are stuck with it as once a user facing change is in it's extremely hard to get rid of it again... Thanks Philipp > > > > seems theoretical to me and it is not CMA usage at fault here IMHO. It > > > > is the said theoretical driver that needs fixing anyway. > > > > Now, what we want to make clear is if it's a theoretical possibility, or > > very likely happen. We have met several on-flight DMA stomping into > > kexec kernel's initrd in the past two years because device driver didn't > > provide shutdown() methor properly. For kdump, once it happen, the pain > > is we don't know how to debug. For kexec reboot, customer allows to > > login their system to reproduce and figure out the stomping. For kdump, > > the system corruption rarely happend, and the stomping could rarely > > happen too. > > yes, this is understood. > > > The code change looks simple and the benefit is very attractive. I > > surely like it if finally people confirm there's no risk. As I said, we > > can't afford to take the risk if it possibly happen. But I don't object > > if other people would rather take risk, we can let it land in kernel. > > I think it is fair to be cautious and I wouldn't impose the new method > as a default. Only time can tell how safe this really is. It is hard to > protect agains theoretical issues though. Bugs should be fixed. > I believe this option would allow to configure kdump much easier and > less fragile. > > > My personal opinion, thanks for sharing your thought. > > Thanks for sharing. >
Hi Jiri, I'd really love to see something like this to work. Although I also share the concerns about shitty device drivers corrupting the CMA. Please see my other mail for that. Anyway, one more comment below. On Fri, 24 Nov 2023 20:54:36 +0100 Jiri Bohac <jbohac@suse.cz> wrote: [...] > Now, specifying > crashkernel=100M craskhernel=1G,cma > on the command line will make a standard crashkernel reservation > of 100M, where kexec will load the kernel and initrd. > > An additional 1G will be reserved from CMA, still usable by the > production system. The crash kernel will have 1.1G memory > available. The 100M can be reliably predicted based on the size > of the kernel and initrd. I doubt that the fixed part can be predicted "reliably". For sure it will be more reliable than today but IMHO we will still be stuck with some guessing. Otherwise it would mean that you already know during boot which initrd the user space will be loading later on. Which IMHO is impossible as the initrd can always be rebuild with a larger size. Furthermore, I'd be careful when you are dealing with compressed kernel images. As I'm not sure how the different decompressor phases would handle scenarios where the (fixed) crashkernel memory is large enough to hold the compressed kernel (+initrd) but not the decompressed one. One more thing, I'm not sure I like that you need to reserve two separate memory regions. Personally I would prefer it if you could reserve one large region for the crashkernel but allow parts of it to be reused via CMA. Otherwise I'm afraid there will be people who only have one ,cma entry on the command line and cannot figure out why they cannot load the crash kernel. Thanks Philipp > When no crashkernel=size,cma is specified, everything works as > before. >
On Fri 01-12-23 12:33:53, Philipp Rudo wrote: [...] > And yes, those are all what-if concerns but unfortunately that is all > we have right now. Should theoretical concerns without an actual evidence (e.g. multiple drivers known to be broken) become a roadblock for this otherwise useful feature? > Only alternative would be to run extended tests in > the field. Which means this user facing change needs to be included. > Which also means that we are stuck with it as once a user facing change > is in it's extremely hard to get rid of it again... I am not really sure I follow you here. Are you suggesting once crashkernel=cma is added it would become a user api and therefore impossible to get rid of?
On Thu, Nov 30, 2023 at 12:01:36PM +0800, Baoquan He wrote: > On 11/29/23 at 11:51am, Jiri Bohac wrote: > > We get a lot of problems reported by partners testing kdump on > > their setups prior to release. But even if we tune the reserved > > size up, OOM is still the most common reason for kdump to fail > > when the product starts getting used in real life. It's been > > pretty frustrating for a long time. > > I remember SUSE engineers ever told you will boot kernel and do an > estimation of kdump kernel usage, then set the crashkernel according to > the estimation. OOM will be triggered even that way is taken? Just > curious, not questioning the benefit of using ,cma to save memory. Yes, we do that during the kdump package build. We use this to find some baseline for memory requirements of the kdump kernel and tools on that specific product. Using these numbers we estimate the requirements on the system where kdump is configured by adding extra memory for the size of RAM, number of SCSI devices, etc. But apparently we get this wrong in too many cases, because the actual hardware differs too much from the virtual environment which we used to get the baseline numbers. We've been adding silly constants to the calculations and we still get OOMs on one hand and people hesitant to sacrifice the calculated amount of memory on the other. The result is that kdump basically cannot be trusted unless the user verifies that the sacrificed memory is still enough after every major upgrade. This is the main motivation behind the CMA idea: to safely give kdump enough memory, including a safe margin, without sacrificing too much memory. > > I feel the exact opposite about VMs. Reserving hundreds of MB for > > crash kernel on _every_ VM on a busy VM host wastes the most > > memory. VMs are often tuned to well defined task and can be set > > up with very little memory, so the ~256 MB can be a huge part of > > that. And while it's theoretically better to dump from the > > hypervisor, users still often prefer kdump because the hypervisor > > may not be under their control. Also, in a VM it should be much > > easier to be sure the machine is safe WRT the potential DMA > > corruption as it has less HW drivers. So I actually thought the > > CMA reservation could be most useful on VMs. > > Hmm, we ever discussed this in upstream with David Hildend who works in > virt team. VMs problem is much easier to solve if they complain the > default crashkernel value is wasteful. The shrinking interface is for > them. The crashkernel value can't be enlarged, but shrinking existing > crashkernel memory is functioning smoothly well. They can adjust that in > script in a very simple way. The shrinking does not solve this problem at all. It solves a different problem: the virtual hardware configuration can easily vary between boots and so will the crashkernel size requirements. And since crashkernel needs to be passed on the commandline, once the system is booted it's impossible to change it without a reboot. Here the shrinking mechanism comes in handy - we reserve enough for all configurations on the command line and during boot the requirements for the currently booted configuration can be determined and the reservation shrunk to the determined value. But determining this value is the same unsolved problem as above and CMA could help in exactly the same way. > Anyway, let's discuss and figure out any risk of ,cma. If finally all > worries and concerns are proved unnecessary, then let's have a new great > feature. But we can't afford the risk if the ,cma area could be entangled > with 1st kernel's on-going action. As we know, not like kexec reboot, we > only shutdown CPUs, interrupt, most of devices are alive. And many of > them could be not reset and initialized in kdump kernel if the relevant > driver is not added in. Well since my patchset makes the use of ,cma completely optional and has _absolutely_ _no_ _effect_ on users that don't opt to use it, I think you're not taking any risk at all. We will never know how much DMA is a problem in practice unless we give users or distros a way to try and come up with good ways of determining if it's safe on whichever specific system based on the hardware, drivers, etc. I've successfully tested the patches on a few systems, physical and virtual. Of course this is not proof that the DMA problem does not exist but shows that it may be a solution that mostly works. If nothing else, for systems where sacrificing ~400 MB of memory is something that prevents the user from having any dump at all, having a dump that mostly works with a sacrifice of ~100 MB may be useful. Thanks,
On Fri, 1 Dec 2023 12:55:52 +0100 Michal Hocko <mhocko@suse.com> wrote: > On Fri 01-12-23 12:33:53, Philipp Rudo wrote: > [...] > > And yes, those are all what-if concerns but unfortunately that is all > > we have right now. > > Should theoretical concerns without an actual evidence (e.g. multiple > drivers known to be broken) become a roadblock for this otherwise useful > feature? Those concerns aren't just theoretical. They are experiences we have from a related feature that suffers exactly the same problem regularly which wouldn't exist if everybody would simply work "properly". And yes, even purely theoretical concerns can become a roadblock for a feature when the cost of those theoretical concerns exceed the benefit of the feature. The thing is that bugs will be reported against kexec. So _we_ need to figure out which of the shitty drivers caused the problem. That puts additional burden on _us_. What we are trying to evaluate at the moment is if the benefit outweighs the extra burden with the information we have at the moment. > > Only alternative would be to run extended tests in > > the field. Which means this user facing change needs to be included. > > Which also means that we are stuck with it as once a user facing change > > is in it's extremely hard to get rid of it again... > > I am not really sure I follow you here. Are you suggesting once > crashkernel=cma is added it would become a user api and therefore > impossible to get rid of? Yes, sort of. I wouldn't rank a command line parameter as user api. So we still can get rid of it. But there will be long discussions I'd like to avoid if possible. Thanks Philipp
On Fri 01-12-23 16:51:13, Philipp Rudo wrote: > On Fri, 1 Dec 2023 12:55:52 +0100 > Michal Hocko <mhocko@suse.com> wrote: > > > On Fri 01-12-23 12:33:53, Philipp Rudo wrote: > > [...] > > > And yes, those are all what-if concerns but unfortunately that is all > > > we have right now. > > > > Should theoretical concerns without an actual evidence (e.g. multiple > > drivers known to be broken) become a roadblock for this otherwise useful > > feature? > > Those concerns aren't just theoretical. They are experiences we have > from a related feature that suffers exactly the same problem regularly > which wouldn't exist if everybody would simply work "properly". What is the related feature? > And yes, even purely theoretical concerns can become a roadblock for a > feature when the cost of those theoretical concerns exceed the benefit > of the feature. The thing is that bugs will be reported against kexec. > So _we_ need to figure out which of the shitty drivers caused the > problem. That puts additional burden on _us_. What we are trying to > evaluate at the moment is if the benefit outweighs the extra burden > with the information we have at the moment. I do understand your concerns! But I am pretty sure you do realize that it is really hard to argue theoreticals. Let me restate what I consider facts. Hopefully we can agree on these points - the CMA region can be used by user space memory which is a great advantage because the memory is not wasted and our experience has shown that users do care about this a lot. We _know_ that pressure on making those reservations smaller results in a less reliable crashdump and more resources spent on tuning and testing (especially after major upgrades). A larger reservation which is not completely wasted for the normal runtime is addressing that concern. - There is no other known mechanism to achieve the reusability of the crash kernel memory to stop the wastage without much more intrusive code/api impact (e.g. a separate zone or dedicated interface to prevent any hazardous usage like RDMA). - implementation wise the patch has a very small footprint. It is using an existing infrastructure (CMA) and it adds a minimal hooking into crashkernel configuration. - The only identified risk so far is RDMA acting on this memory without using proper pinning interface. If it helps to have a statement from RDMA maintainers/developers then we can pull them in for a further discussion of course. - The feature requires an explicit opt-in so this doesn't bring any new risk to existing crash kernel users until they decide to use it. AFAIU there is no way to tell that the crash kernel memory used to be CMA based in the primary kernel. If you believe that having that information available for debugability would help then I believe this shouldn't be hard to add. I think it would even make sense to mark this feature experimental to make it clear to users that this needs some time before it can be marked production ready. I hope I haven't really missed anything important. The final cost/benefit judgment is up to you, maintainers, of course but I would like to remind that we are dealing with a _real_ problem that many production systems are struggling with and that we don't really have any other solution available.
On Fri, 1 Dec 2023 17:59:02 +0100 Michal Hocko <mhocko@suse.com> wrote: > On Fri 01-12-23 16:51:13, Philipp Rudo wrote: > > On Fri, 1 Dec 2023 12:55:52 +0100 > > Michal Hocko <mhocko@suse.com> wrote: > > > > > On Fri 01-12-23 12:33:53, Philipp Rudo wrote: > > > [...] > > > > And yes, those are all what-if concerns but unfortunately that is all > > > > we have right now. > > > > > > Should theoretical concerns without an actual evidence (e.g. multiple > > > drivers known to be broken) become a roadblock for this otherwise useful > > > feature? > > > > Those concerns aren't just theoretical. They are experiences we have > > from a related feature that suffers exactly the same problem regularly > > which wouldn't exist if everybody would simply work "properly". > > What is the related feature? kexec > > And yes, even purely theoretical concerns can become a roadblock for a > > feature when the cost of those theoretical concerns exceed the benefit > > of the feature. The thing is that bugs will be reported against kexec. > > So _we_ need to figure out which of the shitty drivers caused the > > problem. That puts additional burden on _us_. What we are trying to > > evaluate at the moment is if the benefit outweighs the extra burden > > with the information we have at the moment. > > I do understand your concerns! But I am pretty sure you do realize that > it is really hard to argue theoreticals. Let me restate what I consider > facts. Hopefully we can agree on these points > - the CMA region can be used by user space memory which is a > great advantage because the memory is not wasted and our > experience has shown that users do care about this a lot. We > _know_ that pressure on making those reservations smaller > results in a less reliable crashdump and more resources spent > on tuning and testing (especially after major upgrades). A > larger reservation which is not completely wasted for the > normal runtime is addressing that concern. > - There is no other known mechanism to achieve the reusability > of the crash kernel memory to stop the wastage without much > more intrusive code/api impact (e.g. a separate zone or > dedicated interface to prevent any hazardous usage like RDMA). > - implementation wise the patch has a very small footprint. It > is using an existing infrastructure (CMA) and it adds a > minimal hooking into crashkernel configuration. > - The only identified risk so far is RDMA acting on this memory > without using proper pinning interface. If it helps to have a > statement from RDMA maintainers/developers then we can pull > them in for a further discussion of course. > - The feature requires an explicit opt-in so this doesn't bring > any new risk to existing crash kernel users until they decide > to use it. AFAIU there is no way to tell that the crash kernel > memory used to be CMA based in the primary kernel. If you > believe that having that information available for > debugability would help then I believe this shouldn't be hard > to add. I think it would even make sense to mark this feature > experimental to make it clear to users that this needs some > time before it can be marked production ready. > > I hope I haven't really missed anything important. The final If I understand Documentation/core-api/pin_user_pages.rst correctly you missed case 1 Direct IO. In that case "short term" DMA is allowed for pages without FOLL_LONGTERM. Meaning that there is a way you can corrupt the CMA and with that the crash kernel after the production kernel has panicked. With that I don't see a chance this series can be included unless someone can explain me that that the documentation is wrong or I understood it wrong. Having that said NAcked-by: Philipp Rudo <prudo@redhat.com> > cost/benefit judgment is up to you, maintainers, of course but I would > like to remind that we are dealing with a _real_ problem that many > production systems are struggling with and that we don't really have any > other solution available.
On 06.12.23 12:08, Philipp Rudo wrote: > On Fri, 1 Dec 2023 17:59:02 +0100 > Michal Hocko <mhocko@suse.com> wrote: > >> On Fri 01-12-23 16:51:13, Philipp Rudo wrote: >>> On Fri, 1 Dec 2023 12:55:52 +0100 >>> Michal Hocko <mhocko@suse.com> wrote: >>> >>>> On Fri 01-12-23 12:33:53, Philipp Rudo wrote: >>>> [...] >>>>> And yes, those are all what-if concerns but unfortunately that is all >>>>> we have right now. >>>> >>>> Should theoretical concerns without an actual evidence (e.g. multiple >>>> drivers known to be broken) become a roadblock for this otherwise useful >>>> feature? >>> >>> Those concerns aren't just theoretical. They are experiences we have >>> from a related feature that suffers exactly the same problem regularly >>> which wouldn't exist if everybody would simply work "properly". >> >> What is the related feature? > > kexec > >>> And yes, even purely theoretical concerns can become a roadblock for a >>> feature when the cost of those theoretical concerns exceed the benefit >>> of the feature. The thing is that bugs will be reported against kexec. >>> So _we_ need to figure out which of the shitty drivers caused the >>> problem. That puts additional burden on _us_. What we are trying to >>> evaluate at the moment is if the benefit outweighs the extra burden >>> with the information we have at the moment. >> >> I do understand your concerns! But I am pretty sure you do realize that >> it is really hard to argue theoreticals. Let me restate what I consider >> facts. Hopefully we can agree on these points >> - the CMA region can be used by user space memory which is a >> great advantage because the memory is not wasted and our >> experience has shown that users do care about this a lot. We >> _know_ that pressure on making those reservations smaller >> results in a less reliable crashdump and more resources spent >> on tuning and testing (especially after major upgrades). A >> larger reservation which is not completely wasted for the >> normal runtime is addressing that concern. >> - There is no other known mechanism to achieve the reusability >> of the crash kernel memory to stop the wastage without much >> more intrusive code/api impact (e.g. a separate zone or >> dedicated interface to prevent any hazardous usage like RDMA). >> - implementation wise the patch has a very small footprint. It >> is using an existing infrastructure (CMA) and it adds a >> minimal hooking into crashkernel configuration. >> - The only identified risk so far is RDMA acting on this memory >> without using proper pinning interface. If it helps to have a >> statement from RDMA maintainers/developers then we can pull >> them in for a further discussion of course. >> - The feature requires an explicit opt-in so this doesn't bring >> any new risk to existing crash kernel users until they decide >> to use it. AFAIU there is no way to tell that the crash kernel >> memory used to be CMA based in the primary kernel. If you >> believe that having that information available for >> debugability would help then I believe this shouldn't be hard >> to add. I think it would even make sense to mark this feature >> experimental to make it clear to users that this needs some >> time before it can be marked production ready. >> >> I hope I haven't really missed anything important. The final > > If I understand Documentation/core-api/pin_user_pages.rst correctly you > missed case 1 Direct IO. In that case "short term" DMA is allowed for > pages without FOLL_LONGTERM. Meaning that there is a way you can > corrupt the CMA and with that the crash kernel after the production > kernel has panicked. > > With that I don't see a chance this series can be included unless > someone can explain me that that the documentation is wrong or I > understood it wrong. I think you are right. We'd have to disallow any FOLL_PIN on these CMA pages, or find other ways of handling that (detect that there are no short-term pins any). But, I'm also wondering how MMU-notifier-based approaches might interfere, where CMA pages might be transparently mapped into secondary MMUs, possibly having DMA going on. Are we sure that all these secondary MMUs are inactive as soon as we kexec?
On Wed 06-12-23 12:08:05, Philipp Rudo wrote: > On Fri, 1 Dec 2023 17:59:02 +0100 > Michal Hocko <mhocko@suse.com> wrote: > > > On Fri 01-12-23 16:51:13, Philipp Rudo wrote: > > > On Fri, 1 Dec 2023 12:55:52 +0100 > > > Michal Hocko <mhocko@suse.com> wrote: > > > > > > > On Fri 01-12-23 12:33:53, Philipp Rudo wrote: > > > > [...] > > > > > And yes, those are all what-if concerns but unfortunately that is all > > > > > we have right now. > > > > > > > > Should theoretical concerns without an actual evidence (e.g. multiple > > > > drivers known to be broken) become a roadblock for this otherwise useful > > > > feature? > > > > > > Those concerns aren't just theoretical. They are experiences we have > > > from a related feature that suffers exactly the same problem regularly > > > which wouldn't exist if everybody would simply work "properly". > > > > What is the related feature? > > kexec OK, but that is a completely different thing, no? crashkernel parameter doesn't affect kexec. Or what is the actual relation? > > > And yes, even purely theoretical concerns can become a roadblock for a > > > feature when the cost of those theoretical concerns exceed the benefit > > > of the feature. The thing is that bugs will be reported against kexec. > > > So _we_ need to figure out which of the shitty drivers caused the > > > problem. That puts additional burden on _us_. What we are trying to > > > evaluate at the moment is if the benefit outweighs the extra burden > > > with the information we have at the moment. > > > > I do understand your concerns! But I am pretty sure you do realize that > > it is really hard to argue theoreticals. Let me restate what I consider > > facts. Hopefully we can agree on these points > > - the CMA region can be used by user space memory which is a > > great advantage because the memory is not wasted and our > > experience has shown that users do care about this a lot. We > > _know_ that pressure on making those reservations smaller > > results in a less reliable crashdump and more resources spent > > on tuning and testing (especially after major upgrades). A > > larger reservation which is not completely wasted for the > > normal runtime is addressing that concern. > > - There is no other known mechanism to achieve the reusability > > of the crash kernel memory to stop the wastage without much > > more intrusive code/api impact (e.g. a separate zone or > > dedicated interface to prevent any hazardous usage like RDMA). > > - implementation wise the patch has a very small footprint. It > > is using an existing infrastructure (CMA) and it adds a > > minimal hooking into crashkernel configuration. > > - The only identified risk so far is RDMA acting on this memory > > without using proper pinning interface. If it helps to have a > > statement from RDMA maintainers/developers then we can pull > > them in for a further discussion of course. > > - The feature requires an explicit opt-in so this doesn't bring > > any new risk to existing crash kernel users until they decide > > to use it. AFAIU there is no way to tell that the crash kernel > > memory used to be CMA based in the primary kernel. If you > > believe that having that information available for > > debugability would help then I believe this shouldn't be hard > > to add. I think it would even make sense to mark this feature > > experimental to make it clear to users that this needs some > > time before it can be marked production ready. > > > > I hope I haven't really missed anything important. The final > > If I understand Documentation/core-api/pin_user_pages.rst correctly you > missed case 1 Direct IO. In that case "short term" DMA is allowed for > pages without FOLL_LONGTERM. Meaning that there is a way you can > corrupt the CMA and with that the crash kernel after the production > kernel has panicked. Could you expand on this? How exactly direct IO request survives across into the kdump kernel? I do understand the RMDA case because the IO is async and out of control of the receiving end. Also if direct IO is a problem how come this is not a problem for kexec in general. The new kernel usually shares all the memory with the 1st kernel. /me confused.
On Wed 06-12-23 14:49:51, Michal Hocko wrote: > On Wed 06-12-23 12:08:05, Philipp Rudo wrote: [...] > > If I understand Documentation/core-api/pin_user_pages.rst correctly you > > missed case 1 Direct IO. In that case "short term" DMA is allowed for > > pages without FOLL_LONGTERM. Meaning that there is a way you can > > corrupt the CMA and with that the crash kernel after the production > > kernel has panicked. > > Could you expand on this? How exactly direct IO request survives across > into the kdump kernel? I do understand the RMDA case because the IO is > async and out of control of the receiving end. OK, I guess I get what you mean. You are worried that there is DIO request program DMA controller to read into CMA memory <panic> boot into crash kernel backed by CMA DMA transfer is done. DIO doesn't migrate the pinned memory because it is considered a very quick operation which doesn't block the movability for too long. That is why I have considered that a non-problem. RDMA on the other might pin memory for transfer for much longer but that case is handled by migrating the memory away. Now I agree that there is a chance of the corruption from DIO. The question I am not entirely clear about right now is how big of a real problem that is. DMA transfers should be a very swift operation. Would it help to wait for a grace period before jumping into the kdump kernel? > Also if direct IO is a problem how come this is not a problem for kexec > in general. The new kernel usually shares all the memory with the 1st > kernel. This is also more clear now. Pure kexec is shutting down all the devices which should terminate the in-flight DMA transfers.
On 12/06/23 at 04:19pm, Michal Hocko wrote: > On Wed 06-12-23 14:49:51, Michal Hocko wrote: > > On Wed 06-12-23 12:08:05, Philipp Rudo wrote: > [...] > > > If I understand Documentation/core-api/pin_user_pages.rst correctly you > > > missed case 1 Direct IO. In that case "short term" DMA is allowed for > > > pages without FOLL_LONGTERM. Meaning that there is a way you can > > > corrupt the CMA and with that the crash kernel after the production > > > kernel has panicked. > > > > Could you expand on this? How exactly direct IO request survives across > > into the kdump kernel? I do understand the RMDA case because the IO is > > async and out of control of the receiving end. > > OK, I guess I get what you mean. You are worried that there is > DIO request > program DMA controller to read into CMA memory > <panic> > boot into crash kernel backed by CMA > DMA transfer is done. > > DIO doesn't migrate the pinned memory because it is considered a very > quick operation which doesn't block the movability for too long. That is > why I have considered that a non-problem. RDMA on the other might pin > memory for transfer for much longer but that case is handled by > migrating the memory away. > > Now I agree that there is a chance of the corruption from DIO. The > question I am not entirely clear about right now is how big of a real > problem that is. DMA transfers should be a very swift operation. Would > it help to wait for a grace period before jumping into the kdump kernel? On system with hardware IOMMU of x86_64, people finally had fixed it after very long history of trying, arguing. Until 2014, HPE's engineer came up with a series to copy the 1st kernel's iommu page table to kdump kernel so that the on-flight DMA from 1st kernel can continue transferring. Later, these attempts and discussions were converted codes into mainline kernel. Before that, people even tried to introduce reset_devices() before jumping to kdump kernel. But that was denied immediately because any extra unnecessary actions could cause uncertain failure of kdump kernel, given 1st kernel has been in an unpredictable unstable situation. We can't guarantee how swift the DMA transfer could be in the cma, case, it will be a venture. [3] [PATCH v9 00/13] Fix the on-flight DMA issue on system with amd iommu https://lists.openwall.net/linux-kernel/2017/08/01/399 [2] [PATCH 00/19] Fix Intel IOMMU breakage in kdump kernel https://lists.openwall.net/linux-kernel/2015/06/13/72 [1] [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO https://lkml.org/lkml/2014/4/24/836 > > > Also if direct IO is a problem how come this is not a problem for kexec > > in general. The new kernel usually shares all the memory with the 1st > > kernel. > > This is also more clear now. Pure kexec is shutting down all the devices > which should terminate the in-flight DMA transfers. Exactly. That's what I have been noticing in this thread.
On Thu 07-12-23 12:23:13, Baoquan He wrote: [...] > We can't guarantee how swift the DMA transfer could be in the cma, case, > it will be a venture. We can't guarantee this of course but AFAIK the DMA shouldn't take minutes, right? While not perfect, waiting for some time before jumping into the crash kernel should be acceptable from user POV and it should work around most of those potential lingering programmed DMA transfers. So I guess what we would like to hear from you as kdump maintainers is this. Is it absolutely imperative that these issue must be proven impossible or is a best effort approach something worth investing time into? Because if the requirement is an absolute guarantee then I simply do not see any feasible way to achieve the goal of reusable memory. Let me reiterate that the existing reservation mechanism is showing its limits for production systems and I strongly believe this is something that needs addressing because crash dumps are very often the only tool to investigate complex issues.
On Thu, 7 Dec 2023 09:55:20 +0100 Michal Hocko <mhocko@suse.com> wrote: > On Thu 07-12-23 12:23:13, Baoquan He wrote: > [...] > > We can't guarantee how swift the DMA transfer could be in the cma, case, > > it will be a venture. > > We can't guarantee this of course but AFAIK the DMA shouldn't take > minutes, right? While not perfect, waiting for some time before jumping > into the crash kernel should be acceptable from user POV and it should > work around most of those potential lingering programmed DMA transfers. I don't think that simply waiting is acceptable. For one it doesn't guarantee that there is no corruption (please also see below) but only reduces its probability. Furthermore, how long would you wait? Thing is that users don't only want to reduce the memory usage but also the downtime of kdump. In the end I'm afraid that "simply waiting" will make things unnecessarily more complex without really solving any issue. > So I guess what we would like to hear from you as kdump maintainers is > this. Is it absolutely imperative that these issue must be proven > impossible or is a best effort approach something worth investing time > into? Because if the requirement is an absolute guarantee then I simply > do not see any feasible way to achieve the goal of reusable memory. > > Let me reiterate that the existing reservation mechanism is showing its > limits for production systems and I strongly believe this is something > that needs addressing because crash dumps are very often the only tool > to investigate complex issues. Because having a crash dump is so important I want a prove that no legal operation can corrupt the crashkernel memory. The easiest way to achieve this is by simply keeping the two memory regions fully separated like it is today. In theory it should also be possible to prevent any kind of page pinning in the shared crashkernel memory. But I don't know which side effect this has for mm. Such an idea needs to be discussed on the mm mailing list first. Finally, let me question whether the whole approach actually solves anything. For me the difficulty in determining the correct crashkernel memory is only a symptom. The real problem is that most developers don't expect their code to run outside their typical environment. Especially not in an memory constraint environment like kdump. But that problem won't be solved by throwing more memory at it as this additional memory will eventually run out as well. In the end we are back at the point where we are today but with more memory. Finally finally, one tip. Next time a customer complaints about how much memory the crashkernel "wastes" ask them how much one day of down time for one machine costs them and how much memory they could buy for that money. After that calculation I'm pretty sure that an additional 100M of crashkernel memory becomes much more tempting. Thanks Philipp
On Wed, 6 Dec 2023 16:19:51 +0100 Michal Hocko <mhocko@suse.com> wrote: > On Wed 06-12-23 14:49:51, Michal Hocko wrote: > > On Wed 06-12-23 12:08:05, Philipp Rudo wrote: > [...] > > > If I understand Documentation/core-api/pin_user_pages.rst correctly you > > > missed case 1 Direct IO. In that case "short term" DMA is allowed for > > > pages without FOLL_LONGTERM. Meaning that there is a way you can > > > corrupt the CMA and with that the crash kernel after the production > > > kernel has panicked. > > > > Could you expand on this? How exactly direct IO request survives across > > into the kdump kernel? I do understand the RMDA case because the IO is > > async and out of control of the receiving end. > > OK, I guess I get what you mean. You are worried that there is > DIO request > program DMA controller to read into CMA memory > <panic> > boot into crash kernel backed by CMA > DMA transfer is done. > > DIO doesn't migrate the pinned memory because it is considered a very > quick operation which doesn't block the movability for too long. That is > why I have considered that a non-problem. RDMA on the other might pin > memory for transfer for much longer but that case is handled by > migrating the memory away. Right that is the scenario we need to prevent. > Now I agree that there is a chance of the corruption from DIO. The > question I am not entirely clear about right now is how big of a real > problem that is. DMA transfers should be a very swift operation. Would > it help to wait for a grace period before jumping into the kdump kernel? Please see my other mail. > > Also if direct IO is a problem how come this is not a problem for kexec > > in general. The new kernel usually shares all the memory with the 1st > > kernel. > > This is also more clear now. Pure kexec is shutting down all the devices > which should terminate the in-flight DMA transfers. Right, it _should_ terminate all transfers. But here we are back at the shitty device drivers that don't have a working shutdown method. That's why we have already seen the problem you describe above with kexec. And please believe me that debugging such a scenario is an absolute pain. Especially when it's a proprietary, out-of-tree driver that caused the mess. Thanks Philipp
On Thu 07-12-23 12:13:14, Philipp Rudo wrote: > On Thu, 7 Dec 2023 09:55:20 +0100 > Michal Hocko <mhocko@suse.com> wrote: > > > On Thu 07-12-23 12:23:13, Baoquan He wrote: > > [...] > > > We can't guarantee how swift the DMA transfer could be in the cma, case, > > > it will be a venture. > > > > We can't guarantee this of course but AFAIK the DMA shouldn't take > > minutes, right? While not perfect, waiting for some time before jumping > > into the crash kernel should be acceptable from user POV and it should > > work around most of those potential lingering programmed DMA transfers. > > I don't think that simply waiting is acceptable. For one it doesn't > guarantee that there is no corruption (please also see below) but only > reduces its probability. Furthermore, how long would you wait? I would like to talk to storage experts to have some ballpark idea about worst case scenario but waiting for 1 minutes shouldn't terribly influence downtime and remember this is an opt-in feature. If that doesn't fit your use case, do not use it. > Thing is that users don't only want to reduce the memory usage but also > the downtime of kdump. In the end I'm afraid that "simply waiting" will > make things unnecessarily more complex without really solving any issue. I am not sure I see the added complexity. Something as simple as __crash_kexec: if (crashk_cma_cnt) mdelay(TIMEOUT) should do the trick. > > So I guess what we would like to hear from you as kdump maintainers is > > this. Is it absolutely imperative that these issue must be proven > > impossible or is a best effort approach something worth investing time > > into? Because if the requirement is an absolute guarantee then I simply > > do not see any feasible way to achieve the goal of reusable memory. > > > > Let me reiterate that the existing reservation mechanism is showing its > > limits for production systems and I strongly believe this is something > > that needs addressing because crash dumps are very often the only tool > > to investigate complex issues. > > Because having a crash dump is so important I want a prove that no > legal operation can corrupt the crashkernel memory. The easiest way to > achieve this is by simply keeping the two memory regions fully > separated like it is today. In theory it should also be possible to > prevent any kind of page pinning in the shared crashkernel memory. But > I don't know which side effect this has for mm. Such an idea needs to > be discussed on the mm mailing list first. I do not see that as a feasible option. That would require to migrate memory on any gup user that might end up sending data over DMA. > Finally, let me question whether the whole approach actually solves > anything. For me the difficulty in determining the correct crashkernel > memory is only a symptom. The real problem is that most developers > don't expect their code to run outside their typical environment. > Especially not in an memory constraint environment like kdump. But that > problem won't be solved by throwing more memory at it as this > additional memory will eventually run out as well. In the end we are > back at the point where we are today but with more memory. I disagree with you here. While the kernel is really willing to cache objects into memory I do not think that any particular subsystem is super eager to waste memory. The thing we should keep in mind is that the memory sitting aside is not used in majority of time. Crashes (luckily/hopefully) do not happen very often. And I can really see why people are reluctant to waste it. Every MB of memory has an operational price tag on it. And let's just be really honest, a simple reboot without a crash dump is very likely a cheaper option than wasting a productive memory as long as the issue happens very seldom. > Finally finally, one tip. Next time a customer complaints about how > much memory the crashkernel "wastes" ask them how much one day of down > time for one machine costs them and how much memory they could buy for > that money. After that calculation I'm pretty sure that an additional > 100M of crashkernel memory becomes much more tempting. Exactly and that is why a simple reboot would be a preferred option than configuring kdump and invest admin time to keep testing configuration after every (major) upgrade to make sure the existing setup still works. From my experience crashdump availability hugely improves chances to get underlying crash diagnosed and bug solved so it is also in our interest to encourage kdump deployments as much as possible. Now I do get your concerns about potential issues and I fully recognize the pain you have gone through when debugging these subtle issues in the past but let's not forget that perfect is an enemy of good and that a best effort solution might be better than crash dumps at all. At the end, let me just ask a theoretical question. With the experience you have gained would you nack the kexec support if it was proposed now just because of all the potential problems it might have?
On 12/07/23 at 12:52pm, Michal Hocko wrote: > On Thu 07-12-23 12:13:14, Philipp Rudo wrote: > > On Thu, 7 Dec 2023 09:55:20 +0100 > > Michal Hocko <mhocko@suse.com> wrote: > > > > > On Thu 07-12-23 12:23:13, Baoquan He wrote: > > > [...] > > > > We can't guarantee how swift the DMA transfer could be in the cma, case, > > > > it will be a venture. > > > > > > We can't guarantee this of course but AFAIK the DMA shouldn't take > > > minutes, right? While not perfect, waiting for some time before jumping > > > into the crash kernel should be acceptable from user POV and it should > > > work around most of those potential lingering programmed DMA transfers. > > > > I don't think that simply waiting is acceptable. For one it doesn't > > guarantee that there is no corruption (please also see below) but only > > reduces its probability. Furthermore, how long would you wait? > > I would like to talk to storage experts to have some ballpark idea about > worst case scenario but waiting for 1 minutes shouldn't terribly > influence downtime and remember this is an opt-in feature. If that > doesn't fit your use case, do not use it. > > > Thing is that users don't only want to reduce the memory usage but also > > the downtime of kdump. In the end I'm afraid that "simply waiting" will > > make things unnecessarily more complex without really solving any issue. > > I am not sure I see the added complexity. Something as simple as > __crash_kexec: > if (crashk_cma_cnt) > mdelay(TIMEOUT) > > should do the trick. I would say please don't do this. kdump jumping is a very quick behavirou after corruption, usually in several seconds. I can't see any meaningful stuff with the delay of one minute or several minutes. Most importantly, the 1st kernel is in corruption which is a very unpredictable state. ... > > Finally, let me question whether the whole approach actually solves > > anything. For me the difficulty in determining the correct crashkernel > > memory is only a symptom. The real problem is that most developers > > don't expect their code to run outside their typical environment. > > Especially not in an memory constraint environment like kdump. But that > > problem won't be solved by throwing more memory at it as this > > additional memory will eventually run out as well. In the end we are > > back at the point where we are today but with more memory. > > I disagree with you here. While the kernel is really willing to cache > objects into memory I do not think that any particular subsystem is > super eager to waste memory. > > The thing we should keep in mind is that the memory sitting aside is not > used in majority of time. Crashes (luckily/hopefully) do not happen very > often. And I can really see why people are reluctant to waste it. Every > MB of memory has an operational price tag on it. And let's just be > really honest, a simple reboot without a crash dump is very likely > a cheaper option than wasting a productive memory as long as the issue > happens very seldom. All the time, I have never heard people don't want to "waste" the memory. E.g, for more than 90% of system on x86, 256M is enough. The rare exceptions will be noted once recognized and documented in product release. And ,cma is not silver bullet, see this oom issue caused by i40e and its fix , your crashkernel=1G,cma won't help either. [v1,0/3] Reducing memory usage of i40e for kdump https://patchwork.ozlabs.org/project/intel-wired-lan/cover/20210304025543.334912-1-coxu@redhat.com/ ======Abstrcted from above cover letter========================== After reducing the allocation of tx/rx/arg/asq ring buffers to the minimum, the memory consumption is significantly reduced, - x86_64: 85.1MB to 1.2MB - POWER9: 15368.5MB to 20.8MB ================================================================== And say more about it. This is not the first time of attempt to make use of ,cma area for crashkernel=. In redhat, at least 5 people have tried to add this, finally we gave up after long discussion and investigation. This year, one kernel developer in our team raised this again with a very long mail after his own analysis, we told him the discussion and trying we have done in the past.
On 12/07/23 at 09:55am, Michal Hocko wrote: > On Thu 07-12-23 12:23:13, Baoquan He wrote: > [...] > > We can't guarantee how swift the DMA transfer could be in the cma, case, > > it will be a venture. > > We can't guarantee this of course but AFAIK the DMA shouldn't take > minutes, right? While not perfect, waiting for some time before jumping > into the crash kernel should be acceptable from user POV and it should > work around most of those potential lingering programmed DMA transfers. > > So I guess what we would like to hear from you as kdump maintainers is > this. Is it absolutely imperative that these issue must be proven > impossible or is a best effort approach something worth investing time > into? Because if the requirement is an absolute guarantee then I simply > do not see any feasible way to achieve the goal of reusable memory. Honestly, I think all the discussions and proof have told clearly it's not a good idea. This is not about who wants this, who doesn't. So far, this is an objective fact that taking ,cma area for crashkernel= is not a good idea, it's very risky. We don't deny this at the beginning. I tried to present all what I know, we have experienced, we have investigated, we have tried. I wanted to see if this time we can clarify some concerns may be mistaken. But it's not. The risk is obvious and very likely happen. > > Let me reiterate that the existing reservation mechanism is showing its > limits for production systems and I strongly believe this is something > that needs addressing because crash dumps are very often the only tool > to investigate complex issues. Yes, I admit that. But it haven't got to the point that it's too bad to bear so that we have to take the risk to take ,cma instead.
On Fri 08-12-23 09:55:39, Baoquan He wrote: > On 12/07/23 at 12:52pm, Michal Hocko wrote: > > On Thu 07-12-23 12:13:14, Philipp Rudo wrote: [...] > > > Thing is that users don't only want to reduce the memory usage but also > > > the downtime of kdump. In the end I'm afraid that "simply waiting" will > > > make things unnecessarily more complex without really solving any issue. > > > > I am not sure I see the added complexity. Something as simple as > > __crash_kexec: > > if (crashk_cma_cnt) > > mdelay(TIMEOUT) > > > > should do the trick. > > I would say please don't do this. kdump jumping is a very quick > behavirou after corruption, usually in several seconds. I can't see any > meaningful stuff with the delay of one minute or several minutes. Well, I've been told that DMA should complete within seconds after controller is programmed (if that was much more then short term pinning is not really appropriate because that would block memory movability for way too long and therefore result in failures). This is something we can tune for. But if that sounds like a completely wrong approach then I think an alternative would be to live with potential inflight DMAs just avoid using that memory by the kdump kernel before the DMA controllers (PCI bus) is reinitialized by the kdump kernel. That should happen early in the boot process IIRC and the CMA backed memory could be added after that moment. We already do have means so defer memory initialization so an extension shouldn't be hard to do. It will be a slightly more involved patch touching core MM which we have tried to avoid so far. Does that sound like something acceptable? [...] > > The thing we should keep in mind is that the memory sitting aside is not > > used in majority of time. Crashes (luckily/hopefully) do not happen very > > often. And I can really see why people are reluctant to waste it. Every > > MB of memory has an operational price tag on it. And let's just be > > really honest, a simple reboot without a crash dump is very likely > > a cheaper option than wasting a productive memory as long as the issue > > happens very seldom. > > All the time, I have never heard people don't want to "waste" the > memory. E.g, for more than 90% of system on x86, 256M is enough. The > rare exceptions will be noted once recognized and documented in product > release. > > And ,cma is not silver bullet, see this oom issue caused by i40e and its > fix , your crashkernel=1G,cma won't help either. > > [v1,0/3] Reducing memory usage of i40e for kdump > https://patchwork.ozlabs.org/project/intel-wired-lan/cover/20210304025543.334912-1-coxu@redhat.com/ > > ======Abstrcted from above cover letter========================== > After reducing the allocation of tx/rx/arg/asq ring buffers to the > minimum, the memory consumption is significantly reduced, > - x86_64: 85.1MB to 1.2MB > - POWER9: 15368.5MB to 20.8MB > ================================================================== Nice to see memory consumption reduction fixes. But, honestly this should happen regardless of kdump. CMA backed kdump is not to workaround excessive kernel memory consumers. It seems I am failing to get the message through :( but I do not know how else to express that the pressure on reducing the wasted memory is real. It is not important whether 256MB is enough for everybody. Even that would grow to non trivial cost in data centers with many machines. > And say more about it. This is not the first time of attempt to make use > of ,cma area for crashkernel=. In redhat, at least 5 people have tried > to add this, finally we gave up after long discussion and investigation. > This year, one kernel developer in our team raised this again with a > very long mail after his own analysis, we told him the discussion and > trying we have done in the past. This is really hard to comment on without any references to those discussions. From this particular email thread I have a perception that you guys focus much more on correctness provability than feasibility. If we applied the same approach universally then many other features couldn't have been merged. E.g. kexec for reasons you have mentioned in the email thread. Anyway, thanks for pointing to regular DMA via gup case which we were obviously not aware of. I personally have considered this to be a marginal problem comparing to RDMA which is unpredictable wrt timing. But we believe that this could be worked around. Now it would be really valuable if we knew somebody has _tried_ that and it turned out not working because of XYZ reasons. That would be a solid base to re-evaluate and think of different approaches. Look, this will be your call as maintainers in the end. If you are decided then fair enough. We might end up trying this feature downstream and maybe come back in the future with an experience which we currently do not have. But it seems we are not alone seeing the existing state is insufficient (http://lkml.kernel.org/r/20230719224821.GC3528218@google.com). Thanks!