[0/4] kdump: crashkernel reservation from CMA

Message ID ZWD_fAPqEWkFlEkM@dwarf.suse.cz
Headers
Series kdump: crashkernel reservation from CMA |

Message

Jiri Bohac Nov. 24, 2023, 7:54 p.m. UTC
  Hi,

this series implements a new way to reserve additional crash kernel
memory using CMA.

Currently, all the memory for the crash kernel is not usable by
the 1st (production) kernel. It is also unmapped so that it can't
be corrupted by the fault that will eventually trigger the crash.
This makes sense for the memory actually used by the kexec-loaded
crash kernel image and initrd and the data prepared during the
load (vmcoreinfo, ...). However, the reserved space needs to be
much larger than that to provide enough run-time memory for the
crash kernel and the kdump userspace. Estimating the amount of
memory to reserve is difficult. Being too careful makes kdump
likely to end in OOM, being too generous takes even more memory
from the production system. Also, the reservation only allows
reserving a single contiguous block (or two with the "low"
suffix). I've seen systems where this fails because the physical
memory is fragmented.

By reserving additional crashkernel memory from CMA, the main
crashkernel reservation can be just small enough to fit the 
kernel and initrd image, minimizing the memory taken away from
the production system. Most of the run-time memory for the crash
kernel will be memory previously available to userspace in the
production system. As this memory is no longer wasted, the
reservation can be done with a generous margin, making kdump more
reliable. Kernel memory that we need to preserve for dumping is 
never allocated from CMA. User data is typically not dumped by 
makedumpfile. When dumping of user data is intended this new CMA 
reservation cannot be used.

There are four patches in this series:

The first adds a new ",cma" suffix to the recenly introduced generic
crashkernel parsing code. parse_crashkernel() takes one more
argument to store the cma reservation size.

The second patch implements reserve_crashkernel_cma() which
performs the reservation. If the requested size is not available
in a single range, multiple smaller ranges will be reserved.

The third patch enables the functionality for x86 as a proof of
concept. There are just three things every arch needs to do:
- call reserve_crashkernel_cma()
- include the CMA-reserved ranges in the physical memory map
- exclude the CMA-reserved ranges from the memory available
  through /proc/vmcore by excluding them from the vmcoreinfo
  PT_LOAD ranges.
Adding other architectures is easy and I can do that as soon as
this series is merged.

The fourth patch just updates Documentation/

Now, specifying
	crashkernel=100M craskhernel=1G,cma
on the command line will make a standard crashkernel reservation
of 100M, where kexec will load the kernel and initrd.

An additional 1G will be reserved from CMA, still usable by the
production system. The crash kernel will have 1.1G memory
available. The 100M can be reliably predicted based on the size
of the kernel and initrd.

When no crashkernel=size,cma is specified, everything works as
before.
  

Comments

Tao Liu Nov. 25, 2023, 1:51 a.m. UTC | #1
Hi Jiri,

On Sat, Nov 25, 2023 at 3:55 AM Jiri Bohac <jbohac@suse.cz> wrote:
>
> Hi,
>
> this series implements a new way to reserve additional crash kernel
> memory using CMA.
>
> Currently, all the memory for the crash kernel is not usable by
> the 1st (production) kernel. It is also unmapped so that it can't
> be corrupted by the fault that will eventually trigger the crash.
> This makes sense for the memory actually used by the kexec-loaded
> crash kernel image and initrd and the data prepared during the
> load (vmcoreinfo, ...). However, the reserved space needs to be
> much larger than that to provide enough run-time memory for the
> crash kernel and the kdump userspace. Estimating the amount of
> memory to reserve is difficult. Being too careful makes kdump
> likely to end in OOM, being too generous takes even more memory
> from the production system. Also, the reservation only allows
> reserving a single contiguous block (or two with the "low"
> suffix). I've seen systems where this fails because the physical
> memory is fragmented.
>
> By reserving additional crashkernel memory from CMA, the main
> crashkernel reservation can be just small enough to fit the
> kernel and initrd image, minimizing the memory taken away from
> the production system. Most of the run-time memory for the crash
> kernel will be memory previously available to userspace in the
> production system. As this memory is no longer wasted, the
> reservation can be done with a generous margin, making kdump more
> reliable. Kernel memory that we need to preserve for dumping is
> never allocated from CMA. User data is typically not dumped by
> makedumpfile. When dumping of user data is intended this new CMA
> reservation cannot be used.
>

Thanks for the idea of using CMA as part of memory for the 2nd kernel.
However I have a question:

What if there is on-going DMA/RDMA access on the CMA range when 1st
kernel crash? There might be data corruption when 2nd kernel and
DMA/RDMA write to the same place, how to address such an issue?

Thanks,
Tao Liu

> There are four patches in this series:
>
> The first adds a new ",cma" suffix to the recenly introduced generic
> crashkernel parsing code. parse_crashkernel() takes one more
> argument to store the cma reservation size.
>
> The second patch implements reserve_crashkernel_cma() which
> performs the reservation. If the requested size is not available
> in a single range, multiple smaller ranges will be reserved.
>
> The third patch enables the functionality for x86 as a proof of
> concept. There are just three things every arch needs to do:
> - call reserve_crashkernel_cma()
> - include the CMA-reserved ranges in the physical memory map
> - exclude the CMA-reserved ranges from the memory available
>   through /proc/vmcore by excluding them from the vmcoreinfo
>   PT_LOAD ranges.
> Adding other architectures is easy and I can do that as soon as
> this series is merged.
>
> The fourth patch just updates Documentation/
>
> Now, specifying
>         crashkernel=100M craskhernel=1G,cma
> on the command line will make a standard crashkernel reservation
> of 100M, where kexec will load the kernel and initrd.
>
> An additional 1G will be reserved from CMA, still usable by the
> production system. The crash kernel will have 1.1G memory
> available. The 100M can be reliably predicted based on the size
> of the kernel and initrd.
>
> When no crashkernel=size,cma is specified, everything works as
> before.
>
> --
> Jiri Bohac <jbohac@suse.cz>
> SUSE Labs, Prague, Czechia
>
>
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>
  
Jiri Bohac Nov. 25, 2023, 9:22 p.m. UTC | #2
Hi Tao, 

On Sat, Nov 25, 2023 at 09:51:54AM +0800, Tao Liu wrote:
> Thanks for the idea of using CMA as part of memory for the 2nd kernel.
> However I have a question:
> 
> What if there is on-going DMA/RDMA access on the CMA range when 1st
> kernel crash? There might be data corruption when 2nd kernel and
> DMA/RDMA write to the same place, how to address such an issue?

The crash kernel CMA area(s) registered via
cma_declare_contiguous() are distinct from the
dma_contiguous_default_area or device-specific CMA areas that
dma_alloc_contiguous() would use to reserve memory for DMA.

Kernel pages will not be allocated from the crash kernel CMA
area(s), because they are not GFP_MOVABLE. The CMA area will only
be used for user pages. 

User pages for RDMA, should be pinned with FOLL_LONGTERM and that
would migrate them away from the CMA area.

But you're right that DMA to user pages pinned without
FOLL_LONGTERM would still be possible. Would this be a problem in
practice? Do you see any way around it?

Thanks,
  
Tao Liu Nov. 28, 2023, 1:12 a.m. UTC | #3
Hi Jiri,

On Sun, Nov 26, 2023 at 5:22 AM Jiri Bohac <jbohac@suse.cz> wrote:
>
> Hi Tao,
>
> On Sat, Nov 25, 2023 at 09:51:54AM +0800, Tao Liu wrote:
> > Thanks for the idea of using CMA as part of memory for the 2nd kernel.
> > However I have a question:
> >
> > What if there is on-going DMA/RDMA access on the CMA range when 1st
> > kernel crash? There might be data corruption when 2nd kernel and
> > DMA/RDMA write to the same place, how to address such an issue?
>
> The crash kernel CMA area(s) registered via
> cma_declare_contiguous() are distinct from the
> dma_contiguous_default_area or device-specific CMA areas that
> dma_alloc_contiguous() would use to reserve memory for DMA.
>
> Kernel pages will not be allocated from the crash kernel CMA
> area(s), because they are not GFP_MOVABLE. The CMA area will only
> be used for user pages.
>
> User pages for RDMA, should be pinned with FOLL_LONGTERM and that
> would migrate them away from the CMA area.
>
> But you're right that DMA to user pages pinned without
> FOLL_LONGTERM would still be possible. Would this be a problem in
> practice? Do you see any way around it?
>

Thanks for the explanation! Sorry I don't have any ideas so far...

@Pingfan Liu @Baoquan He Hi, do you have any suggestions for it?

Thanks,
Tao Liu

> Thanks,
>
> --
> Jiri Bohac <jbohac@suse.cz>
> SUSE Labs, Prague, Czechia
>
  
Pingfan Liu Nov. 28, 2023, 2:07 a.m. UTC | #4
On Sun, Nov 26, 2023 at 5:24 AM Jiri Bohac <jbohac@suse.cz> wrote:
>
> Hi Tao,
>
> On Sat, Nov 25, 2023 at 09:51:54AM +0800, Tao Liu wrote:
> > Thanks for the idea of using CMA as part of memory for the 2nd kernel.
> > However I have a question:
> >
> > What if there is on-going DMA/RDMA access on the CMA range when 1st
> > kernel crash? There might be data corruption when 2nd kernel and
> > DMA/RDMA write to the same place, how to address such an issue?
>
> The crash kernel CMA area(s) registered via
> cma_declare_contiguous() are distinct from the
> dma_contiguous_default_area or device-specific CMA areas that
> dma_alloc_contiguous() would use to reserve memory for DMA.
>
> Kernel pages will not be allocated from the crash kernel CMA
> area(s), because they are not GFP_MOVABLE. The CMA area will only
> be used for user pages.
>
> User pages for RDMA, should be pinned with FOLL_LONGTERM and that
> would migrate them away from the CMA area.
>
> But you're right that DMA to user pages pinned without
> FOLL_LONGTERM would still be possible. Would this be a problem in
> practice? Do you see any way around it?
>

I have not a real case in mind. But this problem has kept us from
using the CMA area in kdump for years.  Most importantly, this method
will introduce an uneasy tracking bug.

For a way around, maybe you can introduce a specific zone, and for any
GUP, migrate the pages away. I have doubts about whether this approach
is worthwhile, considering the trade-off between benefits and
complexity.

Thanks,

Pingfan
  
Baoquan He Nov. 28, 2023, 2:11 a.m. UTC | #5
On 11/28/23 at 09:12am, Tao Liu wrote:
> Hi Jiri,
> 
> On Sun, Nov 26, 2023 at 5:22 AM Jiri Bohac <jbohac@suse.cz> wrote:
> >
> > Hi Tao,
> >
> > On Sat, Nov 25, 2023 at 09:51:54AM +0800, Tao Liu wrote:
> > > Thanks for the idea of using CMA as part of memory for the 2nd kernel.
> > > However I have a question:
> > >
> > > What if there is on-going DMA/RDMA access on the CMA range when 1st
> > > kernel crash? There might be data corruption when 2nd kernel and
> > > DMA/RDMA write to the same place, how to address such an issue?
> >
> > The crash kernel CMA area(s) registered via
> > cma_declare_contiguous() are distinct from the
> > dma_contiguous_default_area or device-specific CMA areas that
> > dma_alloc_contiguous() would use to reserve memory for DMA.
> >
> > Kernel pages will not be allocated from the crash kernel CMA
> > area(s), because they are not GFP_MOVABLE. The CMA area will only
> > be used for user pages.
> >
> > User pages for RDMA, should be pinned with FOLL_LONGTERM and that
> > would migrate them away from the CMA area.
> >
> > But you're right that DMA to user pages pinned without
> > FOLL_LONGTERM would still be possible. Would this be a problem in
> > practice? Do you see any way around it?

Thanks for the effort to bring this up, Jiri.

I am wondering how you will use this crashkernel=,cma parameter. I mean
the scenario of crashkernel=,cma. Asking this because I don't know how
SUSE deploy kdump in SUSE distros. In SUSE distros, kdump kernel's
initramfs is the same as the 1st kernel, or only contain those needed
kernel modules for needed devices. E.g if we dump to local disk, NIC
driver will be filter out? If latter case, It's possibly having the
on-flight DMA issue, e.g NIC has DMA buffer in the CMA area, but not
reset during kdump bootup because the NIC driver is not loaded in to
initialize. Not sure if this is 100%, possible in theory?

Recently we are seeing an issue that on a HPE system, PCI error messages
are always seen in kdump kernel, while it's a local dump, NIC device is
not needed and the igb driver is not loaded in. Then adding igb driver
into kdump initramfs can work around it. It's similar with above
on-flight DMA.

The crashkernel=,cma requires no userspace data dumping, from our
support engineers' feedback, customer never express they don't need to
dump user space data. Assume a server with huge databse deployed, and
the database often collapsed recently and database provider claimed that
it's not database's fault, OS need prove their innocence. What will you
do?

So this looks like a nice to have to me. At least in fedora/rhel's
usage, we may only back port this patch, and add one sentence in our
user guide saying "there's a crashkernel=,cma added, can be used with
crashkernel= to save memory. Please feel free to try if you like".
Unless SUSE or other distros decides to use it as default config or
something like that. Please correct me if I missed anything or took
anything wrong.

Thanks
Baoquan
  
Michal Hocko Nov. 28, 2023, 8:58 a.m. UTC | #6
On Tue 28-11-23 10:07:08, Pingfan Liu wrote:
> On Sun, Nov 26, 2023 at 5:24 AM Jiri Bohac <jbohac@suse.cz> wrote:
> >
> > Hi Tao,
> >
> > On Sat, Nov 25, 2023 at 09:51:54AM +0800, Tao Liu wrote:
> > > Thanks for the idea of using CMA as part of memory for the 2nd kernel.
> > > However I have a question:
> > >
> > > What if there is on-going DMA/RDMA access on the CMA range when 1st
> > > kernel crash? There might be data corruption when 2nd kernel and
> > > DMA/RDMA write to the same place, how to address such an issue?
> >
> > The crash kernel CMA area(s) registered via
> > cma_declare_contiguous() are distinct from the
> > dma_contiguous_default_area or device-specific CMA areas that
> > dma_alloc_contiguous() would use to reserve memory for DMA.
> >
> > Kernel pages will not be allocated from the crash kernel CMA
> > area(s), because they are not GFP_MOVABLE. The CMA area will only
> > be used for user pages.
> >
> > User pages for RDMA, should be pinned with FOLL_LONGTERM and that
> > would migrate them away from the CMA area.
> >
> > But you're right that DMA to user pages pinned without
> > FOLL_LONGTERM would still be possible. Would this be a problem in
> > practice? Do you see any way around it?
> >
> 
> I have not a real case in mind. But this problem has kept us from
> using the CMA area in kdump for years.  Most importantly, this method
> will introduce an uneasy tracking bug.

Long term pinning is something that has changed the picture IMHO. The
API had been breweing for a long time but it has been established and
usage spreading. Is it possible that some driver could be doing remote
DMA without the long term pinning? Quite possible but this means such a
driver should be fixed rather than preventing cma use for this usecase
TBH.
 
> For a way around, maybe you can introduce a specific zone, and for any
> GUP, migrate the pages away. I have doubts about whether this approach
> is worthwhile, considering the trade-off between benefits and
> complexity.

No, a zone is definitely not an answer to that because because a)
userspace would need to be able to use that memory and userspace might
pin memory for direct IO and others. So in the end longterm pinning
would need to be used anyway.
  
Michal Hocko Nov. 28, 2023, 9:08 a.m. UTC | #7
On Tue 28-11-23 10:11:31, Baoquan He wrote:
> On 11/28/23 at 09:12am, Tao Liu wrote:
[...]
> Thanks for the effort to bring this up, Jiri.
> 
> I am wondering how you will use this crashkernel=,cma parameter. I mean
> the scenario of crashkernel=,cma. Asking this because I don't know how
> SUSE deploy kdump in SUSE distros. In SUSE distros, kdump kernel's
> driver will be filter out? If latter case, It's possibly having the
> on-flight DMA issue, e.g NIC has DMA buffer in the CMA area, but not
> reset during kdump bootup because the NIC driver is not loaded in to
> initialize. Not sure if this is 100%, possible in theory?

NIC drivers do not allocation from movable zones (that includes CMA
zone). In fact kernel doesn't use GFP_MOVABLE for non-user requests.
RDMA drivers might and do transfer from user backed memory but for that
purpose they should be pinning memory (have a look at
__gup_longterm_locked and its callers) and that will migrate away from
the any zone.
 
[...]
> The crashkernel=,cma requires no userspace data dumping, from our
> support engineers' feedback, customer never express they don't need to
> dump user space data. Assume a server with huge databse deployed, and
> the database often collapsed recently and database provider claimed that
> it's not database's fault, OS need prove their innocence. What will you
> do?

Don't use CMA backed crash memory then? This is an optional feature.
 
> So this looks like a nice to have to me. At least in fedora/rhel's
> usage, we may only back port this patch, and add one sentence in our
> user guide saying "there's a crashkernel=,cma added, can be used with
> crashkernel= to save memory. Please feel free to try if you like".
> Unless SUSE or other distros decides to use it as default config or
> something like that. Please correct me if I missed anything or took
> anything wrong.

Jiri will know better than me but for us a proper crash memory
configuration has become a real nut. You do not want to reserve too much
because it is effectively cutting of the usable memory and we regularly
hit into "not enough memory" if we tried to be savvy. The more tight you
try to configure the easier to fail that is. Even worse any in kernel
memory consumer can increase its memory demand and get the overall
consumption off the cliff. So this is not an easy to maintain solution.
CMA backed crash memory can be much more generous while still usable.
  
Baoquan He Nov. 29, 2023, 7:57 a.m. UTC | #8
On 11/28/23 at 10:08am, Michal Hocko wrote:
> On Tue 28-11-23 10:11:31, Baoquan He wrote:
> > On 11/28/23 at 09:12am, Tao Liu wrote:
> [...]
> > Thanks for the effort to bring this up, Jiri.
> > 
> > I am wondering how you will use this crashkernel=,cma parameter. I mean
> > the scenario of crashkernel=,cma. Asking this because I don't know how
> > SUSE deploy kdump in SUSE distros. In SUSE distros, kdump kernel's
> > driver will be filter out? If latter case, It's possibly having the
> > on-flight DMA issue, e.g NIC has DMA buffer in the CMA area, but not
> > reset during kdump bootup because the NIC driver is not loaded in to
> > initialize. Not sure if this is 100%, possible in theory?
> 
> NIC drivers do not allocation from movable zones (that includes CMA
> zone). In fact kernel doesn't use GFP_MOVABLE for non-user requests.
> RDMA drivers might and do transfer from user backed memory but for that
> purpose they should be pinning memory (have a look at
> __gup_longterm_locked and its callers) and that will migrate away from
> the any zone.

OK, in that case, we don't need to worry about the risk of DMA.

>  
> [...]
> > The crashkernel=,cma requires no userspace data dumping, from our
> > support engineers' feedback, customer never express they don't need to
> > dump user space data. Assume a server with huge databse deployed, and
> > the database often collapsed recently and database provider claimed that
> > it's not database's fault, OS need prove their innocence. What will you
> > do?
> 
> Don't use CMA backed crash memory then? This is an optional feature.

Guess so. As I said earlier, this is more like a nice-to-have feature,
can suggest user to try by themselves. Since Jiri didn't give how he
will use it.

>  
> > So this looks like a nice to have to me. At least in fedora/rhel's
> > usage, we may only back port this patch, and add one sentence in our
> > user guide saying "there's a crashkernel=,cma added, can be used with
> > crashkernel= to save memory. Please feel free to try if you like".
> > Unless SUSE or other distros decides to use it as default config or
> > something like that. Please correct me if I missed anything or took
> > anything wrong.
> 
> Jiri will know better than me but for us a proper crash memory
> configuration has become a real nut. You do not want to reserve too much
> because it is effectively cutting of the usable memory and we regularly
> hit into "not enough memory" if we tried to be savvy. The more tight you
> try to configure the easier to fail that is. Even worse any in kernel
> memory consumer can increase its memory demand and get the overall
> consumption off the cliff. So this is not an easy to maintain solution.
> CMA backed crash memory can be much more generous while still usable.

Hmm, Redhat could go in a different way. We have been trying to:
1) customize initrd for kdump kernel specifically, e.g exclude unneeded
devices's driver to save memory;
2) monitor device and kenrel memory usage if they begin to consume much
more memory than before. We have CI testing cases to watch this. We ever
found one NIC even eat up GB level memory, then this need be
investigated and fixed.

With these effort, our default crashkernel values satisfy most of cases,
surely not call cases. Only rare cases need be handled manually,
increasing crashkernel. The crashkernel=,high was added in this case, a
small low memory under 4G for DMA with crashkernel=,low, a big chunk of
high memory above 4G with crashkernel=,high. I can't see where needs are
not met.

Wondering how you will use this crashkernel=,cma syntax. On normal
machines and virt guests, not much meomry is needed, usually 256M or a
little more is enough. On those high end systems with hundreds of Giga
bytes, even Tera bytes of memory, I don't think the saved memory with
crashkernel=,cma make much sense. Taking out 1G memory above 4G as
crashkernel won't impact much. 

So with my understanding, crashkernel=,cma adds an option user can take
besides the existing crashkernel=,high. As I have said earlier, in
Redhat, we may rebase it to fedora/RHEL and add one sentence into our
user guide saying "one another crashkernel=,cma can be use to save
memory, please feel free to try if you like." Then that's it. Guess SUSE
will check user's configuration, e.g the dump level of makedumpfile, if
no user space data needed, crashkernel=,cma is taken, otherwise the normal
crashkernel=xM will be chosen?

Thanks
Baoquan
  
Baoquan He Nov. 29, 2023, 8:10 a.m. UTC | #9
On 11/28/23 at 10:08am, Michal Hocko wrote:
> On Tue 28-11-23 10:11:31, Baoquan He wrote:
> > On 11/28/23 at 09:12am, Tao Liu wrote:
> [...]
> > Thanks for the effort to bring this up, Jiri.
> > 
> > I am wondering how you will use this crashkernel=,cma parameter. I mean
> > the scenario of crashkernel=,cma. Asking this because I don't know how
> > SUSE deploy kdump in SUSE distros. In SUSE distros, kdump kernel's
> > driver will be filter out? If latter case, It's possibly having the
> > on-flight DMA issue, e.g NIC has DMA buffer in the CMA area, but not
> > reset during kdump bootup because the NIC driver is not loaded in to
> > initialize. Not sure if this is 100%, possible in theory?
> 
> NIC drivers do not allocation from movable zones (that includes CMA
> zone). In fact kernel doesn't use GFP_MOVABLE for non-user requests.
> RDMA drivers might and do transfer from user backed memory but for that
> purpose they should be pinning memory (have a look at
> __gup_longterm_locked and its callers) and that will migrate away from
> the any zone.

Add Don in this thread.

I am not familiar with RDMA. If we reserve a range of 1G meory as cma in
1st kernel, and RDMA or any other user space tools could use it. When
corruption happened with any cause, that 1G cma memory will be reused as
available MOVABLE memory of kdump kernel. If no risk at all, I mean 100%
safe from RDMA, that would be great.

>  
> [...]
> > The crashkernel=,cma requires no userspace data dumping, from our
> > support engineers' feedback, customer never express they don't need to
> > dump user space data. Assume a server with huge databse deployed, and
> > the database often collapsed recently and database provider claimed that
> > it's not database's fault, OS need prove their innocence. What will you
> > do?
> 
> Don't use CMA backed crash memory then? This is an optional feature.
>  
> > So this looks like a nice to have to me. At least in fedora/rhel's
> > usage, we may only back port this patch, and add one sentence in our
> > user guide saying "there's a crashkernel=,cma added, can be used with
> > crashkernel= to save memory. Please feel free to try if you like".
> > Unless SUSE or other distros decides to use it as default config or
> > something like that. Please correct me if I missed anything or took
> > anything wrong.
> 
> Jiri will know better than me but for us a proper crash memory
> configuration has become a real nut. You do not want to reserve too much
> because it is effectively cutting of the usable memory and we regularly
> hit into "not enough memory" if we tried to be savvy. The more tight you
> try to configure the easier to fail that is. Even worse any in kernel
> memory consumer can increase its memory demand and get the overall
> consumption off the cliff. So this is not an easy to maintain solution.
> CMA backed crash memory can be much more generous while still usable.
> -- 
> Michal Hocko
> SUSE Labs
>
  
Michal Hocko Nov. 29, 2023, 9:25 a.m. UTC | #10
On Wed 29-11-23 15:57:59, Baoquan He wrote:
[...]
> Hmm, Redhat could go in a different way. We have been trying to:
> 1) customize initrd for kdump kernel specifically, e.g exclude unneeded
> devices's driver to save memory;
> 2) monitor device and kenrel memory usage if they begin to consume much
> more memory than before. We have CI testing cases to watch this. We ever
> found one NIC even eat up GB level memory, then this need be
> investigated and fixed.

How do you simulate all different HW configuration setups that are using
out there in the wild?
  
Jiri Bohac Nov. 29, 2023, 10:51 a.m. UTC | #11
Hi Baoquan,

thanks for your interest...

On Wed, Nov 29, 2023 at 03:57:59PM +0800, Baoquan He wrote:
> On 11/28/23 at 10:08am, Michal Hocko wrote:
> > On Tue 28-11-23 10:11:31, Baoquan He wrote:
> > > On 11/28/23 at 09:12am, Tao Liu wrote:
> > [...]
> > > Thanks for the effort to bring this up, Jiri.
> > > 
> > > I am wondering how you will use this crashkernel=,cma parameter. I mean
> > > the scenario of crashkernel=,cma. Asking this because I don't know how
> > > SUSE deploy kdump in SUSE distros. In SUSE distros, kdump kernel's
> > > driver will be filter out? If latter case, It's possibly having the
> > > on-flight DMA issue, e.g NIC has DMA buffer in the CMA area, but not
> > > reset during kdump bootup because the NIC driver is not loaded in to
> > > initialize. Not sure if this is 100%, possible in theory?

yes, we also only add the necessary drivers to the kdump initrd (using
dracut --hostonly).

The plan was to use this feature by default only on systems where
we are reasonably sure it is safe and let the user experiment
with it when we're not sure.

I grepped a list of all calls to pin_user_pages*. From the 55,
about one half uses FOLL_LONGTERM, so these should be migrated
away from the CMA area. In the rest there are four cases that
don't use the pages to set up DMA:
	mm/process_vm_access.c:		pinned_pages = pin_user_pages_remote(mm, pa, pinned_pages,
	net/rds/info.c:	ret = pin_user_pages_fast(start, nr_pages, FOLL_WRITE, pages);
	drivers/vhost/vhost.c:	r = pin_user_pages_fast(log, 1, FOLL_WRITE, &page);
	kernel/trace/trace_events_user.c:	ret = pin_user_pages_remote(mm->mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT,

The remaining cases are potentially problematic:
	drivers/gpu/drm/i915/gem/i915_gem_userptr.c:		ret = pin_user_pages_fast(obj->userptr.ptr + pinned * PAGE_SIZE,
	drivers/iommu/iommufd/iova_bitmap.c:	ret = pin_user_pages_fast((unsigned long)addr, npages,
	drivers/iommu/iommufd/pages.c:	rc = pin_user_pages_remote(
	drivers/media/pci/ivtv/ivtv-udma.c:	err = pin_user_pages_unlocked(user_dma.uaddr, user_dma.page_count,
	drivers/media/pci/ivtv/ivtv-yuv.c:		uv_pages = pin_user_pages_unlocked(uv_dma.uaddr,
	drivers/media/pci/ivtv/ivtv-yuv.c:	y_pages = pin_user_pages_unlocked(y_dma.uaddr,
	drivers/misc/genwqe/card_utils.c:	rc = pin_user_pages_fast(data & PAGE_MASK, /* page aligned addr */
	drivers/misc/xilinx_sdfec.c:	res = pin_user_pages_fast((unsigned long)src_ptr, nr_pages, 0, pages);
	drivers/platform/goldfish/goldfish_pipe.c:	ret = pin_user_pages_fast(first_page, requested_pages,
	drivers/rapidio/devices/rio_mport_cdev.c:		pinned = pin_user_pages_fast(
	drivers/sbus/char/oradax.c:	ret = pin_user_pages_fast((unsigned long)va, 1, FOLL_WRITE, p);
	drivers/scsi/st.c:	res = pin_user_pages_fast(uaddr, nr_pages, rw == READ ? FOLL_WRITE : 0,
	drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c:		actual_pages = pin_user_pages_fast((unsigned long)ubuf & PAGE_MASK, num_pages,
	drivers/tee/tee_shm.c:		rc = pin_user_pages_fast(start, num_pages, FOLL_WRITE,
	drivers/vfio/vfio_iommu_spapr_tce.c:	if (pin_user_pages_fast(tce & PAGE_MASK, 1,
	drivers/video/fbdev/pvr2fb.c:	ret = pin_user_pages_fast((unsigned long)buf, nr_pages, FOLL_WRITE, pages);
	drivers/xen/gntdev.c:	ret = pin_user_pages_fast(addr, 1, batch->writeable ? FOLL_WRITE : 0, &page);
	drivers/xen/privcmd.c:		page_count = pin_user_pages_fast(
	fs/orangefs/orangefs-bufmap.c:	ret = pin_user_pages_fast((unsigned long)user_desc->ptr,
	arch/x86/kvm/svm/sev.c:	npinned = pin_user_pages_fast(uaddr, npages, write ? FOLL_WRITE : 0, pages);
	drivers/fpga/dfl-afu-dma-region.c:	pinned = pin_user_pages_fast(region->user_addr, npages, FOLL_WRITE,
	lib/iov_iter.c:	res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages);

We can easily check if some of these drivers (of which some we don't
even ship/support) are loaded and decide this system is not safe
for CMA crashkernel. Maybe looking at the list more thoroughly
will show that even some of the above calls are acually safe,
e.g. because the DMA is set up for reading only.
lib/iov_iter.c seem like it could be the real
problem since it's used by generic block layer...

> > > The crashkernel=,cma requires no userspace data dumping, from our
> > > support engineers' feedback, customer never express they don't need to
> > > dump user space data. Assume a server with huge databse deployed, and
> > > the database often collapsed recently and database provider claimed that
> > > it's not database's fault, OS need prove their innocence. What will you
> > > do?
> > 
> > Don't use CMA backed crash memory then? This is an optional feature.

Right. Our kdump does not dump userspace by default and we would
of course make sure ,cma is not used when the user wanted to turn
on userspace dumping.

> > Jiri will know better than me but for us a proper crash memory
> > configuration has become a real nut. You do not want to reserve too much
> > because it is effectively cutting of the usable memory and we regularly
> > hit into "not enough memory" if we tried to be savvy. The more tight you
> > try to configure the easier to fail that is. Even worse any in kernel
> > memory consumer can increase its memory demand and get the overall
> > consumption off the cliff. So this is not an easy to maintain solution.
> > CMA backed crash memory can be much more generous while still usable.
> 
> Hmm, Redhat could go in a different way. We have been trying to:
> 1) customize initrd for kdump kernel specifically, e.g exclude unneeded
> devices's driver to save memory;

ditto

> 2) monitor device and kenrel memory usage if they begin to consume much
> more memory than before. We have CI testing cases to watch this. We ever
> found one NIC even eat up GB level memory, then this need be
> investigated and fixed.
> With these effort, our default crashkernel values satisfy most of cases,
> surely not call cases. Only rare cases need be handled manually,
> increasing crashkernel.

We get a lot of problems reported by partners testing kdump on
their setups prior to release. But even if we tune the reserved
size up, OOM is still the most common reason for kdump to fail
when the product starts getting used in real life. It's been
pretty frustrating for a long time.

> Wondering how you will use this crashkernel=,cma syntax. On normal
> machines and virt guests, not much meomry is needed, usually 256M or a
> little more is enough. On those high end systems with hundreds of Giga
> bytes, even Tera bytes of memory, I don't think the saved memory with
> crashkernel=,cma make much sense.

I feel the exact opposite about VMs. Reserving hundreds of MB for
crash kernel on _every_ VM on a busy VM host wastes the most
memory. VMs are often tuned to well defined task and can be set
up with very little memory, so the ~256 MB can be a huge part of
that. And while it's theoretically better to dump from the
hypervisor, users still often prefer kdump because the hypervisor
may not be under their control. Also, in a VM it should be much
easier to be sure the machine is safe WRT the potential DMA
corruption as it has less HW drivers. So I actually thought the
CMA reservation could be most useful on VMs.

Thanks,
  
Donald Dutile Nov. 29, 2023, 3:03 p.m. UTC | #12
Baoquan,
hi!

On 11/29/23 3:10 AM, Baoquan He wrote:
> On 11/28/23 at 10:08am, Michal Hocko wrote:
>> On Tue 28-11-23 10:11:31, Baoquan He wrote:
>>> On 11/28/23 at 09:12am, Tao Liu wrote:
>> [...]
>>> Thanks for the effort to bring this up, Jiri.
>>>
>>> I am wondering how you will use this crashkernel=,cma parameter. I mean
>>> the scenario of crashkernel=,cma. Asking this because I don't know how
>>> SUSE deploy kdump in SUSE distros. In SUSE distros, kdump kernel's
>>> driver will be filter out? If latter case, It's possibly having the
>>> on-flight DMA issue, e.g NIC has DMA buffer in the CMA area, but not
>>> reset during kdump bootup because the NIC driver is not loaded in to
>>> initialize. Not sure if this is 100%, possible in theory?
>>
>> NIC drivers do not allocation from movable zones (that includes CMA
>> zone). In fact kernel doesn't use GFP_MOVABLE for non-user requests.
>> RDMA drivers might and do transfer from user backed memory but for that
>> purpose they should be pinning memory (have a look at
>> __gup_longterm_locked and its callers) and that will migrate away from
>> the any zone.
> 
> Add Don in this thread.
> 
> I am not familiar with RDMA. If we reserve a range of 1G meory as cma in
> 1st kernel, and RDMA or any other user space tools could use it. When
> corruption happened with any cause, that 1G cma memory will be reused as
> available MOVABLE memory of kdump kernel. If no risk at all, I mean 100%
> safe from RDMA, that would be great.
> 
My RDMA days are long behind me... more in mm space these days, so this still
interests me.
I thought, in general, userspace memory is not saved or used in kdumps, so
if RDMA is using cma space for userspace-based IO (gup), then I would expect
it can be re-used for kexec'd kernel.
So, I'm not sure what 'safe from RDMA' means, but I would expect RDMA queues
are in-kernel data structures, not userspace strucutures, and they would be
more/most important to maintain/keep for kdump saving.  The actual userspace
data ... ssdd wrt any other userspace data.
dma-buf's allocated from cma, which are (typically) shared with GPUs
(& RDMA in GPU-direct configs), again, would be shared userspace, not
control/cmd/rsp queues, so I'm not seeing an issue there either.

I would poke the NVIDIA+Mellanox folks for further review in this space,
if my reply leaves you (or others) 'wanting'.

- Don
>>   
>> [...]
>>> The crashkernel=,cma requires no userspace data dumping, from our
>>> support engineers' feedback, customer never express they don't need to
>>> dump user space data. Assume a server with huge databse deployed, and
>>> the database often collapsed recently and database provider claimed that
>>> it's not database's fault, OS need prove their innocence. What will you
>>> do?
>>
>> Don't use CMA backed crash memory then? This is an optional feature.
>>   
>>> So this looks like a nice to have to me. At least in fedora/rhel's
>>> usage, we may only back port this patch, and add one sentence in our
>>> user guide saying "there's a crashkernel=,cma added, can be used with
>>> crashkernel= to save memory. Please feel free to try if you like".
>>> Unless SUSE or other distros decides to use it as default config or
>>> something like that. Please correct me if I missed anything or took
>>> anything wrong.
>>
>> Jiri will know better than me but for us a proper crash memory
>> configuration has become a real nut. You do not want to reserve too much
>> because it is effectively cutting of the usable memory and we regularly
>> hit into "not enough memory" if we tried to be savvy. The more tight you
>> try to configure the easier to fail that is. Even worse any in kernel
>> memory consumer can increase its memory demand and get the overall
>> consumption off the cliff. So this is not an easy to maintain solution.
>> CMA backed crash memory can be much more generous while still usable.
>> -- 
>> Michal Hocko
>> SUSE Labs
>>
>
  
Baoquan He Nov. 30, 2023, 2:42 a.m. UTC | #13
On 11/29/23 at 10:25am, Michal Hocko wrote:
> On Wed 29-11-23 15:57:59, Baoquan He wrote:
> [...]
> > Hmm, Redhat could go in a different way. We have been trying to:
> > 1) customize initrd for kdump kernel specifically, e.g exclude unneeded
> > devices's driver to save memory;
> > 2) monitor device and kenrel memory usage if they begin to consume much
> > more memory than before. We have CI testing cases to watch this. We ever
> > found one NIC even eat up GB level memory, then this need be
> > investigated and fixed.
> 
> How do you simulate all different HW configuration setups that are using
> out there in the wild?

We don't simulate.

We do this with best effort with existing systems in our LAB. And
meantime partner company will test and report any OOM if they encounter.
  
Baoquan He Nov. 30, 2023, 3 a.m. UTC | #14
On 11/29/23 at 10:03am, Donald Dutile wrote:
> Baoquan,
> hi!
> 
> On 11/29/23 3:10 AM, Baoquan He wrote:
> > On 11/28/23 at 10:08am, Michal Hocko wrote:
> > > On Tue 28-11-23 10:11:31, Baoquan He wrote:
> > > > On 11/28/23 at 09:12am, Tao Liu wrote:
> > > [...]
> > > > Thanks for the effort to bring this up, Jiri.
> > > > 
> > > > I am wondering how you will use this crashkernel=,cma parameter. I mean
> > > > the scenario of crashkernel=,cma. Asking this because I don't know how
> > > > SUSE deploy kdump in SUSE distros. In SUSE distros, kdump kernel's
> > > > driver will be filter out? If latter case, It's possibly having the
> > > > on-flight DMA issue, e.g NIC has DMA buffer in the CMA area, but not
> > > > reset during kdump bootup because the NIC driver is not loaded in to
> > > > initialize. Not sure if this is 100%, possible in theory?
> > > 
> > > NIC drivers do not allocation from movable zones (that includes CMA
> > > zone). In fact kernel doesn't use GFP_MOVABLE for non-user requests.
> > > RDMA drivers might and do transfer from user backed memory but for that
> > > purpose they should be pinning memory (have a look at
> > > __gup_longterm_locked and its callers) and that will migrate away from
> > > the any zone.
> > 
> > Add Don in this thread.
> > 
> > I am not familiar with RDMA. If we reserve a range of 1G meory as cma in
> > 1st kernel, and RDMA or any other user space tools could use it. When
> > corruption happened with any cause, that 1G cma memory will be reused as
> > available MOVABLE memory of kdump kernel. If no risk at all, I mean 100%
> > safe from RDMA, that would be great.
> > 
> My RDMA days are long behind me... more in mm space these days, so this still
> interests me.
> I thought, in general, userspace memory is not saved or used in kdumps, so
> if RDMA is using cma space for userspace-based IO (gup), then I would expect
> it can be re-used for kexec'd kernel.
> So, I'm not sure what 'safe from RDMA' means, but I would expect RDMA queues
> are in-kernel data structures, not userspace strucutures, and they would be
> more/most important to maintain/keep for kdump saving.  The actual userspace
> data ... ssdd wrt any other userspace data.
> dma-buf's allocated from cma, which are (typically) shared with GPUs
> (& RDMA in GPU-direct configs), again, would be shared userspace, not
> control/cmd/rsp queues, so I'm not seeing an issue there either.

Thanks a lot for valuable input, Don.

Here, Jiri's patches attempt to reserve the cma area which is used in
1st kernel as CMA area, e.g being added into buddy allocator as MOVABLE,
and will be taken as available system memory of kdump kernel. Means in
kdump kernel, that specific CMA area will be zerod out and its content
won't be cared about and dumped out at all in kdump kernel. Kdump kernel
will see it as an available system RAM and initialize it and add it into
memblock allocator and buddy allocator.

Now, we are worried if there's risk if the CMA area is retaken into kdump
kernel as system RAM. E.g is it possible that 1st kernel's ongoing RDMA
or DMA will interfere with kdump kernel's normal memory accessing?
Because kdump kernel usually only reset and initialize the needed
device, e.g dump target. Those unneeded devices will be unshutdown and
let go. 

We could overthink, so would like to make clear.

> 
> I would poke the NVIDIA+Mellanox folks for further review in this space,
> if my reply leaves you (or others) 'wanting'.
> 
> - Don
> > > [...]
> > > > The crashkernel=,cma requires no userspace data dumping, from our
> > > > support engineers' feedback, customer never express they don't need to
> > > > dump user space data. Assume a server with huge databse deployed, and
> > > > the database often collapsed recently and database provider claimed that
> > > > it's not database's fault, OS need prove their innocence. What will you
> > > > do?
> > > 
> > > Don't use CMA backed crash memory then? This is an optional feature.
> > > > So this looks like a nice to have to me. At least in fedora/rhel's
> > > > usage, we may only back port this patch, and add one sentence in our
> > > > user guide saying "there's a crashkernel=,cma added, can be used with
> > > > crashkernel= to save memory. Please feel free to try if you like".
> > > > Unless SUSE or other distros decides to use it as default config or
> > > > something like that. Please correct me if I missed anything or took
> > > > anything wrong.
> > > 
> > > Jiri will know better than me but for us a proper crash memory
> > > configuration has become a real nut. You do not want to reserve too much
> > > because it is effectively cutting of the usable memory and we regularly
> > > hit into "not enough memory" if we tried to be savvy. The more tight you
> > > try to configure the easier to fail that is. Even worse any in kernel
> > > memory consumer can increase its memory demand and get the overall
> > > consumption off the cliff. So this is not an easy to maintain solution.
> > > CMA backed crash memory can be much more generous while still usable.
> > > -- 
> > > Michal Hocko
> > > SUSE Labs
> > > 
> > 
>
  
Baoquan He Nov. 30, 2023, 4:01 a.m. UTC | #15
On 11/29/23 at 11:51am, Jiri Bohac wrote:
> Hi Baoquan,
> 
> thanks for your interest...
> 
> On Wed, Nov 29, 2023 at 03:57:59PM +0800, Baoquan He wrote:
> > On 11/28/23 at 10:08am, Michal Hocko wrote:
> > > On Tue 28-11-23 10:11:31, Baoquan He wrote:
> > > > On 11/28/23 at 09:12am, Tao Liu wrote:
> > > [...]
> > > > Thanks for the effort to bring this up, Jiri.
> > > > 
> > > > I am wondering how you will use this crashkernel=,cma parameter. I mean
> > > > the scenario of crashkernel=,cma. Asking this because I don't know how
> > > > SUSE deploy kdump in SUSE distros. In SUSE distros, kdump kernel's
> > > > driver will be filter out? If latter case, It's possibly having the
> > > > on-flight DMA issue, e.g NIC has DMA buffer in the CMA area, but not
> > > > reset during kdump bootup because the NIC driver is not loaded in to
> > > > initialize. Not sure if this is 100%, possible in theory?
> 
> yes, we also only add the necessary drivers to the kdump initrd (using
> dracut --hostonly).
> 
> The plan was to use this feature by default only on systems where
> we are reasonably sure it is safe and let the user experiment
> with it when we're not sure.
> 
> I grepped a list of all calls to pin_user_pages*. From the 55,
> about one half uses FOLL_LONGTERM, so these should be migrated
> away from the CMA area. In the rest there are four cases that
> don't use the pages to set up DMA:
> 	mm/process_vm_access.c:		pinned_pages = pin_user_pages_remote(mm, pa, pinned_pages,
> 	net/rds/info.c:	ret = pin_user_pages_fast(start, nr_pages, FOLL_WRITE, pages);
> 	drivers/vhost/vhost.c:	r = pin_user_pages_fast(log, 1, FOLL_WRITE, &page);
> 	kernel/trace/trace_events_user.c:	ret = pin_user_pages_remote(mm->mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT,
> 
> The remaining cases are potentially problematic:
> 	drivers/gpu/drm/i915/gem/i915_gem_userptr.c:		ret = pin_user_pages_fast(obj->userptr.ptr + pinned * PAGE_SIZE,
> 	drivers/iommu/iommufd/iova_bitmap.c:	ret = pin_user_pages_fast((unsigned long)addr, npages,
> 	drivers/iommu/iommufd/pages.c:	rc = pin_user_pages_remote(
> 	drivers/media/pci/ivtv/ivtv-udma.c:	err = pin_user_pages_unlocked(user_dma.uaddr, user_dma.page_count,
> 	drivers/media/pci/ivtv/ivtv-yuv.c:		uv_pages = pin_user_pages_unlocked(uv_dma.uaddr,
> 	drivers/media/pci/ivtv/ivtv-yuv.c:	y_pages = pin_user_pages_unlocked(y_dma.uaddr,
> 	drivers/misc/genwqe/card_utils.c:	rc = pin_user_pages_fast(data & PAGE_MASK, /* page aligned addr */
> 	drivers/misc/xilinx_sdfec.c:	res = pin_user_pages_fast((unsigned long)src_ptr, nr_pages, 0, pages);
> 	drivers/platform/goldfish/goldfish_pipe.c:	ret = pin_user_pages_fast(first_page, requested_pages,
> 	drivers/rapidio/devices/rio_mport_cdev.c:		pinned = pin_user_pages_fast(
> 	drivers/sbus/char/oradax.c:	ret = pin_user_pages_fast((unsigned long)va, 1, FOLL_WRITE, p);
> 	drivers/scsi/st.c:	res = pin_user_pages_fast(uaddr, nr_pages, rw == READ ? FOLL_WRITE : 0,
> 	drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c:		actual_pages = pin_user_pages_fast((unsigned long)ubuf & PAGE_MASK, num_pages,
> 	drivers/tee/tee_shm.c:		rc = pin_user_pages_fast(start, num_pages, FOLL_WRITE,
> 	drivers/vfio/vfio_iommu_spapr_tce.c:	if (pin_user_pages_fast(tce & PAGE_MASK, 1,
> 	drivers/video/fbdev/pvr2fb.c:	ret = pin_user_pages_fast((unsigned long)buf, nr_pages, FOLL_WRITE, pages);
> 	drivers/xen/gntdev.c:	ret = pin_user_pages_fast(addr, 1, batch->writeable ? FOLL_WRITE : 0, &page);
> 	drivers/xen/privcmd.c:		page_count = pin_user_pages_fast(
> 	fs/orangefs/orangefs-bufmap.c:	ret = pin_user_pages_fast((unsigned long)user_desc->ptr,
> 	arch/x86/kvm/svm/sev.c:	npinned = pin_user_pages_fast(uaddr, npages, write ? FOLL_WRITE : 0, pages);
> 	drivers/fpga/dfl-afu-dma-region.c:	pinned = pin_user_pages_fast(region->user_addr, npages, FOLL_WRITE,
> 	lib/iov_iter.c:	res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages);
> 
> We can easily check if some of these drivers (of which some we don't
> even ship/support) are loaded and decide this system is not safe
> for CMA crashkernel. Maybe looking at the list more thoroughly
> will show that even some of the above calls are acually safe,
> e.g. because the DMA is set up for reading only.
> lib/iov_iter.c seem like it could be the real
> problem since it's used by generic block layer...

Hmm, yeah. From my point of view, we may need make sure the safety of
reusing ,cma area in kdump kernel without exception. That we can use it
on system we 100% sure, let people to experiment with if if not sure,
seems to be not safe. Most of time, user even don't know how to judge 
the system they own is 100% safe, or the safety is not sure. That's too
hard.

> > > > The crashkernel=,cma requires no userspace data dumping, from our
> > > > support engineers' feedback, customer never express they don't need to
> > > > dump user space data. Assume a server with huge databse deployed, and
> > > > the database often collapsed recently and database provider claimed that
> > > > it's not database's fault, OS need prove their innocence. What will you
> > > > do?
> > > 
> > > Don't use CMA backed crash memory then? This is an optional feature.
> 
> Right. Our kdump does not dump userspace by default and we would
> of course make sure ,cma is not used when the user wanted to turn
> on userspace dumping.
> 
> > > Jiri will know better than me but for us a proper crash memory
> > > configuration has become a real nut. You do not want to reserve too much
> > > because it is effectively cutting of the usable memory and we regularly
> > > hit into "not enough memory" if we tried to be savvy. The more tight you
> > > try to configure the easier to fail that is. Even worse any in kernel
> > > memory consumer can increase its memory demand and get the overall
> > > consumption off the cliff. So this is not an easy to maintain solution.
> > > CMA backed crash memory can be much more generous while still usable.
> > 
> > Hmm, Redhat could go in a different way. We have been trying to:
> > 1) customize initrd for kdump kernel specifically, e.g exclude unneeded
> > devices's driver to save memory;
> 
> ditto
> 
> > 2) monitor device and kenrel memory usage if they begin to consume much
> > more memory than before. We have CI testing cases to watch this. We ever
> > found one NIC even eat up GB level memory, then this need be
> > investigated and fixed.
> > With these effort, our default crashkernel values satisfy most of cases,
> > surely not call cases. Only rare cases need be handled manually,
> > increasing crashkernel.
> 
> We get a lot of problems reported by partners testing kdump on
> their setups prior to release. But even if we tune the reserved
> size up, OOM is still the most common reason for kdump to fail
> when the product starts getting used in real life. It's been
> pretty frustrating for a long time.

I remember SUSE engineers ever told you will boot kernel and do an
estimation of kdump kernel usage, then set the crashkernel according to
the estimation. OOM will be triggered even that way is taken? Just
curious, not questioning the benefit of using ,cma to save memory.

> 
> > Wondering how you will use this crashkernel=,cma syntax. On normal
> > machines and virt guests, not much meomry is needed, usually 256M or a
> > little more is enough. On those high end systems with hundreds of Giga
> > bytes, even Tera bytes of memory, I don't think the saved memory with
> > crashkernel=,cma make much sense.
> 
> I feel the exact opposite about VMs. Reserving hundreds of MB for
> crash kernel on _every_ VM on a busy VM host wastes the most
> memory. VMs are often tuned to well defined task and can be set
> up with very little memory, so the ~256 MB can be a huge part of
> that. And while it's theoretically better to dump from the
> hypervisor, users still often prefer kdump because the hypervisor
> may not be under their control. Also, in a VM it should be much
> easier to be sure the machine is safe WRT the potential DMA
> corruption as it has less HW drivers. So I actually thought the
> CMA reservation could be most useful on VMs.

Hmm, we ever discussed this in upstream with David Hildend who works in
virt team. VMs problem is much easier to solve if they complain the
default crashkernel value is wasteful. The shrinking interface is for
them. The crashkernel value can't be enlarged, but shrinking existing
crashkernel memory is functioning smoothly well. They can adjust that in
script in a very simple way.

Anyway, let's discuss and figure out any risk of ,cma. If finally all
worries and concerns are proved unnecessary, then let's have a new great
feature. But we can't afford the risk if the ,cma area could be entangled
with 1st kernel's on-going action. As we know, not like kexec reboot, we
only shutdown CPUs, interrupt, most of devices are alive. And many of
them could be not reset and initialized in kdump kernel if the relevant
driver is not added in.

Earlier, we met several on-flight DMA stomping into memory when kexec
rebooting because some pci devices didn't provide shutdown() method. It
gave people so much headache to figure out and fix it. Simillarly for
kdump, we absolutely don't expect to see that happening with ,cma,
it absolutely will be a disaster to kdump, no matter how much memory it
can save. Because you don't know what happened, how to debug, until you
suspect this and turn it off.
  
Michal Hocko Nov. 30, 2023, 10:16 a.m. UTC | #16
On Thu 30-11-23 11:00:48, Baoquan He wrote:
[...]
> Now, we are worried if there's risk if the CMA area is retaken into kdump
> kernel as system RAM. E.g is it possible that 1st kernel's ongoing RDMA
> or DMA will interfere with kdump kernel's normal memory accessing?
> Because kdump kernel usually only reset and initialize the needed
> device, e.g dump target. Those unneeded devices will be unshutdown and
> let go. 

I do not really want to discount your concerns but I am bit confused why
this matters so much. First of all, if there is a buggy RDMA driver
which doesn't use the proper pinning API (which would migrate away from
the CMA) then what is the worst case? We will get crash kernel corrupted
potentially and fail to take a proper kernel crash, right? Is this
worrisome? Yes. Is it a real roadblock? I do not think so. The problem
seems theoretical to me and it is not CMA usage at fault here IMHO. It
is the said theoretical driver that needs fixing anyway.

Now, it is really fair to mention that CMA backed crash kernel memory
has some limitations
	- CMA reservation can only be used by the userspace in the
	  primary kernel. If the size is overshot this might have
	  negative impact on kernel allocations
	- userspace memory dumping in the crash kernel is fundamentally
	  incomplete.

Just my 2c
  
Baoquan He Nov. 30, 2023, 12:04 p.m. UTC | #17
On 11/30/23 at 11:16am, Michal Hocko wrote:
> On Thu 30-11-23 11:00:48, Baoquan He wrote:
> [...]
> > Now, we are worried if there's risk if the CMA area is retaken into kdump
> > kernel as system RAM. E.g is it possible that 1st kernel's ongoing RDMA
> > or DMA will interfere with kdump kernel's normal memory accessing?
> > Because kdump kernel usually only reset and initialize the needed
> > device, e.g dump target. Those unneeded devices will be unshutdown and
> > let go. 
> 
> I do not really want to discount your concerns but I am bit confused why
> this matters so much. First of all, if there is a buggy RDMA driver
> which doesn't use the proper pinning API (which would migrate away from
> the CMA) then what is the worst case? We will get crash kernel corrupted
> potentially and fail to take a proper kernel crash, right? Is this
> worrisome? Yes. Is it a real roadblock? I do not think so. The problem
> seems theoretical to me and it is not CMA usage at fault here IMHO. It
> is the said theoretical driver that needs fixing anyway.
> 
> Now, it is really fair to mention that CMA backed crash kernel memory
> has some limitations
> 	- CMA reservation can only be used by the userspace in the
> 	  primary kernel. If the size is overshot this might have
> 	  negative impact on kernel allocations
> 	- userspace memory dumping in the crash kernel is fundamentally
> 	  incomplete.

I am not sure if we are talking about the same thing. My concern is:
====================================================================
1) system corrutption happened, crash dumping is prepared, cpu and
interrupt controllers are shutdown;
2) all pci devices are kept alive;
3) kdump kernel boot up, initialization is only done on those devices
which drivers are added into kdump kernel's initrd;
4) those on-flight DMA engine could be still working if their kernel
module is not loaded;

In this case, if the DMA's destination is located in crashkernel=,cma
region, the DMA writting could continue even when kdump kernel has put
important kernel data into the area. Is this possible or absolutely not
possible with DMA, RDMA, or any other stuff which could keep accessing
that area?

The existing crashkernel= syntax can gurantee the reserved crashkernel
area for kdump kernel is safe.
=======================================================================

The 1st kernel's data in the ,cma area is ignored once crashkernel=,cma
is taken.
  
Baoquan He Nov. 30, 2023, 12:31 p.m. UTC | #18
Hi Michal,

On 11/30/23 at 08:04pm, Baoquan He wrote:
> On 11/30/23 at 11:16am, Michal Hocko wrote:
> > On Thu 30-11-23 11:00:48, Baoquan He wrote:
> > [...]
> > > Now, we are worried if there's risk if the CMA area is retaken into kdump
> > > kernel as system RAM. E.g is it possible that 1st kernel's ongoing RDMA
> > > or DMA will interfere with kdump kernel's normal memory accessing?
> > > Because kdump kernel usually only reset and initialize the needed
> > > device, e.g dump target. Those unneeded devices will be unshutdown and
> > > let go. 


Re-read your mail, we are saying the same thing, Please ignore the
words at bottom from my last mail.

> > 
> > I do not really want to discount your concerns but I am bit confused why
> > this matters so much. First of all, if there is a buggy RDMA driver

Not buggy DMA or RDMA driver. This is decided by kdump mechanism. When
we do kexec reboot, we shutdown cpu, interrupt, all devicees. When we do
kdump, we only shutdown cpu, interrupt.

> > which doesn't use the proper pinning API (which would migrate away from
> > the CMA) then what is the worst case? We will get crash kernel corrupted
> > potentially and fail to take a proper kernel crash, right? Is this
> > worrisome? Yes. Is it a real roadblock? I do not think so. The problem

We may fail to take a proper kernel crash, why isn't it a roadblock? We
have stable way with a little more memory, why would we take risk to
take another way, just for saving memory? Usually only high end server
needs the big memory for crashkernel and the big end server usually have
huge system ram. The big memory will be a very small percentage relative
to huge system RAM.

> > seems theoretical to me and it is not CMA usage at fault here IMHO. It
> > is the said theoretical driver that needs fixing anyway.

Now, what we want to make clear is if it's a theoretical possibility, or
very likely happen. We have met several on-flight DMA stomping into
kexec kernel's initrd in the past two years because device driver didn't
provide shutdown() methor properly. For kdump, once it happen, the pain
is we don't know how to debug. For kexec reboot, customer allows to
login their system to reproduce and figure out the stomping. For kdump,
the system corruption rarely happend, and the stomping could rarely
happen too.

The code change looks simple and the benefit is very attractive. I
surely like it if finally people confirm there's no risk. As I said, we
can't afford to take the risk if it possibly happen. But I don't object
if other people would rather take risk, we can let it land in kernel.

My personal opinion, thanks for sharing your thought.

> > 
> > Now, it is really fair to mention that CMA backed crash kernel memory
> > has some limitations
> > 	- CMA reservation can only be used by the userspace in the
> > 	  primary kernel. If the size is overshot this might have
> > 	  negative impact on kernel allocations
> > 	- userspace memory dumping in the crash kernel is fundamentally
> > 	  incomplete.
> 
> I am not sure if we are talking about the same thing. My concern is:
> ====================================================================
> 1) system corrutption happened, crash dumping is prepared, cpu and
> interrupt controllers are shutdown;
> 2) all pci devices are kept alive;
> 3) kdump kernel boot up, initialization is only done on those devices
> which drivers are added into kdump kernel's initrd;
> 4) those on-flight DMA engine could be still working if their kernel
> module is not loaded;
> 
> In this case, if the DMA's destination is located in crashkernel=,cma
> region, the DMA writting could continue even when kdump kernel has put
> important kernel data into the area. Is this possible or absolutely not
> possible with DMA, RDMA, or any other stuff which could keep accessing
> that area?
> 
> The existing crashkernel= syntax can gurantee the reserved crashkernel
> area for kdump kernel is safe.
> =======================================================================
> 
> The 1st kernel's data in the ,cma area is ignored once crashkernel=,cma
> is taken.
>
  
Michal Hocko Nov. 30, 2023, 1:29 p.m. UTC | #19
On Thu 30-11-23 20:04:59, Baoquan He wrote:
> On 11/30/23 at 11:16am, Michal Hocko wrote:
> > On Thu 30-11-23 11:00:48, Baoquan He wrote:
> > [...]
> > > Now, we are worried if there's risk if the CMA area is retaken into kdump
> > > kernel as system RAM. E.g is it possible that 1st kernel's ongoing RDMA
> > > or DMA will interfere with kdump kernel's normal memory accessing?
> > > Because kdump kernel usually only reset and initialize the needed
> > > device, e.g dump target. Those unneeded devices will be unshutdown and
> > > let go. 
> > 
> > I do not really want to discount your concerns but I am bit confused why
> > this matters so much. First of all, if there is a buggy RDMA driver
> > which doesn't use the proper pinning API (which would migrate away from
> > the CMA) then what is the worst case? We will get crash kernel corrupted
> > potentially and fail to take a proper kernel crash, right? Is this
> > worrisome? Yes. Is it a real roadblock? I do not think so. The problem
> > seems theoretical to me and it is not CMA usage at fault here IMHO. It
> > is the said theoretical driver that needs fixing anyway.
> > 
> > Now, it is really fair to mention that CMA backed crash kernel memory
> > has some limitations
> > 	- CMA reservation can only be used by the userspace in the
> > 	  primary kernel. If the size is overshot this might have
> > 	  negative impact on kernel allocations
> > 	- userspace memory dumping in the crash kernel is fundamentally
> > 	  incomplete.
> 
> I am not sure if we are talking about the same thing. My concern is:
> ====================================================================
> 1) system corrutption happened, crash dumping is prepared, cpu and
> interrupt controllers are shutdown;
> 2) all pci devices are kept alive;
> 3) kdump kernel boot up, initialization is only done on those devices
> which drivers are added into kdump kernel's initrd;
> 4) those on-flight DMA engine could be still working if their kernel
> module is not loaded;
> 
> In this case, if the DMA's destination is located in crashkernel=,cma
> region, the DMA writting could continue even when kdump kernel has put
> important kernel data into the area. Is this possible or absolutely not
> possible with DMA, RDMA, or any other stuff which could keep accessing
> that area?

I do nuderstand your concern. But as already stated if anybody uses
movable memory (CMA including) as a target of {R}DMA then that memory
should be properly pinned. That would mean that the memory will be
migrated to somewhere outside of movable (CMA) memory before the
transfer is configured. So modulo bugs this shouldn't really happen.
Are there {R}DMA drivers that do not pin memory correctly? Possibly. Is
that a road bloack to not using CMA to back crash kernel memory, I do
not think so. Those drivers should be fixed instead.

> The existing crashkernel= syntax can gurantee the reserved crashkernel
> area for kdump kernel is safe.

I do not think this is true. If a DMA is misconfigured it can still
target crash kernel memory even if it is not mapped AFAICS. But those
are theoreticals. Or am I missing something?
  
Pingfan Liu Nov. 30, 2023, 1:33 p.m. UTC | #20
On Thu, Nov 30, 2023 at 9:29 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 30-11-23 20:04:59, Baoquan He wrote:
> > On 11/30/23 at 11:16am, Michal Hocko wrote:
> > > On Thu 30-11-23 11:00:48, Baoquan He wrote:
> > > [...]
> > > > Now, we are worried if there's risk if the CMA area is retaken into kdump
> > > > kernel as system RAM. E.g is it possible that 1st kernel's ongoing RDMA
> > > > or DMA will interfere with kdump kernel's normal memory accessing?
> > > > Because kdump kernel usually only reset and initialize the needed
> > > > device, e.g dump target. Those unneeded devices will be unshutdown and
> > > > let go.
> > >
> > > I do not really want to discount your concerns but I am bit confused why
> > > this matters so much. First of all, if there is a buggy RDMA driver
> > > which doesn't use the proper pinning API (which would migrate away from
> > > the CMA) then what is the worst case? We will get crash kernel corrupted
> > > potentially and fail to take a proper kernel crash, right? Is this
> > > worrisome? Yes. Is it a real roadblock? I do not think so. The problem
> > > seems theoretical to me and it is not CMA usage at fault here IMHO. It
> > > is the said theoretical driver that needs fixing anyway.
> > >
> > > Now, it is really fair to mention that CMA backed crash kernel memory
> > > has some limitations
> > >     - CMA reservation can only be used by the userspace in the
> > >       primary kernel. If the size is overshot this might have
> > >       negative impact on kernel allocations
> > >     - userspace memory dumping in the crash kernel is fundamentally
> > >       incomplete.
> >
> > I am not sure if we are talking about the same thing. My concern is:
> > ====================================================================
> > 1) system corrutption happened, crash dumping is prepared, cpu and
> > interrupt controllers are shutdown;
> > 2) all pci devices are kept alive;
> > 3) kdump kernel boot up, initialization is only done on those devices
> > which drivers are added into kdump kernel's initrd;
> > 4) those on-flight DMA engine could be still working if their kernel
> > module is not loaded;
> >
> > In this case, if the DMA's destination is located in crashkernel=,cma
> > region, the DMA writting could continue even when kdump kernel has put
> > important kernel data into the area. Is this possible or absolutely not
> > possible with DMA, RDMA, or any other stuff which could keep accessing
> > that area?
>
> I do nuderstand your concern. But as already stated if anybody uses
> movable memory (CMA including) as a target of {R}DMA then that memory
> should be properly pinned. That would mean that the memory will be
> migrated to somewhere outside of movable (CMA) memory before the
> transfer is configured. So modulo bugs this shouldn't really happen.
> Are there {R}DMA drivers that do not pin memory correctly? Possibly. Is
> that a road bloack to not using CMA to back crash kernel memory, I do
> not think so. Those drivers should be fixed instead.
>
I think that is our concern. Is there any method to guarantee that
will not happen instead of 'should be' ?
Any static analysis during compiling time or dynamic checking method?

If this can be resolved, I think this method is promising.

Thanks,

Pingfan

> > The existing crashkernel= syntax can gurantee the reserved crashkernel
> > area for kdump kernel is safe.
>
> I do not think this is true. If a DMA is misconfigured it can still
> target crash kernel memory even if it is not mapped AFAICS. But those
> are theoreticals. Or am I missing something?
> --
> Michal Hocko
> SUSE Labs
>
  
Michal Hocko Nov. 30, 2023, 1:41 p.m. UTC | #21
On Thu 30-11-23 20:31:44, Baoquan He wrote:
[...]
> > > which doesn't use the proper pinning API (which would migrate away from
> > > the CMA) then what is the worst case? We will get crash kernel corrupted
> > > potentially and fail to take a proper kernel crash, right? Is this
> > > worrisome? Yes. Is it a real roadblock? I do not think so. The problem
> 
> We may fail to take a proper kernel crash, why isn't it a roadblock?

It would be if the threat was practical. So far I only see very
theoretical what-if concerns. And I do not mean to downplay those at
all. As already explained proper CMA users shouldn't ever leak out any
writes across kernel reboot.

> We
> have stable way with a little more memory, why would we take risk to
> take another way, just for saving memory? Usually only high end server
> needs the big memory for crashkernel and the big end server usually have
> huge system ram. The big memory will be a very small percentage relative
> to huge system RAM.

Jiri will likely talk more specific about that but our experience tells
that proper crashkernel memory scaling has turned out a real
maintainability problem because existing setups tend to break with major
kernel version upgrades or non trivial changes.
 
> > > seems theoretical to me and it is not CMA usage at fault here IMHO. It
> > > is the said theoretical driver that needs fixing anyway.
> 
> Now, what we want to make clear is if it's a theoretical possibility, or
> very likely happen. We have met several on-flight DMA stomping into
> kexec kernel's initrd in the past two years because device driver didn't
> provide shutdown() methor properly. For kdump, once it happen, the pain
> is we don't know how to debug. For kexec reboot, customer allows to
> login their system to reproduce and figure out the stomping. For kdump,
> the system corruption rarely happend, and the stomping could rarely
> happen too.

yes, this is understood.
 
> The code change looks simple and the benefit is very attractive. I
> surely like it if finally people confirm there's no risk. As I said, we
> can't afford to take the risk if it possibly happen. But I don't object
> if other people would rather take risk, we can let it land in kernel.

I think it is fair to be cautious and I wouldn't impose the new method
as a default. Only time can tell how safe this really is. It is hard to
protect agains theoretical issues though. Bugs should be fixed.
I believe this option would allow to configure kdump much easier and
less fragile.
 
> My personal opinion, thanks for sharing your thought.

Thanks for sharing.
  
Michal Hocko Nov. 30, 2023, 1:43 p.m. UTC | #22
On Thu 30-11-23 21:33:04, Pingfan Liu wrote:
> On Thu, Nov 30, 2023 at 9:29 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 30-11-23 20:04:59, Baoquan He wrote:
> > > On 11/30/23 at 11:16am, Michal Hocko wrote:
> > > > On Thu 30-11-23 11:00:48, Baoquan He wrote:
> > > > [...]
> > > > > Now, we are worried if there's risk if the CMA area is retaken into kdump
> > > > > kernel as system RAM. E.g is it possible that 1st kernel's ongoing RDMA
> > > > > or DMA will interfere with kdump kernel's normal memory accessing?
> > > > > Because kdump kernel usually only reset and initialize the needed
> > > > > device, e.g dump target. Those unneeded devices will be unshutdown and
> > > > > let go.
> > > >
> > > > I do not really want to discount your concerns but I am bit confused why
> > > > this matters so much. First of all, if there is a buggy RDMA driver
> > > > which doesn't use the proper pinning API (which would migrate away from
> > > > the CMA) then what is the worst case? We will get crash kernel corrupted
> > > > potentially and fail to take a proper kernel crash, right? Is this
> > > > worrisome? Yes. Is it a real roadblock? I do not think so. The problem
> > > > seems theoretical to me and it is not CMA usage at fault here IMHO. It
> > > > is the said theoretical driver that needs fixing anyway.
> > > >
> > > > Now, it is really fair to mention that CMA backed crash kernel memory
> > > > has some limitations
> > > >     - CMA reservation can only be used by the userspace in the
> > > >       primary kernel. If the size is overshot this might have
> > > >       negative impact on kernel allocations
> > > >     - userspace memory dumping in the crash kernel is fundamentally
> > > >       incomplete.
> > >
> > > I am not sure if we are talking about the same thing. My concern is:
> > > ====================================================================
> > > 1) system corrutption happened, crash dumping is prepared, cpu and
> > > interrupt controllers are shutdown;
> > > 2) all pci devices are kept alive;
> > > 3) kdump kernel boot up, initialization is only done on those devices
> > > which drivers are added into kdump kernel's initrd;
> > > 4) those on-flight DMA engine could be still working if their kernel
> > > module is not loaded;
> > >
> > > In this case, if the DMA's destination is located in crashkernel=,cma
> > > region, the DMA writting could continue even when kdump kernel has put
> > > important kernel data into the area. Is this possible or absolutely not
> > > possible with DMA, RDMA, or any other stuff which could keep accessing
> > > that area?
> >
> > I do nuderstand your concern. But as already stated if anybody uses
> > movable memory (CMA including) as a target of {R}DMA then that memory
> > should be properly pinned. That would mean that the memory will be
> > migrated to somewhere outside of movable (CMA) memory before the
> > transfer is configured. So modulo bugs this shouldn't really happen.
> > Are there {R}DMA drivers that do not pin memory correctly? Possibly. Is
> > that a road bloack to not using CMA to back crash kernel memory, I do
> > not think so. Those drivers should be fixed instead.
> >
> I think that is our concern. Is there any method to guarantee that
> will not happen instead of 'should be' ?
> Any static analysis during compiling time or dynamic checking method?

I am not aware of any method to detect a driver is going to configure a
RDMA.
 
> If this can be resolved, I think this method is promising.

Are you indicating this is a mandatory prerequisite?
  
Pingfan Liu Dec. 1, 2023, 12:54 a.m. UTC | #23
On Thu, Nov 30, 2023 at 9:43 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 30-11-23 21:33:04, Pingfan Liu wrote:
> > On Thu, Nov 30, 2023 at 9:29 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 30-11-23 20:04:59, Baoquan He wrote:
> > > > On 11/30/23 at 11:16am, Michal Hocko wrote:
> > > > > On Thu 30-11-23 11:00:48, Baoquan He wrote:
> > > > > [...]
> > > > > > Now, we are worried if there's risk if the CMA area is retaken into kdump
> > > > > > kernel as system RAM. E.g is it possible that 1st kernel's ongoing RDMA
> > > > > > or DMA will interfere with kdump kernel's normal memory accessing?
> > > > > > Because kdump kernel usually only reset and initialize the needed
> > > > > > device, e.g dump target. Those unneeded devices will be unshutdown and
> > > > > > let go.
> > > > >
> > > > > I do not really want to discount your concerns but I am bit confused why
> > > > > this matters so much. First of all, if there is a buggy RDMA driver
> > > > > which doesn't use the proper pinning API (which would migrate away from
> > > > > the CMA) then what is the worst case? We will get crash kernel corrupted
> > > > > potentially and fail to take a proper kernel crash, right? Is this
> > > > > worrisome? Yes. Is it a real roadblock? I do not think so. The problem
> > > > > seems theoretical to me and it is not CMA usage at fault here IMHO. It
> > > > > is the said theoretical driver that needs fixing anyway.
> > > > >
> > > > > Now, it is really fair to mention that CMA backed crash kernel memory
> > > > > has some limitations
> > > > >     - CMA reservation can only be used by the userspace in the
> > > > >       primary kernel. If the size is overshot this might have
> > > > >       negative impact on kernel allocations
> > > > >     - userspace memory dumping in the crash kernel is fundamentally
> > > > >       incomplete.
> > > >
> > > > I am not sure if we are talking about the same thing. My concern is:
> > > > ====================================================================
> > > > 1) system corrutption happened, crash dumping is prepared, cpu and
> > > > interrupt controllers are shutdown;
> > > > 2) all pci devices are kept alive;
> > > > 3) kdump kernel boot up, initialization is only done on those devices
> > > > which drivers are added into kdump kernel's initrd;
> > > > 4) those on-flight DMA engine could be still working if their kernel
> > > > module is not loaded;
> > > >
> > > > In this case, if the DMA's destination is located in crashkernel=,cma
> > > > region, the DMA writting could continue even when kdump kernel has put
> > > > important kernel data into the area. Is this possible or absolutely not
> > > > possible with DMA, RDMA, or any other stuff which could keep accessing
> > > > that area?
> > >
> > > I do nuderstand your concern. But as already stated if anybody uses
> > > movable memory (CMA including) as a target of {R}DMA then that memory
> > > should be properly pinned. That would mean that the memory will be
> > > migrated to somewhere outside of movable (CMA) memory before the
> > > transfer is configured. So modulo bugs this shouldn't really happen.
> > > Are there {R}DMA drivers that do not pin memory correctly? Possibly. Is
> > > that a road bloack to not using CMA to back crash kernel memory, I do
> > > not think so. Those drivers should be fixed instead.
> > >
> > I think that is our concern. Is there any method to guarantee that
                           ^^^ Sorry, to clarify, I am only speaking for myself.

> > will not happen instead of 'should be' ?
> > Any static analysis during compiling time or dynamic checking method?
>
> I am not aware of any method to detect a driver is going to configure a
> RDMA.
>

If there is a pattern, scripts/coccinelle may give some help. But I am
not sure about that.

> > If this can be resolved, I think this method is promising.
>
> Are you indicating this is a mandatory prerequisite?

IMHO, that should be mandatory. Otherwise for any unexpected kdump
kernel collapses,  it can not shake off its suspicion.

Thanks,

Pingfan
  
Michal Hocko Dec. 1, 2023, 10:37 a.m. UTC | #24
On Fri 01-12-23 08:54:20, Pingfan Liu wrote:
[...]
> > I am not aware of any method to detect a driver is going to configure a
> > RDMA.
> >
> 
> If there is a pattern, scripts/coccinelle may give some help. But I am
> not sure about that.

I am not aware of any pattern.

> > > If this can be resolved, I think this method is promising.
> >
> > Are you indicating this is a mandatory prerequisite?
> 
> IMHO, that should be mandatory. Otherwise for any unexpected kdump
> kernel collapses,  it can not shake off its suspicion.

I appreciate your carefulness! But I do not really see how such a
detection would work and be maintained over time. What exactly is the
scope of such a tooling? Should it be limited to RDMA drivers? Should we
protect from stray writes in general?

Also to make it clear. Are you going to nak the proposed solution if
there is no such tooling available?
  
Philipp Rudo Dec. 1, 2023, 11:33 a.m. UTC | #25
Hi Michal,

On Thu, 30 Nov 2023 14:41:12 +0100
Michal Hocko <mhocko@suse.com> wrote:

> On Thu 30-11-23 20:31:44, Baoquan He wrote:
> [...]
> > > > which doesn't use the proper pinning API (which would migrate away from
> > > > the CMA) then what is the worst case? We will get crash kernel corrupted
> > > > potentially and fail to take a proper kernel crash, right? Is this
> > > > worrisome? Yes. Is it a real roadblock? I do not think so. The problem  
> > 
> > We may fail to take a proper kernel crash, why isn't it a roadblock?  
> 
> It would be if the threat was practical. So far I only see very
> theoretical what-if concerns. And I do not mean to downplay those at
> all. As already explained proper CMA users shouldn't ever leak out any
> writes across kernel reboot.

You are right, "proper" CMA users don't do that. But "proper" drivers
also provide a working shutdown() method. Experience shows that there
are enough shitty drivers out there without working shutdown(). So I
think it is naive to assume you are only dealing with "proper" CMA
users.

For me the question is, what is less painful? Hunting down shitty
(potentially out of tree) drivers that cause a memory corruption? Or ...

> > We
> > have stable way with a little more memory, why would we take risk to
> > take another way, just for saving memory? Usually only high end server
> > needs the big memory for crashkernel and the big end server usually have
> > huge system ram. The big memory will be a very small percentage relative
> > to huge system RAM.  
> 
> Jiri will likely talk more specific about that but our experience tells
> that proper crashkernel memory scaling has turned out a real
> maintainability problem because existing setups tend to break with major
> kernel version upgrades or non trivial changes.

... frequently test if the crashkernel memory is still appropriate? The
big advantage of the latter I see is that an OOM situation has very
easy to detect and debug. A memory corruption isn't. Especially when
it was triggered by an other kernel.

And yes, those are all what-if concerns but unfortunately that is all
we have right now. Only alternative would be to run extended tests in
the field. Which means this user facing change needs to be included.
Which also means that we are stuck with it as once a user facing change
is in it's extremely hard to get rid of it again...

Thanks
Philipp

> > > > seems theoretical to me and it is not CMA usage at fault here IMHO. It
> > > > is the said theoretical driver that needs fixing anyway.  
> > 
> > Now, what we want to make clear is if it's a theoretical possibility, or
> > very likely happen. We have met several on-flight DMA stomping into
> > kexec kernel's initrd in the past two years because device driver didn't
> > provide shutdown() methor properly. For kdump, once it happen, the pain
> > is we don't know how to debug. For kexec reboot, customer allows to
> > login their system to reproduce and figure out the stomping. For kdump,
> > the system corruption rarely happend, and the stomping could rarely
> > happen too.  
> 
> yes, this is understood.
>  
> > The code change looks simple and the benefit is very attractive. I
> > surely like it if finally people confirm there's no risk. As I said, we
> > can't afford to take the risk if it possibly happen. But I don't object
> > if other people would rather take risk, we can let it land in kernel.  
> 
> I think it is fair to be cautious and I wouldn't impose the new method
> as a default. Only time can tell how safe this really is. It is hard to
> protect agains theoretical issues though. Bugs should be fixed.
> I believe this option would allow to configure kdump much easier and
> less fragile.
>  
> > My personal opinion, thanks for sharing your thought.  
> 
> Thanks for sharing.
>
  
Philipp Rudo Dec. 1, 2023, 11:34 a.m. UTC | #26
Hi Jiri,

I'd really love to see something like this to work. Although I also
share the concerns about shitty device drivers corrupting the CMA.
Please see my other mail for that.

Anyway, one more comment below.

On Fri, 24 Nov 2023 20:54:36 +0100
Jiri Bohac <jbohac@suse.cz> wrote:

[...]
 
> Now, specifying
> 	crashkernel=100M craskhernel=1G,cma
> on the command line will make a standard crashkernel reservation
> of 100M, where kexec will load the kernel and initrd.
> 
> An additional 1G will be reserved from CMA, still usable by the
> production system. The crash kernel will have 1.1G memory
> available. The 100M can be reliably predicted based on the size
> of the kernel and initrd.

I doubt that the fixed part can be predicted "reliably". For sure it
will be more reliable than today but IMHO we will still be stuck with
some guessing. Otherwise it would mean that you already know during
boot which initrd the user space will be loading later on. Which IMHO is
impossible as the initrd can always be rebuild with a larger size.
Furthermore, I'd be careful when you are dealing with compressed kernel
images. As I'm not sure how the different decompressor phases would
handle scenarios where the (fixed) crashkernel memory is large enough
to hold the compressed kernel (+initrd) but not the decompressed one.

One more thing, I'm not sure I like that you need to reserve two
separate memory regions. Personally I would prefer it if you could
reserve one large region for the crashkernel but allow parts of it to
be reused via CMA. Otherwise I'm afraid there will be people who only
have one ,cma entry on the command line and cannot figure out why they
cannot load the crash kernel.

Thanks
Philipp
 
> When no crashkernel=size,cma is specified, everything works as
> before.
>
  
Michal Hocko Dec. 1, 2023, 11:55 a.m. UTC | #27
On Fri 01-12-23 12:33:53, Philipp Rudo wrote:
[...]
> And yes, those are all what-if concerns but unfortunately that is all
> we have right now.

Should theoretical concerns without an actual evidence (e.g. multiple
drivers known to be broken) become a roadblock for this otherwise useful
feature? 

> Only alternative would be to run extended tests in
> the field. Which means this user facing change needs to be included.
> Which also means that we are stuck with it as once a user facing change
> is in it's extremely hard to get rid of it again...

I am not really sure I follow you here. Are you suggesting once
crashkernel=cma is added it would become a user api and therefore
impossible to get rid of?
  
Jiri Bohac Dec. 1, 2023, 12:35 p.m. UTC | #28
On Thu, Nov 30, 2023 at 12:01:36PM +0800, Baoquan He wrote:
> On 11/29/23 at 11:51am, Jiri Bohac wrote:
> > We get a lot of problems reported by partners testing kdump on
> > their setups prior to release. But even if we tune the reserved
> > size up, OOM is still the most common reason for kdump to fail
> > when the product starts getting used in real life. It's been
> > pretty frustrating for a long time.
> 
> I remember SUSE engineers ever told you will boot kernel and do an
> estimation of kdump kernel usage, then set the crashkernel according to
> the estimation. OOM will be triggered even that way is taken? Just
> curious, not questioning the benefit of using ,cma to save memory.

Yes, we do that during the kdump package build. We use this to
find some baseline for memory requirements of the kdump kernel
and tools on that specific product. Using these numbers we
estimate the requirements on the system where kdump is
configured by adding extra memory for the size of RAM, number of
SCSI devices, etc. But apparently we get this wrong in too many cases,
because the actual hardware differs too much from the virtual
environment which we used to get the baseline numbers. We've been
adding silly constants to the calculations and we still get OOMs on
one hand and people hesitant to sacrifice the calculated amount
of memory on the other. 

The result is that kdump basically cannot be trusted unless the
user verifies that the sacrificed memory is still enough after
every major upgrade.

This is the main motivation behind the CMA idea: to safely give
kdump enough memory, including a safe margin, without sacrificing
too much memory.

> > I feel the exact opposite about VMs. Reserving hundreds of MB for
> > crash kernel on _every_ VM on a busy VM host wastes the most
> > memory. VMs are often tuned to well defined task and can be set
> > up with very little memory, so the ~256 MB can be a huge part of
> > that. And while it's theoretically better to dump from the
> > hypervisor, users still often prefer kdump because the hypervisor
> > may not be under their control. Also, in a VM it should be much
> > easier to be sure the machine is safe WRT the potential DMA
> > corruption as it has less HW drivers. So I actually thought the
> > CMA reservation could be most useful on VMs.
> 
> Hmm, we ever discussed this in upstream with David Hildend who works in
> virt team. VMs problem is much easier to solve if they complain the
> default crashkernel value is wasteful. The shrinking interface is for
> them. The crashkernel value can't be enlarged, but shrinking existing
> crashkernel memory is functioning smoothly well. They can adjust that in
> script in a very simple way.

The shrinking does not solve this problem at all. It solves a
different problem: the virtual hardware configuration can easily
vary between boots and so will the crashkernel size requirements.
And since crashkernel needs to be passed on the commandline, once
the system is booted it's impossible to change it without a
reboot. Here the shrinking mechanism comes in handy
- we reserve enough for all configurations on the command line and
during boot the requirements for the currently booted
configuration can be determined and the reservation shrunk to
the determined value.  But determining this value is the same
unsolved problem as above and CMA could help in exactly the same
way.

> Anyway, let's discuss and figure out any risk of ,cma. If finally all
> worries and concerns are proved unnecessary, then let's have a new great
> feature. But we can't afford the risk if the ,cma area could be entangled
> with 1st kernel's on-going action. As we know, not like kexec reboot, we
> only shutdown CPUs, interrupt, most of devices are alive. And many of
> them could be not reset and initialized in kdump kernel if the relevant
> driver is not added in.

Well since my patchset makes the use of ,cma completely optional
and has _absolutely_ _no_ _effect_ on users that don't opt to use
it, I think you're not taking any risk at all. We will never know
how much DMA is a problem in practice unless we give users or
distros a way to try and come up with good ways of determining if
it's safe on whichever specific system based on the hardware,
drivers, etc.

I've successfully tested the patches on a few systems, physical
and virtual. Of course this is not proof that the DMA problem
does not exist but shows that it may be a solution that mostly
works. If nothing else, for systems where sacrificing ~400 MB of
memory is something that prevents the user from having any dump
at all, having a dump that mostly works with a sacrifice of ~100
MB may be useful.

Thanks,
  
Philipp Rudo Dec. 1, 2023, 3:51 p.m. UTC | #29
On Fri, 1 Dec 2023 12:55:52 +0100
Michal Hocko <mhocko@suse.com> wrote:

> On Fri 01-12-23 12:33:53, Philipp Rudo wrote:
> [...]
> > And yes, those are all what-if concerns but unfortunately that is all
> > we have right now.  
> 
> Should theoretical concerns without an actual evidence (e.g. multiple
> drivers known to be broken) become a roadblock for this otherwise useful
> feature? 

Those concerns aren't just theoretical. They are experiences we have
from a related feature that suffers exactly the same problem regularly
which wouldn't exist if everybody would simply work "properly".

And yes, even purely theoretical concerns can become a roadblock for a
feature when the cost of those theoretical concerns exceed the benefit
of the feature. The thing is that bugs will be reported against kexec.
So _we_ need to figure out which of the shitty drivers caused the
problem. That puts additional burden on _us_. What we are trying to
evaluate at the moment is if the benefit outweighs the extra burden
with the information we have at the moment.

> > Only alternative would be to run extended tests in
> > the field. Which means this user facing change needs to be included.
> > Which also means that we are stuck with it as once a user facing change
> > is in it's extremely hard to get rid of it again...  
> 
> I am not really sure I follow you here. Are you suggesting once
> crashkernel=cma is added it would become a user api and therefore
> impossible to get rid of?

Yes, sort of. I wouldn't rank a command line parameter as user api. So
we still can get rid of it. But there will be long discussions I'd like
to avoid if possible.

Thanks
Philipp
  
Michal Hocko Dec. 1, 2023, 4:59 p.m. UTC | #30
On Fri 01-12-23 16:51:13, Philipp Rudo wrote:
> On Fri, 1 Dec 2023 12:55:52 +0100
> Michal Hocko <mhocko@suse.com> wrote:
> 
> > On Fri 01-12-23 12:33:53, Philipp Rudo wrote:
> > [...]
> > > And yes, those are all what-if concerns but unfortunately that is all
> > > we have right now.  
> > 
> > Should theoretical concerns without an actual evidence (e.g. multiple
> > drivers known to be broken) become a roadblock for this otherwise useful
> > feature? 
> 
> Those concerns aren't just theoretical. They are experiences we have
> from a related feature that suffers exactly the same problem regularly
> which wouldn't exist if everybody would simply work "properly".

What is the related feature?
 
> And yes, even purely theoretical concerns can become a roadblock for a
> feature when the cost of those theoretical concerns exceed the benefit
> of the feature. The thing is that bugs will be reported against kexec.
> So _we_ need to figure out which of the shitty drivers caused the
> problem. That puts additional burden on _us_. What we are trying to
> evaluate at the moment is if the benefit outweighs the extra burden
> with the information we have at the moment.

I do understand your concerns! But I am pretty sure you do realize that
it is really hard to argue theoreticals.  Let me restate what I consider
facts. Hopefully we can agree on these points
	- the CMA region can be used by user space memory which is a
	  great advantage because the memory is not wasted and our
	  experience has shown that users do care about this a lot. We
	  _know_ that pressure on making those reservations smaller
	  results in a less reliable crashdump and more resources spent
	  on tuning and testing (especially after major upgrades).  A
	  larger reservation which is not completely wasted for the
	  normal runtime is addressing that concern.
	- There is no other known mechanism to achieve the reusability
	  of the crash kernel memory to stop the wastage without much
	  more intrusive code/api impact (e.g. a separate zone or
	  dedicated interface to prevent any hazardous usage like RDMA).
	- implementation wise the patch has a very small footprint. It
	  is using an existing infrastructure (CMA) and it adds a
	  minimal hooking into crashkernel configuration.
	- The only identified risk so far is RDMA acting on this memory
	  without using proper pinning interface. If it helps to have a
	  statement from RDMA maintainers/developers then we can pull
	  them in for a further discussion of course.
	- The feature requires an explicit opt-in so this doesn't bring
	  any new risk to existing crash kernel users until they decide
	  to use it. AFAIU there is no way to tell that the crash kernel
	  memory used to be CMA based in the primary kernel. If you
	  believe that having that information available for
	  debugability would help then I believe this shouldn't be hard
	  to add.  I think it would even make sense to mark this feature
	  experimental to make it clear to users that this needs some
	  time before it can be marked production ready.

I hope I haven't really missed anything important. The final
cost/benefit judgment is up to you, maintainers, of course but I would
like to remind that we are dealing with a _real_ problem that many
production systems are struggling with and that we don't really have any
other solution available.
  
Philipp Rudo Dec. 6, 2023, 11:08 a.m. UTC | #31
On Fri, 1 Dec 2023 17:59:02 +0100
Michal Hocko <mhocko@suse.com> wrote:

> On Fri 01-12-23 16:51:13, Philipp Rudo wrote:
> > On Fri, 1 Dec 2023 12:55:52 +0100
> > Michal Hocko <mhocko@suse.com> wrote:
> >   
> > > On Fri 01-12-23 12:33:53, Philipp Rudo wrote:
> > > [...]  
> > > > And yes, those are all what-if concerns but unfortunately that is all
> > > > we have right now.    
> > > 
> > > Should theoretical concerns without an actual evidence (e.g. multiple
> > > drivers known to be broken) become a roadblock for this otherwise useful
> > > feature?   
> > 
> > Those concerns aren't just theoretical. They are experiences we have
> > from a related feature that suffers exactly the same problem regularly
> > which wouldn't exist if everybody would simply work "properly".  
> 
> What is the related feature?

kexec

> > And yes, even purely theoretical concerns can become a roadblock for a
> > feature when the cost of those theoretical concerns exceed the benefit
> > of the feature. The thing is that bugs will be reported against kexec.
> > So _we_ need to figure out which of the shitty drivers caused the
> > problem. That puts additional burden on _us_. What we are trying to
> > evaluate at the moment is if the benefit outweighs the extra burden
> > with the information we have at the moment.  
> 
> I do understand your concerns! But I am pretty sure you do realize that
> it is really hard to argue theoreticals.  Let me restate what I consider
> facts. Hopefully we can agree on these points
> 	- the CMA region can be used by user space memory which is a
> 	  great advantage because the memory is not wasted and our
> 	  experience has shown that users do care about this a lot. We
> 	  _know_ that pressure on making those reservations smaller
> 	  results in a less reliable crashdump and more resources spent
> 	  on tuning and testing (especially after major upgrades).  A
> 	  larger reservation which is not completely wasted for the
> 	  normal runtime is addressing that concern.
> 	- There is no other known mechanism to achieve the reusability
> 	  of the crash kernel memory to stop the wastage without much
> 	  more intrusive code/api impact (e.g. a separate zone or
> 	  dedicated interface to prevent any hazardous usage like RDMA).
> 	- implementation wise the patch has a very small footprint. It
> 	  is using an existing infrastructure (CMA) and it adds a
> 	  minimal hooking into crashkernel configuration.
> 	- The only identified risk so far is RDMA acting on this memory
> 	  without using proper pinning interface. If it helps to have a
> 	  statement from RDMA maintainers/developers then we can pull
> 	  them in for a further discussion of course.
> 	- The feature requires an explicit opt-in so this doesn't bring
> 	  any new risk to existing crash kernel users until they decide
> 	  to use it. AFAIU there is no way to tell that the crash kernel
> 	  memory used to be CMA based in the primary kernel. If you
> 	  believe that having that information available for
> 	  debugability would help then I believe this shouldn't be hard
> 	  to add.  I think it would even make sense to mark this feature
> 	  experimental to make it clear to users that this needs some
> 	  time before it can be marked production ready.
> 
> I hope I haven't really missed anything important. The final

If I understand Documentation/core-api/pin_user_pages.rst correctly you
missed case 1 Direct IO. In that case "short term" DMA is allowed for
pages without FOLL_LONGTERM. Meaning that there is a way you can
corrupt the CMA and with that the crash kernel after the production
kernel has panicked.

With that I don't see a chance this series can be included unless
someone can explain me that that the documentation is wrong or I
understood it wrong.

Having that said
NAcked-by: Philipp Rudo <prudo@redhat.com>

> cost/benefit judgment is up to you, maintainers, of course but I would
> like to remind that we are dealing with a _real_ problem that many
> production systems are struggling with and that we don't really have any
> other solution available.
  
David Hildenbrand Dec. 6, 2023, 11:23 a.m. UTC | #32
On 06.12.23 12:08, Philipp Rudo wrote:
> On Fri, 1 Dec 2023 17:59:02 +0100
> Michal Hocko <mhocko@suse.com> wrote:
> 
>> On Fri 01-12-23 16:51:13, Philipp Rudo wrote:
>>> On Fri, 1 Dec 2023 12:55:52 +0100
>>> Michal Hocko <mhocko@suse.com> wrote:
>>>    
>>>> On Fri 01-12-23 12:33:53, Philipp Rudo wrote:
>>>> [...]
>>>>> And yes, those are all what-if concerns but unfortunately that is all
>>>>> we have right now.
>>>>
>>>> Should theoretical concerns without an actual evidence (e.g. multiple
>>>> drivers known to be broken) become a roadblock for this otherwise useful
>>>> feature?
>>>
>>> Those concerns aren't just theoretical. They are experiences we have
>>> from a related feature that suffers exactly the same problem regularly
>>> which wouldn't exist if everybody would simply work "properly".
>>
>> What is the related feature?
> 
> kexec
> 
>>> And yes, even purely theoretical concerns can become a roadblock for a
>>> feature when the cost of those theoretical concerns exceed the benefit
>>> of the feature. The thing is that bugs will be reported against kexec.
>>> So _we_ need to figure out which of the shitty drivers caused the
>>> problem. That puts additional burden on _us_. What we are trying to
>>> evaluate at the moment is if the benefit outweighs the extra burden
>>> with the information we have at the moment.
>>
>> I do understand your concerns! But I am pretty sure you do realize that
>> it is really hard to argue theoreticals.  Let me restate what I consider
>> facts. Hopefully we can agree on these points
>> 	- the CMA region can be used by user space memory which is a
>> 	  great advantage because the memory is not wasted and our
>> 	  experience has shown that users do care about this a lot. We
>> 	  _know_ that pressure on making those reservations smaller
>> 	  results in a less reliable crashdump and more resources spent
>> 	  on tuning and testing (especially after major upgrades).  A
>> 	  larger reservation which is not completely wasted for the
>> 	  normal runtime is addressing that concern.
>> 	- There is no other known mechanism to achieve the reusability
>> 	  of the crash kernel memory to stop the wastage without much
>> 	  more intrusive code/api impact (e.g. a separate zone or
>> 	  dedicated interface to prevent any hazardous usage like RDMA).
>> 	- implementation wise the patch has a very small footprint. It
>> 	  is using an existing infrastructure (CMA) and it adds a
>> 	  minimal hooking into crashkernel configuration.
>> 	- The only identified risk so far is RDMA acting on this memory
>> 	  without using proper pinning interface. If it helps to have a
>> 	  statement from RDMA maintainers/developers then we can pull
>> 	  them in for a further discussion of course.
>> 	- The feature requires an explicit opt-in so this doesn't bring
>> 	  any new risk to existing crash kernel users until they decide
>> 	  to use it. AFAIU there is no way to tell that the crash kernel
>> 	  memory used to be CMA based in the primary kernel. If you
>> 	  believe that having that information available for
>> 	  debugability would help then I believe this shouldn't be hard
>> 	  to add.  I think it would even make sense to mark this feature
>> 	  experimental to make it clear to users that this needs some
>> 	  time before it can be marked production ready.
>>
>> I hope I haven't really missed anything important. The final
> 
> If I understand Documentation/core-api/pin_user_pages.rst correctly you
> missed case 1 Direct IO. In that case "short term" DMA is allowed for
> pages without FOLL_LONGTERM. Meaning that there is a way you can
> corrupt the CMA and with that the crash kernel after the production
> kernel has panicked.
> 
> With that I don't see a chance this series can be included unless
> someone can explain me that that the documentation is wrong or I
> understood it wrong.

I think you are right. We'd have to disallow any FOLL_PIN on these CMA 
pages, or find other ways of handling that (detect that there are no 
short-term pins any).

But, I'm also wondering how MMU-notifier-based approaches might 
interfere, where CMA pages might be transparently mapped into secondary 
MMUs, possibly having DMA going on.

Are we sure that all these secondary MMUs are inactive as soon as we kexec?
  
Michal Hocko Dec. 6, 2023, 1:49 p.m. UTC | #33
On Wed 06-12-23 12:08:05, Philipp Rudo wrote:
> On Fri, 1 Dec 2023 17:59:02 +0100
> Michal Hocko <mhocko@suse.com> wrote:
> 
> > On Fri 01-12-23 16:51:13, Philipp Rudo wrote:
> > > On Fri, 1 Dec 2023 12:55:52 +0100
> > > Michal Hocko <mhocko@suse.com> wrote:
> > >   
> > > > On Fri 01-12-23 12:33:53, Philipp Rudo wrote:
> > > > [...]  
> > > > > And yes, those are all what-if concerns but unfortunately that is all
> > > > > we have right now.    
> > > > 
> > > > Should theoretical concerns without an actual evidence (e.g. multiple
> > > > drivers known to be broken) become a roadblock for this otherwise useful
> > > > feature?   
> > > 
> > > Those concerns aren't just theoretical. They are experiences we have
> > > from a related feature that suffers exactly the same problem regularly
> > > which wouldn't exist if everybody would simply work "properly".  
> > 
> > What is the related feature?
> 
> kexec

OK, but that is a completely different thing, no? crashkernel parameter
doesn't affect kexec. Or what is the actual relation?

> > > And yes, even purely theoretical concerns can become a roadblock for a
> > > feature when the cost of those theoretical concerns exceed the benefit
> > > of the feature. The thing is that bugs will be reported against kexec.
> > > So _we_ need to figure out which of the shitty drivers caused the
> > > problem. That puts additional burden on _us_. What we are trying to
> > > evaluate at the moment is if the benefit outweighs the extra burden
> > > with the information we have at the moment.  
> > 
> > I do understand your concerns! But I am pretty sure you do realize that
> > it is really hard to argue theoreticals.  Let me restate what I consider
> > facts. Hopefully we can agree on these points
> > 	- the CMA region can be used by user space memory which is a
> > 	  great advantage because the memory is not wasted and our
> > 	  experience has shown that users do care about this a lot. We
> > 	  _know_ that pressure on making those reservations smaller
> > 	  results in a less reliable crashdump and more resources spent
> > 	  on tuning and testing (especially after major upgrades).  A
> > 	  larger reservation which is not completely wasted for the
> > 	  normal runtime is addressing that concern.
> > 	- There is no other known mechanism to achieve the reusability
> > 	  of the crash kernel memory to stop the wastage without much
> > 	  more intrusive code/api impact (e.g. a separate zone or
> > 	  dedicated interface to prevent any hazardous usage like RDMA).
> > 	- implementation wise the patch has a very small footprint. It
> > 	  is using an existing infrastructure (CMA) and it adds a
> > 	  minimal hooking into crashkernel configuration.
> > 	- The only identified risk so far is RDMA acting on this memory
> > 	  without using proper pinning interface. If it helps to have a
> > 	  statement from RDMA maintainers/developers then we can pull
> > 	  them in for a further discussion of course.
> > 	- The feature requires an explicit opt-in so this doesn't bring
> > 	  any new risk to existing crash kernel users until they decide
> > 	  to use it. AFAIU there is no way to tell that the crash kernel
> > 	  memory used to be CMA based in the primary kernel. If you
> > 	  believe that having that information available for
> > 	  debugability would help then I believe this shouldn't be hard
> > 	  to add.  I think it would even make sense to mark this feature
> > 	  experimental to make it clear to users that this needs some
> > 	  time before it can be marked production ready.
> > 
> > I hope I haven't really missed anything important. The final
> 
> If I understand Documentation/core-api/pin_user_pages.rst correctly you
> missed case 1 Direct IO. In that case "short term" DMA is allowed for
> pages without FOLL_LONGTERM. Meaning that there is a way you can
> corrupt the CMA and with that the crash kernel after the production
> kernel has panicked.

Could you expand on this? How exactly direct IO request survives across
into the kdump kernel? I do understand the RMDA case because the IO is
async and out of control of the receiving end.

Also if direct IO is a problem how come this is not a problem for kexec
in general. The new kernel usually shares all the memory with the 1st
kernel.

/me confused.
  
Michal Hocko Dec. 6, 2023, 3:19 p.m. UTC | #34
On Wed 06-12-23 14:49:51, Michal Hocko wrote:
> On Wed 06-12-23 12:08:05, Philipp Rudo wrote:
[...]
> > If I understand Documentation/core-api/pin_user_pages.rst correctly you
> > missed case 1 Direct IO. In that case "short term" DMA is allowed for
> > pages without FOLL_LONGTERM. Meaning that there is a way you can
> > corrupt the CMA and with that the crash kernel after the production
> > kernel has panicked.
> 
> Could you expand on this? How exactly direct IO request survives across
> into the kdump kernel? I do understand the RMDA case because the IO is
> async and out of control of the receiving end.

OK, I guess I get what you mean. You are worried that there is 
DIO request
program DMA controller to read into CMA memory
<panic>
boot into crash kernel backed by CMA
DMA transfer is done.

DIO doesn't migrate the pinned memory because it is considered a very
quick operation which doesn't block the movability for too long. That is
why I have considered that a non-problem. RDMA on the other might pin
memory for transfer for much longer but that case is handled by
migrating the memory away.

Now I agree that there is a chance of the corruption from DIO. The
question I am not entirely clear about right now is how big of a real
problem that is. DMA transfers should be a very swift operation. Would
it help to wait for a grace period before jumping into the kdump kernel?

> Also if direct IO is a problem how come this is not a problem for kexec
> in general. The new kernel usually shares all the memory with the 1st
> kernel.

This is also more clear now. Pure kexec is shutting down all the devices
which should terminate the in-flight DMA transfers.
  
Baoquan He Dec. 7, 2023, 4:23 a.m. UTC | #35
On 12/06/23 at 04:19pm, Michal Hocko wrote:
> On Wed 06-12-23 14:49:51, Michal Hocko wrote:
> > On Wed 06-12-23 12:08:05, Philipp Rudo wrote:
> [...]
> > > If I understand Documentation/core-api/pin_user_pages.rst correctly you
> > > missed case 1 Direct IO. In that case "short term" DMA is allowed for
> > > pages without FOLL_LONGTERM. Meaning that there is a way you can
> > > corrupt the CMA and with that the crash kernel after the production
> > > kernel has panicked.
> > 
> > Could you expand on this? How exactly direct IO request survives across
> > into the kdump kernel? I do understand the RMDA case because the IO is
> > async and out of control of the receiving end.
> 
> OK, I guess I get what you mean. You are worried that there is 
> DIO request
> program DMA controller to read into CMA memory
> <panic>
> boot into crash kernel backed by CMA
> DMA transfer is done.
> 
> DIO doesn't migrate the pinned memory because it is considered a very
> quick operation which doesn't block the movability for too long. That is
> why I have considered that a non-problem. RDMA on the other might pin
> memory for transfer for much longer but that case is handled by
> migrating the memory away.
> 
> Now I agree that there is a chance of the corruption from DIO. The
> question I am not entirely clear about right now is how big of a real
> problem that is. DMA transfers should be a very swift operation. Would
> it help to wait for a grace period before jumping into the kdump kernel?

On system with hardware IOMMU of x86_64, people finally had fixed it
after very long history of trying, arguing. Until 2014, HPE's engineer came
up with a series to copy the 1st kernel's iommu page table to kdump
kernel so that the on-flight DMA from 1st kernel can continue
transferring. Later, these attempts and discussions were converted codes
into mainline kernel. Before that, people even tried to introduce
reset_devices() before jumping to kdump kernel. But that was denied
immediately because any extra unnecessary actions could cause uncertain
failure of kdump kernel, given 1st kernel has been in an unpredictable
unstable situation.

We can't guarantee how swift the DMA transfer could be in the cma, case,
it will be a venture.

[3]
[PATCH v9 00/13] Fix the on-flight DMA issue on system with amd iommu
https://lists.openwall.net/linux-kernel/2017/08/01/399

[2]
[PATCH 00/19] Fix Intel IOMMU breakage in kdump kernel
https://lists.openwall.net/linux-kernel/2015/06/13/72

[1]
[PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
https://lkml.org/lkml/2014/4/24/836

> 
> > Also if direct IO is a problem how come this is not a problem for kexec
> > in general. The new kernel usually shares all the memory with the 1st
> > kernel.
> 
> This is also more clear now. Pure kexec is shutting down all the devices
> which should terminate the in-flight DMA transfers.

Exactly. That's what I have been noticing in this thread.
  
Michal Hocko Dec. 7, 2023, 8:55 a.m. UTC | #36
On Thu 07-12-23 12:23:13, Baoquan He wrote:
[...]
> We can't guarantee how swift the DMA transfer could be in the cma, case,
> it will be a venture.

We can't guarantee this of course but AFAIK the DMA shouldn't take
minutes, right? While not perfect, waiting for some time before jumping
into the crash kernel should be acceptable from user POV and it should
work around most of those potential lingering programmed DMA transfers.

So I guess what we would like to hear from you as kdump maintainers is
this. Is it absolutely imperative that these issue must be proven
impossible or is a best effort approach something worth investing time
into? Because if the requirement is an absolute guarantee then I simply
do not see any feasible way to achieve the goal of reusable memory.

Let me reiterate that the existing reservation mechanism is showing its
limits for production systems and I strongly believe this is something
that needs addressing because crash dumps are very often the only tool
to investigate complex issues.
  
Philipp Rudo Dec. 7, 2023, 11:13 a.m. UTC | #37
On Thu, 7 Dec 2023 09:55:20 +0100
Michal Hocko <mhocko@suse.com> wrote:

> On Thu 07-12-23 12:23:13, Baoquan He wrote:
> [...]
> > We can't guarantee how swift the DMA transfer could be in the cma, case,
> > it will be a venture.  
> 
> We can't guarantee this of course but AFAIK the DMA shouldn't take
> minutes, right? While not perfect, waiting for some time before jumping
> into the crash kernel should be acceptable from user POV and it should
> work around most of those potential lingering programmed DMA transfers.

I don't think that simply waiting is acceptable. For one it doesn't
guarantee that there is no corruption (please also see below) but only
reduces its probability. Furthermore, how long would you wait?
Thing is that users don't only want to reduce the memory usage but also
the downtime of kdump. In the end I'm afraid that "simply waiting" will
make things unnecessarily more complex without really solving any issue.

> So I guess what we would like to hear from you as kdump maintainers is
> this. Is it absolutely imperative that these issue must be proven
> impossible or is a best effort approach something worth investing time
> into? Because if the requirement is an absolute guarantee then I simply
> do not see any feasible way to achieve the goal of reusable memory.
> 
> Let me reiterate that the existing reservation mechanism is showing its
> limits for production systems and I strongly believe this is something
> that needs addressing because crash dumps are very often the only tool
> to investigate complex issues.

Because having a crash dump is so important I want a prove that no
legal operation can corrupt the crashkernel memory. The easiest way to
achieve this is by simply keeping the two memory regions fully
separated like it is today. In theory it should also be possible to
prevent any kind of page pinning in the shared crashkernel memory. But
I don't know which side effect this has for mm. Such an idea needs to
be discussed on the mm mailing list first.

Finally, let me question whether the whole approach actually solves
anything. For me the difficulty in determining the correct crashkernel
memory is only a symptom. The real problem is that most developers
don't expect their code to run outside their typical environment.
Especially not in an memory constraint environment like kdump. But that
problem won't be solved by throwing more memory at it as this
additional memory will eventually run out as well. In the end we are
back at the point where we are today but with more memory.

Finally finally, one tip. Next time a customer complaints about how
much memory the crashkernel "wastes" ask them how much one day of down
time for one machine costs them and how much memory they could buy for
that money. After that calculation I'm pretty sure that an additional
100M of crashkernel memory becomes much more tempting.

Thanks
Philipp
  
Philipp Rudo Dec. 7, 2023, 11:13 a.m. UTC | #38
On Wed, 6 Dec 2023 16:19:51 +0100
Michal Hocko <mhocko@suse.com> wrote:

> On Wed 06-12-23 14:49:51, Michal Hocko wrote:
> > On Wed 06-12-23 12:08:05, Philipp Rudo wrote:  
> [...]
> > > If I understand Documentation/core-api/pin_user_pages.rst correctly you
> > > missed case 1 Direct IO. In that case "short term" DMA is allowed for
> > > pages without FOLL_LONGTERM. Meaning that there is a way you can
> > > corrupt the CMA and with that the crash kernel after the production
> > > kernel has panicked.  
> > 
> > Could you expand on this? How exactly direct IO request survives across
> > into the kdump kernel? I do understand the RMDA case because the IO is
> > async and out of control of the receiving end.  
> 
> OK, I guess I get what you mean. You are worried that there is 
> DIO request
> program DMA controller to read into CMA memory
> <panic>
> boot into crash kernel backed by CMA
> DMA transfer is done.
> 
> DIO doesn't migrate the pinned memory because it is considered a very
> quick operation which doesn't block the movability for too long. That is
> why I have considered that a non-problem. RDMA on the other might pin
> memory for transfer for much longer but that case is handled by
> migrating the memory away.

Right that is the scenario we need to prevent.

> Now I agree that there is a chance of the corruption from DIO. The
> question I am not entirely clear about right now is how big of a real
> problem that is. DMA transfers should be a very swift operation. Would
> it help to wait for a grace period before jumping into the kdump kernel?

Please see my other mail.

> > Also if direct IO is a problem how come this is not a problem for kexec
> > in general. The new kernel usually shares all the memory with the 1st
> > kernel.  
> 
> This is also more clear now. Pure kexec is shutting down all the devices
> which should terminate the in-flight DMA transfers.

Right, it _should_ terminate all transfers. But here we are back at the
shitty device drivers that don't have a working shutdown method. That's
why we have already seen the problem you describe above with kexec. And
please believe me that debugging such a scenario is an absolute pain.
Especially when it's a proprietary, out-of-tree driver that caused the
mess.

Thanks
Philipp
  
Michal Hocko Dec. 7, 2023, 11:52 a.m. UTC | #39
On Thu 07-12-23 12:13:14, Philipp Rudo wrote:
> On Thu, 7 Dec 2023 09:55:20 +0100
> Michal Hocko <mhocko@suse.com> wrote:
> 
> > On Thu 07-12-23 12:23:13, Baoquan He wrote:
> > [...]
> > > We can't guarantee how swift the DMA transfer could be in the cma, case,
> > > it will be a venture.  
> > 
> > We can't guarantee this of course but AFAIK the DMA shouldn't take
> > minutes, right? While not perfect, waiting for some time before jumping
> > into the crash kernel should be acceptable from user POV and it should
> > work around most of those potential lingering programmed DMA transfers.
> 
> I don't think that simply waiting is acceptable. For one it doesn't
> guarantee that there is no corruption (please also see below) but only
> reduces its probability. Furthermore, how long would you wait?

I would like to talk to storage experts to have some ballpark idea about
worst case scenario but waiting for 1 minutes shouldn't terribly
influence downtime and remember this is an opt-in feature. If that
doesn't fit your use case, do not use it.

> Thing is that users don't only want to reduce the memory usage but also
> the downtime of kdump. In the end I'm afraid that "simply waiting" will
> make things unnecessarily more complex without really solving any issue.

I am not sure I see the added complexity. Something as simple as
__crash_kexec:
	if (crashk_cma_cnt) 
		mdelay(TIMEOUT)

should do the trick.

> > So I guess what we would like to hear from you as kdump maintainers is
> > this. Is it absolutely imperative that these issue must be proven
> > impossible or is a best effort approach something worth investing time
> > into? Because if the requirement is an absolute guarantee then I simply
> > do not see any feasible way to achieve the goal of reusable memory.
> > 
> > Let me reiterate that the existing reservation mechanism is showing its
> > limits for production systems and I strongly believe this is something
> > that needs addressing because crash dumps are very often the only tool
> > to investigate complex issues.
> 
> Because having a crash dump is so important I want a prove that no
> legal operation can corrupt the crashkernel memory. The easiest way to
> achieve this is by simply keeping the two memory regions fully
> separated like it is today. In theory it should also be possible to
> prevent any kind of page pinning in the shared crashkernel memory. But
> I don't know which side effect this has for mm. Such an idea needs to
> be discussed on the mm mailing list first.

I do not see that as a feasible option. That would require to migrate
memory on any gup user that might end up sending data over DMA.

> Finally, let me question whether the whole approach actually solves
> anything. For me the difficulty in determining the correct crashkernel
> memory is only a symptom. The real problem is that most developers
> don't expect their code to run outside their typical environment.
> Especially not in an memory constraint environment like kdump. But that
> problem won't be solved by throwing more memory at it as this
> additional memory will eventually run out as well. In the end we are
> back at the point where we are today but with more memory.

I disagree with you here. While the kernel is really willing to cache
objects into memory I do not think that any particular subsystem is
super eager to waste memory.

The thing we should keep in mind is that the memory sitting aside is not
used in majority of time. Crashes (luckily/hopefully) do not happen very
often. And I can really see why people are reluctant to waste it. Every
MB of memory has an operational price tag on it. And let's just be
really honest, a simple reboot without a crash dump is very likely
a cheaper option than wasting a productive memory as long as the issue
happens very seldom.
 
> Finally finally, one tip. Next time a customer complaints about how
> much memory the crashkernel "wastes" ask them how much one day of down
> time for one machine costs them and how much memory they could buy for
> that money. After that calculation I'm pretty sure that an additional
> 100M of crashkernel memory becomes much more tempting.

Exactly and that is why a simple reboot would be a preferred option than
configuring kdump and invest admin time to keep testing configuration
after every (major) upgrade to make sure the existing setup still works.
From my experience crashdump availability hugely improves chances to get
underlying crash diagnosed and bug solved so it is also in our interest
to encourage kdump deployments as much as possible.

Now I do get your concerns about potential issues and I fully recognize
the pain you have gone through when debugging these subtle issues in the
past but let's not forget that perfect is an enemy of good and that a
best effort solution might be better than crash dumps at all.

At the end, let me just ask a theoretical question. With the experience
you have gained would you nack the kexec support if it was proposed now
just because of all the potential problems it might have?
  
Baoquan He Dec. 8, 2023, 1:55 a.m. UTC | #40
On 12/07/23 at 12:52pm, Michal Hocko wrote:
> On Thu 07-12-23 12:13:14, Philipp Rudo wrote:
> > On Thu, 7 Dec 2023 09:55:20 +0100
> > Michal Hocko <mhocko@suse.com> wrote:
> > 
> > > On Thu 07-12-23 12:23:13, Baoquan He wrote:
> > > [...]
> > > > We can't guarantee how swift the DMA transfer could be in the cma, case,
> > > > it will be a venture.  
> > > 
> > > We can't guarantee this of course but AFAIK the DMA shouldn't take
> > > minutes, right? While not perfect, waiting for some time before jumping
> > > into the crash kernel should be acceptable from user POV and it should
> > > work around most of those potential lingering programmed DMA transfers.
> > 
> > I don't think that simply waiting is acceptable. For one it doesn't
> > guarantee that there is no corruption (please also see below) but only
> > reduces its probability. Furthermore, how long would you wait?
> 
> I would like to talk to storage experts to have some ballpark idea about
> worst case scenario but waiting for 1 minutes shouldn't terribly
> influence downtime and remember this is an opt-in feature. If that
> doesn't fit your use case, do not use it.
> 
> > Thing is that users don't only want to reduce the memory usage but also
> > the downtime of kdump. In the end I'm afraid that "simply waiting" will
> > make things unnecessarily more complex without really solving any issue.
> 
> I am not sure I see the added complexity. Something as simple as
> __crash_kexec:
> 	if (crashk_cma_cnt) 
> 		mdelay(TIMEOUT)
> 
> should do the trick.

I would say please don't do this. kdump jumping is a very quick
behavirou after corruption, usually in several seconds. I can't see any
meaningful stuff with the delay of one minute or several minutes. Most
importantly, the 1st kernel is in corruption which is a very
unpredictable state.
... 
> > Finally, let me question whether the whole approach actually solves
> > anything. For me the difficulty in determining the correct crashkernel
> > memory is only a symptom. The real problem is that most developers
> > don't expect their code to run outside their typical environment.
> > Especially not in an memory constraint environment like kdump. But that
> > problem won't be solved by throwing more memory at it as this
> > additional memory will eventually run out as well. In the end we are
> > back at the point where we are today but with more memory.
> 
> I disagree with you here. While the kernel is really willing to cache
> objects into memory I do not think that any particular subsystem is
> super eager to waste memory.
> 
> The thing we should keep in mind is that the memory sitting aside is not
> used in majority of time. Crashes (luckily/hopefully) do not happen very
> often. And I can really see why people are reluctant to waste it. Every
> MB of memory has an operational price tag on it. And let's just be
> really honest, a simple reboot without a crash dump is very likely
> a cheaper option than wasting a productive memory as long as the issue
> happens very seldom.

All the time, I have never heard people don't want to "waste" the
memory. E.g, for more than 90% of system on x86, 256M is enough. The
rare exceptions will be noted once recognized and documented in product
release.

And ,cma is not silver bullet, see this oom issue caused by i40e and its
fix , your crashkernel=1G,cma won't help either.

[v1,0/3] Reducing memory usage of i40e for kdump
https://patchwork.ozlabs.org/project/intel-wired-lan/cover/20210304025543.334912-1-coxu@redhat.com/

======Abstrcted from above cover letter==========================
After reducing the allocation of tx/rx/arg/asq ring buffers to the
minimum, the memory consumption is significantly reduced,
    - x86_64: 85.1MB to 1.2MB 
    - POWER9: 15368.5MB to 20.8MB
==================================================================

And say more about it. This is not the first time of attempt to make use
of ,cma area for crashkernel=. In redhat, at least 5 people have tried
to add this, finally we gave up after long discussion and investigation.
This year, one kernel developer in our team raised this again with a
very long mail after his own analysis, we told him the discussion and
trying we have done in the past.
  
Baoquan He Dec. 8, 2023, 2:10 a.m. UTC | #41
On 12/07/23 at 09:55am, Michal Hocko wrote:
> On Thu 07-12-23 12:23:13, Baoquan He wrote:
> [...]
> > We can't guarantee how swift the DMA transfer could be in the cma, case,
> > it will be a venture.
> 
> We can't guarantee this of course but AFAIK the DMA shouldn't take
> minutes, right? While not perfect, waiting for some time before jumping
> into the crash kernel should be acceptable from user POV and it should
> work around most of those potential lingering programmed DMA transfers.
> 
> So I guess what we would like to hear from you as kdump maintainers is
> this. Is it absolutely imperative that these issue must be proven
> impossible or is a best effort approach something worth investing time
> into? Because if the requirement is an absolute guarantee then I simply
> do not see any feasible way to achieve the goal of reusable memory.

Honestly, I think all the discussions and proof have told clearly it's
not a good idea. This is not about who wants this, who doesn't. So
far, this is an objective fact that taking ,cma area for crashkernel= is
not a good idea, it's very risky.

We don't deny this at the beginning. I tried to present all what I know,
we have experienced, we have investigated, we have tried. I wanted to
see if this time we can clarify some concerns may be mistaken. But it's
not. The risk is obvious and very likely happen.

> 
> Let me reiterate that the existing reservation mechanism is showing its
> limits for production systems and I strongly believe this is something
> that needs addressing because crash dumps are very often the only tool
> to investigate complex issues.

Yes, I admit that. But it haven't got to the point that it's too bad to
bear so that we have to take the risk to take ,cma instead.
  
Michal Hocko Dec. 8, 2023, 10:04 a.m. UTC | #42
On Fri 08-12-23 09:55:39, Baoquan He wrote:
> On 12/07/23 at 12:52pm, Michal Hocko wrote:
> > On Thu 07-12-23 12:13:14, Philipp Rudo wrote:
[...]
> > > Thing is that users don't only want to reduce the memory usage but also
> > > the downtime of kdump. In the end I'm afraid that "simply waiting" will
> > > make things unnecessarily more complex without really solving any issue.
> > 
> > I am not sure I see the added complexity. Something as simple as
> > __crash_kexec:
> > 	if (crashk_cma_cnt) 
> > 		mdelay(TIMEOUT)
> > 
> > should do the trick.
> 
> I would say please don't do this. kdump jumping is a very quick
> behavirou after corruption, usually in several seconds. I can't see any
> meaningful stuff with the delay of one minute or several minutes.

Well, I've been told that DMA should complete within seconds after
controller is programmed (if that was much more then short term pinning
is not really appropriate because that would block memory movability for
way too long and therefore result in failures). This is something we can
tune for.

But if that sounds like a completely wrong approach then I think an
alternative would be to live with potential inflight DMAs just avoid
using that memory by the kdump kernel before the DMA controllers (PCI
bus) is reinitialized by the kdump kernel. That should happen early in
the boot process IIRC and the CMA backed memory could be added after
that moment. We already do have means so defer memory initialization
so an extension shouldn't be hard to do. It will be a slightly more involved
patch touching core MM which we have tried to avoid so far. Does that
sound like something acceptable?

[...]

> > The thing we should keep in mind is that the memory sitting aside is not
> > used in majority of time. Crashes (luckily/hopefully) do not happen very
> > often. And I can really see why people are reluctant to waste it. Every
> > MB of memory has an operational price tag on it. And let's just be
> > really honest, a simple reboot without a crash dump is very likely
> > a cheaper option than wasting a productive memory as long as the issue
> > happens very seldom.
> 
> All the time, I have never heard people don't want to "waste" the
> memory. E.g, for more than 90% of system on x86, 256M is enough. The
> rare exceptions will be noted once recognized and documented in product
> release.
> 
> And ,cma is not silver bullet, see this oom issue caused by i40e and its
> fix , your crashkernel=1G,cma won't help either.
> 
> [v1,0/3] Reducing memory usage of i40e for kdump
> https://patchwork.ozlabs.org/project/intel-wired-lan/cover/20210304025543.334912-1-coxu@redhat.com/
> 
> ======Abstrcted from above cover letter==========================
> After reducing the allocation of tx/rx/arg/asq ring buffers to the
> minimum, the memory consumption is significantly reduced,
>     - x86_64: 85.1MB to 1.2MB 
>     - POWER9: 15368.5MB to 20.8MB
> ==================================================================

Nice to see memory consumption reduction fixes. But, honestly this
should happen regardless of kdump. CMA backed kdump is not to
workaround excessive kernel memory consumers. It seems I am failing to
get the message through :( but I do not know how else to express that the
pressure on reducing the wasted memory is real. It is not important
whether 256MB is enough for everybody. Even that would grow to non
trivial cost in data centers with many machines.

> And say more about it. This is not the first time of attempt to make use
> of ,cma area for crashkernel=. In redhat, at least 5 people have tried
> to add this, finally we gave up after long discussion and investigation.
> This year, one kernel developer in our team raised this again with a
> very long mail after his own analysis, we told him the discussion and
> trying we have done in the past.

This is really hard to comment on without any references to those
discussions. From this particular email thread I have a perception that
you guys focus much more on correctness provability than feasibility. If
we applied the same approach universally then many other features
couldn't have been merged. E.g. kexec for reasons you have mentioned in
the email thread.

Anyway, thanks for pointing to regular DMA via gup case which we were
obviously not aware of. I personally have considered this to be a
marginal problem comparing to RDMA which is unpredictable wrt timing.
But we believe that this could be worked around. Now it would be really
valuable if we knew somebody has _tried_ that and it turned out not
working because of XYZ reasons.  That would be a solid base to
re-evaluate and think of different approaches.

Look, this will be your call as maintainers in the end. If you are
decided then fair enough. We might end up trying this feature downstream
and maybe come back in the future with an experience which we currently
do not have. But it seems we are not alone seeing the existing state is
insufficient (http://lkml.kernel.org/r/20230719224821.GC3528218@google.com).

Thanks!