[2/2] docs/mm: Physical Memory: add structure, introduction and nodes description
Commit Message
From: "Mike Rapoport (IBM)" <rppt@kernel.org>
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
Documentation/mm/physical_memory.rst | 322 +++++++++++++++++++++++++++
1 file changed, 322 insertions(+)
Comments
On Sun, Jan 01, 2023 at 11:45:23AM +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>
> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
> ---
> Documentation/mm/physical_memory.rst | 322 +++++++++++++++++++++++++++
> 1 file changed, 322 insertions(+)
>
> diff --git a/Documentation/mm/physical_memory.rst b/Documentation/mm/physical_memory.rst
> index 2ab7b8c1c863..fcf52f1db16b 100644
> --- a/Documentation/mm/physical_memory.rst
> +++ b/Documentation/mm/physical_memory.rst
> @@ -3,3 +3,325 @@
> ===============
> Physical Memory
> ===============
> +
> +Linux is available for a wide range of architectures so there is a need for an
> +architecture-independent abstraction to represent the physical memory. This
> +chapter describes the structures used to manage physical memory in a running
> +system.
> +
> +The first principal concept prevalent in the memory management is
> +`Non-Uniform Memory Access (NUMA)
> +<https://en.wikipedia.org/wiki/Non-uniform_memory_access>`_.
> +With multi-core and multi-socket machines, memory may be arranged into banks
> +that incur a different cost to access depending on the “distance” from the
> +processor. For example, there might be a bank of memory assigned to each CPU or
> +a bank of memory very suitable for DMA near peripheral devices.
Absolutely wonderfully written. Perhaps put a sub-heading for NUMA here?
An aside, but I think it'd be a good idea to mention base pages, folios and
folio order pretty early on as they get touched as concepts all over the place
in physical memory (but perhaps can wait for other contribs!)
> +
> +Each bank is called a node and the concept is represented under Linux by a
> +``struct pglist_data`` even if the architecture is UMA. This structure is
> +always referenced to by it's typedef ``pg_data_t``. A pg_data_t structure
> +for a particular node can be referenced by ``NODE_DATA(nid)`` macro where
> +``nid`` is the ID of that node.
> +
> +For NUMA architectures, the node structures are allocated by the architecture
> +specific code early during boot. Usually, these structures are allocated
> +locally on the memory bank they represent. For UMA architectures, only one
> +static pg_data_t structure called ``contig_page_data`` is used. Nodes will
> +be discussed further in Section :ref:`Nodes <nodes>`
> +
> +Each node may be divided up into a number of blocks called zones which
> +represent ranges within memory. These ranges are usually determined by
> +architectural constraints for accessing the physical memory. A zone is
> +described by a ``struct zone_struct``, typedeffed to ``zone_t`` and each zone
> +has one of the types described below.
I don't think it's quite right to say 'may' be divided up into zones, as they
absolutely will be so (and the entire phsyical memory allocator hinges on being
zoned, even if trivially in UMA/single zone cases).
Also it's struct zone right, not zone_struct/zone_t?
I think it's important to clarify that a given zone does not map to a single
struct zone, rather that a struct zone (contained within a pg_data_t object's
array node_zones[]) represents only the portion of the zone that resides in this
node.
It's fiddly because when I talk about a zone like this I am referring to one of
the 'classifications' of zones you mention below, e.g. ZONE_DMA, ZONE_DMA32,
etc. but you might also want to refer to a zone as being equivalent to a struct
zone object.
I think the clearest thing however is to use the term zone to refer to each of
the ZONE_xxx types, e.g. 'this memory is located in ZONE_NORMAL' and to clarify
that one zone can span different individual struct zones (and thus nodes).
I know it's tricky because you and others have rightly pointed out that my own
explanation of this is confusing, and it is something I intend to rejig a bit
myself!
> +
> +`ZONE_DMA` and `ZONE_DMA32`
> + represent memory suitable for DMA by peripheral devices that cannot
> + access all of the addressable memory. Depending on the architecture,
> + either of these zone types or even they both can be disabled at build
> + time using ``CONFIG_ZONE_DMA`` and ``CONFIG_ZONE_DMA32`` configuration
> + options. Some 64-bit platforms may need both zones as they support
> + peripherals with different DMA addressing limitations.
It might be worth pointing out ZONE_DMA spans an incredibly little range that
probably won't matter for any peripherals this side of the cretaceous period,
though that may be more colour than might suit the docs :) perhaps worth
pointing out that ZONE_DMA32 spans the first 32-bits of physical address space
just to nail down that a zone refers to the memory range and yes it in this case
at least is quite as simple as this.
> +
> +`ZONE_NORMAL`
> + is for normal memory that can be accessed by the kernel all the time. DMA
> + operations can be performed on pages in this zone if the DMA devices support
> + transfers to all addressable memory. ZONE_NORMAL is always enabled.
> +
Might be worth saying 'this is where memory ends up if not otherwise in another
zone'.
> +`ZONE_HIGHMEM`
> + is the part of the physical memory that is not covered by a permanent mapping
> + in the kernel page tables. The memory in this zone is only accessible to the
> + kernel using temporary mappings. This zone is available only some 32-bit
> + architectures and is enabled with ``CONFIG_HIGHMEM``.
> +
I comment here only to say 'wow I am so glad I chose to only focus on 64-bit so
I could side-step all the awkward discussion of high pages' :)
> +`ZONE_MOVABLE`
> + is for normal accessible memory, just like ZONE_NORMAL. The difference is
> + that most pages in ZONE_MOVABLE are movable. That means that while virtual
> + addresses of these pages do not change, their content may move between
> + different physical pages. ZONE_MOVABLE is only enabled when one of
> + `kernelcore`, `movablecore` and `movable_node` parameters is present in the
> + kernel command line. See :ref:`Page migration <page_migration>` for
> + additional details.
> +
> +`ZONE_DEVICE`
> + represents memory residing on devices such as PMEM and GPU. It has different
> + characteristics than RAM zone types and it exists to provide :ref:`struct
> + page <Pages>` and memory map services for device driver identified physical
> + address ranges. ZONE_DEVICE is enabled with configuration option
> + ``CONFIG_ZONE_DEVICE``.
> +
> +It is important to note that many kernel operations can only take place using
> +ZONE_NORMAL so it is the most performance critical zone. Zones are discussed
> +further in Section :ref:`Zones <zones>`.
> +
> +The relation between node and zone extents is determined by the physical memory
> +map reported by the firmware, architectural constraints for memory addressing
> +and certain parameters in the kernel command line.
Perhaps worth mentioning device tree here? Though perhaps encapsulated in the
'firmware' reference.
> +
> +For example, with 32-bit kernel on an x86 UMA machine with 2 Gbytes of RAM the
> +entire memory will be on node 0 and there will be three zones: ZONE_DMA,
> +ZONE_NORMAL and ZONE_HIGHMEM::
> +
> + 0 2G
> + +-------------------------------------------------------------+
> + | node 0 |
> + +-------------------------------------------------------------+
> +
> + 0 16M 896M 2G
> + +----------+-----------------------+--------------------------+
> + | ZONE_DMA | ZONE_NORMAL | ZONE_HIGHMEM |
> + +----------+-----------------------+--------------------------+
> +
> +
> +With a kernel built with ZONE_DMA disabled and ZONE_DMA32 enabled and booted
> +with `movablecore=80%` parameter on an arm64 machine with 16 Gbytes of RAM
> +equally split between two nodes, there will be ZONE_DMA32, ZONE_NORMAL and
> +ZONE_MOVABLE on node 0, and ZONE_NORMAL and ZONE_MOVABLE on node 1::
> +
> +
> + 1G 9G 17G
> + +--------------------------------+ +--------------------------+
> + | node 0 | | node 1 |
> + +--------------------------------+ +--------------------------+
> +
> + 1G 4G 4200M 9G 9320M 17G
> + +---------+----------+-----------+ +------------+-------------+
> + | DMA32 | NORMAL | MOVABLE | | NORMAL | MOVABLE |
> + +---------+----------+-----------+ +------------+-------------+
> +
Excellent diagrams!
> +.. _nodes:
> +
> +Nodes
> +=====
> +
> +As we have mentioned, each node in memory is described by a ``pg_data_t`` which
> +is a typedef for a ``struct pglist_data``. When allocating a page, by default
> +Linux uses a node-local allocation policy to allocate memory from the node
> +closest to the running CPU. As processes tend to run on the same CPU, it is
> +likely the memory from the current node will be used. The allocation policy can
> +be controlled by users as described in
> +`Documentation/admin-guide/mm/numa_memory_policy.rst`.
> +
> +Most NUMA architectures maintain an array of pointers to the node
> +structures. The actual structures are allocated early during boot when
> +architecture specific code parses the physical memory map reported by the
> +firmware. The bulk of the node initialization happens slightly later in the
> +boot process by free_area_init() function, described later in Section
> +:ref:`Initialization <initialization>`.
> +
> +
> +Along with the node structures, kernel maintains an array of ``nodemask_t``
> +bitmasks called `node_states`. Each bitmask in this array represents a set of
> +nodes with particular properties as defined by `enum node_states`:
> +
> +`N_POSSIBLE`
> + The node could become online at some point.
> +`N_ONLINE`
> + The node is online.
> +`N_NORMAL_MEMORY`
> + The node has regular memory.
> +`N_HIGH_MEMORY`
> + The node has regular or high memory. When ``CONFIG_HIGHMEM`` is disabled
> + aliased to `N_NORMAL_MEMORY`.
> +`N_MEMORY`
> + The node has memory(regular, high, movable)
> +`N_CPU`
> + The node has one or more CPUs
> +
> +For each node that has a property described above, the bit corresponding to the
> +node ID in the ``node_states[<property>]`` bitmask is set.
> +
> +For example, for node 2 with normal memory and CPUs, bit 2 will be set in ::
> +
> + node_states[N_POSSIBLE]
> + node_states[N_ONLINE]
> + node_states[N_NORMAL_MEMORY]
> + node_states[N_MEMORY]
> + node_states[N_CPU]
> +
> +For various operations possible with nodemasks please refer to
> +`include/linux/nodemask.h
> +<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/nodemask.h>`_.
> +
> +Among other things, nodemasks are used to provide macros for node traversal,
> +namely `for_each_node()` and `for_each_online_node()`.
> +
> +For instance, to call a function foo() for each online node::
> +
> + for_each_online_node(nid) {
> + pg_data_t *pgdat = NODE_DATA(nid);
> +
> + foo(pgdat);
> + }
> +
> +Node structure
> +--------------
> +
> +The struct pglist_data is declared in `include/linux/mmzone.h
> +<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/mmzone.h>`_.
> +Here we briefly describe fields of this structure:
Perhaps worth saying 'The node structure' just to reiterate.
> +
> +General
> +~~~~~~~
> +
> +`node_zones`
> + The zones for this node. Not all of the zones may be populated, but it is
> + the full list. It is referenced by this node's node_zonelists as well as
> + other node's node_zonelists.
Perhaps worth describing what zonelists (and equally zonerefs) are here or
above, and that this is the canonical place where zones reside. Maybe reference
populated_zone() and for_each_populated_zone() in reference to the fact that not
all here may be populated?
> +
> +`node_zonelists` The list of all zones in all nodes. This list defines the
> + order of zones that allocations are preferred from. The `node_zonelists` is
> + set up by build_zonelists() in mm/page_alloc.c during the initialization of
> + core memory management structures.
> +
> +`nr_zones`
> + Number of populated zones in this node.
> +
> +`node_mem_map`
> + For UMA systems that use FLATMEM memory model the 0's node (and the only)
> + `node_mem_map` is array of struct pages representing each physical frame.
> +
> +`node_page_ext`
> + For UMA systems that use FLATMEM memory model the 0's (and the only) node
> + `node_mem_map` is array of extensions of struct pages. Available only in the
> + kernels built with ``CONFIG_PAGE_EXTENTION`` enabled.
> +
> +`node_start_pfn`
> + The page frame number of the starting page frame in this node.
> +
> +`node_present_pages`
> + Total number of physical pages present in this node.
> +
> +`node_spanned_pages`
> + Total size of physical page range, including holes.
> +
I think it'd be useful to discuss briefly the meaning of managed, spanned and
present pages in the context of zones.
> +`node_size_lock`
> + A lock that protects the fields defining the node extents. Only defined when
> + at least one of ``CONFIG_MEMORY_HOTPLUG`` or
> + ``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` configuration options are enabled.
> +
> + pgdat_resize_lock() and pgdat_resize_unlock() are provided to manipulate
> + node_size_lock without checking for CONFIG_MEMORY_HOTPLUG or
> + CONFIG_DEFERRED_STRUCT_PAGE_INIT.
> +
> +`node_id`
> + The Node ID (NID) of the node, starts at 0.
> +
> +`totalreserve_pages`
> + This is a per~node reserve of pages that are not available to userspace
> + allocations.
> +
> +`first_deferred_pfn`
> + If memory initialization on large machines is deferred then this is the first
> + PFN that needs to be initialized. Defined only when
> + ``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` is enabled
> +
> +`deferred_split_queue`
> + Per-node queue of huge pages that their split was deferred. Defined only when ``CONFIG_TRANSPARENT_HUGEPAGE`` is enabled.
> +
> +`__lruvec`
> + Per-node lruvec holding LRU lists and related parameters. Used only when memory cgroups are disabled. Should not be accessed directly, use mem_cgroup_lruvec() to look up lruvecs instead.
> +
> +Reclaim control
> +~~~~~~~~~~~~~~~
> +
> +See also :ref:`Page Reclaim <page_reclaim>`.
> +
> +`kswapd`
> + Per-node instance of kswapd kernel thread.
> +
> +`kswapd_wait`, `pfmemalloc_wait`, `reclaim_wait`
> + Workqueues used to synchronize memory reclaim tasks
> +
> +`nr_writeback_throttled`
> + Number of tasks that are throttled waiting on dirty pages to clean.
> +
> +`nr_reclaim_start`
> + Number of pages written while reclaim is throttled waiting for writeback.
> +
> +`kswapd_order`
> + Controls the order kswapd tries to reclaim
> +
> +`kswapd_highest_zoneidx`
> + The highest zone index to be reclaimed by kswapd
> +
> +`kswapd_failures`
> + Number of runs kswapd was unable to reclaim any pages
> +
> +`min_unmapped_pages`
> + Minimal number of unmapped file backed pages that cannot be reclaimed. Determined by vm.min_unmapped_ratio sysctl.
> + Only defined when ``CONFIG_NUMA`` is enabled.
> +
> +`min_slab_pages`
> + Minimal number of SLAB pages that cannot be reclaimed. Determined by vm.min_slab_ratio sysctl.
> + Only defined when ``CONFIG_NUMA`` is enabled
> +
> +`flags`
> + Flags controlling reclaim behavior.
> +
> +Compaction control
> +~~~~~~~~~~~~~~~~~~
> +
> +`kcompactd_max_order`
> + Page order that kcompactd should try to achieve.
> +
> +`kcompactd_highest_zoneidx`
> + The highest zone index to be compacted by kcompactd.
> +
> +`kcompactd_wait`
> + Workqueue used to synchronizes memory compaction tasks.
> +
> +`kcompactd`
> + Per-node instance of kcompactd kernel thread.
> +
> +`proactive_compact_trigger`
> + Determines if proactive compaction is enabled. Controlled by vm.compaction_proactiveness sysctl.
> +
> +Statistics
> +~~~~~~~~~~
> +
> +`per_cpu_nodestats`
> + Per-CPU VM statistics for the node
> +
> +`vm_stat`
> + VM statistics for the node.
> +
> +.. _zones:
> +
> +Zones
> +=====
> +
> +.. _pages:
> +
> +Pages
> +=====
> +
> +.. _folios:
> +
> +Folios
> +======
> +
> +.. _initialization:
> +
> +Initialization
> +==============
> --
> 2.35.1
>
Overall it's fantastically written (you're a gifted writer!) and a great basis
on which to build further documentation.
I hope you can forgive the nitpicking (which unfortunately is a little too easy
when reviewing doc patches I feel) and that my comments are useful!
Cheers, Lorenzo
On Sun, Jan 01, 2023 at 11:45:23AM +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>
No patch description really?
> +Each node may be divided up into a number of blocks called zones which
> +represent ranges within memory. These ranges are usually determined by
> +architectural constraints for accessing the physical memory. A zone is
> +described by a ``struct zone_struct``, typedeffed to ``zone_t`` and each zone
> +has one of the types described below.
> +
> +`ZONE_DMA` and `ZONE_DMA32`
> + represent memory suitable for DMA by peripheral devices that cannot
> + access all of the addressable memory. Depending on the architecture,
> + either of these zone types or even they both can be disabled at build
> + time using ``CONFIG_ZONE_DMA`` and ``CONFIG_ZONE_DMA32`` configuration
> + options. Some 64-bit platforms may need both zones as they support
> + peripherals with different DMA addressing limitations.
> +
> +`ZONE_NORMAL`
> + is for normal memory that can be accessed by the kernel all the time. DMA
> + operations can be performed on pages in this zone if the DMA devices support
> + transfers to all addressable memory. ZONE_NORMAL is always enabled.
> +
> +`ZONE_HIGHMEM`
> + is the part of the physical memory that is not covered by a permanent mapping
> + in the kernel page tables. The memory in this zone is only accessible to the
> + kernel using temporary mappings. This zone is available only some 32-bit
> + architectures and is enabled with ``CONFIG_HIGHMEM``.
> +
> +`ZONE_MOVABLE`
> + is for normal accessible memory, just like ZONE_NORMAL. The difference is
> + that most pages in ZONE_MOVABLE are movable. That means that while virtual
> + addresses of these pages do not change, their content may move between
> + different physical pages. ZONE_MOVABLE is only enabled when one of
> + `kernelcore`, `movablecore` and `movable_node` parameters is present in the
> + kernel command line. See :ref:`Page migration <page_migration>` for
> + additional details.
> +
> +`ZONE_DEVICE`
> + represents memory residing on devices such as PMEM and GPU. It has different
> + characteristics than RAM zone types and it exists to provide :ref:`struct
> + page <Pages>` and memory map services for device driver identified physical
> + address ranges. ZONE_DEVICE is enabled with configuration option
> + ``CONFIG_ZONE_DEVICE``.
I think bullet lists should do the job better, since the zone names are
connected directly to their representations:
---- >8 ----
diff --git a/Documentation/mm/physical_memory.rst b/Documentation/mm/physical_memory.rst
index fcf52f1db16b71..d308b11cfcf7f0 100644
--- a/Documentation/mm/physical_memory.rst
+++ b/Documentation/mm/physical_memory.rst
@@ -35,40 +35,36 @@ architectural constraints for accessing the physical memory. A zone is
described by a ``struct zone_struct``, typedeffed to ``zone_t`` and each zone
has one of the types described below.
-`ZONE_DMA` and `ZONE_DMA32`
- represent memory suitable for DMA by peripheral devices that cannot
- access all of the addressable memory. Depending on the architecture,
- either of these zone types or even they both can be disabled at build
- time using ``CONFIG_ZONE_DMA`` and ``CONFIG_ZONE_DMA32`` configuration
- options. Some 64-bit platforms may need both zones as they support
- peripherals with different DMA addressing limitations.
+* `ZONE_DMA` and `ZONE_DMA32` represent memory suitable for DMA by peripheral
+ devices that cannot access all of the addressable memory. Depending on the
+ architecture, either of these zone types or even they both can be disabled
+ at build time using ``CONFIG_ZONE_DMA`` and ``CONFIG_ZONE_DMA32``
+ configuration options. Some 64-bit platforms may need both zones as they
+ support peripherals with different DMA addressing limitations.
-`ZONE_NORMAL`
- is for normal memory that can be accessed by the kernel all the time. DMA
- operations can be performed on pages in this zone if the DMA devices support
- transfers to all addressable memory. ZONE_NORMAL is always enabled.
+* `ZONE_NORMAL` is for normal memory that can be accessed by the kernel all
+ the time. DMA operations can be performed on pages in this zone if the DMA
+ devices support transfers to all addressable memory. ZONE_NORMAL is always
+ enabled.
-`ZONE_HIGHMEM`
- is the part of the physical memory that is not covered by a permanent mapping
- in the kernel page tables. The memory in this zone is only accessible to the
- kernel using temporary mappings. This zone is available only some 32-bit
- architectures and is enabled with ``CONFIG_HIGHMEM``.
+* `ZONE_HIGHMEM` is the part of the physical memory that is not covered by a
+ permanent mapping in the kernel page tables. The memory in this zone is only
+ accessible to the kernel using temporary mappings. This zone is available
+ only on some 32-bit architectures and is enabled with ``CONFIG_HIGHMEM``.
-`ZONE_MOVABLE`
- is for normal accessible memory, just like ZONE_NORMAL. The difference is
- that most pages in ZONE_MOVABLE are movable. That means that while virtual
- addresses of these pages do not change, their content may move between
- different physical pages. ZONE_MOVABLE is only enabled when one of
+* `ZONE_MOVABLE` is for normal accessible memory, just like ZONE_NORMAL. The
+ difference is that most pages in ZONE_MOVABLE are movable. That means that
+ while virtual addresses of these pages do not change, their content may move
+ between different physical pages. ZONE_MOVABLE is only enabled when one of
`kernelcore`, `movablecore` and `movable_node` parameters is present in the
kernel command line. See :ref:`Page migration <page_migration>` for
additional details.
-`ZONE_DEVICE`
- represents memory residing on devices such as PMEM and GPU. It has different
- characteristics than RAM zone types and it exists to provide :ref:`struct
- page <Pages>` and memory map services for device driver identified physical
- address ranges. ZONE_DEVICE is enabled with configuration option
- ``CONFIG_ZONE_DEVICE``.
+* `ZONE_DEVICE` represents memory residing on devices such as PMEM and GPU.
+ It has different characteristics than RAM zone types and it exists to provide
+ :ref:`struct page <Pages>` and memory map services for device driver
+ identified physical address ranges. ZONE_DEVICE is enabled with configuration
+ option ``CONFIG_ZONE_DEVICE``.
It is important to note that many kernel operations can only take place using
ZONE_NORMAL so it is the most performance critical zone. Zones are discussed
> +For example, with 32-bit kernel on an x86 UMA machine with 2 Gbytes of RAM the
> +entire memory will be on node 0 and there will be three zones: ZONE_DMA,
> +ZONE_NORMAL and ZONE_HIGHMEM::
> +
> + 0 2G
> + +-------------------------------------------------------------+
> + | node 0 |
> + +-------------------------------------------------------------+
> +
> + 0 16M 896M 2G
> + +----------+-----------------------+--------------------------+
> + | ZONE_DMA | ZONE_NORMAL | ZONE_HIGHMEM |
> + +----------+-----------------------+--------------------------+
> +
> +
> +With a kernel built with ZONE_DMA disabled and ZONE_DMA32 enabled and booted
> +with `movablecore=80%` parameter on an arm64 machine with 16 Gbytes of RAM
> +equally split between two nodes, there will be ZONE_DMA32, ZONE_NORMAL and
> +ZONE_MOVABLE on node 0, and ZONE_NORMAL and ZONE_MOVABLE on node 1::
> +
> +
> + 1G 9G 17G
> + +--------------------------------+ +--------------------------+
> + | node 0 | | node 1 |
> + +--------------------------------+ +--------------------------+
> +
> + 1G 4G 4200M 9G 9320M 17G
> + +---------+----------+-----------+ +------------+-------------+
> + | DMA32 | NORMAL | MOVABLE | | NORMAL | MOVABLE |
> + +---------+----------+-----------+ +------------+-------------+
I see inconsistency of formatting keywords: some are in inline code and some
are not. I'm leaning towards inlining them all:
---- >8 ----
diff --git a/Documentation/mm/physical_memory.rst b/Documentation/mm/physical_memory.rst
index d308b11cfcf7f0..83e13166508a20 100644
--- a/Documentation/mm/physical_memory.rst
+++ b/Documentation/mm/physical_memory.rst
@@ -19,14 +19,14 @@ a bank of memory very suitable for DMA near peripheral devices.
Each bank is called a node and the concept is represented under Linux by a
``struct pglist_data`` even if the architecture is UMA. This structure is
-always referenced to by it's typedef ``pg_data_t``. A pg_data_t structure
+always referenced to by it's typedef ``pg_data_t``. ``A pg_data_t`` structure
for a particular node can be referenced by ``NODE_DATA(nid)`` macro where
``nid`` is the ID of that node.
For NUMA architectures, the node structures are allocated by the architecture
specific code early during boot. Usually, these structures are allocated
locally on the memory bank they represent. For UMA architectures, only one
-static pg_data_t structure called ``contig_page_data`` is used. Nodes will
+static ``pg_data_t`` structure called ``contig_page_data`` is used. Nodes will
be discussed further in Section :ref:`Nodes <nodes>`
Each node may be divided up into a number of blocks called zones which
@@ -35,48 +35,49 @@ architectural constraints for accessing the physical memory. A zone is
described by a ``struct zone_struct``, typedeffed to ``zone_t`` and each zone
has one of the types described below.
-* `ZONE_DMA` and `ZONE_DMA32` represent memory suitable for DMA by peripheral
- devices that cannot access all of the addressable memory. Depending on the
- architecture, either of these zone types or even they both can be disabled
- at build time using ``CONFIG_ZONE_DMA`` and ``CONFIG_ZONE_DMA32``
- configuration options. Some 64-bit platforms may need both zones as they
- support peripherals with different DMA addressing limitations.
+* ``ZONE_DMA` and ``ZONE_DMA32`` represent memory suitable for DMA by
+ peripheral devices that cannot access all of the addressable memory.
+ Depending on the architecture, either of these zone types or even they both
+ can be disabled at build time using ``CONFIG_ZONE_DMA`` and
+ ``CONFIG_ZONE_DMA32`` configuration options. Some 64-bit platforms may need
+ both zones as they support peripherals with different DMA addressing
+ limitations.
-* `ZONE_NORMAL` is for normal memory that can be accessed by the kernel all
+* ``ZONE_NORMAL`` is for normal memory that can be accessed by the kernel all
the time. DMA operations can be performed on pages in this zone if the DMA
- devices support transfers to all addressable memory. ZONE_NORMAL is always
- enabled.
+ devices support transfers to all addressable memory. ``ZONE_NORMAL`` is
+ always enabled.
-* `ZONE_HIGHMEM` is the part of the physical memory that is not covered by a
+* ``ZONE_HIGHMEM`` is the part of the physical memory that is not covered by a
permanent mapping in the kernel page tables. The memory in this zone is only
accessible to the kernel using temporary mappings. This zone is available
only on some 32-bit architectures and is enabled with ``CONFIG_HIGHMEM``.
-* `ZONE_MOVABLE` is for normal accessible memory, just like ZONE_NORMAL. The
- difference is that most pages in ZONE_MOVABLE are movable. That means that
- while virtual addresses of these pages do not change, their content may move
- between different physical pages. ZONE_MOVABLE is only enabled when one of
- `kernelcore`, `movablecore` and `movable_node` parameters is present in the
- kernel command line. See :ref:`Page migration <page_migration>` for
- additional details.
+* ``ZONE_MOVABLE`` is for normal accessible memory, just like ``ZONE_NORMAL``.
+ The difference is that most pages in ``ZONE_MOVABLE`` are movable. That means
+ that while virtual addresses of these pages do not change, their content may
+ move between different physical pages. ``ZONE_MOVABLE`` is only enabled when
+ one of ``kernelcore``, ``movablecore`` and ``movable_node`` parameters is
+ present in the kernel command line. See :ref:`Page migration
+ <page_migration>` for additional details.
-* `ZONE_DEVICE` represents memory residing on devices such as PMEM and GPU.
+* ``ZONE_DEVICE`` represents memory residing on devices such as PMEM and GPU.
It has different characteristics than RAM zone types and it exists to provide
:ref:`struct page <Pages>` and memory map services for device driver
- identified physical address ranges. ZONE_DEVICE is enabled with configuration
- option ``CONFIG_ZONE_DEVICE``.
+ identified physical address ranges. ``ZONE_DEVICE`` is enabled with
+ configuration option ``CONFIG_ZONE_DEVICE``.
It is important to note that many kernel operations can only take place using
-ZONE_NORMAL so it is the most performance critical zone. Zones are discussed
-further in Section :ref:`Zones <zones>`.
+``ZONE_NORMAL`` so it is the most performance critical zone. Zones are
+discussed further in Section :ref:`Zones <zones>`.
The relation between node and zone extents is determined by the physical memory
map reported by the firmware, architectural constraints for memory addressing
and certain parameters in the kernel command line.
For example, with 32-bit kernel on an x86 UMA machine with 2 Gbytes of RAM the
-entire memory will be on node 0 and there will be three zones: ZONE_DMA,
-ZONE_NORMAL and ZONE_HIGHMEM::
+entire memory will be on node 0 and there will be three zones: ``ZONE_DMA``,
+``ZONE_NORMAL`` and ``ZONE_HIGHMEM``::
0 2G
+-------------------------------------------------------------+
@@ -89,10 +90,11 @@ ZONE_NORMAL and ZONE_HIGHMEM::
+----------+-----------------------+--------------------------+
-With a kernel built with ZONE_DMA disabled and ZONE_DMA32 enabled and booted
-with `movablecore=80%` parameter on an arm64 machine with 16 Gbytes of RAM
-equally split between two nodes, there will be ZONE_DMA32, ZONE_NORMAL and
-ZONE_MOVABLE on node 0, and ZONE_NORMAL and ZONE_MOVABLE on node 1::
+With a kernel built with ``ZONE_DMA`` disabled and ``ZONE_DMA32`` enabled and
+booted with ``movablecore=80%`` parameter on an arm64 machine with 16 Gbytes of
+RAM equally split between two nodes, there will be ``ZONE_DMA32``,
+``ZONE_NORMAL`` and ``ZONE_MOVABLE`` on node 0, and ``ZONE_NORMAL`` and
+``ZONE_MOVABLE`` on node 1::
1G 9G 17G
@@ -116,7 +118,7 @@ Linux uses a node-local allocation policy to allocate memory from the node
closest to the running CPU. As processes tend to run on the same CPU, it is
likely the memory from the current node will be used. The allocation policy can
be controlled by users as described in
-`Documentation/admin-guide/mm/numa_memory_policy.rst`.
+Documentation/admin-guide/mm/numa_memory_policy.rst.
Most NUMA architectures maintain an array of pointers to the node
structures. The actual structures are allocated early during boot when
@@ -127,21 +129,21 @@ boot process by free_area_init() function, described later in Section
Along with the node structures, kernel maintains an array of ``nodemask_t``
-bitmasks called `node_states`. Each bitmask in this array represents a set of
-nodes with particular properties as defined by `enum node_states`:
+bitmasks called ``node_states``. Each bitmask in this array represents a set of
+nodes with particular properties as defined by ``enum node_states``:
-`N_POSSIBLE`
+``N_POSSIBLE``
The node could become online at some point.
-`N_ONLINE`
+``N_ONLINE``
The node is online.
-`N_NORMAL_MEMORY`
+``N_NORMAL_MEMORY``
The node has regular memory.
-`N_HIGH_MEMORY`
+``N_HIGH_MEMORY``
The node has regular or high memory. When ``CONFIG_HIGHMEM`` is disabled
- aliased to `N_NORMAL_MEMORY`.
-`N_MEMORY`
+ aliased to ``N_NORMAL_MEMORY``.
+``N_MEMORY``
The node has memory(regular, high, movable)
-`N_CPU`
+``N_CPU``
The node has one or more CPUs
For each node that has a property described above, the bit corresponding to the
@@ -160,7 +162,7 @@ For various operations possible with nodemasks please refer to
<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/nodemask.h>`_.
Among other things, nodemasks are used to provide macros for node traversal,
-namely `for_each_node()` and `for_each_online_node()`.
+namely ``for_each_node()`` and ``for_each_online_node()``.
For instance, to call a function foo() for each online node::
@@ -180,126 +182,130 @@ Here we briefly describe fields of this structure:
General
~~~~~~~
-`node_zones`
+``node_zones``
The zones for this node. Not all of the zones may be populated, but it is
the full list. It is referenced by this node's node_zonelists as well as
other node's node_zonelists.
-`node_zonelists` The list of all zones in all nodes. This list defines the
- order of zones that allocations are preferred from. The `node_zonelists` is
- set up by build_zonelists() in mm/page_alloc.c during the initialization of
+``node_zonelists``
+ The list of all zones in all nodes. This list defines the order of zones
+ that allocations are preferred from. The ``node_zonelists`` is set up by
+ ``build_zonelists()`` in ``mm/page_alloc.c`` during the initialization of
core memory management structures.
-`nr_zones`
+``nr_zones``
Number of populated zones in this node.
-`node_mem_map`
+``node_mem_map``
For UMA systems that use FLATMEM memory model the 0's node (and the only)
- `node_mem_map` is array of struct pages representing each physical frame.
+ ``node_mem_map`` is array of struct pages representing each physical frame.
-`node_page_ext`
+``node_page_ext``
For UMA systems that use FLATMEM memory model the 0's (and the only) node
- `node_mem_map` is array of extensions of struct pages. Available only in the
+ ``node_mem_map`` is array of extensions of struct pages. Available only in the
kernels built with ``CONFIG_PAGE_EXTENTION`` enabled.
-`node_start_pfn`
+``node_start_pfn``
The page frame number of the starting page frame in this node.
-`node_present_pages`
+``node_present_pages``
Total number of physical pages present in this node.
-`node_spanned_pages`
+``node_spanned_pages``
Total size of physical page range, including holes.
-`node_size_lock`
+``node_size_lock``
A lock that protects the fields defining the node extents. Only defined when
at least one of ``CONFIG_MEMORY_HOTPLUG`` or
``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` configuration options are enabled.
+ ``pgdat_resize_lock()`` and ``pgdat_resize_unlock()`` are provided to
+ manipulate ``node_size_lock`` without checking for ``CONFIG_MEMORY_HOTPLUG``
+ or ``CONFIG_DEFERRED_STRUCT_PAGE_INIT``.
- pgdat_resize_lock() and pgdat_resize_unlock() are provided to manipulate
- node_size_lock without checking for CONFIG_MEMORY_HOTPLUG or
- CONFIG_DEFERRED_STRUCT_PAGE_INIT.
-
-`node_id`
+``node_id``
The Node ID (NID) of the node, starts at 0.
-`totalreserve_pages`
+``totalreserve_pages``
This is a per~node reserve of pages that are not available to userspace
allocations.
-`first_deferred_pfn`
+``first_deferred_pfn``
If memory initialization on large machines is deferred then this is the first
PFN that needs to be initialized. Defined only when
``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` is enabled
-`deferred_split_queue`
+``deferred_split_queue``
Per-node queue of huge pages that their split was deferred. Defined only when ``CONFIG_TRANSPARENT_HUGEPAGE`` is enabled.
-`__lruvec`
- Per-node lruvec holding LRU lists and related parameters. Used only when memory cgroups are disabled. Should not be accessed directly, use mem_cgroup_lruvec() to look up lruvecs instead.
+``__lruvec``
+ Per-node lruvec holding LRU lists and related parameters. Used only when
+ memory cgroups are disabled. It should not be accessed directly, use
+ ``mem_cgroup_lruvec()`` to look up lruvecs instead.
Reclaim control
~~~~~~~~~~~~~~~
See also :ref:`Page Reclaim <page_reclaim>`.
-`kswapd`
+``kswapd``
Per-node instance of kswapd kernel thread.
-`kswapd_wait`, `pfmemalloc_wait`, `reclaim_wait`
+``kswapd_wait``, ``pfmemalloc_wait``, ``reclaim_wait``
Workqueues used to synchronize memory reclaim tasks
-`nr_writeback_throttled`
+``nr_writeback_throttled``
Number of tasks that are throttled waiting on dirty pages to clean.
-`nr_reclaim_start`
+``nr_reclaim_start``
Number of pages written while reclaim is throttled waiting for writeback.
-`kswapd_order`
+``kswapd_order``
Controls the order kswapd tries to reclaim
-`kswapd_highest_zoneidx`
+``kswapd_highest_zoneidx``
The highest zone index to be reclaimed by kswapd
-`kswapd_failures`
+``kswapd_failures``
Number of runs kswapd was unable to reclaim any pages
-`min_unmapped_pages`
- Minimal number of unmapped file backed pages that cannot be reclaimed. Determined by vm.min_unmapped_ratio sysctl.
- Only defined when ``CONFIG_NUMA`` is enabled.
+``min_unmapped_pages``
+ Minimal number of unmapped file backed pages that cannot be reclaimed.
+ Determined by ``vm.min_unmapped_ratio`` sysctl. Only defined when
+ ``CONFIG_NUMA`` is enabled.
-`min_slab_pages`
- Minimal number of SLAB pages that cannot be reclaimed. Determined by vm.min_slab_ratio sysctl.
- Only defined when ``CONFIG_NUMA`` is enabled
+``min_slab_pages``
+ Minimal number of SLAB pages that cannot be reclaimed. Determined by
+ ``vm.min_slab_ratio sysctl``. Only defined when ``CONFIG_NUMA`` is enabled
-`flags`
+``flags``
Flags controlling reclaim behavior.
Compaction control
~~~~~~~~~~~~~~~~~~
-`kcompactd_max_order`
+``kcompactd_max_order``
Page order that kcompactd should try to achieve.
-`kcompactd_highest_zoneidx`
+``kcompactd_highest_zoneidx``
The highest zone index to be compacted by kcompactd.
-`kcompactd_wait`
+``kcompactd_wait``
Workqueue used to synchronizes memory compaction tasks.
-`kcompactd`
+``kcompactd``
Per-node instance of kcompactd kernel thread.
-`proactive_compact_trigger`
- Determines if proactive compaction is enabled. Controlled by vm.compaction_proactiveness sysctl.
+``proactive_compact_trigger``
+ Determines if proactive compaction is enabled. Controlled by
+ ``vm.compaction_proactiveness`` sysctl.
Statistics
~~~~~~~~~~
-`per_cpu_nodestats`
+``per_cpu_nodestats``
Per-CPU VM statistics for the node
-`vm_stat`
+``vm_stat``
VM statistics for the node.
.. _zones:
> +For various operations possible with nodemasks please refer to
> +`include/linux/nodemask.h
> +<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/nodemask.h>`_.
Instead of linking to Linus's tree, just inline the source path:
---- >8 ----
diff --git a/Documentation/mm/physical_memory.rst b/Documentation/mm/physical_memory.rst
index 83e13166508a20..130880e5c369de 100644
--- a/Documentation/mm/physical_memory.rst
+++ b/Documentation/mm/physical_memory.rst
@@ -158,8 +158,7 @@ For example, for node 2 with normal memory and CPUs, bit 2 will be set in ::
node_states[N_CPU]
For various operations possible with nodemasks please refer to
-`include/linux/nodemask.h
-<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/nodemask.h>`_.
+``include/linux/nodemask.h``.
Among other things, nodemasks are used to provide macros for node traversal,
namely ``for_each_node()`` and ``for_each_online_node()``.
@@ -175,9 +174,8 @@ For instance, to call a function foo() for each online node::
Node structure
--------------
-The struct pglist_data is declared in `include/linux/mmzone.h
-<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/mmzone.h>`_.
-Here we briefly describe fields of this structure:
+The struct pglist_data is declared in ``include/linux/mmzone.h``. Here we
+briefly describe fields of this structure:
General
~~~~~~~
> +.. _zones:
> +
> +Zones
> +=====
> +
> +.. _pages:
> +
> +Pages
> +=====
> +
> +.. _folios:
> +
> +Folios
> +======
> +
> +.. _initialization:
> +
> +Initialization
> +==============
Are these sections stubs (no fields list for each types)? If so, add
admonitions to inform readers:
---- >8 ----
diff --git a/Documentation/mm/physical_memory.rst b/Documentation/mm/physical_memory.rst
index 130880e5c369de..cf61725d93b229 100644
--- a/Documentation/mm/physical_memory.rst
+++ b/Documentation/mm/physical_memory.rst
@@ -311,17 +311,33 @@ Statistics
Zones
=====
+.. admonition:: Stub
+
+ This section is incomplete. Please list and describe the appropriate fields.
+
.. _pages:
Pages
=====
+.. admonition:: Stub
+
+ This section is incomplete. Please list and describe the appropriate fields.
+
.. _folios:
Folios
======
+.. admonition:: Stub
+
+ This section is incomplete. Please list and describe the appropriate fields.
+
.. _initialization:
Initialization
==============
+
+.. admonition:: Stub
+
+ This section is incomplete. Please list and describe the appropriate fields.
Thanks.
On Fri, Jan 06, 2023 at 10:32:46PM +0000, Lorenzo Stoakes wrote:
> On Sun, Jan 01, 2023 at 11:45:23AM +0200, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> >
> > Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
> > ---
> > Documentation/mm/physical_memory.rst | 322 +++++++++++++++++++++++++++
> > 1 file changed, 322 insertions(+)
> >
> > diff --git a/Documentation/mm/physical_memory.rst b/Documentation/mm/physical_memory.rst
> > index 2ab7b8c1c863..fcf52f1db16b 100644
> > --- a/Documentation/mm/physical_memory.rst
> > +++ b/Documentation/mm/physical_memory.rst
> > @@ -3,3 +3,325 @@
> > ===============
> > Physical Memory
> > ===============
> > +
> > +Linux is available for a wide range of architectures so there is a need for an
> > +architecture-independent abstraction to represent the physical memory. This
> > +chapter describes the structures used to manage physical memory in a running
> > +system.
> > +
> > +The first principal concept prevalent in the memory management is
> > +`Non-Uniform Memory Access (NUMA)
> > +<https://en.wikipedia.org/wiki/Non-uniform_memory_access>`_.
> > +With multi-core and multi-socket machines, memory may be arranged into banks
> > +that incur a different cost to access depending on the “distance” from the
> > +processor. For example, there might be a bank of memory assigned to each CPU or
> > +a bank of memory very suitable for DMA near peripheral devices.
>
> Absolutely wonderfully written.
Thanks to Mel :)
> Perhaps put a sub-heading for NUMA here?
I consider all this text as an high level overview and I'd prefer to keep
it as a single piece.
> An aside, but I think it'd be a good idea to mention base pages, folios and
> folio order pretty early on as they get touched as concepts all over the place
> in physical memory (but perhaps can wait for other contribs!)
The plan is to have "Pages" section Really Soon :)
> > +
> > +Each bank is called a node and the concept is represented under Linux by a
> > +``struct pglist_data`` even if the architecture is UMA. This structure is
> > +always referenced to by it's typedef ``pg_data_t``. A pg_data_t structure
> > +for a particular node can be referenced by ``NODE_DATA(nid)`` macro where
> > +``nid`` is the ID of that node.
> > +
> > +For NUMA architectures, the node structures are allocated by the architecture
> > +specific code early during boot. Usually, these structures are allocated
> > +locally on the memory bank they represent. For UMA architectures, only one
> > +static pg_data_t structure called ``contig_page_data`` is used. Nodes will
> > +be discussed further in Section :ref:`Nodes <nodes>`
> > +
> > +Each node may be divided up into a number of blocks called zones which
> > +represent ranges within memory. These ranges are usually determined by
> > +architectural constraints for accessing the physical memory. A zone is
> > +described by a ``struct zone_struct``, typedeffed to ``zone_t`` and each zone
> > +has one of the types described below.
>
> I don't think it's quite right to say 'may' be divided up into zones, as they
> absolutely will be so (and the entire phsyical memory allocator hinges on being
> zoned, even if trivially in UMA/single zone cases).
Not necessarily. ZONE_DMA or ZONE_NORMAL may span the entire memory.
> Also it's struct zone right, not zone_struct/zone_t?
Right, thanks.
> I think it's important to clarify that a given zone does not map to a single
> struct zone, rather that a struct zone (contained within a pg_data_t object's
> array node_zones[]) represents only the portion of the zone that resides in this
> node.
>
> It's fiddly because when I talk about a zone like this I am referring to one of
> the 'classifications' of zones you mention below, e.g. ZONE_DMA, ZONE_DMA32,
> etc. but you might also want to refer to a zone as being equivalent to a struct
> zone object.
>
> I think the clearest thing however is to use the term zone to refer to each of
> the ZONE_xxx types, e.g. 'this memory is located in ZONE_NORMAL' and to clarify
> that one zone can span different individual struct zones (and thus nodes).
>
> I know it's tricky because you and others have rightly pointed out that my own
> explanation of this is confusing, and it is something I intend to rejig a bit
> myself!
The term 'zone' is indeed somewhat ambiguous, I'll try to come up with more
clear version.
> > +
> > +`ZONE_DMA` and `ZONE_DMA32`
> > + represent memory suitable for DMA by peripheral devices that cannot
> > + access all of the addressable memory. Depending on the architecture,
> > + either of these zone types or even they both can be disabled at build
> > + time using ``CONFIG_ZONE_DMA`` and ``CONFIG_ZONE_DMA32`` configuration
> > + options. Some 64-bit platforms may need both zones as they support
> > + peripherals with different DMA addressing limitations.
>
> It might be worth pointing out ZONE_DMA spans an incredibly little range that
> probably won't matter for any peripherals this side of the cretaceous period,
On RPi4 ZONE_DMA spans 1G, which is quite some part of the memory ;-)
> > +
> > +`ZONE_NORMAL`
> > + is for normal memory that can be accessed by the kernel all the time. DMA
> > + operations can be performed on pages in this zone if the DMA devices support
> > + transfers to all addressable memory. ZONE_NORMAL is always enabled.
> > +
>
> Might be worth saying 'this is where memory ends up if not otherwise in another
> zone'.
This may not be the case on !x86.
> > +`ZONE_HIGHMEM`
> > + is the part of the physical memory that is not covered by a permanent mapping
> > + in the kernel page tables. The memory in this zone is only accessible to the
> > + kernel using temporary mappings. This zone is available only some 32-bit
> > + architectures and is enabled with ``CONFIG_HIGHMEM``.
> > +
>
> I comment here only to say 'wow I am so glad I chose to only focus on 64-bit so
> I could side-step all the awkward discussion of high pages' :)
>
> > +The relation between node and zone extents is determined by the physical memory
> > +map reported by the firmware, architectural constraints for memory addressing
> > +and certain parameters in the kernel command line.
>
> Perhaps worth mentioning device tree here? Though perhaps encapsulated in the
> 'firmware' reference.
It is :)
> > +Node structure
> > +--------------
> > +
> > +The struct pglist_data is declared in `include/linux/mmzone.h
> > +<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/mmzone.h>`_.
> > +Here we briefly describe fields of this structure:
>
> Perhaps worth saying 'The node structure' just to reiterate.
Not sure I follow, can you phrase the entire sentence?
> > +
> > +General
> > +~~~~~~~
> > +
> > +`node_zones`
> > + The zones for this node. Not all of the zones may be populated, but it is
> > + the full list. It is referenced by this node's node_zonelists as well as
> > + other node's node_zonelists.
>
> Perhaps worth describing what zonelists (and equally zonerefs) are here or
> above, and that this is the canonical place where zones reside. Maybe reference
> populated_zone() and for_each_populated_zone() in reference to the fact that not
> all here may be populated?
I'd prefer to start simple and than add more content on top.
> > +
> > +`node_zonelists` The list of all zones in all nodes. This list defines the
> > + order of zones that allocations are preferred from. The `node_zonelists` is
> > + set up by build_zonelists() in mm/page_alloc.c during the initialization of
> > + core memory management structures.
> > +
> > +`nr_zones`
> > + Number of populated zones in this node.
> > +
> > +`node_mem_map`
> > + For UMA systems that use FLATMEM memory model the 0's node (and the only)
> > + `node_mem_map` is array of struct pages representing each physical frame.
> > +
> > +`node_page_ext`
> > + For UMA systems that use FLATMEM memory model the 0's (and the only) node
> > + `node_mem_map` is array of extensions of struct pages. Available only in the
> > + kernels built with ``CONFIG_PAGE_EXTENTION`` enabled.
> > +
> > +`node_start_pfn`
> > + The page frame number of the starting page frame in this node.
> > +
> > +`node_present_pages`
> > + Total number of physical pages present in this node.
> > +
> > +`node_spanned_pages`
> > + Total size of physical page range, including holes.
> > +
>
> I think it'd be useful to discuss briefly the meaning of managed, spanned and
> present pages in the context of zones.
This will be a part of the Zones section.
> Cheers, Lorenzo
On Mon, Jan 09, 2023 at 03:33:15PM +0200, Mike Rapoport wrote:
> > Absolutely wonderfully written.
>
> Thanks to Mel :)
>
I should have known :)
> > > +Each node may be divided up into a number of blocks called zones which
> > > +represent ranges within memory. These ranges are usually determined by
> > > +architectural constraints for accessing the physical memory. A zone is
> > > +described by a ``struct zone_struct``, typedeffed to ``zone_t`` and each zone
> > > +has one of the types described below.
> >
> > I don't think it's quite right to say 'may' be divided up into zones, as they
> > absolutely will be so (and the entire phsyical memory allocator hinges on being
> > zoned, even if trivially in UMA/single zone cases).
>
> Not necessarily. ZONE_DMA or ZONE_NORMAL may span the entire memory.
I see what you mean, here again we get the confusion around zones as a term (And
Willy has yet to propose a 'zolio' :), what I meant to say is that every byte of
memory is in a zone, though a zone may span a node, multiple nodes or all nodes.
> > > +
> > > +`ZONE_DMA` and `ZONE_DMA32`
> > > + represent memory suitable for DMA by peripheral devices that cannot
> > > + access all of the addressable memory. Depending on the architecture,
> > > + either of these zone types or even they both can be disabled at build
> > > + time using ``CONFIG_ZONE_DMA`` and ``CONFIG_ZONE_DMA32`` configuration
> > > + options. Some 64-bit platforms may need both zones as they support
> > > + peripherals with different DMA addressing limitations.
> >
> > It might be worth pointing out ZONE_DMA spans an incredibly little range that
> > probably won't matter for any peripherals this side of the cretaceous period,
>
> On RPi4 ZONE_DMA spans 1G, which is quite some part of the memory ;-)
>
Ah yeah that's another weirdness, my asahi laptop actually puts everything into
ZONE_DMA so fair point. Arches do complicate things... (hence why I limit my
scope to only one)
> > > +
> > > +`ZONE_NORMAL`
> > > + is for normal memory that can be accessed by the kernel all the time. DMA
> > > + operations can be performed on pages in this zone if the DMA devices support
> > > + transfers to all addressable memory. ZONE_NORMAL is always enabled.
> > > +
> >
> > Might be worth saying 'this is where memory ends up if not otherwise in another
> > zone'.
>
> This may not be the case on !x86.
Yeah again, I am being a fool because I keep burying in my mind the fact that my
Asahi laptop literally doesn't do this... :) I think in 'principle' though it
still is where things should go unless you just decide to have the first zone
only? But in any case, I think then the original explanation is better.
>
> > > +`ZONE_HIGHMEM`
> > > + is the part of the physical memory that is not covered by a permanent mapping
> > > + in the kernel page tables. The memory in this zone is only accessible to the
> > > + kernel using temporary mappings. This zone is available only some 32-bit
> > > + architectures and is enabled with ``CONFIG_HIGHMEM``.
> > > +
> >
> > I comment here only to say 'wow I am so glad I chose to only focus on 64-bit so
> > I could side-step all the awkward discussion of high pages' :)
> >
> > > +The relation between node and zone extents is determined by the physical memory
> > > +map reported by the firmware, architectural constraints for memory addressing
> > > +and certain parameters in the kernel command line.
> >
> > Perhaps worth mentioning device tree here? Though perhaps encapsulated in the
> > 'firmware' reference.
>
> It is :)
Ack, and that makes sense
>
> > > +Node structure
> > > +--------------
> > > +
> > > +The struct pglist_data is declared in `include/linux/mmzone.h
> > > +<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/mmzone.h>`_.
> > > +Here we briefly describe fields of this structure:
> >
> > Perhaps worth saying 'The node structure' just to reiterate.
>
> Not sure I follow, can you phrase the entire sentence?
>
Sorry I wasn't clear here, I meant to say simply reiterate that the pglist_data
struct is the one describing a node.
> > > +
> > > +General
> > > +~~~~~~~
> > > +
> > > +`node_zones`
> > > + The zones for this node. Not all of the zones may be populated, but it is
> > > + the full list. It is referenced by this node's node_zonelists as well as
> > > + other node's node_zonelists.
> >
> > Perhaps worth describing what zonelists (and equally zonerefs) are here or
> > above, and that this is the canonical place where zones reside. Maybe reference
> > populated_zone() and for_each_populated_zone() in reference to the fact that not
> > all here may be populated?
>
> I'd prefer to start simple and than add more content on top.
>
Absolutely, makes sense!
> > > +
> > > +`node_zonelists` The list of all zones in all nodes. This list defines the
> > > + order of zones that allocations are preferred from. The `node_zonelists` is
> > > + set up by build_zonelists() in mm/page_alloc.c during the initialization of
> > > + core memory management structures.
> > > +
> > > +`nr_zones`
> > > + Number of populated zones in this node.
> > > +
> > > +`node_mem_map`
> > > + For UMA systems that use FLATMEM memory model the 0's node (and the only)
> > > + `node_mem_map` is array of struct pages representing each physical frame.
> > > +
> > > +`node_page_ext`
> > > + For UMA systems that use FLATMEM memory model the 0's (and the only) node
> > > + `node_mem_map` is array of extensions of struct pages. Available only in the
> > > + kernels built with ``CONFIG_PAGE_EXTENTION`` enabled.
> > > +
> > > +`node_start_pfn`
> > > + The page frame number of the starting page frame in this node.
> > > +
> > > +`node_present_pages`
> > > + Total number of physical pages present in this node.
> > > +
> > > +`node_spanned_pages`
> > > + Total size of physical page range, including holes.
> > > +
> >
> > I think it'd be useful to discuss briefly the meaning of managed, spanned and
> > present pages in the context of zones.
>
> This will be a part of the Zones section.
Makes sense again!
Overall it's very good. Nitpicking here really!
On Sat, Jan 07, 2023 at 10:55:26AM +0700, Bagas Sanjaya wrote:
> On Sun, Jan 01, 2023 at 11:45:23AM +0200, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> >
>
> No patch description really?
The subject says it all, but I can copy it to the description as well.
> > +Each node may be divided up into a number of blocks called zones which
> > +represent ranges within memory. These ranges are usually determined by
> > +architectural constraints for accessing the physical memory. A zone is
> > +described by a ``struct zone_struct``, typedeffed to ``zone_t`` and each zone
> > +has one of the types described below.
> > +
> > +`ZONE_DMA` and `ZONE_DMA32`
> > + represent memory suitable for DMA by peripheral devices that cannot
> > + access all of the addressable memory. Depending on the architecture,
> > + either of these zone types or even they both can be disabled at build
> > + time using ``CONFIG_ZONE_DMA`` and ``CONFIG_ZONE_DMA32`` configuration
> > + options. Some 64-bit platforms may need both zones as they support
> > + peripherals with different DMA addressing limitations.
> > +
> > +`ZONE_NORMAL`
> > + is for normal memory that can be accessed by the kernel all the time. DMA
> > + operations can be performed on pages in this zone if the DMA devices support
> > + transfers to all addressable memory. ZONE_NORMAL is always enabled.
> > +
> > +`ZONE_HIGHMEM`
> > + is the part of the physical memory that is not covered by a permanent mapping
> > + in the kernel page tables. The memory in this zone is only accessible to the
> > + kernel using temporary mappings. This zone is available only some 32-bit
> > + architectures and is enabled with ``CONFIG_HIGHMEM``.
> > +
> > +`ZONE_MOVABLE`
> > + is for normal accessible memory, just like ZONE_NORMAL. The difference is
> > + that most pages in ZONE_MOVABLE are movable. That means that while virtual
> > + addresses of these pages do not change, their content may move between
> > + different physical pages. ZONE_MOVABLE is only enabled when one of
> > + `kernelcore`, `movablecore` and `movable_node` parameters is present in the
> > + kernel command line. See :ref:`Page migration <page_migration>` for
> > + additional details.
> > +
> > +`ZONE_DEVICE`
> > + represents memory residing on devices such as PMEM and GPU. It has different
> > + characteristics than RAM zone types and it exists to provide :ref:`struct
> > + page <Pages>` and memory map services for device driver identified physical
> > + address ranges. ZONE_DEVICE is enabled with configuration option
> > + ``CONFIG_ZONE_DEVICE``.
>
> I think bullet lists should do the job better, since the zone names are
> connected directly to their representations:
Agree.
> > +For example, with 32-bit kernel on an x86 UMA machine with 2 Gbytes of RAM the
> > +entire memory will be on node 0 and there will be three zones: ZONE_DMA,
> > +ZONE_NORMAL and ZONE_HIGHMEM::
> > +
> > + 0 2G
> > + +-------------------------------------------------------------+
> > + | node 0 |
> > + +-------------------------------------------------------------+
> > +
> > + 0 16M 896M 2G
> > + +----------+-----------------------+--------------------------+
> > + | ZONE_DMA | ZONE_NORMAL | ZONE_HIGHMEM |
> > + +----------+-----------------------+--------------------------+
> > +
> > +
> > +With a kernel built with ZONE_DMA disabled and ZONE_DMA32 enabled and booted
> > +with `movablecore=80%` parameter on an arm64 machine with 16 Gbytes of RAM
> > +equally split between two nodes, there will be ZONE_DMA32, ZONE_NORMAL and
> > +ZONE_MOVABLE on node 0, and ZONE_NORMAL and ZONE_MOVABLE on node 1::
> > +
> > +
> > + 1G 9G 17G
> > + +--------------------------------+ +--------------------------+
> > + | node 0 | | node 1 |
> > + +--------------------------------+ +--------------------------+
> > +
> > + 1G 4G 4200M 9G 9320M 17G
> > + +---------+----------+-----------+ +------------+-------------+
> > + | DMA32 | NORMAL | MOVABLE | | NORMAL | MOVABLE |
> > + +---------+----------+-----------+ +------------+-------------+
>
> I see inconsistency of formatting keywords: some are in inline code and some
> are not. I'm leaning towards inlining them all:
Sure, thanks for the patch :)
> > +For various operations possible with nodemasks please refer to
> > +`include/linux/nodemask.h
> > +<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/nodemask.h>`_.
>
> Instead of linking to Linus's tree, just inline the source path:
Ok.
> > +.. _zones:
> > +
> > +Zones
> > +=====
> > +
> > +.. _pages:
> > +
> > +Pages
> > +=====
> > +
> > +.. _folios:
> > +
> > +Folios
> > +======
> > +
> > +.. _initialization:
> > +
> > +Initialization
> > +==============
>
> Are these sections stubs (no fields list for each types)? If so, add
> admonitions to inform readers:
Ok.
@@ -3,3 +3,325 @@
===============
Physical Memory
===============
+
+Linux is available for a wide range of architectures so there is a need for an
+architecture-independent abstraction to represent the physical memory. This
+chapter describes the structures used to manage physical memory in a running
+system.
+
+The first principal concept prevalent in the memory management is
+`Non-Uniform Memory Access (NUMA)
+<https://en.wikipedia.org/wiki/Non-uniform_memory_access>`_.
+With multi-core and multi-socket machines, memory may be arranged into banks
+that incur a different cost to access depending on the “distance” from the
+processor. For example, there might be a bank of memory assigned to each CPU or
+a bank of memory very suitable for DMA near peripheral devices.
+
+Each bank is called a node and the concept is represented under Linux by a
+``struct pglist_data`` even if the architecture is UMA. This structure is
+always referenced to by it's typedef ``pg_data_t``. A pg_data_t structure
+for a particular node can be referenced by ``NODE_DATA(nid)`` macro where
+``nid`` is the ID of that node.
+
+For NUMA architectures, the node structures are allocated by the architecture
+specific code early during boot. Usually, these structures are allocated
+locally on the memory bank they represent. For UMA architectures, only one
+static pg_data_t structure called ``contig_page_data`` is used. Nodes will
+be discussed further in Section :ref:`Nodes <nodes>`
+
+Each node may be divided up into a number of blocks called zones which
+represent ranges within memory. These ranges are usually determined by
+architectural constraints for accessing the physical memory. A zone is
+described by a ``struct zone_struct``, typedeffed to ``zone_t`` and each zone
+has one of the types described below.
+
+`ZONE_DMA` and `ZONE_DMA32`
+ represent memory suitable for DMA by peripheral devices that cannot
+ access all of the addressable memory. Depending on the architecture,
+ either of these zone types or even they both can be disabled at build
+ time using ``CONFIG_ZONE_DMA`` and ``CONFIG_ZONE_DMA32`` configuration
+ options. Some 64-bit platforms may need both zones as they support
+ peripherals with different DMA addressing limitations.
+
+`ZONE_NORMAL`
+ is for normal memory that can be accessed by the kernel all the time. DMA
+ operations can be performed on pages in this zone if the DMA devices support
+ transfers to all addressable memory. ZONE_NORMAL is always enabled.
+
+`ZONE_HIGHMEM`
+ is the part of the physical memory that is not covered by a permanent mapping
+ in the kernel page tables. The memory in this zone is only accessible to the
+ kernel using temporary mappings. This zone is available only some 32-bit
+ architectures and is enabled with ``CONFIG_HIGHMEM``.
+
+`ZONE_MOVABLE`
+ is for normal accessible memory, just like ZONE_NORMAL. The difference is
+ that most pages in ZONE_MOVABLE are movable. That means that while virtual
+ addresses of these pages do not change, their content may move between
+ different physical pages. ZONE_MOVABLE is only enabled when one of
+ `kernelcore`, `movablecore` and `movable_node` parameters is present in the
+ kernel command line. See :ref:`Page migration <page_migration>` for
+ additional details.
+
+`ZONE_DEVICE`
+ represents memory residing on devices such as PMEM and GPU. It has different
+ characteristics than RAM zone types and it exists to provide :ref:`struct
+ page <Pages>` and memory map services for device driver identified physical
+ address ranges. ZONE_DEVICE is enabled with configuration option
+ ``CONFIG_ZONE_DEVICE``.
+
+It is important to note that many kernel operations can only take place using
+ZONE_NORMAL so it is the most performance critical zone. Zones are discussed
+further in Section :ref:`Zones <zones>`.
+
+The relation between node and zone extents is determined by the physical memory
+map reported by the firmware, architectural constraints for memory addressing
+and certain parameters in the kernel command line.
+
+For example, with 32-bit kernel on an x86 UMA machine with 2 Gbytes of RAM the
+entire memory will be on node 0 and there will be three zones: ZONE_DMA,
+ZONE_NORMAL and ZONE_HIGHMEM::
+
+ 0 2G
+ +-------------------------------------------------------------+
+ | node 0 |
+ +-------------------------------------------------------------+
+
+ 0 16M 896M 2G
+ +----------+-----------------------+--------------------------+
+ | ZONE_DMA | ZONE_NORMAL | ZONE_HIGHMEM |
+ +----------+-----------------------+--------------------------+
+
+
+With a kernel built with ZONE_DMA disabled and ZONE_DMA32 enabled and booted
+with `movablecore=80%` parameter on an arm64 machine with 16 Gbytes of RAM
+equally split between two nodes, there will be ZONE_DMA32, ZONE_NORMAL and
+ZONE_MOVABLE on node 0, and ZONE_NORMAL and ZONE_MOVABLE on node 1::
+
+
+ 1G 9G 17G
+ +--------------------------------+ +--------------------------+
+ | node 0 | | node 1 |
+ +--------------------------------+ +--------------------------+
+
+ 1G 4G 4200M 9G 9320M 17G
+ +---------+----------+-----------+ +------------+-------------+
+ | DMA32 | NORMAL | MOVABLE | | NORMAL | MOVABLE |
+ +---------+----------+-----------+ +------------+-------------+
+
+.. _nodes:
+
+Nodes
+=====
+
+As we have mentioned, each node in memory is described by a ``pg_data_t`` which
+is a typedef for a ``struct pglist_data``. When allocating a page, by default
+Linux uses a node-local allocation policy to allocate memory from the node
+closest to the running CPU. As processes tend to run on the same CPU, it is
+likely the memory from the current node will be used. The allocation policy can
+be controlled by users as described in
+`Documentation/admin-guide/mm/numa_memory_policy.rst`.
+
+Most NUMA architectures maintain an array of pointers to the node
+structures. The actual structures are allocated early during boot when
+architecture specific code parses the physical memory map reported by the
+firmware. The bulk of the node initialization happens slightly later in the
+boot process by free_area_init() function, described later in Section
+:ref:`Initialization <initialization>`.
+
+
+Along with the node structures, kernel maintains an array of ``nodemask_t``
+bitmasks called `node_states`. Each bitmask in this array represents a set of
+nodes with particular properties as defined by `enum node_states`:
+
+`N_POSSIBLE`
+ The node could become online at some point.
+`N_ONLINE`
+ The node is online.
+`N_NORMAL_MEMORY`
+ The node has regular memory.
+`N_HIGH_MEMORY`
+ The node has regular or high memory. When ``CONFIG_HIGHMEM`` is disabled
+ aliased to `N_NORMAL_MEMORY`.
+`N_MEMORY`
+ The node has memory(regular, high, movable)
+`N_CPU`
+ The node has one or more CPUs
+
+For each node that has a property described above, the bit corresponding to the
+node ID in the ``node_states[<property>]`` bitmask is set.
+
+For example, for node 2 with normal memory and CPUs, bit 2 will be set in ::
+
+ node_states[N_POSSIBLE]
+ node_states[N_ONLINE]
+ node_states[N_NORMAL_MEMORY]
+ node_states[N_MEMORY]
+ node_states[N_CPU]
+
+For various operations possible with nodemasks please refer to
+`include/linux/nodemask.h
+<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/nodemask.h>`_.
+
+Among other things, nodemasks are used to provide macros for node traversal,
+namely `for_each_node()` and `for_each_online_node()`.
+
+For instance, to call a function foo() for each online node::
+
+ for_each_online_node(nid) {
+ pg_data_t *pgdat = NODE_DATA(nid);
+
+ foo(pgdat);
+ }
+
+Node structure
+--------------
+
+The struct pglist_data is declared in `include/linux/mmzone.h
+<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/mmzone.h>`_.
+Here we briefly describe fields of this structure:
+
+General
+~~~~~~~
+
+`node_zones`
+ The zones for this node. Not all of the zones may be populated, but it is
+ the full list. It is referenced by this node's node_zonelists as well as
+ other node's node_zonelists.
+
+`node_zonelists` The list of all zones in all nodes. This list defines the
+ order of zones that allocations are preferred from. The `node_zonelists` is
+ set up by build_zonelists() in mm/page_alloc.c during the initialization of
+ core memory management structures.
+
+`nr_zones`
+ Number of populated zones in this node.
+
+`node_mem_map`
+ For UMA systems that use FLATMEM memory model the 0's node (and the only)
+ `node_mem_map` is array of struct pages representing each physical frame.
+
+`node_page_ext`
+ For UMA systems that use FLATMEM memory model the 0's (and the only) node
+ `node_mem_map` is array of extensions of struct pages. Available only in the
+ kernels built with ``CONFIG_PAGE_EXTENTION`` enabled.
+
+`node_start_pfn`
+ The page frame number of the starting page frame in this node.
+
+`node_present_pages`
+ Total number of physical pages present in this node.
+
+`node_spanned_pages`
+ Total size of physical page range, including holes.
+
+`node_size_lock`
+ A lock that protects the fields defining the node extents. Only defined when
+ at least one of ``CONFIG_MEMORY_HOTPLUG`` or
+ ``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` configuration options are enabled.
+
+ pgdat_resize_lock() and pgdat_resize_unlock() are provided to manipulate
+ node_size_lock without checking for CONFIG_MEMORY_HOTPLUG or
+ CONFIG_DEFERRED_STRUCT_PAGE_INIT.
+
+`node_id`
+ The Node ID (NID) of the node, starts at 0.
+
+`totalreserve_pages`
+ This is a per~node reserve of pages that are not available to userspace
+ allocations.
+
+`first_deferred_pfn`
+ If memory initialization on large machines is deferred then this is the first
+ PFN that needs to be initialized. Defined only when
+ ``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` is enabled
+
+`deferred_split_queue`
+ Per-node queue of huge pages that their split was deferred. Defined only when ``CONFIG_TRANSPARENT_HUGEPAGE`` is enabled.
+
+`__lruvec`
+ Per-node lruvec holding LRU lists and related parameters. Used only when memory cgroups are disabled. Should not be accessed directly, use mem_cgroup_lruvec() to look up lruvecs instead.
+
+Reclaim control
+~~~~~~~~~~~~~~~
+
+See also :ref:`Page Reclaim <page_reclaim>`.
+
+`kswapd`
+ Per-node instance of kswapd kernel thread.
+
+`kswapd_wait`, `pfmemalloc_wait`, `reclaim_wait`
+ Workqueues used to synchronize memory reclaim tasks
+
+`nr_writeback_throttled`
+ Number of tasks that are throttled waiting on dirty pages to clean.
+
+`nr_reclaim_start`
+ Number of pages written while reclaim is throttled waiting for writeback.
+
+`kswapd_order`
+ Controls the order kswapd tries to reclaim
+
+`kswapd_highest_zoneidx`
+ The highest zone index to be reclaimed by kswapd
+
+`kswapd_failures`
+ Number of runs kswapd was unable to reclaim any pages
+
+`min_unmapped_pages`
+ Minimal number of unmapped file backed pages that cannot be reclaimed. Determined by vm.min_unmapped_ratio sysctl.
+ Only defined when ``CONFIG_NUMA`` is enabled.
+
+`min_slab_pages`
+ Minimal number of SLAB pages that cannot be reclaimed. Determined by vm.min_slab_ratio sysctl.
+ Only defined when ``CONFIG_NUMA`` is enabled
+
+`flags`
+ Flags controlling reclaim behavior.
+
+Compaction control
+~~~~~~~~~~~~~~~~~~
+
+`kcompactd_max_order`
+ Page order that kcompactd should try to achieve.
+
+`kcompactd_highest_zoneidx`
+ The highest zone index to be compacted by kcompactd.
+
+`kcompactd_wait`
+ Workqueue used to synchronizes memory compaction tasks.
+
+`kcompactd`
+ Per-node instance of kcompactd kernel thread.
+
+`proactive_compact_trigger`
+ Determines if proactive compaction is enabled. Controlled by vm.compaction_proactiveness sysctl.
+
+Statistics
+~~~~~~~~~~
+
+`per_cpu_nodestats`
+ Per-CPU VM statistics for the node
+
+`vm_stat`
+ VM statistics for the node.
+
+.. _zones:
+
+Zones
+=====
+
+.. _pages:
+
+Pages
+=====
+
+.. _folios:
+
+Folios
+======
+
+.. _initialization:
+
+Initialization
+==============