[bpf-next,v4,2/2] docs/bpf: Add LRU internals description and graph
Commit Message
Extend the bpf hashmap docs to include a brief description of the
internals of the LRU map type (setting appropriate API expectations),
including the original commit message from Martin and a variant on the
graph that I had presented during my Linux Plumbers Conference 2022 talk
on "Pressure feedback for LRU map types"[0].
The node names in the dot file correspond roughly to the functions where
the logic for those decisions or steps is defined, to help curious
developers to cross-reference and update this logic if the details of
the LRU implementation ever differ from this description.
[0]: https://lpc.events/event/16/contributions/1368/
Signed-off-by: Joe Stringer <joe@isovalent.com>
---
v4: Move UAPI descriptions outside of the internals section
Fix function reference discrepancies in dot source
Fix incorrect flag references (missing F_)
Simplify logic at bottom of graph for map updates
Add missing return codes to graph for failure cases
v3: Use standard table syntax
Replace inline commit message with reference to commit
Fix incorrect Y/N label for common LRU check
Rename some dotfile variables to reduce confusion between cases
Minor wording touchups
v2: Fix issue that caused initial email submission to fail
---
Documentation/bpf/map_hash.rst | 42 ++++++
Documentation/bpf/map_lru_hash_update.dot | 172 ++++++++++++++++++++++
2 files changed, 214 insertions(+)
create mode 100644 Documentation/bpf/map_lru_hash_update.dot
Comments
Hi Joe,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on bpf-next/master]
url: https://github.com/intel-lab-lkp/linux/commits/Joe-Stringer/docs-bpf-Add-LRU-internals-description-and-graph/20230402-040757
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link: https://lore.kernel.org/r/20230401200651.1022113-2-joe%40isovalent.com
patch subject: [PATCH bpf-next v4 2/2] docs/bpf: Add LRU internals description and graph
reproduce:
# https://github.com/intel-lab-lkp/linux/commit/0c42be421b73fffe9160867ac673ea9841982ece
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Joe-Stringer/docs-bpf-Add-LRU-internals-description-and-graph/20230402-040757
git checkout 0c42be421b73fffe9160867ac673ea9841982ece
make menuconfig
# enable CONFIG_COMPILE_TEST, CONFIG_WARN_MISSING_DOCUMENTS, CONFIG_WARN_ABI_ERRORS
make htmldocs
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304022107.IwHc05cs-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> Warning: Orthogonal edges do not currently handle edge labels. Try using xlabels.
On Sun, Apr 02, 2023 at 09:47:49PM +0800, kernel test robot wrote:
> All warnings (new ones prefixed by >>):
>
> >> Warning: Orthogonal edges do not currently handle edge labels. Try using xlabels.
>
Hi,
I can't reproduce the warning above. My system has graphviz 2.42.2
installed (via Debian package). What graphviz version do kernel test
robot have?
On Sat, Apr 01, 2023 at 01:06:51PM -0700, Joe Stringer wrote:
> diff --git a/Documentation/bpf/map_hash.rst b/Documentation/bpf/map_hash.rst
> index 45d923cd16c4..ddc961f98b27 100644
> --- a/Documentation/bpf/map_hash.rst
> +++ b/Documentation/bpf/map_hash.rst
> @@ -1,5 +1,6 @@
> .. SPDX-License-Identifier: GPL-2.0-only
> .. Copyright (C) 2022 Red Hat, Inc.
> +.. Copyright (C) 2022-2023 Isovalent, Inc.
>
> ===============================================
> BPF_MAP_TYPE_HASH, with PERCPU and LRU Variants
> @@ -215,3 +216,44 @@ Userspace walking the map elements from the map declared above:
> cur_key = &next_key;
> }
> }
> +
> +Internals
> +=========
> +
> +This section of the document is targeted at Linux developers and describes
> +aspects of the map implementations that are not considered stable ABI. The
> +following details are subject to change in future versions of the kernel.
> +
> +``BPF_MAP_TYPE_LRU_HASH`` and variants
> +--------------------------------------
> +
> +Updating elements in LRU maps may trigger eviction behaviour when the capacity
> +of the map is reached. There are various steps that the update algorithm
> +attempts in order to enforce the LRU property which have increasing impacts on
> +other CPUs involved in the following operation attempts:
> +
> +- Attempt to use CPU-local state to batch operations
> +- Attempt to fetch free nodes from global lists
> +- Attempt to pull any node from a global list and remove it from the hashmap
> +- Attempt to pull any node from any CPU's list and remove it from the hashmap
> +
> +This algorithm is described visually in the following diagram. See the
> +description in commit 3a08c2fd7634 ("bpf: LRU List") for a full explanation of
> +the corresponding operations:
> +
> +.. kernel-figure:: map_lru_hash_update.dot
> + :alt: Diagram outlining the LRU eviction steps taken during map update.
> +
> + LRU hash eviction during map update for ``BPF_MAP_TYPE_LRU_HASH`` and
> + variants. See the dot file source for kernel function name code references.
> +
> +Map updates start from the oval in the top right "begin ``bpf_map_update()``"
> +and progress through the graph towards the bottom where the result may be
> +either a successful update or a failure with various error codes. The key in
> +the top right provides indicators for which locks may be involved in specific
> +operations. This is intended as a visual hint for reasoning about how map
> +contention may impact update operations, though the map type and flags may
> +impact the actual contention on those locks, based on the logic described in
> +the table above. For instance, if the map is created with type
> +``BPF_MAP_TYPE_LRU_PERCPU_HASH`` and flags ``BPF_F_NO_COMMON_LRU`` then all map
> +properties would be per-cpu.
The doc LGTM, thanks!
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
On Mon, 03 Apr 2023 at 10:41:27 +0700, Bagas Sanjaya wrote:
> On Sun, Apr 02, 2023 at 09:47:49PM +0800, kernel test robot wrote:
> > All warnings (new ones prefixed by >>):
> >
> > >> Warning: Orthogonal edges do not currently handle edge labels. Try using xlabels.
> >
>
> Hi,
>
> I can't reproduce the warning above. My system has graphviz 2.42.2
> installed (via Debian package). What graphviz version do kernel test
> robot have?
I have the same warning on Arch Linux.
$ dot --version
dot - graphviz version 7.1.0 (0)
I can reproduce this issue, will fix it up.
$ dot -V
dot - graphviz version 2.43.0 (0
On Mon, Apr 3, 2023 at 1:49 AM Maxim Mikityanskiy <maxtram95@gmail.com> wrote:
>
> On Mon, 03 Apr 2023 at 10:41:27 +0700, Bagas Sanjaya wrote:
> > On Sun, Apr 02, 2023 at 09:47:49PM +0800, kernel test robot wrote:
> > > All warnings (new ones prefixed by >>):
> > >
> > > >> Warning: Orthogonal edges do not currently handle edge labels. Try using xlabels.
> > >
> >
> > Hi,
> >
> > I can't reproduce the warning above. My system has graphviz 2.42.2
> > installed (via Debian package). What graphviz version do kernel test
> > robot have?
>
> I have the same warning on Arch Linux.
>
> $ dot --version
> dot - graphviz version 7.1.0 (0)
@@ -1,5 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
+.. Copyright (C) 2022-2023 Isovalent, Inc.
===============================================
BPF_MAP_TYPE_HASH, with PERCPU and LRU Variants
@@ -215,3 +216,44 @@ Userspace walking the map elements from the map declared above:
cur_key = &next_key;
}
}
+
+Internals
+=========
+
+This section of the document is targeted at Linux developers and describes
+aspects of the map implementations that are not considered stable ABI. The
+following details are subject to change in future versions of the kernel.
+
+``BPF_MAP_TYPE_LRU_HASH`` and variants
+--------------------------------------
+
+Updating elements in LRU maps may trigger eviction behaviour when the capacity
+of the map is reached. There are various steps that the update algorithm
+attempts in order to enforce the LRU property which have increasing impacts on
+other CPUs involved in the following operation attempts:
+
+- Attempt to use CPU-local state to batch operations
+- Attempt to fetch free nodes from global lists
+- Attempt to pull any node from a global list and remove it from the hashmap
+- Attempt to pull any node from any CPU's list and remove it from the hashmap
+
+This algorithm is described visually in the following diagram. See the
+description in commit 3a08c2fd7634 ("bpf: LRU List") for a full explanation of
+the corresponding operations:
+
+.. kernel-figure:: map_lru_hash_update.dot
+ :alt: Diagram outlining the LRU eviction steps taken during map update.
+
+ LRU hash eviction during map update for ``BPF_MAP_TYPE_LRU_HASH`` and
+ variants. See the dot file source for kernel function name code references.
+
+Map updates start from the oval in the top right "begin ``bpf_map_update()``"
+and progress through the graph towards the bottom where the result may be
+either a successful update or a failure with various error codes. The key in
+the top right provides indicators for which locks may be involved in specific
+operations. This is intended as a visual hint for reasoning about how map
+contention may impact update operations, though the map type and flags may
+impact the actual contention on those locks, based on the logic described in
+the table above. For instance, if the map is created with type
+``BPF_MAP_TYPE_LRU_PERCPU_HASH`` and flags ``BPF_F_NO_COMMON_LRU`` then all map
+properties would be per-cpu.
new file mode 100644
@@ -0,0 +1,172 @@
+// SPDX-License-Identifier: GPL-2.0-only
+// Copyright (C) 2022-2023 Isovalent, Inc.
+digraph {
+ node [colorscheme=accent4,style=filled] # Apply colorscheme to all nodes
+ graph [splines=ortho, nodesep=1]
+
+ subgraph cluster_key {
+ label = "Key\n(locks held during operation)";
+ rankdir = TB;
+
+ remote_lock [shape=rectangle,fillcolor=4,label="remote CPU LRU lock"]
+ hash_lock [shape=rectangle,fillcolor=3,label="hashtab lock"]
+ lru_lock [shape=rectangle,fillcolor=2,label="LRU lock"]
+ local_lock [shape=rectangle,fillcolor=1,label="local CPU LRU lock"]
+ no_lock [shape=rectangle,label="no locks held"]
+ }
+
+ begin [shape=oval,label="begin\nbpf_map_update()"]
+
+ // Nodes below with an 'fn_' prefix are roughly labeled by the C function
+ // names that initiate the corresponding logic in kernel/bpf/bpf_lru_list.c.
+ // Number suffixes and errno suffixes handle subsections of the corresponding
+ // logic in the function as of the writing of this dot.
+
+ // cf. __local_list_pop_free() / bpf_percpu_lru_pop_free()
+ local_freelist_check [shape=diamond,fillcolor=1,
+ label="Local freelist\nnode available?"];
+ use_local_node [shape=rectangle,
+ label="Use node owned\nby this CPU"]
+
+ // cf. bpf_lru_pop_free()
+ common_lru_check [shape=diamond,
+ label="Map created with\ncommon LRU?\n(!BPF_F_NO_COMMON_LRU)"];
+
+ fn_bpf_lru_list_pop_free_to_local [shape=rectangle,fillcolor=2,
+ label="Flush local pending,
+ Rotate Global list, move
+ LOCAL_FREE_TARGET
+ from global -> local"]
+ // Also corresponds to:
+ // fn__local_list_flush()
+ // fn_bpf_lru_list_rotate()
+ fn___bpf_lru_node_move_to_free[shape=diamond,fillcolor=2,
+ label="Able to free\nLOCAL_FREE_TARGET\nnodes?"]
+
+ fn___bpf_lru_list_shrink_inactive [shape=rectangle,fillcolor=3,
+ label="Shrink inactive list
+ up to remaining
+ LOCAL_FREE_TARGET
+ (global LRU -> local)"]
+ fn___bpf_lru_list_shrink [shape=diamond,fillcolor=2,
+ label="> 0 entries in\nlocal free list?"]
+ fn___bpf_lru_list_shrink2 [shape=rectangle,fillcolor=2,
+ label="Steal one node from
+ inactive, or if empty,
+ from active global list"]
+ fn___bpf_lru_list_shrink3 [shape=rectangle,fillcolor=3,
+ label="Try to remove\nnode from hashtab"]
+
+ local_freelist_check2 [shape=diamond,label="Htab removal\nsuccessful?"]
+ common_lru_check2 [shape=diamond,
+ label="Map created with\ncommon LRU?\n(!BPF_F_NO_COMMON_LRU)"];
+
+ subgraph cluster_remote_lock {
+ label = "Iterate through CPUs\n(start from current)";
+ style = dashed;
+ rankdir=LR;
+
+ local_freelist_check5 [shape=diamond,fillcolor=4,
+ label="Steal a node from\nper-cpu freelist?"]
+ local_freelist_check6 [shape=rectangle,fillcolor=4,
+ label="Steal a node from
+ (1) Unreferenced pending, or
+ (2) Any pending node"]
+ local_freelist_check7 [shape=rectangle,fillcolor=3,
+ label="Try to remove\nnode from hashtab"]
+ fn_htab_lru_map_update_elem [shape=diamond,
+ label="Stole node\nfrom remote\nCPU?"]
+ fn_htab_lru_map_update_elem2 [shape=diamond,label="Iterated\nall CPUs?"]
+ // Also corresponds to:
+ // use_local_node()
+ // fn__local_list_pop_pending()
+ }
+
+ fn_bpf_lru_list_pop_free_to_local2 [shape=rectangle,
+ label="Use node that was\nnot recently referenced"]
+ local_freelist_check4 [shape=rectangle,
+ label="Use node that was\nactively referenced\nin global list"]
+ fn_htab_lru_map_update_elem_ENOMEM [shape=oval,label="return -ENOMEM"]
+ fn_htab_lru_map_update_elem3 [shape=rectangle,
+ label="Use node that was\nactively referenced\nin (another?) CPU's cache"]
+ fn_htab_lru_map_update_elem4 [shape=rectangle,fillcolor=3,
+ label="Update hashmap\nwith new element"]
+ fn_htab_lru_map_update_elem5 [shape=oval,label="return 0"]
+ fn_htab_lru_map_update_elem_EBUSY [shape=oval,label="return -EBUSY"]
+ fn_htab_lru_map_update_elem_EEXIST [shape=oval,label="return -EEXIST"]
+ fn_htab_lru_map_update_elem_ENOENT [shape=oval,label="return -ENOENT"]
+
+ begin -> local_freelist_check
+ local_freelist_check -> use_local_node [xlabel="Y"]
+ local_freelist_check -> common_lru_check [xlabel="N"]
+ common_lru_check -> fn_bpf_lru_list_pop_free_to_local [xlabel="Y"]
+ common_lru_check -> fn___bpf_lru_list_shrink_inactive [xlabel="N"]
+ fn_bpf_lru_list_pop_free_to_local -> fn___bpf_lru_node_move_to_free
+ fn___bpf_lru_node_move_to_free ->
+ fn_bpf_lru_list_pop_free_to_local2 [xlabel="Y"]
+ fn___bpf_lru_node_move_to_free ->
+ fn___bpf_lru_list_shrink_inactive [xlabel="N"]
+ fn___bpf_lru_list_shrink_inactive -> fn___bpf_lru_list_shrink
+ fn___bpf_lru_list_shrink -> fn_bpf_lru_list_pop_free_to_local2 [xlabel = "Y"]
+ fn___bpf_lru_list_shrink -> fn___bpf_lru_list_shrink2 [xlabel="N"]
+ fn___bpf_lru_list_shrink2 -> fn___bpf_lru_list_shrink3
+ fn___bpf_lru_list_shrink3 -> local_freelist_check2
+ local_freelist_check2 -> local_freelist_check4 [xlabel = "Y"]
+ local_freelist_check2 -> common_lru_check2 [xlabel = "N"]
+ common_lru_check2 -> local_freelist_check5 [xlabel = "Y"]
+ common_lru_check2 -> fn_htab_lru_map_update_elem_ENOMEM [xlabel = "N"]
+ local_freelist_check5 -> fn_htab_lru_map_update_elem [xlabel = "Y"]
+ local_freelist_check5 -> local_freelist_check6 [xlabel = "N"]
+ local_freelist_check6 -> local_freelist_check7
+ local_freelist_check7 -> fn_htab_lru_map_update_elem
+
+ fn_htab_lru_map_update_elem -> fn_htab_lru_map_update_elem3 [xlabel = "Y"]
+ fn_htab_lru_map_update_elem -> fn_htab_lru_map_update_elem2 [xlabel = "N"]
+ fn_htab_lru_map_update_elem2 ->
+ fn_htab_lru_map_update_elem_ENOMEM [xlabel = "Y"]
+ fn_htab_lru_map_update_elem2 -> local_freelist_check5 [xlabel = "N"]
+ fn_htab_lru_map_update_elem3 -> fn_htab_lru_map_update_elem4
+
+ use_local_node -> fn_htab_lru_map_update_elem4
+ fn_bpf_lru_list_pop_free_to_local2 -> fn_htab_lru_map_update_elem4
+ local_freelist_check4 -> fn_htab_lru_map_update_elem4
+
+ fn_htab_lru_map_update_elem4 -> fn_htab_lru_map_update_elem5 [label="Success"]
+ fn_htab_lru_map_update_elem4 ->
+ fn_htab_lru_map_update_elem_EBUSY [xlabel="Hashtab lock failed"]
+ fn_htab_lru_map_update_elem4 ->
+ fn_htab_lru_map_update_elem_EEXIST [xlabel="BPF_EXIST set and\nkey already exists"]
+ fn_htab_lru_map_update_elem4 ->
+ fn_htab_lru_map_update_elem_ENOENT [label="BPF_NOEXIST set\nand no such entry"]
+
+ // Create invisible pad nodes to line up various nodes
+ pad0 [style=invis]
+ pad1 [style=invis]
+ pad2 [style=invis]
+ pad3 [style=invis]
+ pad4 [style=invis]
+
+ // Line up the key with the top of the graph
+ no_lock -> local_lock [style=invis]
+ local_lock -> lru_lock [style=invis]
+ lru_lock -> hash_lock [style=invis]
+ hash_lock -> remote_lock [style=invis]
+ remote_lock -> local_freelist_check5 [style=invis]
+ remote_lock -> fn___bpf_lru_list_shrink [style=invis]
+
+ // Line up return code nodes at the bottom of the graph
+ fn_htab_lru_map_update_elem -> pad0 [style=invis]
+ pad0 -> pad1 [style=invis]
+ pad1 -> pad2 [style=invis]
+ //pad2-> fn_htab_lru_map_update_elem_ENOMEM [style=invis]
+ fn_htab_lru_map_update_elem4 -> pad3 [style=invis]
+ pad3 -> fn_htab_lru_map_update_elem5 [style=invis]
+ pad3 -> fn_htab_lru_map_update_elem_EBUSY [style=invis]
+ pad3 -> fn_htab_lru_map_update_elem_EEXIST [style=invis]
+ pad3 -> fn_htab_lru_map_update_elem_ENOENT [style=invis]
+
+ // Reduce diagram width by forcing some nodes to appear above others
+ local_freelist_check4 -> fn_htab_lru_map_update_elem3 [style=invis]
+ common_lru_check2 -> pad4 [style=invis]
+ pad4 -> local_freelist_check5 [style=invis]
+}