[bpf-next,v3] docs/bpf: Add LRU internals description and graph

Message ID 20230312190600.324573-1-joe@isovalent.com
State New
Headers
Series [bpf-next,v3] docs/bpf: Add LRU internals description and graph |

Commit Message

Joe Stringer March 12, 2023, 7:05 p.m. UTC
  Extend the bpf hashmap docs to include a brief description of the
internals of the LRU map type (setting appropriate API expectations),
including the original commit message from Martin and a variant on the
graph that I had presented during my Linux Plumbers Conference 2022 talk
on "Pressure feedback for LRU map types"[0].

The node names in the dot file correspond roughly to the functions where
the logic for those decisions or steps is defined, to help curious
developers to cross-reference and update this logic if the details of
the LRU implementation ever differ from this description.

[0]: https://lpc.events/event/16/contributions/1368/

Signed-off-by: Joe Stringer <joe@isovalent.com>
---
v3: Use standard table syntax
    Replace inline commit message with reference to commit
    Fix incorrect Y/N label for common LRU check
    Rename some dotfile variables to reduce confusion between cases
    Minor wording touchups
v2: Fix issue that caused initial email submission to fail
---
 Documentation/bpf/map_hash.rst            |  62 ++++++++
 Documentation/bpf/map_lru_hash_update.dot | 166 ++++++++++++++++++++++
 2 files changed, 228 insertions(+)
 create mode 100644 Documentation/bpf/map_lru_hash_update.dot
  

Comments

Bagas Sanjaya March 13, 2023, 3:54 a.m. UTC | #1
On Sun, Mar 12, 2023 at 12:05:59PM -0700, Joe Stringer wrote:
> +Even if an LRU node may be acquired, maps of type ``BPF_MAP_TYPE_LRU_HASH``
> +may fail to insert the entry into the map if other CPUs are heavily contending
> +on the global hashmap lock.

"Even if an LRU node can be acquired, ..."

> +
> +This algorithm is described visually in the following diagram. See the
> +description in commit 3a08c2fd7634 ("bpf: LRU List") for a full explanation of
> +the corresponding operations:
> +
> +.. kernel-figure::  map_lru_hash_update.dot
> +   :alt:    Diagram outlining the LRU eviction steps taken during map update
> +
> +   LRU hash eviction during map update for ``BPF_MAP_TYPE_LRU_HASH`` and
> +   variants
> +
> <snipped> ...
> +The dot file source for the above figure uses internal kernel function names
> +for the node names in order to make the corresponding logic easier to find.

Shouldn't the figure note above be in :alt:?

Otherwise LGTM.
  
Joe Stringer March 13, 2023, 7:18 p.m. UTC | #2
On Sun, Mar 12, 2023 at 8:54 PM Bagas Sanjaya <bagasdotme@gmail.com> wrote:
>
> On Sun, Mar 12, 2023 at 12:05:59PM -0700, Joe Stringer wrote:
> > +Even if an LRU node may be acquired, maps of type ``BPF_MAP_TYPE_LRU_HASH``
> > +may fail to insert the entry into the map if other CPUs are heavily contending
> > +on the global hashmap lock.
>
> "Even if an LRU node can be acquired, ..."

Ack.

> > +
> > +This algorithm is described visually in the following diagram. See the
> > +description in commit 3a08c2fd7634 ("bpf: LRU List") for a full explanation of
> > +the corresponding operations:
> > +
> > +.. kernel-figure::  map_lru_hash_update.dot
> > +   :alt:    Diagram outlining the LRU eviction steps taken during map update
> > +
> > +   LRU hash eviction during map update for ``BPF_MAP_TYPE_LRU_HASH`` and
> > +   variants
> > +
> > <snipped> ...
> > +The dot file source for the above figure uses internal kernel function names
> > +for the node names in order to make the corresponding logic easier to find.
>
> Shouldn't the figure note above be in :alt:?

Do you mean alt or caption? Alt will hide the information from most developers.

> Otherwise LGTM.

Thanks for the review!
  
John Fastabend March 14, 2023, 5:21 a.m. UTC | #3
Joe Stringer wrote:
> Extend the bpf hashmap docs to include a brief description of the
> internals of the LRU map type (setting appropriate API expectations),
> including the original commit message from Martin and a variant on the
> graph that I had presented during my Linux Plumbers Conference 2022 talk
> on "Pressure feedback for LRU map types"[0].
> 
> The node names in the dot file correspond roughly to the functions where
> the logic for those decisions or steps is defined, to help curious
> developers to cross-reference and update this logic if the details of
> the LRU implementation ever differ from this description.
> 
> [0]: https://lpc.events/event/16/contributions/1368/
> 
> Signed-off-by: Joe Stringer <joe@isovalent.com>


Thanks couple nits below

> ---
> v3: Use standard table syntax
>     Replace inline commit message with reference to commit
>     Fix incorrect Y/N label for common LRU check
>     Rename some dotfile variables to reduce confusion between cases
>     Minor wording touchups
> v2: Fix issue that caused initial email submission to fail
> ---
>  Documentation/bpf/map_hash.rst            |  62 ++++++++
>  Documentation/bpf/map_lru_hash_update.dot | 166 ++++++++++++++++++++++
>  2 files changed, 228 insertions(+)
>  create mode 100644 Documentation/bpf/map_lru_hash_update.dot
> diff --git a/Documentation/bpf/map_hash.rst b/Documentation/bpf/map_hash.rst
> index 8669426264c6..61602ce26561 100644
> --- a/Documentation/bpf/map_hash.rst
> +++ b/Documentation/bpf/map_hash.rst
> @@ -1,5 +1,6 @@
>  .. SPDX-License-Identifier: GPL-2.0-only
>  .. Copyright (C) 2022 Red Hat, Inc.
> +.. Copyright (C) 2022-2023 Isovalent, Inc.
>  
>  ===============================================
>  BPF_MAP_TYPE_HASH, with PERCPU and LRU Variants
> @@ -206,3 +207,64 @@ Userspace walking the map elements from the map declared above:
>                      cur_key = &next_key;
>              }
>      }
> +
> +Internals
> +=========
> +
> +This section of the document is targeted at Linux developers and describes
> +aspects of the map implementations that are not considered stable ABI. The
> +following details are subject to change in future versions of the kernel.
> +
> +``BPF_MAP_TYPE_LRU_HASH`` and variants
> +--------------------------------------
> +
> +An LRU hashmap type consists of two properties: Firstly, it is a hash map and
> +hence is indexable by key for constant time lookups. Secondly, when at map
> +capacity, map updates will trigger eviction of old entries based on the age of
> +the elements in a set of lists. Each of these properties may be either global
> +or per-CPU, depending on the map type and flags used to create the map:
> +
> ++------------------------+---------------------------+----------------------------------+
> +|                        | ``BPF_MAP_TYPE_LRU_HASH`` | ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` |
> ++========================+===========================+==================================+
> +| ``BPF_NO_COMMON_LRU``  | Per-CPU LRU, global map   | Per-CPU LRU, per-cpu map         |
> ++------------------------+---------------------------+----------------------------------+
> +| ``!BPF_NO_COMMON_LRU`` | Global LRU, global map    | Global LRU, per-cpu map          |
> ++------------------------+---------------------------+----------------------------------+

Above all seems API to me. Maybe move the statement about not considered stable
ABI down here? Something like,

"
The internal details of which entry is evicted and acquiring a new entry
are not considered stable and may change in the future. But the current
impelementation is as follows.
"

Or something like that?

> +
> +Notably, there are various steps that the update algorithm attempts in order to
> +enforce the LRU property which have increasing impacts on other CPUs involved
> +in the following operation attempts:
> +
> +- Attempt to use CPU-local state to batch operations
> +- Attempt to fetch free nodes from global lists
> +- Attempt to pull any node from a global list and remove it from the hashmap
> +- Attempt to pull any node from any CPU's list and remove it from the hashmap
> +
> +Even if an LRU node may be acquired, maps of type ``BPF_MAP_TYPE_LRU_HASH``
> +may fail to insert the entry into the map if other CPUs are heavily contending
> +on the global hashmap lock.

Similarly this is ABI correct? Probably we can also specify the error code? 
Assuming it is just EBUSY or EAGAIN?

> +
> +This algorithm is described visually in the following diagram. See the
> +description in commit 3a08c2fd7634 ("bpf: LRU List") for a full explanation of
> +the corresponding operations:
> +
> +.. kernel-figure::  map_lru_hash_update.dot
> +   :alt:    Diagram outlining the LRU eviction steps taken during map update
> +
> +   LRU hash eviction during map update for ``BPF_MAP_TYPE_LRU_HASH`` and
> +   variants
> +
> +Map updates start from the oval in the top right "begin ``bpf_map_update()``"
> +and progress through the graph towards the bottom where the result may be
> +either a successful update or a failure with various error codes. The key in
> +the top right provides indicators for which locks may be involved in specific
> +operations. This is intended as a visual hint for reasoning about how map
> +contention may impact update operations, though the map type and flags may
> +impact the actual contention on those locks, based on the logic described in
> +the table above. For instance, if the map is created with type
> +``BPF_MAP_TYPE_LRU_PERCPU_HASH`` and flags ``BPF_NO_COMMON_LRU`` then all map
> +properties would be per-cpu.
> +
> +The dot file source for the above figure uses internal kernel function names
> +for the node names in order to make the corresponding logic easier to find.
> diff --git a/Documentation/bpf/map_lru_hash_update.dot b/Documentation/bpf/map_lru_hash_update.dot
> new file mode 100644
> index 000000000000..3a44ebec501e
> --- /dev/null
> +++ b/Documentation/bpf/map_lru_hash_update.dot
> @@ -0,0 +1,166 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +// Copyright (C) 2022-2023 Isovalent, Inc.
> +digraph {
> +  node [colorscheme=accent4,style=filled] # Apply colorscheme to all nodes
> +  graph [splines=ortho, nodesep=1]
> +
> +  subgraph cluster_key {
> +    label = "Key\n(locks held during operation)";
> +    rankdir = TB;
> +
> +    remote_lock [shape=rectangle,fillcolor=4,label="remote CPU LRU lock"]
> +    hash_lock [shape=rectangle,fillcolor=3,label="hashtab lock"]
> +    lru_lock [shape=rectangle,fillcolor=2,label="LRU lock"]
> +    local_lock [shape=rectangle,fillcolor=1,label="local CPU LRU lock"]
> +    no_lock [shape=rectangle,label="no locks held"]
> +  }
> +
> +  begin [shape=oval,label="begin\nbpf_map_update()"]
> +
> +  // Nodes below with an 'fn_' prefix are roughly labeled by the C function
> +  // names that initiate the corresponding logic in kernel/bpf/bpf_lru_list.c.
> +  // Number suffixes and errno suffixes handle subsections of the corresponding
> +  // logic in the function as of the writing of this dot.
> +
> +  // The following corresponds to bpf_lru_pop_free() for common LRU case.
> +  local_freelist_check [shape=diamond,fillcolor=1,
> +    label="Local freelist\nnode available?"];
> +  // The following corresponds to __local_list_pop_free() for common LRU case.
> +  use_local_node [shape=rectangle,
> +    label="Use node owned\nby this CPU"]
> +
> +  common_lru_check [shape=diamond,
> +    label="Map created with\ncommon LRU?\n(!BPF_NO_COMMON_LRU)"];
> +
> +  fn_bpf_lru_list_pop_free_to_local [shape=rectangle,fillcolor=2,
> +    label="Flush local pending,
> +    Rotate Global list, move
> +    LOCAL_FREE_TARGET
> +    from global -> local"]
> +  // Also corresponds to:
> +  // fn__local_list_flush()
> +  // fn_bpf_lru_list_rotate()
> +  fn___bpf_lru_node_move_to_free[shape=diamond,fillcolor=2,
> +    label="Able to free\nLOCAL_FREE_TARGET\nnodes?"]
> +
> +  fn___bpf_lru_list_shrink_inactive [shape=rectangle,fillcolor=3,
> +    label="Shrink inactive list
> +      up to remaining
> +      LOCAL_FREE_TARGET
> +      (global LRU -> local)"]
> +  fn___bpf_lru_list_shrink [shape=diamond,fillcolor=2,
> +    label="> 0 entries in\nlocal free list?"]
> +  fn___bpf_lru_list_shrink2 [shape=rectangle,fillcolor=2,
> +    label="Steal one node from
> +      inactive, or if empty,
> +      from active global list"]
> +  fn___bpf_lru_list_shrink3 [shape=rectangle,fillcolor=3,
> +    label="Try to remove\nnode from hashtab"]
> +
> +  local_freelist_check2 [shape=diamond,label="Htab removal\nsuccessful?"]
> +  common_lru_check2 [shape=diamond,
> +    label="Map created with\ncommon LRU?\n(!BPF_NO_COMMON_LRU)"];
> +
> +  subgraph cluster_remote_lock {
> +    label = "Iterate through CPUs\n(start from current)";
> +    style = dashed;
> +    rankdir=LR;
> +
> +    local_freelist_check5 [shape=diamond,fillcolor=4,
> +      label="Steal a node from\nper-cpu freelist?"]
> +    local_freelist_check6 [shape=rectangle,fillcolor=4,
> +      label="Steal a node from
> +        (1) Unreferenced pending, or
> +        (2) Any pending node"]
> +    local_freelist_check7 [shape=rectangle,fillcolor=3,
> +      label="Try to remove\nnode from hashtab"]
> +    fn_htab_lru_map_update_elem [shape=diamond,
> +      label="Stole node\nfrom remote\nCPU?"]
> +    fn_htab_lru_map_update_elem2 [shape=diamond,label="Iterated\nall CPUs?"]
> +    // Also corresponds to:
> +    // use_local_node()
> +    // fn__local_list_pop_pending()
> +  }
> +
> +  fn_bpf_lru_list_pop_free_to_local2 [shape=rectangle,
> +    label="Use node that was\nnot recently referenced"]
> +  local_freelist_check4 [shape=rectangle,
> +    label="Use node that was\nactively referenced\nin global list"]
> +  fn_htab_lru_map_update_elem_ENOMEM [shape=oval,label="return -ENOMEM"]
> +  fn_htab_lru_map_update_elem3 [shape=rectangle,
> +    label="Use node that was\nactively referenced\nin (another?) CPU's cache"]
> +  fn_htab_lru_map_update_elem4 [shape=diamond,
> +    label="Can lock this\nhashtab bucket?"]
> +  fn_htab_lru_map_update_elem5 [shape=rectangle,fillcolor=3,
> +    label="Update hashmap\nwith new element"]
> +  fn_htab_lru_map_update_elem6 [shape=oval,label="return 0"]
> +  fn_htab_lru_map_update_elem_EBUSY [shape=oval,label="return -EBUSY"]
> +
> +  begin -> local_freelist_check
> +  local_freelist_check -> use_local_node [xlabel="Y"]
> +  local_freelist_check -> common_lru_check [xlabel="N"]
> +  common_lru_check -> fn_bpf_lru_list_pop_free_to_local [xlabel="Y"]
> +  common_lru_check -> fn___bpf_lru_list_shrink_inactive [xlabel="N"]
> +  fn_bpf_lru_list_pop_free_to_local -> fn___bpf_lru_node_move_to_free
> +  fn___bpf_lru_node_move_to_free ->
> +    fn_bpf_lru_list_pop_free_to_local2 [xlabel="Y"]
> +  fn___bpf_lru_node_move_to_free ->
> +    fn___bpf_lru_list_shrink_inactive [xlabel="N"]
> +  fn___bpf_lru_list_shrink_inactive -> fn___bpf_lru_list_shrink
> +  fn___bpf_lru_list_shrink -> fn_bpf_lru_list_pop_free_to_local2 [xlabel = "Y"]
> +  fn___bpf_lru_list_shrink -> fn___bpf_lru_list_shrink2 [xlabel="N"]
> +  fn___bpf_lru_list_shrink2 -> fn___bpf_lru_list_shrink3
> +  fn___bpf_lru_list_shrink3 -> local_freelist_check2
> +  local_freelist_check2 -> local_freelist_check4 [xlabel = "Y"]
> +  local_freelist_check2 -> common_lru_check2 [xlabel = "N"]
> +  common_lru_check2 -> local_freelist_check5 [xlabel = "Y"]
> +  common_lru_check2 -> fn_htab_lru_map_update_elem_ENOMEM [xlabel = "N"]
> +  local_freelist_check5 -> fn_htab_lru_map_update_elem [xlabel = "Y"]
> +  local_freelist_check5 -> local_freelist_check6 [xlabel = "N"]
> +  local_freelist_check6 -> local_freelist_check7
> +  local_freelist_check7 -> fn_htab_lru_map_update_elem
> +
> +  fn_htab_lru_map_update_elem -> fn_htab_lru_map_update_elem3 [xlabel = "Y"]
> +  fn_htab_lru_map_update_elem -> fn_htab_lru_map_update_elem2  [xlabel = "N"]
> +  fn_htab_lru_map_update_elem2 ->
> +    fn_htab_lru_map_update_elem_ENOMEM [xlabel = "Y"]
> +  fn_htab_lru_map_update_elem2 -> local_freelist_check5 [xlabel = "N"]
> +  fn_htab_lru_map_update_elem3 -> fn_htab_lru_map_update_elem4
> +
> +  use_local_node -> fn_htab_lru_map_update_elem4
> +  fn_bpf_lru_list_pop_free_to_local2 -> fn_htab_lru_map_update_elem4
> +  local_freelist_check4 -> fn_htab_lru_map_update_elem4
> +
> +  fn_htab_lru_map_update_elem4 -> fn_htab_lru_map_update_elem5 [xlabel="Y"]
> +  fn_htab_lru_map_update_elem4 ->
> +    fn_htab_lru_map_update_elem_EBUSY [xlabel="N"]
> +  fn_htab_lru_map_update_elem5 -> fn_htab_lru_map_update_elem6
> +
> +  // Create invisible pad nodes to line up various nodes
> +  pad0 [style=invis]
> +  pad1 [style=invis]
> +  pad2 [style=invis]
> +  pad3 [style=invis]
> +  pad4 [style=invis]
> +
> +  // Line up the key with the top of the graph
> +  no_lock -> local_lock [style=invis]
> +  local_lock -> lru_lock [style=invis]
> +  lru_lock -> hash_lock [style=invis]
> +  hash_lock -> remote_lock [style=invis]
> +  remote_lock -> local_freelist_check5 [style=invis]
> +  remote_lock -> fn___bpf_lru_list_shrink [style=invis]
> +
> +  // Line up return code nodes at the bottom of the graph
> +  fn_htab_lru_map_update_elem -> pad0 [style=invis]
> +  pad0 -> pad1 [style=invis]
> +  pad1 -> pad2 [style=invis]
> +  pad2-> fn_htab_lru_map_update_elem_ENOMEM [style=invis]
> +  fn_htab_lru_map_update_elem4 -> pad3 [style=invis]
> +  pad3 -> fn_htab_lru_map_update_elem_EBUSY  [style=invis]
> +
> +  // Reduce diagram width by forcing some nodes to appear above others
> +  local_freelist_check4 -> fn_htab_lru_map_update_elem3 [style=invis]
> +  common_lru_check2 -> pad4 [style=invis]
> +  pad4 -> local_freelist_check5 [style=invis]
> +}
> -- 
> 2.34.1
>
  
Maxim Mikityanskiy March 14, 2023, 1:55 p.m. UTC | #4
On Sun, Mar 12, 2023 at 12:05:59PM -0700, Joe Stringer wrote:
> Extend the bpf hashmap docs to include a brief description of the
> internals of the LRU map type (setting appropriate API expectations),
> including the original commit message from Martin and a variant on the
> graph that I had presented during my Linux Plumbers Conference 2022 talk
> on "Pressure feedback for LRU map types"[0].
> 
> The node names in the dot file correspond roughly to the functions where
> the logic for those decisions or steps is defined, to help curious
> developers to cross-reference and update this logic if the details of
> the LRU implementation ever differ from this description.
> 
> [0]: https://lpc.events/event/16/contributions/1368/
> 
> Signed-off-by: Joe Stringer <joe@isovalent.com>
> ---
> v3: Use standard table syntax
>     Replace inline commit message with reference to commit
>     Fix incorrect Y/N label for common LRU check
>     Rename some dotfile variables to reduce confusion between cases
>     Minor wording touchups
> v2: Fix issue that caused initial email submission to fail
> ---
>  Documentation/bpf/map_hash.rst            |  62 ++++++++
>  Documentation/bpf/map_lru_hash_update.dot | 166 ++++++++++++++++++++++
>  2 files changed, 228 insertions(+)
>  create mode 100644 Documentation/bpf/map_lru_hash_update.dot
> 
> diff --git a/Documentation/bpf/map_hash.rst b/Documentation/bpf/map_hash.rst
> index 8669426264c6..61602ce26561 100644
> --- a/Documentation/bpf/map_hash.rst
> +++ b/Documentation/bpf/map_hash.rst
> @@ -1,5 +1,6 @@
>  .. SPDX-License-Identifier: GPL-2.0-only
>  .. Copyright (C) 2022 Red Hat, Inc.
> +.. Copyright (C) 2022-2023 Isovalent, Inc.
>  
>  ===============================================
>  BPF_MAP_TYPE_HASH, with PERCPU and LRU Variants
> @@ -206,3 +207,64 @@ Userspace walking the map elements from the map declared above:
>                      cur_key = &next_key;
>              }
>      }
> +
> +Internals
> +=========
> +
> +This section of the document is targeted at Linux developers and describes
> +aspects of the map implementations that are not considered stable ABI. The
> +following details are subject to change in future versions of the kernel.
> +
> +``BPF_MAP_TYPE_LRU_HASH`` and variants
> +--------------------------------------
> +
> +An LRU hashmap type consists of two properties: Firstly, it is a hash map and
> +hence is indexable by key for constant time lookups. Secondly, when at map
> +capacity, map updates will trigger eviction of old entries based on the age of
> +the elements in a set of lists. Each of these properties may be either global
> +or per-CPU, depending on the map type and flags used to create the map:
> +
> ++------------------------+---------------------------+----------------------------------+
> +|                        | ``BPF_MAP_TYPE_LRU_HASH`` | ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` |
> ++========================+===========================+==================================+
> +| ``BPF_NO_COMMON_LRU``  | Per-CPU LRU, global map   | Per-CPU LRU, per-cpu map         |
> ++------------------------+---------------------------+----------------------------------+
> +| ``!BPF_NO_COMMON_LRU`` | Global LRU, global map    | Global LRU, per-cpu map          |
> ++------------------------+---------------------------+----------------------------------+
> +
> +Notably, there are various steps that the update algorithm attempts in order to
> +enforce the LRU property which have increasing impacts on other CPUs involved
> +in the following operation attempts:
> +
> +- Attempt to use CPU-local state to batch operations
> +- Attempt to fetch free nodes from global lists
> +- Attempt to pull any node from a global list and remove it from the hashmap
> +- Attempt to pull any node from any CPU's list and remove it from the hashmap
> +
> +Even if an LRU node may be acquired, maps of type ``BPF_MAP_TYPE_LRU_HASH``
> +may fail to insert the entry into the map if other CPUs are heavily contending
> +on the global hashmap lock.
> +
> +This algorithm is described visually in the following diagram. See the
> +description in commit 3a08c2fd7634 ("bpf: LRU List") for a full explanation of
> +the corresponding operations:
> +
> +.. kernel-figure::  map_lru_hash_update.dot
> +   :alt:    Diagram outlining the LRU eviction steps taken during map update
> +
> +   LRU hash eviction during map update for ``BPF_MAP_TYPE_LRU_HASH`` and
> +   variants
> +
> +Map updates start from the oval in the top right "begin ``bpf_map_update()``"
> +and progress through the graph towards the bottom where the result may be
> +either a successful update or a failure with various error codes. The key in
> +the top right provides indicators for which locks may be involved in specific
> +operations. This is intended as a visual hint for reasoning about how map
> +contention may impact update operations, though the map type and flags may
> +impact the actual contention on those locks, based on the logic described in
> +the table above. For instance, if the map is created with type
> +``BPF_MAP_TYPE_LRU_PERCPU_HASH`` and flags ``BPF_NO_COMMON_LRU`` then all map
> +properties would be per-cpu.
> +
> +The dot file source for the above figure uses internal kernel function names
> +for the node names in order to make the corresponding logic easier to find.
> diff --git a/Documentation/bpf/map_lru_hash_update.dot b/Documentation/bpf/map_lru_hash_update.dot
> new file mode 100644
> index 000000000000..3a44ebec501e
> --- /dev/null
> +++ b/Documentation/bpf/map_lru_hash_update.dot
> @@ -0,0 +1,166 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +// Copyright (C) 2022-2023 Isovalent, Inc.
> +digraph {
> +  node [colorscheme=accent4,style=filled] # Apply colorscheme to all nodes
> +  graph [splines=ortho, nodesep=1]
> +
> +  subgraph cluster_key {
> +    label = "Key\n(locks held during operation)";
> +    rankdir = TB;
> +
> +    remote_lock [shape=rectangle,fillcolor=4,label="remote CPU LRU lock"]
> +    hash_lock [shape=rectangle,fillcolor=3,label="hashtab lock"]
> +    lru_lock [shape=rectangle,fillcolor=2,label="LRU lock"]
> +    local_lock [shape=rectangle,fillcolor=1,label="local CPU LRU lock"]
> +    no_lock [shape=rectangle,label="no locks held"]
> +  }
> +
> +  begin [shape=oval,label="begin\nbpf_map_update()"]
> +
> +  // Nodes below with an 'fn_' prefix are roughly labeled by the C function
> +  // names that initiate the corresponding logic in kernel/bpf/bpf_lru_list.c.
> +  // Number suffixes and errno suffixes handle subsections of the corresponding
> +  // logic in the function as of the writing of this dot.
> +
> +  // The following corresponds to bpf_lru_pop_free() for common LRU case.

The comment that points to the function and explains that it's
applicable only to one case is a great idea, it's much clearer compared
to the previous version.

I believe there are some inaccuracies, though. As far as I see it,
local_freelist_check corresponds to __local_list_pop_free in the common
LRU case, specifically, to checking its return value; use_local_node
corresponds to returning that value; and common_lru_check corresponds
to bpf_lru_pop_free (for both common and percpu LRU, that's where the
distinction is made).

> +  local_freelist_check [shape=diamond,fillcolor=1,
> +    label="Local freelist\nnode available?"];
> +  // The following corresponds to __local_list_pop_free() for common LRU case.
> +  use_local_node [shape=rectangle,
> +    label="Use node owned\nby this CPU"]
> +
> +  common_lru_check [shape=diamond,
> +    label="Map created with\ncommon LRU?\n(!BPF_NO_COMMON_LRU)"];

Nit: the exact flag name is BPF_F_NO_COMMON_LRU.

> +
> +  fn_bpf_lru_list_pop_free_to_local [shape=rectangle,fillcolor=2,
> +    label="Flush local pending,
> +    Rotate Global list, move
> +    LOCAL_FREE_TARGET
> +    from global -> local"]
> +  // Also corresponds to:
> +  // fn__local_list_flush()
> +  // fn_bpf_lru_list_rotate()
> +  fn___bpf_lru_node_move_to_free[shape=diamond,fillcolor=2,
> +    label="Able to free\nLOCAL_FREE_TARGET\nnodes?"]
> +
> +  fn___bpf_lru_list_shrink_inactive [shape=rectangle,fillcolor=3,
> +    label="Shrink inactive list
> +      up to remaining
> +      LOCAL_FREE_TARGET
> +      (global LRU -> local)"]
> +  fn___bpf_lru_list_shrink [shape=diamond,fillcolor=2,
> +    label="> 0 entries in\nlocal free list?"]
> +  fn___bpf_lru_list_shrink2 [shape=rectangle,fillcolor=2,
> +    label="Steal one node from
> +      inactive, or if empty,
> +      from active global list"]
> +  fn___bpf_lru_list_shrink3 [shape=rectangle,fillcolor=3,
> +    label="Try to remove\nnode from hashtab"]
> +
> +  local_freelist_check2 [shape=diamond,label="Htab removal\nsuccessful?"]
> +  common_lru_check2 [shape=diamond,
> +    label="Map created with\ncommon LRU?\n(!BPF_NO_COMMON_LRU)"];
> +
> +  subgraph cluster_remote_lock {
> +    label = "Iterate through CPUs\n(start from current)";
> +    style = dashed;
> +    rankdir=LR;
> +
> +    local_freelist_check5 [shape=diamond,fillcolor=4,
> +      label="Steal a node from\nper-cpu freelist?"]
> +    local_freelist_check6 [shape=rectangle,fillcolor=4,
> +      label="Steal a node from
> +        (1) Unreferenced pending, or
> +        (2) Any pending node"]
> +    local_freelist_check7 [shape=rectangle,fillcolor=3,
> +      label="Try to remove\nnode from hashtab"]
> +    fn_htab_lru_map_update_elem [shape=diamond,
> +      label="Stole node\nfrom remote\nCPU?"]
> +    fn_htab_lru_map_update_elem2 [shape=diamond,label="Iterated\nall CPUs?"]
> +    // Also corresponds to:
> +    // use_local_node()
> +    // fn__local_list_pop_pending()
> +  }
> +
> +  fn_bpf_lru_list_pop_free_to_local2 [shape=rectangle,
> +    label="Use node that was\nnot recently referenced"]
> +  local_freelist_check4 [shape=rectangle,
> +    label="Use node that was\nactively referenced\nin global list"]
> +  fn_htab_lru_map_update_elem_ENOMEM [shape=oval,label="return -ENOMEM"]
> +  fn_htab_lru_map_update_elem3 [shape=rectangle,
> +    label="Use node that was\nactively referenced\nin (another?) CPU's cache"]
> +  fn_htab_lru_map_update_elem4 [shape=diamond,
> +    label="Can lock this\nhashtab bucket?"]
> +  fn_htab_lru_map_update_elem5 [shape=rectangle,fillcolor=3,
> +    label="Update hashmap\nwith new element"]
> +  fn_htab_lru_map_update_elem6 [shape=oval,label="return 0"]
> +  fn_htab_lru_map_update_elem_EBUSY [shape=oval,label="return -EBUSY"]
> +
> +  begin -> local_freelist_check
> +  local_freelist_check -> use_local_node [xlabel="Y"]
> +  local_freelist_check -> common_lru_check [xlabel="N"]
> +  common_lru_check -> fn_bpf_lru_list_pop_free_to_local [xlabel="Y"]
> +  common_lru_check -> fn___bpf_lru_list_shrink_inactive [xlabel="N"]
> +  fn_bpf_lru_list_pop_free_to_local -> fn___bpf_lru_node_move_to_free
> +  fn___bpf_lru_node_move_to_free ->
> +    fn_bpf_lru_list_pop_free_to_local2 [xlabel="Y"]
> +  fn___bpf_lru_node_move_to_free ->
> +    fn___bpf_lru_list_shrink_inactive [xlabel="N"]
> +  fn___bpf_lru_list_shrink_inactive -> fn___bpf_lru_list_shrink
> +  fn___bpf_lru_list_shrink -> fn_bpf_lru_list_pop_free_to_local2 [xlabel = "Y"]
> +  fn___bpf_lru_list_shrink -> fn___bpf_lru_list_shrink2 [xlabel="N"]
> +  fn___bpf_lru_list_shrink2 -> fn___bpf_lru_list_shrink3
> +  fn___bpf_lru_list_shrink3 -> local_freelist_check2
> +  local_freelist_check2 -> local_freelist_check4 [xlabel = "Y"]
> +  local_freelist_check2 -> common_lru_check2 [xlabel = "N"]
> +  common_lru_check2 -> local_freelist_check5 [xlabel = "Y"]
> +  common_lru_check2 -> fn_htab_lru_map_update_elem_ENOMEM [xlabel = "N"]
> +  local_freelist_check5 -> fn_htab_lru_map_update_elem [xlabel = "Y"]
> +  local_freelist_check5 -> local_freelist_check6 [xlabel = "N"]
> +  local_freelist_check6 -> local_freelist_check7
> +  local_freelist_check7 -> fn_htab_lru_map_update_elem
> +
> +  fn_htab_lru_map_update_elem -> fn_htab_lru_map_update_elem3 [xlabel = "Y"]
> +  fn_htab_lru_map_update_elem -> fn_htab_lru_map_update_elem2  [xlabel = "N"]
> +  fn_htab_lru_map_update_elem2 ->
> +    fn_htab_lru_map_update_elem_ENOMEM [xlabel = "Y"]
> +  fn_htab_lru_map_update_elem2 -> local_freelist_check5 [xlabel = "N"]
> +  fn_htab_lru_map_update_elem3 -> fn_htab_lru_map_update_elem4
> +
> +  use_local_node -> fn_htab_lru_map_update_elem4
> +  fn_bpf_lru_list_pop_free_to_local2 -> fn_htab_lru_map_update_elem4
> +  local_freelist_check4 -> fn_htab_lru_map_update_elem4
> +
> +  fn_htab_lru_map_update_elem4 -> fn_htab_lru_map_update_elem5 [xlabel="Y"]
> +  fn_htab_lru_map_update_elem4 ->
> +    fn_htab_lru_map_update_elem_EBUSY [xlabel="N"]
> +  fn_htab_lru_map_update_elem5 -> fn_htab_lru_map_update_elem6
> +
> +  // Create invisible pad nodes to line up various nodes
> +  pad0 [style=invis]
> +  pad1 [style=invis]
> +  pad2 [style=invis]
> +  pad3 [style=invis]
> +  pad4 [style=invis]
> +
> +  // Line up the key with the top of the graph
> +  no_lock -> local_lock [style=invis]
> +  local_lock -> lru_lock [style=invis]
> +  lru_lock -> hash_lock [style=invis]
> +  hash_lock -> remote_lock [style=invis]
> +  remote_lock -> local_freelist_check5 [style=invis]
> +  remote_lock -> fn___bpf_lru_list_shrink [style=invis]
> +
> +  // Line up return code nodes at the bottom of the graph
> +  fn_htab_lru_map_update_elem -> pad0 [style=invis]
> +  pad0 -> pad1 [style=invis]
> +  pad1 -> pad2 [style=invis]
> +  pad2-> fn_htab_lru_map_update_elem_ENOMEM [style=invis]
> +  fn_htab_lru_map_update_elem4 -> pad3 [style=invis]
> +  pad3 -> fn_htab_lru_map_update_elem_EBUSY  [style=invis]
> +
> +  // Reduce diagram width by forcing some nodes to appear above others
> +  local_freelist_check4 -> fn_htab_lru_map_update_elem3 [style=invis]
> +  common_lru_check2 -> pad4 [style=invis]
> +  pad4 -> local_freelist_check5 [style=invis]
> +}
> -- 
> 2.34.1
> 

Thanks again for this patch, this piece of documentation really helped
me understand internals of the LRU hashmap.
  
Martin KaFai Lau March 14, 2023, 7:31 p.m. UTC | #5
On 3/12/23 12:05 PM, Joe Stringer wrote:
> Extend the bpf hashmap docs to include a brief description of the
> internals of the LRU map type (setting appropriate API expectations),
> including the original commit message from Martin and a variant on the
> graph that I had presented during my Linux Plumbers Conference 2022 talk
> on "Pressure feedback for LRU map types"[0].
> 
> The node names in the dot file correspond roughly to the functions where
> the logic for those decisions or steps is defined, to help curious
> developers to cross-reference and update this logic if the details of
> the LRU implementation ever differ from this description.
> 
> [0]: https://lpc.events/event/16/contributions/1368/
> 
> Signed-off-by: Joe Stringer <joe@isovalent.com>
> ---
> v3: Use standard table syntax
>      Replace inline commit message with reference to commit
>      Fix incorrect Y/N label for common LRU check
>      Rename some dotfile variables to reduce confusion between cases
>      Minor wording touchups
> v2: Fix issue that caused initial email submission to fail
> ---
>   Documentation/bpf/map_hash.rst            |  62 ++++++++
>   Documentation/bpf/map_lru_hash_update.dot | 166 ++++++++++++++++++++++
>   2 files changed, 228 insertions(+)
>   create mode 100644 Documentation/bpf/map_lru_hash_update.dot
> 
> diff --git a/Documentation/bpf/map_hash.rst b/Documentation/bpf/map_hash.rst
> index 8669426264c6..61602ce26561 100644
> --- a/Documentation/bpf/map_hash.rst
> +++ b/Documentation/bpf/map_hash.rst
> @@ -1,5 +1,6 @@
>   .. SPDX-License-Identifier: GPL-2.0-only
>   .. Copyright (C) 2022 Red Hat, Inc.
> +.. Copyright (C) 2022-2023 Isovalent, Inc.
>   
>   ===============================================
>   BPF_MAP_TYPE_HASH, with PERCPU and LRU Variants
> @@ -206,3 +207,64 @@ Userspace walking the map elements from the map declared above:
>                       cur_key = &next_key;
>               }
>       }
> +
> +Internals
> +=========
> +
> +This section of the document is targeted at Linux developers and describes
> +aspects of the map implementations that are not considered stable ABI. The
> +following details are subject to change in future versions of the kernel.
> +
> +``BPF_MAP_TYPE_LRU_HASH`` and variants
> +--------------------------------------
> +
> +An LRU hashmap type consists of two properties: Firstly, it is a hash map and
> +hence is indexable by key for constant time lookups. Secondly, when at map
> +capacity, map updates will trigger eviction of old entries based on the age of
> +the elements in a set of lists. Each of these properties may be either global
> +or per-CPU, depending on the map type and flags used to create the map:
> +
> ++------------------------+---------------------------+----------------------------------+
> +|                        | ``BPF_MAP_TYPE_LRU_HASH`` | ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` |
> ++========================+===========================+==================================+
> +| ``BPF_NO_COMMON_LRU``  | Per-CPU LRU, global map   | Per-CPU LRU, per-cpu map         |
> ++------------------------+---------------------------+----------------------------------+
> +| ``!BPF_NO_COMMON_LRU`` | Global LRU, global map    | Global LRU, per-cpu map          |
> ++------------------------+---------------------------+----------------------------------+
> +
> +Notably, there are various steps that the update algorithm attempts in order to
> +enforce the LRU property which have increasing impacts on other CPUs involved
> +in the following operation attempts:
> +
> +- Attempt to use CPU-local state to batch operations
> +- Attempt to fetch free nodes from global lists
> +- Attempt to pull any node from a global list and remove it from the hashmap
> +- Attempt to pull any node from any CPU's list and remove it from the hashmap
> +
> +Even if an LRU node may be acquired, maps of type ``BPF_MAP_TYPE_LRU_HASH``
> +may fail to insert the entry into the map if other CPUs are heavily contending
> +on the global hashmap lock.

The global hashmap lock described here is the action taken in htab_lock_bucket()?

It is a percpu counter added in commit 20b6cc34ea74 ("bpf: Avoid hashtab 
deadlock with map_locked") to avoid deadlock/recursion.

I would suggest to simplify the diagram by removing the "Can lock this hashtab 
bucket?" details. May be a note somewhere to mention why it will still fail to 
shrink the list because the htab_lock_bucket() have detected potential 
deadlock/recursion which is a very unlikely case.


Thanks for the write-up!
  
Joe Stringer March 16, 2023, 1:19 a.m. UTC | #6
On Mon, Mar 13, 2023 at 10:21 PM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Joe Stringer wrote:
> > Extend the bpf hashmap docs to include a brief description of the
> > internals of the LRU map type (setting appropriate API expectations),
> > including the original commit message from Martin and a variant on the
> > graph that I had presented during my Linux Plumbers Conference 2022 talk
> > on "Pressure feedback for LRU map types"[0].
> >
> > The node names in the dot file correspond roughly to the functions where
> > the logic for those decisions or steps is defined, to help curious
> > developers to cross-reference and update this logic if the details of
> > the LRU implementation ever differ from this description.
> >
> > [0]: https://lpc.events/event/16/contributions/1368/
> >
> > Signed-off-by: Joe Stringer <joe@isovalent.com>
>
>
> Thanks couple nits below
>
> > ---
> > v3: Use standard table syntax
> >     Replace inline commit message with reference to commit
> >     Fix incorrect Y/N label for common LRU check
> >     Rename some dotfile variables to reduce confusion between cases
> >     Minor wording touchups
> > v2: Fix issue that caused initial email submission to fail
> > ---
> >  Documentation/bpf/map_hash.rst            |  62 ++++++++
> >  Documentation/bpf/map_lru_hash_update.dot | 166 ++++++++++++++++++++++
> >  2 files changed, 228 insertions(+)
> >  create mode 100644 Documentation/bpf/map_lru_hash_update.dot
> > diff --git a/Documentation/bpf/map_hash.rst b/Documentation/bpf/map_hash.rst
> > index 8669426264c6..61602ce26561 100644
> > --- a/Documentation/bpf/map_hash.rst
> > +++ b/Documentation/bpf/map_hash.rst
> > @@ -1,5 +1,6 @@
> >  .. SPDX-License-Identifier: GPL-2.0-only
> >  .. Copyright (C) 2022 Red Hat, Inc.
> > +.. Copyright (C) 2022-2023 Isovalent, Inc.
> >
> >  ===============================================
> >  BPF_MAP_TYPE_HASH, with PERCPU and LRU Variants
> > @@ -206,3 +207,64 @@ Userspace walking the map elements from the map declared above:
> >                      cur_key = &next_key;
> >              }
> >      }
> > +
> > +Internals
> > +=========
> > +
> > +This section of the document is targeted at Linux developers and describes
> > +aspects of the map implementations that are not considered stable ABI. The
> > +following details are subject to change in future versions of the kernel.
> > +
> > +``BPF_MAP_TYPE_LRU_HASH`` and variants
> > +--------------------------------------
> > +
> > +An LRU hashmap type consists of two properties: Firstly, it is a hash map and
> > +hence is indexable by key for constant time lookups. Secondly, when at map
> > +capacity, map updates will trigger eviction of old entries based on the age of
> > +the elements in a set of lists. Each of these properties may be either global
> > +or per-CPU, depending on the map type and flags used to create the map:
> > +
> > ++------------------------+---------------------------+----------------------------------+
> > +|                        | ``BPF_MAP_TYPE_LRU_HASH`` | ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` |
> > ++========================+===========================+==================================+
> > +| ``BPF_NO_COMMON_LRU``  | Per-CPU LRU, global map   | Per-CPU LRU, per-cpu map         |
> > ++------------------------+---------------------------+----------------------------------+
> > +| ``!BPF_NO_COMMON_LRU`` | Global LRU, global map    | Global LRU, per-cpu map          |
> > ++------------------------+---------------------------+----------------------------------+
>
> Above all seems API to me. Maybe move the statement about not considered stable
> ABI down here? Something like,
>
> "
> The internal details of which entry is evicted and acquiring a new entry
> are not considered stable and may change in the future. But the current
> impelementation is as follows.
> "
>
> Or something like that?

Yep sounds good to me, I'll fix that up.

> > +
> > +Notably, there are various steps that the update algorithm attempts in order to
> > +enforce the LRU property which have increasing impacts on other CPUs involved
> > +in the following operation attempts:
> > +
> > +- Attempt to use CPU-local state to batch operations
> > +- Attempt to fetch free nodes from global lists
> > +- Attempt to pull any node from a global list and remove it from the hashmap
> > +- Attempt to pull any node from any CPU's list and remove it from the hashmap
> > +
> > +Even if an LRU node may be acquired, maps of type ``BPF_MAP_TYPE_LRU_HASH``
> > +may fail to insert the entry into the map if other CPUs are heavily contending
> > +on the global hashmap lock.
>
> Similarly this is ABI correct? Probably we can also specify the error code?
> Assuming it is just EBUSY or EAGAIN?

Hmm. It's EBUSY right now. Also looks like we're missing corresponding
docs in the uapi header for this case:

https://github.com/torvalds/linux/blob/9c1bec9c0b08abeac72ed6214b723adc224013bf/include/uapi/linux/bpf.h#L163

I think that "this function may fail due to contention on map usage,
in which case it returns EBUSY" (or similar wording) is reasonable to
document as ABI. IMO that should go in the header linked above (which
gets turned into UAPI docs elsewhere). This particular phrasing is
going into more detail around things like the hashmap lock which
should not be described as part of the uAPI/ABI.
  
Joe Stringer March 16, 2023, 1:22 a.m. UTC | #7
On Tue, Mar 14, 2023 at 6:55 AM Maxim Mikityanskiy <maxtram95@gmail.com> wrote:
>
> On Sun, Mar 12, 2023 at 12:05:59PM -0700, Joe Stringer wrote:

<snip>

> I believe there are some inaccuracies, though. As far as I see it,
> local_freelist_check corresponds to __local_list_pop_free in the common
> LRU case, specifically, to checking its return value; use_local_node
> corresponds to returning that value; and common_lru_check corresponds
> to bpf_lru_pop_free (for both common and percpu LRU, that's where the
> distinction is made).

Ah yes, thanks for the pointers, will fix up. I started with reviewing
the shared case since I was primarily interested in the behaviour
there, then I added the other cases later. Adding the function names
was one of the later ideas but it's difficult to get accurate.

> > +  local_freelist_check [shape=diamond,fillcolor=1,
> > +    label="Local freelist\nnode available?"];
> > +  // The following corresponds to __local_list_pop_free() for common LRU case.
> > +  use_local_node [shape=rectangle,
> > +    label="Use node owned\nby this CPU"]
> > +
> > +  common_lru_check [shape=diamond,
> > +    label="Map created with\ncommon LRU?\n(!BPF_NO_COMMON_LRU)"];
>
> Nit: the exact flag name is BPF_F_NO_COMMON_LRU.

Will fix.

> Thanks again for this patch, this piece of documentation really helped
> me understand internals of the LRU hashmap.

Glad to hear, thanks for the feedback on the patch!
  
Joe Stringer March 16, 2023, 1:54 a.m. UTC | #8
On Tue, Mar 14, 2023 at 12:31 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 3/12/23 12:05 PM, Joe Stringer wrote:
> > Extend the bpf hashmap docs to include a brief description of the
> > internals of the LRU map type (setting appropriate API expectations),
> > including the original commit message from Martin and a variant on the
> > graph that I had presented during my Linux Plumbers Conference 2022 talk
> > on "Pressure feedback for LRU map types"[0].
> >
> > The node names in the dot file correspond roughly to the functions where
> > the logic for those decisions or steps is defined, to help curious
> > developers to cross-reference and update this logic if the details of
> > the LRU implementation ever differ from this description.
> >
> > [0]: https://lpc.events/event/16/contributions/1368/
> >
> > Signed-off-by: Joe Stringer <joe@isovalent.com>
> > ---
> > v3: Use standard table syntax
> >      Replace inline commit message with reference to commit
> >      Fix incorrect Y/N label for common LRU check
> >      Rename some dotfile variables to reduce confusion between cases
> >      Minor wording touchups
> > v2: Fix issue that caused initial email submission to fail
> > ---
> >   Documentation/bpf/map_hash.rst            |  62 ++++++++
> >   Documentation/bpf/map_lru_hash_update.dot | 166 ++++++++++++++++++++++
> >   2 files changed, 228 insertions(+)
> >   create mode 100644 Documentation/bpf/map_lru_hash_update.dot
> >
> > diff --git a/Documentation/bpf/map_hash.rst b/Documentation/bpf/map_hash.rst
> > index 8669426264c6..61602ce26561 100644
> > --- a/Documentation/bpf/map_hash.rst
> > +++ b/Documentation/bpf/map_hash.rst
> > @@ -1,5 +1,6 @@
> >   .. SPDX-License-Identifier: GPL-2.0-only
> >   .. Copyright (C) 2022 Red Hat, Inc.
> > +.. Copyright (C) 2022-2023 Isovalent, Inc.
> >
> >   ===============================================
> >   BPF_MAP_TYPE_HASH, with PERCPU and LRU Variants
> > @@ -206,3 +207,64 @@ Userspace walking the map elements from the map declared above:
> >                       cur_key = &next_key;
> >               }
> >       }
> > +
> > +Internals
> > +=========
> > +
> > +This section of the document is targeted at Linux developers and describes
> > +aspects of the map implementations that are not considered stable ABI. The
> > +following details are subject to change in future versions of the kernel.
> > +
> > +``BPF_MAP_TYPE_LRU_HASH`` and variants
> > +--------------------------------------
> > +
> > +An LRU hashmap type consists of two properties: Firstly, it is a hash map and
> > +hence is indexable by key for constant time lookups. Secondly, when at map
> > +capacity, map updates will trigger eviction of old entries based on the age of
> > +the elements in a set of lists. Each of these properties may be either global
> > +or per-CPU, depending on the map type and flags used to create the map:
> > +
> > ++------------------------+---------------------------+----------------------------------+
> > +|                        | ``BPF_MAP_TYPE_LRU_HASH`` | ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` |
> > ++========================+===========================+==================================+
> > +| ``BPF_NO_COMMON_LRU``  | Per-CPU LRU, global map   | Per-CPU LRU, per-cpu map         |
> > ++------------------------+---------------------------+----------------------------------+
> > +| ``!BPF_NO_COMMON_LRU`` | Global LRU, global map    | Global LRU, per-cpu map          |
> > ++------------------------+---------------------------+----------------------------------+
> > +
> > +Notably, there are various steps that the update algorithm attempts in order to
> > +enforce the LRU property which have increasing impacts on other CPUs involved
> > +in the following operation attempts:
> > +
> > +- Attempt to use CPU-local state to batch operations
> > +- Attempt to fetch free nodes from global lists
> > +- Attempt to pull any node from a global list and remove it from the hashmap
> > +- Attempt to pull any node from any CPU's list and remove it from the hashmap
> > +
> > +Even if an LRU node may be acquired, maps of type ``BPF_MAP_TYPE_LRU_HASH``
> > +may fail to insert the entry into the map if other CPUs are heavily contending
> > +on the global hashmap lock.
>
> The global hashmap lock described here is the action taken in htab_lock_bucket()?
>
> It is a percpu counter added in commit 20b6cc34ea74 ("bpf: Avoid hashtab
> deadlock with map_locked") to avoid deadlock/recursion.

Hmm, yes that's the lock I had in mind. Thanks for the pointer, I
didn't really understand the motivation for that case previously. That
said, I now find it even harder to think of reasonable wording to
describe in the ABI about how an eBPF program developer should reason
about the -EBUSY failure case.

> I would suggest to simplify the diagram by removing the "Can lock this hashtab
> bucket?" details.

I could swap that to instead have "Update hashmap with new element",
then have two possible outcomes depending on whether that succeeds or
not. I guess this is also similar to John's feedback above that in the
end, EBUSY return code ends up being ABI for the helper. Does that
make sense? One of my goals for the diagram was to at least capture
the various return codes to assist readers in reasoning about the
different failure modes.

> Maybe a note somewhere to mention why it will still fail to
> shrink the list because the htab_lock_bucket() have detected potential
> deadlock/recursion which is a very unlikely case.

I missed the "shrink the list" link here since it seems like this
could happen for any combination of update or delete elems for the
same bucket. But yeah given that also needs to happen on the same CPU,
it does seem very unlikely... Could there be a case something like
"userspace process is touching that bucket, gets interrupted, then the
same CPU runs an eBPF program that attempts to update/delete elements
in the same bucket"?

Previously I had read this to think that EBUSY was the common case and
ENOMEM is the uncommon case, but based on your pointers above I'm less
convinced now, and more surprised that either failure would occur.
Perhaps the failure I had hit was even later in the regular hashtab
update logic. At the time of the incidents I was investigating, we
unfortunately did not record which of the failure cases occurred so I
don't have specific data to back up what we were experiencing. We have
since added such reporting but I haven't received further information
from the failure mode.
  
Martin KaFai Lau March 17, 2023, 6:04 a.m. UTC | #9
On 3/15/23 6:54 PM, Joe Stringer wrote:
> On Tue, Mar 14, 2023 at 12:31 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 3/12/23 12:05 PM, Joe Stringer wrote:
>>> Extend the bpf hashmap docs to include a brief description of the
>>> internals of the LRU map type (setting appropriate API expectations),
>>> including the original commit message from Martin and a variant on the
>>> graph that I had presented during my Linux Plumbers Conference 2022 talk
>>> on "Pressure feedback for LRU map types"[0].
>>>
>>> The node names in the dot file correspond roughly to the functions where
>>> the logic for those decisions or steps is defined, to help curious
>>> developers to cross-reference and update this logic if the details of
>>> the LRU implementation ever differ from this description.
>>>
>>> [0]: https://lpc.events/event/16/contributions/1368/
>>>
>>> Signed-off-by: Joe Stringer <joe@isovalent.com>
>>> ---
>>> v3: Use standard table syntax
>>>       Replace inline commit message with reference to commit
>>>       Fix incorrect Y/N label for common LRU check
>>>       Rename some dotfile variables to reduce confusion between cases
>>>       Minor wording touchups
>>> v2: Fix issue that caused initial email submission to fail
>>> ---
>>>    Documentation/bpf/map_hash.rst            |  62 ++++++++
>>>    Documentation/bpf/map_lru_hash_update.dot | 166 ++++++++++++++++++++++
>>>    2 files changed, 228 insertions(+)
>>>    create mode 100644 Documentation/bpf/map_lru_hash_update.dot
>>>
>>> diff --git a/Documentation/bpf/map_hash.rst b/Documentation/bpf/map_hash.rst
>>> index 8669426264c6..61602ce26561 100644
>>> --- a/Documentation/bpf/map_hash.rst
>>> +++ b/Documentation/bpf/map_hash.rst
>>> @@ -1,5 +1,6 @@
>>>    .. SPDX-License-Identifier: GPL-2.0-only
>>>    .. Copyright (C) 2022 Red Hat, Inc.
>>> +.. Copyright (C) 2022-2023 Isovalent, Inc.
>>>
>>>    ===============================================
>>>    BPF_MAP_TYPE_HASH, with PERCPU and LRU Variants
>>> @@ -206,3 +207,64 @@ Userspace walking the map elements from the map declared above:
>>>                        cur_key = &next_key;
>>>                }
>>>        }
>>> +
>>> +Internals
>>> +=========
>>> +
>>> +This section of the document is targeted at Linux developers and describes
>>> +aspects of the map implementations that are not considered stable ABI. The
>>> +following details are subject to change in future versions of the kernel.
>>> +
>>> +``BPF_MAP_TYPE_LRU_HASH`` and variants
>>> +--------------------------------------
>>> +
>>> +An LRU hashmap type consists of two properties: Firstly, it is a hash map and
>>> +hence is indexable by key for constant time lookups. Secondly, when at map
>>> +capacity, map updates will trigger eviction of old entries based on the age of
>>> +the elements in a set of lists. Each of these properties may be either global
>>> +or per-CPU, depending on the map type and flags used to create the map:
>>> +
>>> ++------------------------+---------------------------+----------------------------------+
>>> +|                        | ``BPF_MAP_TYPE_LRU_HASH`` | ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` |
>>> ++========================+===========================+==================================+
>>> +| ``BPF_NO_COMMON_LRU``  | Per-CPU LRU, global map   | Per-CPU LRU, per-cpu map         |
>>> ++------------------------+---------------------------+----------------------------------+
>>> +| ``!BPF_NO_COMMON_LRU`` | Global LRU, global map    | Global LRU, per-cpu map          |
>>> ++------------------------+---------------------------+----------------------------------+
>>> +
>>> +Notably, there are various steps that the update algorithm attempts in order to
>>> +enforce the LRU property which have increasing impacts on other CPUs involved
>>> +in the following operation attempts:
>>> +
>>> +- Attempt to use CPU-local state to batch operations
>>> +- Attempt to fetch free nodes from global lists
>>> +- Attempt to pull any node from a global list and remove it from the hashmap
>>> +- Attempt to pull any node from any CPU's list and remove it from the hashmap
>>> +
>>> +Even if an LRU node may be acquired, maps of type ``BPF_MAP_TYPE_LRU_HASH``
>>> +may fail to insert the entry into the map if other CPUs are heavily contending
>>> +on the global hashmap lock.
>>
>> The global hashmap lock described here is the action taken in htab_lock_bucket()?
>>
>> It is a percpu counter added in commit 20b6cc34ea74 ("bpf: Avoid hashtab
>> deadlock with map_locked") to avoid deadlock/recursion.
> 
> Hmm, yes that's the lock I had in mind. Thanks for the pointer, I
> didn't really understand the motivation for that case previously. That
> said, I now find it even harder to think of reasonable wording to
> describe in the ABI about how an eBPF program developer should reason
> about the -EBUSY failure case.

-EBUSY should be very unlikely. No good idea how to explain it as a user manual/doc.

> 
>> I would suggest to simplify the diagram by removing the "Can lock this hashtab
>> bucket?" details.
> 
> I could swap that to instead have "Update hashmap with new element",
> then have two possible outcomes depending on whether that succeeds or
> not. I guess this is also similar to John's feedback above that in the
> end, EBUSY return code ends up being ABI for the helper. Does that
> make sense? One of my goals for the diagram was to at least capture
> the various return codes to assist readers in reasoning about the
> different failure modes.

don't mind to leave the "Can lock this hashtab bucket?" as is in the diagram. I 
was just thinking a simpler diagram may be easier to grab the big picture quickly.

> 
>> Maybe a note somewhere to mention why it will still fail to
>> shrink the list because the htab_lock_bucket() have detected potential
>> deadlock/recursion which is a very unlikely case.
> 
> I missed the "shrink the list" link here since it seems like this
> could happen for any combination of update or delete elems for the
> same bucket. But yeah given that also needs to happen on the same CPU,
> it does seem very unlikely... 

shrink should try to shrink a couple of stale elems which are likely hashed to 
different buckets. If shrink hits htab_lock_bucket() EBUSY on all stale elems, 
shrink could fail but unlikely.

> Could there be a case something like "userspace process is touching that bucket, 
> gets interrupted, then the same CPU runs an eBPF program that attempts to 
> update/delete elements in the same bucket"?

raw_spin_lock_irqsave() is done after the percpu counter, so there is a gap but 
not sure if it matters though unless the production use case can really hit it.

> 
> Previously I had read this to think that EBUSY was the common case and
> ENOMEM is the uncommon case, but based on your pointers above I'm less
> convinced now, and more surprised that either failure would occur.
> Perhaps the failure I had hit was even later in the regular hashtab
> update logic. At the time of the incidents I was investigating, we
> unfortunately did not record which of the failure cases occurred so I
> don't have specific data to back up what we were experiencing. We have
> since added such reporting but I haven't received further information
> from the failure mode.
  
Joe Stringer April 1, 2023, 6:57 p.m. UTC | #10
On Thu, Mar 16, 2023 at 11:05 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 3/15/23 6:54 PM, Joe Stringer wrote:
> > On Tue, Mar 14, 2023 at 12:31 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> Maybe a note somewhere to mention why it will still fail to
> >> shrink the list because the htab_lock_bucket() have detected potential
> >> deadlock/recursion which is a very unlikely case.
> >
> > I missed the "shrink the list" link here since it seems like this
> > could happen for any combination of update or delete elems for the
> > same bucket. But yeah given that also needs to happen on the same CPU,
> > it does seem very unlikely...
>
> shrink should try to shrink a couple of stale elems which are likely hashed to
> different buckets. If shrink hits htab_lock_bucket() EBUSY on all stale elems,
> shrink could fail but unlikely.

The failure case I had in mind for this is to assume that shrinking
succeeds and we find an LRU node during the htab_map_update_elem()
call through prealloc_lru_pop(), but then immediately afterwards it
makes a direct htab_lock_bucket() call which has just one chance to
succeed based on whether this CPU races against some other user of the
bucket lock. Still seems somewhat rare, but feasible to hit.

> > Could there be a case something like "userspace process is touching that bucket,
> > gets interrupted, then the same CPU runs an eBPF program that attempts to
> > update/delete elements in the same bucket"?
>
> raw_spin_lock_irqsave() is done after the percpu counter, so there is a gap but
> not sure if it matters though unless the production use case can really hit it.

Yeah unfortunately I'm going off an incident from last year and I
don't have this level of visibility into the failure scenario in a
prod-like environment today.
  

Patch

diff --git a/Documentation/bpf/map_hash.rst b/Documentation/bpf/map_hash.rst
index 8669426264c6..61602ce26561 100644
--- a/Documentation/bpf/map_hash.rst
+++ b/Documentation/bpf/map_hash.rst
@@ -1,5 +1,6 @@ 
 .. SPDX-License-Identifier: GPL-2.0-only
 .. Copyright (C) 2022 Red Hat, Inc.
+.. Copyright (C) 2022-2023 Isovalent, Inc.
 
 ===============================================
 BPF_MAP_TYPE_HASH, with PERCPU and LRU Variants
@@ -206,3 +207,64 @@  Userspace walking the map elements from the map declared above:
                     cur_key = &next_key;
             }
     }
+
+Internals
+=========
+
+This section of the document is targeted at Linux developers and describes
+aspects of the map implementations that are not considered stable ABI. The
+following details are subject to change in future versions of the kernel.
+
+``BPF_MAP_TYPE_LRU_HASH`` and variants
+--------------------------------------
+
+An LRU hashmap type consists of two properties: Firstly, it is a hash map and
+hence is indexable by key for constant time lookups. Secondly, when at map
+capacity, map updates will trigger eviction of old entries based on the age of
+the elements in a set of lists. Each of these properties may be either global
+or per-CPU, depending on the map type and flags used to create the map:
+
++------------------------+---------------------------+----------------------------------+
+|                        | ``BPF_MAP_TYPE_LRU_HASH`` | ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` |
++========================+===========================+==================================+
+| ``BPF_NO_COMMON_LRU``  | Per-CPU LRU, global map   | Per-CPU LRU, per-cpu map         |
++------------------------+---------------------------+----------------------------------+
+| ``!BPF_NO_COMMON_LRU`` | Global LRU, global map    | Global LRU, per-cpu map          |
++------------------------+---------------------------+----------------------------------+
+
+Notably, there are various steps that the update algorithm attempts in order to
+enforce the LRU property which have increasing impacts on other CPUs involved
+in the following operation attempts:
+
+- Attempt to use CPU-local state to batch operations
+- Attempt to fetch free nodes from global lists
+- Attempt to pull any node from a global list and remove it from the hashmap
+- Attempt to pull any node from any CPU's list and remove it from the hashmap
+
+Even if an LRU node may be acquired, maps of type ``BPF_MAP_TYPE_LRU_HASH``
+may fail to insert the entry into the map if other CPUs are heavily contending
+on the global hashmap lock.
+
+This algorithm is described visually in the following diagram. See the
+description in commit 3a08c2fd7634 ("bpf: LRU List") for a full explanation of
+the corresponding operations:
+
+.. kernel-figure::  map_lru_hash_update.dot
+   :alt:    Diagram outlining the LRU eviction steps taken during map update
+
+   LRU hash eviction during map update for ``BPF_MAP_TYPE_LRU_HASH`` and
+   variants
+
+Map updates start from the oval in the top right "begin ``bpf_map_update()``"
+and progress through the graph towards the bottom where the result may be
+either a successful update or a failure with various error codes. The key in
+the top right provides indicators for which locks may be involved in specific
+operations. This is intended as a visual hint for reasoning about how map
+contention may impact update operations, though the map type and flags may
+impact the actual contention on those locks, based on the logic described in
+the table above. For instance, if the map is created with type
+``BPF_MAP_TYPE_LRU_PERCPU_HASH`` and flags ``BPF_NO_COMMON_LRU`` then all map
+properties would be per-cpu.
+
+The dot file source for the above figure uses internal kernel function names
+for the node names in order to make the corresponding logic easier to find.
diff --git a/Documentation/bpf/map_lru_hash_update.dot b/Documentation/bpf/map_lru_hash_update.dot
new file mode 100644
index 000000000000..3a44ebec501e
--- /dev/null
+++ b/Documentation/bpf/map_lru_hash_update.dot
@@ -0,0 +1,166 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+// Copyright (C) 2022-2023 Isovalent, Inc.
+digraph {
+  node [colorscheme=accent4,style=filled] # Apply colorscheme to all nodes
+  graph [splines=ortho, nodesep=1]
+
+  subgraph cluster_key {
+    label = "Key\n(locks held during operation)";
+    rankdir = TB;
+
+    remote_lock [shape=rectangle,fillcolor=4,label="remote CPU LRU lock"]
+    hash_lock [shape=rectangle,fillcolor=3,label="hashtab lock"]
+    lru_lock [shape=rectangle,fillcolor=2,label="LRU lock"]
+    local_lock [shape=rectangle,fillcolor=1,label="local CPU LRU lock"]
+    no_lock [shape=rectangle,label="no locks held"]
+  }
+
+  begin [shape=oval,label="begin\nbpf_map_update()"]
+
+  // Nodes below with an 'fn_' prefix are roughly labeled by the C function
+  // names that initiate the corresponding logic in kernel/bpf/bpf_lru_list.c.
+  // Number suffixes and errno suffixes handle subsections of the corresponding
+  // logic in the function as of the writing of this dot.
+
+  // The following corresponds to bpf_lru_pop_free() for common LRU case.
+  local_freelist_check [shape=diamond,fillcolor=1,
+    label="Local freelist\nnode available?"];
+  // The following corresponds to __local_list_pop_free() for common LRU case.
+  use_local_node [shape=rectangle,
+    label="Use node owned\nby this CPU"]
+
+  common_lru_check [shape=diamond,
+    label="Map created with\ncommon LRU?\n(!BPF_NO_COMMON_LRU)"];
+
+  fn_bpf_lru_list_pop_free_to_local [shape=rectangle,fillcolor=2,
+    label="Flush local pending,
+    Rotate Global list, move
+    LOCAL_FREE_TARGET
+    from global -> local"]
+  // Also corresponds to:
+  // fn__local_list_flush()
+  // fn_bpf_lru_list_rotate()
+  fn___bpf_lru_node_move_to_free[shape=diamond,fillcolor=2,
+    label="Able to free\nLOCAL_FREE_TARGET\nnodes?"]
+
+  fn___bpf_lru_list_shrink_inactive [shape=rectangle,fillcolor=3,
+    label="Shrink inactive list
+      up to remaining
+      LOCAL_FREE_TARGET
+      (global LRU -> local)"]
+  fn___bpf_lru_list_shrink [shape=diamond,fillcolor=2,
+    label="> 0 entries in\nlocal free list?"]
+  fn___bpf_lru_list_shrink2 [shape=rectangle,fillcolor=2,
+    label="Steal one node from
+      inactive, or if empty,
+      from active global list"]
+  fn___bpf_lru_list_shrink3 [shape=rectangle,fillcolor=3,
+    label="Try to remove\nnode from hashtab"]
+
+  local_freelist_check2 [shape=diamond,label="Htab removal\nsuccessful?"]
+  common_lru_check2 [shape=diamond,
+    label="Map created with\ncommon LRU?\n(!BPF_NO_COMMON_LRU)"];
+
+  subgraph cluster_remote_lock {
+    label = "Iterate through CPUs\n(start from current)";
+    style = dashed;
+    rankdir=LR;
+
+    local_freelist_check5 [shape=diamond,fillcolor=4,
+      label="Steal a node from\nper-cpu freelist?"]
+    local_freelist_check6 [shape=rectangle,fillcolor=4,
+      label="Steal a node from
+        (1) Unreferenced pending, or
+        (2) Any pending node"]
+    local_freelist_check7 [shape=rectangle,fillcolor=3,
+      label="Try to remove\nnode from hashtab"]
+    fn_htab_lru_map_update_elem [shape=diamond,
+      label="Stole node\nfrom remote\nCPU?"]
+    fn_htab_lru_map_update_elem2 [shape=diamond,label="Iterated\nall CPUs?"]
+    // Also corresponds to:
+    // use_local_node()
+    // fn__local_list_pop_pending()
+  }
+
+  fn_bpf_lru_list_pop_free_to_local2 [shape=rectangle,
+    label="Use node that was\nnot recently referenced"]
+  local_freelist_check4 [shape=rectangle,
+    label="Use node that was\nactively referenced\nin global list"]
+  fn_htab_lru_map_update_elem_ENOMEM [shape=oval,label="return -ENOMEM"]
+  fn_htab_lru_map_update_elem3 [shape=rectangle,
+    label="Use node that was\nactively referenced\nin (another?) CPU's cache"]
+  fn_htab_lru_map_update_elem4 [shape=diamond,
+    label="Can lock this\nhashtab bucket?"]
+  fn_htab_lru_map_update_elem5 [shape=rectangle,fillcolor=3,
+    label="Update hashmap\nwith new element"]
+  fn_htab_lru_map_update_elem6 [shape=oval,label="return 0"]
+  fn_htab_lru_map_update_elem_EBUSY [shape=oval,label="return -EBUSY"]
+
+  begin -> local_freelist_check
+  local_freelist_check -> use_local_node [xlabel="Y"]
+  local_freelist_check -> common_lru_check [xlabel="N"]
+  common_lru_check -> fn_bpf_lru_list_pop_free_to_local [xlabel="Y"]
+  common_lru_check -> fn___bpf_lru_list_shrink_inactive [xlabel="N"]
+  fn_bpf_lru_list_pop_free_to_local -> fn___bpf_lru_node_move_to_free
+  fn___bpf_lru_node_move_to_free ->
+    fn_bpf_lru_list_pop_free_to_local2 [xlabel="Y"]
+  fn___bpf_lru_node_move_to_free ->
+    fn___bpf_lru_list_shrink_inactive [xlabel="N"]
+  fn___bpf_lru_list_shrink_inactive -> fn___bpf_lru_list_shrink
+  fn___bpf_lru_list_shrink -> fn_bpf_lru_list_pop_free_to_local2 [xlabel = "Y"]
+  fn___bpf_lru_list_shrink -> fn___bpf_lru_list_shrink2 [xlabel="N"]
+  fn___bpf_lru_list_shrink2 -> fn___bpf_lru_list_shrink3
+  fn___bpf_lru_list_shrink3 -> local_freelist_check2
+  local_freelist_check2 -> local_freelist_check4 [xlabel = "Y"]
+  local_freelist_check2 -> common_lru_check2 [xlabel = "N"]
+  common_lru_check2 -> local_freelist_check5 [xlabel = "Y"]
+  common_lru_check2 -> fn_htab_lru_map_update_elem_ENOMEM [xlabel = "N"]
+  local_freelist_check5 -> fn_htab_lru_map_update_elem [xlabel = "Y"]
+  local_freelist_check5 -> local_freelist_check6 [xlabel = "N"]
+  local_freelist_check6 -> local_freelist_check7
+  local_freelist_check7 -> fn_htab_lru_map_update_elem
+
+  fn_htab_lru_map_update_elem -> fn_htab_lru_map_update_elem3 [xlabel = "Y"]
+  fn_htab_lru_map_update_elem -> fn_htab_lru_map_update_elem2  [xlabel = "N"]
+  fn_htab_lru_map_update_elem2 ->
+    fn_htab_lru_map_update_elem_ENOMEM [xlabel = "Y"]
+  fn_htab_lru_map_update_elem2 -> local_freelist_check5 [xlabel = "N"]
+  fn_htab_lru_map_update_elem3 -> fn_htab_lru_map_update_elem4
+
+  use_local_node -> fn_htab_lru_map_update_elem4
+  fn_bpf_lru_list_pop_free_to_local2 -> fn_htab_lru_map_update_elem4
+  local_freelist_check4 -> fn_htab_lru_map_update_elem4
+
+  fn_htab_lru_map_update_elem4 -> fn_htab_lru_map_update_elem5 [xlabel="Y"]
+  fn_htab_lru_map_update_elem4 ->
+    fn_htab_lru_map_update_elem_EBUSY [xlabel="N"]
+  fn_htab_lru_map_update_elem5 -> fn_htab_lru_map_update_elem6
+
+  // Create invisible pad nodes to line up various nodes
+  pad0 [style=invis]
+  pad1 [style=invis]
+  pad2 [style=invis]
+  pad3 [style=invis]
+  pad4 [style=invis]
+
+  // Line up the key with the top of the graph
+  no_lock -> local_lock [style=invis]
+  local_lock -> lru_lock [style=invis]
+  lru_lock -> hash_lock [style=invis]
+  hash_lock -> remote_lock [style=invis]
+  remote_lock -> local_freelist_check5 [style=invis]
+  remote_lock -> fn___bpf_lru_list_shrink [style=invis]
+
+  // Line up return code nodes at the bottom of the graph
+  fn_htab_lru_map_update_elem -> pad0 [style=invis]
+  pad0 -> pad1 [style=invis]
+  pad1 -> pad2 [style=invis]
+  pad2-> fn_htab_lru_map_update_elem_ENOMEM [style=invis]
+  fn_htab_lru_map_update_elem4 -> pad3 [style=invis]
+  pad3 -> fn_htab_lru_map_update_elem_EBUSY  [style=invis]
+
+  // Reduce diagram width by forcing some nodes to appear above others
+  local_freelist_check4 -> fn_htab_lru_map_update_elem3 [style=invis]
+  common_lru_check2 -> pad4 [style=invis]
+  pad4 -> local_freelist_check5 [style=invis]
+}