[RFC,net-next,v5,2/2] net: add netmem to skb_frag_t

Message ID 20240109011455.1061529-3-almasrymina@google.com
State New
Headers
Series Abstract page from net stack |

Commit Message

Mina Almasry Jan. 9, 2024, 1:14 a.m. UTC
  Use struct netmem* instead of page in skb_frag_t. Currently struct
netmem* is always a struct page underneath, but the abstraction
allows efforts to add support for skb frags not backed by pages.

There is unfortunately 1 instance where the skb_frag_t is assumed to be
a exactly a bio_vec in kcm. For this case, WARN_ON_ONCE and return error
before doing a cast.

Add skb[_frag]_fill_netmem_*() and skb_add_rx_frag_netmem() helpers so
that the API can be used to create netmem skbs.

Signed-off-by: Mina Almasry <almasrymina@google.com>

---

v4:
- Handle error in kcm_write_msgs() instead of only warning (Willem)

v3:
- Renamed the fields in skb_frag_t.

v2:
- Add skb frag filling helpers.

---
 include/linux/skbuff.h | 90 +++++++++++++++++++++++++++++-------------
 net/core/skbuff.c      | 22 ++++++++---
 net/kcm/kcmsock.c      |  9 ++++-
 3 files changed, 86 insertions(+), 35 deletions(-)
  

Comments

Jason Gunthorpe Jan. 16, 2024, 12:01 a.m. UTC | #1
On Mon, Jan 15, 2024 at 03:23:33PM -0800, Mina Almasry wrote:
> > > You did not answer my question that I asked here, and ignoring this
> > > question is preventing us from making any forward progress on this
> > > discussion. What do you expect or want skb_frag_page() to do when
> > > there is no page in the frag?
> >
> > I would expect it to do nothing.
> 
> I don't understand. skb_frag_page() with an empty implementation just
> results in a compiler error as the function needs to return a page
> pointer. Do you actually expect skb_frag_page() to unconditionally
> cast frag->netmem to a page pointer? That was explained as
> unacceptable over and over again by Jason and Christian as it risks
> casting devmem to page; completely unacceptable and will get nacked.
> Do you have a suggestion of what skb_frag_page() should do that will
> not get nacked by mm?

WARN_ON and return NULL seems reasonable?

Jason
  
Jason Gunthorpe Jan. 16, 2024, 12:16 p.m. UTC | #2
On Tue, Jan 16, 2024 at 07:04:13PM +0800, Yunsheng Lin wrote:
> On 2024/1/16 8:01, Jason Gunthorpe wrote:
> > On Mon, Jan 15, 2024 at 03:23:33PM -0800, Mina Almasry wrote:
> >>>> You did not answer my question that I asked here, and ignoring this
> >>>> question is preventing us from making any forward progress on this
> >>>> discussion. What do you expect or want skb_frag_page() to do when
> >>>> there is no page in the frag?
> >>>
> >>> I would expect it to do nothing.
> >>
> >> I don't understand. skb_frag_page() with an empty implementation just
> >> results in a compiler error as the function needs to return a page
> >> pointer. Do you actually expect skb_frag_page() to unconditionally
> >> cast frag->netmem to a page pointer? That was explained as
> >> unacceptable over and over again by Jason and Christian as it risks
> >> casting devmem to page; completely unacceptable and will get nacked.
> >> Do you have a suggestion of what skb_frag_page() should do that will
> >> not get nacked by mm?
> > 
> > WARN_ON and return NULL seems reasonable?
> 
> While I am agreed that it may be a nightmare to debug the case of passing
> a false page into the mm system, but I am not sure what's the point of
> returning NULL to caller if the caller is not expecting or handling
> the

You have to return something and NULL will largely reliably crash the
thread. The WARN_ON explains in detail why your thread just crashed.

> NULL returning[for example, most of mm API called by the networking does not
> seems to handling NULL as input page], isn't the NULL returning will make
> the kernel panic anyway? Doesn't it make more sense to just add a BUG_ON()
> depending on some configuration like CONFIG_DEBUG_NET or CONFIG_DEVMEM?
> As returning NULL seems to be causing a confusion for the caller of
> skb_frag_page() as whether to or how to handle the NULL returning case.

Possibly, though Linus doesn't like BUG_ON on principle..

I think the bigger challenge is convincing people that this devmem
stuff doesn't just open a bunch of holes in the kernel where userspace
can crash it.

The fact you all are debating what to do with skb_frag_page() suggests
to me there isn't confidence...

Jason
  
Mina Almasry Jan. 17, 2024, 6 p.m. UTC | #3
On Tue, Jan 16, 2024 at 4:16 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Jan 16, 2024 at 07:04:13PM +0800, Yunsheng Lin wrote:
> > On 2024/1/16 8:01, Jason Gunthorpe wrote:
> > > On Mon, Jan 15, 2024 at 03:23:33PM -0800, Mina Almasry wrote:
> > >>>> You did not answer my question that I asked here, and ignoring this
> > >>>> question is preventing us from making any forward progress on this
> > >>>> discussion. What do you expect or want skb_frag_page() to do when
> > >>>> there is no page in the frag?
> > >>>
> > >>> I would expect it to do nothing.
> > >>
> > >> I don't understand. skb_frag_page() with an empty implementation just
> > >> results in a compiler error as the function needs to return a page
> > >> pointer. Do you actually expect skb_frag_page() to unconditionally
> > >> cast frag->netmem to a page pointer? That was explained as
> > >> unacceptable over and over again by Jason and Christian as it risks
> > >> casting devmem to page; completely unacceptable and will get nacked.
> > >> Do you have a suggestion of what skb_frag_page() should do that will
> > >> not get nacked by mm?
> > >
> > > WARN_ON and return NULL seems reasonable?
> >

That's more or less what I'm thinking.

> > While I am agreed that it may be a nightmare to debug the case of passing
> > a false page into the mm system, but I am not sure what's the point of
> > returning NULL to caller if the caller is not expecting or handling
> > the
>
> You have to return something and NULL will largely reliably crash the
> thread. The WARN_ON explains in detail why your thread just crashed.
>

Agreed.

> > NULL returning[for example, most of mm API called by the networking does not
> > seems to handling NULL as input page], isn't the NULL returning will make
> > the kernel panic anyway? Doesn't it make more sense to just add a BUG_ON()
> > depending on some configuration like CONFIG_DEBUG_NET or CONFIG_DEVMEM?
> > As returning NULL seems to be causing a confusion for the caller of
> > skb_frag_page() as whether to or how to handle the NULL returning case.
>
> Possibly, though Linus doesn't like BUG_ON on principle..
>
> I think the bigger challenge is convincing people that this devmem
> stuff doesn't just open a bunch of holes in the kernel where userspace
> can crash it.
>

It does not, and as of right now there are no pending concerns from
any netdev maintainers regarding mishandled devmem checks at least.
This is because the devmem series comes with a full audit of
skb_frag_page() callers [1] and all areas in the net stack attempting
to access the skb [2].

[1] https://patchwork.kernel.org/project/netdevbpf/patch/20231218024024.3516870-10-almasrymina@google.com/
[2] https://patchwork.kernel.org/project/netdevbpf/patch/20231218024024.3516870-11-almasrymina@google.com/

> The fact you all are debating what to do with skb_frag_page() suggests
> to me there isn't confidence...
>

The debate raging on is related to the performance of skb_frag_page(),
not correctness (and even then, I don't think it's related to
perf...). Yunsheng would like us to optimize skb_frag_page() using an
unconditional cast from netmem to page. This in Yunsheng's mind is a
performance optimization as we don't need to add an if statement
checking if the netmem is a page. I'm resistant to implement that
change so far because:

(a) unconditionally casting from netmem to page negates the compiler
type safety that you and Christian are laying out as a requirement for
the devmem stuff.
(b) With likely/unlikely or static branches the check to make sure
netmem is page is a no-op for existing use cases anyway, so AFAIU,
there is no perf gain from optimizing it out anyway.

But none of this is related to correctness. Code calling
skb_frag_page() will fail or crash if it's not handled correctly
regardless of the implementation details of skb_frag_page(). In the
devmem series we add support to handle it correctly via [1] & [2].
  

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a5ae952454c8..e59f76151628 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -37,6 +37,7 @@ 
 #endif
 #include <net/net_debug.h>
 #include <net/dropreason-core.h>
+#include <net/netmem.h>
 
 /**
  * DOC: skb checksums
@@ -359,7 +360,11 @@  extern int sysctl_max_skb_frags;
  */
 #define GSO_BY_FRAGS	0xFFFF
 
-typedef struct bio_vec skb_frag_t;
+typedef struct skb_frag {
+	netmem_ref netmem;
+	unsigned int len;
+	unsigned int offset;
+} skb_frag_t;
 
 /**
  * skb_frag_size() - Returns the size of a skb fragment
@@ -367,7 +372,7 @@  typedef struct bio_vec skb_frag_t;
  */
 static inline unsigned int skb_frag_size(const skb_frag_t *frag)
 {
-	return frag->bv_len;
+	return frag->len;
 }
 
 /**
@@ -377,7 +382,7 @@  static inline unsigned int skb_frag_size(const skb_frag_t *frag)
  */
 static inline void skb_frag_size_set(skb_frag_t *frag, unsigned int size)
 {
-	frag->bv_len = size;
+	frag->len = size;
 }
 
 /**
@@ -387,7 +392,7 @@  static inline void skb_frag_size_set(skb_frag_t *frag, unsigned int size)
  */
 static inline void skb_frag_size_add(skb_frag_t *frag, int delta)
 {
-	frag->bv_len += delta;
+	frag->len += delta;
 }
 
 /**
@@ -397,7 +402,7 @@  static inline void skb_frag_size_add(skb_frag_t *frag, int delta)
  */
 static inline void skb_frag_size_sub(skb_frag_t *frag, int delta)
 {
-	frag->bv_len -= delta;
+	frag->len -= delta;
 }
 
 /**
@@ -417,7 +422,7 @@  static inline bool skb_frag_must_loop(struct page *p)
  *	skb_frag_foreach_page - loop over pages in a fragment
  *
  *	@f:		skb frag to operate on
- *	@f_off:		offset from start of f->bv_page
+ *	@f_off:		offset from start of f->netmem
  *	@f_len:		length from f_off to loop over
  *	@p:		(temp var) current page
  *	@p_off:		(temp var) offset from start of current page,
@@ -2429,22 +2434,37 @@  static inline unsigned int skb_pagelen(const struct sk_buff *skb)
 	return skb_headlen(skb) + __skb_pagelen(skb);
 }
 
+static inline void skb_frag_fill_netmem_desc(skb_frag_t *frag,
+					     netmem_ref netmem, int off,
+					     int size)
+{
+	frag->netmem = netmem;
+	frag->offset = off;
+	skb_frag_size_set(frag, size);
+}
+
 static inline void skb_frag_fill_page_desc(skb_frag_t *frag,
 					   struct page *page,
 					   int off, int size)
 {
-	frag->bv_page = page;
-	frag->bv_offset = off;
-	skb_frag_size_set(frag, size);
+	skb_frag_fill_netmem_desc(frag, page_to_netmem(page), off, size);
+}
+
+static inline void __skb_fill_netmem_desc_noacc(struct skb_shared_info *shinfo,
+						int i, netmem_ref netmem,
+						int off, int size)
+{
+	skb_frag_t *frag = &shinfo->frags[i];
+
+	skb_frag_fill_netmem_desc(frag, netmem, off, size);
 }
 
 static inline void __skb_fill_page_desc_noacc(struct skb_shared_info *shinfo,
 					      int i, struct page *page,
 					      int off, int size)
 {
-	skb_frag_t *frag = &shinfo->frags[i];
-
-	skb_frag_fill_page_desc(frag, page, off, size);
+	__skb_fill_netmem_desc_noacc(shinfo, i, page_to_netmem(page), off,
+				     size);
 }
 
 /**
@@ -2460,10 +2480,10 @@  static inline void skb_len_add(struct sk_buff *skb, int delta)
 }
 
 /**
- * __skb_fill_page_desc - initialise a paged fragment in an skb
+ * __skb_fill_netmem_desc - initialise a fragment in an skb
  * @skb: buffer containing fragment to be initialised
- * @i: paged fragment index to initialise
- * @page: the page to use for this fragment
+ * @i: fragment index to initialise
+ * @netmem: the netmem to use for this fragment
  * @off: the offset to the data with @page
  * @size: the length of the data
  *
@@ -2472,10 +2492,12 @@  static inline void skb_len_add(struct sk_buff *skb, int delta)
  *
  * Does not take any additional reference on the fragment.
  */
-static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
-					struct page *page, int off, int size)
+static inline void __skb_fill_netmem_desc(struct sk_buff *skb, int i,
+					  netmem_ref netmem, int off, int size)
 {
-	__skb_fill_page_desc_noacc(skb_shinfo(skb), i, page, off, size);
+	struct page *page = netmem_to_page(netmem);
+
+	__skb_fill_netmem_desc_noacc(skb_shinfo(skb), i, netmem, off, size);
 
 	/* Propagate page pfmemalloc to the skb if we can. The problem is
 	 * that not all callers have unique ownership of the page but rely
@@ -2483,7 +2505,20 @@  static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
 	 */
 	page = compound_head(page);
 	if (page_is_pfmemalloc(page))
-		skb->pfmemalloc	= true;
+		skb->pfmemalloc = true;
+}
+
+static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
+					struct page *page, int off, int size)
+{
+	__skb_fill_netmem_desc(skb, i, page_to_netmem(page), off, size);
+}
+
+static inline void skb_fill_netmem_desc(struct sk_buff *skb, int i,
+					netmem_ref netmem, int off, int size)
+{
+	__skb_fill_netmem_desc(skb, i, netmem, off, size);
+	skb_shinfo(skb)->nr_frags = i + 1;
 }
 
 /**
@@ -2503,8 +2538,7 @@  static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
 static inline void skb_fill_page_desc(struct sk_buff *skb, int i,
 				      struct page *page, int off, int size)
 {
-	__skb_fill_page_desc(skb, i, page, off, size);
-	skb_shinfo(skb)->nr_frags = i + 1;
+	skb_fill_netmem_desc(skb, i, page_to_netmem(page), off, size);
 }
 
 /**
@@ -2530,6 +2564,8 @@  static inline void skb_fill_page_desc_noacc(struct sk_buff *skb, int i,
 
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 		     int size, unsigned int truesize);
+void skb_add_rx_frag_netmem(struct sk_buff *skb, int i, netmem_ref netmem,
+			    int off, int size, unsigned int truesize);
 
 void skb_coalesce_rx_frag(struct sk_buff *skb, int i, int size,
 			  unsigned int truesize);
@@ -3378,7 +3414,7 @@  static inline void skb_propagate_pfmemalloc(const struct page *page,
  */
 static inline unsigned int skb_frag_off(const skb_frag_t *frag)
 {
-	return frag->bv_offset;
+	return frag->offset;
 }
 
 /**
@@ -3388,7 +3424,7 @@  static inline unsigned int skb_frag_off(const skb_frag_t *frag)
  */
 static inline void skb_frag_off_add(skb_frag_t *frag, int delta)
 {
-	frag->bv_offset += delta;
+	frag->offset += delta;
 }
 
 /**
@@ -3398,7 +3434,7 @@  static inline void skb_frag_off_add(skb_frag_t *frag, int delta)
  */
 static inline void skb_frag_off_set(skb_frag_t *frag, unsigned int offset)
 {
-	frag->bv_offset = offset;
+	frag->offset = offset;
 }
 
 /**
@@ -3409,7 +3445,7 @@  static inline void skb_frag_off_set(skb_frag_t *frag, unsigned int offset)
 static inline void skb_frag_off_copy(skb_frag_t *fragto,
 				     const skb_frag_t *fragfrom)
 {
-	fragto->bv_offset = fragfrom->bv_offset;
+	fragto->offset = fragfrom->offset;
 }
 
 /**
@@ -3420,7 +3456,7 @@  static inline void skb_frag_off_copy(skb_frag_t *fragto,
  */
 static inline struct page *skb_frag_page(const skb_frag_t *frag)
 {
-	return frag->bv_page;
+	return netmem_to_page(frag->netmem);
 }
 
 /**
@@ -3524,7 +3560,7 @@  static inline void *skb_frag_address_safe(const skb_frag_t *frag)
 static inline void skb_frag_page_copy(skb_frag_t *fragto,
 				      const skb_frag_t *fragfrom)
 {
-	fragto->bv_page = fragfrom->bv_page;
+	fragto->netmem = fragfrom->netmem;
 }
 
 bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 12d22c0b8551..4fdc33c81969 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -845,16 +845,24 @@  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 }
 EXPORT_SYMBOL(__napi_alloc_skb);
 
-void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
-		     int size, unsigned int truesize)
+void skb_add_rx_frag_netmem(struct sk_buff *skb, int i, netmem_ref netmem,
+			    int off, int size, unsigned int truesize)
 {
 	DEBUG_NET_WARN_ON_ONCE(size > truesize);
 
-	skb_fill_page_desc(skb, i, page, off, size);
+	skb_fill_netmem_desc(skb, i, netmem, off, size);
 	skb->len += size;
 	skb->data_len += size;
 	skb->truesize += truesize;
 }
+EXPORT_SYMBOL(skb_add_rx_frag_netmem);
+
+void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
+		     int size, unsigned int truesize)
+{
+	skb_add_rx_frag_netmem(skb, i, page_to_netmem(page), off, size,
+			       truesize);
+}
 EXPORT_SYMBOL(skb_add_rx_frag);
 
 void skb_coalesce_rx_frag(struct sk_buff *skb, int i, int size,
@@ -1904,10 +1912,11 @@  int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 
 	/* skb frags point to kernel buffers */
 	for (i = 0; i < new_frags - 1; i++) {
-		__skb_fill_page_desc(skb, i, head, 0, psize);
+		__skb_fill_netmem_desc(skb, i, page_to_netmem(head), 0, psize);
 		head = (struct page *)page_private(head);
 	}
-	__skb_fill_page_desc(skb, new_frags - 1, head, 0, d_off);
+	__skb_fill_netmem_desc(skb, new_frags - 1, page_to_netmem(head), 0,
+			       d_off);
 	skb_shinfo(skb)->nr_frags = new_frags;
 
 release:
@@ -3645,7 +3654,8 @@  skb_zerocopy(struct sk_buff *to, struct sk_buff *from, int len, int hlen)
 		if (plen) {
 			page = virt_to_head_page(from->head);
 			offset = from->data - (unsigned char *)page_address(page);
-			__skb_fill_page_desc(to, 0, page, offset, plen);
+			__skb_fill_netmem_desc(to, 0, page_to_netmem(page),
+					       offset, plen);
 			get_page(page);
 			j = 1;
 			len -= plen;
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index 1184d40167b8..145ef22b2b35 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -636,9 +636,14 @@  static int kcm_write_msgs(struct kcm_sock *kcm)
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
 			msize += skb_frag_size(&skb_shinfo(skb)->frags[i]);
 
+		if (WARN_ON_ONCE(!skb_frag_page(&skb_shinfo(skb)->frags[0]))) {
+			ret = -EINVAL;
+			goto out;
+		}
+
 		iov_iter_bvec(&msg.msg_iter, ITER_SOURCE,
-			      skb_shinfo(skb)->frags, skb_shinfo(skb)->nr_frags,
-			      msize);
+			      (const struct bio_vec *)skb_shinfo(skb)->frags,
+			      skb_shinfo(skb)->nr_frags, msize);
 		iov_iter_advance(&msg.msg_iter, txm->frag_offset);
 
 		do {