[net-next,v4] hv_netvsc: Mark VF as slave before exposing it to user-mode

Message ID 1699484212-24079-1-git-send-email-longli@linuxonhyperv.com
State New
Headers
Series [net-next,v4] hv_netvsc: Mark VF as slave before exposing it to user-mode |

Commit Message

longli@linuxonhyperv.com Nov. 8, 2023, 10:56 p.m. UTC
  From: Long Li <longli@microsoft.com>

When a VF is being exposed form the kernel, it should be marked as "slave"
before exposing to the user-mode. The VF is not usable without netvsc running
as master. The user-mode should never see a VF without the "slave" flag.

An example of a user-mode program depending on this flag is cloud-init
(https://github.com/canonical/cloud-init/blob/19.3/cloudinit/net/__init__.py)
When scanning interfaces, it checks on if this interface has a master to
decide if it should be configured. There are other user-mode programs perform
similar checks.

This commit moves the code of setting the slave flag to the time before VF is
exposed to user-mode.

Signed-off-by: Long Li <longli@microsoft.com>
---

Change since v1:
Use a new function to handle NETDEV_POST_INIT.

Change since v2:
Add "net" in subject. Add more details on the user-mode program behavior.

Change since v3:
Change target to net-next.

 drivers/net/hyperv/netvsc_drv.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)
  

Comments

Jakub Kicinski Nov. 9, 2023, 2:13 a.m. UTC | #1
On Wed,  8 Nov 2023 14:56:52 -0800 longli@linuxonhyperv.com wrote:
> From: Long Li <longli@microsoft.com>
> 
> When a VF is being exposed form the kernel, it should be marked as "slave"
> before exposing to the user-mode. The VF is not usable without netvsc running
> as master. The user-mode should never see a VF without the "slave" flag.
> 
> An example of a user-mode program depending on this flag is cloud-init
> (https://github.com/canonical/cloud-init/blob/19.3/cloudinit/net/__init__.py)

Quick grep for "flags", "priv" and "slave" doesn't show anything.
Can you point me to the line of code?

> When scanning interfaces, it checks on if this interface has a master to
> decide if it should be configured. There are other user-mode programs perform
> similar checks.
> 
> This commit moves the code of setting the slave flag to the time before VF is
> exposed to user-mode.

> Change since v3:
> Change target to net-next.

You don't consider this a fix? It seems like a race condition.

> -		if (ether_addr_equal(vf_netdev->perm_addr, ndev->perm_addr)) {
> -			netdev_notice(vf_netdev,
> -				      "falling back to mac addr based matching\n");
> +		if (ether_addr_equal(vf_netdev->perm_addr, ndev->perm_addr) ||
> +		    ether_addr_equal(vf_netdev->dev_addr, ndev->perm_addr))

This change doesn't seem to be described in the commit message.

Please note that we have a rule against reposting patches within 24h:

https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#resending-after-review
  
Long Li Nov. 10, 2023, 12:43 a.m. UTC | #2
> Subject: Re: [PATCH net-next v4] hv_netvsc: Mark VF as slave before exposing it
> to user-mode
> 
> On Wed,  8 Nov 2023 14:56:52 -0800 longli@linuxonhyperv.com wrote:
> > From: Long Li <longli@microsoft.com>
> >
> > When a VF is being exposed form the kernel, it should be marked as "slave"
> > before exposing to the user-mode. The VF is not usable without netvsc
> > running as master. The user-mode should never see a VF without the "slave"
> flag.
> >
> > An example of a user-mode program depending on this flag is cloud-init
> > (https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
> > hub.com%2Fcanonical%2Fcloud-
> init%2Fblob%2F19.3%2Fcloudinit%2Fnet%2F__i
> >
> nit__.py&data=05%7C01%7Clongli%40microsoft.com%7C5fd05bce17d2471c74c
> 00
> >
> 8dbe0c9728b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63835092
> 80435
> >
> 56592%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMz
> IiLCJB
> >
> TiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zAL2hc8338ci8Tl5
> ktZjk
> > mWZKCKWMqa%2BGlsE7Ty9g00%3D&reserved=0)
> 
> Quick grep for "flags", "priv" and "slave" doesn't show anything.
> Can you point me to the line of code?

I'm sorry, The URL I put in the commit should be: (I didn't realize the change has not been merged, here is the buggy code)
https://github.com/canonical/cloud-init/blob/3f515387142007fe0992a45486a1e049198a82f2/cloudinit/net/__init__.py#L1094

The code above needs to work with and without netvsc (the possible master device) present. It doesn't work properly with both conditions as of today. The patch series (with Haiyang's patches) fix that.

Because the code is specific to HyperV, we know we could be handling a VF NIC that is possibly a slave device, so checking on "slave" flag is a reliable indication whether the VF should be handled.

The current workflow in the kernel looks like this:
1. VF net device is created and expose to user-mode
2. VF is bonded to NETVSC (if NETVSC exists on the system)

With the current kernel behavior, the user-mode can possibly see the VF after 1, and before 2 when VF is bonded. When this happens, the user-mode doesn't know if the VF will be bonded in the future (it may never happen on systems without NETVSC). In this case, it doesn't know if it should configure the VF or not.

> 
> > When scanning interfaces, it checks on if this interface has a master
> > to decide if it should be configured. There are other user-mode
> > programs perform similar checks.
> >
> > This commit moves the code of setting the slave flag to the time
> > before VF is exposed to user-mode.
> 
> > Change since v3:
> > Change target to net-next.
> 
> You don't consider this a fix? It seems like a race condition.

I will work with Haiyang to get patch sent in a series.

Thanks,

Long
  
Jakub Kicinski Nov. 10, 2023, 8:05 p.m. UTC | #3
On Fri, 10 Nov 2023 00:43:55 +0000 Long Li wrote:
> The code above needs to work with and without netvsc (the possible
> master device) present.

I don't think that's a reasonable requirement for the kernel code.

The auto-bonding already puts the kernel into business of guessing
policy, which frankly we shouldn't be in.

Having the kernel guess even harder that there will be a master,
but it's not there yet, is not reasonable.
  
Stephen Hemminger Nov. 15, 2023, 4:14 p.m. UTC | #4
On Fri, 10 Nov 2023 12:05:13 -0800
Jakub Kicinski <kuba@kernel.org> wrote:

> On Fri, 10 Nov 2023 00:43:55 +0000 Long Li wrote:
> > The code above needs to work with and without netvsc (the possible
> > master device) present.  
> 
> I don't think that's a reasonable requirement for the kernel code.
> 
> The auto-bonding already puts the kernel into business of guessing
> policy, which frankly we shouldn't be in.
> 
> Having the kernel guess even harder that there will be a master,
> but it's not there yet, is not reasonable.
> 

I wrote the netvsc automatic VF code almost six years ago.
So let me give a little history. The original support of VF's was
done by using a bonding device and script. Haiyang worked hard
to get to work but it could not work on many distro's and had
lots of races and problems.

Jakub is right that in an ideal world, this could all be managed by
userspace. But the management of network devices in Linux is a
dumpster fire! Every distro invents there own solution, last time
I counted there were six different tools claiming to be the
"one network device manager to rule them all". And that doesn't
include all the custom scripts and vendor appliances.

The users requirements were:
 - VF networking should work out of the box
 - VF networking should require no userspace changes
 - It must work with legacy enterprise distro's
 - The first network device must show up as eth0 and it must work.

The Linux ecosystem of userspace but the kernel is a common base.
It was much easier for Microsoft to tell partners to
"use these upstream kernel components" and it will work.
Windows and BSD OS's have a tight binding between kernel and management
from userspace, therefore it is possible to handle things in userspace.

There are still problems (as Long indicated in the patch) because
the VF device does appear in the list of network devices. And
getting the transparent VF support to work in the face of all
the trash of userspace scripts is hard. Part of the problem is
that the state model of Linux network devices is fractured and
poorly documented.

The IFF_SLAVE flag is already used to indicate device is managed
by another driver. It keeps IPv6 from doing local address assignment
and existing userspace should be looking at it. The problem was
that userspace must not see a non-flagged VF device, or it will
get confused.

Microsoft should have exposed only one device in hardware.
Other vendors only expose the VF device and hairpin packets any
pre-processed packets. Part of the problem here is that
VF firmware needs to be updated (too often) and it is a requirement
that VM's do not lose connectivity.

Ideally, several things should happen:
   - Linux should support hiding devices managed by another device
   - the naming of device roles needs to not be master/slave.
  
Jakub Kicinski Nov. 18, 2023, 5:38 p.m. UTC | #5
On Wed, 15 Nov 2023 08:14:06 -0800 Stephen Hemminger wrote:
> Jakub is right that in an ideal world, this could all be managed by
> userspace. But the management of network devices in Linux is a
> dumpster fire! Every distro invents there own solution, last time
> I counted there were six different tools claiming to be the
> "one network device manager to rule them all". And that doesn't
> include all the custom scripts and vendor appliances.

To be clear, I thought Long Li was saying that the goal is work around
cases where VF is probed before netvsc. That seems like something that
can be prevented by the hypervisor.
  
Long Li Nov. 21, 2023, 12:23 a.m. UTC | #6
> On Wed, 15 Nov 2023 08:14:06 -0800 Stephen Hemminger wrote:
> > Jakub is right that in an ideal world, this could all be managed by
> > userspace. But the management of network devices in Linux is a
> > dumpster fire! Every distro invents there own solution, last time I
> > counted there were six different tools claiming to be the "one network
> > device manager to rule them all". And that doesn't include all the
> > custom scripts and vendor appliances.
> 
> To be clear, I thought Long Li was saying that the goal is work around cases where
> VF is probed before netvsc. That seems like something that can be prevented by
> the hypervisor.

Hi Jakub,

I think you misunderstood my response, here is the response again.

(quote)

The current workflow in the kernel looks like this:
1. VF net device is created and expose to user-mode 2. VF is bonded to NETVSC (if NETVSC exists on the system)

With the current kernel behavior, the user-mode can possibly see the VF after 1, and before 2 when VF is bonded. When this happens, the user-mode doesn't know if the VF will be bonded in the future (it may never happen on systems without NETVSC). In this case, it doesn't know if it should configure the VF or not.

(end quote)

The problem is not that VF could be probed before netvsc. The problem is that it's possible that VF is probed, exposed to user-mode earlier than netvsc has a chance to bond it.

Thanks,

Long
  

Patch

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index ec77fb9dcf89..fdad58dcc6a8 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2206,9 +2206,6 @@  static int netvsc_vf_join(struct net_device *vf_netdev,
 		goto upper_link_failed;
 	}
 
-	/* set slave flag before open to prevent IPv6 addrconf */
-	vf_netdev->flags |= IFF_SLAVE;
-
 	schedule_delayed_work(&ndev_ctx->vf_takeover, VF_TAKEOVER_INT);
 
 	call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
@@ -2320,11 +2317,9 @@  static struct net_device *get_netvsc_byslot(const struct net_device *vf_netdev)
 	 */
 	list_for_each_entry(ndev_ctx, &netvsc_dev_list, list) {
 		ndev = hv_get_drvdata(ndev_ctx->device_ctx);
-		if (ether_addr_equal(vf_netdev->perm_addr, ndev->perm_addr)) {
-			netdev_notice(vf_netdev,
-				      "falling back to mac addr based matching\n");
+		if (ether_addr_equal(vf_netdev->perm_addr, ndev->perm_addr) ||
+		    ether_addr_equal(vf_netdev->dev_addr, ndev->perm_addr))
 			return ndev;
-		}
 	}
 
 	netdev_notice(vf_netdev,
@@ -2332,6 +2327,19 @@  static struct net_device *get_netvsc_byslot(const struct net_device *vf_netdev)
 	return NULL;
 }
 
+static int netvsc_prepare_slave(struct net_device *vf_netdev)
+{
+	struct net_device *ndev;
+
+	ndev = get_netvsc_byslot(vf_netdev);
+	if (!ndev)
+		return NOTIFY_DONE;
+
+	/* set slave flag before open to prevent IPv6 addrconf */
+	vf_netdev->flags |= IFF_SLAVE;
+	return NOTIFY_DONE;
+}
+
 static int netvsc_register_vf(struct net_device *vf_netdev)
 {
 	struct net_device_context *net_device_ctx;
@@ -2753,6 +2761,8 @@  static int netvsc_netdev_event(struct notifier_block *this,
 		return NOTIFY_DONE;
 
 	switch (event) {
+	case NETDEV_POST_INIT:
+		return netvsc_prepare_slave(event_dev);
 	case NETDEV_REGISTER:
 		return netvsc_register_vf(event_dev);
 	case NETDEV_UNREGISTER: