From patchwork Sun Feb 19 13:10:05 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Felix Fietkau X-Patchwork-Id: 59146 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp846533wrn; Sun, 19 Feb 2023 05:25:18 -0800 (PST) X-Google-Smtp-Source: AK7set8uD5Ad8OHMt0Av41/04LHR/T68SOV1ag6ql7kxrR8WnJclxqoAgJYFz5upb9zS9kI6tELC X-Received: by 2002:a17:906:6d91:b0:878:5917:601 with SMTP id h17-20020a1709066d9100b0087859170601mr6377007ejt.58.1676813118424; Sun, 19 Feb 2023 05:25:18 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676813118; cv=none; d=google.com; s=arc-20160816; b=PM8GZHhBK+5beoa0Ea/9hal1F4ThM5yAVKtWNvauZdsJGg22UM7nz+SKJx/rEIaolg 965N3lldNDaOmp3v0qeyO7iduOgcP6ExE9zuQYk60qvL3AId6cj6J0WqYqX9G9Kw6Y0p 5LGr2snudsu/EfY+7f/03EShBFws/8yjwsb5UAVvytbEbDPHoGxjJkBS22fxKHZeQJwO t6F+ZpePg64BXq+tA4H6WFSBzWBYwMCgHn0ewfrVahZkrfMEuB/zwRvCO6Tl6F5PIImc wI8xCnUdBKuLpITPtXVmm06XtZ98EQDIF5NH3uLFyTpCrvWdPgBwwYHSNOFXeUx3M5wX p5aA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=4OHw52BKjpu7NUk9y7oRGvtrPK79btkrHDuYaEwxHdw=; b=zokV+KvfSN6A+au8cy871aPY6biT3jdDgOZnBe2E0GTX3KTa7+XMCAYdQG91QppevE pQm4O6lBtPae8TUILCi3mKp1On/kiIz8yVggKfNW9uZcAhHQdTwAopDLwMJK49yZ8rN9 a17XDj8nONauliDoC7yd2XwzGNTNcUC52axDvDe5ciM22gQ5yWcy+6GF7XfkYMh0WMr7 C83k46UqQFXBiuq53Dd7VsnQefm50avmw334UveOowqGqSQf66TngmviQgkhC7pA0ab7 oEDnuJBkZshWPHoEdag4i87xcajY82KBXU2b7Y1P/m1AsPTsliOdavoGEEqHyQZmWXsM 0gbA== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@nbd.name header.s=20160729 header.b=F2t0mA56; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=nbd.name Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ui10-20020a170907c90a00b008b315ed8bd6si7536109ejc.787.2023.02.19.05.24.54; Sun, 19 Feb 2023 05:25:18 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=fail header.i=@nbd.name header.s=20160729 header.b=F2t0mA56; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=nbd.name Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229891AbjBSNKZ (ORCPT + 99 others); Sun, 19 Feb 2023 08:10:25 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57050 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229506AbjBSNKY (ORCPT ); Sun, 19 Feb 2023 08:10:24 -0500 Received: from nbd.name (nbd.name [46.4.11.11]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 15AB71040E; Sun, 19 Feb 2023 05:10:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=nbd.name; s=20160729; h=Content-Transfer-Encoding:MIME-Version:Message-Id:Date:Subject: Cc:To:From:Sender:Reply-To:Content-Type:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: In-Reply-To:References:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=4OHw52BKjpu7NUk9y7oRGvtrPK79btkrHDuYaEwxHdw=; b=F2t0mA56TVHlxpY9AorPg3wZAE 7EZE94YZV/zOdmf+b5QVPO6N/LGcT5+SGuInUH+K56fKJURQiLHz3uE2IQ9vOXHfb986Pj5Vm45H0 cisI5Pn51UVwUHN5jSwn0cUAAWvqG0NaSveJq/O4rfdD18gwSJmHyqpx/LVQpfv/YhTc=; Received: from p200300daa7147b00887e0d3dc2704444.dip0.t-ipconnect.de ([2003:da:a714:7b00:887e:d3d:c270:4444] helo=Maecks.lan) by ds12 with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 (Exim 4.94.2) (envelope-from ) id 1pTjSD-009vYp-49; Sun, 19 Feb 2023 14:10:09 +0100 From: Felix Fietkau To: netdev@vger.kernel.org, "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni Cc: linux-kernel@vger.kernel.org Subject: [RFC v3] net/core: add optional threading for backlog processing Date: Sun, 19 Feb 2023 14:10:05 +0100 Message-Id: <20230219131006.92681-1-nbd@nbd.name> X-Mailer: git-send-email 2.39.0 MIME-Version: 1.0 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758265992472897661?= X-GMAIL-MSGID: =?utf-8?q?1758265992472897661?= When dealing with few flows or an imbalance on CPU utilization, static RPS CPU assignment can be too inflexible. Add support for enabling threaded NAPI for backlog processing in order to allow the scheduler to better balance processing. This helps better spread the load across idle CPUs. Signed-off-by: Felix Fietkau --- RFC v3: - make patch more generic, applies to backlog processing in general - fix process queue access on flush RFC v2: - fix rebase error in rps locking include/linux/netdevice.h | 2 + net/core/dev.c | 78 +++++++++++++++++++++++++++++++++++--- net/core/sysctl_net_core.c | 27 +++++++++++++ 3 files changed, 102 insertions(+), 5 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index d9cdbc047b49..b3cef91b1696 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -522,6 +522,7 @@ static inline bool napi_complete(struct napi_struct *n) } int dev_set_threaded(struct net_device *dev, bool threaded); +int backlog_set_threaded(bool threaded); /** * napi_disable - prevent NAPI from scheduling @@ -3192,6 +3193,7 @@ struct softnet_data { unsigned int cpu; unsigned int input_queue_tail; #endif + unsigned int process_queue_empty; unsigned int received_rps; unsigned int dropped; struct sk_buff_head input_pkt_queue; diff --git a/net/core/dev.c b/net/core/dev.c index 357081b0113c..76874513b7b5 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4597,7 +4597,7 @@ static int napi_schedule_rps(struct softnet_data *sd) struct softnet_data *mysd = this_cpu_ptr(&softnet_data); #ifdef CONFIG_RPS - if (sd != mysd) { + if (sd != mysd && !test_bit(NAPI_STATE_THREADED, &sd->backlog.state)) { sd->rps_ipi_next = mysd->rps_ipi_list; mysd->rps_ipi_list = sd; @@ -5778,6 +5778,8 @@ static DEFINE_PER_CPU(struct work_struct, flush_works); /* Network device is going away, flush any packets still pending */ static void flush_backlog(struct work_struct *work) { + unsigned int process_queue_empty; + bool threaded, flush_processq; struct sk_buff *skb, *tmp; struct softnet_data *sd; @@ -5792,8 +5794,15 @@ static void flush_backlog(struct work_struct *work) input_queue_head_incr(sd); } } + + threaded = test_bit(NAPI_STATE_THREADED, &sd->backlog.state); + flush_processq = threaded && + !skb_queue_empty_lockless(&sd->process_queue); rps_unlock_irq_enable(sd); + if (threaded) + goto out; + skb_queue_walk_safe(&sd->process_queue, skb, tmp) { if (skb->dev->reg_state == NETREG_UNREGISTERING) { __skb_unlink(skb, &sd->process_queue); @@ -5801,7 +5810,16 @@ static void flush_backlog(struct work_struct *work) input_queue_head_incr(sd); } } + +out: local_bh_enable(); + + while (flush_processq) { + msleep(1); + rps_lock_irq_disable(sd); + flush_processq = process_queue_empty == sd->process_queue_empty; + rps_unlock_irq_enable(sd); + } } static bool flush_required(int cpu) @@ -5933,16 +5951,16 @@ static int process_backlog(struct napi_struct *napi, int quota) } rps_lock_irq_disable(sd); + sd->process_queue_empty++; if (skb_queue_empty(&sd->input_pkt_queue)) { /* * Inline a custom version of __napi_complete(). - * only current cpu owns and manipulates this napi, - * and NAPI_STATE_SCHED is the only possible flag set - * on backlog. + * only current cpu owns and manipulates this napi. * We can use a plain write instead of clear_bit(), * and we dont need an smp_mb() memory barrier. */ - napi->state = 0; + napi->state &= ~(NAPIF_STATE_SCHED | + NAPIF_STATE_SCHED_THREADED); again = false; } else { skb_queue_splice_tail_init(&sd->input_pkt_queue, @@ -6356,6 +6374,53 @@ int dev_set_threaded(struct net_device *dev, bool threaded) } EXPORT_SYMBOL(dev_set_threaded); +int backlog_set_threaded(bool threaded) +{ + static bool backlog_threaded; + int err = 0; + int i; + + if (backlog_threaded == threaded) + return 0; + + for_each_possible_cpu(i) { + struct softnet_data *sd = &per_cpu(softnet_data, i); + struct napi_struct *n = &sd->backlog; + + n->thread = kthread_run(napi_threaded_poll, n, "napi/backlog-%d", i); + if (IS_ERR(n->thread)) { + err = PTR_ERR(n->thread); + pr_err("kthread_run failed with err %d\n", err); + n->thread = NULL; + threaded = false; + break; + } + + } + + backlog_threaded = threaded; + + /* Make sure kthread is created before THREADED bit + * is set. + */ + smp_mb__before_atomic(); + + for_each_possible_cpu(i) { + struct softnet_data *sd = &per_cpu(softnet_data, i); + struct napi_struct *n = &sd->backlog; + unsigned long flags; + + rps_lock_irqsave(sd, &flags); + if (threaded) + n->state |= NAPIF_STATE_THREADED; + else + n->state &= ~NAPIF_STATE_THREADED; + rps_unlock_irq_restore(sd, &flags); + } + + return err; +} + void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi, int (*poll)(struct napi_struct *, int), int weight) { @@ -11114,6 +11179,9 @@ static int dev_cpu_dead(unsigned int oldcpu) raise_softirq_irqoff(NET_TX_SOFTIRQ); local_irq_enable(); + if (test_bit(NAPI_STATE_THREADED, &oldsd->backlog.state)) + return 0; + #ifdef CONFIG_RPS remsd = oldsd->rps_ipi_list; oldsd->rps_ipi_list = NULL; diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index 7130e6d9e263..3eea703b69d7 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -30,6 +30,7 @@ static int int_3600 = 3600; static int min_sndbuf = SOCK_MIN_SNDBUF; static int min_rcvbuf = SOCK_MIN_RCVBUF; static int max_skb_frags = MAX_SKB_FRAGS; +static int backlog_threaded; static int net_msg_warn; /* Unused, but still a sysctl */ @@ -165,6 +166,23 @@ static int rps_sock_flow_sysctl(struct ctl_table *table, int write, } #endif /* CONFIG_RPS */ +static int backlog_threaded_sysctl(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + static DEFINE_MUTEX(backlog_threaded_mutex); + int ret; + + mutex_lock(&backlog_threaded_mutex); + + ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); + if (write && !ret) + ret = backlog_set_threaded(backlog_threaded); + + mutex_unlock(&backlog_threaded_mutex); + + return ret; +} + #ifdef CONFIG_NET_FLOW_LIMIT static DEFINE_MUTEX(flow_limit_update_mutex); @@ -514,6 +532,15 @@ static struct ctl_table net_core_table[] = { .proc_handler = rps_default_mask_sysctl }, #endif + { + .procname = "backlog_threaded", + .data = &backlog_threaded, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = backlog_threaded_sysctl, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE + }, #ifdef CONFIG_NET_FLOW_LIMIT { .procname = "flow_limit_cpu_bitmap",