From patchwork Mon Nov 27 22:05:21 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: NeilBrown X-Patchwork-Id: 170425 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp3495669vqx; Mon, 27 Nov 2023 14:05:51 -0800 (PST) X-Google-Smtp-Source: AGHT+IG5INNTey37CosAXItm8f1ZAKEXCEhlfyZYL6YO2I2sseqB2MiHz4XwVoosK8j0DN92ocv4 X-Received: by 2002:a17:902:d4c5:b0:1cc:22db:cf3d with SMTP id o5-20020a170902d4c500b001cc22dbcf3dmr23467563plg.15.1701122751368; Mon, 27 Nov 2023 14:05:51 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701122751; cv=none; d=google.com; s=arc-20160816; b=AMVXjEJWb3pBhaqdkmRGvzJR7KcitQdgNoMfgPXhruM9/SSza3BdPLHhEqLrORoE44 Fsu1rPeX8gqBfDyQeFR9XBGJPP11JE3oITqeciynhISqyE95sFUdiFPOoimT+at8pAzY s10MArnMYfnveWUleksoZNA2Becs3z9nao6B8fBYgwuvmwA1qSmW41M9PKRo9vDAxd7b rd6agvOgjdDcjbQ9rCMtxnkwO3NzksVjznHuRUphtAUL/DAFqHVbG03H8pO1Dp0Y0l0N 1n2r3zBJRmUdBr1TOVhBNCjhxIswYbD/TjFrKjAyxFRKzZ+AMwHUjjWKp03sqrkRn7Ts 24ug== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:subject:cc:to:from:mime-version :content-transfer-encoding; bh=pAW5iga1AmyC5/8eqjklmnEceEyMHzGam0M7OSRkvxw=; fh=Ztty36qtkQLs7Q3c1H26mJJAy3kQBJDD0zo+We6C3rQ=; b=pmFByZkxFGdItl2a+ZmDnq92xWnSpwxvuHTMMDYK+URoKvYL/kNY9bn0faAfToksl3 LJrL0gDqU0wxv0dmuJzZASMkmGTmw8SfQa4wfzOZdqqgltgroDEj8VbYYOPnPjx4Ac8z iT2rP2xdSKgtxm05x+GA+boxKtsEywF34q1FEiBS8qEdHzAaYuj5vZ3zuyQG4Q+RUT76 3cHYx0c+dtwU3Qo5t7PZMUJ/tiKBH6PolYHeRKMekq8RZ2kjakAVLUmUW7oh5boJm8Vs lqfLKr3r0pUfubFpjq8Eqktx5SEVPgQwudtogKH2zKFy3y9kdt/iklvN5YOgRiSNsh6F YuqA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=suse.de Received: from agentk.vger.email (agentk.vger.email. [2620:137:e000::3:2]) by mx.google.com with ESMTPS id j2-20020a170902da8200b001c77674ea97si10918427plx.423.2023.11.27.14.05.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 27 Nov 2023 14:05:51 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) client-ip=2620:137:e000::3:2; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=suse.de Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id AFA17805ADDA; Mon, 27 Nov 2023 14:05:47 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233512AbjK0WFc convert rfc822-to-8bit (ORCPT + 99 others); Mon, 27 Nov 2023 17:05:32 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58602 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233569AbjK0WF1 (ORCPT ); Mon, 27 Nov 2023 17:05:27 -0500 Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2a07:de40:b251:101:10:150:64:2]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 83CC210F0; Mon, 27 Nov 2023 14:05:31 -0800 (PST) Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 8C3601F388; Mon, 27 Nov 2023 22:05:29 +0000 (UTC) Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 020361367B; Mon, 27 Nov 2023 22:05:24 +0000 (UTC) Received: from dovecot-director2.suse.de ([10.150.64.162]) by imap1.dmz-prg2.suse.org with ESMTPSA id 1znWJ6QSZWU6OwAAD6G6ig (envelope-from ); Mon, 27 Nov 2023 22:05:24 +0000 MIME-Version: 1.0 From: "NeilBrown" To: Al Viro , Christian Brauner , Jens Axboe , Oleg Nesterov , Chuck Lever , Jeff Layton , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nfs@vger.kernel.org Subject: [PATCH/RFC] core/nfsd: allow kernel threads to use task_work. Date: Tue, 28 Nov 2023 09:05:21 +1100 Message-id: <170112272125.7109.6245462722883333440@noble.neil.brown.name> X-Spamd-Bar: ++ Authentication-Results: smtp-out2.suse.de; dkim=none; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=suse.de (policy=none); spf=softfail (smtp-out2.suse.de: 2a07:de40:b281:104:10:150:64:97 is neither permitted nor denied by domain of neilb@suse.de) smtp.mailfrom=neilb@suse.de X-Rspamd-Server: rspamd2 X-Spamd-Result: default: False [2.81 / 50.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; SPAMHAUS_XBL(0.00)[2a07:de40:b281:104:10:150:64:97:from]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; NEURAL_HAM_LONG(-0.98)[-0.979]; MIME_GOOD(-0.10)[text/plain]; R_SPF_SOFTFAIL(4.60)[~all:c]; RCVD_COUNT_THREE(0.00)[3]; MX_GOOD(-0.01)[]; RCPT_COUNT_TWELVE(0.00)[13]; DBL_BLOCKED_OPENRESOLVER(0.00)[suse.de:email]; FUZZY_BLOCKED(0.00)[rspamd.com]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(2.20)[]; MIME_TRACE(0.00)[0:+]; RCVD_TLS_ALL(0.00)[]; BAYES_HAM(-3.00)[100.00%]; DMARC_POLICY_SOFTFAIL(0.10)[suse.de : No valid SPF, No valid DKIM,none] X-Spam-Score: 2.81 X-Rspamd-Queue-Id: 8C3601F388 X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Mon, 27 Nov 2023 14:05:47 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783756489930869661 X-GMAIL-MSGID: 1783756489930869661 I have evidence from a customer site of 256 nfsd threads adding files to delayed_fput_lists nearly twice as fast they are retired by a single work-queue thread running delayed_fput(). As you might imagine this does not end well (20 million files in the queue at the time a snapshot was taken for analysis). While this might point to a problem with the filesystem not handling the final close efficiently, such problems should only hurt throughput, not lead to memory exhaustion. For normal threads, the thread that closes the file also calls the final fput so there is natural rate limiting preventing excessive growth in the list of delayed fputs. For kernel threads, and particularly for nfsd, delayed in the final fput do not impose any throttling to prevent the thread from closing more files. A simple way to fix this is to treat nfsd threads like normal processes for task_work. Thus the pending files are queued for the thread, and the same thread finishes the work. Currently KTHREADs are assumed never to call task_work_run(). With this patch that it still the default but it is implemented by storing the magic value TASK_WORKS_DISABLED in ->task_works. If a kthread, such as nfsd, will call task_work_run() periodically, it sets ->task_works to NULL to indicate this. Signed-off-by: NeilBrown --- I wonder which tree this should go through assuming everyone likes it. VFS maybe?? Thanks. fs/file_table.c | 2 +- fs/nfsd/nfssvc.c | 4 ++++ include/linux/sched.h | 1 + include/linux/task_work.h | 4 +++- kernel/fork.c | 2 +- kernel/task_work.c | 7 ++++--- 6 files changed, 14 insertions(+), 6 deletions(-) diff --git a/fs/file_table.c b/fs/file_table.c index de4a2915bfd4..e79351df22be 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -445,7 +445,7 @@ void fput(struct file *file) if (atomic_long_dec_and_test(&file->f_count)) { struct task_struct *task = current; - if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { + if (likely(!in_interrupt())) { init_task_work(&file->f_rcuhead, ____fput); if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME)) return; diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c index 66ca50b38b27..c047961262ca 100644 --- a/fs/nfsd/nfssvc.c +++ b/fs/nfsd/nfssvc.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include @@ -941,6 +942,7 @@ nfsd(void *vrqstp) } current->fs->umask = 0; + current->task_works = NULL; /* Declare that I will call task_work_run() */ atomic_inc(&nfsdstats.th_cnt); @@ -955,6 +957,8 @@ nfsd(void *vrqstp) svc_recv(rqstp); validate_process_creds(); + if (task_work_pending(current)) + task_work_run(); } atomic_dec(&nfsdstats.th_cnt); diff --git a/include/linux/sched.h b/include/linux/sched.h index 292c31697248..c63c2bedbf71 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1117,6 +1117,7 @@ struct task_struct { unsigned int sas_ss_flags; struct callback_head *task_works; +#define TASK_WORKS_DISABLED ((void*)1) #ifdef CONFIG_AUDIT #ifdef CONFIG_AUDITSYSCALL diff --git a/include/linux/task_work.h b/include/linux/task_work.h index 795ef5a68429..3c74e3de81ed 100644 --- a/include/linux/task_work.h +++ b/include/linux/task_work.h @@ -22,7 +22,9 @@ enum task_work_notify_mode { static inline bool task_work_pending(struct task_struct *task) { - return READ_ONCE(task->task_works); + struct callback_head *works = READ_ONCE(task->task_works); + + return works && works != TASK_WORKS_DISABLED; } int task_work_add(struct task_struct *task, struct callback_head *twork, diff --git a/kernel/fork.c b/kernel/fork.c index 10917c3e1f03..903b29804fe1 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2577,7 +2577,7 @@ __latent_entropy struct task_struct *copy_process( p->dirty_paused_when = 0; p->pdeath_signal = 0; - p->task_works = NULL; + p->task_works = args->kthread ? TASK_WORKS_DISABLED : NULL; clear_posix_cputimers_work(p); #ifdef CONFIG_KRETPROBES diff --git a/kernel/task_work.c b/kernel/task_work.c index 95a7e1b7f1da..ffdf4b0d7a0e 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -49,7 +49,8 @@ int task_work_add(struct task_struct *task, struct callback_head *work, head = READ_ONCE(task->task_works); do { - if (unlikely(head == &work_exited)) + if (unlikely(head == &work_exited || + head == TASK_WORKS_DISABLED)) return -ESRCH; work->next = head; } while (!try_cmpxchg(&task->task_works, &head, work)); @@ -157,7 +158,7 @@ void task_work_run(void) work = READ_ONCE(task->task_works); do { head = NULL; - if (!work) { + if (!work || work == TASK_WORKS_DISABLED) { if (task->flags & PF_EXITING) head = &work_exited; else @@ -165,7 +166,7 @@ void task_work_run(void) } } while (!try_cmpxchg(&task->task_works, &work, head)); - if (!work) + if (!work || work == TASK_WORKS_DISABLED) break; /* * Synchronize with task_work_cancel(). It can not remove