Message ID | 20240301-slab-memcg-v1-4-359328a46596@suse.cz |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-88799-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:fa17:b0:10a:f01:a869 with SMTP id ju23csp39205dyc; Fri, 1 Mar 2024 09:43:55 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCVibza3eP+CNYOuWVLPvQhLEUFsgKmvqJwrXwadTXs9Q8SExLN9FNJ2REtdJOf07V42Ata5a3ZWZL/wt6lIoGzGF4i1oA== X-Google-Smtp-Source: AGHT+IEQdW/H/yt/p2nSdiRyoda0GREV2jfoGySGh76Cs0aS2ebDDhMRPAWvX9mgFu7i+oGCkDvg X-Received: by 2002:a17:902:ea0a:b0:1dc:30ab:406b with SMTP id s10-20020a170902ea0a00b001dc30ab406bmr2652806plg.13.1709315035379; Fri, 01 Mar 2024 09:43:55 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1709315035; cv=pass; d=google.com; s=arc-20160816; b=rmSWL3jMX3U3PSTlqp8bw3CWwYOvVotB1OtvqOjtvrFsm2UoZdd6MIjKUKKjMu11oK aGRClZZA8tbgfswflVOEBbReXcmROM4pPWRBUwxU5FNHRqlk93c5Omr7jqR7FAc+4ciB SZmML8lucpStw5THF1/+Z2YD8dn3rQRwY5N/7XL4spGaF/V9XUlpmdSHxoSQnU2Hq53E yddXu9RxeoMZDU9KnVEoplLe3vWU/Q56cKES6oHt7H7awdhJlHCFZyCtZ+X+devgTcBX JCfZmx2Q3AmtyWSUHCGY41IXnEpOIvwYkH8r2b7tjOYHXvTU8yS/t19Uc6sDD26ey/Id yD6A== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :subject:date:from:dkim-signature:dkim-signature:dkim-signature :dkim-signature; bh=3OiKnePSvssqSsATKZD6FgDSrsGbw6Gz9LH7uaXGsMo=; fh=WGGwdVVTGLpVYCxjY83qTUlyyBzuw+WV0hgXC1Po200=; b=gs5Qyq02HS1isMrtd0J7kKUeR5/jH6Tr6b3lbGzHDY6NDcM20uLUcuK7H+MfYERGQZ SfcW+uRliF9ocwJdVEWcPs5CFvWEnMSnnDk+SBfBw6U+BhAxqRT2IZpBhayEnQJBDYcC NBPcvOszP8W/sP2F81XrOCW0IvhhEQxL+Z7oSfUX2wCgsf2D1nU4woaB7ypYL3zNYzk8 b+r0CU75Wz2hFZpTMhaMOjv0+L17MbXnL4zOGEyI4TutW4PIuF8e1pZA+pAEEP7LZzQD cY7jYxubloBxSuqPnd59sLzYFefuWCAcdmAZvvKxJsfXt3KsZFeWOunCa7TJApuCftNg Wj+w==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=hR4RI3M7; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=hR4RI3M7; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519; arc=pass (i=1 spf=pass spfdomain=suse.cz dkim=pass dkdomain=suse.cz dkim=pass dkdomain=suse.cz); spf=pass (google.com: domain of linux-kernel+bounces-88799-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-88799-ouuuleilei=gmail.com@vger.kernel.org" Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [147.75.48.161]) by mx.google.com with ESMTPS id w13-20020a170902e88d00b001d9e1ab2acbsi4057308plg.356.2024.03.01.09.43.54 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 01 Mar 2024 09:43:55 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-88799-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) client-ip=147.75.48.161; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=hR4RI3M7; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=hR4RI3M7; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519; arc=pass (i=1 spf=pass spfdomain=suse.cz dkim=pass dkdomain=suse.cz dkim=pass dkdomain=suse.cz); spf=pass (google.com: domain of linux-kernel+bounces-88799-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-88799-ouuuleilei=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id ED4F9B2A16D for <ouuuleilei@gmail.com>; Fri, 1 Mar 2024 17:08:53 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 9730338398; Fri, 1 Mar 2024 17:07:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=suse.cz header.i=@suse.cz header.b="hR4RI3M7"; dkim=permerror (0-bit key) header.d=suse.cz header.i=@suse.cz header.b="o05E46K6"; dkim=pass (1024-bit key) header.d=suse.cz header.i=@suse.cz header.b="hR4RI3M7"; dkim=permerror (0-bit key) header.d=suse.cz header.i=@suse.cz header.b="o05E46K6" Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C4FBA46B7; Fri, 1 Mar 2024 17:07:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=195.135.223.131 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709312834; cv=none; b=aSP5QSHl+bulrDBSro5CHLmmUSNwv0xOh2W7PCHX9bzVMw+iLY9BY8gk7cqQ7COLt84qFZlEVgcqME6gXzVsVb8xGCtTodcpd49dfWYZxSSlZ0QBB+354r8UWV09wXnoh3iLrxkdWqHlb10l7N1m0Vb4D6ijFhoqBMDkJvwgPB4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709312834; c=relaxed/simple; bh=QnfkJAoInt5fUE+416z0sQ20bCXrkSWJKNSLICwxmGM=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=p95dDAU7RfLK74kD+9IggMtkH5zglerivEdHaMv236iJh6+dGhFTqrcAIe6/PqFS0LKbAwSI4cdmhZo0bk2FzmAdWOSnyBoHTPLywNe/xHhx8npCU3++sYE3ZVCDxjp+7rOOwj1Dk06nZWlYPqojnfhENh7p1D4iaNMHfeZ2xbI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz; spf=pass smtp.mailfrom=suse.cz; dkim=pass (1024-bit key) header.d=suse.cz header.i=@suse.cz header.b=hR4RI3M7; dkim=permerror (0-bit key) header.d=suse.cz header.i=@suse.cz header.b=o05E46K6; dkim=pass (1024-bit key) header.d=suse.cz header.i=@suse.cz header.b=hR4RI3M7; dkim=permerror (0-bit key) header.d=suse.cz header.i=@suse.cz header.b=o05E46K6; arc=none smtp.client-ip=195.135.223.131 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=suse.cz Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [10.150.64.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 13FDA207B4; Fri, 1 Mar 2024 17:07:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1709312831; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3OiKnePSvssqSsATKZD6FgDSrsGbw6Gz9LH7uaXGsMo=; b=hR4RI3M7QdmI77qwLuZOBimpV0p2ze0qGZw1B4hIMLU2F/D7t3Gu5VbTAR9gTJgDi6vCeY CHs1xCbyeczWk9Hb3kbBfU7sfuhKG4jF0YdapV00Mtliv7GsO2UuM0v+bywHfHlN1udd4J we+zKQVacDSXAT9F9S8KKuGQXrqXA18= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1709312831; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3OiKnePSvssqSsATKZD6FgDSrsGbw6Gz9LH7uaXGsMo=; b=o05E46K6EN+fy853AkUBuf6HrQD5ZThoOdIKf4tjmcYR9J8tcaU11fvMi6vu+GSL77uZkk Y/9da7wAoHpaIGBA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1709312831; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3OiKnePSvssqSsATKZD6FgDSrsGbw6Gz9LH7uaXGsMo=; b=hR4RI3M7QdmI77qwLuZOBimpV0p2ze0qGZw1B4hIMLU2F/D7t3Gu5VbTAR9gTJgDi6vCeY CHs1xCbyeczWk9Hb3kbBfU7sfuhKG4jF0YdapV00Mtliv7GsO2UuM0v+bywHfHlN1udd4J we+zKQVacDSXAT9F9S8KKuGQXrqXA18= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1709312831; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3OiKnePSvssqSsATKZD6FgDSrsGbw6Gz9LH7uaXGsMo=; b=o05E46K6EN+fy853AkUBuf6HrQD5ZThoOdIKf4tjmcYR9J8tcaU11fvMi6vu+GSL77uZkk Y/9da7wAoHpaIGBA== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id E618113AA3; Fri, 1 Mar 2024 17:07:10 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id MKjgNz4L4mUcGQAAD6G6ig (envelope-from <vbabka@suse.cz>); Fri, 01 Mar 2024 17:07:10 +0000 From: Vlastimil Babka <vbabka@suse.cz> Date: Fri, 01 Mar 2024 18:07:11 +0100 Subject: [PATCH RFC 4/4] UNFINISHED mm, fs: use kmem_cache_charge() in path_openat() Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20240301-slab-memcg-v1-4-359328a46596@suse.cz> References: <20240301-slab-memcg-v1-0-359328a46596@suse.cz> In-Reply-To: <20240301-slab-memcg-v1-0-359328a46596@suse.cz> To: Linus Torvalds <torvalds@linux-foundation.org>, Josh Poimboeuf <jpoimboe@kernel.org>, Jeff Layton <jlayton@kernel.org>, Chuck Lever <chuck.lever@oracle.com>, Kees Cook <kees@kernel.org>, Christoph Lameter <cl@linux.com>, Pekka Enberg <penberg@kernel.org>, David Rientjes <rientjes@google.com>, Joonsoo Kim <iamjoonsoo.kim@lge.com>, Andrew Morton <akpm@linux-foundation.org>, Roman Gushchin <roman.gushchin@linux.dev>, Hyeonggon Yoo <42.hyeyoo@gmail.com>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Shakeel Butt <shakeelb@google.com>, Muchun Song <muchun.song@linux.dev>, Alexander Viro <viro@zeniv.linux.org.uk>, Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-fsdevel@vger.kernel.org, Vlastimil Babka <vbabka@suse.cz> X-Mailer: b4 0.13.0 Authentication-Results: smtp-out2.suse.de; none X-Spam-Level: X-Spam-Score: -6.80 X-Spamd-Result: default: False [-6.80 / 50.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; FREEMAIL_ENVRCPT(0.00)[gmail.com]; TO_MATCH_ENVRCPT_ALL(0.00)[]; TAGGED_RCPT(0.00)[]; MIME_GOOD(-0.10)[text/plain]; REPLY(-4.00)[]; BAYES_HAM(-3.00)[100.00%]; NEURAL_HAM_LONG(-1.00)[-1.000]; R_RATELIMIT(0.00)[to_ip_from(RL8ogcagzi1y561i1mcnzpnkwh)]; RCVD_COUNT_THREE(0.00)[3]; DKIM_SIGNED(0.00)[suse.cz:s=susede2_rsa,suse.cz:s=susede2_ed25519]; NEURAL_HAM_SHORT(-0.20)[-1.000]; RCPT_COUNT_TWELVE(0.00)[24]; FREEMAIL_TO(0.00)[linux-foundation.org,kernel.org,oracle.com,linux.com,google.com,lge.com,linux.dev,gmail.com,cmpxchg.org,zeniv.linux.org.uk,suse.cz]; FUZZY_BLOCKED(0.00)[rspamd.com]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; RCVD_TLS_ALL(0.00)[]; SUSPICIOUS_RECIPS(1.50)[] X-Spam-Flag: NO X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1792346722299573439 X-GMAIL-MSGID: 1792346722299573439 |
Series |
memcg_kmem hooks refactoring and kmem_cache_charge()
|
|
Commit Message
Vlastimil Babka
March 1, 2024, 5:07 p.m. UTC
This is just an example of using the kmem_cache_charge() API. I think it's placed in a place that's applicable for Linus's example [1] although he mentions do_dentry_open() - I have followed from strace() showing openat(2) to path_openat() doing the alloc_empty_file(). The idea is that filp_cachep stops being SLAB_ACCOUNT. Allocations that want to be accounted immediately can use GFP_KERNEL_ACCOUNT. I did that in alloc_empty_file_noaccount() (despite the contradictory name but the noaccount refers to something else, right?) as IIUC it's about kernel-internal opens. alloc_empty_file() is now not doing the accounting, so I added kmem_account_file() that calls the new kmem_cache_charge() API. Why is this unfinished: - there are other callers of alloc_empty_file() which I didn't adjust so they simply became memcg-unaccounted. I haven't investigated for which ones it would make also sense to separate the allocation and accounting. Maybe alloc_empty_file() would need to get a parameter to control this. - I don't know how to properly unwind the accounting failure case. It seems like a new case because when we succeed the open, there's no further error path at least in path_openat(). Basically it boils down I'm unfamiliar with VFS so this depends if this approach is deemed useful enough to finish it. [1] https://lore.kernel.org/all/CAHk-=whYOOdM7jWy5jdrAm8LxcgCMFyk2bt8fYYvZzM4U-zAQA@mail.gmail.com/ Not-signed-off-by: Vlastimil Babka <vbabka@suse.cz> --- fs/file_table.c | 9 +++++++-- fs/internal.h | 1 + fs/namei.c | 4 +++- 3 files changed, 11 insertions(+), 3 deletions(-)
Comments
On Fri, 1 Mar 2024 at 09:07, Vlastimil Babka <vbabka@suse.cz> wrote: > > This is just an example of using the kmem_cache_charge() API. I think > it's placed in a place that's applicable for Linus's example [1] > although he mentions do_dentry_open() - I have followed from strace() > showing openat(2) to path_openat() doing the alloc_empty_file(). Thanks. This is not the right patch, but yes, patches 1-3 look very nice to me. > The idea is that filp_cachep stops being SLAB_ACCOUNT. Allocations that > want to be accounted immediately can use GFP_KERNEL_ACCOUNT. I did that > in alloc_empty_file_noaccount() (despite the contradictory name but the > noaccount refers to something else, right?) as IIUC it's about > kernel-internal opens. Yeah, the "noaccount" function is about not accounting it towards nr_files. That said, I don't think it necessarily needs to do the memory accounting either - it's literally for cases where we're never going to install the file descriptor in any user space. Your change to use GFP_KERNEL_ACCOUNT isn't exactly wrong, but I don't think it's really the right thing either, because > Why is this unfinished: > > - there are other callers of alloc_empty_file() which I didn't adjust so > they simply became memcg-unaccounted. I haven't investigated for which > ones it would make also sense to separate the allocation and accounting. > Maybe alloc_empty_file() would need to get a parameter to control > this. Right. I think the natural and logical way to deal with this is to just say "we account when we add the file to the fdtable". IOW, just have fd_install() do it. That's the really natural point, and also makes it very logical why alloc_empty_file_noaccount() wouldn't need to do the GFP_KERNEL_ACCOUNT. > - I don't know how to properly unwind the accounting failure case. It > seems like a new case because when we succeed the open, there's no > further error path at least in path_openat(). Yeah, let me think about this part. Becasue fd_install() is the right point, but that too does not really allow for error handling. Yes, we could close things and fail it, but it really is much too late at this point. What I *think* I'd want for this case is (a) allow the accounting to go over by a bit (b) make sure there's a cheap way to ask (before) about "did we go over the limit" IOW, the accounting never needed to be byte-accurate to begin with, and making it fail (cheaply and early) on the next file allocation is fine. Just make it really cheap. Can we do that? For example, maybe don't bother with the whole "bytes and pages" stuff. Just a simple "are we more than one page over?" kind of question. Without the 'stock_lock' mess for sub-page bytes etc How would that look? Would it result in something that can be done cheaply without locking and atomics and without excessive pointer indirection through many levels of memcg data structures? Linus
On Fri, Mar 01, 2024 at 09:51:18AM -0800, Linus Torvalds wrote: > On Fri, 1 Mar 2024 at 09:07, Vlastimil Babka <vbabka@suse.cz> wrote: > > > > This is just an example of using the kmem_cache_charge() API. I think > > it's placed in a place that's applicable for Linus's example [1] > > although he mentions do_dentry_open() - I have followed from strace() > > showing openat(2) to path_openat() doing the alloc_empty_file(). > > Thanks. This is not the right patch, but yes, patches 1-3 look very nice to me. > > > The idea is that filp_cachep stops being SLAB_ACCOUNT. Allocations that > > want to be accounted immediately can use GFP_KERNEL_ACCOUNT. I did that > > in alloc_empty_file_noaccount() (despite the contradictory name but the > > noaccount refers to something else, right?) as IIUC it's about > > kernel-internal opens. > > Yeah, the "noaccount" function is about not accounting it towards nr_files. > > That said, I don't think it necessarily needs to do the memory > accounting either - it's literally for cases where we're never going > to install the file descriptor in any user space. > > Your change to use GFP_KERNEL_ACCOUNT isn't exactly wrong, but I don't > think it's really the right thing either, because > > > Why is this unfinished: > > > > - there are other callers of alloc_empty_file() which I didn't adjust so > > they simply became memcg-unaccounted. I haven't investigated for which > > ones it would make also sense to separate the allocation and accounting. > > Maybe alloc_empty_file() would need to get a parameter to control > > this. > > Right. I think the natural and logical way to deal with this is to > just say "we account when we add the file to the fdtable". > > IOW, just have fd_install() do it. That's the really natural point, > and also makes it very logical why alloc_empty_file_noaccount() > wouldn't need to do the GFP_KERNEL_ACCOUNT. > > > - I don't know how to properly unwind the accounting failure case. It > > seems like a new case because when we succeed the open, there's no > > further error path at least in path_openat(). > > Yeah, let me think about this part. Becasue fd_install() is the right > point, but that too does not really allow for error handling. > > Yes, we could close things and fail it, but it really is much too late > at this point. > > What I *think* I'd want for this case is > > (a) allow the accounting to go over by a bit > > (b) make sure there's a cheap way to ask (before) about "did we go > over the limit" > > IOW, the accounting never needed to be byte-accurate to begin with, > and making it fail (cheaply and early) on the next file allocation is > fine. > > Just make it really cheap. Can we do that? > > For example, maybe don't bother with the whole "bytes and pages" > stuff. Just a simple "are we more than one page over?" kind of > question. Without the 'stock_lock' mess for sub-page bytes etc > > How would that look? Would it result in something that can be done > cheaply without locking and atomics and without excessive pointer > indirection through many levels of memcg data structures? I think it's possible and I'm currently looking into batching charge, objcg refcnt management and vmstats using per-task caching. It should speed up things for the majority of allocations. For allocations from an irq context and targeted allocations (where the target memcg != memcg of the current task) we'd probably need to keep the old scheme. I hope to post some patches relatively soon. I tried to optimize the current implementation but failed to get any significant gains. It seems that the overhead is very evenly spread across objcg pointer access, charge management, objcg refcnt management and vmstats. Thanks!
On Fri, Mar 01, 2024 at 09:51:18AM -0800, Linus Torvalds wrote: > On Fri, 1 Mar 2024 at 09:07, Vlastimil Babka <vbabka@suse.cz> wrote: > > > > This is just an example of using the kmem_cache_charge() API. I think > > it's placed in a place that's applicable for Linus's example [1] > > although he mentions do_dentry_open() - I have followed from strace() > > showing openat(2) to path_openat() doing the alloc_empty_file(). > > Thanks. This is not the right patch, but yes, patches 1-3 look very nice to me. > > > The idea is that filp_cachep stops being SLAB_ACCOUNT. Allocations that > > want to be accounted immediately can use GFP_KERNEL_ACCOUNT. I did that > > in alloc_empty_file_noaccount() (despite the contradictory name but the > > noaccount refers to something else, right?) as IIUC it's about > > kernel-internal opens. > > Yeah, the "noaccount" function is about not accounting it towards nr_files. > That said, I don't think it necessarily needs to do the memory > accounting either - it's literally for cases where we're never going > to install the file descriptor in any user space. Exactly. > Your change to use GFP_KERNEL_ACCOUNT isn't exactly wrong, but I don't > think it's really the right thing either, because > > > Why is this unfinished: > > > > - there are other callers of alloc_empty_file() which I didn't adjust so > > they simply became memcg-unaccounted. I haven't investigated for which > > ones it would make also sense to separate the allocation and accounting. > > Maybe alloc_empty_file() would need to get a parameter to control > > this. > > Right. I think the natural and logical way to deal with this is to > just say "we account when we add the file to the fdtable". > IOW, just have fd_install() do it. That's the really natural point, > and also makes it very logical why alloc_empty_file_noaccount() > wouldn't need to do the GFP_KERNEL_ACCOUNT. > > > - I don't know how to properly unwind the accounting failure case. It > > seems like a new case because when we succeed the open, there's no > > further error path at least in path_openat(). > > Yeah, let me think about this part. Becasue fd_install() is the right > point, but that too does not really allow for error handling. > > Yes, we could close things and fail it, but it really is much too late > at this point. It would also mean massaging 100+ callsites. And having a non-subsystems specific failure step between file allocation, fd reservation and fd_install() would be awkward and an invitation for bugs. > What I *think* I'd want for this case is > > (a) allow the accounting to go over by a bit > > (b) make sure there's a cheap way to ask (before) about "did we go > over the limit" > > IOW, the accounting never needed to be byte-accurate to begin with, > and making it fail (cheaply and early) on the next file allocation is > fine. I think that's a good idea.
diff --git a/fs/file_table.c b/fs/file_table.c index b991f90571b4..6401b6f175ae 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -223,6 +223,11 @@ struct file *alloc_empty_file(int flags, const struct cred *cred) return ERR_PTR(-ENFILE); } +int kmem_account_file(struct file *f) +{ + return kmem_cache_charge(filp_cachep, GFP_KERNEL_ACCOUNT, f); +} + /* * Variant of alloc_empty_file() that doesn't check and modify nr_files. * @@ -234,7 +239,7 @@ struct file *alloc_empty_file_noaccount(int flags, const struct cred *cred) struct file *f; int error; - f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL); + f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL_ACCOUNT); if (unlikely(!f)) return ERR_PTR(-ENOMEM); @@ -468,7 +473,7 @@ void __init files_init(void) { filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, SLAB_TYPESAFE_BY_RCU | SLAB_HWCACHE_ALIGN | - SLAB_PANIC | SLAB_ACCOUNT, NULL); + SLAB_PANIC, NULL); percpu_counter_init(&nr_files, 0, GFP_KERNEL); } diff --git a/fs/internal.h b/fs/internal.h index b67406435fc0..06ada11b71d0 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -96,6 +96,7 @@ extern void chroot_fs_refs(const struct path *, const struct path *); struct file *alloc_empty_file(int flags, const struct cred *cred); struct file *alloc_empty_file_noaccount(int flags, const struct cred *cred); struct file *alloc_empty_backing_file(int flags, const struct cred *cred); +int kmem_account_file(struct file *file); static inline void file_put_write_access(struct file *file) { diff --git a/fs/namei.c b/fs/namei.c index 4e0de939fea1..fcf3f3fcd059 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3799,8 +3799,10 @@ static struct file *path_openat(struct nameidata *nd, terminate_walk(nd); } if (likely(!error)) { - if (likely(file->f_mode & FMODE_OPENED)) + if (likely(file->f_mode & FMODE_OPENED)) { + kmem_account_file(file); return file; + } WARN_ON(1); error = -EINVAL; }