[stable] memcg: add refcnt for pcpu stock to avoid UAF problem in drain_all_stock()
Message ID | 20240221081801.69764-1-gongruiqi1@huawei.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-74290-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:693c:2685:b0:108:e6aa:91d0 with SMTP id mn5csp898989dyc; Wed, 21 Feb 2024 00:12:01 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCXagJo8jDVbW0iSgT3HNEcXnzUqkLw1obxefOIAP+JN6vrd+LsG0HDj8NLPBXHYW7tJLwC+HN2FPpdqdFxH/iikvAtYtQ== X-Google-Smtp-Source: AGHT+IEoJK+askwh+VhbA2++H7pBtLV4XCIpk8KevwGt/bBvRmwaN8eyi2eVkBCcvpVDTwmLEVbB X-Received: by 2002:a05:6214:2681:b0:68f:9051:5879 with SMTP id gm1-20020a056214268100b0068f90515879mr5513925qvb.21.1708503121099; Wed, 21 Feb 2024 00:12:01 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1708503121; cv=pass; d=google.com; s=arc-20160816; b=TM16ay+zCL3QXjCK/e0zFEbblI7FaVS5hZ6z4FdXme3RV563+kwUfzHaPx5ZIZx41i j8A4nW6lSqKT9iX9bNBrKoIBEot1IZAv8Qtaj6ECe1OjhHL3WQmfdrOsfrcnWf2930JA sHRc6qCCWJSh5s5IWmN9QETQeEMyYpK7DyeIomO2eyFQsXqHSuAt+aVGCOJjN+VgwzKz mdr7Xs75MRCYVTcVcQP4z0UyQoUoTPtyzUNGm5aOOPCy1lA8AAaCwcDXxy42ssrzgVry vM7mwjvlqEu69q3AD/qCVM5vtHmf3pRsZiTkos2hCzOjS4vDf0t0PUvt+1Auqw/uBZiY 58nw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from; bh=QswBzDOL/bQpw/1cySEjF/CQyPWFC1XkxLNwNJXPJPc=; fh=sC9h1KtoA2fpJ9v9kFtRxMra5VbMdEMrB1z4VFzsSBM=; b=qvKUds0PBwidV3L+TU+VEVO0XIWb8SqcqmhpBz2wGZxrJeEvf82QFca9uQs8YGZMuL PitgS7FcJlfZdqWGXlQoW83jmiWYjimAHijKsOiXLsA89QWeMLrUo0a/fCj52DRewEJ+ hMJZ3FrLhIyo8SXu3afBEaz5gRaBTszn42wC6v0x2kVLEswjOHNEW8Ig3f9uzb5NbLiE cKd9OXLlahdGgBtVNfq/VxTWSa/1RY11VR+/quHvi7naNUp5Kt6eJ2Psf5ynGWgB30zP ORh/wYaSrIjtrcq/4jWbKYoqNEqsJL3NHU0kxIP1Rd3BLJPs9Lroj7TWjMwbFdhfpiGN gm/A==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=huawei.com dmarc=pass fromdomain=huawei.com); spf=pass (google.com: domain of linux-kernel+bounces-74290-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-74290-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id jf15-20020a0562142a4f00b0068f6ee1e8e7si6400323qvb.217.2024.02.21.00.12.00 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 21 Feb 2024 00:12:01 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-74290-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=huawei.com dmarc=pass fromdomain=huawei.com); spf=pass (google.com: domain of linux-kernel+bounces-74290-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-74290-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id CB6CD1C22AFC for <ouuuleilei@gmail.com>; Wed, 21 Feb 2024 08:12:00 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 015423C08F; Wed, 21 Feb 2024 08:11:48 +0000 (UTC) Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0A79A3B78D; Wed, 21 Feb 2024 08:11:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.189 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708503106; cv=none; b=AFbgVSTOv7glz1Zao6s6au5FzpTIaCiqEkJpZmOxNxN/v6H/4uloTU1AWZImgnaHIQR9NhLLncyFr8kn5zv9U+t0L0zliJApjqqIiTiaIdNkYKnMY499HWg/XjGg5A3g2c7ww/s8/qDijsr7aZM3BOpEFH+F50pxzE96vFKnJNc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708503106; c=relaxed/simple; bh=+2+yAJpVza7viz8w2DkLR7wHx5qlz5d/JWVKzbzMLbQ=; h=From:To:CC:Subject:Date:Message-ID:MIME-Version:Content-Type; b=RzDpM62x27hkspPPn1/iriHDBUb3p7k4KbYvGisfS0d8kZRrUB5NMlWsIGatJjMjc0osLYhOSdSfTr4X2PIlbdRk9nssijDZKm8K8ZpgAQV+nG4x0rpUfhhWaCFmvzJTo0/ipNMhkrwWxUCLwGZOhV1gxIHsIi4HFaXamBUUmgI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.189 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.88.105]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4Tfpq72rMSzNlm7; Wed, 21 Feb 2024 16:10:11 +0800 (CST) Received: from dggpemd100002.china.huawei.com (unknown [7.185.36.164]) by mail.maildlp.com (Postfix) with ESMTPS id 55334140114; Wed, 21 Feb 2024 16:11:34 +0800 (CST) Received: from huawei.com (10.67.174.33) by dggpemd100002.china.huawei.com (7.185.36.164) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.1258.28; Wed, 21 Feb 2024 16:11:34 +0800 From: "GONG, Ruiqi" <gongruiqi1@huawei.com> To: <linux-kernel@vger.kernel.org>, <stable@vger.kernel.org>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeelb@google.com>, Muchun Song <muchun.song@linux.dev> CC: <cgroups@vger.kernel.org>, <linux-mm@kvack.org>, Wang Weiyang <wangweiyang2@huawei.com>, Xiu Jianfeng <xiujianfeng@huawei.com> Subject: [PATCH stable] memcg: add refcnt for pcpu stock to avoid UAF problem in drain_all_stock() Date: Wed, 21 Feb 2024 16:18:01 +0800 Message-ID: <20240221081801.69764-1-gongruiqi1@huawei.com> X-Mailer: git-send-email 2.25.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpemd100002.china.huawei.com (7.185.36.164) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1791495368862431885 X-GMAIL-MSGID: 1791495368862431885 |
Series |
[stable] memcg: add refcnt for pcpu stock to avoid UAF problem in drain_all_stock()
|
|
Commit Message
Gong Ruiqi
Feb. 21, 2024, 8:18 a.m. UTC
commit 1a3e1f40962c445b997151a542314f3c6097f8c3 upstream.
There was a kernel panic happened on an in-house environment running
3.10, and the same problem was reproduced on 4.19:
general protection fault: 0000 [#1] SMP PTI
CPU: 1 PID: 2085 Comm: bash Kdump: loaded Tainted: G L 4.19.90+ #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
RIP: 0010 drain_all_stock+0xad/0x140
Code: 00 00 4d 85 ff 74 2c 45 85 c9 74 27 4d 39 fc 74 42 41 80 bc 24 28 04 00 00 00 74 17 49 8b 04 24 49 8b 17 48 8b 88 90 02 00 00 <48> 39 8a 90 02 00 00 74 02 eb 86 48 63 88 3c 01 00 00 39 8a 3c 01
RSP: 0018:ffffa7efc5813d70 EFLAGS: 00010202
RAX: ffff8cb185548800 RBX: ffff8cb89f420160 RCX: ffff8cb1867b6000
RDX: babababababababa RSI: 0000000000000001 RDI: 0000000000231876
RBP: 0000000000000000 R08: 0000000000000415 R09: 0000000000000002
R10: 0000000000000000 R11: 0000000000000001 R12: ffff8cb186f89040
R13: 0000000000020160 R14: 0000000000000001 R15: ffff8cb186b27040
FS: 00007f4a308d3740(0000) GS:ffff8cb89f440000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffe4d634a68 CR3: 000000010b022000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
mem_cgroup_force_empty_write+0x31/0xb0
cgroup_file_write+0x60/0x140
? __check_object_size+0x136/0x147
kernfs_fop_write+0x10e/0x190
__vfs_write+0x37/0x1b0
? selinux_file_permission+0xe8/0x130
? security_file_permission+0x2e/0xb0
vfs_write+0xb6/0x1a0
ksys_write+0x57/0xd0
do_syscall_64+0x63/0x250
? async_page_fault+0x8/0x30
entry_SYSCALL_64_after_hwframe+0x5c/0xc1
Modules linked in: ...
It is found that in case of stock->nr_pages == 0, the memcg on
stock->cached could be freed due to its refcnt decreased to 0, which
made stock->cached become a dangling pointer. It could cause a UAF
problem in drain_all_stock() in the following concurrent scenario. Note
that drain_all_stock() doesn't disable irq but only preemption.
CPU1 CPU2
==============================================================================
stock->cached = memcgA (freed)
drain_all_stock(memcgB)
rcu_read_lock()
memcg = CPU1's stock->cached (memcgA)
(interrupted)
refill_stock(memcgC)
drain_stock(memcgA)
stock->cached = memcgC
stock->nr_pages += xxx (> 0)
stock->nr_pages > 0
mem_cgroup_is_descendant(memcgA, memcgB) [UAF]
rcu_read_unlock()
This problem is, unintenionally, fixed at 5.9, where commit 1a3e1f40962c
("mm: memcontrol: decouple reference counting from page accounting")
adds memcg refcnt for stock. Therefore affected LTS versions include
4.19 and 5.4.
For 4.19, memcg's css offline process doesn't call drain_all_stock(). so
it's easier for the released memcg to be left on the stock. For 5.4,
although mem_cgroup_css_offline() does call drain_all_stock(), but the
flushing could be skipped when stock->nr_pages happens to be 0, and
besides the async draining could be delayed and take place after the UAF
problem has happened.
Fix this problem by adding (and decreasing) memcg's refcnt when memcg is
put onto (and removed from) stock, just like how commit 1a3e1f40962c
("mm: memcontrol: decouple reference counting from page accounting")
does. After all, "being on the stock" is a kind of reference with
regards to memcg. As such, it's guaranteed that a css on stock would not
be freed.
Cc: stable@vger.kernel.org # 4.19 5.4
Fixes: cdec2e4265df ("memcg: coalesce charging via percpu storage")
Signed-off-by: GONG, Ruiqi <gongruiqi1@huawei.com>
---
mm/memcontrol.c | 6 ++++++
1 file changed, 6 insertions(+)
Comments
On Wed 21-02-24 16:18:01, GONG, Ruiqi wrote: > commit 1a3e1f40962c445b997151a542314f3c6097f8c3 upstream. I think it would be good to mention that this is only a partial backport and also explain why to do a partial rather than the full one. > There was a kernel panic happened on an in-house environment running > 3.10, and the same problem was reproduced on 4.19: > > general protection fault: 0000 [#1] SMP PTI > CPU: 1 PID: 2085 Comm: bash Kdump: loaded Tainted: G L 4.19.90+ #7 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 > RIP: 0010 drain_all_stock+0xad/0x140 > Code: 00 00 4d 85 ff 74 2c 45 85 c9 74 27 4d 39 fc 74 42 41 80 bc 24 28 04 00 00 00 74 17 49 8b 04 24 49 8b 17 48 8b 88 90 02 00 00 <48> 39 8a 90 02 00 00 74 02 eb 86 48 63 88 3c 01 00 00 39 8a 3c 01 > RSP: 0018:ffffa7efc5813d70 EFLAGS: 00010202 > RAX: ffff8cb185548800 RBX: ffff8cb89f420160 RCX: ffff8cb1867b6000 > RDX: babababababababa RSI: 0000000000000001 RDI: 0000000000231876 > RBP: 0000000000000000 R08: 0000000000000415 R09: 0000000000000002 > R10: 0000000000000000 R11: 0000000000000001 R12: ffff8cb186f89040 > R13: 0000000000020160 R14: 0000000000000001 R15: ffff8cb186b27040 > FS: 00007f4a308d3740(0000) GS:ffff8cb89f440000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007ffe4d634a68 CR3: 000000010b022000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > mem_cgroup_force_empty_write+0x31/0xb0 > cgroup_file_write+0x60/0x140 > ? __check_object_size+0x136/0x147 > kernfs_fop_write+0x10e/0x190 > __vfs_write+0x37/0x1b0 > ? selinux_file_permission+0xe8/0x130 > ? security_file_permission+0x2e/0xb0 > vfs_write+0xb6/0x1a0 > ksys_write+0x57/0xd0 > do_syscall_64+0x63/0x250 > ? async_page_fault+0x8/0x30 > entry_SYSCALL_64_after_hwframe+0x5c/0xc1 > Modules linked in: ... > > It is found that in case of stock->nr_pages == 0, the memcg on > stock->cached could be freed due to its refcnt decreased to 0, which > made stock->cached become a dangling pointer. It could cause a UAF > problem in drain_all_stock() in the following concurrent scenario. Note > that drain_all_stock() doesn't disable irq but only preemption. > > CPU1 CPU2 > ============================================================================== > stock->cached = memcgA (freed) > drain_all_stock(memcgB) > rcu_read_lock() > memcg = CPU1's stock->cached (memcgA) > (interrupted) > refill_stock(memcgC) > drain_stock(memcgA) > stock->cached = memcgC > stock->nr_pages += xxx (> 0) > stock->nr_pages > 0 > mem_cgroup_is_descendant(memcgA, memcgB) [UAF] > rcu_read_unlock() > > This problem is, unintenionally, fixed at 5.9, where commit 1a3e1f40962c > ("mm: memcontrol: decouple reference counting from page accounting") > adds memcg refcnt for stock. Therefore affected LTS versions include > 4.19 and 5.4. > > For 4.19, memcg's css offline process doesn't call drain_all_stock(). so > it's easier for the released memcg to be left on the stock. For 5.4, > although mem_cgroup_css_offline() does call drain_all_stock(), but the > flushing could be skipped when stock->nr_pages happens to be 0, and > besides the async draining could be delayed and take place after the UAF > problem has happened. > > Fix this problem by adding (and decreasing) memcg's refcnt when memcg is > put onto (and removed from) stock, just like how commit 1a3e1f40962c > ("mm: memcontrol: decouple reference counting from page accounting") > does. After all, "being on the stock" is a kind of reference with > regards to memcg. As such, it's guaranteed that a css on stock would not > be freed. What does prevent from the following? refill_stock(memcgC) drain_all_stock(memcgB) drain_stock(memcgA) rcu_read_lock() css_put(old->css) memcgA = stock->cached mem_cgroup_is_descendant(memcgA, memcgB) UAF stock->cached = NULL > > Cc: stable@vger.kernel.org # 4.19 5.4 > Fixes: cdec2e4265df ("memcg: coalesce charging via percpu storage") > Signed-off-by: GONG, Ruiqi <gongruiqi1@huawei.com> > --- > mm/memcontrol.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 5a366cf79821..8c04296df1c7 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2015,6 +2015,9 @@ static void drain_stock(struct memcg_stock_pcp *stock) > { > struct mem_cgroup *old = stock->cached; > > + if (!old) > + return; > + > if (stock->nr_pages) { > page_counter_uncharge(&old->memory, stock->nr_pages); > if (do_memsw_account()) > @@ -2022,6 +2025,8 @@ static void drain_stock(struct memcg_stock_pcp *stock) > css_put_many(&old->css, stock->nr_pages); > stock->nr_pages = 0; > } > + > + css_put(&old->css); > stock->cached = NULL; > } > > @@ -2057,6 +2062,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) > stock = this_cpu_ptr(&memcg_stock); > if (stock->cached != memcg) { /* reset if necessary */ > drain_stock(stock); > + css_get(&memcg->css); > stock->cached = memcg; > } > stock->nr_pages += nr_pages; > -- > 2.25.1 >
On Wed, Feb 21, 2024 at 04:18:01PM +0800, GONG, Ruiqi wrote: > commit 1a3e1f40962c445b997151a542314f3c6097f8c3 upstream. > > There was a kernel panic happened on an in-house environment running > 3.10, and the same problem was reproduced on 4.19: Now queued up, thanks! greg k-h
On Wed 21-02-24 09:40:29, Greg KH wrote: > On Wed, Feb 21, 2024 at 04:18:01PM +0800, GONG, Ruiqi wrote: > > commit 1a3e1f40962c445b997151a542314f3c6097f8c3 upstream. > > > > There was a kernel panic happened on an in-house environment running > > 3.10, and the same problem was reproduced on 4.19: > > Now queued up, thanks! Give it some more time Greg, http://lkml.kernel.org/r/ZdW2gMajIw_cUN2-@tiehlicka
On 2024/02/21 16:38, Michal Hocko wrote: > On Wed 21-02-24 16:18:01, GONG, Ruiqi wrote: >> commit 1a3e1f40962c445b997151a542314f3c6097f8c3 upstream. > > I think it would be good to mention that this is only a partial backport > and also explain why to do a partial rather than the full one. > Okay. I think to fix this problem we should add refcnt relation between memcg and stock, and since higher versions have achieved this, maybe it's better to use the same code and align with them. So I put a "commit xxx upstream" here, as requested in kernel docs[1]. So yes it's a partial backport as we only need the stock part. [1]: https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html#stable-kernel-rules >> There was a kernel panic happened on an in-house environment running >> 3.10, and the same problem was reproduced on 4.19: >> >> general protection fault: 0000 [#1] SMP PTI >> CPU: 1 PID: 2085 Comm: bash Kdump: loaded Tainted: G L 4.19.90+ #7 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 >> RIP: 0010 drain_all_stock+0xad/0x140 >> Code: 00 00 4d 85 ff 74 2c 45 85 c9 74 27 4d 39 fc 74 42 41 80 bc 24 28 04 00 00 00 74 17 49 8b 04 24 49 8b 17 48 8b 88 90 02 00 00 <48> 39 8a 90 02 00 00 74 02 eb 86 48 63 88 3c 01 00 00 39 8a 3c 01 >> RSP: 0018:ffffa7efc5813d70 EFLAGS: 00010202 >> RAX: ffff8cb185548800 RBX: ffff8cb89f420160 RCX: ffff8cb1867b6000 >> RDX: babababababababa RSI: 0000000000000001 RDI: 0000000000231876 >> RBP: 0000000000000000 R08: 0000000000000415 R09: 0000000000000002 >> R10: 0000000000000000 R11: 0000000000000001 R12: ffff8cb186f89040 >> R13: 0000000000020160 R14: 0000000000000001 R15: ffff8cb186b27040 >> FS: 00007f4a308d3740(0000) GS:ffff8cb89f440000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 00007ffe4d634a68 CR3: 000000010b022000 CR4: 00000000000006e0 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> Call Trace: >> mem_cgroup_force_empty_write+0x31/0xb0 >> cgroup_file_write+0x60/0x140 >> ? __check_object_size+0x136/0x147 >> kernfs_fop_write+0x10e/0x190 >> __vfs_write+0x37/0x1b0 >> ? selinux_file_permission+0xe8/0x130 >> ? security_file_permission+0x2e/0xb0 >> vfs_write+0xb6/0x1a0 >> ksys_write+0x57/0xd0 >> do_syscall_64+0x63/0x250 >> ? async_page_fault+0x8/0x30 >> entry_SYSCALL_64_after_hwframe+0x5c/0xc1 >> Modules linked in: ... >> >> It is found that in case of stock->nr_pages == 0, the memcg on >> stock->cached could be freed due to its refcnt decreased to 0, which >> made stock->cached become a dangling pointer. It could cause a UAF >> problem in drain_all_stock() in the following concurrent scenario. Note >> that drain_all_stock() doesn't disable irq but only preemption. >> >> CPU1 CPU2 >> ============================================================================== >> stock->cached = memcgA (freed) >> drain_all_stock(memcgB) >> rcu_read_lock() >> memcg = CPU1's stock->cached (memcgA) >> (interrupted) >> refill_stock(memcgC) >> drain_stock(memcgA) >> stock->cached = memcgC >> stock->nr_pages += xxx (> 0) >> stock->nr_pages > 0 >> mem_cgroup_is_descendant(memcgA, memcgB) [UAF] >> rcu_read_unlock() >> >> This problem is, unintenionally, fixed at 5.9, where commit 1a3e1f40962c >> ("mm: memcontrol: decouple reference counting from page accounting") >> adds memcg refcnt for stock. Therefore affected LTS versions include >> 4.19 and 5.4. >> >> For 4.19, memcg's css offline process doesn't call drain_all_stock(). so >> it's easier for the released memcg to be left on the stock. For 5.4, >> although mem_cgroup_css_offline() does call drain_all_stock(), but the >> flushing could be skipped when stock->nr_pages happens to be 0, and >> besides the async draining could be delayed and take place after the UAF >> problem has happened. >> >> Fix this problem by adding (and decreasing) memcg's refcnt when memcg is >> put onto (and removed from) stock, just like how commit 1a3e1f40962c >> ("mm: memcontrol: decouple reference counting from page accounting") >> does. After all, "being on the stock" is a kind of reference with >> regards to memcg. As such, it's guaranteed that a css on stock would not >> be freed. > > What does prevent from the following? > > refill_stock(memcgC) drain_all_stock(memcgB) > drain_stock(memcgA) rcu_read_lock() > css_put(old->css) memcgA = stock->cached > mem_cgroup_is_descendant(memcgA, memcgB) UAF > stock->cached = NULL > I think it's not a problem since refill_stock() has disabled irq before calling drain_stock(): refill_stock(memcgC) local_irq_save drain_stock(memcgA) css_put(old->css) <1> stock->cached = NULL local_irq_restore <2> And since css_put(old->css) is an RCU free, memcgA would not be freed at <1> as it's still in grace period. The actual release of memcgA could happen only after irq is enabled (at <2>). And for CPU2, the access to stock->cached in drain_all_stock() is protected by rcu_read_lock(), so from stock->cached we get either NULL, or a memcgA that is still not freed. Please correct me if I have some wrong understanding to RCU. >> >> Cc: stable@vger.kernel.org # 4.19 5.4 >> Fixes: cdec2e4265df ("memcg: coalesce charging via percpu storage") >> Signed-off-by: GONG, Ruiqi <gongruiqi1@huawei.com> >> --- >> mm/memcontrol.c | 6 ++++++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 5a366cf79821..8c04296df1c7 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -2015,6 +2015,9 @@ static void drain_stock(struct memcg_stock_pcp *stock) >> { >> struct mem_cgroup *old = stock->cached; >> >> + if (!old) >> + return; >> + >> if (stock->nr_pages) { >> page_counter_uncharge(&old->memory, stock->nr_pages); >> if (do_memsw_account()) >> @@ -2022,6 +2025,8 @@ static void drain_stock(struct memcg_stock_pcp *stock) >> css_put_many(&old->css, stock->nr_pages); >> stock->nr_pages = 0; >> } >> + >> + css_put(&old->css); >> stock->cached = NULL; >> } >> >> @@ -2057,6 +2062,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) >> stock = this_cpu_ptr(&memcg_stock); >> if (stock->cached != memcg) { /* reset if necessary */ >> drain_stock(stock); >> + css_get(&memcg->css); >> stock->cached = memcg; >> } >> stock->nr_pages += nr_pages; >> -- >> 2.25.1 >> >
On Wed 21-02-24 17:50:27, Gong Ruiqi wrote: > > On 2024/02/21 16:38, Michal Hocko wrote: > > On Wed 21-02-24 16:18:01, GONG, Ruiqi wrote: > >> commit 1a3e1f40962c445b997151a542314f3c6097f8c3 upstream. > > > > I think it would be good to mention that this is only a partial backport > > and also explain why to do a partial rather than the full one. > > > > Okay. I think to fix this problem we should add refcnt relation between > memcg and stock, and since higher versions have achieved this, maybe > it's better to use the same code and align with them. So I put a "commit > xxx upstream" here, as requested in kernel docs[1]. So yes it's a > partial backport as we only need the stock part. I think it is sufficient to mention that this is a partial backport to minimize the fix to the bare minimum. [...] > > What does prevent from the following? > > > > refill_stock(memcgC) drain_all_stock(memcgB) > > drain_stock(memcgA) rcu_read_lock() > > css_put(old->css) memcgA = stock->cached > > mem_cgroup_is_descendant(memcgA, memcgB) UAF > > stock->cached = NULL > > > > I think it's not a problem since refill_stock() has disabled irq before > calling drain_stock(): > > refill_stock(memcgC) > local_irq_save > drain_stock(memcgA) > css_put(old->css) > <1> > stock->cached = NULL > local_irq_restore > <2> > > And since css_put(old->css) is an RCU free, memcgA would not be freed at > <1> as it's still in grace period. The actual release of memcgA could > happen only after irq is enabled (at <2>). > > And for CPU2, the access to stock->cached in drain_all_stock() is > protected by rcu_read_lock(), so from stock->cached we get either NULL, > or a memcgA that is still not freed. > > Please correct me if I have some wrong understanding to RCU. You are right. Thanks! IRQ disabling is there in one form or the other since db2ba40c277d ("mm: memcontrol: make per-cpu charge cache IRQ-safe for socket accounting") so 4.8+ is safe. Backports to older kernels would nee to pull this one as well. > >> Cc: stable@vger.kernel.org # 4.19 5.4 > >> Fixes: cdec2e4265df ("memcg: coalesce charging via percpu storage") > >> Signed-off-by: GONG, Ruiqi <gongruiqi1@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Thanks!
On Wed 21-02-24 11:08:02, Michal Hocko wrote: [...] > You are right. Thanks! IRQ disabling is there in one form or the other > since db2ba40c277d ("mm: memcontrol: make per-cpu charge cache IRQ-safe > for socket accounting") so 4.8+ is safe. Backports to older kernels > would nee to pull this one as well. Scratch that. Prior to db2ba40c277d, we used to disable preemption via {get,put}_cpu_var and that should achive the same result.
On Wed, Feb 21, 2024 at 09:48:05AM +0100, Michal Hocko wrote: > On Wed 21-02-24 09:40:29, Greg KH wrote: > > On Wed, Feb 21, 2024 at 04:18:01PM +0800, GONG, Ruiqi wrote: > > > commit 1a3e1f40962c445b997151a542314f3c6097f8c3 upstream. > > > > > > There was a kernel panic happened on an in-house environment running > > > 3.10, and the same problem was reproduced on 4.19: > > > > Now queued up, thanks! > > Give it some more time Greg, http://lkml.kernel.org/r/ZdW2gMajIw_cUN2-@tiehlicka Oops, now dropped, thanks! greg k-h
On Wed 21-02-24 11:33:55, Greg KH wrote: > On Wed, Feb 21, 2024 at 09:48:05AM +0100, Michal Hocko wrote: > > On Wed 21-02-24 09:40:29, Greg KH wrote: > > > On Wed, Feb 21, 2024 at 04:18:01PM +0800, GONG, Ruiqi wrote: > > > > commit 1a3e1f40962c445b997151a542314f3c6097f8c3 upstream. > > > > > > > > There was a kernel panic happened on an in-house environment running > > > > 3.10, and the same problem was reproduced on 4.19: > > > > > > Now queued up, thanks! > > > > Give it some more time Greg, http://lkml.kernel.org/r/ZdW2gMajIw_cUN2-@tiehlicka > > Oops, now dropped, thanks! Race conditions. The thing got resolved in the mean time. The patch is a partial backport, and I am OK with that. Now with my questions resolved this is good to know but a note about this being a partial backport in the changelog would be nice.
On 2024/02/21 18:08, Michal Hocko wrote: > On Wed 21-02-24 17:50:27, Gong Ruiqi wrote: >> >> On 2024/02/21 16:38, Michal Hocko wrote: >>> On Wed 21-02-24 16:18:01, GONG, Ruiqi wrote: >>>> commit 1a3e1f40962c445b997151a542314f3c6097f8c3 upstream. >>> >>> I think it would be good to mention that this is only a partial backport >>> and also explain why to do a partial rather than the full one. >>> >> >> Okay. I think to fix this problem we should add refcnt relation between >> memcg and stock, and since higher versions have achieved this, maybe >> it's better to use the same code and align with them. So I put a "commit >> xxx upstream" here, as requested in kernel docs[1]. So yes it's a >> partial backport as we only need the stock part. > > I think it is sufficient to mention that this is a partial backport to > minimize the fix to the bare minimum. > Okay. Then I will send a v2 and add a paragraph to mention this. > [...]
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5a366cf79821..8c04296df1c7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2015,6 +2015,9 @@ static void drain_stock(struct memcg_stock_pcp *stock) { struct mem_cgroup *old = stock->cached; + if (!old) + return; + if (stock->nr_pages) { page_counter_uncharge(&old->memory, stock->nr_pages); if (do_memsw_account()) @@ -2022,6 +2025,8 @@ static void drain_stock(struct memcg_stock_pcp *stock) css_put_many(&old->css, stock->nr_pages); stock->nr_pages = 0; } + + css_put(&old->css); stock->cached = NULL; } @@ -2057,6 +2062,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) stock = this_cpu_ptr(&memcg_stock); if (stock->cached != memcg) { /* reset if necessary */ drain_stock(stock); + css_get(&memcg->css); stock->cached = memcg; } stock->nr_pages += nr_pages;