From patchwork Wed Mar 15 08:49:36 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 70075 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp2214422wrd; Wed, 15 Mar 2023 02:00:50 -0700 (PDT) X-Google-Smtp-Source: AK7set8+8ZlLg3l7tYClrQApKf1YTlcUfsZB06lgKO2cUmJiJrjlQqkc0pX24eP46vbFcTRpJJ6M X-Received: by 2002:a05:6a20:a5a8:b0:cc:c925:1e7a with SMTP id bc40-20020a056a20a5a800b000ccc9251e7amr42493275pzb.59.1678870850302; Wed, 15 Mar 2023 02:00:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1678870850; cv=none; d=google.com; s=arc-20160816; b=HAA2B4vR52d68ksbmwnzoYKrf3wOLVM5pO7r0JLCfjft1WVBlrleIun4Utmj5UzFzj CNhQ1bHaEbcaEyUHnfn6clGLHV0r5A0CR+v6jy7hFLDN9mPWl2c4HACG5R9BJ40XrlQz S8Rnp2s9ZrgbbfmgwWc7NNcp6h+aesDUpiXu1xM+hEwrlw5qGdH7FyX1q0tlu4GfcZ94 tMMRwaQWJ1gTyOgCEIg7gSgJGMafPu7Ykg51rJVO0BFGMyVYCxdbbWMlhzkNWRCXI/Mo 1tT29PH2R5D1WL6btKEFnUsmtuUe6HBpI99mZQkCoyBYwwcr5Lwd1bT+J+9zzxhwXRgJ BMZg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=tkFj6yYLyA565/QhOvX6PgrnENIUFczCaPlTlQYQGRk=; b=DSPqPoBdxcUd5LDBfekq1g1qogQrDzpAN2H73bq6VneNBjY6dC2B79lz/DfXQojJvY 4QZFg79CgFpBJVEipFLw0nLp6LskR8jelPrijBM+qRzn3LdjzfARlfVS2JLguOLo5p1Q KTH1S17Wzq/YuldFDEfh0GqcnVedRI3qEdoOyi7k/9h4axzAITC4E3/2Uzq607eG2JE4 agLy6Jk44YFqKeZDzqmqEllX6mFP5ZTy3DOsZkLpE9Yp/PTE6ejfAQOtmzpvLtbsO7ZD zLN2sdsosCxtqI9uxK2dTSbZTWWVTF+EERiL4bUtwnPN1pgFd11fni2EXqZiUzqsGNIJ 77xg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@fromorbit-com.20210112.gappssmtp.com header.s=20210112 header.b=P8lBfpG7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=fromorbit.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id t21-20020a635f15000000b004f1a97a4c9dsi4478078pgb.799.2023.03.15.02.00.34; Wed, 15 Mar 2023 02:00:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@fromorbit-com.20210112.gappssmtp.com header.s=20210112 header.b=P8lBfpG7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=fromorbit.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231831AbjCOIuI (ORCPT + 99 others); Wed, 15 Mar 2023 04:50:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55378 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231465AbjCOIt4 (ORCPT ); Wed, 15 Mar 2023 04:49:56 -0400 Received: from mail-pf1-x429.google.com (mail-pf1-x429.google.com [IPv6:2607:f8b0:4864:20::429]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8EE4658B40 for ; Wed, 15 Mar 2023 01:49:53 -0700 (PDT) Received: by mail-pf1-x429.google.com with SMTP id n16so5097773pfa.12 for ; Wed, 15 Mar 2023 01:49:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20210112.gappssmtp.com; s=20210112; t=1678870192; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=tkFj6yYLyA565/QhOvX6PgrnENIUFczCaPlTlQYQGRk=; b=P8lBfpG7IED6FPgo2ZB7apKwy3nUll9jUKZ8XzlrGpjDtYG8AbCcwo7zvIxIedMAi2 H6VZUAxroyRe59ECAAvynvkq8QSQkYykDwiNYazLkZA+lH3wCju9gPC+hT8a43f2BmoA j9RpQWC41i0mNupI+oXacSSvjKAGJ4+vgvsFXxD4G/haDpYnbt0PZ0RSkuEWQjljHrLG hGZcIKjas3JKkHHKlgv2t0oVs/hWgaL7pp9XDcpTbRs3hXgx0O3YC0Xiy7NF2n3Ej7tU RslGq9jpYn6NsD498EG4rurP0SM0c4G4pWSHhohSvWR1DWUZXDA7LXMdnh2+YmirVJdi 9A6Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678870192; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tkFj6yYLyA565/QhOvX6PgrnENIUFczCaPlTlQYQGRk=; b=diglQXSkc3wFwncCbAr2Jt1LV+daWJsEAlBr+83J5sVHOsInM+H+PFIgjYvyB8PjZs tC0fw6mynlt7XY0WXM2uET/lAletqRHxktNwD1HnKPNnojISIbTWhnqowhVdfJeQyUbq ccTvNDxLcCxqDbPeasiiblu8eY+qufQnAOq702MA9x8nfFXiKuRTLTv370BD0jeWV+45 7vM0XMzdWERgKWNnlA5+yHqvXt8Ol8CHlrh7JgAPXO3H+z5SKltiVyBU5wp4N5f96ocR Q+oZSrnDc+6knoIWJt6HpwO7gT3TcE1A56701RSTbHQa5DEyi7/V3X8fLB0lxRZuG+Fp SNXw== X-Gm-Message-State: AO0yUKXBnc7nTGeXrhrRTUA8RPN33JcBb7bm1Rf788H/4PCV5lt2u/jh nK1hfuKs6vaGr7IAQ5KXeOm23sibUdNSi3BYK7c= X-Received: by 2002:a05:6a00:41:b0:625:5560:696e with SMTP id i1-20020a056a00004100b006255560696emr5343794pfk.16.1678870192212; Wed, 15 Mar 2023 01:49:52 -0700 (PDT) Received: from dread.disaster.area (pa49-186-4-237.pa.vic.optusnet.com.au. [49.186.4.237]) by smtp.gmail.com with ESMTPSA id h11-20020a62b40b000000b0062505afff9fsm2971980pfn.126.2023.03.15.01.49.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Mar 2023 01:49:51 -0700 (PDT) Received: from [192.168.253.23] (helo=devoid.disaster.area) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1pcMpQ-008zeT-3D; Wed, 15 Mar 2023 19:49:48 +1100 Received: from dave by devoid.disaster.area with local (Exim 4.96) (envelope-from ) id 1pcMpQ-00Ag6P-0I; Wed, 15 Mar 2023 19:49:48 +1100 From: Dave Chinner To: linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org Cc: linux-mm@vger.kernel.org, linux-fsdevel@vger.kernel.org, yebin10@huawei.com Subject: [PATCH 2/4] pcpcntrs: fix dying cpu summation race Date: Wed, 15 Mar 2023 19:49:36 +1100 Message-Id: <20230315084938.2544737-3-david@fromorbit.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230315084938.2544737-1-david@fromorbit.com> References: <20230315084938.2544737-1-david@fromorbit.com> MIME-Version: 1.0 X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1760423680872137444?= X-GMAIL-MSGID: =?utf-8?q?1760423680872137444?= From: Dave Chinner In commit f689054aace2 ("percpu_counter: add percpu_counter_sum_all interface") a race condition between a cpu dying and percpu_counter_sum() iterating online CPUs was identified. The solution was to iterate all possible CPUs for summation via percpu_counter_sum_all(). We recently had a percpu_counter_sum() call in XFS trip over this same race condition and it fired a debug assert because the filesystem was unmounting and the counter *should* be zero just before we destroy it. That was reported here: https://lore.kernel.org/linux-kernel/20230314090649.326642-1-yebin@huaweicloud.com/ likely as a result of running generic/648 which exercises filesystems in the presence of CPU online/offline events. The solution to use percpu_counter_sum_all() is an awful one. We use percpu counters and percpu_counter_sum() for accurate and reliable threshold detection for space management, so a summation race condition during these operations can result in overcommit of available space and that may result in filesystem shutdowns. As percpu_counter_sum_all() iterates all possible CPUs rather than just those online or even those present, the mask can include CPUs that aren't even installed in the machine, or in the case of machines that can hot-plug CPU capable nodes, even have physical sockets present in the machine. Fundamentally, this race condition is caused by the CPU being offlined being removed from the cpu_online_mask before the notifier that cleans up per-cpu state is run. Hence percpu_counter_sum() will not sum the count for a cpu currently being taken offline, regardless of whether the notifier has run or not. This is the root cause of the bug. The percpu counter notifier iterates all the registered counters, locks the counter and moves the percpu count to the global sum. This is serialised against other operations that move the percpu counter to the global sum as well as percpu_counter_sum() operations that sum the percpu counts while holding the counter lock. Hence the notifier is safe to run concurrently with sum operations, and the only thing we actually need to care about is that percpu_counter_sum() iterates dying CPUs. That's trivial to do, and when there are no CPUs dying, it has no addition overhead except for a cpumask_or() operation. This change makes percpu_counter_sum() always do the right thing in the presence of CPU hot unplug events and makes percpu_counter_sum_all() unnecessary. This, in turn, means that filesystems like XFS, ext4, and btrfs don't have to work out when they should use percpu_counter_sum() vs percpu_counter_sum_all() in their space accounting algorithms Signed-off-by: Dave Chinner --- lib/percpu_counter.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c index dba56c5c1837..0e096311e0c0 100644 --- a/lib/percpu_counter.c +++ b/lib/percpu_counter.c @@ -131,7 +131,7 @@ static s64 __percpu_counter_sum_mask(struct percpu_counter *fbc, raw_spin_lock_irqsave(&fbc->lock, flags); ret = fbc->count; - for_each_cpu(cpu, cpu_mask) { + for_each_cpu_or(cpu, cpu_online_mask, cpu_mask) { s32 *pcount = per_cpu_ptr(fbc->counters, cpu); ret += *pcount; } @@ -141,11 +141,20 @@ static s64 __percpu_counter_sum_mask(struct percpu_counter *fbc, /* * Add up all the per-cpu counts, return the result. This is a more accurate - * but much slower version of percpu_counter_read_positive() + * but much slower version of percpu_counter_read_positive(). + * + * We use the cpu mask of (cpu_online_mask | cpu_dying_mask) to capture sums + * from CPUs that are in the process of being taken offline. Dying cpus have + * been removed from the online mask, but may not have had the hotplug dead + * notifier called to fold the percpu count back into the global counter sum. + * By including dying CPUs in the iteration mask, we avoid this race condition + * so __percpu_counter_sum() just does the right thing when CPUs are being taken + * offline. */ s64 __percpu_counter_sum(struct percpu_counter *fbc) { - return __percpu_counter_sum_mask(fbc, cpu_online_mask); + + return __percpu_counter_sum_mask(fbc, cpu_dying_mask); } EXPORT_SYMBOL(__percpu_counter_sum);