[0/3] migrate_pages: fix deadlock in batched synchronous migration

Message ID	20230224141145.96814-1-ying.huang@intel.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Huang Ying <ying.huang@intel.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying <ying.huang@intel.com>, Hugh Dickins <hughd@google.com>, "Xu, Pengfei" <pengfei.xu@intel.com>, Christoph Hellwig <hch@lst.de>, Stefan Roesch <shr@devkernel.io>, Tejun Heo <tj@kernel.org>, Xin Hao <xhao@linux.alibaba.com>, Zi Yan <ziy@nvidia.com>, Yang Shi <shy828301@gmail.com>, Baolin Wang <baolin.wang@linux.alibaba.com>, Matthew Wilcox <willy@infradead.org>, Mike Kravetz <mike.kravetz@oracle.com> Subject: [PATCH 0/3] migrate_pages: fix deadlock in batched synchronous migration Date: Fri, 24 Feb 2023 22:11:42 +0800 Message-Id: <20230224141145.96814-1-ying.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	migrate_pages: fix deadlock in batched synchronous migration \| [0/3] migrate_pages: fix deadlock in batched synchronous migration [1/3] migrate_pages: fix deadlock in batched migration [2/3] migrate_pages: move split folios processing out of migrate_pages_batch() [3/3] migrate_pages: try migrate in batch asynchronously firstly

Message ID

20230224141145.96814-1-ying.huang@intel.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: Huang Ying <ying.huang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
        Huang Ying <ying.huang@intel.com>,
        Hugh Dickins <hughd@google.com>,
        "Xu, Pengfei" <pengfei.xu@intel.com>,
        Christoph Hellwig <hch@lst.de>,
        Stefan Roesch <shr@devkernel.io>, Tejun Heo <tj@kernel.org>,
        Xin Hao <xhao@linux.alibaba.com>, Zi Yan <ziy@nvidia.com>,
        Yang Shi <shy828301@gmail.com>,
        Baolin Wang <baolin.wang@linux.alibaba.com>,
        Matthew Wilcox <willy@infradead.org>,
        Mike Kravetz <mike.kravetz@oracle.com>
Subject: [PATCH 0/3] migrate_pages: fix deadlock in batched synchronous
 migration
Date: Fri, 24 Feb 2023 22:11:42 +0800
Message-Id: <20230224141145.96814-1-ying.huang@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

migrate_pages: fix deadlock in batched synchronous migration |

Message

Huang, Ying Feb. 24, 2023, 2:11 p.m. UTC

  Two deadlock bugs were reported for the migrate_pages() batching
series.  Thanks Hugh and Pengfei.  Analysis shows that if we have
locked some other folios except the one we are migrating, it's not
safe in general to wait synchronously, for example, to wait the
writeback to complete or wait to lock the buffer head.

So 1/3 fixes the deadlock in a simple way, where the batching support
for the synchronous migration is disabled.  The change is
straightforward and easy to be understood.  While 3/3 re-introduce the
batching for synchronous migration via trying to migrate
asynchronously in batch optimistically, then fall back to migrate
synchronously one by one for fail-to-migrate folios.  Test shows that
this can restore the TLB flushing batching performance for synchronous
migration effectively.

Best Regards,
Huang, Ying

Comments

Andrew Morton Feb. 26, 2023, 4:55 a.m. UTC | #1

On Fri, 24 Feb 2023 22:11:42 +0800 Huang Ying <ying.huang@intel.com> wrote:

> Two deadlock bugs were reported for the migrate_pages() batching
> series.

"migrate_pages(): batch TLB flushing"

>  Thanks Hugh and Pengfei.  Analysis shows that if we have
> locked some other folios except the one we are migrating, it's not
> safe in general to wait synchronously, for example, to wait the
> writeback to complete or wait to lock the buffer head.
> 
> So 1/3 fixes the deadlock in a simple way, where the batching support
> for the synchronous migration is disabled.  The change is
> straightforward and easy to be understood.  While 3/3 re-introduce the
> batching for synchronous migration via trying to migrate
> asynchronously in batch optimistically, then fall back to migrate
> synchronously one by one for fail-to-migrate folios.  Test shows that
> this can restore the TLB flushing batching performance for synchronous
> migration effectively.

If anyone backports the "migrate_pages(): batch TLB flushing" series
into their kernels, they will want to know about such fixes.  So we can
help them by providing suitable Link: tags.

Such a Link: may also be helpful to people who are performing git
bisection searches for some issue but who keep stumbling over the
issues which this series addresses.

Being lazy, I slapped

Fixes: 6f7d760e86fa ("migrate_pages: move THP/hugetlb migration support check to simplify code")

on all three, as this was the final patch in that series.  Inaccurate,
but it means that these fixes will land in a suitable place if anyone
needs them.

Huang, Ying Feb. 27, 2023, 1:25 a.m. UTC | #2

Andrew Morton <akpm@linux-foundation.org> writes:

> On Fri, 24 Feb 2023 22:11:42 +0800 Huang Ying <ying.huang@intel.com> wrote:
>
>> Two deadlock bugs were reported for the migrate_pages() batching
>> series.
>
> "migrate_pages(): batch TLB flushing"

Yes.  Should have written as that.

>>  Thanks Hugh and Pengfei.  Analysis shows that if we have
>> locked some other folios except the one we are migrating, it's not
>> safe in general to wait synchronously, for example, to wait the
>> writeback to complete or wait to lock the buffer head.
>> 
>> So 1/3 fixes the deadlock in a simple way, where the batching support
>> for the synchronous migration is disabled.  The change is
>> straightforward and easy to be understood.  While 3/3 re-introduce the
>> batching for synchronous migration via trying to migrate
>> asynchronously in batch optimistically, then fall back to migrate
>> synchronously one by one for fail-to-migrate folios.  Test shows that
>> this can restore the TLB flushing batching performance for synchronous
>> migration effectively.
>
> If anyone backports the "migrate_pages(): batch TLB flushing" series
> into their kernels, they will want to know about such fixes.  So we can
> help them by providing suitable Link: tags.
>
> Such a Link: may also be helpful to people who are performing git
> bisection searches for some issue but who keep stumbling over the
> issues which this series addresses.
>
> Being lazy, I slapped
>
> Fixes: 6f7d760e86fa ("migrate_pages: move THP/hugetlb migration support check to simplify code")
>
> on all three, as this was the final patch in that series.  Inaccurate,
> but it means that these fixes will land in a suitable place if anyone
> needs them.

Sorry.  I should have added the "Fixes:" tag.  I will be more careful
in the future.  And, I will add proper "Link:" tag too.

Best Regards,
Huang, Ying