[v2] mm: memory-failure: use rcu lock instead of tasklist_lock when collect_procs()
Commit Message
We found a softlock issue in our test, analyzed the logs, and found that
the relevant CPU call trace as follows:
CPU0:
_do_fork
-> copy_process()
-> write_lock_irq(&tasklist_lock) //Disable irq,waiting for
//tasklist_lock
CPU1:
wp_page_copy()
->pte_offset_map_lock()
-> spin_lock(&page->ptl); //Hold page->ptl
-> ptep_clear_flush()
-> flush_tlb_others() ...
-> smp_call_function_many()
-> arch_send_call_function_ipi_mask()
-> csd_lock_wait() //Waiting for other CPUs respond
//IPI
CPU2:
collect_procs_anon()
-> read_lock(&tasklist_lock) //Hold tasklist_lock
->for_each_process(tsk)
-> page_mapped_in_vma()
-> page_vma_mapped_walk()
-> map_pte()
->spin_lock(&page->ptl) //Waiting for page->ptl
We can see that CPU1 waiting for CPU0 respond IPI,CPU0 waiting for CPU2
unlock tasklist_lock, CPU2 waiting for CPU1 unlock page->ptl. As a result,
softlockup is triggered.
For collect_procs_anon(), we will not modify the tasklist, but only perform
read traversal. Therefore, we can use rcu lock instead of spin lock
tasklist_lock, from this, we can break the softlock chain above.
The same logic can also be applied to:
- collect_procs_file()
- collect_procs_fsdax()
- collect_procs_ksm()
Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
Since v1:
- 1. According to Matthew's suggestion, only the comments of
find_early_kill_thread() are modified, no need to hold the rcu lock.
Changes since RFC[1]:
- 1. According to Naoya's suggestion, modify the tasklist_lock in the
comment about locking order in mm/filemap.c.
- 2. According to Kefeng's suggestion, optimize the implementation of
find_early_kill_thread() without functional changes.
- 3. Modify the title description.
[1] https://lore.kernel.org/lkml/20230815130154.1100779-1-tongtiangen@huawei.com/
---
mm/filemap.c | 3 ---
mm/ksm.c | 4 ++--
mm/memory-failure.c | 16 ++++++++--------
3 files changed, 10 insertions(+), 13 deletions(-)
Comments
在 2023/8/22 2:33, Matthew Wilcox 写道:
> On Mon, Aug 21, 2023 at 05:13:12PM +0800, Tong Tiangen wrote:
>> We found a softlock issue in our test, analyzed the logs, and found that
>> the relevant CPU call trace as follows:
>>
>> CPU0:
>> _do_fork
>> -> copy_process()
>> -> write_lock_irq(&tasklist_lock) //Disable irq,waiting for
>> //tasklist_lock
>>
>> CPU1:
>> wp_page_copy()
>> ->pte_offset_map_lock()
>> -> spin_lock(&page->ptl); //Hold page->ptl
>> -> ptep_clear_flush()
>> -> flush_tlb_others() ...
>> -> smp_call_function_many()
>> -> arch_send_call_function_ipi_mask()
>> -> csd_lock_wait() //Waiting for other CPUs respond
>> //IPI
>>
>> CPU2:
>> collect_procs_anon()
>> -> read_lock(&tasklist_lock) //Hold tasklist_lock
>> ->for_each_process(tsk)
>> -> page_mapped_in_vma()
>> -> page_vma_mapped_walk()
>> -> map_pte()
>> ->spin_lock(&page->ptl) //Waiting for page->ptl
>>
>> We can see that CPU1 waiting for CPU0 respond IPI,CPU0 waiting for CPU2
>> unlock tasklist_lock, CPU2 waiting for CPU1 unlock page->ptl. As a result,
>> softlockup is triggered.
>>
>> For collect_procs_anon(), we will not modify the tasklist, but only perform
>> read traversal. Therefore, we can use rcu lock instead of spin lock
>> tasklist_lock, from this, we can break the softlock chain above.
>
> The only thing that's giving me pause is that there's no discussion
> about why this is safe. "We're not modifying it" isn't really enough
> to justify going from read_lock() to rcu_read_lock(). When you take a
> normal read_lock(), writers are not permitted and so you see an atomic
> snapshot of the list. With rcu_read_lock() you can see inconsistencies.
Hi Matthew:
When rcu_read_lock() is used, the task list can be modified during the
iteration, but cannot be seen during iteration. After the iteration is
complete, the task list can be updated in the RCU mechanism. Therefore,
the task list used by iteration can also be considered as a snapshot.
> For example, if new tasks can be added to the tasklist, they may not
> be seen by an iteration. Is this OK?
The newly added tasks does not access the HWPoison page, because the
HWPoison page has been isolated from the
buddy(memory_failure()->take_page_off_buddy()). Therefore, it is safe to
see the newly added task during the iteration and not be seen by iteration.
Tasks may be removed from the
> tasklist after they have been seen by the iteration. Is this OK?
Task be seen during iteration are deleted from the task list after
iteration, it's task_struct is not released because reference counting
is added in __add_to_kill(). Therefore, the subsequent processing of
kill_procs() is not affected (sending signals to the task deleted from
task list). so i think it's safe too.
>
> As I understand the list RCU code, it guarantees that all tasks which
> were on the list before rcu_read_lock() and remain on the list after
> rcu_read_unlock() will be seen by a list iteration, while tasks which
> are added or removed during that time may or may not be seen.
As described above, i understand that the write update is not visible
during the RCU read.
Thanks,
Tong.
>
> .
@@ -121,9 +121,6 @@
* bdi.wb->list_lock (zap_pte_range->set_page_dirty)
* ->inode->i_lock (zap_pte_range->set_page_dirty)
* ->private_lock (zap_pte_range->block_dirty_folio)
- *
- * ->i_mmap_rwsem
- * ->tasklist_lock (memory_failure, collect_procs_ao)
*/
static void page_cache_delete(struct address_space *mapping,
@@ -2925,7 +2925,7 @@ void collect_procs_ksm(struct page *page, struct list_head *to_kill,
struct anon_vma *av = rmap_item->anon_vma;
anon_vma_lock_read(av);
- read_lock(&tasklist_lock);
+ rcu_read_lock();
for_each_process(tsk) {
struct anon_vma_chain *vmac;
unsigned long addr;
@@ -2944,7 +2944,7 @@ void collect_procs_ksm(struct page *page, struct list_head *to_kill,
}
}
}
- read_unlock(&tasklist_lock);
+ rcu_read_unlock();
anon_vma_unlock_read(av);
}
}
@@ -547,8 +547,8 @@ static void kill_procs(struct list_head *to_kill, int forcekill, bool fail,
* on behalf of the thread group. Return task_struct of the (first found)
* dedicated thread if found, and return NULL otherwise.
*
- * We already hold read_lock(&tasklist_lock) in the caller, so we don't
- * have to call rcu_read_lock/unlock() in this function.
+ * We already hold rcu lock in the caller, so we don't have to call
+ * rcu_read_lock/unlock() in this function.
*/
static struct task_struct *find_early_kill_thread(struct task_struct *tsk)
{
@@ -609,7 +609,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
return;
pgoff = page_to_pgoff(page);
- read_lock(&tasklist_lock);
+ rcu_read_lock();
for_each_process(tsk) {
struct anon_vma_chain *vmac;
struct task_struct *t = task_early_kill(tsk, force_early);
@@ -626,7 +626,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
add_to_kill_anon_file(t, page, vma, to_kill);
}
}
- read_unlock(&tasklist_lock);
+ rcu_read_unlock();
anon_vma_unlock_read(av);
}
@@ -642,7 +642,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
pgoff_t pgoff;
i_mmap_lock_read(mapping);
- read_lock(&tasklist_lock);
+ rcu_read_lock();
pgoff = page_to_pgoff(page);
for_each_process(tsk) {
struct task_struct *t = task_early_kill(tsk, force_early);
@@ -662,7 +662,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
add_to_kill_anon_file(t, page, vma, to_kill);
}
}
- read_unlock(&tasklist_lock);
+ rcu_read_unlock();
i_mmap_unlock_read(mapping);
}
@@ -685,7 +685,7 @@ static void collect_procs_fsdax(struct page *page,
struct task_struct *tsk;
i_mmap_lock_read(mapping);
- read_lock(&tasklist_lock);
+ rcu_read_lock();
for_each_process(tsk) {
struct task_struct *t = task_early_kill(tsk, true);
@@ -696,7 +696,7 @@ static void collect_procs_fsdax(struct page *page,
add_to_kill_fsdax(t, page, vma, to_kill, pgoff);
}
}
- read_unlock(&tasklist_lock);
+ rcu_read_unlock();
i_mmap_unlock_read(mapping);
}
#endif /* CONFIG_FS_DAX */