[RFC,05/10] mm/hugetlb: Make walk_hugetlb_range() RCU-safe

Message ID 20221030212929.335473-6-peterx@redhat.com
State New
Headers
Series mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare |

Commit Message

Peter Xu Oct. 30, 2022, 9:29 p.m. UTC
  RCU makes sure the pte_t* won't go away from under us.  Please refer to the
comment above huge_pte_offset() for more information.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/pagewalk.c | 5 +++++
 1 file changed, 5 insertions(+)
  

Comments

kernel test robot Nov. 6, 2022, 8:14 a.m. UTC | #1
Greeting,

FYI, we noticed WARNING:suspicious_RCU_usage due to commit (built with gcc-11):

commit: 8b7e3b7ca3897ebc4cb7b23c65a4618d64056e3b ("[PATCH RFC 05/10] mm/hugetlb: Make walk_hugetlb_range() RCU-safe")
url: https://github.com/intel-lab-lkp/linux/commits/Peter-Xu/mm-hugetlb-Make-huge_pte_offset-thread-safe-for-pmd-unshare/20221031-053221
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/lkml/20221030212929.335473-6-peterx@redhat.com
patch subject: [PATCH RFC 05/10] mm/hugetlb: Make walk_hugetlb_range() RCU-safe

in testcase: kernel-selftests
version: kernel-selftests-x86_64-9313ba54-1_20221017
with following parameters:

	sc_nr_hugepages: 2
	group: vm

test-description: The kernel contains a set of "self tests" under the tools/testing/selftests/ directory. These are intended to be small unit tests to exercise individual code paths in the kernel.
test-url: https://www.kernel.org/doc/Documentation/kselftest.txt


on test machine: 12 threads 1 sockets Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz (Coffee Lake) with 16G memory

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Link: https://lore.kernel.org/oe-lkp/202211061521.28931f7-oliver.sang@intel.com


kern  :warn  : [  181.942648] WARNING: suspicious RCU usage
kern  :warn  : [  181.943175] 6.1.0-rc1-00309-g8b7e3b7ca389 #1 Tainted: G S
kern  :warn  : [  181.943972] -----------------------------
kern  :warn  : [  181.944526] include/linux/rcupdate.h:364 Illegal context switch in RCU read-side critical section!
kern  :warn  : [  181.945559]
other info that might help us debug this:

kern  :warn  : [  181.946625]
rcu_scheduler_active = 2, debug_locks = 1
kern  :warn  : [  181.947473] 2 locks held by hmm-tests/9934:
kern :warn : [  181.948016] #0: ffff8884325b2d18 (&mm->mmap_lock#2){++++}-{3:3}, at: dmirror_fault (test_hmm.c:?) test_hmm
kern :warn : [  181.949129] #1: ffffffff858a7860 (rcu_read_lock){....}-{1:2}, at: walk_hugetlb_range (pagewalk.c:?) 
kern  :warn  : [  181.950161]
stack backtrace:
kern  :warn  : [  181.950780] CPU: 9 PID: 9934 Comm: hmm-tests Tainted: G S                 6.1.0-rc1-00309-g8b7e3b7ca389 #1
kern  :warn  : [  181.951863] Hardware name: Dell Inc. Vostro 3670/0HVPDY, BIOS 1.5.11 12/24/2018
kern  :warn  : [  181.952709] Call Trace:
kern  :warn  : [  181.953070]  <TASK>
kern :warn : [  181.953403] dump_stack_lvl (??:?) 
kern :warn : [  181.953890] __might_resched (??:?) 
kern :warn : [  181.954403] __mutex_lock (mutex.c:?) 
kern :warn : [  181.954886] ? validate_chain (lockdep.c:?) 
kern :warn : [  181.955405] ? hugetlb_fault (??:?) 
kern :warn : [  181.955926] ? mark_lock+0xca/0xac0 
kern :warn : [  181.956450] ? mutex_lock_io_nested (mutex.c:?) 
kern :warn : [  181.957039] ? check_prev_add (lockdep.c:?) 
kern :warn : [  181.957580] ? hugetlb_vm_op_pagesize (hugetlb.c:?) 
kern :warn : [  181.958177] ? hugetlb_fault (??:?) 
kern :warn : [  181.958690] hugetlb_fault (??:?) 
kern :warn : [  181.959199] ? find_held_lock (lockdep.c:?) 
kern :warn : [  181.959709] ? hugetlb_no_page (??:?) 
kern :warn : [  181.960255] ? __lock_release (lockdep.c:?) 
kern :warn : [  181.960772] ? lock_downgrade (lockdep.c:?) 
kern :warn : [  181.961292] ? lock_is_held_type (??:?) 
kern :warn : [  181.961830] ? handle_mm_fault (??:?) 
kern :warn : [  181.962363] handle_mm_fault (??:?) 
kern :warn : [  181.962870] ? hmm_vma_walk_hugetlb_entry (hmm.c:?) 
kern :warn : [  181.963501] hmm_vma_fault (hmm.c:?) 
kern :warn : [  181.964096] walk_hugetlb_range (pagewalk.c:?) 
kern :warn : [  181.964639] __walk_page_range (pagewalk.c:?) 
kern :warn : [  181.965160] walk_page_range (??:?) 
kern :warn : [  181.965670] ? __walk_page_range (??:?) 
kern :warn : [  181.966213] ? rcu_read_unlock (main.c:?) 
kern :warn : [  181.966718] ? lock_is_held_type (??:?) 
kern :warn : [  181.967259] ? mmu_interval_read_begin (??:?) 
kern :warn : [  181.967855] ? lock_is_held_type (??:?) 
kern :warn : [  181.968400] hmm_range_fault (??:?) 
kern :warn : [  181.968911] ? down_read (??:?) 
kern :warn : [  181.969383] ? hmm_vma_fault (??:?) 
kern :warn : [  181.969891] ? __lock_release (lockdep.c:?) 
kern :warn : [  181.970416] dmirror_fault (test_hmm.c:?) test_hmm
kern :warn : [  181.971012] ? dmirror_migrate_to_system+0x590/0x590 test_hmm
kern :warn : [  181.971847] ? find_held_lock (lockdep.c:?) 
kern :warn : [  181.972355] ? dmirror_write+0x202/0x310 test_hmm
kern :warn : [  181.973069] ? __lock_release (lockdep.c:?) 
kern :warn : [  181.973586] ? lock_downgrade (lockdep.c:?) 
kern :warn : [  181.974107] ? lock_is_held_type (??:?) 
kern :warn : [  181.974641] ? dmirror_write+0x202/0x310 test_hmm
kern :warn : [  181.975355] ? lock_release (??:?) 
kern :warn : [  181.975845] ? __mutex_unlock_slowpath (mutex.c:?) 
kern :warn : [  181.976444] ? bit_wait_io_timeout (mutex.c:?) 
kern :warn : [  181.977008] ? lock_is_held_type (??:?) 
kern :warn : [  181.977547] ? dmirror_do_write (test_hmm.c:?) test_hmm
kern :warn : [  181.978185] dmirror_write+0x1bf/0x310 test_hmm
kern :warn : [  181.978881] ? dmirror_fault (test_hmm.c:?) test_hmm
kern :warn : [  181.979484] ? lock_is_held_type (??:?) 
kern :warn : [  181.980021] ? __might_fault (??:?) 
kern :warn : [  181.980523] ? lock_release (??:?) 
kern :warn : [  181.981019] dmirror_fops_unlocked_ioctl (test_hmm.c:?) test_hmm
kern :warn : [  181.981732] ? dmirror_exclusive+0x780/0x780 test_hmm
kern :warn : [  181.982485] ? do_user_addr_fault (fault.c:?) 
kern :warn : [  181.983042] ? __lock_release (lockdep.c:?) 
kern :warn : [  181.983562] __x64_sys_ioctl (??:?) 
kern :warn : [  181.984074] do_syscall_64 (??:?) 
kern :warn : [  181.984545] ? do_user_addr_fault (fault.c:?) 
kern :warn : [  181.985103] ? do_user_addr_fault (fault.c:?) 
kern :warn : [  181.985654] ? irqentry_exit_to_user_mode (??:?) 
kern :warn : [  181.986256] ? lockdep_hardirqs_on_prepare (lockdep.c:?) 
kern :warn : [  181.986945] entry_SYSCALL_64_after_hwframe (??:?) 
kern  :warn  : [  181.987569] RIP: 0033:0x7fac2f598e9b
kern :warn : [ 181.988047] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1b 48 8b 44 24 18 64 48 2b 04 25 28 00
All code
========
   0:	00 48 89             	add    %cl,-0x77(%rax)
   3:	44 24 18             	rex.R and $0x18,%al
   6:	31 c0                	xor    %eax,%eax
   8:	48 8d 44 24 60       	lea    0x60(%rsp),%rax
   d:	c7 04 24 10 00 00 00 	movl   $0x10,(%rsp)
  14:	48 89 44 24 08       	mov    %rax,0x8(%rsp)
  19:	48 8d 44 24 20       	lea    0x20(%rsp),%rax
  1e:	48 89 44 24 10       	mov    %rax,0x10(%rsp)
  23:	b8 10 00 00 00       	mov    $0x10,%eax
  28:	0f 05                	syscall 
  2a:*	41 89 c0             	mov    %eax,%r8d		<-- trapping instruction
  2d:	3d 00 f0 ff ff       	cmp    $0xfffff000,%eax
  32:	77 1b                	ja     0x4f
  34:	48 8b 44 24 18       	mov    0x18(%rsp),%rax
  39:	64                   	fs
  3a:	48                   	rex.W
  3b:	2b                   	.byte 0x2b
  3c:	04 25                	add    $0x25,%al
  3e:	28 00                	sub    %al,(%rax)

Code starting with the faulting instruction
===========================================
   0:	41 89 c0             	mov    %eax,%r8d
   3:	3d 00 f0 ff ff       	cmp    $0xfffff000,%eax
   8:	77 1b                	ja     0x25
   a:	48 8b 44 24 18       	mov    0x18(%rsp),%rax
   f:	64                   	fs
  10:	48                   	rex.W
  11:	2b                   	.byte 0x2b
  12:	04 25                	add    $0x25,%al
  14:	28 00                	sub    %al,(%rax)


To reproduce:

        git clone https://github.com/intel/lkp-tests.git
        cd lkp-tests
        sudo bin/lkp install job.yaml           # job file is attached in this email
        bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
        sudo bin/lkp run generated-yaml-file

        # if come across any failure that blocks the test,
        # please remove ~/.lkp and /lkp dir to run from a clean state.
  
Peter Xu Nov. 6, 2022, 4:41 p.m. UTC | #2
On Sun, Nov 06, 2022 at 04:14:10PM +0800, kernel test robot wrote:
> 
> Greeting,
> 
> FYI, we noticed WARNING:suspicious_RCU_usage due to commit (built with gcc-11):
> 
> commit: 8b7e3b7ca3897ebc4cb7b23c65a4618d64056e3b ("[PATCH RFC 05/10] mm/hugetlb: Make walk_hugetlb_range() RCU-safe")
> url: https://github.com/intel-lab-lkp/linux/commits/Peter-Xu/mm-hugetlb-Make-huge_pte_offset-thread-safe-for-pmd-unshare/20221031-053221
> base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/lkml/20221030212929.335473-6-peterx@redhat.com
> patch subject: [PATCH RFC 05/10] mm/hugetlb: Make walk_hugetlb_range() RCU-safe
> 
> in testcase: kernel-selftests
> version: kernel-selftests-x86_64-9313ba54-1_20221017
> with following parameters:
> 
> 	sc_nr_hugepages: 2
> 	group: vm
> 
> test-description: The kernel contains a set of "self tests" under the tools/testing/selftests/ directory. These are intended to be small unit tests to exercise individual code paths in the kernel.
> test-url: https://www.kernel.org/doc/Documentation/kselftest.txt
> 
> 
> on test machine: 12 threads 1 sockets Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz (Coffee Lake) with 16G memory
> 
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
> 
> 
> If you fix the issue, kindly add following tag
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Link: https://lore.kernel.org/oe-lkp/202211061521.28931f7-oliver.sang@intel.com
> 
> 
> kern  :warn  : [  181.942648] WARNING: suspicious RCU usage
> kern  :warn  : [  181.943175] 6.1.0-rc1-00309-g8b7e3b7ca389 #1 Tainted: G S
> kern  :warn  : [  181.943972] -----------------------------
> kern  :warn  : [  181.944526] include/linux/rcupdate.h:364 Illegal context switch in RCU read-side critical section!
> kern  :warn  : [  181.945559]
> other info that might help us debug this:
> 
> kern  :warn  : [  181.946625]
> rcu_scheduler_active = 2, debug_locks = 1
> kern  :warn  : [  181.947473] 2 locks held by hmm-tests/9934:
> kern :warn : [  181.948016] #0: ffff8884325b2d18 (&mm->mmap_lock#2){++++}-{3:3}, at: dmirror_fault (test_hmm.c:?) test_hmm
> kern :warn : [  181.949129] #1: ffffffff858a7860 (rcu_read_lock){....}-{1:2}, at: walk_hugetlb_range (pagewalk.c:?) 
> kern  :warn  : [  181.950161]
> stack backtrace:
> kern  :warn  : [  181.950780] CPU: 9 PID: 9934 Comm: hmm-tests Tainted: G S                 6.1.0-rc1-00309-g8b7e3b7ca389 #1
> kern  :warn  : [  181.951863] Hardware name: Dell Inc. Vostro 3670/0HVPDY, BIOS 1.5.11 12/24/2018
> kern  :warn  : [  181.952709] Call Trace:
> kern  :warn  : [  181.953070]  <TASK>
> kern :warn : [  181.953403] dump_stack_lvl (??:?) 
> kern :warn : [  181.953890] __might_resched (??:?) 
> kern :warn : [  181.954403] __mutex_lock (mutex.c:?) 
> kern :warn : [  181.954886] ? validate_chain (lockdep.c:?) 
> kern :warn : [  181.955405] ? hugetlb_fault (??:?) 
> kern :warn : [  181.955926] ? mark_lock+0xca/0xac0 
> kern :warn : [  181.956450] ? mutex_lock_io_nested (mutex.c:?) 
> kern :warn : [  181.957039] ? check_prev_add (lockdep.c:?) 
> kern :warn : [  181.957580] ? hugetlb_vm_op_pagesize (hugetlb.c:?) 
> kern :warn : [  181.958177] ? hugetlb_fault (??:?) 
> kern :warn : [  181.958690] hugetlb_fault (??:?) 
> kern :warn : [  181.959199] ? find_held_lock (lockdep.c:?) 
> kern :warn : [  181.959709] ? hugetlb_no_page (??:?) 
> kern :warn : [  181.960255] ? __lock_release (lockdep.c:?) 
> kern :warn : [  181.960772] ? lock_downgrade (lockdep.c:?) 
> kern :warn : [  181.961292] ? lock_is_held_type (??:?) 
> kern :warn : [  181.961830] ? handle_mm_fault (??:?) 
> kern :warn : [  181.962363] handle_mm_fault (??:?) 
> kern :warn : [  181.962870] ? hmm_vma_walk_hugetlb_entry (hmm.c:?) 
> kern :warn : [  181.963501] hmm_vma_fault (hmm.c:?) 
> kern :warn : [  181.964096] walk_hugetlb_range (pagewalk.c:?) 
> kern :warn : [  181.964639] __walk_page_range (pagewalk.c:?) 
> kern :warn : [  181.965160] walk_page_range (??:?) 
> kern :warn : [  181.965670] ? __walk_page_range (??:?) 
> kern :warn : [  181.966213] ? rcu_read_unlock (main.c:?) 
> kern :warn : [  181.966718] ? lock_is_held_type (??:?) 
> kern :warn : [  181.967259] ? mmu_interval_read_begin (??:?) 
> kern :warn : [  181.967855] ? lock_is_held_type (??:?) 
> kern :warn : [  181.968400] hmm_range_fault (??:?) 
> kern :warn : [  181.968911] ? down_read (??:?) 
> kern :warn : [  181.969383] ? hmm_vma_fault (??:?) 
> kern :warn : [  181.969891] ? __lock_release (lockdep.c:?) 
> kern :warn : [  181.970416] dmirror_fault (test_hmm.c:?) test_hmm
> kern :warn : [  181.971012] ? dmirror_migrate_to_system+0x590/0x590 test_hmm
> kern :warn : [  181.971847] ? find_held_lock (lockdep.c:?) 
> kern :warn : [  181.972355] ? dmirror_write+0x202/0x310 test_hmm
> kern :warn : [  181.973069] ? __lock_release (lockdep.c:?) 
> kern :warn : [  181.973586] ? lock_downgrade (lockdep.c:?) 
> kern :warn : [  181.974107] ? lock_is_held_type (??:?) 
> kern :warn : [  181.974641] ? dmirror_write+0x202/0x310 test_hmm
> kern :warn : [  181.975355] ? lock_release (??:?) 
> kern :warn : [  181.975845] ? __mutex_unlock_slowpath (mutex.c:?) 
> kern :warn : [  181.976444] ? bit_wait_io_timeout (mutex.c:?) 
> kern :warn : [  181.977008] ? lock_is_held_type (??:?) 
> kern :warn : [  181.977547] ? dmirror_do_write (test_hmm.c:?) test_hmm
> kern :warn : [  181.978185] dmirror_write+0x1bf/0x310 test_hmm
> kern :warn : [  181.978881] ? dmirror_fault (test_hmm.c:?) test_hmm
> kern :warn : [  181.979484] ? lock_is_held_type (??:?) 
> kern :warn : [  181.980021] ? __might_fault (??:?) 
> kern :warn : [  181.980523] ? lock_release (??:?) 
> kern :warn : [  181.981019] dmirror_fops_unlocked_ioctl (test_hmm.c:?) test_hmm
> kern :warn : [  181.981732] ? dmirror_exclusive+0x780/0x780 test_hmm
> kern :warn : [  181.982485] ? do_user_addr_fault (fault.c:?) 
> kern :warn : [  181.983042] ? __lock_release (lockdep.c:?) 
> kern :warn : [  181.983562] __x64_sys_ioctl (??:?) 
> kern :warn : [  181.984074] do_syscall_64 (??:?) 
> kern :warn : [  181.984545] ? do_user_addr_fault (fault.c:?) 
> kern :warn : [  181.985103] ? do_user_addr_fault (fault.c:?) 
> kern :warn : [  181.985654] ? irqentry_exit_to_user_mode (??:?) 
> kern :warn : [  181.986256] ? lockdep_hardirqs_on_prepare (lockdep.c:?) 
> kern :warn : [  181.986945] entry_SYSCALL_64_after_hwframe (??:?) 

So it is caused by the hmm code doing page fault during page walk, where
it'll go into the hugetlb fault logic and trying to take sleeptable locks..

That's slightly out of my expectation because logically I think the page
walk hooks should only do trivial works on the pte/pmd/.. being walked on,
rather than things as complicated as triggering a page fault as what HMM
does.  And it's also surprising to me that we can actually allow sleep.
But so far it looks safe.

Besides HMM it seems there's yet another user (enable_skey_walk_ops) that
can also yield itself by calling cond_resched().

My current plan is I may need to add some helpers so that when the hooks
decides to call code that can sleep, we need to notify the walker API.  It
could be something called walk_page_pause(), walk_page_cont(), then for
either a fault or cond_reched(), we could:

  walk_page_pause(&walk);
  hmm_vma_fault(); // or cond_reched(), etc.
  walk_page_cont(&walk);

We should probably also emphasize somewhere that mmap lock should never be
released for the whole page walk process, because walk_page_range() will
cache vma pointers.

If there's any better suggestion, please feel free to comment, or I'll give
it a shot with above approach in the next version.
  

Patch

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 7f1c9b274906..bbc71c750576 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -302,6 +302,9 @@  static int walk_hugetlb_range(unsigned long addr, unsigned long end,
 	const struct mm_walk_ops *ops = walk->ops;
 	int err = 0;
 
+	/* For huge_pte_offset() */
+	rcu_read_lock();
+
 	do {
 		next = hugetlb_entry_end(h, addr, end);
 		pte = huge_pte_offset(walk->mm, addr & hmask, sz);
@@ -315,6 +318,8 @@  static int walk_hugetlb_range(unsigned long addr, unsigned long end,
 			break;
 	} while (addr = next, addr != end);
 
+	rcu_read_unlock();
+
 	return err;
 }