Message ID | 20221107095232.36828-1-xiubli@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp1952181wru; Mon, 7 Nov 2022 01:58:49 -0800 (PST) X-Google-Smtp-Source: AMsMyM4h+mm/utzDeOVlrFOtVN6L+eHe4TLjg4Svj7D4gF0zqd8aQQAJSqd1rPHhIGYrM4Hxds1+ X-Received: by 2002:aa7:8f35:0:b0:56b:c2dd:a9ae with SMTP id y21-20020aa78f35000000b0056bc2dda9aemr50563564pfr.0.1667815129369; Mon, 07 Nov 2022 01:58:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1667815129; cv=none; d=google.com; s=arc-20160816; b=pOGR111ZQt3sLeQ2d2xkCxF/7onHiTYhaYeUjJVhwoBp6wkpavWs/lJWuuxTQRgaQv oNnF0RvTlZqsmdgn1Eyc6D0v959e4xnwvC44GCmCGefiT8TBBgUIkWgADMdn2/qWPBRM CHg+/hn9y7R6TERBBSOF1zqqSONU4sGz+/+2BaRTK8eEAkj8o7jv2z8ylX1iFupE0FGl 2OqbhofySrLE/QGsIKywgiehcCJAEnm8ZO/zFlJbiMly3bK9NS+zM01+1tJwCSgdoH9k qHnACFXQeEsNhhe/Y/ig+1Yr8ihrxxg/UQSjpuJa4vFglpQil8WDo3e/blaZg988D/e0 vVbw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=PRG3TK5PkvceDpSktl/XoLwICyvkxZzD1g+3ZIVVWaA=; b=VysOd7hEMMYfijuNkRNPNDgG2xaTiOtMFiZYUAyG+v1njHXttbl1eUrJpHRWYOEAhx XSX4vJ1900OIl3RsYsbMfsFA2kyWOaZXAUt/U6xwJ/1SFn0oA0vVlvty54DUnSd6T6KF /BRuOvtBmtMrZppIPw4m9CPW4IICcuf9AhhydzE3IRPkCD1qWUY1Foue2KrcAyrZP8ru 4YKHtRsVTSpjTUW3ZnGgeNRGHVkHN++63DVzZ07Q5jm3B83r6vEYMdHMfbZXuU5DLJeU 8xhTGYFc/atiltylMob67JzC+w6PQ9c939zZC3LBF+k9ZRn+OfYG4tITc3QRdWvWsVU8 9EjA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=diCJF5tM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id t7-20020a62ea07000000b0053a339cafb3si8885857pfh.53.2022.11.07.01.58.35; Mon, 07 Nov 2022 01:58:49 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=diCJF5tM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231821AbiKGJyH (ORCPT <rfc822;hjfbswb@gmail.com> + 99 others); Mon, 7 Nov 2022 04:54:07 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45070 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231909AbiKGJx7 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 7 Nov 2022 04:53:59 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CD4FD183B9 for <linux-kernel@vger.kernel.org>; Mon, 7 Nov 2022 01:52:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1667814777; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=PRG3TK5PkvceDpSktl/XoLwICyvkxZzD1g+3ZIVVWaA=; b=diCJF5tMJTHLG+PXRiZK7bsGr7+wPpWAjHcJE+7cKzg6spXN4wwsL0E4Mq6i/mT2N7nFl+ fC9+FiUz/WktAP2v9nRsdZJus2fu6ba/1OEfeXz94yCdkRlh5LxEh/+k1i8mnIdmLNlH5F PUf3RinMZiKaS1tYMJ3HkCeOtMcY4qo= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-7-n8Lus9X6OkKLq4OgQeNMkw-1; Mon, 07 Nov 2022 04:52:53 -0500 X-MC-Unique: n8Lus9X6OkKLq4OgQeNMkw-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.rdu2.redhat.com [10.11.54.7]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 1ADFD86F155; Mon, 7 Nov 2022 09:52:53 +0000 (UTC) Received: from lxbceph1.gsslab.pek2.redhat.com (unknown [10.72.47.117]) by smtp.corp.redhat.com (Postfix) with ESMTP id A1DF6140EBF5; Mon, 7 Nov 2022 09:52:47 +0000 (UTC) From: xiubli@redhat.com To: viro@zeniv.linux.org.uk, jlayton@kernel.org, chuck.lever@oracle.com Cc: axboe@kernel.dk, asml.silence@gmail.com, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, io-uring@vger.kernel.org, ceph-devel@vger.kernel.org, mchangir@redhat.com, idryomov@gmail.com, lhenriques@suse.de, gfarnum@redhat.com, Xiubo Li <xiubli@redhat.com> Subject: [RFC PATCH] fs/lock: increase the filp's reference for Posix-style locks Date: Mon, 7 Nov 2022 17:52:32 +0800 Message-Id: <20221107095232.36828-1-xiubli@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.7 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1748830917315412918?= X-GMAIL-MSGID: =?utf-8?q?1748830917315412918?= |
Series |
[RFC] fs/lock: increase the filp's reference for Posix-style locks
|
|
Commit Message
Xiubo Li
Nov. 7, 2022, 9:52 a.m. UTC
From: Xiubo Li <xiubli@redhat.com> When closing the file descripters in parallel in multiple threads, who are sharing the same file descripters, the filp_close() will remove all the Posix-style locks. But if two threads both calling the filp_close() it may race and cause use-after-free crash: PID: 327771 TASK: ffff952aa1db3180 CPU: 8 COMMAND: "db2fmp" #0 [ffff95202f33b960] machine_kexec at ffffffff890662f4 #1 [ffff95202f33b9c0] __crash_kexec at ffffffff89122b82 #2 [ffff95202f33ba90] crash_kexec at ffffffff89122c70 #3 [ffff95202f33baa8] oops_end at ffffffff89791798 #4 [ffff95202f33bad0] no_context at ffffffff89075d14 #5 [ffff95202f33bb20] __bad_area_nosemaphore at ffffffff89075fe2 #6 [ffff95202f33bb70] bad_area_nosemaphore at ffffffff89076104 #7 [ffff95202f33bb80] __do_page_fault at ffffffff89794750 #8 [ffff95202f33bbf0] do_page_fault at ffffffff89794975 #9 [ffff95202f33bc20] page_fault at ffffffff89790778 [exception RIP: ceph_fl_release_lock+20] RIP: ffffffffc08247a4 RSP: ffff95202f33bcd0 RFLAGS: 00010286 RAX: ffff952d4ebd8a00 RBX: 0000000000000000 RCX: dead000000000200 RDX: ffff95202f33bd60 RSI: ffff95202f33bd60 RDI: ffff9526b6ac5b00 RBP: ffff95202f33bce0 R8: ffff9526b6ac5b18 R9: ffffffffc083c368 R10: 0000000000001109 R11: 0000000000000000 R12: ffff95202f33bd60 R13: ffff9526b6ac5b00 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ffff95202f33bce8] locks_release_private at ffffffff892ab3d7 #11 [ffff95202f33bd00] locks_free_lock at ffffffff892ac34d #12 [ffff95202f33bd18] locks_dispose_list at ffffffff892ac44b #13 [ffff95202f33bd40] __posix_lock_file at ffffffff892acdfa #14 [ffff95202f33bda8] posix_lock_file at ffffffff892ad146 #15 [ffff95202f33bdb8] ceph_lock at ffffffffc0824e8a [ceph] #16 [ffff95202f33bdf8] vfs_lock_file at ffffffff892ad185 #17 [ffff95202f33be08] locks_remove_posix at ffffffff892ad239 #18 [ffff95202f33bee0] locks_remove_posix at ffffffff892ad2a0 #19 [ffff95202f33bef0] filp_close at ffffffff8924baa6 #20 [ffff95202f33bf18] __close_fd at ffffffff8926f89c #21 [ffff95202f33bf40] sys_close at ffffffff8924d503 #22 [ffff95202f33bf50] system_call_fastpath at ffffffff89799f92 RIP: 00007f806ec446ab RSP: 00007f80517f0d90 RFLAGS: 00010206 RAX: 0000000000000003 RBX: 00007f8030001a20 RCX: 00007f80300386b0 RDX: 00007f806ef0d880 RSI: 0000000000000001 RDI: 0000000000000006 RBP: 00007f806ef0e3c0 R8: 00007f80517fa700 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 R13: 00007f80300035b0 R14: 00007f80517f1104 R15: 000000000000006c ORIG_RAX: 0000000000000003 CS: 0033 SS: 002b We need to make sure that the filp in the file_lock shouldn't be release when any file_lock is still referring to it. For the Posix-style locks, whose owner will be the thread ids, we will increase the filp's reference. URL: https://tracker.ceph.com/issues/57986 Signed-off-by: Xiubo Li <xiubli@redhat.com> --- drivers/android/binder.c | 2 +- fs/file.c | 15 ++++++++++----- fs/locks.c | 18 +++++++++++++++--- include/linux/fs.h | 14 ++++++++++++++ io_uring/openclose.c | 3 ++- 5 files changed, 42 insertions(+), 10 deletions(-)
Comments
On Mon, 2022-11-07 at 17:52 +0800, xiubli@redhat.com wrote: > From: Xiubo Li <xiubli@redhat.com> > > When closing the file descripters in parallel in multiple threads, > who are sharing the same file descripters, the filp_close() will > remove all the Posix-style locks. But if two threads both calling > the filp_close() it may race and cause use-after-free crash: > > PID: 327771 TASK: ffff952aa1db3180 CPU: 8 COMMAND: "db2fmp" > #0 [ffff95202f33b960] machine_kexec at ffffffff890662f4 > #1 [ffff95202f33b9c0] __crash_kexec at ffffffff89122b82 > #2 [ffff95202f33ba90] crash_kexec at ffffffff89122c70 > #3 [ffff95202f33baa8] oops_end at ffffffff89791798 > #4 [ffff95202f33bad0] no_context at ffffffff89075d14 > #5 [ffff95202f33bb20] __bad_area_nosemaphore at ffffffff89075fe2 > #6 [ffff95202f33bb70] bad_area_nosemaphore at ffffffff89076104 > #7 [ffff95202f33bb80] __do_page_fault at ffffffff89794750 > #8 [ffff95202f33bbf0] do_page_fault at ffffffff89794975 > #9 [ffff95202f33bc20] page_fault at ffffffff89790778 > [exception RIP: ceph_fl_release_lock+20] > RIP: ffffffffc08247a4 RSP: ffff95202f33bcd0 RFLAGS: 00010286 > RAX: ffff952d4ebd8a00 RBX: 0000000000000000 RCX: dead000000000200 > RDX: ffff95202f33bd60 RSI: ffff95202f33bd60 RDI: ffff9526b6ac5b00 > RBP: ffff95202f33bce0 R8: ffff9526b6ac5b18 R9: ffffffffc083c368 > R10: 0000000000001109 R11: 0000000000000000 R12: ffff95202f33bd60 > R13: ffff9526b6ac5b00 R14: 0000000000000000 R15: 0000000000000000 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #10 [ffff95202f33bce8] locks_release_private at ffffffff892ab3d7 > #11 [ffff95202f33bd00] locks_free_lock at ffffffff892ac34d > #12 [ffff95202f33bd18] locks_dispose_list at ffffffff892ac44b > #13 [ffff95202f33bd40] __posix_lock_file at ffffffff892acdfa > #14 [ffff95202f33bda8] posix_lock_file at ffffffff892ad146 > #15 [ffff95202f33bdb8] ceph_lock at ffffffffc0824e8a [ceph] > #16 [ffff95202f33bdf8] vfs_lock_file at ffffffff892ad185 > #17 [ffff95202f33be08] locks_remove_posix at ffffffff892ad239 > #18 [ffff95202f33bee0] locks_remove_posix at ffffffff892ad2a0 > #19 [ffff95202f33bef0] filp_close at ffffffff8924baa6 > #20 [ffff95202f33bf18] __close_fd at ffffffff8926f89c > #21 [ffff95202f33bf40] sys_close at ffffffff8924d503 > #22 [ffff95202f33bf50] system_call_fastpath at ffffffff89799f92 > RIP: 00007f806ec446ab RSP: 00007f80517f0d90 RFLAGS: 00010206 > RAX: 0000000000000003 RBX: 00007f8030001a20 RCX: 00007f80300386b0 > RDX: 00007f806ef0d880 RSI: 0000000000000001 RDI: 0000000000000006 > RBP: 00007f806ef0e3c0 R8: 00007f80517fa700 R9: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 > R13: 00007f80300035b0 R14: 00007f80517f1104 R15: 000000000000006c > ORIG_RAX: 0000000000000003 CS: 0033 SS: 002b > > We need to make sure that the filp in the file_lock shouldn't be > release when any file_lock is still referring to it. > > For the Posix-style locks, whose owner will be the thread ids, we > will increase the filp's reference. > > URL: https://tracker.ceph.com/issues/57986 > Signed-off-by: Xiubo Li <xiubli@redhat.com> > --- > drivers/android/binder.c | 2 +- > fs/file.c | 15 ++++++++++----- > fs/locks.c | 18 +++++++++++++++--- > include/linux/fs.h | 14 ++++++++++++++ > io_uring/openclose.c | 3 ++- > 5 files changed, 42 insertions(+), 10 deletions(-) > > diff --git a/drivers/android/binder.c b/drivers/android/binder.c > index 880224ec6abb..03692564d940 100644 > --- a/drivers/android/binder.c > +++ b/drivers/android/binder.c > @@ -1924,7 +1924,7 @@ static void binder_deferred_fd_close(int fd) > if (twcb->file) { > // pin it until binder_do_fd_close(); see comments there > get_file(twcb->file); > - filp_close(twcb->file, current->files); > + filp_close(twcb->file, file_lock_make_thread_owner(current->files)); > task_work_add(current, &twcb->twork, TWA_RESUME); > } else { > kfree(twcb); > diff --git a/fs/file.c b/fs/file.c > index 5f9c802a5d8d..39ad8e74a8d9 100644 > --- a/fs/file.c > +++ b/fs/file.c > @@ -417,6 +417,7 @@ static struct fdtable *close_files(struct files_struct * files) > * files structure. > */ > struct fdtable *fdt = rcu_dereference_raw(files->fdt); > + fl_owner_t owner = file_lock_make_thread_owner(files); > unsigned int i, j = 0; > > for (;;) { > @@ -429,7 +430,7 @@ static struct fdtable *close_files(struct files_struct * files) > if (set & 1) { > struct file * file = xchg(&fdt->fd[i], NULL); > if (file) { > - filp_close(file, files); > + filp_close(file, owner); > cond_resched(); > } > } > @@ -653,6 +654,7 @@ static struct file *pick_file(struct files_struct *files, unsigned fd) > int close_fd(unsigned fd) > { > struct files_struct *files = current->files; > + fl_owner_t owner = file_lock_make_thread_owner(files); > struct file *file; > > spin_lock(&files->file_lock); > @@ -661,7 +663,7 @@ int close_fd(unsigned fd) > if (!file) > return -EBADF; > > - return filp_close(file, files); > + return filp_close(file, owner); > } > EXPORT_SYMBOL(close_fd); /* for ksys_close() */ > > @@ -695,6 +697,7 @@ static inline void __range_cloexec(struct files_struct *cur_fds, > static inline void __range_close(struct files_struct *cur_fds, unsigned int fd, > unsigned int max_fd) > { > + fl_owner_t owner = file_lock_make_thread_owner(cur_fds); > unsigned n; > > rcu_read_lock(); > @@ -711,7 +714,7 @@ static inline void __range_close(struct files_struct *cur_fds, unsigned int fd, > > if (file) { > /* found a valid file to close */ > - filp_close(file, cur_fds); > + filp_close(file, owner); > cond_resched(); > } > } > @@ -816,6 +819,7 @@ struct file *close_fd_get_file(unsigned int fd) > > void do_close_on_exec(struct files_struct *files) > { > + fl_owner_t owner = file_lock_make_thread_owner(files); > unsigned i; > struct fdtable *fdt; > > @@ -841,7 +845,7 @@ void do_close_on_exec(struct files_struct *files) > rcu_assign_pointer(fdt->fd[fd], NULL); > __put_unused_fd(files, fd); > spin_unlock(&files->file_lock); > - filp_close(file, files); > + filp_close(file, owner); > cond_resched(); > spin_lock(&files->file_lock); > } > @@ -1080,6 +1084,7 @@ static int do_dup2(struct files_struct *files, > struct file *file, unsigned fd, unsigned flags) > __releases(&files->file_lock) > { > + fl_owner_t owner = file_lock_make_thread_owner(files); > struct file *tofree; > struct fdtable *fdt; > > @@ -1111,7 +1116,7 @@ __releases(&files->file_lock) > spin_unlock(&files->file_lock); > > if (tofree) > - filp_close(tofree, files); > + filp_close(tofree, owner); > > return fd; > > diff --git a/fs/locks.c b/fs/locks.c > index 607f94a0e789..e8b67f87e0ee 100644 > --- a/fs/locks.c > +++ b/fs/locks.c > @@ -331,6 +331,8 @@ EXPORT_SYMBOL_GPL(locks_owner_has_blockers); > /* Free a lock which is not in use. */ > void locks_free_lock(struct file_lock *fl) > { > + if (fl->fl_file && file_lock_is_thread_owner(fl->fl_owner)) > + fput(fl->fl_file); > locks_release_private(fl); > kmem_cache_free(filelock_cache, fl); > } > @@ -384,7 +386,10 @@ void locks_copy_lock(struct file_lock *new, struct file_lock *fl) > > locks_copy_conflock(new, fl); > > - new->fl_file = fl->fl_file; > + if (file_lock_is_thread_owner(new->fl_owner)) > + new->fl_file = get_file(fl->fl_file); > + else > + new->fl_file = fl->fl_file; > new->fl_ops = fl->fl_ops; > > if (fl->fl_ops) { > @@ -488,13 +493,14 @@ static int flock64_to_posix_lock(struct file *filp, struct file_lock *fl, > } else > fl->fl_end = OFFSET_MAX; > > - fl->fl_owner = current->files; > + fl->fl_owner = file_lock_make_thread_owner(current->files); > fl->fl_pid = current->tgid; > - fl->fl_file = filp; > + fl->fl_file = get_file(filp); > fl->fl_flags = FL_POSIX; > fl->fl_ops = NULL; > fl->fl_lmops = NULL; > > + > return assign_type(fl, l->l_type); > } > > @@ -2243,6 +2249,7 @@ int fcntl_getlk(struct file *filp, unsigned int cmd, struct flock *flock) > > fl->fl_flags |= FL_OFDLCK; > fl->fl_owner = filp; > + fput(filp); > } > > error = vfs_test_lock(filp, fl); > @@ -2376,6 +2383,7 @@ int fcntl_setlk(unsigned int fd, struct file *filp, unsigned int cmd, > cmd = F_SETLK; > file_lock->fl_flags |= FL_OFDLCK; > file_lock->fl_owner = filp; > + fput(filp); > break; > case F_OFD_SETLKW: > error = -EINVAL; > @@ -2385,6 +2393,7 @@ int fcntl_setlk(unsigned int fd, struct file *filp, unsigned int cmd, > cmd = F_SETLKW; > file_lock->fl_flags |= FL_OFDLCK; > file_lock->fl_owner = filp; > + fput(filp); > fallthrough; > case F_SETLKW: > file_lock->fl_flags |= FL_SLEEP; > @@ -2450,6 +2459,7 @@ int fcntl_getlk64(struct file *filp, unsigned int cmd, struct flock64 *flock) > cmd = F_GETLK64; > fl->fl_flags |= FL_OFDLCK; > fl->fl_owner = filp; > + fput(filp); > } > > error = vfs_test_lock(filp, fl); > @@ -2499,6 +2509,7 @@ int fcntl_setlk64(unsigned int fd, struct file *filp, unsigned int cmd, > cmd = F_SETLK64; > file_lock->fl_flags |= FL_OFDLCK; > file_lock->fl_owner = filp; > + fput(filp); > break; > case F_OFD_SETLKW: > error = -EINVAL; > @@ -2508,6 +2519,7 @@ int fcntl_setlk64(unsigned int fd, struct file *filp, unsigned int cmd, > cmd = F_SETLKW64; > file_lock->fl_flags |= FL_OFDLCK; > file_lock->fl_owner = filp; > + fput(filp); > fallthrough; > case F_SETLKW64: > file_lock->fl_flags |= FL_SLEEP; > diff --git a/include/linux/fs.h b/include/linux/fs.h > index e654435f1651..d7d81962a863 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1028,6 +1028,20 @@ static inline struct file *get_file(struct file *f) > /* legacy typedef, should eventually be removed */ > typedef void *fl_owner_t; > > +/* > + * Set the last significant bit to 1 to mark that > + * we have get a reference of the fl->fl_file. > + */ > +static inline fl_owner_t file_lock_make_thread_owner(fl_owner_t owner) > +{ > + return (fl_owner_t)((unsigned long)owner | 1UL); > +} > + > +static inline bool file_lock_is_thread_owner(fl_owner_t owner) > +{ > + return ((unsigned long)owner & 1UL); > +} > + > struct file_lock; > > struct file_lock_operations { > diff --git a/io_uring/openclose.c b/io_uring/openclose.c > index 67178e4bb282..5a12cdf7f8d0 100644 > --- a/io_uring/openclose.c > +++ b/io_uring/openclose.c > @@ -212,6 +212,7 @@ int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > int io_close(struct io_kiocb *req, unsigned int issue_flags) > { > struct files_struct *files = current->files; > + fl_owner_t owner = file_lock_make_thread_owner(files); > struct io_close *close = io_kiocb_to_cmd(req, struct io_close); > struct fdtable *fdt; > struct file *file; > @@ -247,7 +248,7 @@ int io_close(struct io_kiocb *req, unsigned int issue_flags) > goto err; > > /* No ->flush() or already async, safely close from here */ > - ret = filp_close(file, current->files); > + ret = filp_close(file, owner); > err: > if (ret < 0) > req_set_fail(req); I think this is the wrong approach to fixing this. It also looks like you could hit a similar problem with OFD locks and this patch wouldn't address that issue. The real bug seems to be that ceph_fl_release_lock dereferences fl_file, at a point when it shouldn't rely on that being valid. Most filesystems stash some info in fl->fl_u if they need to do bookkeeping after releasing a lock. Perhaps ceph should be doing something similar?
On 07/11/2022 18:33, Jeff Layton wrote: > On Mon, 2022-11-07 at 17:52 +0800, xiubli@redhat.com wrote: >> From: Xiubo Li <xiubli@redhat.com> >> >> When closing the file descripters in parallel in multiple threads, >> who are sharing the same file descripters, the filp_close() will >> remove all the Posix-style locks. But if two threads both calling >> the filp_close() it may race and cause use-after-free crash: >> >> PID: 327771 TASK: ffff952aa1db3180 CPU: 8 COMMAND: "db2fmp" >> #0 [ffff95202f33b960] machine_kexec at ffffffff890662f4 >> #1 [ffff95202f33b9c0] __crash_kexec at ffffffff89122b82 >> #2 [ffff95202f33ba90] crash_kexec at ffffffff89122c70 >> #3 [ffff95202f33baa8] oops_end at ffffffff89791798 >> #4 [ffff95202f33bad0] no_context at ffffffff89075d14 >> #5 [ffff95202f33bb20] __bad_area_nosemaphore at ffffffff89075fe2 >> #6 [ffff95202f33bb70] bad_area_nosemaphore at ffffffff89076104 >> #7 [ffff95202f33bb80] __do_page_fault at ffffffff89794750 >> #8 [ffff95202f33bbf0] do_page_fault at ffffffff89794975 >> #9 [ffff95202f33bc20] page_fault at ffffffff89790778 >> [exception RIP: ceph_fl_release_lock+20] >> RIP: ffffffffc08247a4 RSP: ffff95202f33bcd0 RFLAGS: 00010286 >> RAX: ffff952d4ebd8a00 RBX: 0000000000000000 RCX: dead000000000200 >> RDX: ffff95202f33bd60 RSI: ffff95202f33bd60 RDI: ffff9526b6ac5b00 >> RBP: ffff95202f33bce0 R8: ffff9526b6ac5b18 R9: ffffffffc083c368 >> R10: 0000000000001109 R11: 0000000000000000 R12: ffff95202f33bd60 >> R13: ffff9526b6ac5b00 R14: 0000000000000000 R15: 0000000000000000 >> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 >> #10 [ffff95202f33bce8] locks_release_private at ffffffff892ab3d7 >> #11 [ffff95202f33bd00] locks_free_lock at ffffffff892ac34d >> #12 [ffff95202f33bd18] locks_dispose_list at ffffffff892ac44b >> #13 [ffff95202f33bd40] __posix_lock_file at ffffffff892acdfa >> #14 [ffff95202f33bda8] posix_lock_file at ffffffff892ad146 >> #15 [ffff95202f33bdb8] ceph_lock at ffffffffc0824e8a [ceph] >> #16 [ffff95202f33bdf8] vfs_lock_file at ffffffff892ad185 >> #17 [ffff95202f33be08] locks_remove_posix at ffffffff892ad239 >> #18 [ffff95202f33bee0] locks_remove_posix at ffffffff892ad2a0 >> #19 [ffff95202f33bef0] filp_close at ffffffff8924baa6 >> #20 [ffff95202f33bf18] __close_fd at ffffffff8926f89c >> #21 [ffff95202f33bf40] sys_close at ffffffff8924d503 >> #22 [ffff95202f33bf50] system_call_fastpath at ffffffff89799f92 >> RIP: 00007f806ec446ab RSP: 00007f80517f0d90 RFLAGS: 00010206 >> RAX: 0000000000000003 RBX: 00007f8030001a20 RCX: 00007f80300386b0 >> RDX: 00007f806ef0d880 RSI: 0000000000000001 RDI: 0000000000000006 >> RBP: 00007f806ef0e3c0 R8: 00007f80517fa700 R9: 0000000000000000 >> R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 >> R13: 00007f80300035b0 R14: 00007f80517f1104 R15: 000000000000006c >> ORIG_RAX: 0000000000000003 CS: 0033 SS: 002b >> >> We need to make sure that the filp in the file_lock shouldn't be >> release when any file_lock is still referring to it. >> >> For the Posix-style locks, whose owner will be the thread ids, we >> will increase the filp's reference. >> >> URL: https://tracker.ceph.com/issues/57986 >> Signed-off-by: Xiubo Li <xiubli@redhat.com> >> --- >> drivers/android/binder.c | 2 +- >> fs/file.c | 15 ++++++++++----- >> fs/locks.c | 18 +++++++++++++++--- >> include/linux/fs.h | 14 ++++++++++++++ >> io_uring/openclose.c | 3 ++- >> 5 files changed, 42 insertions(+), 10 deletions(-) >> >> diff --git a/drivers/android/binder.c b/drivers/android/binder.c >> index 880224ec6abb..03692564d940 100644 >> --- a/drivers/android/binder.c >> +++ b/drivers/android/binder.c >> @@ -1924,7 +1924,7 @@ static void binder_deferred_fd_close(int fd) >> if (twcb->file) { >> // pin it until binder_do_fd_close(); see comments there >> get_file(twcb->file); >> - filp_close(twcb->file, current->files); >> + filp_close(twcb->file, file_lock_make_thread_owner(current->files)); >> task_work_add(current, &twcb->twork, TWA_RESUME); >> } else { >> kfree(twcb); >> diff --git a/fs/file.c b/fs/file.c >> index 5f9c802a5d8d..39ad8e74a8d9 100644 >> --- a/fs/file.c >> +++ b/fs/file.c >> @@ -417,6 +417,7 @@ static struct fdtable *close_files(struct files_struct * files) >> * files structure. >> */ >> struct fdtable *fdt = rcu_dereference_raw(files->fdt); >> + fl_owner_t owner = file_lock_make_thread_owner(files); >> unsigned int i, j = 0; >> >> for (;;) { >> @@ -429,7 +430,7 @@ static struct fdtable *close_files(struct files_struct * files) >> if (set & 1) { >> struct file * file = xchg(&fdt->fd[i], NULL); >> if (file) { >> - filp_close(file, files); >> + filp_close(file, owner); >> cond_resched(); >> } >> } >> @@ -653,6 +654,7 @@ static struct file *pick_file(struct files_struct *files, unsigned fd) >> int close_fd(unsigned fd) >> { >> struct files_struct *files = current->files; >> + fl_owner_t owner = file_lock_make_thread_owner(files); >> struct file *file; >> >> spin_lock(&files->file_lock); >> @@ -661,7 +663,7 @@ int close_fd(unsigned fd) >> if (!file) >> return -EBADF; >> >> - return filp_close(file, files); >> + return filp_close(file, owner); >> } >> EXPORT_SYMBOL(close_fd); /* for ksys_close() */ >> >> @@ -695,6 +697,7 @@ static inline void __range_cloexec(struct files_struct *cur_fds, >> static inline void __range_close(struct files_struct *cur_fds, unsigned int fd, >> unsigned int max_fd) >> { >> + fl_owner_t owner = file_lock_make_thread_owner(cur_fds); >> unsigned n; >> >> rcu_read_lock(); >> @@ -711,7 +714,7 @@ static inline void __range_close(struct files_struct *cur_fds, unsigned int fd, >> >> if (file) { >> /* found a valid file to close */ >> - filp_close(file, cur_fds); >> + filp_close(file, owner); >> cond_resched(); >> } >> } >> @@ -816,6 +819,7 @@ struct file *close_fd_get_file(unsigned int fd) >> >> void do_close_on_exec(struct files_struct *files) >> { >> + fl_owner_t owner = file_lock_make_thread_owner(files); >> unsigned i; >> struct fdtable *fdt; >> >> @@ -841,7 +845,7 @@ void do_close_on_exec(struct files_struct *files) >> rcu_assign_pointer(fdt->fd[fd], NULL); >> __put_unused_fd(files, fd); >> spin_unlock(&files->file_lock); >> - filp_close(file, files); >> + filp_close(file, owner); >> cond_resched(); >> spin_lock(&files->file_lock); >> } >> @@ -1080,6 +1084,7 @@ static int do_dup2(struct files_struct *files, >> struct file *file, unsigned fd, unsigned flags) >> __releases(&files->file_lock) >> { >> + fl_owner_t owner = file_lock_make_thread_owner(files); >> struct file *tofree; >> struct fdtable *fdt; >> >> @@ -1111,7 +1116,7 @@ __releases(&files->file_lock) >> spin_unlock(&files->file_lock); >> >> if (tofree) >> - filp_close(tofree, files); >> + filp_close(tofree, owner); >> >> return fd; >> >> diff --git a/fs/locks.c b/fs/locks.c >> index 607f94a0e789..e8b67f87e0ee 100644 >> --- a/fs/locks.c >> +++ b/fs/locks.c >> @@ -331,6 +331,8 @@ EXPORT_SYMBOL_GPL(locks_owner_has_blockers); >> /* Free a lock which is not in use. */ >> void locks_free_lock(struct file_lock *fl) >> { >> + if (fl->fl_file && file_lock_is_thread_owner(fl->fl_owner)) >> + fput(fl->fl_file); >> locks_release_private(fl); >> kmem_cache_free(filelock_cache, fl); >> } >> @@ -384,7 +386,10 @@ void locks_copy_lock(struct file_lock *new, struct file_lock *fl) >> >> locks_copy_conflock(new, fl); >> >> - new->fl_file = fl->fl_file; >> + if (file_lock_is_thread_owner(new->fl_owner)) >> + new->fl_file = get_file(fl->fl_file); >> + else >> + new->fl_file = fl->fl_file; >> new->fl_ops = fl->fl_ops; >> >> if (fl->fl_ops) { >> @@ -488,13 +493,14 @@ static int flock64_to_posix_lock(struct file *filp, struct file_lock *fl, >> } else >> fl->fl_end = OFFSET_MAX; >> >> - fl->fl_owner = current->files; >> + fl->fl_owner = file_lock_make_thread_owner(current->files); >> fl->fl_pid = current->tgid; >> - fl->fl_file = filp; >> + fl->fl_file = get_file(filp); >> fl->fl_flags = FL_POSIX; >> fl->fl_ops = NULL; >> fl->fl_lmops = NULL; >> >> + >> return assign_type(fl, l->l_type); >> } >> >> @@ -2243,6 +2249,7 @@ int fcntl_getlk(struct file *filp, unsigned int cmd, struct flock *flock) >> >> fl->fl_flags |= FL_OFDLCK; >> fl->fl_owner = filp; >> + fput(filp); >> } >> >> error = vfs_test_lock(filp, fl); >> @@ -2376,6 +2383,7 @@ int fcntl_setlk(unsigned int fd, struct file *filp, unsigned int cmd, >> cmd = F_SETLK; >> file_lock->fl_flags |= FL_OFDLCK; >> file_lock->fl_owner = filp; >> + fput(filp); >> break; >> case F_OFD_SETLKW: >> error = -EINVAL; >> @@ -2385,6 +2393,7 @@ int fcntl_setlk(unsigned int fd, struct file *filp, unsigned int cmd, >> cmd = F_SETLKW; >> file_lock->fl_flags |= FL_OFDLCK; >> file_lock->fl_owner = filp; >> + fput(filp); >> fallthrough; >> case F_SETLKW: >> file_lock->fl_flags |= FL_SLEEP; >> @@ -2450,6 +2459,7 @@ int fcntl_getlk64(struct file *filp, unsigned int cmd, struct flock64 *flock) >> cmd = F_GETLK64; >> fl->fl_flags |= FL_OFDLCK; >> fl->fl_owner = filp; >> + fput(filp); >> } >> >> error = vfs_test_lock(filp, fl); >> @@ -2499,6 +2509,7 @@ int fcntl_setlk64(unsigned int fd, struct file *filp, unsigned int cmd, >> cmd = F_SETLK64; >> file_lock->fl_flags |= FL_OFDLCK; >> file_lock->fl_owner = filp; >> + fput(filp); >> break; >> case F_OFD_SETLKW: >> error = -EINVAL; >> @@ -2508,6 +2519,7 @@ int fcntl_setlk64(unsigned int fd, struct file *filp, unsigned int cmd, >> cmd = F_SETLKW64; >> file_lock->fl_flags |= FL_OFDLCK; >> file_lock->fl_owner = filp; >> + fput(filp); >> fallthrough; >> case F_SETLKW64: >> file_lock->fl_flags |= FL_SLEEP; >> diff --git a/include/linux/fs.h b/include/linux/fs.h >> index e654435f1651..d7d81962a863 100644 >> --- a/include/linux/fs.h >> +++ b/include/linux/fs.h >> @@ -1028,6 +1028,20 @@ static inline struct file *get_file(struct file *f) >> /* legacy typedef, should eventually be removed */ >> typedef void *fl_owner_t; >> >> +/* >> + * Set the last significant bit to 1 to mark that >> + * we have get a reference of the fl->fl_file. >> + */ >> +static inline fl_owner_t file_lock_make_thread_owner(fl_owner_t owner) >> +{ >> + return (fl_owner_t)((unsigned long)owner | 1UL); >> +} >> + >> +static inline bool file_lock_is_thread_owner(fl_owner_t owner) >> +{ >> + return ((unsigned long)owner & 1UL); >> +} >> + >> struct file_lock; >> >> struct file_lock_operations { >> diff --git a/io_uring/openclose.c b/io_uring/openclose.c >> index 67178e4bb282..5a12cdf7f8d0 100644 >> --- a/io_uring/openclose.c >> +++ b/io_uring/openclose.c >> @@ -212,6 +212,7 @@ int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >> int io_close(struct io_kiocb *req, unsigned int issue_flags) >> { >> struct files_struct *files = current->files; >> + fl_owner_t owner = file_lock_make_thread_owner(files); >> struct io_close *close = io_kiocb_to_cmd(req, struct io_close); >> struct fdtable *fdt; >> struct file *file; >> @@ -247,7 +248,7 @@ int io_close(struct io_kiocb *req, unsigned int issue_flags) >> goto err; >> >> /* No ->flush() or already async, safely close from here */ >> - ret = filp_close(file, current->files); >> + ret = filp_close(file, owner); >> err: >> if (ret < 0) >> req_set_fail(req); > I think this is the wrong approach to fixing this. It also looks like > you could hit a similar problem with OFD locks and this patch wouldn't > address that issue. For the OFD locks they will set the 'file' struct as the owner just as the flock does, it should be okay and I don't think it has this issue if my understanding is correct here. > The real bug seems to be that ceph_fl_release_lock dereferences fl_file, > at a point when it shouldn't rely on that being valid. Most filesystems > stash some info in fl->fl_u if they need to do bookkeeping after > releasing a lock. Perhaps ceph should be doing something similar? This is the 'filp' memory in filp_close(filp, ...): crash> file.f_path.dentry,f_inode 0xffff952d7ab46200 f_path.dentry = 0xffff9521b121cb40 f_inode = 0xffff951f3ea33550, We can see the 'f_inode' is pointing to the correct inode memory. While later in 'ceph_fl_release_lock()': 41 static void ceph_fl_release_lock(struct file_lock *fl) 42 { 43 struct ceph_file_info *fi = fl->fl_file->private_data; 44 struct inode *inode = file_inode(fl->fl_file); 45 struct ceph_inode_info *ci = ceph_inode(inode); 46 atomic_dec(&fi->num_locks); 47 if (atomic_dec_and_test(&ci->i_filelock_ref)) { 48 /* clear error when all locks are released */ 49 spin_lock(&ci->i_ceph_lock); 50 ci->i_ceph_flags &= ~CEPH_I_ERROR_FILELOCK; 51 spin_unlock(&ci->i_ceph_lock); 52 } 53 } It crashed in Line#47 and the 'fl->fl_file' memory is: crash> file.f_path.dentry,f_inode 0xffff952d4ebd8a00 f_path.dentry = 0x0 f_inode = 0x0, Please NOTE: the 'filp' and 'fl->fl_file' are two different 'file struct'. Can we fix this by using 'fl->fl_u' here ? I was also thinking I could just call the 'get_file(file)' in ceph_lock() and then in ceph_fl_release_lock() release the reference counter. How about this ? Thanks! - Xiubo
On Mon, 2022-11-07 at 20:03 +0800, Xiubo Li wrote: > On 07/11/2022 18:33, Jeff Layton wrote: > > On Mon, 2022-11-07 at 17:52 +0800, xiubli@redhat.com wrote: > > > From: Xiubo Li <xiubli@redhat.com> > > > > > > When closing the file descripters in parallel in multiple threads, > > > who are sharing the same file descripters, the filp_close() will > > > remove all the Posix-style locks. But if two threads both calling > > > the filp_close() it may race and cause use-after-free crash: > > > > > > PID: 327771 TASK: ffff952aa1db3180 CPU: 8 COMMAND: "db2fmp" > > > #0 [ffff95202f33b960] machine_kexec at ffffffff890662f4 > > > #1 [ffff95202f33b9c0] __crash_kexec at ffffffff89122b82 > > > #2 [ffff95202f33ba90] crash_kexec at ffffffff89122c70 > > > #3 [ffff95202f33baa8] oops_end at ffffffff89791798 > > > #4 [ffff95202f33bad0] no_context at ffffffff89075d14 > > > #5 [ffff95202f33bb20] __bad_area_nosemaphore at ffffffff89075fe2 > > > #6 [ffff95202f33bb70] bad_area_nosemaphore at ffffffff89076104 > > > #7 [ffff95202f33bb80] __do_page_fault at ffffffff89794750 > > > #8 [ffff95202f33bbf0] do_page_fault at ffffffff89794975 > > > #9 [ffff95202f33bc20] page_fault at ffffffff89790778 > > > [exception RIP: ceph_fl_release_lock+20] > > > RIP: ffffffffc08247a4 RSP: ffff95202f33bcd0 RFLAGS: 00010286 > > > RAX: ffff952d4ebd8a00 RBX: 0000000000000000 RCX: dead000000000200 > > > RDX: ffff95202f33bd60 RSI: ffff95202f33bd60 RDI: ffff9526b6ac5b00 > > > RBP: ffff95202f33bce0 R8: ffff9526b6ac5b18 R9: ffffffffc083c368 > > > R10: 0000000000001109 R11: 0000000000000000 R12: ffff95202f33bd60 > > > R13: ffff9526b6ac5b00 R14: 0000000000000000 R15: 0000000000000000 > > > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > > > #10 [ffff95202f33bce8] locks_release_private at ffffffff892ab3d7 > > > #11 [ffff95202f33bd00] locks_free_lock at ffffffff892ac34d > > > #12 [ffff95202f33bd18] locks_dispose_list at ffffffff892ac44b > > > #13 [ffff95202f33bd40] __posix_lock_file at ffffffff892acdfa > > > #14 [ffff95202f33bda8] posix_lock_file at ffffffff892ad146 > > > #15 [ffff95202f33bdb8] ceph_lock at ffffffffc0824e8a [ceph] > > > #16 [ffff95202f33bdf8] vfs_lock_file at ffffffff892ad185 > > > #17 [ffff95202f33be08] locks_remove_posix at ffffffff892ad239 > > > #18 [ffff95202f33bee0] locks_remove_posix at ffffffff892ad2a0 > > > #19 [ffff95202f33bef0] filp_close at ffffffff8924baa6 > > > #20 [ffff95202f33bf18] __close_fd at ffffffff8926f89c > > > #21 [ffff95202f33bf40] sys_close at ffffffff8924d503 > > > #22 [ffff95202f33bf50] system_call_fastpath at ffffffff89799f92 > > > RIP: 00007f806ec446ab RSP: 00007f80517f0d90 RFLAGS: 00010206 > > > RAX: 0000000000000003 RBX: 00007f8030001a20 RCX: 00007f80300386b0 > > > RDX: 00007f806ef0d880 RSI: 0000000000000001 RDI: 0000000000000006 > > > RBP: 00007f806ef0e3c0 R8: 00007f80517fa700 R9: 0000000000000000 > > > R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 > > > R13: 00007f80300035b0 R14: 00007f80517f1104 R15: 000000000000006c > > > ORIG_RAX: 0000000000000003 CS: 0033 SS: 002b > > > > > > We need to make sure that the filp in the file_lock shouldn't be > > > release when any file_lock is still referring to it. > > > > > > For the Posix-style locks, whose owner will be the thread ids, we > > > will increase the filp's reference. > > > > > > URL: https://tracker.ceph.com/issues/57986 > > > Signed-off-by: Xiubo Li <xiubli@redhat.com> > > > --- > > > drivers/android/binder.c | 2 +- > > > fs/file.c | 15 ++++++++++----- > > > fs/locks.c | 18 +++++++++++++++--- > > > include/linux/fs.h | 14 ++++++++++++++ > > > io_uring/openclose.c | 3 ++- > > > 5 files changed, 42 insertions(+), 10 deletions(-) > > > > > > diff --git a/drivers/android/binder.c b/drivers/android/binder.c > > > index 880224ec6abb..03692564d940 100644 > > > --- a/drivers/android/binder.c > > > +++ b/drivers/android/binder.c > > > @@ -1924,7 +1924,7 @@ static void binder_deferred_fd_close(int fd) > > > if (twcb->file) { > > > // pin it until binder_do_fd_close(); see comments there > > > get_file(twcb->file); > > > - filp_close(twcb->file, current->files); > > > + filp_close(twcb->file, file_lock_make_thread_owner(current->files)); > > > task_work_add(current, &twcb->twork, TWA_RESUME); > > > } else { > > > kfree(twcb); > > > diff --git a/fs/file.c b/fs/file.c > > > index 5f9c802a5d8d..39ad8e74a8d9 100644 > > > --- a/fs/file.c > > > +++ b/fs/file.c > > > @@ -417,6 +417,7 @@ static struct fdtable *close_files(struct files_struct * files) > > > * files structure. > > > */ > > > struct fdtable *fdt = rcu_dereference_raw(files->fdt); > > > + fl_owner_t owner = file_lock_make_thread_owner(files); > > > unsigned int i, j = 0; > > > > > > for (;;) { > > > @@ -429,7 +430,7 @@ static struct fdtable *close_files(struct files_struct * files) > > > if (set & 1) { > > > struct file * file = xchg(&fdt->fd[i], NULL); > > > if (file) { > > > - filp_close(file, files); > > > + filp_close(file, owner); > > > cond_resched(); > > > } > > > } > > > @@ -653,6 +654,7 @@ static struct file *pick_file(struct files_struct *files, unsigned fd) > > > int close_fd(unsigned fd) > > > { > > > struct files_struct *files = current->files; > > > + fl_owner_t owner = file_lock_make_thread_owner(files); > > > struct file *file; > > > > > > spin_lock(&files->file_lock); > > > @@ -661,7 +663,7 @@ int close_fd(unsigned fd) > > > if (!file) > > > return -EBADF; > > > > > > - return filp_close(file, files); > > > + return filp_close(file, owner); > > > } > > > EXPORT_SYMBOL(close_fd); /* for ksys_close() */ > > > > > > @@ -695,6 +697,7 @@ static inline void __range_cloexec(struct files_struct *cur_fds, > > > static inline void __range_close(struct files_struct *cur_fds, unsigned int fd, > > > unsigned int max_fd) > > > { > > > + fl_owner_t owner = file_lock_make_thread_owner(cur_fds); > > > unsigned n; > > > > > > rcu_read_lock(); > > > @@ -711,7 +714,7 @@ static inline void __range_close(struct files_struct *cur_fds, unsigned int fd, > > > > > > if (file) { > > > /* found a valid file to close */ > > > - filp_close(file, cur_fds); > > > + filp_close(file, owner); > > > cond_resched(); > > > } > > > } > > > @@ -816,6 +819,7 @@ struct file *close_fd_get_file(unsigned int fd) > > > > > > void do_close_on_exec(struct files_struct *files) > > > { > > > + fl_owner_t owner = file_lock_make_thread_owner(files); > > > unsigned i; > > > struct fdtable *fdt; > > > > > > @@ -841,7 +845,7 @@ void do_close_on_exec(struct files_struct *files) > > > rcu_assign_pointer(fdt->fd[fd], NULL); > > > __put_unused_fd(files, fd); > > > spin_unlock(&files->file_lock); > > > - filp_close(file, files); > > > + filp_close(file, owner); > > > cond_resched(); > > > spin_lock(&files->file_lock); > > > } > > > @@ -1080,6 +1084,7 @@ static int do_dup2(struct files_struct *files, > > > struct file *file, unsigned fd, unsigned flags) > > > __releases(&files->file_lock) > > > { > > > + fl_owner_t owner = file_lock_make_thread_owner(files); > > > struct file *tofree; > > > struct fdtable *fdt; > > > > > > @@ -1111,7 +1116,7 @@ __releases(&files->file_lock) > > > spin_unlock(&files->file_lock); > > > > > > if (tofree) > > > - filp_close(tofree, files); > > > + filp_close(tofree, owner); > > > > > > return fd; > > > > > > diff --git a/fs/locks.c b/fs/locks.c > > > index 607f94a0e789..e8b67f87e0ee 100644 > > > --- a/fs/locks.c > > > +++ b/fs/locks.c > > > @@ -331,6 +331,8 @@ EXPORT_SYMBOL_GPL(locks_owner_has_blockers); > > > /* Free a lock which is not in use. */ > > > void locks_free_lock(struct file_lock *fl) > > > { > > > + if (fl->fl_file && file_lock_is_thread_owner(fl->fl_owner)) > > > + fput(fl->fl_file); > > > locks_release_private(fl); > > > kmem_cache_free(filelock_cache, fl); > > > } > > > @@ -384,7 +386,10 @@ void locks_copy_lock(struct file_lock *new, struct file_lock *fl) > > > > > > locks_copy_conflock(new, fl); > > > > > > - new->fl_file = fl->fl_file; > > > + if (file_lock_is_thread_owner(new->fl_owner)) > > > + new->fl_file = get_file(fl->fl_file); > > > + else > > > + new->fl_file = fl->fl_file; > > > new->fl_ops = fl->fl_ops; > > > > > > if (fl->fl_ops) { > > > @@ -488,13 +493,14 @@ static int flock64_to_posix_lock(struct file *filp, struct file_lock *fl, > > > } else > > > fl->fl_end = OFFSET_MAX; > > > > > > - fl->fl_owner = current->files; > > > + fl->fl_owner = file_lock_make_thread_owner(current->files); > > > fl->fl_pid = current->tgid; > > > - fl->fl_file = filp; > > > + fl->fl_file = get_file(filp); > > > fl->fl_flags = FL_POSIX; > > > fl->fl_ops = NULL; > > > fl->fl_lmops = NULL; > > > > > > + > > > return assign_type(fl, l->l_type); > > > } > > > > > > @@ -2243,6 +2249,7 @@ int fcntl_getlk(struct file *filp, unsigned int cmd, struct flock *flock) > > > > > > fl->fl_flags |= FL_OFDLCK; > > > fl->fl_owner = filp; > > > + fput(filp); > > > } > > > > > > error = vfs_test_lock(filp, fl); > > > @@ -2376,6 +2383,7 @@ int fcntl_setlk(unsigned int fd, struct file *filp, unsigned int cmd, > > > cmd = F_SETLK; > > > file_lock->fl_flags |= FL_OFDLCK; > > > file_lock->fl_owner = filp; > > > + fput(filp); > > > break; > > > case F_OFD_SETLKW: > > > error = -EINVAL; > > > @@ -2385,6 +2393,7 @@ int fcntl_setlk(unsigned int fd, struct file *filp, unsigned int cmd, > > > cmd = F_SETLKW; > > > file_lock->fl_flags |= FL_OFDLCK; > > > file_lock->fl_owner = filp; > > > + fput(filp); > > > fallthrough; > > > case F_SETLKW: > > > file_lock->fl_flags |= FL_SLEEP; > > > @@ -2450,6 +2459,7 @@ int fcntl_getlk64(struct file *filp, unsigned int cmd, struct flock64 *flock) > > > cmd = F_GETLK64; > > > fl->fl_flags |= FL_OFDLCK; > > > fl->fl_owner = filp; > > > + fput(filp); > > > } > > > > > > error = vfs_test_lock(filp, fl); > > > @@ -2499,6 +2509,7 @@ int fcntl_setlk64(unsigned int fd, struct file *filp, unsigned int cmd, > > > cmd = F_SETLK64; > > > file_lock->fl_flags |= FL_OFDLCK; > > > file_lock->fl_owner = filp; > > > + fput(filp); > > > break; > > > case F_OFD_SETLKW: > > > error = -EINVAL; > > > @@ -2508,6 +2519,7 @@ int fcntl_setlk64(unsigned int fd, struct file *filp, unsigned int cmd, > > > cmd = F_SETLKW64; > > > file_lock->fl_flags |= FL_OFDLCK; > > > file_lock->fl_owner = filp; > > > + fput(filp); > > > fallthrough; > > > case F_SETLKW64: > > > file_lock->fl_flags |= FL_SLEEP; > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > index e654435f1651..d7d81962a863 100644 > > > --- a/include/linux/fs.h > > > +++ b/include/linux/fs.h > > > @@ -1028,6 +1028,20 @@ static inline struct file *get_file(struct file *f) > > > /* legacy typedef, should eventually be removed */ > > > typedef void *fl_owner_t; > > > > > > +/* > > > + * Set the last significant bit to 1 to mark that > > > + * we have get a reference of the fl->fl_file. > > > + */ > > > +static inline fl_owner_t file_lock_make_thread_owner(fl_owner_t owner) > > > +{ > > > + return (fl_owner_t)((unsigned long)owner | 1UL); > > > +} > > > + > > > +static inline bool file_lock_is_thread_owner(fl_owner_t owner) > > > +{ > > > + return ((unsigned long)owner & 1UL); > > > +} > > > + > > > struct file_lock; > > > > > > struct file_lock_operations { > > > diff --git a/io_uring/openclose.c b/io_uring/openclose.c > > > index 67178e4bb282..5a12cdf7f8d0 100644 > > > --- a/io_uring/openclose.c > > > +++ b/io_uring/openclose.c > > > @@ -212,6 +212,7 @@ int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > > int io_close(struct io_kiocb *req, unsigned int issue_flags) > > > { > > > struct files_struct *files = current->files; > > > + fl_owner_t owner = file_lock_make_thread_owner(files); > > > struct io_close *close = io_kiocb_to_cmd(req, struct io_close); > > > struct fdtable *fdt; > > > struct file *file; > > > @@ -247,7 +248,7 @@ int io_close(struct io_kiocb *req, unsigned int issue_flags) > > > goto err; > > > > > > /* No ->flush() or already async, safely close from here */ > > > - ret = filp_close(file, current->files); > > > + ret = filp_close(file, owner); > > > err: > > > if (ret < 0) > > > req_set_fail(req); > > I think this is the wrong approach to fixing this. It also looks like > > you could hit a similar problem with OFD locks and this patch wouldn't > > address that issue. > > For the OFD locks they will set the 'file' struct as the owner just as > the flock does, it should be okay and I don't think it has this issue if > my understanding is correct here. > They set the the owner to "file", but they don't hold a reference to it. With OFD locks, the file is what holds references to the lock, not the reverse. > > The real bug seems to be that ceph_fl_release_lock dereferences fl_file, > > at a point when it shouldn't rely on that being valid. Most filesystems > > stash some info in fl->fl_u if they need to do bookkeeping after > > releasing a lock. Perhaps ceph should be doing something similar? > > This is the 'filp' memory in filp_close(filp, ...): > > crash> file.f_path.dentry,f_inode 0xffff952d7ab46200 > f_path.dentry = 0xffff9521b121cb40 > f_inode = 0xffff951f3ea33550, > > We can see the 'f_inode' is pointing to the correct inode memory. > > > > While later in 'ceph_fl_release_lock()': > > 41 static void ceph_fl_release_lock(struct file_lock *fl) > 42 { > 43 struct ceph_file_info *fi = fl->fl_file->private_data; > 44 struct inode *inode = file_inode(fl->fl_file); > 45 struct ceph_inode_info *ci = ceph_inode(inode); > 46 atomic_dec(&fi->num_locks); > 47 if (atomic_dec_and_test(&ci->i_filelock_ref)) { > 48 /* clear error when all locks are released */ > 49 spin_lock(&ci->i_ceph_lock); > 50 ci->i_ceph_flags &= ~CEPH_I_ERROR_FILELOCK; > 51 spin_unlock(&ci->i_ceph_lock); > 52 } > 53 } > You only need the inode for most of this. The exception is fi->num_locks, so you may need to test for that in a different way. > It crashed in Line#47 and the 'fl->fl_file' memory is: > > crash> file.f_path.dentry,f_inode 0xffff952d4ebd8a00 > f_path.dentry = 0x0 > f_inode = 0x0, > > Please NOTE: the 'filp' and 'fl->fl_file' are two different 'file struct'. > Yep, I understand the bug. I just don't like the proposed fix. :) > Can we fix this by using 'fl->fl_u' here ? > Probably. You could take and hold an inode reference in there, and maybe add a function that looks at whether there are any locks held against a particular file, rather than trying to count locks in ceph_file_info. > I was also thinking I could just call the 'get_file(file)' in > ceph_lock() and then in ceph_fl_release_lock() release the reference > counter. How about this ? > That may work too, though again, I'd be worried about cyclical dependencies, particularly with OFD locks. If the lock holds a reference to the file, then can the file's refcount ever go to zero if the lock is never explicitly released? I think not. You may also need to consider flock locks too, since they have similar ownership semantics to OFD locks.
On 07/11/2022 20:29, Jeff Layton wrote: > On Mon, 2022-11-07 at 20:03 +0800, Xiubo Li wrote: >> On 07/11/2022 18:33, Jeff Layton wrote: >>> On Mon, 2022-11-07 at 17:52 +0800, xiubli@redhat.com wrote: [...] >>>> diff --git a/io_uring/openclose.c b/io_uring/openclose.c >>>> index 67178e4bb282..5a12cdf7f8d0 100644 >>>> --- a/io_uring/openclose.c >>>> +++ b/io_uring/openclose.c >>>> @@ -212,6 +212,7 @@ int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>> int io_close(struct io_kiocb *req, unsigned int issue_flags) >>>> { >>>> struct files_struct *files = current->files; >>>> + fl_owner_t owner = file_lock_make_thread_owner(files); >>>> struct io_close *close = io_kiocb_to_cmd(req, struct io_close); >>>> struct fdtable *fdt; >>>> struct file *file; >>>> @@ -247,7 +248,7 @@ int io_close(struct io_kiocb *req, unsigned int issue_flags) >>>> goto err; >>>> >>>> /* No ->flush() or already async, safely close from here */ >>>> - ret = filp_close(file, current->files); >>>> + ret = filp_close(file, owner); >>>> err: >>>> if (ret < 0) >>>> req_set_fail(req); >>> I think this is the wrong approach to fixing this. It also looks like >>> you could hit a similar problem with OFD locks and this patch wouldn't >>> address that issue. >> For the OFD locks they will set the 'file' struct as the owner just as >> the flock does, it should be okay and I don't think it has this issue if >> my understanding is correct here. >> > They set the the owner to "file", but they don't hold a reference to it. > With OFD locks, the file is what holds references to the lock, not the > reverse. Yeah, right. But for both OFD and flock they shouldn't hit this issue, because it when removing all the locks having the same owner, which is the 'file', passed by filp_close(filp), the 'file' reference counter must be larger than 0. Because the filp_close() is still using it. This is why using the thread id as the owner is a special case for Posix-style lock. > >>> The real bug seems to be that ceph_fl_release_lock dereferences fl_file, >>> at a point when it shouldn't rely on that being valid. Most filesystems >>> stash some info in fl->fl_u if they need to do bookkeeping after >>> releasing a lock. Perhaps ceph should be doing something similar? >> This is the 'filp' memory in filp_close(filp, ...): >> >> crash> file.f_path.dentry,f_inode 0xffff952d7ab46200 >> f_path.dentry = 0xffff9521b121cb40 >> f_inode = 0xffff951f3ea33550, >> >> We can see the 'f_inode' is pointing to the correct inode memory. >> >> >> >> While later in 'ceph_fl_release_lock()': >> >> 41 static void ceph_fl_release_lock(struct file_lock *fl) >> 42 { >> 43 struct ceph_file_info *fi = fl->fl_file->private_data; >> 44 struct inode *inode = file_inode(fl->fl_file); >> 45 struct ceph_inode_info *ci = ceph_inode(inode); >> 46 atomic_dec(&fi->num_locks); >> 47 if (atomic_dec_and_test(&ci->i_filelock_ref)) { >> 48 /* clear error when all locks are released */ >> 49 spin_lock(&ci->i_ceph_lock); >> 50 ci->i_ceph_flags &= ~CEPH_I_ERROR_FILELOCK; >> 51 spin_unlock(&ci->i_ceph_lock); >> 52 } >> 53 } >> > You only need the inode for most of this. The exception is > fi->num_locks, so you may need to test for that in a different way. > >> It crashed in Line#47 and the 'fl->fl_file' memory is: >> >> crash> file.f_path.dentry,f_inode 0xffff952d4ebd8a00 >> f_path.dentry = 0x0 >> f_inode = 0x0, >> >> Please NOTE: the 'filp' and 'fl->fl_file' are two different 'file struct'. >> > Yep, I understand the bug. I just don't like the proposed fix. :) Yeah, I also think this approach is ugly :-) >> Can we fix this by using 'fl->fl_u' here ? >> > Probably. You could take and hold an inode reference in there, and maybe > add a function that looks at whether there are any locks held against a > particular file, rather than trying to count locks in ceph_file_info. Okay, this sounds good. Let me try this tomorrow. >> I was also thinking I could just call the 'get_file(file)' in >> ceph_lock() and then in ceph_fl_release_lock() release the reference >> counter. How about this ? >> > That may work too, though again, I'd be worried about cyclical > dependencies, particularly with OFD locks. If the lock holds a reference > to the file, then can the file's refcount ever go to zero if the lock is > never explicitly released? I think not. > > You may also need to consider flock locks too, since they have similar > ownership semantics to OFD locks. I will send a V2 later. Thanks Jeff! - Xiubo
diff --git a/drivers/android/binder.c b/drivers/android/binder.c index 880224ec6abb..03692564d940 100644 --- a/drivers/android/binder.c +++ b/drivers/android/binder.c @@ -1924,7 +1924,7 @@ static void binder_deferred_fd_close(int fd) if (twcb->file) { // pin it until binder_do_fd_close(); see comments there get_file(twcb->file); - filp_close(twcb->file, current->files); + filp_close(twcb->file, file_lock_make_thread_owner(current->files)); task_work_add(current, &twcb->twork, TWA_RESUME); } else { kfree(twcb); diff --git a/fs/file.c b/fs/file.c index 5f9c802a5d8d..39ad8e74a8d9 100644 --- a/fs/file.c +++ b/fs/file.c @@ -417,6 +417,7 @@ static struct fdtable *close_files(struct files_struct * files) * files structure. */ struct fdtable *fdt = rcu_dereference_raw(files->fdt); + fl_owner_t owner = file_lock_make_thread_owner(files); unsigned int i, j = 0; for (;;) { @@ -429,7 +430,7 @@ static struct fdtable *close_files(struct files_struct * files) if (set & 1) { struct file * file = xchg(&fdt->fd[i], NULL); if (file) { - filp_close(file, files); + filp_close(file, owner); cond_resched(); } } @@ -653,6 +654,7 @@ static struct file *pick_file(struct files_struct *files, unsigned fd) int close_fd(unsigned fd) { struct files_struct *files = current->files; + fl_owner_t owner = file_lock_make_thread_owner(files); struct file *file; spin_lock(&files->file_lock); @@ -661,7 +663,7 @@ int close_fd(unsigned fd) if (!file) return -EBADF; - return filp_close(file, files); + return filp_close(file, owner); } EXPORT_SYMBOL(close_fd); /* for ksys_close() */ @@ -695,6 +697,7 @@ static inline void __range_cloexec(struct files_struct *cur_fds, static inline void __range_close(struct files_struct *cur_fds, unsigned int fd, unsigned int max_fd) { + fl_owner_t owner = file_lock_make_thread_owner(cur_fds); unsigned n; rcu_read_lock(); @@ -711,7 +714,7 @@ static inline void __range_close(struct files_struct *cur_fds, unsigned int fd, if (file) { /* found a valid file to close */ - filp_close(file, cur_fds); + filp_close(file, owner); cond_resched(); } } @@ -816,6 +819,7 @@ struct file *close_fd_get_file(unsigned int fd) void do_close_on_exec(struct files_struct *files) { + fl_owner_t owner = file_lock_make_thread_owner(files); unsigned i; struct fdtable *fdt; @@ -841,7 +845,7 @@ void do_close_on_exec(struct files_struct *files) rcu_assign_pointer(fdt->fd[fd], NULL); __put_unused_fd(files, fd); spin_unlock(&files->file_lock); - filp_close(file, files); + filp_close(file, owner); cond_resched(); spin_lock(&files->file_lock); } @@ -1080,6 +1084,7 @@ static int do_dup2(struct files_struct *files, struct file *file, unsigned fd, unsigned flags) __releases(&files->file_lock) { + fl_owner_t owner = file_lock_make_thread_owner(files); struct file *tofree; struct fdtable *fdt; @@ -1111,7 +1116,7 @@ __releases(&files->file_lock) spin_unlock(&files->file_lock); if (tofree) - filp_close(tofree, files); + filp_close(tofree, owner); return fd; diff --git a/fs/locks.c b/fs/locks.c index 607f94a0e789..e8b67f87e0ee 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -331,6 +331,8 @@ EXPORT_SYMBOL_GPL(locks_owner_has_blockers); /* Free a lock which is not in use. */ void locks_free_lock(struct file_lock *fl) { + if (fl->fl_file && file_lock_is_thread_owner(fl->fl_owner)) + fput(fl->fl_file); locks_release_private(fl); kmem_cache_free(filelock_cache, fl); } @@ -384,7 +386,10 @@ void locks_copy_lock(struct file_lock *new, struct file_lock *fl) locks_copy_conflock(new, fl); - new->fl_file = fl->fl_file; + if (file_lock_is_thread_owner(new->fl_owner)) + new->fl_file = get_file(fl->fl_file); + else + new->fl_file = fl->fl_file; new->fl_ops = fl->fl_ops; if (fl->fl_ops) { @@ -488,13 +493,14 @@ static int flock64_to_posix_lock(struct file *filp, struct file_lock *fl, } else fl->fl_end = OFFSET_MAX; - fl->fl_owner = current->files; + fl->fl_owner = file_lock_make_thread_owner(current->files); fl->fl_pid = current->tgid; - fl->fl_file = filp; + fl->fl_file = get_file(filp); fl->fl_flags = FL_POSIX; fl->fl_ops = NULL; fl->fl_lmops = NULL; + return assign_type(fl, l->l_type); } @@ -2243,6 +2249,7 @@ int fcntl_getlk(struct file *filp, unsigned int cmd, struct flock *flock) fl->fl_flags |= FL_OFDLCK; fl->fl_owner = filp; + fput(filp); } error = vfs_test_lock(filp, fl); @@ -2376,6 +2383,7 @@ int fcntl_setlk(unsigned int fd, struct file *filp, unsigned int cmd, cmd = F_SETLK; file_lock->fl_flags |= FL_OFDLCK; file_lock->fl_owner = filp; + fput(filp); break; case F_OFD_SETLKW: error = -EINVAL; @@ -2385,6 +2393,7 @@ int fcntl_setlk(unsigned int fd, struct file *filp, unsigned int cmd, cmd = F_SETLKW; file_lock->fl_flags |= FL_OFDLCK; file_lock->fl_owner = filp; + fput(filp); fallthrough; case F_SETLKW: file_lock->fl_flags |= FL_SLEEP; @@ -2450,6 +2459,7 @@ int fcntl_getlk64(struct file *filp, unsigned int cmd, struct flock64 *flock) cmd = F_GETLK64; fl->fl_flags |= FL_OFDLCK; fl->fl_owner = filp; + fput(filp); } error = vfs_test_lock(filp, fl); @@ -2499,6 +2509,7 @@ int fcntl_setlk64(unsigned int fd, struct file *filp, unsigned int cmd, cmd = F_SETLK64; file_lock->fl_flags |= FL_OFDLCK; file_lock->fl_owner = filp; + fput(filp); break; case F_OFD_SETLKW: error = -EINVAL; @@ -2508,6 +2519,7 @@ int fcntl_setlk64(unsigned int fd, struct file *filp, unsigned int cmd, cmd = F_SETLKW64; file_lock->fl_flags |= FL_OFDLCK; file_lock->fl_owner = filp; + fput(filp); fallthrough; case F_SETLKW64: file_lock->fl_flags |= FL_SLEEP; diff --git a/include/linux/fs.h b/include/linux/fs.h index e654435f1651..d7d81962a863 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1028,6 +1028,20 @@ static inline struct file *get_file(struct file *f) /* legacy typedef, should eventually be removed */ typedef void *fl_owner_t; +/* + * Set the last significant bit to 1 to mark that + * we have get a reference of the fl->fl_file. + */ +static inline fl_owner_t file_lock_make_thread_owner(fl_owner_t owner) +{ + return (fl_owner_t)((unsigned long)owner | 1UL); +} + +static inline bool file_lock_is_thread_owner(fl_owner_t owner) +{ + return ((unsigned long)owner & 1UL); +} + struct file_lock; struct file_lock_operations { diff --git a/io_uring/openclose.c b/io_uring/openclose.c index 67178e4bb282..5a12cdf7f8d0 100644 --- a/io_uring/openclose.c +++ b/io_uring/openclose.c @@ -212,6 +212,7 @@ int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) int io_close(struct io_kiocb *req, unsigned int issue_flags) { struct files_struct *files = current->files; + fl_owner_t owner = file_lock_make_thread_owner(files); struct io_close *close = io_kiocb_to_cmd(req, struct io_close); struct fdtable *fdt; struct file *file; @@ -247,7 +248,7 @@ int io_close(struct io_kiocb *req, unsigned int issue_flags) goto err; /* No ->flush() or already async, safely close from here */ - ret = filp_close(file, current->files); + ret = filp_close(file, owner); err: if (ret < 0) req_set_fail(req);