[RFC,2/2] io_uring: add support for getdents

Message ID 20230422-uring-getdents-v1-2-14c1db36e98c@codewreck.org
State New
Headers
Series io_uring: add getdents support, take 2 |

Commit Message

Dominique Martinet April 22, 2023, 8:40 a.m. UTC
  This add support for getdents64 to io_uring, acting exactly like the
syscall: the directory is iterated from it's current's position as
stored in the file struct, and the file's position is updated exactly as
if getdents64 had been called.

Additionally, since io_uring has no way of issuing a seek, a flag
IORING_GETDENTS_REWIND has been added that will seek to the start of the
directory like rewinddir(3) for users that might require such a thing.
This will act exactly as if seek then getdents64 have been called
sequentially with no atomicity guarantee:
if this wasn't clear it is the responsibility of the caller to not use
getdents multiple time on a single file in parallel if they want useful
results, as is currently the case when using the syscall from multiple
threads.

Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
---
 include/uapi/linux/io_uring.h |  7 ++++++
 io_uring/fs.c                 | 51 +++++++++++++++++++++++++++++++++++++++++++
 io_uring/fs.h                 |  3 +++
 io_uring/opdef.c              |  8 +++++++
 4 files changed, 69 insertions(+)
  

Comments

Dominique Martinet April 23, 2023, 11:43 p.m. UTC | #1
Dave Chinner wrote on Mon, Apr 24, 2023 at 08:40:45AM +1000:
> This doesn't actually introduce non-blocking getdents operations, so
> what's the point? If it just shuffles the getdents call off to a
> background thread, why bother with io_uring in the first place?

As said in the cover letter my main motivation really is simplifying the
userspace application:
 - style-wise, mixing in plain old getdents(2) or readdir(3) in the
middle of an io_uring handling loop just feels wrong; but this may just
be my OCD talking.
 - in my understanding io_uring has its own thread pool, so even if the
actual getdents is blocking other IOs can progress (assuming there is
less blocked getdents than threads), without having to build one's own
extra thread pool next to the uring handling.
Looking at io_uring/fs.c the other "metadata related" calls there also
use the synchronous APIs (renameat, unlinkat, mkdirat, symlinkat and
linkat all do), so I didn't think of that as a problem in itself.


> Filesystems like XFS can easily do non-blocking getdents calls - we
> just need the NOWAIT plumbing (like we added to the IO path with
> IOCB_NOWAIT) to tell the filesystem not to block on locks or IO.
> Indeed, filesystems often have async readahead built into their
> getdents paths (XFS does), so it seems to me that we really want
> non-blocking getdents to allow filesystems to take full advantage of
> doing work without blocking and then shuffling the remainder off to
> a background thread when it actually needs to wait for IO....

I believe that can be done without any change of this API, so that'll be
a very welcome addition when it is ready; I don't think the adding the
uring op should wait on this if we can agree a simple wrapper API is
good enough (or come up with a better one if someone has a Good Idea)

(looking at io_uring/rw.c for comparison, io_getdents() will "just" need
to be adjusted to issue an async req if IO_URING_F_NONBLOCK is set, and
the poll/retry logic sorted out)

Thanks,
  
Clay Harris April 24, 2023, 7:29 a.m. UTC | #2
On Mon, Apr 24 2023 at 08:43:00 +0900, Dominique Martinet quoth thus:

> Dave Chinner wrote on Mon, Apr 24, 2023 at 08:40:45AM +1000:
> > This doesn't actually introduce non-blocking getdents operations, so
> > what's the point? If it just shuffles the getdents call off to a
> > background thread, why bother with io_uring in the first place?
> 
> As said in the cover letter my main motivation really is simplifying the
> userspace application:
>  - style-wise, mixing in plain old getdents(2) or readdir(3) in the
> middle of an io_uring handling loop just feels wrong; but this may just
> be my OCD talking.
>  - in my understanding io_uring has its own thread pool, so even if the
> actual getdents is blocking other IOs can progress (assuming there is
> less blocked getdents than threads), without having to build one's own
> extra thread pool next to the uring handling.
> Looking at io_uring/fs.c the other "metadata related" calls there also
> use the synchronous APIs (renameat, unlinkat, mkdirat, symlinkat and
> linkat all do), so I didn't think of that as a problem in itself.

Having written code to create additional worker threads in addition
to using io_uring as a main loop, I'm glad to see this proposal back
again.  I think the original work was shot down much too hastily based
on the file positioning issues.  Really only two cases of directory
position are useful*, which can be expressed either as an off_t
(0 = rewind, -1 = curpos), or a single bit flag.  As I understand the
code, the rewind case shouldn't be any problem.  From a practical
viewpoint, I don't think non-blocking would see a lot of use, but it
wouldn't hurt.

This also seems like a good place to bring up a point I made with
the last attempt at this code.  You're missing an optimization here.
getdents knows whether it is returning a buffer because the next entry
won't fit versus because there are no more entries.  As it doesn't
return that information, callers must always keep calling it back
until EOF.  This means a completely unnecessary call is made for
every open directory.  In other words, for a directory scan where
the buffers are large enough to not overflow, that literally twice
as many calls are made to getdents as necessary.  As io_uring is
in-kernel, it could use an internal interface to getdents which would
return an EOF indicator along with the (probably non-empty) buffer.
io_uring would then return that flag with the CQE.


(* As an aside, the only place I've ever seen a non-zero lseek on a
directory, is in a very resource limited environment, e.g. too small
open files limit.  In the case of a depth-first directory scan, it
must close directories before completely reading them, and reopen /
lseek to their previous position in order to continue.  This scenario
is certainly not worth bothering with for io_uring.)

> > Filesystems like XFS can easily do non-blocking getdents calls - we
> > just need the NOWAIT plumbing (like we added to the IO path with
> > IOCB_NOWAIT) to tell the filesystem not to block on locks or IO.
> > Indeed, filesystems often have async readahead built into their
> > getdents paths (XFS does), so it seems to me that we really want
> > non-blocking getdents to allow filesystems to take full advantage of
> > doing work without blocking and then shuffling the remainder off to
> > a background thread when it actually needs to wait for IO....
> 
> I believe that can be done without any change of this API, so that'll be
> a very welcome addition when it is ready; I don't think the adding the
> uring op should wait on this if we can agree a simple wrapper API is
> good enough (or come up with a better one if someone has a Good Idea)
> 
> (looking at io_uring/rw.c for comparison, io_getdents() will "just" need
> to be adjusted to issue an async req if IO_URING_F_NONBLOCK is set, and
> the poll/retry logic sorted out)
> 
> Thanks,
> -- 
> Dominique Martinet | Asmadeus
  
Dominique Martinet April 24, 2023, 8:41 a.m. UTC | #3
Thanks!

Clay Harris wrote on Mon, Apr 24, 2023 at 02:29:46AM -0500:
> This also seems like a good place to bring up a point I made with
> the last attempt at this code.  You're missing an optimization here.
> getdents knows whether it is returning a buffer because the next entry
> won't fit versus because there are no more entries.  As it doesn't
> return that information, callers must always keep calling it back
> until EOF.  This means a completely unnecessary call is made for
> every open directory.  In other words, for a directory scan where
> the buffers are large enough to not overflow, that literally twice
> as many calls are made to getdents as necessary.  As io_uring is
> in-kernel, it could use an internal interface to getdents which would
> return an EOF indicator along with the (probably non-empty) buffer.
> io_uring would then return that flag with the CQE.

Sorry I didn't spot that comment in the last iteration of the patch,
that sounds interesting.

This isn't straightforward even in-kernel though: the ctx.actor callback
(filldir64) isn't called when we're done, so we only know we couldn't
fill in the buffer.
We could have the callback record 'buffer full' and consider we're done
if the buffer is full, or just single-handedly declare we are if we have
more than `MAXNAMLEN + sizeof(struct linux_dirent64)` left over, but I
assume a filesystem is allowed to return what it has readily available
and expect the user to come back later?
In which case we cannot use this as an heuristic...

So if we do this, it'll require a way for filesystems to say they're
filling in as much as they can, or go the sledgehammer way of adding an
extra dir_context dir_context callback, either way I'm not sure I want
to deal with all that immediately unless I'm told all filesystems will
fill as much as possible without ever failing for any temporary reason
in the middle of iterate/iterate_shared().
Call me greedy but I believe such a flag in the CQE could also be added
later on without any bad side effects (as it's optional to check on it
to stop calling early and there's no harm in not setting it)?


> (* As an aside, the only place I've ever seen a non-zero lseek on a
> directory, is in a very resource limited environment, e.g. too small
> open files limit.  In the case of a depth-first directory scan, it
> must close directories before completely reading them, and reopen /
> lseek to their previous position in order to continue.  This scenario
> is certainly not worth bothering with for io_uring.)

(I also thought of userspace NFS/9P servers are these two at least get
requests from clients with an arbitrary offset, but I'll be glad to
forget about them for now...)
  
Clay Harris April 24, 2023, 9:20 a.m. UTC | #4
On Mon, Apr 24 2023 at 17:41:18 +0900, Dominique Martinet quoth thus:

> Thanks!
> 
> Clay Harris wrote on Mon, Apr 24, 2023 at 02:29:46AM -0500:
> > This also seems like a good place to bring up a point I made with
> > the last attempt at this code.  You're missing an optimization here.
> > getdents knows whether it is returning a buffer because the next entry
> > won't fit versus because there are no more entries.  As it doesn't
> > return that information, callers must always keep calling it back
> > until EOF.  This means a completely unnecessary call is made for
> > every open directory.  In other words, for a directory scan where
> > the buffers are large enough to not overflow, that literally twice
> > as many calls are made to getdents as necessary.  As io_uring is
> > in-kernel, it could use an internal interface to getdents which would
> > return an EOF indicator along with the (probably non-empty) buffer.
> > io_uring would then return that flag with the CQE.
> 
> Sorry I didn't spot that comment in the last iteration of the patch,
> that sounds interesting.
> 
> This isn't straightforward even in-kernel though: the ctx.actor callback
> (filldir64) isn't called when we're done, so we only know we couldn't
> fill in the buffer.
> We could have the callback record 'buffer full' and consider we're done
> if the buffer is full, or just single-handedly declare we are if we have
> more than `MAXNAMLEN + sizeof(struct linux_dirent64)` left over, but I
> assume a filesystem is allowed to return what it has readily available
> and expect the user to come back later?
> In which case we cannot use this as an heuristic...
> 
> So if we do this, it'll require a way for filesystems to say they're
> filling in as much as they can, or go the sledgehammer way of adding an
> extra dir_context dir_context callback, either way I'm not sure I want
> to deal with all that immediately unless I'm told all filesystems will
> fill as much as possible without ever failing for any temporary reason
> in the middle of iterate/iterate_shared().

I don't have a complete understanding of this area, but my thought was
not that we would look for any buffer full condition, but rather that
an iterator could be tested for next_entry == EOF.

> Call me greedy but I believe such a flag in the CQE could also be added
> later on without any bad side effects (as it's optional to check on it
> to stop calling early and there's no harm in not setting it)?

Certainly it could be added later, but I wanted to make sure some thought
was put into it now.  It would be nice to have it sooner rather than later
though...

> 
> > (* As an aside, the only place I've ever seen a non-zero lseek on a
> > directory, is in a very resource limited environment, e.g. too small
> > open files limit.  In the case of a depth-first directory scan, it
> > must close directories before completely reading them, and reopen /
> > lseek to their previous position in order to continue.  This scenario
> > is certainly not worth bothering with for io_uring.)
> 
> (I also thought of userspace NFS/9P servers are these two at least get
> requests from clients with an arbitrary offset, but I'll be glad to
> forget about them for now...)
> 
> -- 
> Dominique Martinet | Asmadeus
  
Dominique Martinet April 24, 2023, 10:55 a.m. UTC | #5
Clay Harris wrote on Mon, Apr 24, 2023 at 04:20:55AM -0500:
> > This isn't straightforward even in-kernel though: the ctx.actor callback
> > (filldir64) isn't called when we're done, so we only know we couldn't
> > fill in the buffer.
> > We could have the callback record 'buffer full' and consider we're done
> > if the buffer is full, or just single-handedly declare we are if we have
> > more than `MAXNAMLEN + sizeof(struct linux_dirent64)` left over, but I
> > assume a filesystem is allowed to return what it has readily available
> > and expect the user to come back later?
> > In which case we cannot use this as an heuristic...
> > 
> > So if we do this, it'll require a way for filesystems to say they're
> > filling in as much as they can, or go the sledgehammer way of adding an
> > extra dir_context dir_context callback, either way I'm not sure I want
> > to deal with all that immediately unless I'm told all filesystems will
> > fill as much as possible without ever failing for any temporary reason
> > in the middle of iterate/iterate_shared().
> 
> I don't have a complete understanding of this area, but my thought was
> not that we would look for any buffer full condition, but rather that
> an iterator could be tested for next_entry == EOF.

I'm afraid I don't see any such iterator readily available in the common
code, so that would also have to be per fs as far as I can see :/

Also some filesystems just don't have a next_entry iterator readily
available; for example on 9p the network protocol is pretty much exactly
like getdents and a readdir call is issued continuously until either
filldir64 returns an error (e.g. buffer full) or the server returned 0
(or an error); if you leave the v9fs_dir_readdir{,_dotl}() call you
loose this information immediately.
Getting this information out would be doubly appreciable because the
"final getdents" to confirm the dir has been fully read is redundant
with that previous last 9p readdir which ended the previous iteration
(could also be fixed by caching...), but that'll really require fixing
up the iterate_shared() operation to signal such a thing...

> > Call me greedy but I believe such a flag in the CQE could also be added
> > later on without any bad side effects (as it's optional to check on it
> > to stop calling early and there's no harm in not setting it)?
> 
> Certainly it could be added later, but I wanted to make sure some thought
> was put into it now.  It would be nice to have it sooner rather than later
> though...

Yes, if there's an easy way forward I overlooked I'd be happy to spend a
bit more time on it; and discussing never hurts.
In for a penny, in for a pound :)
  
Dominique Martinet April 28, 2023, 6:14 a.m. UTC | #6
Dave Chinner wrote on Fri, Apr 28, 2023 at 03:06:40PM +1000:
> >  - in my understanding io_uring has its own thread pool, so even if the
> > actual getdents is blocking other IOs can progress (assuming there is
> > less blocked getdents than threads), without having to build one's own
> > extra thread pool next to the uring handling.
> > Looking at io_uring/fs.c the other "metadata related" calls there also
> > use the synchronous APIs (renameat, unlinkat, mkdirat, symlinkat and
> > linkat all do), so I didn't think of that as a problem in itself.
> 
> I think you missed the point. getdents is not an exclusive operation
> - it is run under shared locking unlike all the other direcotry
> modification operations you cite above. They use exclusive locking
> so there's no real benefit by trying to run them non-blocking or
> as an async operation as they are single threaded and will consume
> a single thread context from start to end.
> 
> Further, one of the main reasons they get punted to the per-thread
> pool is so that io_uring can optimise away the lock contention
> caused by running multiple work threads on exclusively locked
> objects; it does this by only running one work item per inode at a
> time.

Ah, I didn't know that -- thank you for this piece of information.

If we're serializing readdirs per inode I agree that this implementation
is subpart for filesystems implementing iterate_shared.

> > > Filesystems like XFS can easily do non-blocking getdents calls - we
> > > just need the NOWAIT plumbing (like we added to the IO path with
> > > IOCB_NOWAIT) to tell the filesystem not to block on locks or IO.
> > > Indeed, filesystems often have async readahead built into their
> > > getdents paths (XFS does), so it seems to me that we really want
> > > non-blocking getdents to allow filesystems to take full advantage of
> > > doing work without blocking and then shuffling the remainder off to
> > > a background thread when it actually needs to wait for IO....
> > 
> > I believe that can be done without any change of this API, so that'll be
> > a very welcome addition when it is ready;
> 
> Again, I think you miss the point.
> 
> Non blocking data IO came before io_uring and we had the
> infrastructure in place before io_uring took advantage of it.
> Application developers asked the fs developers to add support for
> non-blocking direct IO operations and because we pretty much had all
> the infrastructure to support already in place it got done quickly
> via preadv2/pwritev2 via RWF_NOWAIT flags.
> 
> We already pass a struct dir_context to ->iterate_shared(), so we
> have a simple way to add context specific flags down the filesystem
> from iterate_dir(). This is similar to the iocb for file data IO
> that contains the flags field that holds the IOCB_NOWAIT context for
> io_uring based IO. So the infrastructure to plumb it all the way
> down the fs implementation of ->iterate_shared is already there.

Sure, that sounds like a good approach that isn't breaking the API (not
breaking iterate/iterate_shared implementations that don't look at the
flags and allowing the fs that want to look at it to do so)

Adding such a flag and setting it on the uring side without doing
anything else is trivial enough if polling/rescheduling can be sorted
out.

(I guess at this rate the dir_context flag could also be modified by the
FS to signal it just filled in the last entry, for the other part of
this thread asking to know when the directory has been done iterating
without wasting a trivial empty getdents)

> XFS also has async metadata IO capability and we use that for
> readahead in the xfs_readdir() implementation. hence we've got all
> the parts we need to do non-blocking readdir already in place. This
> is very similar to how we already had all the pieces in the IO path
> ready to do non-block IO well before anyone asked for IOCB_NOWAIT
> functionality....
> 
> AFAICT, the io_uring code wouldn't need to do much more other than
> punt to the work queue if it receives a -EAGAIN result. Otherwise
> the what the filesystem returns doesn't need to change, and I don't
> see that we need to change how the filldir callbacks work, either.
> We just keep filling the user buffer until we either run out of
> cached directory data or the user buffer is full.

I agree with the fs side of it, I'd like to confirm what the uring
side needs to do before proceeding -- looking at the read/write path
there seems to be a polling mechanism in place to tell uring when to
look again, and I haven't looked at this part of the code yet to see
what happens if no such polling is in place (does uring just retry
periodically?)

I'll have a look this weekend.

> > I don't think the adding the
> > uring op should wait on this if we can agree a simple wrapper API is
> > good enough (or come up with a better one if someone has a Good Idea)
> 
> It doesn't look at all hard to me. If you add a NOWAIT context flag
> to the dir_context it should be relatively trivial to connect all
> the parts together. If you do all the VFS, io_uring and userspace
> testing infrastructure work, I should be able to sort out the
> changes needed to xfs_readdir() to support nonblocking
> ->iterate_shared() behaviour.

I still think the userspace <-> ring API is unrelated to the nowait side
of it, but as said in the other part of the thread I'll be happy to give
a bit more time if that helps moving forward.

I'll send another reply around the end of the weekend with either v2 of
this patch or a few more comments.


Thanks,
  
Dominique Martinet April 28, 2023, 11:27 a.m. UTC | #7
Dominique Martinet wrote on Fri, Apr 28, 2023 at 03:14:52PM +0900:
> > We already pass a struct dir_context to ->iterate_shared(), so we
> > have a simple way to add context specific flags down the filesystem
> > from iterate_dir(). This is similar to the iocb for file data IO
> > that contains the flags field that holds the IOCB_NOWAIT context for
> > io_uring based IO. So the infrastructure to plumb it all the way
> > down the fs implementation of ->iterate_shared is already there.
> 
> Sure, that sounds like a good approach that isn't breaking the API (not
> breaking iterate/iterate_shared implementations that don't look at the
> flags and allowing the fs that want to look at it to do so)

Hmm actually I said that, but io_getdents() needs to know if the flag
will be honored or not (if it will be honored, we can call this when
issue_flags & IO_URING_F_NONBLOCK but if we're not sure the fs handles
it then we risk blocking)

I'm not familiar with this part of the VFS, but I do not see any kind of
flags for the filesystem to signal if it'll handle it or not -- this is
actually similar to iterate vs. iterate_shared so that'll mean adding
iterate_shared_hasnonblock or something, which is getting silly.

I'm sure you understand this better than me and I'm missing something
obvious here, but I don't think I'll be able to make something I'm happy
with here (in a reasonable timeframe anyway).


Thanks,
  
Dominique Martinet April 29, 2023, 8:07 a.m. UTC | #8
Dominique Martinet wrote on Fri, Apr 28, 2023 at 03:14:52PM +0900:
> > AFAICT, the io_uring code wouldn't need to do much more other than
> > punt to the work queue if it receives a -EAGAIN result. Otherwise
> > the what the filesystem returns doesn't need to change, and I don't
> > see that we need to change how the filldir callbacks work, either.
> > We just keep filling the user buffer until we either run out of
> > cached directory data or the user buffer is full.
> 
> [...] I'd like to confirm what the uring
> side needs to do before proceeding -- looking at the read/write path
> there seems to be a polling mechanism in place to tell uring when to
> look again, and I haven't looked at this part of the code yet to see
> what happens if no such polling is in place (does uring just retry
> periodically?)

Ok so this part can work out as you said, I hadn't understood what you
meant by "punt to the work queue" but that should work from my new
understanding of the ring; we can just return EAGAIN if the non-blocking
variant doesn't have immediate results and call the blocking variant
when we're called again without IO_URING_F_NONBLOCK in flags.
(So there's no need to try to add a form of polling, although that is
possible if we ever become able to do that; I'll just forget about this
and be happy this part is easy)


That just leaves deciding if a filesystem handles the blocking variant
or not; ideally if we can know early (prep time) we can even mark
REQ_F_FORCE_ASYNC in flags to skip the non-blocking call for filesystems
that don't handle that and we get the best of both worlds.

I've had a second look and I still don't see anything obvious though;
I'd rather avoid adding a new variant of iterate()/iterate_shared() --
we could use that as a chance to add a flag to struct file_operation
instead? e.g., something like mmap_supported_flags:
-----
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c85916e9f7db..2ebbf48ee18b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1761,7 +1761,7 @@ struct file_operations {
 	int (*iopoll)(struct kiocb *kiocb, struct io_comp_batch *,
 			unsigned int flags);
 	int (*iterate) (struct file *, struct dir_context *);
-	int (*iterate_shared) (struct file *, struct dir_context *);
+	unsigned long iterate_supported_flags;
 	__poll_t (*poll) (struct file *, struct poll_table_struct *);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
@@ -1797,6 +1797,10 @@ struct file_operations {
 				unsigned int poll_flags);
 } __randomize_layout;
 
+/** iterate_supported_flags */
+#define ITERATE_SHARED 0x1
+#define ITERATE_NOWAIT 0x2
+
 struct inode_operations {
 	struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
 	const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *);
-----

and fix all usages of iterate_shared.

I guess at this rate it might make sense to rename mmap_supported_flags
to some more generic supported_flags instead?...

It's a bit more than I have signed up for, but I guess it's still
reasonable enough. I'll wait for feedback before doing it though; please
say if this sounds good to you and I'll send a v2 with such a flag, as
well as adding flags to dir_context as you had suggested.

Thanks,
  
Dominique Martinet May 1, 2023, 12:49 a.m. UTC | #9
Dave Chinner wrote on Mon, May 01, 2023 at 09:32:41AM +1000:
> > I've had a second look and I still don't see anything obvious though;
> > I'd rather avoid adding a new variant of iterate()/iterate_shared() --
> > we could use that as a chance to add a flag to struct file_operation
> > instead? e.g., something like mmap_supported_flags:
> 
> I don't think that makes sense - the eventual goal is to make
> ->iterate() go away entirely and all filesystems use
> ->iterate_shared(). Hence I think adding flags to select iterate vs
> iterate_shared and the locking that is needed is the wrong place to
> start from here.

(The flag could just go away when all filesystems not supporting it are
gone, and it could be made the other way around (e.g. explicit
NOT_SHARED to encourage migrations), so I don't really see the problem
with this but next point makes this moot anyway)

> Whether the filesystem supports non-blocking ->iterate_shared() or
> not is a filesystem implementation option and io_uring needs that
> information to be held on the struct file for efficient
> determination of whether it should use non-blocking operations or
> not.

Right, sorry. I was thinking that since it's fs/op dependant it made
more sense to keep next to the iterate operation, but that'd be a
layering violation to look directly at the file_operation vector
directly from the uring code... So having it in the struct file is
better from that point of view.

> We already set per-filesystem file modes via the ->open method,
> that's how we already tell io_uring that it can do NOWAIT IO, as
> well as async read/write IO for regular files. And now we also use
> it for FMODE_DIO_PARALLEL_WRITE, too.
> 
> See __io_file_supports_nowait()....
> 
> Essentially, io_uring already cwhas the mechanism available to it
> to determine if it should use NOWAIT semantics for getdents
> operations; we just need to set FMODE_NOWAIT correctly for directory
> files via ->open() on the filesystems that support it...

Great, I wasn't aware of FMODE_NOWAIT; things are starting to fall in
place.
I'll send a v2 around Wed or Thurs (yay national holidays)

> [ Hmmmm - we probably need to be more careful in XFS about what
> types of files we set those flags on.... ]

Yes, FMODE_NOWAIT will be set on directories as xfs_dir_open calls
xfs_file_open which sets it inconditionally... So I got to check other
filesystems don't do something similar as a bonus, but it looks like
none that set FMODE_NOWAIT on regular files share the file open path,
so at least that shouldn't be too bad.
Happy to also fold the xfs fix as a prerequisite patch of this series or
to let you do it, just tell me.


Thanks,
  

Patch

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 709de6d4feb2..44c87fad2714 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -65,6 +65,7 @@  struct io_uring_sqe {
 		__u32		xattr_flags;
 		__u32		msg_ring_flags;
 		__u32		uring_cmd_flags;
+		__u32		getdents_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
 	/* pack this to avoid bogus arm OABI complaints */
@@ -223,6 +224,7 @@  enum io_uring_op {
 	IORING_OP_URING_CMD,
 	IORING_OP_SEND_ZC,
 	IORING_OP_SENDMSG_ZC,
+	IORING_OP_GETDENTS,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
@@ -258,6 +260,11 @@  enum io_uring_op {
  */
 #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
 
+/*
+ * sqe->getdents_flags
+ */
+#define IORING_GETDENTS_REWIND	(1U << 0)
+
 /*
  * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
  * command flags for POLL_ADD are stored in sqe->len.
diff --git a/io_uring/fs.c b/io_uring/fs.c
index f6a69a549fd4..cb555bc738bd 100644
--- a/io_uring/fs.c
+++ b/io_uring/fs.c
@@ -47,6 +47,13 @@  struct io_link {
 	int				flags;
 };
 
+struct io_getdents {
+	struct file			*file;
+	struct linux_dirent64 __user	*dirent;
+	unsigned int			count;
+	int				flags;
+};
+
 int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
@@ -291,3 +298,47 @@  void io_link_cleanup(struct io_kiocb *req)
 	putname(sl->oldpath);
 	putname(sl->newpath);
 }
+
+int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
+
+	if (READ_ONCE(sqe->off) != 0)
+		return -EINVAL;
+
+	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	gd->count = READ_ONCE(sqe->len);
+	gd->flags = READ_ONCE(sqe->getdents_flags);
+	if (gd->flags & ~IORING_GETDENTS_REWIND)
+		return -EINVAL;
+
+	return 0;
+}
+
+int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
+	int ret;
+
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		return -EAGAIN;
+
+	if ((gd->flags & IORING_GETDENTS_REWIND)) {
+		ret = vfs_llseek(req->file, 0, SEEK_SET);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = vfs_getdents(req->file, gd->dirent, gd->count);
+out:
+	if (ret < 0) {
+		if (ret == -ERESTARTSYS)
+			ret = -EINTR;
+
+		req_set_fail(req);
+	}
+
+	io_req_set_res(req, ret, 0);
+	return 0;
+}
+
diff --git a/io_uring/fs.h b/io_uring/fs.h
index 0bb5efe3d6bb..f83a6f3a678d 100644
--- a/io_uring/fs.h
+++ b/io_uring/fs.h
@@ -18,3 +18,6 @@  int io_symlinkat(struct io_kiocb *req, unsigned int issue_flags);
 int io_linkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_linkat(struct io_kiocb *req, unsigned int issue_flags);
 void io_link_cleanup(struct io_kiocb *req);
+
+int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_getdents(struct io_kiocb *req, unsigned int issue_flags);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index cca7c5b55208..8f770c582ab3 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -428,6 +428,11 @@  const struct io_issue_def io_issue_defs[] = {
 		.prep			= io_eopnotsupp_prep,
 #endif
 	},
+	[IORING_OP_GETDENTS] = {
+		.needs_file		= 1,
+		.prep			= io_getdents_prep,
+		.issue			= io_getdents,
+	},
 };
 
 
@@ -648,6 +653,9 @@  const struct io_cold_def io_cold_defs[] = {
 		.fail			= io_sendrecv_fail,
 #endif
 	},
+	[IORING_OP_GETDENTS] = {
+		.name			= "GETDENTS",
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)