mbox series

[GIT,PULL,v2] timestamp fixes

Message ID	20230921-umgekehrt-buden-a8718451ef7c@brauner
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) client-ip=2620:137:e000::3:8; From: Christian Brauner <brauner@kernel.org> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: Christian Brauner <brauner@kernel.org>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Jan Kara <jack@suse.cz>, Jeff Layton <jlayton@kernel.org> Subject: [GIT PULL v2] timestamp fixes Date: Thu, 21 Sep 2023 13:20:46 +0200 Message-Id: <20230921-umgekehrt-buden-a8718451ef7c@brauner> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[GIT,PULL,v2] timestamp fixes \| [GIT,PULL,v2] timestamp fixes

Pull-request

git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/v6.6-rc3.vfs.ctime.revert

Message

Christian Brauner Sept. 21, 2023, 11:20 a.m. UTC

  Hey Linus,

/* Summary */
Earlier this week we sent a few minor fixes for the multi-grained
timestamp work in [1]. While we were polishing those up after Linus
realized that there might be a nicer way to fix them we received a
regression report in [2] that fine grained timestamps break gnulib tests
and thus possibly other tools.

The kernel will elide fine-grain timestamp updates when no one is
actively querying for them to avoid performance impacts. So a sequence
like write(f1) stat(f2) write(f2) stat(f2) write(f1) stat(f1) may result
in timestamp f1 to be older than the final f2 timestamp even though f1
was last written too but the second write didn't update the timestamp.

Such plotholes can lead to subtle bugs when programs compare timestamps.
For example, the nap() function in [2] will estimate that it needs to
wait one ns on a fine-grain timestamp enabled filesytem between
subsequent calls to observe a timestamp change. But in general we don't
update timestamps with more than one jiffie if we think that no one is
actively querying for fine-grain timestamps to avoid performance
impacts.

While discussing various fixes the decision was to go back to the
drawing board and ultimately to explore a solution that involves only
exposing such fine-grained timestamps to nfs internally and never to
userspace.

As there are multiple solutions discussed the honest thing to do here is
not to fix this up or disable it but to cleanly revert. The general
infrastructure will probably come back but there is no reason to keep
this code in mainline.

The general changes to timestamp handling are valid and a good cleanup
that will stay. The revert is fully bisectable.

Link: [1]: https://lore.kernel.org/all/20230918-hirte-neuzugang-4c2324e7bae3@brauner
Link: [2]: https://lore.kernel.org/all/bf0524debb976627693e12ad23690094e4514303.camel@linuxfromscratch.org

/* Testing */
clang: Ubuntu clang version 15.0.7
gcc: (Ubuntu 12.2.0-3ubuntu1) 12.2.0

All patches are based on v6.5-rc1 and have been sitting in linux-next.
No build failures or warnings were observed. xfstests pass without
regressions.

/* Conflicts */
At the time of creating this PR no merge conflicts were reported from
linux-next and no merge conflicts showed up doing a test-merge with
current mainline.

The following changes since commit ce9ecca0238b140b88f43859b211c9fdfd8e5b70:

  Linux 6.6-rc2 (2023-09-17 14:40:24 -0700)

are available in the Git repository at:

  git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/v6.6-rc3.vfs.ctime.revert

for you to fetch changes up to 647aa768281f38cb1002edb3a1f673c3d66a8d81:

  Revert "fs: add infrastructure for multigrain timestamps" (2023-09-20 18:05:31 +0200)

Please consider pulling these changes from the signed v6.6-rc3.vfs.ctime.revert tag.

Thanks!
Christian

----------------------------------------------------------------
v6.6-rc3.vfs.ctime.revert

----------------------------------------------------------------
Christian Brauner (5):
      Revert "tmpfs: add support for multigrain timestamps"
      Revert "xfs: switch to multigrain timestamps"
      Revert "ext4: switch to multigrain timestamps"
      Revert "btrfs: convert to multigrain timestamps"
      Revert "fs: add infrastructure for multigrain timestamps"

 fs/btrfs/file.c                 | 24 ++++++++++--
 fs/btrfs/super.c                |  5 +--
 fs/ext4/super.c                 |  2 +-
 fs/inode.c                      | 82 ++---------------------------------------
 fs/stat.c                       | 41 +--------------------
 fs/xfs/libxfs/xfs_trans_inode.c |  6 +--
 fs/xfs/xfs_iops.c               |  6 +--
 fs/xfs/xfs_super.c              |  2 +-
 include/linux/fs.h              | 46 +----------------------
 mm/shmem.c                      |  2 +-
 10 files changed, 38 insertions(+), 178 deletions(-)

Comments

Linus Torvalds Sept. 21, 2023, 6:24 p.m. UTC | #1

On Thu, 21 Sept 2023 at 04:21, Christian Brauner <brauner@kernel.org> wrote:
>
>   git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/v6.6-rc3.vfs.ctime.revert

So for some reason pr-tracker-bot doesn't seem to have reacted to this
pull request, but it's in my tree now.

I *do* have one reaction to all of this: now that you have made
"i_ctime" be something that cannot be accessed directly (and renamed
it to "__i_ctime"), would you mind horribly going all the way, and do
the same for i_atime and i_mtime too?

The reason I ask is that I *really* despise "struct timespec64" as a type.

I despise it inherently, but I despise it even more when you really
use it as another type entirely, and are hiding bits in there.

I despise it because "tv_sec" obviously needs to be 64-bit, but then
"tv_nsec" is this horrible abomination. It's defined as "long", which
is all kinds of crazy. It's bogus and historical.

And it's wasteful.

Now, in the case of i_ctime, you took advantage of that waste by using
one (of the potentially 2..34!) unused bits for that
"fine-granularity" flag.

But even when you do that, there's up to 33 other bits just lying
around, wasting space in a very central data structure.

So it would actually be much better to explode the 'struct timespec64'
into explicit 64-bit seconds field, and an explicit 32-bit field with
two bits reserved. And not even next to each other, because they don't
pack well in general.

So instead of

        struct timespec64       i_atime;
        struct timespec64       i_mtime;
        struct timespec64       __i_ctime;

where that last one needs accessors to access, just make them *all*
have helper accessors, and make it be

        u64  i_atime_sec;
        u64  i_mtime_sec;
        u64  i_ctime_sec;
        u32  i_atime_nsec;
        u32  i_mtime_nsec;
        u32  i_ctime_nsec;

and now that 'struct inode' should shrink by 12 bytes.

Then do this:

  #define inode_time(x) \
       (struct timespec64) { x##_sec, x##_nsec }

and you can now create a timespec64 by just doing

    inode_time(inode->i_atime)

or something like that (to help create those accessor functions).

Now, I agree that 12 bytes in the disaster that is 'struct inode' is
likely a drop in the ocean. We have tons of big things in there (ie
several list_heads, a whole 'struct address_space' etc etc), so it's
only twelve bytes out of hundreds.

But dammit, that 'timespec64' really is ugly, and now that you're
hiding bits in it and it's no longer *really* a 'timespec64', I feel
like it would be better to do it right, and not mis-use a type that
has other semantics, and has other problems.

              Linus

Linus Torvalds Sept. 21, 2023, 7:28 p.m. UTC | #2

On Thu, 21 Sept 2023 at 11:51, Jeff Layton <jlayton@kernel.org> wrote:
>
> We have many, many inodes though, and 12 bytes per adds up!

That was my thinking, but honestly, who knows what other alignment
issues might eat up some - or all - of the theoreteical 12 bytes.

It might be, for example, that the inode is already some aligned size,
and that the allocation alignment means that the size wouldn't
*really* shrink at all.

So I just want to make clear that I think the 12 bytes isn't
necessarily there. Maybe you'd get it, maybe it would be hidden by
other things.

My biggest impetus was really that whole abuse of a type that I
already disliked for other reasons.

> I'm on board with the idea, but...that's likely to be as big a patch
> series as the ctime overhaul was. In fact, it'll touch a lot of the same
> code. I can take a stab at that in the near future though.

Yea, it's likely to be fairly big and invasive.  That was one of the
reasons for my suggested "inode_time()" macro hack: using the macro
argument concatenation is really a hack to "gather" the pieces based
on name, and while it's odd and not a very typical kernel model, I
think doing it that way might allow the conversion to be slightly less
painful.

You'd obviously have to have the same kind of thing for assignment.

Without that kind of name-based hack, you'd have to create all these
random helper functions that just do the same thing over and over for
the different times, which seems really annoying.

> Since we're on the subject...another thing that bothers me with all of
> the timestamp handling is that we don't currently try to mitigate "torn
> reads" across the two different words. It seems like you could fetch a
> tv_sec value and then get a tv_nsec value that represents an entirely
> different timestamp if there are stores between them.

Hmm. I think that's an issue that we have always had in theory, and
have ignored because it's simply not problematic in practice, and
fixing it is *hugely* painful.

I suspect we'd have to use some kind of sequence lock for it (to make
reads be cheap), and while it's _possible_ that having the separate
accessor functions for reading/writing those times might help things
out, I suspect the reading/writing happens for the different times (ie
atime/mtime/ctime) together often enough that you might want to have
the locking done at an outer level, and _not_ do it at the accessor
level.

So I suspect this is a completely separate issue (ie even an accessor
doesn't make the "hugely painful" go away). And probably not worth
worrying about *unless* somebody decides that they really really care
about the race.

That said, one thing that *could* help is if people decide that the
right format for inode times is to just have one 64-bit word that has
"sufficient resolution". That's what we did for "kernel time", ie
"ktime_t" is a 64-bit nanosecond count, and by being just a single
value, it avoids not just the horrible padding with 'struct
timespec64', it is also dense _and_ can be accessed as one atomic
value.

Sadly, that "sufficient resolution" couldn't be nanoseconds, because
64-bit nanoseconds isn't enough of a spread. It's fine for the kernel
time, because 2**63 nanoseconds is 292 years, so it moved the "year
2038" problem to "year 2262".

And that's ok when we're talking about times that are kernel running
times and we haev a couple of centuries to say "ok, we'll need to make
it be a bigger type", but when you save the values to disk, things are
different. I suspect filesystem people are *not* willing to deal with
a "year 2262" issue.

But if we were to say that "a tenth of microsecond resolution is
sufficient for inode timestamps", then suddenly 64 bits is *enormous*.
So we could do a

    // tenth of a microseconds since Jan 1, 1970
    typedef s64 fstime_t;

and have a nice dense timestamp format with reasonable - but not
nanosecond - accuracy. Now that 292 year range has become 29,247
years, and filesystem people *might* find the "year-31k" problem
acceptable.

I happen to think that "100ns timestamp resolution on files is
sufficient" is a very reasonable statement, but I suspect that we'll
still find lots of people who say "that's completely unacceptable"
both to that resolution, and to the 31k-year problem.

But wouldn't it be nice to have just one single "fstime_t" for file
timestamps, the same way we have "ktime_t" for CPU timestamps?

Then we'd save even more space in the 'struct inode'....

                  Linus

pr-tracker-bot@kernel.org Sept. 21, 2023, 8:10 p.m. UTC | #3

The pull request you sent on Thu, 21 Sep 2023 13:20:46 +0200:

> git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/v6.6-rc3.vfs.ctime.revert

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/b5cbe7c00aa0f7a81ec40c007f81a3e9c84581e3

Thank you!

Jeff Layton Sept. 21, 2023, 9:57 p.m. UTC | #4

On Thu, 2023-09-21 at 12:46 -0700, Linus Torvalds wrote:
> On Thu, 21 Sept 2023 at 12:28, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > 
> > And that's ok when we're talking about times that are kernel running
> > times and we haev a couple of centuries to say "ok, we'll need to make
> > it be a bigger type",
> 
> Note that the "couple of centuries" here is mostly the machine uptime,
> not necessarily "we'll need to change the time in the year 2292".
> 

Right. On-disk formats are really a different matter anyway, so that
value is only relevant within a single running instance.

> Although we do also have "ktime_get_real()" which is encoding the
> whole "nanoseconds since 1970". That *will* break in 2292.
>

Still pretty much SEP, unless we all end up as cyborgs after this.

> Anyway, regardless, I am *not* suggesting that ktime_t would be useful
> for filesystems, because of this issue.
> 
> I *do* suspect that we might consider a "tenth of a microsecond", though.
> 
> Resolution-wise, it's pretty much in the "system call time" order of
> magnitude, and if we have Linux filesystems around in the year-31k,
> I'll happily consider it to be a SEP thing at that point ("somebody
> else's problem").
> 

FWIW, I'm reworking the multigrain ctime patches for internal consumers.
As part of that, when we present multigrain timestamps to userland via
statx, we'll truncate them at a granularity of (NSEC_PER_SEC / HZ).

So, we could easily do that today since we're already going to be
truncating off more than that for external uses. Having a single word to
deal with would sure be simpler too, particularly since we're using
atomic operations here.

I'll have to think about it. The first step is to get all of the
timestamp handling wrappers in place anyway.

Cheers,

David Sterba Sept. 22, 2023, 10:19 a.m. UTC | #5

On Thu, Sep 21, 2023 at 12:28:13PM -0700, Linus Torvalds wrote:
> On Thu, 21 Sept 2023 at 11:51, Jeff Layton <jlayton@kernel.org> wrote:
> >
> > We have many, many inodes though, and 12 bytes per adds up!
> 
> That was my thinking, but honestly, who knows what other alignment
> issues might eat up some - or all - of the theoreteical 12 bytes.
> 
> It might be, for example, that the inode is already some aligned size,
> and that the allocation alignment means that the size wouldn't
> *really* shrink at all.
> 
> So I just want to make clear that I think the 12 bytes isn't
> necessarily there. Maybe you'd get it, maybe it would be hidden by
> other things.

I think all filesystem developers appreciate when struct inode shrinks,
it's usually embedded with additional data and the size grows. I'm on a
mission to squeeze btrfs_inode under 1024 so it fits better to the slab
pages and currently it's about 1100 bytes. 1024 is within reach but it
gets harder to find potential space savings.

Christian Brauner Sept. 22, 2023, 12:16 p.m. UTC | #6

On Thu, Sep 21, 2023 at 11:24:43AM -0700, Linus Torvalds wrote:
> On Thu, 21 Sept 2023 at 04:21, Christian Brauner <brauner@kernel.org> wrote:
> >
> >   git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/v6.6-rc3.vfs.ctime.revert
> 
> So for some reason pr-tracker-bot doesn't seem to have reacted to this
> pull request, but it's in my tree now.

I think vger was somehow backed up again. We've been hit kinda hard by
this recently. I've asked Konstantin whether he'd move us to the new
kernel.org managed mailing infra (which will happen eventually anyway)
so hopefully we'll have less of these delays in the very near future...

Christian Brauner Sept. 22, 2023, 12:24 p.m. UTC | #7

> We have many, many inodes though, and 12 bytes per adds up!
> 
> I'm on board with the idea, but...that's likely to be as big a patch
> series as the ctime overhaul was. In fact, it'll touch a lot of the same

Hm, I really think that isn't an issue. Let the series be big. Half of
the meaningful fs patches explode anyway as soon as you change any
i_op/f_op anyway. If it meaningfully cleans up something then it's worth
it.

Hm, we really need automated testing on vfs.git soon so we can run
xfstests for all major filesystems automatically instead of me having to
kick this off manually every time.

Christian Brauner Sept. 22, 2023, 12:28 p.m. UTC | #8

> I'll have to think about it. The first step is to get all of the

I think it'd be best if we start off with converting the other times in
struct inode to accessor and leave the questions whether timestamps-
until-2292 are enough to solve for later. I don't think the torn
timestamp reads is all that of a pressing issue and mixing both things
up might just stall what is otherwise already a worthy cleanup.

Amir Goldstein Sept. 23, 2023, 6:36 a.m. UTC | #9

On Sat, Sep 23, 2023 at 3:43 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 21 Sept 2023 at 11:51, Jeff Layton <jlayton@kernel.org> wrote:
> >
> > We have many, many inodes though, and 12 bytes per adds up!
>
> That was my thinking, but honestly, who knows what other alignment
> issues might eat up some - or all - of the theoreteical 12 bytes.
>
> It might be, for example, that the inode is already some aligned size,
> and that the allocation alignment means that the size wouldn't
> *really* shrink at all.
>
> So I just want to make clear that I think the 12 bytes isn't
> necessarily there. Maybe you'd get it, maybe it would be hidden by
> other things.
>
> My biggest impetus was really that whole abuse of a type that I
> already disliked for other reasons.
>
> > I'm on board with the idea, but...that's likely to be as big a patch
> > series as the ctime overhaul was. In fact, it'll touch a lot of the same
> > code. I can take a stab at that in the near future though.
>
> Yea, it's likely to be fairly big and invasive.  That was one of the
> reasons for my suggested "inode_time()" macro hack: using the macro
> argument concatenation is really a hack to "gather" the pieces based
> on name, and while it's odd and not a very typical kernel model, I
> think doing it that way might allow the conversion to be slightly less
> painful.
>
> You'd obviously have to have the same kind of thing for assignment.
>
> Without that kind of name-based hack, you'd have to create all these
> random helper functions that just do the same thing over and over for
> the different times, which seems really annoying.
>
> > Since we're on the subject...another thing that bothers me with all of
> > the timestamp handling is that we don't currently try to mitigate "torn
> > reads" across the two different words. It seems like you could fetch a
> > tv_sec value and then get a tv_nsec value that represents an entirely
> > different timestamp if there are stores between them.
>
> Hmm. I think that's an issue that we have always had in theory, and
> have ignored because it's simply not problematic in practice, and
> fixing it is *hugely* painful.
>
> I suspect we'd have to use some kind of sequence lock for it (to make
> reads be cheap), and while it's _possible_ that having the separate
> accessor functions for reading/writing those times might help things
> out, I suspect the reading/writing happens for the different times (ie
> atime/mtime/ctime) together often enough that you might want to have
> the locking done at an outer level, and _not_ do it at the accessor
> level.
>
> So I suspect this is a completely separate issue (ie even an accessor
> doesn't make the "hugely painful" go away). And probably not worth
> worrying about *unless* somebody decides that they really really care
> about the race.
>
> That said, one thing that *could* help is if people decide that the
> right format for inode times is to just have one 64-bit word that has
> "sufficient resolution". That's what we did for "kernel time", ie
> "ktime_t" is a 64-bit nanosecond count, and by being just a single
> value, it avoids not just the horrible padding with 'struct
> timespec64', it is also dense _and_ can be accessed as one atomic
> value.

Just pointing out that xfs has already changed it's on-disk timestamp
format to this model (a.k.a bigtime), but the in-core inode still uses
the timespec64 of course.
The vfs can inprise from this model.

>
> Sadly, that "sufficient resolution" couldn't be nanoseconds, because
> 64-bit nanoseconds isn't enough of a spread. It's fine for the kernel
> time, because 2**63 nanoseconds is 292 years, so it moved the "year
> 2038" problem to "year 2262".

Note that xfs_bigtime_to_unix(XFS_BIGTIME_TIME_MAX) is in year
2486, not year 2262, because there was no need to use the 64bit to
go backwards to year 1678.

>
> And that's ok when we're talking about times that are kernel running
> times and we haev a couple of centuries to say "ok, we'll need to make
> it be a bigger type", but when you save the values to disk, things are
> different. I suspect filesystem people are *not* willing to deal with
> a "year 2262" issue.
>

Apparently, they are willing to handle the "year 2486" issue ;)

> But if we were to say that "a tenth of microsecond resolution is
> sufficient for inode timestamps", then suddenly 64 bits is *enormous*.
> So we could do a
>
>     // tenth of a microseconds since Jan 1, 1970
>     typedef s64 fstime_t;
>
> and have a nice dense timestamp format with reasonable - but not
> nanosecond - accuracy. Now that 292 year range has become 29,247
> years, and filesystem people *might* find the "year-31k" problem
> acceptable.
>
> I happen to think that "100ns timestamp resolution on files is
> sufficient" is a very reasonable statement, but I suspect that we'll
> still find lots of people who say "that's completely unacceptable"
> both to that resolution, and to the 31k-year problem.
>

I am guessing that you are aware of the Windows/SMB FILETIME
standard which is 64bit in 100ns units (since 1601).
So the 31k-year "problem" is very widespread already.

But the resolution change is counter to the purpose of multigrain
timestamps - if two syscalls updated the same or two different inodes
within a 100ns tick, apparently, there are some workloads that
care to know about it and fs needs to store this information persistently.

Thanks,
Amir.

Linus Torvalds Sept. 23, 2023, 5:48 p.m. UTC | #10

On Fri, 22 Sept 2023 at 23:36, Amir Goldstein <amir73il@gmail.com> wrote:
>
> Apparently, they are willing to handle the "year 2486" issue ;)

Well, we could certainly do the same at the VFS layer.

But I suspect 10ns resolution is entirely overkill, since on a lot of
platforms you don't even have timers with that resolution.

I feel like 100ns is a much more reasonable resolution, and is quite
close to a single system call (think "one thousand cycles at 10GHz").

> But the resolution change is counter to the purpose of multigrain
> timestamps - if two syscalls updated the same or two different inodes
> within a 100ns tick, apparently, there are some workloads that
> care to know about it and fs needs to store this information persistently.

Those workloads are broken garbage, and we should *not* use that kind
of sh*t to decide on VFS internals.

Honestly, if the main reason for the multigrain resolution is
something like that, I think we should forget about MG *entirely*.
Somebody needs to be told to get their act together.

We have *never* guaranteed nanosecond resolution on timestamps, and I
think we should put our foot down and say that we never will.

Partly because we have platforms where that kind of timer resolution
just does not exist.

Partly because it's stupid to expect that kind of resolution anyway.

And partly because any load that assumes that kind of resolution is
already broken.

End result: we should ABSOLUTELY NOT have as a target to support some
insane resolution.

100ns resolution for file access times is - and I'll happily go down
in history for saying this - enough for anybody.

If you need finer resolution than that, you'd better do it yourself in
user space.

And no, this is not a "but some day we'll have terahertz CPU's and
100ns is an eternity". Moore's law is dead, we're not going to see
terahertz CPUs, and people who say "but quantum" have bought into a
technological fairytale.

100ns is plenty, and has the advantage of having a very safe range.

That said, we don't have to do powers-of-ten. In fact, in many ways,
it would probably be a good idea to think of the fractional seconds in
powers of two. That tends to make it cheaper to do conversions,
without having to do a full 64-bit divide (a constant divide turns
into a fancy multiply, but it's still painful on 32-bit
architectures).

So, for example, we could easily make the format be a fixed-point
format with "sign bit, 38 bit seconds, 25 bit fractional seconds",
which gives us about 30ns resolution, and a range of almost 9000
years. Which is nice, in how it covers all of written history and all
four-digit years (we'd keep the 1970 base).

And 30ns resolution really *is* pretty much the limit of a single
system call. I could *wish* we had system calls that fast, or CPU's
that fast. Not the case right now, and sadly doesn't seem to be the
case in the forseeable future - if ever - either. It would be a really
good problem to have.

And the nice thing about that would be that conversion to timespec64
would be fairly straightforward:

   struct timespec64 to_timespec(fstime_t fstime)
   {
        struct timespec64 res;
        unsigned int frac;

        frac = fstime & 0x1ffffffu;
        res.tv_sec = fstime >> 25;
        res.tv_nsec = frac * 1000000000ull >> 25;
        return res;
   }

   fstime_t to_fstime(struct timespec64 a)
   {
        fstime_t sec = (fstime_t) a.tv_sec << 25;
        unsigned frac = a.tv_nsec;

        frac = ((unsigned long long) a.tv_nsec << 25) / 1000000000ull;
        return sec | frac;
   }

and both of those generate good code (that large divide by a constant
in to_fstime() is not great, but the compiler can turn it into a
multiply).

The above could be improved upon (nicer rounding and overflow
handling, and a few modifications to generate even nicer code), but
it's not horrendous as-is. On x86-64, to_timespec becomes a very
reasonable

        movq    %rdi, %rax
        andl    $33554431, %edi
        imulq   $1000000000, %rdi, %rdx
        sarq    $25, %rax
        shrq    $25, %rdx

and to some degree that's the critical function (that code would show
up in 'stat()').

Of course, I might have screwed up the above conversion functions,
they are untested garbage, but they look close enough to being in the
right ballpark.

Anyway, we really need to push back at any crazies who say "I want
nanosecond resolution, because I'm special and my mother said so".

                Linus

Theodore Ts'o Sept. 23, 2023, 7:30 p.m. UTC | #11

On Sat, Sep 23, 2023 at 10:48:51AM -0700, Linus Torvalds wrote:
> 
> I feel like 100ns is a much more reasonable resolution, and is quite
> close to a single system call (think "one thousand cycles at 10GHz").

FWIW, UUID's (which originally came from Apollo Domain/OS in the
1980's, before getting adopted by OSF/DCE, and then by Linux and
Microsoft) use a 100ns granularity.  And the smart folks at Apollo
figured this out some 4 decades ago, and *no* they didn't use units of
a single nanosecond.  :-)

100ns granularity is also what what ext4 uses for our on-disk format
--- 2**30 just enough to cover 100ns granularity (with only 7% of
wasted number space), and those two bits are enough for us to encode
timestamps into 2446 using a 64-bit timestamp (and what we do past
2446 is pretty much something I'm happy to let someone else deal with,
as I expect I'll be long dead by then.)

(And if someone does happen to event some kind of life-extension
technology, I'm happy to fix it up... later.  :-)

> That said, we don't have to do powers-of-ten. In fact, in many ways,
> it would probably be a good idea to think of the fractional seconds in
> powers of two. That tends to make it cheaper to do conversions,
> without having to do a full 64-bit divide (a constant divide turns
> into a fancy multiply, but it's still painful on 32-bit
> architectures).

It depends on what conversion we need to do.  If we're converting to
userspace's timespec64 data structure, which is denominated in
nanosecods, it's actually much easier to use decimal 100ns units:

#define EXT4_EPOCH_BITS 2
#define EXT4_EPOCH_MASK ((1 << EXT4_EPOCH_BITS) - 1)
#define EXT4_NSEC_MASK  (~0UL << EXT4_EPOCH_BITS)

static inline __le32 ext4_encode_extra_time(struct timespec64 *time)
{
	u32 extra =((time->tv_sec - (s32)time->tv_sec) >> 32) & EXT4_EPOCH_MASK;
	return cpu_to_le32(extra | (time->tv_nsec << EXT4_EPOCH_BITS));
}

static inline void ext4_decode_extra_time(struct timespec64 *time,
					  __le32 extra)
{
	if (unlikely(extra & cpu_to_le32(EXT4_EPOCH_MASK)))
		time->tv_sec += (u64)(le32_to_cpu(extra) & EXT4_EPOCH_MASK) << 32;
	time->tv_nsec = (le32_to_cpu(extra) & EXT4_NSEC_MASK) >> EXT4_EPOCH_BITS;
}

> Of course, I might have screwed up the above conversion functions,
> they are untested garbage, but they look close enough to being in the
> right ballpark.

We actually have kunit tests for ext4_encode_extra_time() and
ext4_decode_extra_time(), mainly because people *have* screwed it up
when making architecture-specific optimizations or when making global
sweeps of VFS code.  :-)

     	    		     	    	      	  - Ted

Linus Torvalds Sept. 23, 2023, 8:03 p.m. UTC | #12

On Sat, 23 Sept 2023 at 12:30, Theodore Ts'o <tytso@mit.edu> wrote:
>
> It depends on what conversion we need to do.  If we're converting to
> userspace's timespec64 data structure, which is denominated in
> nanosecods, it's actually much easier to use decimal 100ns units:

Actually, your data format seems to be a mix of "decimal nanoseconds"
and then a power-of-two seconds (ie bit shift).

Except it looks like ext4 actually does full nanosecond resolution (30
bits for nanoseconds, 34 bits for seconds). Thus the "only a couple of
hundred years of range".

And yes, that's probably close to optimal. It makes it harder to do
*math* on those dates (because you have seconds and 100ns as separate
fields), but for file timestamps that's likely not a real issue.

It was for 'ktime_t()', where with timers etc the whole "subtract and
add times" happens *all* the time, but for file timestamps you
basically have "set time" together with possibly comparing them (and
you can do comparisons without even splitting the fields if you lay
things out reasonably - which you ext4 doesn't seem to have done).

So yeah, I think that would be a fine 'fstime_t' for the kernel.

Except we'd do it without the EXT4_EPOCH_MASK conditionals, and I
think it would be better to have a bigger range for seconds. If you
give the seconds field three extra bits, you're comfortable in the
"thousands of years", and you still have 27 bits that can encode a
decimal "100ns".

It means that when you convert a fstime_t to timespec64 at stat()
time, you'd have to do a 32-bit "multiply by 100" to get the actual
nanosecond field, but that's cheap everywhere (and obviously the
shift-and-masks to get the separate fields out of the 64-bit value).

               Linus

Amir Goldstein Sept. 23, 2023, 9:29 p.m. UTC | #13

On Sat, Sep 23, 2023 at 8:49 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, 22 Sept 2023 at 23:36, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > Apparently, they are willing to handle the "year 2486" issue ;)
>
> Well, we could certainly do the same at the VFS layer.
>
> But I suspect 10ns resolution is entirely overkill, since on a lot of
> platforms you don't even have timers with that resolution.
>
> I feel like 100ns is a much more reasonable resolution, and is quite
> close to a single system call (think "one thousand cycles at 10GHz").
>

You are right. 100ns is enough resolution for fs timestamps, but:

   1. We cannot truncate existing timestamps on existing files
that are stored in 1ns format even if the 100ns remainder is garbage,
applications depend on tv_nsec not changing, so if vfs truncates to
100ns, filesystems will need to carry the remainder anyway.

   2. The whole idea behind multigrain time is that the 100ns remainder
space is available in on-disk timestamps formats and is not going away,
so better not let it go to waste on garbage and use is for "change cookie".

> > But the resolution change is counter to the purpose of multigrain
> > timestamps - if two syscalls updated the same or two different inodes
> > within a 100ns tick, apparently, there are some workloads that
> > care to know about it and fs needs to store this information persistently.
>
> Those workloads are broken garbage, and we should *not* use that kind
> of sh*t to decide on VFS internals.
>

Sorry, I phrased it completely wrong.
The workloads don't expect 1ns resolution.
The workloads just compare timestamps of objects and expect some sane
not-before ordering rules.
If user visible timestamps are truncated to 100ns all is good.

> End result: we should ABSOLUTELY NOT have as a target to support some
> insane resolution.

This is not the target.

I think the target is explained well in this new v8 revision [1]
of the patch set which does exactly what you promote -
it hides the insane resolution in fs timestamps from users.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjGofKT2LcM_YmoMYsH42ETpB5kxQkh+21nCbYc=w1nEg@mail.gmail.com/T/#mbc87e67faf71cc58c6424335333935bf7740d49e

Theodore Ts'o Sept. 23, 2023, 10:07 p.m. UTC | #14

On Sat, Sep 23, 2023 at 01:03:56PM -0700, Linus Torvalds wrote:
> 
> Except it looks like ext4 actually does full nanosecond resolution (30
> bits for nanoseconds, 34 bits for seconds). Thus the "only a couple of
> hundred years of range".

Hmm, yeah, sorry, I misremembered.  We did talk about possibly going
with 100ns, but if I recall correctly, I think there was a desire that
an arbitrary timespec64 should be encodable into an on-disk timestamp,
and then back again, and hundreds of years of range was considered
Good Enough (tm).

> Except we'd do it without the EXT4_EPOCH_MASK conditionals, and I
> think it would be better to have a bigger range for seconds. If you
> give the seconds field three extra bits, you're comfortable in the
> "thousands of years", and you still have 27 bits that can encode a
> decimal "100ns".

I might be screweing my math, but I believe 24 bits should be enough
to code 10,000,000 units of 100ns (it's enough for 16,777,216), which
should be sufficient.  What am I missing?

As far as how many seconds are needed, that's an area where people of
good will can disagree.  Given that I don't really believe a machine
is going to be up for decades before we will need to reboot and update
the kernel to address zero days, and LTS kernels are going to be
supported for two years going forward, if what we're talking about is
the in-memory time type, my personal opinion is that hundreds of years
is plenty, since it's not hard to change the encoding later.

Cheers,

						- Ted

Linus Torvalds Sept. 23, 2023, 11:31 p.m. UTC | #15

On Sat, 23 Sept 2023 at 15:08, Theodore Ts'o <tytso@mit.edu> wrote:
>
> I might be screweing my math, but I believe 24 bits should be enough
> to code 10,000,000 units of 100ns (it's enough for 16,777,216), which
> should be sufficient.  What am I missing?

You're missing me just being stupid, having a brain-fart, and doing
the math for 10ns despite telling everybody that 100ns should be the
reasonable thing.

Duh.

             Linus

Amir Goldstein Sept. 24, 2023, 8:34 a.m. UTC | #16

On Thu, Sep 21, 2023 at 10:02 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Thu, 2023-09-21 at 11:24 -0700, Linus Torvalds wrote:
> > On Thu, 21 Sept 2023 at 04:21, Christian Brauner <brauner@kernel.org> wrote:
> > >
> > >   git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/v6.6-rc3.vfs.ctime.revert
> >
> > So for some reason pr-tracker-bot doesn't seem to have reacted to this
> > pull request, but it's in my tree now.
> >
> > I *do* have one reaction to all of this: now that you have made
> > "i_ctime" be something that cannot be accessed directly (and renamed
> > it to "__i_ctime"), would you mind horribly going all the way, and do
> > the same for i_atime and i_mtime too?
> >
> > The reason I ask is that I *really* despise "struct timespec64" as a type.
> >
> > I despise it inherently, but I despise it even more when you really
> > use it as another type entirely, and are hiding bits in there.
> >
> > I despise it because "tv_sec" obviously needs to be 64-bit, but then
> > "tv_nsec" is this horrible abomination. It's defined as "long", which
> > is all kinds of crazy. It's bogus and historical.
> >
> > And it's wasteful.
> >
> > Now, in the case of i_ctime, you took advantage of that waste by using
> > one (of the potentially 2..34!) unused bits for that
> > "fine-granularity" flag.
> >
> > But even when you do that, there's up to 33 other bits just lying
> > around, wasting space in a very central data structure.
> >
> > So it would actually be much better to explode the 'struct timespec64'
> > into explicit 64-bit seconds field, and an explicit 32-bit field with
> > two bits reserved. And not even next to each other, because they don't
> > pack well in general.
> >
> > So instead of
> >
> >         struct timespec64       i_atime;
> >         struct timespec64       i_mtime;
> >         struct timespec64       __i_ctime;
> >
> > where that last one needs accessors to access, just make them *all*
> > have helper accessors, and make it be
> >
> >         u64  i_atime_sec;
> >         u64  i_mtime_sec;
> >         u64  i_ctime_sec;
> >         u32  i_atime_nsec;
> >         u32  i_mtime_nsec;
> >         u32  i_ctime_nsec;
> >
> > and now that 'struct inode' should shrink by 12 bytes.
> >
>
> I like it.
>
> > Then do this:
> >
> >   #define inode_time(x) \
> >        (struct timespec64) { x##_sec, x##_nsec }
> >
> > and you can now create a timespec64 by just doing
> >
> >     inode_time(inode->i_atime)
> >
> > or something like that (to help create those accessor functions).
> >
> > Now, I agree that 12 bytes in the disaster that is 'struct inode' is
> > likely a drop in the ocean. We have tons of big things in there (ie
> > several list_heads, a whole 'struct address_space' etc etc), so it's
> > only twelve bytes out of hundreds.
> >
> > But dammit, that 'timespec64' really is ugly, and now that you're
> > hiding bits in it and it's no longer *really* a 'timespec64', I feel
> > like it would be better to do it right, and not mis-use a type that
> > has other semantics, and has other problems.
> >
>
>
> We have many, many inodes though, and 12 bytes per adds up!
>
> I'm on board with the idea, but...that's likely to be as big a patch
> series as the ctime overhaul was. In fact, it'll touch a lot of the same
> code. I can take a stab at that in the near future though.
>
> Since we're on the subject...another thing that bothers me with all of
> the timestamp handling is that we don't currently try to mitigate "torn
> reads" across the two different words. It seems like you could fetch a
> tv_sec value and then get a tv_nsec value that represents an entirely
> different timestamp if there are stores between them.
>
> Should we be concerned about this? I suppose we could do something with
> a seqlock, though I'd really want to avoid locking on the write side.

As luck would have it, if my calculations are correct, on x86-64 and with
CONFIG_FS_POSIX_ACL=y, CONFIG_SECURITY=y (as they are on
distro kernels), __i_ctime is exactly on split cache lines and maybe even
worse (?), __i_ctime.tv_nsec and the QUERIED bit are on the same
cache line with i_lock :-/

If we reorder the inode timestamps with i_size, we improve the situation
for this specific and very common configuration. Maybe.

Thanks,
Amir.

Christian Brauner Sept. 24, 2023, 10:15 a.m. UTC | #17

> As luck would have it, if my calculations are correct, on x86-64 and with

I hope you're aware that you can use pahole to get struct layouts and
that you don't have to manually calculate this...

> CONFIG_FS_POSIX_ACL=y, CONFIG_SECURITY=y (as they are on
> distro kernels), __i_ctime is exactly on split cache lines and maybe even

(Make sure that your kernel doesn't use randomize_layout...)

tv_nsec has always been on a split cacheline. We've optimized struct
file a little bit in the same way before as there were actual
regressions in some workloads

I can put it into one of the vfs.git perf branches that are tested by
LKP to see if there's any actual performance changes.

5.15:
        /* --- cacheline 1 boundary (64 bytes) --- */
        long unsigned int          i_ino;                                                /*    64     8 */
        union {
                const unsigned int i_nlink;                                              /*    72     4 */
                unsigned int       __i_nlink;                                            /*    72     4 */
        };                                                                               /*    72     4 */
        /* typedef dev_t -> __kernel_dev_t -> u32 -> __u32 */ unsigned int               i_rdev; /*    76     4 */
        /* typedef loff_t -> __kernel_loff_t */ long long int              i_size;       /*    80     8 */
        struct timespec64 {
                /* typedef time64_t -> __s64 */ long long int      tv_sec;               /*    88     8 */
                long int           tv_nsec;                                              /*    96     8 */
        }i_atime; /*    88    16 */
        struct timespec64 {
                /* typedef time64_t -> __s64 */ long long int      tv_sec;               /*   104     8 */
                long int           tv_nsec;                                              /*   112     8 */
        }i_mtime; /*   104    16 */
        struct timespec64 {
                /* typedef time64_t -> __s64 */ long long int      tv_sec;               /*   120     8 */
                /* --- cacheline 2 boundary (128 bytes) --- */
                long int           tv_nsec;                                              /*   128     8 */
        }i_ctime; /*   120    16 */
        /* typedef spinlock_t */ struct spinlock {

6.6:
        /* --- cacheline 1 boundary (64 bytes) --- */
        long unsigned int          i_ino;                                                /*    64     8 */
        union {
                const unsigned int i_nlink;                                              /*    72     4 */
                unsigned int       __i_nlink;                                            /*    72     4 */
        };                                                                               /*    72     4 */
        /* typedef dev_t -> __kernel_dev_t -> u32 -> __u32 */ unsigned int               i_rdev; /*    76     4 */
        /* typedef loff_t -> __kernel_loff_t */ long long int              i_size;       /*    80     8 */
        struct timespec64 {
                /* typedef time64_t -> __s64 */ long long int      tv_sec;               /*    88     8 */
                long int           tv_nsec;                                              /*    96     8 */
        }i_atime; /*    88    16 */
        struct timespec64 {
                /* typedef time64_t -> __s64 */ long long int      tv_sec;               /*   104     8 */
                long int           tv_nsec;                                              /*   112     8 */
        }i_mtime; /*   104    16 */
        struct timespec64 {
                /* typedef time64_t -> __s64 */ long long int      tv_sec;               /*   120     8 */
                /* --- cacheline 2 boundary (128 bytes) --- */
                long int           tv_nsec;                                              /*   128     8 */
        }__i_ctime; /*   120    16 */

Christian Brauner Sept. 24, 2023, 10:26 a.m. UTC | #18

> > Those workloads are broken garbage, and we should *not* use that kind
> > of sh*t to decide on VFS internals.
> >
> 
> Sorry, I phrased it completely wrong.

Thanks for clearing this up. I had just formulated my own reply but I'll
happily delete it. :)

> The workloads don't expect 1ns resolution.

Yes, they don't. In the revert explanation I just used that number to
illustrate the general ordering problem. The workload that surfaced the
issue is just plain weird of course but it points to a more general
ordering problem.

> The workloads just compare timestamps of objects and expect some sane
> not-before ordering rules.

Yes.

> If user visible timestamps are truncated to 100ns all is good.

Yes.

Jeff Layton Sept. 25, 2023, 11:22 a.m. UTC | #19

On Sat, 2023-09-23 at 10:48 -0700, Linus Torvalds wrote:
> On Fri, 22 Sept 2023 at 23:36, Amir Goldstein <amir73il@gmail.com> wrote:
> > 
> > Apparently, they are willing to handle the "year 2486" issue ;)
> 
> Well, we could certainly do the same at the VFS layer.
> 
> But I suspect 10ns resolution is entirely overkill, since on a lot of
> platforms you don't even have timers with that resolution.
> 
> I feel like 100ns is a much more reasonable resolution, and is quite
> close to a single system call (think "one thousand cycles at 10GHz").
> 
> > But the resolution change is counter to the purpose of multigrain
> > timestamps - if two syscalls updated the same or two different inodes
> > within a 100ns tick, apparently, there are some workloads that
> > care to know about it and fs needs to store this information persistently.
> 
> Those workloads are broken garbage, and we should *not* use that kind
> of sh*t to decide on VFS internals.
> 
> Honestly, if the main reason for the multigrain resolution is
> something like that, I think we should forget about MG *entirely*.
> Somebody needs to be told to get their act together.
> 

As I noted in the other thread, the primary reason for this was to fix
XFS's change cookie without having to rev the on-disk format. If we
could also present fine-grained timestamps to userland and nfsd, then
that would also fix a lot of cache-coherency problems with NFSv3, and
may also help some workloads which depend on comparing timestamps
between files. That'd be a wonderful bonus, but I'm not going to lose
too much sleep if we can't make that work.


> We have *never* guaranteed nanosecond resolution on timestamps, and I
> think we should put our foot down and say that we never will.
>
> Partly because we have platforms where that kind of timer resolution
> just does not exist.
> 
> Partly because it's stupid to expect that kind of resolution anyway.
> 
> And partly because any load that assumes that kind of resolution is
> already broken.
>
> End result: we should ABSOLUTELY NOT have as a target to support some
> insane resolution.
> 
> 100ns resolution for file access times is - and I'll happily go down
> in history for saying this - enough for anybody.
> 
> If you need finer resolution than that, you'd better do it yourself in
> user space.
> 
> And no, this is not a "but some day we'll have terahertz CPU's and
> 100ns is an eternity". Moore's law is dead, we're not going to see
> terahertz CPUs, and people who say "but quantum" have bought into a
> technological fairytale.
> 
> 100ns is plenty, and has the advantage of having a very safe range.
> 

The catch here is that we have at least some testcases that do things
like set specific values in the mtime and atime, and then test that the
same value is retrievable.

Are we OK with breaking those? If we can always say that the stored
resolution is X and that even values that are explicitly set get
truncated then the v8 set I sent on Friday may be ok.

Of course, that set truncates the values at jiffies granularity (~4ms on
my box). That's well above 100ns, so it's possible that's too coarse for
us to handwave this problem away.


> That said, we don't have to do powers-of-ten. In fact, in many ways,
> it would probably be a good idea to think of the fractional seconds in
> powers of two. That tends to make it cheaper to do conversions,
> without having to do a full 64-bit divide (a constant divide turns
> into a fancy multiply, but it's still painful on 32-bit
> architectures).
> 
> So, for example, we could easily make the format be a fixed-point
> format with "sign bit, 38 bit seconds, 25 bit fractional seconds",
> which gives us about 30ns resolution, and a range of almost 9000
> years. Which is nice, in how it covers all of written history and all
> four-digit years (we'd keep the 1970 base).
> 
> And 30ns resolution really *is* pretty much the limit of a single
> system call. I could *wish* we had system calls that fast, or CPU's
> that fast. Not the case right now, and sadly doesn't seem to be the
> case in the forseeable future - if ever - either. It would be a really
> good problem to have.
> 
> And the nice thing about that would be that conversion to timespec64
> would be fairly straightforward:
> 
>    struct timespec64 to_timespec(fstime_t fstime)
>    {
>         struct timespec64 res;
>         unsigned int frac;
> 
>         frac = fstime & 0x1ffffffu;
>         res.tv_sec = fstime >> 25;
>         res.tv_nsec = frac * 1000000000ull >> 25;
>         return res;
>    }
> 
>    fstime_t to_fstime(struct timespec64 a)
>    {
>         fstime_t sec = (fstime_t) a.tv_sec << 25;
>         unsigned frac = a.tv_nsec;
> 
>         frac = ((unsigned long long) a.tv_nsec << 25) / 1000000000ull;
>         return sec | frac;
>    }
> 
> and both of those generate good code (that large divide by a constant
> in to_fstime() is not great, but the compiler can turn it into a
> multiply).
> 
> The above could be improved upon (nicer rounding and overflow
> handling, and a few modifications to generate even nicer code), but
> it's not horrendous as-is. On x86-64, to_timespec becomes a very
> reasonable
> 
>         movq    %rdi, %rax
>         andl    $33554431, %edi
>         imulq   $1000000000, %rdi, %rdx
>         sarq    $25, %rax
>         shrq    $25, %rdx
> 
> and to some degree that's the critical function (that code would show
> up in 'stat()').
> 
> Of course, I might have screwed up the above conversion functions,
> they are untested garbage, but they look close enough to being in the
> right ballpark.
> 
> Anyway, we really need to push back at any crazies who say "I want
> nanosecond resolution, because I'm special and my mother said so".
> 


Yeah if we we're going to establish a floor granularity for timestamps
above 1ns, then making it a power-of-two factor would probably be a good
thing. These calculations are done a _lot_ so we really do want them to
be efficient.

Linus Torvalds Sept. 25, 2023, 4:02 p.m. UTC | #20

On Mon, 25 Sept 2023 at 04:23, Jeff Layton <jlayton@kernel.org> wrote:
>
> The catch here is that we have at least some testcases that do things
> like set specific values in the mtime and atime, and then test that the
> same value is retrievable.

Yeah, I'm sure that happens. But as you say, we already have
per-filesystem granularity issues that means that there is some non-ns
granularity that those tests have to deal with.

Unless they currently just work on one or two filesystems, of course.

> Of course, that set truncates the values at jiffies granularity (~4ms on
> my box). That's well above 100ns, so it's possible that's too coarse for
> us to handwave this problem away.

Note that depending or enforcing some kind of jiffies granularity is
*particularly* bad, because it's basically a random value.

It will depend on architecture and on configuration. On some
architectures it's a fixed value (some have it at 100, which is, I
think, the original x86 case), on others it's "configurable", but not
really (ie alpha is "configurable" in our Kconfig, but the _hardware_
typically has a fixed clock tick at 1024 Hz, but then there are
platforms that are different, and then there's Qemu that likes a lower
frequency to avoid overhead etc etc).

And then we have the "modern default", which is to ask the user at
config time if they want a 100 / 250 / 300 / 1000 HZ value, and the
actual hw clock tick may be much more dynamic than that.

Anyway, what I'm saying is just that we should *not* limit granularity
to anything that has to do with jiffies. Yes, that's still a real
thing in that it's a "cheap read of the current time", but it should
never be seen as any kind of vfs granularity.

And yes, HZ will be in the "roughly 100-1000" range, so if we're
talking granularities that are microseconds or finer, then at most
you'll have rounding issues - and since any HZ rounding issues should
only be for the cases where we set the time to "now" - rounding
shouldn't be an issue anyway, since it's not a time that is specified
by user space.

End result: try to avoid anything to do with HZ in filesystem code,
unless it's literally about jiffies (which should typically be mostly
about any timeouts a filesystem may have, not about timestamps
themselves).

               Linus