x86: enable Data Operand Independent Timing Mode

Message ID 20230125012801.362496-1-ebiggers@kernel.org
State New
Headers
Series x86: enable Data Operand Independent Timing Mode |

Commit Message

Eric Biggers Jan. 25, 2023, 1:28 a.m. UTC
  From: Eric Biggers <ebiggers@google.com>

[ I'm sending this patch out to hopefully get people interested in
  mitigating this Intel CPU vulnerability, as the original thread at
  https://lore.kernel.org/lkml/YwgCrqutxmX0W72r@gmail.com/T/#u died out
  after DIT was enabled on arm64, but nothing was done for x86.  Note:
  I don't currently have a way to properly test this patch, as I don't
  have hardware with Ice Lake or later.  Help with testing this on the
  actual hardware would be greatly appreciated.  I've tested that the
  command line options and sysfs file seem to be wired up properly. ]

According to documentation that Intel published recently [1], Intel CPUs
based on the Ice Lake and later microarchitectures don't guarantee "data
operand independent timing" by default.  I.e., instruction execution
times may depend on the values of data operated on.  This is true for a
wide variety of instructions, including many instructions that are
heavily used in cryptography and have always been assumed to be
constant-time, e.g. additions, XORs, and even the AES-NI instructions.

Cryptography algorithms require constant-time instructions to prevent
side-channel attacks that recover cryptographic keys based on execution
times.  Therefore, without this CPU vulnerability mitigated, it's
generally impossible to safely do cryptography on the latest Intel CPUs.

It's also plausible that this CPU vulnerability can expose privileged
kernel data to unprivileged userspace processes more generally.

To mitigate this CPU vulnerability, it's possible to enable "Data
Operand Independent Timing Mode" (DOITM) by setting a bit in a MSR.
While Intel's documentation suggests that this bit should only be set
where "necessary", that is highly impractical, given the fact that
cryptography can happen nearly anywhere in the kernel and userspace, and
the fact that the entire kernel likely needs to be protected anyway.

Therefore, let's simply enable DOITM globally by default to fix this
vulnerability.  At most this gives up an "optimization" on the very
latest CPUs, restoring the correct behavior from previous CPUs.

Note: this patch does not address the separate but related vulnerability
of MXCSR Configuration Dependent Timing (MCDT) that the Intel document
describes.  A separate patch would need to address that.

[1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/best-practices/data-operand-independent-timing-isa-guidance.html

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 .../ABI/testing/sysfs-devices-system-cpu      |  1 +
 Documentation/admin-guide/hw-vuln/dodt.rst    | 37 ++++++++++++
 Documentation/admin-guide/hw-vuln/index.rst   |  1 +
 .../admin-guide/kernel-parameters.txt         |  7 +++
 arch/x86/include/asm/cpufeatures.h            |  1 +
 arch/x86/include/asm/msr-index.h              |  8 +++
 arch/x86/kernel/cpu/bugs.c                    | 56 +++++++++++++++++++
 arch/x86/kernel/cpu/common.c                  |  4 ++
 arch/x86/kernel/cpu/cpu.h                     |  1 +
 drivers/base/cpu.c                            |  8 +++
 include/linux/cpu.h                           |  2 +
 11 files changed, 126 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/dodt.rst


base-commit: fb6e71db53f3d4351dada7c130fb652eecf994d6
  

Comments

Bagas Sanjaya Jan. 25, 2023, 3:07 a.m. UTC | #1
On Tue, Jan 24, 2023 at 05:28:01PM -0800, Eric Biggers wrote:
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +DODT - Data Operand Dependent Timing
> +====================================
> +
> +Data Operand Dependent Timing (DODT) is a CPU vulnerability that makes the
> +execution times of instructions depend on the values of the data operated on.
> +
> +This vulnerability potentially enables side-channel attacks on data, including
> +cryptographic keys.  Most cryptography algorithms require that a variety of
> +instructions be constant-time in order to prevent side-channel attacks.
> +
> +Affected CPUs
> +-------------
> +
> +This vulnerability affects Intel Core family processors based on the Ice Lake
> +and later microarchitectures, and Intel Atom family processors based on the
> +Gracemont and later microarchitectures.  For more information, see Intel's
> +documentation [1]_.
> +
> +Mitigation
> +----------
> +
> +Mitigation of this vulnerability involves setting a Model Specific Register
> +(MSR) bit to enable Data Operand Independent Timing Mode (DOITM).
> +
> +By the default, the kernel does this on all CPUs.  This mitigation is global, so
> +it applies to both the kernel and userspace.
> +
> +This mitigation can be disabled by adding ``doitm=off`` to the kernel command
> +line.  It's also one of the mitigations that can be disabled by
> +``mitigations=off``.
> +
> +References
> +----------
> +.. [1] Data Operand Independent Timing Instruction Set Architecture (ISA) Guidance
> +   https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/best-practices/data-operand-independent-timing-isa-guidance.html
> diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
> index 4df436e7c4177..cd962f9634dad 100644
> --- a/Documentation/admin-guide/hw-vuln/index.rst
> +++ b/Documentation/admin-guide/hw-vuln/index.rst
> @@ -18,3 +18,4 @@ are configurable at compile, boot or run time.
>     core-scheduling.rst
>     l1d_flush.rst
>     processor_mmio_stale_data.rst
> +   dodt.rst
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 6cfa6e3996cf7..a6a872c4365e6 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1119,6 +1119,12 @@
>  			The filter can be disabled or changed to another
>  			driver later using sysfs.
>  
> +	doitm=off	[X86,INTEL] Disable the use of Data Operand Independent
> +			Timing Mode (DOITM).  I.e., disable the mitigation for
> +			the Data Operand Dependent Timing (DODT) CPU
> +			vulnerability.  For details, see
> +			Documentation/admin-guide/hw-vuln/dodt.rst
> +
>  	driver_async_probe=  [KNL]
>  			List of driver names to be probed asynchronously. *
>  			matches with all driver names. If * is specified, the
> @@ -3259,6 +3265,7 @@
>  					       no_uaccess_flush [PPC]
>  					       mmio_stale_data=off [X86]
>  					       retbleed=off [X86]
> +					       doitm=off [X86,INTEL]
>  
>  				Exceptions:
>  					       This does not have any effect on
 
The doc LGTM, thanks!

Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
  
Dave Hansen Jan. 25, 2023, 3:29 p.m. UTC | #2
On 1/24/23 17:28, Eric Biggers wrote:
> To mitigate this CPU vulnerability, it's possible to enable "Data
> Operand Independent Timing Mode" (DOITM) by setting a bit in a MSR.
> While Intel's documentation suggests that this bit should only be set
> where "necessary", that is highly impractical, given the fact that
> cryptography can happen nearly anywhere in the kernel and userspace, and
> the fact that the entire kernel likely needs to be protected anyway.

I think this misses a key point from the documentation:

	This functionality is intended for use by software which has
	already applied other techniques to mitigate software timing
	side channels, such as those documented in Intel's Guidelines
	for Mitigating Timing Side Channels Against Cryptographic
	Implementations.

Translating from Intel-speak: Intel thinks that DOITM purely a way to
make the CPU run slower if you haven't already written code specifically
to mitigate timing side channels.  All pain, no gain.

The kernel as a whole is not written that way.  I'm sure the crypto
folks that are cc'd can tell us specifically if the kernel crypto code
is written following those recommendations.

So, let's call this patch what it is: a potential global slowdown which
protects a very small amount of crypto code, probably just in userspace.
 That is probably the code that's generating your RSA keys, so it's
quite important, but it's also a _very_ small total amount of code.

There's another part here which I think was recently added to the
documentation:

	Intel expects the performance impact of this mode may be
	significantly higher on future processors. 

That's _meant_ to be really scary and keep folks from turning this on by
default, aka. what this patch does.  Your new CPU will be really slow if
you turn this on!  Boo!

All that said, and given the information that Intel has released, I
think this patch is generally the right thing to do.  I don't think
people are wrong for looking at "DODT" as being a new vulnerability.
Intel obviously doesn't see it that way, which is why "DODT" has (as far
as I can tell) not been treated with the same security pomp and
circumstance as other stuff.

Last, if you're going to propose that this be turned on, I expect to see
at least _some_ performance data.  DOITM=1 isn't free, even on Ice Lake.
  
Dave Hansen Jan. 25, 2023, 4:15 p.m. UTC | #3
On 1/25/23 07:29, Dave Hansen wrote:
> There's another part here which I think was recently added to the
> documentation:
> 
> 	Intel expects the performance impact of this mode may be
> 	significantly higher on future processors. 
> 
> That's _meant_ to be really scary and keep folks from turning this on by
> default, aka. what this patch does.  Your new CPU will be really slow if
> you turn this on!  Boo!

*If* we go forward with this patch's approach in the kernel, I think we
should at least consider what the kernel will do in a future where there
are two classes of systems:

	1. Ice Lake era ones where DOITM=1 is cheap enough that it can
	   on by default.
	2. "Future processors" where DOITM=1 by default is too costly.

Maybe for #2 we set DOITM=0 in the kernel.  Maybe we add per-task
controls.

But, there *is* DOITM cost where the large fleets are going to be
tempted to turn it off somehow, somewhere.  The kernel will be better
off if we can design that in now.
  
Ard Biesheuvel Jan. 25, 2023, 4:22 p.m. UTC | #4
On Wed, 25 Jan 2023 at 16:29, Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 1/24/23 17:28, Eric Biggers wrote:
> > To mitigate this CPU vulnerability, it's possible to enable "Data
> > Operand Independent Timing Mode" (DOITM) by setting a bit in a MSR.
> > While Intel's documentation suggests that this bit should only be set
> > where "necessary", that is highly impractical, given the fact that
> > cryptography can happen nearly anywhere in the kernel and userspace, and
> > the fact that the entire kernel likely needs to be protected anyway.
>
> I think this misses a key point from the documentation:
>
>         This functionality is intended for use by software which has
>         already applied other techniques to mitigate software timing
>         side channels, such as those documented in Intel's Guidelines
>         for Mitigating Timing Side Channels Against Cryptographic
>         Implementations.
>
> Translating from Intel-speak: Intel thinks that DOITM purely a way to
> make the CPU run slower if you haven't already written code specifically
> to mitigate timing side channels.  All pain, no gain.
>
> The kernel as a whole is not written that way.  I'm sure the crypto
> folks that are cc'd can tell us specifically if the kernel crypto code
> is written following those recommendations.
>

Cryptography is often singled out because it deals with confidential
data, and if timing variances leak the data, the confidentiality is
violated.

However, this is not fundamentally different from execution at a
higher privilege level. In this case, the data is shielded from other
observers by the h/w enforced privilege boundary rather than
encryption, but if timing variances leak the data, the result is the
same, i.e., data that was assumed to be confidential as far as user
space is concerned is no longer confidential.

All the nospec stuff we added for Spectre v1 serves the same purpose,
essentially, although the timing variances due to cache misses are
likely easier to measure. IOW, some of the kernel is now written that
way in fact, although the author of that doc may have had something
else in mind.

So IMHO, the scope is really not as narrow as you think.

> So, let's call this patch what it is: a potential global slowdown which
> protects a very small amount of crypto code, probably just in userspace.
>  That is probably the code that's generating your RSA keys, so it's
> quite important, but it's also a _very_ small total amount of code.
>
> There's another part here which I think was recently added to the
> documentation:
>
>         Intel expects the performance impact of this mode may be
>         significantly higher on future processors.
>
> That's _meant_ to be really scary and keep folks from turning this on by
> default, aka. what this patch does.  Your new CPU will be really slow if
> you turn this on!  Boo!
>

What is the penalty for switching it on and off? On arm64, it is now
on by default in the kernel, and off by default in user space, and
user space can opt into it using an unprivileged instruction.

> All that said, and given the information that Intel has released, I
> think this patch is generally the right thing to do.  I don't think
> people are wrong for looking at "DODT" as being a new vulnerability.
> Intel obviously doesn't see it that way, which is why "DODT" has (as far
> as I can tell) not been treated with the same security pomp and
> circumstance as other stuff.
>
> Last, if you're going to propose that this be turned on, I expect to see
> at least _some_ performance data.  DOITM=1 isn't free, even on Ice Lake.
  
Dave Hansen Jan. 25, 2023, 4:45 p.m. UTC | #5
On 1/25/23 08:22, Ard Biesheuvel wrote:
...
> All the nospec stuff we added for Spectre v1 serves the same purpose,
> essentially, although the timing variances due to cache misses are
> likely easier to measure. IOW, some of the kernel is now written that
> way in fact, although the author of that doc may have had something
> else in mind.
> 
> So IMHO, the scope is really not as narrow as you think.

I've spoken with the folks who wrote that doc.  They've told me
repeatedly that the scope is super narrow.  Seriously, look at just
*one* thing in the other Intel doc about mitigating timing side-channels[1]:

	be wary of code generated from high-level language source code
	that appears to adhere to all of these recommendations.

The kernel has a fair amount of code written in high-level languages.

The authors of the DOIT doc truly intend the real-world benefits of
DOITM to be exceedingly narrow.  I think it would be fair to say that
they think:

	DOITM is basically useless for most code written in C, including
	basically the entire kernel.

I'll go forward this on to them and make sure I'm not overstating this
_too_ much.

>> That's _meant_ to be really scary and keep folks from turning this on by
>> default, aka. what this patch does.  Your new CPU will be really slow if
>> you turn this on!  Boo!
> 
> What is the penalty for switching it on and off? On arm64, it is now
> on by default in the kernel, and off by default in user space, and
> user space can opt into it using an unprivileged instruction.

Right now, DOITM is controlled by a bit in an MSR and it applies
everywhere.  It is (thankfully) one of the cheap MSRs and is not
architecturally serializing.

That's still not ideal and there is a desire to expose the bit to
userspace *somehow* to make it much, much cheaper to toggle.  But, it'll
still be an extra bit that needs to get managed and context switched.

When I looked, the arm64 bit seemed to be in some flags register that
got naturally saved and restored already on user<->kernel transitions.
Was I reading it right?  It seemed like a really nice, simple mechanism
to me.


1.
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/secure-coding/mitigate-timing-side-channel-crypto-implementation.html
  
Ard Biesheuvel Jan. 26, 2023, 10:20 a.m. UTC | #6
On Wed, 25 Jan 2023 at 17:46, Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 1/25/23 08:22, Ard Biesheuvel wrote:
> ...
> > All the nospec stuff we added for Spectre v1 serves the same purpose,
> > essentially, although the timing variances due to cache misses are
> > likely easier to measure. IOW, some of the kernel is now written that
> > way in fact, although the author of that doc may have had something
> > else in mind.
> >
> > So IMHO, the scope is really not as narrow as you think.
>
> I've spoken with the folks who wrote that doc.  They've told me
> repeatedly that the scope is super narrow.  Seriously, look at just
> *one* thing in the other Intel doc about mitigating timing side-channels[1]:
>
>         be wary of code generated from high-level language source code
>         that appears to adhere to all of these recommendations.
>
> The kernel has a fair amount of code written in high-level languages.
>

This is why we have crypto_memneq(), for instance, which is intended
to be time invariant, whereas the time taken by ordinary memcmp() is
typically correlated with the byte index of the first unequal byte. So
what we do there is compare every byte, instead of returning early on
the first mismatch. We do, however, perform the comparison in the
native word size and not byte by byte.

So if these optimizations result in word comparisons potentially
taking less time if the first byte is a mismatch, we definitely have a
problem. (This particular example may be far fetched but you get my
point)

> The authors of the DOIT doc truly intend the real-world benefits of
> DOITM to be exceedingly narrow.

I understand that this is the intent. But for privileged execution,
this should really be the other way around: the scope for
optimizations relying on data dependent timing is exceedingly narrow
in the kernel, because any data it processes must be assumed to be
confidential by default (wrt user space), and it will probably be
rather tricky to identify CPU bound workloads in the kernel where data
dependent optimizations are guaranteed to be safe and result in a
significant speedup.

This is basically the same argument I made for arm64.

>  I think it would be fair to say that
> they think:
>
>         DOITM is basically useless for most code written in C, including
>         basically the entire kernel.
>
> I'll go forward this on to them and make sure I'm not overstating this
> _too_ much.
>

C code that was not specifically written with data independent timing
in mind may still behave that way today,
C code that *was* specifically written with data independent timing in
mind (such as crypto_memneq()) could potentially lose that property
under these optimizations.

> >> That's _meant_ to be really scary and keep folks from turning this on by
> >> default, aka. what this patch does.  Your new CPU will be really slow if
> >> you turn this on!  Boo!
> >
> > What is the penalty for switching it on and off? On arm64, it is now
> > on by default in the kernel, and off by default in user space, and
> > user space can opt into it using an unprivileged instruction.
>
> Right now, DOITM is controlled by a bit in an MSR and it applies
> everywhere.  It is (thankfully) one of the cheap MSRs and is not
> architecturally serializing.
>
> That's still not ideal and there is a desire to expose the bit to
> userspace *somehow* to make it much, much cheaper to toggle.  But, it'll
> still be an extra bit that needs to get managed and context switched.
>
> When I looked, the arm64 bit seemed to be in some flags register that
> got naturally saved and restored already on user<->kernel transitions.
> Was I reading it right?  It seemed like a really nice, simple mechanism
> to me.
>

Indeed. It is part of PSTATE, which means is gets preserved/restored
along with the rest of the SPSR (saved program state) when an
exception is taken.
  
Jann Horn Jan. 26, 2023, 1:52 p.m. UTC | #7
On Wed, Jan 25, 2023 at 4:30 PM Dave Hansen <dave.hansen@intel.com> wrote:
> On 1/24/23 17:28, Eric Biggers wrote:
> > To mitigate this CPU vulnerability, it's possible to enable "Data
> > Operand Independent Timing Mode" (DOITM) by setting a bit in a MSR.
> > While Intel's documentation suggests that this bit should only be set
> > where "necessary", that is highly impractical, given the fact that
> > cryptography can happen nearly anywhere in the kernel and userspace, and
> > the fact that the entire kernel likely needs to be protected anyway.
>
> I think this misses a key point from the documentation:
>
>         This functionality is intended for use by software which has
>         already applied other techniques to mitigate software timing
>         side channels, such as those documented in Intel's Guidelines
>         for Mitigating Timing Side Channels Against Cryptographic
>         Implementations.
>
> Translating from Intel-speak: Intel thinks that DOITM purely a way to
> make the CPU run slower if you haven't already written code specifically
> to mitigate timing side channels.  All pain, no gain.
>
> The kernel as a whole is not written that way.

The kernel as a whole also doesn't really use the FPU registers for
anything other than checksumming and cryptography and stuff like that
(it's disabled in the compiler flags because the FPU registers
normally contain userspace state that must not be clobbered). The
instructions listed on that Intel help page are all weird PM* and VP*
arithmetic instructions that can't be generated from C code in the
kernel (except for weird subsystems in which every function is only
callable in kernel-FPU-enabled mode and the compiler is set to enable
FPU instruction generation, by which I mean amdgpu).

So with the current generations of CPUs, based on that Intel help
page, "the kernel as a whole" should not be affected by this setting,
it's mostly just cryptographic helpers written in assembly. From a
quick-and-dirty grep:

$ find arch/x86/ -name '*.S' | xargs egrep -li
'(PMADDUBSW|PMULUDQ|VPMULHRSW|PMADDWD|VPLZCNTD|VPMULHUW|PMULDQ|VPLZCNTQ|VPMULHW|PMULHRSW|VPMADD52HUQ|VPMULLD|PMULHUW|VPMADD52LUQ|VPMULLQ|PMULHW|VPMADDUBSW|VPMULLW|PMULLD|VPMADDWD|VPMULUDQ|PMULLW|VPMULDQ)'
arch/x86/crypto/nh-avx2-x86_64.S
arch/x86/crypto/nh-sse2-x86_64.S
arch/x86/crypto/poly1305-x86_64-cryptogams.S

That's NHPoly1305 and Poly1305, two crypto functions. Poly1305 is used
for the authentication part of AEAD schemes (in particular WireGuard
uses Poly1305 for all its authentication), NHPoly1305 is for Adiantum
(weird Android disk encryption stuff AFAIK).

> I'm sure the crypto
> folks that are cc'd can tell us specifically if the kernel crypto code
> is written following those recommendations.

Note that all the memory-management and block layer code that copies
page contents around and swaps them to disk and so on should (ideally)
be cryptographically timing-safe (yes I know that's very much not true
with zram swap), because it operates on basically all userspace
memory, including userspace memory used by cryptographic code.

On top of that, some cryptographic secrets (including SSH
authentication keys) need to be persisted somewhere - some
cryptographic key material is stored in filesystems. And of course
there are also secrets that are not quite cryptographic, and can be
similarly important and small as cryptographic keys, such as cookies
and OAuth tokens stored in the various databases of something like a
browser.

I think part of the reason why KSM is nowadays disabled in most cloud
environments is that it introduced timing differences in the MM
subsystem (between taking a write fault on a private page and taking a
write fault on a KSM-shared page), and those timing differences could
be used to leak information between VMs. (See, for example,
<https://gruss.cc/files/remote_dedup.pdf>.)

So I think there is quite a bit more kernel code that *handles
cryptographic intermediate states and cryptographic secrets* (and
non-cryptographic secrets that have similar properties) than kernel
code that *performs cryptography*.

Do we have a guarantee that things like page copying (for example as
part of migration or forking), usercopy (for example for filesystem
access), and whatever checksumming and other things might reasonably
be happening in the block layer (including for swap) and in
filesystems are going to be data-timing-independent on future
processors?

> So, let's call this patch what it is: a potential global slowdown which
> protects a very small amount of crypto code, probably just in userspace.

And a bunch of data paths that this crypto code's secrets might go
through, in particular in the kernel.

Also note that the kernel can do a bunch of important cryptography on
its own, like for disk encryption, filesystem encryption, SSL/TLS
sockets (yes you can have the kernel do the easy parts of TLS for
you), IPSec, WireGuard, and probably a bunch of other things.

>  That is probably the code that's generating your RSA keys, so it's
> quite important, but it's also a _very_ small total amount of code.

Yeah but all the code that handles storage of your RSA keys really
also needs a similar level of protection when touching them.
See https://arxiv.org/pdf/2108.04600.pdf for an example:
"We analyze the exploitability of base64 decoding functions across
several widely used cryptographic libraries. Base64 decoding is used
when loading keys stored in PEM format. We show that these functions
by themselves leak sufficient information even if libraries are
executed in trusted execution environments. In fact, we show that
recent countermeasures to transient execution attacks such as LVI ease
the exploitability of the observed faint leakages, allowing us to
robustly infer sufficient information about RSA private keys with a
single trace. We present a complete attack, including a broad library
analysis, a high-resolution last level cache attack on SGX enclaves,
and a fully parallelized implementation of the extend-and-prune
approach that allows a complete key recovery at medium costs."

> There's another part here which I think was recently added to the
> documentation:
>
>         Intel expects the performance impact of this mode may be
>         significantly higher on future processors.
>
> That's _meant_ to be really scary and keep folks from turning this on by
> default, aka. what this patch does.  Your new CPU will be really slow if
> you turn this on!  Boo!

It does sound really scary, because it sounds like that list of
instructions Intel published might be useless for trying to reason
about the security impact of turning this optimization on for future
hardware?

If the performance impact is going to be higher on newer CPUs then I
would *definitely* want DOITM to be on by default, and then when Intel
can say what instructions are actually affected on a new CPU, we can
figure out whether it's safe to turn it off as a per-cpu-generation
thing?

> All that said, and given the information that Intel has released, I
> think this patch is generally the right thing to do.  I don't think
> people are wrong for looking at "DODT" as being a new vulnerability.
> Intel obviously doesn't see it that way, which is why "DODT" has (as far
> as I can tell) not been treated with the same security pomp and
> circumstance as other stuff.
>
> Last, if you're going to propose that this be turned on, I expect to see
> at least _some_ performance data.  DOITM=1 isn't free, even on Ice Lake.

So all in all, it sounds to me like this should only affect the
performance of crypto code in current CPUs (where we really want
data-independent timing), so it should definitely be enabled on
current CPUs. And on future CPUs it's a surprise package of potential
security problems, so for those we should be turning it on even more?
  
Dave Hansen Jan. 26, 2023, 4:40 p.m. UTC | #8
On 1/26/23 05:52, Jann Horn wrote:
> On Wed, Jan 25, 2023 at 4:30 PM Dave Hansen <dave.hansen@intel.com> wrote:
>> Translating from Intel-speak: Intel thinks that DOITM purely a way to
>> make the CPU run slower if you haven't already written code specifically
>> to mitigate timing side channels.  All pain, no gain.
>>
>> The kernel as a whole is not written that way.
> 
> The kernel as a whole also doesn't really use the FPU registers for
> anything other than checksumming and cryptography and stuff like that
> (it's disabled in the compiler flags because the FPU registers
> normally contain userspace state that must not be clobbered). The
> instructions listed on that Intel help page are all weird PM* and VP*
> arithmetic instructions that can't be generated from C code in the
> kernel (except for weird subsystems in which every function is only
> callable in kernel-FPU-enabled mode and the compiler is set to enable
> FPU instruction generation, by which I mean amdgpu).

Maybe I'm totally missing something, but I thought the scope here was
the "non-data operand independent timing behavior for the listed
instructions" referenced here:

> https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/best-practices/data-operand-independent-timing-isa-guidance.html

where the "listed instructions" is this list:

> https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/resources/data-operand-independent-timing-instructions.html

For example, that includes XOR with the 0x31 and 0x81 opcodes which
there are plenty of in the kernel.

That's a bit wider scope than the crazy instructions like VPLZCNTD.  The
crazy instructions list that I _think_ you were grepping for is the
"Instructions That May Exhibit MCDT Behavior".  That's also a fun one,
but it is more narrow than the DOITM list.
  
Jann Horn Jan. 26, 2023, 5:52 p.m. UTC | #9
On Thu, Jan 26, 2023 at 5:40 PM Dave Hansen <dave.hansen@intel.com> wrote:
> On 1/26/23 05:52, Jann Horn wrote:
> > On Wed, Jan 25, 2023 at 4:30 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >> Translating from Intel-speak: Intel thinks that DOITM purely a way to
> >> make the CPU run slower if you haven't already written code specifically
> >> to mitigate timing side channels.  All pain, no gain.
> >>
> >> The kernel as a whole is not written that way.
> >
> > The kernel as a whole also doesn't really use the FPU registers for
> > anything other than checksumming and cryptography and stuff like that
> > (it's disabled in the compiler flags because the FPU registers
> > normally contain userspace state that must not be clobbered). The
> > instructions listed on that Intel help page are all weird PM* and VP*
> > arithmetic instructions that can't be generated from C code in the
> > kernel (except for weird subsystems in which every function is only
> > callable in kernel-FPU-enabled mode and the compiler is set to enable
> > FPU instruction generation, by which I mean amdgpu).
>
> Maybe I'm totally missing something, but I thought the scope here was
> the "non-data operand independent timing behavior for the listed
> instructions" referenced here:
>
> > https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/best-practices/data-operand-independent-timing-isa-guidance.html
>
> where the "listed instructions" is this list:
>
> > https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/resources/data-operand-independent-timing-instructions.html
>
> For example, that includes XOR with the 0x31 and 0x81 opcodes which
> there are plenty of in the kernel.

That list says at the top: "The table below lists instructions that
have data-independent timing."

And the "MCDT (MXCSR-Sensitivity)" column that marks instructions that
do not actually have data-independent timing if you set the MSR to the
unsafe state is only marked for PMADDUBSW, PMADDWD, PMULDQ, PMULHRSW,
PMULHUW, PMULHW, PMULLD, PMULLW, PMULUDQ, VPLZCNTD, VPLZCNTQ,
VPMADD52HUQ, VPMADD52LUQ, VPMADDUBSW, VPMADDWD, VPMULDQ, VPMULHRSW,
VPMULHUW, VPMULHW, VPMULLD, VPMULLQ, VPMULLW, VPMULUDQ. All the others
are guaranteed to always have data-independent timing, if I understand
the table correctly.

> That's a bit wider scope than the crazy instructions like VPLZCNTD.  The
> crazy instructions list that I _think_ you were grepping for is the
> "Instructions That May Exhibit MCDT Behavior".  That's also a fun one,
> but it is more narrow than the DOITM list.
  
Dave Hansen Jan. 26, 2023, 7:12 p.m. UTC | #10
On 1/26/23 09:52, Jann Horn wrote:
>> Maybe I'm totally missing something, but I thought the scope here was
>> the "non-data operand independent timing behavior for the listed
>> instructions" referenced here:
>>
>>> https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/best-practices/data-operand-independent-timing-isa-guidance.html
>> where the "listed instructions" is this list:
>>
>>> https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/resources/data-operand-independent-timing-instructions.html
>> For example, that includes XOR with the 0x31 and 0x81 opcodes which
>> there are plenty of in the kernel.
> That list says at the top: "The table below lists instructions that
> have data-independent timing."

So, first of all, apologies for the documentation.  It needs some work,
and I see where the confusion is coming from.

But, I did just confirm with the folks that wrote it. The "listed
instructions" *ARE* within the scope of being affected by the DOITM=0/1
setting.

Instead of saying:

	The table below lists instructions that have data-independent
	timing.

I think it should probably say something like:

	The table below lists instructions that have data-independent
	timing when DOITM is enabled.

	(Modulo the MXCSR interactions for now)

Somebody from Intel please thwack me over the head if I'm managing to
get this wrong (wouldn't be the first time).
  
Eric Biggers Jan. 26, 2023, 10:37 p.m. UTC | #11
On Thu, Jan 26, 2023 at 11:12:36AM -0800, Dave Hansen wrote:
> On 1/26/23 09:52, Jann Horn wrote:
> >> Maybe I'm totally missing something, but I thought the scope here was
> >> the "non-data operand independent timing behavior for the listed
> >> instructions" referenced here:
> >>
> >>> https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/best-practices/data-operand-independent-timing-isa-guidance.html
> >> where the "listed instructions" is this list:
> >>
> >>> https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/resources/data-operand-independent-timing-instructions.html
> >> For example, that includes XOR with the 0x31 and 0x81 opcodes which
> >> there are plenty of in the kernel.
> > That list says at the top: "The table below lists instructions that
> > have data-independent timing."
> 
> So, first of all, apologies for the documentation.  It needs some work,
> and I see where the confusion is coming from.
> 
> But, I did just confirm with the folks that wrote it. The "listed
> instructions" *ARE* within the scope of being affected by the DOITM=0/1
> setting.
> 
> Instead of saying:
> 
> 	The table below lists instructions that have data-independent
> 	timing.
> 
> I think it should probably say something like:
> 
> 	The table below lists instructions that have data-independent
> 	timing when DOITM is enabled.
> 
> 	(Modulo the MXCSR interactions for now)
> 
> Somebody from Intel please thwack me over the head if I'm managing to
> get this wrong (wouldn't be the first time).

That's my understanding too, based on the part of
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/best-practices/data-operand-independent-timing-isa-guidance.html
that describes DOITM.  The page
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/resources/data-operand-independent-timing-instructions.html
is actively misleading, so yes please get that fixed.

I think the following is also extremely important:

	For Intel® Core™ family processors based on microarchitectures before
	Ice Lake and Intel Atom® family processors based on microarchitectures
	before Gracemont that do not enumerate IA32_UARCH_MISC_CTL, developers
	may assume that the instructions listed here operate as if DOITM is
	enabled.

The end result is that on older CPUs, Intel explicitly guarantees that the
instructions in
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/resources/data-operand-independent-timing-instructions.html
have data operand independent timing.  But on newer CPUs, Intel has explicitly
removed that guarantee, and enabling DOITM is needed to get it back.

By the way, surely the importance of using DOITM on a particular CPU correlates
strongly with its performance overhead?  So I'm not sure that benchmarks of
DOITM would even be very interesting, as we couldn't necessarily decide on
something like "don't use DOITM if the overhead is more than X percent", since
that would exclude exactly the CPUs where it's the most important to use...

I think the real takeaway here is that the optimizations that Intel is
apparently trying to introduce are a bad idea and not safe at all.  To the
extent that they exist at all, they should be an opt-in thing, not out-opt.  The
CPU gets that wrong, but Linux can flip that and do it right.

- Eric
  
Dave Hansen Jan. 26, 2023, 11:58 p.m. UTC | #12
On 1/26/23 14:37, Eric Biggers wrote:
> The end result is that on older CPUs, Intel explicitly guarantees that the
> instructions in
> https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/resources/data-operand-independent-timing-instructions.html
> have data operand independent timing.  But on newer CPUs, Intel has explicitly
> removed that guarantee, and enabling DOITM is needed to get it back.

Yep.

> By the way, surely the importance of using DOITM on a particular CPU correlates
> strongly with its performance overhead?  So I'm not sure that benchmarks of
> DOITM would even be very interesting, as we couldn't necessarily decide on
> something like "don't use DOITM if the overhead is more than X percent", since
> that would exclude exactly the CPUs where it's the most important to use...

We've looked at how bad the cure is compared to the disease for *every*
one of these issues.  As far as I know, either the disease has always
gotten better over time or the cure gets cheaper (think IBRS/retpoline
-> EIBRS or RDCL_NO for Meltdown).

DOITM doesn't follow that pattern.  It appears that it's fairly cheap
now, but Intel is reserving the right to make it worse over time.  I
don't know how we can come up with a kernel policy which will be sane
going forward when things get worse over time.

> I think the real takeaway here is that the optimizations that Intel is
> apparently trying to introduce are a bad idea and not safe at all.  To the
> extent that they exist at all, they should be an opt-in thing, not out-opt.  The
> CPU gets that wrong, but Linux can flip that and do it right.

Let's try to break this down a _bit_.

The code most sensitive to the DOITM behavior is code written to be
resistant to timing side-channels today.  DOITM=0 behavior is obviously
unsafe there.  This code is, unfortunately, out in the wild and
completely unannotated and completely unaware of DOITM.  The only way to
mitigate it is to set DOITM=1 whenever it _might_ be running, which is
when userspace runs.  If the kernel can't be bothered to switch DOITM on
kernel entry/exit, then it's game over and DOITM=1 is the only choice.
For this code, it's arguable that even mitigations=off shouldn't set
DOITM=0.

The other extreme is code that's known to be vulnerable to timing
side-channels no matter the value of DOITM.  KSM is an example here.
Setting DOITM=1 isn't going to make KSM's side channels go away.

Then, there's the middle ground.  There is surely some code that does
not have timing side-channels with DOITM=0, but some appear with
DOITM=1.  I _think_ my colleagues at Intel consider this code to be in
the same bucket as "known vulnerable".  Basically, "if it's not designed
to be timing side-channel resistant then it surely is vulnerable to one,
regardless of DOITM."

That middle ground matters a *lot* because it's probably 99.9% of all
software, including essentially the whole kernel.

I think what I'm hearing from folks in this thread is that if going from
DOITM=1->0 _makes_ any code vulnerable in practice, then they view that
as a bug -- a real-world security bug.  It doesn't matter whether the
code was designed to be side-channel resistant or not.  A regression is
a regression.

I'm also hearing that the old (pre-Ice Lake), universal, DOITM=1
behavior is basically architecture because software depends on it.  It
can't be changed except with an explicit software opt-in.  Even if DOITM
were the right basic interface for this opt-in, the default polarity is
wrong for an opt-in.
  
Dave Hansen Jan. 31, 2023, 10:48 p.m. UTC | #13
We've been talking about this inside Intel.  Suffice to say that DOITM
was not designed to be turned on all the time.  If software turns it on
all the time, it won't accomplish what it was designed to do.

The _most_ likely thing that's going to happen is that DOITM gets
redefined to be much more limited in scope.  The current DOITM
architecture is very broad, but the implementations have much more
limited effects.  This means that the architecture can probably be
constrained and still match the hardware that's out there today.  That's
pure speculation (ha!) on my part.

I think we should hold off on merging any DOITM patches until the dust
settles.  As far as I know, there is no pressing practical reason to
apply something urgently.

Any objections?
  
Eric Biggers Feb. 1, 2023, 6:54 a.m. UTC | #14
On Tue, Jan 31, 2023 at 02:48:05PM -0800, Dave Hansen wrote:
> We've been talking about this inside Intel.  Suffice to say that DOITM
> was not designed to be turned on all the time.  If software turns it on
> all the time, it won't accomplish what it was designed to do.

Why wouldn't it accomplish what it is designed to do?  Does it not actually
work?

> 
> The _most_ likely thing that's going to happen is that DOITM gets
> redefined to be much more limited in scope.  The current DOITM
> architecture is very broad, but the implementations have much more
> limited effects.  This means that the architecture can probably be
> constrained and still match the hardware that's out there today.  That's
> pure speculation (ha!) on my part.
>
> I think we should hold off on merging any DOITM patches until the dust
> settles.  As far as I know, there is no pressing practical reason to
> apply something urgently.
> 
> Any objections?

Does this mean that Intel will be restoring the data operand independent timing
guarantee of some instructions that have had it removed?  If so, which
instructions will still be left?

Also, given that the DOITM flag can only be set and cleared by the kernel, and
assuming that Linux won't support DOITM in any way yet (as you're recommending),
what should the developers of userspace cryptographic libraries do?  Does Intel
have a list of which instructions *already* have started having data operand
dependent timing on released CPUs, so that the existing security impact can be
assessed and so that developers can avoid using those instructions?

- Eric
  
Dave Hansen Feb. 1, 2023, 6:09 p.m. UTC | #15
On 1/31/23 22:54, Eric Biggers wrote:
> On Tue, Jan 31, 2023 at 02:48:05PM -0800, Dave Hansen wrote:
>> We've been talking about this inside Intel.  Suffice to say that DOITM
>> was not designed to be turned on all the time.  If software turns it on
>> all the time, it won't accomplish what it was designed to do.
> 
> Why wouldn't it accomplish what it is designed to do?  Does it not actually
> work?

It was designed with the idea that timing-sensitive and *ONLY*
timing-sensitive code would be wrapped with DOITM.  Wrapping that code
would allow the rest of the system to safely operate with fancy new
optimizations that exhibit data dependent timing.

But, first of all, that code isn't wrapped with DOITM-prodding APIs
today.  That model falls apart with current software on current DOITM
implementations.

Second, we consider the kernel in general to be timing-sensitive enough
that we want DOITM=1 behavior in the kernel.

Those lead software folks to set DOITM=1 on all the time.  The fallout
from that is that nobody will ever use those fancy new optimizations.
If nobody can take advantage of them, silicon shouldn't be wasted on
them in the first place.

Basically, DOITM was meant to open the door for adding fancy new
optimizations.  It's a failure if it doesn't do that.

>> The _most_ likely thing that's going to happen is that DOITM gets
>> redefined to be much more limited in scope.  The current DOITM
>> architecture is very broad, but the implementations have much more
>> limited effects.  This means that the architecture can probably be
>> constrained and still match the hardware that's out there today.  That's
>> pure speculation (ha!) on my part.
>>
>> I think we should hold off on merging any DOITM patches until the dust
>> settles.  As far as I know, there is no pressing practical reason to
>> apply something urgently.
>>
>> Any objections?
> 
> Does this mean that Intel will be restoring the data operand independent timing
> guarantee of some instructions that have had it removed?  If so, which
> instructions will still be left?

Speaking for myself and *not* the official Intel plan here.  Yes, I
think the pre-DOITM behavior can and will be restored for basically the
entire list[1] (ignoring MCDT of course).

> Also, given that the DOITM flag can only be set and cleared by the
> kernel, and assuming that Linux won't support DOITM in any way yet
> (as you're recommending), what should the developers of userspace
> cryptographic libraries do? Does Intel have a list of which
> instructions *already* have started having data operand dependent
> timing on released CPUs, so that the existing security impact can be
> assessed and so that developers can avoid using those instructions?
Good questions.  I want to make sure what I'm saying here is accurate,
and I don't have good answers to those right now.  We're working on it,
and should have something useful to say in the next couple of days.

1.
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/resources/data-operand-independent-timing-instructions.html
  
Josh Triplett Feb. 1, 2023, 10:33 p.m. UTC | #16
On Wed, Feb 01, 2023 at 10:09:16AM -0800, Dave Hansen wrote:
> It was designed with the idea that timing-sensitive and *ONLY*
> timing-sensitive code would be wrapped with DOITM.  Wrapping that code
> would allow the rest of the system to safely operate with fancy new
> optimizations that exhibit data dependent timing.
> 
> But, first of all, that code isn't wrapped with DOITM-prodding APIs
> today.  That model falls apart with current software on current DOITM
> implementations.
> 
> Second, we consider the kernel in general to be timing-sensitive enough
> that we want DOITM=1 behavior in the kernel.
> 
> Those lead software folks to set DOITM=1 on all the time.  The fallout
> from that is that nobody will ever use those fancy new optimizations.
> If nobody can take advantage of them, silicon shouldn't be wasted on
> them in the first place.
> 
> Basically, DOITM was meant to open the door for adding fancy new
> optimizations.  It's a failure if it doesn't do that.

It seems like it still potentially accomplishes its intended purpose
even if it's opt-in. A prctl to turn it on for a particular process, or
a particular process and its children, could work if that process knows
it wants all the performance it can get and won't be handling data for
which privilege boundaries matter. If this actually has the potential
for substantial optimizations that would be worth it.

But yeah, opt-out seems like a non-starter, since it'd require fixing
all the cryptographic code in the world to request DOITM=1.
  
Dave Hansen Feb. 3, 2023, 4:25 p.m. UTC | #17
On 2/1/23 10:09, Dave Hansen wrote:
> Good questions.  I want to make sure what I'm saying here is accurate,
> and I don't have good answers to those right now.  We're working on it,
> and should have something useful to say in the next couple of days.

This is an attempt to make sure that everyone that is concerned about
DOITM behavior has all the same information as Intel folks before we
make a decision about a kernel implementation.

Here we go...

The execution latency of the DOIT instructions[1] does not depend on the
value of data operands on all currently-supported Intel processors. This
includes all processors that enumerate DOITM support.  There are no
plans for any processors where this behavior would change, despite the
DOITM architecture theoretically allowing it.

So, what's the point of DOITM in the first place?  Fixed execution
latency does not mean that programs as a whole will have constant
overall latency.  DOITM currently affects features which do not affect
execution latency but may, for instance, affect overall program latency
due to side-effects of prefetching on the cache.  Even with fixed
instruction execution latency, these side-effects can matter especially
to the paranoid.

Today, those affected features are:
 * Data Dependent Prefetchers (DDP)[2]
 * Some Fast Store Forwarding Predictors (FSFP)[3].

There are existing controls for those features, including
spec_store_bypass_disable=[4].  Some paranoid software may already have
mitigations in place that are a superset of DOITM.  In addition, both
DDP and FSFP were also designed to limit nastiness when crossing
privilege boundaries.  Please see the linked docs for more details.

That's basically the Intel side of things.  Everyone else should have
all the background that I have.  Now back to maintainer mode...

So, in short, I don't think everyone's crypto libraries are busted today
when DOITM=0.  I don't think they're going to _become_ busted any time soon.

Where do we go from here?  There are a few choices:

1. Apply the patch in this thread, set DOITM=1 always.  Today, this
   reduces exposure to DDP and FSFP, but probably only for userspace.
   It reduces exposure to any future predictors under the DOITM umbrella
   and also from Intel changing its mind.
2. Ignore DOITM, leave it to the hardware default of DOITM=0.  Today,
   this probably just steers folks to using relatively heavyweight
   mitigations (like SSBD) if they want DDP/FSFP disabled.  It also
   leaves Linux exposed to Intel changing its mind on its future plans.
3. Apply the patch in this thread, but leave DOITM=0 as the default.
   This lets folks enable DOITM on a moment's notice if the need arises.

There are some other crazier choices like adding ABI to toggle DOITM for
userspace, but I'm not sure they're even worth discussing.

#1 is obviously the only way to go if the DOITM architecture remains
as-is.  There is talk of making changes, like completely removing the
idea of variable execution latency.  But that's a slow process and would
be a non-starter if *anyone* (like other OSes) is depending on the
existing DOITM definition.

My inclination is to wait a couple of weeks to see which way DOITM is
headed and if the definition is likely to get changed.  Does anyone feel
any greater sense of urgency?


1.
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/resources/data-operand-independent-timing-instructions.html
2.
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/data-dependent-prefetcher.html
3.
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/fast-store-forwarding-predictor.html
4. https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
  
Dave Hansen Feb. 3, 2023, 6:25 p.m. UTC | #18
BTW, I'm basically moving forward assuming that we're going to apply
this patch in _some_ form.  I'm going to make some changes, but I'll
discuss them in this thread to make sure we're all on the same page first.

On 1/24/23 17:28, Eric Biggers wrote:
> +Affected CPUs
> +-------------
> +
> +This vulnerability affects Intel Core family processors based on the Ice Lake
> +and later microarchitectures, and Intel Atom family processors based on the
> +Gracemont and later microarchitectures.  For more information, see Intel's
> +documentation [1]_.

I had a hard time following the docs in this area.

But I'm not sure this statement is correct.  The docs actually say:

	For Intel® Core™ family processors based on microarchitectures
	before Ice Lake and Intel Atom® family processors based on
	microarchitectures before Gracemont that do not enumerate
	IA32_UARCH_MISC_CTL, developers may assume that the instructions
	listed here operate as if DOITM is enabled.

A processor needs to be before "Ice Lake" and friends *AND* not
enumerate IA32_UARCH_MISC_CTL to be unaffected.

There's also another tweak that's needed because:

	Processors that do not enumerate IA32_ARCH_CAPABILITIES[DOITM]
	when the latest microcode is applied do not need to set
	IA32_UARCH_MISC_CTL [DOITM] in order to have the behavior
	described in this document...

First, we need to mention the "latest microcode" thing in the kernel
docs.  I also _think_ the whole "microarchitectures before" stuff is
rather irrelevant and we can simplify this down to:

	This vulnerability affects all Intel processors that support
	MSR_IA32_ARCH_CAPABILITIES and set MSR_IA32_ARCH_CAPABILITIES[DOITM]
	when the latest microcode is applied.

Which reminds me.  This:

> +void update_doitm_msr(void)
> +{
> +	u64 msr;
> +
> +	if (doitm_off)
> +		return;
> +
> +	rdmsrl(MSR_IA32_UARCH_MISC_CTL, msr);
> +	wrmsrl(MSR_IA32_UARCH_MISC_CTL, msr | UARCH_MISC_DOITM);
> +}

should probably be:

void update_doitm_msr(void)
{
	u64 msr;

	/*
	 * All processors that enumerate support for DOIT
	 * are affected *and* have the mitigation available.
	 */
	if (!boot_cpu_has_bug(X86_BUG_DODT))
		return;

	rdmsrl(MSR_IA32_UARCH_MISC_CTL, msr);
	if (doitm_off)
		msr &= ~UARCH_MISC_DOITM;
	else
		msr |= UARCH_MISC_DOITM;
	wrmsrl(MSR_IA32_UARCH_MISC_CTL, msr);
}

in case the CPU isn't actually coming out of reset, like if kexec() left
DOITM=1.
  
Roxana Bradescu March 3, 2023, 3:32 a.m. UTC | #19
> On Feb 3, 2023, at 10:25 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> BTW, I'm basically moving forward assuming that we're going to apply
> this patch in _some_ form.  I'm going to make some changes, but I'll
> discuss them in this thread to make sure we're all on the same page first.

Just checking in on the changes mentioned here.

> On 1/24/23 17:28, Eric Biggers wrote:
>> +Affected CPUs
>> +-------------
>> +
>> +This vulnerability affects Intel Core family processors based on the Ice Lake
>> +and later microarchitectures, and Intel Atom family processors based on the
>> +Gracemont and later microarchitectures.  For more information, see Intel's
>> +documentation [1]_.
> 
> I had a hard time following the docs in this area.
> 
> But I'm not sure this statement is correct.  The docs actually say:
> 
> 	For Intel® Core™ family processors based on microarchitectures
> 	before Ice Lake and Intel Atom® family processors based on
> 	microarchitectures before Gracemont that do not enumerate
> 	IA32_UARCH_MISC_CTL, developers may assume that the instructions
> 	listed here operate as if DOITM is enabled.

Have we been able to clarify if this assumption is guaranteed?

> 
> A processor needs to be before "Ice Lake" and friends *AND* not
> enumerate IA32_UARCH_MISC_CTL to be unaffected.
> 
> There's also another tweak that's needed because:
> 
> 	Processors that do not enumerate IA32_ARCH_CAPABILITIES[DOITM]
> 	when the latest microcode is applied do not need to set
> 	IA32_UARCH_MISC_CTL [DOITM] in order to have the behavior
> 	described in this document...
> 
> First, we need to mention the "latest microcode" thing in the kernel
> docs.  I also _think_ the whole "microarchitectures before" stuff is
> rather irrelevant and we can simplify this down to:
> 
> 	This vulnerability affects all Intel processors that support
> 	MSR_IA32_ARCH_CAPABILITIES and set MSR_IA32_ARCH_CAPABILITIES[DOITM]
> 	when the latest microcode is applied.
> 

Certainly a lot cleaner. Would be great if the Intel docs reflected this.

—
Regards, Roxana


> Which reminds me.  This:
> 
>> +void update_doitm_msr(void)
>> +{
>> +	u64 msr;
>> +
>> +	if (doitm_off)
>> +		return;
>> +
>> +	rdmsrl(MSR_IA32_UARCH_MISC_CTL, msr);
>> +	wrmsrl(MSR_IA32_UARCH_MISC_CTL, msr | UARCH_MISC_DOITM);
>> +}
> 
> should probably be:
> 
> void update_doitm_msr(void)
> {
> 	u64 msr;
> 
> 	/*
> 	 * All processors that enumerate support for DOIT
> 	 * are affected *and* have the mitigation available.
> 	 */
> 	if (!boot_cpu_has_bug(X86_BUG_DODT))
> 		return;
> 
> 	rdmsrl(MSR_IA32_UARCH_MISC_CTL, msr);
> 	if (doitm_off)
> 		msr &= ~UARCH_MISC_DOITM;
> 	else
> 		msr |= UARCH_MISC_DOITM;
> 	wrmsrl(MSR_IA32_UARCH_MISC_CTL, msr);
> }
> 
> in case the CPU isn't actually coming out of reset, like if kexec() left
> DOITM=1.
>
  

Patch

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index f54867cadb0f6..b76966aac3470 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -524,6 +524,7 @@  What:		/sys/devices/system/cpu/vulnerabilities
 		/sys/devices/system/cpu/vulnerabilities/itlb_multihit
 		/sys/devices/system/cpu/vulnerabilities/mmio_stale_data
 		/sys/devices/system/cpu/vulnerabilities/retbleed
+		/sys/devices/system/cpu/vulnerabilities/dodt
 Date:		January 2018
 Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
 Description:	Information about CPU vulnerabilities
diff --git a/Documentation/admin-guide/hw-vuln/dodt.rst b/Documentation/admin-guide/hw-vuln/dodt.rst
new file mode 100644
index 0000000000000..03d24f4d97100
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/dodt.rst
@@ -0,0 +1,37 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+DODT - Data Operand Dependent Timing
+====================================
+
+Data Operand Dependent Timing (DODT) is a CPU vulnerability that makes the
+execution times of instructions depend on the values of the data operated on.
+
+This vulnerability potentially enables side-channel attacks on data, including
+cryptographic keys.  Most cryptography algorithms require that a variety of
+instructions be constant-time in order to prevent side-channel attacks.
+
+Affected CPUs
+-------------
+
+This vulnerability affects Intel Core family processors based on the Ice Lake
+and later microarchitectures, and Intel Atom family processors based on the
+Gracemont and later microarchitectures.  For more information, see Intel's
+documentation [1]_.
+
+Mitigation
+----------
+
+Mitigation of this vulnerability involves setting a Model Specific Register
+(MSR) bit to enable Data Operand Independent Timing Mode (DOITM).
+
+By the default, the kernel does this on all CPUs.  This mitigation is global, so
+it applies to both the kernel and userspace.
+
+This mitigation can be disabled by adding ``doitm=off`` to the kernel command
+line.  It's also one of the mitigations that can be disabled by
+``mitigations=off``.
+
+References
+----------
+.. [1] Data Operand Independent Timing Instruction Set Architecture (ISA) Guidance
+   https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/best-practices/data-operand-independent-timing-isa-guidance.html
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index 4df436e7c4177..cd962f9634dad 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -18,3 +18,4 @@  are configurable at compile, boot or run time.
    core-scheduling.rst
    l1d_flush.rst
    processor_mmio_stale_data.rst
+   dodt.rst
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6cfa6e3996cf7..a6a872c4365e6 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1119,6 +1119,12 @@ 
 			The filter can be disabled or changed to another
 			driver later using sysfs.
 
+	doitm=off	[X86,INTEL] Disable the use of Data Operand Independent
+			Timing Mode (DOITM).  I.e., disable the mitigation for
+			the Data Operand Dependent Timing (DODT) CPU
+			vulnerability.  For details, see
+			Documentation/admin-guide/hw-vuln/dodt.rst
+
 	driver_async_probe=  [KNL]
 			List of driver names to be probed asynchronously. *
 			matches with all driver names. If * is specified, the
@@ -3259,6 +3265,7 @@ 
 					       no_uaccess_flush [PPC]
 					       mmio_stale_data=off [X86]
 					       retbleed=off [X86]
+					       doitm=off [X86,INTEL]
 
 				Exceptions:
 					       This does not have any effect on
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 61012476d66e0..7ddb7390c8a60 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -466,5 +466,6 @@ 
 #define X86_BUG_MMIO_UNKNOWN		X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */
 #define X86_BUG_RETBLEED		X86_BUG(27) /* CPU is affected by RETBleed */
 #define X86_BUG_EIBRS_PBRSB		X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
+#define X86_BUG_DODT			X86_BUG(29) /* CPU has data operand dependent timing by default */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 37ff47552bcb7..1d929ad4301a8 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -114,6 +114,11 @@ 
 						 * Not susceptible to
 						 * TSX Async Abort (TAA) vulnerabilities.
 						 */
+#define ARCH_CAP_DOITM			BIT(12)	/*
+						 * The processor supports Data Operand
+						 * Independent Timing Mode, i.e. it is
+						 * vulnerable to data operand dependent timing.
+						 */
 #define ARCH_CAP_SBDR_SSDP_NO		BIT(13)	/*
 						 * Not susceptible to SBDR and SSDP
 						 * variants of Processor MMIO stale data
@@ -155,6 +160,9 @@ 
 						 * supported
 						 */
 
+#define MSR_IA32_UARCH_MISC_CTL		0x00001b01
+#define UARCH_MISC_DOITM		BIT(0)	/* Enable Data Operand Independent Timing Mode. */
+
 #define MSR_IA32_FLUSH_CMD		0x0000010b
 #define L1D_FLUSH			BIT(0)	/*
 						 * Writeback and invalidate the
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index bca0bd8f48464..7ea20a2618232 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -49,6 +49,7 @@  static void __init taa_select_mitigation(void);
 static void __init mmio_select_mitigation(void);
 static void __init srbds_select_mitigation(void);
 static void __init l1d_flush_select_mitigation(void);
+static void __init dodt_select_mitigation(void);
 
 /* The base value of the SPEC_CTRL MSR without task-specific bits set */
 u64 x86_spec_ctrl_base;
@@ -124,6 +125,8 @@  DEFINE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush);
 DEFINE_STATIC_KEY_FALSE(mmio_stale_data_clear);
 EXPORT_SYMBOL_GPL(mmio_stale_data_clear);
 
+static bool __ro_after_init doitm_off;
+
 void __init check_bugs(void)
 {
 	identify_boot_cpu();
@@ -167,6 +170,7 @@  void __init check_bugs(void)
 	md_clear_select_mitigation();
 	srbds_select_mitigation();
 	l1d_flush_select_mitigation();
+	dodt_select_mitigation();
 
 	arch_smt_update();
 
@@ -2218,6 +2222,42 @@  static int __init l1tf_cmdline(char *str)
 }
 early_param("l1tf", l1tf_cmdline);
 
+#undef pr_fmt
+#define pr_fmt(fmt)	"DODT: " fmt
+
+static __init int doitm_parse_cmdline(char *str)
+{
+	if (!str)
+		return -EINVAL;
+
+	if (!boot_cpu_has_bug(X86_BUG_DODT))
+		return 0;
+
+	doitm_off = !strcmp(str, "off");
+	if (doitm_off)
+		pr_info("'doitm=off' specified, not enabling Data Operand Independent Timing Mode\n");
+	return 0;
+}
+early_param("doitm", doitm_parse_cmdline);
+
+static __init void dodt_select_mitigation(void)
+{
+	if (cpu_mitigations_off() || !boot_cpu_has_bug(X86_BUG_DODT))
+		doitm_off = true;
+	update_doitm_msr();
+}
+
+void update_doitm_msr(void)
+{
+	u64 msr;
+
+	if (doitm_off)
+		return;
+
+	rdmsrl(MSR_IA32_UARCH_MISC_CTL, msr);
+	wrmsrl(MSR_IA32_UARCH_MISC_CTL, msr | UARCH_MISC_DOITM);
+}
+
 #undef pr_fmt
 #define pr_fmt(fmt) fmt
 
@@ -2415,6 +2455,14 @@  static ssize_t retbleed_show_state(char *buf)
 	return sysfs_emit(buf, "%s\n", retbleed_strings[retbleed_mitigation]);
 }
 
+static ssize_t dodt_show_state(char *buf)
+{
+	if (doitm_off)
+		return sysfs_emit(buf, "Vulnerable\n");
+	else
+		return sysfs_emit(buf, "Mitigation: DOITM\n");
+}
+
 static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr,
 			       char *buf, unsigned int bug)
 {
@@ -2464,6 +2512,9 @@  static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr
 	case X86_BUG_RETBLEED:
 		return retbleed_show_state(buf);
 
+	case X86_BUG_DODT:
+		return dodt_show_state(buf);
+
 	default:
 		break;
 	}
@@ -2528,4 +2579,9 @@  ssize_t cpu_show_retbleed(struct device *dev, struct device_attribute *attr, cha
 {
 	return cpu_show_common(dev, attr, buf, X86_BUG_RETBLEED);
 }
+
+ssize_t cpu_show_dodt(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return cpu_show_common(dev, attr, buf, X86_BUG_DODT);
+}
 #endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 9cfca3d7d0e20..b27fa7e730f32 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1325,6 +1325,9 @@  static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
 	    !(ia32_cap & ARCH_CAP_PSCHANGE_MC_NO))
 		setup_force_cpu_bug(X86_BUG_ITLB_MULTIHIT);
 
+	if (ia32_cap & ARCH_CAP_DOITM)
+		setup_force_cpu_bug(X86_BUG_DODT);
+
 	if (cpu_matches(cpu_vuln_whitelist, NO_SPECULATION))
 		return;
 
@@ -1972,6 +1975,7 @@  void identify_secondary_cpu(struct cpuinfo_x86 *c)
 	validate_apic_and_package_id(c);
 	x86_spec_ctrl_setup_ap();
 	update_srbds_msr();
+	update_doitm_msr();
 
 	tsx_ap_init();
 }
diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h
index 7c9b5893c30ab..187190cf1ccca 100644
--- a/arch/x86/kernel/cpu/cpu.h
+++ b/arch/x86/kernel/cpu/cpu.h
@@ -82,6 +82,7 @@  unsigned int aperfmperf_get_khz(int cpu);
 
 extern void x86_spec_ctrl_setup_ap(void);
 extern void update_srbds_msr(void);
+extern void update_doitm_msr(void);
 
 extern u64 x86_read_arch_cap_msr(void);
 
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 4c98849577d4e..281d4e18dddd2 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -576,6 +576,12 @@  ssize_t __weak cpu_show_retbleed(struct device *dev,
 	return sysfs_emit(buf, "Not affected\n");
 }
 
+ssize_t __weak cpu_show_dodt(struct device *dev,
+			     struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "Not affected\n");
+}
+
 static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);
 static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);
 static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);
@@ -587,6 +593,7 @@  static DEVICE_ATTR(itlb_multihit, 0444, cpu_show_itlb_multihit, NULL);
 static DEVICE_ATTR(srbds, 0444, cpu_show_srbds, NULL);
 static DEVICE_ATTR(mmio_stale_data, 0444, cpu_show_mmio_stale_data, NULL);
 static DEVICE_ATTR(retbleed, 0444, cpu_show_retbleed, NULL);
+static DEVICE_ATTR(dodt, 0444, cpu_show_dodt, NULL);
 
 static struct attribute *cpu_root_vulnerabilities_attrs[] = {
 	&dev_attr_meltdown.attr,
@@ -600,6 +607,7 @@  static struct attribute *cpu_root_vulnerabilities_attrs[] = {
 	&dev_attr_srbds.attr,
 	&dev_attr_mmio_stale_data.attr,
 	&dev_attr_retbleed.attr,
+	&dev_attr_dodt.attr,
 	NULL
 };
 
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 314802f98b9da..8fa28be9f48bc 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -70,6 +70,8 @@  extern ssize_t cpu_show_mmio_stale_data(struct device *dev,
 					char *buf);
 extern ssize_t cpu_show_retbleed(struct device *dev,
 				 struct device_attribute *attr, char *buf);
+extern ssize_t cpu_show_dodt(struct device *dev,
+			     struct device_attribute *attr, char *buf);
 
 extern __printf(4, 5)
 struct device *cpu_device_create(struct device *parent, void *drvdata,