[1/2] docs: rcu: Add cautionary note on plain-accesses to requirements

Message ID 20230803032408.2514989-1-joel@joelfernandes.org
State New
Headers
Series [1/2] docs: rcu: Add cautionary note on plain-accesses to requirements |

Commit Message

Joel Fernandes Aug. 3, 2023, 3:24 a.m. UTC
  Add a detailed note to explain the potential side effects of
plain-accessing the gp pointer using a plain load, without using the
rcu_dereference() macros; which might trip neighboring code that does
use rcu_dereference().

I haven't verified this with a compiler, but this is what I gather from
the below link using Will's experience with READ_ONCE().

Link: https://lore.kernel.org/all/20230728124412.GA21303@willie-the-truck/
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 .../RCU/Design/Requirements/Requirements.rst  | 32 +++++++++++++++++++
 1 file changed, 32 insertions(+)
  

Comments

Alan Huang Aug. 3, 2023, 12:08 p.m. UTC | #1
> 2023年8月3日 11:24,Joel Fernandes (Google) <joel@joelfernandes.org> 写道:
> 
> Add a detailed note to explain the potential side effects of
> plain-accessing the gp pointer using a plain load, without using the
> rcu_dereference() macros; which might trip neighboring code that does
> use rcu_dereference().
> 
> I haven't verified this with a compiler, but this is what I gather from
> the below link using Will's experience with READ_ONCE().
> 
> Link: https://lore.kernel.org/all/20230728124412.GA21303@willie-the-truck/
> Cc: Will Deacon <will@kernel.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
> .../RCU/Design/Requirements/Requirements.rst  | 32 +++++++++++++++++++
> 1 file changed, 32 insertions(+)
> 
> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
> index f3b605285a87..e0b896d3fb9b 100644
> --- a/Documentation/RCU/Design/Requirements/Requirements.rst
> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
> @@ -376,6 +376,38 @@ mechanism, most commonly locking or reference counting
> .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
> .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
> 
> +Note that, there can be strange side effects (due to compiler optimizations) if
> +``gp`` is ever accessed using a plain load (i.e. without ``READ_ONCE()`` or
> +``rcu_dereference()``) potentially hurting any succeeding
> +``rcu_dereference()``. For example, consider the code:
> +
> +   ::
> +
> +       1 bool do_something_gp(void)
> +       2 {
> +       3   void *tmp;
> +       4   rcu_read_lock();
> +       5   tmp = gp; // Plain-load of GP.
> +       6   printk("Point gp = %p\n", tmp);
> +       7
> +       8   p = rcu_dereference(gp);
> +       9   if (p) {
> +      10     do_something(p->a, p->b);
> +      11     rcu_read_unlock();
> +      12     return true;
> +      13   }
> +      14   rcu_read_unlock();
> +      15   return false;
> +      16 }
> +
> +The behavior of plain accesses involved in a data race is non-deterministic in
> +the face of compiler optimizations. Since accesses to the ``gp`` pointer is
> +by-design a data race, the compiler could trip this code by caching the value
> +of ``gp`` into a register in line 5, and then using the value of the register
> +to satisfy the load in line 10. Thus it is important to never mix

Will’s example is:

	 // Assume *ptr is initially 0 and somebody else writes it to 1
	 // concurrently

	 foo = *ptr;
	 bar = READ_ONCE(*ptr);
	 baz = *ptr;

Then the compiler is within its right to reorder it to:

	foo = *ptr;
 	baz = *ptr;
 	bar = READ_ONCE(*ptr);

So, the result foo == baz == 0 but bar == 1 is perfectly legal.

But the example here is different, the compiler can not use the value loaded from line 5
unless the compiler can deduce that the tmp is equals to p in which case the address dependency
doesn’t exist anymore.

What am I missing here?

> +plain accesses of a memory location with rcu_dereference() of the same memory
> +location, in code involved in a data race.
> +
> In short, updaters use rcu_assign_pointer() and readers use
> rcu_dereference(), and these two RCU API elements work together to
> ensure that readers have a consistent view of newly added data elements.
> -- 
> 2.41.0.585.gd2178a4bd4-goog
>
  
Joel Fernandes Aug. 3, 2023, 12:35 p.m. UTC | #2
> On Aug 3, 2023, at 8:09 AM, Alan Huang <mmpgouride@gmail.com> wrote:
> 
> 
>> 2023年8月3日 11:24,Joel Fernandes (Google) <joel@joelfernandes.org> 写道:
>> 
>> Add a detailed note to explain the potential side effects of
>> plain-accessing the gp pointer using a plain load, without using the
>> rcu_dereference() macros; which might trip neighboring code that does
>> use rcu_dereference().
>> 
>> I haven't verified this with a compiler, but this is what I gather from
>> the below link using Will's experience with READ_ONCE().
>> 
>> Link: https://lore.kernel.org/all/20230728124412.GA21303@willie-the-truck/
>> Cc: Will Deacon <will@kernel.org>
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> ---
>> .../RCU/Design/Requirements/Requirements.rst  | 32 +++++++++++++++++++
>> 1 file changed, 32 insertions(+)
>> 
>> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
>> index f3b605285a87..e0b896d3fb9b 100644
>> --- a/Documentation/RCU/Design/Requirements/Requirements.rst
>> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
>> @@ -376,6 +376,38 @@ mechanism, most commonly locking or reference counting
>> .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
>> .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
>> 
>> +Note that, there can be strange side effects (due to compiler optimizations) if
>> +``gp`` is ever accessed using a plain load (i.e. without ``READ_ONCE()`` or
>> +``rcu_dereference()``) potentially hurting any succeeding
>> +``rcu_dereference()``. For example, consider the code:
>> +
>> +   ::
>> +
>> +       1 bool do_something_gp(void)
>> +       2 {
>> +       3   void *tmp;
>> +       4   rcu_read_lock();
>> +       5   tmp = gp; // Plain-load of GP.
>> +       6   printk("Point gp = %p\n", tmp);
>> +       7
>> +       8   p = rcu_dereference(gp);
>> +       9   if (p) {
>> +      10     do_something(p->a, p->b);
>> +      11     rcu_read_unlock();
>> +      12     return true;
>> +      13   }
>> +      14   rcu_read_unlock();
>> +      15   return false;
>> +      16 }
>> +
>> +The behavior of plain accesses involved in a data race is non-deterministic in
>> +the face of compiler optimizations. Since accesses to the ``gp`` pointer is
>> +by-design a data race, the compiler could trip this code by caching the value
>> +of ``gp`` into a register in line 5, and then using the value of the register
>> +to satisfy the load in line 10. Thus it is important to never mix
> 
> Will’s example is:
> 
>     // Assume *ptr is initially 0 and somebody else writes it to 1
>     // concurrently
> 
>     foo = *ptr;
>     bar = READ_ONCE(*ptr);
>     baz = *ptr;
> 
> Then the compiler is within its right to reorder it to:
> 
>    foo = *ptr;
>    baz = *ptr;
>    bar = READ_ONCE(*ptr);
> 
> So, the result foo == baz == 0 but bar == 1 is perfectly legal.

Yes, a bad outcome is perfectly legal amidst data race. Who said it is not legal?

> 
> But the example here is different,

That is intentional. Wills discussion partially triggered this. Though I am wondering
if we should document that as well.

> the compiler can not use the value loaded from line 5
> unless the compiler can deduce that the tmp is equals to p in which case the address dependency
> doesn’t exist anymore.
> 
> What am I missing here?

Maybe you are trying to rationalize too much that the sequence mentioned cannot result
in a counter intuitive outcome like I did?

The point AFAIU is not just about line 10 but that the compiler can replace any of the
lines after the plain access with the cached value.

Thanks.



> 
>> +plain accesses of a memory location with rcu_dereference() of the same memory
>> +location, in code involved in a data race.
>> +
>> In short, updaters use rcu_assign_pointer() and readers use
>> rcu_dereference(), and these two RCU API elements work together to
>> ensure that readers have a consistent view of newly added data elements.
>> -- 
>> 2.41.0.585.gd2178a4bd4-goog
>> 
>
  
Alan Huang Aug. 3, 2023, 1:36 p.m. UTC | #3
> 2023年8月3日 下午8:35,Joel Fernandes <joel@joelfernandes.org> 写道:
> 
> 
> 
>> On Aug 3, 2023, at 8:09 AM, Alan Huang <mmpgouride@gmail.com> wrote:
>> 
>> 
>>> 2023年8月3日 11:24,Joel Fernandes (Google) <joel@joelfernandes.org> 写道:
>>> 
>>> Add a detailed note to explain the potential side effects of
>>> plain-accessing the gp pointer using a plain load, without using the
>>> rcu_dereference() macros; which might trip neighboring code that does
>>> use rcu_dereference().
>>> 
>>> I haven't verified this with a compiler, but this is what I gather from
>>> the below link using Will's experience with READ_ONCE().
>>> 
>>> Link: https://lore.kernel.org/all/20230728124412.GA21303@willie-the-truck/
>>> Cc: Will Deacon <will@kernel.org>
>>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>>> ---
>>> .../RCU/Design/Requirements/Requirements.rst  | 32 +++++++++++++++++++
>>> 1 file changed, 32 insertions(+)
>>> 
>>> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
>>> index f3b605285a87..e0b896d3fb9b 100644
>>> --- a/Documentation/RCU/Design/Requirements/Requirements.rst
>>> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
>>> @@ -376,6 +376,38 @@ mechanism, most commonly locking or reference counting
>>> .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
>>> .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
>>> 
>>> +Note that, there can be strange side effects (due to compiler optimizations) if
>>> +``gp`` is ever accessed using a plain load (i.e. without ``READ_ONCE()`` or
>>> +``rcu_dereference()``) potentially hurting any succeeding
>>> +``rcu_dereference()``. For example, consider the code:
>>> +
>>> +   ::
>>> +
>>> +       1 bool do_something_gp(void)
>>> +       2 {
>>> +       3   void *tmp;
>>> +       4   rcu_read_lock();
>>> +       5   tmp = gp; // Plain-load of GP.
>>> +       6   printk("Point gp = %p\n", tmp);
>>> +       7
>>> +       8   p = rcu_dereference(gp);
>>> +       9   if (p) {
>>> +      10     do_something(p->a, p->b);
>>> +      11     rcu_read_unlock();
>>> +      12     return true;
>>> +      13   }
>>> +      14   rcu_read_unlock();
>>> +      15   return false;
>>> +      16 }
>>> +
>>> +The behavior of plain accesses involved in a data race is non-deterministic in
>>> +the face of compiler optimizations. Since accesses to the ``gp`` pointer is
>>> +by-design a data race, the compiler could trip this code by caching the value
>>> +of ``gp`` into a register in line 5, and then using the value of the register
>>> +to satisfy the load in line 10. Thus it is important to never mix
>> 
>> Will’s example is:
>> 
>>    // Assume *ptr is initially 0 and somebody else writes it to 1
>>    // concurrently
>> 
>>    foo = *ptr;
>>    bar = READ_ONCE(*ptr);
>>    baz = *ptr;
>> 
>> Then the compiler is within its right to reorder it to:
>> 
>>   foo = *ptr;
>>   baz = *ptr;
>>   bar = READ_ONCE(*ptr);
>> 
>> So, the result foo == baz == 0 but bar == 1 is perfectly legal.
> 
> Yes, a bad outcome is perfectly legal amidst data race. Who said it is not legal?

My understanding is that it is legal even without data race, and the compiler only keeps the order of volatile access.

> 
>> 
>> But the example here is different,
> 
> That is intentional. Wills discussion partially triggered this. Though I am wondering
> if we should document that as well.
> 
>> the compiler can not use the value loaded from line 5
>> unless the compiler can deduce that the tmp is equals to p in which case the address dependency
>> doesn’t exist anymore.
>> 
>> What am I missing here?
> 
> Maybe you are trying to rationalize too much that the sequence mentioned cannot result
> in a counter intuitive outcome like I did?
> 
> The point AFAIU is not just about line 10 but that the compiler can replace any of the
> lines after the plain access with the cached value.

Well, IIUC, according to the C standard, the compiler can do anything if there is a data race (undefined behavior).

However, what if a write is not protected with WRITE_ONCE and the read is marked with READ_ONCE?
That’s also a data race, right? But the kernel considers it is Okay if the write is machine word aligned.

> 
> Thanks.
> 
> 
> 
>> 
>>> +plain accesses of a memory location with rcu_dereference() of the same memory
>>> +location, in code involved in a data race.
>>> +
>>> In short, updaters use rcu_assign_pointer() and readers use
>>> rcu_dereference(), and these two RCU API elements work together to
>>> ensure that readers have a consistent view of newly added data elements.
>>> -- 
>>> 2.41.0.585.gd2178a4bd4-goog
  
Joel Fernandes Aug. 3, 2023, 4:01 p.m. UTC | #4
On Thu, Aug 3, 2023 at 9:36 AM Alan Huang <mmpgouride@gmail.com> wrote:
>
>
> > 2023年8月3日 下午8:35,Joel Fernandes <joel@joelfernandes.org> 写道:
> >
> >
> >
> >> On Aug 3, 2023, at 8:09 AM, Alan Huang <mmpgouride@gmail.com> wrote:
> >>
> >> 
> >>> 2023年8月3日 11:24,Joel Fernandes (Google) <joel@joelfernandes.org> 写道:
> >>>
> >>> Add a detailed note to explain the potential side effects of
> >>> plain-accessing the gp pointer using a plain load, without using the
> >>> rcu_dereference() macros; which might trip neighboring code that does
> >>> use rcu_dereference().
> >>>
> >>> I haven't verified this with a compiler, but this is what I gather from
> >>> the below link using Will's experience with READ_ONCE().
> >>>
> >>> Link: https://lore.kernel.org/all/20230728124412.GA21303@willie-the-truck/
> >>> Cc: Will Deacon <will@kernel.org>
> >>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >>> ---
> >>> .../RCU/Design/Requirements/Requirements.rst  | 32 +++++++++++++++++++
> >>> 1 file changed, 32 insertions(+)
> >>>
> >>> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
> >>> index f3b605285a87..e0b896d3fb9b 100644
> >>> --- a/Documentation/RCU/Design/Requirements/Requirements.rst
> >>> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
> >>> @@ -376,6 +376,38 @@ mechanism, most commonly locking or reference counting
> >>> .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
> >>> .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
> >>>
> >>> +Note that, there can be strange side effects (due to compiler optimizations) if
> >>> +``gp`` is ever accessed using a plain load (i.e. without ``READ_ONCE()`` or
> >>> +``rcu_dereference()``) potentially hurting any succeeding
> >>> +``rcu_dereference()``. For example, consider the code:
> >>> +
> >>> +   ::
> >>> +
> >>> +       1 bool do_something_gp(void)
> >>> +       2 {
> >>> +       3   void *tmp;
> >>> +       4   rcu_read_lock();
> >>> +       5   tmp = gp; // Plain-load of GP.
> >>> +       6   printk("Point gp = %p\n", tmp);
> >>> +       7
> >>> +       8   p = rcu_dereference(gp);
> >>> +       9   if (p) {
> >>> +      10     do_something(p->a, p->b);
> >>> +      11     rcu_read_unlock();
> >>> +      12     return true;
> >>> +      13   }
> >>> +      14   rcu_read_unlock();
> >>> +      15   return false;
> >>> +      16 }
> >>> +
> >>> +The behavior of plain accesses involved in a data race is non-deterministic in
> >>> +the face of compiler optimizations. Since accesses to the ``gp`` pointer is
> >>> +by-design a data race, the compiler could trip this code by caching the value
> >>> +of ``gp`` into a register in line 5, and then using the value of the register
> >>> +to satisfy the load in line 10. Thus it is important to never mix
> >>
> >> Will’s example is:
> >>
> >>    // Assume *ptr is initially 0 and somebody else writes it to 1
> >>    // concurrently
> >>
> >>    foo = *ptr;
> >>    bar = READ_ONCE(*ptr);
> >>    baz = *ptr;
> >>
> >> Then the compiler is within its right to reorder it to:
> >>
> >>   foo = *ptr;
> >>   baz = *ptr;
> >>   bar = READ_ONCE(*ptr);
> >>
> >> So, the result foo == baz == 0 but bar == 1 is perfectly legal.
> >
> > Yes, a bad outcome is perfectly legal amidst data race. Who said it is not legal?
>
> My understanding is that it is legal even without data race, and the compiler only keeps the order of volatile access.

Yes, but I can bet on it the author of the code would not have
intended such an outcome, if they did then Will wouldn't have been
debugging it ;-). That's why I called it a bad outcome. The goal of
this patch is to document such a possible unintentional outcome.

> >> But the example here is different,
> >
> > That is intentional. Wills discussion partially triggered this. Though I am wondering
> > if we should document that as well.
> >
> >> the compiler can not use the value loaded from line 5
> >> unless the compiler can deduce that the tmp is equals to p in which case the address dependency
> >> doesn’t exist anymore.
> >>
> >> What am I missing here?
> >
> > Maybe you are trying to rationalize too much that the sequence mentioned cannot result
> > in a counter intuitive outcome like I did?
> >
> > The point AFAIU is not just about line 10 but that the compiler can replace any of the
> > lines after the plain access with the cached value.
>
> Well, IIUC, according to the C standard, the compiler can do anything if there is a data race (undefined behavior).
>
> However, what if a write is not protected with WRITE_ONCE and the read is marked with READ_ONCE?
> That’s also a data race, right? But the kernel considers it is Okay if the write is machine word aligned.

Yes, but there is a compiler between the HLL code and what the
processor sees which can tear the write. How can not using
WRITE_ONCE() prevent store-tearing? See [1]. My understanding is that
it is OK only if the reader did a NULL check. In that case the torn
result will not change the semantics of the program. But otherwise,
that's bad.

[1] https://lwn.net/Articles/793253/#Store%20Tearing

thanks,

 - Joel


>
> >
> > Thanks.
> >
> >
> >
> >>
> >>> +plain accesses of a memory location with rcu_dereference() of the same memory
> >>> +location, in code involved in a data race.
> >>> +
> >>> In short, updaters use rcu_assign_pointer() and readers use
> >>> rcu_dereference(), and these two RCU API elements work together to
> >>> ensure that readers have a consistent view of newly added data elements.
> >>> --
> >>> 2.41.0.585.gd2178a4bd4-goog
>
  
Alan Huang Aug. 3, 2023, 7:25 p.m. UTC | #5
> 2023年8月4日 00:01,Joel Fernandes <joel@joelfernandes.org> 写道:
> 
> On Thu, Aug 3, 2023 at 9:36 AM Alan Huang <mmpgouride@gmail.com> wrote:
>> 
>> 
>>> 2023年8月3日 下午8:35,Joel Fernandes <joel@joelfernandes.org> 写道:
>>> 
>>> 
>>> 
>>>> On Aug 3, 2023, at 8:09 AM, Alan Huang <mmpgouride@gmail.com> wrote:
>>>> 
>>>> 
>>>>> 2023年8月3日 11:24,Joel Fernandes (Google) <joel@joelfernandes.org> 写道:
>>>>> 
>>>>> Add a detailed note to explain the potential side effects of
>>>>> plain-accessing the gp pointer using a plain load, without using the
>>>>> rcu_dereference() macros; which might trip neighboring code that does
>>>>> use rcu_dereference().
>>>>> 
>>>>> I haven't verified this with a compiler, but this is what I gather from
>>>>> the below link using Will's experience with READ_ONCE().
>>>>> 
>>>>> Link: https://lore.kernel.org/all/20230728124412.GA21303@willie-the-truck/
>>>>> Cc: Will Deacon <will@kernel.org>
>>>>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>>>>> ---
>>>>> .../RCU/Design/Requirements/Requirements.rst  | 32 +++++++++++++++++++
>>>>> 1 file changed, 32 insertions(+)
>>>>> 
>>>>> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
>>>>> index f3b605285a87..e0b896d3fb9b 100644
>>>>> --- a/Documentation/RCU/Design/Requirements/Requirements.rst
>>>>> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
>>>>> @@ -376,6 +376,38 @@ mechanism, most commonly locking or reference counting
>>>>> .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
>>>>> .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
>>>>> 
>>>>> +Note that, there can be strange side effects (due to compiler optimizations) if
>>>>> +``gp`` is ever accessed using a plain load (i.e. without ``READ_ONCE()`` or
>>>>> +``rcu_dereference()``) potentially hurting any succeeding
>>>>> +``rcu_dereference()``. For example, consider the code:
>>>>> +
>>>>> +   ::
>>>>> +
>>>>> +       1 bool do_something_gp(void)
>>>>> +       2 {
>>>>> +       3   void *tmp;
>>>>> +       4   rcu_read_lock();
>>>>> +       5   tmp = gp; // Plain-load of GP.
>>>>> +       6   printk("Point gp = %p\n", tmp);
>>>>> +       7
>>>>> +       8   p = rcu_dereference(gp);
>>>>> +       9   if (p) {
>>>>> +      10     do_something(p->a, p->b);
>>>>> +      11     rcu_read_unlock();
>>>>> +      12     return true;
>>>>> +      13   }
>>>>> +      14   rcu_read_unlock();
>>>>> +      15   return false;
>>>>> +      16 }
>>>>> +
>>>>> +The behavior of plain accesses involved in a data race is non-deterministic in
>>>>> +the face of compiler optimizations. Since accesses to the ``gp`` pointer is
>>>>> +by-design a data race, the compiler could trip this code by caching the value
>>>>> +of ``gp`` into a register in line 5, and then using the value of the register
>>>>> +to satisfy the load in line 10. Thus it is important to never mix
>>>> 
>>>> Will’s example is:
>>>> 
>>>>   // Assume *ptr is initially 0 and somebody else writes it to 1
>>>>   // concurrently
>>>> 
>>>>   foo = *ptr;
>>>>   bar = READ_ONCE(*ptr);
>>>>   baz = *ptr;
>>>> 
>>>> Then the compiler is within its right to reorder it to:
>>>> 
>>>>  foo = *ptr;
>>>>  baz = *ptr;
>>>>  bar = READ_ONCE(*ptr);
>>>> 
>>>> So, the result foo == baz == 0 but bar == 1 is perfectly legal.
>>> 
>>> Yes, a bad outcome is perfectly legal amidst data race. Who said it is not legal?
>> 
>> My understanding is that it is legal even without data race, and the compiler only keeps the order of volatile access.
> 
> Yes, but I can bet on it the author of the code would not have
> intended such an outcome, if they did then Will wouldn't have been
> debugging it ;-). That's why I called it a bad outcome. The goal of
> this patch is to document such a possible unintentional outcome.
> 
>>>> But the example here is different,
>>> 
>>> That is intentional. Wills discussion partially triggered this. Though I am wondering
>>> if we should document that as well.
>>> 
>>>> the compiler can not use the value loaded from line 5
>>>> unless the compiler can deduce that the tmp is equals to p in which case the address dependency
>>>> doesn’t exist anymore.
>>>> 
>>>> What am I missing here?
>>> 
>>> Maybe you are trying to rationalize too much that the sequence mentioned cannot result
>>> in a counter intuitive outcome like I did?
>>> 
>>> The point AFAIU is not just about line 10 but that the compiler can replace any of the
>>> lines after the plain access with the cached value.
>> 
>> Well, IIUC, according to the C standard, the compiler can do anything if there is a data race (undefined behavior).
>> 
>> However, what if a write is not protected with WRITE_ONCE and the read is marked with READ_ONCE?
>> That’s also a data race, right? But the kernel considers it is Okay if the write is machine word aligned.
> 
> Yes, but there is a compiler between the HLL code and what the
> processor sees which can tear the write. How can not using
> WRITE_ONCE() prevent store-tearing? See [1]. My understanding is that
> it is OK only if the reader did a NULL check. In that case the torn

Yes, a write-write data race where the value is the same is also fine.

But they are still data race, if the compiler is within its right to do anything it likes (due to data race),
we still need WRITE_ONCE() in these cases, though it’s semantically safe.

IIUC, even with _ONCE(), the compiler is within its right do anything according to the standard (at least before the upcoming C23), because the standard doesn’t consider a volatile access to be atomic.

However, the kernel consider the volatile access to be atomic, right?

BTW, line 5 in the example is likely to be optimized away. And yes, the compiler can cache the value loaded from line 5 from the perspective of undefined behavior, even if I believe it would be a compiler bug from the perspective of kernel.

> result will not change the semantics of the program. But otherwise,
> that's bad.
> 
> [1] https://lwn.net/Articles/793253/#Store%20Tearing
> 
> thanks,
> 
> - Joel
> 
> 
>> 
>>> 
>>> Thanks.
>>> 
>>> 
>>> 
>>>> 
>>>>> +plain accesses of a memory location with rcu_dereference() of the same memory
>>>>> +location, in code involved in a data race.
>>>>> +
>>>>> In short, updaters use rcu_assign_pointer() and readers use
>>>>> rcu_dereference(), and these two RCU API elements work together to
>>>>> ensure that readers have a consistent view of newly added data elements.
>>>>> --
>>>>> 2.41.0.585.gd2178a4bd4-goog
  
Joel Fernandes Aug. 3, 2023, 11:05 p.m. UTC | #6
> On Aug 3, 2023, at 3:26 PM, Alan Huang <mmpgouride@gmail.com> wrote:
> 
> 
>> 2023年8月4日 00:01,Joel Fernandes <joel@joelfernandes.org> 写道:
>> 
>>> On Thu, Aug 3, 2023 at 9:36 AM Alan Huang <mmpgouride@gmail.com> wrote:
>>> 
>>> 
>>>> 2023年8月3日 下午8:35,Joel Fernandes <joel@joelfernandes.org> 写道:
>>>> 
>>>> 
>>>> 
>>>>> On Aug 3, 2023, at 8:09 AM, Alan Huang <mmpgouride@gmail.com> wrote:
>>>>> 
>>>>> 
>>>>>> 2023年8月3日 11:24,Joel Fernandes (Google) <joel@joelfernandes.org> 写道:
>>>>>> 
>>>>>> Add a detailed note to explain the potential side effects of
>>>>>> plain-accessing the gp pointer using a plain load, without using the
>>>>>> rcu_dereference() macros; which might trip neighboring code that does
>>>>>> use rcu_dereference().
>>>>>> 
>>>>>> I haven't verified this with a compiler, but this is what I gather from
>>>>>> the below link using Will's experience with READ_ONCE().
>>>>>> 
>>>>>> Link: https://lore.kernel.org/all/20230728124412.GA21303@willie-the-truck/
>>>>>> Cc: Will Deacon <will@kernel.org>
>>>>>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>>>>>> ---
>>>>>> .../RCU/Design/Requirements/Requirements.rst  | 32 +++++++++++++++++++
>>>>>> 1 file changed, 32 insertions(+)
>>>>>> 
>>>>>> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
>>>>>> index f3b605285a87..e0b896d3fb9b 100644
>>>>>> --- a/Documentation/RCU/Design/Requirements/Requirements.rst
>>>>>> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
>>>>>> @@ -376,6 +376,38 @@ mechanism, most commonly locking or reference counting
>>>>>> .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
>>>>>> .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
>>>>>> 
>>>>>> +Note that, there can be strange side effects (due to compiler optimizations) if
>>>>>> +``gp`` is ever accessed using a plain load (i.e. without ``READ_ONCE()`` or
>>>>>> +``rcu_dereference()``) potentially hurting any succeeding
>>>>>> +``rcu_dereference()``. For example, consider the code:
>>>>>> +
>>>>>> +   ::
>>>>>> +
>>>>>> +       1 bool do_something_gp(void)
>>>>>> +       2 {
>>>>>> +       3   void *tmp;
>>>>>> +       4   rcu_read_lock();
>>>>>> +       5   tmp = gp; // Plain-load of GP.
>>>>>> +       6   printk("Point gp = %p\n", tmp);
>>>>>> +       7
>>>>>> +       8   p = rcu_dereference(gp);
>>>>>> +       9   if (p) {
>>>>>> +      10     do_something(p->a, p->b);
>>>>>> +      11     rcu_read_unlock();
>>>>>> +      12     return true;
>>>>>> +      13   }
>>>>>> +      14   rcu_read_unlock();
>>>>>> +      15   return false;
>>>>>> +      16 }
>>>>>> +
>>>>>> +The behavior of plain accesses involved in a data race is non-deterministic in
>>>>>> +the face of compiler optimizations. Since accesses to the ``gp`` pointer is
>>>>>> +by-design a data race, the compiler could trip this code by caching the value
>>>>>> +of ``gp`` into a register in line 5, and then using the value of the register
>>>>>> +to satisfy the load in line 10. Thus it is important to never mix
>>>>> 
>>>>> Will’s example is:
>>>>> 
>>>>>  // Assume *ptr is initially 0 and somebody else writes it to 1
>>>>>  // concurrently
>>>>> 
>>>>>  foo = *ptr;
>>>>>  bar = READ_ONCE(*ptr);
>>>>>  baz = *ptr;
>>>>> 
>>>>> Then the compiler is within its right to reorder it to:
>>>>> 
>>>>> foo = *ptr;
>>>>> baz = *ptr;
>>>>> bar = READ_ONCE(*ptr);
>>>>> 
>>>>> So, the result foo == baz == 0 but bar == 1 is perfectly legal.
>>>> 
>>>> Yes, a bad outcome is perfectly legal amidst data race. Who said it is not legal?
>>> 
>>> My understanding is that it is legal even without data race, and the compiler only keeps the order of volatile access.
>> 
>> Yes, but I can bet on it the author of the code would not have
>> intended such an outcome, if they did then Will wouldn't have been
>> debugging it ;-). That's why I called it a bad outcome. The goal of
>> this patch is to document such a possible unintentional outcome.

Please trim replies if possible.

>> 
>>>>> But the example here is different,
>>>> 
>>>> That is intentional. Wills discussion partially triggered this. Though I am wondering
>>>> if we should document that as well.
>>>> 
>>>>> the compiler can not use the value loaded from line 5
>>>>> unless the compiler can deduce that the tmp is equals to p in which case the address dependency
>>>>> doesn’t exist anymore.
>>>>> 
>>>>> What am I missing here?
>>>> 
>>>> Maybe you are trying to rationalize too much that the sequence mentioned cannot result
>>>> in a counter intuitive outcome like I did?
>>>> 
>>>> The point AFAIU is not just about line 10 but that the compiler can replace any of the
>>>> lines after the plain access with the cached value.
>>> 
>>> Well, IIUC, according to the C standard, the compiler can do anything if there is a data race (undefined behavior).
>>> 
>>> However, what if a write is not protected with WRITE_ONCE and the read is marked with READ_ONCE?
>>> That’s also a data race, right? But the kernel considers it is Okay if the write is machine word aligned.
>> 
>> Yes, but there is a compiler between the HLL code and what the
>> processor sees which can tear the write. How can not using
>> WRITE_ONCE() prevent store-tearing? See [1]. My understanding is that
>> it is OK only if the reader did a NULL check. In that case the torn
> 
> Yes, a write-write data race where the value is the same is also fine.
> 
> But they are still data race, if the compiler is within its right to do anything it likes (due to data race),
> we still need WRITE_ONCE() in these cases, though it’s semantically safe.
> 
> IIUC, even with _ONCE(), the compiler is within its right do anything according to the standard (at least before the upcoming C23), because the standard doesn’t consider a volatile access to be atomic.
> 
> However, the kernel consider the volatile access to be atomic, right?
> 
> BTW, line 5 in the example is likely to be optimized away. And yes, the compiler can cache the value loaded from line 5 from the perspective of undefined behavior, even if I believe it would be a compiler bug from the perspective of kernel.

I am actually a bit lost with what you are trying to say.  Are you saying that mixing
plain accesses with marked accesses is an acceptable practice? 

I would like others to weight in as well since I am not seeing what Alan is suggesting.
AFAICS, in the absence of barrier(), any optimization caused by plain access
makes it a bad practice to mix it.

Thanks,

- Joel



> 
>> result will not change the semantics of the program. But otherwise,
>> that's bad.
>> 
>> [1] https://lwn.net/Articles/793253/#Store%20Tearing
>> 
>> thanks,
>> 
>> - Joel
>> 
>> 
>>> 
>>>> 
>>>> Thanks.
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>>> +plain accesses of a memory location with rcu_dereference() of the same memory
>>>>>> +location, in code involved in a data race.
>>>>>> +
>>>>>> In short, updaters use rcu_assign_pointer() and readers use
>>>>>> rcu_dereference(), and these two RCU API elements work together to
>>>>>> ensure that readers have a consistent view of newly added data elements.
>>>>>> --
>>>>>> 2.41.0.585.gd2178a4bd4-goog
> 
>
  
Paul E. McKenney Aug. 4, 2023, 12:01 a.m. UTC | #7
On Fri, Aug 04, 2023 at 03:25:57AM +0800, Alan Huang wrote:
> > 2023年8月4日 00:01,Joel Fernandes <joel@joelfernandes.org> 写道:
> > On Thu, Aug 3, 2023 at 9:36 AM Alan Huang <mmpgouride@gmail.com> wrote:
> >>> 2023年8月3日 下午8:35,Joel Fernandes <joel@joelfernandes.org> 写道:
> >>>> On Aug 3, 2023, at 8:09 AM, Alan Huang <mmpgouride@gmail.com> wrote:
> >>>>> 2023年8月3日 11:24,Joel Fernandes (Google) <joel@joelfernandes.org> 写道:
> >>>>> Add a detailed note to explain the potential side effects of
> >>>>> plain-accessing the gp pointer using a plain load, without using the
> >>>>> rcu_dereference() macros; which might trip neighboring code that does
> >>>>> use rcu_dereference().
> >>>>> 
> >>>>> I haven't verified this with a compiler, but this is what I gather from
> >>>>> the below link using Will's experience with READ_ONCE().
> >>>>> 
> >>>>> Link: https://lore.kernel.org/all/20230728124412.GA21303@willie-the-truck/
> >>>>> Cc: Will Deacon <will@kernel.org>
> >>>>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >>>>> ---
> >>>>> .../RCU/Design/Requirements/Requirements.rst  | 32 +++++++++++++++++++
> >>>>> 1 file changed, 32 insertions(+)
> >>>>> 
> >>>>> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
> >>>>> index f3b605285a87..e0b896d3fb9b 100644
> >>>>> --- a/Documentation/RCU/Design/Requirements/Requirements.rst
> >>>>> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
> >>>>> @@ -376,6 +376,38 @@ mechanism, most commonly locking or reference counting
> >>>>> .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
> >>>>> .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
> >>>>> 
> >>>>> +Note that, there can be strange side effects (due to compiler optimizations) if
> >>>>> +``gp`` is ever accessed using a plain load (i.e. without ``READ_ONCE()`` or
> >>>>> +``rcu_dereference()``) potentially hurting any succeeding
> >>>>> +``rcu_dereference()``. For example, consider the code:
> >>>>> +
> >>>>> +   ::
> >>>>> +
> >>>>> +       1 bool do_something_gp(void)
> >>>>> +       2 {
> >>>>> +       3   void *tmp;
> >>>>> +       4   rcu_read_lock();
> >>>>> +       5   tmp = gp; // Plain-load of GP.
> >>>>> +       6   printk("Point gp = %p\n", tmp);
> >>>>> +       7
> >>>>> +       8   p = rcu_dereference(gp);
> >>>>> +       9   if (p) {
> >>>>> +      10     do_something(p->a, p->b);
> >>>>> +      11     rcu_read_unlock();
> >>>>> +      12     return true;
> >>>>> +      13   }
> >>>>> +      14   rcu_read_unlock();
> >>>>> +      15   return false;
> >>>>> +      16 }
> >>>>> +
> >>>>> +The behavior of plain accesses involved in a data race is non-deterministic in
> >>>>> +the face of compiler optimizations. Since accesses to the ``gp`` pointer is
> >>>>> +by-design a data race, the compiler could trip this code by caching the value
> >>>>> +of ``gp`` into a register in line 5, and then using the value of the register
> >>>>> +to satisfy the load in line 10. Thus it is important to never mix
> >>>> 
> >>>> Will’s example is:
> >>>> 
> >>>>   // Assume *ptr is initially 0 and somebody else writes it to 1
> >>>>   // concurrently
> >>>> 
> >>>>   foo = *ptr;
> >>>>   bar = READ_ONCE(*ptr);
> >>>>   baz = *ptr;
> >>>> 
> >>>> Then the compiler is within its right to reorder it to:
> >>>> 
> >>>>  foo = *ptr;
> >>>>  baz = *ptr;
> >>>>  bar = READ_ONCE(*ptr);
> >>>> 
> >>>> So, the result foo == baz == 0 but bar == 1 is perfectly legal.
> >>> 
> >>> Yes, a bad outcome is perfectly legal amidst data race. Who said it is not legal?
> >> 
> >> My understanding is that it is legal even without data race, and the compiler only keeps the order of volatile access.
> > 
> > Yes, but I can bet on it the author of the code would not have
> > intended such an outcome, if they did then Will wouldn't have been
> > debugging it ;-). That's why I called it a bad outcome. The goal of
> > this patch is to document such a possible unintentional outcome.
> > 
> >>>> But the example here is different,
> >>> 
> >>> That is intentional. Wills discussion partially triggered this. Though I am wondering
> >>> if we should document that as well.
> >>> 
> >>>> the compiler can not use the value loaded from line 5
> >>>> unless the compiler can deduce that the tmp is equals to p in which case the address dependency
> >>>> doesn’t exist anymore.
> >>>> 
> >>>> What am I missing here?
> >>> 
> >>> Maybe you are trying to rationalize too much that the sequence mentioned cannot result
> >>> in a counter intuitive outcome like I did?
> >>> 
> >>> The point AFAIU is not just about line 10 but that the compiler can replace any of the
> >>> lines after the plain access with the cached value.
> >> 
> >> Well, IIUC, according to the C standard, the compiler can do anything if there is a data race (undefined behavior).
> >> 
> >> However, what if a write is not protected with WRITE_ONCE and the read is marked with READ_ONCE?
> >> That’s also a data race, right? But the kernel considers it is Okay if the write is machine word aligned.
> > 
> > Yes, but there is a compiler between the HLL code and what the
> > processor sees which can tear the write. How can not using
> > WRITE_ONCE() prevent store-tearing? See [1]. My understanding is that
> > it is OK only if the reader did a NULL check. In that case the torn
> 
> Yes, a write-write data race where the value is the same is also fine.
> 
> But they are still data race, if the compiler is within its right to do anything it likes (due to data race),
> we still need WRITE_ONCE() in these cases, though it’s semantically safe.
> 
> IIUC, even with _ONCE(), the compiler is within its right do anything according to the standard (at least before the upcoming C23), because the standard doesn’t consider a volatile access to be atomic.

Volatile accesses are not specified very well in the standard.  However,
as a practical matter, compilers that wish to be able to device drivers
(whether in kernels or userspace applications) must compile those volatile
accesses in such a way to allow reliable device drivers to be written.

> However, the kernel consider the volatile access to be atomic, right?

The compiler must therefore act as if a volatile access to an aligned
machine-word size location is atomic.  To see this, consider accesses
to memory that is shared by a device driver and that device's firmware,
both of which are written in either C or C++.

Does that help?

							Thanx, Paul

> BTW, line 5 in the example is likely to be optimized away. And yes, the compiler can cache the value loaded from line 5 from the perspective of undefined behavior, even if I believe it would be a compiler bug from the perspective of kernel.
> 
> > result will not change the semantics of the program. But otherwise,
> > that's bad.
> > 
> > [1] https://lwn.net/Articles/793253/#Store%20Tearing
> > 
> > thanks,
> > 
> > - Joel
> > 
> > 
> >> 
> >>> 
> >>> Thanks.
> >>> 
> >>> 
> >>> 
> >>>> 
> >>>>> +plain accesses of a memory location with rcu_dereference() of the same memory
> >>>>> +location, in code involved in a data race.
> >>>>> +
> >>>>> In short, updaters use rcu_assign_pointer() and readers use
> >>>>> rcu_dereference(), and these two RCU API elements work together to
> >>>>> ensure that readers have a consistent view of newly added data elements.
> >>>>> --
> >>>>> 2.41.0.585.gd2178a4bd4-goog
> 
>
  
Joel Fernandes Aug. 4, 2023, 12:33 p.m. UTC | #8
> On Aug 3, 2023, at 8:01 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Fri, Aug 04, 2023 at 03:25:57AM +0800, Alan Huang wrote:
>>> 2023年8月4日 00:01,Joel Fernandes <joel@joelfernandes.org> 写道:
>>> On Thu, Aug 3, 2023 at 9:36 AM Alan Huang <mmpgouride@gmail.com> wrote:
>>>>> 2023年8月3日 下午8:35,Joel Fernandes <joel@joelfernandes.org> 写道:
>>>>>>> On Aug 3, 2023, at 8:09 AM, Alan Huang <mmpgouride@gmail.com> wrote:
>>>>>>>> 2023年8月3日 11:24,Joel Fernandes (Google) <joel@joelfernandes.org> 写道:
>>>>>>>> Add a detailed note to explain the potential side effects of
>>>>>>>> plain-accessing the gp pointer using a plain load, without using the
>>>>>>>> rcu_dereference() macros; which might trip neighboring code that does
>>>>>>>> use rcu_dereference().
>>>>>>>> 
>>>>>>>> I haven't verified this with a compiler, but this is what I gather from
>>>>>>>> the below link using Will's experience with READ_ONCE().
>>>>>>>> 
>>>>>>>> Link: https://lore.kernel.org/all/20230728124412.GA21303@willie-the-truck/
>>>>>>>> Cc: Will Deacon <will@kernel.org>
>>>>>>>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>>>>>>>> ---
>>>>>>>> .../RCU/Design/Requirements/Requirements.rst  | 32 +++++++++++++++++++
>>>>>>>> 1 file changed, 32 insertions(+)
>>>>>>>> 
>>>>>>>> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
>>>>>>>> index f3b605285a87..e0b896d3fb9b 100644
>>>>>>>> --- a/Documentation/RCU/Design/Requirements/Requirements.rst
>>>>>>>> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
>>>>>>>> @@ -376,6 +376,38 @@ mechanism, most commonly locking or reference counting
>>>>>>>> .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
>>>>>>>> .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
>>>>>>>> 
>>>>>>>> +Note that, there can be strange side effects (due to compiler optimizations) if
>>>>>>>> +``gp`` is ever accessed using a plain load (i.e. without ``READ_ONCE()`` or
>>>>>>>> +``rcu_dereference()``) potentially hurting any succeeding
>>>>>>>> +``rcu_dereference()``. For example, consider the code:
>>>>>>>> +
>>>>>>>> +   ::
>>>>>>>> +
>>>>>>>> +       1 bool do_something_gp(void)
>>>>>>>> +       2 {
>>>>>>>> +       3   void *tmp;
>>>>>>>> +       4   rcu_read_lock();
>>>>>>>> +       5   tmp = gp; // Plain-load of GP.
>>>>>>>> +       6   printk("Point gp = %p\n", tmp);
>>>>>>>> +       7
>>>>>>>> +       8   p = rcu_dereference(gp);
>>>>>>>> +       9   if (p) {
>>>>>>>> +      10     do_something(p->a, p->b);
>>>>>>>> +      11     rcu_read_unlock();
>>>>>>>> +      12     return true;
>>>>>>>> +      13   }
>>>>>>>> +      14   rcu_read_unlock();
>>>>>>>> +      15   return false;
>>>>>>>> +      16 }
>>>>>>>> +
>>>>>>>> +The behavior of plain accesses involved in a data race is non-deterministic in
>>>>>>>> +the face of compiler optimizations. Since accesses to the ``gp`` pointer is
>>>>>>>> +by-design a data race, the compiler could trip this code by caching the value
>>>>>>>> +of ``gp`` into a register in line 5, and then using the value of the register
>>>>>>>> +to satisfy the load in line 10. Thus it is important to never mix
>>>>>>> 
>>>>>>> Will’s example is:
>>>>>>> 
>>>>>>>  // Assume *ptr is initially 0 and somebody else writes it to 1
>>>>>>>  // concurrently
>>>>>>> 
>>>>>>>  foo = *ptr;
>>>>>>>  bar = READ_ONCE(*ptr);
>>>>>>>  baz = *ptr;
>>>>>>> 
>>>>>>> Then the compiler is within its right to reorder it to:
>>>>>>> 
>>>>>>> foo = *ptr;
>>>>>>> baz = *ptr;
>>>>>>> bar = READ_ONCE(*ptr);
>>>>>>> 
>>>>>>> So, the result foo == baz == 0 but bar == 1 is perfectly legal.
>>>>>> 
>>>>>> Yes, a bad outcome is perfectly legal amidst data race. Who said it is not legal?
>>>>> 
>>>>> My understanding is that it is legal even without data race, and the compiler only keeps the order of volatile access.
>>> 
>>> Yes, but I can bet on it the author of the code would not have
>>> intended such an outcome, if they did then Will wouldn't have been
>>> debugging it ;-). That's why I called it a bad outcome. The goal of
>>> this patch is to document such a possible unintentional outcome.
>>> 
>>>>>> But the example here is different,
>>>>> 
>>>>> That is intentional. Wills discussion partially triggered this. Though I am wondering
>>>>> if we should document that as well.
>>>>> 
>>>>>> the compiler can not use the value loaded from line 5
>>>>>> unless the compiler can deduce that the tmp is equals to p in which case the address dependency
>>>>>> doesn’t exist anymore.
>>>>>> 
>>>>>> What am I missing here?
>>>>> 
>>>>> Maybe you are trying to rationalize too much that the sequence mentioned cannot result
>>>>> in a counter intuitive outcome like I did?
>>>>> 
>>>>> The point AFAIU is not just about line 10 but that the compiler can replace any of the
>>>>> lines after the plain access with the cached value.
>>>> 
>>>> Well, IIUC, according to the C standard, the compiler can do anything if there is a data race (undefined behavior).
>>>> 
>>>> However, what if a write is not protected with WRITE_ONCE and the read is marked with READ_ONCE?
>>>> That’s also a data race, right? But the kernel considers it is Okay if the write is machine word aligned.
>>> 
>>> Yes, but there is a compiler between the HLL code and what the
>>> processor sees which can tear the write. How can not using
>>> WRITE_ONCE() prevent store-tearing? See [1]. My understanding is that
>>> it is OK only if the reader did a NULL check. In that case the torn
>> 
>> Yes, a write-write data race where the value is the same is also fine.
>> 
>> But they are still data race, if the compiler is within its right to do anything it likes (due to data race),
>> we still need WRITE_ONCE() in these cases, though it’s semantically safe.
>> 
>> IIUC, even with _ONCE(), the compiler is within its right do anything according to the standard (at least before the upcoming C23), because the standard doesn’t consider a volatile access to be atomic.
> 
> Volatile accesses are not specified very well in the standard.  However,
> as a practical matter, compilers that wish to be able to device drivers
> (whether in kernels or userspace applications) must compile those volatile
> accesses in such a way to allow reliable device drivers to be written.

Agreed.

> 
>> However, the kernel consider the volatile access to be atomic, right?
> 
> The compiler must therefore act as if a volatile access to an aligned
> machine-word size location is atomic.  To see this, consider accesses
> to memory that is shared by a device driver and that device's firmware,
> both of which are written in either C or C++.

Btw it appears TSAN complaints bitterly on even volatile 4 byte data races.
Hence we have to explicitly use atomic API for data race accesses in Chrome.

Thanks,
Joel



> Does that help?
> 
>                            Thanx, Paul
> 
>> BTW, line 5 in the example is likely to be optimized away. And yes, the compiler can cache the value loaded from line 5 from the perspective of undefined behavior, even if I believe it would be a compiler bug from the perspective of kernel.
>> 
>>> result will not change the semantics of the program. But otherwise,
>>> that's bad.
>>> 
>>> [1] https://lwn.net/Articles/793253/#Store%20Tearing
>>> 
>>> thanks,
>>> 
>>> - Joel
>>> 
>>> 
>>>> 
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>>> +plain accesses of a memory location with rcu_dereference() of the same memory
>>>>>>> +location, in code involved in a data race.
>>>>>>> +
>>>>>>> In short, updaters use rcu_assign_pointer() and readers use
>>>>>>> rcu_dereference(), and these two RCU API elements work together to
>>>>>>> ensure that readers have a consistent view of newly added data elements.
>>>>>>> --
>>>>>>> 2.41.0.585.gd2178a4bd4-goog
>> 
>>
  
Paul E. McKenney Aug. 4, 2023, 2 p.m. UTC | #9
On Fri, Aug 04, 2023 at 08:33:55AM -0400, Joel Fernandes wrote:
> > On Aug 3, 2023, at 8:01 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > On Fri, Aug 04, 2023 at 03:25:57AM +0800, Alan Huang wrote:
> >>> 2023年8月4日 00:01,Joel Fernandes <joel@joelfernandes.org> 写道:
> >>> On Thu, Aug 3, 2023 at 9:36 AM Alan Huang <mmpgouride@gmail.com> wrote:
> >>>>> 2023年8月3日 下午8:35,Joel Fernandes <joel@joelfernandes.org> 写道:
> >>>>>>> On Aug 3, 2023, at 8:09 AM, Alan Huang <mmpgouride@gmail.com> wrote:
> >>>>>>>> 2023年8月3日 11:24,Joel Fernandes (Google) <joel@joelfernandes.org> 写道:
> >>>>>>>> Add a detailed note to explain the potential side effects of
> >>>>>>>> plain-accessing the gp pointer using a plain load, without using the
> >>>>>>>> rcu_dereference() macros; which might trip neighboring code that does
> >>>>>>>> use rcu_dereference().
> >>>>>>>> 
> >>>>>>>> I haven't verified this with a compiler, but this is what I gather from
> >>>>>>>> the below link using Will's experience with READ_ONCE().
> >>>>>>>> 
> >>>>>>>> Link: https://lore.kernel.org/all/20230728124412.GA21303@willie-the-truck/
> >>>>>>>> Cc: Will Deacon <will@kernel.org>
> >>>>>>>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >>>>>>>> ---
> >>>>>>>> .../RCU/Design/Requirements/Requirements.rst  | 32 +++++++++++++++++++
> >>>>>>>> 1 file changed, 32 insertions(+)
> >>>>>>>> 
> >>>>>>>> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
> >>>>>>>> index f3b605285a87..e0b896d3fb9b 100644
> >>>>>>>> --- a/Documentation/RCU/Design/Requirements/Requirements.rst
> >>>>>>>> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
> >>>>>>>> @@ -376,6 +376,38 @@ mechanism, most commonly locking or reference counting
> >>>>>>>> .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
> >>>>>>>> .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
> >>>>>>>> 
> >>>>>>>> +Note that, there can be strange side effects (due to compiler optimizations) if
> >>>>>>>> +``gp`` is ever accessed using a plain load (i.e. without ``READ_ONCE()`` or
> >>>>>>>> +``rcu_dereference()``) potentially hurting any succeeding
> >>>>>>>> +``rcu_dereference()``. For example, consider the code:
> >>>>>>>> +
> >>>>>>>> +   ::
> >>>>>>>> +
> >>>>>>>> +       1 bool do_something_gp(void)
> >>>>>>>> +       2 {
> >>>>>>>> +       3   void *tmp;
> >>>>>>>> +       4   rcu_read_lock();
> >>>>>>>> +       5   tmp = gp; // Plain-load of GP.
> >>>>>>>> +       6   printk("Point gp = %p\n", tmp);
> >>>>>>>> +       7
> >>>>>>>> +       8   p = rcu_dereference(gp);
> >>>>>>>> +       9   if (p) {
> >>>>>>>> +      10     do_something(p->a, p->b);
> >>>>>>>> +      11     rcu_read_unlock();
> >>>>>>>> +      12     return true;
> >>>>>>>> +      13   }
> >>>>>>>> +      14   rcu_read_unlock();
> >>>>>>>> +      15   return false;
> >>>>>>>> +      16 }
> >>>>>>>> +
> >>>>>>>> +The behavior of plain accesses involved in a data race is non-deterministic in
> >>>>>>>> +the face of compiler optimizations. Since accesses to the ``gp`` pointer is
> >>>>>>>> +by-design a data race, the compiler could trip this code by caching the value
> >>>>>>>> +of ``gp`` into a register in line 5, and then using the value of the register
> >>>>>>>> +to satisfy the load in line 10. Thus it is important to never mix
> >>>>>>> 
> >>>>>>> Will’s example is:
> >>>>>>> 
> >>>>>>>  // Assume *ptr is initially 0 and somebody else writes it to 1
> >>>>>>>  // concurrently
> >>>>>>> 
> >>>>>>>  foo = *ptr;
> >>>>>>>  bar = READ_ONCE(*ptr);
> >>>>>>>  baz = *ptr;
> >>>>>>> 
> >>>>>>> Then the compiler is within its right to reorder it to:
> >>>>>>> 
> >>>>>>> foo = *ptr;
> >>>>>>> baz = *ptr;
> >>>>>>> bar = READ_ONCE(*ptr);
> >>>>>>> 
> >>>>>>> So, the result foo == baz == 0 but bar == 1 is perfectly legal.
> >>>>>> 
> >>>>>> Yes, a bad outcome is perfectly legal amidst data race. Who said it is not legal?
> >>>>> 
> >>>>> My understanding is that it is legal even without data race, and the compiler only keeps the order of volatile access.
> >>> 
> >>> Yes, but I can bet on it the author of the code would not have
> >>> intended such an outcome, if they did then Will wouldn't have been
> >>> debugging it ;-). That's why I called it a bad outcome. The goal of
> >>> this patch is to document such a possible unintentional outcome.
> >>> 
> >>>>>> But the example here is different,
> >>>>> 
> >>>>> That is intentional. Wills discussion partially triggered this. Though I am wondering
> >>>>> if we should document that as well.
> >>>>> 
> >>>>>> the compiler can not use the value loaded from line 5
> >>>>>> unless the compiler can deduce that the tmp is equals to p in which case the address dependency
> >>>>>> doesn’t exist anymore.
> >>>>>> 
> >>>>>> What am I missing here?
> >>>>> 
> >>>>> Maybe you are trying to rationalize too much that the sequence mentioned cannot result
> >>>>> in a counter intuitive outcome like I did?
> >>>>> 
> >>>>> The point AFAIU is not just about line 10 but that the compiler can replace any of the
> >>>>> lines after the plain access with the cached value.
> >>>> 
> >>>> Well, IIUC, according to the C standard, the compiler can do anything if there is a data race (undefined behavior).
> >>>> 
> >>>> However, what if a write is not protected with WRITE_ONCE and the read is marked with READ_ONCE?
> >>>> That’s also a data race, right? But the kernel considers it is Okay if the write is machine word aligned.
> >>> 
> >>> Yes, but there is a compiler between the HLL code and what the
> >>> processor sees which can tear the write. How can not using
> >>> WRITE_ONCE() prevent store-tearing? See [1]. My understanding is that
> >>> it is OK only if the reader did a NULL check. In that case the torn
> >> 
> >> Yes, a write-write data race where the value is the same is also fine.
> >> 
> >> But they are still data race, if the compiler is within its right to do anything it likes (due to data race),
> >> we still need WRITE_ONCE() in these cases, though it’s semantically safe.
> >> 
> >> IIUC, even with _ONCE(), the compiler is within its right do anything according to the standard (at least before the upcoming C23), because the standard doesn’t consider a volatile access to be atomic.
> > 
> > Volatile accesses are not specified very well in the standard.  However,
> > as a practical matter, compilers that wish to be able to device drivers
> > (whether in kernels or userspace applications) must compile those volatile
> > accesses in such a way to allow reliable device drivers to be written.
> 
> Agreed.
> 
> > 
> >> However, the kernel consider the volatile access to be atomic, right?
> > 
> > The compiler must therefore act as if a volatile access to an aligned
> > machine-word size location is atomic.  To see this, consider accesses
> > to memory that is shared by a device driver and that device's firmware,
> > both of which are written in either C or C++.
> 
> Btw it appears TSAN complaints bitterly on even volatile 4 byte data races.
> Hence we have to explicitly use atomic API for data race accesses in Chrome.

That might have been a conscious and deliberate choice on the part of
the TSAN guys.  Volatile does not imply any ordering in the standard
(other than the compiler avoiding emitting volatile operations out of
order), but some compilers did emit memory-barrier instructions for
volatile accesses.  Which resulted in a lot of problems when such code
found compilers that did not cause the CPU to order volatile operations.

So a lot of people decided to thrown the volatile baby out with the
unordered bathwather.  ;-)

							Thanx, Paul

> Thanks,
> Joel
> 
> 
> 
> > Does that help?
> > 
> >                            Thanx, Paul
> > 
> >> BTW, line 5 in the example is likely to be optimized away. And yes, the compiler can cache the value loaded from line 5 from the perspective of undefined behavior, even if I believe it would be a compiler bug from the perspective of kernel.
> >> 
> >>> result will not change the semantics of the program. But otherwise,
> >>> that's bad.
> >>> 
> >>> [1] https://lwn.net/Articles/793253/#Store%20Tearing
> >>> 
> >>> thanks,
> >>> 
> >>> - Joel
> >>> 
> >>> 
> >>>> 
> >>>>> 
> >>>>> Thanks.
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>>> 
> >>>>>>> +plain accesses of a memory location with rcu_dereference() of the same memory
> >>>>>>> +location, in code involved in a data race.
> >>>>>>> +
> >>>>>>> In short, updaters use rcu_assign_pointer() and readers use
> >>>>>>> rcu_dereference(), and these two RCU API elements work together to
> >>>>>>> ensure that readers have a consistent view of newly added data elements.
> >>>>>>> --
> >>>>>>> 2.41.0.585.gd2178a4bd4-goog
> >> 
> >>
  
Alan Huang Aug. 4, 2023, 3:47 p.m. UTC | #10
> 
>>> 
>>>>>> But the example here is different,
>>>>> 
>>>>> That is intentional. Wills discussion partially triggered this. Though I am wondering
>>>>> if we should document that as well.
>>>>> 
>>>>>> the compiler can not use the value loaded from line 5
>>>>>> unless the compiler can deduce that the tmp is equals to p in which case the address dependency
>>>>>> doesn’t exist anymore.
>>>>>> 
>>>>>> What am I missing here?
>>>>> 
>>>>> Maybe you are trying to rationalize too much that the sequence mentioned cannot result
>>>>> in a counter intuitive outcome like I did?
>>>>> 
>>>>> The point AFAIU is not just about line 10 but that the compiler can replace any of the
>>>>> lines after the plain access with the cached value.
>>>> 
>>>> Well, IIUC, according to the C standard, the compiler can do anything if there is a data race (undefined behavior).
>>>> 
>>>> However, what if a write is not protected with WRITE_ONCE and the read is marked with READ_ONCE?
>>>> That’s also a data race, right? But the kernel considers it is Okay if the write is machine word aligned.
>>> 
>>> Yes, but there is a compiler between the HLL code and what the
>>> processor sees which can tear the write. How can not using
>>> WRITE_ONCE() prevent store-tearing? See [1]. My understanding is that
>>> it is OK only if the reader did a NULL check. In that case the torn
>> 
>> Yes, a write-write data race where the value is the same is also fine.
>> 
>> But they are still data race, if the compiler is within its right to do anything it likes (due to data race),
>> we still need WRITE_ONCE() in these cases, though it’s semantically safe.
>> 
>> IIUC, even with _ONCE(), the compiler is within its right do anything according to the standard (at least before the upcoming C23), because the standard doesn’t consider a volatile access to be atomic.
>> 
>> However, the kernel consider the volatile access to be atomic, right?
>> 
>> BTW, line 5 in the example is likely to be optimized away. And yes, the compiler can cache the value loaded from line 5 from the perspective of undefined behavior, even if I believe it would be a compiler bug from the perspective of kernel.
> 
> I am actually a bit lost with what you are trying to say.  Are you saying that mixing
> plain accesses with marked accesses is an acceptable practice? 


I’m trying to say that sometimes data race is fine, that’s why we have the data_race().

Even if the standard says data race results in UB.

And IMHO, the possible data race at line 5 in this example is also fine, unless the compiler 
deduces that the value of gp is always the same.


> 
> I would like others to weight in as well since I am not seeing what Alan is suggesting.
> AFAICS, in the absence of barrier(), any optimization caused by plain access
> makes it a bad practice to mix it.
> 
> Thanks,
> 
> - Joel
> 
> 
> 
>> 
>>> result will not change the semantics of the program. But otherwise,
>>> that's bad.
>>> 
>>> [1] https://lwn.net/Articles/793253/#Store%20Tearing
>>> 
>>> thanks,
>>> 
>>> - Joel
>>> 
>>> 
>>>> 
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>>> +plain accesses of a memory location with rcu_dereference() of the same memory
>>>>>>> +location, in code involved in a data race.
>>>>>>> +
>>>>>>> In short, updaters use rcu_assign_pointer() and readers use
>>>>>>> rcu_dereference(), and these two RCU API elements work together to
>>>>>>> ensure that readers have a consistent view of newly added data elements.
>>>>>>> --
>>>>>>> 2.41.0.585.gd2178a4bd4-goog
  
Joel Fernandes Aug. 4, 2023, 4:15 p.m. UTC | #11
On Fri, Aug 4, 2023 at 11:47 AM Alan Huang <mmpgouride@gmail.com> wrote:
>
> >
> >>>
> >>>>>> But the example here is different,
> >>>>>
> >>>>> That is intentional. Wills discussion partially triggered this. Though I am wondering
> >>>>> if we should document that as well.
> >>>>>
> >>>>>> the compiler can not use the value loaded from line 5
> >>>>>> unless the compiler can deduce that the tmp is equals to p in which case the address dependency
> >>>>>> doesn’t exist anymore.
> >>>>>>
> >>>>>> What am I missing here?
> >>>>>
> >>>>> Maybe you are trying to rationalize too much that the sequence mentioned cannot result
> >>>>> in a counter intuitive outcome like I did?
> >>>>>
> >>>>> The point AFAIU is not just about line 10 but that the compiler can replace any of the
> >>>>> lines after the plain access with the cached value.
> >>>>
> >>>> Well, IIUC, according to the C standard, the compiler can do anything if there is a data race (undefined behavior).
> >>>>
> >>>> However, what if a write is not protected with WRITE_ONCE and the read is marked with READ_ONCE?
> >>>> That’s also a data race, right? But the kernel considers it is Okay if the write is machine word aligned.
> >>>
> >>> Yes, but there is a compiler between the HLL code and what the
> >>> processor sees which can tear the write. How can not using
> >>> WRITE_ONCE() prevent store-tearing? See [1]. My understanding is that
> >>> it is OK only if the reader did a NULL check. In that case the torn
> >>
> >> Yes, a write-write data race where the value is the same is also fine.
> >>
> >> But they are still data race, if the compiler is within its right to do anything it likes (due to data race),
> >> we still need WRITE_ONCE() in these cases, though it’s semantically safe.
> >>
> >> IIUC, even with _ONCE(), the compiler is within its right do anything according to the standard (at least before the upcoming C23), because the standard doesn’t consider a volatile access to be atomic.
> >>
> >> However, the kernel consider the volatile access to be atomic, right?
> >>
> >> BTW, line 5 in the example is likely to be optimized away. And yes, the compiler can cache the value loaded from line 5 from the perspective of undefined behavior, even if I believe it would be a compiler bug from the perspective of kernel.
> >
> > I am actually a bit lost with what you are trying to say.  Are you saying that mixing
> > plain accesses with marked accesses is an acceptable practice?
>
>
> I’m trying to say that sometimes data race is fine, that’s why we have the data_race().
>
> Even if the standard says data race results in UB.
>
> And IMHO, the possible data race at line 5 in this example is also fine, unless the compiler
> deduces that the value of gp is always the same.

IMHO, no one is saying it is not "fine". As in, such behavior is
neither a compiler nor strictly a kernel bug.  More a wtf that the
programmer should know off (does not hurt to know).

I will rest my case with AlanH pending any input from people who know
more than me. If there is a better way to represent such matters in
the docs, I am happy to make changes to this patch.

Cheers,

 - Joel
  
Joel Fernandes Aug. 4, 2023, 4:17 p.m. UTC | #12
[...]
> > >> However, the kernel consider the volatile access to be atomic, right?
> > >
> > > The compiler must therefore act as if a volatile access to an aligned
> > > machine-word size location is atomic.  To see this, consider accesses
> > > to memory that is shared by a device driver and that device's firmware,
> > > both of which are written in either C or C++.
> >
> > Btw it appears TSAN complaints bitterly on even volatile 4 byte data races.
> > Hence we have to explicitly use atomic API for data race accesses in Chrome.
>
> That might have been a conscious and deliberate choice on the part of
> the TSAN guys.  Volatile does not imply any ordering in the standard
> (other than the compiler avoiding emitting volatile operations out of
> order), but some compilers did emit memory-barrier instructions for
> volatile accesses.  Which resulted in a lot of problems when such code
> found compilers that did not cause the CPU to order volatile operations.
>
> So a lot of people decided to thrown the volatile baby out with the
> unordered bathwather.  ;-)

Thanks for the input, I think TSAN was indeed worried about
memory-ordering even if relaxed ordering was intended. I think there
is a way to tell TSAN to shut-up in such situations but in my last
Chrome sprint, I just used the atomic API with relaxed ordering and
called it a day. :-)

 - Joel
  
Alan Huang Aug. 4, 2023, 4:33 p.m. UTC | #13
>> Yes, a write-write data race where the value is the same is also fine.
>> 
>> But they are still data race, if the compiler is within its right to do anything it likes (due to data race),
>> we still need WRITE_ONCE() in these cases, though it’s semantically safe.
>> 
>> IIUC, even with _ONCE(), the compiler is within its right do anything according to the standard (at least before the upcoming C23), because the standard doesn’t consider a volatile access to be atomic.
> 
> Volatile accesses are not specified very well in the standard.  However,
> as a practical matter, compilers that wish to be able to device drivers
> (whether in kernels or userspace applications) must compile those volatile
> accesses in such a way to allow reliable device drivers to be written.
> 
>> However, the kernel consider the volatile access to be atomic, right?
> 
> The compiler must therefore act as if a volatile access to an aligned
> machine-word size location is atomic.  To see this, consider accesses
> to memory that is shared by a device driver and that device's firmware,
> both of which are written in either C or C++.

I learned these things a few months ago. But still thank you!

The real problem is that there may be a data race at line 5, so Joel is correct that the compiler
can cache the value loaded from line 5 according to the standard given that the standard says that
a data race result in undefined behavior, so the compiler might be allowed to do anything. But from the
perspective of the kernel, line 5 is likely a diagnostic read, so it’s fine without READ_ONCE() and the
compiler is not allowed to cache the value.

This situation is like the volatile access.

Am I missing something?

> 
> Does that help?
> 
> Thanx, Paul
> 
>> BTW, line 5 in the example is likely to be optimized away. And yes, the compiler can cache the value loaded from line 5 from the perspective of undefined behavior, even if I believe it would be a compiler bug from the perspective of kernel.
>> 
>>> result will not change the semantics of the program. But otherwise,
>>> that's bad.
>>> 
>>> [1] https://lwn.net/Articles/793253/#Store%20Tearing
>>> 
>>> thanks,
>>> 
>>> - Joel
>>> 
>>> 
>>>> 
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>>> +plain accesses of a memory location with rcu_dereference() of the same memory
>>>>>>> +location, in code involved in a data race.
>>>>>>> +
>>>>>>> In short, updaters use rcu_assign_pointer() and readers use
>>>>>>> rcu_dereference(), and these two RCU API elements work together to
>>>>>>> ensure that readers have a consistent view of newly added data elements.
>>>>>>> --
>>>>>>> 2.41.0.585.gd2178a4bd4-goog
  
Paul E. McKenney Aug. 4, 2023, 5:02 p.m. UTC | #14
On Fri, Aug 04, 2023 at 12:17:58PM -0400, Joel Fernandes wrote:
> [...]
> > > >> However, the kernel consider the volatile access to be atomic, right?
> > > >
> > > > The compiler must therefore act as if a volatile access to an aligned
> > > > machine-word size location is atomic.  To see this, consider accesses
> > > > to memory that is shared by a device driver and that device's firmware,
> > > > both of which are written in either C or C++.
> > >
> > > Btw it appears TSAN complaints bitterly on even volatile 4 byte data races.
> > > Hence we have to explicitly use atomic API for data race accesses in Chrome.
> >
> > That might have been a conscious and deliberate choice on the part of
> > the TSAN guys.  Volatile does not imply any ordering in the standard
> > (other than the compiler avoiding emitting volatile operations out of
> > order), but some compilers did emit memory-barrier instructions for
> > volatile accesses.  Which resulted in a lot of problems when such code
> > found compilers that did not cause the CPU to order volatile operations.
> >
> > So a lot of people decided to thrown the volatile baby out with the
> > unordered bathwather.  ;-)
> 
> Thanks for the input, I think TSAN was indeed worried about
> memory-ordering even if relaxed ordering was intended. I think there
> is a way to tell TSAN to shut-up in such situations but in my last
> Chrome sprint, I just used the atomic API with relaxed ordering and
> called it a day. :-)

Fair enough!

Note that the Linux kernel's version of TSAN, which is KCSAN, does
interpret volatile accesses more or less as if they were relaxed atomics.

So TSAN could change, but I don't have a dog in that fight.  ;-)

							Thanx, Paul
  
Paul E. McKenney Aug. 4, 2023, 5:27 p.m. UTC | #15
On Sat, Aug 05, 2023 at 12:33:03AM +0800, Alan Huang wrote:
> 
> >> Yes, a write-write data race where the value is the same is also fine.
> >> 
> >> But they are still data race, if the compiler is within its right to do anything it likes (due to data race),
> >> we still need WRITE_ONCE() in these cases, though it’s semantically safe.
> >> 
> >> IIUC, even with _ONCE(), the compiler is within its right do anything according to the standard (at least before the upcoming C23), because the standard doesn’t consider a volatile access to be atomic.
> > 
> > Volatile accesses are not specified very well in the standard.  However,
> > as a practical matter, compilers that wish to be able to device drivers
> > (whether in kernels or userspace applications) must compile those volatile
> > accesses in such a way to allow reliable device drivers to be written.
> > 
> >> However, the kernel consider the volatile access to be atomic, right?
> > 
> > The compiler must therefore act as if a volatile access to an aligned
> > machine-word size location is atomic.  To see this, consider accesses
> > to memory that is shared by a device driver and that device's firmware,
> > both of which are written in either C or C++.
> 
> I learned these things a few months ago. But still thank you!
> 
> The real problem is that there may be a data race at line 5, so Joel is correct that the compiler
> can cache the value loaded from line 5 according to the standard given that the standard says that
> a data race result in undefined behavior, so the compiler might be allowed to do anything. But from the
> perspective of the kernel, line 5 is likely a diagnostic read, so it’s fine without READ_ONCE() and the
> compiler is not allowed to cache the value.
> 
> This situation is like the volatile access.
> 
> Am I missing something?

I think you have it right.

The point is that we are sometimes more concerned about focusing KCSAN
diagnostics on the core concurrent algorithm, and are willing to take
the very low risk of messed-up diagnostic output in order to get simpler
and better KCSAN diagnostics on the main algorithm.

So in that case, we use data_race() on the diagnostics and other markings
in the main algorithm.

For example, suppose that we had a core algorithm that relied on strict
locking.  In that case, we want to use unmarked plain C-language accesses
in the core algorithm, which will allow KCSAN to flag and accesses that
are not protected by the lock.  But it might be bad for the diagnostic
code to acquire that lock, as this would suppress diagnostics in the case
where the lock was held for too long a time period.  Using data_race()
in the diagnostic code addresses this situation.

							Thanx, Paul


> > Does that help?
> > 
> > Thanx, Paul
> > 
> >> BTW, line 5 in the example is likely to be optimized away. And yes, the compiler can cache the value loaded from line 5 from the perspective of undefined behavior, even if I believe it would be a compiler bug from the perspective of kernel.
> >> 
> >>> result will not change the semantics of the program. But otherwise,
> >>> that's bad.
> >>> 
> >>> [1] https://lwn.net/Articles/793253/#Store%20Tearing
> >>> 
> >>> thanks,
> >>> 
> >>> - Joel
> >>> 
> >>> 
> >>>> 
> >>>>> 
> >>>>> Thanks.
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>>> 
> >>>>>>> +plain accesses of a memory location with rcu_dereference() of the same memory
> >>>>>>> +location, in code involved in a data race.
> >>>>>>> +
> >>>>>>> In short, updaters use rcu_assign_pointer() and readers use
> >>>>>>> rcu_dereference(), and these two RCU API elements work together to
> >>>>>>> ensure that readers have a consistent view of newly added data elements.
> >>>>>>> --
> >>>>>>> 2.41.0.585.gd2178a4bd4-goog
> 
>
  

Patch

diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
index f3b605285a87..e0b896d3fb9b 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.rst
+++ b/Documentation/RCU/Design/Requirements/Requirements.rst
@@ -376,6 +376,38 @@  mechanism, most commonly locking or reference counting
 .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
 .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
 
+Note that, there can be strange side effects (due to compiler optimizations) if
+``gp`` is ever accessed using a plain load (i.e. without ``READ_ONCE()`` or
+``rcu_dereference()``) potentially hurting any succeeding
+``rcu_dereference()``. For example, consider the code:
+
+   ::
+
+       1 bool do_something_gp(void)
+       2 {
+       3   void *tmp;
+       4   rcu_read_lock();
+       5   tmp = gp; // Plain-load of GP.
+       6   printk("Point gp = %p\n", tmp);
+       7
+       8   p = rcu_dereference(gp);
+       9   if (p) {
+      10     do_something(p->a, p->b);
+      11     rcu_read_unlock();
+      12     return true;
+      13   }
+      14   rcu_read_unlock();
+      15   return false;
+      16 }
+
+The behavior of plain accesses involved in a data race is non-deterministic in
+the face of compiler optimizations. Since accesses to the ``gp`` pointer is
+by-design a data race, the compiler could trip this code by caching the value
+of ``gp`` into a register in line 5, and then using the value of the register
+to satisfy the load in line 10. Thus it is important to never mix
+plain accesses of a memory location with rcu_dereference() of the same memory
+location, in code involved in a data race.
+
 In short, updaters use rcu_assign_pointer() and readers use
 rcu_dereference(), and these two RCU API elements work together to
 ensure that readers have a consistent view of newly added data elements.