[v2,1/2] tcp/dcpp: Un-pin tw_timer
Commit Message
The TCP timewait timer is proving to be problematic for setups where scheduler
CPU isolation is achieved at runtime via cpusets (as opposed to statically via
isolcpus=domains).
What happens there is a CPU goes through tcp_time_wait(), arming the time_wait
timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer fires, causing
interference for the now-isolated CPU. This is conceptually similar to the issue
described in
e02b93124855 ("workqueue: Unbind kworkers before sending them to exit()")
Keep softirqs disabled, but make the timer un-pinned and arm it *after* the
hashdance.
This introduces the following (non-fatal) race:
CPU0 CPU1
allocates a tw
insert it in hash table
finds the TW and removes it
(timer cancel does nothing)
arms a TW timer, lasting
This partially reverts
ed2e92394589 ("tcp/dccp: fix timewait races in timer handling")
and
ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance")
This also reinstores a comment from
ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance")
as inet_twsk_hashdance() had a "Step 1" and "Step 3" comment, but the "Step
2" had gone missing.
Link: https://lore.kernel.org/all/ZPhpfMjSiHVjQkTk@localhost.localdomain/
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
net/dccp/minisocks.c | 16 +++++++---------
net/ipv4/inet_timewait_sock.c | 20 +++++++++++++++-----
net/ipv4/tcp_minisocks.c | 16 +++++++---------
3 files changed, 29 insertions(+), 23 deletions(-)
Comments
On Wed, Nov 15, 2023 at 10:05 PM Valentin Schneider <vschneid@redhat.com> wrote:
>
> The TCP timewait timer is proving to be problematic for setups where scheduler
> CPU isolation is achieved at runtime via cpusets (as opposed to statically via
> isolcpus=domains).
>
> What happens there is a CPU goes through tcp_time_wait(), arming the time_wait
> timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer fires, causing
> interference for the now-isolated CPU. This is conceptually similar to the issue
> described in
> e02b93124855 ("workqueue: Unbind kworkers before sending them to exit()")
>
> Keep softirqs disabled, but make the timer un-pinned and arm it *after* the
> hashdance.
>
> This introduces the following (non-fatal) race:
>
> CPU0 CPU1
> allocates a tw
> insert it in hash table
> finds the TW and removes it
> (timer cancel does nothing)
> arms a TW timer, lasting
>
> This partially reverts
> ed2e92394589 ("tcp/dccp: fix timewait races in timer handling")
> and
> ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance")
>
> This also reinstores a comment from
> ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance")
> as inet_twsk_hashdance() had a "Step 1" and "Step 3" comment, but the "Step
> 2" had gone missing.
>
> Link: https://lore.kernel.org/all/ZPhpfMjSiHVjQkTk@localhost.localdomain/
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
> net/dccp/minisocks.c | 16 +++++++---------
> net/ipv4/inet_timewait_sock.c | 20 +++++++++++++++-----
> net/ipv4/tcp_minisocks.c | 16 +++++++---------
> 3 files changed, 29 insertions(+), 23 deletions(-)
>
> diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
> index 64d805b27adde..2f0fad4255e36 100644
> --- a/net/dccp/minisocks.c
> +++ b/net/dccp/minisocks.c
> @@ -53,16 +53,14 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
> if (state == DCCP_TIME_WAIT)
> timeo = DCCP_TIMEWAIT_LEN;
>
> - /* tw_timer is pinned, so we need to make sure BH are disabled
> - * in following section, otherwise timer handler could run before
> - * we complete the initialization.
> - */
> - local_bh_disable();
> - inet_twsk_schedule(tw, timeo);
> - /* Linkage updates.
> - * Note that access to tw after this point is illegal.
> - */
> + local_bh_disable();
> +
> + // Linkage updates
> inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
> + inet_twsk_schedule(tw, timeo);
We could arm a timer there, while another thread/cpu found the TW in
the ehash table.
> + // Access to tw after this point is illegal.
> + inet_twsk_put(tw);
This would eventually call inet_twsk_free() while the timer is armed.
I think more work is needed.
Perhaps make sure that a live timer owns a reference on tw->tw_refcnt
(This is not the case atm)
Hi Eric,
On Mon, 20 Nov 2023 at 18:56, Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Nov 15, 2023 at 10:05 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >
> > @@ -53,16 +53,14 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
> > if (state == DCCP_TIME_WAIT)
> > timeo = DCCP_TIMEWAIT_LEN;
> >
> > - /* tw_timer is pinned, so we need to make sure BH are disabled
> > - * in following section, otherwise timer handler could run before
> > - * we complete the initialization.
> > - */
> > - local_bh_disable();
> > - inet_twsk_schedule(tw, timeo);
> > - /* Linkage updates.
> > - * Note that access to tw after this point is illegal.
> > - */
> > + local_bh_disable();
> > +
> > + // Linkage updates
> > inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
> > + inet_twsk_schedule(tw, timeo);
>
> We could arm a timer there, while another thread/cpu found the TW in
> the ehash table.
>
>
>
> > + // Access to tw after this point is illegal.
> > + inet_twsk_put(tw);
>
> This would eventually call inet_twsk_free() while the timer is armed.
>
> I think more work is needed.
>
> Perhaps make sure that a live timer owns a reference on tw->tw_refcnt
> (This is not the case atm)
>
I thought that was already the case, per inet_twsk_hashdance():
/* tw_refcnt is set to 3 because we have :
* - one reference for bhash chain.
* - one reference for ehash chain.
* - one reference for timer.
and
tw_timer_handler()
`\
inet_twsk_kill()
`\
inet_twsk_put()
So AFAICT, after we go through the hashdance, there's a reference on
tw_refcnt held by the tw_timer.
inet_twsk_deschedule_put() can race with arming the timer, but it only
calls inet_twsk_kill() if the timer
was already armed & has been deleted, so there's no risk of calling it
twice... If I got it right :-)
On Thu, Nov 23, 2023 at 3:34 PM Valentin Schneider <vschneid@redhat.com> wrote:
>
> Hi Eric,
>
> On Mon, 20 Nov 2023 at 18:56, Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Nov 15, 2023 at 10:05 PM Valentin Schneider <vschneid@redhat.com> wrote:
> > >
> > > @@ -53,16 +53,14 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
> > > if (state == DCCP_TIME_WAIT)
> > > timeo = DCCP_TIMEWAIT_LEN;
> > >
> > > - /* tw_timer is pinned, so we need to make sure BH are disabled
> > > - * in following section, otherwise timer handler could run before
> > > - * we complete the initialization.
> > > - */
> > > - local_bh_disable();
> > > - inet_twsk_schedule(tw, timeo);
> > > - /* Linkage updates.
> > > - * Note that access to tw after this point is illegal.
> > > - */
> > > + local_bh_disable();
> > > +
> > > + // Linkage updates
> > > inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
> > > + inet_twsk_schedule(tw, timeo);
> >
> > We could arm a timer there, while another thread/cpu found the TW in
> > the ehash table.
> >
> >
> >
> > > + // Access to tw after this point is illegal.
> > > + inet_twsk_put(tw);
> >
> > This would eventually call inet_twsk_free() while the timer is armed.
> >
> > I think more work is needed.
> >
> > Perhaps make sure that a live timer owns a reference on tw->tw_refcnt
> > (This is not the case atm)
> >
>
> I thought that was already the case, per inet_twsk_hashdance():
>
> /* tw_refcnt is set to 3 because we have :
> * - one reference for bhash chain.
> * - one reference for ehash chain.
> * - one reference for timer.
>
> and
>
> tw_timer_handler()
> `\
> inet_twsk_kill()
> `\
> inet_twsk_put()
>
> So AFAICT, after we go through the hashdance, there's a reference on
> tw_refcnt held by the tw_timer.
> inet_twsk_deschedule_put() can race with arming the timer, but it only
> calls inet_twsk_kill() if the timer
> was already armed & has been deleted, so there's no risk of calling it
> twice... If I got it right :-)
>
Again, I think you missed some details.
I am OOO for a few days, I do not have time to elaborate.
You will need to properly track active timer by elevating
tw->tw_refcnt, or I guarantee something wrong will happen.
On Thu, 23 Nov 2023 at 17:32, Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Nov 23, 2023 at 3:34 PM Valentin Schneider <vschneid@redhat.com> wrote:
> > I thought that was already the case, per inet_twsk_hashdance():
> >
> > /* tw_refcnt is set to 3 because we have :
> > * - one reference for bhash chain.
> > * - one reference for ehash chain.
> > * - one reference for timer.
> >
> > and
> >
> > tw_timer_handler()
> > `\
> > inet_twsk_kill()
> > `\
> > inet_twsk_put()
> >
> > So AFAICT, after we go through the hashdance, there's a reference on
> > tw_refcnt held by the tw_timer.
> > inet_twsk_deschedule_put() can race with arming the timer, but it only
> > calls inet_twsk_kill() if the timer
> > was already armed & has been deleted, so there's no risk of calling it
> > twice... If I got it right :-)
> >
>
> Again, I think you missed some details.
>
> I am OOO for a few days, I do not have time to elaborate.
>
> You will need to properly track active timer by elevating
> tw->tw_refcnt, or I guarantee something wrong will happen.
>
Gotcha, let me dig into this then!
On 23/11/23 17:32, Eric Dumazet wrote:
> Again, I think you missed some details.
>
> I am OOO for a few days, I do not have time to elaborate.
>
> You will need to properly track active timer by elevating
> tw->tw_refcnt, or I guarantee something wrong will happen.
I apologize if I'm being thick skulled, I've been staring at the code and
tracing on live systems and I still can't see the issue with refcounts.
The patch has the hashdance set the refcount to 4:
* - one reference for bhash chain.
* - one reference for ehash chain.
* - one reference for timer.
* - one reference for ourself (our caller will release it).
AFAICT, finding & using the socket via the ehash table comes with a refcount_inc
(e.g. __inet_lookup_established()).
Worst case, the lookup happens after programming the timer, and we get a
inet_twsk_deschedule_put(). This reduces the refcount by:
3 via inet_twsk_kill():
1 (sk_nulls_del_node_init_rcu())
1 (inet_twsk_bind_unhash())
1 (inet_twsk_put())
1 via inet_twsk_put()
IOW 4 total. So we can have:
tcp_time_wait()
inet_twsk_hashdance(); // refcount = 4
inet_twsk_schedule(); // timer armed
tcp_v4_rcv()
sk = __inet_lookup_skb(); // refcount = 5 (+1)
inet_twsk_deschedule_put(inet_twsk(sk));
inet_twsk_kill(tw) // refcount = 2 (-3)
inet_twsk_put(tw) // refcount = 1 (-1)
inet_twsk_put(tw) // refcount = 0 (-1)
__inet_hash_connect() can invoke inet_twsk_bind_unhash() by itself before
calling inet_twsk_deschedule_put(), but that just means it won't be done by
the latter, so the total refcount delta remains the same.
Thinking about it differently, this would mean that currently (without the
patch) another CPU can bring the refcount to 0 without disarming the timer,
because the patch is making the initial value one higher.
What am I missing?
Hi,
On Thu, 2023-11-23 at 17:32 +0100, Eric Dumazet wrote:
> On Thu, Nov 23, 2023 at 3:34 PM Valentin Schneider <vschneid@redhat.com> wrote:
> > So AFAICT, after we go through the hashdance, there's a reference on
> > tw_refcnt held by the tw_timer.
> > inet_twsk_deschedule_put() can race with arming the timer, but it only
> > calls inet_twsk_kill() if the timer
> > was already armed & has been deleted, so there's no risk of calling it
> > twice... If I got it right :-)
>
> Again, I think you missed some details.
>
> I am OOO for a few days, I do not have time to elaborate.
>
> You will need to properly track active timer by elevating
> tw->tw_refcnt, or I guarantee something wrong will happen.
I'm sorry to bring this up again, but I tried to understand what is
missing in Valentin's patch and I could not find it.
Direct link to the patch, just in case the thread has been lost:
https://lore.kernel.org/lkml/20231115210509.481514-2-vschneid@redhat.com/
The patch raises the initial tw->tw_refcnt to 4, so it tracks (in
advance) the reference for the tw_timer. AFAICS the patch is still
prone to the race you mentioned on the RFC:
CPU0:
allocates a tw, insert it in hash table
CPU1:
finds the TW and removes it (timer cancel does nothing)
CPU0:
arms a TW timer, lasting
but I understood such race is acceptable.
Could you please shed some light?
Many thanks,
Paolo
@@ -53,16 +53,14 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
if (state == DCCP_TIME_WAIT)
timeo = DCCP_TIMEWAIT_LEN;
- /* tw_timer is pinned, so we need to make sure BH are disabled
- * in following section, otherwise timer handler could run before
- * we complete the initialization.
- */
- local_bh_disable();
- inet_twsk_schedule(tw, timeo);
- /* Linkage updates.
- * Note that access to tw after this point is illegal.
- */
+ local_bh_disable();
+
+ // Linkage updates
inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
+ inet_twsk_schedule(tw, timeo);
+ // Access to tw after this point is illegal.
+ inet_twsk_put(tw);
+
local_bh_enable();
} else {
/* Sorry, if we're out of memory, just CLOSE this
@@ -144,6 +144,7 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
spin_lock(lock);
+ /* Step 2: Hash TW into tcp ehash chain */
inet_twsk_add_node_rcu(tw, &ehead->chain);
/* Step 3: Remove SK from hash chain */
@@ -152,16 +153,15 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
spin_unlock(lock);
- /* tw_refcnt is set to 3 because we have :
+ /* tw_refcnt is set to 4 because we have :
* - one reference for bhash chain.
* - one reference for ehash chain.
* - one reference for timer.
+ * - one reference for ourself (our caller will release it).
* We can use atomic_set() because prior spin_lock()/spin_unlock()
* committed into memory all tw fields.
- * Also note that after this point, we lost our implicit reference
- * so we are not allowed to use tw anymore.
*/
- refcount_set(&tw->tw_refcnt, 3);
+ refcount_set(&tw->tw_refcnt, 4);
}
EXPORT_SYMBOL_GPL(inet_twsk_hashdance);
@@ -207,7 +207,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
tw->tw_prot = sk->sk_prot_creator;
atomic64_set(&tw->tw_cookie, atomic64_read(&sk->sk_cookie));
twsk_net_set(tw, sock_net(sk));
- timer_setup(&tw->tw_timer, tw_timer_handler, TIMER_PINNED);
+ timer_setup(&tw->tw_timer, tw_timer_handler, 0);
/*
* Because we use RCU lookups, we should not set tw_refcnt
* to a non null value before everything is setup for this
@@ -232,6 +232,16 @@ EXPORT_SYMBOL_GPL(inet_twsk_alloc);
*/
void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
{
+ /* This can race with tcp_time_wait() and dccp_time_wait(), as the timer
+ * is armed /after/ adding it to the hashtables.
+ *
+ * If this is interleaved between inet_twsk_hashdance() and inet_twsk_put(),
+ * then this is a no-op: the timer will still end up armed.
+ *
+ * Conversely, if this successfully deletes the timer, then we know we
+ * have already gone through {tcp,dcpp}_time_wait(), and we can safely
+ * call inet_twsk_kill().
+ */
if (del_timer_sync(&tw->tw_timer))
inet_twsk_kill(tw);
inet_twsk_put(tw);
@@ -338,16 +338,14 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
if (state == TCP_TIME_WAIT)
timeo = TCP_TIMEWAIT_LEN;
- /* tw_timer is pinned, so we need to make sure BH are disabled
- * in following section, otherwise timer handler could run before
- * we complete the initialization.
- */
- local_bh_disable();
- inet_twsk_schedule(tw, timeo);
- /* Linkage updates.
- * Note that access to tw after this point is illegal.
- */
+ local_bh_disable();
+
+ // Linkage updates.
inet_twsk_hashdance(tw, sk, net->ipv4.tcp_death_row.hashinfo);
+ inet_twsk_schedule(tw, timeo);
+ // Access to tw after this point is illegal.
+ inet_twsk_put(tw);
+
local_bh_enable();
} else {
/* Sorry, if we're out of memory, just CLOSE this