Periodically flow expire from flow offload tables
Commit Message
When a flow is added to a flow table for offload SW/HW-offload
the user has no means of controlling the flow once it has
been offloaded. If a number of firewall rules has been made using
time schedules then these rules doesn't apply for the already
offloaded flows. Adding new firewall rules also doesn't affect
already offloaded flows.
This patch handle flow table retirement giving the user the option
to at least periodically get the flow back into control of the
firewall rules so already offloaded flows can be dropped or be
pushed back to flow offload tables.
The flow retirement is disabled by default and can be set in seconds
using sysctl -w net.netfilter.nf_flowtable_retire
Signed-off-by: Michael Lilja <michael.lilja@gmail.com>
---
.../networking/nf_conntrack-sysctl.rst | 7 ++++++
include/net/netfilter/nf_flow_table.h | 1 +
include/net/netns/conntrack.h | 3 +++
net/netfilter/nf_conntrack_standalone.c | 17 ++++++++++++++
net/netfilter/nf_flow_table_core.c | 23 +++++++++++++++----
5 files changed, 47 insertions(+), 4 deletions(-)
Comments
Hi,
On Sun, Oct 23, 2022 at 07:16:58PM +0200, Michael Lilja wrote:
> When a flow is added to a flow table for offload SW/HW-offload
> the user has no means of controlling the flow once it has
> been offloaded. If a number of firewall rules has been made using
> time schedules then these rules doesn't apply for the already
> offloaded flows. Adding new firewall rules also doesn't affect
> already offloaded flows.
>
> This patch handle flow table retirement giving the user the option
> to at least periodically get the flow back into control of the
> firewall rules so already offloaded flows can be dropped or be
> pushed back to flow offload tables.
>
> The flow retirement is disabled by default and can be set in seconds
> using sysctl -w net.netfilter.nf_flowtable_retire
How does your ruleset look like? Could you detail your usecase?
Thanks.
Hi,
No problem. Here is a snippet of the rulesets in play. I simplified it because there are a lot of devices and a lot of schedules per device. The ‘mark’ is set by userspace so not all flow types are offloaded, that is controlled by userspace:
- - - - snip start - - - -
table inet fw4 {
flowtable ft {
hook ingress priority filter
devices = { lan1, lan2, wan }
flags offload
}
chain mangle_forward {
type filter hook forward priority mangle; policy
meta mark set ct mark
meta mark 0x00000000/16 queue flags bypass to 0
}
chain my_devices_rules {
ether saddr 96:68:97:a7:e8:a7 jump fw_p0_dev0 comment “Device match”
}
chain fw_p0_dev0 {
meta time >= "2022-10-09 18:46:50" meta time < "2022-10-09 19:16:50" counter packets 0 bytes 0 drop comment "!Schedule OFFLINE override"
meta day “Tuesday" meta hour >= "06:00" meta hour < "07:00" drop
}
chain forward {
type filter hook forward priority filter; policy accept;
jump my_devices_rules
}
chain my_forward_offload {
type filter hook forward priority filter + 1; policy accept;
meta mark != 0x00000000/16 meta l4proto { tcp, udp } flow add @ft
}
chain mangle_postrouting {
type filter hook postrouting priority mangle; policy accept;
ct mark set meta mark
}
- - - - snip end - - - -
The use case is that I have schedules per device to control when they are allowed access to the internet and if the flows are offloaded they will not get dropped once the schedule kicks in.
Thanks
> On 25 Oct 2022, at 13.05, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>
> Hi,
>
> On Sun, Oct 23, 2022 at 07:16:58PM +0200, Michael Lilja wrote:
>> When a flow is added to a flow table for offload SW/HW-offload
>> the user has no means of controlling the flow once it has
>> been offloaded. If a number of firewall rules has been made using
>> time schedules then these rules doesn't apply for the already
>> offloaded flows. Adding new firewall rules also doesn't affect
>> already offloaded flows.
>>
>> This patch handle flow table retirement giving the user the option
>> to at least periodically get the flow back into control of the
>> firewall rules so already offloaded flows can be dropped or be
>> pushed back to flow offload tables.
>>
>> The flow retirement is disabled by default and can be set in seconds
>> using sysctl -w net.netfilter.nf_flowtable_retire
>
> How does your ruleset look like? Could you detail your usecase?
>
> Thanks.
Hi,
On Tue, Oct 25, 2022 at 02:36:35PM +0200, Michael Lilja wrote:
> Hi,
>
> No problem. Here is a snippet of the rulesets in play. I simplified it because there are a lot of devices and a lot of schedules per device. The ‘mark’ is set by userspace so not all flow types are offloaded, that is controlled by userspace:
>
> - - - - snip start - - - -
> table inet fw4 {
> flowtable ft {
> hook ingress priority filter
> devices = { lan1, lan2, wan }
> flags offload
> }
>
> chain mangle_forward {
> type filter hook forward priority mangle; policy
> meta mark set ct mark
> meta mark 0x00000000/16 queue flags bypass to 0
> }
>
>
> chain my_devices_rules {
> ether saddr 96:68:97:a7:e8:a7 jump fw_p0_dev0 comment “Device match”
> }
>
> chain fw_p0_dev0 {
> meta time >= "2022-10-09 18:46:50" meta time < "2022-10-09 19:16:50" counter packets 0 bytes 0 drop comment "!Schedule OFFLINE override"
> meta day “Tuesday" meta hour >= "06:00" meta hour < "07:00" drop
> }
>
> chain forward {
> type filter hook forward priority filter; policy accept;
> jump my_devices_rules
> }
>
> chain my_forward_offload {
> type filter hook forward priority filter + 1; policy accept;
> meta mark != 0x00000000/16 meta l4proto { tcp, udp } flow add @ft
> }
>
> chain mangle_postrouting {
> type filter hook postrouting priority mangle; policy accept;
> ct mark set meta mark
> }
> - - - - snip end - - - -
>
> The use case is that I have schedules per device to control when
> they are allowed access to the internet and if the flows are
> offloaded they will not get dropped once the schedule kicks in.
Thanks for explaining.
I suggest to move your 'forward' chain to netdev/ingress using priority
filter - 1
so the time schedule evaluation is always done before the flowtable
lookup, that is, schedules rules will be always evaluated.
In your example, you are using a linear ruleset, which might defeat
the purpose of the flowtable. So I'm attaching a new ruleset
transformed to use maps and the ingress chain as suggested.
table netdev filter {
map ether_to_chain {
typeof ether saddr : verdict
elements = { 96:68:97:a7:e8:a7 comment "Device match" : jump fw_p0_dev0 }
}
map schedule_time {
typeof meta time : verdict
flags interval
counter
elements = { "2022-10-09 18:46:50" - "2022-10-09 19:16:50" comment "!Schedule OFFLINE override" : drop }
}
map schedule_day {
typeof meta day . meta hour : verdict
flags interval
counter
elements = { "Tuesday" . "06:00" - "07:00" : drop }
}
chain fw_p0_dev0 {
meta time vmap @schedule_time
meta day . meta hour vmap @schedule_day
}
chain my_devices_rules {
ether saddr vmap @ether_to_chain
}
chain ingress {
type filter hook ingress device eth0 priority filter; policy accept;
jump my_devices_rules
}
}
Hi,
Thanks for the optimisation suggestions, my nft is a rough conversion from iptables, I will look into using maps.
The ingress chain will work fine for SW OFFLOAD but HW OFFLOAD is not solved by this, at least what I see is that once a flow is offloaded to HW the driver doesn’t see the packets?
If I use the ingress chain I guess I don’t have access to ‘ct mark’ yet? I could think of a use-case where schedules should only some ‘flow type’:
meta mask != 0x12340000/16 meta day “Tuesday" meta hour >= "06:00" meta hour < "07:00" drop
I have more advanced rules that check the ct mark and will need to drop if mark == something. These mark == something rules are applied ‘runtime’ and flowables doesn’t seem to be flushed on nft load, which is also a reason for my ‘flow retire’ from the tables.
So my overall goal is to receive packets, mark them with a value depending on 'flow type' and then for the flows that are allowed to be forwarded offload them to the ingress flow table for either HW or SW offload. Once in a while I will change the verdict of a ‘flow type’ and will need that to apply for all existing flows and future flows, besides the fixed schedules, and it should work both for SW OFFLOAD and HW OFFLOAD.
I only have the M7621 device to play with for HW OFFLOAD, but it works fine with my patch.
Thanks
>
>
> I suggest to move your 'forward' chain to netdev/ingress using priority
>
> filter - 1
>
> so the time schedule evaluation is always done before the flowtable
> lookup, that is, schedules rules will be always evaluated.
>
> In your example, you are using a linear ruleset, which might defeat
> the purpose of the flowtable. So I'm attaching a new ruleset
> transformed to use maps and the ingress chain as suggested.
> <schedules.nft>
Hi,
On Tue, Oct 25, 2022 at 03:32:51PM +0200, Michael Lilja wrote:
> Hi,
>
> Thanks for the optimisation suggestions, my nft is a rough
> conversion from iptables, I will look into using maps.
>
> The ingress chain will work fine for SW OFFLOAD but HW OFFLOAD is
> not solved by this, at least what I see is that once a flow is
> offloaded to HW the driver doesn’t see the packets?
>
> If I use the ingress chain I guess I don’t have access to ‘ct mark’
> yet? I could think of a use-case where schedules should only some
> ‘flow type’: meta mask != 0x12340000/16 meta day “Tuesday" meta hour
> >= "06:00" meta hour < "07:00" drop
>
> I have more advanced rules that check the ct mark and will need to
> drop if mark == something. These mark == something rules are applied
> ‘runtime’ and flowables doesn’t seem to be flushed on nft load,
> which is also a reason for my ‘flow retire’ from the tables.
It should be also possible to notify the flowtable that the ruleset
has been updated. That won't cover the meta day, hour, time scenario
though. I think both mechanism (the 'retire' feature you propose) and
ruleset update notifications are complementary each other and they
would be good to have.
> So my overall goal is to receive packets, mark them with a value
> depending on 'flow type' and then for the flows that are allowed to
> be forwarded offload them to the ingress flow table for either HW or
> SW offload. Once in a while I will change the verdict of a ‘flow
> type’ and will need that to apply for all existing flows and future
> flows, besides the fixed schedules, and it should work both for SW
> OFFLOAD and HW OFFLOAD.
>
> I only have the M7621 device to play with for HW OFFLOAD, but it
> works fine with my patch.
Thanks for explaining.
My suggestions are:
- Add support for this in the flowtable netlink interface (instead of
sysctl), I'm going to post a patch to add support for setting the
flowtable size, it can be used as reference to expose this new
'retire' feature.
- flow_offload_teardown() already unsets the IPS_OFFLOAD bit, so
probably your patch can follow that path too (instead of clearing
IPS_OFFLOAD_BIT from flow_offload_del).
static void nf_flow_offload_gc_step(struct nf_flowtable *flow_table,
struct flow_offload *flow, void *data)
{
if (nf_flow_has_expired(flow) ||
nf_ct_is_dying(flow->ct))
flow_offload_teardown(flow);
Hi,
I will look to use the flowable netlink interface. I have not yet, but does this possible give the option of doing something like this:
flowtable ft {
hook ingress priority filter
devices = { lan1, lan2, wan }
flags offload, timeout
}
I would say the above it the most flexible, I just didn’t explore that, it would kinda be like with ’sets’ where you can specify a timeout on when the entries should expire?
With regards to the IPS_OPPLOAD clear in flow_offload_del() then I added that because I saw some weird timeout side effects due to flow_offload_fixup_ct(), but I can re-investigate, it could be that it was early in my investigations and some of the other changes I made has made it obsolete.
Thanks
Michael
Hi,
I just quickly tried following the flow_offload_teardown() path instead of clearing IPS_OFFLOAD in flow_offload_del() and it does have some side effects. The flow is added again before the HW has actually reported it to be NF_FLOW_HW_DEAD.
The sequence with my patch is:
: Retire -> Remove from hw tables -> Remove from sw tables -> kfree(flow) -> flow_offload_add()
But if flow_offload_teardown() is called on expire I see:
: Retire -> Remove from hw tables -> flow_offload_add() -> Remove from sw tables -> kfree(flow)
I need to investigate why this happens, maybe the IPS_OFFLOAD flag is cleared too early and should not be cleared until the flow is actually removed, like I do? Maybe the issue is not seen before because on timeout or flow_is_dying() no packet arrive to create the flow again prematurely?
Thanks,
Michael
On Wed, Oct 26, 2022 at 09:40:11PM +0200, Michael Lilja wrote:
> Hi,
>
> I just quickly tried following the flow_offload_teardown() path instead of clearing IPS_OFFLOAD in flow_offload_del() and it does have some side effects. The flow is added again before the HW has actually reported it to be NF_FLOW_HW_DEAD.
>
> The sequence with my patch is:
> : Retire -> Remove from hw tables -> Remove from sw tables -> kfree(flow) -> flow_offload_add()
>
> But if flow_offload_teardown() is called on expire I see:
> : Retire -> Remove from hw tables -> flow_offload_add() -> Remove from sw tables -> kfree(flow)
>
> I need to investigate why this happens, maybe the IPS_OFFLOAD flag is cleared too early and should not be cleared until the flow is actually removed, like I do? Maybe the issue is not seen before because on timeout or flow_is_dying() no packet arrive to create the flow again prematurely?
Hm, IPS_OFFLOAD should be cleared from flow_offload_del() then, it is
cleared too early.
I'll post a fix for nf.git first then I propose to follow up on this
flowtable feature. I'll keep you on Cc.
Hi,
thanks. I have been too busy elsewhere so I have not yet looked at
patching the IPS_OFFLOAD myself. When I quickly looked at it last
week, the IPS_OFFLOAD got set a few places, so I'm not sure there is a
race somewhere still.
@@ -201,3 +201,10 @@ nf_flowtable_udp_timeout - INTEGER (seconds)
Control offload timeout for udp connections.
UDP connections may be offloaded from nf conntrack to nf flow table.
Once aged, the connection is returned to nf conntrack with udp pickup timeout.
+
+nf_flowtable_retire - INTEGER (seconds)
+ - 0 - disabled (default)
+ - not 0 - enabled and set the number of seconds a flow is offloaded
+
+ If this option is enabled offloaded flows retire periodically and return the
+ control of the flow to conntrack/netfilter.
@@ -177,6 +177,7 @@ struct flow_offload {
unsigned long flags;
u16 type;
u32 timeout;
+ u32 retire;
struct rcu_head rcu_head;
};
@@ -110,5 +110,8 @@ struct netns_ct {
#if defined(CONFIG_NF_CONNTRACK_LABELS)
unsigned int labels_used;
#endif
+#if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
+ unsigned int sysctl_flowtable_retire;
+#endif
};
#endif
@@ -620,6 +620,9 @@ enum nf_ct_sysctl_index {
#ifdef CONFIG_LWTUNNEL
NF_SYSCTL_CT_LWTUNNEL,
#endif
+#if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
+ NF_SYSCTL_CT_FLOWTABLE_RETIRE,
+#endif
__NF_SYSCTL_CT_LAST_SYSCTL,
};
@@ -967,6 +970,15 @@ static struct ctl_table nf_ct_sysctl_table[] = {
.mode = 0644,
.proc_handler = nf_hooks_lwtunnel_sysctl_handler,
},
+#endif
+#if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
+ [NF_SYSCTL_CT_FLOWTABLE_RETIRE] = {
+ .procname = "nf_flowtable_retire",
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .data = &init_net.ct.sysctl_flowtable_retire,
+ .proc_handler = proc_dointvec_jiffies,
+ },
#endif
{}
};
@@ -1111,6 +1123,11 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
nf_conntrack_standalone_init_dccp_sysctl(net, table);
nf_conntrack_standalone_init_gre_sysctl(net, table);
+#if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
+ /* Disable retire per default */
+ net->ct.sysctl_flowtable_retire = 0;
+#endif
+
/* Don't allow non-init_net ns to alter global sysctls */
if (!net_eq(&init_net, net)) {
table[NF_SYSCTL_CT_MAX].mode = 0444;
@@ -285,6 +285,12 @@ int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
int err;
flow->timeout = nf_flowtable_time_stamp + flow_offload_get_timeout(flow);
+ if (nf_ct_net(flow->ct)->ct.sysctl_flowtable_retire) {
+ flow->retire = nf_flowtable_time_stamp +
+ nf_ct_net(flow->ct)->ct.sysctl_flowtable_retire;
+ } else {
+ flow->retire = 0;
+ }
err = rhashtable_insert_fast(&flow_table->rhashtable,
&flow->tuplehash[0].node,
@@ -313,6 +319,11 @@ int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
}
EXPORT_SYMBOL_GPL(flow_offload_add);
+static inline bool nf_flow_has_retired(const struct flow_offload *flow)
+{
+ return flow->retire && nf_flow_timeout_delta(flow->retire) <= 0;
+}
+
void flow_offload_refresh(struct nf_flowtable *flow_table,
struct flow_offload *flow)
{
@@ -327,7 +338,8 @@ void flow_offload_refresh(struct nf_flowtable *flow_table,
if (likely(!nf_flowtable_hw_offload(flow_table)))
return;
- nf_flow_offload_add(flow_table, flow);
+ if (!nf_flow_has_retired(flow))
+ nf_flow_offload_add(flow_table, flow);
}
EXPORT_SYMBOL_GPL(flow_offload_refresh);
@@ -339,6 +351,7 @@ static inline bool nf_flow_has_expired(const struct flow_offload *flow)
static void flow_offload_del(struct nf_flowtable *flow_table,
struct flow_offload *flow)
{
+ clear_bit(IPS_OFFLOAD_BIT, &flow->ct->status);
rhashtable_remove_fast(&flow_table->rhashtable,
&flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].node,
nf_flow_offload_rhash_params);
@@ -423,12 +436,14 @@ static void nf_flow_offload_gc_step(struct nf_flowtable *flow_table,
nf_ct_is_dying(flow->ct))
flow_offload_teardown(flow);
- if (test_bit(NF_FLOW_TEARDOWN, &flow->flags)) {
+ if (test_bit(NF_FLOW_TEARDOWN, &flow->flags) || nf_flow_has_retired(flow)) {
if (test_bit(NF_FLOW_HW, &flow->flags)) {
- if (!test_bit(NF_FLOW_HW_DYING, &flow->flags))
+ if (!test_bit(NF_FLOW_HW_DYING, &flow->flags)) {
nf_flow_offload_del(flow_table, flow);
- else if (test_bit(NF_FLOW_HW_DEAD, &flow->flags))
+ } else if (test_bit(NF_FLOW_HW_DEAD, &flow->flags)) {
+ clear_bit(NF_FLOW_HW, &flow->flags);
flow_offload_del(flow_table, flow);
+ }
} else {
flow_offload_del(flow_table, flow);
}