Donald Sharp [Thu, 11 Nov 2021 18:25:35 +0000 (13:25 -0500)]
ospfd: Prevent use after free on shutdown
Running ospf_topo_vrf1 leads us to this valgrind issue:
==2386518== Invalid read of size 8
==2386518== at 0x4971520: route_top (table.c:401)
==2386518== by 0x181F08: ospf_interface_bfd_apply (ospf_bfd.c:126)
==2386518== by 0x182069: ospf_interface_disable_bfd (ospf_bfd.c:158)
==2386518== by 0x18BF51: ospf_del_if_params (ospf_interface.c:557)
==2386518== by 0x18C584: ospf_if_delete_hook (ospf_interface.c:712)
==2386518== by 0x490CA0B: hook_call_if_del (if.c:61)
==2386518== by 0x490D1F3: if_delete_retain (if.c:286)
==2386518== by 0x490D337: if_delete (if.c:309)
==2386518== by 0x490CDED: if_destroy_via_zapi (if.c:200)
==2386518== by 0x49940A9: zclient_interface_delete (zclient.c:2237)
==2386518== by 0x4998062: zclient_read (zclient.c:3969)
==2386518== by 0x4979529: thread_call (thread.c:1908)
==2386518== by 0x4919918: frr_run (libfrr.c:1164)
==2386518== by 0x181AC7: main (ospf_main.c:235)
==2386518== Address 0x5df39a0 is 0 bytes inside a block of size 56 free'd
==2386518== at 0x48399AB: free (vg_replace_malloc.c:538)
==2386518== by 0x492A03E: qfree (memory.c:141)
==2386518== by 0x4970C6F: route_table_free (table.c:141)
==2386518== by 0x4970A36: route_table_finish (table.c:61)
==2386518== by 0x18C543: ospf_if_delete_hook (ospf_interface.c:708)
==2386518== by 0x490CA0B: hook_call_if_del (if.c:61)
==2386518== by 0x490D1F3: if_delete_retain (if.c:286)
==2386518== by 0x490D337: if_delete (if.c:309)
==2386518== by 0x490CDED: if_destroy_via_zapi (if.c:200)
==2386518== by 0x49940A9: zclient_interface_delete (zclient.c:2237)
==2386518== by 0x4998062: zclient_read (zclient.c:3969)
==2386518== by 0x4979529: thread_call (thread.c:1908)
==2386518== by 0x4919918: frr_run (libfrr.c:1164)
==2386518== by 0x181AC7: main (ospf_main.c:235)
==2386518== Block was alloc'd at
==2386518== at 0x483AB65: calloc (vg_replace_malloc.c:760)
==2386518== by 0x4929EFC: qcalloc (memory.c:116)
==2386518== by 0x49709F8: route_table_init_with_delegate (table.c:53)
==2386518== by 0x49717F4: route_table_init (table.c:528)
==2386518== by 0x18C328: ospf_if_new_hook (ospf_interface.c:659)
==2386518== by 0x490C97D: hook_call_if_add (if.c:60)
==2386518== by 0x490CE85: if_create_name (if.c:223)
==2386518== by 0x490DF32: if_get_by_name (if.c:622)
==2386518== by 0x4993F73: zclient_interface_add (zclient.c:2186)
==2386518== by 0x4998062: zclient_read (zclient.c:3969)
==2386518== by 0x4979529: thread_call (thread.c:1908)
==2386518== by 0x4919918: frr_run (libfrr.c:1164)
==2386518== by 0x181AC7: main (ospf_main.c:235)
==2386518==
Fix the ordering to do the individual node tree cleanup after we delete
the data we care about.
Donald Sharp [Thu, 11 Nov 2021 14:39:52 +0000 (09:39 -0500)]
*: use compiler.h MIN/MAX macros instead of everyone having one
We had various forms of min/max macros across multiple daemons
all of which duplicated what we have in compiler.h. Convert
everyone to use the `correct` ones
Table manager is allocated when the VRF is created, but it is freed when
the VRF is disabled. When this VRF is re-enabled, zebra ends up with
table manager being NULL pointer and it crashes on any dereference.
Table manager should be freed when the VRF is deleted, not when it's
disabled.
Igor Ryzhov [Sat, 6 Nov 2021 14:00:46 +0000 (17:00 +0300)]
ospfd: remove commands for broken GR helper mode
Issue #9983 explains what is wrong with the GR helper mode.
To unblock the CI that fails almost all the time on the ospf_gr_topo1
test, remove the commands and disable the test. Also add a reminder to
completely remove the helper mode if no one fixes the code in a month.
David Lamparter [Wed, 10 Nov 2021 11:16:40 +0000 (12:16 +0100)]
doc: stick `libunwind` into build docs
It's strictly optional, but… the backtraces are really much better.
Specifically, `libunwind` is notably more capable in figuring out
function names compared to glibc/libexecinfo `backtrace_symbols()`.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
Igor Ryzhov [Wed, 10 Nov 2021 13:36:15 +0000 (16:36 +0300)]
bfdd: fix coverity warnings
show/clear DEFUNs always require either peer label or IP address to be
specified, so if `label` is NULL then `peer_str` is definitely not NULL.
But Coverity doesn't know about that, so it complains about possible
NULL dereference of `peer_str`. This commit should make Coverity happy.
Chirag Shah [Wed, 3 Nov 2021 00:57:07 +0000 (17:57 -0700)]
zebra: svi down remove l2vni from l3vni list
Problem:
L2-VNI SVI down followed by L2-VNI's vxlan device
deletion leads to stale entry into L3VNI's
L2-VNI list.
Solution:
When L2-VNI associated SVI is down, default vrf
is the new tenant vrf.
Remove L2-VNI from L3VNI's l2vni list as
L3VNI/VRF is no longer valid in absence of associated
SVI.
When SVI is up re-add L2-VNI into associated VRF's
L3VNI.
The above remove/add from the L3VNI's L2VNI list is
already done when vxlan or L2-VNI is flaped, just need
to handle when SVI is flapped.
Without fix:
running config:
ip msdp mesh-group foo1 member *
Frr reoad failure log:
2021-11-02 11:05:04,317 INFO: Loading Config object from vtysh show running
line 5: % Unknown command: ip msdp mesh-group foo1 member *
Traceback (most recent call last):
File "/usr/lib/frr/frr-reload.py", line 1950, in <module>
With fix:
--------
running config displays:
ip msdp mesh-group foo1 member 0.0.0.0
David Lamparter [Sun, 7 Nov 2021 14:49:17 +0000 (15:49 +0100)]
tests: allow common_cli.c with logging enabled
common_cli.c disables logging by default so stdio is usable as vty
without log messages getting strewn inbetween. This the right thing for
most tests, but not all; sometimes we do want log messages.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
David Lamparter [Sun, 7 Nov 2021 14:41:18 +0000 (15:41 +0100)]
lib: fix c-ares thread misuse
The `struct thread **ref` that the thread code takes is written to and
needs to stay valid over the lifetime of a thread. This does not hold
up if thread pointers are directly put in a `vector` since adding items
to a `vector` may reallocate the entire array. The thread code would
then write to a now-invalid `ref`, potentially corrupting entirely
unrelated data.
This should be extremely rare to trigger in practice since we only use
one c-ares channel, which will likely only ever use one fd, so the
vector is never resized. That said, c-ares using only one fd is just
plain fragile luck.
Either way, fix this by creating a resolver_fd tracking struct, and
clean up the code while we're at it.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
Donald Sharp [Sun, 7 Nov 2021 13:37:09 +0000 (08:37 -0500)]
tests: bfd_ospf_topo1 expects unreasonable convergence times under load
When our CI test system is under high load, expecting bfd to
converge in under 2 seconds is not going to happen. Modify the test
suites to just ensure that things reconvderge.
Donald Sharp [Sun, 7 Nov 2021 12:45:27 +0000 (07:45 -0500)]
tests: Remove debugs from topotests
Debugs take up a significant amount of cpu time as well as
increased disk space for storage of results. Reduce test
over head by removing the debugs, Hopefully this helps
alleviate some of the overloading that we are seeing in
our CI systems.
Donald Sharp [Fri, 5 Nov 2021 21:56:42 +0000 (17:56 -0400)]
ospf6d: Prevent crash in adj_ok
The adj_ok thread event is being added but not killed
when the underlying interface is deleted. I am seeing
this crash:
OSPF6: Received signal 11 at 1636142186 (si_addr 0x0, PC 0x561d7fc42285); aborting...
OSPF6: zlog_signal+0x18c 7f227e93519a7ffdae024590 /lib/libfrr.so.0 (mapped at 0x7f227e884000)
OSPF6: core_handler+0xe3 7f227e97305e7ffdae0246b0 /lib/libfrr.so.0 (mapped at 0x7f227e884000)
OSPF6: funlockfile+0x50 7f227e8631407ffdae024800 /lib/x86_64-linux-gnu/libpthread.so.0 (mapped at 0x7f227e84f000)
OSPF6: ---- signal ----
OSPF6: need_adjacency+0x10 561d7fc422857ffdae024db0 /usr/lib/frr/ospf6d (mapped at 0x561d7fbc6000)
OSPF6: adj_ok+0x180 561d7fc42f0b7ffdae024dc0 /usr/lib/frr/ospf6d (mapped at 0x561d7fbc6000)
OSPF6: thread_call+0xc2 7f227e989e327ffdae024e00 /lib/libfrr.so.0 (mapped at 0x7f227e884000)
OSPF6: frr_run+0x217 7f227e92a7f37ffdae024ec0 /lib/libfrr.so.0 (mapped at 0x7f227e884000)
OSPF6: main+0xf3 561d7fc0f5737ffdae024fd0 /usr/lib/frr/ospf6d (mapped at 0x561d7fbc6000)
OSPF6: __libc_start_main+0xea 7f227e6b0d0a7ffdae025010 /lib/x86_64-linux-gnu/libc.so.6 (mapped at 0x7f227e68a000)
OSPF6: _start+0x2a 561d7fc0f06a7ffdae0250e0 /usr/lib/frr/ospf6d (mapped at 0x561d7fbc6000)
OSPF6: in thread adj_ok scheduled from ospf6d/ospf6_interface.c:678 dr_election()
The crash is in the on->ospf6_if pointer is NULL. The only way this could
happen from what I can tell is that the event is added to the system
and then we immediately delete the interface, removing the memory
but not freeing up the adj_ok thread event.
Donald Sharp [Thu, 4 Nov 2021 17:00:51 +0000 (13:00 -0400)]
ospf6d: Prevent use after free
I am seeing a crash of ospf6d with this stack trace:
OSPF6: Received signal 11 at 1636042827 (si_addr 0x0, PC 0x55efc2d09ec2); aborting...
OSPF6: zlog_signal+0x18c 7fe20c8ca19a7ffd08035590 /lib/libfrr.so.0 (mapped at 0x7fe20c819000)
OSPF6: core_handler+0xe3 7fe20c90805e7ffd080356b0 /lib/libfrr.so.0 (mapped at 0x7fe20c819000)
OSPF6: funlockfile+0x50 7fe20c7f81407ffd08035800 /lib/x86_64-linux-gnu/libpthread.so.0 (mapped at 0x7fe20c7e4000)
OSPF6: ---- signal ----
OSPF6: ospf6_neighbor_state_change+0xdc 55efc2d09ec27ffd08035d90 /usr/lib/frr/ospf6d (mapped at 0x55efc2c8e000)
OSPF6: exchange_done+0x15c 55efc2d0ab4a7ffd08035dc0 /usr/lib/frr/ospf6d (mapped at 0x55efc2c8e000)
OSPF6: thread_call+0xc2 7fe20c91ee327ffd08035df0 /lib/libfrr.so.0 (mapped at 0x7fe20c819000)
OSPF6: frr_run+0x217 7fe20c8bf7f37ffd08035eb0 /lib/libfrr.so.0 (mapped at 0x7fe20c819000)
OSPF6: main+0xf3 55efc2cd75737ffd08035fc0 /usr/lib/frr/ospf6d (mapped at 0x55efc2c8e000)
OSPF6: __libc_start_main+0xea 7fe20c645d0a7ffd08036000 /lib/x86_64-linux-gnu/libc.so.6 (mapped at 0x7fe20c61f000)
OSPF6: _start+0x2a 55efc2cd706a7ffd080360d0 /usr/lib/frr/ospf6d (mapped at 0x55efc2c8e000)
OSPF6: in thread exchange_done scheduled from ospf6d/ospf6_message.c:2264 ospf6_dbdesc_send_newone()
The stack trace when decoded is:
(gdb) l *(ospf6_neighbor_state_change+0xdc)
0x7bec2 is in ospf6_neighbor_state_change (ospf6d/ospf6_neighbor.c:200).
warning: Source file is more recent than executable.
195 on->name, ospf6_neighbor_state_str[prev_state],
196 ospf6_neighbor_state_str[next_state],
197 ospf6_neighbor_event_string(event));
198 }
199
200 /* Optionally notify about adjacency changes */
201 if (CHECK_FLAG(on->ospf6_if->area->ospf6->config_flags,
202 OSPF6_LOG_ADJACENCY_CHANGES)
203 && (CHECK_FLAG(on->ospf6_if->area->ospf6->config_flags,
204 OSPF6_LOG_ADJACENCY_DETAIL)
OSPFv3 is creating the event without a managing thread and as such
if the event is not run before a deletion event comes in memory
will be freed up and we'll start trying to access memory we should
not. Modify ospfv3 to track the thread and appropriately stop
it when the memory is deleted or it is no longer need to run
that bit of code.
Donald Sharp [Fri, 5 Nov 2021 15:49:37 +0000 (11:49 -0400)]
tests: pim_basic needs to wait for event to happen under load
The test system under load looks for upstream state only
1 time immediately after sending 2 streams of S,G data
flowing. Give the system some time to process this
and ensure that it actually shows up in a small
amount of time.
Donald Sharp [Fri, 5 Nov 2021 15:13:12 +0000 (11:13 -0400)]
tests: Ensure ospf has reconverged before continuing
The test_ldp_pseudowires_after_link_down test
shuts a link down and was blindly waiting 5 seconds
before just assuming the test system was in a sane
state. Remove the sleep(5) and actually look for
the changed state for the route 2.2.2.2 that the
psueudowire actually depends on.
LEI BAO [Fri, 5 Nov 2021 06:05:36 +0000 (14:05 +0800)]
zebra: Fix the RA send failed in netns mode
In the rtadv_timer(), it always uses the zvrf's socket to send RA
packets. In the vrf-lite mode, it's righ since it uses the default
vrf to send the RA packets. But in the netns mode, it uses socket
in each netns. So the issue only happens in the netns mode because
the zvrf's socket may not be in the same netns as the interface's
netns. In order to compatible with both vrf-lite and netns mode,
the fix uses the if_lookup_by_index() to check whether interfaces
can use the zvrf's socket.
Igor Ryzhov [Thu, 4 Nov 2021 23:04:02 +0000 (02:04 +0300)]
zebra: don't register same hook multiple times
Before 42d4b30e, table_manager_enable was called only once and the hook
was also registered once. After the change, the hook is registered per
each VRF that is created in the system. This is wrong.
Donald Sharp [Thu, 4 Nov 2021 15:45:27 +0000 (11:45 -0400)]
tests: Fix route replace test in all_protocol_startup
The route replace test was doing this seq of events:
a) Create nhg
b) Install route w/ sharpd
c) Ensure it worked
d) Modify nhg
d) Ensure the update group replace worked
The problem is that the sharp code is doing this:
/* Only send via ID if nhgroup has been successfully installed */
if (nhgid && sharp_nhgroup_id_is_installed(nhgid)) {
SET_FLAG(api.message, ZAPI_MESSAGE_NHG);
api.nhgid = nhgid;
} else {
for (ALL_NEXTHOPS_PTR(nhg, nh)) {
api_nh = &api.nexthops[i];
zapi_nexthop_from_nexthop(api_nh, nh);
i++;
}
api.nexthop_num = i;
}
The created nhg has not been successfully installed( or at least
sharpd has not read the results yet) when it gets the command
to install the routes. As such it passes down the individual
nexthops instead. The route replace is never going to work.
Modify the code to add a bit of sleep to allow sharpd to
get notified when the system is under load. At this point
there is no way to query sharpd for whether or not it
thinks it's nhg is installed properly or not. This
test is failing all over the place for a bunch of people
let's get this fixed so people can get running
Donald Sharp [Thu, 4 Nov 2021 12:01:14 +0000 (08:01 -0400)]
zebra: Send up ifindex for redistribution when appropriate
Currently the NEXTHOP_TYPE_IPV4 and NEXTHOP_TYPE_IPV6 are
not sending up the resolved ifindex for the route. This
is causing upper level protocols that have something like
this:
route-map FOO permit 10
match interface swp13
!
router ospf
redistribute static
!
ip route 4.5.6.7/32 10.10.10.10
where 10.10.10.10 resolves to interface swp13. The route-map
will never match in this case.
Since FRR has the resolved nexthop interface, FRR might as
well send it up to be selected on by the upper level protocol
as needed.
Rafael Zalamena [Tue, 2 Nov 2021 21:54:23 +0000 (18:54 -0300)]
bgpd: fix BFD configuration update on TTL change
When altering the TTL of a eBGP peer also update the BFD
configuration. This was only working when the configuration happened
after the peer connection had been established.
Signed-off-by: Rafael Zalamena <rzalamena@opensourcerouting.org>
Philippe Guibert [Thu, 28 Oct 2021 16:28:42 +0000 (18:28 +0200)]
zebra: update dataplane flowspec address family in ipset_info
It is needed for the ipset entry to know for which address family
this ipset entry applies to. Actually, the family is in the original
ipset structure and was not passed as attribute in the dataplane
ipset_info structure. Add it.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Philippe Guibert [Thu, 28 Oct 2021 11:42:57 +0000 (13:42 +0200)]
zebra: fix flowspec ipset operations
When injecting an ipset entry into the zebra dataplane context, the
ipset name is stored in a separate structure. This will permit the
flowspec plugin to be able to know which ipset has to be appended with
relevant ipset entry.
The problem was that the zebra dataplane objects related to ipset entries
is made up of an union between the ipset structure and the ipset info
structure. This was implying that the two structures were on the same
memory zone, and when extracting the data stored, the data were incomplete.
Fix this by replacing the union structure by a defined struct.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Philippe Guibert [Wed, 27 Oct 2021 14:45:05 +0000 (16:45 +0200)]
ospf6d: avoid writing dumb ospf6 info at startup
in show: 'show ipv6 ospf6' handler command, the reason of SPF
executation is looked up and displayed. At startup, SPF has been
started, but shows no specific reason. Instead of dumping non
initialised string context, reset the string context.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Igor Ryzhov [Tue, 2 Nov 2021 21:29:19 +0000 (00:29 +0300)]
lib: fix crash when terminating inactive VRFs
If the VRF is not enabled, if_terminate deletes the VRF after the last
interface is removed from it. Therefore daemons crash on the subsequent
call to vrf_delete. We should call vrf_delete only for enabled VRFs.
Igor Ryzhov [Tue, 2 Nov 2021 20:54:43 +0000 (23:54 +0300)]
zebra: fix stale pointer when netns is deleted
When the netns is deleted, we should always clear the vrf->ns_ctxt
pointer. Currently, it is not cleared when there are interfaces in the
netns at the time of deletion.
If the netns is re-created, zebra crashes because it tries to use the
stale pointer.